Week 11
Databases, Dashboards, and Agentic Coding

DS105W – Data for Data Science

🗓️ 02 Apr 2026

Part 1: 🗃️ Designing a Database for Your Project

Last week you learned to read from a database using SQL and pd.read_sql().

This week: how to create and populate your own.

Why not just keep everything in CSVs?

RECAP: while CSVs work fine when your data fits in a single flat table, your projects will likely have data from multiple sources that relate to each other:

  • Self-chosen track: different datasets with shared keys
  • WFP track: four APIs producing interconnected tables

A database gives you:

  • Structure: each table has defined columns and types
  • Relationships: tables connect through shared keys (just like pd.merge())
  • Efficiency: query only the rows and columns you need
Approach When it works
Single CSV One data source, flat structure, small data
Multiple CSVs Several sources, but you load everything into memory every time
SQLite database Several related tables, you want to query specific subsets, your project needs reproducible data access

For the 📦 Group Project self-chosen track: your submission MUST include a SQLite database with at least 2 tables connected by foreign keys. Those taking the WfP track are not required to use a database but can if they want to.

Why split data into separate tables?

Imagine storing everything in one table:

tconst primary_title avg_rating num_votes person_name role
tt0111161 The Shawshank Redemption 9.3 3120155 Tim Robbins actor
tt0111161 The Shawshank Redemption 9.3 3120155 Morgan Freeman actor
tt0111161 The Shawshank Redemption 9.3 3120155 Frank Darabont director
tt0068646 The Godfather 9.2 2175958 Marlon Brando actor
tt0068646 The Godfather 9.2 2175958 Al Pacino actor
tt0068646 The Godfather 9.2 2175958 Francis Ford Coppola director

The title, rating, and vote count repeat for every person in that movie. Six rows, and "The Shawshank Redemption" appears three times with the same rating.

Split into separate tables:

Each piece of information is stored once and connected by tconst:

  • title_basics: one row per movie (title, year, genre)
  • title_ratings: one row per movie (rating, votes)
  • title_principals: one row per person-movie combination

You already know this schema from 🖥️ W09 Lecture and 🖥️ W10 Lecture. Now you’ll learn how to create it yourself.

How to design your database

  • When designing your tables, ask: “If I need to update this value, how many rows do I need to update?”

  • If a movie’s rating changes and you have a single combined table, you’d need to update every row for every person in that movie. With separate tables, you update one row in title_ratings.

  • The same logic applies to your projects. If a station’s location changes, or a country’s population estimate gets revised, you want to change it in one place and then you let your JOINs (MERGEs) reflect the change to the other tables when needed.

  • Split your data so that each table describes one kind of thing (movies, ratings, people, countries, stations) and connect them with shared keys.

CREATE TABLE in SQLite

CREATE TABLE defines a table’s structure before you put any data into it:

CREATE TABLE IF NOT EXISTS title_basics (
    tconst      TEXT PRIMARY KEY,
    title_type  TEXT,
    primary_title TEXT,
    start_year  INTEGER,
    genres      TEXT
);
  • TEXT, INTEGER, REAL are the common SQLite data types
  • PRIMARY KEY marks the column that uniquely identifies each row
  • IF NOT EXISTS prevents errors if you run the script twice

Running it from Python:

import sqlite3

conn = sqlite3.connect("data/processed/my_project.db")
cursor = conn.cursor()

create_query = """
    CREATE TABLE IF NOT EXISTS title_basics (
        tconst      TEXT PRIMARY KEY,
        title_type  TEXT,
        primary_title TEXT,
        start_year  INTEGER,
        genres      TEXT
    )
"""

cursor.execute(create_query)

conn.commit()

conn.commit() saves the change to disk similar to how git commit saves changes to a repository.

DROP TABLE in SQLite

If you made a mistake in your table creation and you want to start over, you can use DROP TABLE to remove the table.

cursor.execute("DROP TABLE IF EXISTS title_basics;")
conn.commit()

# Now create it again  
create_query = """
    CREATE TABLE IF NOT EXISTS title_basics (
        tconst      TEXT PRIMARY KEY,
        title_type  TEXT,
        primary_title TEXT,
        start_year  INTEGER,
        genres      TEXT
    )
"""

cursor.execute(create_query)
conn.commit()

Primary keys and foreign keys

  • A primary key uniquely identifies each row. No two rows can share the same primary key value, and it cannot be empty.

  • In IMDb, tconst is the primary key for both title_basics and title_ratings.

  • A foreign key is a column in one table that points to the primary key of another table. It creates the connection between tables.

CREATE TABLE IF NOT EXISTS title_ratings (
    tconst         TEXT PRIMARY KEY,
    average_rating REAL,
    num_votes      INTEGER,
    FOREIGN KEY (tconst)
        REFERENCES title_basics(tconst)
);

👉 The FOREIGN KEY line is super important because it ensures that the data in the title_ratings table is consistent with the data in the title_basics table. It says: “the tconst column in this table MUST match a tconst that exists in title_basics.”

It will throw an error if you try to insert a tconst that does not exist in title_basics.

Data types in SQLite

SQLite is simpler than most databases. It has five storage types:

SQLite type What it stores pandas equivalent
TEXT Strings object or string
INTEGER Whole numbers int64
REAL Decimal numbers float64
BLOB Binary data (images, files) rarely used in DS105
NULL Missing values NaN / None

When you use df.to_sql(), pandas maps its dtypes to these SQLite types automatically. You only need to think about types when writing CREATE TABLE statements.

Curating column types in DS105

A throwback to simpler times…

In 🖥️ W02 Lecture we saw that everything in memory is made of bits, and that choosing a representation means choosing how much space and which operations you commit to. More bits let you represent a wider range of values (for example four bits give you \(2^4 = 16\) distinct patterns of bits):

\[ \require{color} \begin{array}{ccccr} \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 0 \\ \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 15 \\ \end{array} \]

I want to see you carry that knowledge forward when you design tables ⏩

Column type affinity in SQLite

The table below is the affinity name examples table from SQLite: Datatypes In SQLite Version 3 (section 3.1.1).

Example typenames from CREATE TABLE or CAST Affinity
INT, INTEGER, TINYINT, SMALLINT, MEDIUMINT, BIGINT, UNSIGNED BIG INT, INT2, INT8 INTEGER
CHARACTER(20), VARCHAR(255), VARYING CHARACTER(255), NCHAR(55), NATIVE CHARACTER(70), NVARCHAR(100), TEXT, CLOB TEXT
BLOB, (no datatype specified) BLOB
REAL, DOUBLE, DOUBLE PRECISION, FLOAT REAL
NUMERIC, DECIMAL(10,5), BOOLEAN, DATE, DATETIME NUMERIC

Curating column types in DS105

Just to make things interesting, even though SQLite will map the type name to an affinity automatically anyway, I still want you to think back to the bits and bytes we discussed in 🖥️ W02 Lecture. So here are the rules for us in DS105:

Use the most adequate data type: If a column is genuinely a whole number (counts, years, ages), use a suitable integer type, not TEXT. Do not store "42" as a string when the value behaves like a number.

Adjust to the size of the data: Don’t use a BIGINT or bland INTEGER if the range of values for that column should fit in a SMALLINT or INT8, say. Use this W3Schools guide to help you decide.

The DS105 way to INSERT data

You’ve spent the whole course building skills transforming data with pandas. I want to see you use them, even though you could do many other things in SQL. Reserve your use of SQL for the exploratory data analysis, where SQL will be more useful and interesting.

The workflow for populating your database:

  1. Load and clean your data the way you already know (API calls, CSV reads, JSON parsing)
  2. Build your DataFrames with the columns you need
  3. Push to the database with df.to_sql()
df_basics.to_sql(
    "title_basics",
    conn,
    if_exists="append",
    index=False
)

Why do it this way?

You could write raw SQL INSERT statements to add data row by row. But that skips all the cleaning and reshaping tools you’ve learned.

By building DataFrames first, you can:

  • Use .rename(), .drop(), .astype() to get columns right
  • Use boolean masks to filter out bad rows
  • Use .merge() to combine sources before splitting into target tables

Then to_sql() handles the SQL insertion for you.

If your project needs two database tables from one API response, that’s a good exercise in using pandas to split and reshape your data into the right structure.

if_exists='append' vs if_exists='replace'

Prefer if_exists='append'

Don’t let pandas create the tables for you. Insert rows into the existing table. The structure you declared in SQL is preserved.

# After CREATE TABLE title_basics ( ... ) 
# has already been executed
df_basics.to_sql(
    "title_basics",
    conn,
    if_exists="append",
    index=False,
)

Use this as your default once the empty table is in place.

Avoid if_exists='replace'

replace drops the table and builds a new one from the DataFrame alone.

# Risky: recreates "readings" from df only. It is not the DS105 way
df_all.to_sql(
    "readings", conn,
    if_exists="replace", index=False
)

If you need to reload data, safer patterns are: DELETE FROM readings; then append again, or DROP TABLE and re-run your same CREATE TABLE script, then append.

⚠️ replace is a common source of confusion in group projects: the table looks fine in pandas, but keys and types no longer match what you designed.

Data insertion workflow

The DS105 way to insert data:

  1. Define the table with CREATE TABLE and cursor.execute(...) (and conn.commit()), exactly as in the slides above, so your primary keys, foreign keys, and column types are defined exactly as you want them.
  2. Clean your DataFrame as usual,
  3. Insert with to_sql(..., if_exists="append"). Pandas only inserts rows into the table that already exists; it does not need to invent a new schema.

Managing your connection

Open once, reuse throughout the notebook:

import sqlite3
import pandas as pd

# Open at the start
conn = sqlite3.connect("../data/processed/my_project.db")

# Use throughout
df_1 = pd.read_sql("SELECT * FROM table_a", conn)
df_2 = pd.read_sql("SELECT * FROM table_b", conn)

# ... more work ...

# Close when done
conn.close()

You don’t need to open and close the connection for every query. One connection per notebook is enough.

But do close it when you’re finished. An open connection locks the database file, which can cause problems if another notebook or script tries to access it.

Place conn.close() at the end of your notebook, after all your queries and writes are done.

A good habit: treat the connection like opening and closing a file. Open it early, use it throughout, close it at the end.

Reading back from the database

You already know this from 🖥️ W10 Lecture:

df = pd.read_sql(
    "SELECT * FROM title_basics LIMIT 10",
    conn
)

Last week we had a side-by-side comparison of pandas operations and their SQL equivalents. The same SELECT, WHERE, JOIN, and GROUP BY you practised in W10 all work here against your own database.

The full cycle:

  1. Build DataFrames in pandas (your collection notebooks)
  2. Push them to the database with df.to_sql()
  3. Read from the database with pd.read_sql() in your analysis notebooks

Your EDA notebooks should read from the database, not from raw CSV or JSON files. This ensures you’re working with the cleaned, structured version of your data.

For the 📦 Group Project self-chosen track: your NB03 analysis notebooks must read data from the SQLite database, not directly from raw files. This is part of demonstrating a proper data pipeline.

☕ Coffee Break

Part 2: 🎛️ Building Dashboards with Streamlit

Your database holds the data. Now let’s build something people can interact with.

From Markdown to HTML to Streamlit

In 🖥️ W10 Lecture you saw how docs/index.md becomes docs/index.html:

Markdown HTML What it renders
# Title <h1>Title</h1> Large heading
## Section <h2>Section</h2> Section heading
**bold** <strong>bold</strong> Bold text
- item <li>item</li> List item

Streamlit follows the same idea, but in Python:

st.title("Title")       # like <h1>
st.header("Section")    # like <h2>
st.markdown("**bold**") # renders any Markdown

Why Streamlit?

Streamlit turns a Python script into a web app. You write Python, and Streamlit handles the HTML, CSS, and interactivity for you.

No need to learn web development. If you can write a pandas script, you can build a dashboard.

import streamlit as st
import pandas as pd

st.title("My Dashboard")
st.write("This is a paragraph of text.")

Install Streamlit once in the environment you use for the project (local machine, Nuvolos terminal, and so on):

pip install streamlit

📦 Group project: put pip install streamlit (or your equivalent, for example a requirements.txt line) in your README so teammates and markers can run your dashboard without guessing.

Run it from the terminal:

streamlit run dashboard/app.py

Your first Streamlit app

The building blocks:

(Just a mock example, not real)

import streamlit as st
import pandas as pd
import sqlite3

st.title("IMDb Explorer")
st.write("""
    Browse popular movies from the
    IMDb database.
""")

conn = sqlite3.connect("../data/processed/imdb_2026.db")
df = pd.read_sql(
    "SELECT * FROM title_basics",
    conn
)

st.dataframe(df.head(20))
conn.close()

st.dataframe() renders a sortable, scrollable table. It already handles thousands of rows.

Other display functions:

Function What it shows
st.dataframe(df) Interactive table (sortable)
st.table(df) Static table
st.metric("Label", value) Single number with label
st.bar_chart(df) Quick bar chart
st.line_chart(df) Quick line chart

Adding interactivity with widgets

Streamlit widgets let users filter and explore the data:

# Dropdown for selecting a genre
genre = st.selectbox(
    "Choose a genre:",
    df["genres"].unique()
)

# Filter the DataFrame
filtered_df = df[df["genres"] == genre]

# Show the filtered result
st.dataframe(filtered_df)
st.write(
    f"Showing {len(filtered_df)} titles."
)

Every time the user picks a new genre, Streamlit reruns the entire script from top to bottom. The variable genre holds the new selection, so filtered_df updates automatically.

Common widgets:

Widget Use case
st.selectbox() Pick one option from a list
st.multiselect() Pick several options
st.slider() Choose a numeric range
st.text_input() Free text search
st.checkbox() Toggle something on or off
st.download_button() Let users download a CSV

The pattern is always the same: create a widget, capture its return value, then use that value to filter or transform your DataFrame.

min_votes = st.slider(
    "Minimum votes:", 0, 100000, 10000
)
filtered = df[df["num_votes"] >= min_votes]

Connecting your dashboard to the database

Since Streamlit reruns the entire script on every interaction, you want to avoid reconnecting to the database every time:

@st.cache_resource
def get_connection():
    return sqlite3.connect("data/processed/my_project.db", check_same_thread=False)

conn = get_connection()

@st.cache_resource tells Streamlit to run this function once and reuse the result on every rerun. The connection stays open across interactions.

For loading data, use @st.cache_data instead:

@st.cache_data
def load_data():
    conn = get_connection()
    return pd.read_sql("SELECT * FROM title_basics", conn)

df = load_data()

This caches the DataFrame so the SQL query only runs once, not on every widget click.

Keeping your code clean: utils.py

Without utils.py: everything in app.py

# app.py — gets long fast
import streamlit as st
import pandas as pd
import sqlite3

@st.cache_resource
def get_connection():
    return sqlite3.connect(
        "data/processed/my_project.db",
        check_same_thread=False
    )

@st.cache_data
def load_data():
    conn = get_connection()
    return pd.read_sql(
        "SELECT * FROM my_table", conn
    )

def filter_by_country(df, country):
    return df[df["country"] == country]

# ... 200 more lines of mixed
#     data logic and layout code ...

With utils.py: separate concerns

# utils.py — data logic lives here
import pandas as pd
import sqlite3
import streamlit as st

@st.cache_resource
def get_connection():
    return sqlite3.connect(
        "data/processed/my_project.db",
        check_same_thread=False
    )

@st.cache_data
def load_data():
    conn = get_connection()
    return pd.read_sql(
        "SELECT * FROM my_table", conn
    )

def filter_by_country(df, country):
    return df[df["country"] == country]
# app.py — layout only
import streamlit as st
from utils import load_data, filter_by_country

df = load_data()
country = st.selectbox("Country:", df["country"].unique())
st.dataframe(filter_by_country(df, country))

When is a dashboard the right choice?

A dashboard is the right tool when:

  • The data changes over time and someone needs to monitor it (live metrics, real-time feeds)
  • You want the reader to draw their own conclusions about the data
  • You want to let someone fact-check specific values or filter to their region/category

A dashboard is not the right tool when:

  • You’ve done analysis and have specific findings to communicate
  • You want to tell a story that guides the reader through your reasoning
  • Your data doesn’t change after collection

The test:

“If I remove all interactivity and just show static charts, does the reader still understand my argument?”

If yes, you probably need a report, not a dashboard.

If no, the interactivity is earning its place.

For most data science projects, you need to make an argument first and offer exploration second. The WFP track is a deliberate exception: the dashboard is a monitoring tool for analysts who will use it with their own questions.

Let me show you what I mean

🔗 LINK: Why do children and young people in smaller towns do better academically than those in larger towns? (Office for National Statistics, 2023)

I’m going to open this page and scroll through it together.

Notice how the ONS:

  • Opens with a clear research question as the title
  • Presents findings as a narrative with annotated charts
  • Guides you through each step of the argument
  • Places an interactive dashboard at the very end for readers who want to explore further

The dashboard supports the narrative. It doesn’t replace it.

The DS105 policy on dashboards

Track Dashboard role
WFP track The dashboard is a core deliverable. Virginia’s team will use it to monitor food security data. This is a monitoring use case where interactivity is the point.
Self-chosen track A dashboard can be an addendum to your narrative website, but it is not the main deliverable. Your website must tell the story. If you include a dashboard, explain in your narrative what users should explore and why.

The risk to avoid: building a dashboard with filters and charts but no interpretation. That shifts the analytical work from you to the reader, and the rubric assesses your analytical decisions and communication, not the reader’s ability to explore your data.

For the 📦 Group Project: the Communication criterion (40 marks) rewards clear findings and well-structured arguments. A dashboard alone, without a narrative that explains what the data shows, will score poorly on Communication.

Part 3: 🤖 Agentic Coding with GitHub Copilot

⚠️ This section is not being recorded. It is only available for students attending in person today.

How people code in 2026

The way software gets written has changed.

In a typical professional workflow today, a developer:

  1. Describes what they want to build
  2. Reviews what the AI suggests
  3. Edits, tests, and iterates
  4. Takes responsibility for the final code

The AI handles boilerplate, syntax lookup, and first drafts. The human handles intent, quality, and accountability.

This is sometimes called agentic coding: you direct an AI agent to write code on your behalf, and you remain in control of the outcome.

“Vibe coding” is a term coined by Andrej Karpathy (a former OpenAI researcher) to describe a more relaxed version of this: you describe what you want in natural language, accept whatever the AI produces, and see if it works.

Vibe coding can be fast for prototyping. But for assessed work, you need to understand what the code does and why it works. The rubric assesses your analytical decisions, and “the AI wrote it” is not a justification.

The skill is learning to work with the AI productively, not handing over control.

GitHub Copilot in VS Code

If you have a GitHub Education account, GitHub Copilot is free.

In VS Code (including on Nuvolos), Copilot appears as:

  • Inline suggestions: greyed-out code that appears as you type. Press Tab to accept, Esc to dismiss.
  • Chat panel: a sidebar where you can ask questions about your code, request changes, or describe what you want to build.
  • Agentic mode: Copilot can edit multiple files, run terminal commands, and iterate on errors.

👨‍💻 Let me set this up and show you how it works live.

The 4Ds: a framework for working with AI

Anthropic (the company behind Claude) collaborated with Professors Rick Dakan and Joseph Feller to develop the 4D AI Fluency Framework. It describes four competencies for working with AI effectively:

Competency What it means In practice with Copilot
Delegation Deciding what work is appropriate for you, for AI, or for both “Should I ask Copilot to write this filtering logic, or do I need to think through the logic myself first?”
Description Giving the AI clear context about your goals and constraints “Write a function that filters df to rows where start_year is between 2000 and 2020 and num_votes > 10000” vs “filter the data”
Discernment Critically evaluating what the AI produces “Does this output match what I expected? Are there edge cases it missed? Would I get the same result doing it manually?”
Diligence Taking responsibility for the final output “I wrote this code using Copilot. I understand what it does. I’ve tested it. I’m accountable for it.”

Let me show you this live

What this means for your 📦 Group Project

You can use GitHub Copilot and other AI tools in your project. The course has never prohibited AI; it asks you to use it thoughtfully.

Your individual reflection (reflections/<github-username>.md) should describe how you used AI tools and what you learned from doing so. The rubric rewards honest reflection on your process.

Apply the 4Ds:

  • Delegate tasks where AI saves time without skipping learning (boilerplate, syntax lookup, reformatting)
  • Describe your intent clearly so the AI produces something close to what you need
  • Discern whether the output is correct by testing it against your data and your understanding
  • Be diligent about the final result. If you can’t explain what a line of code does, don’t submit it

What’s Next?

  • Easter Break starts after today. There is no more scheduled teaching for DS105W this term. Use the break to advance your 📦 Group Project.

  • Virginia Leape (WFP Regional VAM Officer) visits LSE around 20 April. WFP track groups should have API access and initial data collection working by then.

  • 📦 Group Project deadline: Tuesday 26 May 2026, 8 pm UK time. Both tracks submit through GitHub Classroom.

  • Stay connected on Slack: check #announcements for updates, use your group channels to coordinate, and reach out to the teaching team if you get stuck.

🔖 Appendix

Post-Lecture Actions

  • Self-chosen track: sketch your database schema (which tables, which columns, which keys connect them) before writing any CREATE TABLE statements
  • WFP track: start experimenting with streamlit run on a minimal app.py
  • Both tracks: agree with your group how you’ll organise work across the Easter break
  • If you have a GitHub Education account, set up Copilot in VS Code

Looking Ahead

  • Easter break: Front-load project work. The break is your biggest block of uninterrupted time.
  • ~20 April: Virginia Leape visits LSE (WFP track milestone)
  • 26 May, 8pm: 📦 Group Project final submission (40% of grade + 10% individual reflection)
  • Slack: #announcements for updates, group channels for coordination