🖥️ Week 10 Lecture

Introduction to SQL and Databases + Group Project Briefing

Author

Dr Jon Cardoso-Silva

Published

24 May 2026

🥅 Learning Goals

By the end of this lecture, you should be able to: i) connect to a SQLite database from Python with sqlite3, ii) inspect database structure using sqlite_master and PRAGMA table_info(), iii) translate pandas operations you already know (head(), boolean masking, groupby(), merge()) into equivalent SQL queries, iv) explain why databases exist and when they are more appropriate than CSV files, v) use .explode() to convert list-valued columns into separate rows for analysis.

No slides this week. This page is the reference copy: read through it after the lecture to revisit the SQL-pandas mapping and the group project briefing at your own pace.

📍 Session Details

Date: Thursday, 26 March 2026
Time: 16:00 - 18:00
Location: CKK.LG.03

📋 Preparation

You went to the 🖥️ W09 Lecture and 💻 W09 Lab
Your ✍️ Mini-Project 2 should be nearly done or submitted
If you have not joined a group yet, read the announcement on Moodle and either join your group’s repository on GitHub, post in #social on Slack, or reply to the Moodle thread

🗣️ Lecture Overview

Hour 1: SQL and Databases

Why databases exist and when CSVs stop being practical
Connecting to a SQLite database with sqlite3
Inspecting the catalogue: sqlite_master and PRAGMA table_info()
SELECT, WHERE, COUNT, GROUP BY, ORDER BY, JOIN: each one mapped to the pandas equivalent you already know
.explode() for list-valued columns (the genres problem)

Hour 2: Group Project Briefing + Team Activity

Introduction to the 📦 Group Project requirements and what to bring to the formative pitch (Mon 30 / Tue 31 March)
The 🌍 WFP Track: building a food security data pipeline for the World Food Programme
Team activity: create a GitHub Project Board for your group

Hour 1: SQL and Databases

If you work with a much larger volume of data, or need to read data from an entire organisation, CSVs will not cut it. You will need databases. Most large organisations store their data in proprietary systems backed by databases, and if you need to do analysis or build algorithms from their data, you will need to read from those databases.

The core reason is straightforward: when a table has millions (or billions) of rows, you cannot load the entire thing into your computer’s memory. A CSV forces you to read everything at once. A database lets you ask for exactly the rows and columns you need, and the database engine handles the rest. The language you use to make those requests is called SQL (Structured Query Language), and it has been the standard way to talk to databases since the 1970s.

The good news is that SQL maps closely to pandas operations you already know. We will walk through that mapping side by side, using the same IMDb dataset from the autumn term.

Connecting to the database

One line. That is all you need.

import sqlite3
import pandas as pd

conn = sqlite3.connect("data/processed/imdb_2026.db")

conn is something called a connection object. You pass it to pd.read_sql() every time you want to run a query. The database file (imdb_2026.db) is a curated subset of the IMDb Non-Commercial Datasets filtered to titles with endYear = 2026.

What tables exist?

pd.read_sql("""
    SELECT name
    FROM sqlite_master
    WHERE type = 'table'
    ORDER BY name;
""", conn)

	name
0	name_basics
1	title_akas
2	title_basics
3	title_crew
4	title_episode
5	title_principals
6	title_ratings

sqlite_master is the database’s own catalogue. It lists everything stored in the database. You used this in the autumn term too.

Inspecting table structure

pd.read_sql("PRAGMA table_info(title_basics)", conn)

	name	type
0	tconst	TEXT
1	titleType	TEXT
2	primaryTitle	TEXT
3	originalTitle	TEXT
4	isAdult	INTEGER
5	startYear	INTEGER
6	endYear	INTEGER
7	runtimeMinutes	INTEGER
8	genres	TEXT

PRAGMA table_info() is SQLite’s equivalent of looking at .dtypes and .columns on a DataFrame. tconst is the primary key for title_basics and appears as a foreign key in almost every other table. Think of it as the merge key from W08’s pd.merge() work.

SQL vs pandas: a translation guide

Everything below follows the same pattern. The left column shows the SQL query; the right column shows the pandas code that does the same thing. Same result, different grammar.

SELECT + LIMIT = column selection + `.head()`

SQL

SELECT primaryTitle, startYear
FROM title_basics
LIMIT 5

pandas

title_basics[["primaryTitle", "startYear"]].head(5)

SQL declares what you want; pandas chains operations. LIMIT and .head() do the same job.

WHERE = boolean masking

SQL

SELECT primaryTitle,
       startYear,
       runtimeMinutes
FROM title_basics
WHERE titleType = 'movie'
  AND startYear = 2026
LIMIT 5

pandas

mask = (
    (title_basics["titleType"] == "movie")
    & (title_basics["startYear"] == 2026)
)

(
    title_basics[mask]
    [["primaryTitle", "startYear","runtimeMinutes"]]
    .head(5)
)

The AND in SQL maps to & in pandas. You practised boolean masking in W02 to W04. The .query() method is an alternative that reads more like SQL:

(
    title_basics
    .query("titleType == 'movie' & startYear == 2026")
    [["primaryTitle", "startYear", "runtimeMinutes"]]
    .head(5)
)

IS NOT NULL = `.notna()`

SQL

SELECT primaryTitle, runtimeMinutes
FROM title_basics
WHERE runtimeMinutes IS NOT NULL
LIMIT 5

pandas

mask = title_basics["runtimeMinutes"].notna()
(
  title_basics[mask][["primaryTitle", "runtimeMinutes"]]
  .head(5)
)

IS NOT NULL in SQL is .notna() in pandas. You did completeness checks with .isna() in W09. Same concept, SQL vocabulary.

COUNT = `.shape[0]`

SQL

SELECT COUNT(*)
FROM title_basics
WHERE titleType = 'movie'

pandas

title_basics[title_basics["titleType"] == "movie"].shape[0]

Both answer “how many rows match this condition?”

COUNT DISTINCT = `.nunique()`

SQL

SELECT COUNT(DISTINCT titleType)
FROM title_basics

pandas

title_basics["titleType"].nunique()

COUNT counts rows. COUNT DISTINCT counts unique values. Ask yourself: what is the difference between “how many rows have a titleType?” and “how many different titleType values exist?” That distinction matters.

GROUP BY + COUNT + ORDER BY = `.groupby().size().sort_values()`

SQL

SELECT titleType,
       COUNT(*) AS n_titles
FROM title_basics
GROUP BY titleType
ORDER BY n_titles DESC

pandas

(
    title_basics
    .groupby("titleType")
    .size()
    .reset_index(name="n_titles")
    .sort_values("n_titles",
                 ascending=False)
)

Same split-apply-combine logic from W05. SQL reads top to bottom: select what you want, from where, group how, order which way. The pandas chain reads left to right. Same logic, different grammar.

JOIN = `pd.merge()`

The next two are longer, so the SQL and pandas versions are separated into collapsible blocks.

This query finds the 10 highest-rated titles with more than 100,000 votes. It combines data from title_ratings and title_basics using tconst as the shared key. The autumn lecture notebook ended with exactly this query, so this is the payoff that connects the two terms.

SQL version

SELECT b.primaryTitle,
       r.averageRating,
       r.numVotes
FROM title_ratings AS r
JOIN title_basics AS b
  ON r.tconst = b.tconst
WHERE r.numVotes > 100000
ORDER BY r.averageRating DESC
LIMIT 10

AS r and AS b are table aliases so you do not have to write the full table name each time. JOIN (also called INNER JOIN) keeps only rows where tconst appears in both tables.

pandas equivalent

merged = pd.merge(
    title_ratings,
    title_basics[["tconst", "primaryTitle"]],
    on="tconst",
    how="inner"
)

(
    merged[merged["numVotes"] > 100000]
    .sort_values("averageRating", ascending=False)
    .head(10)
)

You learned pd.merge() with on= and how= in W08. The how="inner" is the default and matches SQL’s JOIN.

LEFT JOIN = `pd.merge(how="left")`

A LEFT JOIN keeps all rows from the left table. Movies without ratings get NaN in the rating columns, just as they do with pd.merge(how="left").

SQL version

SELECT b.primaryTitle,
       b.startYear,
       r.averageRating,
       r.numVotes
FROM title_basics AS b
LEFT JOIN title_ratings AS r
  ON b.tconst = r.tconst
WHERE b.titleType = 'movie'
LIMIT 5

LEFT JOIN keeps every row from title_basics (the left table). If a movie has no entry in title_ratings, the rating columns come back as NULL.

pandas equivalent

merged = pd.merge(
    title_basics,
    title_ratings,
    on="tconst",
    how="left"
)

(
    merged[merged["titleType"] == "movie"]
    [["primaryTitle", "startYear",
      "averageRating", "numVotes"]]
    .head(5)
)

You learned how="left" in W08. Rows from title_basics without a match in title_ratings get NaN in the rating columns.

The genres problem: `.explode()`

Look at the genres column in title_basics:

title_basics[["primaryTitle", "genres"]].head(5)

	primaryTitle	genres
0	The Wheel of Time	Action,Adventure,Drama
1	Stranger Things	Drama,Fantasy,Horror
2	The Handmaid’s Tale	Drama,Sci-Fi,Thriller
3	Cobra Kai	Action,Comedy,Drama
4	Only Murders in the Building	Comedy,Crime,Mystery

Each row has multiple genres packed into one string. If you run .value_counts() on this column, “Action,Adventure,Drama” is treated as a single genre. That is wrong. You want to count each genre separately.

Step 1: Split the string into a list

title_basics["genre_list"] = title_basics["genres"].str.split(",")

title_basics[["primaryTitle", "genres", "genre_list"]].head(3)

	primaryTitle	genres	genre_list
0	The Wheel of Time	Action,Adventure,Drama	[Action, Adventure, Drama]
1	Stranger Things	Drama,Fantasy,Horror	[Drama, Fantasy, Horror]
2	The Handmaid’s Tale	Drama,Sci-Fi,Thriller	[Drama, Sci-Fi, Thriller]

Step 2: Explode the list into separate rows

exploded = title_basics[["primaryTitle", "genre_list"]].explode("genre_list")

Before (3 rows, one list per row)

	primaryTitle	genre_list
0	The Wheel of Time	[Action, Adventure, Drama]
1	Stranger Things	[Drama, Fantasy, Horror]
2	The Handmaid’s Tale	[Drama, Sci-Fi, Thriller]

After .explode() (7 rows, one genre per row)

	primaryTitle	genre_list
0	The Wheel of Time	Action
0	The Wheel of Time	Adventure
0	The Wheel of Time	Drama
1	Stranger Things	Drama
1	Stranger Things	Fantasy
1	Stranger Things	Horror
2	The Handmaid’s Tale	Drama

Row 0 became three rows. Row 1 became three rows. The index repeats, and that is how you know which rows came from the same original title.

Step 3: Now you can count genres properly

exploded["genre_list"].value_counts().head(10)

And you can groupby genre for summary statistics:

merged_exploded = pd.merge(
    exploded, title_ratings, on="tconst", how="left"
)

(
    merged_exploded
    .groupby("genre_list")
    .agg(
        median_rating=("averageRating", "median"),
        n_titles=("averageRating", "count")
    )
    .sort_values("median_rating", ascending=False)
    .head(10)
)

Without .explode(), a title tagged “Action,Adventure,Drama” only appears once. After exploding, it appears in all three genre groups. Distribution analysis (mean, median, histograms) requires each observation represented correctly. That is the connection to W09’s EDA workflow.

Group project connection: Your project data may contain nested lists or comma-separated values too (for example, multiple IPC phases per country, or multiple genres per title). The same split-then-explode pattern applies. Whether to explode depends on your research question. If you are comparing totals, you do not need to explode. If you are comparing individual components, you do. Document this choice in your notebooks.

Hour 2: Group Project Briefing

The 📦 Group Project

The group project is worth 40% of your final grade (plus 10% for an individual reflection). Full details, both tracks, rubrics, and repository structure are on the 📦 Group Project specification page.

You choose one of two tracks:

🌍 WFP Track (prescribed)

Build a food security data pipeline for the World Food Programme’s East and Southern Africa regional office. You collect from four APIs (IPC, UNHCR, IOM, World Bank), produce a target CSV with reference-period logic, and deliver a Streamlit dashboard.

Virginia Leape from WFP will visit the LSE around 20 April to meet with WFP track groups.

→ Full WFP track details

🔍 Self-Chosen Track

Define your own research question, collect data from APIs using requests, store it in a SQLite database with at least 2 related tables, and present findings on a narrative website with a maximum of 3 distinct visualisations.

→ Full self-chosen track details

🗣️ Formative Pitch (W11)

On Monday 30 March (and Tuesday 31 March for some groups), your team will present your project idea to Jon and the teaching team. This is formative only and does not count toward your grade.

What to show us (on GitHub, in whatever format you like):

Your research question or the WFP problem as you understand it
Your project board with initial Issues and task assignments
Your planned approach: data sources, pipeline steps, who does what
One risk or open question you want feedback on

→ Full pitch details

Team Activity: Create a Project Board

For the last part of today’s lecture, get into your groups and create a project board on GitHub. If you have not joined your group’s repository yet, do it now: GitHub Classroom link.

Your project board should have at minimum:

A column for backlog/to-do items
A column for in-progress work
A column for completed work
At least 3 initial Issues describing your first tasks

Use this time to agree on your track (WFP or self-chosen), sketch a rough plan, and assign early responsibilities. You are preparing for the formative pitch on Monday/Tuesday.

📖 Read more: GitHub: About Projects | Quickstart for Projects

📓 Lecture Materials

Today’s SQL content is covered in a lecture notebook that walks through every query shown above using imdb_2026.db. Download the full Week 10 materials bundle (notebook + database) below.

🔖 Appendix

SQL and Databases

pandas operations reviewed

Group Project

Useful Links

Looking Ahead

Friday W10 Lab: Git collaboration practice (branches, pull requests, merge conflicts) in your groups
Monday 30 / Tuesday 31 March: 🗣️ Formative Pitch presentations
Wednesday 1 April, 8 pm: ✍️ Mini-Project 2 deadline

📋 Preparation

🗣️ Lecture Overview

Hour 1: SQL and Databases

Connecting to the database

What tables exist?

Inspecting table structure

SQL vs pandas: a translation guide

SELECT + LIMIT = column selection + .head()

WHERE = boolean masking

IS NOT NULL = .notna()

COUNT = .shape[0]

COUNT DISTINCT = .nunique()

GROUP BY + COUNT + ORDER BY = .groupby().size().sort_values()

JOIN = pd.merge()

LEFT JOIN = pd.merge(how="left")

The genres problem: .explode()

Step 1: Split the string into a list

Step 2: Explode the list into separate rows

Step 3: Now you can count genres properly

Hour 2: Group Project Briefing

The 📦 Group Project

🗣️ Formative Pitch (W11)

Team Activity: Create a Project Board

📓 Lecture Materials

🔖 Appendix

SELECT + LIMIT = column selection + `.head()`

IS NOT NULL = `.notna()`

COUNT = `.shape[0]`

COUNT DISTINCT = `.nunique()`

GROUP BY + COUNT + ORDER BY = `.groupby().size().sort_values()`

JOIN = `pd.merge()`

LEFT JOIN = `pd.merge(how="left")`

The genres problem: `.explode()`