🖥️ Week 10 Lecture
Introduction to SQL and Databases + Group Project Briefing
By the end of this lecture, you should be able to: i) connect to a SQLite database from Python with sqlite3, ii) inspect database structure using sqlite_master and PRAGMA table_info(), iii) translate pandas operations you already know (head(), boolean masking, groupby(), merge()) into equivalent SQL queries, iv) explain why databases exist and when they are more appropriate than CSV files, v) use .explode() to convert list-valued columns into separate rows for analysis.
No slides this week. This page is the reference copy: read through it after the lecture to revisit the SQL-pandas mapping and the group project briefing at your own pace.
📍 Session Details
- Date : Thursday, 26 March 2026
- Time : 16:00 - 18:00
- Location : CKK.LG.01
📋 Preparation
- You went to the 🖥️ W09 Lecture and 💻 W09 Lab
- Your ✍️ Mini-Project 2 should be nearly done or submitted
- If you have not joined a group yet, read the announcement on Moodle and either join your group’s repository on GitHub, post in
#socialon Slack, or reply to the Moodle thread
🗣️ Lecture Overview
Hour 1: SQL and Databases
- Why databases exist and when CSVs stop being practical
- Connecting to a SQLite database with
sqlite3 - Inspecting the catalogue:
sqlite_masterandPRAGMA table_info() - SELECT, WHERE, COUNT, GROUP BY, ORDER BY, JOIN: each one mapped to the pandas equivalent you already know
.explode()for list-valued columns (the genres problem)
Hour 2: Group Project Briefing + Team Activity
- Introduction to the 📦 Group Project requirements and what to bring to the formative pitch (Mon 30 / Tue 31 March)
- The 🌍 WFP Track : building a food security data pipeline for the World Food Programme
- Team activity: create a GitHub Project Board for your group
Hour 1: SQL and Databases
If you work with a much larger volume of data, or need to read data from an entire organisation, CSVs will not cut it. You will need databases. Most large organisations store their data in proprietary systems backed by databases, and if you need to do analysis or build algorithms from their data, you will need to read from those databases.
The core reason is straightforward: when a table has millions (or billions) of rows, you cannot load the entire thing into your computer’s memory. A CSV forces you to read everything at once. A database lets you ask for exactly the rows and columns you need, and the database engine handles the rest. The language you use to make those requests is called SQL (Structured Query Language), and it has been the standard way to talk to databases since the 1970s.
The good news is that SQL maps closely to pandas operations you already know. We will walk through that mapping side by side, using the same IMDb dataset from the autumn term.
Connecting to the database
One line. That is all you need.
import sqlite3
import pandas as pd
conn = sqlite3.connect("data/processed/imdb_2026.db")conn is something called a connection object. You pass it to pd.read_sql() every time you want to run a query. The database file (imdb_2026.db) is a curated subset of the IMDb Non-Commercial Datasets filtered to titles with endYear = 2026.
What tables exist?
pd.read_sql("""
SELECT name
FROM sqlite_master
WHERE type = 'table'
ORDER BY name;
""", conn)| name | |
|---|---|
| 0 | name_basics |
| 1 | title_akas |
| 2 | title_basics |
| 3 | title_crew |
| 4 | title_episode |
| 5 | title_principals |
| 6 | title_ratings |
sqlite_master is the database’s own catalogue. It lists everything stored in the database. You used this in the autumn term too.
Inspecting table structure
pd.read_sql("PRAGMA table_info(title_basics)", conn)| name | type | |
|---|---|---|
| 0 | tconst | TEXT |
| 1 | titleType | TEXT |
| 2 | primaryTitle | TEXT |
| 3 | originalTitle | TEXT |
| 4 | isAdult | INTEGER |
| 5 | startYear | INTEGER |
| 6 | endYear | INTEGER |
| 7 | runtimeMinutes | INTEGER |
| 8 | genres | TEXT |
PRAGMA table_info() is SQLite’s equivalent of looking at .dtypes and .columns on a DataFrame. tconst is the primary key for title_basics and appears as a foreign key in almost every other table. Think of it as the merge key from W08’s pd.merge() work.
SQL vs pandas: a translation guide
Everything below follows the same pattern. The left column shows the SQL query; the right column shows the pandas code that does the same thing. Same result, different grammar.
SELECT + LIMIT = column selection + .head()
SQL
SELECT primaryTitle, startYear
FROM title_basics
LIMIT 5pandas
title_basics[["primaryTitle", "startYear"]].head(5)SQL declares what you want; pandas chains operations. LIMIT and .head() do the same job.
WHERE = boolean masking
SQL
SELECT primaryTitle,
startYear,
runtimeMinutes
FROM title_basics
WHERE titleType = 'movie'
AND startYear = 2026
LIMIT 5pandas
mask = (
(title_basics["titleType"] == "movie")
& (title_basics["startYear"] == 2026)
)
(
title_basics[mask]
[["primaryTitle", "startYear","runtimeMinutes"]]
.head(5)
)The AND in SQL maps to & in pandas. You practised boolean masking in W02 to W04. The .query() method is an alternative that reads more like SQL:
(
title_basics
.query("titleType == 'movie' & startYear == 2026")
[["primaryTitle", "startYear", "runtimeMinutes"]]
.head(5)
)IS NOT NULL = .notna()
SQL
SELECT primaryTitle, runtimeMinutes
FROM title_basics
WHERE runtimeMinutes IS NOT NULL
LIMIT 5pandas
mask = title_basics["runtimeMinutes"].notna()
(
title_basics[mask][["primaryTitle", "runtimeMinutes"]]
.head(5)
)IS NOT NULL in SQL is .notna() in pandas. You did completeness checks with .isna() in W09. Same concept, SQL vocabulary.
COUNT = .shape[0]
SQL
SELECT COUNT(*)
FROM title_basics
WHERE titleType = 'movie'pandas
title_basics[title_basics["titleType"] == "movie"].shape[0]Both answer “how many rows match this condition?”
COUNT DISTINCT = .nunique()
SQL
SELECT COUNT(DISTINCT titleType)
FROM title_basicspandas
title_basics["titleType"].nunique()COUNT counts rows. COUNT DISTINCT counts unique values. Ask yourself: what is the difference between “how many rows have a titleType?” and “how many different titleType values exist?” That distinction matters.
GROUP BY + COUNT + ORDER BY = .groupby().size().sort_values()
SQL
SELECT titleType,
COUNT(*) AS n_titles
FROM title_basics
GROUP BY titleType
ORDER BY n_titles DESCpandas
(
title_basics
.groupby("titleType")
.size()
.reset_index(name="n_titles")
.sort_values("n_titles",
ascending=False)
)Same split-apply-combine logic from W05. SQL reads top to bottom: select what you want, from where, group how, order which way. The pandas chain reads left to right. Same logic, different grammar.
JOIN = pd.merge()
The next two are longer, so the SQL and pandas versions are separated into collapsible blocks.
This query finds the 10 highest-rated titles with more than 100,000 votes. It combines data from title_ratings and title_basics using tconst as the shared key. The autumn lecture notebook ended with exactly this query, so this is the payoff that connects the two terms.
SQL version
SELECT b.primaryTitle,
r.averageRating,
r.numVotes
FROM title_ratings AS r
JOIN title_basics AS b
ON r.tconst = b.tconst
WHERE r.numVotes > 100000
ORDER BY r.averageRating DESC
LIMIT 10AS r and AS b are table aliases so you do not have to write the full table name each time. JOIN (also called INNER JOIN) keeps only rows where tconst appears in both tables.
pandas equivalent
merged = pd.merge(
title_ratings,
title_basics[["tconst", "primaryTitle"]],
on="tconst",
how="inner"
)
(
merged[merged["numVotes"] > 100000]
.sort_values("averageRating", ascending=False)
.head(10)
)You learned pd.merge() with on= and how= in W08. The how="inner" is the default and matches SQL’s JOIN.
LEFT JOIN = pd.merge(how="left")
A LEFT JOIN keeps all rows from the left table. Movies without ratings get NaN in the rating columns, just as they do with pd.merge(how="left").
SQL version
SELECT b.primaryTitle,
b.startYear,
r.averageRating,
r.numVotes
FROM title_basics AS b
LEFT JOIN title_ratings AS r
ON b.tconst = r.tconst
WHERE b.titleType = 'movie'
LIMIT 5LEFT JOIN keeps every row from title_basics (the left table). If a movie has no entry in title_ratings, the rating columns come back as NULL.
pandas equivalent
merged = pd.merge(
title_basics,
title_ratings,
on="tconst",
how="left"
)
(
merged[merged["titleType"] == "movie"]
[["primaryTitle", "startYear",
"averageRating", "numVotes"]]
.head(5)
)You learned how="left" in W08. Rows from title_basics without a match in title_ratings get NaN in the rating columns.
The genres problem: .explode()
Look at the genres column in title_basics:
title_basics[["primaryTitle", "genres"]].head(5)| primaryTitle | genres | |
|---|---|---|
| 0 | The Wheel of Time | Action,Adventure,Drama |
| 1 | Stranger Things | Drama,Fantasy,Horror |
| 2 | The Handmaid’s Tale | Drama,Sci-Fi,Thriller |
| 3 | Cobra Kai | Action,Comedy,Drama |
| 4 | Only Murders in the Building | Comedy,Crime,Mystery |
Each row has multiple genres packed into one string. If you run .value_counts() on this column, “Action,Adventure,Drama” is treated as a single genre. That is wrong. You want to count each genre separately.
Step 1: Split the string into a list
title_basics["genre_list"] = title_basics["genres"].str.split(",")
title_basics[["primaryTitle", "genres", "genre_list"]].head(3)| primaryTitle | genres | genre_list | |
|---|---|---|---|
| 0 | The Wheel of Time | Action,Adventure,Drama | [Action, Adventure, Drama] |
| 1 | Stranger Things | Drama,Fantasy,Horror | [Drama, Fantasy, Horror] |
| 2 | The Handmaid’s Tale | Drama,Sci-Fi,Thriller | [Drama, Sci-Fi, Thriller] |
Step 2: Explode the list into separate rows
exploded = title_basics[["primaryTitle", "genre_list"]].explode("genre_list")Before (3 rows, one list per row)
| primaryTitle | genre_list | |
|---|---|---|
| 0 | The Wheel of Time | [Action, Adventure, Drama] |
| 1 | Stranger Things | [Drama, Fantasy, Horror] |
| 2 | The Handmaid’s Tale | [Drama, Sci-Fi, Thriller] |
After .explode() (7 rows, one genre per row)
| primaryTitle | genre_list | |
|---|---|---|
| 0 | The Wheel of Time | Action |
| 0 | The Wheel of Time | Adventure |
| 0 | The Wheel of Time | Drama |
| 1 | Stranger Things | Drama |
| 1 | Stranger Things | Fantasy |
| 1 | Stranger Things | Horror |
| 2 | The Handmaid’s Tale | Drama |
Row 0 became three rows. Row 1 became three rows. The index repeats, and that is how you know which rows came from the same original title.
Step 3: Now you can count genres properly
exploded["genre_list"].value_counts().head(10)And you can groupby genre for summary statistics:
merged_exploded = pd.merge(
exploded, title_ratings, on="tconst", how="left"
)
(
merged_exploded
.groupby("genre_list")
.agg(
median_rating=("averageRating", "median"),
n_titles=("averageRating", "count")
)
.sort_values("median_rating", ascending=False)
.head(10)
)Without .explode(), a title tagged “Action,Adventure,Drama” only appears once. After exploding, it appears in all three genre groups. Distribution analysis (mean, median, histograms) requires each observation represented correctly. That is the connection to W09’s EDA workflow.
Group project connection: Your project data may contain nested lists or comma-separated values too (for example, multiple IPC phases per country, or multiple genres per title). The same split-then-explode pattern applies. Whether to explode depends on your research question. If you are comparing totals, you do not need to explode. If you are comparing individual components, you do. Document this choice in your notebooks.
Hour 2: Group Project Briefing
The 📦 Group Project
The group project is worth 40% of your final grade (plus 10% for an individual reflection). Full details, both tracks, rubrics, and repository structure are on the 📦 Group Project specification page.
You choose one of two tracks:
🌍 WFP Track (prescribed)
Build a food security data pipeline for the World Food Programme’s East and Southern Africa regional office. You collect from four APIs (IPC, UNHCR, IOM, World Bank), produce a target CSV with reference-period logic, and deliver a Streamlit dashboard.
Virginia Leape from WFP will visit the LSE around 20 April to meet with WFP track groups.
🔍 Self-Chosen Track
Define your own research question, collect data from APIs using requests, store it in a SQLite database with at least 2 related tables, and present findings on a narrative website with a maximum of 3 distinct visualisations.
🗣️ Formative Pitch (W11)
On Monday 30 March (and Tuesday 31 March for some groups), your team will present your project idea to Jon and the teaching team. This is formative only and does not count toward your grade.
What to show us (on GitHub, in whatever format you like):
- Your research question or the WFP problem as you understand it
- Your project board with initial Issues and task assignments
- Your planned approach: data sources, pipeline steps, who does what
- One risk or open question you want feedback on
Team Activity: Create a Project Board
For the last part of today’s lecture, get into your groups and create a project board on GitHub. If you have not joined your group’s repository yet, do it now: GitHub Classroom link.
Your project board should have at minimum:
- A column for backlog/to-do items
- A column for in-progress work
- A column for completed work
- At least 3 initial Issues describing your first tasks
Use this time to agree on your track (WFP or self-chosen), sketch a rough plan, and assign early responsibilities. You are preparing for the formative pitch on Monday/Tuesday.
📖 Read more: GitHub: About Projects | Quickstart for Projects
📓 Lecture Materials
Today’s SQL content is covered in a lecture notebook that walks through every query shown above using imdb_2026.db. Download the full Week 10 materials bundle (notebook + database) below.
🔖 Appendix
SQL and Databases
pandas operations reviewed
Group Project
Useful Links
- 📦 Group Project
- 💻 W10 Lab
- 📔 Syllabus
- ✋ Contact Hours
Looking Ahead
- Friday W10 Lab: Git collaboration practice (branches, pull requests, merge conflicts) in your groups
- Monday 30 / Tuesday 31 March: 🗣️ Formative Pitch presentations
- Wednesday 1 April, 8 pm: ✍️ Mini-Project 2 deadline
