DS105 2025-2026 Winter Term Icon

🖥️ Week 10 Lecture

Introduction to SQL and Databases + Group Project Briefing

Author

Dr Jon Cardoso-Silva

Published

26 March 2026

🥅 Learning Goals

By the end of this lecture, you should be able to: i) connect to a SQLite database from Python with sqlite3, ii) inspect database structure using sqlite_master and PRAGMA table_info(), iii) translate pandas operations you already know (head(), boolean masking, groupby(), merge()) into equivalent SQL queries, iv) explain why databases exist and when they are more appropriate than CSV files, v) use .explode() to convert list-valued columns into separate rows for analysis.

No slides this week. This page is the reference copy: read through it after the lecture to revisit the SQL-pandas mapping and the group project briefing at your own pace.

📍 Session Details

  • Date : Thursday, 26 March 2026
  • Time : 16:00 - 18:00
  • Location : CKK.LG.01

📋 Preparation

  • You went to the 🖥️ W09 Lecture and 💻 W09 Lab
  • Your ✍️ Mini-Project 2 should be nearly done or submitted
  • If you have not joined a group yet, read the announcement on Moodle and either join your group’s repository on GitHub, post in #social on Slack, or reply to the Moodle thread

🗣️ Lecture Overview

Hour 1: SQL and Databases
  • Why databases exist and when CSVs stop being practical
  • Connecting to a SQLite database with sqlite3
  • Inspecting the catalogue: sqlite_master and PRAGMA table_info()
  • SELECT, WHERE, COUNT, GROUP BY, ORDER BY, JOIN: each one mapped to the pandas equivalent you already know
  • .explode() for list-valued columns (the genres problem)
Hour 2: Group Project Briefing + Team Activity
  • Introduction to the 📦 Group Project requirements and what to bring to the formative pitch (Mon 30 / Tue 31 March)
  • The 🌍 WFP Track : building a food security data pipeline for the World Food Programme
  • Team activity: create a GitHub Project Board for your group

Hour 1: SQL and Databases

If you work with a much larger volume of data, or need to read data from an entire organisation, CSVs will not cut it. You will need databases. Most large organisations store their data in proprietary systems backed by databases, and if you need to do analysis or build algorithms from their data, you will need to read from those databases.

The core reason is straightforward: when a table has millions (or billions) of rows, you cannot load the entire thing into your computer’s memory. A CSV forces you to read everything at once. A database lets you ask for exactly the rows and columns you need, and the database engine handles the rest. The language you use to make those requests is called SQL (Structured Query Language), and it has been the standard way to talk to databases since the 1970s.

The good news is that SQL maps closely to pandas operations you already know. We will walk through that mapping side by side, using the same IMDb dataset from the autumn term.

Connecting to the database

One line. That is all you need.

import sqlite3
import pandas as pd

conn = sqlite3.connect("data/processed/imdb_2026.db")

conn is something called a connection object. You pass it to pd.read_sql() every time you want to run a query. The database file (imdb_2026.db) is a curated subset of the IMDb Non-Commercial Datasets filtered to titles with endYear = 2026.

What tables exist?

pd.read_sql("""
    SELECT name
    FROM sqlite_master
    WHERE type = 'table'
    ORDER BY name;
""", conn)
name
0 name_basics
1 title_akas
2 title_basics
3 title_crew
4 title_episode
5 title_principals
6 title_ratings

sqlite_master is the database’s own catalogue. It lists everything stored in the database. You used this in the autumn term too.

Inspecting table structure

pd.read_sql("PRAGMA table_info(title_basics)", conn)
name type
0 tconst TEXT
1 titleType TEXT
2 primaryTitle TEXT
3 originalTitle TEXT
4 isAdult INTEGER
5 startYear INTEGER
6 endYear INTEGER
7 runtimeMinutes INTEGER
8 genres TEXT

PRAGMA table_info() is SQLite’s equivalent of looking at .dtypes and .columns on a DataFrame. tconst is the primary key for title_basics and appears as a foreign key in almost every other table. Think of it as the merge key from W08’s pd.merge() work.

SQL vs pandas: a translation guide

Everything below follows the same pattern. The left column shows the SQL query; the right column shows the pandas code that does the same thing. Same result, different grammar.

SELECT + LIMIT = column selection + .head()

SQL

SELECT primaryTitle, startYear
FROM title_basics
LIMIT 5

pandas

title_basics[["primaryTitle", "startYear"]].head(5)

SQL declares what you want; pandas chains operations. LIMIT and .head() do the same job.

WHERE = boolean masking

SQL

SELECT primaryTitle,
       startYear,
       runtimeMinutes
FROM title_basics
WHERE titleType = 'movie'
  AND startYear = 2026
LIMIT 5

pandas

mask = (
    (title_basics["titleType"] == "movie")
    & (title_basics["startYear"] == 2026)
)

(
    title_basics[mask]
    [["primaryTitle", "startYear","runtimeMinutes"]]
    .head(5)
)

The AND in SQL maps to & in pandas. You practised boolean masking in W02 to W04. The .query() method is an alternative that reads more like SQL:

(
    title_basics
    .query("titleType == 'movie' & startYear == 2026")
    [["primaryTitle", "startYear", "runtimeMinutes"]]
    .head(5)
)

IS NOT NULL = .notna()

SQL

SELECT primaryTitle, runtimeMinutes
FROM title_basics
WHERE runtimeMinutes IS NOT NULL
LIMIT 5

pandas

mask = title_basics["runtimeMinutes"].notna()
(
  title_basics[mask][["primaryTitle", "runtimeMinutes"]]
  .head(5)
)

IS NOT NULL in SQL is .notna() in pandas. You did completeness checks with .isna() in W09. Same concept, SQL vocabulary.

COUNT = .shape[0]

SQL

SELECT COUNT(*)
FROM title_basics
WHERE titleType = 'movie'

pandas

title_basics[title_basics["titleType"] == "movie"].shape[0]

Both answer “how many rows match this condition?”

COUNT DISTINCT = .nunique()

SQL

SELECT COUNT(DISTINCT titleType)
FROM title_basics

pandas

title_basics["titleType"].nunique()

COUNT counts rows. COUNT DISTINCT counts unique values. Ask yourself: what is the difference between “how many rows have a titleType?” and “how many different titleType values exist?” That distinction matters.

GROUP BY + COUNT + ORDER BY = .groupby().size().sort_values()

SQL

SELECT titleType,
       COUNT(*) AS n_titles
FROM title_basics
GROUP BY titleType
ORDER BY n_titles DESC

pandas

(
    title_basics
    .groupby("titleType")
    .size()
    .reset_index(name="n_titles")
    .sort_values("n_titles",
                 ascending=False)
)

Same split-apply-combine logic from W05. SQL reads top to bottom: select what you want, from where, group how, order which way. The pandas chain reads left to right. Same logic, different grammar.

JOIN = pd.merge()

The next two are longer, so the SQL and pandas versions are separated into collapsible blocks.

This query finds the 10 highest-rated titles with more than 100,000 votes. It combines data from title_ratings and title_basics using tconst as the shared key. The autumn lecture notebook ended with exactly this query, so this is the payoff that connects the two terms.

SQL version
SELECT b.primaryTitle,
       r.averageRating,
       r.numVotes
FROM title_ratings AS r
JOIN title_basics AS b
  ON r.tconst = b.tconst
WHERE r.numVotes > 100000
ORDER BY r.averageRating DESC
LIMIT 10

AS r and AS b are table aliases so you do not have to write the full table name each time. JOIN (also called INNER JOIN) keeps only rows where tconst appears in both tables.

pandas equivalent
merged = pd.merge(
    title_ratings,
    title_basics[["tconst", "primaryTitle"]],
    on="tconst",
    how="inner"
)

(
    merged[merged["numVotes"] > 100000]
    .sort_values("averageRating", ascending=False)
    .head(10)
)

You learned pd.merge() with on= and how= in W08. The how="inner" is the default and matches SQL’s JOIN.

LEFT JOIN = pd.merge(how="left")

A LEFT JOIN keeps all rows from the left table. Movies without ratings get NaN in the rating columns, just as they do with pd.merge(how="left").

SQL version
SELECT b.primaryTitle,
       b.startYear,
       r.averageRating,
       r.numVotes
FROM title_basics AS b
LEFT JOIN title_ratings AS r
  ON b.tconst = r.tconst
WHERE b.titleType = 'movie'
LIMIT 5

LEFT JOIN keeps every row from title_basics (the left table). If a movie has no entry in title_ratings, the rating columns come back as NULL.

pandas equivalent
merged = pd.merge(
    title_basics,
    title_ratings,
    on="tconst",
    how="left"
)

(
    merged[merged["titleType"] == "movie"]
    [["primaryTitle", "startYear",
      "averageRating", "numVotes"]]
    .head(5)
)

You learned how="left" in W08. Rows from title_basics without a match in title_ratings get NaN in the rating columns.

The genres problem: .explode()

Look at the genres column in title_basics:

title_basics[["primaryTitle", "genres"]].head(5)
primaryTitle genres
0 The Wheel of Time Action,Adventure,Drama
1 Stranger Things Drama,Fantasy,Horror
2 The Handmaid’s Tale Drama,Sci-Fi,Thriller
3 Cobra Kai Action,Comedy,Drama
4 Only Murders in the Building Comedy,Crime,Mystery

Each row has multiple genres packed into one string. If you run .value_counts() on this column, “Action,Adventure,Drama” is treated as a single genre. That is wrong. You want to count each genre separately.

Step 1: Split the string into a list

title_basics["genre_list"] = title_basics["genres"].str.split(",")

title_basics[["primaryTitle", "genres", "genre_list"]].head(3)
primaryTitle genres genre_list
0 The Wheel of Time Action,Adventure,Drama [Action, Adventure, Drama]
1 Stranger Things Drama,Fantasy,Horror [Drama, Fantasy, Horror]
2 The Handmaid’s Tale Drama,Sci-Fi,Thriller [Drama, Sci-Fi, Thriller]

Step 2: Explode the list into separate rows

exploded = title_basics[["primaryTitle", "genre_list"]].explode("genre_list")

Before (3 rows, one list per row)

primaryTitle genre_list
0 The Wheel of Time [Action, Adventure, Drama]
1 Stranger Things [Drama, Fantasy, Horror]
2 The Handmaid’s Tale [Drama, Sci-Fi, Thriller]

After .explode() (7 rows, one genre per row)

primaryTitle genre_list
0 The Wheel of Time Action
0 The Wheel of Time Adventure
0 The Wheel of Time Drama
1 Stranger Things Drama
1 Stranger Things Fantasy
1 Stranger Things Horror
2 The Handmaid’s Tale Drama

Row 0 became three rows. Row 1 became three rows. The index repeats, and that is how you know which rows came from the same original title.

Step 3: Now you can count genres properly

exploded["genre_list"].value_counts().head(10)

And you can groupby genre for summary statistics:

merged_exploded = pd.merge(
    exploded, title_ratings, on="tconst", how="left"
)

(
    merged_exploded
    .groupby("genre_list")
    .agg(
        median_rating=("averageRating", "median"),
        n_titles=("averageRating", "count")
    )
    .sort_values("median_rating", ascending=False)
    .head(10)
)

Without .explode(), a title tagged “Action,Adventure,Drama” only appears once. After exploding, it appears in all three genre groups. Distribution analysis (mean, median, histograms) requires each observation represented correctly. That is the connection to W09’s EDA workflow.

Group project connection: Your project data may contain nested lists or comma-separated values too (for example, multiple IPC phases per country, or multiple genres per title). The same split-then-explode pattern applies. Whether to explode depends on your research question. If you are comparing totals, you do not need to explode. If you are comparing individual components, you do. Document this choice in your notebooks.

Hour 2: Group Project Briefing

The 📦 Group Project

The group project is worth 40% of your final grade (plus 10% for an individual reflection). Full details, both tracks, rubrics, and repository structure are on the 📦 Group Project specification page.

You choose one of two tracks:

🌍 WFP Track (prescribed)

Build a food security data pipeline for the World Food Programme’s East and Southern Africa regional office. You collect from four APIs (IPC, UNHCR, IOM, World Bank), produce a target CSV with reference-period logic, and deliver a Streamlit dashboard.

Virginia Leape from WFP will visit the LSE around 20 April to meet with WFP track groups.

Full WFP track details

🔍 Self-Chosen Track

Define your own research question, collect data from APIs using requests, store it in a SQLite database with at least 2 related tables, and present findings on a narrative website with a maximum of 3 distinct visualisations.

Full self-chosen track details

🗣️ Formative Pitch (W11)

On Monday 30 March (and Tuesday 31 March for some groups), your team will present your project idea to Jon and the teaching team. This is formative only and does not count toward your grade.

What to show us (on GitHub, in whatever format you like):

  • Your research question or the WFP problem as you understand it
  • Your project board with initial Issues and task assignments
  • Your planned approach: data sources, pipeline steps, who does what
  • One risk or open question you want feedback on

Full pitch details

Team Activity: Create a Project Board

For the last part of today’s lecture, get into your groups and create a project board on GitHub. If you have not joined your group’s repository yet, do it now: GitHub Classroom link.

Your project board should have at minimum:

  1. A column for backlog/to-do items
  2. A column for in-progress work
  3. A column for completed work
  4. At least 3 initial Issues describing your first tasks

Use this time to agree on your track (WFP or self-chosen), sketch a rough plan, and assign early responsibilities. You are preparing for the formative pitch on Monday/Tuesday.

📖 Read more: GitHub: About Projects | Quickstart for Projects

📓 Lecture Materials

Today’s SQL content is covered in a lecture notebook that walks through every query shown above using imdb_2026.db. Download the full Week 10 materials bundle (notebook + database) below.

🔖 Appendix

Useful Links

Looking Ahead

  • Friday W10 Lab: Git collaboration practice (branches, pull requests, merge conflicts) in your groups
  • Monday 30 / Tuesday 31 March: 🗣️ Formative Pitch presentations
  • Wednesday 1 April, 8 pm: ✍️ Mini-Project 2 deadline