Week 08
SQLite Foundations for Data Projects

DS105A – Data for Data Science

🗓️ 20 Nov 2025

Today’s Goals

  • Learn: Database fundamentals and SQLite storage
  • Understand: Why organisations normalise data into multiple tables
  • Practice: Schema design, pd.merge(), and basic SQL queries

Why this matters: These skills enable structured data storage and support your ✍️ Mini-Project 2 work, where you can optionally use SQLite for organising TfL and ONS data.

1️⃣ Intro to Databases

The Scenario

If you work for a company, you will discover that they likely already have a lot of data in their internal systems.

  • You might find data in Excel spreadsheets (which you can easily read into Python using pandas) or convert them to CSV files.

  • At larger organisations, you will likely find that this operational data is stored in databases.

Files vs Databases

Files

Common formats: TXT, CSV, TSV, JSON, JSONL, XML, Excel
Most used when:
  • data is restricted to single users
  • data does not need to be updated continuously
  • exporting data from systems
  • small-scale data analysis

Databases

Popular systems: SQLite 🆓, MySQL 🆓, PostgreSQL 🆓, Oracle 💰, Microsoft SQL Server 💰, cloud-based databases
Most used when:
  • data is large (millions/billions of rows)
  • data logic is complex (can’t fit in single table)
  • multiple people/systems need to access the same data concurrently
  • updates happen continuously
  • data integrity must be enforced

Remember Binary Files?

In W03, we used the cat command to view files.

cat file.csv    # Readable text
cat image.png   # Gibberish (binary)

📋 NOTE:

You also won’t be able to view the contents of a database file with cat because it’s not a plain text file. Internally, databases use data structures more advanced than the ones we learn in this course to store and guarantee data integrity.

🔗 Link: (Advanced) Click here to learn of how one database system (SQLite) stores its data in its internal file format.

Database as a collection of tables

A relational database is a collection of tables connected by relationships.

Tables store data in rows and columns, similar to DataFrames.

We can connect tables through information they share.

(I labelled the relationships in this imaginary diagram just for the sake of clarity. In reality, the type of databases I am talking about don’t allow you to name the relationships.)

Tables are like enhanced DataFrames

Columns in the pandas DataFrame are like columns in the database table.

Pandas DataFrame (movies)

movie_id
(int64)
title
(object)
rating
(float64)
votes
(int64)
1 Carmencita 5.7 1335
2 Le clown 5.4 842
3 Pauvre Pierrot 6.2 1971

But in pandas, you can:

  • add or remove columns at will
  • change the types of columns anytime
  • have missing values in any column
  • rely on row indices instead of enforcing keys (like foreign keys in databases)

But database tables are stricter

You can’t just do anything you want with a database table. You have to follow the rules.

Database table (movies)

movie_id (PK)
(INTEGER)
title
(TEXT NOT NULL)
average_rating
(REAL)
num_votes
(INTEGER NOT NULL)
1 Carmencita 5.7 1335
2 Le clown 5.4 842
3 Pauvre Pierrot 6.2 1971

Database table (ratings)

movie_id (PK, FK)
(INTEGER)
average_rating
(REAL)
num_votes
(INTEGER)
1 5.7 1335
2 5.4 842
3 6.2 1971

You need to know that:

  • tables must be defined before you insert data, and changing them later is harder
  • a column can only have missing values if it is declared NULL when the table is created
  • tables have primary keys that uniquely identify each row (PK = Primary Key)
  • tables can have foreign keys that reference other tables (FK = Foreign Key)

2️⃣ Reading from databases (IMDb exploration)

There are many databases out there. We’ll use one called SQLite because it’s free, lightweight, and easy to use.

What is IMDb?

  • IMDb (Internet Movie Database) powers millions of lookups.
  • We’re examining the real production schema, not a toy example.
  • Same normalisation principles you’ll apply to TfL + ONS data.

To keep VS Code responsive we created imdb_2025.db, a curated slice of titles whose endYear = 2025.

You should see a copy of it on your workspace at /files/week08/data/imdb_2025.db.

The IMDB homepage as of 20 Nov 2025

Explore a database visually (1/2)

  1. Install the SQLite Viewer extension in VS Code.

    1. Access VS Code on Nuvolos
    2. Click on the Extensions icon in the left sidebar
    3. Search for “SQLite Viewer”
    4. Click on the “Install” button

Explore a database visually (2/2)

  1. Open the database file in the SQLite Viewer.

    1. Go back to the file explorer in VS Code
    2. Navigate to imdb_2025.db inside your Week 08 data folder
    3. Right-click the file and select “Open with SQLite Viewer”
    4. You should see the database file open in the SQLite Viewer

Explore database using Python

From within /files/week08/ you will find a W08-NB01-Lecture notebook which has code like the one below. We need to add the sqlite3 library to the notebook.

# This must be added to the top of the notebook
# SQLite3 is a library for interacting with SQLite databases
# It already comes with Python, no need to pip install anything
import sqlite3

Then we can connect to the database and read the first 5 rows of the title_basics table:

conn = sqlite3.connect("./data/imdb_2025.db")

# This reads the first 5 rows of the title_basics table
pd.read_sql("SELECT * FROM title_basics LIMIT 5;", conn)

(Alt) Connect via Nuvolos Tables

Note from Jon: Nuvolos has this particular feature where it lets us all access the same database. Listen closely during the lecture to see if this is enabled. It if is, you can use the code below to connect to the database.

from nuvolos import get_connection
import pandas as pd

conn = get_connection()
pd.read_sql("SELECT * FROM title_basics LIMIT 5;", conn)

The Nuvolos version (if available) is much bigger than the imdb_2025.db file you have on your workspace and has all the historical data.

SQL Basics Preview

SQL stands for Structured Query Language. It is the language used to query databases and the inspiration for many of pandas’s DataFrame methods.

Tomorrow’s lab builds SQL skills progressively:

  • SELECT specific columns
  • WHERE to filter rows
  • COUNT and COUNT(DISTINCT) for aggregations
  • GROUP BY to split-apply-combine
  • Combine everything to recreate plot_df

Today, we will use the accompanying notebook to explore these patterns. Listen carefully during the lecture.

Explore the Structure

IMDb has several tables. Look at a few:

# Titles and basic information
pd.read_sql("SELECT * FROM title_basics LIMIT 5", con)

# Ratings for those titles
pd.read_sql("SELECT * FROM title_ratings LIMIT 5", con)

# Actors and their roles
pd.read_sql("SELECT * FROM title_principals LIMIT 5", con)

# Names of actors
pd.read_sql("SELECT * FROM name_basics LIMIT 5", con)

Data split across multiple tables.

How IMDb Tables Connect

Look at tconst. It appears in both title_basics and title_ratings.

title_basics:

tconst primaryTitle
tt0000001 Carmencita
tt0000002 Le clown

title_ratings:

tconst averageRating
tt0000001 5.7
tt0000002 5.4

One title has one rating. Connected by tconst.

ORDER BY + LIMIT (rankings)

How can I get the top 10 highest rated popular (100k+ votes) titles?

SELECT *
FROM title_ratings
WHERE numVotes > 100000
ORDER BY averageRating DESC
LIMIT 10;
tconst averageRating numVotes
tt4574334 8.6 1504453
tt9253284 8.6 255970
tt7221388 8.4 237080
tt5834204 8.3 290989
tt31806037 8.1 252695
tt10919420 8.0 727012
tt1869454 8.0 121327
tt1751634 7.6 204936
tt7335184 7.6 354235
tt7462410 7.2 161539

3️⃣ Tidy Data

Tidy Data Principle

  • One table represents one thing.

  • Ask of every table: what does a single row describe?

  • If the answer changes mid-table, you probably need another table.

Unit of Analysis

Each table has a unit of analysis. In our example:

title_basics: Each row describes a title (movie, TV series, episode)

title_ratings: Each row describes a title’s rating

title_principals: Each row describes an actor’s role in a title

Only JOIN When Needed

You won’t always need to join tables. It all depends on the question you are asking.

  • Question: “How many titles per type?” → Query title_basics only

  • Question: “Average rating per title type?” → JOIN title_basics with title_ratings

  • Question: “Which actors appear in highest rated titles?” → JOIN all three

Query design: use minimum tables needed.

4️⃣ pd.merge() and SQL JOIN

🤔 How to grab the title names?

Back to our example. How can I get the top 10 highest rated popular (100k+ votes) titles?

SELECT *
FROM title_ratings
WHERE numVotes > 100000
ORDER BY averageRating DESC
LIMIT 10;
tconst averageRating numVotes
tt4574334 8.6 1504453
tt9253284 8.6 255970
tt7221388 8.4 237080
tt5834204 8.3 290989
tt31806037 8.1 252695
tt10919420 8.0 727012
tt1869454 8.0 121327
tt1751634 7.6 204936
tt7335184 7.6 354235
tt7462410 7.2 161539

🤔 How to grab the title names? (cont.)

If we have this DataFrame:

df_top_ratings = pd.read_sql(
    """
    SELECT 
      tconst, 
      averageRating, 
      numVotes
    FROM title_ratings
    WHERE numVotes > 100000
    ORDER BY averageRating DESC
    LIMIT 10
    """,
    conn
)
df_top_ratings

If title_basics is a DataFrame:

query = """
SELECT 
   tconst, 
   primaryTitle, 
   titleType
FROM title_basics
"""

df_titles = pd.read_sql(query, conn)

mask = df_titles["tconst"].isin(df_top_ratings["tconst"])
df_titles[mask]

.isin() checks whether each tconst in df_titles appears inside the short list of df_top_ratings.

Why filtering is not enough

The output would look like this:

tconst primaryTitle titleType
87 tt10919420 Squid Game tvSeries
340 tt1751634 The Sandman tvSeries
354 tt1869454 Good Omens tvSeries
862 tt31806037 Adolescence tvMiniSeries
5186 tt4574334 Stranger Things tvSeries
5204 tt5834204 The Handmaid’s Tale tvSeries
5240 tt7221388 Cobra Kai tvSeries
5242 tt7335184 You tvSeries
5246 tt7462410 The Wheel of Time tvSeries
5305 tt9253284 Andor tvSeries
  • .isin() works, but you have to manage masks manually.
  • We still want the rating columns and the titles in one DataFrame.
  • This is the moment to learn how to join tables in pandas.

Introducing pd.merge()

df_top = (
    pd.merge(
        df_top_ratings,
        df_titles,
        how="inner",
        on="tconst"
    )
    .sort_values("averageRating", ascending=False)
    .reset_index(drop=True)
)
df_top
  • on="tconst" tells pandas which column matches between tables.
  • how="inner" keeps only the rows that exist in both tables.
  • Result: ratings + human-readable title in one DataFrame.

The output

This is more like how we want to see the data.

tconst averageRating numVotes primaryTitle titleType
0 tt4574334 8.6 1504453 Stranger Things tvSeries
1 tt9253284 8.6 255970 Andor tvSeries
2 tt7221388 8.4 237080 Cobra Kai tvSeries
3 tt5834204 8.3 290989 The Handmaid’s Tale tvSeries
4 tt31806037 8.1 252695 Adolescence tvMiniSeries
5 tt10919420 8.0 727012 Squid Game tvSeries
6 tt1869454 8.0 121327 Good Omens tvSeries
7 tt1751634 7.6 204936 The Sandman tvSeries
8 tt7335184 7.6 354235 You tvSeries
9 tt7462410 7.2 161539 The Wheel of Time tvSeries

Merge parameters cheat sheet

Parameter Meaning
left, right DataFrames you want to combine
on Column present in both DataFrames
left_on, right_on Use when join column names differ
how "inner", "left", "right", "outer"
suffixes Rename overlapping columns (('_rating', '_info'))

This is the same logic you’ll use for MP2’s TfL + ONS merges if you choose to use a database.

SQL JOIN equivalent

Everything above can be run directly inside SQLite as well:

SELECT
    b.primaryTitle,
    r.averageRating,
    r.numVotes
FROM title_ratings AS r
JOIN title_basics AS b
  ON r.tconst = b.tconst
WHERE r.numVotes > 100000
ORDER BY r.averageRating DESC
LIMIT 10;

pd.read_sql() will return the merged DataFrame without needing pandas-side merge().

5️⃣ Creating a SQLite Database

You do not have to build your own database for ✍️ Mini Project 2, but you will for the 📦 Final Project. These slides demonstrate part of the process of creating a SQLite database.

How these tables were created

In Part 1 I showed how we describe a table, what we call the schema.
If you decide to define your own tables (optional for ✍️ Mini Project 2, required later), this is exactly how the IMDb schema looks in SQL:

CREATE TABLE title_basics (
    tconst TEXT PRIMARY KEY,
    primaryTitle TEXT,
    startYear INTEGER
);

CREATE TABLE title_ratings (
    tconst TEXT PRIMARY KEY,
    averageRating REAL,
    numVotes INTEGER,
    FOREIGN KEY (tconst) REFERENCES title_basics(tconst)
);

Schema designed first. Then data populated.

Creating the SQLite file

Once those statements exist you can write them to a .db file like this:

conn = sqlite3.connect("imdb_2025.db")

create_query = """CREATE TABLE ... statements here..."""

# Do it for each table
conn.execute(create_query)

# Then close the connection
conn.close()

Populating IMDb Tables

If this data was in DataFrames, you’d populate like this:

titles_df.to_sql('title_basics', conn, if_exists='append')
ratings_df.to_sql('title_ratings', conn, if_exists='append')

By the way, be careful with the if_exists parameter.

❌ DO NOT DO IT THIS WAY

df.to_sql('table', conn, if_exists='replace')

Loses schema. Loses constraints.

✅ DO IT THIS WAY

df.to_sql('table', conn, if_exists='append')

Preserves schema. Enforces constraints.

Group Question

Why separate ratings from titles? What problems does one table create?

Think about what happens if a rating changes. Or if we want to add more information about titles.

Tomorrow’s Lab

In tomorrow’s 💻 W08 Lab, you’ll:

  • Practice SQL queries progressively (SELECT → WHERE → GROUP BY)
  • Write pandas equivalents to verify understanding
  • Independently create plot_df using SQL by lab end

Bonus: Schema Design

Schema Design Pattern (Optional)

Want to see how to design your own schema? Here’s the pattern:

CREATE TABLE journeys (
    journey_id INTEGER PRIMARY KEY,
    postcode_origin TEXT NOT NULL,
    postcode_destination TEXT NOT NULL,
    total_duration INTEGER
);

CREATE TABLE steps (
    step_id INTEGER PRIMARY KEY,
    journey_id INTEGER NOT NULL,
    step_distance INTEGER,
    step_travel_mode TEXT,
    FOREIGN KEY (journey_id) REFERENCES journeys(journey_id)
);

Key Takeaways

Today you learned:

  1. ✅ Why organisations normalise data
  2. ✅ How to design schemas
  3. pd.merge() with normalised data
  4. ✅ Basic SQL queries

Remember: Normalised storage enables flexible analysis. JOIN when your question requires it.

Questions?

Resources:

  • 📓 Lecture notebook (downloadable on Nuvolos)
  • 💻 W08 Lab tomorrow
  • 💬 Post questions in #help on Slack
  • 📅 Attend drop-in sessions

Looking ahead: 🖥️ W09 Lecture more SQL practice and renewed focus on data visualisation.