💾 Week 02 | Day 04
Data Reshaping: pandas to SQL

ME204 – Data Engineering for the Social World

Dr Jon Cardoso-Silva

24 July 2025

Yesterday’s Recap & Today’s Problem

Yesterday: You became experts at collecting data from APIs. You have posts and you have comments.

Today’s Question: How do we link them together to ask interesting questions?

The Challenge: From Separate Files to Connected Insights

Imagine you have two spreadsheets: posts.csv and comments.csv.

How do you find out which post titles get the most comments? You need to join them.

The `pandas` Way: `pd.merge()`

The Core Idea: Joining Tables

Before we write any code, let’s understand the concept. We have two tables, and we want to combine them based on a shared column.

posts

post_id	title
`p1`	First Post
`p2`	Second Post
`p3`	Third Post

comments

comment_id	post_id	comment_body
`c1`	`p1`	Great point!
`c2`	`p2`	I disagree…
`c3`	`p2`	Totally agree!
`c4`	`p3`	Interesting.

The Core Idea: Joining Tables

First, we identify the key that connects them. In this case, it’s post_id.

posts

post_id	title
`p1`	First Post
`p2`	Second Post
`p3`	Third Post

comments

comment_id	post_id	comment_body
`c1`	`p1`	Great point!
`c2`	`p2`	I disagree…
`c3`	`p2`	Totally agree!
`c4`	`p3`	Interesting.

The Core Idea: Joining Tables

Then, we match the rows where the keys are the same to create new, combined rows.

post_id	title	comment_id	comment_body
`p1`	First Post	`c1`	Great point!
`p2`	Second Post	`c2`	I disagree…
`p2`	Second Post	`c3`	Totally agree!
`p3`	Third Post	`c4`	Interesting.

This concept of “joining” is a fundamental, everyday task in data analysis.

Joins: The Daily Bread of Data Work

This “joining” concept you just learned isn’t some academic exercise. It’s the bread and butter of real-world data analysis.

Multiple Databases: In any organisation, customer data lives in one system, sales data in another, product data in a third. You join them constantly.
Multiple Spreadsheets: Marketing has their Excel file, Finance has theirs, Operations has another. Your job? Connect them to answer business questions.
Every Day, All Day: Data analysts spend enormous amounts of time combining datasets from different sources. It’s not optional—it’s core to the job.

You already have a tool for this! pd.merge() is the pandas way of performing this operation.

# The pandas code
import pandas as pd

df_merged = pd.merge(
  df_posts, 
  df_comments, 
  on='post_id'
)

post_id	title	comment_id	comment_body
`p1`	First Post	`c1`	Great point!
`p2`	Second Post	`c2`	I disagree…
`p2`	Second Post	`c3`	Totally agree!
`p3`	Third Post	`c4`	Interesting.

But Here’s the Reality Check

pandas is excellent for Python users, but in the professional world of data analytics, SQL is the universal language.

Industry Standard: Data analysts, business analysts, data scientists, and even marketing managers use SQL daily.
Database Reality: Companies don’t store their billion-record datasets as CSV files. They use databases optimised for scale and performance.
Tool Agnostic: Whether you use Python, R, Tableau, Power BI, or Excel, they all speak SQL to databases.

Think of it this way: pandas is like being fluent in English, but SQL is like knowing how to read—it’s fundamental.

How Databases Actually Work (The Magic Inside)

Databases don’t store data like we do with CSV files. They have sophisticated internal structures:

Strong Typing

Every column has a precise data type
VARCHAR(50) vs INTEGER vs DATETIME
Like numpy dtypes, but enforced rigorously

Optimised Storage

Data is stored in pages and blocks
Indexes make lookups lightning-fast
Query optimisers plan the most efficient execution

Keys and Relationships

Primary keys ensure uniqueness
Foreign keys enforce relationships
Think of pandas indexes, but with rules

ACID Properties

Atomicity, Consistency, Isolation, Durability
Like Git commits: changes are all-or-nothing
Multiple users can work simultaneously safely

The good news? You don’t need to understand the internals. Databases handle the complexity for you.

The Database Ecosystem

SQL is the standard language across all major database systems:

Relational Databases:

PostgreSQL (Open source, very popular)
MySQL (Web applications)
SQLite (What we’ll use - simple & portable)
SQL Server (Microsoft enterprise)
Oracle (Enterprise, expensive)

Cloud Databases:

BigQuery (Google Cloud)
Redshift (Amazon AWS)
Snowflake (Analytics-focused)
Azure SQL (Microsoft Cloud)

Different databases, same SQL language. Learn SQL once, use it everywhere.

Why Not Just Stick with `pandas`?

You might still ask: “If I know pandas, why bother with SQL?”

Scale Limitations

pandas loads everything into memory
Your laptop has 16GB RAM; corporate databases have terabytes
A simple join on 100M rows would crash your computer

Team Collaboration

What if your colleague uses R, not Python?
What if the business analyst needs the data too?
SQL is the common language everyone speaks

Data Integrity

CSV files can be accidentally corrupted
No rules prevent inconsistent data entry
Databases enforce constraints and relationships

Concurrent Access

Multiple people reading/writing CSV files = chaos
Databases handle thousands of simultaneous users
Think Git for data: managed, tracked, safe

From `pandas` to SQL

Now, let’s see how the language changes. We move from Python functions to SQL (Structured Query Language) commands.

Task	`pandas` Method	SQL Command
Select Columns	`df[['title', 'author']]`	`SELECT title, author FROM posts;`
Filter Rows	`df[df['score'] > 100]`	`SELECT * FROM posts WHERE score > 100;`
Combine Tables	`pd.merge(posts, comments, on='post_id')`	`SELECT * FROM posts JOIN comments ON ...`
Aggregate Data	`df.groupby('author')['score'].sum()`	`SELECT author, SUM(score) FROM ... GROUP BY ...`
Sort Results	`df.sort_values('score', ascending=False)`	`SELECT * FROM ... ORDER BY score DESC;`

Why Bother with SQL?

You might be asking: “If pandas can do all this, why learn a new language?”

It’s the Lingua Franca of Data: SQL is the universal language for interacting with databases. Data analysts, scientists, and engineers all use it. pandas is for Python users; SQL is for everyone.
Performance at Scale: Databases are highly optimised. A SQL query on a billion rows is executed efficiently on the database server; a pandas merge would require loading all billion rows into your computer’s memory first.
It’s Declarative: With SQL, you declare what you want (e.g., “give me the count of comments per post”). With pandas, you specify how to get it (group by this, then count that, then reset the index…). SQL handles the “how” for you.

Our “Schema-First” Philosophy

In this course, we will follow a “schema-first” approach to databases.

The Golden Rule: CREATE Before You INSERT

You must always define your table structure with a CREATE TABLE statement before you insert any data.

Think about your columns and their data types (VARCHAR, INTEGER, etc.).
CREATE the empty table with a precise schema.
INSERT your DataFrame into the waiting table using if_exists='append'.

NEVER use if_exists='replace'. This lets pandas guess the data types, which is inefficient and defeats the purpose of designing a robust database.

The Full Workflow: Code View

Here’s how the steps look in your notebook.

1. Create Tables (SQL)

You define the structure first.

CREATE TABLE posts (
    post_id VARCHAR(10) PRIMARY KEY,
    title VARCHAR(300)
);

CREATE TABLE comments (
    comment_id VARCHAR(10) PRIMARY KEY,
    post_id VARCHAR(10),
    FOREIGN KEY (post_id) 
        REFERENCES posts(post_id)
);

2. Insert Data (`pandas`)

Then you load your DataFrames into them.

# Create database connection
engine = create_engine('sqlite:///reddit.db')

# Insert data
df_posts.to_sql('posts', engine, if_exists='append')
df_comments.to_sql('comments', engine, if_exists='append')

Querying with SQL in `pandas`

Once the data is in the database, you use pd.read_sql() to get exactly what you need.

sql_query = """
SELECT
    p.title,
    COUNT(c.comment_id) AS comment_count
FROM
    posts AS p
JOIN
    comments AS c ON p.post_id = c.post_id
GROUP BY
    p.title
ORDER BY
    comment_count DESC;
"""

# Let the database do the heavy lifting!
df_results = pd.read_sql(sql_query, engine)

The database performs the join and aggregation efficiently, and pandas just receives the final, clean result.

Moving to the Notebook

Time to put this into practice!

What We’ll Do Now

Open the lecture notebook: ME204_W02D04_lecture.ipynb
Use pd.merge() on our dummy data.
Create an SQLite database and define our table schema.
Insert our posts and comments DataFrames into the database.
Run a SQL query to get our comment counts.

Let’s see how the professionals organise their data! 💻

LSE Summer School 2025 | ME204 Week 02 Day 04

💾 Week 02 | Day 04Data Reshaping: pandas to SQL

Yesterday’s Recap & Today’s Problem

The pandas Way: pd.merge()

The Core Idea: Joining Tables

The Core Idea: Joining Tables

The Core Idea: Joining Tables

Joins: The Daily Bread of Data Work

But Here’s the Reality Check

How Databases Actually Work (The Magic Inside)

Strong Typing

Optimised Storage

Keys and Relationships

ACID Properties

The Database Ecosystem

Why Not Just Stick with pandas?

Scale Limitations

Team Collaboration

Data Integrity

Concurrent Access

From pandas to SQL

Why Bother with SQL?

Our “Schema-First” Philosophy

The Full Workflow: Code View

1. Create Tables (SQL)

2. Insert Data (pandas)

Querying with SQL in pandas

Moving to the Notebook

💾 Week 02 | Day 04
Data Reshaping: pandas to SQL

The `pandas` Way: `pd.merge()`

Why Not Just Stick with `pandas`?

From `pandas` to SQL

2. Insert Data (`pandas`)

Querying with SQL in `pandas`