💾 Week 02 | Day 04
Data Reshaping: pandas to SQL

ME204 – Data Engineering for the Social World

Dr Jon Cardoso-Silva

24 July 2025

Yesterday’s Recap & Today’s Problem

Yesterday: You became experts at collecting data from APIs. You have posts and you have comments.

Today’s Question: How do we link them together to ask interesting questions?

The Challenge: From Separate Files to Connected Insights

Imagine you have two spreadsheets: posts.csv and comments.csv.

How do you find out which post titles get the most comments? You need to join them.

The pandas Way: pd.merge()

The Core Idea: Joining Tables

Before we write any code, let’s understand the concept. We have two tables, and we want to combine them based on a shared column.

posts

post_id title
p1 First Post
p2 Second Post
p3 Third Post

comments

comment_id post_id comment_body
c1 p1 Great point!
c2 p2 I disagree…
c3 p2 Totally agree!
c4 p3 Interesting.

The Core Idea: Joining Tables

First, we identify the key that connects them. In this case, it’s post_id.

posts

post_id title
p1 First Post
p2 Second Post
p3 Third Post

comments

comment_id post_id comment_body
c1 p1 Great point!
c2 p2 I disagree…
c3 p2 Totally agree!
c4 p3 Interesting.

The Core Idea: Joining Tables

Then, we match the rows where the keys are the same to create new, combined rows.

post_id title comment_id comment_body
p1 First Post c1 Great point!
p2 Second Post c2 I disagree…
p2 Second Post c3 Totally agree!
p3 Third Post c4 Interesting.

This concept of “joining” is a fundamental, everyday task in data analysis.

Joins: The Daily Bread of Data Work

This “joining” concept you just learned isn’t some academic exercise. It’s the bread and butter of real-world data analysis.

  • Multiple Databases: In any organisation, customer data lives in one system, sales data in another, product data in a third. You join them constantly.
  • Multiple Spreadsheets: Marketing has their Excel file, Finance has theirs, Operations has another. Your job? Connect them to answer business questions.
  • Every Day, All Day: Data analysts spend enormous amounts of time combining datasets from different sources. It’s not optional—it’s core to the job.

You already have a tool for this! pd.merge() is the pandas way of performing this operation.

# The pandas code
import pandas as pd

df_merged = pd.merge(
  df_posts, 
  df_comments, 
  on='post_id'
)
post_id title comment_id comment_body
p1 First Post c1 Great point!
p2 Second Post c2 I disagree…
p2 Second Post c3 Totally agree!
p3 Third Post c4 Interesting.

But Here’s the Reality Check

pandas is excellent for Python users, but in the professional world of data analytics, SQL is the universal language.

  • Industry Standard: Data analysts, business analysts, data scientists, and even marketing managers use SQL daily.
  • Database Reality: Companies don’t store their billion-record datasets as CSV files. They use databases optimised for scale and performance.
  • Tool Agnostic: Whether you use Python, R, Tableau, Power BI, or Excel, they all speak SQL to databases.

Think of it this way: pandas is like being fluent in English, but SQL is like knowing how to read—it’s fundamental.

How Databases Actually Work (The Magic Inside)

Databases don’t store data like we do with CSV files. They have sophisticated internal structures:

Strong Typing

  • Every column has a precise data type
  • VARCHAR(50) vs INTEGER vs DATETIME
  • Like numpy dtypes, but enforced rigorously

Optimised Storage

  • Data is stored in pages and blocks
  • Indexes make lookups lightning-fast
  • Query optimisers plan the most efficient execution

Keys and Relationships

  • Primary keys ensure uniqueness
  • Foreign keys enforce relationships
  • Think of pandas indexes, but with rules

ACID Properties

  • Atomicity, Consistency, Isolation, Durability
  • Like Git commits: changes are all-or-nothing
  • Multiple users can work simultaneously safely

The good news? You don’t need to understand the internals. Databases handle the complexity for you.

The Database Ecosystem

SQL is the standard language across all major database systems:

Relational Databases:

  • PostgreSQL (Open source, very popular)
  • MySQL (Web applications)
  • SQLite (What we’ll use - simple & portable)
  • SQL Server (Microsoft enterprise)
  • Oracle (Enterprise, expensive)

Cloud Databases:

  • BigQuery (Google Cloud)
  • Redshift (Amazon AWS)
  • Snowflake (Analytics-focused)
  • Azure SQL (Microsoft Cloud)

Different databases, same SQL language. Learn SQL once, use it everywhere.

Why Not Just Stick with pandas?

You might still ask: “If I know pandas, why bother with SQL?”

Scale Limitations

  • pandas loads everything into memory
  • Your laptop has 16GB RAM; corporate databases have terabytes
  • A simple join on 100M rows would crash your computer

Team Collaboration

  • What if your colleague uses R, not Python?
  • What if the business analyst needs the data too?
  • SQL is the common language everyone speaks

Data Integrity

  • CSV files can be accidentally corrupted
  • No rules prevent inconsistent data entry
  • Databases enforce constraints and relationships

Concurrent Access

  • Multiple people reading/writing CSV files = chaos
  • Databases handle thousands of simultaneous users
  • Think Git for data: managed, tracked, safe

From pandas to SQL

Now, let’s see how the language changes. We move from Python functions to SQL (Structured Query Language) commands.

Task pandas Method SQL Command
Select Columns df[['title', 'author']] SELECT title, author FROM posts;
Filter Rows df[df['score'] > 100] SELECT * FROM posts WHERE score > 100;
Combine Tables pd.merge(posts, comments, on='post_id') SELECT * FROM posts JOIN comments ON ...
Aggregate Data df.groupby('author')['score'].sum() SELECT author, SUM(score) FROM ... GROUP BY ...
Sort Results df.sort_values('score', ascending=False) SELECT * FROM ... ORDER BY score DESC;

Why Bother with SQL?

You might be asking: “If pandas can do all this, why learn a new language?”

  • It’s the Lingua Franca of Data: SQL is the universal language for interacting with databases. Data analysts, scientists, and engineers all use it. pandas is for Python users; SQL is for everyone.
  • Performance at Scale: Databases are highly optimised. A SQL query on a billion rows is executed efficiently on the database server; a pandas merge would require loading all billion rows into your computer’s memory first.
  • It’s Declarative: With SQL, you declare what you want (e.g., “give me the count of comments per post”). With pandas, you specify how to get it (group by this, then count that, then reset the index…). SQL handles the “how” for you.

Our “Schema-First” Philosophy

In this course, we will follow a “schema-first” approach to databases.

The Golden Rule: CREATE Before You INSERT

You must always define your table structure with a CREATE TABLE statement before you insert any data.

  1. Think about your columns and their data types (VARCHAR, INTEGER, etc.).
  2. CREATE the empty table with a precise schema.
  3. INSERT your DataFrame into the waiting table using if_exists='append'.

NEVER use if_exists='replace'. This lets pandas guess the data types, which is inefficient and defeats the purpose of designing a robust database.

The Full Workflow: Code View

Here’s how the steps look in your notebook.

1. Create Tables (SQL)

You define the structure first.

CREATE TABLE posts (
    post_id VARCHAR(10) PRIMARY KEY,
    title VARCHAR(300)
);

CREATE TABLE comments (
    comment_id VARCHAR(10) PRIMARY KEY,
    post_id VARCHAR(10),
    FOREIGN KEY (post_id) 
        REFERENCES posts(post_id)
);

2. Insert Data (pandas)

Then you load your DataFrames into them.

# Create database connection
engine = create_engine('sqlite:///reddit.db')

# Insert data
df_posts.to_sql('posts', engine, if_exists='append')
df_comments.to_sql('comments', engine, if_exists='append')

Querying with SQL in pandas

Once the data is in the database, you use pd.read_sql() to get exactly what you need.

sql_query = """
SELECT
    p.title,
    COUNT(c.comment_id) AS comment_count
FROM
    posts AS p
JOIN
    comments AS c ON p.post_id = c.post_id
GROUP BY
    p.title
ORDER BY
    comment_count DESC;
"""

# Let the database do the heavy lifting!
df_results = pd.read_sql(sql_query, engine)

The database performs the join and aggregation efficiently, and pandas just receives the final, clean result.

Moving to the Notebook

Time to put this into practice!

What We’ll Do Now

  1. Open the lecture notebook: ME204_W02D04_lecture.ipynb
  2. Use pd.merge() on our dummy data.
  3. Create an SQLite database and define our table schema.
  4. Insert our posts and comments DataFrames into the database.
  5. Run a SQL query to get our comment counts.

Let’s see how the professionals organise their data! 💻

LSE Summer School 2025 | ME204 Week 02 Day 04