ME204 – Data Engineering for the Social World
24 July 2025
Yesterday: You became experts at collecting data from APIs. You have posts
and you have comments
.
Today’s Question: How do we link them together to ask interesting questions?
The Challenge: From Separate Files to Connected Insights
Imagine you have two spreadsheets: posts.csv
and comments.csv
.
How do you find out which post titles get the most comments? You need to join them.
pandas
Way: pd.merge()
Before we write any code, let’s understand the concept. We have two tables, and we want to combine them based on a shared column.
posts
post_id | title |
---|---|
p1 |
First Post |
p2 |
Second Post |
p3 |
Third Post |
comments
comment_id | post_id | comment_body |
---|---|---|
c1 |
p1 |
Great point! |
c2 |
p2 |
I disagree… |
c3 |
p2 |
Totally agree! |
c4 |
p3 |
Interesting. |
First, we identify the key that connects them. In this case, it’s post_id
.
posts
post_id | title |
---|---|
p1 |
First Post |
p2 |
Second Post |
p3 |
Third Post |
comments
comment_id | post_id | comment_body |
---|---|---|
c1 |
p1 |
Great point! |
c2 |
p2 |
I disagree… |
c3 |
p2 |
Totally agree! |
c4 |
p3 |
Interesting. |
Then, we match the rows where the keys are the same to create new, combined rows.
post_id | title | comment_id | comment_body |
---|---|---|---|
p1 |
First Post | c1 |
Great point! |
p2 |
Second Post | c2 |
I disagree… |
p2 |
Second Post | c3 |
Totally agree! |
p3 |
Third Post | c4 |
Interesting. |
This concept of “joining” is a fundamental, everyday task in data analysis.
This “joining” concept you just learned isn’t some academic exercise. It’s the bread and butter of real-world data analysis.
You already have a tool for this! pd.merge()
is the pandas
way of performing this operation.
pandas
is excellent for Python users, but in the professional world of data analytics, SQL is the universal language.
Think of it this way: pandas
is like being fluent in English, but SQL is like knowing how to read—it’s fundamental.
Databases don’t store data like we do with CSV files. They have sophisticated internal structures:
VARCHAR(50)
vs INTEGER
vs DATETIME
numpy
dtypes, but enforced rigorouslyThe good news? You don’t need to understand the internals. Databases handle the complexity for you.
SQL is the standard language across all major database systems:
Relational Databases:
Cloud Databases:
Different databases, same SQL language. Learn SQL once, use it everywhere.
pandas
?You might still ask: “If I know pandas
, why bother with SQL?”
pandas
loads everything into memorypandas
to SQLNow, let’s see how the language changes. We move from Python functions to SQL (Structured Query Language) commands.
Task | pandas Method |
SQL Command |
---|---|---|
Select Columns | df[['title', 'author']] |
SELECT title, author FROM posts; |
Filter Rows | df[df['score'] > 100] |
SELECT * FROM posts WHERE score > 100; |
Combine Tables | pd.merge(posts, comments, on='post_id') |
SELECT * FROM posts JOIN comments ON ... |
Aggregate Data | df.groupby('author')['score'].sum() |
SELECT author, SUM(score) FROM ... GROUP BY ... |
Sort Results | df.sort_values('score', ascending=False) |
SELECT * FROM ... ORDER BY score DESC; |
You might be asking: “If pandas
can do all this, why learn a new language?”
pandas
is for Python users; SQL is for everyone.In this course, we will follow a “schema-first” approach to databases.
The Golden Rule: CREATE
Before You INSERT
You must always define your table structure with a CREATE TABLE
statement before you insert any data.
VARCHAR
, INTEGER
, etc.).CREATE
the empty table with a precise schema.INSERT
your DataFrame into the waiting table using if_exists='append'
.NEVER use if_exists='replace'
. This lets pandas guess the data types, which is inefficient and defeats the purpose of designing a robust database.
Here’s how the steps look in your notebook.
You define the structure first.
pandas
Once the data is in the database, you use pd.read_sql()
to get exactly what you need.
sql_query = """
SELECT
p.title,
COUNT(c.comment_id) AS comment_count
FROM
posts AS p
JOIN
comments AS c ON p.post_id = c.post_id
GROUP BY
p.title
ORDER BY
comment_count DESC;
"""
# Let the database do the heavy lifting!
df_results = pd.read_sql(sql_query, engine)
The database performs the join and aggregation efficiently, and pandas just receives the final, clean result.
Time to put this into practice!
What We’ll Do Now
ME204_W02D04_lecture.ipynb
pd.merge()
on our dummy data.posts
and comments
DataFrames into the database.Let’s see how the professionals organise their data! 💻
LSE Summer School 2025 | ME204 Week 02 Day 04