🖥️ Week 02, Day 04 - Lecture

Data Reshaping: From pandas to SQL

Author

Dr Jon Cardoso-Silva

Last updated

24 July 2025

🥅 Learning Objectives

By the end of this session, you should be able to: i) Explain the limitations of using multiple CSV files for related data. ii) Use pandas.merge() to combine related DataFrames. iii) Understand the benefits of using a relational database (SQLite). iv) Design a simple database schema for posts and comments. v) Store pandas DataFrames in an SQLite database. vi) Write basic SQL queries to retrieve and aggregate data (SELECT, GROUP BY, JOIN).

ME204 course icon

Welcome back! Yesterday, we mastered collecting rich data from authenticated APIs. Today, we tackle the next logical question: how do we organise, store, and connect related datasets effectively?

We’ll start with a familiar tool, pandas, and then see why databases are the professional standard for this kind of work.

⏰ Date and Time: Thursday, 24 July 2025 | 10.00am - 1.00pm 📍 Location: CKK.2.06 (see LSE’s 🗺️ campus map)

🗣️ Today’s Plan

  • 🧩 Part 1: The Data Organisation Challenge (45 minutes)
    • Group Activity: How would you design tables for Reddit posts and comments?
    • The pandas solution: Using pd.merge() to combine our data.
  • 💾 Part 2: Introduction to Databases with SQLite (50 minutes)
    • Why databases? The limits of CSV files.
    • Storing our pandas DataFrames in a simple SQLite database.
  • 🔍 Part 3: Querying with SQL (25 minutes)
    • The basics of SQL: SELECT, GROUP BY, and JOIN.
    • Answering the same questions, but now with the power of SQL.

🎬 Lecture Slides

Use your keyboard arrows to move through the slides below. You can also view them in fullscreen.

Prefer a PDF? Download the slides here:

(Sometimes the PDF export is a bit buggy and some text may appear with different formatting.)

📓 Lecture Notebook

We’ll be using a hands-on notebook for all parts of today’s session. Download it here:

Save it to your me204-study-notes repository. You should also see it on Nuvolos.

After the Lecture

This afternoon’s lab is all about practice. You’ll take the Reddit data you’ve personally collected and apply the full workflow: merge it, design a database for it, store it, and query it. This will be a crucial skill for your upcoming assignments.

💻 Today’s Lab

From merge() to SELECT.

➡️ Go to Lab Instructions

Questions?

➡️ Ask on Slack

🔗 Extra Resources: Databases & SQL

Useful references.