๐Ÿ–ฅ๏ธ Week 03, Day 01 - Lecture

Designing Databases and Merging Data

Author

Dr Jon Cardoso-Silva

Last updated

31 July 2025

๐Ÿฅ… Learning Objectives
By the end of this session, you should be able to: i) Use pandas.merge() to combine related DataFrames, ii) Explain the limitations of using flat files for relational data, iii) Design a simple, normalised database schema, iv) Store pandas DataFrames in a SQLite database, v) Write basic SQL JOIN queries to retrieve and combine data.
ME204 course icon

Welcome back! We will start by checking in on your progress with the ๐Ÿ“ฆ Final Project - todayโ€™s session is is directly relevant to your project! Weโ€™ll learn how to organise your collected data into proper database structures, design relationships between different parts of your data, and query your data efficiently using SQL. This content will help you with NB02 (database design) and NB03 (analysis) - the next steps after your data collection.

โฐ Monday, 28 July 2025 | 10:00am - 1:00pm ๐Ÿ“ Location: CKK.2.06 (see LSEโ€™s ๐Ÿ—บ๏ธ campus map)

Project Check-in & Todayโ€™s Roadmap

10:00 โ€“ 10:15

๐Ÿ“’ Project Status Check

We will start with a quick poll to see how everyone is doing with their ๐Ÿ“ฆ Final Project before we dive into todayโ€™s content.

Part 1: From pandas to Databases

10:15 โ€“ 10:45

Setting the Stage

Weโ€™ll combine Reddit posts and comments using pandas.merge(). This shows how to join related data using tools you know.

Key Concepts

  • Data Collection: We will reuse the data collection code from the ๐Ÿ–ฅ๏ธ Week 02 Day 04 Lecture.
  • The pandas Bridge: Understanding pd.merge() and join types
  • Limitations: Why CSV files donโ€™t work well at scale
Notebook Focus

Watch how we link posts and comments using post_id. This shows the basics of relational data.

Quick Stretch Break & Check-in

10:45 โ€“ 10:50

Part 2: Database Design Fundamentals

10:50 โ€“ 11:30

Why Use Databases?

Weโ€™ll explore why organisations need databases instead of large CSV files, focusing on data integrity and scalability, as we explore how databases are structured in practice.

Database Schema Design

  • Normalisation & Keys: How to structure data relationships
  • SQLite Introduction: Creating tables with proper data types. A nod to our ๐Ÿ–ฅ๏ธ Week 01 Day 02 Lecture.
  • Populating the Database: Moving data from pandas to SQLite

Key Learning Moment

Learn to design a two-table schema - an essential step for your ๐Ÿ“ฆ Final Project.

โ˜•๏ธ Coffee Break

11:30 โ€“ 11:45

Part 3: SQL Queries and Analysis

11:45 โ€“ 13:00

SQL JOIN

Weโ€™ll recreate our pandas merge using SQL JOIN to show how databases handle data relationships.

Query Building Blocks

  • Basic SELECT: Getting data
  • WHERE clauses: Filtering
  • JOIN operations: Combining tables
  • GROUP BY: Summarizing data

Follow Along

Compare SQL and pandas solutions side-by-side.

๐Ÿ““ Lecture Notebook

Get the notebook for todayโ€™s hands-on work:

Save it to your me204-study-notes repository or find it in Nuvolos under lab-notebooks/.

After the Lecture

This afternoon: we will recap and revisit some SQL and then help you organise your collected data into proper database structures and writing SQL queries to analyse it. Youโ€™ll apply todayโ€™s concepts directly to your project data.

๐Ÿ’ป Todayโ€™s Lab

Designing schemas and joining data.

โžก๏ธ Go to Lab Instructions

๐Ÿ”— Extra Resources: Databases & SQL

Useful references and cheatsheets.