πŸ’» Week 03, Day 01 - Lab

Designing schemas and joining data

Author

Dr Jon Cardoso-Silva

Last updated

31 July 2025

🎯 Lab Objectives

After this lab, you will be able to: i) Explain why organisations use normalised databases instead of flat files ii) Write SQL queries with aggregations and HAVING clauses iii) Create a normalised database schema for your project data iv) Write SQL queries to answer research questions v) Apply database design concepts to data engineering problems

ME204 course icon

This morning, we emulated a full data engineering workflow: we collected data, organised it in pandas DataFrames, and stored it in a database. We then retrieved the data using SQL.

This afternoon, you’ll apply these concepts to your final project by designing your own database.

⏰ Monday, 28 July 2025 | Either 2:00-3.30pm or 3.30-5:00pm πŸ“ Check your timetable for the location of your class


πŸ›£οΈ Lab Overview

Open the ME204_W03D01.ipynb notebook in your lab-notebooks/ folder on Nuvolos.

Part 1: A recap of the morning’s lecture (15 min)

Your teacher will discuss:

  • Benefits of using well-structured databases vs flat CSV files
  • How database design works in real systems
  • When to use pandas vs SQL

Join the discussion when you can.

Part 2: Writing Analysis Queries (30 min)

Learn how to plan queries for analysis.

Your teacher will show two pre-written queries and explain the syntax and logic behind them.

Think about how these concepts apply to your project. If you have ideas for your project’s tables, consider how you might query them.

Query 1: Engagement Analysis

This query combines post details with comment metrics for engagement analysis. It shows how one query can compute multiple metrics at once.

SELECT 
    p.subreddit,
    p.title,
    p.score as post_score,
    COUNT(c.comment_id) as comment_count,
    AVG(c.score) as avg_comment_score,
    MAX(c.score) as max_comment_score
FROM posts p
LEFT JOIN comments c ON p.post_id = c.post_id
GROUP BY p.post_id, p.subreddit, p.title, p.score
ORDER BY comment_count DESC
LIMIT 10

As the class teacher runs the queries, check your own understanding of the following concepts:

  • Multiple aggregation functions
  • GROUP BY requirements.

Query 2: Active User Analysis

This query adds the HAVING clause to filter out users with only one comment.

SELECT 
    c.author,
    COUNT(c.comment_id) as comment_count,
    AVG(c.score) as avg_score,
    SUM(c.score) as total_score
FROM comments c
WHERE c.author IS NOT NULL
GROUP BY c.author
HAVING comment_count >= 2
ORDER BY total_score DESC
LIMIT 10

As the class teacher runs the queries, check your own understanding of the following concepts:

  • WHERE and HAVING clauses
  • GROUP BY requirements.
  • JOIN operations

πŸ’­ Think about it: When would you use WHERE vs HAVING in your project?

Part 3: Design Your Database (45 min)

This part of the lab is for you to design a database schema for your final project data.

You now have the tools to design a database schema for your final project data.

  1. Before coding, think about your project’s data structure.

    At the end of your NB02 notebook, you must have produced a single database that contains all the tables you will need for your analysis. In NB03, the first thing you will do is read from this database and then produce a query (in pandas or SQL) to answer your research questions. So, it’s important to get that structure right.

    Ask your teacher if you need help with this step.

  2. Design your database schema.

    Refer to the code in the ME204_W03D01.ipynb notebook, in particular the Part 2: Designing a Database section. It should give you a good idea for the types of tables you will need.

    Note: it is very natural for you to have to go back to NB01 to add more columns or other information to your curated dataframes. This is totally normal and expected.

    Ask your teacher if you need help with this step.

  3. Add data to your database.

    If you get an integrity error when you try to add data to your database, it’s because you have not created the tables correctly. You might have duplicated keys, or you might have not created the tables in the correct order.

    Ask your teacher if you get an integrity error.

  4. (if time allows) Write a query to answer your research question.

    Refer to the code in the ME204_W03D01.ipynb notebook, in particular the Part 2: Writing Analysis Queries section.

    Ask your teacher if you need help with this step.

In the notebook, you will find a checklist of the things you can do to ensure your database is well-designed.