π» Week 03, Day 01 - Lab
Designing schemas and joining data
After this lab, you will be able to: i) Explain why organisations use normalised databases instead of flat files ii) Write SQL queries with aggregations and HAVING clauses iii) Create a normalised database schema for your project data iv) Write SQL queries to answer research questions v) Apply database design concepts to data engineering problems
This morning, we emulated a full data engineering workflow: we collected data, organised it in pandas DataFrames, and stored it in a database. We then retrieved the data using SQL.
This afternoon, youβll apply these concepts to your final project by designing your own database.
β° Monday, 28 July 2025 | Either 2:00-3.30pm or 3.30-5:00pm π Check your timetable for the location of your class
π£οΈ Lab Overview
Open the ME204_W03D01.ipynb
notebook in your lab-notebooks/
folder on Nuvolos.
Part 1: A recap of the morningβs lecture (15 min)
Your teacher will discuss:
- Benefits of using well-structured databases vs flat CSV files
- How database design works in real systems
- When to use pandas vs SQL
Join the discussion when you can.
Part 2: Writing Analysis Queries (30 min)
Learn how to plan queries for analysis.
Your teacher will show two pre-written queries and explain the syntax and logic behind them.
Think about how these concepts apply to your project. If you have ideas for your projectβs tables, consider how you might query them.
Query 1: Engagement Analysis
This query combines post details with comment metrics for engagement analysis. It shows how one query can compute multiple metrics at once.
SELECT
p.subreddit,
p.title,as post_score,
p.score COUNT(c.comment_id) as comment_count,
AVG(c.score) as avg_comment_score,
MAX(c.score) as max_comment_score
FROM posts p
LEFT JOIN comments c ON p.post_id = c.post_id
GROUP BY p.post_id, p.subreddit, p.title, p.score
ORDER BY comment_count DESC
LIMIT 10
As the class teacher runs the queries, check your own understanding of the following concepts:
- Multiple aggregation functions
- GROUP BY requirements.
Query 2: Active User Analysis
This query adds the HAVING clause to filter out users with only one comment.
SELECT
c.author,COUNT(c.comment_id) as comment_count,
AVG(c.score) as avg_score,
SUM(c.score) as total_score
FROM comments c
WHERE c.author IS NOT NULL
GROUP BY c.author
HAVING comment_count >= 2
ORDER BY total_score DESC
LIMIT 10
As the class teacher runs the queries, check your own understanding of the following concepts:
- WHERE and HAVING clauses
- GROUP BY requirements.
- JOIN operations
π Think about it: When would you use WHERE vs HAVING in your project?
Part 3: Design Your Database (45 min)
This part of the lab is for you to design a database schema for your final project data.
You now have the tools to design a database schema for your final project data.
Before coding, think about your projectβs data structure.
At the end of your
NB02
notebook, you must have produced a single database that contains all the tables you will need for your analysis. InNB03
, the first thing you will do is read from this database and then produce a query (in pandas or SQL) to answer your research questions. So, itβs important to get that structure right.Ask your teacher if you need help with this step.
Design your database schema.
Refer to the code in the
ME204_W03D01.ipynb
notebook, in particular thePart 2: Designing a Database
section. It should give you a good idea for the types of tables you will need.Note: it is very natural for you to have to go back to
NB01
to add more columns or other information to your curated dataframes. This is totally normal and expected.Ask your teacher if you need help with this step.
Add data to your database.
If you get an integrity error when you try to add data to your database, itβs because you have not created the tables correctly. You might have duplicated keys, or you might have not created the tables in the correct order.
Ask your teacher if you get an integrity error.
(if time allows) Write a query to answer your research question.
Refer to the code in the
ME204_W03D01.ipynb
notebook, in particular thePart 2: Writing Analysis Queries
section.Ask your teacher if you need help with this step.
In the notebook, you will find a checklist of the things you can do to ensure your database is well-designed.