π¦ Final Project (75%)
ME204 (2025) β Data Engineering for the Social World
This is your final assessment, worth 75% of your final grade for ME204. Building on the comprehensive data engineering skills youβve developed over three intensive weeks, youβll create a complete data pipeline from collection to public communication.
- β²οΈ Due Date: Friday, 01 August 2025 at 6 pm UK time.
- β¬οΈ Submission:
git push
your work to your private GitHub Classroom repository before the deadline. We will mark the latest version pushed to your repo. GitHub automatically rejects pushes after the deadline, so do not leave it to the last minute.
Click on the button below to accept the assignment:
β οΈ Assignment link only available on Moodle.
If you are seeing this from the public website of ME204, you wonβt be able to find the link to accept the assignment. Please visit the Moodle version of this page instead to find the link.
π₯ Working in Pairs: Collaboration Policy
You may choose to work in pairs for the data collection and database design stages (NB01 and NB02). If you do, you will share a single GitHub repository and co-develop these two notebooks together. This is a fun good way to learn how to use Git in a real-world setting, when more than one person is involved.
However, each student must complete their own NB03 (analysis) and their own website (e.g., one on docs/index.md
and the other on docs/yourusername.md
). This ensures that your exploratory analysis and storytelling are unique to you.
How to set up your GitHub Classroom team:
- If working alone, use your GitHub username as your βteam nameβ (e.g.,
jonjoncardoso
). - If working in a pair, use both usernames separated by a hyphen (e.g.,
jonjoncardoso-sbramwell86
). The order does not matter.
This naming convention helps us keep track of who collaborated on which repository. If you have any questions about group formation or submission, please ask on Slack before starting!
π― Your Mission
You have complete creative freedom to explore a research question that genuinely interests you. Your task is to build a professional data engineering pipeline that demonstrates mastery of the core skills weβve covered: API data collection, database design, data analysis, and public communication.
βWhat story can you tell using data you collect yourself?β
π TECHNICAL REQUIREMENTS
Data Source: You must collect data by yourself, either by using an API, scraping a website, or manually.
Database: You must create a properly designed SQLite database with at least 2 tables and appropriate relationships.
Public Website: You must create a public-facing website that tells the story of your findings to a general audience.
Professional Structure: You must follow the enforced folder structure detailed below.
ποΈ Project Structure & Audience Strategy
Your repository must follow this exact folder structure:
your-repo/
βββ data/
β βββ database.db # Your SQLite database
β βββ raw/ # Raw API responses (CSV/JSON)
βββ docs/
β βββ index.md # Public website (main story)
βββ notebooks/
β βββ NB01-data-collection.ipynb # API data collection
β βββ NB02-data-processing.ipynb # Database design & ETL
β βββ NB03-analysis.ipynb # Exploratory analysis
βββ README.md # For technical reproduction
βββ scripts/ # Any utility scripts (optional)
π Understanding Your Three Audiences
The different files you will produce are designed to be βconsumedβ by different hypothetical audiences. I design it this way so you can practise storytelling and communication, which are essential ingredients for the success of any data science project.
File / Section | Audience & Purpose | What this audience needs to know |
---|---|---|
π README.md |
Technical Colleagues Other data scientists who might want to reproduce your work |
- βHow should I set up my Python environment to run your code?β - βWhat steps do I need to take to get an API credential for myself?β - βWhich scripts/notebooks should I run to reproduce the analysis? Is there a particular order?β |
π notebooks/ |
Data Analysts Technical professionals who understand Python, pandas, and SQL and want to understand the rationale behind your choices (code-wise and methodologically) |
- βWhy did you choose to filter the data in this particular way here in this notebook?β - βWhat were your key decisions at each step, and what alternatives did you consider?β - βCan I follow your reasoning and reproduce your results, including the logic behind your choices?β |
π docs/index.md |
General Public Educated readers without technical background |
- βWhy should I care about this?β - βWhat did you discover, and what does it mean?β - βCan I understand your findings without needing to know any code?β |
π Detailed Requirements
1. Data Collection (notebooks/NB01-data-collection.ipynb
)
Choose your data collection approach based on your interests and comfort level. You have complete freedom in your choice, but here are some suggestions to spark your creativity:
π» Public APIs
Use any API, authenticated or not (including those from the course, e.g. Open-Meteo, Reddit, etc.)
Browse public APIs on GitHub or search for your own
Great for: fast, reliable, and reproducible data collection
Document your API choice and collection process in the notebook
π·οΈ Web Scraping (For the Brave!)
Only attempt if youβre up for a challenge!
(we only cover this quite late in the course: π₯οΈ W03 D02 Lecture)Respect website terms of service and
robots.txt
Good for: data not available via API, unique or niche sources
Document your scraping process and ethical considerations
π Manual Data Collection (Real World)
Collect your own data: surveys, observations, measurements, etc.
Must provide evidence of authenticity (e.g., photos, survey forms)
Good for: original research, social/field data, creative projects
Note: Your data must be real and collected by you. AI-generated βdataβ will inevitably result in a boring project that misses the learning objectives entirely and will make me go like this: π
Key Requirements (All Approaches):
- Document your data collection process with clear explanations of your methodology
- Provide evidence of authenticity: show that you genuinely collected the data yourself (in the case of manual collection)
- Save raw data to the
data/raw/
folder in any format you like (TXT, CSV, JSON, etc.) - Respect ethical boundaries: follow terms of service, robots.txt, and privacy considerations. Do NOT collect any personal data without explicit consent. Do not try to use subterfuge to collect data from a website that does not allow it.
2. Database Design & Processing (notebooks/NB02-data-processing.ipynb
)
Transform your collected data into a well-structured relational database:
Database Requirements:
You should read the data you collected in NB01
and then design and create a SQLite database that contains all the data you will need for your analysis later in NB03
. Your database must include at least two tables (but you can have more if you want), and these tables should be connected by appropriate relationships that reflect the real-world structure of your data. For each table, carefully choose data types that make sense for the information you are storing; you should think about whether a column should be text, an integer (how big of an integer?), a date, or something else. Every table should have a primary key to uniquely identify each row, and you should use foreign keys to link related tables together 1. As you build your database, make sure to include steps for cleaning and validating your data so that your tables are accurate and reliable.
π‘ DATABASE DESIGN TIP: Think about the relationships in your data. If youβre analysing Reddit communities, you might have tables for posts
, comments
, and subreddits
. If youβre studying music, consider tables for artists
, tracks
, and playlists
. The relationships between these entities should drive your schema design.
Key Requirements:
- Load raw data from
data/raw/
- Apply vectorised pandas operations (no loops!). You first saw this in π₯οΈ W02 D01 Lecture
- Create and populate your SQLite database at
data/database.db
- Document your design decisions and data cleaning steps directly in the notebook
3. Analysis & Insights (notebooks/NB03-analysis.ipynb
)
Conduct exploratory analysis using both pandas and SQL to discover interesting patterns:
Technical Requirements:
For this stage, you should read all your data directly from your SQLite database. Make sure that every dataset you use in NB03 is stored there. You are free to analyse your data using either pandas or SQL queries, whichever feels more comfortable for you. As you explore your data, you will often need to merge or join related tables 2. When it comes to presenting your findings, create visualisations using matplotlib or seaborn (plotly is also fine), or use styled tables with pandas Styler if that suits your data better. As you analyse, make sure to use group-by operations and aggregation functions just like you practised throughout Week 02. Finally, remember to avoid unnecessary for
loops and try, instead, to use vectorised operations in pandas or SQL 3. This will make your code more efficient and easier to read and ensure you are practising the skills you would need on a big real-world project.
Analytical Requirements:
In this part of your project, focus on asking genuine exploratory questions about your data. Rather than trying to prove a hypothesis or make predictions, aim to discover what is interesting or unexpected in your dataset. We have not covered statistical inference or machine learning in this course, so keep your analysis exploratory. As you work, look for patterns, trends, or surprising relationships that emerge from your data. When you create plots, use narrative titles that clearly communicate the main insight or story behind each visualisation. As you draw conclusions, make sure they are directly supported by your analysis and the evidence you have found.
4. Public Website (docs/index.md
)
Create a compelling narrative website for a general audience. Your website must be a single page, and it must present three insights from your analysis in the form of either a plot or a stylised table. (You can mix up: one summary table, two plots, etc.)
Feel free to structure it however you like, but it should be a coherent story that is easy to follow, engaging, and above all, concise (π)! 4.
Hereβs a suggested structure:
# [Title]
[An opening hook like "I decided to explore [topic] using data from [source] to understand [exploratory question].
In this page, I'll show you how I did it and what I found."]
## [Introduction]
"Here's how I collected and processed the data..." (accessible explanations, not technical details).
## [Findings] (this is where you show your visualisations and tables)
## [Conclusion] (this is where you write your conclusions and implications)
Key Requirements:
- Write for educated non-technical readers
- Include your best visualisations with clear explanations
- Tell a coherent story with a clear beginning, middle, and end
- Make insights accessible and relevant to real people
- Professional presentation suitable for sharing publicly
If youβre interested in more advanced website creation, you can explore Quarto - a powerful tool for creating beautiful, interactive websites from Markdown. You could create docs/index.qmd
instead of docs/index.md
to use Quartoβs enhanced features like:
- Interactive plots and charts
- Better typography and styling
- Code execution and output display
- Advanced layout options
This is completely optional and wonβt affect your grade - itβs just an opportunity to explore professional data science communication tools!
βοΈ Grading Rubric
Your project will be marked out of 100. The breakdown below shows how marks are allocated across the four main criteria:
ποΈ Pipeline Design & Execution (35 marks)
Marks | Level | Description |
---|---|---|
<18 | Poor / Fair | Database schema missing proper relationships or data types. API collection incomplete or poorly documented. Uses inefficient loops throughout. Raw data processing shows little understanding of vectorised operations. Pipeline components donβt connect properly. |
~26 | Good! | Well-designed SQLite database with appropriate relationships and data types. Successful API authentication and comprehensive data collection. Clean vectorised pandas operations replacing manual loops. Evidence of progression from mindless procedural to declarative thinking. Pipeline flows logically from collection to analysis. |
30+ | WOW | Sophisticated schema design showing deep understanding of normalisation and constraints. Creative or innovative data collection beyond basic examples. Masterful use of vectorised operations demonstrating genuine efficiency thinking. Pipeline architecture could serve as a template for similar projects. |
π Professional Practice (25 marks)
Marks | Level | Description |
---|---|---|
<13 | Poor / Fair | Inconsistent Git usage with poor commit messages. Security lapses like committed credentials. Code lacks error handling or graceful failure modes. Repository organisation makes reproduction difficult. |
~19 | Good! | Consistent Git workflow with meaningful commits and proper branching. Secure credential management using environment variables. Appropriate error handling and fallback strategies. Clean repository structure following professional standards. Documentation enables technical reproduction. |
22+ | WOW | Exemplary version control practices that could teach others. Exceptional attention to security and edge cases. Repository structure and documentation set the standard for professional data science work. Code demonstrates mastery of collaborative development practices. |
π Analytical Reasoning (25 marks)
Marks | Level | Description |
---|---|---|
<13 | Poor / Fair | Superficial analysis that reveals little interesting about the data. Research questions vague or poorly motivated. Limited use of SQL and pandas capabilities. Conclusions not supported by evidence shown. |
~19 | Good! | Thoughtful exploratory analysis revealing genuine patterns in the data. Well-posed research questions systematically investigated. Effective use of both pandas and SQL for different analytical tasks. Clear visualisations with narrative titles. Conclusions directly supported by analysis. |
22+ | WOW | Analysis uncovers genuinely surprising or important insights about the chosen domain. Sophisticated analytical techniques applied appropriately. Exceptional visualisation design that makes complex patterns immediately clear. Evidence of creative thinking and deep engagement with the data. |
π Communication (15 marks)
Marks | Level | Description |
---|---|---|
<8 | Poor / Fair | README lacks reproduction instructions. Notebooks poorly documented. Public website too technical or unclear. Poor understanding of different audience needs. |
~11 | Good! | README enables technical colleagues to reproduce work. Notebooks explain analytical choices and reasoning. Public website accessible to general readers with clear narrative flow. Each component appropriately tailored to its intended audience. |
13+ | WOW | Documentation exemplary enough to serve as reference for others. Public website engages readers and makes technical discoveries compelling to non-experts. Exceptional clarity in explaining complex concepts across different technical levels. |
Footnotes
You will have learned about all this in π₯οΈ W02 D04 Lecture and π₯οΈ W03 D01 Lecture.β©οΈ
you can revisit how to do this in the π₯οΈ W02 D04 Lecture and π₯οΈ W03 D01 Lectureβ©οΈ
you practised this in π» W02 D01 Lab and π» W02 D02 Labβ©οΈ
The wide use of Generative AI has made it such that many people feel like they must write A LOT of text. Donβt fall into this trap!β©οΈ