📦 Final Project (75%)

ME204 (2025) – Data Engineering for the Social World

Author

Last updated

01 August 2025

This is your final assessment, worth 75% of your final grade for ME204. Building on the comprehensive data engineering skills you’ve developed over three intensive weeks, you’ll create a complete data pipeline from collection to public communication.

⏲️ Due Date: Friday, 01 August 2025 at 6 pm UK time.
⬆️ Submission: git push your work to your private GitHub Classroom repository before the deadline. We will mark the latest version pushed to your repo. GitHub automatically rejects pushes after the deadline, so do not leave it to the last minute.

Click on the button below to accept the assignment:

⚠️ Assignment link only available on Moodle.

If you are seeing this from the public website of ME204, you won’t be able to find the link to accept the assignment. Please visit the Moodle version of this page instead to find the link.

👥 Working in Pairs: Collaboration Policy

You may choose to work in pairs for the data collection and database design stages (NB01 and NB02). If you do, you will share a single GitHub repository and co-develop these two notebooks together. This is a fun good way to learn how to use Git in a real-world setting, when more than one person is involved.

However, each student must complete their own NB03 (analysis) and their own website (e.g., one on docs/index.md and the other on docs/yourusername.md). This ensures that your exploratory analysis and storytelling are unique to you.

How to set up your GitHub Classroom team:

If working alone, use your GitHub username as your “team name” (e.g., jonjoncardoso).
If working in a pair, use both usernames separated by a hyphen (e.g., jonjoncardoso-sbramwell86). The order does not matter.

This naming convention helps us keep track of who collaborated on which repository. If you have any questions about group formation or submission, please ask on Slack before starting!

🎯 Your Mission

You have complete creative freedom to explore a research question that genuinely interests you. Your task is to build a professional data engineering pipeline that demonstrates mastery of the core skills we’ve covered: API data collection, database design, data analysis, and public communication.

“What story can you tell using data you collect yourself?”

📌 TECHNICAL REQUIREMENTS
Data Source: You must collect data by yourself, either by using an API, scraping a website, or manually.
Database: You must create a properly designed SQLite database with at least 2 tables and appropriate relationships.
Public Website: You must create a public-facing website that tells the story of your findings to a general audience.
Professional Structure: You must follow the enforced folder structure detailed below.

🗂️ Project Structure & Audience Strategy

Your repository must follow this exact folder structure:

your-repo/
├── data/
│   ├── database.db             # Your SQLite database
│   └── raw/                    # Raw API responses (CSV/JSON)
├── docs/
│   └── index.md                # Public website (main story)
├── notebooks/
│   ├── NB01-data-collection.ipynb    # API data collection
│   ├── NB02-data-processing.ipynb    # Database design & ETL
│   └── NB03-analysis.ipynb           # Exploratory analysis
├── README.md                    # For technical reproduction
└── scripts/                    # Any utility scripts (optional)

📖 Understanding Your Three Audiences

The different files you will produce are designed to be “consumed” by different hypothetical audiences. I design it this way so you can practise storytelling and communication, which are essential ingredients for the success of any data science project.

File / Section	Audience & Purpose	What this audience needs to know
📄 `README.md`	Technical Colleagues Other data scientists who might want to reproduce your work	- “How should I set up my Python environment to run your code?” - “What steps do I need to take to get an API credential for myself?” - “Which scripts/notebooks should I run to reproduce the analysis? Is there a particular order?”
📓 `notebooks/`	Data Analysts Technical professionals who understand Python, pandas, and SQL and want to understand the rationale behind your choices (code-wise and methodologically)	- “Why did you choose to filter the data in this particular way here in this notebook?” - “What were your key decisions at each step, and what alternatives did you consider?” - “Can I follow your reasoning and reproduce your results, including the logic behind your choices?”
🌐 `docs/index.md`	General Public Educated readers without technical background	- “Why should I care about this?” - “What did you discover, and what does it mean?” - “Can I understand your findings without needing to know any code?”

📝 Detailed Requirements

1. Data Collection (`notebooks/NB01-data-collection.ipynb`)

Choose your data collection approach based on your interests and comfort level. You have complete freedom in your choice, but here are some suggestions to spark your creativity:

💻 Public APIs

Use any API, authenticated or not (including those from the course, e.g. Open-Meteo, Reddit, etc.)
Browse public APIs on GitHub or search for your own
Great for: fast, reliable, and reproducible data collection
Document your API choice and collection process in the notebook

🕷️ Web Scraping (For the Brave!)

Only attempt if you’re up for a challenge!
(we only cover this quite late in the course: 🖥️ W03 D02 Lecture)
Respect website terms of service and robots.txt
Good for: data not available via API, unique or niche sources
Document your scraping process and ethical considerations

📋 Manual Data Collection (Real World)

Collect your own data: surveys, observations, measurements, etc.
Must provide evidence of authenticity (e.g., photos, survey forms)
Good for: original research, social/field data, creative projects
Note: Your data must be real and collected by you. AI-generated “data” will inevitably result in a boring project that misses the learning objectives entirely and will make me go like this: 🙄

Key Requirements (All Approaches):

Document your data collection process with clear explanations of your methodology
Provide evidence of authenticity: show that you genuinely collected the data yourself (in the case of manual collection)
Save raw data to the data/raw/ folder in any format you like (TXT, CSV, JSON, etc.)
Respect ethical boundaries: follow terms of service, robots.txt, and privacy considerations. Do NOT collect any personal data without explicit consent. Do not try to use subterfuge to collect data from a website that does not allow it.

2. Database Design & Processing (`notebooks/NB02-data-processing.ipynb`)

Transform your collected data into a well-structured relational database:

Database Requirements:

You should read the data you collected in NB01 and then design and create a SQLite database that contains all the data you will need for your analysis later in NB03. Your database must include at least two tables (but you can have more if you want), and these tables should be connected by appropriate relationships that reflect the real-world structure of your data. For each table, carefully choose data types that make sense for the information you are storing; you should think about whether a column should be text, an integer (how big of an integer?), a date, or something else. Every table should have a primary key to uniquely identify each row, and you should use foreign keys to link related tables together ¹. As you build your database, make sure to include steps for cleaning and validating your data so that your tables are accurate and reliable.

💡 DATABASE DESIGN TIP: Think about the relationships in your data. If you’re analysing Reddit communities, you might have tables for posts, comments, and subreddits. If you’re studying music, consider tables for artists, tracks, and playlists. The relationships between these entities should drive your schema design.

Key Requirements:

Load raw data from data/raw/
Apply vectorised pandas operations (no loops!). You first saw this in 🖥️ W02 D01 Lecture
Create and populate your SQLite database at data/database.db
Document your design decisions and data cleaning steps directly in the notebook

3. Analysis & Insights (`notebooks/NB03-analysis.ipynb`)

Conduct exploratory analysis using both pandas and SQL to discover interesting patterns:

Technical Requirements:

For this stage, you should read all your data directly from your SQLite database. Make sure that every dataset you use in NB03 is stored there. You are free to analyse your data using either pandas or SQL queries, whichever feels more comfortable for you. As you explore your data, you will often need to merge or join related tables ². When it comes to presenting your findings, create visualisations using matplotlib or seaborn (plotly is also fine), or use styled tables with pandas Styler if that suits your data better. As you analyse, make sure to use group-by operations and aggregation functions just like you practised throughout Week 02. Finally, remember to avoid unnecessary for loops and try, instead, to use vectorised operations in pandas or SQL ³. This will make your code more efficient and easier to read and ensure you are practising the skills you would need on a big real-world project.

Analytical Requirements:

In this part of your project, focus on asking genuine exploratory questions about your data. Rather than trying to prove a hypothesis or make predictions, aim to discover what is interesting or unexpected in your dataset. We have not covered statistical inference or machine learning in this course, so keep your analysis exploratory. As you work, look for patterns, trends, or surprising relationships that emerge from your data. When you create plots, use narrative titles that clearly communicate the main insight or story behind each visualisation. As you draw conclusions, make sure they are directly supported by your analysis and the evidence you have found.

4. Public Website (`docs/index.md`)

Create a compelling narrative website for a general audience. Your website must be a single page, and it must present three insights from your analysis in the form of either a plot or a stylised table. (You can mix up: one summary table, two plots, etc.)

Feel free to structure it however you like, but it should be a coherent story that is easy to follow, engaging, and above all, concise (🙏)! ⁴.

Here’s a suggested structure:

# [Title]

[An opening hook like "I decided to explore [topic] using data from [source] to understand [exploratory question]. 
In this page, I'll show you how I did it and what I found."]

## [Introduction]

"Here's how I collected and processed the data..." (accessible explanations, not technical details).

## [Findings] (this is where you show your visualisations and tables)

## [Conclusion] (this is where you write your conclusions and implications)

Key Requirements:

Write for educated non-technical readers
Include your best visualisations with clear explanations
Tell a coherent story with a clear beginning, middle, and end
Make insights accessible and relevant to real people
Professional presentation suitable for sharing publicly

💡 Optional: Quarto Exploration

If you’re interested in more advanced website creation, you can explore Quarto - a powerful tool for creating beautiful, interactive websites from Markdown. You could create docs/index.qmd instead of docs/index.md to use Quarto’s enhanced features like:

Interactive plots and charts
Better typography and styling
Code execution and output display
Advanced layout options

This is completely optional and won’t affect your grade - it’s just an opportunity to explore professional data science communication tools!

✔️ Grading Rubric

Your project will be marked out of 100. The breakdown below shows how marks are allocated across the four main criteria:

🗃️ Pipeline Design & Execution (35 marks)

Marks	Level	Description
<18	Poor / Fair	Database schema missing proper relationships or data types. API collection incomplete or poorly documented. Uses inefficient loops throughout. Raw data processing shows little understanding of vectorised operations. Pipeline components don’t connect properly.
~26	Good!	Well-designed SQLite database with appropriate relationships and data types. Successful API authentication and comprehensive data collection. Clean vectorised pandas operations replacing manual loops. Evidence of progression from mindless procedural to declarative thinking. Pipeline flows logically from collection to analysis.
30+	WOW	Sophisticated schema design showing deep understanding of normalisation and constraints. Creative or innovative data collection beyond basic examples. Masterful use of vectorised operations demonstrating genuine efficiency thinking. Pipeline architecture could serve as a template for similar projects.

🔒 Professional Practice (25 marks)

Marks	Level	Description
<13	Poor / Fair	Inconsistent Git usage with poor commit messages. Security lapses like committed credentials. Code lacks error handling or graceful failure modes. Repository organisation makes reproduction difficult.
~19	Good!	Consistent Git workflow with meaningful commits and proper branching. Secure credential management using environment variables. Appropriate error handling and fallback strategies. Clean repository structure following professional standards. Documentation enables technical reproduction.
22+	WOW	Exemplary version control practices that could teach others. Exceptional attention to security and edge cases. Repository structure and documentation set the standard for professional data science work. Code demonstrates mastery of collaborative development practices.

📊 Analytical Reasoning (25 marks)

Marks	Level	Description
<13	Poor / Fair	Superficial analysis that reveals little interesting about the data. Research questions vague or poorly motivated. Limited use of SQL and pandas capabilities. Conclusions not supported by evidence shown.
~19	Good!	Thoughtful exploratory analysis revealing genuine patterns in the data. Well-posed research questions systematically investigated. Effective use of both pandas and SQL for different analytical tasks. Clear visualisations with narrative titles. Conclusions directly supported by analysis.
22+	WOW	Analysis uncovers genuinely surprising or important insights about the chosen domain. Sophisticated analytical techniques applied appropriately. Exceptional visualisation design that makes complex patterns immediately clear. Evidence of creative thinking and deep engagement with the data.

📝 Communication (15 marks)

Marks	Level	Description
<8	Poor / Fair	README lacks reproduction instructions. Notebooks poorly documented. Public website too technical or unclear. Poor understanding of different audience needs.
~11	Good!	README enables technical colleagues to reproduce work. Notebooks explain analytical choices and reasoning. Public website accessible to general readers with clear narrative flow. Each component appropriately tailored to its intended audience.
13+	WOW	Documentation exemplary enough to serve as reference for others. Public website engages readers and makes technical discoveries compelling to non-experts. Exceptional clarity in explaining complex concepts across different technical levels.

Footnotes

You will have learned about all this in 🖥️ W02 D04 Lecture and 🖥️ W03 D01 Lecture.↩︎
you can revisit how to do this in the 🖥️ W02 D04 Lecture and 🖥️ W03 D01 Lecture ↩︎
you practised this in 💻 W02 D01 Lab and 💻 W02 D02 Lab ↩︎
The wide use of Generative AI has made it such that many people feel like they must write A LOT of text. Don’t fall into this trap!↩︎

👥 Working in Pairs: Collaboration Policy

🎯 Your Mission

🗂️ Project Structure & Audience Strategy

📖 Understanding Your Three Audiences

📝 Detailed Requirements

1. Data Collection (notebooks/NB01-data-collection.ipynb)

Key Requirements (All Approaches):

2. Database Design & Processing (notebooks/NB02-data-processing.ipynb)

Database Requirements:

Key Requirements:

3. Analysis & Insights (notebooks/NB03-analysis.ipynb)

Technical Requirements:

Analytical Requirements:

4. Public Website (docs/index.md)

Key Requirements:

✔️ Grading Rubric

Footnotes

1. Data Collection (`notebooks/NB01-data-collection.ipynb`)

2. Database Design & Processing (`notebooks/NB02-data-processing.ipynb`)

3. Analysis & Insights (`notebooks/NB03-analysis.ipynb`)

4. Public Website (`docs/index.md`)