DS105 2025-2026 Autumn Term Icon

✍️ Mini-Project 1 (20%): Air Quality Analysis

2025/26 Autumn Term

Author

Dr Jon Cardoso-Silva

Published

27 October 2025

🎯 Learning Goals
By the end of this assignment, you will: i) Collect and authenticate with a real-world API using secure credential management, ii) Apply vectorised operations to analyse temporal patterns in environmental data, iii) Make and document independent analytical decisions with external source citation, iv) Create compelling visualisations that communicate data-driven insights, v) Reflect on your analytical process and technical decision-making

This is your first graded summative assignment, worth 20% of your final grade in this course.

⏲️ Due Date:

⬆️ Submission:

📚 Preparation

DEADLINE Thursday, 6 November 2025 (Week 06) at 8 pm UK time
📂 Repository Setup GitHub Classroom Repository (link on Moodle)
💎 Weight 20% of final grade
📝 Prerequisite This assumes you are caught up with the course up until Week 04

Before you begin:

  1. Make sure you are caught up with the course.

    This assignment assumes you have been working on all prior weekly exercises and that you have worked on the 📝 W04 Practice exercise too. If you are behind, please contact one of us as soon as possible, check the ✋ Contact Hours page for details.

    If you are into AI, you can also start a fresh new chat on the DS105A Claude Project and type “Help me catch up!”.

  2. Accept the GitHub Classroom assignment. This will work just the same as the 📝 W04 Practice exercise. You will be taken to a page where you will have to formally Accept the assignment.

    🚫 For security reasons, I cannot post the invitation link here on the public website. Please click here to view the uncensored version of this page on Moodle. The invitation link is available there.

    After accepting, a personalised GitHub repository will be created for you. Grab the SSH URL from there.

  3. Clone your repository to Nuvolos. The repository will contain a partial template for NB01. Everything else is your responsibility to create.

    If you don’t remember how to clone a repository, check the 4️⃣ Git & GitHub Guide for instructions.

  4. (Optional but strongly recommended) Keep a dedicated AI chat window for this assignment.

    🤖 AI Policy Reminder: The use of AI tools is fully authorised in this course. You are not penalised for using AI in your DS105A course work.

    Why we ask for your AI chat logs:

    Together with your code and reflections, your chat log helps us provide better feedback on how you’re using AI as a learning tool (not a crutch).

    General-Purpose Generative AI Tools you can use:

    • ChatGPT (OpenAI)
    • Gemini (Google)
    • Claude (Anthropic)
    • ⭐️ My recommendation: Use the DS105A Claude Project. I have curated information for this custom Claude bot to help you with your DS105A coursework. Your chat log will also help me check if the bot is helping you or not.

    How to document your AI usage:

    Start a fresh chat and type something like: “I will use this chat for the ✍️ Mini-Project 1, as part of the LSE DS105A (2025/2026) Data for Data Science course.” When you’re done, export the chat log link and include it in one of your notebooks (you decide where it’s most appropriate).

    We can help you with this. Just post a question on the #help channel on Slack.

📝 Your Assignment

The Question

You’ve been commissioned by the fictitiously famous Office of Quirky Inquiries (OQI) to answer the question 1:

“Is London’s air getting better or worse?”

Your goal is to analyse historical air pollution trends in London using the OpenWeather Air Pollution API. You will:

  • Collect historical air pollution data through API authentication and save it to a JSON file (NB01-Data-Collection.ipynb)

    This is a 🆕 concept you will learn by following the instructions in the provided NB01-Data-Collection.ipynb notebook. In the past, we didn’t need any passwords to connect to OpenMeteo but we also got locked out of the API because we were using too many requests. By using an alternative API that requires authentication, we can avoid this problem as each request is uniquely tied to your e-mail address.

  • Transform raw API responses into a tabular format (NB02-Data-Transformation.ipynb)

  • Produce two analytical insights about air quality patterns (NB03-Data-Analysis.ipynb)

    Read the Technical Requirements section below for more details.

  • Use vectorised operations (NumPy/Pandas) throughout your data manipulation (NB02) and analysis (NB03).

    If you feel you must use a for loop, you must document why NumPy/Pandas alternatives wouldn’t work for your specific need.

API resource you will use:

This time, we will switch to a different API called OpenWeather.

Note: The OpenWeather website provides multiple API endpoints (the base_url you use to construct the API request). You will need to figure out which one provides the historical air pollution data you need.

📌 DECISIONS, DECISIONS, DECISIONS

You will need to make several decisions independently:

  • API endpoint: Which base_url provides the historical air pollution data you need?
  • Time horizon: What time period will you analyse?
  • Location: Which coordinates for London will you use?
  • Pollutants: Which air quality metrics will you focus on? (PM2.5, NO₂, O3, etc.)
  • Categories: Will you use OpenWeather’s index levels or define custom thresholds?
  • Temporal aggregation: Daily, weekly, monthly analysis?

If you define custom air quality categories or thresholds, then you must cite official government sources (e.g., UK DEFRA, WHO guidelines) or academic literature to back up your choices.

In your notebooks (you decide where it’s most appropriate) document your decisions and explain your reasoning. You will be graded on decision quality: did you do any research/use any AI chatbot to help you make your decisions? Whatever it is you discovered, how did that inform your choices and technical implementation?

Technical Requirements

Packages you should use:

  • requests, os, json for API interaction
  • python-dotenv for API key management (🆕 - see NB01 instructions)
  • datetime, numpy, pandas for data manipulation
  • seaborn and matplotlib for visualisation

Vectorisation requirement:

Use vectorised operations throughout your data manipulation (NB02) and analysis (NB03). If you feel you must use a for loop, you must document why NumPy/Pandas alternatives wouldn’t work for your specific need.

Freedom to use Python packages:

You may use any pandas/numpy/seaborn functions (as well as datetime), even advanced features not covered in lectures or in the course, as long as you document:

  • How you learned about it (documentation, search engine, AI chatbot)
  • How you tested it to verify understanding before using it

Notebook Structure

You must create three notebooks:

📥 NB01-Data-Collection.ipynb

Partial template provided. You complete the rest.

  • API authentication working
  • Historical data collected
  • JSON file saved to data/ folder

🔄 NB02-Data-Transformation.ipynb

No template provided. You determine structure.

  • JSON loaded and parsed
  • Pandas transformations applied
  • CSV file(s) saved to data/ folder

📊 NB03-Data-Analysis.ipynb

No template provided. You determine structure.

  • CSV loaded
  • Vectorised analysis performed
  • Two insights produced:
    • 2 seaborn visualisations OR
    • 1 seaborn plot + 1 styled DataFrame
  • Each insight must have a narrative title (states finding directly, not description)

Whilst we will not dictate how you are to structure your notebooks, you will be graded on the hierarchical organisation of your notebooks. Use clear markdown headers, logical section progression, and appropriate code/markdown cell balance.

Reflection & Process Documentation

Add 💭 **Personal Reflection Notes:** cells as you work, briefly explaining technical decisions, reasons for using particular functions (including anything new), how you tested them, any issues or re-runs, and what you tried to resolve challenges.

Need guidance on effective reflection? Review the first hour of 🖥️ W04 Lecture where Jon discusses reflection importance.

Your reflection notes should be substantive but concise. Each reflection cell should address the specific decision/process at hand without excessive repetition. Think: “What would a colleague need to know to understand my choice?” rather than “How can I fill space?”.

✔️ How We Will Grade Your Work

I don’t enjoy this but, unfortunately, I must be strict when grading summative assignments to mitigate fears over grade inflation.

Higher marks are reserved for those who demonstrate exceptional talent or effort, but in a way that aligns with the learning objectives and coding philosophy of this course. (Simply adding more analysis or complicating the code is not sufficient!)

Critical reminder: You are graded on reasoning process and decision documentation, not just functional code. Correct output with weak justification scores lower than incorrect output with a clear evidence of learning.

Coherence matters: We assess whether what you write in your reflections matches what you actually did in your code. If you use sophisticated techniques but can’t explain why you chose them or how they relate to what we taught in lectures and labs, this suggests the work isn’t genuinely yours. We want to see you connect your decisions to specific course activities (e.g., “I used vectorised operations as shown in W04 Lab Section 3” rather than generic statements like “I used pandas because it’s efficient”).

The good news is that, if you have been attentive to the teaching materials and actively engaged with the exercises, it should still be feasible to achieve a ‘Very Good!’ level (70-75 marks).

Here is a rough rubric of how we will grade your work.

🧐 Documentation & Reflection (0-30 marks)

Covers all three notebooks and the README.md file

Marks awarded Level Description
<12
marks
Poor Key documentation is missing or has too many critical problems. For example: README missing or mostly empty, notebooks have almost no markdown explaining what you did, no reflections about your decisions, or we simply can’t understand what you did or why you did it.
12-14
marks
Weak You submitted the work but it has multiple serious problems. For example: documentation is very limited or poorly organised, reflections are extremely generic or superficial with no real explanation of your choices, or multiple key elements are missing or severely incomplete.
15-17
marks
Fair Adequate work with notable weaknesses. Deliverables present but reflections are generic, decision justifications are limited, work is disorganised, or lacks connection to course content.
18-20
marks
Good Competent documentation with reasonable reflection. README and notebooks organised with some specific reasoning and decision documentation. Reflections show some connection to course materials though may be limited.
21-23
marks
Very Good! Solid documentation with substantive reflection showing clear course alignment. Typical of someone who has done all prior practice exercises and attended all lectures/classes and actively participated in the course. Clear README and well-organised notebooks with specific reasoning, documented decision-making process, credible sources cited, and reflections that reference specific course activities (lectures, labs, practice exercises). What you write matches what you did.
24+
marks
🏆 WOW Exceptional reflection quality with sophisticated understanding. Publishable-quality documentation with evidence of deep research, comparative reasoning, and genuine analytical thinking throughout.
📥 Data Collection & Transformation (0-30 marks)

Covers NB01-Data-Collection.ipynb and NB02-Data-Transformation.ipynb

Marks awarded Level Description
<12
marks
Poor Your code logic doesn’t work (even if it runs OK). For example: authentication failed, you used the wrong API endpoint so you didn’t get historical data, files are missing, or your code has so many errors it can’t run.
12-14
marks
Weak Your code runs but has multiple serious problems. For example: you collected data that doesn’t actually help answer the question, you used lots of for loops when vectorisation would work, your files are disorganised or mixed together, or your code is very messy and hard to follow.
15-17
marks
Fair Your workflow works but has notable problems. API authentication and data collection function but with concerning issues in how you implemented things, how you organised your code, or disconnection between the sophistication of your code and what we taught (using advanced techniques without explaining why they’re better than what we showed you).
18-20
marks
Good Competent technical work. API authentication and data collection working with reasonable use of the techniques we taught (vectorised operations, proper file organisation). If you used techniques beyond the course, you show some understanding of why they were appropriate or better than what we showed you.
21-23
marks
Very Good! Clean, appropriate technical work clearly aligned with what we taught. Typical of someone who has done all prior practice exercises and attended all lectures/classes. API authentication working, data collected successfully, proper use of vectorised operations, files organised correctly, code is clean. If you used advanced techniques we haven’t covered, you clearly explain why they were more appropriate than the simpler approaches we taught.
24+
marks
🏆 WOW Exceptional technical implementation. Exceptionally clean pandas transformations, creative method chaining, professional touches like exceptionally clean and clearly named custom functions or error handling, exemplary organisation.
📊 Data Analysis & Insights (0-40 marks)

Covers NB03-Data-Analysis.ipynb and the README.md file

Marks awarded Level Description
<16
marks
Poor Your analysis doesn’t answer the question or has too many fundamental problems. For example: visualisations are missing, you didn’t use seaborn, your code has too many errors to interpret, axes are unlabelled or misleading, or your interpretation is completely wrong or missing.
16-19
marks
Weak You tried to do analysis but it doesn’t work well. For example: your visualisations don’t actually show what you’re trying to say, your titles just describe what’s in the chart rather than stating a finding, your charts are poorly formatted (missing labels, messy legends), your interpretation is very shallow, or you barely used vectorisation when you should have.
20-23
marks
Fair Your analysis produces some insights but has notable problems. You’ve done analytical work but there are significant weaknesses in how you communicated findings, how you implemented the analysis, or disconnection between the sophistication of your techniques and what we taught (using advanced methods without explaining why they’re better than what we showed you).
24-27
marks
Good Competent analysis with reasonable insights. Two insights produced with acceptable visualisations, reasonable interpretation, and use of the techniques we taught (vectorised operations, appropriate plot types). If you used analytical methods beyond the course, you show some understanding of why.
28-31
marks
Very Good! Solid analysis with clear insights, clearly aligned with what we taught. Typical of someone who has done all prior practice exercises and attended all lectures/classes. Two clear insights with narrative titles, properly labelled visualisations, appropriate choices, accurate interpretation, vectorised operations, clean code. If you used advanced analytical techniques we haven’t covered, you clearly explain why they were more appropriate than the simpler methods we taught.
32+
marks
🏆 WOW Exceptional analysis with compelling narrative. Publication-quality visualisations, sophisticated seaborn styling, professional pandas technique, nuanced pattern discovery, convincing narrative.

🔗 Useful Resources

📊 Essential Guides

💻 Course Materials

  • 🖥️ W04 Lecture: Vectorisation concepts and reflection importance
  • 💻 W04 Lab: NumPy and Pandas practice
  • 📝 W04 Practice: Complete workflow example

🆘 Getting Help

Check staff availability on the ✋ Contact Hours page.

🌐 External Resources

Footnotes

  1. A secret source, who asked to remain anonymous, told us that OQI are not that interested in the actual answer, but rather the process you took to answer it.↩︎