📦 Group Project (40%)

2025/26 Winter Term

Author

Dr Jon Cardoso-Silva

Published

26 March 2026

Modified

24 May 2026

This is your final group assessment for DS105W, worth 40% of your final grade. You will also submit an individual reflection worth 10% (see Individual Reflection below). Building on everything you have learned across both terms, your team will deliver a complete data project from collection to communication.

⏳	Deadline	Tuesday 26 May 2026 at 8 pm UK time
💎	Weight	40% of final grade (group) + 10% (individual reflection)
👥	Teams	3-4 members (formed in W09 Lab)
📤	Submission	Via your existing GitHub Classroom repository

📝 Overview

You choose one of two tracks:

🌍 WFP Track (prescribed): Build a food security data pipeline for the World Food Programme’s East and Southern Africa regional office.
🔍 Self-Chosen Track: Design your own data project using APIs you select, store data in SQLite, and present findings on a narrative website.

Both tracks are assessed to the same standard using the same three criteria. Pick the one that suits your team’s interests and strengths.

Every group must:

Collect data programmatically using the requests library
Transform and clean data through documented notebook steps
Present findings through a website or dashboard hosted on GitHub
Use a project board with Issues to coordinate team work
Submit individual reflections in reflections/<github-username>.md

⚠️ Team size note: Most groups have 3-4 members. A few groups have been allowed 5 members. If your group has 5 people, we will naturally expect a broader scope or deeper analysis to reflect the additional capacity.

🌍 WFP Track: Food Security Data Pipeline

The Client and the Problem

The World Food Programme (WFP) East and Southern Africa Regional Bureau needs an interactive dashboard to view, filter, and download food security and displacement data across the region. Your team will build the data pipeline that feeds this dashboard.

The core challenge: food security analysis (from the IPC, the Integrated Food Security Phase Classification) produces data with overlapping reference periods. Each country’s analysis generates a “current” period and a “projection” period with different population figures. The dashboard needs to show whichever reference period applies to the current month.

👉 Virginia Leape from WFP will visit the LSE around 20 April to meet with WFP track groups and answer questions about the domain context.

Data Sources

Your pipeline collects from three APIs and one static source:

IPC API: Food security phase classification data (population analysed, food insecure populations by IPC Phase 1-5, reference periods, analysis dates)
UNHCR API (Refugee Data Finder / Open Data Portal): Refugee population counts per country
IOM API: Internally Displaced Person (IDP) population counts per country
World Bank Open Data: Total country population (indicator: “Population, total”)

API key note: You need to fill out a form to formally request access to some of these APIs. If you choose this track, do this as soon as possible. If there are delays in getting access, get in touch with Jon to discuss how to proceed.

🚨 DON’T LEAK YOUR API KEYS! Always use a .env and never write your API key anywhere in your repository or notebooks. Even if you delete and commit it again, it will remain on the Git history.

The Deliverable

Your team produces:

A target CSV combining all four data sources into a single cleaned table (one row per country, columns defined in the data dictionary below)
A Streamlit dashboard that reads from the CSV and provides an interactive table with filtering, sorting, and CSV export
- This will be discussed on W11 Lecture.
Pipeline notebooks documenting every collection and transformation step
- This time you choose how to name your notebooks, but they should have a good naming convention and be well organised.
A companion website (docs/index.md or equivalent on GitHub Pages) explaining your pipeline, data sources, and any decisions you made

Data Dictionary

The target CSV must contain these columns (column naming is up to your team, but the content must match):

Field	Type	Detail
Country	String	Country name
ISO Code	String	ISO country code
Date of Analysis	YYYY-MM	When the IPC analysis was conducted
Analysis Title	String	e.g. “IPC Zambia Oct 2025”
Reference Period	String	e.g. “Nov 2025 - Mar 2026 (Projection)”
Population Analysed	Integer	IPC “pop analysed” figure
Country Population	Integer	World Bank “Population, total”
IPC Phase 1	Integer	Minimal
IPC Phase 2	Integer	Stressed
IPC Phase 3	Integer	Crisis
IPC Phase 4	Integer	Emergency
IPC Phase 5	Integer	Famine
IPC 3+	Integer	Sum of Phase 3 + 4 + 5
Refugees Covered	Boolean	Whether refugee data is included in the IPC figures (manual field)
IDPs Covered	Boolean	Whether IDP data is included in the IPC figures (manual field)
IDP Population	Integer	From IOM API
Refugee Population	Integer	From UNHCR API

⚠️ Reference period logic: When building your target CSV, the reference period shown for each country should be the one whose date range contains the current month. For example, if Zambia has a “Current” period covering Apr-Sep and a “Projection” period covering Oct-Mar, and today is in March, your pipeline should select the Projection period. This is the trickiest part of the IPC data and a good place to invest debugging time early.

⚠️ “Refugees Covered” and “IDPs Covered”: These two boolean columns cannot be derived from the API. They reflect editorial decisions made by the WFP regional office about whether displacement figures are already included in the IPC analysis for a given country. Your team should include these columns in the CSV with placeholder values and document that they require manual input.

Repository Structure

Organise your repository so someone unfamiliar with the project can follow the pipeline from raw data to final outputs. A suggested layout:

your-team-repo/
├── .env              # API keys (not tracked by git)
├── dashboard/        # Streamlit app
├── data/
│   ├── raw/          # JSON responses from APIs
│   └── processed/    # Target CSV and any intermediate files
├── docs/             # Companion website (GitHub Pages)
├── notebooks/        # Pipeline notebooks (your naming convention)
├── reflections/      # One .md file per team member
└── README.md         # Project overview & reproduction steps

Suggested Milestones

For this project, we have a suggested milestone plan. You do not need to submit anything at each milestone, but it’s a good idea to set internal deadlines to keep the project on track. The final deadline is Tuesday 26 May 2026 at 8 pm UK time.

Milestone	Target Date	What to aim for
M1	20 April	IPC and at least one other API collected, raw JSON saved, first notebook complete
M2	8 May	All four sources collected, transformation notebooks drafted, target CSV taking shape
M3	15 May	Target CSV finalised, Streamlit dashboard functional, companion website drafted
M4	26 May	Everything polished, reflections written, final push before 8 pm deadline

Time management

The WFP track involves four separate APIs, a bespoke logic for the reference period, and a Streamlit dashboard. Start early, get raw data saved as JSON before you worry about transformations, and test your reference-period logic on a few countries before scaling to the full region.

If an API is down or an application is delayed, save what you have and move on. A partial pipeline with clear documentation of what worked and what did not is better than a stalled project.

What Happens After Submission

After marking, the best dashboards from WFP track groups may be shared with Virginia and the WFP regional team as examples of what student teams can produce. We will credit your work publicly and share the GitHub repository with them. If you have any concerns about this, please get in touch with Jon.

👉 Note from Jon: I might combine best elements from across multiple projects to create a composite dashboard that I share with WFP. In which case, everyone involved will be credited and linked to their original repositories.

🔍 Self-Chosen Track: Your Own Data Project

What You Build

Your team defines its own research question, collects data from APIs, stores it in a SQLite database, and presents findings on a narrative website hosted via GitHub Pages.

The major thing about your website is that you must produce a MAXIMUM of 3 distinct visualisations or table summaries to tell your story. Choose them carefully. Each visual must earn its place.

If you use Closeread or another scrollytelling format, the strict count of 3 visuals don’t apply because the narrative structure allows for more flexible presentation. Do whatever serves the story, as long as it does not sprawl.

Data Source Requirements

The same constraints from the autumn term apply:

Your primary data must be collected using the requests library. Pre-made API wrappers (e.g. spotipy, praw) are not allowed.
Simple bulk CSV downloads are not acceptable as your primary source. They can supplement your analysis.
Complex static datasets (e.g. OpenSanctions, World Values Survey) are allowed with Jon’s permission.

You can reuse APIs from Mini Project 1 and Mini Project 2 (OpenWeather, OpenMeteo, TfL, ONS) as long as your question is different and you use additional endpoints beyond what we required in the past.

Technical Requirements

SQLite database with at least 2 tables connected by foreign keys
Analysis notebooks that read from the database (not from raw files)
Narrative website via GitHub Pages with a maximum of 3 distinct visualisations or table summaries (unless using Closeread or equivalent)
Proper credential handling via .env files (never hardcoded)

Click here for 💡 Website Implementation Options

💡 FOUR WAYS TO BUILD YOUR WEBSITE

You can choose any of these approaches, all deployed via GitHub Pages:

Plain Markdown (docs/index.md) - Simplest option, minimal setup
Jekyll Themes - More polished appearance with minimal effort
Quarto - Follow the Quarto documentation for websites
AI-Generated HTML - Use Claude or similar to generate custom HTML/CSS for full creative control

Jon will demonstrate AI-assisted website and dashboard creation in the W11 Lecture. All four options are equally valid.

Repository Structure

Choose your own notebook naming convention, but keep things tidy. A reasonable layout:

your-team-repo/
├── .env              # API keys (not tracked by git)
├── data/
│   ├── raw/          # JSON responses from APIs
│   └── processed/    # SQLite database and any intermediate files
├── docs/             # Website (GitHub Pages)
├── notebooks/        # Your naming convention
├── reflections/      # One .md file per team member
└── README.md         # Research question, reproduction steps, member roles

🗣️ Formative Pitch (W11)

On Monday 30 March (and Tuesday 31 March for some groups), your team will present your project idea to Jon and the teaching team. This is formative only and does not count toward your grade.

What to show us (on GitHub, as a page, presentation, images, or any format you like):

Your research question or the WFP problem as you understand it
Your project board with initial Issues and task assignments
Your planned approach (data sources, pipeline steps, who does what)
One risk or open question you want feedback on

The W10 Lab is designed to help you prepare for this. Keep it short and focused; we want to give you useful feedback, not watch a polished production.

1:1 Meetings with Jon

From 5 May onward, each self-chosen track group can book a 1:1 meeting with Jon via Calendly (link will be shared on Slack). Use this to get feedback on your approach, troubleshoot data issues, or sense-check your analysis direction.

📊 How We Grade It

Your project is assessed across three criteria. The marking bands below are indicative rather than comprehensive. Given that the two tracks produce different deliverables, we are not prescribing every mark range. Instead, we describe what constitutes a Pass, Good, Really Good, and WOW outcome for each criterion.

🔧 Data Pipeline (40 marks)

Data collection, storage, transformation, and code quality

Level	Description
Pass (40%)	Data is collected and stored. The pipeline runs, but there are gaps: missing error handling for API calls, inconsistent file organisation, or transformations that lose information without acknowledgement. Code works but would be hard for someone else to follow.
Good (60%)	Data collection is methodical and documented. Files are well organised, transformations are justified in the notebooks, and the pipeline reads cleanly from start to finish. Credentials are handled securely (`.env`, not hardcoded). Code uses vectorised pandas operations where appropriate.
Really Good (70%)	Clean, efficient pipeline with professional habits throughout. Database schema or CSV structure reflects thoughtful design choices. Transformation steps are well sequenced so downstream notebooks read from processed outputs rather than repeating work. Code quality is consistent across all notebooks and team members.
🏆 WOW	Exceptional pipeline design. Creative and efficient data transformations, exemplary code organisation, nothing over-engineered. The repository feels like work from an experienced team.

🌐 Communication (40 marks)

Narrative quality, visualisation design, and clarity of insight

Your website or dashboard tells the story of your analysis. We are looking for economy: a maximum of 3 distinct visualisations or table summaries that each earn their place. Finding a way to convey your insights without sprawling across dozens of plots is a genuine skill.

If you use Closeread or a similarly narrative-driven format, the “3 distinct visualisations” constraint relaxes. In that case, do whatever serves the story well, as long as it is not overly long.

Level	Description
Pass (40%)	Website or dashboard exists and shows results, but the narrative is thin. If on the self-track, visualisations describe data without stating findings and the reader has to work hard to understand what the project discovered. If on the WFP-track, the visuals don’t match to the brief
Good (60%)	If on the self-track, clear narrative flow with visualisations that support the text. Plot titles convey insights rather than describing axes. The website reads as a coherent piece of communication, not disconnected notebook outputs. If on the WFP-track, the dashboard meets the brief and the data is presented clearly, but the companion website could explain the pipeline and its decisions more thoroughly.
Really Good (70%)	If on the self-track, compelling storytelling with well-designed visualisations. Each visual earns its place. Appropriate hedging of claims, clear acknowledgement of limitations, and a reader who knows nothing about the data could follow the argument. If on the WFP-track, the dashboard matches the brief closely with good design choices, and the companion website clearly documents the pipeline, data sources, and any decisions your team made along the way.
🏆 WOW	If on the self-track, exceptional presentation that would impress a professional audience. Creative visual approaches, sophisticated narrative that engages readers, outstanding attention to the balance between depth and economy. If on the WFP-track, the dashboard feels like a polished professional tool that goes beyond the brief in thoughtful ways, and the companion documentation is exemplary.

👥 Teamwork (20 marks)

Coordination evidence, role distribution, and project management

Your project board and Issues history is the primary evidence for this criterion. We will look at how your team planned work, divided responsibilities, and resolved problems.

Level	Description
Pass (40%)	Some evidence of coordination, but the project board is sparse or abandoned early, or “made up” on the last minute to appear historical. Contributions appear unbalanced without explanation.
Good (60%)	Project board shows a clear plan that the team followed. Workload distribution makes sense given members’ strengths. The repository feels like a group effort rather than disconnected pieces.
Really Good (70%)	Sophisticated coordination with evidence of strategic specialisation. Good use of Issues and branches to track decisions. Code feels unified throughout.
🏆 WOW	Exceptional project management. Evidence of thoughtful planning, clear role specialisation, and seamless integration of contributions. Could serve as an example of good team practice.

👤 Individual Reflection (10%)

Each team member submits a personal reflection as a Markdown file in your project repository:

your-team-repo/
└── reflections/
    ├── <github-username-1>.md
    ├── <github-username-2>.md
    └── <github-username-3>.md

The reflection has two components:

Evidence of Contribution (70%)

Show what you did. The evidence should speak for itself. Link to specific commits, Issues, Pull Requests, or branches that demonstrate your contribution. You do not need to write at length (in fact, don’t!). Short, clear pointers to your work are better than long paragraphs describing it.

Include:

Links to your most significant commits or PRs
Brief descriptions of what each contribution involved
Any technical decisions you made and why

Learning Integration (30%)

Write up to two paragraphs about how feedback from earlier assessments (W04 Practice, Mini-Project 1, Mini-Project 2) shaped your approach in this project. Be specific: quote actual feedback, explain what you changed, and point to where the change shows up in your group project work.

To achieve 70/100 or above:

Provide clear evidence of substantial contributions (commit links, PR references, Issue threads)
Show genuine learning integration with specific feedback examples and behavioural changes
Write in your own voice (AI-generated reflections lack the specificity we are looking for)

📮 Need Help?

Post questions in the #help Slack channel
Check the ✋ Contact Hours page for support times
Book office hours via StudentHub for deeper conversations
Use the 🤖 Claude Tutor for technical questions about your pipeline

You are building on everything from both terms. Start with a clear plan, communicate with your team regularly, and give yourself enough time to polish both the pipeline and the presentation. Good luck!