📝 Group project

2025/26 Spring Term

Author

Dr. Ghita Berrada (edited by Dr. Stuart Bramwell)

💡 NOTES:

The research questions listed below are starting points, not straightjackets. You are expected to refine, narrow down, or reframe them as your understanding of the data develops.
The datasets provided are sufficient to address the questions at a basic level. However, you are allowed to supplement them with additional publicly available datasets if this strengthens your analysis. Any additional data must be clearly motivated and documented.
You should only provide code if it adds anything to your storytelling: make sure your code confirms, reinforces, or complements your narrative. Adding code just for the sake of it will not improve your grade.
You should prioritise methods seen in the course. If you use methods not covered, you must justify their use and explain how they work clearly enough for a technically literate reader.
You are writing a technical report for a scientific audience. Your goal is to convince that audience that your analytical choices are sound and that your interpretations are well supported.
Be mindful of the balance between detail and clarity. Not everything needs to be in the main text, but the core message of your analysis should be immediately clear. If you are looking for a word limit, we suggest producing a report that is 5,000 words (exclusive of code, references and plot annotations).

⏲️ Due Date: Tuesday, 26th May 2026, 5pm

If you update your files on GitHub after this date without an authorised extension, you will receive a late submission penalty.

Did you have an extenuating circumstance and need an extension? Send an e-mail to 📧

⚖️ Assignment Weight:

This assignment is worth 40% of your final grade in this course.

40%

📝 Instructions

👉 Read it carefully, as some details might change from one assignment to another.

Step 1 on Day 1 (29th April): Choice of datasets/research questions and group formation

Get acquainted with the dataset/research question pairs and rank them by order of preference by 5pm on 30th April in a document shared on Slack’s #ds202w-central channel

The research questions for the group projects are as follows:

Project	Dataset	Research question	Notes
1	California school accountability	How sensitive are school accountability outcomes to design choices when student composition and cohort sizes change over time?	Any accountability system involves choices: what to measure, how to aggregate, and where to draw lines. Think about what those choices assume, and which schools might be systematically advantaged or disadvantaged by a given design. You will need enrollment data and academic indicator data (ELA and Mathematics); the choice of year(s) is up to you.
2	Sovereign debt, growth, and the perils of aggregation	What conclusions about the relationship between public debt and economic growth depend on aggregation choices, sample restrictions, and weighting schemes?	This project is inspired by the Reinhart-Rogoff debate, one of the most discussed methodological controversies in empirical economics. The data links given here provide a starting point, but you are expected to assemble your own dataset: the choice of variables, countries, time periods, and any additional sources is yours to make and justify. Before modelling anything, think carefully about the choices that shape what the data appear to say.
3	Algorithmic governance in welfare targeting	How do design choices in automated risk scoring systems shape who is flagged, and how stable are those outcomes across time and populations?	This system makes high-stakes decisions about real people. Think about what it means for such a system to work well, and for whom, and what the available data allows you to examine. You are encouraged to supplement the Lighthouse Reports data with additional administrative or socioeconomic data of your choosing.
4	Global health spending and system resilience	How do countries’ health financing structures relate to their capacity to absorb a major shock, and what does the COVID-19 period reveal about which funding sources and populations are most exposed when health systems come under pressure?	The data covers 195 countries from 2000 to 2023 across three linked tables. Defining and justifying a meaningful scope (in terms of countries, time period, and spending dimensions) is itself an analytical decision. You are encouraged to supplement with country-level contextual indicators and, if relevant to your question, with health outcome data (e.g. burden of disease, COVID mortality, or healthcare access indicators) to connect financing structures to actual system performance.
5	Firm survival and market concentration	How do firm survival patterns relate to market structure, and how do these relationships evolve over time?	Firms do not enter and exit markets randomly, and neither do the conditions they face. Think carefully about what structure in the data might confound simple comparisons, and what it would take to say something more robust. You may supplement with business demography or firm registry data from a country of your choosing.
6	World Bank project effectiveness and equity	Do World Bank development projects deliver equitably across countries and populations, and how much do conclusions about project effectiveness depend on where projects are implemented, what they aim to achieve, and how performance itself is measured?	The ratings in this dataset are produced by an independent evaluation unit, but independence does not mean neutrality. Think about what the ratings capture, what they might miss, and whether the standards they apply travel equally across very different contexts. The dataset contains both project characteristics known at approval and independent outcome assessments; consider what that structure makes possible, and what it does not.
7	Electoral irregularities and electoral integrity	To what extent do apparent signs of electoral irregularity reflect genuine problems in electoral integrity, and how sensitive are substantive conclusions to the choice of indicator, level of aggregation, and election context?	Different elections can appear problematic for very different reasons. Do not treat ‘electoral irregularity’ as a single, self-evident phenomenon; think carefully about which dimensions are being measured, by whom, and at what level of aggregation. You may supplement with election-results data at country or constituency level.
8	Urban service equity and the NYC 311 system	How did the volume, nature, and resolution of 311 service requests change around a major urban shock, and are there systematic disparities in which complaints go unresolved; do those disparities persist or widen after the shock?	The dataset is large (40M+ rows from 2010 to present). Defining and justifying a meaningful subset is itself an analytical decision. Define carefully what ‘unresolved’ means; this choice shapes your entire analysis. Do not specify the shock in advance: identify and justify it from the data. You may supplement with neighbourhood-level socioeconomic indicators (e.g. US Census data).

In the same document, you’ll also be asked to indicate your availability for the planned mentoring slots on 5th May and 12th May (being available for a slot means you can join either in-person or online; when booking slots, we’ll give you the option to choose).
The final group composition will be announced by 8pm on 27th April.

Step 2: Book mentoring session slots

Check Slack’s #ds202w-central channel for a document on which you’ll be able to book mentoring session slots. Book the slots by 1st May (slots are allocated on a first come first serve basis but you can swap slots with other groups if needed).

Step 3: Create the group project repository on GitHub Classroom

Go to our Slack workspace’s #ds202w-central channel to find a GitHub Classroom link entitled 📝 Group project. Do not share this link with anyone outside this course!
Click on the link, sign in to GitHub, and then click on the green Accept this assignment button. The first student from the team will be creating a new team (and giving it a name) while the others will join an existing team. So coordinate between yourselves on the team name so that you join the correct GitHub repository.
You will be redirected to a new private repository named ds202w-2025-2026-group-project-name-of-your-team, where name-of-your-team is the team name you’ve chosen. The repository will be private and blank; it’s up to you to populate it. In particular, add:
- a README.md file: this should document the content of your repository and give instructions on how to use it. See more details about README files here.
- a .qmd file as well as a rendered HTML file corresponding to your final group project report. These files should only contain the amount of analysis needed to answer the research question. Don’t try every single machine learning method under the sun, avoid explanations that are too verbose, and only provide code if it adds anything to your storytelling. The .qmd file should be named after your team.
  1. Fill out the .qmd file with your analysis. Only add code chunks if required for your storytelling. Still, you should provide a nicely formatted notebook.
    - Use headers (in particular section/subsection headers) and code chunks to keep your work organised. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
    - Don’t forget to reference your work properly if using ideas which are not your own.
  2. Once done, render your .qmd file. This will create an .html file with the same name as your .qmd file.
    - If you added any code, ensure it is reproducible. If we were to restart your environment and run your notebook, it should run without errors and produce the same results.
    - Please ensure your code confirms, reinforces, or complements your storytelling. Adding code just for the sake of it will not help you get a higher grade.
- individual contribution reflection files (500 words max each); see Section 1.4 for details.

“What do I submit?”

You will submit:

A Quarto markdown file with the following naming convention: <TEAM_NAME>.qmd, where <TEAM_NAME> is your team name. For example, if your team name is team_alpha, then your file should be named team_alpha.qmd.
An HTML file render of the Quarto markdown file. Your HTML file must be self-contained.

In addition to these two files, each team member will submit an individual contribution reflection file of 500 words maximum. Submit it as a Markdown file at reflections/<username>.md, replacing <username> with your GitHub username (⚠️ don’t forget to send your username to if you haven’t already done so!).

In this file, you should outline:

your technical contribution: e.g. which parts of the analysis you contributed to, which models you implemented, which code you wrote
your role in team collaboration: e.g. examples of how you supported your team members, coordinated work, or helped resolve disagreements
what you learned: e.g. skills you developed, challenges you overcame, or areas you want to work on further

Provide some evidence to back up your reflection (e.g. meeting notes, Slack discussion screenshots, links to GitHub commits or pull requests).

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version. Not sure how to use Git? You can always add files via the GitHub web interface.

📋 Your Task

What do we need from you?

Context

While we provide data and a general research question, we will not be prescriptive about your choice of methods. Instead, we ask you to propose your own approach to the data.

Unlike other assignments, the data is provided as is, so you will have to choose your own features and carry out some amount of data cleaning before proceeding with your analysis.

This mirrors real-world scenarios in data science and academic research, where you are often given a dataset and asked to derive insights or address a problem.

Some things we want you to consider when tackling your research question:

What are the characteristics specific to your dataset? What does that imply for your subsequent analyses?
Which features do you need and why? Do you need any pre-processing to make them usable?
Should you use a supervised or unsupervised approach, or both?
How do you define your target variable (if you need one)? Is your definition sound? Does it have any limits?
Can you disentangle causality from your dataset? If so, how, and how robust are your conclusions?
Should you consider issues of fairness or bias when analysing the data?
Should you examine the whole dataset, or consider how the performance and conclusions of your analysis vary within well-chosen subsets?

✔️ Assessment criteria

Here is a rough rubric for how we’ll grade this project.

Component	Weight	Things that influence your grade
Report organisation and logic	15%	Your report is clearly structured and organized in sections and subsections You transition between sections smoothly and don’t jump abruptly between sections and ideas There is a clear storyline in your report: after reading your report, your reader can easily tell/recall what the overarching message is.
Clarity of the presentation	15%	Your report is formatted correctly, with all figures rendered well and annotated appropriately You have included the right amount of explanations and your report is not overly verbose You explain technical details clearly without being abstruse You included the right visualisations to present your results
Appropriateness of the methods chosen wrt to the problem at hand	30%	Did the group choose suitable techniques/strategies to address the research question? Did you justify the use of the techniques in the context of the problem being addressed (dataset/research question)? Did you ground your modelling choices in some literature review? Did you explain your choice of features and/or target variables? Did you explain your data cleaning processes if any? Did you discuss the impact of your processing/modelling choices on your results?
Quality of the interpretations	30%	Did you derive noteworthy insights from your analysis? Did you interpret your analysis results in the context of your dataset and research question? Did you discuss the strengths and limitations of your approaches? Did you discuss potential mitigation plans/future work plans?
Team coordination and project documentation	10%	Is there evidence that the work was well coordinated and shared fairly between everyone? Is the content of your GitHub repository well structured and documented (in particular, is the README.md file informative)?\|

A 70% or above score is considered a distinction in the typical LSE expectation. This means that you can expect to score around 70% if you provide adequate answers to all questions, in line with the learning outcomes of the course and the instructions provided.

Only if you go above and beyond what is asked of you in a meaningful way will you get a higher score. Simply adding more code or text will not get you a higher score. You need to add unique insights or analyses to get a distinction; we cannot tell you what these are, but these should be things that make us go, “wow, that’s a great idea! I hadn’t thought of that”.

Warning

DO NOT TRY EVERY SINGLE MODEL UNDER THE SUN to tackle the research question. State your modelling hypotheses clearly, justify your choices, and only select a small number of methods that are genuinely suited to your question.
The goal is not to solve the question entirely, but to get as close as possible to it in a principled, well-reasoned way.

🙋 Getting help

Mentoring sessions

There will be two mentoring sessions before the submission of your final group project report on 26th May.

A document will be circulated on Slack on the #ds202w-central channel for each group to book 2 mentoring session slots: one on 5th May and one on 12th May. Each team should book a slot on both dates (not all team members need to be present). Book the slots by 1st May and indicate whether you’d like the session to be in-person, online, or hybrid.

The aim of the first slot on 5th May is to check the feasibility of your project ideas; you won’t be expected to have completed any real analysis by this point.

The aim of the second slot on 12th May is to check on the state of your analysis and address any potential bottlenecks.

Asking for help on Slack

You can post general clarifying questions on Slack.

For example, you can ask:

“Where do I find material that compares different clustering techniques?”
“I came across the term ‘loadings’ when reading about PCA in the textbook, but I don’t fully understand it. Does anyone have a good alternative resource about it?”

You won’t be penalized for posting something on Slack that accidentally crosses the line. Don’t worry; we will delete your message and let you know.

👯 Collaborating with others

You are allowed to discuss the assignment with other teams, work alongside each other, and help each other. However, you cannot share or copy code from others.

🤖 Using AI help?

You can use Generative AI tools such as ChatGPT or Claude when doing this research and search online for help. If you do (however minimal your use), you are asked to report the AI tool(s) you used and add an extra section to your notebook explaining how you used them.

Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. They also tend to produce formulaic and repetitive responses, which limits your chances of getting a high mark. When it comes to coding, these tools often generate code that is inefficient, outdated, or does not follow the principles taught in this course.

To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.