π Group project
2025/26 Spring Term
- The research questions listed below are starting points, not straightjackets. You are expected to refine, narrow down, or reframe them as your understanding of the data develops.
- The datasets provided are sufficient to address the questions at a basic level. However, you are allowed to supplement them with additional publicly available datasets if this strengthens your analysis. Any additional data must be clearly motivated and documented.
- You should only provide code if it adds anything to your storytelling: make sure your code confirms, reinforces, or complements your narrative. Adding code just for the sake of it will not improve your grade.
- You should prioritise methods seen in the course. If you use methods not covered, you must justify their use and explain how they work clearly enough for a technically literate reader.
- You are writing a technical report for a scientific audience. Your goal is to convince that audience that your analytical choices are sound and that your interpretations are well supported.
- Be mindful of the balance between detail and clarity. Not everything needs to be in the main text, but the core message of your analysis should be immediately clear. If you are looking for a word limit, we suggest producing a report that is 5,000 words (exclusive of code, references and plot annotations).
β²οΈ Due Date: Tuesday, 26th May 2026, 5pm
If you update your files on GitHub after this date without an authorised extension, you will receive a late submission penalty.
Did you have an extenuating circumstance and need an extension? Send an e-mail to π§
βοΈ Assignment Weight:
This assignment is worth 40% of your final grade in this course.
40%
π Instructions
π Read it carefully, as some details might change from one assignment to another.
Step 1 on Day 1 (29th April): Choice of datasets/research questions and group formation
- Get acquainted with the dataset/research question pairs and rank them by order of preference by 5pm on 30th April in a document shared on Slackβs
#ds202w-centralchannel
The research questions for the group projects are as follows:
In the same document, youβll also be asked to indicate your availability for the planned mentoring slots on 5th May and 12th May (being available for a slot means you can join either in-person or online; when booking slots, weβll give you the option to choose).
The final group composition will be announced by 8pm on 27th April.
Step 2: Book mentoring session slots
- Check Slackβs
#ds202w-centralchannel for a document on which youβll be able to book mentoring session slots. Book the slots by 1st May (slots are allocated on a first come first serve basis but you can swap slots with other groups if needed).
Step 3: Create the group project repository on GitHub Classroom
Go to our Slack workspaceβs
#ds202w-centralchannel to find a GitHub Classroom link entitled π Group project. Do not share this link with anyone outside this course!Click on the link, sign in to GitHub, and then click on the green
Accept this assignmentbutton. The first student from the team will be creating a new team (and giving it a name) while the others will join an existing team. So coordinate between yourselves on the team name so that you join the correct GitHub repository.You will be redirected to a new private repository named
ds202w-2025-2026-group-project-name-of-your-team, wherename-of-your-teamis the team name youβve chosen. The repository will be private and blank; itβs up to you to populate it. In particular, add:a
README.mdfile: this should document the content of your repository and give instructions on how to use it. See more details about README files here.a
.qmdfile as well as a renderedHTMLfile corresponding to your final group project report. These files should only contain the amount of analysis needed to answer the research question. Donβt try every single machine learning method under the sun, avoid explanations that are too verbose, and only provide code if it adds anything to your storytelling. The.qmdfile should be named after your team.Fill out the
.qmdfile with your analysis. Only add code chunks if required for your storytelling. Still, you should provide a nicely formatted notebook.- Use headers (in particular section/subsection headers) and code chunks to keep your work organised. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
- Donβt forget to reference your work properly if using ideas which are not your own.
Once done, render your
.qmdfile. This will create an.htmlfile with the same name as your.qmdfile.If you added any code, ensure it is reproducible. If we were to restart your environment and run your notebook, it should run without errors and produce the same results.
Please ensure your code confirms, reinforces, or complements your storytelling. Adding code just for the sake of it will not help you get a higher grade.
individual contribution reflection files (500 words max each); see Section 1.4 for details.
βWhat do I submit?β
You will submit:
A Quarto markdown file with the following naming convention:
<TEAM_NAME>.qmd, where<TEAM_NAME>is your team name. For example, if your team name isteam_alpha, then your file should be namedteam_alpha.qmd.An HTML file render of the Quarto markdown file. Your HTML file must be self-contained.
In addition to these two files, each team member will submit an individual contribution reflection file of 500 words maximum. Submit it as a Markdown file at reflections/<username>.md, replacing <username> with your GitHub username (β οΈ donβt forget to send your username to if you havenβt already done so!).
In this file, you should outline:
- your technical contribution: e.g. which parts of the analysis you contributed to, which models you implemented, which code you wrote
- your role in team collaboration: e.g. examples of how you supported your team members, coordinated work, or helped resolve disagreements
- what you learned: e.g. skills you developed, challenges you overcame, or areas you want to work on further
Provide some evidence to back up your reflection (e.g. meeting notes, Slack discussion screenshots, links to GitHub commits or pull requests).
You donβt need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version. Not sure how to use Git? You can always add files via the GitHub web interface.
π Your Task
What do we need from you?
Context
While we provide data and a general research question, we will not be prescriptive about your choice of methods. Instead, we ask you to propose your own approach to the data.
Unlike other assignments, the data is provided as is, so you will have to choose your own features and carry out some amount of data cleaning before proceeding with your analysis.
This mirrors real-world scenarios in data science and academic research, where you are often given a dataset and asked to derive insights or address a problem.
Some things we want you to consider when tackling your research question:
What are the characteristics specific to your dataset? What does that imply for your subsequent analyses?
Which features do you need and why? Do you need any pre-processing to make them usable?
Should you use a supervised or unsupervised approach, or both?
How do you define your target variable (if you need one)? Is your definition sound? Does it have any limits?
Can you disentangle causality from your dataset? If so, how, and how robust are your conclusions?
Should you consider issues of fairness or bias when analysing the data?
Should you examine the whole dataset, or consider how the performance and conclusions of your analysis vary within well-chosen subsets?
βοΈ Assessment criteria
Here is a rough rubric for how weβll grade this project.
| Component | Weight | Things that influence your grade |
|---|---|---|
| Report organisation and logic | 15% |
|
| Clarity of the presentation | 15% |
|
| Appropriateness of the methods chosen wrt to the problem at hand | 30% |
|
| Quality of the interpretations | 30% |
|
| Team coordination and project documentation | 10% |
|
A 70% or above score is considered a distinction in the typical LSE expectation. This means that you can expect to score around 70% if you provide adequate answers to all questions, in line with the learning outcomes of the course and the instructions provided.
Only if you go above and beyond what is asked of you in a meaningful way will you get a higher score. Simply adding more code or text will not get you a higher score. You need to add unique insights or analyses to get a distinction; we cannot tell you what these are, but these should be things that make us go, βwow, thatβs a great idea! I hadnβt thought of thatβ.
- DO NOT TRY EVERY SINGLE MODEL UNDER THE SUN to tackle the research question. State your modelling hypotheses clearly, justify your choices, and only select a small number of methods that are genuinely suited to your question.
- The goal is not to solve the question entirely, but to get as close as possible to it in a principled, well-reasoned way.
π Getting help
Mentoring sessions
There will be two mentoring sessions before the submission of your final group project report on 26th May.
A document will be circulated on Slack on the #ds202w-central channel for each group to book 2 mentoring session slots: one on 5th May and one on 12th May. Each team should book a slot on both dates (not all team members need to be present). Book the slots by 1st May and indicate whether youβd like the session to be in-person, online, or hybrid.
The aim of the first slot on 5th May is to check the feasibility of your project ideas; you wonβt be expected to have completed any real analysis by this point.
The aim of the second slot on 12th May is to check on the state of your analysis and address any potential bottlenecks.
Asking for help on Slack
You can post general clarifying questions on Slack.
For example, you can ask:
- βWhere do I find material that compares different clustering techniques?β
- βI came across the term βloadingsβ when reading about PCA in the textbook, but I donβt fully understand it. Does anyone have a good alternative resource about it?β
You wonβt be penalized for posting something on Slack that accidentally crosses the line. Donβt worry; we will delete your message and let you know.
π― Collaborating with others
You are allowed to discuss the assignment with other teams, work alongside each other, and help each other. However, you cannot share or copy code from others.
π€ Using AI help?
You can use Generative AI tools such as ChatGPT or Claude when doing this research and search online for help. If you do (however minimal your use), you are asked to report the AI tool(s) you used and add an extra section to your notebook explaining how you used them.
Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. They also tend to produce formulaic and repetitive responses, which limits your chances of getting a high mark. When it comes to coding, these tools often generate code that is inefficient, outdated, or does not follow the principles taught in this course.
To see examples of how to report the use of AI tools, see π€ Our Generative AI policy.












