📝 Group project

2024/25 Winter Term

Author

Dr. Ghita Berrada

💡 NOTES:

You should only provide code if it adds anything to your storytelling: make sure your code confirms, reinforces, or complements your storytelling. Adding code just for the sake of it will not help you get a higher grade.
You should prioritize methods seen in the course. But, if they are not suitable for the problem at hand (i.e dataset you have been given/problem you’re trying to solve), you are obviously allowed to use methods not seen in the course. If you use any method not seen in the course, justify why you’re using it and explain how it works: we simply need to confirm that you know what you’re doing.
You’re writing a technical report for a scientific audience. Your main goal is to convince that audience that your analysis methods are sound and that you derive solid insights from your analysis.
Be mindful of the balance between detail (do you need everything in the main text? Do you need all the code blocks to be visible?) and explanation/storytelling. The message you’re trying to convey needs to be clear to your audience.

⏲️ Due Date: ~~Tuesday, May 13th, 5pm~~ Thursday, May 15th, 5pm

If you update your files on GitHub after this date without an authorised extension, you will receive a late submission penalty.

Did you have an extenuating circumstance and need an extension? Send an e-mail to 📧

⚖️ Assignment Weight:

This assignment is worth 40% of your final grade in this course.

40%

📝 Instructions

👉 Read it carefully, as some details might change from one assignment to another.

Step 1 on Day 1 (April 28th): Choice of datasets/research questions and group formation

Get acquainted with the dataset/research question pairs and rank them by order of preference by 5pm on April 28th in a document shared on Slack’s #announcements channel

The research questions for the group projects are, as follows:

Project	Dataset	Research Question	Notes
1	Varieties of Democracy	Do democracies have different regime support coalitions than autocracies?	Potential features to look at: `v2regsupgroups_0`-`v2regsupgroups_13`
2	Sambanis data imputed by Muchlinski et al	How best can we predict civil war? Choose three (and only three) algorithms to compare.	Article Replication Data Potential feature of interest: `warstds`
3	World Values Survey, Wave 7	What factors shape attitudes towards economic redistribution?	Potential features of interest: `Q106`-`Q108`
4	Bank of England Inflation Attitudes Survey data + Bank of England/NMG household survey data	How does the perception of inflation influence spending behaviour/patterns?	Hint: To understand the variables in the Bank of England/NMG household survey dataset, go to this page and check the 2004-2011 version of the dataset Potential features of interest: `q1` and `q2`
5	World Bank World Development Indicators (WDI) + World Bank Infant Mortality Data	What factors shape variation in infant mortality?	To understand each of the WDI indicators, capitalize the indicator name and replace underscores with dots. For example, `ic_reg_proc` becomes `IC.REG.PROC`. Append the transformed code to this URL `https://databank.worldbank.org/metadataglossary/world-development-indicators/series/` i.e for `IC.REG.PROC`, you’ll go to the “https://databank.worldbank.org/metadataglossary/world-development-indicators/series/IC.REG.PROC” page You can supplement your WDI indicators with additional WDI indicators either through WDI tables or WDI website. Main feature of interest: `SP.DYN.IMRT.IN`
6	European Social Survey, Wave 8	Is there variation in support regarding government assistance for different social groups?	Potential features of interest: `gvslvol`, `gvslvue` and `gvcldcr`
7	European Social Survey, Wave 11 / The PopuList	Are economic conditions primarily responsible for the support of radical right-wing populist parties?	You can also find the PopuList dataset here Potential features of interest: variables with prefix `prtvt`
8	World Bank World Development Indicators (WDI)	By which criteria should we classify economies? Does division based on GNI per capita levels make sense?	See an example of classification here To understand each of the WDI indicators, capitalize the indicator name and replace underscores with dots. For example, `ic_reg_proc` becomes `IC.REG.PROC`. Append the transformed code to this URL `https://databank.worldbank.org/metadataglossary/world-development-indicators/series/` i.e for `IC.REG.PROC`, you’ll go to the “https://databank.worldbank.org/metadataglossary/world-development-indicators/series/IC.REG.PROC” page You can supplement your WDI indicators with additional WDI indicators either through WDI tables or WDI website.
9	MHMisinfo	Do videos promoting health information and those spreading health misinformation differ systematically from each other?	Potential feature of interest: `label`
10	Political Apologies Database	Why do political leaders apologise?	Potential feature of interest: `description`

In the same document, you’ll also be asked to indicate your availability for the planned mentoring slots on April 30th and May 7th (being available for a slot means you can join either in-person or online - when booking slots, we’ll give you the option to choose to join the sessions in-person or online)
The final group composition will be announced by 6.30pm on April 28th.

Step 2: Book mentoring session slots

Check the Slack’s #announcements channel for a document on which you’ll be able to book mentoring session slots. Book the slots by April 29th at 5pm.

Step 3: Create the group project repository on GitHub Classroom

Go to our Slack workspace’s #announcements channel to find a GitHub Classroom link entitled 📝 Group project. Do not share this link with anyone outside this course!
Click on the link, sign in to GitHub, and then click on the green Accept this assignment button. The first student from the team will be creating a new team (and giving it a name) while the others will join an existing team. So coordinate between yourselves on the team name so that you join the correct GitHub repository
You will be redirected to a new private repository created just for you. The repository will be named ds202w-2024-2025-group-project-name-of-your-team, where name-of-your-team is the team name you’ve chosen. The repository will be private and will be blank unlike in previous assignments. It’ll be up to you to populate it. In particular, add:
- a README.md file : this should document the content of your repository and give instructions on how to use it. See more details about README files here
- a .qmd file as well as a rendered HTML file that correspond to your final group project report. These files should only contain the amount of analysis needed to answer the original research question. Don’t try every single machine learning method under the sun to solve the research questions, avoid explanations that are too verbose and only provide code if it adds anything to your storytelling. The .qmd file should be named after your team.
  1. Fill out the .qmd file with your analysis. Only add code chunks if required for your storytelling. Still, you should provide a nicely formatted notebook.
    - Use headers (in particular section/subsection headers) and code chunks to keep your work organised. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
    - don’t forget to reference your work properly if using ideas which are not your own
  2. Once done, click on the Render button at the top of the .qmd file. This will create an .html file with the same name as your .qmd file. For example, if your .qmd file is named 12345.qmd, then the .html file will be named 12345.html.
    - If you added any code, ensure your .qmd code is reproducible. If we were to restart VSCode and run your notebook, it should run without errors, and we should get the same results as you did.
    - If you choose to add code, please ensure your code confirms, reinforces, or complements your storytelling. Adding code just for the sake of it will not help you get a higher grade.
- individual contribution reflections files (500 words max) (see Section 1.4 for this part)

“What do I submit?”

You will submit:

A Quarto markdown file with the following naming convention: <TEAM_NAME>.qmd, where <TEAM_NAME> is your candidate number. For example, if your team name is team_alpha, then your file should be named team_alpha.qmd.
An HTML file render of the Quarto markdown file. In case this wasn’t clear already, your HTML file needs to be self-contained (this was a requirement for every assignment until now and still is a requirement now).

In addition to these two files, each team member will submit an individual contribution reflection file of 500 words maximum. This should be submitted as a Markdown file, say, a reflections/<username>.md where you replace <username> with your GitHub username (⚠️ don’t forget to send them to us i.e if you haven’t already done so!). In this file, you should outline:

your technical contribution e.g which parts of the analysis you contributed to, which models you implemented, which code you wrote
your role in the team collaboration e.g examples of how you supported your team members, coordinated or played a role in diffusing conflicts
what you learned form this project e.g any skills you developed, challenges you overcame or areas you want to work on further in the future

Provide some evidence to back up your reflection file (e.g meeting notes, Slack discussion screenshots, links to GitHub classroom commits or pull requests).

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.

📋 Your Task

What do we need from you?

Context

While we provide data and general research question, we will not be prescriptive in terms of choice of methods. Instead, we will task you with proposing your approach to the data.

Unlike other assignments, the data is also provided as is so you will have to choose your own features and do some amount of data cleaning before proceeding with your analysis.

This mirrors real-world scenarios in data science and academic research, where you are often given a dataset and asked to derive insights or address a problem.

Some things we want you to consider when tackling your research questions are as follows:

What are the characteristics specific to our dataset? What does that entail for our subsequent analyses/modeling?
Which features do we need and why? Do we need any pre-processing to make them usable?
Should we go for a supervised or unsupervised model or both?
How do we define our target variable (if you need one)? Is our definition sound? Does it have any limits?

✔️ Assessment criteria

Here is a rough rubric for how we’ll grade this project.

Component	Weight	Things that influence your grade
Report organisation and logic	15%	Your report is clearly structured and organized in sections and subsections You transition between sections smoothly and don’t jump abruptly from between sections and ideas There is a clear storyline in your report: after reading your report, your reader can easily tell/recall what the overarching message is.
Clarity of the presentation	15%	Your report is formatted correctly, with all figures rendered well and annotated appropriately You have included the right amount of explanations and your report is not overly verbose You explain technical details clearly without being abstruse You included the right visualisations to present your results
Appropriateness of the methods chosen wrt to the problem at hand	30%	Did the group choose suitable techniques/strategies to address the research question? Did you justify the use of the techniques in the context of problem being resolved (dataset/research question)? Did you ground your modeling choices in some literature review? Did you explain your choice of features and/or target variables? Did you explain your data cleaning processes if any? Did you discuss the impact of your processing/modeling choices on your results?
Quality of the interpretations	30%	Did you derive noteworthy insights from your analysis? Did you interpret your various analysis results in the context of your dataset and research question? Did you discuss the strengths and limitations of your approaches? Did you discuss potential mitigation plans/future work plans?
Team coordination and project documentation	10%	Is there evidence that the work was well coordinated and shared well between everyone? Is the content of your GitHub repository well structured and documented (in particular, is that README.md file informative)?

A 70% or above score is considered a distinction in the typical LSE expectation. This means that you can expect to score around 70% if you provide adequate answers to all questions, in line with the learning outcomes of the course and the instructions provided.

Only if you go above and beyond what is asked of you in a meaningful way will you get a higher score. Simply adding more code or text will not get you a higher score. You need to add unique insights or analyses to get a distinction - we cannot tell you what these are, but these should be things that make us go, “wow, that’s a great idea! I hadn’t thought of that”.

Warning

DO NOT TRY EVERY SINGLE MODEL UNDER THE SUN to tackle to the research question. State your modeling hypotheses clearly, justify your choices and only choose a couple of models to try and solve your question.
The goal is also not to solve the questions entirely but to get as close as possible to it.

🙋 Getting help

Mentoring sessions

There will be two mentoring sessions before the submission of your final group project report on May 13th.

A document will be circulated on Slack on the #announcements channel for each group to book 2 mentoring session slots: one on April 30th and one on May 7th. Each team is supposed to book a slot on both the 30th and 7th (not all team members need to be present). Book the slots by April 29th at 5pm and indicate whether you’d like the session to be in-person or online.

The aim of the first slot on April 30th is simply to check the feasibility of your project ideas (you wouldn’t have had time to start any real analysis by the time of the first mentoring sessions).

The aim of the second slot on May 7th is to check on the state of advancement of your analysis and to address any potential bottlenecks.

Asking for help on Slack

You can post general clarifying questions on Slack.

For example, you can ask:

“Where do I find material that compares different clustering techniques?”
“I came across the term ‘loadings’ when reading about PCA in the textbook, but I don’t fully understand it. Does anyone have a good alternative resource about it?”

You won’t be penalized for posting something on Slack that violates this principle without realizing it. Don’t worry; we will delete your message and let you know.

👯 Collaborating with others

You are allowed to discuss the assignment with other teams, work alongside each other, and help each other. However, you cannot share or copy code from others — pretty much the same rules as above.

🤖 Using AI help?

You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain how much you used it.

Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to create formulaic and repetitive responses, thus limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient or old and does not follow the principles we teach in this course.

To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.