📝 Group project

2024/25 Autumn Term

Author

Dr. Ghita Berrada

💡 NOTE: This time, you are not asked to write code as part of the assignment. If you choose to do so, please ensure your code confirms, reinforces, or complements your answers. Adding code just for the sake of it will not help you get a higher grade.

⏲️ Due Date: Wednesday, February 12th, 5pm

If you update your files on GitHub after this date without an authorised extension, you will receive a late submission penalty.

Did you have an extenuating circumstance and need an extension? Send an e-mail to 📧

⚖️ Assignment Weight:

This assignment is worth 40% of your final grade in this course.

40%

📝 Instructions

👉 Read it carefully, as some details might change from one assignment to another.

Step 1 on Day 1 (January 29th): Choice of datasets/research questions and group formation

Get acquainted with the dataset/research question pairs and rank them by order of preference by 5pm on January 29th in a document shared on Slack’s #announcements channel

We have chosen three datasets to base the research questions for the group projects on:

The World Values Survey, Wave 7

And download the full World Values survey questionnaire description below:

Wellcome Trust Global Monitor, 2020

And download the full Wellcome Trust Global Monitor survey questionnaire description below:

European Social Survey, Round 11

The research questions for the group projects are, as follows:

Project	Dataset	Research question
1	Wellcome Trust Global Monitor, 2020	What factors determine the public’s trust in science ( question `W6`)?
2	Wellcome Trust Global Monitor, 2020	What factors would determine one’s opinion and outlook on science (questions `W11A` and `W11B`)?
3	Wellcome Trust Global Monitor, 2020	Would science increase or decrease jobs ( question `W10`)?
4	Wellcome Trust Global Monitor, 2020	What factors make it more (or less) likely to be climatosceptic (question `W15`)?
5	World Values Survey, Wave 7	What factors influence the perception that a country is democratic (question `Q251`)?
6	World Values Survey, Wave 7	What factors influence feelings of security (question `Q131`)?
7	World Values Survey, Wave 7	What factors drive an interest in politics (question `Q199`)?
8	World Values Survey, Wave 7	What actions can be justified (question `Q177` to `Q195`)? Are there regional differences when it comes to this?
9	World Values Survey, Wave 7	What factors drive the perception of freedom of choice (question `Q48`)?
10	European Social Survey, Round 11	Can you predict emotional attachment to country (`atchctr`)? What factors drive it?
11	European Social Survey, Round 11	What factors make it more likely to divide parliament (more) equally between men and women (`eqparep`)?
12	World Values Survey, Wave 7	Can you predict trust in media (e.g `Q66`,`Q67`,`Q117`)?
13	European Social Survey, Round 11	Can you predict trust in political institutions?

In the same document, you’ll also be asked to indicate your availability for the planned mentoring slots on January 31st and February 5th
The final group composition will be announced by 6.30pm on January 29th

Step 2: Book mentoring session slots

Check the Slack’s #announcements channel for a document on which you’ll be able to book mentoring session slots. Book the slots by January 30th at 6pm.

Step 3: Create the group project repository on GitHub Classroom

Go to our Slack workspace’s #announcements channel to find a GitHub Classroom link entitled 📝 Group project. Do not share this link with anyone outside this course!
Click on the link, sign in to GitHub, and then click on the green Accept this assignment button. The first student from the team will be creating a new team (and giving it a name) while the others will join an existing team. So coordinate between yourselves on the team name so that you join the correct GitHub repository
You will be redirected to a new private repository created just for you. The repository will be named ds202a-2024-group-project-name-of-your-team, where name-of-your-team is the team name you’ve chosen. The repository will be private and will be blank unlike in previous assignments. It’ll be up to you to populate it. In particular, add:
- a README.md file : this should document the content of your repository and give instructions on how to use it. See more details about README files here
- a .qmd file as well as a rendered HTML file that correspond to your final group project report. These files should only contain the amount of analysis needed to answer the original research question. Don’t try every single machine learning method under the sun to solve the research questions, avoid explanations that are too verbose and only provide code if it adds anything to your storytelling. The .qmd file should be named after your team.
  1. Fill out the .qmd file with your analysis. Only add code chunks if required for your storytelling. Still, you should provide a nicely formatted notebook.
    - Use headers (in particular section/subsection headers) and code chunks to keep your work organised. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
    - don’t forget to reference your work properly if using ideas which are not your own
  2. Once done, click on the Render button at the top of the .qmd file. This will create an .html file with the same name as your .qmd file. For example, if your .qmd file is named 12345.qmd, then the .html file will be named 12345.html.
    - If you added any code, ensure your .qmd code is reproducible. If we were to restart R and RStudio and run your notebook, it should run without errors, and we should get the same results as you did.
    - If you choose to add code, please ensure your code confirms, reinforces, or complements your storytelling. Adding code just for the sake of it will not help you get a higher grade.
- individual contribution reflections files (500 words max) (see Section 1.4 for this part)

“What do I submit?”

You will submit:

A Quarto markdown file with the following naming convention: <TEAM_NAME>.qmd, where <TEAM_NAME> is your candidate number. For example, if your team name is team_alpha, then your file should be named team_alpha.qmd.
An HTML file render of the Quarto markdown file.

In addition to these two files, each team member will submit an individual contribution reflection file of 500 words maximum. This should be submitted as a Markdown file, say, a reflections/<username>.md where you replace <username> with your GitHub username (⚠️ don’t forget to send us i.e if you haven’t already done so!). In this file, you should outline:

your technical contribution e.g which parts of the analysis you contributed to, which models you implemented, which code you wrote
your role in the team collaboration e.g examples of how you supported your team members, coordinated or played a role in diffusing conflicts
what you learned form this project e.g any skills you developed, challenges you overcame or areas you want to work on further in the future

Provide some evidence to back up your reflection file (e.g meeting notes, Slack discussion screenshots, links to GitHub classroom commits or pull requests).

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.

📋 Your Task

What do we need from you?

Context

While we provide data and general research question, we will not be prescriptive in terms of choice of methods. Instead, we will task you with proposing your approach to the data.

Unlike other assignments, the data is also provided as is so you will have to choose your own features and do some amount of data cleaning before proceeding with your analysis.

This mirrors real-world scenarios in data science and academic research, where you are often given a dataset and asked to derive insights or address a problem.

Some things we want you to consider when tackling your research questions are as follows:

Which features do we need and why? Do we need any pre-processing to make them usable?
Should we go for a supervised or unsupervised model or both?
How do we define our target variable (if you need one)? Is our definition sound? Does it have any limits?

✔️ Assessment criteria

Here is a rough rubric for how we’ll grade this project.

Component	Weight	Things that influence your grade
Report organisation and logic	15%	Your report is clearly structured and organized in sections and subsections You transition between sections smoothly and don’t jump abruptly from between sections and ideas
Clarity of the presentation	15%	Your report is formatted correctly, with all figures rendered well and annotated appropriately You have included the right amount of explanations and your report is not overly verbose You explain technical details clearly without being abstruse You included the right visualisations to present your results
Appropriateness of the methods chosen wrt to the problem at hand	30%	Did the group choose suitable techniques/strategies to address the research question? Did you justify the use of the techniques in the context of problem being resolved (dataset/research question)? Did you ground your modeling choices in some literature review? Did you explain your choice of features and/or target variables? Did you explain your data cleaning processes if any? Did you discuss the impact of your processing/modeling choices on your results?
Quality of the interpretations	30%	Did you derive noteworthy insights from your analysis? Did you interpret your various analysis results in the context of your dataset and research question? Did you discuss the strengths and limitations of your approaches? Did you discuss potential mitigation plans/future work plans?
Team coordination and project documentation	10%	Is there evidence that the work was well coordinated and shared well between everyone? Is the content of your GitHub repository well structured and documented (in particular, is that README.md file informative)?

A 70% or above score is considered a distinction in the typical LSE expectation. This means that you can expect to score around 70% if you provide adequate answers to all questions, in line with the learning outcomes of the course and the instructions provided.

Only if you go above and beyond what is asked of you in a meaningful way will you get a higher score. Simply adding more code or text will not get you a higher score. You need to add unique insights or analyses to get a distinction - we cannot tell you what these are, but these should be things that make us go, “wow, that’s a great idea! I hadn’t thought of that”.

Warning

DO NOT TRY EVERY SINGLE MODEL UNDER THE SUN to tackle to the research question. State your modeling hypotheses clearly, justify your choices and only choose a couple of models to try and solve your question.
The goal is also not to solve the questions entirely but to get as close as possible to it.

🙋 Getting help

Mentoring sessions

There will be two mentoring sessions before the submission of your final group project report on February 12th.

A document will be circulated on Slack on the #announcements channel for each group to book 2 mentoring session slots: one on January 31st and one on February 5th. Each team is supposed to book a slot on both the 31st and 5th (not all team members need to be present). Book the slots by January 30th at 6pm.

The aim of the first slot on January 31st is simply to check the feasibility of your project ideas (you wouldn’t have had time to start any real analysis by the time of the first mentoring sessions).

The aim of the second slot on February 5th is to check on the state of advancement of your analysis and to address any potential bottlenecks.

Asking for help on Slack

You can post general clarifying questions on Slack.

For example, you can ask:

“Where do I find material that compares different clustering techniques?”
“I came across the term ‘loadings’ when reading about PCA in the textbook, but I don’t fully understand it. Does anyone have a good alternative resource about it?”

You won’t be penalized for posting something on Slack that violates this principle without realizing it. Don’t worry; we will delete your message and let you know.

👯 Collaborating with others

You are allowed to discuss the assignment with other teams, work alongside each other, and help each other. However, you cannot share or copy code from others — pretty much the same rules as above.

🤖 Using AI help?

You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain how much you used it.

Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to create formulaic and repetitive responses, thus limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient or old and does not follow the principles we teach in this course.

To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.