📝 W07 Summative
2023/24 Autumn Term
⏲️ Due Date:
- 9 November 2023 at 11:59:59 am (London time)
If you update your files on GitHub after this date without an authorised extension, you will receive a late submission penalty.
Did you have an extenuating circumstance and need an extension? Send an e-mail to 📧 Kevin
🎯 Main Objectives:
- Demonstrate your ability to write a report in Quarto Markdown
- Demonstrate your
dplyr
andggplot2
skills - Demonstrate your ability to fit a logistic regression model
- Demonstrate your ability to interpret and evaluate the performance of a logistic regression model
- Demonstrate your ability to defend your model choices
⚖️ Assignment Weight:
This assignment is worth 20% of your final grade in this course.
20%
“Your candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.”
Source: LSE
📝 Instructions
Go to our Slack workspace’s
#announcements
channel to find a GitHub Classroom link entitled 📝 W07 Summative. Do not share this link with anyone outside this course!Click on the link, sign in to GitHub and then click on the green button
Accept this assignment
.You will be redirected to a new private repository created just for you. The repository will be named
ds202a-2023-w07-summative-yourusername
, whereyourusername
is your GitHub username. The repository will be private and will contain aREADME.md
file with a copy of these instructions.Recall what is your LSE CANDIDATE NUMBER. You will need it in the next step.
Create a
<CANDIDATE_NUMBER>.qmd
file with your answers, replacing the text<CANDIDATE_NUMBER>
with your actual LSE number.For example, if your candidate number is
12345
, then your file should be named12345.qmd
.Then, replace whatever is between the
---
lines at the top of your newly created.qmd
file with the following:--- title: "DS202A - W07 Summative" author: <CANDIDATE_NUMBER> output: html self-contained: true ---
Once again, replace the text
<CANDIDATE_NUMBER>
with your actual LSE CANDIDATE NUMBER. For example, if your candidate number is12345
, then your.qmd
file should start with:--- title: "DS202A - W07 Summative" author: 12345 output: html self-contained: true ---
Fill out the
.qmd
file with your answers. Use headers and code chunks to keep your work organised. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.Once you are done, click on the
Render
button at the top of the.qmd
file. This will create an.html
file with the same name as your.qmd
file. For example, if your.qmd
file is named12345.qmd
, then the.html
file will be named12345.html
.Ensure that your
.qmd
code is reproducible, that is, if we were to restart R and RStudio and run your notebook from scratch, from top to the bottom, we would get the same results as you did.Push both files to your GitHub repository. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.
Read the section How to get help and how to collaborate with others at the end of this document.
“What do I submit?”
You will submit two files:
A Quarto markdown file with the following naming convention:
<CANDIDATE_NUMBER>.qmd
, where<CANDIDATE_NUMBER>
is your candidate number. For example, if your candidate number is12345
, then your file should be named12345.qmd
.An HTML file render of the Quarto markdown file.
You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit
AND push
your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.
🗄️ Get the data
What data will you be using?
We will be using the following dataset that happens to have been published by one of our teachers, Dr Stuart Bramwell:
Nyrup, Jacob, Hikaru Yamagishi, and Stuart Bramwell. 2023a. “Replication Data for: Consolidating Progress: The Selection of Female Ministers in Autocracies and Democracies.” Harvard Dataverse, v1.
The link above will give you the data and the code to replicate the analysis published in the academic paper:
Nyrup, Jacob, Hikaru Yamagishi, and Stuart Bramwell. 2023. “Consolidating Progress: The Selection of Female Ministers in Autocracies and Democracies.” American Political Science Review, July, 1–20.
How often do you have a chance to discuss a research paper with its author? Take advantage of the fact that Stuart is hosting a drop-in session on Wednesday, 1 November 2023. You probably received a calendar invite, but you can also find the details on the Moodle announcements forum.
Use this opportunity to think more conceptually about this data before you start working on the assignment.
Preparation
Download the data from the link above. It is a
.zip
file.Unzip the file to a convenient folder on your computer.
Read the CSV file that is under:
1_data/df_consolidatingprogress_V1.csv
into R. Save it as a data frame calleddf
. Freely explore the data on your own.
You will find a PDF called supplementarymaterials_Consolidating_Progress.pdf that explains the data set in more detail. You can also read the paper that published this data set if you want to learn more about it.
📋 Your Tasks
What do we actually want from you?
Part 1: Show us your dplyr
muscles! (20 marks)
Here, we will focus on the variables related to the BMR index1, a binary measure that indicates whether a country is a democracy. Feel free to organise your code however you like for this part; you don’t need to use a chunk for each question.
- Filter
df
to the years where the BMR index suffered a change compared to the previous year. Save this new data frame asdf_bmr_changed
.
💡 Tip: you might want to use the
tidyselect::contains()
function to select just the variables you need.
- Using
mutate
, create a new column called bmr_transition, a string that can only assume one of the two values below:
“Autocracy to Democracy”
“Democracy to Autocracy”
It is up to you to figure out which other columns in the data frame to use.
Filter the columns of
df_bmr_changed
to show justcountry_name
,year
andbmr_transition_type
— sort by country name and year in ascending order.What countries faced more than one regime transition during the time covered by the data set? Were there any inconsistencies in the data?
Part 2: Create a baseline model (50 marks)
In the paper that published this data set, the authors used a linear regression model to predict share_female
-related variables (Nyrup, Yamagishi, and Bramwell 2023). The authors used more advanced linear regression techniques, including mixed-group effects - things we have not covered in this course.
Here, we will tackle this as a classification task. We aim to create a logistic regression model to predict whether the share of females in the cabinet will increase or decrease in the next year.
As it was in the previous section, you don’t need to use a chunk for each question. Feel free to organise your code and markdown for this part.
Create a binary target variable called
is_share_female_up
. The variable should be set to 1 if theshare_female
variable in the current year is higher than theshare_female
variable in the previous year. Otherwise, it should be set to 0.To avoid problems, don’t use a
recipe
here — just usemutate
to create the variable.Create a logistic regression model using a single valid predictor. This could be either a column already in the data frame or a new column you create using
mutate
or with a recipe.Set the last year in the data set as the test set. Use the previous years as the training set.
Use whatever metric you feel is most apt for this task to evaluate your model’s performance. Explain why you chose this metric.
Explain what the regression coefficients mean in the context of this problem.
Comment on the goodness-of-fit of your model and its predictive power.
Part 3: Model some more (30 marks)
You have a choice to make here. You can either:
OPTION 01: Go deep by improving your model’s performance by adding more predictors or feature engineering2 techniques OR
OPTION 02: Perform a more robust time-aware cross-validation evaluation of your model’s performance. That is, you would use the same baseline model but perform multiple train-test splits, ensuring that the test set always comes after the training set.
Whatever you do, this is what we expect from you:
Show us your code and your model.
Explain your choices (of feature engineering or cross-validation strategy)
Evaluate your model’s performance. If you created a new model, compare it to the baseline model. If you performed a more robust cross-validation, compare it to the single train-test split you did in the previous section.
✔️ How we will grade your work
Here, we start to get more rigid about grading your work. Following all the instructions, you should expect a score of around 70/100. Only if you go above and beyond what is asked of you in a meaningful way will you get a higher score. Simply adding more code or text will not get you a higher score; you need to add interesting insights or analyses to get a distinction.
Part 1: Show us your dplyr
muscles! (20 marks)
Here is a rough rubric for this part:
- 5 marks: You wrote some code but filtered the data incorrectly or did not follow the instructions.
- 10 marks: You created the intermediate data frame
df_bmr_changed
correctly, but you might have made some mistakes when creating the columnbmr_transition
, or your conclusions for Task 4 are not correct. - 15 marks: You did everything correctly as instructed. Your submission just fell short of perfect. Your code or markdown could be more organised, or your answers were not concise enough (unnecessary, overly long text).
- 20 marks: You did everything correctly, and your submission was perfect. Wow! Your code and markdown were well-organised, and your answers were concise and to the point.
Part 2: Create a baseline model (50 marks)
Here is a rough rubric for this part:
- <10 marks: A deep fail. There is no code, or the code/markdown is so insubstantial or disorganised to the point that we cannot understand what you did.
- 10-20 marks: A fail. You wrote some code and text but ignored important aspects of the instructions (like not using logistic regression)
- 20-30 marks: You made some critical mistakes or did not complete all the tasks. For example: your pre-processing step was incorrect, your model contained some data leakage (seeing the future), or perhaps your analysis of your model was way off.
- 30-35: Good, you just made minor mistakes in your code, or your analysis demonstrated some minor misunderstandings of the concepts.
- ~35 marks: You did everything correctly as instructed. Your submission just fell short of perfect. Your code or markdown could be more organised, or your answers were not concise enough (unnecessary, overly long text).
- >35 marks: Impressive! You impressed us with your level of technical expertise and deep knowledge of the intricacies of the logistic function. We are likely to print a photo of your submission and hang it on the wall of our offices.
Part 3: Model some more (30 marks)
Here is a rough rubric for this part:
- <10 marks: A fail. There is no code, or the code/markdown is so insubstantial or disorganised to the point that we cannot understand what you did, or you wrote some code and text but ignored important aspects of the instructions.
- 10-20 marks: Good, although you made mistakes in your code, or your analysis demonstrated some misunderstandings of the concepts.
- ~22 marks: You did everything correctly as instructed. Your submission just fell short of perfect. Your code or markdown could be more organised, or your answers were not concise enough (unnecessary, overly long text).
- >22 marks: Impressive! You impressed us with your level of technical expertise and deep knowledge of the intricacies of the logistic function. We are likely to print a photo of your submission and hang it on the wall of our offices.
How to get help and how to collaborate with others
🙋 Getting help
You can post general coding questions on Slack but should not reveal code that is part of your solution.
For example, you can ask:
- “Does anyone know how I can create a logistic regression in tidymodels without a recipe?”
- “Has anyone figured out how to do time-aware cross-validation, grouped per country??”
You are allowed to share ‘aesthetic’ elements of your code if they are not part of the core of the solution. For example, suppose you find a really cool new way to generate a plot. You can share the code for the plot, using a generic df
as the data frame, but you should not share the code for the data wrangling that led to the creation of df
.
If we find that you posted something on Slack that violates this principle without realising it, you won’t be penalised for it - don’t worry, but we will delete your message and let you know.
👯 Collaborating with others
You are allowed to discuss the assignment with others, work alongside each other, and help each other. However, you cannot share or copy code from others — pretty much the same rules as above.
🤖 Using AI help?
You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain how much you used it.
Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to create formulaic and repetitive responses, thus limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient or old and does not follow the principles we teach in this course.
To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.
References
Footnotes
The BMR indicator is defined in (Boix, Miller, and Rosato 2013), but I like this summary from Gründler and Krieger (2021):
“The Boix–Miller–Rosato (BMR) indicator is a dichotomous democracy index that labels a regime as democratic (1) if it meets the following conditions (Boix et al., 2013):
- The majority of male citizens was eligible to vote.
- The executive power was directly or indirectly elected in popular elections and is responsible to the voters or to a legislature.
- The legislature (and when directly elected the executive) was chosen in free and fair elections.”
Feature engineering is creating new variables from existing ones. For example, you could create a new variable that results from a mathematical transformation of an existing variable.↩︎