๐ W04 Summative
2023/24 Winter Term
โฒ๏ธ Due Date:
- 08 February 2024 at 5 pm (UK time)
If you submit after this date without an authorised extension, you will receive a late submission penalty.
Did you have an extenuating circumstance and need an extension? Send an e-mail to ๐ง
๐ฏ Main Objectives:
- Use applied computer programming (to modify data)
- Demonstrate that you can fit and apply linear regression models for making predictions
โ๏ธ Assignment Weight:
10%
๐ Instructions
Find the GitHub assignment link: Go to our Slack workspaceโs
#announcements
channel to find a GitHub Classroom link entitled ๐ W04 Summative.Accept the assignment: Click the link, sign in to GitHub, and click the green button.
Access your private repository: Youโll be redirected to a new private repository named
ds202w-2023-w04-summative-<yourusername>
. The repository will be private and will contain aREADME.md
file with a copy of these instructions.Grab your LSE Candidate Number: Youโll need it next. Check LSE for You (source) - itโs a unique 5-digit number used for anonymous grading.
Create your answer file: Name it
<CANDIDATE_NUMBER>.qmd
, replacing<CANDIDATE_NUMBER>
with your actual number.For example, if your candidate number is
12345
, then your file should be named12345.qmd
.Update your
.qmd
header: replace whatever is between the---
lines at the top of your newly created.qmd
file with the following:--- title: "DS202W - W04 Summative" author: <CANDIDATE_NUMBER> output: html self-contained: true editor: render-on-save: true preview: true ---
Once again, replace the text
<CANDIDATE_NUMBER>
with your actual LSE CANDIDATE NUMBER. For example, if your candidate number is12345
, then your.qmd
file should start with:--- title: "DS202W - W04 Summative" author: 12345 output: html self-contained: true editor: render-on-save: true preview: true ---
Fill out your answers: Use headers and code chunks for organization. Refer to Quarto markdown basics for Markdown formatting tips.
Render your work: Click โRenderโ in your
.qmd
file to create an.html
file with the same name (e.g.,12345.html
).Ensure reproducibility: If we run your notebook from scratch, it should produce the same results you got.
Push to GitHub: Push both
.qmd
and.html
files to your repository (multiple pushes are fine, weโll grade the latest version). Not a Git pro? Use the GitHub web interface!Check the rules on peer collaboration and AI tool usage: Refer to the How to get help and how to collaborate with others section for specific rules.
What to Submit
Two files are required:
- Your source Quarto markdown file:
<CANDIDATE_NUMBER>.qmd
- An HTML render of your report: The rendered version of your
.qmd
file<CANDIDATE_NUMBER>.html
No manual submission is needed! Your assignment is automatically submitted when you commit and push your changes to GitHub. Feel free to commit+push multiple times before the deadlineโweโll grade the latest version.
๐ก Not a Git expert? upload your files directly via GitHubโs web interface.
๐ Your Tasks
Context
For this summative assessment, we will use a recently updated dataset1 from the Office for National Statistics (ONS) regarding public sector finances in the UK. It includes indicators like government investment and spending, government borrowing, tax revenues, and more.
Click on the button below to download the dataset:
โน๏ธ About the dataset
This dataset was slightly pre-processed compared to the version available on the ONS website. Here is our pre-processing code in case youโre curious about what we did:
library(dplyr)
library(readxl)
library(janitor)
<- "https://www.ons.gov.uk/file?uri=/economy/governmentpublicsectorandtaxes/publicsectorfinance/datasets/publicsectorfinancessummarytablesappendixm/current/publicsectorfinancesummarytablesappendixmfinal.xlsx"
url <- download.file(url, "data/UK_public_finance_summary.xlsx")
uk_finance <- read_excel("data/UK_public_finance_summary.xlsx",sheet="Time Series",skip=4)
data <- data %>% filter(`Time period`!="Dataset identifier code")
data <- data %>%
data clean_names() %>%
select_all(~gsub("million", "pounds_million", .))%>%
select_all(~gsub("_note_\\d{1,2}", "", .))
write.csv(data,"data/UK_public_finances_cleaned.csv")
The dataset was processed to make variables more code and model-friendly. If you prefer a dataset with more explicit variable names, you can download it using the button below:
If youโd like to know more about the dataset, have a look here or here.
What we want from you
๐ฅ Your ultimate goal
You are to create a linear model using tidymodels
that can predict a target variable central_government_net_borrowing_pounds_million
per month as best as possible.
To achieve that, we will ask you to create a training set, and you must only fit a model to this training set. Then, with the model you created, you will predict the target variable on a test set. The specification of training and test sets will be given below.
There is just one thing you CANNOT DO: use the future! ๐ฎ
Given the nature of our target variable, which is measured at a specific point in time, itโs crucial to ensure that the predictors you use do not include any data or information that was not available at the time of prediction. For instance, if youโre working with data for January 2018, make sure your predictors do not incorporate data or calculations that were unavailable in December 2017 or earlier.
Follow the steps below to achieve this goal.
Load the downloaded dataset into a data frame
df
. Make sure you process the dates in the correct format.Create a column named
income_tax_receipts_pounds_million
, which should correspond to the sum of the following columns:self_assessed_income_tax_pounds_million
pay_as_you_earn_income_tax_pounds_million
other_income_tax_pounds_million
Create the training and test sets:
- create a
df_train
to contain just the data for the UK up until May 2010 - create a
df_test
to contain the data for the UK from June 2010 onwards.
- create a
Now, using only the
df_train
data as a starting point, create a model that predicts the target variable based on a single variable,income_tax_receipts_pounds_million
.How well does your model perform? Just as in the lab on week 3, use the residuals plot and MAE metric to justify your reasoning. Can you explain the performance change between the training and test set?
Your goal is still to create a linear model using
tidymodels
that can predict the target variable using just the data fromdf_train
. But now, feel free to use any other variables in the dataset. You can use one or multiple variables. If you wish, you can also create new variables from existing ones.๐ก Remember: Whatever further pre-processing you make to
df_train
, you must replicate them independently to thedf_test
.Show that your model fits the data well by plotting residuals and by calculating the MAE of your model on the data relative to
df_train
. How โwellโ does it have to fit? Well, thatโs up to you to decide.Show that your model can predict the future well by calculating the residuals and the MAE of your model on the data relative to
df_test
.Explain your choices. Why did you choose the variables you did? How did you come up with this particular configuration? Did you try anything else that didnโt work? What did you learn from this exercise?
โ๏ธ How we will grade your work
We will not be extremely rigid on this very first summative. Therefore, it is possible to reach 100/100 marks if you do everything correctly.
Criteria 01: Notebook Organisation (20 marks)
- Does your notebook look organised?
- Did you use appropriate headers and code chunks to keep your work organised?
If we feel like we can say yes to these questions, then you will get full marks for this criteria.
You can refer to โ
W03 Lab Solutions .qmd
notebook as an example of how to organise your work.
Criteria 02: Correctness of your answers (50 marks)
This is simple. Did you follow all the steps we asked you to do? Did you do them correctly?
If we feel like we can say yes to these questions, then you will get full marks for this criteria.
Criteria 03: Quality of your explanations (30 marks)
- Is it easy to follow your thought process?
- Did you explain, using text, what you are trying to achieve at each step of your analysis?
- Is there clear explanation of your choices and thought process?
If we feel like we can say yes to these questions, then you will get full marks for this criteria.
How to get help and how to collaborate with others
๐ Getting help
You can post general coding questions on Slack, but you should not reveal code that is part of your solution.
For example, you can ask:
- โDoes anyone know how I can create a new variable that is the sum of two other variables?โ
- โDoes anyone understand the difference between
lag
andlead
?โ - โHow is everyone doing to ensure that their training data does not use anything about future?โ
- โI tried using something like
df %>% mutate(col1=col2 + col3)
but then I got an errorโ (Reproducible example)
You are allowed to share โaestheticโ elements of your code, if they are not part of the core of the solution. For example, suppose you find a really cool new way to generate a plot. You can share the code for the plot, using a generic df
as the data frame, but you should not share the code for the data wrangling that led to the creation of df
.
If we find that you posted something on Slack that violates this principle without realising, you wonโt be penalised for it - donโt worry, but we will delete your message and let you know.
๐ฏ Collaborating with others
You are allowed to discuss the assignment with others, to work alongside each other and to help each other. However, you are not allowed to share code or to copy code from others. Pretty much the same rules as above.
๐ค Using AI help?
You are allowed to use Generative AI tools such as ChatGPT when doing this research โ as well as search online for help. If you do use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain the extent to which you used it.
Note that, while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to generate responses that are formulaic and repetitive; thus, limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient, or that is very old and does not follow the principles we teach in this course.
To see examples of how to report the use of AI tools, see ๐ค Our Generative AI policy.
Footnotes
last updated on 23 January 2024, with the next update scheduled for 21 February 2024โฉ๏ธ