📝 W04 Summative

2023/24 Winter Term

Author

Dr. Ghita Berrada

⏲️ Due Date:

08 February 2024 at 5 pm (UK time)

If you submit after this date without an authorised extension, you will receive a late submission penalty.

Did you have an extenuating circumstance and need an extension? Send an e-mail to 📧

🎯 Main Objectives:

Use applied computer programming (to modify data)
Demonstrate that you can fit and apply linear regression models for making predictions

⚖️ Assignment Weight:

10%

📝 Instructions

Find the GitHub assignment link: Go to our Slack workspace’s #announcements channel to find a GitHub Classroom link entitled 📝 W04 Summative.
Accept the assignment: Click the link, sign in to GitHub, and click the green button.
Access your private repository: You’ll be redirected to a new private repository named ds202w-2023-w04-summative-<yourusername>. The repository will be private and will contain a README.md file with a copy of these instructions.
Grab your LSE Candidate Number: You’ll need it next. Check LSE for You (source) - it’s a unique 5-digit number used for anonymous grading.
Create your answer file: Name it <CANDIDATE_NUMBER>.qmd, replacing <CANDIDATE_NUMBER> with your actual number.

For example, if your candidate number is 12345, then your file should be named 12345.qmd.

Update your .qmd header: replace whatever is between the --- lines at the top of your newly created .qmd file with the following:

---
title: "DS202W - W04 Summative"
author: <CANDIDATE_NUMBER>
output: html
self-contained: true
editor:
    render-on-save: true
    preview: true
---

Once again, replace the text <CANDIDATE_NUMBER> with your actual LSE CANDIDATE NUMBER. For example, if your candidate number is 12345, then your .qmd file should start with:

---
title: "DS202W - W04 Summative"
author: 12345
output: html
self-contained: true
editor:
    render-on-save: true
    preview: true
---

Fill out your answers: Use headers and code chunks for organization. Refer to Quarto markdown basics for Markdown formatting tips.
Render your work: Click “Render” in your .qmd file to create an .html file with the same name (e.g., 12345.html).
Ensure reproducibility: If we run your notebook from scratch, it should produce the same results you got.
Push to GitHub: Push both .qmd and .html files to your repository (multiple pushes are fine, we’ll grade the latest version). Not a Git pro? Use the GitHub web interface!
Check the rules on peer collaboration and AI tool usage: Refer to the How to get help and how to collaborate with others section for specific rules.

What to Submit

Two files are required:

Your source Quarto markdown file: <CANDIDATE_NUMBER>.qmd
An HTML render of your report: The rendered version of your .qmd file <CANDIDATE_NUMBER>.html

No manual submission is needed! Your assignment is automatically submitted when you commit and push your changes to GitHub. Feel free to commit+push multiple times before the deadline—we’ll grade the latest version.

💡 Not a Git expert? upload your files directly via GitHub’s web interface.

📋 Your Tasks

Context

For this summative assessment, we will use a recently updated dataset¹ from the Office for National Statistics (ONS) regarding public sector finances in the UK. It includes indicators like government investment and spending, government borrowing, tax revenues, and more.

Click on the button below to download the dataset:

ℹ️ About the dataset

This dataset was slightly pre-processed compared to the version available on the ONS website. Here is our pre-processing code in case you’re curious about what we did:

library(dplyr)
library(readxl)
library(janitor)

url <- "https://www.ons.gov.uk/file?uri=/economy/governmentpublicsectorandtaxes/publicsectorfinance/datasets/publicsectorfinancessummarytablesappendixm/current/publicsectorfinancesummarytablesappendixmfinal.xlsx" 
uk_finance <- download.file(url, "data/UK_public_finance_summary.xlsx")
data <- read_excel("data/UK_public_finance_summary.xlsx",sheet="Time Series",skip=4)
data <- data %>% filter(`Time period`!="Dataset identifier code")
data <- data %>% 
  clean_names() %>% 
  select_all(~gsub("million", "pounds_million", .))%>%
  select_all(~gsub("_note_\\d{1,2}", "", .))
write.csv(data,"data/UK_public_finances_cleaned.csv")

The dataset was processed to make variables more code and model-friendly. If you prefer a dataset with more explicit variable names, you can download it using the button below:

If you’d like to know more about the dataset, have a look here or here.

What we want from you

🥅 Your ultimate goal

You are to create a linear model using tidymodels that can predict a target variable central_government_net_borrowing_pounds_million per month as best as possible.

To achieve that, we will ask you to create a training set, and you must only fit a model to this training set. Then, with the model you created, you will predict the target variable on a test set. The specification of training and test sets will be given below.

There is just one thing you CANNOT DO: use the future! 🔮

Given the nature of our target variable, which is measured at a specific point in time, it’s crucial to ensure that the predictors you use do not include any data or information that was not available at the time of prediction. For instance, if you’re working with data for January 2018, make sure your predictors do not incorporate data or calculations that were unavailable in December 2017 or earlier.

Follow the steps below to achieve this goal.

Load the downloaded dataset into a data frame df. Make sure you process the dates in the correct format.
Create a column named income_tax_receipts_pounds_million, which should correspond to the sum of the following columns:
- self_assessed_income_tax_pounds_million
- pay_as_you_earn_income_tax_pounds_million
- other_income_tax_pounds_million
Create the training and test sets:
- create a df_train to contain just the data for the UK up until May 2010
- create a df_test to contain the data for the UK from June 2010 onwards.
Now, using only the df_train data as a starting point, create a model that predicts the target variable based on a single variable, income_tax_receipts_pounds_million.

How well does your model perform? Just as in the lab on week 3, use the residuals plot and MAE metric to justify your reasoning. Can you explain the performance change between the training and test set?
Your goal is still to create a linear model using tidymodels that can predict the target variable using just the data from df_train. But now, feel free to use any other variables in the dataset. You can use one or multiple variables. If you wish, you can also create new variables from existing ones.

💡 Remember: Whatever further pre-processing you make to df_train, you must replicate them independently to the df_test.
Show that your model fits the data well by plotting residuals and by calculating the MAE of your model on the data relative to df_train. How ‘well’ does it have to fit? Well, that’s up to you to decide.
Show that your model can predict the future well by calculating the residuals and the MAE of your model on the data relative to df_test.
Explain your choices. Why did you choose the variables you did? How did you come up with this particular configuration? Did you try anything else that didn’t work? What did you learn from this exercise?

✔️ How we will grade your work

We will not be extremely rigid on this very first summative. Therefore, it is possible to reach 100/100 marks if you do everything correctly.

Criteria 01: Notebook Organisation (20 marks)

Does your notebook look organised?
Did you use appropriate headers and code chunks to keep your work organised?

If we feel like we can say yes to these questions, then you will get full marks for this criteria.

You can refer to ✅ W03 Lab Solutions .qmd notebook as an example of how to organise your work.

Criteria 02: Correctness of your answers (50 marks)

This is simple. Did you follow all the steps we asked you to do? Did you do them correctly?

If we feel like we can say yes to these questions, then you will get full marks for this criteria.

Criteria 03: Quality of your explanations (30 marks)

Is it easy to follow your thought process?
Did you explain, using text, what you are trying to achieve at each step of your analysis?
Is there clear explanation of your choices and thought process?

If we feel like we can say yes to these questions, then you will get full marks for this criteria.

How to get help and how to collaborate with others

🙋 Getting help

You can post general coding questions on Slack, but you should not reveal code that is part of your solution.

For example, you can ask:

“Does anyone know how I can create a new variable that is the sum of two other variables?”
“Does anyone understand the difference between lag and lead?”
“How is everyone doing to ensure that their training data does not use anything about future?”
“I tried using something like df %>% mutate(col1=col2 + col3) but then I got an error” (Reproducible example)

You are allowed to share ‘aesthetic’ elements of your code, if they are not part of the core of the solution. For example, suppose you find a really cool new way to generate a plot. You can share the code for the plot, using a generic df as the data frame, but you should not share the code for the data wrangling that led to the creation of df.

If we find that you posted something on Slack that violates this principle without realising, you won’t be penalised for it - don’t worry, but we will delete your message and let you know.

👯 Collaborating with others

You are allowed to discuss the assignment with others, to work alongside each other and to help each other. However, you are not allowed to share code or to copy code from others. Pretty much the same rules as above.

🤖 Using AI help?

You are allowed to use Generative AI tools such as ChatGPT when doing this research – as well as search online for help. If you do use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain the extent to which you used it.

Note that, while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to generate responses that are formulaic and repetitive; thus, limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient, or that is very old and does not follow the principles we teach in this course.

To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.

Footnotes

last updated on 23 January 2024, with the next update scheduled for 21 February 2024↩︎