📝 W04 Summative

2023/24 Autumn Term

Author

Dr Jon Cardoso-Silva

⏲️ Due Date:

19 October 2023 at 23:59:59

If you submit after this date without an authorised extension, you will receive a late submission penalty.

Did you have an extenuating circumstance and need an extension? Send an e-mail to 📧

🎯 Main Objectives:

Use applied computer programming (to modify data)
Demonstrate that you can fit and apply linear regression models for making predictions

Now another small line with emoji + text in bold to demarcate how much your work will count towards your final grade.

⚖️ Assignment Weight:

10%

Do you know your CANDIDATE NUMBER? You will need it.

“Your candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.”

Source: LSE

📝 Instructions

Go to our Slack workspace’s #announcements channel to find a GitHub Classroom link entitled 📝 W04 Summative. Do not share this link with anyone outside this course!
Click on the link, sign in to GitHub and then click on the green button Accept this assignment.
You will be redirected to a new private repository created just for you. The repository will be named ds202a-2023-w04-summative--yourusername, where yourusername is your GitHub username. The repository will be private and will contain a README.md file with a copy of these instructions.
Recall what is your LSE CANDIDATE NUMBER. You will need it in the next step.
Create a <CANDIDATE_NUMBER>.qmd file with your answers, replacing the text <CANDIDATE_NUMBER> with your actual LSE number.

For example, if your candidate number is 12345, then your file should be named 12345.qmd.
Then, replace whatever is between the --- lines at the top of your newly created .qmd file with the following:
```
---
title: "DS202A - W04 Summative"
author: <CANDIDATE_NUMBER>
output: html
self-contained: true
---
```
Once again, replace the text <CANDIDATE_NUMBER> with your actual LSE CANDIDATE NUMBER. For example, if your candidate number is 12345, then your .qmd file should start with:
```
---
title: "DS202A - W04 Summative"
author: 12345
output: html
self-contained: true
---
```
Fill out the .qmd file with your answers. Use headers and code chunks to keep your work organised. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
Once you are done, click on the Render button at the top of the .qmd file. This will create an .html file with the same name as your .qmd file. For example, if your .qmd file is named 12345.qmd, then the .html file will be named 12345.html.

Ensure that your .qmd code is reproducible, that is, if we were to restart R and RStudio and run your notebook from scratch, from top to the bottom, we would get the same results as you did.
Push both files to your GitHub repository. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.
Read the section How to get help and how to collaborate with others at the end of this document.

“What do I submit?”

You will submit two files:

A Quarto markdown file with the following naming convention: <CANDIDATE_NUMBER>.qmd, where <CANDIDATE_NUMBER> is your candidate number. For example, if your candidate number is 12345, then your file should be named 12345.qmd.
An HTML file render of the Quarto markdown file.

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.

📋 Your Tasks

What do we actually want from you?

Using the UK HPI dataset we have been using in the labs, create a data frame called df_uk that contains just the data for the whole of UK.
Create a df_uk_train to contain just the data for the UK up until December 2019, and a df_uk_test to contain the data for the UK from January 2020 onwards.
Your goal is to create a linear model using tidymodels that can predict the variable SalesVolume per month, as best as you can, using just the df_uk_train data as a starting point. You can use any of the other variables in the dataset to do so, and you can create new variables from the existing ones, if you so wish.

Your df_uk_train is just the starting point but it won’t the dataset you will use to create your model. You still need to further preprocess df_uk_train before you can train a model. You might also need to further preprocess df_uk_test before you can run a pre-trained model on it.
There is just one thing you CANNOT DO: use the future! 🔮

That is, each line of the dataset you pass to the model must contain: the target variable (SalesVolume) plus the predictors you are using to train and these predictors cannot refer to the same (or future) month(s) of the target variable.

For example, in the line relative to January 2018, you can use data from December 2017 and before (not just the previous month) but you cannot use any data from January 2018 onwards.
Show that your model fits the data well, by plotting residuals and by calculating the MAE of your model on the data relative to df_uk_train. How ‘well’ does it have to fit? Well, that’s up to you to decide.
Show that your model can predict the future well, by calculating the residuals and the MAE of your model on the data relative to df_uk_test.
Explain your choices. Why did you choose the variables you did? How did you come up with this particular configuration? Did you try anything else that didn’t work? What did you learn from this exercise?

✔️ How we will grade your work

We will not be extremely rigid on this very first summative. Therefore, it is possible to reach 100/100 marks if you do everything correctly.

Criteria 01: Notebook Organisation (20 marks)

Does your notebook look organised?
Did you use appropriate headers and code chunks to keep your work organised?

If we feel like we can say yes to these questions, then you will get full marks for this criteria.

You can refer to ✅ W03 Lab Solutions .qmd notebook as an example of how to organise your work.

Criteria 02: Correctness of your answers (50 marks)

This is simple. Did you follow all the steps we asked you to do? Did you do them correctly?

If we feel like we can say yes to these questions, then you will get full marks for this criteria.

Criteria 03: Quality of your explanations (30 marks)

Is it easy to follow your thought process?
Did you explain, using text, what you are trying to achieve at each step of your analysis?
Is there clear explanation of your choices and thought process?

If we feel like we can say yes to these questions, then you will get full marks for this criteria.

How to get help and how to collaborate with others

🙋 Getting help

You can post general coding questions on Slack, but you should not reveal code that is part of your solution.

For example, you can ask:

“Does anyone know how I can create a new variable that is the sum of two other variables?”
“Does anyone understand the difference between lag and lead?”
“How is everyone doing to ensure that their training data does not use anything about future?”
“I tried using something like df %>% mutate(col1=col2 + col3) but then I got an error” (Reproducible example)

You are allowed to share ‘aesthetic’ elements of your code, if they are not part of the core of the solution. For example, suppose you find a really cool new way to generate a plot. You can share the code for the plot, using a generic df as the data frame, but you should not share the code for the data wrangling that led to the creation of df.

If we find that you posted something on Slack that violates this principle without realising, you won’t be penalised for it - don’t worry, but we will delete your message and let you know.

👯 Collaborating with others

You are allowed to discuss the assignment with others, to work alongside each other and to help each other. However, you are not allowed to share code or to copy code from others. Pretty much the same rules as above.

🤖 Using AI help?

You are allowed to use Generative AI tools such as ChatGPT when doing this research – as well as search online for help. If you do use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain the extent to which you used it.

Note that, while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to generate responses that are formulaic and repetitive; thus, limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient, or that is very old and does not follow the principles we teach in this course.

To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.