๐ W04 Summative
2023/24 Autumn Term
โฒ๏ธ Due Date:
- 19 October 2023 at 23:59:59
If you submit after this date without an authorised extension, you will receive a late submission penalty.
Did you have an extenuating circumstance and need an extension? Send an e-mail to ๐ง
๐ฏ Main Objectives:
- Use applied computer programming (to modify data)
- Demonstrate that you can fit and apply linear regression models for making predictions
Now another small line with emoji + text in bold to demarcate how much your work will count towards your final grade.
โ๏ธ Assignment Weight:
10%
โYour candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.โ
Source: LSE
๐ Instructions
Go to our Slack workspaceโs
#announcements
channel to find a GitHub Classroom link entitled ๐ W04 Summative. Do not share this link with anyone outside this course!Click on the link, sign in to GitHub and then click on the green button
Accept this assignment
.You will be redirected to a new private repository created just for you. The repository will be named
ds202a-2023-w04-summative--yourusername
, whereyourusername
is your GitHub username. The repository will be private and will contain aREADME.md
file with a copy of these instructions.Recall what is your LSE CANDIDATE NUMBER. You will need it in the next step.
Create a
<CANDIDATE_NUMBER>.qmd
file with your answers, replacing the text<CANDIDATE_NUMBER>
with your actual LSE number.For example, if your candidate number is
12345
, then your file should be named12345.qmd
.Then, replace whatever is between the
---
lines at the top of your newly created.qmd
file with the following:--- title: "DS202A - W04 Summative" author: <CANDIDATE_NUMBER> output: html self-contained: true ---
Once again, replace the text
<CANDIDATE_NUMBER>
with your actual LSE CANDIDATE NUMBER. For example, if your candidate number is12345
, then your.qmd
file should start with:--- title: "DS202A - W04 Summative" author: 12345 output: html self-contained: true ---
Fill out the
.qmd
file with your answers. Use headers and code chunks to keep your work organised. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.Once you are done, click on the
Render
button at the top of the.qmd
file. This will create an.html
file with the same name as your.qmd
file. For example, if your.qmd
file is named12345.qmd
, then the.html
file will be named12345.html
.Ensure that your
.qmd
code is reproducible, that is, if we were to restart R and RStudio and run your notebook from scratch, from top to the bottom, we would get the same results as you did.Push both files to your GitHub repository. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.
Read the section How to get help and how to collaborate with others at the end of this document.
โWhat do I submit?โ
You will submit two files:
A Quarto markdown file with the following naming convention:
<CANDIDATE_NUMBER>.qmd
, where<CANDIDATE_NUMBER>
is your candidate number. For example, if your candidate number is12345
, then your file should be named12345.qmd
.An HTML file render of the Quarto markdown file.
You donโt need to click to submit anything. Your assignment will be automatically submitted when you commit
AND push
your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.
๐ Your Tasks
What do we actually want from you?
Using the UK HPI dataset we have been using in the labs, create a data frame called
df_uk
that contains just the data for the whole of UK.Create a
df_uk_train
to contain just the data for the UK up until December 2019, and adf_uk_test
to contain the data for the UK from January 2020 onwards.Your goal is to create a linear model using
tidymodels
that can predict the variableSalesVolume
per month, as best as you can, using just thedf_uk_train
data as a starting point. You can use any of the other variables in the dataset to do so, and you can create new variables from the existing ones, if you so wish.Your
df_uk_train
is just the starting point but it wonโt the dataset you will use to create your model. You still need to further preprocessdf_uk_train
before you can train a model. You might also need to further preprocessdf_uk_test
before you can run a pre-trained model on it.There is just one thing you CANNOT DO: use the future! ๐ฎ
That is, each line of the dataset you pass to the model must contain: the target variable (
SalesVolume
) plus the predictors you are using to train and these predictors cannot refer to the same (or future) month(s) of the target variable.For example, in the line relative to January 2018, you can use data from December 2017 and before (not just the previous month) but you cannot use any data from January 2018 onwards.
Show that your model fits the data well, by plotting residuals and by calculating the MAE of your model on the data relative to
df_uk_train
. How โwellโ does it have to fit? Well, thatโs up to you to decide.Show that your model can predict the future well, by calculating the residuals and the MAE of your model on the data relative to
df_uk_test
.Explain your choices. Why did you choose the variables you did? How did you come up with this particular configuration? Did you try anything else that didnโt work? What did you learn from this exercise?
โ๏ธ How we will grade your work
We will not be extremely rigid on this very first summative. Therefore, it is possible to reach 100/100 marks if you do everything correctly.
Criteria 01: Notebook Organisation (20 marks)
- Does your notebook look organised?
- Did you use appropriate headers and code chunks to keep your work organised?
If we feel like we can say yes to these questions, then you will get full marks for this criteria.
You can refer to โ
W03 Lab Solutions .qmd
notebook as an example of how to organise your work.
Criteria 02: Correctness of your answers (50 marks)
This is simple. Did you follow all the steps we asked you to do? Did you do them correctly?
If we feel like we can say yes to these questions, then you will get full marks for this criteria.
Criteria 03: Quality of your explanations (30 marks)
- Is it easy to follow your thought process?
- Did you explain, using text, what you are trying to achieve at each step of your analysis?
- Is there clear explanation of your choices and thought process?
If we feel like we can say yes to these questions, then you will get full marks for this criteria.
How to get help and how to collaborate with others
๐ Getting help
You can post general coding questions on Slack, but you should not reveal code that is part of your solution.
For example, you can ask:
- โDoes anyone know how I can create a new variable that is the sum of two other variables?โ
- โDoes anyone understand the difference between
lag
andlead
?โ - โHow is everyone doing to ensure that their training data does not use anything about future?โ
- โI tried using something like
df %>% mutate(col1=col2 + col3)
but then I got an errorโ (Reproducible example)
You are allowed to share โaestheticโ elements of your code, if they are not part of the core of the solution. For example, suppose you find a really cool new way to generate a plot. You can share the code for the plot, using a generic df
as the data frame, but you should not share the code for the data wrangling that led to the creation of df
.
If we find that you posted something on Slack that violates this principle without realising, you wonโt be penalised for it - donโt worry, but we will delete your message and let you know.
๐ฏ Collaborating with others
You are allowed to discuss the assignment with others, to work alongside each other and to help each other. However, you are not allowed to share code or to copy code from others. Pretty much the same rules as above.
๐ค Using AI help?
You are allowed to use Generative AI tools such as ChatGPT when doing this research โ as well as search online for help. If you do use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain the extent to which you used it.
Note that, while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to generate responses that are formulaic and repetitive; thus, limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient, or that is very old and does not follow the principles we teach in this course.
To see examples of how to report the use of AI tools, see ๐ค Our Generative AI policy.