📝 W08 Summative

2024/25 Autumn Term

Author

⏲️ Due Date:

21 November 2024 at 5pm (London time)

If you update your files on GitHub after this date without an authorised extension, you will receive a late submission penalty.

Did you have an extenuating circumstance and need an extension? Send an e-mail to 📧

🎯 Main Objectives:

Demonstrate your ability to write a report in Quarto Markdown
Demonstrate your dplyr and ggplot2 skills
Demonstrate your ability to fit a linear/logistic regression model
Demonstrate your ability to interpret and evaluate the performance of a linear/logistic regression model
Demonstrate your understanding of supervised learning techniques
Demonstrate your ability to defend your model choices

⚖️ Assignment Weight:

This assignment is worth 20% of your final grade in this course.

30%

Do you know your CANDIDATE NUMBER? You will need it.

“Your candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.”

Source: LSE

📝 Instructions

Go to our Slack workspace’s #announcements channel to find a GitHub Classroom link entitled 📝 W08 Summative. Do not share this link with anyone outside this course!
Click on the link, sign in to GitHub and then click on the green button Accept this assignment.
You will be redirected to a new private repository created just for you. The repository will be named ds202a-2024-w08-summative-yourusername, where yourusername is your GitHub username. The repository will be private and will contain a README.md file with a copy of these instructions.
Recall what is your LSE CANDIDATE NUMBER. You will need it in the next step.
Create a <CANDIDATE_NUMBER>.qmd file with your answers, replacing the text <CANDIDATE_NUMBER> with your actual LSE number.

For example, if your candidate number is 12345, then your file should be named 12345.qmd.
Then, replace whatever is between the --- lines at the top of your newly created .qmd file with the following:
```
---
title: "DS202A - W08 Summative"
author: <CANDIDATE_NUMBER>
output: html
self-contained: true
---
```
Once again, replace the text <CANDIDATE_NUMBER> with your actual LSE CANDIDATE NUMBER. For example, if your candidate number is 12345, then your .qmd file should start with:
```
---
title: "DS202A - W08 Summative"
author: 12345
output: html
self-contained: true
---
```
Fill out the .qmd file with your answers. Use headers and code chunks to keep your work organised. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
Once you are done, click on the Render button at the top of the .qmd file. This will create an .html file with the same name as your .qmd file. For example, if your .qmd file is named 12345.qmd, then the .html file will be named 12345.html.

Ensure that your .qmd code is reproducible, that is, if we were to restart R and RStudio and run your notebook from scratch, from top to the bottom, we would get the same results as you did.
Push both files to your GitHub repository. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.
Read the section How to get help and how to collaborate with others at the end of this document.

“What do I submit?”

You will submit two files:

A Quarto markdown file with the following naming convention: <CANDIDATE_NUMBER>.qmd, where <CANDIDATE_NUMBER> is your candidate number. For example, if your candidate number is 12345, then your file should be named 12345.qmd.
An HTML file render of the Quarto markdown file.

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.

🗄️ Get the data

What data will you be using?

You will be using two distinct datasets for this summative.

Parts I and II

Your dataset for these parts is a dataset on air quality collected and shared publicly by (Betancourt 2020). It is a “collection of aggregated air quality data from the years 2010–2014 and metadata at more than 5500 air quality monitoring stations all over the world, provided by the first Tropospheric Ozone Assessment Report (TOAR) [and it] focuses in particular on metrics of tropospheric ozone, which has a detrimental effect on climate, human morbidity and mortality, as well as crop yields”.

📚 Preparation

Download the data by clicking on the button below.

Download the dataset variable dictionary below:

Part 3

The dataset for this part is about water quality and more precisely drinking water potability. It is publicly availably on Kaggle.

📚 Preparation

Download the data by clicking on the button below.

📋 Your Tasks

What do we actually want from you?

Part 1: Show us your `dplyr` muscles! (10 marks)

You don’t need to use a chunk for each question. Feel free to organise your code and markdown for this part.

Load the data into a data frame called aq_bench. Freely explore the data on your own.
This dataset comes in mostly clean format but will require some work before it can be used.
1. Filter the dataset into a new dataframe called aq_bench_filtered to remove the lat, lon and dataset columns
What are the 5 countries with the highest number of rows in the dataset? And what are the 5 countries with the lowest number of rows in the dataset?
What is the median NO2 per type of area?
Create a plot that shows the relationship between population density and O3 average values. What does this plot tell you?

Part 2: Create regression models (45 marks)

In this part, we focus on predicting o3_average_values.

As it was in the previous section, you don’t need to use a chunk for each question. Feel free to organise your code and markdown for this part.

Create a baseline linear regression model:
- Create the training and test sets:
  - create a df_train to contain 75% of your original data
  - create a df_test to contain 25% of your original
- Now, using only the df_train data as a starting point, create a linear regression model that predicts the target variable
- How well does your model perform? Just as in the lab on week 3, use the residuals plot and a metric of your choosing to justify your reasoning. Can you explain the performance change between the training and test set?
Now is your time to shine! Come up with your own feature selection or feature engineering or model selection strategy¹ and try to get a better model performance than you had before. Don’t forget to validate your results using the appropriate resampling techniques!
Whatever you do, this is what we expect from you:
- Show us your code and your model.
- Explain your choices (of feature engineering, model selection or resampling strategy)
- Evaluate your model’s performance. If you created a new model, compare it to the baseline model. If you performed a more robust resampling, compare it to the single train-test split you did in the previous question.

Part 3: Create classification models (45 marks)

In this part, we’ll focus on predicting Potability.

Create a baseline logistic regression model:
- Split your data in training and test set (75% for training set)
- Use whatever metric you feel is most apt for this task to evaluate your model’s performance. Explain why you chose this metric.
- Explain what the regression coefficients mean in the context of this problem.
- Comment on the goodness-of-fit of your model and its predictive power.
Now is your time to shine once again ! Come up with your own feature selection, feature engineering and/or model selection strategy and try to get a better model performance than you had before. Don’t forget to validate your results using the appropriate resampling techniques!
Whatever you do, this is what we expect from you:
- Show us your code and your model.
- Explain your choices (of feature engineering, model selection or resampling strategy)
- Evaluate your model’s performance. If you created a new model, compare it to the baseline model. If you performed a more robust resampling, compare it to the single train-test split you did in the previous question.

✔️ How we will grade your work

Following all the instructions, you should expect a score of around 70/100. Only if you go above and beyond what is asked of you in a meaningful way will you get a higher score. Simply adding more code² or text will not get you a higher score; you need to add interesting insights or analyses to get a distinction.

⚠️ You will incur a penalty if you only submit a .qmd file and not also a properly rendered .html file alongside it!

Part 1: Show us your `dplyr` muscles! (10 marks)

Here is a rough rubric for this part:

<4 marks: You wrote some code but filtered the data incorrectly or did not follow the instructions.
4-6 marks: You cleaned the initial dataframe correctly correctly, but you might have made some mistakes when tallying the number of rows per countries, calculating the median NO2 per type of area or your plot and conclusions for Task 5 are not correct.
7-9 marks: You did everything correctly as instructed. Your submission just fell short of perfect. Your code or markdown could be more organised, or your answers were not concise enough (unnecessary, overly long text).
10 marks: You did everything correctly, and your submission was perfect. Wow! Your code and markdown were well-organised, and your answers were concise and to the point.

Part 2: Create regression models (45 marks)

Here is a rough rubric for this part:

<11 marks: A deep fail. There is no code, or the code/markdown is so insubstantial or disorganised to the point that we cannot understand what you did.
11-21 marks: A fail. You wrote some code and text but ignored important aspects of the instructions (like not using linear regression)
22-33 marks: You made some critical mistakes or did not complete all the tasks. For example: your pre-processing step was incorrect, your model contained some data leakage (e.g using variables that define others to predict them), or perhaps your analysis of your model was way off.
34-38: Good, you just made minor mistakes in your code, or your analysis demonstrated some minor misunderstandings of the concepts.
~39 marks: You did everything correctly as instructed. Your submission just fell short of perfect. Your code or markdown could be more organised, or your answers were not concise enough (unnecessary, overly long text).
>39 marks: Impressive! You impressed us with your level of technical expertise and deep knowledge of the intricacies of linear regression and other models. We are likely to print a photo of your submission and hang it on the wall of our offices.

Part 3: Create classification models (45 marks)

Here is a rough rubric for this part:

<11 marks: A deep fail. There is no code, or the code/markdown is so insubstantial or disorganised to the point that we cannot understand what you did.
11-21 marks: A fail. You wrote some code and text but ignored important aspects of the instructions (like not using logistic regression)
22-33 marks: You made some critical mistakes or did not complete all the tasks. For example: your pre-processing step was incorrect, your model contained some data leakage (e.g using variables that define others to predict them), or perhaps your analysis of your model was way off.
34-38: Good, you just made minor mistakes in your code, or your analysis demonstrated some minor misunderstandings of the concepts.
~39 marks: You did everything correctly as instructed. Your submission just fell short of perfect. Your code or markdown could be more organised, or your answers were not concise enough (unnecessary, overly long text).
>39 marks: Impressive! You impressed us with your level of technical expertise and deep knowledge of the intricacies of the logistic function and other models. We are likely to print a photo of your submission and hang it on the wall of our offices.

How to get help and how to collaborate with others

🙋 Getting help

You can post general coding questions on Slack but should not reveal code that is part of your solution.

For example, you can ask:

“Does anyone know how I can create a logistic regression in tidymodels without a recipe?”
“Has anyone figured out how to do time-aware cross-validation, grouped per country??”
“I tried using something like df %>% mutate(col1=col2 + col3) but then I got an error” (Reproducible example)
“Does anyone know how I can create a new variable that is the sum of two other variables?”

You are allowed to share ‘aesthetic’ elements of your code if they are not part of the core of the solution. For example, suppose you find a really cool new way to generate a plot. You can share the code for the plot, using a generic df as the data frame, but you should not share the code for the data wrangling that led to the creation of df.

If we find that you posted something on Slack that violates this principle without realising it, you won’t be penalised for it - don’t worry, but we will delete your message and let you know.

👯 Collaborating with others

You are allowed to discuss the assignment with others, work alongside each other, and help each other. However, you cannot share or copy code from others — pretty much the same rules as above.

🤖 Using AI help?

You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain how much you used it.

Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to create formulaic and repetitive responses, thus limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient or old and does not follow the principles we teach in this course.

To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.

References

Betancourt, Clara. 2020. “AQ-Bench.” https://b2share.eudat.eu. https://doi.org/10.23728/B2SHARE.30D42B5A87344E82855A486BF2123E9F.

Footnotes

Feature engineering is creating new variables from existing ones. For example, you could create a new variable that results from a mathematical transformation of an existing variable.Or you could enrich your dataset with some other publicly available data.↩︎
Hint: don’t just write code, especially uncommented chunks of code. It won’t get you very far. You need to explain the code results, interpret them and put them in context.↩︎

📝 Instructions

“What do I submit?”

🗄️ Get the data

What data will you be using?

Parts I and II

Part 3

📋 Your Tasks

Part 1: Show us your dplyr muscles! (10 marks)

Part 2: Create regression models (45 marks)

Part 3: Create classification models (45 marks)

✔️ How we will grade your work

Part 1: Show us your dplyr muscles! (10 marks)

Part 2: Create regression models (45 marks)

Part 3: Create classification models (45 marks)

How to get help and how to collaborate with others

🙋 Getting help

👯 Collaborating with others

🤖 Using AI help?

References

Footnotes

Part 1: Show us your `dplyr` muscles! (10 marks)

Part 1: Show us your `dplyr` muscles! (10 marks)