Summative Problem Set 01 | W05-W07
DS202 - Data Science for Social Scientists
Welcome to the first Summative Problem Set of DS202 (2022/23)!
Things to know before you start:
- Deadline: you have until 9 November 2022, 23:59 UK time (Week 07) to complete your solutions and submit via Moodle.
- You will be granted a maximum of 100 points for the whole assignment. You will see how much each task is right next to the tasksβ names.
- This assessment is worth 20% of your final grade.
- Read the instructions carefully and make sure you follow them.
π Instructions
Read carefully
1. Download this page as an .Rmd
file, as well as the two accompanying CSV files from Moodle (follow this link). Ensure all files are saved in the same directory. Open the RMarkdown file in RStudio.
2. Work on your solutions and fill the allocated spaces with your answers. Each question has one or more empty spaces for you to fill out.
Text answers are identified as:
Your text goes here
R code answers are identified as:
# Replace this by your code
# REMEMBER TO REMOVE ALL HASH SIGNS OR YOUR CODE WILL NOT RUN!!!
# YOU WILL MARKED AS 0 IN A QUESTION IF YOUR CODE DOES NOT PRODUCE
# ANY OUTPUT WHEN IT WOULD HAVE BEEN EXPECTED.
Equation answers are identified as:
\[Your~equation~goes~here\]
You can add more code R code, plots, text or equations to complement your responses if you want to. For extra tips on formatting equations see 1.
3. After you answer all the questions, you should press the Knit button, and knit the file to HTML. If you cannot find this button in RStudio, check out this tutorial.
4. An HTML file will appear in your files. This is what you must submit to Moodle.
5. Open Moodle and submit your HTML file.
β οΈ IMPORTANT:
- If your code has any errors, you might not be able to knit the document. You must fix those errors first.
- If you do not know how to answer a solution, leave it empty. Do not add incomplete R scripts as it will lead to knitting errors.
- Most questions do not have a single objective response. You are expected to justify your choices of algorithms, parameters and validation metrics from your experiences in the lectures, labs and your readings of the textbook.
π‘Click here to learn how to get help
How to get help
It is okay to use whatever additional R packages you can find to help you explore the data and/or models better.
It is okay to team up with class colleagues to brainstorm ideas and these problems together.
- Most questions do not have a single objective response. It is unlikely that you will all write the exact same response so we will spot plagiarism and full copy-pastes very easily.
It is ok to use Slack or a shared Google Drive document to share links to useful content. For example, you can share things like:
βTip: I am using this package called tidymodels and it is much simpler than writing
for
loops!ββI found this link useful: What is the difference between type=βclassβ and type=βresponseβ in the predict function?β
This website has many examples of charts in R β and it has the source code!
βThis R package called lubridate helped me work with dates a lot easier!β
It is ok ask clarification about questions of this problem set publicly on Slack. For example, you can ask questions like:
- βI am a little confused about Question X. Where it says
...
does it mean....
? Am I getting this right?β
- βI am a little confused about Question X. Where it says
It is also ok to ask generic programming-related questions publicly on Slack. For example, you can ask questions like:
βHow do I get just the last 10 items of a list in R/tidyverse?β or
βHow do I sum the number of occurrences of a value in a column?β
βAnyone else getting an
unequal lengths error
when creating a new vector? How do I solve this?ββI am having a hard time understanding the code below (from Week X lab):
<- seq(length(dim(some_dataframe)[1])) new_list
Why so many parentheses? Does anyone how to interpret this? #helpβ or even
βHow do I select specific columns of a dataframe by their names?β or maybe
βThe
pairs
plot is too messy, anyone knows of a better way to visualise pairs of variables?β
What we CANNOT accept:
sharing your entire script or RMarkdown with others. But it is ok to share snippets of code with best practices, or to ask for help, like the type of code people share on Stackoverflow
asking others to do your work for you (LSE regulations on plagiarism applies to computer code too)
We will run TurnitIn on your submissions to help flag cases of plagiarism
THIS IS AN ANONYMOUS SUBMISSION. DO NOT INCLUDE YOUR NAME NOR OTHER PERSONAL INFORMATION ANYWHERE IN THIS FILE OTHER THAN YOUR LSE ID NUMBER.
# Complete the line below by filling it out with your LSE ID
# If you do not add your LSE ID number, knitting will throw an error.
<- ## This is a number that starts with the year you joined LSE, like 202208.... LSE_ID
βοΈ Setup
Import required libraries.No need to change if you are not planning to use other packages.
library(tidyverse)
library(tidymodels)
# If you use other packages, add them here
The Data: Algerian forest fires
This problem set is a chance for you to explore the true power of machine learning and build a model that will predict the occurrence of forest fires. Forest fires are a huge problem for many countries and a significant sustainability issue. By solving this task, you are expected to build a predictive model but also to try and diagnose which predictors are more predictive of fires.
β οΈ Remember once again: most questions do not have a single objective response. You are expected to justify your choices of algorithms, parameters and validation metrics based on your experiences in the lectures and labs and of the readings of the textbook or other recommended reading resources.
How you will be assessed:
Question | Marks | Level | Total |
---|---|---|---|
Q1 | 2 | Easy | 2 |
Q2 | 3 | Easy | 5 |
Q3 | 2 | Easy | 7 |
Q4 | 3 | Easy | 10 |
Q5 | 8 | Medium-Hard | 18 |
Q6 | 7 | Easy-Medium | 25 |
Q7 | 10 | Medium | 35 |
Q8 | 8 | Medium | 43 |
Q9 | 3 | Easy | 46 |
Q10 | 7 | Medium | 53 |
Q11 | 12 | Hard | 65 |
Q12 | 7 | Easy-Medium | 72 |
Q13 | 8 | Medium-Hard | 80 |
Q14 | 20 | Hard | 100 |
Data Dictionary
We will use a dataset of Algerian forest fires used by Faroudja & Izeboudjen (2019) 2 and sourced from the UCI ML repository. It has observations on 244 days in Algeria from June to September 2012 in two regions:
- Bejaia, and
- Sidi Bel-abbes
The dataset contains the following variables:
Date columns
Column | Description |
---|---|
day |
day of monitoring |
month |
month of the monitoring (βjuneβ to βseptemberβ) |
year |
Fixed: 2012 |
The column we want to predict
Column | Description |
---|---|
Classes |
two classes representing the ocurrence of fire |
Loading the data
Use the code below to load the two datasets used in this first problem set.
IMPORTANT: Ensure all the dataset files are in the exact same directory as this RMarkdown.
# read_csv is a function of the tidyverse package
<- read_csv("./Algeria_Forest_Fires_Bejaia_Region_Dataset.csv")
df_forest_fires_bejaia
<- read_csv("./Algeria_Forest_Fires_Sidi_Bel_Abbes_Region_Dataset.csv")
df_forest_fires_sidi
Take a look at the data
# Look at the first few lines of the dataframe
%>% head() df_forest_fires_bejaia
# Look at the first few lines of the dataframe
%>% head() df_forest_fires_sidi
# What are the dimensions of the dataframes?
%>% dim() df_forest_fires_bejaia
# What are the dimensions of the dataframes?
%>% dim() df_forest_fires_sidi
What we want from you
You main goal will be to predict the occurrence of fires and understand what this ocurrence is associated with.
Next, you will go through a sequence of tasks. For each of them you are given a code cell where you are supposed to write solutions to the tasks.
π― Questions - Part I
Q1. Fire days
Using R, count the number of fire days observed in the two regions. (2 points)
- Bejaia region
# Replace this by your code
- Sidi Bel-abbes region
# Replace this by your code
Q2. Fire days in common
Using R, calculate how many days of fire the two regions had in common, and explain how you calculated it. (3 points)
# Replace this by your code
Explain what you did in the code above:
Replace this with your text. Use multiple lines if needed.
Q3. Exploratory Data Analysis - Part I
Run the code below to look at the plot it produces. In your own words, explain what you see: what dataset was used in the plot, what are the variables in the X and Y axis and what do the colours mean? (2 points)
<-
g ggplot(df_forest_fires_bejaia,
(aes(x = Temperature, y = RH, colour = Classes))
+ geom_point(size = 3, alpha = 0.6)
# OPTIONAL: Customising the plot.
# You can delete these lines below if you don't like the theme
# Or you can choose other themes from
# https://ggplot2.tidyverse.org/reference/ggtheme.html
+ theme_bw()
) g
Your text goes here
Q4. Exploratory Data Analysis - Part II
Now, create a scatterplot using any two predictors from the Sidi Bel-abbes region data. Colour the dots according to their Classes
(3 points)
You can use either base R or ggplot. 4
# Your code here
Q5. Exploratory Data Analysis - Part III
Can you spot differences in the distributions of predictors between the two regions (Sidi Bel-abbes vs Bejaia)? Describe the differences for at least one variable. Write your response and provide evidence using R code. You could use, for example, cross-tabulation, descriptive statistics or visualisations to support your point. (8 points)
Replace this with your text. Use multiple lines if needed.
# Your code here
π― Questions - Part II
Q6. Logistic Regression Model
Build a logistic regression model for the Bejaia dataset using THREE predictors to predict the ocurrence of fire (the Classes
variable). You can also add interaction effects amongst these three predictors if you wish. Save it as a variable named model
and use R to print its summary. (7 points)
π‘ Tip: you might need to convert Classes
to a factor.
π‘ If you have questions about R programming or conceptual questions about logistic regression, itβs ok to ask questions to teachers and colleagues. What you are not allowed to ask: things like βis my solution correct?β or βwhich variables did you use?β, etc..
You can choose to print the summary using base R or any of the functions from the broom
package (part of tidymodels
).
# Your code here
# If you won't answer this question, erase or comment out the line of code below. Otherwise, you will get an error when knitting this notebook.
<- model
Q7. Logistic Regression Model - Justification
Provide a reasonable explanation for your choice of the three predictors in Q6. Why did you chose those variables? (10 points)
(Optional: add additional R code/visualisations that you feel might help support your answer)
Replace this with your text. Use multiple lines if needed.
Q8. Logistic Regression Model - Diagnostics
Run the code below to look at the plot it produces. In your own words, explain what you see and what this plot tells you about your model. (8 points)
β οΈ If you didnβt build a model in Q6, erase or comment out the block of code below. Otherwise, you will get an error when knitting this notebook.
<- df_forest_fires_bejaia$Classes
train_classes <- predict(model, df_forest_fires_bejaia, type = "response")
train_predictions
<- data.frame(train_classes = train_classes,
plot_df train_predictions = train_predictions)
<-
g ggplot(plot_df, aes(x = train_predictions, fill = train_classes))
(+ geom_histogram(alpha = 0.8, binwidth = 0.05, position = "stack")
# OPTIONAL: Customising the plot.
# You can delete these lines below if you don't like the theme
# Or you can choose other themes from
# https://ggplot2.tidyverse.org/reference/ggtheme.html
+ theme_bw()
+ labs(x = "Predictions on the training set",
y = "Count")
+ scale_fill_brewer(name = "Target", type = "qual", palette = 2)
+ scale_x_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1))
+ ggtitle("Histogram of probability distributions fitted to the data")
) g
Your text goes here
π― Questions - Part III
Here we will ask you to reflect on the threshold of your classification model.
π‘ TIPS
- You might want to reuse the code of the notebook used in Week 04βs workshop to calculate classification metrics for this question.
- If you have questions about the code itself or conceptual questions about thresholds & confusion matrices, itβs ok to ask questions to teachers and colleagues.
- What you are not allowed to ask: things like βis my solution correct?β or βwhat do you think of my solution?β or βwhich variables did you use?β, etc..
- For a more visual analysis of confusion matrix, you can alternatively use the function
plot_confusion_matrix
from packagecvms
. You can adapt the code from the notebook used in Week 04βs workshop. - Take a look at Chapter 21 of the R for Data Science book to learn how to write
loops
(such asfor
loops,seq
,seq_along
).
First, take a look at the function below that will help you select a good threshold for your model.
The function apply_threshold
receives three arguments (model
, df
& threshold
) and returns a vector of predicted classes with the same length as there are observations in the dataframe df
.
<- function(model, df, threshold) {
apply_threshold
<- predict(model, df, type = "response")
pred_probs
<- factor(ifelse(pred_probs < threshold, "not fire", "fire"),
pred_classes levels = c("not fire", "fire"),
ordered = TRUE)
return(pred_classes)
}
Q9. Logistic Regression Model - Confusion Matrix
Run the code below to look at the table it produces. What does this table show and what does it tell you about your model? (3 points)
<- df_forest_fires_bejaia$Classes
train_classes <- apply_threshold(model, df_forest_fires_bejaia, threshold=0.50)
train_class_predictions
<- table(train_classes, train_class_predictions)
confusion_matrix print(confusion_matrix)
Q10. Logistic Regression Model - Classification metrics
Now, consider three other options of threshold: \(t \in \{0.20, 0.40, 0.60\}\). Which of these three options lead to the best f1-score for your model? Write the R code for this and justify your answer. (7 points)
# Your code here
Replace this with your text. Use multiple lines if needed.
Q11. Logistic Regression Model - Optimal Threshold (Challenging)
Now, consider another set of possible thresholds, \(t \in \{0.00, 0.01, 0.02, \ldots, 0.98, 0.99, 1.00\}\). Find the optimal threshold \(t^*\), the one that leads to the best f1-score. Write the R code for this and justify your answer. (12 points)
# Your code here
Replace this with your text. Use multiple lines if needed.
π― Questions - Part IV
Q12. Test set predictions
Follow the instructions below to apply the model you trained in Q6 to predict the probability of forest fires in the Sidi Bel-abbes dataset and produce a plot similar to that of Q8. (7 points)
- Create a vector named
test_classes
that contains the true observed data (fire vs not fire) of Sidi Bel-abbes (You might need to convert it tofactor
) - Create a vector named
test_predictions
that contains the predict probability of forest fires in the Sidi Bel-abbes region - If the plot is produced and correct, you will get full marks. No need to justify the response.
β οΈ If you donβt want to answer this question, erase or comment out the block of code below. Otherwise, you will get an error when knitting this notebook.
# Your code here
<-
test_classes <-
test_predictions
<- data.frame(test_classes = test_classes,
plot_df test_predictions = test_predictions)
<-
g ggplot(plot_df, aes(x = test_predictions, fill = test_classes))
(+ geom_histogram(alpha = 0.8, binwidth = 0.05, position = "stack")
# OPTIONAL: Customising the plot.
# You can delete these lines below if you don't like the theme
# Or you can choose other themes from
# https://ggplot2.tidyverse.org/reference/ggtheme.html
+ theme_bw()
+ labs(x = "Predictions on the test set",
y = "Count")
+ scale_fill_brewer(name = "Target", type = "qual", palette = 2)
+ scale_x_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1))
+ ggtitle("Histogram of probability distributions when applied to Sidi Bel-abbes data")
) g
Q13. Diagnostics
Using the best threshold you found in either Q10 or Q11, write R code to produce a confusion matrix for the test set (Sidi Bel-abbes dataset). What is the True Positive Rate and True Negative Rate of your model in the test set? Did your model generalise well from the training to test set? (8 points)
# Your code here
Replace this with your text. Use multiple lines if needed.
π― Questions - Part V
Here we will ask you to build an alternative classification model, using an algorithm other than logistic regression.
Q14. Alternative Models (Challenging)
Follow the instructions below to build and explore an alternative classification model. Add as many chunks of code, text and equations as you prefer. (20 points)
- Chose another algorithm (either Naive Bayes, Decision Tree or Support Vector Machine) to build a new classification model.
- Use the same training data you used to build your logistic regression in Q6 (same predictors)
- If the algorithm requires a threshold, chose one that maximises the F1-score using the same logic as in Q10 or Q11.
- Use the same test data you used to validate your logistic regression as in Q12
- If the algorithm does not require a threshold, try to tweak the parameters of the algorithm so as to avoid overfitting the model.
- Use whatever means you find appropriate (for example metrics, matrices, tables, plots) to compare your new model to the logistic model you built in the rest of this notebook.
- Write about what you think makes your alternative model better/worse.
- Provide the full R code you used to build and test your alternative model
π‘ TIPS
- Use what you have learned about up until Week 05 (lecture)
- Refer to your readings of the textbook if you want to understand more about how the alternative algorithms work. You can find links to the appropriate chapters here.
# Your code here. Copy this chunk of code if you need.
Replace this with your text. Use multiple lines if needed.
Decompress
How do you plan to reward yourself for completing this problem set?
<replace this text with your reward. A cookie? take X days off? etc.>
Footnotes
https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-referenceβ©οΈ
Abid, Faroudja, and Nouma Izeboudjen. βPredicting forest fire in algeria using data mining techniques: Case study of the decision tree algorithm.β International Conference on Advanced Intelligent Systems for Sustainable Development. Springer, Cham, 2019.β©οΈ
Read more about Forest Fire Weather Index (FWI): https://www.wikiwand.com/en/Forest_fire_weather_indexβ©οΈ
Check out Chapter 3 of R for Data Science book (available online for free)β©οΈ