Summative Problem Set 01 | W05-W07

DS202 - Data Science for Social Scientists

Author

DS202 2022MT Teaching Group

Published

25 October 2022

Welcome to the first Summative Problem Set of DS202 (2022/23)!

Things to know before you start:

Deadline: you have until 9 November 2022, 23:59 UK time (Week 07) to complete your solutions and submit via Moodle.
You will be granted a maximum of 100 points for the whole assignment. You will see how much each task is right next to the tasks’ names.
This assessment is worth 20% of your final grade.
Read the instructions carefully and make sure you follow them.

Important

Your feedback on the formative problem set was very important! We heard that many of you are still learning — and struggling with — R, and this can get in the way of learning about Machine Learning.

So, for this summative, we will not be relying too much on pure R skills. We provide most of the code that you will need, and we explicitly encourage you to publicly ask questions about programming. You can read our comprehensive list of tips below.

📒 Instructions

Read carefully

1. Download this page as an .Rmd file, as well as the two accompanying CSV files from Moodle (follow this link). Ensure all files are saved in the same directory. Open the RMarkdown file in RStudio.

2. Work on your solutions and fill the allocated spaces with your answers. Each question has one or more empty spaces for you to fill out.

Text answers are identified as:

R code answers are identified as:

# Replace this by your code

# REMEMBER TO REMOVE ALL HASH SIGNS OR YOUR CODE WILL NOT RUN!!!
# YOU WILL MARKED AS 0 IN A QUESTION IF YOUR CODE DOES NOT PRODUCE 
# ANY OUTPUT WHEN IT WOULD HAVE BEEN EXPECTED.

Equation answers are identified as:

\[Your~equation~goes~here\]

You can add more code R code, plots, text or equations to complement your responses if you want to. For extra tips on formatting equations see ¹.

3. After you answer all the questions, you should press the Knit button, and knit the file to HTML. If you cannot find this button in RStudio, check out this tutorial.

4. An HTML file will appear in your files. This is what you must submit to Moodle.

5. Open Moodle and submit your HTML file.

⚠️ IMPORTANT:

If your code has any errors, you might not be able to knit the document. You must fix those errors first.
If you do not know how to answer a solution, leave it empty. Do not add incomplete R scripts as it will lead to knitting errors.
Most questions do not have a single objective response. You are expected to justify your choices of algorithms, parameters and validation metrics from your experiences in the lectures, labs and your readings of the textbook.

💡Click here to learn how to get help

How to get help

It is okay to use whatever additional R packages you can find to help you explore the data and/or models better.
It is okay to team up with class colleagues to brainstorm ideas and these problems together.
- Most questions do not have a single objective response. It is unlikely that you will all write the exact same response so we will spot plagiarism and full copy-pastes very easily.
It is ok to use Slack or a shared Google Drive document to share links to useful content. For example, you can share things like:
- “Tip: I am using this package called tidymodels and it is much simpler than writing for loops!”
- ‘I found this link useful: What is the difference between type=“class” and type=“response” in the predict function?’
- This website has many examples of charts in R — and it has the source code!
- ‘This R package called lubridate helped me work with dates a lot easier!’
It is ok ask clarification about questions of this problem set publicly on Slack. For example, you can ask questions like:
- “I am a little confused about Question X. Where it says ... does it mean ....? Am I getting this right?”
It is also ok to ask generic programming-related questions publicly on Slack. For example, you can ask questions like:
- “How do I get just the last 10 items of a list in R/tidyverse?” or
- “How do I sum the number of occurrences of a value in a column?”
- “Anyone else getting an unequal lengths error when creating a new vector? How do I solve this?”
- “I am having a hard time understanding the code below (from Week X lab):
```
new_list <- seq(length(dim(some_dataframe)[1]))
```
  Why so many parentheses? Does anyone how to interpret this? #help” or even
- “How do I select specific columns of a dataframe by their names?” or maybe
- “The pairs plot is too messy, anyone knows of a better way to visualise pairs of variables?”
What we CANNOT accept:
- sharing your entire script or RMarkdown with others. But it is ok to share snippets of code with best practices, or to ask for help, like the type of code people share on Stackoverflow
- asking others to do your work for you (LSE regulations on plagiarism applies to computer code too)
- We will run TurnitIn on your submissions to help flag cases of plagiarism

THIS IS AN ANONYMOUS SUBMISSION. DO NOT INCLUDE YOUR NAME NOR OTHER PERSONAL INFORMATION ANYWHERE IN THIS FILE OTHER THAN YOUR LSE ID NUMBER.

# Complete the line below by filling it out with your LSE ID
# If you do not add your LSE ID number, knitting will throw an error.
LSE_ID <- ## This is a number that starts with the year you joined LSE, like 202208....

⚙️ Setup

Import required libraries.No need to change if you are not planning to use other packages.

library(tidyverse)
library(tidymodels)

# If you use other packages, add them here

The Data: Algerian forest fires

This problem set is a chance for you to explore the true power of machine learning and build a model that will predict the occurrence of forest fires. Forest fires are a huge problem for many countries and a significant sustainability issue. By solving this task, you are expected to build a predictive model but also to try and diagnose which predictors are more predictive of fires.

How you will be assessed:

Question	Marks	Level	Total
Q1	2	Easy	2
Q2	3	Easy	5
Q3	2	Easy	7
Q4	3	Easy	10
Q5	8	Medium-Hard	18
Q6	7	Easy-Medium	25
Q7	10	Medium	35
Q8	8	Medium	43
Q9	3	Easy	46
Q10	7	Medium	53
Q11	12	Hard	65
Q12	7	Easy-Medium	72
Q13	8	Medium-Hard	80
Q14	20	Hard	100

Data Dictionary

We will use a dataset of Algerian forest fires used by Faroudja & Izeboudjen (2019) ² and sourced from the UCI ML repository. It has observations on 244 days in Algeria from June to September 2012 in two regions:

Bejaia, and
Sidi Bel-abbes

The dataset contains the following variables:

Date columns

Column	Description
`day`	day of monitoring
`month`	month of the monitoring (‘june’ to ‘september’)
`year`	Fixed: 2012

Columns related to weather data observations

Column	Description
`Temperature`	temperature at noon (max temperature) in Celsius degrees
`RH`	Relative Humidity in %
`Ws`	Wind speed in km/h
`Rain`	total day in mm

Columns related to FWI components ³

Column	Description
`FFMC`	Fine Fuel Moisture Code index from the FWI system
`DMC`	Duff Moisture Code index from the FWI system
`DC`	Drought Code index from the FWI system
`ISI`	Initial Spread Index index from the FWI system
`BUI`	Buildup Index index from the FWI system
`FWI`	Fire Weather Index Index

The column we want to predict

Column	Description
`Classes`	two classes representing the ocurrence of fire

Loading the data

Use the code below to load the two datasets used in this first problem set.

IMPORTANT: Ensure all the dataset files are in the exact same directory as this RMarkdown.

# read_csv is a function of the tidyverse package

df_forest_fires_bejaia <- read_csv("./Algeria_Forest_Fires_Bejaia_Region_Dataset.csv")

df_forest_fires_sidi   <- read_csv("./Algeria_Forest_Fires_Sidi_Bel_Abbes_Region_Dataset.csv")

Take a look at the data

# Look at the first few lines of the dataframe
df_forest_fires_bejaia %>% head()

# Look at the first few lines of the dataframe
df_forest_fires_sidi %>% head()

# What are the dimensions of the dataframes?

df_forest_fires_bejaia %>% dim()

# What are the dimensions of the dataframes?

df_forest_fires_sidi %>% dim()

What we want from you

You main goal will be to predict the occurrence of fires and understand what this ocurrence is associated with.

Next, you will go through a sequence of tasks. For each of them you are given a code cell where you are supposed to write solutions to the tasks.

🎯 Questions - Part I

Q1. Fire days

Using R, count the number of fire days observed in the two regions. (2 points)

Bejaia region

# Replace this by your code

Sidi Bel-abbes region

# Replace this by your code

Q2. Fire days in common

Using R, calculate how many days of fire the two regions had in common, and explain how you calculated it. (3 points)

# Replace this by your code

Explain what you did in the code above:

Q3. Exploratory Data Analysis - Part I

Run the code below to look at the plot it produces. In your own words, explain what you see: what dataset was used in the plot, what are the variables in the X and Y axis and what do the colours mean? (2 points)

g <-
    (ggplot(df_forest_fires_bejaia,
            aes(x = Temperature, y = RH, colour = Classes))
        + geom_point(size = 3, alpha = 0.6)

        # OPTIONAL: Customising the plot.
        # You can delete these lines below if you don't like the theme
        # Or you can choose other themes from
        # https://ggplot2.tidyverse.org/reference/ggtheme.html
        + theme_bw()
    )
g

Q4. Exploratory Data Analysis - Part II

Now, create a scatterplot using any two predictors from the Sidi Bel-abbes region data. Colour the dots according to their Classes (3 points)

You can use either base R or ggplot. ⁴

# Your code here

Q5. Exploratory Data Analysis - Part III

Can you spot differences in the distributions of predictors between the two regions (Sidi Bel-abbes vs Bejaia)? Describe the differences for at least one variable. Write your response and provide evidence using R code. You could use, for example, cross-tabulation, descriptive statistics or visualisations to support your point. (8 points)

# Your code here

🎯 Questions - Part II

Q6. Logistic Regression Model

Build a logistic regression model for the Bejaia dataset using THREE predictors to predict the ocurrence of fire (the Classes variable). You can also add interaction effects amongst these three predictors if you wish. Save it as a variable named model and use R to print its summary. (7 points)

💡 Tip: you might need to convert Classes to a factor.

💡 If you have questions about R programming or conceptual questions about logistic regression, it’s ok to ask questions to teachers and colleagues. What you are not allowed to ask: things like “is my solution correct?” or “which variables did you use?”, etc..

You can choose to print the summary using base R or any of the functions from the broom package (part of tidymodels).

# Your code here

# If you won't answer this question, erase or comment out the line of code below. Otherwise, you will get an error when knitting this notebook.
model <-

Q7. Logistic Regression Model - Justification

Provide a reasonable explanation for your choice of the three predictors in Q6. Why did you chose those variables? (10 points)

(Optional: add additional R code/visualisations that you feel might help support your answer)

Q8. Logistic Regression Model - Diagnostics

Run the code below to look at the plot it produces. In your own words, explain what you see and what this plot tells you about your model. (8 points)

⚠️ If you didn’t build a model in Q6, erase or comment out the block of code below. Otherwise, you will get an error when knitting this notebook.


train_classes     <- df_forest_fires_bejaia$Classes
train_predictions <- predict(model, df_forest_fires_bejaia, type = "response")

plot_df <- data.frame(train_classes     = train_classes,
                      train_predictions = train_predictions)

g <-
    (ggplot(plot_df, aes(x = train_predictions, fill = train_classes))
        + geom_histogram(alpha = 0.8, binwidth = 0.05, position = "stack")

        # OPTIONAL: Customising the plot.
        # You can delete these lines below if you don't like the theme
        # Or you can choose other themes from
        # https://ggplot2.tidyverse.org/reference/ggtheme.html
        + theme_bw()
        + labs(x = "Predictions on the training set",
               y = "Count")
        + scale_fill_brewer(name = "Target", type = "qual", palette = 2)
        + scale_x_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1))
        + ggtitle("Histogram of probability distributions fitted to the data")
    )
g

🎯 Questions - Part III

Here we will ask you to reflect on the threshold of your classification model.

💡 TIPS

You might want to reuse the code of the notebook used in Week 04’s workshop to calculate classification metrics for this question.
If you have questions about the code itself or conceptual questions about thresholds & confusion matrices, it’s ok to ask questions to teachers and colleagues.
What you are not allowed to ask: things like “is my solution correct?” or “what do you think of my solution?” or “which variables did you use?”, etc..
For a more visual analysis of confusion matrix, you can alternatively use the function plot_confusion_matrix from package cvms. You can adapt the code from the notebook used in Week 04’s workshop.
Take a look at Chapter 21 of the R for Data Science book to learn how to write loops (such as for loops, seq, seq_along).

First, take a look at the function below that will help you select a good threshold for your model.

The function apply_threshold receives three arguments (model, df & threshold) and returns a vector of predicted classes with the same length as there are observations in the dataframe df.


apply_threshold <- function(model, df, threshold) {

    pred_probs <- predict(model, df, type = "response")

    pred_classes <- factor(ifelse(pred_probs < threshold, "not fire", "fire"),
                           levels = c("not fire", "fire"),
                           ordered = TRUE)

    return(pred_classes)
}

Q9. Logistic Regression Model - Confusion Matrix

Run the code below to look at the table it produces. What does this table show and what does it tell you about your model? (3 points)

train_classes           <- df_forest_fires_bejaia$Classes
train_class_predictions <- apply_threshold(model, df_forest_fires_bejaia, threshold=0.50)

confusion_matrix <- table(train_classes, train_class_predictions)
print(confusion_matrix)

Q10. Logistic Regression Model - Classification metrics

Now, consider three other options of threshold: \(t \in \{0.20, 0.40, 0.60\}\). Which of these three options lead to the best f1-score for your model? Write the R code for this and justify your answer. (7 points)

# Your code here

Q11. Logistic Regression Model - Optimal Threshold (Challenging)

Now, consider another set of possible thresholds, \(t \in \{0.00, 0.01, 0.02, \ldots, 0.98, 0.99, 1.00\}\). Find the optimal threshold \(t^*\), the one that leads to the best f1-score. Write the R code for this and justify your answer. (12 points)

# Your code here

🎯 Questions - Part IV

Q12. Test set predictions

Follow the instructions below to apply the model you trained in Q6 to predict the probability of forest fires in the Sidi Bel-abbes dataset and produce a plot similar to that of Q8. (7 points)

Create a vector named test_classes that contains the true observed data (fire vs not fire) of Sidi Bel-abbes (You might need to convert it to factor)
Create a vector named test_predictions that contains the predict probability of forest fires in the Sidi Bel-abbes region
If the plot is produced and correct, you will get full marks. No need to justify the response.

⚠️ If you don’t want to answer this question, erase or comment out the block of code below. Otherwise, you will get an error when knitting this notebook.

# Your code here

test_classes <- 
test_predictions <- 

plot_df <- data.frame(test_classes     = test_classes,
                      test_predictions = test_predictions)

g <-
    (ggplot(plot_df, aes(x = test_predictions, fill = test_classes))
        + geom_histogram(alpha = 0.8, binwidth = 0.05, position = "stack")

        # OPTIONAL: Customising the plot.
        # You can delete these lines below if you don't like the theme
        # Or you can choose other themes from
        # https://ggplot2.tidyverse.org/reference/ggtheme.html
        + theme_bw()
        + labs(x = "Predictions on the test set",
               y = "Count")
        + scale_fill_brewer(name = "Target", type = "qual", palette = 2)
        + scale_x_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1))
        + ggtitle("Histogram of probability distributions when applied to Sidi Bel-abbes data")
    )
g

Q13. Diagnostics

Using the best threshold you found in either Q10 or Q11, write R code to produce a confusion matrix for the test set (Sidi Bel-abbes dataset). What is the True Positive Rate and True Negative Rate of your model in the test set? Did your model generalise well from the training to test set? (8 points)

# Your code here

🎯 Questions - Part V

Here we will ask you to build an alternative classification model, using an algorithm other than logistic regression.

Q14. Alternative Models (Challenging)

Follow the instructions below to build and explore an alternative classification model. Add as many chunks of code, text and equations as you prefer. (20 points)

Chose another algorithm (either Naive Bayes, Decision Tree or Support Vector Machine) to build a new classification model.
Use the same training data you used to build your logistic regression in Q6 (same predictors)
If the algorithm requires a threshold, chose one that maximises the F1-score using the same logic as in Q10 or Q11.
Use the same test data you used to validate your logistic regression as in Q12
If the algorithm does not require a threshold, try to tweak the parameters of the algorithm so as to avoid overfitting the model.
Use whatever means you find appropriate (for example metrics, matrices, tables, plots) to compare your new model to the logistic model you built in the rest of this notebook.
Write about what you think makes your alternative model better/worse.
Provide the full R code you used to build and test your alternative model