🛣️ Reading week assignment

Author

The DS202 team

📚 Preparation: Loading packages and data

Materials to download

Use the link below to download the lab materials:

as well as the dataset we will use for this lab:

Loading packages and data

Start by downloading the dataset into your data folder.

Install missing packages

Install missing libraries if any:

#make sure you run this code only once and that this chunk is non-executable when you render your qmd
install.packages("viridis")
install.packages("xgboost")
install.packages("kernlab")
install.packages("LiblineaR")

Import libraries and create functions:

library(ggsci)
library(tidymodels)
library(tidyverse)
library(viridis)
library(xgboost)
library(kernlab)
library(LiblineaR)
library(kernlab)
library(doParallel)
theme_dot <- function() {
  
  theme_minimal() +
    theme(panel.grid = element_blank(),
          legend.position = "bottom")
}
theme_line <- function() {
  
  theme_minimal() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major.x = element_blank(),
          legend.position = "bottom")
}

🥅 Homework objectives

We have introduced a range of supervised learning algorithms in the previous labs and lectures. Selecting the best model can be challenging. In this assignment, we will provide you with guidance to carry out this process.

We will compare and contrast the decision boundaries produced by four different models train on the PTSD data.
Next, we will apply k-fold cross-validation to determine which model performs best.
Finally, we will fine-tune the hyperparameters of a support vector machine model and of an XGBoost model and compare both models.

Loading the PTSD dataset

We will use a sample from the PTSD dataset that you worked with in Lab 5.

set.seed(123)
# Code here

ptsd: 1 if the user has, 0 if the user does not have (the outcome)
anxiety: 0-10 self-reported anxiety scale.
acute_stress: 0-10 self-reported acute stress scale.

Train/test split

Start by performing a train/test split (keep 75% of the data for the training set):

set.seed(123)
# Code here

5-fold cross-validation

Now, create 5 cross-validation folds:

# Code here

Selecting the evaluation metrics

Finally, create a metric set that includes the precision, recall and f1-score.

# Code here

Generating decision boundaries for different models

👉 NOTE: Familiarize yourself with the concept of decision boundaries if needed.

In this assignment, you will have to work with 4 models:

Logistic regression
Decision tree
Linear support vector machine
Polynomial support vector machine

👉 NOTE: Support vector machines are a highly useful family of algorithms that are widely leveraged by machine learning scientists. For the theory behind SVMs, check out Chapter 9 of Introduction to Statistical Learning.

👉 NOTE: Aside from logistic regression, you will need to specify mode = "classification"

Initializing different models

You already know the code for instantiating logistic regression and decision trees, but you will need to consult the documentation in parsnip (click here) to find the relevant models for both SVM models

# Code here

Proportion of individuals affected by PTSD across different combinations of `acute_stress` and `anxiety` levels.

Generate a tibble that contains a grid with the proportion of individuals affected by PTSD across different combinations of acute_stress and anxiety levels.

# Code here

Use list to create a named list of different instantiated models.

# Code here

Fitting the 4 models on the training data

Use map to apply fit over all the models.

# Code here

Generating predictions on the test set with the different models

Use map_dfr to apply a function to a list of model fits (model_fits) and create a single tibble test_preds with predictions from each model on the ptsd_test dataset, with each model’s results nested in its own row.

# Code here

Repeat the same operation using ptsd_grid instead of ptsd_test.

# Code here

Now, merge the two nested tibbles together:

# Code here

Evaluating the model

Compute the f1-score for each model. Use map to compute the f1-score for the different models. Remember to set event_level = "second".

# Code here

Use unnest to unnest the grid predictions and the f1-score.

# Code here

Visualizing decision boundaries

Generate decision boundaries for the four machine learning models trained to predict PTSD based on responses to questions about acute stress and anxiety. You are expected to generate the same plot as the we generated in Lab 5 to visualize decision boundaries for different values of k in the k-NN algorithm.

# Code here

❓ Question:

Which model obtains the highest f1-score ?

Cross-validation and comparison between models

Use workflow_set to build multiple models into an integrated workflow.

# Code here

Now, run each model on each fold of the data generated previously:

# Code here

Run the code below and inspect the results.

# Code here

The above plot provides very useful information. However, we could bring some improvements.

The base ggplot colour scheme looks a bit dull
The f1-score is expressed as f_meas
The y-axes of each plot are different from one another

We can get the rankings of the model by running rank_results(ptsd_fits). From there, come up with an improved version of the plot using your ggplot skills.

# Code here

❓ Question:

How well does the SVM with a second-order polynomial perform, relative to the other models ?

Experiment more with polynomial SVMs using cross-validation

Using the following hyperparameter grid, try tuning a polynomial SVM.

svm_grid <-
  crossing(cost = c(0.000977, 0.0131, 0.177, 2.38, 32),
           degree = 1:3)

From there, we have given you a series of steps. We strongly recommend that you revisit how we have performed hyperparameter tuning using k-fold cross-validation in Lab 5 to get a sense of the code needed to complete this task.

# Instantiate a tuneable polynomial SVM

# Code here

# Create a workflow comprised of the model and a formula

# Code here

# Register multiple cores for faster processing

# Code here

# Tune the model, just use one evaluation metric

# Code here

# Collect the metrics from the model and plot the results

# Code here

❓ Question:

What combination of hyperparameters produces the best fitting model?

Experiment with XGBoost using cross-validation

Now, go back to the week 5 lab. You’ll find it mentions a tree-based model called XGBoost. This time, we’ll use it for classification instead of regression and compare it to our best fitting SVM model.

You need to start by writing a specification of the XGBoost model:

Use the boost_tree function
Set learn_rate = tune()
Set mtry = .cols() and trees = 150
Set the mode to “classification”
Set the engine to “xgboost”

🚨 ATTENTION: You will need to have the xgboost package installed for this algorithm to work.

# write the code for the XGBoost model specification here

Next, create a recipe and workflow as you usually would (no specific pre-processing in the recipe and all variables used to predict the ptsd outcome variable)

#create a recipe

#create a workflow

Using the following hyperparameter grid, try tuning the XGBoost model:

xgb_grid <- tibble(learn_rate = c(0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5,1))

Make sure k-fold cross-validation to do your hyperparameter (learn_rate) tuning
Select roc_auc as the metric to optimize in your tuning (this is to ensure we can compare the performance of the XGBoost model you select with that of the best performing SVM trained earlier)
Produce a graph that visualises the best result
Select and estimate the model with the best AUC

# write your code to tune the learn_rate parameter of the XGBoost model

# write the code to produce a graph that visualises the best result of your XGBoost parameter tuning

# write code to select the learn_rate that produces the best result

# fit the model with the learn_rate that produces the best AUC

❓ Question How does the XGBoost model compare with the best performing SVM fitted earlier?

#write code to get the value of AUC for the best SVM and best XGBoost models

📚 Preparation: Loading packages and data

Materials to download

Loading packages and data

🥅 Homework objectives

Loading the PTSD dataset

Train/test split

5-fold cross-validation

Selecting the evaluation metrics

Generating decision boundaries for different models

Initializing different models

Proportion of individuals affected by PTSD across different combinations of acute_stress and anxiety levels.

Fitting the 4 models on the training data

Generating predictions on the test set with the different models

Evaluating the model

Visualizing decision boundaries

Cross-validation and comparison between models

Experiment more with polynomial SVMs using cross-validation

Experiment with XGBoost using cross-validation

Proportion of individuals affected by PTSD across different combinations of `acute_stress` and `anxiety` levels.