✅ Reading week assignment - Solutions

Author

The DS202 team

Materials to download

This solution file follows the format of the Quarto file .qmd you had to fill in for your homework. If you want to render the document yourselves and play with the code, you can download the .qmd version of this solution file by clicking on the button below:

To be able to use the solutions file, don’t forget the following steps:

Install missing packages

Install missing libraries if any.

#make sure you run this code only once and that this chunk is non-executable when you render your qmd
install.packages("viridis")
install.packages("xgboost")
install.packages("kernlab")
install.packages("LiblineaR")

Import libraries and create functions:

library(ggsci)
library(tidymodels)
library(tidyverse)
library(viridis)
library(xgboost)
library(kernlab)
library(LiblineaR)
library(kernlab)
library(doParallel)

theme_dot <- function() {
  
  theme_minimal() +
    theme(panel.grid = element_blank(),
          legend.position = "bottom")
  
}

theme_line <- function() {
  
  theme_minimal() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major.x = element_blank(),
          legend.position = "bottom")
  
}

🥅 Homework objectives

We have introduced a range of supervised learning algorithms in the previous labs and lectures. Selecting the best model can be challenging. In this assignment, we will provide you with guidance to carry out this process.

We will compare and contrast the decision boundaries produced by four different models train on the PTSD data.
Next, we will apply k-fold cross-validation to determine which model performs best.
Finally, we will fin-tune the hyperparameters of a support vector machine model and of an XGBoost model and compare both models.

Loading the PTSD dataset

We are going to use a sample of the PTSD data set that you have worked with in the lab 5.

set.seed(123)

ptsd <- 
  read_csv("data/ptsd-for-multiple-models.csv") %>% 
  mutate(ptsd = as.factor(ptsd)) %>% 
  slice_sample(prop = 0.2)

ptsd: 1 if the user has, 0 if the user does not have (the outcome)
anxiety: 0-10 self-reported anxiety scale.
acute_stress: 0-10 self-reported acute stress scale.

Rows: 33234 Columns: 3                                                                                             
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): ptsd, anxiety, acute_stress

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Train/test split

Start by performing a train/test split (keep 75% of the data for the training set):

set.seed(123)

ptsd_split <- initial_split(ptsd, prop = 0.75, strata = ptsd)
ptsd_train <- training(ptsd_split)
ptsd_test <- testing(ptsd_split)

5-fold cross-validation

Now, create 5 cross-validation folds:

ptsd_cv <- vfold_cv(ptsd_train, v = 5)

Selecting the evaluation metrics

Finally, create a metric set that includes the precision, recall and f1-score.

class_metrics <- metric_set(precision, recall, f_meas)

Generating decision boundaries for different models

👉 NOTE: Familiarize yourself with the concept of decision boundaries if needed.

In this assignment, you will have to work with 4 models:

Logistic regression
Decision tree
Linear support vector machine
Polynomial support vector machine

👉 NOTE: Support vector machines are another highly useful family of algorithms that are widely leveraged by machine learning scientists. For the theory behind SVMs, check out Chapter 9 of Introduction to Statistical Learning.

👉 NOTE: Aside from logistic regression, you will need to specify mode = "classification"

Initializing different models

You already know the code for instantiating logistic regression and decision trees, but you will need to consult the documentation in parsnip (click here) to find the relevant models for both SVM models

logit_spec <- logistic_reg()
dt_spec <- decision_tree(mode = "classification")
svm_lin_spec <- svm_linear(mode = "classification")
svm_ply_spec <- svm_poly(mode = "classification", degree = 2)

Proportion of individuals affected by PTSD across different combinations of `acute_stress` and `anxiety` levels.

Generate a tibble that contains a grid with the proportion of individuals affected by PTSD across different combinations of acute_stress and anxiety levels.

ptsd_grid <-
  ptsd %>% 
  select(ptsd, anxiety, acute_stress) %>% 
  group_by(anxiety, acute_stress) %>% 
  summarise(ptsd = mean(ptsd == 1, na.rm = TRUE)) %>% 
  ungroup()

Use list to create a named list of different instantiated models.

model_specs <-
  list(
       "Logistic\nregression" = logit_spec,
       "Decision\ntree" = dt_spec,
       "SVM\nlinear" = svm_lin_spec,
       "SVM\npolynomial" = svm_ply_spec
      )

Fitting the 4 models on the training data

Use map to apply fit over all the models.

model_fits <- map(model_specs, ~ fit(.x, ptsd ~ ., data = ptsd_train))

Generating predictions on the test set with the different models

Use map_dfr to apply a function to a list of model fits (model_fits) and create a single tibble test_preds with predictions from each model on the ptsd_test dataset, with each model’s results nested in its own row.

test_preds <-
  model_fits %>% 
  map_dfr(~ .x %>% 
            augment(new_data = ptsd_test) %>% 
            nest(test_preds = everything()), 
          .id = "model")

Repeat the same operation using ptsd_grid instead of ptsd_test.

grid_preds <-
  model_fits %>% 
  map_dfr(~ .x %>% 
            augment(new_data = ptsd_grid) %>% 
            nest(grid_preds = everything()),
          .id = "model")

Now, merge the two nested tibbles together:

test_preds <-
  test_preds %>% 
  left_join(grid_preds, by = "model")

Evaluating the model

Compute the f1-score for each model. Use map to compute the f1-score for the different models. Remember to set event_level = "second".

test_preds <-
  test_preds %>% 
  mutate(f1 = map(test_preds, ~ f_meas(.x, ptsd, .pred_class, event_level = "second")))

Use unnest to unnest the grid predictions and the f1-score.

test_preds <-
  test_preds %>% 
  unnest(c(grid_preds, f1))

Visualizing decision boundaries

Generate decision boundaries for the four machine learning models trained to predict PTSD based on responses to questions about acute stress and anxiety. You are expected to generate the same plot as the we generated in Lab 5 to visualize decision boundaries for different values of k in the k-NN algorithm.

test_preds %>%
  ggplot() +
  facet_wrap(. ~ paste(model, "\n\nf1-score =", round(.estimate, 3))) +
  geom_tile(aes(anxiety, acute_stress, fill = .pred_class == 1), alpha = 0.2) +
  geom_point(aes(anxiety, acute_stress, colour = ptsd), size = 3) +
  scale_x_continuous(breaks = 0:10) +
  scale_y_continuous(breaks = 0:10) +
  scale_colour_viridis(labels = percent, breaks = c(0, 0.4, 0.8), option = "cividis") +
  scale_fill_manual(values = c("lightgrey", "navy")) +
  theme_dot() +
  labs(x = "Anxiety question", y = "Acute stress question",
       colour = "Proportion with PTSD",
       fill = "Predicted PTSD?")

❓ Question:

Which model obtains the highest f1-score ?

The f1-scores vary slightly across the models, but the SVM polynomial achieves the highest f1-score (0.74) and logistic regression the lowest (0.72).

Cross-validation and comparison between models

Use workflow_set to build multiple models into an integrated workflow.

ptsd_wf <-
  workflow_set(
    preproc = list(formula = ptsd ~ .),
    models = list(
      "Logistic\nregression" = logit_spec,
      "Decision\ntree" = dt_spec,
      "SVM\nlinear" = svm_lin_spec,
      "SVM\npolynomial" = svm_ply_spec
    )
  )

Now, run each model on each fold of the data generated previously:

doParallel::registerDoParallel()

ptsd_fits <-
  ptsd_wf %>% 
    workflow_map(
      resamples = ptsd_cv,
      metrics = metric_set(f_meas, recall, precision),
      control = control_resamples(event_level = "second")
    )

Run the code below and inspect the results.

autoplot(ptsd_fits)

The above plot provides very useful information. However, we could bring some improvements.

The base ggplot colour scheme looks a bit dull
The f1-score is expressed as f_meas
The y-axes of each plot are different from one another

We can get the rankings of the model by running rank_results(ptsd_fits). From there, come up with an improved version of the plot using your ggplot skills.

ptsd_fits %>% 
  rank_results() %>% 
  mutate(.metric = case_when(.metric == "f_meas" ~ "f1-score", .default = .metric)) %>% 
  ggplot(aes(rank, mean, colour = fct_inorder(str_remove(wflow_id, "formula\\_")))) +
  facet_wrap(. ~ .metric) +
  geom_point() +
  geom_linerange(aes(ymin = mean - (1.96*std_err),
                     ymax = mean + (1.96*std_err))) +
  scale_colour_uchicago() +
  theme_line() +
  labs(x = "Model Rank", y = "Cross-Validated Score",
       colour = NULL)

❓ Question:

How well does the SVM with a second-order polynomial perform, relative to the other models ?

The SVM with a second-order polynomial has the best performance for both F1-score and recall. However, it is slightly behind the other models in terms of precision.

Experiment more with polynomial SVMs using cross-validation

Using the following hyperparameter grid, try tuning a polynomial SVM.

svm_grid <-
  crossing(cost = c(0.000977, 0.0131, 0.177, 2.38, 32),
           degree = 1:3)

From there, we have given you a series of steps. We strongly recommend that you revisit how we have performed hyperparameter tuning using k-fold cross-validation in Lab 5 to get a sense of the code needed to complete this task.

# Instantiate a tuneable polynomial SVM

svm_tuned_spec <-
  svm_poly(cost = tune(), degree = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("kernlab")

# Create a workflow comprised of the model and a formula

svm_wf <-
  workflow() %>% 
  add_model(svm_tuned_spec) %>% 
  add_formula(ptsd ~ .)

# Register multiple cores for faster processing

doParallel::registerDoParallel()

# Tune the model, just use one evaluation metric

svm_cv_fit <-
  tune_grid(
    svm_wf,
    resamples = ptsd_cv,
    metrics = metric_set(roc_auc),
    grid = svm_grid,
    control = control_resamples(event_level = "second")
  )

# Collect the metrics from the model and plot the results

svm_cv_fit %>% 
  collect_metrics() %>% 
  mutate(max = mean == max(mean)) %>% 
  ggplot(aes(as.factor(round(cost, 5)), mean, shape = as.factor(degree), colour = max)) +
  geom_point(position = position_dodge(width = 1), size = 3) +
  geom_linerange(aes(ymin = mean - (1.96*std_err), ymax = mean + (1.96*std_err)),
                 position = position_dodge(width = 1)) +
  scale_colour_manual(values = c("grey", "black")) +
  theme_line() +
  guides(colour = "none") +
  labs(x = "Cost complexity", y = "Cross-validated score",
       shape = "nth degree polynomial")

❓ Question:

What combination of hyperparameters produces the best fitting model?

The best combination is a second degree polynomial SVM with a cost of 0.0131.

Experiment with XGBoost using cross-validation

Now, go back to the week 5 lab. You’ll find it mentions a tree-based model called XGBoost. This time, we’ll use it for classification instead of regression and compare it to our best fitting SVM model.

You need to start by writing a specification of the XGBoost model:

Use the boost_tree function
Set learn_rate = tune()
Set mtry = .cols() and trees = 150
Set the mode to “classification”
Set the engine to “xgboost”

🚨 ATTENTION: You will need to have the xgboost package installed for this algorithm to work.

xgb_model <-
  boost_tree(mtry = .cols(), trees = 150,
             learn_rate = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("xgboost")

Next, create a recipe and workflow as you usually would (no specific pre-processing in the recipe and all variables used to predict the ptsd outcome variable)

xgb_rec <-
  recipe(ptsd ~ .,
         data = ptsd_train) 

xgb_wf <-
  workflow() %>% 
  add_recipe(xgb_rec) %>% 
  add_model(xgb_model)

Using the following hyperparameter grid, try tuning the XGBoost model:

xgb_grid <- tibble(learn_rate = c(0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5,1))

Make sure k-fold cross-validation to do your hyperparameter (learn_rate) tuning
Select roc_auc as the metric to optimize in your tuning (this is to ensure we can compare the performance of the XGBoost model you select with that of the best performing SVM trained earlier)
Produce a graph that visualises the best result
Select and estimate the model with the best AUC

doParallel::registerDoParallel()

xgb_cv_fit <-
  tune_grid(
    xgb_wf,
    resamples = ptsd_cv,
    metrics = metric_set(roc_auc),
    grid = xgb_grid
  )

xgb_cv_fit %>% 
  collect_metrics() %>%
  ggplot(aes(as.factor(learn_rate), mean, group = 1)) +
  geom_point() +
  geom_line(linetype = "dashed") +
  theme_line() +
  labs(x = "Learn rate", y = "Mean Cross-Validated AUC")

best_auc <- select_best(xgb_cv_fit, metric = "roc_auc")
best_auc

# A tibble: 1 × 2
  learn_rate .config              
       <dbl> <chr>                
1       0.01 Preprocessor1_Model05

The learning rate that produces the best AUC is 0.01.

xgb_final_fit <- 
  xgb_wf %>% 
  finalize_workflow(best_auc) %>% 
  fit(ptsd_train)

❓ Question How does the XGBoost model compare with the best performing SVM fitted earlier?

svm_auc <- svm_cv_fit %>% 
   collect_metrics() %>%
   filter(cost==0.0131 & degree==2) %>%
   select(mean) %>%
   rename("Best SVM Mean Cross-Validated AUC"=mean)
   
xgb_auc <- xgb_cv_fit %>% 
   collect_metrics() %>%
   filter(learn_rate==0.005) %>%
   select(mean) %>%
   rename("Best XGBoost Mean Cross-Validated AUC"= mean)
bind_cols(svm_auc,xgb_auc)

# A tibble: 1 × 2
  `Best SVM Mean Cross-Validated AUC` `Best XGBoost Mean Cross-Validated AUC`
                                <dbl>                                   <dbl>
1                               0.888                                   0.882

The SVM model performs better than XGBoost.

Note

Throughout this homework, we chose to balance out precision and recall and optimize model performance either in terms of F1-score or AUC. But since we’re in a medical setting and the impact of false negative tends to be more detrimental, we could very well have chosen to put more emphasis on recall and chosen an F-score that put more weight on recall than precision.

Materials to download

🥅 Homework objectives

Loading the PTSD dataset

Train/test split

5-fold cross-validation

Selecting the evaluation metrics

Generating decision boundaries for different models

Initializing different models

Proportion of individuals affected by PTSD across different combinations of acute_stress and anxiety levels.

Fitting the 4 models on the training data

Generating predictions on the test set with the different models

Evaluating the model

Visualizing decision boundaries

Cross-validation and comparison between models

Experiment more with polynomial SVMs using cross-validation

Experiment with XGBoost using cross-validation

Proportion of individuals affected by PTSD across different combinations of `acute_stress` and `anxiety` levels.