✅ Reading week assignment - Solutions
Materials to download
This solution file follows the format of the Quarto file .qmd
you had to fill in for your homework. If you want to render the document yourselves and play with the code, you can download the .qmd
version of this solution file by clicking on the button below:
To be able to use the solutions file, don’t forget the following steps:
Install missing packages
Install missing libraries if any.
#make sure you run this code only once and that this chunk is non-executable when you render your qmd
install.packages("viridis")
install.packages("xgboost")
install.packages("kernlab")
install.packages("LiblineaR")
Import libraries and create functions:
library(ggsci)
library(tidymodels)
library(tidyverse)
library(viridis)
library(xgboost)
library(kernlab)
library(LiblineaR)
library(kernlab)
library(doParallel)
<- function() {
theme_dot
theme_minimal() +
theme(panel.grid = element_blank(),
legend.position = "bottom")
}
<- function() {
theme_line
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "bottom")
}
🥅 Homework objectives
We have introduced a range of supervised learning algorithms in the previous labs and lectures. Selecting the best model can be challenging. In this assignment, we will provide you with guidance to carry out this process.
We will compare and contrast the decision boundaries produced by four different models train on the PTSD data.
Next, we will apply k-fold cross-validation to determine which model performs best.
Finally, we will fin-tune the hyperparameters of a support vector machine model and of an XGBoost model and compare both models.
Loading the PTSD dataset
We are going to use a sample of the PTSD data set that you have worked with in the lab 5.
set.seed(123)
<-
ptsd read_csv("data/ptsd-for-multiple-models.csv") %>%
mutate(ptsd = as.factor(ptsd)) %>%
slice_sample(prop = 0.2)
ptsd
: 1 if the user has, 0 if the user does not have (the outcome)anxiety
: 0-10 self-reported anxiety scale.acute_stress
: 0-10 self-reported acute stress scale.
Rows: 33234 Columns: 3
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): ptsd, anxiety, acute_stress
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Train/test split
Start by performing a train/test split (keep 75% of the data for the training set):
set.seed(123)
<- initial_split(ptsd, prop = 0.75, strata = ptsd)
ptsd_split <- training(ptsd_split)
ptsd_train <- testing(ptsd_split) ptsd_test
5-fold cross-validation
Now, create 5 cross-validation folds:
<- vfold_cv(ptsd_train, v = 5) ptsd_cv
Selecting the evaluation metrics
Finally, create a metric set that includes the precision, recall and f1-score.
<- metric_set(precision, recall, f_meas) class_metrics
Generating decision boundaries for different models
👉 NOTE: Familiarize yourself with the concept of decision boundaries if needed.
In this assignment, you will have to work with 4 models:
- Logistic regression
- Decision tree
- Linear support vector machine
- Polynomial support vector machine
👉 NOTE: Support vector machines are another highly useful family of algorithms that are widely leveraged by machine learning scientists. For the theory behind SVMs, check out Chapter 9 of Introduction to Statistical Learning.
👉 NOTE: Aside from logistic regression, you will need to specify mode = "classification"
Initializing different models
You already know the code for instantiating logistic regression and decision trees, but you will need to consult the documentation in parsnip
(click here) to find the relevant models for both SVM models
<- logistic_reg()
logit_spec <- decision_tree(mode = "classification")
dt_spec <- svm_linear(mode = "classification")
svm_lin_spec <- svm_poly(mode = "classification", degree = 2) svm_ply_spec
Proportion of individuals affected by PTSD across different combinations of acute_stress
and anxiety
levels.
Generate a tibble that contains a grid with the proportion of individuals affected by PTSD across different combinations of acute_stress
and anxiety
levels.
<-
ptsd_grid %>%
ptsd select(ptsd, anxiety, acute_stress) %>%
group_by(anxiety, acute_stress) %>%
summarise(ptsd = mean(ptsd == 1, na.rm = TRUE)) %>%
ungroup()
Use list
to create a named list of different instantiated models.
<-
model_specs list(
"Logistic\nregression" = logit_spec,
"Decision\ntree" = dt_spec,
"SVM\nlinear" = svm_lin_spec,
"SVM\npolynomial" = svm_ply_spec
)
Fitting the 4 models on the training data
Use map
to apply fit over all the models.
<- map(model_specs, ~ fit(.x, ptsd ~ ., data = ptsd_train)) model_fits
Generating predictions on the test set with the different models
Use map_dfr
to apply a function to a list of model fits (model_fits
) and create a single tibble test_preds
with predictions from each model on the ptsd_test
dataset, with each model’s results nested in its own row.
<-
test_preds %>%
model_fits map_dfr(~ .x %>%
augment(new_data = ptsd_test) %>%
nest(test_preds = everything()),
.id = "model")
Repeat the same operation using ptsd_grid
instead of ptsd_test
.
<-
grid_preds %>%
model_fits map_dfr(~ .x %>%
augment(new_data = ptsd_grid) %>%
nest(grid_preds = everything()),
.id = "model")
Now, merge the two nested tibbles together:
<-
test_preds %>%
test_preds left_join(grid_preds, by = "model")
Evaluating the model
Compute the f1-score for each model. Use map
to compute the f1-score for the different models. Remember to set event_level = "second"
.
<-
test_preds %>%
test_preds mutate(f1 = map(test_preds, ~ f_meas(.x, ptsd, .pred_class, event_level = "second")))
Use unnest
to unnest the grid predictions and the f1-score.
<-
test_preds %>%
test_preds unnest(c(grid_preds, f1))
Visualizing decision boundaries
Generate decision boundaries for the four machine learning models trained to predict PTSD based on responses to questions about acute stress and anxiety. You are expected to generate the same plot as the we generated in Lab 5 to visualize decision boundaries for different values of k
in the k
-NN algorithm.
%>%
test_preds ggplot() +
facet_wrap(. ~ paste(model, "\n\nf1-score =", round(.estimate, 3))) +
geom_tile(aes(anxiety, acute_stress, fill = .pred_class == 1), alpha = 0.2) +
geom_point(aes(anxiety, acute_stress, colour = ptsd), size = 3) +
scale_x_continuous(breaks = 0:10) +
scale_y_continuous(breaks = 0:10) +
scale_colour_viridis(labels = percent, breaks = c(0, 0.4, 0.8), option = "cividis") +
scale_fill_manual(values = c("lightgrey", "navy")) +
theme_dot() +
labs(x = "Anxiety question", y = "Acute stress question",
colour = "Proportion with PTSD",
fill = "Predicted PTSD?")
❓ Question:
Which model obtains the highest f1-score ?
The f1-scores vary slightly across the models, but the SVM polynomial achieves the highest f1-score (0.74) and logistic regression the lowest (0.72).
Cross-validation and comparison between models
Use workflow_set
to build multiple models into an integrated workflow.
<-
ptsd_wf workflow_set(
preproc = list(formula = ptsd ~ .),
models = list(
"Logistic\nregression" = logit_spec,
"Decision\ntree" = dt_spec,
"SVM\nlinear" = svm_lin_spec,
"SVM\npolynomial" = svm_ply_spec
) )
Now, run each model on each fold of the data generated previously:
::registerDoParallel()
doParallel
<-
ptsd_fits %>%
ptsd_wf workflow_map(
resamples = ptsd_cv,
metrics = metric_set(f_meas, recall, precision),
control = control_resamples(event_level = "second")
)
Run the code below and inspect the results.
autoplot(ptsd_fits)
The above plot provides very useful information. However, we could bring some improvements.
The base
ggplot
colour scheme looks a bit dullThe f1-score is expressed as
f_meas
The y-axes of each plot are different from one another
We can get the rankings of the model by running rank_results(ptsd_fits)
. From there, come up with an improved version of the plot using your ggplot
skills.
%>%
ptsd_fits rank_results() %>%
mutate(.metric = case_when(.metric == "f_meas" ~ "f1-score", .default = .metric)) %>%
ggplot(aes(rank, mean, colour = fct_inorder(str_remove(wflow_id, "formula\\_")))) +
facet_wrap(. ~ .metric) +
geom_point() +
geom_linerange(aes(ymin = mean - (1.96*std_err),
ymax = mean + (1.96*std_err))) +
scale_colour_uchicago() +
theme_line() +
labs(x = "Model Rank", y = "Cross-Validated Score",
colour = NULL)
❓ Question:
How well does the SVM with a second-order polynomial perform, relative to the other models ?
The SVM with a second-order polynomial has the best performance for both F1-score and recall. However, it is slightly behind the other models in terms of precision.
Experiment more with polynomial SVMs using cross-validation
Using the following hyperparameter grid, try tuning a polynomial SVM.
<-
svm_grid crossing(cost = c(0.000977, 0.0131, 0.177, 2.38, 32),
degree = 1:3)
From there, we have given you a series of steps. We strongly recommend that you revisit how we have performed hyperparameter tuning using k
-fold cross-validation in Lab 5 to get a sense of the code needed to complete this task.
# Instantiate a tuneable polynomial SVM
<-
svm_tuned_spec svm_poly(cost = tune(), degree = tune()) %>%
set_mode("classification") %>%
set_engine("kernlab")
# Create a workflow comprised of the model and a formula
<-
svm_wf workflow() %>%
add_model(svm_tuned_spec) %>%
add_formula(ptsd ~ .)
# Register multiple cores for faster processing
::registerDoParallel()
doParallel
# Tune the model, just use one evaluation metric
<-
svm_cv_fit tune_grid(
svm_wf,resamples = ptsd_cv,
metrics = metric_set(roc_auc),
grid = svm_grid,
control = control_resamples(event_level = "second")
)
# Collect the metrics from the model and plot the results
%>%
svm_cv_fit collect_metrics() %>%
mutate(max = mean == max(mean)) %>%
ggplot(aes(as.factor(round(cost, 5)), mean, shape = as.factor(degree), colour = max)) +
geom_point(position = position_dodge(width = 1), size = 3) +
geom_linerange(aes(ymin = mean - (1.96*std_err), ymax = mean + (1.96*std_err)),
position = position_dodge(width = 1)) +
scale_colour_manual(values = c("grey", "black")) +
theme_line() +
guides(colour = "none") +
labs(x = "Cost complexity", y = "Cross-validated score",
shape = "nth degree polynomial")
❓ Question:
What combination of hyperparameters produces the best fitting model?
The best combination is a second degree polynomial SVM with a cost of 0.0131.
Experiment with XGBoost using cross-validation
Now, go back to the week 5 lab. You’ll find it mentions a tree-based model called XGBoost. This time, we’ll use it for classification instead of regression and compare it to our best fitting SVM model.
You need to start by writing a specification of the XGBoost model:
- Use the
boost_tree
function - Set
learn_rate = tune()
- Set
mtry = .cols()
andtrees = 150
- Set the mode to “classification”
- Set the engine to “xgboost”
🚨 ATTENTION: You will need to have the xgboost
package installed for this algorithm to work.
<-
xgb_model boost_tree(mtry = .cols(), trees = 150,
learn_rate = tune()) %>%
set_mode("classification") %>%
set_engine("xgboost")
Next, create a recipe and workflow as you usually would (no specific pre-processing in the recipe and all variables used to predict the ptsd
outcome variable)
<-
xgb_rec recipe(ptsd ~ .,
data = ptsd_train)
<-
xgb_wf workflow() %>%
add_recipe(xgb_rec) %>%
add_model(xgb_model)
Using the following hyperparameter grid, try tuning the XGBoost model:
<- tibble(learn_rate = c(0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5,1)) xgb_grid
- Make sure k-fold cross-validation to do your hyperparameter (
learn_rate
) tuning - Select
roc_auc
as the metric to optimize in your tuning (this is to ensure we can compare the performance of the XGBoost model you select with that of the best performing SVM trained earlier) - Produce a graph that visualises the best result
- Select and estimate the model with the best AUC
::registerDoParallel()
doParallel
<-
xgb_cv_fit tune_grid(
xgb_wf,resamples = ptsd_cv,
metrics = metric_set(roc_auc),
grid = xgb_grid
)
%>%
xgb_cv_fit collect_metrics() %>%
ggplot(aes(as.factor(learn_rate), mean, group = 1)) +
geom_point() +
geom_line(linetype = "dashed") +
theme_line() +
labs(x = "Learn rate", y = "Mean Cross-Validated AUC")
<- select_best(xgb_cv_fit, metric = "roc_auc")
best_auc best_auc
# A tibble: 1 × 2
learn_rate .config
<dbl> <chr>
1 0.01 Preprocessor1_Model05
The learning rate that produces the best AUC is 0.01.
<-
xgb_final_fit %>%
xgb_wf finalize_workflow(best_auc) %>%
fit(ptsd_train)
❓ Question How does the XGBoost model compare with the best performing SVM fitted earlier?
<- svm_cv_fit %>%
svm_auc collect_metrics() %>%
filter(cost==0.0131 & degree==2) %>%
select(mean) %>%
rename("Best SVM Mean Cross-Validated AUC"=mean)
<- xgb_cv_fit %>%
xgb_auc collect_metrics() %>%
filter(learn_rate==0.005) %>%
select(mean) %>%
rename("Best XGBoost Mean Cross-Validated AUC"= mean)
bind_cols(svm_auc,xgb_auc)
# A tibble: 1 × 2
`Best SVM Mean Cross-Validated AUC` `Best XGBoost Mean Cross-Validated AUC`
<dbl> <dbl>
1 0.888 0.882
The SVM model performs better than XGBoost.
Throughout this homework, we chose to balance out precision and recall and optimize model performance either in terms of F1-score or AUC. But since we’re in a medical setting and the impact of false negative tends to be more detrimental, we could very well have chosen to put more emphasis on recall and chosen an F-score that put more weight on recall than precision.