π£οΈ Reading week assignment
π Preparation: Loading packages and data
Materials to download
Use the link below to download the lab materials:
as well as the dataset we will use for this lab:
Loading packages and data
Start by downloading the dataset into your data
folder.
Install missing packages
Install missing libraries if any:
#make sure you run this code only once and that this chunk is non-executable when you render your qmd
install.packages("viridis")
install.packages("xgboost")
install.packages("kernlab")
install.packages("LiblineaR")
Import libraries and create functions:
library(ggsci)
library(tidymodels)
library(tidyverse)
library(viridis)
library(xgboost)
library(kernlab)
library(LiblineaR)
library(kernlab)
library(doParallel)
<- function() {
theme_dot
theme_minimal() +
theme(panel.grid = element_blank(),
legend.position = "bottom")
}<- function() {
theme_line
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "bottom")
}
π₯ Homework objectives
We have introduced a range of supervised learning algorithms in the previous labs and lectures. Selecting the best model can be challenging. In this assignment, we will provide you with guidance to carry out this process.
We will compare and contrast the decision boundaries produced by four different models train on the PTSD data.
Next, we will apply k-fold cross-validation to determine which model performs best.
Finally, we will fine-tune the hyperparameters of a support vector machine model and of an XGBoost model and compare both models.
Loading the PTSD dataset
We will use a sample from the PTSD dataset that you worked with in Lab 5.
set.seed(123)
# Code here
ptsd
: 1 if the user has, 0 if the user does not have (the outcome)anxiety
: 0-10 self-reported anxiety scale.acute_stress
: 0-10 self-reported acute stress scale.
Train/test split
Start by performing a train/test split (keep 75% of the data for the training set):
set.seed(123)
# Code here
5-fold cross-validation
Now, create 5 cross-validation folds:
# Code here
Selecting the evaluation metrics
Finally, create a metric set that includes the precision, recall and f1-score.
# Code here
Generating decision boundaries for different models
π NOTE: Familiarize yourself with the concept of decision boundaries if needed.
In this assignment, you will have to work with 4 models:
- Logistic regression
- Decision tree
- Linear support vector machine
- Polynomial support vector machine
π NOTE: Support vector machines are a highly useful family of algorithms that are widely leveraged by machine learning scientists. For the theory behind SVMs, check out Chapter 9 of Introduction to Statistical Learning.
π NOTE: Aside from logistic regression, you will need to specify mode = "classification"
Initializing different models
You already know the code for instantiating logistic regression and decision trees, but you will need to consult the documentation in parsnip
(click here) to find the relevant models for both SVM models
# Code here
Proportion of individuals affected by PTSD across different combinations of acute_stress
and anxiety
levels.
Generate a tibble that contains a grid with the proportion of individuals affected by PTSD across different combinations of acute_stress
and anxiety
levels.
# Code here
Use list
to create a named list of different instantiated models.
# Code here
Fitting the 4 models on the training data
Use map
to apply fit over all the models.
# Code here
Generating predictions on the test set with the different models
Use map_dfr
to apply a function to a list of model fits (model_fits
) and create a single tibble test_preds
with predictions from each model on the ptsd_test
dataset, with each modelβs results nested in its own row.
# Code here
Repeat the same operation using ptsd_grid
instead of ptsd_test
.
# Code here
Now, merge the two nested tibbles together:
# Code here
Evaluating the model
Compute the f1-score for each model. Use map
to compute the f1-score for the different models. Remember to set event_level = "second"
.
# Code here
Use unnest
to unnest the grid predictions and the f1-score.
# Code here
Visualizing decision boundaries
Generate decision boundaries for the four machine learning models trained to predict PTSD based on responses to questions about acute stress and anxiety. You are expected to generate the same plot as the we generated in Lab 5 to visualize decision boundaries for different values of k
in the k
-NN algorithm.
# Code here
β Question:
Which model obtains the highest f1-score ?
Cross-validation and comparison between models
Use workflow_set
to build multiple models into an integrated workflow.
# Code here
Now, run each model on each fold of the data generated previously:
# Code here
Run the code below and inspect the results.
# Code here
The above plot provides very useful information. However, we could bring some improvements.
The base
ggplot
colour scheme looks a bit dullThe f1-score is expressed as
f_meas
The y-axes of each plot are different from one another
We can get the rankings of the model by running rank_results(ptsd_fits)
. From there, come up with an improved version of the plot using your ggplot
skills.
# Code here
β Question:
How well does the SVM with a second-order polynomial perform, relative to the other models ?
Experiment more with polynomial SVMs using cross-validation
Using the following hyperparameter grid, try tuning a polynomial SVM.
<-
svm_grid crossing(cost = c(0.000977, 0.0131, 0.177, 2.38, 32),
degree = 1:3)
From there, we have given you a series of steps. We strongly recommend that you revisit how we have performed hyperparameter tuning using k
-fold cross-validation in Lab 5 to get a sense of the code needed to complete this task.
# Instantiate a tuneable polynomial SVM
# Code here
# Create a workflow comprised of the model and a formula
# Code here
# Register multiple cores for faster processing
# Code here
# Tune the model, just use one evaluation metric
# Code here
# Collect the metrics from the model and plot the results
# Code here
β Question:
What combination of hyperparameters produces the best fitting model?
Experiment with XGBoost using cross-validation
Now, go back to the week 5 lab. Youβll find it mentions a tree-based model called XGBoost. This time, weβll use it for classification instead of regression and compare it to our best fitting SVM model.
You need to start by writing a specification of the XGBoost model:
- Use the
boost_tree
function - Set
learn_rate = tune()
- Set
mtry = .cols()
andtrees = 150
- Set the mode to βclassificationβ
- Set the engine to βxgboostβ
π¨ ATTENTION: You will need to have the xgboost
package installed for this algorithm to work.
# write the code for the XGBoost model specification here
Next, create a recipe and workflow as you usually would (no specific pre-processing in the recipe and all variables used to predict the ptsd
outcome variable)
#create a recipe
#create a workflow
Using the following hyperparameter grid, try tuning the XGBoost model:
<- tibble(learn_rate = c(0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5,1)) xgb_grid
- Make sure k-fold cross-validation to do your hyperparameter (
learn_rate
) tuning - Select
roc_auc
as the metric to optimize in your tuning (this is to ensure we can compare the performance of the XGBoost model you select with that of the best performing SVM trained earlier) - Produce a graph that visualises the best result
- Select and estimate the model with the best AUC
# write your code to tune the learn_rate parameter of the XGBoost model
# write the code to produce a graph that visualises the best result of your XGBoost parameter tuning
# write code to select the learn_rate that produces the best result
# fit the model with the learn_rate that produces the best AUC
β Question How does the XGBoost model compare with the best performing SVM fitted earlier?
#write code to get the value of AUC for the best SVM and best XGBoost models