π£οΈ Reading week homework
π Preparation: Loading packages and data
Materials to download
Use the link below to download the homework materials:
as well as the dataset we will use for this homework:
Loading packages and data
Start by downloading the dataset into your data folder.
Install missing packages
Install missing libraries if any:
#make sure you run this code only once and that this chunk is non-executable when you render your qmd
install.packages("viridis")
install.packages("xgboost")
install.packages("kernlab")
install.packages("LiblineaR")Import libraries and create functions:
library(ggsci)
library(tidymodels)
library(tidyverse)
library(viridis)
library(xgboost)
library(kernlab)
library(LiblineaR)
library(kernlab)
library(doParallel)
theme_dot <- function() {
theme_minimal() +
theme(panel.grid = element_blank(),
legend.position = "bottom")
}
theme_line <- function() {
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "bottom")
}π₯ Homework objectives
We have introduced a range of supervised learning algorithms in the previous labs and lectures. Selecting the best model can be challenging. In this assignment, we will provide you with guidance to carry out this process.
We will compare and contrast the decision boundaries produced by four different models train on the PTSD data.
Next, we will apply k-fold cross-validation to determine which model performs best.
Finally, we will fine-tune the hyperparameters of a support vector machine model and of an XGBoost model and compare both models.
Loading the PTSD dataset
We will use a sample from the PTSD dataset that you worked with in Lab 5.
set.seed(123)
# Code hereptsd: 1 if the user has, 0 if the user does not have (the outcome)anxiety: 0-10 self-reported anxiety scale.acute_stress: 0-10 self-reported acute stress scale.
Train/test split
Start by performing a train/test split (keep 75% of the data for the training set):
set.seed(123)
# Code here5-fold cross-validation
Now, create 5 cross-validation folds:
# Code hereSelecting the evaluation metrics
Finally, create a metric set that includes the precision, recall and f1-score.
# Code hereGenerating decision boundaries for different models
π NOTE: Familiarize yourself with the concept of decision boundaries if needed.
In this assignment, you will have to work with 4 models:
- Logistic regression
- Decision tree
- Linear support vector machine
- Polynomial support vector machine
π NOTE: Support vector machines are a highly useful family of algorithms that are widely leveraged by machine learning scientists. For the theory behind SVMs, check out Chapter 9 of Introduction to Statistical Learning.
π NOTE: Aside from logistic regression, you will need to specify mode = "classification"
Initializing different models
You already know the code for instantiating logistic regression and decision trees, but you will need to consult the documentation in parsnip (click here) to find the relevant models for both SVM models
# Code hereProportion of individuals affected by PTSD across different combinations of acute_stress and anxiety levels.
Generate a tibble that contains a grid with the proportion of individuals affected by PTSD across different combinations of acute_stress and anxiety levels.
# Code hereUse list to create a named list of different instantiated models.
# Code hereFitting the 4 models on the training data
Use map to apply fit over all the models.
# Code hereGenerating predictions on the test set with the different models
Use map_dfr to apply a function to a list of model fits (model_fits) and create a single tibble test_preds with predictions from each model on the ptsd_test dataset, with each modelβs results nested in its own row.
# Code hereRepeat the same operation using ptsd_grid instead of ptsd_test.
# Code hereNow, merge the two nested tibbles together:
# Code hereEvaluating the model
Compute the f1-score for each model. Use map to compute the f1-score for the different models. Remember to set event_level = "second".
# Code hereUse unnest to unnest the grid predictions and the f1-score.
# Code hereVisualizing decision boundaries
Generate decision boundaries for the four machine learning models trained to predict PTSD based on responses to questions about acute stress and anxiety. You are expected to generate the same plot as the we generated in Lab 5 to visualize decision boundaries for different values of k in the k-NN algorithm.
# Code hereβ Question:
Which model obtains the highest f1-score ?
Cross-validation and comparison between models
Use workflow_set to build multiple models into an integrated workflow.
# Code hereNow, run each model on each fold of the data generated previously:
# Code hereRun the code below and inspect the results.
# Code hereThe above plot provides very useful information. However, we could bring some improvements.
The base
ggplotcolour scheme looks a bit dullThe f1-score is expressed as
f_measThe y-axes of each plot are different from one another
We can get the rankings of the model by running rank_results(ptsd_fits). From there, come up with an improved version of the plot using your ggplot skills.
# Code hereβ Question:
How well does the SVM with a second-order polynomial perform, relative to the other models ?
Experiment more with polynomial SVMs using cross-validation
Using the following hyperparameter grid, try tuning a polynomial SVM.
svm_grid <-
crossing(cost = c(0.000977, 0.0131, 0.177, 2.38, 32),
degree = 1:3)From there, we have given you a series of steps. We strongly recommend that you revisit how we have performed hyperparameter tuning using k-fold cross-validation in Lab 5 to get a sense of the code needed to complete this task.
# Instantiate a tuneable polynomial SVM
# Code here
# Create a workflow comprised of the model and a formula
# Code here
# Register multiple cores for faster processing
# Code here
# Tune the model, just use one evaluation metric
# Code here
# Collect the metrics from the model and plot the results
# Code hereβ Question:
What combination of hyperparameters produces the best fitting model?
Experiment with XGBoost using cross-validation
Now, go back to the week 5 lab. Youβll find it mentions a tree-based model called XGBoost. This time, weβll use it for classification instead of regression and compare it to our best fitting SVM model.
You need to start by writing a specification of the XGBoost model:
- Use the
boost_treefunction - Set
learn_rate = tune() - Set
mtry = .cols()andtrees = 150 - Set the mode to βclassificationβ
- Set the engine to βxgboostβ
π¨ ATTENTION: You will need to have the xgboost package installed for this algorithm to work.
# write the code for the XGBoost model specification hereNext, create a recipe and workflow as you usually would (no specific pre-processing in the recipe and all variables used to predict the ptsd outcome variable)
#create a recipe
#create a workflowUsing the following hyperparameter grid, try tuning the XGBoost model:
xgb_grid <- tibble(learn_rate = c(0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5,1))- Make sure k-fold cross-validation to do your hyperparameter (
learn_rate) tuning - Select
roc_aucas the metric to optimize in your tuning (this is to ensure we can compare the performance of the XGBoost model you select with that of the best performing SVM trained earlier) - Produce a graph that visualises the best result
- Select and estimate the model with the best AUC
# write your code to tune the learn_rate parameter of the XGBoost model# write the code to produce a graph that visualises the best result of your XGBoost parameter tuning# write code to select the learn_rate that produces the best result# fit the model with the learn_rate that produces the best AUCβ Question How does the XGBoost model compare with the best performing SVM fitted earlier?
#write code to get the value of AUC for the best SVM and best XGBoost models