π£οΈ Week 05 Lab - Roadmap (90 min)
2024/25 Autumn Term
Welcome to our DS202A fifth lab!
This week, you will take a closer look at the \(K\)-NN algorithm. Youβll discover a new family of models that is used both for regression (as in this lab) or for classification: tree-based models i.e decision trees, Random Forest and XGBoost. And all along, youβll practice evaluating your models through cross-validation and see how cross-validation can be used in hyperparameter tuning i.e selecting the model parameters that yield the best performance (according to a set metric), for example how to set \(K\) for \(K\)-NN.
π₯ Learning Objectives
By the end of this lab, you will be able to:
- Fit a \(K\)-NN model in R.
- Evaluate the performance of a \(K\)-NN model using standard classification metrics.
- Understand how a regression problem can be modeled through a tree-based model (i.e decision tree, Random Forest, XGBoost)
- Use cross-validation to tune model hyperparameters and evaluate model performance
π Preparation: Loading packages and data
Materials to download
Please download the following files into your data
folder.
Then use the link below to download the lab file:
βοΈ Setup
Install missing libraries:
You probably donβt have the xgboost
, viridis
and rpart.plot
libraries installed so youβll need to install them.
#make sure you run this code only once and that this chunk is non-executable when you render your qmd
install.packages("rpart.plot")
install.packages("xgboost")
install.packages("viridis")
Import libraries and create functions:
library("ggsci")
library("janitor")
library("kknn")
library("rpart.plot")
library("scales")
library("tidymodels")
library("tidyverse")
library("viridis")
library("xgboost")
<- function() {
theme_scatter
theme_minimal() +
theme(panel.grid.minor = element_blank(),
legend.position = "bottom")
}
<- function() {
theme_dot
theme_minimal() +
theme(panel.grid = element_blank(),
legend.position = "bottom")
}
<- function() {
theme_line
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "bottom")
}
<- function() {
theme_histogram
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank())
}
<- function() {
theme_bar
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank())
}
Predicting Post-Traumatic Stress Disorder with k Nearest Neighbours (50 minutes)
Today, we are going to be demonstrating the utility of hyperparameter tuning using k-fold cross validation.
Exploring a feature space
We are going to take a look at data on Mexican survey respondents to an online health risk questionnaire during the COVID-19 pandemic. The data providers created a code for whether or not the users suffered from PTSD. We will expand an example given for k nearest neighbours in the paper (see page 10), using two predictors. The full data set consists of
<-
ptsd read_csv("data/ptsd-for-knn.csv") %>%
mutate(ptsd = as.factor(ptsd))
ptsd
: 1 if the user has, 0 if the user does not have (the outcome)anxiety
: 0-10 self-reported anxiety scale.acute_stress
: 0-10 self-reported acute stress scale.
Letβs use ggplot
to explore this feature space.
%>%
ptsd select(ptsd, anxiety, acute_stress) %>%
group_by(anxiety, acute_stress) %>%
summarise(ptsd = mean(ptsd == 1, na.rm = TRUE)) %>%
ungroup() %>%
ggplot(aes(anxiety, acute_stress, colour = ptsd)) +
geom_point(size = 5) +
scale_x_continuous(breaks = 0:10) +
scale_y_continuous(breaks = 0:10) +
scale_colour_viridis(labels = percent, breaks = c(0, 0.4, 0.8), option = "cividis") +
theme_dot() +
labs(x = "Anxiety question", y = "Acute stress question",
colour = "Proportion of grid with PTSD")
π NOTE: The actual data set has many features that can be used. We are only using 2 features as it is hard to visualise a feature space that is more than two dimensions.
π£οΈ CLASSROOM DISCUSSION:
What does this graph show? How can we identify individuals diagnosed with PTSD? What boundaries can we draw between the two classes?
π¨π»βπ« TEACHING MOMENT:
All machine learning algorithms can draw boundaries.
Introducing k Nearest Neighbours
An algorithm that can be used to classify data points is k Nearest Neighbours (kNN for short). This algorithm works by taking the neighbours of an observation and having them βvoteβ on whether or not the observation in question belongs to a specific class.
π NOTE: The βneighboursβ decide the fate of a data point based on them calculating their proximity to it in relation to other data points that do not belong to their neighbourhood.
The question is, how many neighbours get to vote in the decision? This is the key hyperparameter of nearest neighbours and something that we need to experiment with. The most widely acknowledged way to experiment with hyperparameters is via k-fold cross-validation.
k-fold Cross-validation
Letβs split our sample into training and testing sets.
# Set a seed to replicate results.
# Perform train / test split
# Uncomment to run
# ptsd_cv <- vfold_cv(ptsd_train, v = 5)
# ptsd_cv
We added an extra line of code: ptsd_cv <- vfold_cv(ptsd_train, v = 5)
. All this does is take our training data and perform 5 random splits in ptsd
whereby the majority of the data is used to train the model and the rest is then used to evaluate it (a.k.a. a hold-out set) before it is evaluated on the test data. We can then calculate cross-validated scores by simply taking the mean of each set of evaluation scores. Before we provide a final evaluation, in other words, we are internally auditing our models by building and evaluating them based on different subsets of the data. In doing so, k-fold cross validation provides an extra layer of protection against overfitting when tuning hyperparameters and, as such, should always be used.
The below code instantiates a kNN classifier.
π¨ ATTENTION: You will need to have the kknn
package installed for this algorithm to work.
<-
knn_model nearest_neighbor(neighbors = tune()) %>%
set_mode("classification") %>%
set_engine("kknn")
π¨π»βπ« TEACHING MOMENT:
What does setting neighbors = tune()
mean?
Pre-processing
We are going to use a simple recipe to pre-process the data. All that is happening is that both features are normalised, meaning we subtract the mean from each feature and divide by their standard deviation.
β οΈ WARNING: Any algorithm that uses distance metrics must use this preprocessing step as features with higher variances will play a disproportionately high amount of influence. So, when you hear βthis algorithm utilizes distance measuresβ, think βstandardise features!β
<-
knn_rec recipe(ptsd ~ ., data = ptsd_train) %>%
step_normalize(all_numeric_predictors())
π NOTE: Check out this link to find out what a recipe is and why they are useful.
Creating a workflow
Now that we have a model to build and a recipe to follow, we can combine both into a workflow.
π NOTE: Check out this link to find out what a workflow is and why they are useful.
<-
wf_knn workflow() %>%
add_model(knn_model) %>%
add_recipe(knn_rec)
Deciding upon performance metrics
Your lecturer will have made you aware of the upsides and downsides to each performance metric. With tidymodels
you do not have to settle for one, so we are going to pick three to create a metric set.
# Create a metric set with precision, recall and f_meas
Creating a nearest neighbours grid
We then need to specify a parameter grid.
π NOTE: This task often involves a lot of experimentation so do not take these values to be gospel for any future kNN models you build.
<- tibble(neighbors = c(100, 500, 1000, 3000)) grid_knn
π£οΈ CLASSROOM DISCUSSION:
Which number in the grid would represent the most βcomplexβ model?
π¨π»βπ« TEACHING MOMENT:
Your teacher will present some output to see what happens when we vary k.
Estimating a cross-validated model
We now have all the information we need to estimate a cross-validated model. We have
- A workflow, consisting of a model and a recipe
- A series of 5 resampled data sets
- A metric set containing several evaluation metrics and
- A range of hyperparameter options to tune our models.
This will result in 20 different models (5-folds \(\times\) 4 hyperparameter combinations) running, which may be slow on your laptops. To get the most out of them, we add doParallel::registerDoParallel()
to have your laptop run the models in parallel by using multiple cores.
π¨ ATTENTION: You will need to have the doParallel
package installed for doParallel::registerDoParallel()
to work.
::registerDoParallel()
doParallel
<-
knn_model_cv tune_grid(
wf_knn,resamples = ptsd_cv,
metrics = class_metrics,
grid = grid_knn,
control = control_resamples(event_level = "second")
)
π NOTE: This may take a while to run, so relax for a bit!
Plot hyperparameters
After knn_model_cv
compiles, try using the function collect_metrics()
on the model object. This results in a tibble of cross-validated performance metrics for each hyperparameter combination. Plot the results of this using a graph of your choice!
%>%
knn_model_cv collect_metrics()
# Build a ggplot here
π£οΈ CLASSROOM DISCUSSION:
Which model is best?
Fit the best model to the full training set
We can then build a final model using the best hyperparameter combination. Letβs take the f1-score as the measure that we want to maximise.
# Select the best model
<- select_best(knn_model_cv, metric = "f_meas")
highest_f1
# Use the best parameter combination and fit on the entire training data
<-
knn_model_final %>%
wf_knn finalize_workflow(highest_f1) %>%
fit(ptsd_train)
Evaluate the model on the test set
From there, all we need to do is follow the evaluation step that we have done for previous models. Use f_meas
, setting event_level = "second"
.
# Code here
Removing superfluous objects
rm(list = ls(pattern = "grid|high|knn|ptsd|metric"))
gc()
Predicting childcare costs in US counties with tree-based models (40 minutes)
We are going to see how hyperparameter tuning using a decision tree, Random Forest and XGBoost can help improve classification performance. We will take the following steps:
- Perform some exploratory data analysis
- Plot a decision tree
- Build a random forest model
- Tune the learning rate using XGBoost
Exploring the data set
# childcare <- CODE FOR LOADING THE DATA GOES HERE
Click here to see a full list of variables. Letβs explore the data with a few visualisations.
Median child care costs are ~ $96.53 a week per county
<- median(childcare$mcsa)
med_cost
%>%
childcare ggplot(aes(mcsa)) +
geom_histogram(bins = 40, fill = "lightgrey", colour = "black") +
geom_vline(xintercept = med_cost, linewidth = 2, linetype = "dashed", colour = "red") +
theme_histogram() +
scale_x_continuous(labels = dollar) +
scale_y_continuous(labels = comma) +
labs(x = "Weekly cost of childcare (USD)", y = "Count")
The costs of childcare increase in counties with higher median incomes
%>%
childcare ggplot(aes(mhi_2018, mcsa)) +
geom_point(alpha = 0.125, size = 2) +
geom_smooth(method = "lm", se = FALSE) +
theme_scatter() +
scale_x_continuous(labels = dollar) +
scale_y_continuous(labels = dollar) +
labs(x = "Household income (2018 USD)",
y = "Weekly cost of childcare (USD)")
There is considerable variation in the costs of childcare based on the ethnic composition of the county
π NOTE: We adapted this graph from Julia Silge (click here for her analysis of the same data set).
%>%
childcare select(mcsa, starts_with("one_race_"), -one_race_other, two_races) %>%
pivot_longer(-mcsa, names_to = "race", values_to = "perc") %>%
mutate(race = str_remove(race, "one\\_race\\_"),
race = case_when(race == "w" ~ "Both White",
== "b" ~ "Both Black",
race == "i" ~ "Both Native",
race == "a" ~ "Both Asian",
race == "h" ~ "Both Hispanic",
race == "two_races" ~ "Mixed Parents")) %>%
race ggplot(aes(perc, mcsa)) +
facet_wrap(. ~ race, scales = "free_x") +
geom_point(alpha = 0.05) +
geom_smooth(method = "lm", se = FALSE) +
theme_scatter() +
labs(x = "Percent of racial family types in counties",
y = "Weekly cost of childcare (USD)")
Create a grouped train / test split
If you have inspected childcare
, you will see that it tracks r length(unique(childcare$county_name))
counties from r min(childcare$study_year)
to r max(childcare$study_year)
. This poses an interesting challenge for when we perform a train / test split. If we interpret childcare
as purely cross-sectional and perform an initial split without taking the time-series dimension into account, we run the risk of:
- Using future values to predict the past
- Using information from the same county in the training and testing sets
Naturally, a few solutions follow from this issue:
- Use earlier years to train the model and later years to test it.
- Use some counties to train the model and other counties to test it.
We will opt for the latter. This can be done by using the group_initial_split
function, which takes all the same parameters as initial_split
and includes a group
parameter. Try using this function below.
# Set a seed
set.seed(123)
# Create a grouped initial split
<-
childcare_split group_initial_split(childcare,
group = county_name,
prop = 0.75)
# Isolate the train and test sets
<- training(childcare_split)
childcare_train <- testing(childcare_split) childcare_test
We can also create cross-validation folds using group_vfold_cv
. Try using this below to create a 5-fold cross-validation grid.
# Set a seed
set.seed(123)
# Create 5-fold cross-validation grid, grouped by county name
<- group_vfold_cv(childcare_train, group = county_name, v = 5) childcare_cv
Create a decision tree
Use the following steps
- Use the
decision_tree
function - Set mode equal to βregressionβ
- Set the engine to βrpartβ
π¨ ATTENTION: You will need to have the rpart
package installed for this algorithm to work.
<-
dt_model decision_tree() %>%
set_mode("regression") %>%
set_engine("rpart")
Develop a recipe
The recipe is fairly straightforward. The only step we have added is to assign the county and state features as IDs, meaning they will not play a part in the model building process.
<-
childcare_rec %>%
childcare_train recipe(mcsa ~ .) %>%
update_role(starts_with(c("county", "state")), new_role = "id")
Fit the model
As we are not experimenting with hyperparameters at this point, we can simply build a workflow and fit it directly to the training data.
<-
dt_fit workflow() %>%
add_recipe(childcare_rec) %>%
add_model(dt_model) %>%
fit(data = childcare_train)
Describe the following plot
π¨ ATTENTION: You will need to have the rpart.plot
package installed for this plot to render.
rpart.plot(extract_fit_engine(dt_fit), roundint=FALSE)
π£οΈ CLASSROOM DISCUSSION:
Which feature has the algorithm deemed most important for splitting the data? Can you explain what the tree is doing?
Evaluate the model on the test set
Letβs see how well our simple decision tree does. Create a metric set with the RMSE and r-squared metrics and use them to evaluate the decision tree.
<- metric_set(rsq, rmse)
regress_metrics
%>%
dt_fit augment(new_data = childcare_test) %>%
regress_metrics(mcsa, .pred)
Instantiate a random forest model
- Use the
rand_forest
function - Set
mtry = round(sqrt(52), 0)
andtrees = 1e3
π£οΈ CLASSROOM DISCUSSION:
Why do we do this? What happens if we set mtry
equal to the number of features?
- Set the mode to βregressionβ
- Set the engine to βrangerβ
<-
rf_model rand_forest(mtry = round(sqrt(52), 0), trees = 1e3) %>%
set_mode("regression") %>%
set_engine("ranger")
Fit the model
Use the same syntax used earlier.
::registerDoParallel()
doParallel
<-
rf_fit workflow() %>%
add_recipe(childcare_rec) %>%
add_model(rf_model) %>%
fit(data = childcare_train)
Evaluate the mode on the test set
%>%
rf_fit augment(new_data = childcare_test) %>%
regress_metrics(mcsa, .pred)
π£οΈ CLASSROOM DISCUSSION:
Do we see an improvement?
Instantiate an XGBoost model
- Use the
boost_tree
function - Set
learn_rate = tune()
- Set
mtry = round(sqrt(52), 0)
andtrees = 1e3
- Set the mode to βregressionβ
- Set the engine to βxgboostβ
π¨ ATTENTION: You will need to have the xgboost
package installed for this algorithm to work.
<-
xgb_model boost_tree(mtry = round(sqrt(52), 0), trees = 1e3,
learn_rate = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost")
Create a workflow
Use a similar format to the chunk used to create the kNN workflow.
<-
xgb_wf workflow() %>%
add_recipe(childcare_rec) %>%
add_model(xgb_model)
Create a hyperparameter tuning grid
Use c(0.0001, 0.001, 0.01, 0.1, 0.2, 0.3)
as the experimental values. Remember, that all you need to do is create a tibble of these values with the exact same hyperparameter name used in the model for this grid to work.
<- tibble(learn_rate = c(0.0001, 0.001, 0.01, 0.1, 0.2, 0.3)) xgb_grid
Estimating a cross-validated model
We now have all the information we need to estimate a cross-validated model. We have
- A workflow, consisting of a model and a recipe
- A series of 5 resampled data sets
- A metric set containing two evaluation metrics and
- A range of hyperparameter options to tune our models.
::registerDoParallel()
doParallel
<-
xgb_cv_fit tune_grid(
xgb_wf,resamples = childcare_cv,
metrics = regress_metrics,
grid = xgb_grid
)
Which learning rate produces the best RMSE?
Produce a graph that visualises the best result.
%>%
xgb_cv_fit collect_metrics() %>%
filter(.metric == "rmse") %>%
ggplot(aes(as.factor(learn_rate), mean, group = 1)) +
geom_point() +
geom_line(linetype = "dashed") +
theme_line() +
labs(x = "Learn rate", y = "Mean Cross-Validated RMSE")
Select and estimate the model with the lowest RMSE
<- select_best(xgb_cv_fit, metric = "rmse")
lowest_rmse
<-
xgb_final_fit %>%
xgb_wf finalize_workflow(lowest_rmse) %>%
fit(childcare_train)
Evaluate the model on the test set
# Code here
π£οΈ CLASSROOM DISCUSSION:
How should we rank each model?
(Bonus) Create a graph that summarises our findings!
A bar graph will provide an excellent visual summary. We encourage you to use something like this in your summative work if you are asked to compare different ML models.
# Code here