🛣️ Week 05 Lab - Roadmap (90 min)

2024/25 Autumn Term

Author

The DS202 team

Welcome to our DS202A fifth lab!

This week, you will take a closer look at the $K$-NN algorithm. You’ll discover a new family of models that is used both for regression (as in this lab) or for classification: tree-based models i.e decision trees, Random Forest and XGBoost. And all along, you’ll practice evaluating your models through cross-validation and see how cross-validation can be used in hyperparameter tuning i.e selecting the model parameters that yield the best performance (according to a set metric), for example how to set $K$ for $K$-NN.

🥅 Learning Objectives

By the end of this lab, you will be able to:

Fit a $K$-NN model in R.
Evaluate the performance of a $K$-NN model using standard classification metrics.
Understand how a regression problem can be modeled through a tree-based model (i.e decision tree, Random Forest, XGBoost)
Use cross-validation to tune model hyperparameters and evaluate model performance

📚 Preparation: Loading packages and data

Materials to download

Please download the following files into your data folder.

Then use the link below to download the lab file:

⚙️ Setup

Install missing libraries:

You probably don’t have the xgboost, viridis and rpart.plot libraries installed so you’ll need to install them.

#make sure you run this code only once and that this chunk is non-executable when you render your qmd
install.packages("rpart.plot")
install.packages("xgboost")
install.packages("viridis")

Import libraries and create functions:


library("ggsci")
library("janitor")
library("kknn")
library("rpart.plot")
library("scales")
library("tidymodels")
library("tidyverse")
library("viridis")
library("xgboost")

theme_scatter <- function() {
  
  theme_minimal() +
    theme(panel.grid.minor = element_blank(),
          legend.position = "bottom")
  
}

theme_dot <- function() {
  
  theme_minimal() +
    theme(panel.grid = element_blank(),
          legend.position = "bottom")
  
}

theme_line <- function() {
  
  theme_minimal() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major.x = element_blank(),
          legend.position = "bottom")
  
}

theme_histogram <- function() {
  
  theme_minimal() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major.x = element_blank())
  
}

theme_bar <- function() {
  
  theme_minimal() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major.y = element_blank())
  
}

Predicting Post-Traumatic Stress Disorder with k Nearest Neighbours (50 minutes)

Today, we are going to be demonstrating the utility of hyperparameter tuning using k-fold cross validation.

Exploring a feature space

We are going to take a look at data on Mexican survey respondents to an online health risk questionnaire during the COVID-19 pandemic (see (Garibay Rubio et al. 2024)). The data providers created a code for whether or not the users suffered from PTSD. We will expand an example given for k nearest neighbours in the paper (see page 10), using two predictors. The full data set consists of


ptsd <- 
  read_csv("data/ptsd-for-knn.csv") %>% 
  mutate(ptsd = as.factor(ptsd))

ptsd: 1 if the user has, 0 if the user does not have (the outcome)
anxiety: 0-10 self-reported anxiety scale.
acute_stress: 0-10 self-reported acute stress scale.

Let’s use ggplot to explore this feature space.


ptsd %>% 
  select(ptsd, anxiety, acute_stress) %>% 
  group_by(anxiety, acute_stress) %>% 
  summarise(ptsd = mean(ptsd == 1, na.rm = TRUE)) %>% 
  ungroup() %>% 
  ggplot(aes(anxiety, acute_stress, colour = ptsd)) +
  geom_point(size = 5) +
  scale_x_continuous(breaks = 0:10) +
  scale_y_continuous(breaks = 0:10) +
  scale_colour_viridis(labels = percent, breaks = c(0, 0.4, 0.8), option = "cividis") +
  theme_dot() +
  labs(x = "Anxiety question", y = "Acute stress question",
       colour = "Proportion of grid with PTSD")

👉 NOTE: The actual data set has many features that can be used. We are only using 2 features as it is hard to visualise a feature space that is more than two dimensions.

🗣️ CLASSROOM DISCUSSION:

What does this graph show? How can we identify individuals diagnosed with PTSD? What boundaries can we draw between the two classes?

👨🏻‍🏫 TEACHING MOMENT:

All machine learning algorithms can draw boundaries.

Introducing k Nearest Neighbours

An algorithm that can be used to classify data points is k Nearest Neighbours (kNN for short). This algorithm works by taking the neighbours of an observation and having them “vote” on whether or not the observation in question belongs to a specific class.

👉 NOTE: The “neighbours” decide the fate of a data point based on them calculating their proximity to it in relation to other data points that do not belong to their neighbourhood.

The question is, how many neighbours get to vote in the decision? This is the key hyperparameter of nearest neighbours and something that we need to experiment with. The most widely acknowledged way to experiment with hyperparameters is via k-fold cross-validation.

k-fold Cross-validation

Let’s split our sample into training and testing sets.


# Set a seed to replicate results.

# Perform train / test split

# Uncomment to run
# ptsd_cv <- vfold_cv(ptsd_train, v = 5)
# ptsd_cv

We added an extra line of code: ptsd_cv <- vfold_cv(ptsd_train, v = 5). All this does is take our training data and perform 5 random splits in ptsd whereby the majority of the data is used to train the model and the rest is then used to evaluate it (a.k.a. a hold-out set) before it is evaluated on the test data. We can then calculate cross-validated scores by simply taking the mean of each set of evaluation scores. Before we provide a final evaluation, in other words, we are internally auditing our models by building and evaluating them based on different subsets of the data. In doing so, k-fold cross validation provides an extra layer of protection against overfitting when tuning hyperparameters and, as such, should always be used.

The below code instantiates a kNN classifier.

🚨 ATTENTION: You will need to have the kknn package installed for this algorithm to work.


knn_model <- 
  nearest_neighbor(neighbors = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("kknn")

👨🏻‍🏫 TEACHING MOMENT:

What does setting neighbors = tune() mean?

Pre-processing

We are going to use a simple recipe to pre-process the data. All that is happening is that both features are normalised, meaning we subtract the mean from each feature and divide by their standard deviation.

⚠️ WARNING: Any algorithm that uses distance metrics must use this preprocessing step as features with higher variances will play a disproportionately high amount of influence. So, when you hear “this algorithm utilizes distance measures”, think “standardise features!”


knn_rec <-
  recipe(ptsd ~ ., data = ptsd_train) %>% 
  step_normalize(all_numeric_predictors())

👉 NOTE: Check out this link to find out what a recipe is and why they are useful.

Creating a workflow

Now that we have a model to build and a recipe to follow, we can combine both into a workflow.

👉 NOTE: Check out this link to find out what a workflow is and why they are useful.


wf_knn <-
  workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(knn_rec)

Deciding upon performance metrics

Your lecturer will have made you aware of the upsides and downsides to each performance metric. With tidymodels you do not have to settle for one, so we are going to pick three to create a metric set.


# Create a metric set with precision, recall and f_meas

Creating a nearest neighbours grid

We then need to specify a parameter grid.

👉 NOTE: This task often involves a lot of experimentation so do not take these values to be gospel for any future kNN models you build.


grid_knn <- tibble(neighbors = c(100, 500, 1000, 3000))

🗣️ CLASSROOM DISCUSSION:

Which number in the grid would represent the most “complex” model?

👨🏻‍🏫 TEACHING MOMENT:

Your teacher will present some output to see what happens when we vary k.

Estimating a cross-validated model

We now have all the information we need to estimate a cross-validated model. We have

A workflow, consisting of a model and a recipe
A series of 5 resampled data sets
A metric set containing several evaluation metrics and
A range of hyperparameter options to tune our models.

This will result in 20 different models (5-folds $\times$ 4 hyperparameter combinations) running, which may be slow on your laptops. To get the most out of them, we add doParallel::registerDoParallel() to have your laptop run the models in parallel by using multiple cores.

🚨 ATTENTION: You will need to have the doParallel package installed for doParallel::registerDoParallel() to work.


doParallel::registerDoParallel()

knn_model_cv <-
  tune_grid(
    wf_knn,
    resamples = ptsd_cv,
    metrics = class_metrics,
    grid = grid_knn,
    control = control_resamples(event_level = "second")
  )

👉 NOTE: This may take a while to run, so relax for a bit!

Plot hyperparameters

After knn_model_cv compiles, try using the function collect_metrics() on the model object. This results in a tibble of cross-validated performance metrics for each hyperparameter combination. Plot the results of this using a graph of your choice!


knn_model_cv %>% 
  collect_metrics()
  # Build a ggplot here

🗣️ CLASSROOM DISCUSSION:

Which model is best?

Fit the best model to the full training set

We can then build a final model using the best hyperparameter combination. Let’s take the f1-score as the measure that we want to maximise.

# Select the best model

highest_f1 <- select_best(knn_model_cv, metric = "f_meas")

# Use the best parameter combination and fit on the entire training data

knn_model_final <- 
  wf_knn %>% 
  finalize_workflow(highest_f1) %>% 
  fit(ptsd_train)

Evaluate the model on the test set

From there, all we need to do is follow the evaluation step that we have done for previous models. Use f_meas, setting event_level = "second".


# Code here

Removing superfluous objects

rm(list = ls(pattern = "grid|high|knn|ptsd|metric"))
gc()

Predicting childcare costs in US counties with tree-based models (40 minutes)

We are going to see how hyperparameter tuning using a decision tree, Random Forest and XGBoost can help improve classification performance. We will take the following steps:

Perform some exploratory data analysis
Plot a decision tree
Build a random forest model
Tune the learning rate using XGBoost

Exploring the data set


# childcare <- CODE FOR LOADING THE DATA GOES HERE

Click here to see a full list of variables. Let’s explore the data with a few visualisations.

Median child care costs are ~ $96.53 a week per county


med_cost <- median(childcare$mcsa)

childcare %>% 
  ggplot(aes(mcsa)) +
  geom_histogram(bins = 40, fill = "lightgrey", colour = "black") +
  geom_vline(xintercept = med_cost, linewidth = 2, linetype = "dashed", colour = "red") +
  theme_histogram() +
  scale_x_continuous(labels = dollar) +
  scale_y_continuous(labels = comma) +
  labs(x = "Weekly cost of childcare (USD)", y = "Count")

The costs of childcare increase in counties with higher median incomes


childcare %>% 
  ggplot(aes(mhi_2018, mcsa)) +
  geom_point(alpha = 0.125, size = 2) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_scatter() +
  scale_x_continuous(labels = dollar) +
  scale_y_continuous(labels = dollar) +
  labs(x = "Household income (2018 USD)",
       y = "Weekly cost of childcare (USD)")

There is considerable variation in the costs of childcare based on the ethnic composition of the county

👉 NOTE: We adapted this graph from Julia Silge (click here for her analysis of the same data set).


childcare %>% 
  select(mcsa, starts_with("one_race_"), -one_race_other, two_races) %>%
  pivot_longer(-mcsa, names_to = "race", values_to = "perc") %>% 
  mutate(race = str_remove(race, "one\\_race\\_"),
         race = case_when(race == "w" ~ "Both White",
                          race == "b" ~ "Both Black",
                          race == "i" ~ "Both Native",
                          race == "a" ~ "Both Asian",
                          race == "h" ~ "Both Hispanic",
                          race == "two_races" ~ "Mixed Parents")) %>% 
  ggplot(aes(perc, mcsa)) +
  facet_wrap(. ~ race, scales = "free_x") +
  geom_point(alpha = 0.05) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_scatter() +
  labs(x = "Percent of racial family types in counties",
       y = "Weekly cost of childcare (USD)")

Create a grouped train / test split

If you have inspected childcare, you will see that it tracks r length(unique(childcare$county_name)) counties from r min(childcare$study_year) to r max(childcare$study_year). This poses an interesting challenge for when we perform a train / test split. If we interpret childcare as purely cross-sectional and perform an initial split without taking the time-series dimension into account, we run the risk of:

Using future values to predict the past
Using information from the same county in the training and testing sets

Naturally, a few solutions follow from this issue:

Use earlier years to train the model and later years to test it.
Use some counties to train the model and other counties to test it.

We will opt for the latter. This can be done by using the group_initial_split function, which takes all the same parameters as initial_split and includes a group parameter. Try using this function below.


# Set a seed

set.seed(123)

# Create a grouped initial split

childcare_split <- 
  group_initial_split(childcare, 
                      group = county_name, 
                      prop = 0.75)

# Isolate the train and test sets

childcare_train <- training(childcare_split)
childcare_test <- testing(childcare_split)

We can also create cross-validation folds using group_vfold_cv. Try using this below to create a 5-fold cross-validation grid.


# Set a seed

set.seed(123)

# Create 5-fold cross-validation grid, grouped by county name

childcare_cv <- group_vfold_cv(childcare_train, group = county_name, v = 5)

Create a decision tree

Use the following steps

Use the decision_tree function
Set mode equal to “regression”
Set the engine to “rpart”

🚨 ATTENTION: You will need to have the rpart package installed for this algorithm to work.


dt_model <-
  decision_tree() %>% 
  set_mode("regression") %>% 
  set_engine("rpart")

Develop a recipe

The recipe is fairly straightforward. The only step we have added is to assign the county and state features as IDs, meaning they will not play a part in the model building process.


childcare_rec <-
  childcare_train %>% 
  recipe(mcsa ~ .) %>% 
  update_role(starts_with(c("county", "state")), new_role = "id")

Fit the model

As we are not experimenting with hyperparameters at this point, we can simply build a workflow and fit it directly to the training data.


dt_fit <-
  workflow() %>% 
  add_recipe(childcare_rec) %>% 
  add_model(dt_model) %>% 
  fit(data = childcare_train)

Describe the following plot

🚨 ATTENTION: You will need to have the rpart.plot package installed for this plot to render.


rpart.plot(extract_fit_engine(dt_fit), roundint=FALSE)

🗣️ CLASSROOM DISCUSSION:

Which feature has the algorithm deemed most important for splitting the data? Can you explain what the tree is doing?

Evaluate the model on the test set

Let’s see how well our simple decision tree does. Create a metric set with the RMSE and r-squared metrics and use them to evaluate the decision tree.


regress_metrics <- metric_set(rsq, rmse)

dt_fit %>% 
  augment(new_data = childcare_test) %>% 
  regress_metrics(mcsa, .pred)

Instantiate a random forest model

Use the rand_forest function
Set mtry = round(sqrt(52), 0) and trees = 1e3

🗣️ CLASSROOM DISCUSSION:

Why do we do this? What happens if we set mtry equal to the number of features?

Set the mode to “regression”
Set the engine to “ranger”

rf_model <-
  rand_forest(mtry = round(sqrt(52), 0), trees = 1e3) %>% 
  set_mode("regression") %>% 
  set_engine("ranger")

Fit the model

Use the same syntax used earlier.

doParallel::registerDoParallel()

rf_fit <-
  workflow() %>% 
  add_recipe(childcare_rec) %>% 
  add_model(rf_model) %>% 
  fit(data = childcare_train)

Evaluate the mode on the test set


rf_fit %>% 
  augment(new_data = childcare_test) %>% 
  regress_metrics(mcsa, .pred)

🗣️ CLASSROOM DISCUSSION:

Do we see an improvement?

Instantiate an XGBoost model

Use the boost_tree function
Set learn_rate = tune()
Set mtry = round(sqrt(52), 0) and trees = 1e3
Set the mode to “regression”
Set the engine to “xgboost”

🚨 ATTENTION: You will need to have the xgboost package installed for this algorithm to work.


xgb_model <-
  boost_tree(mtry = round(sqrt(52), 0), trees = 1e3,
             learn_rate = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("xgboost")

Create a workflow

Use a similar format to the chunk used to create the kNN workflow.


xgb_wf <-
  workflow() %>% 
  add_recipe(childcare_rec) %>% 
  add_model(xgb_model)

Create a hyperparameter tuning grid

Use c(0.0001, 0.001, 0.01, 0.1, 0.2, 0.3) as the experimental values. Remember, that all you need to do is create a tibble of these values with the exact same hyperparameter name used in the model for this grid to work.


xgb_grid <- tibble(learn_rate = c(0.0001, 0.001, 0.01, 0.1, 0.2, 0.3))

Estimating a cross-validated model

We now have all the information we need to estimate a cross-validated model. We have

A workflow, consisting of a model and a recipe
A series of 5 resampled data sets
A metric set containing two evaluation metrics and
A range of hyperparameter options to tune our models.


doParallel::registerDoParallel()

xgb_cv_fit <-
  tune_grid(
    xgb_wf,
    resamples = childcare_cv,
    metrics = regress_metrics,
    grid = xgb_grid
  )

Which learning rate produces the best RMSE?

Produce a graph that visualises the best result.


xgb_cv_fit %>% 
  collect_metrics() %>%
  filter(.metric == "rmse") %>% 
  ggplot(aes(as.factor(learn_rate), mean, group = 1)) +
  geom_point() +
  geom_line(linetype = "dashed") +
  theme_line() +
  labs(x = "Learn rate", y = "Mean Cross-Validated RMSE")

Select and estimate the model with the lowest RMSE


lowest_rmse <- select_best(xgb_cv_fit, metric = "rmse")

xgb_final_fit <- 
  xgb_wf %>% 
  finalize_workflow(lowest_rmse) %>% 
  fit(childcare_train)

Evaluate the model on the test set


# Code here

🗣️ CLASSROOM DISCUSSION:

How should we rank each model?

(Bonus) Create a graph that summarises our findings!

A bar graph will provide an excellent visual summary. We encourage you to use something like this in your summative work if you are asked to compare different ML models.


# Code here

References

Garibay Rubio, Carlos Rodrigo, Katsuya Yamori, Genta Nakano, Astrid Renneé Peralta Gutiérrez, Silvia Morales Chainé, Rebeca Robles García, Edgar Landa-Ramírez, Alexis Bojorge Estrada, Alejandro Bosch Maldonado, and Diana Iris Tejadilla Orozco. 2024. “Machine Learning-Ready Mental Health Datasets for Evaluating Psychological Effects and System Needs in Mexico City During the First Year of the COVID-19 Pandemic.” Data in Brief 57: 110877. https://doi.org/https://doi.org/10.1016/j.dib.2024.110877.