πŸ›£οΈ Week 08 - Lab Roadmap (90 min)

Comparing models and model evaluation

Author

Dr Stuart Bramwell

πŸ₯… Learning Objectives

By the end of this lab, you will be able to:

  • Learn how to evaluate models using the bias-variance tradeoff
  • Learn how to use the tune_grid() function to do a grid search
  • Learn how to fit support vector machines
  • Learn how to use ensemble methods

πŸ“‹ Lab Tasks

No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.

Part 0: Export your chat logs (~ 3 min)

As part of the GENIAL project, we ask that you fill out the following form as soon as you come to the lab:

🎯 ACTION POINTS

  1. πŸ”— CLICK HERE to export your chat log.

    Thanks for being GENIAL! You are now one step closer to earning some prizes! 🎟️

πŸ‘‰ NOTE: You MUST complete the initial form.

If you really don’t want to participate in GENIAL1, just answer β€˜No’ to the Terms & Conditions question - your e-mail address will be deleted from GENIAL’s database the following week.

πŸ“š Preparation

We will be using the same corruption-related dataset as in last week’s lab.

Use the link below to download the lab materials:

We will post solutions to Part III on Tuesday afternoon, only after all labs have ended.

Import required libraries:

# Tidyverse packages we will use
library(ggplot2)
library(dplyr)     
library(tidyr)     
library(readr)     

# Tidymodel packages we will use
library(rsample)
library(yardstick) 
library(parsnip)   
library(recipes)   
library(workflows) 
library(rpart)
library(rpart.plot)
library(vip)
library(tune)

library(randomForest)
# New packages for SVM
library(LiblineaR)
library(kernlab)

Read the data set:

It is the dataset you’ve downloaded last week.

# Modify the filepath if needed
filepath <- "data/corruption_data_2019_nomissing.csv"
corruption_data_2019 <- read_csv(filepath)

Part I - Explore decision trees, overfitting, and the bias-variance tradeoff (20 min)

In this lab, we’ll learn how to assess overfitting on decision trees, compare different models, and evaluate a model with respect to the bias-variance tradeoff. Let’s begin by going back to decision trees.

πŸ§‘β€πŸ« TEACHING MOMENT:

(Your class teacher will guide you through this section. Just run all the code chunks below together with your class teacher.)

Our goal in this lab is to explore diverse models and different methods to help evaluate whether a model is over- or under-fitting on a dataset. We will be doing this using models to predict a similar variable to the one we created in last week’s lab, in order to predict the level of corruption of a country (we’re aiming for a binary classification problem this week instead of a multi-class problem like last week). We’ll call this new variable corruption_poor.

🎯 ACTION POINTS:

  1. Create the corruption_poor column by running the following code.
corruption_data_2019 <- 
  corruption_data_2019 %>% 
  mutate(corruption_poor = if_else(cpi_score < 50, "poor", "not poor"),
         corruption_poor = as.factor(corruption
  1. Let’s then select the columns we will use to analyse the data.
corruption_data_2019 <- 
  corruption_data_2019 %>% 
  select(corruption_poor, property_rights:education_index)
  1. Let’s again randomly split our dataset into a training set (containing 70% of the rows in our data) and a test set (including 30% of the rows in our data), and retrieve our resulting training and testing sets.
set.seed(123)
# Randomly split the initial data frame into training and testing sets (70% and 30% of rows, respectively)
split <- initial_split(corruption_data_2019, prop = 0.7, strata = corruption_poor)
training_data <- training(split)
testing_data <- testing(split)
  1. Create a recipe, a decision tree model specification, and wrap them up into a workflow.
# Create a recipe
cpi_rec <-
  recipe(corruption_poor ~ .,
         data = training_data) %>%
  prep()

# Create the specification of a model
dt_spec <- 
    decision_tree(mode = "classification", 
    # You must specify the parameter you want to tune
    min_n = tune()) %>% 
    set_engine("rpart")

wflow <- 
  workflow() %>% 
  add_recipe(cpi_rec) %>% 
  add_model(dt_spec)

πŸ§‘β€πŸ« TEACHING MOMENT: Your class teacher will briefly explain the concept of bias-variance tradeoff.

  1. The bias-variance tradeoff is an essential concept to consider when choosing a model. Bias is a concept describing the difference between the model’s average prediction and the expected value. A model with high bias is said to be underfitting to the data. It makes simplistic assumptions about the training data, which makes it difficult to learn the underlying pattern. Variance captures the generalisability of the model. High variance typically means that we are overfitting to our training data, finding patterns and complexity that are a product of randomness as opposed to some real trend. Ideally, we are looking for a model with low bias, and low variance.

In order to understand the bias-variance tradeoff, we will be varying a hyperparameter of the decision tree to produce more complex decision tree models for comparison. We will do this using the tune_grid() function from the tune package, which uses grid search to train different models based on the parameter values you have chosen. The hyperparameter we are varying here is the min_n - a parameter that refers to the minimum number of observations required in a terminal (leaf) node.

set.seed(234)
folds <- vfold_cv(training_data, v = 5)

ctrl <- control_grid(verbose = FALSE, save_pred = TRUE)

# Create a grid specifying the min number we want to try
grid_search <- expand_grid(
  min_n = c(3:8)
)

# This will take a little while
dt_res <- tune_grid(
  wflow,
  # This computes k-fold CV during tuning
  resamples = folds,
  grid = grid_search,
  # Making sure we keep the out-of-sample predictions for each resample during tuning
  control = control_grid(verbose = FALSE, save_pred = TRUE)
)

If we want to find out which parameter created the best model, we can run the following command.

show_best(dt_res, metric = "roc_auc")
  1. Plot the values for the roc_auc metric for each value of min_n we have tried with the grid search.
dt_res %>%
  autoplot(metric="roc_auc") +
  theme_bw() +
  labs(y = "ROC-AUC")
  1. Now visualise the out-of-sample predictions for each value of min_n we have tried - e.g. the average validation set predictions.
collect_predictions(dt_res)  %>%
  group_by(min_n) %>%
  roc_auc(truth=corruption_poor, .pred_poor, event_level="first") %>%
  ggplot() +
  geom_point(aes(x=min_n, y=.estimate)) +
  geom_line(aes(x=min_n, y=.estimate))

πŸ—£οΈ DISCUSSION:

We trained different models using a variety of values for our min_n parameter. Looking at the training vs testing performance at each value, how do you think the parameter changes the overfitting or underfitting of the model?

Part II - Comparing models (30 min)

πŸ§‘β€πŸ« TEACHING MOMENT: Your class teacher will briefly explain the concept of decision boundaries and how different models produce different decision boundaries.

So far we have looked at decision trees. But how well do other models compare to predicting a country’s level of corruption?

Create a new recipe, model specification, and workflow for the support vector machine model. Note that this is using a linear kernel, SVM models can also be used with polynomial, radial and sigmoid kernels.

# SVMs do not accept NAs or categorical variables
# Create the specification of a support vector machine (SVM) model
cpi_rec <-
  recipe(corruption_poor ~ ., data = training_data) %>%
  step_naomit(all_numeric_predictors(), skip = FALSE) %>%
  prep()

# Create the specification of a support vector machine (SVM) model
svm_spec <- 
  svm_linear(mode = "classification") %>%
  set_engine("LiblineaR")

wflow_svm <- 
  workflow() %>% 
  add_recipe(cpi_rec) %>% 
  add_model(svm_spec)

Now that you have a workflow to fit an SVM model:

  1. Fit the model to a testing and training set, calculating an appropriate metric (e.g. confusion matrix, ROC/AUC curve). How does it compare to the decision tree?

model <- 
  wflow_svm %>% 
  fit(training_data)

model %>% 
  augment(new_data = cpi_rec %>% bake(testing_data)) %>% 
  f_meas(truth = corruption_poor, estimate = .pred_class)
  1. Train an SVM model, using a radial kernel.
svm_spec <- 
  svm_rbf() %>%
  set_mode("classification") %>% 
  set_engine("kernlab")

wflow_svm <- 
  workflow() %>% 
  add_recipe(cpi_rec) %>% 
  add_model(svm_spec)

model <- 
  wflow_svm %>% 
  fit(training_data)

model %>% 
  augment(new_data = cpi_rec %>% bake(testing_data)) %>% 
  f_meas(truth = corruption_poor, estimate = .pred_class)

Part III - Ensemble methods (40 min)

πŸ§‘β€πŸ« TEACHING MOMENT: Your class teacher will briefly explain the concept of ensemble methods.

While we can tweak hyperparameters to reduce overfitting and underfitting to try and improve the bias-variance tradeoff in decision trees, we also have techniques such as β€˜ensemble methods’ which can also help to improve modelling results. Ensemble learning helps to improve predictions by combining several models, which can lead to better predictive performance than compared to a single model. The basic idea is to learn a set of classifiers (experts) and to allow them to vote.

Ensemble algorithms, such as bootstrapping aggregation (bagging) and boosting, which aim to reduce variance at the small cost of bias in decision trees.

We suggest you have these tabs open in your browser:

  1. The tidymodels documentation page (you can open tabs with documentation pages for each package if you need to)
  2. The tidyverse documentation page (you can open tabs with documentation pages for each package if you need to)

This is a model specification for boosting decision trees (this one is for XGBoost).

boost_spec <- boost_tree(trees = 200, tree_depth = 4) %>%
  set_engine("xgboost") %>%
  set_mode("classification")
  1. Apply boosting to our dataset using a workflow.
  2. How does this compare to decision trees?
  3. Tune the hyperparameters using a grid search to improve your model.

(Bonus)

  1. Apply bagging to decision trees to try and reduce overfitting. Can you also do it while tuning hyperparameters?

Footnotes

  1. We’re gonna cry a little bit, not gonna lie. But no hard feelings. We’ll get over it.β†©οΈŽ