πŸ›£οΈ Week 08 - Lab Roadmap (90 min)

Comparing models and model evaluation

Author

Tabtim Duenger

πŸ₯… Learning Objectives

By the end of this lab, you will be able to:

  • Learn how to evaluate models using the bias-variance tradeoff
  • Learn how to use the tune_grid() function to do a grid search
  • Learn how to fit support vector machines
  • Learn how to use ensemble methods
I am part of the GENIAL project

This week you can still use ChatGPT if you like but you will not be asked to do so. Go about the lab as usual, freely interacting with others, the lab material, the Web or any other resource you may find useful.

The only thing we ask is to fill out the brief survey at the end of the lab: πŸ”— link, and when asked if you were asked to use ChatGPT, please answer no.

Thanks for being a GENIAL participant!

πŸ“š Preparation

We will be using the same World Values Survey dataset as in last week’s lab.

Use the link below to download the lab materials:

We will post solutions to Part III on Tuesday afternoon, only after all labs have ended.

πŸ“‹ Lab Tasks

Here are the instructions for this lab:

Import required libraries:

# Tidyverse packages we will use
library(ggplot2)
library(dplyr)     
library(tidyr)     
library(readr)     

# Tidymodel packages we will use
library(rsample)
library(yardstick) 
library(parsnip)   
library(recipes)   
library(workflows) 
library(rpart)
library(tune)

# New packages for SVM
library(LiblineaR)
library(kernlab)

Read the data set:

It is the dataset you’ve downloaded last week.

# Modify the filepath if needed
filepath <- "data/WVS_Wave7_modified.csv"
wvs_data <- read_csv(filepath)

Part I - Explore decision trees, overfitting, and the bias-variance tradeoff (20 min)

In this lab, we’ll learn how to assess overfitting on decision trees, compare different models, and evaluate a model with respect to the bias-variance tradeoff. Let’s begin by going back to decision trees.

πŸ§‘β€πŸ« TEACHING MOMENT:

(Your class teacher will guide you through this section. Just run all the code chunks below together with your class teacher.)

Our goal in this lab is to explore diverse models and different methods to help evaluate whether a model is over- or under-fitting on a dataset. We will be doing this using models to predict the same variable we created in last week’s lab, in order to predict someone’s trust in institutions (i.e., police, army, justice courts, press and television, labor unions, civil services, political parties, parliament, and government), as represented in our dataset by the variable wvs_data["TRUST_INSTITUTIONS"]. See last week’s lab for a refresher on the logic behind how we have created this variable.

🎯 ACTION POINTS:

  1. Create the TRUST_INSTITUTIONS column by running the following code.
wvs_data <- 
    wvs_data %>%
    rowwise() %>% 
    mutate(MEAN_I_TRUST_INSTITUTIONS = mean(c(I_TRUSTARMY, I_TRUSTCIVILSERVICES, I_TRUSTPOLICE, I_TRUSTCOURTS, 
                                            I_TRUSTPRESS, I_TRUSTTELEVISION, I_TRUSTUNIONS, I_TRUSTGVT, 
                                            I_TRUSTPARTIES, I_TRUSTPARLIAMENT), 
                                            na.rm = TRUE),
           TRUST_INSTITUTIONS = (1 - MEAN_I_TRUST_INSTITUTIONS) > 0.5,
           TRUST_INSTITUTIONS = factor(TRUST_INSTITUTIONS,
                                       labels=c("No", "Yes"),
                                       levels=c(FALSE, TRUE))) %>%
                                       ungroup()
  1. Remove the columns we used to compute the TRUST_INSTITUTIONS column.
cols_to_remove <- c("I_TRUSTARMY", "I_TRUSTCIVILSERVICES", "I_TRUSTPOLICE", "I_TRUSTCOURTS", 
                    "I_TRUSTPRESS", "I_TRUSTTELEVISION", "I_TRUSTUNIONS", "I_TRUSTGVT", 
                    "I_TRUSTPARTIES", "I_TRUSTPARLIAMENT", "MEAN_I_TRUST_INSTITUTIONS",
                    "D_INTERVIEW", "W_WEIGHT", "S018", "Q_MODE", "K_DURATION",
                    "Q65", "Q67", "Q68", "Q69", "Q70", "Q71", "Q72", "Q73", "Q74",
                    "Q275","Q276","Q277","Q278","Q275A","Q276A","Q277A","Q278A")

# Filter data to remove unnecessary columns
wvs_data <- 
    wvs_data %>% 
    select(-all_of(cols_to_remove))
  1. Let’s again randomly split our dataset into a training set (containing 70% of the rows in our data) and a test set (including 30% of the rows in our data), and retrieve our resulting training and testing sets.
set.seed(123)
# Randomly split the initial data frame into training and testing sets (70% and 30% of rows, respectively)
split <- initial_split(wvs_data, prop = 0.7)
training_data <- training(split)
testing_data <- testing(split)
  1. Create a recipe, a decision tree model specification, and wrap them up into a workflow.
# Create a recipe
wvs_rec <-
  recipe(TRUST_INSTITUTIONS ~ .,
         data = training_data) %>%
  prep()

# Create the specification of a model
dt_spec <- 
    decision_tree(mode = "classification", 
    # You must specify the parameter you want to tune
    min_n = tune()) %>% 
    set_engine("rpart")

wflow <- 
  workflow() %>% 
  add_recipe(wvs_rec) %>% 
  add_model(dt_spec)

πŸ§‘β€πŸ« TEACHING MOMENT: Your class teacher will briefly explain the concept of bias-variance tradeoff.

  1. The bias-variance tradeoff is an essential concept to consider when choosing a model. Bias is a concept describing the difference between the model’s average prediction and the expected value. A model with high bias is said to be underfitting to the data. It makes simplistic assumptions about the training data, which makes it difficult to learn the underlying pattern. Variance captures the generalisability of the model. High variance typically means that we are overfitting to our training data, finding patterns and complexity that are a product of randomness as opposed to some real trend. Ideally, we are looking for a model with low bias, and low variance.

In order to understand the bias-variance tradeoff, we will be varying a hyperparameter of the decision tree to produce more complex decision tree models for comparison. We will do this using the tune_grid() function from the tune package, which uses grid search to train different models based on the parameter values you have chosen. The hyperparameter we are varying here is the min_n - a parameter that refers to the minimum number of observations required in a terminal (leaf) node.

set.seed(234)
folds <- vfold_cv(training_data, v = 5)

ctrl <- control_grid(verbose = FALSE, save_pred = TRUE)

# Create a grid specifying the min number we want to try
grid_search = expand_grid(
  min_n = seq(1, 500, length.out = 5)
)

# This will take a little while
dt_res <- tune_grid(
  wflow,
  # This computes k-fold CV during tuning
  resamples = folds,
  grid = grid_search,
  # Making sure we keep the out-of-sample predictions for each resample during tuning
  control = control_grid(verbose = FALSE, save_pred = TRUE)
)

If we want to find out which parameter created the best model, we can run the following command.

show_best(dt_res)
  1. Plot the values for the roc_auc metric for each value of min_n we have tried with the grid search.
dt_res %>%
  autoplot(metric="roc_auc")
  1. Now visualise the out-of-sample predictions for each value of min_n we have tried - e.g. the average validation set predictions.
collect_predictions(dt_res)  %>%
  group_by(min_n) %>%
  roc_auc(truth=TRUST_INSTITUTIONS, .pred_Yes, event_level="second") %>%
  ggplot() +
  geom_point(aes(x=min_n, y=.estimate)) +
  geom_line(aes(x=min_n, y=.estimate))

πŸ—£οΈ DISCUSSION:

We trained different models using a variety of values for our min_n parameter. Looking at the training vs testing performance at each value, how do you think the parameter changes the overfitting or underfitting of the model?

Part II - Comparing models (30 min)

πŸ§‘β€πŸ« TEACHING MOMENT: Your class teacher will briefly explain the concept of decision boundaries and how different models produce different decision boundaries.

So far we have looked at decision trees. But how well do other models compare to predicting someone’s trust in institutions?

Create a new recipe, model specification, and workflow for the support vector machine model. Note that this is using a linear kernel, SVM models can also be used with polynomial kernels.

# SVMs do not accept NAs or categorical variables
# Create the specification of a support vector machine (SVM) model
wvs_rec <-
  recipe(TRUST_INSTITUTIONS ~ .,
         data = training_data) %>%
  step_rm(all_nominal_predictors()) %>%
  step_naomit(everything(), skip = FALSE) %>%
  prep()

# Create the specification of a support vector machine (SVM) model
svm_spec <- 
    svm_linear(mode = "classification") %>%
    set_engine("LiblineaR")

wflow_svm <- 
  workflow() %>% 
  add_recipe(wvs_rec) %>% 
  add_model(svm_spec)

Now that you have a workflow to fit an SVM model:

  1. Fit the model to a testing and training set, calculating an appropriate metric (e.g. confusion matrix, ROC/AUC curve). How does it compare to the decision tree?
  2. Change a parameter of the SVM model to find out if it improved the predictions (e.g. confusion matrix, AUC/ROC curve).

πŸ—£οΈ DISCUSSION:

You will have noticed a difference in the pre-processing of our dataframe for the SVM model. What do you think the pros and cons of the various algorithms are and the assumptions they make? Given the different pre-processing requirements of different algorithms, does this influence your view on which cases they are best used?

(Bonus)

  1. Train an SVM model, using a polynomial kernel.

Part III - Ensemble methods (40 min)

πŸ§‘β€πŸ« TEACHING MOMENT: Your class teacher will briefly explain the concept of ensemble methods.

While we can tweak hyperparameters to reduce overfitting and underfitting to try and improve the bias-variance tradeoff in decision trees, we also have techniques such as β€˜ensemble methods’ which can also help to improve modelling results. Ensemble learning helps to improve predictions by combining several models, which can lead to better predictive performance than compared to a single model. The basic idea is to learn a set of classifiers (experts) and to allow them to vote.

Ensemble algorithms, such as bootstrapping aggregation (bagging) and boosting, which aim to reduce variance at the small cost of bias in decision trees.

We suggest you have these tabs open in your browser:

  1. The tidymodels documentation page (you can open tabs with documentation pages for each package if you need to)
  2. The tidyverse documentation page (you can open tabs with documentation pages for each package if you need to)

This is a model specification for boosting decision trees.

boost_spec <- boost_tree(trees = 200, tree_depth = 4) %>%
  set_engine("xgboost") %>%
  set_mode("regression")
  1. Apply boosting to our dataset using a workflow.
  2. How does this compare to decision trees?
  3. Tune the hyperparameters using a grid search to improve your model.

(Bonus)

  1. Apply bagging to decision trees to try and reduce overfitting.
  • Fill out this brief survey at the end of the lab if you are part of GENIAL: πŸ”— link (requires LSE login).