✨ Bonus Lab (Optional) - Cross-validation

DS202 - Data Science for Social Scientists

Author

Jon Cardoso-Silva

Published

14 November 2022

Update 22/11/2022

This notebook has been modified. It now contains model solutions too.

I show the code only, no plots, otherwise it would make building this website too slow!

Context

This page is an (optional) continuation to 💻 Week 08 lab. Originally, we intended it to be part of W08 lab but decided against it as it would make the lab too cluttered.

Although optional, we think the exercises in here might be a great way to solidify your knowledge of SVM and its parameters.

Topics

Here we show an alternative way to perform k-fold cross-validation using tidymodels instead of the cv.glm we saw in W05 lab.

Setup

💡 Some of you have mentioned that your R version cannot handle tidymodels. We recommend you update R to version 4.2.0 or above.

Packages you will need

library('ISLR2')       # for the data
library('tidyverse')   # to use things like the pipe (%>%)
library('e1071')       # for SVM model
library('tidymodels')  # for model tuning, cross-validation etc.

# Vanity packages:
library('GGally')      # for pretty correlation plot
library('ggsci')       # for pretty plot colours
library('cvms')        # for pretty confusion matrix plots

The Data

🍊Orange Juice

This week we will use a different ISLR2 dataset: OJ . We will perform a classification task with the goal to predict the Purchase column.

The data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. A number of characteristics of the customer and product are recorded.

ISLR2::OJ %>% head()

To understand what each variable represent, open the R Console, type the following and hit ENTER:

?ISLR2::OJ

Which variables can help us distinguish the two different brands?

To simplify our plots later on, let’s focus on just two predictors:

plot_df <- ISLR2::OJ %>% select(Purchase, LoyalCH, PriceDiff)

g = (
  ggpairs(plot_df, aes(colour=Purchase))
  
  # Customizing the plot
  + scale_colour_startrek()
  + scale_fill_startrek()
  + theme_bw()
)
g

📈 Stock Market

We will also use the Smarket dataset from the ISLR2 package. We will perform a regression task with the goal to predict the percentage of return of the S&P 500 stock index on any given day, as represented by the Today column.

Daily percentage returns for the S&P 500 stock index between 2001 and 2005.

ISLR2::Smarket %>% head()

plot_df <- ISLR2::Smarket %>% select(Today, Volume, Lag1, Lag2)

g = (
  ggpairs(plot_df)
  
  # Customizing the plot
  + scale_colour_startrek()
  + scale_fill_startrek()
  + theme_bw()
)
g

To understand what each variable represent, open the R Console, type the following and hit ENTER:

?ISLR2::Smarket

Step 4: k-fold cross validation with `tidymodels`

Here we will replicate the cross-validation setup used to generate the structure 🗓️ Week 04 lecture/workshop.

⏮️Recap:

We also explored training vs test splits, albeit in a different way, in Step 2.1 of 💻 Week 04 - Lab
k-fold cross-validation was also present in Step 2 of 💻 Week 05 - Lab
Cross-validation was also mentioned in the 🗓️ Week 05 lecture, when you were introduced the problem of overfitting.

But how exactly can cross-validation help overcome overfitting? This is what we will explore in this section of the lab.

Step 4.1: Create training / test split

We start by creating a training / test split using the functions initial_split packages for 📈 SMarket data.


set.seed(123)

# Remove Direction, otherwise we would be "cheating" 
filtered_data <- ISLR2::Smarket %>% select(Today, Volume, Lag1)

default_split <- initial_split(filtered_data, prop = 0.75, strata = Today)

internal_validation_set <- training(default_split)

external_validation_set <- testing(default_split)

How many samples are in the internal validation set?

nrow(internal_validation_set)

How many samples were left in the external validation set?

nrow(external_validation_set)

The external validation set will only be used at the end

Step 4.2: Create resampling folds for cross validation

Next, let’s create 10-fold cross-validation data using internal_validation_set. We can achieve this by using the vfold_cv command, specifying v = 10.

k_folds <- vfold_cv(internal_validation_set, v = 10)
k_folds

Notice anything odd? The output is a tibble but the first column splits is a series of lists. This is another thing that makes tibbles different to data frames - you can nest lists within tibbles but not data frames. We will use this more explicitly in the next lab when we build k-means clustering models.

Step 4.3: 🍳Specifying a recipe

The next step is to create a recipe. Luckily, recipe function takes the same values as the lm model! We first create a formula default ~ . and then use data = internal_validation_set. Printing smarket_recipe, we have one outcome and three predictors.


smarket_recipe <- recipe(Today ~ ., data = internal_validation_set)
smarket_recipe

Step 4.4: Specify a model

Next, we will specify a support vector machine model. Here’s where things get a bit more involved.

We specify a radial basis function SVM. svm_rbf takes two hyperparameters: cost and rbf_sigma. Instead of specifying a single value for each, we will instead set them equal to tune(). This indicates that we want to try a range of different values.


svm_regressor <-
  svm_rbf(cost = tune(), rbf_sigma = tune()) %>% 
  set_mode('regression')

Step 4.5: Create a hyperparameter grid

Which values for cost and rbf_sigma should we choose? It is often hard to tell, so instead we can experiment with different values.

We can use grid_regular to create a tibble of different hyperparameter combinations. levels = 5 indicates that we want to try out five different values for each hyperparameter.


set.seed(234)

svm_grid <- grid_regular(cost(), rbf_sigma(), levels = 5)
svm_grid

Step 4.6: Perform cross-validation

We now have all we need to run cross-validation, and the function tune_grid puts everything together. Let’s think intuitively what this command is doing.

We are telling tune_grid that we want to run a classification model on a recipe using different resampling folds.
Instead of specifying hyperparameter values we want to run a combination of different values.
After this, we want to choose a metric to evaluate different combinations.
We opt for rmse but we can specify several metrics however using the metric_set command.

⚠️ As you can probably notice, the code below takes quite some time to run. This is expected; after all, we are training A LOT of models to tune the parameters.


smarket_tuned <-
  tune_grid(object = svm_regressor,
            preprocessor = smarket_recipe,
            resamples = k_folds,
            grid = svm_grid,
            metrics = metric_set(rmse),
            control = control_grid(save_pred=TRUE))

Step 4.7: Which combination of hyperparameters works best?

Now we have tuned our models, let’s find out which combination of hyperparameters work best. We can create a ggplot easily using the autoplot command.


smarket_tuned %>% 
  autoplot() +
  theme_minimal() +
  ggsci::scale_color_jco() +
  labs(y = 'RMSE', colour = 'Sigma')

🎯 ACTION POINT: Can you explain what we see in the plot above?

See a model solution

The plot above shows the performance of each variation of the SVM algorithm, in terms of the RMSE metric (Y-axis). Two parameters vary: Cost (X-axis) and Sigma (Colour of the lines).

From the plot, we can observe that lower values of Sigma (\(\le 1e-4\)) lead to smaller RMSE regardless of Cost.

On a personal note, I do not really love this plot. autoplot tries to guess the best plot for the dataframe, but I don’t think it communicates truthfully. Here is how I would re-do this plot from scratch using ggplot2:

First, collect the metrics as a single dataframe:

plot_df <- collect_metrics(smarket_tuned)
head(plot_df)

Then take care of making the X and Y labels easier to read:

plot_df$rbf_sigma <- factor(plot_df$rbf_sigma)
plot_df$cost <- factor(plot_df$cost)

g <-
  (ggplot(plot_df, 
          aes(x=cost, y=rbf_sigma, size=mean, color=mean)) 
   + geom_point()
   
   + scale_size_continuous(limits=c(1.1, 1.25))
   + scale_colour_viridis_c(limits=c(1.1,1.25))
   + scale_x_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
   + scale_y_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
   
   + theme_minimal()
   + theme(axis.text.x = element_text(angle=90),
           legend.position = "bottom",
           legend.direction ="horizontal")
   + labs(x="Cost", y="RBF Sigma", title="Results are similar", subtitle="But we get better results for lower Cost and lower Sigma")
   )
g

Well, if we truly care about representing things accurately in the plot, the X and Y axes should be put in the right scale:

# This time I won't convert cost and sigma to factors
plot_df <- collect_metrics(smarket_tuned)

# In the plot, scale_x_discrete has to be replaced by scale_x_continuous:
g <-
  (ggplot(plot_df, 
          aes(x=cost, y=rbf_sigma, size=mean, color=mean)) 
   + geom_point(alpha=0.3)
   
   + scale_size_continuous(limits=c(1.1, 1.25))
   + scale_colour_viridis_c(limits=c(1.1,1.25))
   + scale_x_continuous(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
   + scale_y_continuous(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
   
   + theme_minimal()
   + theme(axis.text.x = element_text(angle=90),
           legend.position = "bottom",
           legend.direction ="horizontal")
   + labs(x="Cost", y="RBF Sigma", title="Results are similar", subtitle="But we get better results for lower Cost and lower Sigma")
   )
g

We can use the function select_best to identify the hyperparameter combination that leads to the highest precision.

select_best(smarket_tuned)

🏠 Take-home exercises (Advanced)

Don’t worry, the next summative problem set will not require you to write code like asked in here.

Q5: Re-run Step3 (Regression)

Build a standalone SVM model (tidymodels version) on the same data we used in Step 3 of W08 lab, only this time set the parameters of the SVM to the optimal parameters identified in Step 4.

See a model solution

filtered_data <- ISLR2::Smarket %>% select(Today, Volume, Lag1)

alternative_svm_model <-
  svm_rbf(cost=0.0009765625, rbf_sigma=1) %>% 
  set_mode('regression') %>% 
  fit(Today ~ ., data = filtered_data)

alternative_svm_model

Q6: Predict the external validation set

Use the model you built in Q5 and make predictions on the external validation set. How does the RMSE of these predictions compare to the RMSE of the internal validation set?

See a model solution

augment(alternative_svm_model, external_validation_set) %>% 
  rmse(Today, .pred)

RMSE in the external validation set is comparable to that of the mean RMSE in the internal validation set (close to RMSE=1.1)!

That is what is good about cross-validation. The internal cross-validation error tells us about by how much we can expect to err, on average.

Q7: SVM Decision Space (Regression)

Replicate the plot from Step 3.3 of W08 lab roadmap, only this time using the model from Q5.

See a model solution

When using real data, I will focus on the external validation set. There are fewer samples, so the plot will not look as polluted.

## Simulated data
sim.data <- 
  crossing(Volume   = seq(0,3.5,0.1),
           Lag1 = seq(-5,6,0.2))
sim.data <- augment(smarket_tidymodel, sim.data)

## Real Data
plot_df <- 
  augment(alternative_svm_model, external_validation_set) %>% 
  mutate(row_number=row_number())

plot_df$residual_above_2 <- (plot_df$.resid) > 2


g <- (
  plot_df %>%   
    ggplot()
  
    # Tile the background of the plot with SVM predictions
    + geom_tile(data = sim.data, aes(x=Volume, y=Lag1, fill = .pred), alpha = 0.45)
  
    # Actual data
    + geom_point(aes(x=Volume, y=Lag1, colour = residual_above_2, shape = residual_above_2, alpha=residual_above_2), size=2.5, stroke=0.95)
  
    # Define X and Os
    + scale_shape_manual(values = c(4, 1))
    + scale_fill_viridis_c()
    + scale_color_manual(values=c("black", "red"))
    + scale_alpha_manual(values=c(0.1, 0.7))
    
    # (OPTIONAL) Customizing the colours and theme of the plot
    + theme_minimal()
    + theme(panel.grid = element_blank(), legend.position = 'bottom')
    + labs(x = 'Volume', y = 'Lag 1', fill = "Today's Prediction", colour = 'Residual above 2?', shape = 'Residual above 2?', alpha='Residual above 2?', title='Worst predictions are marked as red circles', subtitle="(showing external validation set only)")
)

g

Q8: Compare SVM Decision Space plots

Explain if and how the decision space you obtained in Q8 differs from the one in Step 3.3.

See a model solution

The decision space in Q7 is a lot simpler than the one in Step 3.3 of W08 lab; predictions made by this SVM are very similar to each other, regardless of the value of Lag1 and Volume.

If you are not convinced, take a look at the colour legend at the bottom of the plot. Note that the range goes from 0.01 to 0.05 whereas in the first plot, colours represented a range from -1.0 to 1.5.

This is more apparent when we re-do the plot in Q7 overriding the colour range to be the same as before:

g + scale_fill_viridis_c(limits=c(-1, 1.5))

Not a lot of variance…

Q9: Grid search for Orange Juice (Classification)

For this task, you will use the Orange Juice data set (ISLR2::OJ) with ALL the predictors included
Replicate the entire procedure of Step 4, making the necessary adjustments to predict Purchase
Use F1 score as the optimisation metric.

💡You will have two tweak at least two main things: metric_set and set_mode

See a model solution


### Set aside some data for external validation
set.seed(123)

default_split <- initial_split(ISLR2::OJ , prop = 0.75, strata = Purchase)

internal_validation_set <- training(default_split)

external_validation_set <- testing(default_split)


orange_tuned <-
  tune_grid(object = svm_rbf(cost = tune(), rbf_sigma = tune()) %>% set_mode('classification'),
            preprocessor = recipe(Purchase ~ ., data = internal_validation_set),
            resamples = vfold_cv(internal_validation_set, v = 10),
            grid = grid_regular(cost(), rbf_sigma(), levels = 5),
            metrics = metric_set(f_meas),
            control = control_grid(save_pred=TRUE))

plot_df <- collect_metrics(orange_tuned)
plot_df$rbf_sigma <- factor(plot_df$rbf_sigma)
plot_df$cost <- factor(plot_df$cost)

plot_df$mean %>% summary()

g <-
  (ggplot(plot_df, 
          aes(x=cost, y=rbf_sigma, size=mean, color=mean)) 
   + geom_point()
   
   + scale_size_continuous(limits=c(0.7, 0.9))
   + scale_colour_viridis_c(limits=c(0.7, 0.9))
   + scale_x_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
   + scale_y_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
   
   + theme_minimal()
   + theme(axis.text.x = element_text(angle=90),
           legend.position = "bottom",
           legend.direction ="horizontal")
   + labs(x="Cost", y="RBF Sigma", title="Results are similar", subtitle="But we get better (larger) F1-scores when sigma is closer to 3.2e-3")
   )
g

Get the best configuration of parameters:

select_best(orange_tuned)

Now, I am curious about the external validation set:

oj_model2 <-
  svm_rbf(cost=2.378414, rbf_sigma=0.003162278) %>% 
  set_mode('classification') %>% 
  fit(Purchase ~ ., data = external_validation_set)

augment(oj_model2, external_validation_set) %>% 
  f_meas(Purchase, .pred_class)

Good enough! F1-score ~ 0.86 is not bad and it’s closer to what we expected from the cross-validation

Also, the ROC curve applied to the external validation set doesn’t look too bad!

augment(oj_model2, external_validation_set) %>% 
  roc_curve(Purchase, .pred_CH) %>% 
  autoplot()

Q10: Challenge

Replicate the same steps as in Q9 for different SVM kernels (linear, polynomial, etc.). Is it possible to fit a model that is better than the radial kernel,in terms of F1-score?

See a model solution

Linear

A few things are different:

The only parameter we can change is cost
The algorithm only works if we explicitly tell tidymodels that this SVM kernel can be found in the “kernlab” package, hence the %>% set_engine("kernlab") added to the model specification.

orange_linear_tuned <-
  tune_grid(object = svm_linear(cost = tune()) %>% set_mode('classification') %>% set_engine("kernlab") ,
            preprocessor = recipe(Purchase ~ ., data = internal_validation_set),
            resamples = vfold_cv(internal_validation_set, v = 10),
            grid = grid_regular(cost(), levels = 5),
            metrics = metric_set(f_meas),
            control = control_grid(save_pred=TRUE))

orange_linear_tuned <- readRDS("RMarkdown/orange_linear_tuned.Rds")

Results do not vary too much from svm_rbf. As we long as cost is sufficiently high, F1-scores are pretty decent.

plot_df <- collect_metrics(orange_linear_tuned)
plot_df$cost <- factor(plot_df$cost)

plot_df

Polynomial

A few things are different:

We have to tune cost, degree and scale_factor.
Some combinations of parameters do not yield valid solutions. You might see a few errors in your console.

orange_polynomial_tuned <-
  tune_grid(object = svm_poly(cost = tune(), degree=tune(), scale_factor=tune()) %>% set_mode('classification') ,
            preprocessor = recipe(Purchase ~ ., data = internal_validation_set),
            resamples = vfold_cv(internal_validation_set, v = 10),
            grid = grid_regular(cost(), degree(), scale_factor(), levels = 5),
            metrics = metric_set(f_meas),
            control = control_grid(save_pred=TRUE))

Results do not vary too much here either, but there is a “sweet spot”: if

plot_df <- collect_metrics(orange_polynomial_tuned)
plot_df$degree <- factor(plot_df$degree)
plot_df$cost <- factor(plot_df$cost)
plot_df$scale_factor <- factor(plot_df$scale_factor)

g <-
  (ggplot(plot_df, 
          aes(x=cost, y=scale_factor, size=mean, color=mean)) 
   + geom_point()
   
   + scale_radius(limits=c(0.7, 0.9), range=c(1, 4))
   + scale_colour_viridis_c(limits=c(0.7, 0.9))
   + scale_x_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
   + scale_y_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
   
   + theme_bw()
   + theme(axis.text.x = element_text(angle=90),
           legend.position = "bottom",
           legend.direction ="horizontal")
   + labs(x="Cost", y="Scale Factor", title="Results are similar to svm_rbf", subtitle="But we get better (larger) F1-scores when Scale Factor is large enough")
   
   + facet_grid(~ degree, labeller = label_both)
   )
g

Extra

It is possible to tune multiple models at the same time, instead of tune each algorithm at a time. Check Tuning and comparing models tutorial from tidymodels.

Context

Topics

Setup

Packages you will need

The Data

🍊Orange Juice

Which variables can help us distinguish the two different brands?

📈 Stock Market

Step 4: k-fold cross validation with tidymodels

Step 4.1: Create training / test split

Step 4.2: Create resampling folds for cross validation

Step 4.3: 🍳Specifying a recipe

Step 4.4: Specify a model

Step 4.5: Create a hyperparameter grid

Step 4.6: Perform cross-validation

Step 4.7: Which combination of hyperparameters works best?

🏠 Take-home exercises (Advanced)

Q5: Re-run Step3 (Regression)

Q6: Predict the external validation set

Q7: SVM Decision Space (Regression)

Q8: Compare SVM Decision Space plots

Q9: Grid search for Orange Juice (Classification)

Q10: Challenge

Linear

Polynomial

Extra

Step 4: k-fold cross validation with `tidymodels`