โจ Bonus Lab (Optional) - Cross-validation
DS202 - Data Science for Social Scientists
Context
This page is an (optional) continuation to ๐ป Week 08 lab. Originally, we intended it to be part of W08 lab but decided against it as it would make the lab too cluttered.
Although optional, we think the exercises in here might be a great way to solidify your knowledge of SVM and its parameters.
Topics
Here we show an alternative way to perform k-fold cross-validation using tidymodels
instead of the cv.glm
we saw in W05 lab.
Setup
๐ก Some of you have mentioned that your R version cannot handle tidymodels. We recommend you update R to version 4.2.0 or above.
Packages you will need
library('ISLR2') # for the data
library('tidyverse') # to use things like the pipe (%>%)
library('e1071') # for SVM model
library('tidymodels') # for model tuning, cross-validation etc.
# Vanity packages:
library('GGally') # for pretty correlation plot
library('ggsci') # for pretty plot colours
library('cvms') # for pretty confusion matrix plots
The Data
๐Orange Juice
This week we will use a different ISLR2 dataset: OJ
. We will perform a classification task with the goal to predict the Purchase
column.
The data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. A number of characteristics of the customer and product are recorded.
::OJ %>% head() ISLR2
To understand what each variable represent, open the R Console, type the following and hit ENTER:
::OJ ?ISLR2
Which variables can help us distinguish the two different brands?
To simplify our plots later on, letโs focus on just two predictors:
<- ISLR2::OJ %>% select(Purchase, LoyalCH, PriceDiff)
plot_df
= (
g ggpairs(plot_df, aes(colour=Purchase))
# Customizing the plot
+ scale_colour_startrek()
+ scale_fill_startrek()
+ theme_bw()
) g
๐ Stock Market
We will also use the Smarket
dataset from the ISLR2 package. We will perform a regression task with the goal to predict the percentage of return of the S&P 500 stock index on any given day, as represented by the Today
column.
Daily percentage returns for the S&P 500 stock index between 2001 and 2005.
::Smarket %>% head() ISLR2
<- ISLR2::Smarket %>% select(Today, Volume, Lag1, Lag2)
plot_df
= (
g ggpairs(plot_df)
# Customizing the plot
+ scale_colour_startrek()
+ scale_fill_startrek()
+ theme_bw()
) g
To understand what each variable represent, open the R Console, type the following and hit ENTER:
::Smarket ?ISLR2
Step 4: k-fold cross validation with tidymodels
Here we will replicate the cross-validation setup used to generate the structure ๐๏ธ Week 04 lecture/workshop.
โฎ๏ธRecap:
We also explored training vs test splits, albeit in a different way, in Step 2.1 of ๐ป Week 04 - Lab
k-fold cross-validation was also present in Step 2 of ๐ป Week 05 - Lab
Cross-validation was also mentioned in the ๐๏ธ Week 05 lecture, when you were introduced the problem of overfitting.
But how exactly can cross-validation help overcome overfitting? This is what we will explore in this section of the lab.
Step 4.1: Create training / test split
We start by creating a training / test split using the functions initial_split
packages for ๐ SMarket
data.
set.seed(123)
# Remove Direction, otherwise we would be "cheating"
<- ISLR2::Smarket %>% select(Today, Volume, Lag1)
filtered_data
<- initial_split(filtered_data, prop = 0.75, strata = Today)
default_split
<- training(default_split)
internal_validation_set
<- testing(default_split) external_validation_set
How many samples are in the internal validation set
?
nrow(internal_validation_set)
How many samples were left in the external validation set
?
nrow(external_validation_set)
The external validation set will only be used at the end
Step 4.2: Create resampling folds for cross validation
Next, letโs create 10-fold cross-validation data using internal_validation_set
. We can achieve this by using the vfold_cv
command, specifying v = 10
.
<- vfold_cv(internal_validation_set, v = 10)
k_folds k_folds
Notice anything odd? The output is a tibble but the first column splits
is a series of lists. This is another thing that makes tibbles different to data frames - you can nest lists within tibbles but not data frames. We will use this more explicitly in the next lab when we build k-means clustering models.
Step 4.3: ๐ณSpecifying a recipe
The next step is to create a recipe. Luckily, recipe
function takes the same values as the lm
model! We first create a formula default ~ .
and then use data = internal_validation_set
. Printing smarket_recipe
, we have one outcome and three predictors.
<- recipe(Today ~ ., data = internal_validation_set)
smarket_recipe smarket_recipe
Step 4.4: Specify a model
Next, we will specify a support vector machine model. Hereโs where things get a bit more involved.
We specify a radial basis function SVM. svm_rbf
takes two hyperparameters: cost
and rbf_sigma
. Instead of specifying a single value for each, we will instead set them equal to tune()
. This indicates that we want to try a range of different values.
<-
svm_regressor svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
set_mode('regression')
Step 4.5: Create a hyperparameter grid
Which values for cost
and rbf_sigma
should we choose? It is often hard to tell, so instead we can experiment with different values.
We can use grid_regular
to create a tibble of different hyperparameter combinations. levels = 5
indicates that we want to try out five different values for each hyperparameter.
set.seed(234)
<- grid_regular(cost(), rbf_sigma(), levels = 5)
svm_grid svm_grid
Step 4.6: Perform cross-validation
We now have all we need to run cross-validation, and the function tune_grid
puts everything together. Letโs think intuitively what this command is doing.
We are telling
tune_grid
that we want to run a classification model on a recipe using different resampling folds.Instead of specifying hyperparameter values we want to run a combination of different values.
After this, we want to choose a metric to evaluate different combinations.
We opt for
rmse
but we can specify several metrics however using themetric_set
command.
โ ๏ธ As you can probably notice, the code below takes quite some time to run. This is expected; after all, we are training A LOT of models to tune the parameters.
<-
smarket_tuned tune_grid(object = svm_regressor,
preprocessor = smarket_recipe,
resamples = k_folds,
grid = svm_grid,
metrics = metric_set(rmse),
control = control_grid(save_pred=TRUE))
Step 4.7: Which combination of hyperparameters works best?
Now we have tuned our models, letโs find out which combination of hyperparameters work best. We can create a ggplot easily using the autoplot
command.
%>%
smarket_tuned autoplot() +
theme_minimal() +
::scale_color_jco() +
ggscilabs(y = 'RMSE', colour = 'Sigma')
๐ฏ ACTION POINT: Can you explain what we see in the plot above?
See a model solution
The plot above shows the performance of each variation of the SVM algorithm, in terms of the RMSE metric (Y-axis). Two parameters vary:
Cost
(X-axis) andSigma
(Colour of the lines).From the plot, we can observe that lower values of
Sigma
(\(\le 1e-4\)) lead to smaller RMSE regardless ofCost
.
On a personal note, I do not really love this plot. autoplot
tries to guess the best plot for the dataframe, but I donโt think it communicates truthfully. Here is how I would re-do this plot from scratch using ggplot2:
First, collect the metrics as a single dataframe:
<- collect_metrics(smarket_tuned)
plot_df head(plot_df)
Then take care of making the X and Y labels easier to read:
$rbf_sigma <- factor(plot_df$rbf_sigma)
plot_df$cost <- factor(plot_df$cost)
plot_df
<-
g ggplot(plot_df,
(aes(x=cost, y=rbf_sigma, size=mean, color=mean))
+ geom_point()
+ scale_size_continuous(limits=c(1.1, 1.25))
+ scale_colour_viridis_c(limits=c(1.1,1.25))
+ scale_x_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ scale_y_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ theme_minimal()
+ theme(axis.text.x = element_text(angle=90),
legend.position = "bottom",
legend.direction ="horizontal")
+ labs(x="Cost", y="RBF Sigma", title="Results are similar", subtitle="But we get better results for lower Cost and lower Sigma")
) g
Well, if we truly care about representing things accurately in the plot, the X and Y axes should be put in the right scale:
# This time I won't convert cost and sigma to factors
<- collect_metrics(smarket_tuned)
plot_df
# In the plot, scale_x_discrete has to be replaced by scale_x_continuous:
<-
g ggplot(plot_df,
(aes(x=cost, y=rbf_sigma, size=mean, color=mean))
+ geom_point(alpha=0.3)
+ scale_size_continuous(limits=c(1.1, 1.25))
+ scale_colour_viridis_c(limits=c(1.1,1.25))
+ scale_x_continuous(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ scale_y_continuous(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ theme_minimal()
+ theme(axis.text.x = element_text(angle=90),
legend.position = "bottom",
legend.direction ="horizontal")
+ labs(x="Cost", y="RBF Sigma", title="Results are similar", subtitle="But we get better results for lower Cost and lower Sigma")
) g
We can use the function select_best
to identify the hyperparameter combination that leads to the highest precision.
select_best(smarket_tuned)
๐ Take-home exercises (Advanced)
Donโt worry, the next summative problem set will not require you to write code like asked in here.
Q5: Re-run Step3 (Regression)
Build a standalone SVM model (tidymodels
version) on the same data we used in Step 3 of W08 lab, only this time set the parameters of the SVM to the optimal parameters identified in Step 4.
See a model solution
<- ISLR2::Smarket %>% select(Today, Volume, Lag1)
filtered_data
<-
alternative_svm_model svm_rbf(cost=0.0009765625, rbf_sigma=1) %>%
set_mode('regression') %>%
fit(Today ~ ., data = filtered_data)
alternative_svm_model
Q6: Predict the external validation set
Use the model you built in Q5 and make predictions on the external validation set. How does the RMSE of these predictions compare to the RMSE of the internal validation set?
See a model solution
augment(alternative_svm_model, external_validation_set) %>%
rmse(Today, .pred)
RMSE in the external validation set is comparable to that of the mean RMSE in the internal validation set (close to RMSE=1.1)!
That is what is good about cross-validation. The internal cross-validation error tells us about by how much we can expect to err, on average.
Q7: SVM Decision Space (Regression)
Replicate the plot from Step 3.3 of W08 lab roadmap, only this time using the model from Q5.
See a model solution
When using real data, I will focus on the external validation set. There are fewer samples, so the plot will not look as polluted.
## Simulated data
<-
sim.data crossing(Volume = seq(0,3.5,0.1),
Lag1 = seq(-5,6,0.2))
<- augment(smarket_tidymodel, sim.data)
sim.data
## Real Data
<-
plot_df augment(alternative_svm_model, external_validation_set) %>%
mutate(row_number=row_number())
$residual_above_2 <- (plot_df$.resid) > 2
plot_df
<- (
g %>%
plot_df ggplot()
# Tile the background of the plot with SVM predictions
+ geom_tile(data = sim.data, aes(x=Volume, y=Lag1, fill = .pred), alpha = 0.45)
# Actual data
+ geom_point(aes(x=Volume, y=Lag1, colour = residual_above_2, shape = residual_above_2, alpha=residual_above_2), size=2.5, stroke=0.95)
# Define X and Os
+ scale_shape_manual(values = c(4, 1))
+ scale_fill_viridis_c()
+ scale_color_manual(values=c("black", "red"))
+ scale_alpha_manual(values=c(0.1, 0.7))
# (OPTIONAL) Customizing the colours and theme of the plot
+ theme_minimal()
+ theme(panel.grid = element_blank(), legend.position = 'bottom')
+ labs(x = 'Volume', y = 'Lag 1', fill = "Today's Prediction", colour = 'Residual above 2?', shape = 'Residual above 2?', alpha='Residual above 2?', title='Worst predictions are marked as red circles', subtitle="(showing external validation set only)")
)
g
Q8: Compare SVM Decision Space plots
Explain if and how the decision space you obtained in Q8 differs from the one in Step 3.3.
See a model solution
The decision space in Q7 is a lot simpler than the one in Step 3.3 of W08 lab; predictions made by this SVM are very similar to each other, regardless of the value of
Lag1
andVolume
.
If you are not convinced, take a look at the colour legend at the bottom of the plot. Note that the range goes from 0.01 to 0.05 whereas in the first plot, colours represented a range from -1.0 to 1.5.
This is more apparent when we re-do the plot in Q7 overriding the colour range to be the same as before:
+ scale_fill_viridis_c(limits=c(-1, 1.5)) g
Not a lot of varianceโฆ
Q9: Grid search for Orange Juice (Classification)
For this task, you will use the Orange Juice data set (
ISLR2::OJ
) with ALL the predictors includedReplicate the entire procedure of Step 4, making the necessary adjustments to predict
Purchase
Use F1 score as the optimisation metric.
๐กYou will have two tweak at least two main things: metric_set
and set_mode
See a model solution
### Set aside some data for external validation
set.seed(123)
<- initial_split(ISLR2::OJ , prop = 0.75, strata = Purchase)
default_split
<- training(default_split)
internal_validation_set
<- testing(default_split) external_validation_set
<-
orange_tuned tune_grid(object = svm_rbf(cost = tune(), rbf_sigma = tune()) %>% set_mode('classification'),
preprocessor = recipe(Purchase ~ ., data = internal_validation_set),
resamples = vfold_cv(internal_validation_set, v = 10),
grid = grid_regular(cost(), rbf_sigma(), levels = 5),
metrics = metric_set(f_meas),
control = control_grid(save_pred=TRUE))
<- collect_metrics(orange_tuned)
plot_df $rbf_sigma <- factor(plot_df$rbf_sigma)
plot_df$cost <- factor(plot_df$cost)
plot_df
$mean %>% summary() plot_df
<-
g ggplot(plot_df,
(aes(x=cost, y=rbf_sigma, size=mean, color=mean))
+ geom_point()
+ scale_size_continuous(limits=c(0.7, 0.9))
+ scale_colour_viridis_c(limits=c(0.7, 0.9))
+ scale_x_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ scale_y_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ theme_minimal()
+ theme(axis.text.x = element_text(angle=90),
legend.position = "bottom",
legend.direction ="horizontal")
+ labs(x="Cost", y="RBF Sigma", title="Results are similar", subtitle="But we get better (larger) F1-scores when sigma is closer to 3.2e-3")
) g
Get the best configuration of parameters:
select_best(orange_tuned)
Now, I am curious about the external validation set:
<-
oj_model2 svm_rbf(cost=2.378414, rbf_sigma=0.003162278) %>%
set_mode('classification') %>%
fit(Purchase ~ ., data = external_validation_set)
augment(oj_model2, external_validation_set) %>%
f_meas(Purchase, .pred_class)
Good enough! F1-score ~ 0.86 is not bad and itโs closer to what we expected from the cross-validation
Also, the ROC curve applied to the external validation set doesnโt look too bad!
augment(oj_model2, external_validation_set) %>%
roc_curve(Purchase, .pred_CH) %>%
autoplot()
Q10: Challenge
Replicate the same steps as in Q9 for different SVM kernels (linear, polynomial, etc.). Is it possible to fit a model that is better than the radial kernel,in terms of F1-score?
See a model solution
Linear
A few things are different:
- The only parameter we can change is
cost
- The algorithm only works if we explicitly tell
tidymodels
that this SVM kernel can be found in the โkernlabโ package, hence the%>% set_engine("kernlab")
added to the model specification.
<-
orange_linear_tuned tune_grid(object = svm_linear(cost = tune()) %>% set_mode('classification') %>% set_engine("kernlab") ,
preprocessor = recipe(Purchase ~ ., data = internal_validation_set),
resamples = vfold_cv(internal_validation_set, v = 10),
grid = grid_regular(cost(), levels = 5),
metrics = metric_set(f_meas),
control = control_grid(save_pred=TRUE))
<- readRDS("RMarkdown/orange_linear_tuned.Rds") orange_linear_tuned
Results do not vary too much from svm_rbf
. As we long as cost
is sufficiently high, F1-scores are pretty decent.
<- collect_metrics(orange_linear_tuned)
plot_df $cost <- factor(plot_df$cost)
plot_df
plot_df
Polynomial
A few things are different:
- We have to tune
cost
,degree
andscale_factor
. - Some combinations of parameters do not yield valid solutions. You might see a few errors in your console.
<-
orange_polynomial_tuned tune_grid(object = svm_poly(cost = tune(), degree=tune(), scale_factor=tune()) %>% set_mode('classification') ,
preprocessor = recipe(Purchase ~ ., data = internal_validation_set),
resamples = vfold_cv(internal_validation_set, v = 10),
grid = grid_regular(cost(), degree(), scale_factor(), levels = 5),
metrics = metric_set(f_meas),
control = control_grid(save_pred=TRUE))
Results do not vary too much here either, but there is a โsweet spotโ: if
<- collect_metrics(orange_polynomial_tuned)
plot_df $degree <- factor(plot_df$degree)
plot_df$cost <- factor(plot_df$cost)
plot_df$scale_factor <- factor(plot_df$scale_factor)
plot_df
<-
g ggplot(plot_df,
(aes(x=cost, y=scale_factor, size=mean, color=mean))
+ geom_point()
+ scale_radius(limits=c(0.7, 0.9), range=c(1, 4))
+ scale_colour_viridis_c(limits=c(0.7, 0.9))
+ scale_x_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ scale_y_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ theme_bw()
+ theme(axis.text.x = element_text(angle=90),
legend.position = "bottom",
legend.direction ="horizontal")
+ labs(x="Cost", y="Scale Factor", title="Results are similar to svm_rbf", subtitle="But we get better (larger) F1-scores when Scale Factor is large enough")
+ facet_grid(~ degree, labeller = label_both)
) g
Extra
It is possible to tune multiple models at the same time, instead of tune each algorithm at a time. Check Tuning and comparing models tutorial from tidymodels
.