โจ Bonus Lab (Optional) - Cross-validation
DS202 - Data Science for Social Scientists
This notebook has been modified. It now contains model solutions too.
I show the code only, no plots, otherwise it would make building this website too slow!
Context
This page is an (optional) continuation to ๐ป Week 08 lab. Originally, we intended it to be part of W08 lab but decided against it as it would make the lab too cluttered.
Although optional, we think the exercises in here might be a great way to solidify your knowledge of SVM and its parameters.
Topics
Here we show an alternative way to perform k-fold cross-validation using tidymodels instead of the cv.glm we saw in W05 lab.
Setup
๐ก Some of you have mentioned that your R version cannot handle tidymodels. We recommend you update R to version 4.2.0 or above.
Packages you will need
library('ISLR2') # for the data
library('tidyverse') # to use things like the pipe (%>%)
library('e1071') # for SVM model
library('tidymodels') # for model tuning, cross-validation etc.
# Vanity packages:
library('GGally') # for pretty correlation plot
library('ggsci') # for pretty plot colours
library('cvms') # for pretty confusion matrix plotsThe Data
๐Orange Juice
This week we will use a different ISLR2 dataset: OJ . We will perform a classification task with the goal to predict the Purchase column.
The data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. A number of characteristics of the customer and product are recorded.
ISLR2::OJ %>% head()To understand what each variable represent, open the R Console, type the following and hit ENTER:
?ISLR2::OJWhich variables can help us distinguish the two different brands?
To simplify our plots later on, letโs focus on just two predictors:
plot_df <- ISLR2::OJ %>% select(Purchase, LoyalCH, PriceDiff)
g = (
ggpairs(plot_df, aes(colour=Purchase))
# Customizing the plot
+ scale_colour_startrek()
+ scale_fill_startrek()
+ theme_bw()
)
g ๐ Stock Market
We will also use the Smarket dataset from the ISLR2 package. We will perform a regression task with the goal to predict the percentage of return of the S&P 500 stock index on any given day, as represented by the Today column.
Daily percentage returns for the S&P 500 stock index between 2001 and 2005.
ISLR2::Smarket %>% head()plot_df <- ISLR2::Smarket %>% select(Today, Volume, Lag1, Lag2)
g = (
ggpairs(plot_df)
# Customizing the plot
+ scale_colour_startrek()
+ scale_fill_startrek()
+ theme_bw()
)
g To understand what each variable represent, open the R Console, type the following and hit ENTER:
?ISLR2::SmarketStep 4: k-fold cross validation with tidymodels
Here we will replicate the cross-validation setup used to generate the structure ๐๏ธ Week 04 lecture/workshop.
โฎ๏ธRecap:
We also explored training vs test splits, albeit in a different way, in Step 2.1 of ๐ป Week 04 - Lab
k-fold cross-validation was also present in Step 2 of ๐ป Week 05 - Lab
Cross-validation was also mentioned in the ๐๏ธ Week 05 lecture, when you were introduced the problem of overfitting.
But how exactly can cross-validation help overcome overfitting? This is what we will explore in this section of the lab.
Step 4.1: Create training / test split
We start by creating a training / test split using the functions initial_split packages for ๐ SMarket data.
set.seed(123)
# Remove Direction, otherwise we would be "cheating"
filtered_data <- ISLR2::Smarket %>% select(Today, Volume, Lag1)
default_split <- initial_split(filtered_data, prop = 0.75, strata = Today)
internal_validation_set <- training(default_split)
external_validation_set <- testing(default_split)How many samples are in the internal validation set?
nrow(internal_validation_set)How many samples were left in the external validation set?
nrow(external_validation_set)The external validation set will only be used at the end
Step 4.2: Create resampling folds for cross validation
Next, letโs create 10-fold cross-validation data using internal_validation_set. We can achieve this by using the vfold_cv command, specifying v = 10.
k_folds <- vfold_cv(internal_validation_set, v = 10)
k_foldsNotice anything odd? The output is a tibble but the first column splits is a series of lists. This is another thing that makes tibbles different to data frames - you can nest lists within tibbles but not data frames. We will use this more explicitly in the next lab when we build k-means clustering models.
Step 4.3: ๐ณSpecifying a recipe
The next step is to create a recipe. Luckily, recipe function takes the same values as the lm model! We first create a formula default ~ . and then use data = internal_validation_set. Printing smarket_recipe, we have one outcome and three predictors.
smarket_recipe <- recipe(Today ~ ., data = internal_validation_set)
smarket_recipeStep 4.4: Specify a model
Next, we will specify a support vector machine model. Hereโs where things get a bit more involved.
We specify a radial basis function SVM. svm_rbf takes two hyperparameters: cost and rbf_sigma. Instead of specifying a single value for each, we will instead set them equal to tune(). This indicates that we want to try a range of different values.
svm_regressor <-
svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
set_mode('regression') Step 4.5: Create a hyperparameter grid
Which values for cost and rbf_sigma should we choose? It is often hard to tell, so instead we can experiment with different values.
We can use grid_regular to create a tibble of different hyperparameter combinations. levels = 5 indicates that we want to try out five different values for each hyperparameter.
set.seed(234)
svm_grid <- grid_regular(cost(), rbf_sigma(), levels = 5)
svm_gridStep 4.6: Perform cross-validation
We now have all we need to run cross-validation, and the function tune_grid puts everything together. Letโs think intuitively what this command is doing.
We are telling
tune_gridthat we want to run a classification model on a recipe using different resampling folds.Instead of specifying hyperparameter values we want to run a combination of different values.
After this, we want to choose a metric to evaluate different combinations.
We opt for
rmsebut we can specify several metrics however using themetric_setcommand.
โ ๏ธ As you can probably notice, the code below takes quite some time to run. This is expected; after all, we are training A LOT of models to tune the parameters.
smarket_tuned <-
tune_grid(object = svm_regressor,
preprocessor = smarket_recipe,
resamples = k_folds,
grid = svm_grid,
metrics = metric_set(rmse),
control = control_grid(save_pred=TRUE))Step 4.7: Which combination of hyperparameters works best?
Now we have tuned our models, letโs find out which combination of hyperparameters work best. We can create a ggplot easily using the autoplot command.
smarket_tuned %>%
autoplot() +
theme_minimal() +
ggsci::scale_color_jco() +
labs(y = 'RMSE', colour = 'Sigma')๐ฏ ACTION POINT: Can you explain what we see in the plot above?
See a model solution
The plot above shows the performance of each variation of the SVM algorithm, in terms of the RMSE metric (Y-axis). Two parameters vary:
Cost(X-axis) andSigma(Colour of the lines).From the plot, we can observe that lower values of
Sigma(\(\le 1e-4\)) lead to smaller RMSE regardless ofCost.
On a personal note, I do not really love this plot. autoplot tries to guess the best plot for the dataframe, but I donโt think it communicates truthfully. Here is how I would re-do this plot from scratch using ggplot2:
First, collect the metrics as a single dataframe:
plot_df <- collect_metrics(smarket_tuned)
head(plot_df)Then take care of making the X and Y labels easier to read:
plot_df$rbf_sigma <- factor(plot_df$rbf_sigma)
plot_df$cost <- factor(plot_df$cost)
g <-
(ggplot(plot_df,
aes(x=cost, y=rbf_sigma, size=mean, color=mean))
+ geom_point()
+ scale_size_continuous(limits=c(1.1, 1.25))
+ scale_colour_viridis_c(limits=c(1.1,1.25))
+ scale_x_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ scale_y_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ theme_minimal()
+ theme(axis.text.x = element_text(angle=90),
legend.position = "bottom",
legend.direction ="horizontal")
+ labs(x="Cost", y="RBF Sigma", title="Results are similar", subtitle="But we get better results for lower Cost and lower Sigma")
)
gWell, if we truly care about representing things accurately in the plot, the X and Y axes should be put in the right scale:
# This time I won't convert cost and sigma to factors
plot_df <- collect_metrics(smarket_tuned)
# In the plot, scale_x_discrete has to be replaced by scale_x_continuous:
g <-
(ggplot(plot_df,
aes(x=cost, y=rbf_sigma, size=mean, color=mean))
+ geom_point(alpha=0.3)
+ scale_size_continuous(limits=c(1.1, 1.25))
+ scale_colour_viridis_c(limits=c(1.1,1.25))
+ scale_x_continuous(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ scale_y_continuous(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ theme_minimal()
+ theme(axis.text.x = element_text(angle=90),
legend.position = "bottom",
legend.direction ="horizontal")
+ labs(x="Cost", y="RBF Sigma", title="Results are similar", subtitle="But we get better results for lower Cost and lower Sigma")
)
gWe can use the function select_best to identify the hyperparameter combination that leads to the highest precision.
select_best(smarket_tuned)๐ Take-home exercises (Advanced)
Donโt worry, the next summative problem set will not require you to write code like asked in here.
Q5: Re-run Step3 (Regression)
Build a standalone SVM model (tidymodels version) on the same data we used in Step 3 of W08 lab, only this time set the parameters of the SVM to the optimal parameters identified in Step 4.
See a model solution
filtered_data <- ISLR2::Smarket %>% select(Today, Volume, Lag1)
alternative_svm_model <-
svm_rbf(cost=0.0009765625, rbf_sigma=1) %>%
set_mode('regression') %>%
fit(Today ~ ., data = filtered_data)
alternative_svm_modelQ6: Predict the external validation set
Use the model you built in Q5 and make predictions on the external validation set. How does the RMSE of these predictions compare to the RMSE of the internal validation set?
See a model solution
augment(alternative_svm_model, external_validation_set) %>%
rmse(Today, .pred)RMSE in the external validation set is comparable to that of the mean RMSE in the internal validation set (close to RMSE=1.1)!
That is what is good about cross-validation. The internal cross-validation error tells us about by how much we can expect to err, on average.
Q7: SVM Decision Space (Regression)
Replicate the plot from Step 3.3 of W08 lab roadmap, only this time using the model from Q5.
See a model solution
When using real data, I will focus on the external validation set. There are fewer samples, so the plot will not look as polluted.
## Simulated data
sim.data <-
crossing(Volume = seq(0,3.5,0.1),
Lag1 = seq(-5,6,0.2))
sim.data <- augment(smarket_tidymodel, sim.data)
## Real Data
plot_df <-
augment(alternative_svm_model, external_validation_set) %>%
mutate(row_number=row_number())
plot_df$residual_above_2 <- (plot_df$.resid) > 2
g <- (
plot_df %>%
ggplot()
# Tile the background of the plot with SVM predictions
+ geom_tile(data = sim.data, aes(x=Volume, y=Lag1, fill = .pred), alpha = 0.45)
# Actual data
+ geom_point(aes(x=Volume, y=Lag1, colour = residual_above_2, shape = residual_above_2, alpha=residual_above_2), size=2.5, stroke=0.95)
# Define X and Os
+ scale_shape_manual(values = c(4, 1))
+ scale_fill_viridis_c()
+ scale_color_manual(values=c("black", "red"))
+ scale_alpha_manual(values=c(0.1, 0.7))
# (OPTIONAL) Customizing the colours and theme of the plot
+ theme_minimal()
+ theme(panel.grid = element_blank(), legend.position = 'bottom')
+ labs(x = 'Volume', y = 'Lag 1', fill = "Today's Prediction", colour = 'Residual above 2?', shape = 'Residual above 2?', alpha='Residual above 2?', title='Worst predictions are marked as red circles', subtitle="(showing external validation set only)")
)
gQ8: Compare SVM Decision Space plots
Explain if and how the decision space you obtained in Q8 differs from the one in Step 3.3.
See a model solution
The decision space in Q7 is a lot simpler than the one in Step 3.3 of W08 lab; predictions made by this SVM are very similar to each other, regardless of the value of
Lag1andVolume.
If you are not convinced, take a look at the colour legend at the bottom of the plot. Note that the range goes from 0.01 to 0.05 whereas in the first plot, colours represented a range from -1.0 to 1.5.
This is more apparent when we re-do the plot in Q7 overriding the colour range to be the same as before:
g + scale_fill_viridis_c(limits=c(-1, 1.5))Not a lot of varianceโฆ
Q9: Grid search for Orange Juice (Classification)
For this task, you will use the Orange Juice data set (
ISLR2::OJ) with ALL the predictors includedReplicate the entire procedure of Step 4, making the necessary adjustments to predict
PurchaseUse F1 score as the optimisation metric.
๐กYou will have two tweak at least two main things: metric_set and set_mode
See a model solution
### Set aside some data for external validation
set.seed(123)
default_split <- initial_split(ISLR2::OJ , prop = 0.75, strata = Purchase)
internal_validation_set <- training(default_split)
external_validation_set <- testing(default_split)
orange_tuned <-
tune_grid(object = svm_rbf(cost = tune(), rbf_sigma = tune()) %>% set_mode('classification'),
preprocessor = recipe(Purchase ~ ., data = internal_validation_set),
resamples = vfold_cv(internal_validation_set, v = 10),
grid = grid_regular(cost(), rbf_sigma(), levels = 5),
metrics = metric_set(f_meas),
control = control_grid(save_pred=TRUE))plot_df <- collect_metrics(orange_tuned)
plot_df$rbf_sigma <- factor(plot_df$rbf_sigma)
plot_df$cost <- factor(plot_df$cost)
plot_df$mean %>% summary()g <-
(ggplot(plot_df,
aes(x=cost, y=rbf_sigma, size=mean, color=mean))
+ geom_point()
+ scale_size_continuous(limits=c(0.7, 0.9))
+ scale_colour_viridis_c(limits=c(0.7, 0.9))
+ scale_x_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ scale_y_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ theme_minimal()
+ theme(axis.text.x = element_text(angle=90),
legend.position = "bottom",
legend.direction ="horizontal")
+ labs(x="Cost", y="RBF Sigma", title="Results are similar", subtitle="But we get better (larger) F1-scores when sigma is closer to 3.2e-3")
)
gGet the best configuration of parameters:
select_best(orange_tuned)Now, I am curious about the external validation set:
oj_model2 <-
svm_rbf(cost=2.378414, rbf_sigma=0.003162278) %>%
set_mode('classification') %>%
fit(Purchase ~ ., data = external_validation_set)
augment(oj_model2, external_validation_set) %>%
f_meas(Purchase, .pred_class)Good enough! F1-score ~ 0.86 is not bad and itโs closer to what we expected from the cross-validation
Also, the ROC curve applied to the external validation set doesnโt look too bad!
augment(oj_model2, external_validation_set) %>%
roc_curve(Purchase, .pred_CH) %>%
autoplot()Q10: Challenge
Replicate the same steps as in Q9 for different SVM kernels (linear, polynomial, etc.). Is it possible to fit a model that is better than the radial kernel,in terms of F1-score?
See a model solution
Linear
A few things are different:
- The only parameter we can change is
cost - The algorithm only works if we explicitly tell
tidymodelsthat this SVM kernel can be found in the โkernlabโ package, hence the%>% set_engine("kernlab")added to the model specification.
orange_linear_tuned <-
tune_grid(object = svm_linear(cost = tune()) %>% set_mode('classification') %>% set_engine("kernlab") ,
preprocessor = recipe(Purchase ~ ., data = internal_validation_set),
resamples = vfold_cv(internal_validation_set, v = 10),
grid = grid_regular(cost(), levels = 5),
metrics = metric_set(f_meas),
control = control_grid(save_pred=TRUE))orange_linear_tuned <- readRDS("RMarkdown/orange_linear_tuned.Rds")Results do not vary too much from svm_rbf. As we long as cost is sufficiently high, F1-scores are pretty decent.
plot_df <- collect_metrics(orange_linear_tuned)
plot_df$cost <- factor(plot_df$cost)
plot_dfPolynomial
A few things are different:
- We have to tune
cost,degreeandscale_factor. - Some combinations of parameters do not yield valid solutions. You might see a few errors in your console.
orange_polynomial_tuned <-
tune_grid(object = svm_poly(cost = tune(), degree=tune(), scale_factor=tune()) %>% set_mode('classification') ,
preprocessor = recipe(Purchase ~ ., data = internal_validation_set),
resamples = vfold_cv(internal_validation_set, v = 10),
grid = grid_regular(cost(), degree(), scale_factor(), levels = 5),
metrics = metric_set(f_meas),
control = control_grid(save_pred=TRUE))Results do not vary too much here either, but there is a โsweet spotโ: if
plot_df <- collect_metrics(orange_polynomial_tuned)
plot_df$degree <- factor(plot_df$degree)
plot_df$cost <- factor(plot_df$cost)
plot_df$scale_factor <- factor(plot_df$scale_factor)
g <-
(ggplot(plot_df,
aes(x=cost, y=scale_factor, size=mean, color=mean))
+ geom_point()
+ scale_radius(limits=c(0.7, 0.9), range=c(1, 4))
+ scale_colour_viridis_c(limits=c(0.7, 0.9))
+ scale_x_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ scale_y_discrete(labels = function(x) format(as.numeric(x), digits=2, scientific=TRUE))
+ theme_bw()
+ theme(axis.text.x = element_text(angle=90),
legend.position = "bottom",
legend.direction ="horizontal")
+ labs(x="Cost", y="Scale Factor", title="Results are similar to svm_rbf", subtitle="But we get better (larger) F1-scores when Scale Factor is large enough")
+ facet_grid(~ degree, labeller = label_both)
)
gExtra
It is possible to tune multiple models at the same time, instead of tune each algorithm at a time. Check Tuning and comparing models tutorial from tidymodels.
