🗓️ Week 07:
Supervised Learning: Non-linear algorithms

Non-linear algorithms

Dr. Ghita Berrada

LSE Data Science Institute

2023-11-10

A quick recap on cross-validation

Once we’re done with training our model (e.g linear/logistic regression), we need to evaluate its out-of-sample performance
We could do that on a set test set but:
- what if we have limited data?
- what if we’re not sure the test set has a similar distribution than the training set?

A quick recap on cross-validation

Enters K-Fold cross-validation, which is a resampling procedure used to evaluate models on limited data:
- the data is randomly split into K equal-sized subsamples on which the model is then evaluated (K is usually set as 5 or 10)
- the performance of the model is averaged over those subsamples
- this makes for a less biased performance since all data would have been seen

Source:https://upload.wikimedia.org/wikipedia/commons/4/4b/KfoldCV.gif

Regression analysis in real life

Source: https://twitter.com/Sansom_Rob/status/1583156024621805568?ref_src=twsrc%5Etfw

The limits of classic regression models

Linear and logistic regression are a good first shot for building ML models

Easy-to-interpret coefficients
Intuitive (ish) ways to assess variable importance
Often good out-of-the-box predictions

However…

Assumption that the predictors are linearly related to the outcome is restrictive
Sometimes, accounting for higher order polynomial relationships can produce better model fit

Example: Fitting a 4th degree polynomial to model the relationship between wage and age in the `Wage` dataset (Dataset source:(James et al. 2021))

The Wage dataset provides income survey data for men in the central Atlantic region of the USA:

data frame with 3000 observations on the following 11 variables
- year: Year that wage information was recorded
- age : Age of worker
- maritl: factor with levels indicating marital status 1. Never Married 2. Married 3. Widowed 4. Divorced and 5. Separated
- race: factor with levels indicating race 1. White 2. Black 3. Asian and 4. Other
- education: factor with levels indicating education level 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad and 5. Advanced Degree
- region: Region of the country (mid-atlantic only)
- jobclass: factor with levels indicating type of job 1. Industrial and 2. Information
- health: factor with levels indicating health level of worker 1. <=Good and 2. >=Very Good
- health_ins: factor with levels 1. Yes and 2. No
- logwage: Log of workers wage
- wage: Workers raw wage
Data manually assembled by Steve Miller, of Inquidia Consulting (formerly Open BI). From the March 2011 Supplement to Current Population Survey data. (https://www.re3data.org/repository/r3d100011860)

Example: Fitting a 4th degree polynomial to model the relationship between wage and age in the `Wage` dataset (Dataset source:(James et al. 2021))

Click here for the code

library(tidymodels)
library(ISLR2)

rec_raw_poly <- recipe(wage ~ age, data = Wage) %>%
  step_poly(age, degree = 4, options = list(raw = TRUE))

raw_poly_wf <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_raw_poly)

raw_poly_fit <- fit(raw_poly_wf, data = Wage)

tidy(raw_poly_fit)

age_range <- tibble(age = seq(min(Wage$age), max(Wage$age)))

regression_lines <- bind_cols(
  augment(raw_poly_fit, new_data = age_range),
  predict(raw_poly_fit, new_data = age_range, type = "conf_int")
)
regression_lines


Wage %>%
  ggplot(aes(age, wage)) +
  geom_point(alpha = 0.2) +
  geom_line(aes(y = .pred), color = "darkgreen",
            data = regression_lines) +
  geom_line(aes(y = .pred_lower), data = regression_lines, 
            linetype = "dashed", color = "blue") +
  geom_line(aes(y = .pred_upper), data = regression_lines, 
            linetype = "dashed", color = "blue")

For more detailed explanations on the code, see here

Enter non-linear methods

These algorithms do not make (strong) statistical assumptions about the data
The focus is more on predictive rather than explanatory power

The Decision Tree

Decision Tree for a Regression task

Using the Wage dataset, predict wage with a tree-based model using all other features except year and logwage.

Decision Tree for a Regression task

Using the Wage dataset, predict wage with a tree-based model using all other features except year and logwage.

Source Code

Tip

Use the code below to replicate the plot from the previous slide.
Found a bug? Report it on Slack.
💡 Check out this tutorial of rpart.plot.

library(ISLR2)      # to load Wage data
library(tidyverse)  # to use things like the pipe (%>%), mutate and if_else

library(rpart)      # a library that contains decision tree models
library(rpart.plot) # a library that plots rpart models 

#This is where we do cross-validation. We do an initial split of the data and then we create folds based on the training set
set.seed(42)
wage_split <- initial_split(Wage, strata = wage)
wage_train <- training(wage_split)
wage_test <- testing(wage_split)

set.seed(1337)
wage_folds <- vfold_cv(wage_train, strata = wage)
wage_folds

Source Code (continued)

# The function decision_tree below specifies a decision tree that you can then fit to the data
                                                                    # We're following the familiar recipes/workflow pattern here, the only changes are to the function/engine used in the model part of the workflow
dt_spec <- decision_tree(mode = "regression", tree_depth = 5) %>%   
  set_engine("rpart")

rec <- recipe(wage ~ age + maritl + race + education + region +jobclass + health + health_ins, data = wage_train)
wage_flow <- 
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(dt_spec)
res <- fit_resamples(wage_flow, wage_folds) %>% collect_metrics()

model <- 
  wage_flow %>% 
  fit(data = wage_train)
tree_fit <- model %>% extract_fit_engine()

library(rpart.plot)
rpart.plot(tree_fit,roundint=FALSE) #plotting the decision tree

Code for the feature importance plot

library(vip)
tree_fit %>%
  vip(geom = "col", aesthetics = list(fill = "midnightblue", alpha = 0.8)) +
  scale_y_continuous(expand = c(0, 0))

Decision Tree for a Classification task

Using the Default dataset, predict whether default is Yes or No based on student, income and balance:

The Default dataset is simulated data set containing information on ten thousand customers whose aim is to predict which customers will default on their credit card debt.
data frame with 10000 observations on the following 4 variables:
- default: factor with levels No and Yes indicating whether the customer defaulted on their debt
- student: factor with levels No and Yes indicating whether the customer is a student
- balance: average balance that the customer has remaining on their credit card after making their monthly payment
- income Income of customer

Decision Tree for a Classification task

Source Code

Tip

Use the code below to replicate the plot from the previous slide.
Found a bug? Report it on Slack.
💡 Check out this tutorial of rpart.plot.

#| echo: false
#| code-fold: show
library(ISLR2)
library(tidymodels)

default_split <- initial_split(Default)
# extract the training data
default_train <- training(default_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(default_train,v=5)
rec <- recipe(default ~ student + balance +income , data = Default)

Source Code (continued)

#specify the decision tree model
dt_spec <- decision_tree(mode = "classification", tree_depth = 2) %>% 
  set_engine("rpart")

#define the workflow
default_flow <- 
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(dt_spec)
#get the cross-validation going and evaluate the model
res <- fit_resamples(default_flow, cv) %>% collect_metrics()

How does it work?

What’s going on behind the scenes?

How decision trees work:

Divide the predictor space into \(\mathbf{J}\) distinct regions \(R_1\), \(R_2\),…,\(R_j\).
Take the mean of the response values in each region

Here’s how the regions were created in our regression/classification examples ⏭️

Alternative representation of decision tree (Classification)

Source code

Tip

Use the code in the following slides to replicate the plot from those two plots.
Found a bug? Report it on Slack.
💡Check out the parttree documentation for how to customize your plot
💡Learn more about data visualisation with ggplot2 on R for Data Science - Chapter 3

First, you will have to install the parttree package:

# Follow the instructions by the developers of the package
# (https://github.com/grantmcdermott/parttree)

install.packages("remotes")
remotes::install_github("grantmcdermott/parttree", force = TRUE)

Alternative representation of decision tree (Classification)

Source code (continued)

Then:

library(datasets)
library(parttree)
library(rpart)
library(ggplot)
iris_split <- initial_split(iris)
# extract the training data
# We will have to forego a recipe now. 
# The visualisation package we're using, parttree, is only configured to work with the rpart engine
# not with tidymodels.
simple_model <- rpart(Species ~ Petal.Length + Petal.Width, data = iris_train)

ggplot(data = iris_train, mapping=aes(x = Petal.Length, y = Petal.Width)) +
  geom_parttree(data = simple_model, aes(fill = Species), alpha=0.25) +
  geom_point(aes(color = Species), alpha = 0.7) +
  theme_minimal() +
  labs(title="An OK model",
       subtitle="The model finds three different 'boxes' that more or less match the classes")

How are regions created?

Recursive binary splitting

Top down

Start from the top of the tree
Then perform splits at a current level of depth

Greedy

Splits are “local” not global
Only cares about data in the current branch

What if we use more splits?…

Source code


unnecessarily_complex_model <- rpart(Species ~ Petal.Length + Petal.Width, 
                                     data = iris_train,
                                     control=rpart.control(minsplit = 1))

ggplot(data = iris_train, mapping=aes(x = Petal.Length, y = Petal.Width)) +
  geom_parttree(data = unnecessarily_complex_model, aes(fill = Species), alpha=0.25) +
  geom_point(aes(color = Species), alpha = 0.7) +
  theme_minimal() +
  labs(title="Allowing the tree to have more splits leads to better model.. or does it?",
       subtitle="We get more blues and greens correctly, but is that safe for the test set?")

For regression…

The tree selects a predictor \(X_j\) and a cutpoint \(s\) that minimises the residual sum of squares.

We define two half planes \(R_1(j,s) = \left\{X|X_j < s\right\}\) and \(R_2(j,s) = \left\{X|X_j \geq s\right\}\) and find \(j\) and \(s\) by minimising.

\[ \sum_{i: x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j,s)} (y_i - \hat{y}_{R_2})^2 \]

For classification…

The tree selects a predictor \(X_j\) and a cutpoint \(s\) that maximises node purity.
Gini index: \(G = \sum_{k = 1}^{K} \hat{p}_{mk}(1 - \hat{p}_{mk})\)
Entropy: \(D = - \sum_{k = 1}^{K} \hat{p}_{mk}\log \hat{p}_{mk}\)

What can go wrong

When trees run amock

Trees can become too complex if we are not careful
It can lead to something called overfitting
- High training set predictive power
- Low test set predictive power
Let’s see one example ⏭️

The following tree is TOO specialised

Tree based on Auto dataset from (James et al. 2021):
the task was to predict gas mileage (mpg variable in the dataset)
for vehicles based on the model year (year) and weight (weight)

How to fix it

Pruning the tree

Hyperparameters are model-specific dials that we can tune
- Things like max tree depth, or min samples per leaf
As with model selection, there is no one one-size-fits-all approach to hyperparameter tuning.
Instead, we experiment with resampling
- Most frequently, k-fold cross-validation

k-fold cross-validation

We experimented with k-fold CV in 🗓️ Week 07’s lab

Cost Complexity

We apply \(\alpha\) which is a non-negative value to prune the tree.
For example, when \(\alpha = 0.02\) we can create a less complex tree.
In the decision_tree function, you would set the cost_complexity parameter to 0.02 (see the documentation of the function here)

What’s Next

After our 10-min break 🍵:

Support Vector Machine

Support Vector Machines

Support Vector Machines for Classification

Considered one of the best out-of-the-box classifiers (ISLR)
Able to accommodate non-linear decision boundaries

Building Intuition: the Hyperplane

Building Intuition: the Hyperplane (cont.)

When \(1 + 2x_1 + 3x_2 < 0\)

Building Intuition: the Hyperplane (cont.)

When \(1 + 2x_1 + 3x_2 > 0\)

Building Intuition: the Hyperplane (cont.)

When \(1 + 2x_1 + 3x_2 = 0\)

The Maximal Marginal Classifier

The linearly separable case:

Identifying support vectors

The SV’s represent the so-called support vectors:

When data is not linearly separable

Suppose we have a case that is not linearly separable like this. We have two classes but class 1 is “sandwiched” in between class 2.

Enter support vector machines

Let’s start with a (linear) support vector classifier function

\[ f(x) = \beta_0 + \sum_{i\in\mathcal{S}}^{} \alpha_i\left\langle x_i, x_{i'} \right\rangle \]

A couple of points:

\(\left\langle x_i, x_{i'} \right\rangle\) is the inner product ¹ of two observations
\(\alpha_i\) is a parameter fitted for all pairs of training observations
\(i\in\mathcal{S}\) indicates observations that are support vectors (all other observations are ignored by setting all \(\alpha_i \notin \mathcal{S}\) to zero.)

Enter support vector machines (cont.)

We can replace \(\left\langle x_i, x_{i'} \right\rangle\) with a of the form \(K(x_i, x_{i'})\) where \(K\) is a kernel.

Two well-known kernels:

Polynomial \(K(x_i, x_{i'}) = (1 + \sum_{j = 1}^{p} x_{ij}x_{i'j})^d\) where \(d > 1\).
Radial or “Gaussian” \(K(x_i, x_{i'}) = \text{exp}(-\gamma\sum_{j=1}^{p}(x_{ij}-x_{i'j})^2)\) where \(\gamma\) is a positive constant.

Example: `Iris` data

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

SVM with Linear Kernel

SVM with Polynomial Kernel

In tidymodels, each type of SVM kernel requires a different function:

svm_linear for linear kernels (or alternatively svm_poly(degree=1) as a polynomial kernel of degree 1 is a linear kernel)
svm_poly for polynomial kernels (that’s the function used to produce the plot of this slide with parameter degree set to 2)
svm_rbf for radial kernels

tidymodels does not have a function for sigmoid kernels, you have to use the svm function from the e1071 library, with parameter kernel set to sigmoid, to train one.

SVM with Radial Kernel

SVM with Sigmoid Kernel¹

Source Code (SVM with linear kernel)

Tip

Use the code below to replicate the plot from the previous slide.
Found a bug? Report it on Slack.

library(tidymodels)
library(tidyverse)
library(ggplot2)
library(recipes)
library(workflows)
library(ISLR2) # library that contains the iris data
library(ggsci) # just for pretty colours! library that allows us scale_colour_lancet() and scale_fill_lancet() in the final plot
library(parsnip)

#The usual recipe
svm_rec <- 
  recipe(Species ~ Sepal.Length + Sepal.Width, data = iris) 

#Write the model specification
svm_spec <- svm_linear() %>%
  set_mode("classification") %>%
  set_engine("kernlab")

# Combining recipe and model specification in a workflow
svm_wflow <- 
  workflow() %>% 
  add_model(svm_spec) %>% 
  add_recipe(svm_rec)

Source Code (SVM with linear kernel) (continued)
# Applying the workflow to fit the model to the data
svm_fit <- svm_wflow %>% 
  fit(data = iris)

# Generate all possible combinations of Sepal.Length and Sepal.Width
feature_space <- crossing(Sepal.Length = seq(4, 8, 0.1), Sepal.Width = seq(2, 5, 0.1))
feature_space <- svm_fit %>% 
                 augment(feature_space)

# Generate the final predictions plot
plot_df %>%   
  ggplot() +
  geom_tile(data = feature_space, aes(x=Sepal.Length, y=Sepal.Width, fill = .pred_class), alpha = 0.25) +
  geom_point(aes(x=Sepal.Length, y=Sepal.Width, colour = Species, shape = correct), size = 4) +
  scale_shape_manual(values = c(4, 1)) +
  scale_colour_lancet() +
  scale_fill_lancet() +
  theme_minimal() +
  theme(panel.grid = element_blank(), legend.position = 'bottom', 
                          plot.title = element_text(hjust = 0.5)) +
  labs(x = 'Sepal.Length', y = 'Sepal.Width', fill = 'Species', colour = 'Species',
       shape = 'Correct prediction?',
       title = sprintf('Overall Training Accuracy = %.2f %%',
       100*(sum(plot_df$correct)/nrow(plot_df))))

Source code (SVM with sigmoid kernel)

We mentioned that tidymodels did not have a function to train and fit an SVM with a sigmoid kernel and that you need the e1071 library for that. So here is the code you would need in such a case.


library(ISLR2)
library(ggsci)
library(tidyverse)
library(e1071) #for the svm function
library(ggplot2)

model <- svm(Species ~ Sepal.Length + Sepal.Width, data = iris, kernel = 'sigmoid')

kernel.points <-
  crossing(Sepal.Length = seq(4, 8, 0.1),
           Sepal.Width = seq(2, 5, 0.1)) %>% 
  mutate(pred = predict(model, .))

plot_df <- iris %>% mutate(pred=predict(model, iris), correct = if_else(pred == Species, TRUE, FALSE))

Source code (SVM with sigmoid kernel) (continued)plot_df %>%   
  ggplot() +
  geom_tile(data = kernel.points, aes(x=Sepal.Length, y=Sepal.Width, fill = pred), alpha = 0.25) +
  geom_point(aes(x=Sepal.Length, y=Sepal.Width, colour = Species, shape = correct), size = 4) +
  scale_shape_manual(values = c(4, 1)) +
  scale_colour_lancet() +
  scale_fill_lancet() +
  theme_minimal() +
  theme(panel.grid = element_blank(), legend.position = 'bottom', plot.title = element_text(hjust = 0.5)) +
  labs(x = 'Sepal.Length', y = 'Sepal.Width', fill = 'Species', colour = 'Species', shape = 'Correct prediction?', 
       title = sprintf('Overall Training Accuracy = %.2f %%', 100*(sum(plot_df$correct)/nrow(plot_df))))

Overfitting

Overfitting illustrated

Simple models can sometimes be better than complex models.

Simple model on training set

Complex model on training set

Now let’s create some testing data

Look what happens when I set a different seed (nothing else changes) to construct a test set.

Simple model on test set

Complex model on test set

Back to `iris`

SVM with Radial Kernel but tweaking parameters, namely cost and rbf_sigma:

Source Code

Tip

The only differences compared to the previous SVM with Radial Kernel plot is that we added the rbf_sigma and cost parameters
We obviously used the svm_rbf function instead of the svm_linear function used in the SVM with linear kernel code, since we’re training an SVM with radial kernel

library(tidymodels)
library(tidyverse)
library(ggplot2)
library(recipes)
library(workflows)
library(ISLR2) # library that contains the iris data
library(ggsci) # library that allows us scale_colour_lancet() and scale_fill_lancet() in the final plot
library(parsnip)

#The usual recipe
svm_rec <- 
  recipe(Species ~ Sepal.Length + Sepal.Width, data = iris) 

#Write the model specification
svm_spec <- svm_rbf(cost=10^4,rbf_sigma = 10^2) %>%
  set_mode("classification") %>%
  set_engine("kernlab")

Source code (continued)# Combining recipe and model specification in a workflow
svm_wflow <- 
  workflow() %>% 
  add_model(svm_spec) %>% 
  add_recipe(svm_rec)

# Applying the workflow to fit the model to the data
svm_fit <- svm_wflow %>% 
  fit(data = iris)

# Generate all possible combinations of Sepal.Length and Sepal.Width
feature_space <- crossing(Sepal.Length = seq(4, 8, 0.1), Sepal.Width = seq(2, 5, 0.1))
feature_space <- svm_fit %>% 
                 augment(feature_space)

# Create a dataframe just for plotting (with predictions)
plot_df <- svm_fit %>%
  augment(iris) %>% 
  mutate(correct=(.pred_class == Species)) 

Source code (continued)# Generate the final predictions plot
plot_df %>%   
  ggplot() +
  geom_tile(data = feature_space, aes(x=Sepal.Length, y=Sepal.Width, fill = .pred_class), alpha = 0.25) +
  geom_point(aes(x=Sepal.Length, y=Sepal.Width, colour = Species, shape = correct), size = 4) +
  scale_shape_manual(values = c(4, 1)) +
  scale_colour_lancet() +
  scale_fill_lancet() +
  theme_minimal() +
  theme(panel.grid = element_blank(), legend.position = 'bottom',
                     plot.title = element_text(hjust = 0.5)) +
  labs(x = 'Sepal.Length', y = 'Sepal.Width', fill = 'Species',
           colour = 'Species', shape = 'Correct prediction?',
          title = sprintf('Overall Training Accuracy = %.2f %%',
                        100*(sum(plot_df$correct)/nrow(plot_df))))

Cross-validation* for the rescue!

Example

A 5-fold cross-validation:

Split 1:

Test

Train

Split 2:

Train

Test

Train

Split 3:

Train

Test

Train

Split 4:

Train

Test

Train

Split 5:

Train

Test

Recommendations

Readings:
- Decision Trees: (James et al. 2021, sec. 8.1)
- SVMs: (James et al. 2021, chap. 9) - Read it all once just to retain main concepts - Focus your attention on the concepts on the margins of the book
Practice the code that is contained in these slides! Several times!
- There are many datasets in the ISLR2 package (link)
- Load any dataset and explore it with these new algorithms
- There are also many datasets to play with in Kaggle or the UCI Machine Learning Repository

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. 2nd edition. Springer Texts in Statistics. New York NY: Springer. https://www.statlearning.com/.

🗓️ Week 07: Supervised Learning: Non-linear algorithms

A quick recap on cross-validation

A quick recap on cross-validation

Regression analysis in real life

The limits of classic regression models

The limits of classic regression models

However…

Example: Fitting a 4th degree polynomial to model the relationship between wage and age in the Wage dataset (Dataset source:(James et al. 2021))

Example: Fitting a 4th degree polynomial to model the relationship between wage and age in the Wage dataset (Dataset source:(James et al. 2021))

Enter non-linear methods

The Decision Tree

Decision Tree for a Regression task

Decision Tree for a Regression task

Source Code

Source Code (continued)

Decision Tree for a Classification task

Decision Tree for a Classification task

Source Code

Source Code (continued)

How does it work?

What’s going on behind the scenes?

Alternative representation of decision tree (Classification)

Source code

Alternative representation of decision tree (Classification)

Alternative representation of decision tree (Classification)

Source code (continued)

How are regions created?

Top down

Greedy

What if we use more splits?…

Source code

For regression…

For classification…

What can go wrong

When trees run amock

The following tree is TOO specialised

How to fix it

Pruning the tree

Cost Complexity

What’s Next

Support Vector Machines

Support Vector Machines for Classification

Building Intuition: the Hyperplane

Building Intuition: the Hyperplane (cont.)

Building Intuition: the Hyperplane (cont.)

Building Intuition: the Hyperplane (cont.)

The Maximal Marginal Classifier

Identifying support vectors

When data is not linearly separable

Enter support vector machines

Enter support vector machines (cont.)

Example: Iris data

SVM with Linear Kernel

SVM with Polynomial Kernel

SVM with Radial Kernel

SVM with Sigmoid Kernel1

Source Code (SVM with linear kernel)

Source Code (SVM with linear kernel) (continued)

Source code (SVM with sigmoid kernel)

Source code (SVM with sigmoid kernel) (continued)

Overfitting

Overfitting illustrated

Simple model on training set

Complex model on training set

Now let’s create some testing data

Simple model on test set

Complex model on test set

Back to iris

Source Code

Source code (continued)

Source code (continued)

Cross-validation* for the rescue!

Example

Recommendations

References

🗓️ Week 07:
Supervised Learning: Non-linear algorithms

Example: Fitting a 4th degree polynomial to model the relationship between wage and age in the `Wage` dataset (Dataset source:(James et al. 2021))

Example: Fitting a 4th degree polynomial to model the relationship between wage and age in the `Wage` dataset (Dataset source:(James et al. 2021))

Example: `Iris` data

SVM with Sigmoid Kernel¹

Back to `iris`