Non-linear algorithms
2023-11-10
Source: https://twitter.com/Sansom_Rob/status/1583156024621805568?ref_src=twsrc%5Etfw
Linear and logistic regression are a good first shot for building ML models
Wage
dataset (Dataset source:(James et al. 2021))The Wage
dataset provides income survey data for men in the central Atlantic region of the USA:
data frame with 3000 observations on the following 11 variables
year
: Year that wage information was recordedage
: Age of workermaritl
: factor with levels indicating marital status 1. Never Married 2. Married 3. Widowed 4. Divorced and 5. Separatedrace
: factor with levels indicating race 1. White 2. Black 3. Asian and 4. Othereducation
: factor with levels indicating education level 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad and 5. Advanced Degreeregion
: Region of the country (mid-atlantic only)jobclass
: factor with levels indicating type of job 1. Industrial and 2. Informationhealth
: factor with levels indicating health level of worker 1. <=Good and 2. >=Very Goodhealth_ins
: factor with levels 1. Yes and 2. Nologwage
: Log of workers wagewage
: Workers raw wageData manually assembled by Steve Miller, of Inquidia Consulting (formerly Open BI). From the March 2011 Supplement to Current Population Survey data. (https://www.re3data.org/repository/r3d100011860)
Wage
dataset (Dataset source:(James et al. 2021))library(tidymodels)
library(ISLR2)
rec_raw_poly <- recipe(wage ~ age, data = Wage) %>%
step_poly(age, degree = 4, options = list(raw = TRUE))
raw_poly_wf <- workflow() %>%
add_model(lm_spec) %>%
add_recipe(rec_raw_poly)
raw_poly_fit <- fit(raw_poly_wf, data = Wage)
tidy(raw_poly_fit)
age_range <- tibble(age = seq(min(Wage$age), max(Wage$age)))
regression_lines <- bind_cols(
augment(raw_poly_fit, new_data = age_range),
predict(raw_poly_fit, new_data = age_range, type = "conf_int")
)
regression_lines
Wage %>%
ggplot(aes(age, wage)) +
geom_point(alpha = 0.2) +
geom_line(aes(y = .pred), color = "darkgreen",
data = regression_lines) +
geom_line(aes(y = .pred_lower), data = regression_lines,
linetype = "dashed", color = "blue") +
geom_line(aes(y = .pred_upper), data = regression_lines,
linetype = "dashed", color = "blue")
Using the Wage
dataset, predict wage
with a tree-based model using all other features except year
and logwage
.
Using the Wage
dataset, predict wage
with a tree-based model using all other features except year
and logwage
.
Tip
rpart.plot
.library(ISLR2) # to load Wage data
library(tidyverse) # to use things like the pipe (%>%), mutate and if_else
library(rpart) # a library that contains decision tree models
library(rpart.plot) # a library that plots rpart models
#This is where we do cross-validation. We do an initial split of the data and then we create folds based on the training set
set.seed(42)
wage_split <- initial_split(Wage, strata = wage)
wage_train <- training(wage_split)
wage_test <- testing(wage_split)
set.seed(1337)
wage_folds <- vfold_cv(wage_train, strata = wage)
wage_folds
# The function decision_tree below specifies a decision tree that you can then fit to the data
# We're following the familiar recipes/workflow pattern here, the only changes are to the function/engine used in the model part of the workflow
dt_spec <- decision_tree(mode = "regression", tree_depth = 5) %>%
set_engine("rpart")
rec <- recipe(wage ~ age + maritl + race + education + region +jobclass + health + health_ins, data = wage_train)
wage_flow <-
workflow() %>%
add_recipe(rec) %>%
add_model(dt_spec)
res <- fit_resamples(wage_flow, wage_folds) %>% collect_metrics()
model <-
wage_flow %>%
fit(data = wage_train)
tree_fit <- model %>% extract_fit_engine()
library(rpart.plot)
rpart.plot(tree_fit,roundint=FALSE) #plotting the decision tree
Code for the feature importance plot
Using the Default
dataset, predict whether default
is Yes
or No
based on student
, income
and balance
:
The Default
dataset is simulated data set containing information on ten thousand customers whose aim is to predict which customers will default on their credit card debt.
data frame with 10000 observations on the following 4 variables:
default
: factor with levels No
and Yes
indicating whether the customer defaulted on their debtstudent
: factor with levels No
and Yes
indicating whether the customer is a studentbalance
: average balance that the customer has remaining on their credit card after making their monthly paymentincome
Income of customerTip
rpart.plot
.#| echo: false
#| code-fold: show
library(ISLR2)
library(tidymodels)
default_split <- initial_split(Default)
# extract the training data
default_train <- training(default_split)
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(default_train,v=5)
rec <- recipe(default ~ student + balance +income , data = Default)
#specify the decision tree model
dt_spec <- decision_tree(mode = "classification", tree_depth = 2) %>%
set_engine("rpart")
#define the workflow
default_flow <-
workflow() %>%
add_recipe(rec) %>%
add_model(dt_spec)
#get the cross-validation going and evaluate the model
res <- fit_resamples(default_flow, cv) %>% collect_metrics()
How decision trees work:
Here’s how the regions were created in our regression/classification examples ⏭️
Tip
parttree
documentation for how to customize your plotFirst, you will have to install the parttree
package:
Then:
library(datasets)
library(parttree)
library(rpart)
library(ggplot)
iris_split <- initial_split(iris)
# extract the training data
# We will have to forego a recipe now.
# The visualisation package we're using, parttree, is only configured to work with the rpart engine
# not with tidymodels.
simple_model <- rpart(Species ~ Petal.Length + Petal.Width, data = iris_train)
ggplot(data = iris_train, mapping=aes(x = Petal.Length, y = Petal.Width)) +
geom_parttree(data = simple_model, aes(fill = Species), alpha=0.25) +
geom_point(aes(color = Species), alpha = 0.7) +
theme_minimal() +
labs(title="An OK model",
subtitle="The model finds three different 'boxes' that more or less match the classes")
Recursive binary splitting
unnecessarily_complex_model <- rpart(Species ~ Petal.Length + Petal.Width,
data = iris_train,
control=rpart.control(minsplit = 1))
ggplot(data = iris_train, mapping=aes(x = Petal.Length, y = Petal.Width)) +
geom_parttree(data = unnecessarily_complex_model, aes(fill = Species), alpha=0.25) +
geom_point(aes(color = Species), alpha = 0.7) +
theme_minimal() +
labs(title="Allowing the tree to have more splits leads to better model.. or does it?",
subtitle="We get more blues and greens correctly, but is that safe for the test set?")
\[ \sum_{i: x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j,s)} (y_i - \hat{y}_{R_2})^2 \]
Tree based on Auto
dataset from (James et al. 2021):
the task was to predict gas mileage (mpg
variable in the dataset)
for vehicles based on the model year (year
) and weight (weight
)
max tree depth
, or min samples per leaf
k-fold cross-validation
decision_tree
function, you would set the cost_complexity
parameter to 0.02 (see the documentation of the function here)After our 10-min break 🍵:
When \(1 + 2x_1 + 3x_2 < 0\)
When \(1 + 2x_1 + 3x_2 > 0\)
When \(1 + 2x_1 + 3x_2 = 0\)
The linearly separable case:
The SV
’s represent the so-called support vectors:
Suppose we have a case that is not linearly separable like this. We have two classes but class 1 is “sandwiched” in between class 2.
Let’s start with a (linear) support vector classifier function
\[ f(x) = \beta_0 + \sum_{i\in\mathcal{S}}^{} \alpha_i\left\langle x_i, x_{i'} \right\rangle \]
A couple of points:
Two well-known kernels:
Iris
dataIn tidymodels
, each type of SVM kernel requires a different function:
svm_linear
for linear kernels (or alternatively svm_poly(degree=1)
as a polynomial kernel of degree 1 is a linear kernel)svm_poly
for polynomial kernels (that’s the function used to produce the plot of this slide with parameter degree
set to 2)svm_rbf
for radial kernelstidymodels
does not have a function for sigmoid kernels, you have to use the svm
function from the e1071
library, with parameter kernel
set to sigmoid
, to train one.
Tip
library(tidymodels)
library(tidyverse)
library(ggplot2)
library(recipes)
library(workflows)
library(ISLR2) # library that contains the iris data
library(ggsci) # just for pretty colours! library that allows us scale_colour_lancet() and scale_fill_lancet() in the final plot
library(parsnip)
#The usual recipe
svm_rec <-
recipe(Species ~ Sepal.Length + Sepal.Width, data = iris)
#Write the model specification
svm_spec <- svm_linear() %>%
set_mode("classification") %>%
set_engine("kernlab")
# Combining recipe and model specification in a workflow
svm_wflow <-
workflow() %>%
add_model(svm_spec) %>%
add_recipe(svm_rec)
# Applying the workflow to fit the model to the data
svm_fit <- svm_wflow %>%
fit(data = iris)
# Generate all possible combinations of Sepal.Length and Sepal.Width
feature_space <- crossing(Sepal.Length = seq(4, 8, 0.1), Sepal.Width = seq(2, 5, 0.1))
feature_space <- svm_fit %>%
augment(feature_space)
# Generate the final predictions plot
plot_df %>%
ggplot() +
geom_tile(data = feature_space, aes(x=Sepal.Length, y=Sepal.Width, fill = .pred_class), alpha = 0.25) +
geom_point(aes(x=Sepal.Length, y=Sepal.Width, colour = Species, shape = correct), size = 4) +
scale_shape_manual(values = c(4, 1)) +
scale_colour_lancet() +
scale_fill_lancet() +
theme_minimal() +
theme(panel.grid = element_blank(), legend.position = 'bottom',
plot.title = element_text(hjust = 0.5)) +
labs(x = 'Sepal.Length', y = 'Sepal.Width', fill = 'Species', colour = 'Species',
shape = 'Correct prediction?',
title = sprintf('Overall Training Accuracy = %.2f %%',
100*(sum(plot_df$correct)/nrow(plot_df))))
We mentioned that tidymodels
did not have a function to train and fit an SVM with a sigmoid kernel and that you need the e1071
library for that. So here is the code you would need in such a case.
library(ISLR2)
library(ggsci)
library(tidyverse)
library(e1071) #for the svm function
library(ggplot2)
model <- svm(Species ~ Sepal.Length + Sepal.Width, data = iris, kernel = 'sigmoid')
kernel.points <-
crossing(Sepal.Length = seq(4, 8, 0.1),
Sepal.Width = seq(2, 5, 0.1)) %>%
mutate(pred = predict(model, .))
plot_df <- iris %>% mutate(pred=predict(model, iris), correct = if_else(pred == Species, TRUE, FALSE))
plot_df %>%
ggplot() +
geom_tile(data = kernel.points, aes(x=Sepal.Length, y=Sepal.Width, fill = pred), alpha = 0.25) +
geom_point(aes(x=Sepal.Length, y=Sepal.Width, colour = Species, shape = correct), size = 4) +
scale_shape_manual(values = c(4, 1)) +
scale_colour_lancet() +
scale_fill_lancet() +
theme_minimal() +
theme(panel.grid = element_blank(), legend.position = 'bottom', plot.title = element_text(hjust = 0.5)) +
labs(x = 'Sepal.Length', y = 'Sepal.Width', fill = 'Species', colour = 'Species', shape = 'Correct prediction?',
title = sprintf('Overall Training Accuracy = %.2f %%', 100*(sum(plot_df$correct)/nrow(plot_df))))
Simple models can sometimes be better than complex models.
Look what happens when I set a different seed (nothing else changes) to construct a test set.
iris
SVM with Radial Kernel but tweaking parameters, namely cost
and rbf_sigma
:
Tip
rbf_sigma
and cost
parameterssvm_rbf
function instead of the svm_linear
function used in the SVM with linear kernel code, since we’re training an SVM with radial kernellibrary(tidymodels)
library(tidyverse)
library(ggplot2)
library(recipes)
library(workflows)
library(ISLR2) # library that contains the iris data
library(ggsci) # library that allows us scale_colour_lancet() and scale_fill_lancet() in the final plot
library(parsnip)
#The usual recipe
svm_rec <-
recipe(Species ~ Sepal.Length + Sepal.Width, data = iris)
#Write the model specification
svm_spec <- svm_rbf(cost=10^4,rbf_sigma = 10^2) %>%
set_mode("classification") %>%
set_engine("kernlab")
# Combining recipe and model specification in a workflow
svm_wflow <-
workflow() %>%
add_model(svm_spec) %>%
add_recipe(svm_rec)
# Applying the workflow to fit the model to the data
svm_fit <- svm_wflow %>%
fit(data = iris)
# Generate all possible combinations of Sepal.Length and Sepal.Width
feature_space <- crossing(Sepal.Length = seq(4, 8, 0.1), Sepal.Width = seq(2, 5, 0.1))
feature_space <- svm_fit %>%
augment(feature_space)
# Create a dataframe just for plotting (with predictions)
plot_df <- svm_fit %>%
augment(iris) %>%
mutate(correct=(.pred_class == Species))
# Generate the final predictions plot
plot_df %>%
ggplot() +
geom_tile(data = feature_space, aes(x=Sepal.Length, y=Sepal.Width, fill = .pred_class), alpha = 0.25) +
geom_point(aes(x=Sepal.Length, y=Sepal.Width, colour = Species, shape = correct), size = 4) +
scale_shape_manual(values = c(4, 1)) +
scale_colour_lancet() +
scale_fill_lancet() +
theme_minimal() +
theme(panel.grid = element_blank(), legend.position = 'bottom',
plot.title = element_text(hjust = 0.5)) +
labs(x = 'Sepal.Length', y = 'Sepal.Width', fill = 'Species',
colour = 'Species', shape = 'Correct prediction?',
title = sprintf('Overall Training Accuracy = %.2f %%',
100*(sum(plot_df$correct)/nrow(plot_df))))
Split 1:
Test
Train
Train
Train
Train
Split 2:
Train
Test
Train
Train
Train
Split 3:
Train
Train
Test
Train
Train
Split 4:
Train
Train
Train
Test
Train
Split 5:
Train
Train
Train
Train
Test
We experimented with k-fold CV in 🗓️ Week 04’s lecture/workshop
ISLR2
package (link)LSE DS202 2023/24 Autumn Term | archive