Machine Learning Cheatsheet
DS202 Blog Post
“How do I do ...
using tidyverse/tidymodels?”. This page has you covered.
Let me know if you would like me to add something here.
⚙️ Setup
What packages do you need?
In most cases, all you need is:
library(tidyverse)
library(tidymodels)
🧰 Part I Preprocessing data for ML using tidymodels
We will use the following important concepts from tidymodels
:
A recipe is the tidymodels way of specifying how to preprocess and transform data. Recipes can help automate the data cleaning and preprocessing process, making it easier and more efficient for researchers to work with their data.
The recipes package provides a set of functions for defining and applying recipes to data sets. You will find that the functions of this package can be especially useful in social science research, where data sets often have missing or inconsistent values, and researchers need to create new variables or features for analysis.
A workflow is the tidymodels way of specifying how to perform complete data analysis, from data cleaning and preprocessing to modelling and visualization. Workflows can help researchers keep track of all the steps involved in their data analysis, and can help ensure that their analysis is transparent, replicable, and well-documented.
workflows can be especially useful in social science research, where researchers often need to perform complex data analyses and need to be able to reproduce their results, as it streamlines the data analysis process and ensure the reproducibility of results.
How recipes and workflow work together
Recipes can be used as a building block within workflows to specify the data preprocessing and transformation steps necessary for a particular analysis. In other words, recipes provide the “ingredients” for the data analysis, while workflows provide the overall plan for the analysis.
In a typical data science pipeline (workflow), you might want to perform all of the steps below:
- read in data
- clean and preprocess the data
- create new features
- fit a model to the pre-processed data
- visualise the results.
Each of these steps would be specified by one or more recipes while the overall step-by-step procedure listed above would be accomplished by the workflows package.
By breaking down the analysis into smaller, more manageable steps, researchers can more easily understand and troubleshoot each step, and can more easily reproduce the entire analysis if necessary. Additionally, by using
workflows
to document the entire analysis process, researchers can ensure that their analyses are transparent and well-documented.
1.1 How to create a recipe
Suppose we have two preprocessing steps to perform on our data set before training an algorithm. We can create a recipe
and make use of the ever-so-helpful pipe operator %>%
:
#create the recipe for further processing
<-
df_recipe recipe(formula, data= data) %>%
%>% # image you have a recipe step here
step_{FILLIN}() # image you have another recipe step here
step_{FILLIN}()
# 📢 note that the order you write the steps matters.
formula should have two sides connected by the tilde(
~
). The left side should be the predicted column, and which right side contains all the predictors.data is the dataset you will work with.
step_{FILLIN} is the actual data processing step. This can include things like missing data imputation, normalisation, filtering, etc. You could find all the available steps and pre-processing choices on the Step Function Reference page.
But this is not all. After you have created the recipe, you need to apply it to the data. This is done by using the prep()
and bake()
functions. Take a look at the example below for a better understanding.
Example of a recipe
Suppose we have a table as follows containing information about students, including their age, gender, and test scores.
age | gender | score |
---|---|---|
18 | M | 85 |
19 | F | 90 |
20 | F | 75 |
21 | M | 80 |
22 | F | 95 |
💡 Now suppose we want to create a new variable that indicates whether each student’s test score is above or below the average score. The code below shows how to do this using a recipe, and then how to apply the recipe to the data.
# Create a recipe
<- recipe(score ~ ., data = data) %>%
df_recipe step_mutate(
above_average = ifelse(score > mean(score), "yes", "no")
)
# Apply the recipe to the data
<- df_recipe %>%
data_transformed prep() %>%
bake(data)
# View the transformed data
data_transformed
This is what the code above does:
We use
recipe()
to specify the formula and data.We use the
step_mutate()
function to create a new variable calledabove_average
based on the score variable, then specify that the above_average variable should take the value yes if the score is above the average, and no otherwise.We apply the df_recipe to the data using the
prep()
andbake()
functions. Theprep()
function prepares the recipe for use, while thebake()
function applies the recipe to the data, creating the above_average variable.We can now see the transformed data in the
data_transformed
variable.
💡 TIPS
This is just a simple example of how a recipe can be used to preprocess and transform data in the tidyverse. In practice, recipes can be much more complex and can be used to handle missing values, standardize variables, and perform a wide range of data preprocessing tasks.
About roles
🎯 This section is meant just to deepen your understanding of recipes; in practice, you might not need to tweak the roles of variables
When finishing the recipe
, each variable will be given a role, typically predictor or outcome. This comes from the fact that recipes are mostly used in conjunction with workflows in a predictive modelling context. In this typical scenario, it is important to distinguish between the variables that will be used to predict the outcome variable (i.e., predictors) and the variable that we are trying to predict (i.e., the outcome or response variable).
When you create a recipe
object, it automatically detects and assigns roles to each variable based on its data type. By default, any numeric variable that is not the outcome
variable in the formula is assigned the predictor
role. Categorical variables are assigned the predictor
role by default, but you can specify that a categorical variable should be used as the outcome variable by setting its role to outcome
.
You could manually change the roles of each variable based on your model if you want. You can give them any name you want, not just predictor and outcome.
1.2 How to set up a workflow
Now let’s create a workflow to bind modelling and preprocessing objects together using the workflow()
function. It will make your codes more logical and easier to read.
This is how you can create a workflow:
# the basic code
<- workflow() %>%
model_workflow add_recipe(df_recipe) %>%
add_model(model)
💡 Tip: The order of appearance of add_recipe()
and add_model()
should be based on your questions and working steps.
Example of workflow
Using the same imaginary dataset, let’s apply a linear regression to help us examine whether the age and gender affect students’ scores.
# Define a workflow
<- workflow() %>%
student_workflow add_model(lm, formula = score ~ age + gender) %>%
add_recipe(recipe(score ~ ., data = data) %>%
step_mutate(
above_average = ifelse(score > mean(score), "yes", "no")
%>%
) step_dummy(gender) %>%
step_scale(all_predictors()) %>%
step_center(all_predictors())
)
# Fit the workflow to the data
<- student_workflow %>% fit(data)
fitted_workflow
# Summarize the model
summary(fitted_workflow$fit)
We define a workflow using the
workflow()
function. The workflow includes a linear regression model (lm) withage
andgender
as predictors, and arecipe
that preprocesses the data by creating a new variable above_average (step_mutate
), dummy encoding gender (step_dummy
), and scaling and centring all predictors (step_scale
andstep_center
).We then fit the workflow to the data using the
fit()
function.We finally summarize the results using the
summary()
function
🧰 Part II Resampling
To build the most accurate model, we might use a series of data sets as the training/testing, and also cross-validation or bootstrap methods to validate our model.
Main Functions from the rsample package:
Split the data into training and testing parts
#Split the data into training and testing parts
# Twek the `prop` to change the proportion of training/testing data
<- initial_split(data, prop = 3/4)
data_split
# Fit your model with this dataset
<- training(data_split)
data_train
#only use it when evaluating your models
<- testing(data_split) data_test
Create a cross-validation object with 10 folds
<- vfold_cv(traing_data, v = 10 )
df_folds
df_folds
Apply the cross-validation object
- Fit the model within each of the folds
# Create a custom metrics function based on your question
<- metric_set(roc_auc, sens, spec,accuracy)
test_metrics
# Fit resamples
<- wf %>% # wf means the workflow you created for model
dt_rs_model fit_resamples(resamples = df_folds,
metrics = test_metrics)
- Extract the performance metrics for each fold
# View performance metrics
%>%
dt_rs_model collect_metrics(summarize = FALSE) %>%
filter(.metric == "accuracy") %>% # use filter if we want to stick on one matric
select(-.estimator)
Construct a set of bootstrap replicates of the data
<- bootstraps(data, times = 10)
df_boot
df_boot
Apply the bootstrap object
we could write down summarize = FALSE
in collect_metrics()
in order to allow us seeing the individual performance metrics for each fold. - Fit the model within each of the folds
# Create a custom metrics function based on your question
<- metric_set(roc_auc, sens, spec)
test_metrics
# Fit resamples
<- wf %>% # wf means the workflow you created for model
dt_rs_model fit_resamples(resamples = df_boot,
metrics = test_metrics)
- Extract the performance metrics for each fold
For each extraction, we could write down summarize = FALSE
in collect_metrics()
to allow us to see the individual performance metrics for each extraction.
# View performance metrics
%>%
dt_rs_model collect_metrics(summarize = FALSE)
🧰 Part III Algorithms
3.1 Regression
We use regression when we want to predict a numeric variable (for example number of accidents, house price, etc.). If the target variable is not numeric you will get an error, or worse you might get counterintuitive results.
📌 In 3.1, we do not provide any workflow
part aiming at demonstrating diverse solutions for your reference. We will provide the workflow
in 3.2
- Main Function: linear_reg()
How to train:
<-
model linear_reg() %>%
set_mode("regression") %>%
fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))
Get a summary of the fitted model:
summary(model$fit)
Make Predictions
You can use your model
to predict the outcome of a given data set. We use the augment() function for that:
<- augment(model, df) df_augmented
Try a different implementation
By default, linear_reg()
runs the same lm()
algorithm we learned about in 💻 Week 03 - Lab but if you want you can use an alternative implementation from another R package. For example, to use a Bayesian implementation of linear regression (from stan
), use:
<-
model linear_reg() %>%
set_engine("stan") %>%
set_mode("regression") %>%
fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))
Diagnostic Plots
How well did the model fit the data?
par(mfrow = c(2, 2))
plot(model$fit)
- Main Function: decision_tree()
- Required libraries:
rpart
andrpart.plot
How to train:
<-
model decision_tree() %>%
set_mode("regression") %>%
fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))
Get a summary of the fitted model:
summary(model$fit)
Plot the model
library(rpart.plot)
rpart.plot(model$fit, roundint=FALSE)
Make Predictions
You can use your model
to predict the outcome of a given data set. We use the augment() function for that:
<- augment(model, df) df_augmented
Try a different implementation
By default, decision_tree()
runs the algorithm contained in the rpart
package. This is the same package we learned about in 💻 Week 07 - Lab but if you want you can use an alternative implementation from another R package. For example, to use a C5.0:
<-
model linear_reg() %>%
set_engine("C5.0") %>%
set_mode("regression") %>%
fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))
Change parameters
There are three main parameters you can tweak when using the rpart
engine:
cost_complexity
tree_depth
min_n
tidymodels
might use default values or attempt to guess the best values for the parameters. If you want to choose parameter values explicitly, pass those to the decision_tree()
function. For example:
<-
model decision_tree(tree_depth= integer(3)) %>%
set_mode("regression") %>%
fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))
Note that the availability of parameters changes from engine to engine. The decision tree in the C5.0 engine, for example, has one tuning parameter, min_n
. Always check the documentation!
- Supported kernels:
- Radial Basis Function (svm_rbf())
- Polynomial (svm_poly())
- Linear (svm_linear())
- Don’t know what kernels are? Check 🗓️ Week 05 - Part II
- Required libraries:
kernlab
andLiblineaR
(for the linear kernel).
How to train:
<-
model svm_rbf() %>%
set_mode("regression") %>%
fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))
Get a summary of the fitted model:
model
Make Predictions
You can use your model
to predict the outcome of a given data set. We use the augment() function for that:
<- augment(model, df) df_augmented
Plot the decision space
THIS ONLY WORKS WITH TWO PREDICTORS!
Step 1: Check the min and max values of the two SELECTED predictors
Replace <predictor1>
and <predictor2>
with the name of your selected predictors.
%>%
data select(c(<predictor1>, <predictor2>)) %>%
summary()
Identify the min and max values of each predictor.
Step 2: Create a simulated dataset
You will need to find a suitable step_val1
and step_val2
. Play with different values until you find one that you like.
<-
sim.data crossing(<predictor1> = seq(min_val_predictor1, max_val_predictor1, step_val1),
<predictor2> = seq(min_val_predictor2, max_val_predictor2, step_val2))
Step 3: Run the fitted model on this simulated data
<- augment(model, sim.data) sim.data
Step 4: Run the fitted model on the data used to train the model
<- augment(model, data) plot_df
Step 5: Build the plot
Remember to replace <predictor1>
and <predictor2>
with the name of your selected predictors.
<- (
g %>%
plot_df ggplot()
## Tile the background of the plot with SVM predictions
+ geom_tile(data = sim.data, aes(x=<predictor1>, y=<predictor2>, fill = .pred), alpha = 0.45)
## Actual data
+ geom_point(aes(x=<predictor1>, y=<predictor2>), size=2.5, stroke=0.95)
## Define X and Os
+ scale_shape_manual(values = c(4, 1))
+ scale_fill_viridis_c()
+ scale_color_manual(values=c("black", "red"))
+ scale_alpha_manual(values=c(0.1, 0.7))
## (OPTIONAL) Customizing the colours and theme of the plot
+ theme_minimal()
+ theme(panel.grid = element_blank(),
legend.position = 'bottom',
plot.title = element_text(hjust = 0.5))
)
g
Try a different kernel
Simply replace svm_rbf()
with one of the other kernels available in tidymodels (svm_poly()
or svm_linear()
).
Note that the availability of parameters changes from kernel to kernel. Always check the documentation!
Change parameters
There are two main parameters you can tweak when using the svm_rbf()
function:
cost
rbf_sigma
tidymodels
might use default values or attempt to guess the best values for the parameters. If you want to choose parameter values explicitly, pass those to the svm_rbf()
function. For example:
<-
model svm_rbf(cost= integer(1), rbf_sigma = 0.2) %>%
set_mode("regression") %>%
fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))
Note that the availability of parameters changes from kernel to kernel. Always check the documentation!
3.2 Classification
When considering a categorical outcome, which is usually represented as a class label either discrete (e.g., “spam email” or “not spam email”) or continuous (e.g., “high-risk street” or “low-risk street”), we will use classification models. In a classification problem, the goal is to train a model on a labelled dataset so that it can accurately predict the class labels for new, unseen data. This can be done using a variety of algorithms such as Decision Trees
, Support Vector Machines (SVMs)
, Naive Bayes
, and Logistic Regression
.
For the Decision Trees
and Support Vector Machines (SVMs)
models, we have introduced the detailed codes in the aforementioned Regression
part. The significant change is, set set_mode("classification")
. You could double-check with Main Functions.
📒 This version of the code for Decision Trees
and Support Vector Machines (SVMs)
does not use workflows – it is an alternative
- Main Function: decision_tree()
- Required libraries:
rpart
andrpart.plot
How to train:
<-
model decision_tree() %>%
set_mode("classification") %>%
fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))
Get a summary of the fitted model:
summary(model$fit)
Plot the model
library(rpart.plot)
rpart.plot(model$fit, roundint=FALSE)
Make Predictions
You can use your model
to predict the outcome of a given data set. We use the augment() function for that:
<- augment(model, df) df_augmented
How to train:
<-
model svm_rbf() %>%
set_mode("classification") %>%
fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))
Get a summary of the fitted model:
model
Make Predictions
You can use your model
to predict the outcome of a given data set. We use the augment() function for that:
<- augment(model, df) df_augmented
- Main Function logistic_reg()
Firstly, we have to point out which type of model we want to work with.
<- # your model specification
log_modellogistic_reg() %>% # model type
set_engine(engine = "glm") %>% # model engine
set_mode("classification") %>% # model mode
Then, we shall organize a workflow.
<-
wf workflow() %>%
add_model(log_model) %>%
add_recipe(df_recipe)
#show details of our workflow
wf
Train the model
This is the training process and will get our model with the results.
<-
log_fit %>%
wf fit(data = data_train)
#check the model's information
%>%
log_fit summary()
Make Predictions
A similar way with argumentfunction in this step. We will use the testing dataset to make the predictions.
#make predictions
<- augment(log_model, data_test)
df_augmented
#show the details
df_augmented
Try a different implementation
The default engine for this model is glm
, but you could also choose to try another six types of engine. When considering another engine, just simply change the argument set_engine(engine = "glm")
.
show_engines("log_model")
#> # A tibble: 7 × 2
#> engine mode
#> <chr> <chr>
#> 1 glm classification
#> 2 glmnet classification
#> 3 LiblineaR classification
#> 4 spark classification
#> 5 keras classification
#> 6 stan classification
#> 7 brulee classification
Detailed codes for each engine could be found in the official documents, like Logistic regression via glmnet
- Main Function: naive_Bayes
As a similar idea to recipe
and workflow
, we will organize the data and prepare the model for Naive Bayes. For the data, we still use the df_recipe
.
#prepare the model
<- # your model specification
nb_modelnaive_Bayes() %>% # model type
set_engine(engine = "naivebayes") %>% # model engine
set_mode("classification") %>% # model mode
# workflow for the model
<-
wf workflow() %>%
add_model(nb_model) %>%
add_recipe(df_recipe)
#show details of our workflow
wf
Train the model and make prediction
<-
nb_fit %>%
wf fit(data = data_train)
#check the model's information
%>%
nb_fit summary()
We use the augment
again for the prediction.
#make predictions
<- augment(nb_model, data_test)
df_augmented
#show the details
df_augmented
🎯 Evaluation of the classification model
In the classification problem, we will utilise different assessment methods than those used for regression, specifically the confusion matrix. The specifics will reveal how well the model fits the data.
%>% conf_mat(truth = , estimate =) df_augmented
truth refers to the independent columns in the dataset when you fit the model.
estimate refers to the predicted columns, normally its column’s name starts with
.pred_
You might want to plot the confusion matrix and also get the recall/precision/F1 score results.
#confusion matrix
%>%
df_augmented conf_mat(truth = , estimate =) %>%
autoplot(type = 'heatmap')
# precision and recall
tibble(
"precision" =
precision(df_augmented, truth = , estimate =) %>%
select(.estimate),
"recall" =
recall(df_augmented, truth = , estimate = ) %>%
select(.estimate)
%>%
) unnest() %>%
kable()
# F1 score
%>%
df_augmented f_meas(truth = , estimate = ) %>%
select(-.estimator) %>%
kable()
Another way is the ROC curve and AUC (the area under the curve).
# for the roc plot
%>%
df_augmented roc_curve(truth = , estimate =) %>%
autoplot()
# for the auc number
%>%
df_augmented roc_auc(truth = , estimate =)
Also with the metric_set
function, we could easily get a set of the evaluation metrics.
<- metric_set(accuracy, recall,roc_auc, f_meas)
metricsets %>% metricsets(truth = , estimate =) df_augmented
3.2 Clustering
Clustering is a form of unsupervised machine learning in which observations are grouped into clusters based on similarities in their data values, or features; in a clustering model, the label is the cluster to which the observation is assigned, based purely on its features.
We introduce how to replicate the K-Means
clustering method in R. But all the arguments and elements in the function would not be fully explored here, so you could read the Tutorial Resources
and Main Function
for more details.
K-Means Clustering Method
Main Function kmeans
Required library
library(cluster) # clustering algorithms library(factoextra) # clustering algorithms & visualization
Train the model and make prediction
# train the k means model
<- kmeans(data= , centers = , nstart = , iter.max = )
km_model
# show the details of the model
km_model
#summarize the k-means model
summary(km_model)
In the
kmeans
function, only data and centers are essential, specifically, centers define the number of cluster you want to achieve.the results of
summary()
function normally return 8 types of information.cluster
contains information about each row of data,centers
,withinss
, andsize
contain local information about each clustertotss
,tot.withinss
,betweenss
, anditer
contain global information about the clustering process
Then, you can use the trained Kmeans model to predict the cluster of a given data set.
#make predictions
<- augment(km_model, data)
df_augmented
#show the details
summary(df_augmented)
Explore the clusters
Visualize the clusters and assign a unique colour to each cluster.
%>%
df_augmented ggplot(aes(vector1, vector2, color = .pred_cluster)) +
geom_point()
Visualize the centroids of each cluster.
<- as_tibble(km_model$centers)
df_centroids $cluster_id <- factor(seq(1, nrow(df_centroids)))
df_centroids
%>%
df_augmented ggplot(aes(vector1, vector2, color = .pred_cluster)) +
geom_point(data=df_centroids, aes(color=cluster_id), size= , shape="X")
- Vector1/2 refers to the columns’ names which is the input data for clustering
Define the number of clusters
The key element is the number of clusters. We will show you one possible strategy to automate the process of finding an optimal number of clusters based on the total within-cluster sum of squares, which we are going to refer to as tot.withins
. Furthermore, the elbow point
plot will show us direct information.
# create a function
<- function(k) {
kmean_wss kmeans(data, k)$tot.withinss
}
# Compute and plot wss for k = 1 to k = 10
<- 1:10
k.values # extract wss for 2-10 clusters
<- map_dbl(k.values, wss)
wss_values
# Create a data frame to plot the graph
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
fviz_nbclust
function helps us with an easier way to have the elbow point
plot.
fviz_nbclust(df, kmeans, method = "wss")
3.3 Dimensionality Reduction
Dimension reduction is the process of reducing the number of variables (also sometimes referred to as features or of course dimensions) to a set of values of variables called principal variables. It can be a good choice when you suspect there are “too many” variables. The main property of principal variables is the preservation of the structure and information carried by the original variables, to some extent.
💡 PCA “rearranges” the original data matrix, producing a new data matrix where all features are intentionally completely uncorrelated to each other.
Training PCA
Prepare a recipe
for the PCA training.
<-
pca_recipe # First we specify a recipe of what data we are using
recipe(foluma, data = ) %>%
# PCA requires that data have the same distribution
# This can be achieved by normalizing the data (mean=0 and std=1)
step_normalize(all_predictors()) %>%
# This is where we tell the recipe to run PCA and return 9 Principal Components
step_pca(all_predictors(), num_comp= 9)
# pca_recipe created a recipe, but it didn't run any of those steps.
# To train the PCA, we have to prepare the recipe -- with prep()
<- prep(pca_recipe) pca_prep
Now that we have prepared our recipe, let’s bake it using our ingredients (the data):
<- bake(pca_prep, data) new_df
Look at the variance explained by each PC
Use the summary
function. You will have Standard deviation
, Proportion of Variance
and Cumulative Proportion
. Normally, we would have preserved 70% of the variance in this data.
summary(pca_prep$steps[[2]]$res)
Another way to show how many components to keep by the variance explained is to set threshold = 0.7
in step_pca()
.
recipe(~., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), threshold = 0.7) %>% # here is the threshold
prep() %>%
bake(pca_prep, data)
For text mining, we often use LSA to do the Dimensionality Reduction. It analyzes a set of documents and the terms within to find common or divergent concepts related to the documents and terms. We can extend this definition of LSA to be used as a method for classifying documents into different topics.
- Main Function: Latent Sentiment Analysis (LSA)
Train the LSA
However, before using LSA, you should create a document-feature matrix first.
library("quanteda.textmodels")
<- textmodel_lsa(dfm_pac, nd=2)$docs %>% #nd is the number of dimensions
df_lsa as.data.frame()
Visualize the results
We only consider two dimensions here.
plot_ly(data = bind_cols(df_lsa, data),
x = ~V1,
y = ~V2,
type="scatter",
mode="markers",
text=~paste('Doc ID:', docname, '\nDescription:\n', post))
📚 Resources
💡 Exploratory Data Analysis with R This tutorial contains the basic skills to use R for data processing and modelling. You could start from this to get familiar with R and R studio. But you might see different code patterns hereby, as it mainly shows base R code rather than
tidyverse
. Overall, this tutorial is good to start with, as it is easy to follow.💡 Tidymodels and Tidy Modeling with R
These two tutorials are super useful and important.
They contain almost all the models you could refer to in the
tidymodel
packages. They also containrecipe
for data wrangling andworkflow
for model construction.💡 Lisa Lendway’s COMP/STAT 112 website-tidyverse (with code examples and videos)
This online resource institutively shows how the codes will work on our data and models.