π£οΈ Week 03 - Homework
Tidymodel recipes and workflows - a tutorial
π₯ Learning Objectives
By the end of this homework, you will be able to:
- Learn how to use the
recipes
package to pre-process data before fitting a model - Learn how to use the
workflows
package to combine a recipe and a model into a single object - Learn how to use the
bake()
function to apply a recipe to a data frame - Create a custom function in R
π Preparation
Same prep as this weekβs lab.
Then use the link below to download the lab file:
Solutions to this homework will only be posted after youβve had some time to work on it, i.e by Sunday morning.
π Tasks
π¨ You need to have the most recent version of tidymodels
(i.e 1.2.0) installed for the following code to work! Check thatβs the case by running the following command on the RStudio Console (or VSCode terminal after typing R
and going into the R environment): packageVersion("tidymodels")
. If not, update your tidymodels
with update.packages("tidymodels")
.
Part I - Getting Ready
Import required libraries:
# Remember to install.packages() any missing packages
# Tidyverse packages we will use
library(dplyr)
library(ggplot2)
library(tidyr)
library(readr)
library(stringr)
library(rsample) #this is what allows you to split data in training and test set!
# Tidymodel packages we will use
library(yardstick) # this is where metrics like mae() are
library(parsnip) # This is where models like linear_reg() are
library(recipes) # You will learn about this today
library(workflows) # You will learn about this today
# And finally a package for cleaning variable names
library(janitor)
Read the data set:
It is the same data set you used in this weekβs lab:
# Modify the filepath if needed
<- "data/WVS_Wave_7.csv"
filepath <- read_csv(filepath) wvs
As in this weekβs lab, if you find that your laptop is unable to handle the full data set without running slowly, try experimenting with the following code (reminder: this takes a random sample of the data, stratifying by country so the sampling algorithm doesnβt take more data from one country and less from another).
# Set a seed for reproducibility
set.seed(123)
<-
wvs # Load the .csv
read_csv("data/WVS_Wave_7.csv") %>%
# Sample a proportion of the data set for each country.
# We have used 25% but you can experiment depending on
# the capability of your machine.
group_by(iso3c) %>%
slice_sample(prop = 0.25) %>%
ungroup()
Part II - Tidymodels recipes
Say our goal is to predict satisfaction
using the variables available in the dataset with the exception of -iso3c
(just as you did in this weekβs lab!). That is:
# Split the data with 75% being used to train the model
<- initial_split(wvs, prop = 0.75)
wvs_split
# Create tibbles of the training and test set
<- training(wvs_split)
wvs_train <- testing(wvs_split)
wvs_test
# Then train it with tidymodels:
<-
model linear_reg() %>%
set_engine("lm") %>%
fit(satisfaction ~ .-iso3c, data = wvs_train)
But today, we want to show you a different way of pre-processing your data before fitting a model!
We start with a recipe
The problem with the code above is that, if you need to use the fitted model
to make predictions on new data, you will need to apply the same pre-processing steps to that new data. This is not a problem if you are only doing it once, but if you need to do it many times, it can become a bit of a pain. This is where the recipes package (of tidymodels
) comes in handy.
A recipe is a set of instructions of how the data should be pre-processed before it is used in a model. It is a bit like a cooking recipe, where you have a list of ingredients and a set of instructions on how to prepare them. In our case, the ingredients is the data frame, and the instructions are the pre-processing steps we want to apply to them.
Hereβs how we would construct a recipe to pre-process our data before fitting a linear regression model:
#Create a recipe
<-
rec recipe(satisfaction ~ . , data = wvs) %>%
step_rm(iso3c)
summary(rec)
In this particular case, the only pre-processing we do is remove the variable iso3c
from consideration in the model: thatβs whatβs step_rm
does. Other examples of pre-processing could include removing the observations/rows that contain missing values ( step_naomit()
) or imputing (i.e βtaking a best guessβ at) them (e.g step_impute_median
or step_impute_linear
).
The previous syntax is enough if all youβre going to do is use your recipe to train a model. However, if you want to, say check the correctness of your pre-processing by printing out the actual dataset that you obtain after pre-processing, then youβll need a few extra steps to get there.
The previous code will look like this:
<-
rec recipe(satisfaction ~ . , data = wvs) %>%
step_rm(iso3c) %>%
prep() %>%
bake(data = wvs_train)
prep()
estimates all the quantities and statistics needed for the recipe youβve specified (i.e preprocessing steps) to be applied by the bake
function (note that you specify which dataset the bake
function, and ultimately the pre-processing steps, apply on - in this case the training set). You can think of it as the (ingredients) prepping step in a cooking recipe before you bake/cook everything and get the final meal!
β οΈ If you donβt go through the prep and bake steps, the pre-processing steps will not be applied. The recipe is just a recipe, not the final cooked meal. You need to bake it first.
How to use this in a model?
So we have our recipe from before:
#Create a recipe
<-
rec recipe(satisfaction ~ . , data = wvs) %>%
step_rm(iso3c)
How do we now use this in a model?
For this we need a workflow to which we can attach the recipe and the model. Creating a workflow of a linear regression looks like this:
# Create the specification of a model but don't fit it yet
<-
lm_model linear_reg() %>%
set_engine("lm")
# Create a workflow to add the recipe and the model
<-
wflow workflow() %>%
add_recipe(rec) %>%
add_model(lm_model)
print(wflow)
Then, you can fit the workflow to the data:
<-
model %>%
wflow fit(data = wvs_train)
model
Do you want to confirm what recipe was used to pre-process the data? You can extract_recipe()
:
%>%
model extract_recipe()
You can use your model to augment()
the data with predictions:
%>%
model augment(new_data = wvs_test) %>%
head()
Notice that this augments our data frame with .pred
and .resid
columns.
How well did our model perform?
<- metric_set(rsq, rmse, mae, mape)
reg_metrics %>%
model augment(new_data=wvs_test)%>%
reg_metrics(truth = satisfaction, estimate = .pred)
You already know about the \(R^2\) (rsq
) and \(RMSE\) (rmse
) metrics: can you make sense of the other two metrics (mae
for MAE and mape
for MAPE)?
(Bonus) See if you can plot the metrics for a few countries of your choice.
Letβs look at the residuals plot for our model!
%>%
model augment(new_data = wvs_test ) %>%
ggplot(aes(.pred, .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_bw() +
labs(x = "Predicted life satisfaction", y = "Residuals")
Part III - Practice with recipes and workflows
- Consider the following simulated data frame.
set.seed(123)
<-
df_sim tibble(outcome = rnorm(100, 0, 1),
feature1 = rnorm(100, 0, 1),
feature2 = rnorm(100, 0, 20))
We have created a data set with one outcome (creatively labeled outcome
) and two features feature1
and feature2
where one variable has been assigned a standard deviation 20 times the size of the other variable. In some cases like linear regression, it will be okay to use outcome ~ feature1 + feature2
as a formula. However, with other models, variables with larger variation will play a disproportionate role. As a result, we can use standardization (a.k.a. normalization) to counter this issue. This involves subtracting the mean of the variable and dividing the result by the standard deviation. Create a recipe for this data frame using a step from the tidymodels
manual, and make use of graphs to demonstrate that the standardization process βworkedβ.
Some models will only require numeric input, meaning that variables like
employment
orbetter_living
into several dummy variables. Using thetidymodels
manual, can you find a step that can help us achieve this end?Can you tweak the model we wrote in Part I to include the dummy step from Question 1? Evaluate the performance of this new model (first on the training then the test set).
Write a function called
plot_residuals()
that takes a fitted workflow model and a data frame in its original form (that is, before any pre-processing) and plot the residuals against the fitted values. The function should return a ggplot object.
<- function(wflow_model, data) {
plot_residuals
# Replace this part with your code
...
<- ggplot(plot_df, aes(x=.pred, y=.resid, ...)) +
g # Replace this part with your code
...
return(g)
}
Using the plot_residuals
function youβve just created, can you create:
- one residuals against the fitted values plot for the training data used in Question 2
- and another, separate residuals against the fitted values plot for the test data used in Question 2?
- (Bonus) Can you train a LASSO model on the data and evaluate its performance using recipes and workflows? How does this model compare to linear regression?