π£οΈ Week 04 - Lab Roadmap (90 min)
Tidymodel recipes and workflows - a tutorial
π₯ Learning Objectives
By the end of this lab, you will be able to:
- Learn how to use the
recipes
package to pre-process data before fitting a model - Learn how to use the
workflows
package to combine a recipe and a model into a single object - Learn how to use the
bake()
function to apply a recipe to a data frame - Create a custom function in R
π Preparation
Same prep as last week.
Then use the link below to download the lab file:
Solutions to this lab will only be posted after all labs have ended, on Tuesday.
π Lab Tasks
No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.
Part 0: Export your chat logs (~ 3 min)
As part of the GENIAL project, we ask that you fill out the following form as soon as you come to the lab:
π― ACTION POINTS
π CLICK HERE to export your chat log.
Thanks for being GENIAL! You are now one step closer to earning some prizes! ποΈ
π NOTE: You MUST complete the initial form.
If you really donβt want to participate in GENIAL1, just answer βNoβ to the Terms & Conditions question - your e-mail address will be deleted from GENIALβs database the following week.
Part I - Getting Ready (10 min)
Click here to see βοΈ Setup Instructions
Import required libraries:
# Remember to install.packages() any missing packages
# Tidyverse packages we will use
library(dplyr)
library(ggplot2)
library(tidyr)
library(readr)
library(stringr)
library(lubridate)
# Tidymodel packages we will use
library(yardstick) # this is where metrics like mae() are
library(parsnip) # This is where models like linear_reg() are
library(recipes) # You will learn about this today
library(workflows) # You will learn about this today
# And finally a package for cleaning variable names
library(janitor)
Read the data set:
It is the same data set you used in the previous lab:
# Modify the filepath if needed
<- "data/UK-HPI-full-file-2023-06.csv"
filepath <- read_csv(filepath) uk_hpi
π― ACTION POINTS: Run the code chunk below, paying attention to what you are doing.
Letβs kick things off by crafting a df
data frame that will hold records from select columns of the UK HPI dataset, broken down by individual UK countries. Weβll start our analysis from 2005 onward.
<-
df %>%
uk_hpi clean_names() %>%
group_by(region_name) %>%
mutate(date = dmy(date),
across(c(index, sales_volume), ~ lag(.x, 1), .names = "lag_1_{.col}")) %>%
ungroup() %>%
filter(region_name %in% c("England", "Wales", "Scotland", "Northern Ireland"),
>= dmy("01-01-2005")) %>%
date select(date, region_name, sales_volume, starts_with("lag_1"))
Part II - Tidymodels recipes (40 min)
π§π»βπ« TEACHING MOMENT: Your class teacher will explain all the code below. Follow along, running the code as you go. If you have any questions, donβt save it to later, ask away!
Say our goal is to predict sales_volume
using the sales volume from the past month, similar to what weβve been doing so far in this course, we could use dplyr functions mutate
and lag
to achieve that. That is:
<-
df_train %>%
df group_by(region_name) %>%
arrange(date) %>%
mutate(SalesVolume_lag1 = lag(sales_volume, n=1)) %>%
drop_na()
# Then train it with tidymodels:
<-
model linear_reg() %>%
set_engine("lm") %>%
fit(sales_volume ~ ., data = df_train)
But today, we want to show you a different way of pre-processing your data before fitting a model!
We start with a recipe
The problem with the code above is that, if you need to use the fitted model
to make predictions on new data, you will need to apply the same pre-processing steps to that new data. This is not a problem if you are only doing it once, but if you need to do it many times, it can become a bit of a pain. This is where the recipes package (of tidymodels
) comes in handy.
A recipe is a set of instructions of how the data should be pre-processed before it is used in a model. It is a bit like a cooking recipe, where you have a list of ingredients and a set of instructions on how to prepare them. In our case, the ingredients is the data frame, and the instructions are the pre-processing steps we want to apply to them.
Hereβs how we would construct a recipe to pre-process our data before fitting a linear regression model:
<-
rec recipe(sales_volume ~ ., data = df) %>%
step_naomit(lag_1_index, lag_1_sales_volume, skip = FALSE) %>%
step_string2factor(region_name) %>%
update_role(date, new_role = "id") %>%
prep()
summary(rec)
How do we use this recipe? Well, we can use the bake()
function to apply the recipe to our data frame:
%>%
rec bake(df)
β οΈ If you donβt bake it, the pre-processing steps will not be applied. The recipe is just a recipe, not the final cooked meal. You need to bake it first.
How to use this in a model?
For this we need a workflow to which we can attach the recipe and the model. Creating a workflow of a linear regression looks like this:
# Create the specification of a model but don't fit it yet
<-
lm_model linear_reg() %>%
set_engine("lm")
# Create a workflow to add the recipe and the model
<-
wflow workflow() %>%
add_recipe(rec) %>%
add_model(lm_model)
print(wflow)
Then, you can fit the workflow to the data:
<-
model %>%
wflow fit(data = df)
model
Do you want to confirm what recipe was used to pre-process the data? You can extract_recipe()
:
%>%
model extract_recipe()
Need to extract the fitted model? You can extract_fit_parsnip()
:
%>%
model extract_fit_parsnip()
The output of the above is the same you would get if you had fitted the model directly using the fit()
function (like we did last week). That is, you can use it to augment()
the data with predictions:
<-
fitted_model %>%
model extract_fit_parsnip()
# To make predictions, I cannot use original data
# Instead, I have to apply the same pre-processing steps (i.e. bake the recipe)
<-
df_baked %>%
rec bake(df)
%>%
fitted_model augment(new_data = df_baked) %>%
head()
Notice that this augments our data frame with .pred
and .resid
columns.
How well did it perform, on average, by region? How does this compare to the standard deviation?
%>%
fitted_model augment(new_data = df_baked) %>%
group_by(region_name) %>%
mae(truth = sales_volume, estimate = .pred) %>%
left_join(fitted_model %>%
augment(new_data = df_baked) %>%
group_by(region_name) %>%
summarise(sd = sd(sales_volume, na.rm = TRUE)),
by = "region_name") %>%
pivot_longer(cols = c(.estimate, sd), names_to = "comp") %>%
ggplot(aes(x = value, y = region_name, fill = comp)) +
geom_col(position = position_dodge()) +
scale_fill_manual(values = c("black", "grey"),
labels = c("MAE of Model",
"Std. Dev. of Outcome")) +
theme_bw() +
theme(legend.position = "bottom") +
labs(x = "Sales Volume", y = NULL, fill = NULL)
Letβs look at our residual plot across regions!
%>%
fitted_model augment(new_data = df_baked) %>%
mutate(.resid = .pred - sales_volume) %>% # only needed if augment() does not add a .resid column
ggplot(aes(.pred, .resid)) +
facet_wrap(. ~ region_name, scales = "free") +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_bw() +
labs(x = "Predicted sales volume", y = "Residuals")
Part III - Practice with recipes and workflows (40 min)
- Consider the following simulated data frame.
set.seed(123)
<-
df_sim tibble(outcome = rnorm(100, 0, 1),
feature1 = rnorm(100, 0, 1),
feature2 = rnorm(100, 0, 20))
We have created a data set with one outcome (creatively labeled outcome
) and two features feature1
and feature2
where one variable has been assigned a standard deviation 20 times the size of the other variable. In some cases like linear regression, it will be okay to use outcome ~ feature1 + feature2
as a formula. However, with other models, variables with larger variation will play a disproportionate role. As a result, we can use standardization (a.k.a. normalization) to counter this issue. This involves subtracting the mean of the variable and dividing the result by the standard deviation. Create a recipe for this data frame using a step from the tidymodels
manual, bake the recipe, and make use of graphs to demonstrate that the standardization process βworkedβ.
We have used
recipes::step_string2factor
to convertregion_name
from a string to a factor variable. However, some models will only require numeric input, meaning that we will have to convert this variable into several dummy variables. Using thetidymodels
manual, can you find a step that can help us achieve this end?Take a look at the following data frame.
<-
df %>%
uk_hpi clean_names() %>%
group_by(region_name) %>%
mutate(date = dmy(date),
century = paste0(str_remove(as.character(year(date) + 100), "[0-9]{2}$"), "st"),
across(c(index, sales_volume), ~ lag(.x, 1), .names = "lag_1_{.col}")) %>%
ungroup() %>%
filter(region_name %in% c("England", "Wales", "Scotland", "Northern Ireland"),
>= dmy("01-01-2005")) %>%
date select(date, century, region_name, sales_volume, starts_with("lag_1"))
We have created a variable that contains zero variance. Can you:
- spot this variable (hint: check out
dplyr::count
) and - find an appropriate step in the
tidymodels
manual that can help us remove this variable?
Using the data frame from Step 2, create a recipe that incorporates Steps 1 and 2.
Using the recipe in Step 3, train a model that uses as training data only the records of England, Scotland and Wales. Calculate the Mean Absolute Error (MAE) of the training data. Then, respond: how does it compare to the MAE of the model trained on all UK countries (Part II)?
Now, letβs think of Northern Ireland records as a testing set. Predict
sales_volume
for Northern Ireland using the model you fitted in Step 2. Calculate the MAE of the predictions. What do you make of it?Write a function called
plot_residuals()
that takes a fitted workflow model and a data frame in its original form (that is, before any pre-processing) and plot the residuals against the fitted values. The function should return a ggplot object.
<- function(wflow_model, data) {
plot_residuals
# Replace this part with your code
...
<- ggplot(plot_df, aes(x=.pred, y=.resid, ...)) +
g # Replace this part with your code
...
return(g)
}
Using the plot_residuals
function youβve just created, can you create:
- one residuals against the fitted values plot for the training data used in Step 4
- and another, separate residuals against the fitted values plot for the test data used in Step 5?
Footnotes
Weβre gonna cry a little bit, not gonna lie. But no hard feelings. Weβll get over it.β©οΈ