🛣️ Week 04 - Lab Roadmap (90 min)

Tidymodel recipes and workflows - a tutorial

Author

Dr. Stuart Bramwell

Published

05 Feb 2024

🥅 Learning Objectives

By the end of this lab, you will be able to:

Learn how to use the recipes package to pre-process data before fitting a model
Learn how to use the workflows package to combine a recipe and a model into a single object
Learn how to use the bake() function to apply a recipe to a data frame
Create a custom function in R

📚 Preparation

Same prep as last week.

Then use the link below to download the lab file:

Solutions to this lab will only be posted after all labs have ended, on Tuesday.

📋 Lab Tasks

No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.

Part 0: Export your chat logs (~ 3 min)

As part of the GENIAL project, we ask that you fill out the following form as soon as you come to the lab:

🎯 ACTION POINTS

🔗 CLICK HERE to export your chat log.

Thanks for being GENIAL! You are now one step closer to earning some prizes! 🎟️

👉 NOTE: You MUST complete the initial form.

If you really don’t want to participate in GENIAL¹, just answer ‘No’ to the Terms & Conditions question - your e-mail address will be deleted from GENIAL’s database the following week.

Part I - Getting Ready (10 min)

Click here to see ⚙️ Setup Instructions

Import required libraries:

# Remember to install.packages() any missing packages

# Tidyverse packages we will use
library(dplyr)     
library(ggplot2)
library(tidyr)     
library(readr)
library(stringr)
library(lubridate) 

# Tidymodel packages we will use
library(yardstick) # this is where metrics like mae() are
library(parsnip)   # This is where models like linear_reg() are
library(recipes)   # You will learn about this today
library(workflows) # You will learn about this today

# And finally a package for cleaning variable names
library(janitor)

Read the data set:

It is the same data set you used in the previous lab:

# Modify the filepath if needed
filepath <- "data/UK-HPI-full-file-2023-06.csv"
uk_hpi <- read_csv(filepath)

🎯 ACTION POINTS: Run the code chunk below, paying attention to what you are doing.

Let’s kick things off by crafting a df data frame that will hold records from select columns of the UK HPI dataset, broken down by individual UK countries. We’ll start our analysis from 2005 onward.

df <-
  uk_hpi %>%
  clean_names() %>% 
  group_by(region_name) %>% 
  mutate(date = dmy(date),
         across(c(index, sales_volume), ~ lag(.x, 1), .names = "lag_1_{.col}")) %>%
  ungroup() %>% 
  filter(region_name %in% c("England", "Wales", "Scotland", "Northern Ireland"),
         date >= dmy("01-01-2005")) %>% 
  select(date, region_name, sales_volume, starts_with("lag_1"))

Part II - Tidymodels recipes (40 min)

🧑🏻‍🏫 TEACHING MOMENT: Your class teacher will explain all the code below. Follow along, running the code as you go. If you have any questions, don’t save it to later, ask away!

Say our goal is to predict sales_volume using the sales volume from the past month, similar to what we’ve been doing so far in this course, we could use dplyr functions mutate and lag to achieve that. That is:

df_train <-
    df %>% 
    group_by(region_name) %>%
    arrange(date) %>%
    mutate(SalesVolume_lag1 = lag(sales_volume, n=1)) %>% 
    drop_na()

# Then train it with tidymodels:
model <- 
    linear_reg() %>%
    set_engine("lm") %>%
    fit(sales_volume ~ ., data = df_train)

But today, we want to show you a different way of pre-processing your data before fitting a model!

We start with a recipe

The problem with the code above is that, if you need to use the fitted model to make predictions on new data, you will need to apply the same pre-processing steps to that new data. This is not a problem if you are only doing it once, but if you need to do it many times, it can become a bit of a pain. This is where the recipes package (of tidymodels) comes in handy.

A recipe is a set of instructions of how the data should be pre-processed before it is used in a model. It is a bit like a cooking recipe, where you have a list of ingredients and a set of instructions on how to prepare them. In our case, the ingredients is the data frame, and the instructions are the pre-processing steps we want to apply to them.

Here’s how we would construct a recipe to pre-process our data before fitting a linear regression model:

rec <- 
  recipe(sales_volume ~ ., data = df) %>%
  step_naomit(lag_1_index, lag_1_sales_volume, skip = FALSE) %>%
  step_string2factor(region_name) %>% 
  update_role(date, new_role = "id") %>% 
  prep() 

summary(rec)

How do we use this recipe? Well, we can use the bake() function to apply the recipe to our data frame:

rec %>% 
  bake(df)

⚠️ If you don’t bake it, the pre-processing steps will not be applied. The recipe is just a recipe, not the final cooked meal. You need to bake it first.

How to use this in a model?

For this we need a workflow to which we can attach the recipe and the model. Creating a workflow of a linear regression looks like this:

# Create the specification of a model but don't fit it yet
lm_model <- 
  linear_reg() %>%
  set_engine("lm")

# Create a workflow to add the recipe and the model
wflow <- 
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(lm_model)

print(wflow)

Then, you can fit the workflow to the data:

model <- 
  wflow %>% 
  fit(data = df)

model

Do you want to confirm what recipe was used to pre-process the data? You can extract_recipe():

model %>% 
  extract_recipe()

Need to extract the fitted model? You can extract_fit_parsnip():

model %>% 
  extract_fit_parsnip()

The output of the above is the same you would get if you had fitted the model directly using the fit() function (like we did last week). That is, you can use it to augment() the data with predictions:

fitted_model <- 
  model %>% 
  extract_fit_parsnip()

# To make predictions, I cannot use original data
# Instead, I have to apply the same pre-processing steps (i.e. bake the recipe)
df_baked <- 
  rec %>% 
  bake(df)

fitted_model %>% 
  augment(new_data = df_baked) %>% 
  head()

Notice that this augments our data frame with .pred and .resid columns.

How well did it perform, on average, by region? How does this compare to the standard deviation?

fitted_model %>% 
  augment(new_data = df_baked) %>% 
  group_by(region_name) %>%
  mae(truth = sales_volume, estimate = .pred) %>% 
  left_join(fitted_model %>% 
            augment(new_data = df_baked) %>% 
            group_by(region_name) %>% 
            summarise(sd = sd(sales_volume, na.rm = TRUE)),
            by = "region_name") %>% 
  pivot_longer(cols = c(.estimate, sd), names_to = "comp") %>% 
  ggplot(aes(x = value, y = region_name, fill = comp)) +
  geom_col(position = position_dodge()) +
  scale_fill_manual(values = c("black", "grey"), 
                    labels = c("MAE of Model", 
                               "Std. Dev. of Outcome")) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(x = "Sales Volume", y = NULL, fill = NULL)

Let’s look at our residual plot across regions!

fitted_model %>% 
  augment(new_data = df_baked) %>% 
  mutate(.resid = .pred - sales_volume) %>% # only needed if augment() does not add a .resid column
  ggplot(aes(.pred, .resid)) +
  facet_wrap(. ~ region_name, scales = "free") +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_bw() +
  labs(x = "Predicted sales volume", y = "Residuals")

Part III - Practice with recipes and workflows (40 min)

Consider the following simulated data frame.


set.seed(123)

df_sim <-
  tibble(outcome = rnorm(100, 0, 1),
         feature1 = rnorm(100, 0, 1),
         feature2 = rnorm(100, 0, 20))

We have created a data set with one outcome (creatively labeled outcome) and two features feature1 and feature2 where one variable has been assigned a standard deviation 20 times the size of the other variable. In some cases like linear regression, it will be okay to use outcome ~ feature1 + feature2 as a formula. However, with other models, variables with larger variation will play a disproportionate role. As a result, we can use standardization (a.k.a. normalization) to counter this issue. This involves subtracting the mean of the variable and dividing the result by the standard deviation. Create a recipe for this data frame using a step from the tidymodels manual, bake the recipe, and make use of graphs to demonstrate that the standardization process “worked”.

We have used recipes::step_string2factor to convert region_name from a string to a factor variable. However, some models will only require numeric input, meaning that we will have to convert this variable into several dummy variables. Using the tidymodels manual, can you find a step that can help us achieve this end?
Take a look at the following data frame.

df <-
  uk_hpi %>%
  clean_names() %>% 
  group_by(region_name) %>% 
  mutate(date = dmy(date),
         century = paste0(str_remove(as.character(year(date) + 100), "[0-9]{2}$"), "st"),
         across(c(index, sales_volume), ~ lag(.x, 1), .names = "lag_1_{.col}")) %>%
  ungroup() %>% 
  filter(region_name %in% c("England", "Wales", "Scotland", "Northern Ireland"),
         date >= dmy("01-01-2005")) %>% 
  select(date, century, region_name, sales_volume, starts_with("lag_1"))

We have created a variable that contains zero variance. Can you:

spot this variable (hint: check out dplyr::count) and
find an appropriate step in the tidymodels manual that can help us remove this variable?

Using the data frame from Step 2, create a recipe that incorporates Steps 1 and 2.
Using the recipe in Step 3, train a model that uses as training data only the records of England, Scotland and Wales. Calculate the Mean Absolute Error (MAE) of the training data. Then, respond: how does it compare to the MAE of the model trained on all UK countries (Part II)?
Now, let’s think of Northern Ireland records as a testing set. Predict sales_volume for Northern Ireland using the model you fitted in Step 2. Calculate the MAE of the predictions. What do you make of it?
Write a function called plot_residuals() that takes a fitted workflow model and a data frame in its original form (that is, before any pre-processing) and plot the residuals against the fitted values. The function should return a ggplot object.

plot_residuals <- function(wflow_model, data) {
    
    ... # Replace this part with your code

    g <- ggplot(plot_df, aes(x=.pred, y=.resid, ...)) +
        ... # Replace this part with your code

    return(g)
    
}

Using the plot_residuals function you’ve just created, can you create:

one residuals against the fitted values plot for the training data used in Step 4
and another, separate residuals against the fitted values plot for the test data used in Step 5?

Footnotes

We’re gonna cry a little bit, not gonna lie. But no hard feelings. We’ll get over it.↩︎