🛣️ Week 04 - Lab Roadmap (90 min)

Tidymodel recipes and workflows - a tutorial

Author

Dr. Jon Cardoso-Silva

🥅 Learning Objectives

By the end of this lab, you will be able to:

Learn how to use the recipes package to pre-process data before fitting a model
Learn how to use the workflows package to combine a recipe and a model into a single object
Learn how to use the bake() function to apply a recipe to a data frame
Create a custom function in R

This lab is part of the

GENIAL project.

If you never accessed ChatGPT, you must create an account. Click on chat.openai.com and sign up with your email address (it doesn’t have to be your LSE email address).

When you reach Part III of this lab, read the specific instructions for GENIAL participants.

📚 Preparation

Same prep as last week.

Then use the link below to download the lab file:

Solutions to Part III will only be posted after all labs have ended, on Tuesday afternoon.

📋 Lab Tasks

Here are the instructions for this lab:

Click here to see ⚙️ Setup Instructions

Import required libraries:

# Tidyverse packages we will use
library(dplyr)     
library(tidyr)     
library(readr)     
library(lubridate) 

# Tidymodel packages we will use
library(yardstick) # this is where metrics like mae() are
library(parsnip)   # This is where models like linear_reg() are
library(recipes)   # You will learn about this today
library(workflows) # You will learn about this today

Read the data set:

It is the same data set you used in the previous lab (as well as in the first formative assignment):

# Modify the filepath if needed
filepath <- "data/UK-HPI-full-file-2023-06.csv"
uk_hpi <- read_csv(filepath)

Part I - Getting Ready (10 min)

🎯 ACTION POINTS: Just run all the code chunks below, paying attention to what you are doing.

Let’s kick things off by crafting a df data frame that will hold records from select columns of the UK HPI dataset, broken down by individual UK countries.

# Select just the UK countries
df <- 
    uk_hpi %>% 
    filter(RegionName %in% c("England", "Wales", "Scotland", "Northern Ireland")) %>%
    select(Date, RegionName, SalesVolume) %>%
    mutate(Date = dmy(Date))

For a bit of curiosity’s sake, let’s confirm the number of unique countries that appear in our dataset. We can uncover this mystery by counting the unique values in the RegionName column.

df %>% 
    group_by(RegionName) %>% 
    tally()

Just how many non-empty records are there in this dataset per Region?

df %>%
    drop_na() %>%
    group_by(RegionName) %>%
    tally() %>%
    knitr::kable()

These are not the same number of records. Maybe some countries started recording data at a later date? Let’s check:

df %>%
    drop_na() %>%
    group_by(RegionName) %>%
    summarise(min_date = min(Date), max_date = max(Date))

Just to balance things out, let’s start the analysis from 2005 onwards. Let’s recreate our df, adding a filter to only include records from 2005 onwards.

df <- 
    uk_hpi %>% 
    filter(RegionName %in% c("England", "Wales", "Scotland", "Northern Ireland")) %>%
    select(Date, RegionName, SalesVolume) %>%
    mutate(Date = dmy(Date)) %>%
    filter(Date >= dmy("01-01-2005"))

Part II - Tidymodels recipes (40 min)

🧑🏻‍🏫 TEACHING MOMENT: Your class teacher will explain all the code below. Follow along, running the code as you go. If you have any questions, don’t save it to later, ask away!

Say our goal is to predict SalesVolume using the sales volume from the past month, similar to what we’ve been doing so far in this course, we could use dplyr functions mutate and lag to achieve that. That is:

df_train <-
    df %>% 
    group_by(RegionName) %>%
    arrange(Date) %>%
    mutate(SalesVolume_lag1 = lag(SalesVolume, n=1)) %>% 
    drop_na()

# Then train it with tidymodels:
model <- 
    linear_reg() %>%
    set_engine("lm") %>%
    fit(SalesVolume ~ ., data = df_train)

But today, we want to show you a different way of pre-processing your data before fitting a model!

We start with a recipe

The problem with the code above is that, if you need to use the fitted model to make predictions on new data, you will need to apply the same pre-processing steps to that new data. This is not a problem if you are only doing it once, but if you need to do it many times, it can become a bit of a pain. This is where the recipes package (of tidymodels) comes in handy.

A recipe is a set of instructions of how the data should be pre-processed before it is used in a model. It is a bit like a cooking recipe, where you have a list of ingredients and a set of instructions on how to prepare them. In our case, the ingredients is the data frame, and the instructions are the pre-processing steps we want to apply to them.

Here’s how we would construct a recipe to pre-process our data before fitting a linear regression model:

# Create a recipe
rec <- 
    recipe(SalesVolume ~ ., data = df) %>% 
    step_arrange(Date) %>% # We need to sort the data by date before we can use step_lag()
    step_lag(SalesVolume, lag=1) %>%                  # step_lag is a wrapper for dplyr::lag()
    prep() # we need to 'prep' the recipe before we can use it. This is always the final step of a recipe

summary(rec)

How do we use this recipe? Well, we can use the bake() function to apply the recipe to our data frame:

rec %>% 
    bake(df) %>% # This 'bakes' our recipe using whatever data we pass as ingredient
    filter(Date == ymd("2005-01-01")) %>%
    head(8)

⚠️ If you don’t bake it, the pre-processing steps will not be applied. The recipe is just a recipe, not the final cooked meal. You need to bake it first.

Fixing incorrect lag

But the way we perform the lag above leads to a problem because the lagged variable is not calculated within a group. To see why that is problematic, let’s look at the first few months in the data set (Jan 2005 and Feb 2005):

rec %>% 
    bake(df) %>% 
    arrange(RegionName,Date) %>% 
    filter(Date < ymd(20050301))

Just like with England, all the other countries should have an NA value in the lagged variable at the start of the data set (Jan 2005). But they don’t. The lagged variable is calculated across all the data, not within each country.

There is no step_group_by in the recipes package, but it turns out that we can use the step_arrange function to achieve the same:

rec <- 
    recipe(SalesVolume ~ ., data = df) %>% 
    step_arrange(RegionName, Date) %>% # Adding RegionName to the arrange call achieves the same as the group_by
    step_lag(SalesVolume, lag=1) %>%                  
    prep() 

summary(rec)

This works as expected:

rec %>%
    bake(df) %>%
    filter(Date <= ymd(20050201)) %>%
    head(8)

Now we can get rid of these NAs!

Remove NAs

Let’s further edit the rec`` pipeline to include the [step_naomit()` function](https://recipes.tidymodels.org/reference/step_naomit.html) to remove any rows with missing values. This is equivalent to tidyr’s drop_na() function we have been using. Then bake it to confirm it worked:

rec <- 
    recipe(SalesVolume ~ ., data = df) %>% 
    step_arrange(RegionName, Date) %>%
    step_lag(SalesVolume, lag=1) %>% 
    step_naomit(lag_1_SalesVolume, skip=FALSE) %>%  # This line is new
    prep()

rec %>%
    bake(df) %>%
    filter(Date <= ymd(20050201)) %>%
    head(8)

We need the skip=FALSE parameter in the step_naomit() function to explicitly tell the recipe that we want to remove rows with missing values.

Identify columns that won’t be used by algorithms

If you run summary(rec), you will see that Date and RegionName have the role of a ‘predictor’, but we don’t want to use them as predictors. We want to use them as ‘identifiers’ for that row. We can change the role of a variable using the update_role() function. Let’s edit the rec pipeline to include the update_role() function to change the role of Date and RegionName to ‘id’:

rec <- 
    recipe(SalesVolume ~ ., data = df) %>% 
    step_arrange(RegionName, Date) %>%
    step_lag(SalesVolume, lag=1) %>% 
    step_naomit(lag_1_SalesVolume, skip=FALSE) %>%
    update_role(Date, RegionName, new_role = "id") %>% # This line is new
    prep()

summary(rec)

Now we see that only lag_1_SalesVolume is correctly identified as a predictor. The other two columns are now identified as ‘id’ variables.

How to use this in a model?

For this we need a workflow to which we can attach the recipe and the model. Creating a workflow of a linear regression looks like this:

# Create the specification of a model but don't fit it yet
lm_model <- 
    linear_reg() %>%
    set_engine("lm")

# Create a workflow to add the recipe and the model
wflow <- 
    workflow() %>% 
    add_recipe(rec) %>% 
    add_model(lm_model)

print(wflow)

Then, you can fit the workflow to the data:

model <- 
    wflow %>% 
    fit(data = df )

model

Do you want to confirm what recipe was used to pre-process the data? You can extract_recipe():

model %>% extract_recipe()

Need to extract the fitted model? You can extract_fit_parsnip():

model %>% extract_fit_parsnip()

The output of the above is the same as you would get if you had fitted the model directly using the fit() function (like we did last week). That is, you can use it to augument() the data with predictions:

fitted_model <- model %>% extract_fit_parsnip()

# To make predictions, I cannot use original data
# Instead, I have to apply the same pre-processing steps (i.e. bake the recipe)
df_baked <- rec %>% bake(df)

fitted_model %>% 
    augment(new_data = df_baked) %>% 
    head()

Notice that this augments our data frame with .pred and .resid columns.

How well did it perform, on average?

fitted_model %>% 
    augment(new_data = df_baked) %>% 
    group_by(RegionName) %>%
    mae(truth = SalesVolume, estimate = .pred)

Compare it with the summary statistics of the SalesVolume variable:

df_baked %>% group_by(RegionName) %>%

Part III - Practice with recipes and workflows (40 min)

I am part of the GENIAL project
I’m not participating in the GENIAL project

If you are participating in the GENIAL project, you are asked to:

Work independently (not in groups or pairs), but you can ask the class teacher for help if you get stuck.
Have only the following tabs open in your browser:
1. These lab instructions
2. The ChatGPT website (open a new chat window and name it ‘DS202A - Week 03’)
3. The tidymodels documentation page (you can open tabs with documentation pages for each package if you need to)
4. The tidyverse documentation page (you can open tabs with documentation pages for each package if you need to)
Be aware of how useful (or not) ChatGPT was in helping you answer the questions in this section.
Fill out this brief survey at the end of the lab: 🔗 link (requires LSE login)

In case you are not participating in the GENIAL project, you can work in pairs or small groups to answer the questions in this section. You can also ask the class teacher for help if you get stuck.

We suggest you have these tabs open in your browser:

The ChatGPT website (open a new chat window and name it ‘DS202A - Week 03’)
The tidymodels documentation page (you can open tabs with documentation pages for each package if you need to)
The tidyverse documentation page (you can open tabs with documentation pages for each package if you need to)

Question: Can you replicate the filtering steps we applied to uk_hpi for England, but this time, use a recipe for the same tasks?

Employ a step_filter() to limit the records to those from 2005 and beyond.
Use step_mutate() to format the Date column correctly.
Set all variables other than SalesVolume as id variables.
Generate the lagged variable of SalesVolume as done previously.
Eliminate rows with missing values as done previously.

Check that summary(rec) produces exactly following output:

variable	type	role	source
Date	date	id	original
RegionName	factor , unordered, nominal	id	original
AreaCode	factor , unordered, nominal	id	original
AveragePrice	double , numeric	id	original
Index	double , numeric	id	original
IndexSA	double , numeric	id	original
1m%Change	double , numeric	id	original
12m%Change	double , numeric	id	original
AveragePriceSA	double , numeric	id	original
DetachedPrice	double , numeric	id	original
DetachedIndex	double , numeric	id	original
Detached1m%Change	double , numeric	id	original
Detached12m%Change	double , numeric	id	original
SemiDetachedPrice	double , numeric	id	original
SemiDetachedIndex	double , numeric	id	original
SemiDetached1m%Change	double , numeric	id	original
SemiDetached12m%Change	double , numeric	id	original
TerracedPrice	double , numeric	id	original
TerracedIndex	double , numeric	id	original
Terraced1m%Change	double , numeric	id	original
Terraced12m%Change	double , numeric	id	original
FlatPrice	double , numeric	id	original
FlatIndex	double , numeric	id	original
Flat1m%Change	double , numeric	id	original
Flat12m%Change	double , numeric	id	original
CashPrice	double , numeric	id	original
CashIndex	double , numeric	id	original
Cash1m%Change	double , numeric	id	original
Cash12m%Change	double , numeric	id	original
CashSalesVolume	double , numeric	id	original
MortgagePrice	double , numeric	id	original
MortgageIndex	double , numeric	id	original
Mortgage1m%Change	double , numeric	id	original
Mortgage12m%Change	double , numeric	id	original
MortgageSalesVolume	double , numeric	id	original
FTBPrice	double , numeric	id	original
FTBIndex	double , numeric	id	original
FTB1m%Change	double , numeric	id	original
FTB12m%Change	double , numeric	id	original
FOOPrice	double , numeric	id	original
FOOIndex	double , numeric	id	original
FOO1m%Change	double , numeric	id	original
FOO12m%Change	double , numeric	id	original
NewPrice	double , numeric	id	original
NewIndex	double , numeric	id	original
New1m%Change	double , numeric	id	original
New12m%Change	double , numeric	id	original
NewSalesVolume	double , numeric	id	original
OldPrice	double , numeric	id	original
OldIndex	double , numeric	id	original
Old1m%Change	double , numeric	id	original
Old12m%Change	double , numeric	id	original
OldSalesVolume	double , numeric	id	original
SalesVolume	double , numeric	outcome	original
lag_1_SalesVolume	double , numeric	predictor	derived

# Delete this line and add your code here

Using this recipe, train a model that uses as training data only the records of England, Scotland and Wales. Calculate the Mean Absolute Error (MAE) of the training data. Then, respond: how does it compare to the MAE of the model trained on all UK countries (Part II)?

# Delete this line and add your code here

Now, let’s think of Northern Ireland records as a testing set. Predict SalesVolume for Northern Ireland using the model you fitted in Step 2. Calculate the MAE of the predictions. What do you make of it?

# Delete this line and add your code here

Write a function called plot_residuals() that takes a fitted workflow model and a data frame in its original form (that is, before any pre-processing) and plot the residuals against the fitted values. The function should return a ggplot object.

plot_residuals <- function(wflow_model, data) {
    
    ... # Replace this part with your code

    g <- ggplot(plot_df, aes(x=.pred, y=.resid, ...)) +
        ... # Replace this part with your code

    return(g)
}

After filling in the blanks below, create one such plot for the training data used in Step 2 and another, separate one for the test data used in Step 3.

# Delete this line and add your code here

Fill out this brief survey at the end of the lab if you are part of GENIAL: 🔗 link (requires LSE login)