πŸ›£οΈ Week 04 - Lab Roadmap (90 min)

Tidymodel recipes and workflows - a tutorial

Author

Dr. Jon Cardoso-Silva

πŸ₯… Learning Objectives

By the end of this lab, you will be able to:

  • Learn how to use the recipes package to pre-process data before fitting a model
  • Learn how to use the workflows package to combine a recipe and a model into a single object
  • Learn how to use the bake() function to apply a recipe to a data frame
  • Create a custom function in R
This lab is part of the GENIAL project.

If you never accessed ChatGPT, you must create an account. Click on chat.openai.com and sign up with your email address (it doesn’t have to be your LSE email address).

When you reach Part III of this lab, read the specific instructions for GENIAL participants.

πŸ“š Preparation

Same prep as last week.

Then use the link below to download the lab file:

Solutions to Part III will only be posted after all labs have ended, on Tuesday afternoon.

πŸ“‹ Lab Tasks

Here are the instructions for this lab:

Click here to see βš™οΈ Setup Instructions

Import required libraries:

# Tidyverse packages we will use
library(dplyr)     
library(tidyr)     
library(readr)     
library(lubridate) 

# Tidymodel packages we will use
library(yardstick) # this is where metrics like mae() are
library(parsnip)   # This is where models like linear_reg() are
library(recipes)   # You will learn about this today
library(workflows) # You will learn about this today

Read the data set:

It is the same data set you used in the previous lab (as well as in the first formative assignment):

# Modify the filepath if needed
filepath <- "data/UK-HPI-full-file-2023-06.csv"
uk_hpi <- read_csv(filepath)

Part I - Getting Ready (10 min)

🎯 ACTION POINTS: Just run all the code chunks below, paying attention to what you are doing.

Let’s kick things off by crafting a df data frame that will hold records from select columns of the UK HPI dataset, broken down by individual UK countries.

# Select just the UK countries
df <- 
    uk_hpi %>% 
    filter(RegionName %in% c("England", "Wales", "Scotland", "Northern Ireland")) %>%
    select(Date, RegionName, SalesVolume) %>%
    mutate(Date = dmy(Date))

For a bit of curiosity’s sake, let’s confirm the number of unique countries that appear in our dataset. We can uncover this mystery by counting the unique values in the RegionName column.

df %>% 
    group_by(RegionName) %>% 
    tally()

Just how many non-empty records are there in this dataset per Region?

df %>%
    drop_na() %>%
    group_by(RegionName) %>%
    tally() %>%
    knitr::kable()

These are not the same number of records. Maybe some countries started recording data at a later date? Let’s check:

df %>%
    drop_na() %>%
    group_by(RegionName) %>%
    summarise(min_date = min(Date), max_date = max(Date))

Just to balance things out, let’s start the analysis from 2005 onwards. Let’s recreate our df, adding a filter to only include records from 2005 onwards.

df <- 
    uk_hpi %>% 
    filter(RegionName %in% c("England", "Wales", "Scotland", "Northern Ireland")) %>%
    select(Date, RegionName, SalesVolume) %>%
    mutate(Date = dmy(Date)) %>%
    filter(Date >= dmy("01-01-2005"))

Part II - Tidymodels recipes (40 min)

πŸ§‘πŸ»β€πŸ« TEACHING MOMENT: Your class teacher will explain all the code below. Follow along, running the code as you go. If you have any questions, don’t save it to later, ask away!

Say our goal is to predict SalesVolume using the sales volume from the past month, similar to what we’ve been doing so far in this course, we could use dplyr functions mutate and lag to achieve that. That is:

df_train <-
    df %>% 
    group_by(RegionName) %>%
    arrange(Date) %>%
    mutate(SalesVolume_lag1 = lag(SalesVolume, n=1)) %>% 
    drop_na()

# Then train it with tidymodels:
model <- 
    linear_reg() %>%
    set_engine("lm") %>%
    fit(SalesVolume ~ ., data = df_train)

But today, we want to show you a different way of pre-processing your data before fitting a model!

We start with a recipe

The problem with the code above is that, if you need to use the fitted model to make predictions on new data, you will need to apply the same pre-processing steps to that new data. This is not a problem if you are only doing it once, but if you need to do it many times, it can become a bit of a pain. This is where the recipes package (of tidymodels) comes in handy.

A recipe is a set of instructions of how the data should be pre-processed before it is used in a model. It is a bit like a cooking recipe, where you have a list of ingredients and a set of instructions on how to prepare them. In our case, the ingredients is the data frame, and the instructions are the pre-processing steps we want to apply to them.

Here’s how we would construct a recipe to pre-process our data before fitting a linear regression model:

# Create a recipe
rec <- 
    recipe(SalesVolume ~ ., data = df) %>% 
    step_arrange(Date) %>% # We need to sort the data by date before we can use step_lag()
    step_lag(SalesVolume, lag=1) %>%                  # step_lag is a wrapper for dplyr::lag()
    prep() # we need to 'prep' the recipe before we can use it. This is always the final step of a recipe

summary(rec)

How do we use this recipe? Well, we can use the bake() function to apply the recipe to our data frame:

rec %>% 
    bake(df) %>% # This 'bakes' our recipe using whatever data we pass as ingredient
    filter(Date == ymd("2005-01-01")) %>%
    head(8)

⚠️ If you don’t bake it, the pre-processing steps will not be applied. The recipe is just a recipe, not the final cooked meal. You need to bake it first.

Fixing incorrect lag

But the way we perform the lag above leads to a problem because the lagged variable is not calculated within a group. To see why that is problematic, let’s look at the first few months in the data set (Jan 2005 and Feb 2005):

rec %>% 
    bake(df) %>% 
    arrange(RegionName,Date) %>% 
    filter(Date < ymd(20050301))

Just like with England, all the other countries should have an NA value in the lagged variable at the start of the data set (Jan 2005). But they don’t. The lagged variable is calculated across all the data, not within each country.

There is no step_group_by in the recipes package, but it turns out that we can use the step_arrange function to achieve the same:

rec <- 
    recipe(SalesVolume ~ ., data = df) %>% 
    step_arrange(RegionName, Date) %>% # Adding RegionName to the arrange call achieves the same as the group_by
    step_lag(SalesVolume, lag=1) %>%                  
    prep() 

summary(rec)

This works as expected:

rec %>%
    bake(df) %>%
    filter(Date <= ymd(20050201)) %>%
    head(8)

Now we can get rid of these NAs!

Remove NAs

Let’s further edit the rec`` pipeline to include the [step_naomit()` function](https://recipes.tidymodels.org/reference/step_naomit.html) to remove any rows with missing values. This is equivalent to tidyr’s drop_na() function we have been using. Then bake it to confirm it worked:

rec <- 
    recipe(SalesVolume ~ ., data = df) %>% 
    step_arrange(RegionName, Date) %>%
    step_lag(SalesVolume, lag=1) %>% 
    step_naomit(lag_1_SalesVolume, skip=FALSE) %>%  # This line is new
    prep()

rec %>%
    bake(df) %>%
    filter(Date <= ymd(20050201)) %>%
    head(8)

We need the skip=FALSE parameter in the step_naomit() function to explicitly tell the recipe that we want to remove rows with missing values.

Identify columns that won’t be used by algorithms

If you run summary(rec), you will see that Date and RegionName have the role of a β€˜predictor’, but we don’t want to use them as predictors. We want to use them as β€˜identifiers’ for that row. We can change the role of a variable using the update_role() function. Let’s edit the rec pipeline to include the update_role() function to change the role of Date and RegionName to β€˜id’:

rec <- 
    recipe(SalesVolume ~ ., data = df) %>% 
    step_arrange(RegionName, Date) %>%
    step_lag(SalesVolume, lag=1) %>% 
    step_naomit(lag_1_SalesVolume, skip=FALSE) %>%
    update_role(Date, RegionName, new_role = "id") %>% # This line is new
    prep()

summary(rec)

Now we see that only lag_1_SalesVolume is correctly identified as a predictor. The other two columns are now identified as β€˜id’ variables.

How to use this in a model?

For this we need a workflow to which we can attach the recipe and the model. Creating a workflow of a linear regression looks like this:

# Create the specification of a model but don't fit it yet
lm_model <- 
    linear_reg() %>%
    set_engine("lm")

# Create a workflow to add the recipe and the model
wflow <- 
    workflow() %>% 
    add_recipe(rec) %>% 
    add_model(lm_model)

print(wflow)

Then, you can fit the workflow to the data:

model <- 
    wflow %>% 
    fit(data = df )

model

Do you want to confirm what recipe was used to pre-process the data? You can extract_recipe():

model %>% extract_recipe()

Need to extract the fitted model? You can extract_fit_parsnip():

model %>% extract_fit_parsnip()

The output of the above is the same as you would get if you had fitted the model directly using the fit() function (like we did last week). That is, you can use it to augument() the data with predictions:

fitted_model <- model %>% extract_fit_parsnip()

# To make predictions, I cannot use original data
# Instead, I have to apply the same pre-processing steps (i.e. bake the recipe)
df_baked <- rec %>% bake(df)

fitted_model %>% 
    augment(new_data = df_baked) %>% 
    head()

Notice that this augments our data frame with .pred and .resid columns.

How well did it perform, on average?

fitted_model %>% 
    augment(new_data = df_baked) %>% 
    group_by(RegionName) %>%
    mae(truth = SalesVolume, estimate = .pred)

Compare it with the summary statistics of the SalesVolume variable:

df_baked %>% group_by(RegionName) %>%

Part III - Practice with recipes and workflows (40 min)

If you are participating in the GENIAL project, you are asked to:

  • Work independently (not in groups or pairs), but you can ask the class teacher for help if you get stuck.
  • Have only the following tabs open in your browser:
    1. These lab instructions
    2. The ChatGPT website (open a new chat window and name it β€˜DS202A - Week 03’)
    3. The tidymodels documentation page (you can open tabs with documentation pages for each package if you need to)
    4. The tidyverse documentation page (you can open tabs with documentation pages for each package if you need to)
  • Be aware of how useful (or not) ChatGPT was in helping you answer the questions in this section.
  • Fill out this brief survey at the end of the lab: πŸ”— link (requires LSE login)

In case you are not participating in the GENIAL project, you can work in pairs or small groups to answer the questions in this section. You can also ask the class teacher for help if you get stuck.

We suggest you have these tabs open in your browser:

  1. The ChatGPT website (open a new chat window and name it β€˜DS202A - Week 03’)
  2. The tidymodels documentation page (you can open tabs with documentation pages for each package if you need to)
  3. The tidyverse documentation page (you can open tabs with documentation pages for each package if you need to)
  1. Question: Can you replicate the filtering steps we applied to uk_hpi for England, but this time, use a recipe for the same tasks?
  • Employ a step_filter() to limit the records to those from 2005 and beyond.
  • Use step_mutate() to format the Date column correctly.
  • Set all variables other than SalesVolume as id variables.
  • Generate the lagged variable of SalesVolume as done previously.
  • Eliminate rows with missing values as done previously.

Check that summary(rec) produces exactly following output:

variable type role source
Date date id original
RegionName factor , unordered, nominal id original
AreaCode factor , unordered, nominal id original
AveragePrice double , numeric id original
Index double , numeric id original
IndexSA double , numeric id original
1m%Change double , numeric id original
12m%Change double , numeric id original
AveragePriceSA double , numeric id original
DetachedPrice double , numeric id original
DetachedIndex double , numeric id original
Detached1m%Change double , numeric id original
Detached12m%Change double , numeric id original
SemiDetachedPrice double , numeric id original
SemiDetachedIndex double , numeric id original
SemiDetached1m%Change double , numeric id original
SemiDetached12m%Change double , numeric id original
TerracedPrice double , numeric id original
TerracedIndex double , numeric id original
Terraced1m%Change double , numeric id original
Terraced12m%Change double , numeric id original
FlatPrice double , numeric id original
FlatIndex double , numeric id original
Flat1m%Change double , numeric id original
Flat12m%Change double , numeric id original
CashPrice double , numeric id original
CashIndex double , numeric id original
Cash1m%Change double , numeric id original
Cash12m%Change double , numeric id original
CashSalesVolume double , numeric id original
MortgagePrice double , numeric id original
MortgageIndex double , numeric id original
Mortgage1m%Change double , numeric id original
Mortgage12m%Change double , numeric id original
MortgageSalesVolume double , numeric id original
FTBPrice double , numeric id original
FTBIndex double , numeric id original
FTB1m%Change double , numeric id original
FTB12m%Change double , numeric id original
FOOPrice double , numeric id original
FOOIndex double , numeric id original
FOO1m%Change double , numeric id original
FOO12m%Change double , numeric id original
NewPrice double , numeric id original
NewIndex double , numeric id original
New1m%Change double , numeric id original
New12m%Change double , numeric id original
NewSalesVolume double , numeric id original
OldPrice double , numeric id original
OldIndex double , numeric id original
Old1m%Change double , numeric id original
Old12m%Change double , numeric id original
OldSalesVolume double , numeric id original
SalesVolume double , numeric outcome original
lag_1_SalesVolume double , numeric predictor derived
# Delete this line and add your code here
  1. Using this recipe, train a model that uses as training data only the records of England, Scotland and Wales. Calculate the Mean Absolute Error (MAE) of the training data. Then, respond: how does it compare to the MAE of the model trained on all UK countries (Part II)?
# Delete this line and add your code here
  1. Now, let’s think of Northern Ireland records as a testing set. Predict SalesVolume for Northern Ireland using the model you fitted in Step 2. Calculate the MAE of the predictions. What do you make of it?
# Delete this line and add your code here
  1. Write a function called plot_residuals() that takes a fitted workflow model and a data frame in its original form (that is, before any pre-processing) and plot the residuals against the fitted values. The function should return a ggplot object.
plot_residuals <- function(wflow_model, data) {
    
    ... # Replace this part with your code

    g <- ggplot(plot_df, aes(x=.pred, y=.resid, ...)) +
        ... # Replace this part with your code

    return(g)
}

After filling in the blanks below, create one such plot for the training data used in Step 2 and another, separate one for the test data used in Step 3.

# Delete this line and add your code here
  • Fill out this brief survey at the end of the lab if you are part of GENIAL: πŸ”— link (requires LSE login)