π£οΈ Week 04 - Lab Roadmap (90 min)
Tidymodel recipes and workflows - a tutorial
π₯ Learning Objectives
By the end of this lab, you will be able to:
- Learn how to use the
recipes
package to pre-process data before fitting a model - Learn how to use the
workflows
package to combine a recipe and a model into a single object - Learn how to use the
bake()
function to apply a recipe to a data frame - Create a custom function in R
If you never accessed ChatGPT, you must create an account. Click on chat.openai.com and sign up with your email address (it doesnβt have to be your LSE email address).
When you reach Part III of this lab, read the specific instructions for GENIAL participants.
π Preparation
Same prep as last week.
Then use the link below to download the lab file:
Solutions to Part III will only be posted after all labs have ended, on Tuesday afternoon.
π Lab Tasks
Here are the instructions for this lab:
Click here to see βοΈ Setup Instructions
Import required libraries:
# Tidyverse packages we will use
library(dplyr)
library(tidyr)
library(readr)
library(lubridate)
# Tidymodel packages we will use
library(yardstick) # this is where metrics like mae() are
library(parsnip) # This is where models like linear_reg() are
library(recipes) # You will learn about this today
library(workflows) # You will learn about this today
Read the data set:
It is the same data set you used in the previous lab (as well as in the first formative assignment):
# Modify the filepath if needed
<- "data/UK-HPI-full-file-2023-06.csv"
filepath <- read_csv(filepath) uk_hpi
Part I - Getting Ready (10 min)
π― ACTION POINTS: Just run all the code chunks below, paying attention to what you are doing.
Letβs kick things off by crafting a df
data frame that will hold records from select columns of the UK HPI dataset, broken down by individual UK countries.
# Select just the UK countries
<-
df %>%
uk_hpi filter(RegionName %in% c("England", "Wales", "Scotland", "Northern Ireland")) %>%
select(Date, RegionName, SalesVolume) %>%
mutate(Date = dmy(Date))
For a bit of curiosityβs sake, letβs confirm the number of unique countries that appear in our dataset. We can uncover this mystery by counting the unique values in the RegionName
column.
%>%
df group_by(RegionName) %>%
tally()
Just how many non-empty records are there in this dataset per Region?
%>%
df drop_na() %>%
group_by(RegionName) %>%
tally() %>%
::kable() knitr
These are not the same number of records. Maybe some countries started recording data at a later date? Letβs check:
%>%
df drop_na() %>%
group_by(RegionName) %>%
summarise(min_date = min(Date), max_date = max(Date))
Just to balance things out, letβs start the analysis from 2005 onwards. Letβs recreate our df
, adding a filter to only include records from 2005 onwards.
<-
df %>%
uk_hpi filter(RegionName %in% c("England", "Wales", "Scotland", "Northern Ireland")) %>%
select(Date, RegionName, SalesVolume) %>%
mutate(Date = dmy(Date)) %>%
filter(Date >= dmy("01-01-2005"))
Part II - Tidymodels recipes (40 min)
π§π»βπ« TEACHING MOMENT: Your class teacher will explain all the code below. Follow along, running the code as you go. If you have any questions, donβt save it to later, ask away!
Say our goal is to predict SalesVolume
using the sales volume from the past month, similar to what weβve been doing so far in this course, we could use dplyr functions mutate
and lag
to achieve that. That is:
<-
df_train %>%
df group_by(RegionName) %>%
arrange(Date) %>%
mutate(SalesVolume_lag1 = lag(SalesVolume, n=1)) %>%
drop_na()
# Then train it with tidymodels:
<-
model linear_reg() %>%
set_engine("lm") %>%
fit(SalesVolume ~ ., data = df_train)
But today, we want to show you a different way of pre-processing your data before fitting a model!
We start with a recipe
The problem with the code above is that, if you need to use the fitted model
to make predictions on new data, you will need to apply the same pre-processing steps to that new data. This is not a problem if you are only doing it once, but if you need to do it many times, it can become a bit of a pain. This is where the recipes package (of tidymodels) comes in handy.
A recipe is a set of instructions of how the data should be pre-processed before it is used in a model. It is a bit like a cooking recipe, where you have a list of ingredients and a set of instructions on how to prepare them. In our case, the ingredients is the data frame, and the instructions are the pre-processing steps we want to apply to them.
Hereβs how we would construct a recipe to pre-process our data before fitting a linear regression model:
# Create a recipe
<-
rec recipe(SalesVolume ~ ., data = df) %>%
step_arrange(Date) %>% # We need to sort the data by date before we can use step_lag()
step_lag(SalesVolume, lag=1) %>% # step_lag is a wrapper for dplyr::lag()
prep() # we need to 'prep' the recipe before we can use it. This is always the final step of a recipe
summary(rec)
How do we use this recipe? Well, we can use the bake()
function to apply the recipe to our data frame:
%>%
rec bake(df) %>% # This 'bakes' our recipe using whatever data we pass as ingredient
filter(Date == ymd("2005-01-01")) %>%
head(8)
β οΈ If you donβt bake it, the pre-processing steps will not be applied. The recipe is just a recipe, not the final cooked meal. You need to bake it first.
Fixing incorrect lag
But the way we perform the lag above leads to a problem because the lagged variable is not calculated within a group. To see why that is problematic, letβs look at the first few months in the data set (Jan 2005 and Feb 2005):
%>%
rec bake(df) %>%
arrange(RegionName,Date) %>%
filter(Date < ymd(20050301))
Just like with England, all the other countries should have an NA value in the lagged variable at the start of the data set (Jan 2005). But they donβt. The lagged variable is calculated across all the data, not within each country.
There is no step_group_by
in the recipes package, but it turns out that we can use the step_arrange
function to achieve the same:
<-
rec recipe(SalesVolume ~ ., data = df) %>%
step_arrange(RegionName, Date) %>% # Adding RegionName to the arrange call achieves the same as the group_by
step_lag(SalesVolume, lag=1) %>%
prep()
summary(rec)
This works as expected:
%>%
rec bake(df) %>%
filter(Date <= ymd(20050201)) %>%
head(8)
Now we can get rid of these NAs!
Remove NAs
Letβs further edit the rec`` pipeline to include the [
step_naomit()` function](https://recipes.tidymodels.org/reference/step_naomit.html) to remove any rows with missing values. This is equivalent to tidyrβs drop_na() function we have been using. Then bake it to confirm it worked:
<-
rec recipe(SalesVolume ~ ., data = df) %>%
step_arrange(RegionName, Date) %>%
step_lag(SalesVolume, lag=1) %>%
step_naomit(lag_1_SalesVolume, skip=FALSE) %>% # This line is new
prep()
%>%
rec bake(df) %>%
filter(Date <= ymd(20050201)) %>%
head(8)
We need the skip=FALSE
parameter in the step_naomit()
function to explicitly tell the recipe that we want to remove rows with missing values.
Identify columns that wonβt be used by algorithms
If you run summary(rec)
, you will see that Date
and RegionName
have the role of a βpredictorβ, but we donβt want to use them as predictors. We want to use them as βidentifiersβ for that row. We can change the role of a variable using the update_role()
function. Letβs edit the rec
pipeline to include the update_role()
function to change the role of Date
and RegionName
to βidβ:
<-
rec recipe(SalesVolume ~ ., data = df) %>%
step_arrange(RegionName, Date) %>%
step_lag(SalesVolume, lag=1) %>%
step_naomit(lag_1_SalesVolume, skip=FALSE) %>%
update_role(Date, RegionName, new_role = "id") %>% # This line is new
prep()
summary(rec)
Now we see that only lag_1_SalesVolume
is correctly identified as a predictor. The other two columns are now identified as βidβ variables.
How to use this in a model?
For this we need a workflow to which we can attach the recipe and the model. Creating a workflow of a linear regression looks like this:
# Create the specification of a model but don't fit it yet
<-
lm_model linear_reg() %>%
set_engine("lm")
# Create a workflow to add the recipe and the model
<-
wflow workflow() %>%
add_recipe(rec) %>%
add_model(lm_model)
print(wflow)
Then, you can fit the workflow to the data:
<-
model %>%
wflow fit(data = df )
model
Do you want to confirm what recipe was used to pre-process the data? You can extract_recipe()
:
%>% extract_recipe() model
Need to extract the fitted model? You can extract_fit_parsnip()
:
%>% extract_fit_parsnip() model
The output of the above is the same as you would get if you had fitted the model directly using the fit()
function (like we did last week). That is, you can use it to augument()
the data with predictions:
<- model %>% extract_fit_parsnip()
fitted_model
# To make predictions, I cannot use original data
# Instead, I have to apply the same pre-processing steps (i.e. bake the recipe)
<- rec %>% bake(df)
df_baked
%>%
fitted_model augment(new_data = df_baked) %>%
head()
Notice that this augments our data frame with .pred
and .resid
columns.
How well did it perform, on average?
%>%
fitted_model augment(new_data = df_baked) %>%
group_by(RegionName) %>%
mae(truth = SalesVolume, estimate = .pred)
Compare it with the summary statistics of the SalesVolume
variable:
%>% group_by(RegionName) %>% df_baked
Part III - Practice with recipes and workflows (40 min)
If you are participating in the GENIAL project, you are asked to:
- Work independently (not in groups or pairs), but you can ask the class teacher for help if you get stuck.
- Have only the following tabs open in your browser:
- These lab instructions
- The ChatGPT website (open a new chat window and name it βDS202A - Week 03β)
- The
tidymodels
documentation page (you can open tabs with documentation pages for each package if you need to) - The
tidyverse
documentation page (you can open tabs with documentation pages for each package if you need to)
- Be aware of how useful (or not) ChatGPT was in helping you answer the questions in this section.
- Fill out this brief survey at the end of the lab: π link (requires LSE login)
In case you are not participating in the GENIAL project, you can work in pairs or small groups to answer the questions in this section. You can also ask the class teacher for help if you get stuck.
We suggest you have these tabs open in your browser:
- The ChatGPT website (open a new chat window and name it βDS202A - Week 03β)
- The
tidymodels
documentation page (you can open tabs with documentation pages for each package if you need to) - The
tidyverse
documentation page (you can open tabs with documentation pages for each package if you need to)
- Question: Can you replicate the filtering steps we applied to
uk_hpi
for England, but this time, use a recipe for the same tasks?
- Employ a
step_filter()
to limit the records to those from 2005 and beyond. - Use
step_mutate()
to format theDate
column correctly. - Set all variables other than
SalesVolume
asid
variables. - Generate the lagged variable of
SalesVolume
as done previously. - Eliminate rows with missing values as done previously.
Check that summary(rec)
produces exactly following output:
variable | type | role | source |
---|---|---|---|
Date | date | id | original |
RegionName | factor , unordered, nominal | id | original |
AreaCode | factor , unordered, nominal | id | original |
AveragePrice | double , numeric | id | original |
Index | double , numeric | id | original |
IndexSA | double , numeric | id | original |
1m%Change | double , numeric | id | original |
12m%Change | double , numeric | id | original |
AveragePriceSA | double , numeric | id | original |
DetachedPrice | double , numeric | id | original |
DetachedIndex | double , numeric | id | original |
Detached1m%Change | double , numeric | id | original |
Detached12m%Change | double , numeric | id | original |
SemiDetachedPrice | double , numeric | id | original |
SemiDetachedIndex | double , numeric | id | original |
SemiDetached1m%Change | double , numeric | id | original |
SemiDetached12m%Change | double , numeric | id | original |
TerracedPrice | double , numeric | id | original |
TerracedIndex | double , numeric | id | original |
Terraced1m%Change | double , numeric | id | original |
Terraced12m%Change | double , numeric | id | original |
FlatPrice | double , numeric | id | original |
FlatIndex | double , numeric | id | original |
Flat1m%Change | double , numeric | id | original |
Flat12m%Change | double , numeric | id | original |
CashPrice | double , numeric | id | original |
CashIndex | double , numeric | id | original |
Cash1m%Change | double , numeric | id | original |
Cash12m%Change | double , numeric | id | original |
CashSalesVolume | double , numeric | id | original |
MortgagePrice | double , numeric | id | original |
MortgageIndex | double , numeric | id | original |
Mortgage1m%Change | double , numeric | id | original |
Mortgage12m%Change | double , numeric | id | original |
MortgageSalesVolume | double , numeric | id | original |
FTBPrice | double , numeric | id | original |
FTBIndex | double , numeric | id | original |
FTB1m%Change | double , numeric | id | original |
FTB12m%Change | double , numeric | id | original |
FOOPrice | double , numeric | id | original |
FOOIndex | double , numeric | id | original |
FOO1m%Change | double , numeric | id | original |
FOO12m%Change | double , numeric | id | original |
NewPrice | double , numeric | id | original |
NewIndex | double , numeric | id | original |
New1m%Change | double , numeric | id | original |
New12m%Change | double , numeric | id | original |
NewSalesVolume | double , numeric | id | original |
OldPrice | double , numeric | id | original |
OldIndex | double , numeric | id | original |
Old1m%Change | double , numeric | id | original |
Old12m%Change | double , numeric | id | original |
OldSalesVolume | double , numeric | id | original |
SalesVolume | double , numeric | outcome | original |
lag_1_SalesVolume | double , numeric | predictor | derived |
# Delete this line and add your code here
- Using this recipe, train a model that uses as training data only the records of England, Scotland and Wales. Calculate the Mean Absolute Error (MAE) of the training data. Then, respond: how does it compare to the MAE of the model trained on all UK countries (Part II)?
# Delete this line and add your code here
- Now, letβs think of Northern Ireland records as a testing set. Predict SalesVolume for Northern Ireland using the model you fitted in Step 2. Calculate the MAE of the predictions. What do you make of it?
# Delete this line and add your code here
- Write a function called
plot_residuals()
that takes a fitted workflow model and a data frame in its original form (that is, before any pre-processing) and plot the residuals against the fitted values. The function should return a ggplot object.
<- function(wflow_model, data) {
plot_residuals
# Replace this part with your code
...
<- ggplot(plot_df, aes(x=.pred, y=.resid, ...)) +
g # Replace this part with your code
...
return(g)
}
After filling in the blanks below, create one such plot for the training data used in Step 2 and another, separate one for the test data used in Step 3.
# Delete this line and add your code here
- Fill out this brief survey at the end of the lab if you are part of GENIAL: π link (requires LSE login)