🛣️ Week 03 - Homework

Tidymodel recipes and workflows - a tutorial

Author

Dr. Ghita Berrada

Published

16 Oct 2024

🥅 Learning Objectives

By the end of this homework, you will be able to:

Learn how to use the recipes package to pre-process data before fitting a model
Learn how to use the workflows package to combine a recipe and a model into a single object
Learn how to use the bake() function to apply a recipe to a data frame
Create a custom function in R

📚 Preparation

Same prep as this week’s lab.

Then use the link below to download the lab file:

Solutions to this homework will only be posted after you’ve had some time to work on it, i.e by Sunday morning.

📋 Tasks

🚨 You need to have the most recent version of tidymodels (i.e 1.2.0) installed for the following code to work! Check that’s the case by running the following command on the RStudio Console (or VSCode terminal after typing R and going into the R environment): packageVersion("tidymodels"). If not, update your tidymodels with update.packages("tidymodels").

Part I - Getting Ready

Import required libraries:

# Remember to install.packages() any missing packages

# Tidyverse packages we will use
library(dplyr)     
library(ggplot2)
library(tidyr)     
library(readr)
library(stringr)
library(rsample) #this is what allows you to split data in training and test set!

# Tidymodel packages we will use
library(yardstick) # this is where metrics like mae() are
library(parsnip)   # This is where models like linear_reg() are
library(recipes)   # You will learn about this today
library(workflows) # You will learn about this today

# And finally a package for cleaning variable names
library(janitor)

Read the data set:

It is the same data set you used in this week’s lab:

# Modify the filepath if needed
filepath <- "data/WVS_Wave_7.csv"
wvs <- read_csv(filepath)

As in this week’s lab, if you find that your laptop is unable to handle the full data set without running slowly, try experimenting with the following code (reminder: this takes a random sample of the data, stratifying by country so the sampling algorithm doesn’t take more data from one country and less from another).

# Set a seed for reproducibility

set.seed(123)
  
wvs <-
  # Load the .csv
  read_csv("data/WVS_Wave_7.csv") %>%
  # Sample a proportion of the data set for each country.
  # We have used 25% but you can experiment depending on
  # the capability of your machine.
  group_by(iso3c) %>% 
  slice_sample(prop = 0.25) %>% 
  ungroup()

Part II - Tidymodels recipes

Say our goal is to predict satisfaction using the variables available in the dataset with the exception of -iso3c (just as you did in this week’s lab!). That is:

# Split the data with 75% being used to train the model

wvs_split <- initial_split(wvs, prop = 0.75)

# Create tibbles of the training and test set

wvs_train <- training(wvs_split)
wvs_test <- testing(wvs_split)
    

# Then train it with tidymodels:
model <- 
    linear_reg() %>%
    set_engine("lm") %>%
    fit(satisfaction ~ .-iso3c, data = wvs_train)

But today, we want to show you a different way of pre-processing your data before fitting a model!

We start with a recipe

The problem with the code above is that, if you need to use the fitted model to make predictions on new data, you will need to apply the same pre-processing steps to that new data. This is not a problem if you are only doing it once, but if you need to do it many times, it can become a bit of a pain. This is where the recipes package (of tidymodels) comes in handy.

A recipe is a set of instructions of how the data should be pre-processed before it is used in a model. It is a bit like a cooking recipe, where you have a list of ingredients and a set of instructions on how to prepare them. In our case, the ingredients is the data frame, and the instructions are the pre-processing steps we want to apply to them.

Here’s how we would construct a recipe to pre-process our data before fitting a linear regression model:

#Create a recipe
rec <- 
  recipe(satisfaction ~ . , data = wvs) %>%
  step_rm(iso3c)


summary(rec)

In this particular case, the only pre-processing we do is remove the variable iso3c from consideration in the model: that’s what’s step_rm does. Other examples of pre-processing could include removing the observations/rows that contain missing values ( step_naomit()) or imputing (i.e ‘taking a best guess’ at) them (e.g step_impute_median or step_impute_linear).

Warning

The previous syntax is enough if all you’re going to do is use your recipe to train a model. However, if you want to, say check the correctness of your pre-processing by printing out the actual dataset that you obtain after pre-processing, then you’ll need a few extra steps to get there.

The previous code will look like this:

rec <- 
  recipe(satisfaction ~ . , data = wvs) %>%
  step_rm(iso3c) %>%
  prep() %>%
  bake(data = wvs_train)

prep() estimates all the quantities and statistics needed for the recipe you’ve specified (i.e preprocessing steps) to be applied by the bake function (note that you specify which dataset the bake function, and ultimately the pre-processing steps, apply on - in this case the training set). You can think of it as the (ingredients) prepping step in a cooking recipe before you bake/cook everything and get the final meal!

⚠️ If you don’t go through the prep and bake steps, the pre-processing steps will not be applied. The recipe is just a recipe, not the final cooked meal. You need to bake it first.

How to use this in a model?

So we have our recipe from before:

#Create a recipe
rec <- 
  recipe(satisfaction ~ . , data = wvs) %>%
  step_rm(iso3c)

How do we now use this in a model?

For this we need a workflow to which we can attach the recipe and the model. Creating a workflow of a linear regression looks like this:

# Create the specification of a model but don't fit it yet
lm_model <- 
  linear_reg() %>%
  set_engine("lm")

# Create a workflow to add the recipe and the model
wflow <- 
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(lm_model)

print(wflow)

Then, you can fit the workflow to the data:

model <- 
  wflow %>% 
  fit(data = wvs_train)

model

Do you want to confirm what recipe was used to pre-process the data? You can extract_recipe():

model %>% 
  extract_recipe()

You can use your model to augment() the data with predictions:

model %>% 
  augment(new_data = wvs_test) %>% 
  head()

Notice that this augments our data frame with .pred and .resid columns.

How well did our model perform?

reg_metrics <- metric_set(rsq, rmse, mae, mape)
model %>%
    augment(new_data=wvs_test)%>%
    reg_metrics(truth = satisfaction, estimate = .pred)

You already know about the \(R^2\) (rsq) and \(RMSE\) (rmse) metrics: can you make sense of the other two metrics (mae for MAE and mape for MAPE)?

(Bonus) See if you can plot the metrics for a few countries of your choice.

Let’s look at the residuals plot for our model!

model %>% 
  augment(new_data = wvs_test ) %>%
  ggplot(aes(.pred, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_bw() +
  labs(x = "Predicted life satisfaction", y = "Residuals")

Part III - Practice with recipes and workflows

Consider the following simulated data frame.


set.seed(123)

df_sim <-
  tibble(outcome = rnorm(100, 0, 1),
         feature1 = rnorm(100, 0, 1),
         feature2 = rnorm(100, 0, 20))

We have created a data set with one outcome (creatively labeled outcome) and two features feature1 and feature2 where one variable has been assigned a standard deviation 20 times the size of the other variable. In some cases like linear regression, it will be okay to use outcome ~ feature1 + feature2 as a formula. However, with other models, variables with larger variation will play a disproportionate role. As a result, we can use standardization (a.k.a. normalization) to counter this issue. This involves subtracting the mean of the variable and dividing the result by the standard deviation. Create a recipe for this data frame using a step from the tidymodels manual, and make use of graphs to demonstrate that the standardization process “worked”.

Some models will only require numeric input, meaning that variables like employment or better_living into several dummy variables. Using the tidymodels manual, can you find a step that can help us achieve this end?
Can you tweak the model we wrote in Part I to include the dummy step from Question 1? Evaluate the performance of this new model (first on the training then the test set).
Write a function called plot_residuals() that takes a fitted workflow model and a data frame in its original form (that is, before any pre-processing) and plot the residuals against the fitted values. The function should return a ggplot object.

plot_residuals <- function(wflow_model, data) {
    
    ... # Replace this part with your code

    g <- ggplot(plot_df, aes(x=.pred, y=.resid, ...)) +
        ... # Replace this part with your code

    return(g)
    
}

Using the plot_residuals function you’ve just created, can you create:

one residuals against the fitted values plot for the training data used in Question 2
and another, separate residuals against the fitted values plot for the test data used in Question 2?

(Bonus) Can you train a LASSO model on the data and evaluate its performance using recipes and workflows? How does this model compare to linear regression?