✅ Week 03 - Homework: Solutions

Tidymodel recipes and workflows - a tutorial

Author

Dr. Ghita Berrada

Published

16 Oct 2024

Instead of just reading the solutions, you might want to download the .qmd file and run the code yourself. You can download the .qmd file by clicking on the download button below:

🚨 You need to have the most recent version of tidymodels (i.e 1.2.0) installed for the following code to work! Check that’s the case by running the following command on the RStudio Console (or VSCode terminal after typing R and going into the R environment): packageVersion("tidymodels"). If not, update your tidymodels with update.packages("tidymodels").

Part I - Getting Ready

Import required libraries:

# Remember to install.packages() any missing packages

# Tidyverse packages we will use
library(dplyr)     
library(ggplot2)
library(tidyr)     
library(readr)
library(stringr)
library(rsample) #this is what allows you to split data in training and test set!

# Tidymodel packages we will use
library(yardstick) # this is where metrics like mae() are
library(parsnip)   # This is where models like linear_reg() are
library(recipes)   # You will learn about this today
library(workflows) # You will learn about this today

# And finally a package for cleaning variable names
library(janitor)

Read the data set:

It is the same data set you used in this week’s lab:

# Modify the filepath if needed
filepath <- "data/WVS_Wave_7.csv"
wvs <- read_csv(filepath)

Rows: 86867 Columns: 17                                                                                            
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): iso3c, employment, better_living
dbl  (4): satisfaction, age, financial_situ, relig_import
lgl (10): social_trust, male, post_second_edu, rural, married, no_food, no_s...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

As in this week’s lab, if you find that your laptop is unable to handle the full data set without running slowly, try experimenting with the following code (reminder: this takes a random sample of the data, stratifying by country so the sampling algorithm doesn’t take more data from one country and less from another).

# Set a seed for reproducibility

set.seed(123)
  
wvs <-
  # Load the .csv
  read_csv("data/WVS_Wave_7.csv") %>%
  # Sample a proportion of the data set for each country.
  # We have used 25% but you can experiment depending on
  # the capability of your machine.
  group_by(iso3c) %>% 
  slice_sample(prop = 0.25) %>% 
  ungroup()

Part II - Tidymodels recipes

Say our goal is to predict satisfaction using the variables available in the dataset with the exception of -iso3c (just as you did in this week’s lab!). That is:

# Split the data with 75% being used to train the model

wvs_split <- initial_split(wvs, prop = 0.75)

# Create tibbles of the training and test set

wvs_train <- training(wvs_split)
wvs_test <- testing(wvs_split)
    

# Then train it with tidymodels:
model <- 
    linear_reg() %>%
    set_engine("lm") %>%
    fit(satisfaction ~ .-iso3c, data = wvs_train)

But today, we want to show you a different way of pre-processing your data before fitting a model!

We start with a recipe

The problem with the code above is that, if you need to use the fitted model to make predictions on new data, you will need to apply the same pre-processing steps to that new data. This is not a problem if you are only doing it once, but if you need to do it many times, it can become a bit of a pain. This is where the recipes package (of tidymodels) comes in handy.

A recipe is a set of instructions of how the data should be pre-processed before it is used in a model. It is a bit like a cooking recipe, where you have a list of ingredients and a set of instructions on how to prepare them. In our case, the ingredients is the data frame, and the instructions are the pre-processing steps we want to apply to them.

Here’s how we would construct a recipe to pre-process our data before fitting a linear regression model:

#Create a recipe
rec <- 
  recipe(satisfaction ~ . , data = wvs) %>%
  step_rm(iso3c)


summary(rec)

# A tibble: 17 × 4
   variable        type      role      source  
   <chr>           <list>    <chr>     <chr>   
 1 iso3c           <chr [3]> predictor original
 2 social_trust    <chr [1]> predictor original
 3 male            <chr [1]> predictor original
 4 age             <chr [2]> predictor original
 5 post_second_edu <chr [1]> predictor original
 6 rural           <chr [1]> predictor original
 7 employment      <chr [3]> predictor original
 8 financial_situ  <chr [2]> predictor original
 9 married         <chr [1]> predictor original
10 relig_import    <chr [2]> predictor original
11 better_living   <chr [3]> predictor original
12 no_food         <chr [1]> predictor original
13 no_safety       <chr [1]> predictor original
14 no_medical      <chr [1]> predictor original
15 no_cash         <chr [1]> predictor original
16 no_shelter      <chr [1]> predictor original
17 satisfaction    <chr [2]> outcome   original

In this particular case, the only pre-processing we do is remove the variable iso3c from consideration in the model: that’s what’s step_rm does. Other examples of pre-processing could include removing the observations/rows that contain missing values ( step_naomit()) or imputing (i.e ‘taking a best guess’ at) them (e.g step_impute_median or step_impute_linear)

Warning

The previous syntax is enough if all you’re going to do is use your recipe to train a model. However, if you want to, say check the correctness of your pre-processing by printing out the actual dataset that you obtain after pre-processing, then you’ll need a few extra steps to get there.

The previous code will look like this:

rec <- 
  recipe(satisfaction ~ . , data = wvs) %>%
  step_rm(iso3c) %>%
  prep() %>%
  bake(data = wvs_train)

prep() estimates all the quantities and statistics needed for the recipe you’ve specified (i.e preprocessing steps) to be applied by the bake function (note that you specify which dataset the bake function, and ultimately the pre-processing steps, apply on - in this case the training set). You can think of it as the (ingredients) prepping step in a cooking recipe before you bake/cook everything and get the final meal!

⚠️ If you don’t go through the prep and bake steps, the pre-processing steps will not be applied. The recipe is just a recipe, not the final cooked meal. You need to bake it first.

How to use this in a model?

So we have our recipe from before:

#Create a recipe
rec <- 
  recipe(satisfaction ~ . , data = wvs) %>%
  step_rm(iso3c)

How do we now use this in a model?

For this we need a workflow to which we can attach the recipe and the model. Creating a workflow of a linear regression looks like this:

# Create the specification of a model but don't fit it yet
lm_model <- 
  linear_reg() %>%
  set_engine("lm")

# Create a workflow to add the recipe and the model
wflow <- 
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(lm_model)

print(wflow)

══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 Recipe Step

• step_rm()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

Then, you can fit the workflow to the data:

model <- 
  wflow %>% 
  fit(data = wvs_train)

model

══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 Recipe Step

• step_rm()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
                 (Intercept)              social_trustTRUE  
                   3.7784484                     0.0797187  
                    maleTRUE                           age  
                  -0.0495987                     0.0000462  
         post_second_eduTRUE                     ruralTRUE  
                  -0.0314364                    -0.0465320  
employmentHouse wife/husband               employmentOther  
                   0.0123533                     0.0596715  
         employmentPart time             employmentRetired  
                   0.1207598                    -0.0130359  
     employmentSelf employed             employmentStudent  
                  -0.0007829                    -0.0163621  
        employmentUnemployed                financial_situ  
                  -0.1666486                     0.4774211  
                 marriedTRUE                  relig_import  
                   0.1837895                     0.0591540  
     better_livingBetter off        better_livingWorse off  
                  -0.0153946                    -0.4330708  
                 no_foodTRUE                 no_safetyTRUE  
                  -0.1942058                    -0.0419340  
              no_medicalTRUE                   no_cashTRUE  
                  -0.1346880                     0.0389337  
              no_shelterTRUE  
                  -0.2189660

Do you want to confirm what recipe was used to pre-process the data? You can extract_recipe():

model %>% 
  extract_recipe()

── Recipe ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── Inputs 
Number of variables by role
outcome:    1
predictor: 16

── Training information 
Training data contained 65150 data points and no incomplete rows.

── Operations 
• Variables removed: iso3c | Trained

You can use your model to augment() the data with predictions:

model %>% 
  augment(new_data = wvs_test) %>% 
  head()

# A tibble: 6 × 19
  .pred .resid iso3c satisfaction social_trust male    age post_second_edu rural
  <dbl>  <dbl> <chr>        <dbl> <lgl>        <lgl> <dbl> <lgl>           <lgl>
1  5.59  4.41  AND             10 FALSE        FALSE    43 FALSE           FALSE
2  5.79 -4.79  AND              1 FALSE        TRUE     33 FALSE           FALSE
3  7.90  2.10  AND             10 FALSE        TRUE     39 FALSE           TRUE 
4  7.47 -0.471 AND              7 TRUE         FALSE    66 TRUE            FALSE
5  8.72  0.281 AND              9 FALSE        FALSE    48 FALSE           FALSE
6  6.81 -0.808 AND              6 FALSE        TRUE     60 TRUE            FALSE
# ℹ 10 more variables: employment <chr>, financial_situ <dbl>, married <lgl>,
#   relig_import <dbl>, better_living <chr>, no_food <lgl>, no_safety <lgl>,
#   no_medical <lgl>, no_cash <lgl>, no_shelter <lgl>

Notice that this augments our data frame with .pred and .resid columns.

How well did our model perform?

reg_metrics <- metric_set(rsq, rmse, mae, mape)
model %>%
    augment(new_data=wvs_test)%>%
    reg_metrics(truth = satisfaction, estimate = .pred)

# A tibble: 4 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.328
2 rmse    standard       1.83 
3 mae     standard       1.37 
4 mape    standard      32.9

You already know about the \(R^2\) (rsq) and \(RMSE\) (rmse) metrics: can you make sense of the other two metrics (mae for MAE and mape for MAPE)?

Based on \(R^2\) (the only scale-agnostic metric in our set of metrics), our model doesn’t seem like a good fit. For the model to be a good fit, you would ideally want \(R^2\) to be closer to 1. For more on \(R^2\), see this page, this one and finally, this one.

RMSE represents the square root of the variance of the residuals. The lower the RMSE, the better a given model fits your dataset. For our given dataset, the RMSE turns out to be 1.82, which looks relatively low and decent. However, RMSE is a metric that is much more useful when used to compare different models on the same dataset: the interpretation of a single RMSE measurement is rather tricky, however comparing the RMSE of different models on the same dataset can tell you which one is a better fit. For more information on RMSE, see this page and this page.

MAPE (or mean average percentage error) for our model is 32.8. This means that, on average, the error between the life satisfaction predicted by the model and the actual satisfaction is 32.8%. Do you think this is a low or high number? As usual with these sorts of metrics, their value lies more in comparisons with other models: a model with a lower MAPE than another is likely to be a better fit. For more on MAPE, see this page or this page or this one

The last metric is MAE (mean absolute error). It simply tells you on average, how far off the model is from the actual data. The actual formula for this metric is \[\frac{1}{n}\sum_{i=1}^n|y_i-\hat{y_i}|\]

This metric is expressed in the scale of the variable you are predicting (i.e if you are predicting a salary in GBP, MAE is in GBP). The lower this metric, the better. It treats all error values equally (i.e it doesn’t emphasize low or high errors). However, you can’t use it to compare models trained on different datasets since it reflects an absolute error. For more on MAE, see this page or this page or this one.

(Bonus) See if you can plot the metrics for a few countries of your choice.

Let’s get my metrics per countries (for the sake of example, I am choosing the US (country code: USA), the UK (country code: GBR), Morocco (country code: MAR) and the Netherlands (country code: NLD)):

model %>%
    augment(new_data=wvs_test)%>%
      filter(iso3c %in% c('GBR','MAR','USA','NLD'))%>%
      group_by(iso3c)%>%
      reg_metrics(truth = satisfaction, estimate = .pred)

# A tibble: 16 × 4
   iso3c .metric .estimator .estimate
   <chr> <chr>   <chr>          <dbl>
 1 GBR   rsq     standard       0.353
 2 MAR   rsq     standard       0.628
 3 NLD   rsq     standard       0.317
 4 USA   rsq     standard       0.312
 5 GBR   rmse    standard       1.51 
 6 MAR   rmse    standard       1.54 
 7 NLD   rmse    standard       1.13 
 8 USA   rmse    standard       1.52 
 9 GBR   mae     standard       1.11 
10 MAR   mae     standard       1.19 
11 NLD   mae     standard       0.793
12 USA   mae     standard       1.13 
13 GBR   mape    standard      21.0  
14 MAR   mape    standard      33.6  
15 NLD   mape    standard      13.3  
16 USA   mape    standard      20.8

Our various metrics have different scales so let’s plot them separately in bar charts.

Let’s start with \(R^2\)

metrics_per_country %>%
          rename(country_code=iso3c,Value=.estimate,Metric=.metric)%>%
          filter(Metric=='rsq')%>%
          ggplot(aes(Metric,Value,fill=country_code))+
          geom_bar(stat="identity", position = "dodge")+
          scale_fill_brewer(palette='Pastel1')

Then RMSE:

metrics_per_country %>%
          rename(country_code=iso3c,Value=.estimate,Metric=.metric)%>%
          filter(Metric=='rmse')%>%
          ggplot(aes(Metric,Value,fill=country_code))+
          geom_bar(stat="identity", position = "dodge")+
          scale_fill_brewer(palette='Pastel1')

Then MAE:

metrics_per_country %>%
          rename(country_code=iso3c,Value=.estimate,Metric=.metric)%>%
          filter(Metric=='mae')%>%
          ggplot(aes(Metric,Value,fill=country_code))+
          geom_bar(stat="identity", position = "dodge")+
          scale_fill_brewer(palette='Pastel1')

And finally MAPE:

metrics_per_country %>%
          rename(country_code=iso3c,Value=.estimate,Metric=.metric)%>%
          filter(Metric=='MAPE')%>%
          ggplot(aes(Metric,Value,fill=country_code))+
          geom_bar(stat="identity", position = "dodge")+
          scale_fill_brewer(palette='Pastel1')

If we’d stopped by looking at \(R^2\), we would have concluded the model performs best for Morocco. However, the other metrics, particularly MAPE, paint a different picture: errors are, on average, smaller for the Netherlands. The picture for Morocco diverges so much in between \(R^2\) and MAPE that it might be worth having a look at the residual plot here to ascertain the underlying picture. Did we stumble upon any of the limitations of these two metrics in this particular case?

Let’s look at the residuals plot for our model!

model %>% 
  augment(new_data = wvs_test ) %>%
  ggplot(aes(.pred, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_bw() +
  labs(x = "Predicted life satisfaction", y = "Residuals")

It’s the same residuals plot you saw in the W03 lab! No surprises there since we fit the same model 😉

Part III - Practice with recipes and workflows

Consider the following simulated data frame.


set.seed(123)

df_sim <-
  tibble(outcome = rnorm(100, 0, 1),
         feature1 = rnorm(100, 0, 1),
         feature2 = rnorm(100, 0, 20))

Let’s have a quick look at the first lines of df_sim before we proceed:

df_sim  %>% head(10) %>% knitr::kable()

|    outcome|   feature1|   feature2|
|----------:|----------:|----------:|
| -0.5604756| -0.7104066|  43.976207|
| -0.2301775|  0.2568837|  26.248259|
|  1.5587083| -0.2466919|  -5.302901|
|  0.0705084| -0.3475426|  10.863881|
|  0.1292877| -0.9516186|  -8.286799|
|  1.7150650| -0.0450277|  -9.524938|
|  0.4609162| -0.7849045| -15.772057|
| -1.2650612| -1.6679419| -11.892345|
| -0.6868529| -0.3802265|  33.018149|
| -0.4456620|  0.9189966|  -1.080563|

We have created a data set with one outcome (creatively labeled outcome) and two features feature1 and feature2 where one variable has been assigned a standard deviation 20 times the size of the other variable. In some cases like linear regression, it will be okay to use outcome ~ feature1 + feature2 as a formula. However, with other models, variables with larger variation will play a disproportionate role. As a result, we can use standardization (a.k.a. normalization) to counter this issue. This involves subtracting the mean of the variable and dividing the result by the standard deviation. Create a recipe for this data frame using a step from the tidymodels manual, and make use of graphs to demonstrate that the standardization process “worked”.


# Let's look at the distributions of each feature overlaid!

df_sim %>% 
  pivot_longer(-outcome, names_to = 'feature') %>% 
  ggplot(aes(value, fill = str_remove(feature, "feature"))) +
  geom_histogram(alpha = 0.75) +
  theme_bw() +
  labs(x = "Feature value", y = "Frequency", fill = "Feature")

# Now let's normalize feature2 using step_normalize
df_normalized <-
  df_sim %>% 
  recipe(outcome ~ ., data = .) %>% 
  step_normalize(feature2) %>% 
  prep() %>% 
  bake(new_data = NULL)

# Now let's plot the distributions of the features
df_normalized %>% 
  pivot_longer(-outcome, names_to = 'feature') %>% 
  ggplot(aes(value, fill = str_remove(feature, "feature"))) +
  geom_histogram(alpha = 0.75) +
  theme_bw() +
  labs(x = "Feature value", y = "Frequency", fill = "Feature")

Some models will only require numeric input, meaning that variables like employment or better_living into several dummy variables. Using the tidymodels manual, can you find a step that can help us achieve this end?

recipe2 <- 
  recipe(satisfaction ~ . , data = wvs) %>%
  step_rm(iso3c) %>%
  step_dummy(employment,better_living)

I can stop here if all I am doing is using the recipe for modeling (which is what we’ll do in Question 2). To be able to visualise the result of my pre-processing and check I’ve performed all the correct steps, here is an extra piece of code:

  recipe2 %>%
        prep() %>%
        bake(wvs_train)

# A tibble: 65,150 × 23
   social_trust male    age post_second_edu rural financial_situ married
   <lgl>        <lgl> <dbl> <lgl>           <lgl>          <dbl> <lgl>  
 1 FALSE        FALSE    65 FALSE           TRUE               8 FALSE  
 2 FALSE        FALSE    47 FALSE           FALSE             10 TRUE   
 3 FALSE        TRUE     63 FALSE           FALSE              5 FALSE  
 4 FALSE        FALSE    46 TRUE            FALSE              4 FALSE  
 5 TRUE         TRUE     52 TRUE            FALSE              4 TRUE   
 6 TRUE         FALSE    59 FALSE           FALSE              8 FALSE  
 7 FALSE        TRUE     80 FALSE           FALSE              5 FALSE  
 8 FALSE        TRUE     69 FALSE           TRUE               5 FALSE  
 9 FALSE        FALSE    36 FALSE           TRUE               5 TRUE   
10 FALSE        TRUE     22 FALSE           TRUE               9 TRUE   
# ℹ 65,140 more rows
# ℹ 16 more variables: relig_import <dbl>, no_food <lgl>, no_safety <lgl>,
#   no_medical <lgl>, no_cash <lgl>, no_shelter <lgl>, satisfaction <dbl>,
#   employment_House.wife.husband <dbl>, employment_Other <dbl>,
#   employment_Part.time <dbl>, employment_Retired <dbl>,
#   employment_Self.employed <dbl>, employment_Student <dbl>,
#   employment_Unemployed <dbl>, better_living_Better.off <dbl>, …
# ℹ Use `print(n = ...)` to see more rows

If I wanted to view my results even better, I could do this (try this code in your console - don’t render!):

  baked_recipe <- recipe2 %>% 
                  prep() %>%
                  bake(wvs_train)
   View(baked_recipe)

Can you tweak the model we wrote in Part I to include the dummy step from Question 1? Evaluate the performance of this new model (first on the training then the test set).

The model specification from Part I doesn’t change: we’re still using a linear regression!

# Create the specification of a model but don't fit it yet
lm_model <- 
  linear_reg() %>%
  set_engine("lm")

We tweak our workflow slightly to use the recipe recipe2 that we’ve just defined in Question 2 (and that uses the dummy steps!)

# Create a workflow to add the recipe and the model
wflow <- 
  workflow() %>% 
  add_recipe(recipe2) %>% 
  add_model(lm_model)

print(wflow)

══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_rm()
• step_dummy()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

Then, you can fit the workflow to the data:

model <- 
  wflow %>% 
  fit(data = wvs_train)

model

══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_rm()
• step_dummy()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
                  (Intercept)               social_trustTRUE  
                    3.7784484                      0.0797187  
                     maleTRUE                            age  
                   -0.0495987                      0.0000462  
          post_second_eduTRUE                      ruralTRUE  
                   -0.0314364                     -0.0465320  
               financial_situ                    marriedTRUE  
                    0.4774211                      0.1837895  
                 relig_import                    no_foodTRUE  
                    0.0591540                     -0.1942058  
                no_safetyTRUE                 no_medicalTRUE  
                   -0.0419340                     -0.1346880  
                  no_cashTRUE                 no_shelterTRUE  
                    0.0389337                     -0.2189660  
employment_House.wife.husband               employment_Other  
                    0.0123533                      0.0596715  
         employment_Part.time             employment_Retired  
                    0.1207598                     -0.0130359  
     employment_Self.employed             employment_Student  
                   -0.0007829                     -0.0163621  
        employment_Unemployed       better_living_Better.off  
                   -0.1666486                     -0.0153946  
      better_living_Worse.off  
                   -0.4330708

After that, you can use your model to augment() the data with predictions and compute metrics just as you did before in Part I:

model %>% 
  augment(new_data = wvs_test) %>% 
  head()

# A tibble: 6 × 19
  .pred .resid iso3c satisfaction social_trust male    age post_second_edu rural
  <dbl>  <dbl> <chr>        <dbl> <lgl>        <lgl> <dbl> <lgl>           <lgl>
1  5.59  4.41  AND             10 FALSE        FALSE    43 FALSE           FALSE
2  5.79 -4.79  AND              1 FALSE        TRUE     33 FALSE           FALSE
3  7.90  2.10  AND             10 FALSE        TRUE     39 FALSE           TRUE 
4  7.47 -0.471 AND              7 TRUE         FALSE    66 TRUE            FALSE
5  8.72  0.281 AND              9 FALSE        FALSE    48 FALSE           FALSE
6  6.81 -0.808 AND              6 FALSE        TRUE     60 TRUE            FALSE
# ℹ 10 more variables: employment <chr>, financial_situ <dbl>, married <lgl>,
#   relig_import <dbl>, better_living <chr>, no_food <lgl>, no_safety <lgl>,
#   no_medical <lgl>, no_cash <lgl>, no_shelter <lgl>

reg_metrics <- metric_set(rsq, rmse, mae, mape)
model %>%
    augment(new_data=wvs_test)%>%
    reg_metrics(truth = satisfaction, estimate = .pred)

# A tibble: 4 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.328
2 rmse    standard       1.83 
3 mae     standard       1.37 
4 mape    standard      32.9

How well did our model perform?

The dummy steps don’t seem to have made a difference in our model’s performance.

Write a function called plot_residuals() that takes a fitted workflow model and a data frame in its original form (that is, before any pre-processing) and plot the residuals against the fitted values. The function should return a ggplot object.

plot_residuals <- function(wflow_model, data) {
    plot_df <-
      wflow_model %>% # This will be the model fitted with
                      # the workflow, not the workflow 
                      # itself!
      augment(new_data = data)

    g <- 
      ggplot(plot_df, aes(x=.pred, y=.resid)) +
      geom_point(alpha = 0.75) +
      geom_hline(yintercept = 0, linetype = "dashed") +
      theme_bw() +
      labs(x = "Predicted life satisfaction", y = "Residuals")

    return(g)
    
}

Using the plot_residuals function you’ve just created, can you create:

one residuals against the fitted values plot for the training data used in Question 2
and another, separate residuals against the fitted values plot for the test data used in Question 2?

#(a)
plot_residuals(model, wvs_train)

This is the residual plot we plotted previously in Part II!

#(b)
plot_residuals(model, wvs_test)

(Bonus) Can you train a LASSO model on the data and evaluate its performance using recipes and workflows? How does this model compare to linear regression?

Let’s modify recipe2 slightly to make sure that all variables within our dataset are numeric.

recipe3 <- 
  recipe(satisfaction ~ . , data = wvs) %>%
  step_rm(iso3c) %>%
  step_dummy(employment,better_living)%>%
  step_mutate(across(c(social_trust,male,post_second_edu,rural,married,no_food,no_safety,no_medical,no_cash,no_shelter), as.numeric))

Now, we need to modify our model specification since we are no longer using a linear model but a LASSO model.

lasso_model <-
  linear_reg(penalty = 0.01, mixture = 1) %>%
  set_engine("glmnet")

We add recipe and model specification to a workflow just as before:

wflow_lasso <- 
  workflow() %>% 
  add_recipe(recipe3) %>% 
  add_model(lasso_model)

print(wflow_lasso)

══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_rm()
• step_dummy()
• step_mutate()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.01
  mixture = 1

Computational engine: glmnet

model_lasso <- 
  wflow_lasso %>% 
  fit(data = wvs_train)

model_lasso

══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_rm()
• step_dummy()
• step_mutate()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Call:  glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian",      alpha = ~1) 

   Df  %Dev  Lambda
1   0  0.00 1.23000
2   1  5.28 1.12000
3   1  9.66 1.02100
4   1 13.30 0.93010
5   1 16.33 0.84750
6   1 18.83 0.77220
7   1 20.92 0.70360
8   1 22.65 0.64110
9   1 24.08 0.58420
10  1 25.27 0.53230
11  1 26.26 0.48500
12  1 27.08 0.44190
13  1 27.77 0.40260
14  1 28.33 0.36690
15  1 28.80 0.33430
16  1 29.19 0.30460
17  1 29.52 0.27750
18  1 29.78 0.25290
19  1 30.01 0.23040
20  2 30.28 0.20990
21  2 30.52 0.19130
22  2 30.73 0.17430
23  2 30.90 0.15880
24  3 31.10 0.14470
25  3 31.29 0.13180
26  3 31.46 0.12010
27  4 31.63 0.10950
28  5 31.78 0.09974
29  7 31.93 0.09088
30  7 32.08 0.08280
31  7 32.20 0.07545
32  7 32.30 0.06874
33  7 32.38 0.06264
34  8 32.45 0.05707
35  8 32.52 0.05200
36  8 32.57 0.04738
37  8 32.62 0.04317
38  8 32.65 0.03934
39  9 32.68 0.03584
40  9 32.71 0.03266
41  9 32.74 0.02976
42 11 32.76 0.02711
43 11 32.78 0.02471
44 12 32.80 0.02251
45 12 32.82 0.02051
46 12 32.83 0.01869

...
and 26 more lines.

After that, you can use your model to augment() the data with predictions and compute metrics just as you did before with the linear model:

model_lasso %>% 
  augment(new_data = wvs_test) %>% 
  head()

  .pred iso3c satisfaction social_trust male    age post_second_edu rural
  <dbl> <chr>        <dbl> <lgl>        <lgl> <dbl> <lgl>           <lgl>
1  5.55 AND             10 FALSE        FALSE    43 FALSE           FALSE
2  5.74 AND              1 FALSE        TRUE     33 FALSE           FALSE
3  7.94 AND             10 FALSE        TRUE     39 FALSE           TRUE 
4  7.45 AND              7 TRUE         FALSE    66 TRUE            FALSE
5  8.68 AND              9 FALSE        FALSE    48 FALSE           FALSE
6  6.84 AND              6 FALSE        TRUE     60 TRUE            FALSE
# ℹ 10 more variables: employment <chr>, financial_situ <dbl>, married <lgl>,
#   relig_import <dbl>, better_living <chr>, no_food <lgl>, no_safety <lgl>,
#   no_medical <lgl>, no_cash <lgl>, no_shelter <lgl>

reg_metrics <- metric_set(rsq, rmse, mae, mape)
model_lasso %>%
    augment(new_data=wvs_test)%>%
    reg_metrics(truth = satisfaction, estimate = .pred)

# A tibble: 4 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.328
2 rmse    standard       1.83 
3 mae     standard       1.38 
4 mape    standard      33.0

How well did our model perform?

Overall, our metrics (in particular MAE and MAPE) show LASSO to performing slightly worse than linear regression for our particular task.