✅ Week 03 - Homework: Solutions
Tidymodel recipes and workflows - a tutorial
Instead of just reading the solutions, you might want to download the .qmd
file and run the code yourself. You can download the .qmd
file by clicking on the download button below:
🚨 You need to have the most recent version of tidymodels
(i.e 1.2.0) installed for the following code to work! Check that’s the case by running the following command on the RStudio Console (or VSCode terminal after typing R
and going into the R environment): packageVersion("tidymodels")
. If not, update your tidymodels
with update.packages("tidymodels")
.
Part I - Getting Ready
Import required libraries:
# Remember to install.packages() any missing packages
# Tidyverse packages we will use
library(dplyr)
library(ggplot2)
library(tidyr)
library(readr)
library(stringr)
library(rsample) #this is what allows you to split data in training and test set!
# Tidymodel packages we will use
library(yardstick) # this is where metrics like mae() are
library(parsnip) # This is where models like linear_reg() are
library(recipes) # You will learn about this today
library(workflows) # You will learn about this today
# And finally a package for cleaning variable names
library(janitor)
Read the data set:
It is the same data set you used in this week’s lab:
# Modify the filepath if needed
<- "data/WVS_Wave_7.csv"
filepath <- read_csv(filepath) wvs
Rows: 86867 Columns: 17
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): iso3c, employment, better_living
dbl (4): satisfaction, age, financial_situ, relig_import
lgl (10): social_trust, male, post_second_edu, rural, married, no_food, no_s...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
As in this week’s lab, if you find that your laptop is unable to handle the full data set without running slowly, try experimenting with the following code (reminder: this takes a random sample of the data, stratifying by country so the sampling algorithm doesn’t take more data from one country and less from another).
# Set a seed for reproducibility
set.seed(123)
<-
wvs # Load the .csv
read_csv("data/WVS_Wave_7.csv") %>%
# Sample a proportion of the data set for each country.
# We have used 25% but you can experiment depending on
# the capability of your machine.
group_by(iso3c) %>%
slice_sample(prop = 0.25) %>%
ungroup()
Part II - Tidymodels recipes
Say our goal is to predict satisfaction
using the variables available in the dataset with the exception of -iso3c
(just as you did in this week’s lab!). That is:
# Split the data with 75% being used to train the model
<- initial_split(wvs, prop = 0.75)
wvs_split
# Create tibbles of the training and test set
<- training(wvs_split)
wvs_train <- testing(wvs_split)
wvs_test
# Then train it with tidymodels:
<-
model linear_reg() %>%
set_engine("lm") %>%
fit(satisfaction ~ .-iso3c, data = wvs_train)
But today, we want to show you a different way of pre-processing your data before fitting a model!
We start with a recipe
The problem with the code above is that, if you need to use the fitted model
to make predictions on new data, you will need to apply the same pre-processing steps to that new data. This is not a problem if you are only doing it once, but if you need to do it many times, it can become a bit of a pain. This is where the recipes package (of tidymodels
) comes in handy.
A recipe is a set of instructions of how the data should be pre-processed before it is used in a model. It is a bit like a cooking recipe, where you have a list of ingredients and a set of instructions on how to prepare them. In our case, the ingredients is the data frame, and the instructions are the pre-processing steps we want to apply to them.
Here’s how we would construct a recipe to pre-process our data before fitting a linear regression model:
#Create a recipe
<-
rec recipe(satisfaction ~ . , data = wvs) %>%
step_rm(iso3c)
summary(rec)
# A tibble: 17 × 4
variable type role source
<chr> <list> <chr> <chr>
1 iso3c <chr [3]> predictor original
2 social_trust <chr [1]> predictor original
3 male <chr [1]> predictor original
4 age <chr [2]> predictor original
5 post_second_edu <chr [1]> predictor original
6 rural <chr [1]> predictor original
7 employment <chr [3]> predictor original
8 financial_situ <chr [2]> predictor original
9 married <chr [1]> predictor original
10 relig_import <chr [2]> predictor original
11 better_living <chr [3]> predictor original
12 no_food <chr [1]> predictor original
13 no_safety <chr [1]> predictor original
14 no_medical <chr [1]> predictor original
15 no_cash <chr [1]> predictor original
16 no_shelter <chr [1]> predictor original
17 satisfaction <chr [2]> outcome original
In this particular case, the only pre-processing we do is remove the variable iso3c
from consideration in the model: that’s what’s step_rm
does. Other examples of pre-processing could include removing the observations/rows that contain missing values ( step_naomit()
) or imputing (i.e ‘taking a best guess’ at) them (e.g step_impute_median
or step_impute_linear
)
The previous syntax is enough if all you’re going to do is use your recipe to train a model. However, if you want to, say check the correctness of your pre-processing by printing out the actual dataset that you obtain after pre-processing, then you’ll need a few extra steps to get there.
The previous code will look like this:
<-
rec recipe(satisfaction ~ . , data = wvs) %>%
step_rm(iso3c) %>%
prep() %>%
bake(data = wvs_train)
prep()
estimates all the quantities and statistics needed for the recipe you’ve specified (i.e preprocessing steps) to be applied by the bake
function (note that you specify which dataset the bake
function, and ultimately the pre-processing steps, apply on - in this case the training set). You can think of it as the (ingredients) prepping step in a cooking recipe before you bake/cook everything and get the final meal!
⚠️ If you don’t go through the prep and bake steps, the pre-processing steps will not be applied. The recipe is just a recipe, not the final cooked meal. You need to bake it first.
How to use this in a model?
So we have our recipe from before:
#Create a recipe
<-
rec recipe(satisfaction ~ . , data = wvs) %>%
step_rm(iso3c)
How do we now use this in a model?
For this we need a workflow to which we can attach the recipe and the model. Creating a workflow of a linear regression looks like this:
# Create the specification of a model but don't fit it yet
<-
lm_model linear_reg() %>%
set_engine("lm")
# Create a workflow to add the recipe and the model
<-
wflow workflow() %>%
add_recipe(rec) %>%
add_model(lm_model)
print(wflow)
══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 Recipe Step
• step_rm()
── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
Then, you can fit the workflow to the data:
<-
model %>%
wflow fit(data = wvs_train)
model
══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 Recipe Step
• step_rm()
── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) social_trustTRUE
3.7784484 0.0797187
maleTRUE age
-0.0495987 0.0000462
post_second_eduTRUE ruralTRUE
-0.0314364 -0.0465320
employmentHouse wife/husband employmentOther
0.0123533 0.0596715
employmentPart time employmentRetired
0.1207598 -0.0130359
employmentSelf employed employmentStudent
-0.0007829 -0.0163621
employmentUnemployed financial_situ
-0.1666486 0.4774211
marriedTRUE relig_import
0.1837895 0.0591540
better_livingBetter off better_livingWorse off
-0.0153946 -0.4330708
no_foodTRUE no_safetyTRUE
-0.1942058 -0.0419340
no_medicalTRUE no_cashTRUE
-0.1346880 0.0389337
no_shelterTRUE
-0.2189660
Do you want to confirm what recipe was used to pre-process the data? You can extract_recipe()
:
%>%
model extract_recipe()
── Recipe ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 16
── Training information
Training data contained 65150 data points and no incomplete rows.
── Operations
• Variables removed: iso3c | Trained
You can use your model to augment()
the data with predictions:
%>%
model augment(new_data = wvs_test) %>%
head()
# A tibble: 6 × 19
.pred .resid iso3c satisfaction social_trust male age post_second_edu rural
<dbl> <dbl> <chr> <dbl> <lgl> <lgl> <dbl> <lgl> <lgl>
1 5.59 4.41 AND 10 FALSE FALSE 43 FALSE FALSE
2 5.79 -4.79 AND 1 FALSE TRUE 33 FALSE FALSE
3 7.90 2.10 AND 10 FALSE TRUE 39 FALSE TRUE
4 7.47 -0.471 AND 7 TRUE FALSE 66 TRUE FALSE
5 8.72 0.281 AND 9 FALSE FALSE 48 FALSE FALSE
6 6.81 -0.808 AND 6 FALSE TRUE 60 TRUE FALSE
# ℹ 10 more variables: employment <chr>, financial_situ <dbl>, married <lgl>,
# relig_import <dbl>, better_living <chr>, no_food <lgl>, no_safety <lgl>,
# no_medical <lgl>, no_cash <lgl>, no_shelter <lgl>
Notice that this augments our data frame with .pred
and .resid
columns.
How well did our model perform?
<- metric_set(rsq, rmse, mae, mape)
reg_metrics %>%
model augment(new_data=wvs_test)%>%
reg_metrics(truth = satisfaction, estimate = .pred)
# A tibble: 4 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rsq standard 0.328
2 rmse standard 1.83
3 mae standard 1.37
4 mape standard 32.9
You already know about the \(R^2\) (rsq
) and \(RMSE\) (rmse
) metrics: can you make sense of the other two metrics (mae
for MAE and mape
for MAPE)?
Based on \(R^2\) (the only scale-agnostic metric in our set of metrics), our model doesn’t seem like a good fit. For the model to be a good fit, you would ideally want \(R^2\) to be closer to 1. For more on \(R^2\), see this page, this one and finally, this one.
RMSE represents the square root of the variance of the residuals. The lower the RMSE, the better a given model fits your dataset. For our given dataset, the RMSE turns out to be 1.82, which looks relatively low and decent. However, RMSE is a metric that is much more useful when used to compare different models on the same dataset: the interpretation of a single RMSE measurement is rather tricky, however comparing the RMSE of different models on the same dataset can tell you which one is a better fit. For more information on RMSE, see this page and this page.
MAPE (or mean average percentage error) for our model is 32.8. This means that, on average, the error between the life satisfaction predicted by the model and the actual satisfaction is 32.8%. Do you think this is a low or high number? As usual with these sorts of metrics, their value lies more in comparisons with other models: a model with a lower MAPE than another is likely to be a better fit. For more on MAPE, see this page or this page or this one
The last metric is MAE (mean absolute error). It simply tells you on average, how far off the model is from the actual data. The actual formula for this metric is \[\frac{1}{n}\sum_{i=1}^n|y_i-\hat{y_i}|\]
This metric is expressed in the scale of the variable you are predicting (i.e if you are predicting a salary in GBP, MAE is in GBP). The lower this metric, the better. It treats all error values equally (i.e it doesn’t emphasize low or high errors). However, you can’t use it to compare models trained on different datasets since it reflects an absolute error. For more on MAE, see this page or this page or this one.
(Bonus) See if you can plot the metrics for a few countries of your choice.
Let’s get my metrics per countries (for the sake of example, I am choosing the US (country code: USA), the UK (country code: GBR), Morocco (country code: MAR) and the Netherlands (country code: NLD)):
%>%
model augment(new_data=wvs_test)%>%
filter(iso3c %in% c('GBR','MAR','USA','NLD'))%>%
group_by(iso3c)%>%
reg_metrics(truth = satisfaction, estimate = .pred)
# A tibble: 16 × 4
iso3c .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 GBR rsq standard 0.353
2 MAR rsq standard 0.628
3 NLD rsq standard 0.317
4 USA rsq standard 0.312
5 GBR rmse standard 1.51
6 MAR rmse standard 1.54
7 NLD rmse standard 1.13
8 USA rmse standard 1.52
9 GBR mae standard 1.11
10 MAR mae standard 1.19
11 NLD mae standard 0.793
12 USA mae standard 1.13
13 GBR mape standard 21.0
14 MAR mape standard 33.6
15 NLD mape standard 13.3
16 USA mape standard 20.8
Our various metrics have different scales so let’s plot them separately in bar charts.
Let’s start with \(R^2\)
%>%
metrics_per_country rename(country_code=iso3c,Value=.estimate,Metric=.metric)%>%
filter(Metric=='rsq')%>%
ggplot(aes(Metric,Value,fill=country_code))+
geom_bar(stat="identity", position = "dodge")+
scale_fill_brewer(palette='Pastel1')
Then RMSE:
%>%
metrics_per_country rename(country_code=iso3c,Value=.estimate,Metric=.metric)%>%
filter(Metric=='rmse')%>%
ggplot(aes(Metric,Value,fill=country_code))+
geom_bar(stat="identity", position = "dodge")+
scale_fill_brewer(palette='Pastel1')
Then MAE:
%>%
metrics_per_country rename(country_code=iso3c,Value=.estimate,Metric=.metric)%>%
filter(Metric=='mae')%>%
ggplot(aes(Metric,Value,fill=country_code))+
geom_bar(stat="identity", position = "dodge")+
scale_fill_brewer(palette='Pastel1')
And finally MAPE:
%>%
metrics_per_country rename(country_code=iso3c,Value=.estimate,Metric=.metric)%>%
filter(Metric=='MAPE')%>%
ggplot(aes(Metric,Value,fill=country_code))+
geom_bar(stat="identity", position = "dodge")+
scale_fill_brewer(palette='Pastel1')
If we’d stopped by looking at \(R^2\), we would have concluded the model performs best for Morocco. However, the other metrics, particularly MAPE, paint a different picture: errors are, on average, smaller for the Netherlands. The picture for Morocco diverges so much in between \(R^2\) and MAPE that it might be worth having a look at the residual plot here to ascertain the underlying picture. Did we stumble upon any of the limitations of these two metrics in this particular case?
Let’s look at the residuals plot for our model!
%>%
model augment(new_data = wvs_test ) %>%
ggplot(aes(.pred, .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_bw() +
labs(x = "Predicted life satisfaction", y = "Residuals")
It’s the same residuals plot you saw in the W03 lab! No surprises there since we fit the same model 😉
Part III - Practice with recipes and workflows
- Consider the following simulated data frame.
set.seed(123)
<-
df_sim tibble(outcome = rnorm(100, 0, 1),
feature1 = rnorm(100, 0, 1),
feature2 = rnorm(100, 0, 20))
Let’s have a quick look at the first lines of df_sim
before we proceed:
%>% head(10) %>% knitr::kable() df_sim
| outcome| feature1| feature2|
|----------:|----------:|----------:|
| -0.5604756| -0.7104066| 43.976207|
| -0.2301775| 0.2568837| 26.248259|
| 1.5587083| -0.2466919| -5.302901|
| 0.0705084| -0.3475426| 10.863881|
| 0.1292877| -0.9516186| -8.286799|
| 1.7150650| -0.0450277| -9.524938|
| 0.4609162| -0.7849045| -15.772057|
| -1.2650612| -1.6679419| -11.892345|
| -0.6868529| -0.3802265| 33.018149|
| -0.4456620| 0.9189966| -1.080563|
We have created a data set with one outcome (creatively labeled outcome
) and two features feature1
and feature2
where one variable has been assigned a standard deviation 20 times the size of the other variable. In some cases like linear regression, it will be okay to use outcome ~ feature1 + feature2
as a formula. However, with other models, variables with larger variation will play a disproportionate role. As a result, we can use standardization (a.k.a. normalization) to counter this issue. This involves subtracting the mean of the variable and dividing the result by the standard deviation. Create a recipe for this data frame using a step from the tidymodels
manual, and make use of graphs to demonstrate that the standardization process “worked”.
# Let's look at the distributions of each feature overlaid!
%>%
df_sim pivot_longer(-outcome, names_to = 'feature') %>%
ggplot(aes(value, fill = str_remove(feature, "feature"))) +
geom_histogram(alpha = 0.75) +
theme_bw() +
labs(x = "Feature value", y = "Frequency", fill = "Feature")
# Now let's normalize feature2 using step_normalize
<-
df_normalized %>%
df_sim recipe(outcome ~ ., data = .) %>%
step_normalize(feature2) %>%
prep() %>%
bake(new_data = NULL)
# Now let's plot the distributions of the features
%>%
df_normalized pivot_longer(-outcome, names_to = 'feature') %>%
ggplot(aes(value, fill = str_remove(feature, "feature"))) +
geom_histogram(alpha = 0.75) +
theme_bw() +
labs(x = "Feature value", y = "Frequency", fill = "Feature")
- Some models will only require numeric input, meaning that variables like
employment
orbetter_living
into several dummy variables. Using thetidymodels
manual, can you find a step that can help us achieve this end?
<-
recipe2 recipe(satisfaction ~ . , data = wvs) %>%
step_rm(iso3c) %>%
step_dummy(employment,better_living)
I can stop here if all I am doing is using the recipe for modeling (which is what we’ll do in Question 2). To be able to visualise the result of my pre-processing and check I’ve performed all the correct steps, here is an extra piece of code:
%>%
recipe2 prep() %>%
bake(wvs_train)
# A tibble: 65,150 × 23
social_trust male age post_second_edu rural financial_situ married
<lgl> <lgl> <dbl> <lgl> <lgl> <dbl> <lgl>
1 FALSE FALSE 65 FALSE TRUE 8 FALSE
2 FALSE FALSE 47 FALSE FALSE 10 TRUE
3 FALSE TRUE 63 FALSE FALSE 5 FALSE
4 FALSE FALSE 46 TRUE FALSE 4 FALSE
5 TRUE TRUE 52 TRUE FALSE 4 TRUE
6 TRUE FALSE 59 FALSE FALSE 8 FALSE
7 FALSE TRUE 80 FALSE FALSE 5 FALSE
8 FALSE TRUE 69 FALSE TRUE 5 FALSE
9 FALSE FALSE 36 FALSE TRUE 5 TRUE
10 FALSE TRUE 22 FALSE TRUE 9 TRUE
# ℹ 65,140 more rows
# ℹ 16 more variables: relig_import <dbl>, no_food <lgl>, no_safety <lgl>,
# no_medical <lgl>, no_cash <lgl>, no_shelter <lgl>, satisfaction <dbl>,
# employment_House.wife.husband <dbl>, employment_Other <dbl>,
# employment_Part.time <dbl>, employment_Retired <dbl>,
# employment_Self.employed <dbl>, employment_Student <dbl>,
# employment_Unemployed <dbl>, better_living_Better.off <dbl>, …
# ℹ Use `print(n = ...)` to see more rows
If I wanted to view my results even better, I could do this (try this code in your console - don’t render!):
<- recipe2 %>%
baked_recipe prep() %>%
bake(wvs_train)
View(baked_recipe)
- Can you tweak the model we wrote in Part I to include the dummy step from Question 1? Evaluate the performance of this new model (first on the training then the test set).
The model specification from Part I doesn’t change: we’re still using a linear regression!
# Create the specification of a model but don't fit it yet
<-
lm_model linear_reg() %>%
set_engine("lm")
We tweak our workflow slightly to use the recipe recipe2
that we’ve just defined in Question 2 (and that uses the dummy steps!)
# Create a workflow to add the recipe and the model
<-
wflow workflow() %>%
add_recipe(recipe2) %>%
add_model(lm_model)
print(wflow)
══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_rm()
• step_dummy()
── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
Then, you can fit the workflow to the data:
<-
model %>%
wflow fit(data = wvs_train)
model
══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_rm()
• step_dummy()
── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) social_trustTRUE
3.7784484 0.0797187
maleTRUE age
-0.0495987 0.0000462
post_second_eduTRUE ruralTRUE
-0.0314364 -0.0465320
financial_situ marriedTRUE
0.4774211 0.1837895
relig_import no_foodTRUE
0.0591540 -0.1942058
no_safetyTRUE no_medicalTRUE
-0.0419340 -0.1346880
no_cashTRUE no_shelterTRUE
0.0389337 -0.2189660
employment_House.wife.husband employment_Other
0.0123533 0.0596715
employment_Part.time employment_Retired
0.1207598 -0.0130359
employment_Self.employed employment_Student
-0.0007829 -0.0163621
employment_Unemployed better_living_Better.off
-0.1666486 -0.0153946
better_living_Worse.off
-0.4330708
After that, you can use your model to augment()
the data with predictions and compute metrics just as you did before in Part I:
%>%
model augment(new_data = wvs_test) %>%
head()
# A tibble: 6 × 19
.pred .resid iso3c satisfaction social_trust male age post_second_edu rural
<dbl> <dbl> <chr> <dbl> <lgl> <lgl> <dbl> <lgl> <lgl>
1 5.59 4.41 AND 10 FALSE FALSE 43 FALSE FALSE
2 5.79 -4.79 AND 1 FALSE TRUE 33 FALSE FALSE
3 7.90 2.10 AND 10 FALSE TRUE 39 FALSE TRUE
4 7.47 -0.471 AND 7 TRUE FALSE 66 TRUE FALSE
5 8.72 0.281 AND 9 FALSE FALSE 48 FALSE FALSE
6 6.81 -0.808 AND 6 FALSE TRUE 60 TRUE FALSE
# ℹ 10 more variables: employment <chr>, financial_situ <dbl>, married <lgl>,
# relig_import <dbl>, better_living <chr>, no_food <lgl>, no_safety <lgl>,
# no_medical <lgl>, no_cash <lgl>, no_shelter <lgl>
<- metric_set(rsq, rmse, mae, mape)
reg_metrics %>%
model augment(new_data=wvs_test)%>%
reg_metrics(truth = satisfaction, estimate = .pred)
# A tibble: 4 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rsq standard 0.328
2 rmse standard 1.83
3 mae standard 1.37
4 mape standard 32.9
How well did our model perform?
The dummy steps don’t seem to have made a difference in our model’s performance.
- Write a function called
plot_residuals()
that takes a fitted workflow model and a data frame in its original form (that is, before any pre-processing) and plot the residuals against the fitted values. The function should return a ggplot object.
<- function(wflow_model, data) {
plot_residuals <-
plot_df %>% # This will be the model fitted with
wflow_model # the workflow, not the workflow
# itself!
augment(new_data = data)
<-
g ggplot(plot_df, aes(x=.pred, y=.resid)) +
geom_point(alpha = 0.75) +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_bw() +
labs(x = "Predicted life satisfaction", y = "Residuals")
return(g)
}
Using the plot_residuals
function you’ve just created, can you create:
- one residuals against the fitted values plot for the training data used in Question 2
- and another, separate residuals against the fitted values plot for the test data used in Question 2?
#(a)
plot_residuals(model, wvs_train)
This is the residual plot we plotted previously in Part II!
#(b)
plot_residuals(model, wvs_test)
- (Bonus) Can you train a LASSO model on the data and evaluate its performance using recipes and workflows? How does this model compare to linear regression?
Let’s modify recipe2
slightly to make sure that all variables within our dataset are numeric.
<-
recipe3 recipe(satisfaction ~ . , data = wvs) %>%
step_rm(iso3c) %>%
step_dummy(employment,better_living)%>%
step_mutate(across(c(social_trust,male,post_second_edu,rural,married,no_food,no_safety,no_medical,no_cash,no_shelter), as.numeric))
Now, we need to modify our model specification since we are no longer using a linear model but a LASSO model.
<-
lasso_model linear_reg(penalty = 0.01, mixture = 1) %>%
set_engine("glmnet")
We add recipe and model specification to a workflow just as before:
<-
wflow_lasso workflow() %>%
add_recipe(recipe3) %>%
add_model(lasso_model)
print(wflow_lasso)
══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_rm()
• step_dummy()
• step_mutate()
── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Main Arguments:
penalty = 0.01
mixture = 1
Computational engine: glmnet
<-
model_lasso %>%
wflow_lasso fit(data = wvs_train)
model_lasso
══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_rm()
• step_dummy()
• step_mutate()
── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian", alpha = ~1)
Df %Dev Lambda
1 0 0.00 1.23000
2 1 5.28 1.12000
3 1 9.66 1.02100
4 1 13.30 0.93010
5 1 16.33 0.84750
6 1 18.83 0.77220
7 1 20.92 0.70360
8 1 22.65 0.64110
9 1 24.08 0.58420
10 1 25.27 0.53230
11 1 26.26 0.48500
12 1 27.08 0.44190
13 1 27.77 0.40260
14 1 28.33 0.36690
15 1 28.80 0.33430
16 1 29.19 0.30460
17 1 29.52 0.27750
18 1 29.78 0.25290
19 1 30.01 0.23040
20 2 30.28 0.20990
21 2 30.52 0.19130
22 2 30.73 0.17430
23 2 30.90 0.15880
24 3 31.10 0.14470
25 3 31.29 0.13180
26 3 31.46 0.12010
27 4 31.63 0.10950
28 5 31.78 0.09974
29 7 31.93 0.09088
30 7 32.08 0.08280
31 7 32.20 0.07545
32 7 32.30 0.06874
33 7 32.38 0.06264
34 8 32.45 0.05707
35 8 32.52 0.05200
36 8 32.57 0.04738
37 8 32.62 0.04317
38 8 32.65 0.03934
39 9 32.68 0.03584
40 9 32.71 0.03266
41 9 32.74 0.02976
42 11 32.76 0.02711
43 11 32.78 0.02471
44 12 32.80 0.02251
45 12 32.82 0.02051
46 12 32.83 0.01869
...
and 26 more lines.
After that, you can use your model to augment()
the data with predictions and compute metrics just as you did before with the linear model:
%>%
model_lasso augment(new_data = wvs_test) %>%
head()
.pred iso3c satisfaction social_trust male age post_second_edu rural
<dbl> <chr> <dbl> <lgl> <lgl> <dbl> <lgl> <lgl>
1 5.55 AND 10 FALSE FALSE 43 FALSE FALSE
2 5.74 AND 1 FALSE TRUE 33 FALSE FALSE
3 7.94 AND 10 FALSE TRUE 39 FALSE TRUE
4 7.45 AND 7 TRUE FALSE 66 TRUE FALSE
5 8.68 AND 9 FALSE FALSE 48 FALSE FALSE
6 6.84 AND 6 FALSE TRUE 60 TRUE FALSE
# ℹ 10 more variables: employment <chr>, financial_situ <dbl>, married <lgl>,
# relig_import <dbl>, better_living <chr>, no_food <lgl>, no_safety <lgl>,
# no_medical <lgl>, no_cash <lgl>, no_shelter <lgl>
<- metric_set(rsq, rmse, mae, mape)
reg_metrics %>%
model_lasso augment(new_data=wvs_test)%>%
reg_metrics(truth = satisfaction, estimate = .pred)
# A tibble: 4 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rsq standard 0.328
2 rmse standard 1.83
3 mae standard 1.38
4 mape standard 33.0
How well did our model perform?
Overall, our metrics (in particular MAE and MAPE) show LASSO to performing slightly worse than linear regression for our particular task.