✨ Intro to tidymodels recipes

DS202 - Data Science for Social Scientists

Author

Jon Cardoso-Silva

Published

27 November 2022

The following was didn’t make it to this week’s lab roadmap; we would easily run out of time. So, I suggest you include in your studies of the week!

Recipes!

We could do PCA directly in base R using a function called prcomp, but since we have decided to use more tidyverse and related packages, we will teach you how to do PCA with tidymodels.

What are recipes?

Among the packages included with tidymodels, there is one called recipes. This package lets us indicate which role that the columns of our data frame play in our supervised or unsupervised models. Not only that, it lets us reuse the same formula, say for cross-validation or any other use. Check out the Introduction to recipes tutorial to learn more.

Let’s look at the supervised model case first (even though today is about an unsupervised technique).

A recipe for supervised models

Remember the Smarket dataset from W08? There, we used Volume and Lag1 to predict Today and we always represented this with an R Formula like:

Today ~ Volume + Lag1

or, simply:

Today ~ .

That is, the target variable comes first, then we have the ~ symbol to represent what we should regress it on.

To represent this same idea using recipe, we simply do this:

recipe_obj <- recipe(Today ~ .,
                     data = ISLR2::Smarket %>% select(Today, Volume, Lag1))

print(recipe_obj)

🎯 ACTION POINT: What are the variables listed under each role in the recipe object above?

Step Change roles

You might also remember that the Smarket dataset had more than just the variables above:

ISLR2::Smarket %>% colnames()

[1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
[7] "Volume"    "Today"     "Direction"

Note and remember that variables Today and Direction are redundant. They represent almost the same thing. If I were to use all variables to predict the variable Today, I would have to discard Direction.

🎯 ACTION POINT: How would you write a recipe to predict Today using all variables of the SMarket, except Direction?

Function `update_role()`

From recipe’s documentation:

roles define how variables will be used in the model. Examples are: predictor (independent variables), response, and case weight. This is meant to be open-ended and extensible.

We can use update_role() to change the role of a column, and we can name this role whatever we want basically. See for example an alternative for the problem of redundant variables in SMarket:

recipe(Today ~ .,data = ISLR2::Smarket) %>% update_role(Direction, new_role='redundant')

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 7
redundant: 1

We marked Direction as role="redudant", and since this is not a standard role (“predictor”, “outcome”, etc.), this variable will not be used in any algorithm.

Unsupervised case

Let’s go back to our dataset. We don’t have a \(\mathbf{Y}\) variable to predict, we care only about the features/predictors, \(\mathbf{X}\). How would we represent a recipe for an unsupervised model?

It’s simple. We simply leave everything before ~ empty in our R formula representation:

~ <var1> + <var2> + ...

In a dataset, called df:

recipe(~ ., data = df_preprocessed)

Does the above make sense to you? Do you see why we don’t have an outcome role?

🎯 ACTION POINT: Write a recipe for df_preprocessed and then change the role of period and country_name to role="id":