β¨ Intro to tidymodels recipes
DS202 - Data Science for Social Scientists
The following was didnβt make it to this weekβs lab roadmap; we would easily run out of time. So, I suggest you include in your studies of the week!
Recipes!
We could do PCA directly in base R using a function called prcomp
, but since we have decided to use more tidyverse
and related packages, we will teach you how to do PCA with tidymodels
.
What are recipes?
Among the packages included with tidymodels
, there is one called recipes
. This package lets us indicate which role that the columns of our data frame play in our supervised or unsupervised models. Not only that, it lets us reuse the same formula, say for cross-validation or any other use. Check out the Introduction to recipes tutorial to learn more.
Letβs look at the supervised model case first (even though today is about an unsupervised technique).
A recipe for supervised models
Remember the Smarket
dataset from W08? There, we used Volume
and Lag1
to predict Today
and we always represented this with an R Formula like:
~ Volume + Lag1 Today
or, simply:
~ . Today
That is, the target variable comes first, then we have the ~
symbol to represent what we should regress it on.
To represent this same idea using recipe
, we simply do this:
<- recipe(Today ~ .,
recipe_obj data = ISLR2::Smarket %>% select(Today, Volume, Lag1))
print(recipe_obj)
π― ACTION POINT: What are the variables listed under each role in the recipe
object above?
Step Change roles
You might also remember that the Smarket
dataset had more than just the variables above:
::Smarket %>% colnames() ISLR2
[1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
[7] "Volume" "Today" "Direction"
Note and remember that variables Today
and Direction
are redundant. They represent almost the same thing. If I were to use all variables to predict the variable Today
, I would have to discard Direction
.
π― ACTION POINT: How would you write a recipe to predict Today
using all variables of the SMarket
, except Direction
?
Function update_role()
From recipe
βs documentation:
roles define how variables will be used in the model. Examples are: predictor (independent variables), response, and case weight. This is meant to be open-ended and extensible.
We can use update_role()
to change the role of a column, and we can name this role whatever we want basically. See for example an alternative for the problem of redundant variables in SMarket
:
recipe(Today ~ .,data = ISLR2::Smarket) %>% update_role(Direction, new_role='redundant')
ββ Recipe ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ Inputs
Number of variables by role
outcome: 1
predictor: 7
redundant: 1
We marked Direction
as role="redudant"
, and since this is not a standard role (βpredictorβ, βoutcomeβ, etc.), this variable will not be used in any algorithm.
Unsupervised case
Letβs go back to our dataset. We donβt have a \(\mathbf{Y}\) variable to predict, we care only about the features/predictors, \(\mathbf{X}\). How would we represent a recipe for an unsupervised model?
Itβs simple. We simply leave everything before ~
empty in our R formula representation:
~ <var1> + <var2> + ...
In a dataset, called df
:
recipe(~ ., data = df_preprocessed)
Does the above make sense to you? Do you see why we donβt have an outcome
role?
π― ACTION POINT: Write a recipe for df_preprocessed
and then change the role of period
and country_name
to role="id"
: