🛣️ Week 07 - Lab Roadmap (90 min)
Decision trees with tidymodels
🥅 Learning Objectives
By the end of this lab, you will be able to:
- Learn how to apply a decision tree model to a real-world classification problem
- Learn how to use cross-validation to build a decision tree model

When you reach Part III of this lab, read the specific instructions for GENIAL participants.
📚 Preparation
This week, we no longer use the UK House Prices for our lab and move to a new dataset adapted from Wave 7 of the World Values Survey (Haerpfer et al. 2022). Use the link below to download this new dataset:
Put this new file in the data
folder within your DS202A
project folder.
Then use the link below to download the lab file:
We will post solutions to Part III on Tuesday afternoon, only after all labs have ended.
📋 Lab Tasks
Here are the instructions for this lab:
Import required libraries:
# Tidyverse packages we will use
library(dplyr)
library(tidyr)
library(readr)
# Tidymodel packages we will use
library(rsample)
library(yardstick)
library(parsnip)
library(recipes)
library(workflows)
library(rpart)
library(tune)
Read the data set:
It is the brand-new dataset you’ve downloaded.
# Modify the filepath if needed
<- "data/WVS_Wave7_modified.csv"
filepath <- read_csv(filepath) wvs_data
Part I - Meet a new data set (20 min)
We’ve been doing a lot of training vs testing splits in this course, primarily by hand and focused on a particular point in time (e.g., test samples are defined to start in 2019).
But not all splits need to be done this way. Some data sets don’t have a time-series component and can be split randomly for testing purposes. To make things more robust, we don’t just simply use one single train versus test split, but we use a technique called cross-validation and split the data into multiple train vs test splits.
In this lab, we’ll learn how to use cross-validation to build a more robust model for changes in the data. Let’s kick things off by cleaning up our data and randomly splitting it into a training set (70% of data points/rows) and a test set (30% of data points/rows).
🧑🏫 TEACHING MOMENT:
(Your class teacher will guide you through this section. Just run all the code chunks below together with your class teacher.)
Our goal in this lab is to predict someone’s trust in institutions (i.e., police, army, justice courts, press and television, labour unions, civil services, political parties, parliament, and government), as represented in our dataset by the variable wvs_data["TRUST_INSTITUTIONS"]
.
Here is how we obtain this variable:
Average trust: We take the average of several columns that each represents inverse trust (i.e., mistrust) towards a particular institution:
I_TRUSTARMY
(Q65
1 in the original data set)I_TRUSTPRESS
(Q66
2)I_TRUSTTELEVISION
(Q67
3)I_TRUSTUNIONS
(Q68
4)I_TRUSTPOLICE
(Q69
5 in the original data set)I_TRUSTCOURTS
(Q70
6 in the original data set)I_TRUSTGVT
(Q71
7)I_TRUSTPARTIES
(Q72
8)I_TRUSTPARLIAMENT
(Q73
9)I_TRUSTCIVILSERVICES
(Q74
10)
These variables are not part of the original data set but were precomputed from the original data set to make it easier to compute trust in institutions. The way we calculated these variables is as follows:
- if the response to the question is 1, the corresponding variable is assigned a value of 0
- if it is 2, then the corresponding variable gets assigned a value of 0.33;
- if it is 3, then the corresponding variable gets assigned a value of 0.66;
- if it is 4, then the corresponding variable gets assigned a value of 1;
- otherwise (i.e. if the response is either -1, -2, -4, or -5), the value of the corresponding variable is treated as missing
We store the result in the
MEAN_I_TRUST_INSTITUTIONS
columnConvert from mistrust to trust for ease of interpretability: We subtract the average obtained (i.e., the content of the
MEAN_I_TRUST_INSTITUTIONS
column) from 1.Create a binary target variable: We assign a value of 1 if the value obtained after step 2 is higher than 0.5, thus representing trust, and 0 otherwise. We store the result in the
TRUST_INSTITUTIONS
column.
🎯 ACTION POINTS:
- Create the
TRUST_INSTITUTIONS
column. Open a new chunk of code in your notebook and type the following code:
<-
wvs_data %>%
wvs_data rowwise() %>%
mutate(MEAN_I_TRUST_INSTITUTIONS = mean(c(I_TRUSTARMY, I_TRUSTCIVILSERVICES, I_TRUSTPOLICE, I_TRUSTCOURTS,
I_TRUSTPRESS, I_TRUSTTELEVISION, I_TRUSTUNIONS, I_TRUSTGVT,
I_TRUSTPARTIES, I_TRUSTPARLIAMENT), na.rm = TRUE),
TRUST_INSTITUTIONS = (1 - MEAN_I_TRUST_INSTITUTIONS) > 0.5,
TRUST_INSTITUTIONS = factor(TRUST_INSTITUTIONS,
labels=c("No", "Yes"),
levels=c(FALSE, TRUE))) %>%
ungroup()
Remove the columns we used to compute the
TRUST_INSTITUTIONS
column.Since we used the
I_TRUST_*
variables above to compute our outcome variable, these variables should not be part of our model! Neither should you use the variables used to precompute theI_TRUST_*
variables:Q67
,Q68
,Q69
,Q70
,Q71
,Q72
,Q73
orQ74
. We only left those variables in the initial dataset for completeness and to explain how trust in institutions was defined and obtained.From here on, we’ll operate with this filtered data:
<- c("I_TRUSTARMY", "I_TRUSTCIVILSERVICES", "I_TRUSTPOLICE", "I_TRUSTCOURTS",
cols_to_remove "I_TRUSTPRESS", "I_TRUSTTELEVISION", "I_TRUSTUNIONS", "I_TRUSTGVT",
"I_TRUSTPARTIES", "I_TRUSTPARLIAMENT", "MEAN_I_TRUST_INSTITUTIONS",
"D_INTERVIEW", "W_WEIGHT", "S018", "Q_MODE", "K_DURATION",
"Q65", "Q67", "Q68", "Q69", "Q70", "Q71", "Q72", "Q73", "Q74",
"Q275","Q276","Q277","Q278","Q275A","Q276A","Q277A","Q278A")
#filter data to remove unnecessary columns
<-
wvs_data %>%
wvs_data select(-all_of(cols_to_remove))
- Now, let’s randomly split our dataset into a training set (containing 70% of the rows in our data) and a test set (including 30% of the rows in our data)
#Randomly split the initial data frame into training and testing sets (70% and 30% of rows, respectively)
<- initial_split(wvs_data, prop = 0.7) split
What is in the training and testing set?
To get the actual data assigned to either set, use the
rsample::training()
andrsample::testing()
functions:
<- training(split)
training_data <- testing(split) testing_data
For curiosity, let’s confirm which unique countries are represented in our training and test sets and by how many data points. We can uncover this mystery by counting the unique values in the COUNTRY_NAME
column.
# tallying the number of rows per country in the training set
%>%
training_data group_by(COUNTRY_NAME) %>%
tally()
# tallying the number of rows per country in the test set
%>%
testing_data group_by(COUNTRY_NAME) %>%
tally()
Just how many non-empty records are there in our datasets per country?
# tallying the number of non-empty rows per country in the training set
%>%
training_data drop_na() %>%
group_by(COUNTRY_NAME) %>%
tally() %>%
::kable()
knitr
# tallying the number of non-empty rows per country in the test set
%>%
testing_data drop_na() %>%
group_by(COUNTRY_NAME) %>%
tally() %>%
::kable() knitr
Note that missing values are handled in diverse ways in this dataset: in responses to survey questions, negative values also denote missing values (e.g., the respondent didn’t know how to answer the question or did not respond to the question, the question was not asked or the response was missing).
To browse the full dataset documentation, including a description of the variables (though some were recoded for ease of comprehension), you can download the zip archive below:
🗣️ DISCUSSION:
We defined trust in institutions in a specific way, and we will be building a model to predict trust in institutions in a way that fits this definition. Do you think that the definition we gave of trust in institutions is satisfactory? Is it a good modelling objective? If not, what would you do differently? (You can look at the dataset documentation for further ideas).
Part II - Introduction to decision trees (40 min)
🧑🏻🏫 TEACHING MOMENT: In this part, you’ll be learning about a new classification model called decision trees, which is more suited to handle large numbers of features of varying types that potentially contain missing data than logistic regression.
Our dataset is a real-world dataset and ticks all these boxes:
- We mentioned earlier that our goal is to build a model to predict
TRUST_INSTITUTIONS
. Aside from a few variables that don’t look like likely predictors (e.g.S018
andW_WEIGHT
which are two survey-related weightings,K_DURATION
i.e interview duration,Q_MODE,
i.e. interview type) and variables whose values have not been standardised across countries (e.gQ275
,Q276
,Q277
,Q278
,Q275A
,Q276A
,Q277A
,Q278A
that are variables that give the highest level of education achieved by the interviewee and related family member in a country-specific way), we have a large potential number of features/predictors to choose from for our model (552! if we don’t count our ID variableD_INTERVIEW
, which corresponds to the interview ID and identifies each row/data point). - Most variables include missing values, which are sometimes encoded as negative numbers
- The various variables have varying types (e.g., categorical values for answers to survey questions or strings for names of countries or languages)
Your class teacher will explain the basic principle of a decision tree.
As usual, we start with a recipe
<-
wvs_rec recipe(TRUST_INSTITUTIONS ~ .,
data = training_data) %>%
prep()
%>%
wvs_rec bake(training_data)
Here, we’ll be simply using our training dataset without further pre-processing here to fit our model (we leave possible pre-processing steps as a take-home exercise).
Now, we can fit a decision tree model on our data
You can create the model specification for a decision tree using this scaffolding code (you need to install the rpart
library using the install.packages("rpart")
command to be able to use this code and, of course, load the rpart
library):
# Create the specification of a model but don't fit it yet
<-
dt_spec decision_tree(mode = "classification", tree_depth = 5) %>%
set_engine("rpart")
Now that you have the model specification:
- Fit the model to the training set using a workflow and evaluate its performance with an appropriate metric (e.g. AUC/ROC curve)
- Fit the model to the test set using a workflow and evaluate its performance with an appropriate metric (e.g. AUC/ROC curve)
You can plot the fitted decision tree using the rpart.plot
package:
rpart.plot(dt_model$fit)
PS: You might need to install the rpart.plot
package using the install.packages("rpart.plot")
command.
Part III - Cross-validation (30 min)
🧑🏻🏫 TEACHING MOMENT: Your class teacher will briefly explain the concept of cross-validation.
If you are participating in the GENIAL project, you are asked to:
- Work independently (not in groups or pairs), but you can ask the class teacher for help if you get stuck.
- Have only the following tabs open in your browser:
- These lab instructions
- The ChatGPT website (open a new chat window and name it ‘DS202A - Week 03’)
- The
tidymodels
documentation page (you can open tabs with documentation pages for each package if you need to) - The
tidyverse
documentation page (you can open tabs with documentation pages for each package if you need to)
- Be aware of how useful (or not) ChatGPT was in helping you answer the questions in this section.
- Fill out this brief survey at the end of the lab: 🔗 link (requires LSE login)
Click here for tips on how to use ChatGPT today
General Tips:
tidymodels
is a relatively new suite of packages. If you don’t explicitly ask for responses that usetidymodels
-like code, you will likely get old code based on base R instead.Try to provide more specific context about your problem. For example, instead of ‘How to do cross-validation in R?’, try:
I have a data frame with these columns:
- (column) (data type)
- (column) (data type)
...
I need to cross-validate this data `rsample::vfold_cv()` in R, but I don't know how to specify it correctly.
Remember: ChatGPT is not a search engine. It’s a language model that tries to predict likely words. The more context you give it, the better this process will work for you.
- Use it to debug code smartly. Say you tried something, but your code doesn’t run. It throws an error, and you don’t understand what it means. Instead of asking ChatGPT to rewrite the code for you, paste your code and write, ‘I am getting this error when I run the code above: … What is the source of this error? Explain why it doesn’t work.’ Then, using the output you code, try to go to the documentation to cross-check the information. If ChatGPT returned an incorrect explanation, reply: ‘your solution is wrong because…’
In case you are not participating in the GENIAL project, you can work in pairs or small groups to answer the questions in this section. You can also ask the class teacher for help if you get stuck.
We suggest you have these tabs open in your browser:
- The
tidymodels
documentation page (you can open tabs with documentation pages for each package if you need to) - The
tidyverse
documentation page (you can open tabs with documentation pages for each package if you need to)
- Question: Can you retrain your decision tree model using cross-validation?
Use the
initial_split()
andtraining()
functions to split your data and extract your training set.Employ a
vfold_cv()
function to the data with 10-fold cross-validation (10-fold by default).Fit the model using
fit_resamples()
and collect your metrics (as in the W05 lecture notebook). How does the model performance compare to previously?What happens when you tweak tree parameters and cross-validation parameters?
Fill out this brief survey at the end of the lab if you are part of GENIAL: 🔗 link (requires LSE login)
(Bonus) Take-home exercise
Check out the following recipe:
<-
wvs_rec recipe(TRUST_INSTITUTIONS ~ .,
data = training_data) %>%
step_impute_knn(everything(), neighbors = 3) %>%
step_novel(all_nominal(),-all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_numeric(), -all_outcomes()) %>%
step_unknown(all_nominal()) %>%
step_corr(all_predictors(), threshold = 0.7, method = "spearman") %>%
prep()
You’ll notice a few familiar steps in the recipe (update_role
,prep
) and quite a few unfamiliar ones.
Let’s go through them one by one:
Imputing missing values instead of discarding them
Up until now, you had used the step_naomit()
function to remove any rows with missing values. Today, since there are too many missing values in our data, removing missing values would leave us with barely any usable data. So, instead, we try to “make a best-informed guess” of what the missing values might be: this is what we call (missing value) imputation. There are many ways to do imputation (see (Enders 2022) or (Scheffer 2002) for details on imputation methods): here, we chose to use a method called K-nearest neighbours imputation, whose principle is compute the missing value by “averaging” the values of that feature in the closest columns (in terms of distance) to the current column (i.e where the missing value is found). We’ll come back to K-nearest neighbours in the lecture on Friday.
For now, it’s enough to remember that what K-nearest neighbour imputation allows to do is impute i.e “guess” the value of the missing values in our data and that we use the step_impute_knn()
function to do that.
Processing numerical variables
We process the numerical variables in our data in two ways:
- we use
step_normalize()
to ensure the numeric variables (excluding row id and outcome variable) are normalized i.e they have a standard deviation of one and a mean of zero. (i.e., z-standardization). - with
step_zv()
, we remove any numeric variable that have zero variance (not useful for predictions!)
Processing nominal variables
When it comes to nominal variables, we use:
step_novel()
to convert all nominal variables to factors and take care of other issues related to categorical variablesstep_dummy()
to convert our nominal variables into numeric binary (0 and 1) variables.step_unknown()
to categorize missing categorical data (NA’s) asunknown
.
Take-way exercise: Does the decision tree you trained earlier get better when you apply the pre-processing step/recipe we just described? (You might to apply it on a smaller sample of the data for efficiency reasons!)
References
Footnotes
question about trust in the army↩︎
question about trust in the press↩︎
question about trust in television↩︎
question about trust in labour unions↩︎
question about trust in the police↩︎
question about trust in the justice system/courts↩︎
question about trust in the government↩︎
question about trust in political parties↩︎
question about trust in parliament↩︎
question about trust in the civil services↩︎