🛣️ Week 07 - Lab Roadmap (90 min)

Decision trees with tidymodels

Author

Dr Ghita Berrada

🥅 Learning Objectives

By the end of this lab, you will be able to:

  • Learn how to apply a decision tree model to a real-world classification problem
  • Learn how to use cross-validation to build a decision tree model
This lab is part of the GENIAL project.

When you reach Part III of this lab, read the specific instructions for GENIAL participants.

📚 Preparation

This week, we no longer use the UK House Prices for our lab and move to a new dataset adapted from Wave 7 of the World Values Survey (Haerpfer et al. 2022). Use the link below to download this new dataset:

Put this new file in the data folder within your DS202A project folder.

Then use the link below to download the lab file:

We will post solutions to Part III on Tuesday afternoon, only after all labs have ended.

📋 Lab Tasks

Here are the instructions for this lab:

Import required libraries:

# Tidyverse packages we will use
library(dplyr)     
library(tidyr)     
library(readr)     

# Tidymodel packages we will use
library(rsample)
library(yardstick) 
library(parsnip)   
library(recipes)   
library(workflows) 
library(rpart)
library(tune)

Read the data set:

It is the brand-new dataset you’ve downloaded.

# Modify the filepath if needed
filepath <- "data/WVS_Wave7_modified.csv"
wvs_data <- read_csv(filepath)

Part I - Meet a new data set (20 min)

Important

We’ve been doing a lot of training vs testing splits in this course, primarily by hand and focused on a particular point in time (e.g., test samples are defined to start in 2019).

But not all splits need to be done this way. Some data sets don’t have a time-series component and can be split randomly for testing purposes. To make things more robust, we don’t just simply use one single train versus test split, but we use a technique called cross-validation and split the data into multiple train vs test splits.

In this lab, we’ll learn how to use cross-validation to build a more robust model for changes in the data. Let’s kick things off by cleaning up our data and randomly splitting it into a training set (70% of data points/rows) and a test set (30% of data points/rows).

🧑‍🏫 TEACHING MOMENT:

(Your class teacher will guide you through this section. Just run all the code chunks below together with your class teacher.)

Our goal in this lab is to predict someone’s trust in institutions (i.e., police, army, justice courts, press and television, labour unions, civil services, political parties, parliament, and government), as represented in our dataset by the variable wvs_data["TRUST_INSTITUTIONS"].

Here is how we obtain this variable:

  1. Average trust: We take the average of several columns that each represents inverse trust (i.e., mistrust) towards a particular institution:

    • I_TRUSTARMY (Q651 in the original data set)
    • I_TRUSTPRESS (Q662)
    • I_TRUSTTELEVISION (Q673)
    • I_TRUSTUNIONS (Q684)
    • I_TRUSTPOLICE (Q695 in the original data set)
    • I_TRUSTCOURTS (Q706 in the original data set)
    • I_TRUSTGVT (Q717)
    • I_TRUSTPARTIES (Q728)
    • I_TRUSTPARLIAMENT (Q739)
    • I_TRUSTCIVILSERVICES (Q7410)

    These variables are not part of the original data set but were precomputed from the original data set to make it easier to compute trust in institutions. The way we calculated these variables is as follows:

    • if the response to the question is 1, the corresponding variable is assigned a value of 0
    • if it is 2, then the corresponding variable gets assigned a value of 0.33;
    • if it is 3, then the corresponding variable gets assigned a value of 0.66;
    • if it is 4, then the corresponding variable gets assigned a value of 1;
    • otherwise (i.e. if the response is either -1, -2, -4, or -5), the value of the corresponding variable is treated as missing

    We store the result in the MEAN_I_TRUST_INSTITUTIONS column

  2. Convert from mistrust to trust for ease of interpretability: We subtract the average obtained (i.e., the content of the MEAN_I_TRUST_INSTITUTIONS column) from 1.

  3. Create a binary target variable: We assign a value of 1 if the value obtained after step 2 is higher than 0.5, thus representing trust, and 0 otherwise. We store the result in the TRUST_INSTITUTIONS column.

🎯 ACTION POINTS:

  1. Create the TRUST_INSTITUTIONS column. Open a new chunk of code in your notebook and type the following code:

wvs_data <- 
    wvs_data %>%
    rowwise() %>% 
    mutate(MEAN_I_TRUST_INSTITUTIONS = mean(c(I_TRUSTARMY, I_TRUSTCIVILSERVICES, I_TRUSTPOLICE, I_TRUSTCOURTS, 
                                            I_TRUSTPRESS, I_TRUSTTELEVISION, I_TRUSTUNIONS, I_TRUSTGVT, 
                                            I_TRUSTPARTIES, I_TRUSTPARLIAMENT), 
                                            na.rm = TRUE),
           TRUST_INSTITUTIONS = (1 - MEAN_I_TRUST_INSTITUTIONS) > 0.5,
           TRUST_INSTITUTIONS = factor(TRUST_INSTITUTIONS,
                                       labels=c("No", "Yes"),
                                       levels=c(FALSE, TRUE))) %>%
                                       ungroup()
  1. Remove the columns we used to compute the TRUST_INSTITUTIONS column.

    Since we used the I_TRUST_* variables above to compute our outcome variable, these variables should not be part of our model! Neither should you use the variables used to precompute the I_TRUST_* variables: Q67, Q68, Q69, Q70, Q71, Q72, Q73 or Q74. We only left those variables in the initial dataset for completeness and to explain how trust in institutions was defined and obtained.

    From here on, we’ll operate with this filtered data:

cols_to_remove <- c("I_TRUSTARMY", "I_TRUSTCIVILSERVICES", "I_TRUSTPOLICE", "I_TRUSTCOURTS", 
                    "I_TRUSTPRESS", "I_TRUSTTELEVISION", "I_TRUSTUNIONS", "I_TRUSTGVT", 
                    "I_TRUSTPARTIES", "I_TRUSTPARLIAMENT", "MEAN_I_TRUST_INSTITUTIONS",
                    "D_INTERVIEW", "W_WEIGHT", "S018", "Q_MODE", "K_DURATION",
                    "Q65", "Q67", "Q68", "Q69", "Q70", "Q71", "Q72", "Q73", "Q74",
                    "Q275","Q276","Q277","Q278","Q275A","Q276A","Q277A","Q278A")

#filter data to remove unnecessary columns
wvs_data <- 
    wvs_data %>% 
    select(-all_of(cols_to_remove))
  1. Now, let’s randomly split our dataset into a training set (containing 70% of the rows in our data) and a test set (including 30% of the rows in our data)
#Randomly split the initial data frame into training and testing sets (70% and 30% of rows, respectively)
split <- initial_split(wvs_data, prop = 0.7)
  1. What is in the training and testing set?

    To get the actual data assigned to either set, use the rsample::training() and rsample::testing() functions:

training_data <- training(split)
testing_data <- testing(split)

For curiosity, let’s confirm which unique countries are represented in our training and test sets and by how many data points. We can uncover this mystery by counting the unique values in the COUNTRY_NAME column.

# tallying the number of rows per country in the training set
training_data %>% 
    group_by(COUNTRY_NAME) %>% 
    tally()

# tallying the number of rows per country in the test set
testing_data %>% 
    group_by(COUNTRY_NAME) %>% 
    tally()

Just how many non-empty records are there in our datasets per country?

# tallying the number of non-empty rows per country in the training set
training_data %>%
    drop_na() %>%
    group_by(COUNTRY_NAME) %>%
    tally() %>%
    knitr::kable()

# tallying the number of non-empty rows per country in the test set
testing_data %>%
    drop_na() %>%
    group_by(COUNTRY_NAME) %>%
    tally() %>%
    knitr::kable()

Note that missing values are handled in diverse ways in this dataset: in responses to survey questions, negative values also denote missing values (e.g., the respondent didn’t know how to answer the question or did not respond to the question, the question was not asked or the response was missing).

To browse the full dataset documentation, including a description of the variables (though some were recoded for ease of comprehension), you can download the zip archive below:

🗣️ DISCUSSION:

We defined trust in institutions in a specific way, and we will be building a model to predict trust in institutions in a way that fits this definition. Do you think that the definition we gave of trust in institutions is satisfactory? Is it a good modelling objective? If not, what would you do differently? (You can look at the dataset documentation for further ideas).

Part II - Introduction to decision trees (40 min)

🧑🏻‍🏫 TEACHING MOMENT: In this part, you’ll be learning about a new classification model called decision trees, which is more suited to handle large numbers of features of varying types that potentially contain missing data than logistic regression.

Our dataset is a real-world dataset and ticks all these boxes:

  • We mentioned earlier that our goal is to build a model to predict TRUST_INSTITUTIONS. Aside from a few variables that don’t look like likely predictors (e.g. S018 and W_WEIGHT which are two survey-related weightings, K_DURATION i.e interview duration, Q_MODE, i.e. interview type) and variables whose values have not been standardised across countries (e.g Q275, Q276, Q277, Q278, Q275A, Q276A, Q277A, Q278A that are variables that give the highest level of education achieved by the interviewee and related family member in a country-specific way), we have a large potential number of features/predictors to choose from for our model (552! if we don’t count our ID variable D_INTERVIEW, which corresponds to the interview ID and identifies each row/data point).
  • Most variables include missing values, which are sometimes encoded as negative numbers
  • The various variables have varying types (e.g., categorical values for answers to survey questions or strings for names of countries or languages)

Your class teacher will explain the basic principle of a decision tree.

As usual, we start with a recipe


wvs_rec <-
  recipe(TRUST_INSTITUTIONS ~ .,
         data = training_data) %>%
  prep()

wvs_rec %>%
  bake(training_data)

Here, we’ll be simply using our training dataset without further pre-processing here to fit our model (we leave possible pre-processing steps as a take-home exercise).

Now, we can fit a decision tree model on our data

You can create the model specification for a decision tree using this scaffolding code (you need to install the rpart library using the install.packages("rpart") command to be able to use this code and, of course, load the rpart library):

# Create the specification of a model but don't fit it yet
dt_spec <- 
    decision_tree(mode = "classification", tree_depth = 5) %>% 
    set_engine("rpart")

Now that you have the model specification:

  1. Fit the model to the training set using a workflow and evaluate its performance with an appropriate metric (e.g. AUC/ROC curve)
  2. Fit the model to the test set using a workflow and evaluate its performance with an appropriate metric (e.g. AUC/ROC curve)
Tip

You can plot the fitted decision tree using the rpart.plot package:

rpart.plot(dt_model$fit)

PS: You might need to install the rpart.plot package using the install.packages("rpart.plot") command.

Part III - Cross-validation (30 min)

🧑🏻‍🏫 TEACHING MOMENT: Your class teacher will briefly explain the concept of cross-validation.

If you are participating in the GENIAL project, you are asked to:

  • Work independently (not in groups or pairs), but you can ask the class teacher for help if you get stuck.
  • Have only the following tabs open in your browser:
    1. These lab instructions
    2. The ChatGPT website (open a new chat window and name it ‘DS202A - Week 03’)
    3. The tidymodels documentation page (you can open tabs with documentation pages for each package if you need to)
    4. The tidyverse documentation page (you can open tabs with documentation pages for each package if you need to)
  • Be aware of how useful (or not) ChatGPT was in helping you answer the questions in this section.
  • Fill out this brief survey at the end of the lab: 🔗 link (requires LSE login)
Click here for tips on how to use ChatGPT today

General Tips:

  1. tidymodels is a relatively new suite of packages. If you don’t explicitly ask for responses that use tidymodels-like code, you will likely get old code based on base R instead.

  2. Try to provide more specific context about your problem. For example, instead of ‘How to do cross-validation in R?’, try:

I have a data frame with these columns: 

- (column) (data type)
- (column) (data type)
...


I need to cross-validate this data `rsample::vfold_cv()` in R, but I don't know how to specify it correctly.

Remember: ChatGPT is not a search engine. It’s a language model that tries to predict likely words. The more context you give it, the better this process will work for you.

  1. Use it to debug code smartly. Say you tried something, but your code doesn’t run. It throws an error, and you don’t understand what it means. Instead of asking ChatGPT to rewrite the code for you, paste your code and write, ‘I am getting this error when I run the code above: … What is the source of this error? Explain why it doesn’t work.’ Then, using the output you code, try to go to the documentation to cross-check the information. If ChatGPT returned an incorrect explanation, reply: ‘your solution is wrong because…’

In case you are not participating in the GENIAL project, you can work in pairs or small groups to answer the questions in this section. You can also ask the class teacher for help if you get stuck.

We suggest you have these tabs open in your browser:

  1. The tidymodels documentation page (you can open tabs with documentation pages for each package if you need to)
  2. The tidyverse documentation page (you can open tabs with documentation pages for each package if you need to)
  1. Question: Can you retrain your decision tree model using cross-validation?
  • Use the initial_split() and training() functions to split your data and extract your training set.

  • Employ a vfold_cv() function to the data with 10-fold cross-validation (10-fold by default).

  • Fit the model using fit_resamples() and collect your metrics (as in the W05 lecture notebook). How does the model performance compare to previously?

  • What happens when you tweak tree parameters and cross-validation parameters?

  • Fill out this brief survey at the end of the lab if you are part of GENIAL: 🔗 link (requires LSE login)

(Bonus) Take-home exercise

Check out the following recipe:

wvs_rec <-
  recipe(TRUST_INSTITUTIONS ~ .,
         data = training_data) %>%
  step_impute_knn(everything(), neighbors = 3) %>% 
  step_novel(all_nominal(),-all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_unknown(all_nominal()) %>%
  step_corr(all_predictors(), threshold = 0.7, method = "spearman") %>%
  prep()

You’ll notice a few familiar steps in the recipe (update_role,prep) and quite a few unfamiliar ones.

Let’s go through them one by one:

Imputing missing values instead of discarding them

Up until now, you had used the step_naomit() function to remove any rows with missing values. Today, since there are too many missing values in our data, removing missing values would leave us with barely any usable data. So, instead, we try to “make a best-informed guess” of what the missing values might be: this is what we call (missing value) imputation. There are many ways to do imputation (see (Enders 2022) or (Scheffer 2002) for details on imputation methods): here, we chose to use a method called K-nearest neighbours imputation, whose principle is compute the missing value by “averaging” the values of that feature in the closest columns (in terms of distance) to the current column (i.e where the missing value is found). We’ll come back to K-nearest neighbours in the lecture on Friday.

For now, it’s enough to remember that what K-nearest neighbour imputation allows to do is impute i.e “guess” the value of the missing values in our data and that we use the step_impute_knn() function to do that.

Processing numerical variables

We process the numerical variables in our data in two ways:

  • we use step_normalize() to ensure the numeric variables (excluding row id and outcome variable) are normalized i.e they have a standard deviation of one and a mean of zero. (i.e., z-standardization).
  • with step_zv(), we remove any numeric variable that have zero variance (not useful for predictions!)

Processing nominal variables

When it comes to nominal variables, we use:

  • step_novel() to convert all nominal variables to factors and take care of other issues related to categorical variables
  • step_dummy() to convert our nominal variables into numeric binary (0 and 1) variables.
  • step_unknown() to categorize missing categorical data (NA’s) as unknown.

Take-way exercise: Does the decision tree you trained earlier get better when you apply the pre-processing step/recipe we just described? (You might to apply it on a smaller sample of the data for efficiency reasons!)

References

Enders, Craig K. 2022. Applied Missing Data Analysis. Second edition. Methodology in the Social Sciences. New York: The Guilford Press.
Haerpfer, Christian, Ronald Inglehart, Alejandro Moreno, Christian Welzel, Kseniya Kizilova, Jaime Diez-Medrano, Marta Lagos, Pippa Norris, Eduard Ponarin, and Bi Puranen. 2022. “World Values Survey Wave 7 (2017-2022) Cross-National Data-Set.” World Values Survey Association. https://doi.org/10.14281/18241.20.
Scheffer, Judi. 2002. “Dealing with Missing Data.” Mathematical Research Letters 3.

Footnotes

  1. question about trust in the army↩︎

  2. question about trust in the press↩︎

  3. question about trust in television↩︎

  4. question about trust in labour unions↩︎

  5. question about trust in the police↩︎

  6. question about trust in the justice system/courts↩︎

  7. question about trust in the government↩︎

  8. question about trust in political parties↩︎

  9. question about trust in parliament↩︎

  10. question about trust in the civil services↩︎