🛣️ Week 07 - Lab Roadmap (90 min)

Decision trees with tidymodels

Author

Dr Ghita Berrada

🥅 Learning Objectives

By the end of this lab, you will be able to:

Learn how to apply a decision tree model to a real-world classification problem
Learn how to use cross-validation to build a decision tree model

📋 Lab Tasks

No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.

Part 0: Export your chat logs (~ 3 min)

As part of the GENIAL project, we ask that you fill out the following form as soon as you come to the lab:

🎯 ACTION POINTS

🔗 CLICK HERE to export your chat log.

Thanks for being GENIAL! You are now one step closer to earning some prizes! 🎟️

👉 NOTE: You MUST complete the initial form.

If you really don’t want to participate in GENIAL¹, just answer ‘No’ to the Terms & Conditions question - your e-mail address will be deleted from GENIAL’s database the following week.

📚 Preparation

This week, we no longer use the UK House Prices for our lab and move to a new dataset inspired by (Lima and Delen 2020) and whose purpose is to study and predict corruption in countries: just as (Lima and Delen 2020), we combine data from the Ease of Doing Business Project (but we use the data from 2019 and 2020 instead of 2017 and 2018 as in the paper), data from the Heritage Foundation regarding the components of the Economic Freedom index for the years 2019/2020², data regarding the Education Index from the Human Development Report -part of the United Nations Development Program - (again for the years 2019/2020) and finally data from Transparency International for the years 2019/2020 about the Corruption Perception Index (or CPI).

We excluded from the data all countries for which the CPI was missing or which had too many missing values in the data sourced from the Heritage Foundation. Likewise we removed variables from the Ease of Doing Business that either had too many missing values (e.g duration of electricity outages) or couldn’t be interpreted in a straightforward fashion (variables related to VAT as many countries either don’t have a VAT or don’t refund it if they do, leading to columns mixing duration to VAT refunds with annotations explaining the previous facts).

We also tried to simplify the coding of some column values:

we replaced the value “No Practice” indicating that a certain type of practice (e.g tax) didn’t exist in a particular country by an extremely large number (1000000000) to denote that the situation was de facto not possible.
we replaced “NULL” strings by missing values (i.e NA).

Use the link below to download the dataset for the year 2019:

And use the link below to download the dataset for the year 2020:

Put these new files in the data folder within your DS202W project folder.

Then use the link below to download the lab file:

We will post solutions to Part III on Tuesday afternoon, only after all labs have ended.

Here are the instructions for this lab:

Import required libraries:

# Tidyverse packages we will use
library(dplyr)     
library(tidyr)     
library(readr)     

# Tidymodel packages we will use
library(rsample)
library(yardstick) 
library(parsnip)   
library(recipes)   
library(workflows) 
library(rpart)
library(tune)
library(rsample)

Read the 2019 data set:

It is the first brand-new dataset you’ve downloaded.

# Modify the filepath if needed
filepath <- "data/corruption_data_2019_nomissing.csv"
corruption_data_2019 <- read_csv(filepath)

Part I - Meet a new data set (20 min)

Important

We’ve been doing a lot of training vs testing splits in this course, primarily by hand and focused on a particular point in time (e.g., test samples are defined to start in 2019).

But not all splits need to be done this way. Some data sets don’t have a time-series component and can be split randomly for testing purposes. To make things more robust, we don’t just simply use one single train versus test split, but we use a technique called cross-validation and split the data into multiple train vs test splits.

In this lab, we’ll learn how to use cross-validation to build a more robust model for changes in the data. Let’s kick things off by cleaning up our data and randomly splitting it into a training set (70% of data points/rows) and a test set (30% of data points/rows).

🧑‍🏫 TEACHING MOMENT:

(Your class teacher will guide you through this section. Just run all the code chunks below together with your class teacher.)

Our goal in this lab is to predict the level of corruption in a given country. The Corruption Perception Index (CPI), already present in our data set (in the cpi_score column), sorts countries according to their perceived corruption levels. According to (Lima and Delen 2020),“[t]he index captures the assessments of domain experts on corrupt behavioral information, originating a scale from 0 to 100 where economies close to 0 are perceived as highly corrupt while economies close to 100 are perceived as less corrupt”. In other words, the CPI already gives us a way to classify countries in scales of corruption. To go from a scale of 0 to 100 (which is the CPI scale), to a categorical variable (which is what we need for our classification), we simply need the following definition:

Corruption class: We define the following levels of corruption based on the CPI scale:
- if the CPI is lower than 50, then the corruption level is poor
- if the CPI is between 50 (included) and 70 (excluded), then the corruption level is average
- if the CPI is higher than 70, then the corruption level is good
We store the result in the corruption_class column
Convert the corruption_class column from character to factor

🎯 ACTION POINTS:

Create the corruption_class column and convert it to factor.
Now, let’s randomly split our dataset into a training set (containing 70% of the rows in our data) and a test set (including 30% of the rows in our data)

#Randomly split the initial data frame into training and testing sets (70% and 30% of rows, respectively)
split <- initial_split(corruption_data_2019, prop = 0.7)

What is in the training and testing set?

To get the actual data assigned to either set, use the rsample::training() and rsample::testing() functions:

training_data <- training(split)
testing_data <- testing(split)

For curiosity, let’s confirm which unique countries are represented in our training and test sets and by how many data points. We can uncover this mystery by counting the unique values in the economy column (you could do the same per region).

# tallying the number of rows per country in the training set
training_data %>% 
    group_by(economy) %>% 
    tally()

# tallying the number of rows per country in the test set
testing_data %>% 
    group_by(economy) %>% 
    tally()

Just how many non-empty records are there in our datasets per country?

# tallying the number of non-empty rows per country in the training set
training_data %>%
    drop_na() %>%
    group_by(economy) %>%
    tally() %>%
    knitr::kable()

# tallying the number of non-empty rows per country in the test set
testing_data %>%
    drop_na() %>%
    group_by(economy) %>%
    tally() %>%
    knitr::kable()

To browse the original Ease of Doing Business dataset, including a description of the variables (though some were recoded for ease of comprehension) - included in the Metadata sheet of the file-, you can download the Excel file below:

For more information about other variables of the dataset (i.e variables related to the Economic Freedom Index, Education Index or CPI), check out the explanations in (Lima and Delen 2020).

🗣️ DISCUSSION:

We defined corruption (and corruption levels) in a specific way, and we will be building a model to predict corruption in a way that fits this definition. Do you think that the definition we gave of corruption is satisfactory? Is it a good modelling objective? If not, what would you do differently? (You can look at the dataset documentation for further ideas).

Part II - Introduction to decision trees (40 min)

🧑🏻‍🏫 TEACHING MOMENT: In this part, you’ll be learning about a new classification model called decision trees, which is more suited to handle large numbers of features of varying types that potentially contain missing data than logistic regression.

Our dataset is a real-world dataset and ticks all these boxes:

We mentioned earlier that our goal is to build a model to predict corruption_class. Aside from a few variables that don’t look like likely predictors (e.g. db_year, economy,region,income_group(though this one is debatable),cpi_score(it is correlated with the outcome variable so can’t be used to predict it!)), we have a large potential number of features/predictors to choose from for our model (100! for a dataset that only has 140 rows/data points).
Many variables include missing values
Though we smoothed things out, the various variables had varying types (e.g., categorical values or strings or numerical values)

Your class teacher will explain the basic principle of a decision tree.

As usual, we start with a recipe

For computational reasons (too long to run!), we’ll be building our model with a random subset of features from the original dataset. We’ll randomly choose 25 variables from the original columns of the dataset, excluding the columns db_year, economy,region,income_group and cpi_score.

We construct a vector predictors, which contains the names of our chosen predictors:

  remove <- c("country_code","economy","region","income_group","db_year","cpi_score") # list of the columns we exclude
  predictors <- corruption_data_2019 %>% 
                colnames() %>%
                str_remove_all(., paste(remove, collapse = "|"))%>%
                .[!(.=="")] %>%
                sample(.,25)

Then, we write our recipe:


formula_string <- paste("corruption_class ~", paste(predictors, collapse = " + "))
formula <- as.formula(formula_string)
impute_rec <- recipe(formula, data = training_data) %>% 
  step_impute_median(all_numeric(), -all_outcomes()) %>%
  prep()

Here, the only pre-processing step we perform before fitting our model is a simple median imputation in order to fill in missing values (in numeric columns) per column with the median value of their respective column.

Now, we can fit a decision tree model on our data

You can create the model specification for a decision tree using this scaffolding code (you need to install the rpart library using the install.packages("rpart") command to be able to use this code and, of course, load the rpart library):

# Create the specification of a model but don't fit it yet
dt_spec <- 
    decision_tree(mode = "classification", tree_depth = 5) %>% 
    set_engine("rpart")

Now that you have the model specification:

Fit the model to the training set using a workflow and evaluate its performance with an appropriate metric (e.g. AUC/ROC curve)
Fit the model to the test set using a workflow and evaluate its performance with an appropriate metric (e.g. AUC/ROC curve)

Tip

You can plot the fitted decision tree using the rpart.plot package:

rpart.plot(dt_model$fit)

PS: You might need to install the rpart.plot package using the install.packages("rpart.plot") command.

🤔 What happens if you choose a different set of features to train your model?

Part III - Cross-validation (30 min)

🧑🏻‍🏫 TEACHING MOMENT: Your class teacher will briefly explain the concept of cross-validation.

Question: Can you retrain your decision tree model using cross-validation?

Use the initial_split() and training() functions to split your data and extract your training set.
Employ a vfold_cv() function to the data with 10-fold cross-validation (10-fold by default).
Fit the model using fit_resamples() and collect your metrics (as in the W05 lecture notebook). How does the model performance compare to previously?
What happens when you tweak tree parameters and cross-validation parameters?

Question: How does your model perform on (a subset of the) 2020 dataset?

References

Lima, Marcio Salles Melo, and Dursun Delen. 2020. “Predicting and Explaining Corruption Across Countries: A Machine Learning Approach.” Government Information Quarterly 37 (1): 101407. https://doi.org/https://doi.org/10.1016/j.giq.2019.101407.

Footnotes

We’re gonna cry a little bit, not gonna lie. But no hard feelings. We’ll get over it.↩︎
you can read more about those components in section 3.1 of the paper↩︎