π£οΈ Week 07 - Lab Roadmap (90 min)
Decision trees with tidymodels
π₯ Learning Objectives
By the end of this lab, you will be able to:
- Learn how to apply a decision tree model to a real-world classification problem
- Learn how to use cross-validation to build a decision tree model
π Lab Tasks
No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.
Part 0: Export your chat logs (~ 3 min)
As part of the GENIAL project, we ask that you fill out the following form as soon as you come to the lab:
π― ACTION POINTS
π CLICK HERE to export your chat log.
Thanks for being GENIAL! You are now one step closer to earning some prizes! ποΈ
π NOTE: You MUST complete the initial form.
If you really donβt want to participate in GENIAL1, just answer βNoβ to the Terms & Conditions question - your e-mail address will be deleted from GENIALβs database the following week.
π Preparation
This week, we no longer use the UK House Prices for our lab and move to a new dataset inspired by (Lima and Delen 2020) and whose purpose is to study and predict corruption in countries: just as (Lima and Delen 2020), we combine data from the Ease of Doing Business Project (but we use the data from 2019 and 2020 instead of 2017 and 2018 as in the paper), data from the Heritage Foundation regarding the components of the Economic Freedom index for the years 2019/20202, data regarding the Education Index from the Human Development Report -part of the United Nations Development Program - (again for the years 2019/2020) and finally data from Transparency International for the years 2019/2020 about the Corruption Perception Index (or CPI).
We excluded from the data all countries for which the CPI was missing or which had too many missing values in the data sourced from the Heritage Foundation. Likewise we removed variables from the Ease of Doing Business that either had too many missing values (e.g duration of electricity outages) or couldnβt be interpreted in a straightforward fashion (variables related to VAT as many countries either donβt have a VAT or donβt refund it if they do, leading to columns mixing duration to VAT refunds with annotations explaining the previous facts).
We also tried to simplify the coding of some column values:
- we replaced the value βNo Practiceβ indicating that a certain type of practice (e.g tax) didnβt exist in a particular country by an extremely large number (1000000000) to denote that the situation was de facto not possible.
- we replaced βNULLβ strings by missing values (i.e
NA
).
Use the link below to download the dataset for the year 2019:
And use the link below to download the dataset for the year 2020:
Put these new files in the data
folder within your DS202W
project folder.
Then use the link below to download the lab file:
We will post solutions to Part III on Tuesday afternoon, only after all labs have ended.
Here are the instructions for this lab:
Import required libraries:
# Tidyverse packages we will use
library(dplyr)
library(tidyr)
library(readr)
# Tidymodel packages we will use
library(rsample)
library(yardstick)
library(parsnip)
library(recipes)
library(workflows)
library(rpart)
library(tune)
library(rsample)
Read the 2019 data set:
It is the first brand-new dataset youβve downloaded.
# Modify the filepath if needed
<- "data/corruption_data_2019_nomissing.csv"
filepath <- read_csv(filepath) corruption_data_2019
Part I - Meet a new data set (20 min)
Weβve been doing a lot of training vs testing splits in this course, primarily by hand and focused on a particular point in time (e.g., test samples are defined to start in 2019).
But not all splits need to be done this way. Some data sets donβt have a time-series component and can be split randomly for testing purposes. To make things more robust, we donβt just simply use one single train versus test split, but we use a technique called cross-validation and split the data into multiple train vs test splits.
In this lab, weβll learn how to use cross-validation to build a more robust model for changes in the data. Letβs kick things off by cleaning up our data and randomly splitting it into a training set (70% of data points/rows) and a test set (30% of data points/rows).
π§βπ« TEACHING MOMENT:
(Your class teacher will guide you through this section. Just run all the code chunks below together with your class teacher.)
Our goal in this lab is to predict the level of corruption in a given country. The Corruption Perception Index (CPI), already present in our data set (in the cpi_score
column), sorts countries according to their perceived corruption levels. According to (Lima and Delen 2020),β[t]he index captures the assessments of domain experts on corrupt behavioral information, originating a scale from 0 to 100 where economies close to 0 are perceived as highly corrupt while economies close to 100 are perceived as less corruptβ. In other words, the CPI already gives us a way to classify countries in scales of corruption. To go from a scale of 0 to 100 (which is the CPI scale), to a categorical variable (which is what we need for our classification), we simply need the following definition:
Corruption class: We define the following levels of corruption based on the CPI scale:
- if the CPI is lower than 50, then the corruption level is
poor
- if the CPI is between 50 (included) and 70 (excluded), then the corruption level is
average
- if the CPI is higher than 70, then the corruption level is
good
We store the result in the
corruption_class
column- if the CPI is lower than 50, then the corruption level is
Convert the
corruption_class
column from character to factor
π― ACTION POINTS:
Create the
corruption_class
column and convert it tofactor
.Now, letβs randomly split our dataset into a training set (containing 70% of the rows in our data) and a test set (including 30% of the rows in our data)
#Randomly split the initial data frame into training and testing sets (70% and 30% of rows, respectively)
<- initial_split(corruption_data_2019, prop = 0.7) split
What is in the training and testing set?
To get the actual data assigned to either set, use the
rsample::training()
andrsample::testing()
functions:
<- training(split)
training_data <- testing(split) testing_data
For curiosity, letβs confirm which unique countries are represented in our training and test sets and by how many data points. We can uncover this mystery by counting the unique values in the economy
column (you could do the same per region
).
# tallying the number of rows per country in the training set
%>%
training_data group_by(economy) %>%
tally()
# tallying the number of rows per country in the test set
%>%
testing_data group_by(economy) %>%
tally()
Just how many non-empty records are there in our datasets per country?
# tallying the number of non-empty rows per country in the training set
%>%
training_data drop_na() %>%
group_by(economy) %>%
tally() %>%
::kable()
knitr
# tallying the number of non-empty rows per country in the test set
%>%
testing_data drop_na() %>%
group_by(economy) %>%
tally() %>%
::kable() knitr
To browse the original Ease of Doing Business dataset, including a description of the variables (though some were recoded for ease of comprehension) - included in the Metadata
sheet of the file-, you can download the Excel file below:
For more information about other variables of the dataset (i.e variables related to the Economic Freedom Index, Education Index or CPI), check out the explanations in (Lima and Delen 2020).
π£οΈ DISCUSSION:
We defined corruption (and corruption levels) in a specific way, and we will be building a model to predict corruption in a way that fits this definition. Do you think that the definition we gave of corruption is satisfactory? Is it a good modelling objective? If not, what would you do differently? (You can look at the dataset documentation for further ideas).
Part II - Introduction to decision trees (40 min)
π§π»βπ« TEACHING MOMENT: In this part, youβll be learning about a new classification model called decision trees, which is more suited to handle large numbers of features of varying types that potentially contain missing data than logistic regression.
Our dataset is a real-world dataset and ticks all these boxes:
- We mentioned earlier that our goal is to build a model to predict
corruption_class
. Aside from a few variables that donβt look like likely predictors (e.g.db_year
,economy
,region
,income_group
(though this one is debatable),cpi_score
(it is correlated with the outcome variable so canβt be used to predict it!)), we have a large potential number of features/predictors to choose from for our model (100! for a dataset that only has 140 rows/data points). - Many variables include missing values
- Though we smoothed things out, the various variables had varying types (e.g., categorical values or strings or numerical values)
Your class teacher will explain the basic principle of a decision tree.
As usual, we start with a recipe
For computational reasons (too long to run!), weβll be building our model with a random subset of features from the original dataset. Weβll randomly choose 25 variables from the original columns of the dataset, excluding the columns db_year
, economy
,region
,income_group
and cpi_score
.
We construct a vector predictors
, which contains the names of our chosen predictors:
<- c("country_code","economy","region","income_group","db_year","cpi_score") # list of the columns we exclude
remove <- corruption_data_2019 %>%
predictors colnames() %>%
str_remove_all(., paste(remove, collapse = "|"))%>%
!(.=="")] %>%
.[sample(.,25)
Then, we write our recipe:
<- paste("corruption_class ~", paste(predictors, collapse = " + "))
formula_string <- as.formula(formula_string)
formula <- recipe(formula, data = training_data) %>%
impute_rec step_impute_median(all_numeric(), -all_outcomes()) %>%
prep()
Here, the only pre-processing step we perform before fitting our model is a simple median imputation in order to fill in missing values (in numeric columns) per column with the median value of their respective column.
Now, we can fit a decision tree model on our data
You can create the model specification for a decision tree using this scaffolding code (you need to install the rpart
library using the install.packages("rpart")
command to be able to use this code and, of course, load the rpart
library):
# Create the specification of a model but don't fit it yet
<-
dt_spec decision_tree(mode = "classification", tree_depth = 5) %>%
set_engine("rpart")
Now that you have the model specification:
- Fit the model to the training set using a workflow and evaluate its performance with an appropriate metric (e.g. AUC/ROC curve)
- Fit the model to the test set using a workflow and evaluate its performance with an appropriate metric (e.g. AUC/ROC curve)
You can plot the fitted decision tree using the rpart.plot
package:
rpart.plot(dt_model$fit)
PS: You might need to install the rpart.plot
package using the install.packages("rpart.plot")
command.
π€ What happens if you choose a different set of features to train your model?
Part III - Cross-validation (30 min)
π§π»βπ« TEACHING MOMENT: Your class teacher will briefly explain the concept of cross-validation.
- Question: Can you retrain your decision tree model using cross-validation?
- Use the
initial_split()
andtraining()
functions to split your data and extract your training set. - Employ a
vfold_cv()
function to the data with 10-fold cross-validation (10-fold by default). - Fit the model using
fit_resamples()
and collect your metrics (as in the W05 lecture notebook). How does the model performance compare to previously? - What happens when you tweak tree parameters and cross-validation parameters?
- Question: How does your model perform on (a subset of the) 2020 dataset?