LSE DS202W 2025: Week 5 - Cross-validation and decision tree algorithms
2024/25 Winter Term
Welcome to our fifth lab!
⚙️ Setup
Downloading the student notebook
Click on the below button to download the student notebook.
Loading libraries and functions
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import f1_score, root_mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lets_plot import *
LetsPlot.setup_html()
Downloading the data
Download the datasets we will use for this lab.
Use the links below to download this dataset:
🥅 Learning Objectives
By the end of this lab, you will be able to:
- Understand the concept of decision boundaries using a 2-d feature space
- Fit a k-nearest neighbours classifier
- Implement k-fold cross validation
- Implement decision tree-based models for regression problems
- Consolidate skills in implementing train / test splits when the data are TSCS
Predicting who voted Brexit with k Nearest Neighbours (45 minutes)
Today, we are going to be demonstrating the utility of hyperparameter tuning using k-fold cross validation.
Exploring a feature space
We are going to take a look at data adapted from the European Social Survey, round 8 (ESS8)(European Social Survey European Research Infrastructure (ESS ERIC) 2023).
The ESS is a an academically-driven multi-country survey. Its stated aims are “firstly – to monitor and interpret changing public attitudes and values within Europe and to investigate how they interact with Europe’s changing institutions, secondly - to advance and consolidate improved methods of cross-national survey measurement in Europe and beyond, and thirdly - to develop a series of European social indicators, including attitudinal indicators”.
Aside from questions common to all countries, for this round of the survey, the data providers had asked participants from the UK questions around the recently-completed Brexit referendum and more specifically whether they had voted “Leave” (i.e “Brexit”) or “Remain” in the Brexit.
We will expand an example given for k nearest neighbours in Garibay Rubio et al. (2024), using two predictors. The full (adapted) data set consists of:
vtleave
: Whether or not a participant voted “Brexit” or “Remain”atchctr
: 0-10 attachment rating to the United Kingdomatcherp
: 0-10 attachment rating to Europe
= pd.read_csv("data/brexit-ess.csv") brexit
💡Tip: Don’t worry too much about the below code. All we are doing is creating a grid of potential values for illustration purposes. We somewhat adapted the code for this from here!
= [
unique_categories 'atchctr'].unique(),
brexit['atcherp'].unique()
brexit[
]
= ["atchctr", "atcherp"]
names
= pd.MultiIndex.from_product(
multiindex
unique_categories,=names
names
)
= brexit.groupby(["atchctr","atcherp"]).mean().reset_index()
brexit_grid
= (
brexit_grid
brexit_grid
.set_index(names)= np.nan)
.reindex(multiindex, fill_value
.reset_index() )
Let’s use ggplot
to explore this feature space.
("atchctr", "atcherp", colour = "vtleave")) +
ggplot(brexit_grid, aes(= 5) +
geom_point(size = np.arange(0,11,1)) +
scale_x_continuous(breaks = np.arange(0,11,1)) +
scale_y_continuous(breaks =element_blank(),
theme(panel_grid="bottom") +
legend_position= "Attachment to the United Kingdom", y = "Attachment to Europe",
labs(x = "Proportion of grid that voted Brexit")
colour )
👉 NOTE: The actual data set has many features that can be used. We are only using 2 features as it is hard to visualise a feature space that has more than two dimensions.
🗣️ CLASSROOM DISCUSSION: What does this graph show? How can we identify individuals who voted Brexit? What boundaries can we draw between the two classes?
👨🏻🏫 TEACHING MOMENT: All machine learning algorithms can draw boundaries.
Introducing k Nearest Neighbours
An algorithm that can be used to classify data points is k Nearest Neighbours (kNN for short). This algorithm works by taking the neighbours of an observation and having them “vote” on whether or not the observation in question belongs to a specific class.
👉 NOTE: The “neighbours” decide the fate of a data point based on them calculating their proximity to it in relation to other data points that do not belong to their neighbourhood.
The question is, how many neighbours get to vote in the decision? This is the key hyperparameter of nearest neighbours and something that we need to experiment with. The most widely acknowledged way to experiment with hyperparameters is via k-fold cross-validation.
k-fold Cross-validation
Let’s split our sample into training and testing sets.
Pre-processing
We are going to use a simple function to pre-process the data. All that is happening is that both features are normalised, meaning we subtract the mean from each feature and divide by their standard deviation.
⚠️ WARNING: Any algorithm that uses distance metrics must use this preprocessing step as features with higher variances will play a disproportionately high amount of influence. So, when you hear “this algorithm utilizes distance measures”, think “standardise features!”
Note that you could use scikit-learn
’s StandardScaler
method or scipy.stats
’s zscore
method to standardise data instead of your own custom function (see here for details).
# Create a function that normalises variables
def normalise(var):
return (var-var.mean())/var.std()
# Drop the target from the features
= brexit.drop("vtleave", axis = 1)
X
# Normalise features
= X.apply(lambda x: normalise(x), axis=0)
X
# Convert features into a numpy array
= X.to_numpy()
X
# Isolate the target
= brexit["vtleave"]
y
# Create a train / test split
= train_test_split(X,y,stratify=y,random_state=123) X_train, X_test, y_train, y_test
The below code instantiates a kNN classifier.
= KNeighborsClassifier() knn_model
Creating a nearest neighbours grid
We then need to specify a hyperparameter grid.
👉 NOTE: This task often involves a lot of experimentation so do not take these values to be gospel for any future kNN models you build.
= {"n_neighbors":[5, 25, 100, 250]} grid_knn
🗣️ CLASSROOM DISCUSSION: Which number in the grid would represent the most “complex” model?
👨🏻🏫 TEACHING MOMENT:
Your teacher will present some output to see what happens when we vary k.
# Create a function that varies k in k-NN models
def vary_neighbours(n_neighbours, features, target):
= KNeighborsClassifier(n_neighbors=n_neighbours)
knn_model return knn_model.fit(features, target)
# Suggest a k grid
= [5, 25, 100, 250]
neigh_list
# Loop model function of k grid
= [vary_neighbours(neighs, X, y) for neighs in neigh_list]
knn_models
# Create a feature grid
= brexit_grid.drop("vtleave",axis=1).to_numpy()
brexit_feature_grid = normalise(brexit_feature_grid)
brexit_feature_grid
# Apply predictions to grid
= [model.predict(brexit_feature_grid) for model in knn_models]
knn_grid_preds
# Create a list of data frames of predictions
= [pd.DataFrame({"preds": pred.astype(bool)}) for pred in knn_grid_preds]
knn_grid_df
# Append original feature values to list of predictions
= [pd.concat([brexit_grid, pred], axis = 1) for pred in knn_grid_df]
knn_grid_df
# Concatenate to a single data frame
= pd.concat(knn_grid_df, keys = neigh_list, names = ["k","n"])
knn_grid_df = knn_grid_df.reset_index().drop("n", axis = 1) knn_grid_df
# Plot the output
("atchctr", "atcherp", colour = "vtleave")) +
ggplot(knn_grid_df, aes(= "k") +
facet_wrap(facets = 2) +
geom_point(size "atchctr", "atcherp", fill = "preds"), alpha = 0.25) +
geom_tile(aes(= np.arange(0,11,1)) +
scale_x_continuous(breaks = np.arange(0,11,1)) +
scale_y_continuous(breaks =element_blank(),
theme(panel_grid="bottom") +
legend_position= "Attachment to the UK", y = "Attachment to Europe",
labs(x = "Proportion Brexit vote",
colour = "Predicted Brexit vote")
fill )
Estimating a cross-validated model
We now have all the information we need to estimate a cross-validated model. We have
- A k-NN model instantiation (
knn_model
) - A grid of hyperparameter options, expressed as a dictionary (
grid_knn
)
We can now add these as the first and second arguments of GridSearchCV
, and specify the number of cross-validation folds, which we will set to 10. This will result in 40 different models being run (10-folds \(\times\) 4 hyperparameter combinations). We will ask GridSearchCV
to return the f1-score to evaluate the different hyperparameter values. We can then fit the GridSearchCV
instantiation to the training data.
# Code here
Plot hyperparameters
After knn_cv
compiles, locate the cv_results_
attribute of the model object. Try plotting the mean f1-score of for each k using a plot of your choice.
💡HINT: If you are unsure as to how to get the data into the correct format, check out the pandas
introduction in the week 1 lab.
# Code here
🗣️ CLASSROOM DISCUSSION: Which k is best?
Fit the best model to the full training set
We can then build a final model using the best hyperparameter combination. Let’s take the f1-score as the measure that we want to maximise.
# Select the best model
# Use the best parameter combination and fit on the entire training data
Evaluate the model on the test set
From there, all we need to do is follow the evaluation step that we have done for previous models.
# Create predictions from k-NN model for the test set
# Calculate f1-score
# Report the result
Predicting childcare costs in US counties with tree-based models (45 minutes)
We are going to implement a decision tree, Random Forest and an XGBoost.
Exploring the data set
= pd.read_csv("data/childcare-costs.csv") childcare
Our target is mcsa
, defined as “Weekly, full-time median price charged for Center-Based Care for those who are school age based on the results reported in the market rate survey report for the county or the rate zone/cluster to which the county is assigned.” Click here to see a full list of variables. Let’s explore the data with a few visualisations.
Create a grouped train / test split
If you have inspected childcare
, you will see that it tracks 1,786 counties from 2008 to 2018.
📝Task: Create a train / test split using all years up to 2018 to train the model and 2018 to test the model.
# Create a list of all variables with "county" or "state" as a prefix using a logical condition
# Use the above list to drop these variables from the data set
# Use 2018 to test and all other years to train
# Split the data set into features / target
Decision trees
Fit a decision tree that has a maximum depth of three levels using the appropriate function (see the first code chunk). Next, use the plot_tree
function on the model, setting the feature names to the column names.
# Instantiate a decision tree for regression
# Fit the model to the training data
# Plot the output
🗣️ CLASSROOM DISCUSSION: Which feature has the algorithm deemed most important for splitting the data? Can you explain what the tree is doing?
Evaluate the model on the test set
Let’s see how well our simple decision tree does. Here’s a function that takes five parameters:
model
takes ascikit-learn
model objecttest_features
takes apandas
data frame or anumpy
array of test featurestrue_values
takes apandas
data frame or anumpy
array of test outcomesmodel_name
takes a user-created string that specifies the model usedindex
takes an integer value
The output is a 1-row data frame which shows the name of the model and it’s associated RMSE on the test set.
def metric_to_df(model, test_features, true_values, model_name, index):
= model.predict(test_features)
preds = np.round(root_mean_squared_error(true_values,preds), 0)
rmse return pd.DataFrame({"model":model_name, "rmse":rmse}, index=[index])
Try using this function on the model we have built above.
# Code here
Instantiate a random forest model
- Take a look at the library code chunk at the beginning of the notebook and instantiate the relevant model.
- Set
max_features = "sqrt"
- Set
n_estimators = 1000
- Set
🗣️ CLASSROOM DISCUSSION: Why do we do this? What happens if we set max_features
equal to sqrt
?
- Fit the model to the training data.
- Evaluate the model on the test set.
# Instantiate a random forest with the required hyperparameters
# Fit the model to the training data
# Evaluate on the test data
🗣️ CLASSROOM DISCUSSION: Do we see an improvement?
Yes, and quite a dramatic one! Test RMSE is reduced from $34 to $25.
Instantiate an XGBoost model
- Take a look at the library code chunk at the beginning of the notebook and instantiate the relevant model.
- Set
n_estimators = 1000
- Set
colsample_bytree = 0.135
- Set
learning_rate = 0.1
- Set
# Instantiate an XGBoost with the required hyperparameters
# Fit the model to the training data
# Evaluate on the test data
🗣️ CLASSROOM DISCUSSION: Do we see an improvement?
Yes, but only modestly if we compare to the random forest. Test RMSE is only reduced from $25 to $24!
Which model is best?
Create a summary of your findings using either a table or a graph.
# Code here