LSE DS202W 2025: Week 5 - Lab Solutions
2024/25 Winter Term
This solution file follows the format of a Jupyter Notebook file .ipynb
you had to fill in during the lab session.
⚙️ Setup
Downloading the student solutions
Click on the below button to download the student notebook.
Loading libraries and functions
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import f1_score, root_mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lets_plot import *
LetsPlot.setup_html()
Predicting who voted Brexit with k Nearest Neighbours (45 minutes)
Today, we are going to be demonstrating the utility of hyperparameter tuning using k-fold cross validation.
Exploring a feature space
We are going to take a look at data adapted from the European Social Survey, round 8 (ESS8)(European Social Survey European Research Infrastructure (ESS ERIC) 2023).
The ESS is a an academically-driven multi-country survey. Its stated aims are “firstly – to monitor and interpret changing public attitudes and values within Europe and to investigate how they interact with Europe’s changing institutions, secondly - to advance and consolidate improved methods of cross-national survey measurement in Europe and beyond, and thirdly - to develop a series of European social indicators, including attitudinal indicators”.
Aside from questions common to all countries, for this round of the survey, the data providers had asked participants from the UK questions around the recently-completed Brexit referendum and more specifically whether they had voted “Leave” (i.e “Brexit”) or “Remain” in the Brexit.
We will expand an example given for k nearest neighbours in Garibay Rubio et al. (2024), using two predictors. The full (adapted) data set consists of:
vtleave
: Whether or not a participant voted “Brexit” or “Remain”atchctr
: 0-10 attachment rating to the United Kingdomatcherp
: 0-10 attachment rating to Europe
= pd.read_csv("data/brexit-ess.csv") brexit
💡Tip: Don’t worry too much about the below code. All we are doing is creating a grid of potential values for illustration purposes. We somewhat adapted the code for this from here!
= [
unique_categories 'atchctr'].unique(),
brexit['atcherp'].unique()
brexit[
]
= ["atchctr", "atcherp"]
names
= pd.MultiIndex.from_product(
multiindex
unique_categories,=names
names
)
= brexit.groupby(["atchctr","atcherp"]).mean().reset_index()
brexit_grid
= (
brexit_grid
brexit_grid
.set_index(names)= np.nan)
.reindex(multiindex, fill_value
.reset_index() )
Let’s use ggplot
to explore this feature space.
("atchctr", "atcherp", colour = "vtleave")) +
ggplot(brexit_grid, aes(= 5) +
geom_point(size = np.arange(0,11,1)) +
scale_x_continuous(breaks = np.arange(0,11,1)) +
scale_y_continuous(breaks =element_blank(),
theme(panel_grid="bottom") +
legend_position= "Attachment to the United Kingdom", y = "Attachment to Europe",
labs(x = "Proportion of grid that voted Brexit")
colour )
👉 NOTE: The actual data set has many features that can be used. We are only using 2 features as it is hard to visualise a feature space that has more than two dimensions.
🗣️ CLASSROOM DISCUSSION: What does this graph show? How can we identify individuals who voted Brexit? What boundaries can we draw between the two classes?
👨🏻🏫 TEACHING MOMENT: All machine learning algorithms can draw boundaries.
Introducing k Nearest Neighbours
An algorithm that can be used to classify data points is k Nearest Neighbours (kNN for short). This algorithm works by taking the neighbours of an observation and having them “vote” on whether or not the observation in question belongs to a specific class.
👉 NOTE: The “neighbours” decide the fate of a data point based on them calculating their proximity to it in relation to other data points that do not belong to their neighbourhood.
The question is, how many neighbours get to vote in the decision? This is the key hyperparameter of nearest neighbours and something that we need to experiment with. The most widely acknowledged way to experiment with hyperparameters is via k-fold cross-validation.
k-fold Cross-validation
Let’s split our sample into training and testing sets.
Pre-processing
We are going to use a simple function to pre-process the data. All that is happening is that both features are normalised, meaning we subtract the mean from each feature and divide by their standard deviation.
⚠️ WARNING: Any algorithm that uses distance metrics must use this preprocessing step as features with higher variances will play a disproportionately high amount of influence. So, when you hear “this algorithm utilizes distance measures”, think “standardise features!”
Note that you could use scikit-learn
’s StandardScaler
method or scipy.stats
’s zscore
method to standardise data instead of your own custom function (see here for details).
# Create a function that normalises variables
def normalise(var):
return (var-var.mean())/var.std()
# Drop the target from the features
= brexit.drop("vtleave", axis = 1)
X
# Normalise features
= X.apply(lambda x: normalise(x), axis=0)
X
# Convert features into a numpy array
= X.to_numpy()
X
# Isolate the target
= brexit["vtleave"]
y
# Create a train / test split
= train_test_split(X,y,stratify=y,random_state=123) X_train, X_test, y_train, y_test
The below code instantiates a kNN classifier.
= KNeighborsClassifier() knn_model
Creating a nearest neighbours grid
We then need to specify a hyperparameter grid.
👉 NOTE: This task often involves a lot of experimentation so do not take these values to be gospel for any future kNN models you build.
= {"n_neighbors":[5, 25, 100, 250]} grid_knn
🗣️ CLASSROOM DISCUSSION:
Which number in the grid would represent the most “complex” model?
👨🏻🏫 TEACHING MOMENT:
Your teacher will present some output to see what happens when we vary k.
# Create a function that varies k in k-NN models
def vary_neighbours(n_neighbours, features, target):
= KNeighborsClassifier(n_neighbors=n_neighbours)
knn_model return knn_model.fit(features, target)
# Suggest a k grid
= [5, 25, 100, 250]
neigh_list
# Loop model function of k grid
= [vary_neighbours(neighs, X, y) for neighs in neigh_list]
knn_models
# Create a feature grid
= brexit_grid.drop("vtleave",axis=1).to_numpy()
brexit_feature_grid = normalise(brexit_feature_grid)
brexit_feature_grid
# Apply predictions to grid
= [model.predict(brexit_feature_grid) for model in knn_models]
knn_grid_preds
# Create a list of data frames of predictions
= [pd.DataFrame({"preds": pred.astype(bool)}) for pred in knn_grid_preds]
knn_grid_df
# Append original feature values to list of predictions
= [pd.concat([brexit_grid, pred], axis = 1) for pred in knn_grid_df]
knn_grid_df
# Concatenate to a single data frame
= pd.concat(knn_grid_df, keys = neigh_list, names = ["k","n"])
knn_grid_df = knn_grid_df.reset_index().drop("n", axis = 1) knn_grid_df
# Plot the output
("atchctr", "atcherp", colour = "vtleave")) +
ggplot(knn_grid_df, aes(= "k") +
facet_wrap(facets = 2) +
geom_point(size "atchctr", "atcherp", fill = "preds"), alpha = 0.25) +
geom_tile(aes(= np.arange(0,11,1)) +
scale_x_continuous(breaks = np.arange(0,11,1)) +
scale_y_continuous(breaks =element_blank(),
theme(panel_grid="bottom") +
legend_position= "Attachment to the UK", y = "Attachment to Europe",
labs(x = "Proportion Brexit vote",
colour = "Predicted Brexit vote")
fill )
Estimating a cross-validated model
We now have all the information we need to estimate a cross-validated model. We have
- A k-NN model instantiation (
knn_model
) - A grid of hyperparameter options, expressed as a dictionary (
grid_knn
)
We can now add these as the first and second arguments of GridSearchCV
, and specify the number of cross-validation folds, which we will set to 10. This will result in 40 different models being run (10-folds \(\times\) 4 hyperparameter combinations). As a final step, we will ask GridSearchCV
to return the f1-score to evaluate the different hyperparameter values.
= GridSearchCV(knn_model, grid_knn, cv = 10, scoring="f1")
knn_cv = knn_cv.fit(X_train, y_train) _
👉 NOTE: You may be wondering why we used _
. In this context, _
is a throw-away variable that can be used to supress model messages that appear underneath code chunks.
Plot hyperparameters
After knn_cv
compiles, locate the cv_results_
attribute of the model object. Try plotting the mean f1-score of for each k using a plot of your choice.
💡HINT: If you are unsure as to how to get the data into the correct format, check out the pandas
introduction in the week 1 lab.
= pd.DataFrame(knn_cv.cv_results_)
knn_output
("param_n_neighbors", "mean_test_score")) +
ggplot(knn_output, aes(= "dashed") +
geom_line(linetype +
geom_point() = "Number of neighbours", y = "Mean f1 score")
labs(x )
🗣️ CLASSROOM DISCUSSION: Which k is best?
Fit the best model to the full training set
We can then build a final model using the best hyperparameter combination.
# Select the best model
= KNeighborsClassifier(n_neighbors=knn_cv.best_estimator_.n_neighbors)
knn_final
# Use the best parameter combination and fit on the entire training data
= knn_final.fit(X_train, y_train) _
Evaluate the model on the test set
From there, all we need to do is follow the evaluation step that we have done for previous models.
# Create predictions from k-NN model for the test set
= knn_final.predict(X_test)
preds_knn
# Calculate f1-score
= np.round(f1_score(y_test, preds_knn),2)
knn_f1
# Report the result
print(f"Our best k-NN model produces an f1-score of {knn_f1} on the test set!")
-NN model produces an f1-score of 0.71 on the test set! Our best k
Predicting childcare costs in US counties with tree-based models (45 minutes)
We are going to implement a decision tree, Random Forest and an XGBoost.
Exploring the data set
= pd.read_csv("data/childcare-costs.csv") childcare
Our target is mcsa
, defined as “Weekly, full-time median price charged for Center-Based Care for those who are school age based on the results reported in the market rate survey report for the county or the rate zone/cluster to which the county is assigned.” Click here to see a full list of variables. Let’s explore the data with a few visualisations.
Median child care costs are ~ $96.53 a week per county
= np.median(childcare["mcsa"])
med_cost
("mcsa")) +
ggplot(childcare, aes(= 40, fill = "lightgrey", colour = "black") +
geom_histogram(bins = med_cost, linewidth = 3, linetype = "dashed", colour = "red") +
geom_vline(xintercept = "Weekly cost of childcare (USD)", y = "Count")
labs(x )
The costs of childcare increase in counties with higher median incomes
("mhi_2018", "mcsa")) +
ggplot(childcare, aes(= 0.125, size = 2) +
geom_point(alpha = "lm", se = False, colour = "blue") +
geom_smooth(method = "Household income (2018 USD)",
labs(x = "Weekly cost of childcare (USD)")
y )
There is considerable variation in the costs of childcare based on the ethnic composition of the county
👉 NOTE: We adapted this graph from Julia Silge (click here for her analysis of the same data set using the tidymodels
ecosystem in R).
= childcare.columns.str.contains("mcsa|one_race_|two_races")
selected_cols
= pd.melt(childcare.loc[:,selected_cols],
childcare_melted ="mcsa",
id_vars="race",
var_name="perc")
value_name
def transform_category(race):
if race == "one_race_w":
return "Both White"
elif race == "one_race_b":
return "Both Black"
elif race == "one_race_i":
return "Both Native"
elif race == "one_race_a":
return "Both Asian"
elif race == "one_race_h":
return "Both Hispanic"
elif race == "two_races":
return "Mixed Parents"
"race"] = childcare_melted["race"].map(lambda x: transform_category(x))
childcare_melted[
("perc", "mcsa")) +
ggplot(childcare_melted, aes(="race", scales = "free_x") +
facet_wrap(facets= 0.05) +
geom_point(alpha = "lm", se = False) +
geom_smooth(method = "Percent of racial family types in counties",
labs(x = "Weekly cost of childcare (USD)")
y )
Since Python 3.10, in addition to if...elif...else
statements (see Python documentation), you also have match...case
statements so you could rewrite the transform_category
function as:
def transform_category(race):
match race:
case "one_race_w":
return "Both White"
case "one_race_b":
return "Both Black"
case "one_race_i":
return "Both Native"
case "one_race_a":
return "Both Asian"
case "one_race_h":
return "Both Hispanic"
case "two_races":
return "Mixed Parents"
case _:
return "Unknown Category" # Optional default case to handle unexpected inputs gracefully
Create a grouped train / test split
If you have inspected childcare
, you will see that it tracks 1,786 counties from 2008 to 2018.
📝Task: Create a train / test split using all years up to 2018 to train the model and 2018 to test the model.
# Create a list of all variables with "county" or "state" as a prefix using a logical condition
= childcare.columns[childcare.columns.str.contains("county|state")]
feats_to_drop
# Use the above list to drop these variables from the data set
= childcare.drop(feats_to_drop, axis=1)
childcare_cleaned
childcare_cleaned
# Use 2018 to test and all other years to train
= childcare_cleaned.query("study_year < 2018"), childcare_cleaned.query("study_year == 2018")
train, test
# Split the data set into features / target
= train.drop("mcsa", axis=1), train["mcsa"]
X_train, y_train = test.drop("mcsa", axis=1), test["mcsa"] X_test, y_test
Decision trees
Fit a decision tree that has a maximum depth of three levels using the appropriate function (see the first code chunk). Next, use the plot_tree
function on the model, setting the feature names to the column names.
# Instantiate a decision tree for regression
= DecisionTreeRegressor(max_depth=3)
dt_model
# Fit the model to the training data
dt_model.fit(X_train, y_train)
# Plot the output
= plot_tree(dt_model,
_ =X_train.columns,
feature_names=5,
fontsize=1,
precision=False,
impurity=False) proportion
🗣️ CLASSROOM DISCUSSION: Which feature has the algorithm deemed most important for splitting the data? Can you explain what the tree is doing?
The decision tree takes a “top-down” and “greedy” approach, meaning it looks at the variable that can make the best split in the data, performs the split based on the best threshold of that variable, and then repeats the process for each split until the required depth is reached.
Evaluate the model on the test set
Let’s see how well our simple decision tree does. Here’s a function that takes five parameters:
model
takes ascikit-learn
model objecttest_features
takes apandas
data frame or anumpy
array of test featurestrue_values
takes apandas
data frame or anumpy
array of test outcomesmodel_name
takes a user-created string that specifies the model usedindex
takes an integer value
The output is a 1-row data frame which shows the name of the model and it’s associated RMSE on the test set.
def metric_to_df(model, test_features, true_values, model_name, index):
= model.predict(test_features)
preds = np.round(root_mean_squared_error(true_values,preds), 0)
rmse return pd.DataFrame({"model":model_name, "rmse":rmse}, index=[index])
Try using this function on the model we have built above.
= metric_to_df(dt_model,X_test,y_test,"Decision Tree",0)
dt_output print(dt_output)
model rmse0 Decision Tree 34.0
Instantiate a random forest model
- Take a look at the library code chunk at the beginning of the notebook and instantiate the relevant model.
- Set
max_features = "sqrt"
- Set
n_estimators = 1000
- Set
🗣️ CLASSROOM DISCUSSION: Why do we do this? What happens if we set max_features
equal to sqrt
?
By limiting the choice of variables for each decision tree to the square root of the total number of features, we ensure that different variables are randomnly selected when building our “forest.”
- Fit the model to the training data.
- Evaluate the model on the test set.
# Instantiate a random forest with the required hyperparameters
= RandomForestRegressor(max_features="sqrt", n_estimators=1000)
rf_model
# Fit the model to the training data
= rf_model.fit(X_train, y_train) _
# Evaluate on the test data
= metric_to_df(rf_model,X_test,y_test,"Random Forest",1)
rf_output print(rf_output)
model rmse1 Random Forest 25.0
🗣️ CLASSROOM DISCUSSION: Do we see an improvement?
Yes, and quite a dramatic one! Test RMSE is reduced from $34 to $25.
Instantiate an XGBoost model
- Take a look at the library code chunk at the beginning of the notebook and instantiate the relevant model.
- Set
n_estimators = 1000
- Set
colsample_bytree = 0.135
- Set
learning_rate = 0.1
- Set
= XGBRegressor(n_estimators = 1000, learning_rate = 0.1, colsample_bytree = 0.135)
xgb_model
= xgb_model.fit(X_train, y_train) _
# Evaluate on the test data
= metric_to_df(xgb_model,X_test,y_test,"XGBoost",2)
xgb_output
xgb_output
model rmse2 XGBoost 24.0
🗣️ CLASSROOM DISCUSSION: Do we see an improvement?
Yes, but only modestly if we compare to the random forest. Test RMSE is only reduced from $25 to $24!
Which model is best?
Create a summary of your findings using either a table or a graph.
= pd.concat([dt_output,rf_output,xgb_output],axis=0)
all_output = all_output.sort_values(by="rmse") all_output
("rmse","model")) +
ggplot(all_output, aes(="identity") +
geom_bar(stat=element_blank()) +
theme(panel_grid_major_y="Root mean squared error", y="")
labs(x )