π£οΈ Week 05 - Lab Roadmap (90 min)
Parameter tuning with tidymodels
π₯ Learning Objectives
- Calculate confusion matrices of classification models
- Distinguish common metrics for classification models
- In particular, distinguish accuracy from precision and recall
- Grasp the concept of thresholds in classification models
- Explore with changing the threshold of a logistic regression
- Freely explore creating logistic regression models

This week you can still use ChatGPT if you like but you will not be asked to do so. Go about the lab as usual, freely interacting with others, the lab material, the Web or any other resource you may find useful.
The only thing we ask is to fill out the brief survey at the end of the lab: π link, and when asked if you were asked to use ChatGPT, please answer no.
Thanks for being a GENIAL participant!
π Lab Tasks
This week we do not have any π§βπ« TEACHING MOMENT. Instead, you are to follow the material below and play with models and parameters by yourself.
Of course, your class teacher will be there to help you if you need it, and they might choose to do a live demo of some material to address common questions.
Part I: Starting Point (10 min)
First off, letβs get you set up. The code you will use as a starting point is similar to the one we used last week in the lecture but introduces some new variables. Hereβs whatβs new:
Columns
yearly_rate_increase
: This variable captures the percentage change in the average house price in a specific region, comparing it to the same month in the prior year.avg_past_rate_increase
: This oneβs a historical variable. It shows the average percentage increase in house prices in a given region, calculated from when the records first started.price_up
: A binary variable that answers a simple yet critical question: Has the average house price in a region risen more than usual? This is essentially a calculation of whether the current rate increase is higher than theavg_past_rate_increase
. If it is, then the value isYes
, otherwise itβsNo
.
Data Split
The data is split in two parts:
dataset_train
: This is the training set. It contains all the data up to the end of 2018.dataset_test
: This is the test set. It contains all the data from 2019 onwards.
Note that this time, we do not need to set the start date to 2005. Data about average price is available from 1969 onwards, so we can use all the data we have.
π― ACTION POINTS:
Create a
.qmd
file for todayβs lab.Add a code chunk and reserve it for loading the packages you will use today.
Read the UK HPI dataset:
<- uk_hpi ::read_csv("data/UK-HPI-full-file-2023-06.csv") %>% readrmutate(Date = dmy(Date)) %>% ::clean_names() janitor
Create a
dataset
object, as well asdataset_train
anddataset_test
objects with the following code:<- c("England", "Wales", "Scotland", "Northern Ireland") UK_countries <- dataset %>% uk_hpi filter(region_name %in% UK_countries) %>% group_by(region_name) %>% arrange(date) %>% mutate( lag_12_average_price = lag(average_price, 12), yearly_rate_increase = (average_price/lag_12_average_price) - 1 %>% ) drop_na(lag_12_average_price) %>% mutate( sum_past_rate_increase = cumsum(yearly_rate_increase) - yearly_rate_increase, n_past_rows = row_number() - 1, historical_avg_increase = sum_past_rate_increase/n_past_rows, price_up=factor(yearly_rate_increase > historical_avg_increase, labels=c("No","Yes"), levels=c(FALSE, TRUE), ordered=TRUE)) %>% drop_na(historical_avg_increase) %>% arrange(desc(date), region_name) %>% select(-c(sum_past_rate_increase, n_past_rows)) <- dataset %>% filter(year(date) <= 2018) dataset_train <- dataset %>% filter(year(date) > 2018) dataset_test
Create the baseline workflow and recipe:
<- log_rec_specification logistic_reg() %>% set_engine("glm") %>% set_mode("classification") <- rec_baseline recipe(price_up ~ ., data = dataset_train) %>% update_role(-c(price_up), new_role = "ID") %>% step_lag(c(yearly_rate_increase, historical_avg_increase), lag = 1) %>% step_naomit(all_predictors(), skip=FALSE) %>% prep() <- baseline_wf workflow() %>% add_recipe(rec_baseline) %>% add_model(log_rec_specification)
Fit the baseline workflow and extract the fitted model:
<- baseline_wf %>% fit(data = dataset_train) baseline_fit <- baseline_fit %>% extract_fit_parsnip() baseline_model
To see how the model performs, create a separate chunk and use
augment
to select the relevant variables:%>% baseline_model augment(rec_baseline %>% bake(dataset_train)) %>% select(date, region_name, price_up, .pred_class, .pred_Yes, .pred_No)
π‘ REMEMBER: Logistic regression outputs a number between 0 and 1 that indicates the probability of the outcome being
Yes
. Wheneverpred_Yes
is greater than 0.5, the model fills the column.pred_class
with the value βYesβ.But the best summary is the confusion matrix, a table that shows how many times the model got it right and how many times it got it wrong. To create it, use the
conf_mat
function from theyardstick
package:%>% baseline_model augment(rec_baseline %>% bake(dataset_train)) %>% conf_mat(truth=price_up, estimate=.pred_class)
or perhaps more visually:
<- baseline_model %>% g augment(rec_baseline %>% bake(dataset_train)) %>% conf_mat(truth=price_up, estimate=.pred_class) %>% autoplot(type="heatmap") # the output is a ggplot object, so you can customise it if you like g
Tip: try setting
type="mosaic"
to see a different type of plot.
Part II: Read about metrics and event_level
(20 min)
From the confusion matrix, there are A LOT of metrics we can calculate. Here are four basic ones:
- True Positives (TP): The number of times the model correctly predicted a
Yes
outcome. - True Negatives (TN): The number of times the model correctly predicted a
No
outcome. - False Positives (FP): The number of times the model incorrectly predicted a
Yes
outcome. - False Negatives (FN): The number of times the model incorrectly predicted a
No
outcome.
From those, we can go on to calculate other common metrics:
Accuracy: The proportion of correct predictions. It is calculated as the sum of the diagonal divided by the sum of all values in the matrix.
Precision: The proportion of
Yes
predictions made by the model that were actually correct.Recall: The proportion of true
Yes
outcomes that were predicted by the model, correctly, asYes
.F1-score: A metric that combines precision and recall. This score ranges from 0 to 1, 1 being the best. It is calculated as:
\[ 2 \times \frac{precision \times recall}{precision + recall} \]
π‘ If you want a balanced model, one that doesnβt favour either precision or recall, then you should aim for a high F1-score.
How did the baseline model do?
After calculating the confusion matrix, you can run a summary to get the metrics. Here I will focus on the βYesβ label:
%>%
baseline_model augment(rec_baseline %>% bake(dataset_train)) %>%
conf_mat(truth=price_up, estimate=.pred_class) %>%
summary(estimator="binary", event_level="second")
Note: The last column, called f_meas
is the F1-score.
π‘ IMPORTANT: Note that we set event_level="second"
in the summary
function. This is because the price_up
variable is an ordered factor, where the first level is No
and the second level is Yes
. If we had set event_level="first"
, the summary would have been calculated for the No
level, which is not what we want.
You probably see a βgoodβ accuracy value for the baseline model (\(> 70\%\)), meaning the model gets it right most of the time. However, recall is only \(\approx 0.4\), meaning the model is not very good at predicting Yes
outcomes. The model is biased towards predicting No
outcomes, as you can see from the confusion matrix. After all, there are more instances of No
than Yes
in the dataset.
Thresholds
The baseline model uses a threshold of 0.5 to decide whether to predict Yes
or No
. But what if we changed that threshold? What if we set it to 0.3, for example? Would that improve the model?
All you have to do is to rewrite the .pred_class
column using a different threshold. Hereβs how you do it:
<- 0.3
my_threshold
# Here we set the threshold to my_threshold (0.3)
# Then we convert the result to a factor with labels "No" and "Yes"
%>%
baseline_model augment(rec_baseline %>% bake(dataset_train)) %>%
mutate(.pred_class = .pred_Yes > my_threshold,
.pred_class = factor(.pred_class,
labels=c("No","Yes"),
levels=c(FALSE, TRUE),
ordered=TRUE)) %>%
conf_mat(truth=price_up, estimate=.pred_class) %>%
summary(estimator="binary", event_level="second")
You will find that recall has improved! Perhaps at a little cost to precision, but hey, you canβt have it all.
Thresholds are a hyperparameter of the model. This means that they are parameters that you can change to improve the model.
Part III: Craft Your Model (60 Minutes)
Your mission now is to develop a model trained, as before, on data up to the end of 2018 that excels in F1-Score on the training and testing sets. Can we get closer to f_means=1
?
Strategies to Consider
- Adjust the classification threshold
- Enrich the model by adding more variables.
- Transform variables β either manually or with
recipes::step_*
functions.- For the recipes route, consult the sections on Individual Transformations and Normalisation on the
recipes
documentation page. Just triple check the output of your recipes.
- For the recipes route, consult the sections on Individual Transformations and Normalisation on the
π Remember, the golden rule is you canβt use future data to forecast the past.
π₯ A Friendly Challenge
Stopping here is perfectly okay. But if youβre in for a bit more excitement, join our little competition.
The prize? A DSI water bottle for the person who achieves the highest F1-score in both training and testing sets using a legitimate logistic regression model.
Feel free to take this challenge beyond the classroom. Youβve got until Wednesday at 23:59 to submit your finest model via Moodle. I will reveal the winner in Fridayβs lecture.