πŸ›£οΈ Week 05 - Lab Roadmap (90 min)

Parameter tuning with tidymodels

Author

Dr. Jon Cardoso-Silva

πŸ₯… Learning Objectives

  • Calculate confusion matrices of classification models
  • Distinguish common metrics for classification models
  • In particular, distinguish accuracy from precision and recall
  • Grasp the concept of thresholds in classification models
  • Explore with changing the threshold of a logistic regression
  • Freely explore creating logistic regression models

πŸ“‹ Lab Tasks

This week we do not have any πŸ§‘β€πŸ« TEACHING MOMENT. Instead, you are to follow the material below and play with models and parameters by yourself.

Of course, your class teacher will be there to help you if you need it, and they might choose to do a live demo of some material to address common questions.

Part 0: Export your chat logs (~ 3 min)

As part of the GENIAL project, we ask that you fill out the following form as soon as you come to the lab:

🎯 ACTION POINTS

  1. πŸ”— CLICK HERE to export your chat log.

    Thanks for being GENIAL! You are now one step closer to earning some prizes! 🎟️

πŸ‘‰ NOTE: You MUST complete the initial form.

If you really don’t want to participate in GENIAL1, just answer β€˜No’ to the Terms & Conditions question - your e-mail address will be deleted from GENIAL’s database the following week.

Part I: Starting Point (10 min)

First off, let’s get you set up. Your starting point in this lab is similar to what you covered in last week’s lecture and serves as a recap for it. Compared to last week’s lab, we have a few new columns:

Columns

  • yearly_rate_increase: This variable captures the percentage change in the average house price in a specific region, comparing it to the same month in the prior year.

  • avg_past_rate_increase: This one’s a historical variable. It shows the average percentage increase in house prices in a given region, calculated from when the records first started.

  • price_up: A binary variable that answers a simple yet critical question: Has the average house price in a region risen more than usual? This is essentially a calculation of whether the current rate increase is higher than the avg_past_rate_increase. If it is, then the value is Yes, otherwise it’s No.

Data Split

The data is split in two parts:

  • dataset_train: This is the training set. It contains all the data up to the end of 2018.
  • dataset_test: This is the test set. It contains all the data from 2019 onwards.

Note that this time, we do not need to set the start date to 2005. Data about average price is available from 1969 onwards, so we can use all the data we have.

🎯 ACTION POINTS:

  1. Create a .qmd file for today’s lab.

  2. Add a code chunk and reserve it for loading the packages you will use today.

  3. Read the UK HPI dataset:

    uk_hpi <- 
        readr::read_csv("data/UK-HPI-full-file-2023-06.csv") %>%
        mutate(Date = dmy(Date)) %>%
        janitor::clean_names()
  4. Create a dataset object, as well as dataset_train and dataset_test objects with the following code:

    UK_countries <- c("England", "Wales", "Scotland", "Northern Ireland")
    
    dataset <- 
        uk_hpi %>% 
        filter(region_name %in% UK_countries) %>%
        group_by(region_name) %>%
        arrange(date) %>%
        mutate(
        lag_12_average_price = lag(average_price, 12),
        yearly_rate_increase = (average_price/lag_12_average_price) - 1
        ) %>%
        drop_na(lag_12_average_price) %>%
        mutate(
        sum_past_rate_increase = cumsum(yearly_rate_increase) - yearly_rate_increase,
        n_past_rows = row_number() - 1,
        historical_avg_increase = sum_past_rate_increase/n_past_rows,
        price_up=factor(yearly_rate_increase > historical_avg_increase, 
                        labels=c("No","Yes"),
                        levels=c(FALSE, TRUE),
                        ordered=TRUE)) %>%
        drop_na(historical_avg_increase) %>%
        arrange(desc(date), region_name) %>%
        select(-c(sum_past_rate_increase, n_past_rows))
    
    dataset_train <- dataset %>% filter(year(date) <= 2018)
    dataset_test <- dataset %>% filter(year(date) > 2018)
  5. Create the baseline workflow and recipe:

    log_rec_specification <- 
        logistic_reg() %>% 
        set_engine("glm") %>% 
        set_mode("classification") 
    
    rec_baseline <- 
        recipe(price_up ~ ., data = dataset_train) %>% 
        update_role(-c(price_up), new_role = "ID") %>%
        step_lag(c(yearly_rate_increase, historical_avg_increase), lag = 1) %>%
        step_naomit(all_predictors(), skip=FALSE) %>%
        prep()
    
    baseline_wf <-
        workflow() %>%
        add_recipe(rec_baseline) %>%
        add_model(log_rec_specification)
  6. Fit the baseline workflow and extract the fitted model:

    baseline_fit <- baseline_wf %>% fit(data = dataset_train)
    baseline_model <- baseline_fit %>% extract_fit_parsnip()
  7. To see how the model performs, create a separate chunk and use augment to select the relevant variables:

    baseline_model %>%
        augment(rec_baseline %>% bake(dataset_train)) %>%
        select(date, region_name, price_up, .pred_class, .pred_Yes, .pred_No)

    πŸ’‘ REMEMBER: Logistic regression outputs a number between 0 and 1 that indicates the probability of the outcome being Yes. Whenever pred_Yes is greater than 0.5, the model fills the column .pred_class with the value β€˜Yes’.

  8. But the best summary is the confusion matrix, a table that shows how many times the model got it right and how many times it got it wrong. To create it, use the conf_mat function from the yardstick package:

    baseline_model %>% 
        augment(rec_baseline %>% bake(dataset_train)) %>%
        conf_mat(truth=price_up, estimate=.pred_class)

    or perhaps more visually:

    g <- baseline_model %>% 
        augment(rec_baseline %>% bake(dataset_train)) %>%
        conf_mat(truth=price_up, estimate=.pred_class) %>%
        autoplot(type="heatmap")
    
    # the output is a ggplot object, so you can customise it if you like
    g

    Tip: try setting type="mosaic" to see a different type of plot.

Part II: Read about metrics and event_level (20 min)

From the confusion matrix, there are A LOT of metrics we can calculate. Here are four basic ones:

  • True Positives (TP): The number of times the model correctly predicted a Yes outcome.
  • True Negatives (TN): The number of times the model correctly predicted a No outcome.
  • False Positives (FP): The number of times the model incorrectly predicted a Yes outcome.
  • False Negatives (FN): The number of times the model incorrectly predicted a No outcome.

From those, we can go on to calculate other common metrics:

  • Accuracy: The proportion of correct predictions. It is calculated as the sum of the diagonal divided by the sum of all values in the matrix.

  • Precision: The proportion of Yes predictions made by the model that were actually correct.

  • Recall: The proportion of true Yes outcomes that were predicted by the model, correctly, as Yes.

  • F1-score: A metric that combines precision and recall. This score ranges from 0 to 1, 1 being the best. It is calculated as:

\[ 2 \times \frac{precision \times recall}{precision + recall} \]

πŸ’‘ If you want a balanced model, one that doesn’t favour either precision or recall, then you should aim for a high F1-score.

How did the baseline model do?

After calculating the confusion matrix, you can run a summary to get the metrics. Here I will focus on the β€˜Yes’ label:

    baseline_model %>% 
    augment(rec_baseline %>% bake(dataset_train)) %>%
    conf_mat(truth=price_up, estimate=.pred_class) %>%
    summary(estimator="binary", event_level="second")

Note: The last column, called f_meas is the F1-score.

πŸ’‘ IMPORTANT: Note that we set event_level="second" in the summary function. This is because the price_up variable is an ordered factor, where the first level is No and the second level is Yes. If we had set event_level="first", the summary would have been calculated for the No level, which is not what we want.

You probably see a β€˜good’ accuracy value for the baseline model (\(> 70\%\)), meaning the model gets it right most of the time. However, recall is only \(\approx 0.4\), meaning the model is not very good at predicting Yes outcomes. The model is biased towards predicting No outcomes, as you can see from the confusion matrix. After all, there are more instances of No than Yes in the dataset.

Thresholds

The baseline model uses a threshold of 0.5 to decide whether to predict Yes or No. But what if we changed that threshold? What if we set it to 0.3, for example? Would that improve the model?

All you have to do is to rewrite the .pred_class column using a different threshold. Here’s how you do it:

my_threshold <- 0.3

# Here we set the threshold to my_threshold (0.3)
# Then we convert the result to a factor with labels "No" and "Yes"
baseline_model %>% 
    augment(rec_baseline %>% bake(dataset_train)) %>%
    mutate(.pred_class = .pred_Yes > my_threshold,
           .pred_class = factor(.pred_class, 
                                labels=c("No","Yes"), 
                                levels=c(FALSE, TRUE), 
                                ordered=TRUE)) %>%
    conf_mat(truth=price_up, estimate=.pred_class) %>%
    summary(estimator="binary", event_level="second")

You will find that recall has improved! Perhaps at a little cost to precision, but hey, you can’t have it all.

Thresholds are a hyperparameter of the model. This means that they are parameters that you can change to improve the model.

Part III: Craft Your Model (60 Minutes)

Your mission now is to develop a model trained, as before, on data up to the end of 2018 that excels in F1-Score on the training and testing sets. Can we get closer to f_means=1?

Strategies to Consider

  • Adjust the classification threshold
  • Enrich the model by adding more variables.
  • Transform variables β€” either manually or with recipes::step_* functions.
    • For the recipes route, consult the sections on Individual Transformations and Normalisation on the recipes documentation page. Just triple check the output of your recipes.

πŸ‘‰ Remember, the golden rule is you can’t use future data to forecast the past.

πŸ₯‡ A Friendly Challenge

Stopping here is perfectly okay. But if you’re in for a bit more excitement, join our little competition.

Using the same target variable used throughout this lab, can you come up with a set of features that do not contain any data leakage and can beat the awful performances of the models trained throughout this lab? That is, a model specification with a non-null F1-score for the class β€˜Yes’ and an F1-score and accuracy as high as possible?

  • Use the same logistic regression as the algorithm specification
  • Use the same target variable
  • Use the same rolling window resampling technique
  • Feel free to play with any combination of features you can think of, and to even bring in additional data if you want.

Winners will be put in our Hall of Fame and receive DSI tote bags (there are four bags for the taking)! πŸŽ‰

Feel free to take this challenge beyond the classroom. You’ve got until Wednesday at 5pm to submit your finest model via Moodle. Jon will reveal the winner in Friday’s lecture!

Footnotes

  1. We’re gonna cry a little bit, not gonna lie. But no hard feelings. We’ll get over it.β†©οΈŽ