πŸ›£οΈ Week 05 - Lab Roadmap (90 min)

Parameter tuning with tidymodels

Author

Dr. Jon Cardoso-Silva

πŸ₯… Learning Objectives

  • Calculate confusion matrices of classification models
  • Distinguish common metrics for classification models
  • In particular, distinguish accuracy from precision and recall
  • Grasp the concept of thresholds in classification models
  • Explore with changing the threshold of a logistic regression
  • Freely explore creating logistic regression models
I am part of the GENIAL project

This week you can still use ChatGPT if you like but you will not be asked to do so. Go about the lab as usual, freely interacting with others, the lab material, the Web or any other resource you may find useful.

The only thing we ask is to fill out the brief survey at the end of the lab: πŸ”— link, and when asked if you were asked to use ChatGPT, please answer no.

Thanks for being a GENIAL participant!

πŸ“‹ Lab Tasks

This week we do not have any πŸ§‘β€πŸ« TEACHING MOMENT. Instead, you are to follow the material below and play with models and parameters by yourself.

Of course, your class teacher will be there to help you if you need it, and they might choose to do a live demo of some material to address common questions.

Part I: Starting Point (10 min)

First off, let’s get you set up. The code you will use as a starting point is similar to the one we used last week in the lecture but introduces some new variables. Here’s what’s new:

Columns

  • yearly_rate_increase: This variable captures the percentage change in the average house price in a specific region, comparing it to the same month in the prior year.

  • avg_past_rate_increase: This one’s a historical variable. It shows the average percentage increase in house prices in a given region, calculated from when the records first started.

  • price_up: A binary variable that answers a simple yet critical question: Has the average house price in a region risen more than usual? This is essentially a calculation of whether the current rate increase is higher than the avg_past_rate_increase. If it is, then the value is Yes, otherwise it’s No.

Data Split

The data is split in two parts:

  • dataset_train: This is the training set. It contains all the data up to the end of 2018.
  • dataset_test: This is the test set. It contains all the data from 2019 onwards.

Note that this time, we do not need to set the start date to 2005. Data about average price is available from 1969 onwards, so we can use all the data we have.

🎯 ACTION POINTS:

  1. Create a .qmd file for today’s lab.

  2. Add a code chunk and reserve it for loading the packages you will use today.

  3. Read the UK HPI dataset:

    uk_hpi <- 
        readr::read_csv("data/UK-HPI-full-file-2023-06.csv") %>%
        mutate(Date = dmy(Date)) %>%
        janitor::clean_names()
  4. Create a dataset object, as well as dataset_train and dataset_test objects with the following code:

    UK_countries <- c("England", "Wales", "Scotland", "Northern Ireland")
    
    dataset <- 
        uk_hpi %>% 
        filter(region_name %in% UK_countries) %>%
        group_by(region_name) %>%
        arrange(date) %>%
        mutate(
        lag_12_average_price = lag(average_price, 12),
        yearly_rate_increase = (average_price/lag_12_average_price) - 1
        ) %>%
        drop_na(lag_12_average_price) %>%
        mutate(
        sum_past_rate_increase = cumsum(yearly_rate_increase) - yearly_rate_increase,
        n_past_rows = row_number() - 1,
        historical_avg_increase = sum_past_rate_increase/n_past_rows,
        price_up=factor(yearly_rate_increase > historical_avg_increase, 
                        labels=c("No","Yes"),
                        levels=c(FALSE, TRUE),
                        ordered=TRUE)) %>%
        drop_na(historical_avg_increase) %>%
        arrange(desc(date), region_name) %>%
        select(-c(sum_past_rate_increase, n_past_rows))
    
    dataset_train <- dataset %>% filter(year(date) <= 2018)
    dataset_test <- dataset %>% filter(year(date) > 2018)
  5. Create the baseline workflow and recipe:

    log_rec_specification <- 
        logistic_reg() %>% 
        set_engine("glm") %>% 
        set_mode("classification") 
    
    rec_baseline <- 
        recipe(price_up ~ ., data = dataset_train) %>% 
        update_role(-c(price_up), new_role = "ID") %>%
        step_lag(c(yearly_rate_increase, historical_avg_increase), lag = 1) %>%
        step_naomit(all_predictors(), skip=FALSE) %>%
        prep()
    
    baseline_wf <-
        workflow() %>%
        add_recipe(rec_baseline) %>%
        add_model(log_rec_specification)
  6. Fit the baseline workflow and extract the fitted model:

    baseline_fit <- baseline_wf %>% fit(data = dataset_train)
    baseline_model <- baseline_fit %>% extract_fit_parsnip()
  7. To see how the model performs, create a separate chunk and use augment to select the relevant variables:

    baseline_model %>%
        augment(rec_baseline %>% bake(dataset_train)) %>%
        select(date, region_name, price_up, .pred_class, .pred_Yes, .pred_No)

    πŸ’‘ REMEMBER: Logistic regression outputs a number between 0 and 1 that indicates the probability of the outcome being Yes. Whenever pred_Yes is greater than 0.5, the model fills the column .pred_class with the value β€˜Yes’.

  8. But the best summary is the confusion matrix, a table that shows how many times the model got it right and how many times it got it wrong. To create it, use the conf_mat function from the yardstick package:

    baseline_model %>% 
        augment(rec_baseline %>% bake(dataset_train)) %>%
        conf_mat(truth=price_up, estimate=.pred_class)

    or perhaps more visually:

    g <- baseline_model %>% 
        augment(rec_baseline %>% bake(dataset_train)) %>%
        conf_mat(truth=price_up, estimate=.pred_class) %>%
        autoplot(type="heatmap")
    
    # the output is a ggplot object, so you can customise it if you like
    g

    Tip: try setting type="mosaic" to see a different type of plot.

Part II: Read about metrics and event_level (20 min)

From the confusion matrix, there are A LOT of metrics we can calculate. Here are four basic ones:

  • True Positives (TP): The number of times the model correctly predicted a Yes outcome.
  • True Negatives (TN): The number of times the model correctly predicted a No outcome.
  • False Positives (FP): The number of times the model incorrectly predicted a Yes outcome.
  • False Negatives (FN): The number of times the model incorrectly predicted a No outcome.

From those, we can go on to calculate other common metrics:

  • Accuracy: The proportion of correct predictions. It is calculated as the sum of the diagonal divided by the sum of all values in the matrix.

  • Precision: The proportion of Yes predictions made by the model that were actually correct.

  • Recall: The proportion of true Yes outcomes that were predicted by the model, correctly, as Yes.

  • F1-score: A metric that combines precision and recall. This score ranges from 0 to 1, 1 being the best. It is calculated as:

\[ 2 \times \frac{precision \times recall}{precision + recall} \]

πŸ’‘ If you want a balanced model, one that doesn’t favour either precision or recall, then you should aim for a high F1-score.

How did the baseline model do?

After calculating the confusion matrix, you can run a summary to get the metrics. Here I will focus on the β€˜Yes’ label:

    baseline_model %>% 
    augment(rec_baseline %>% bake(dataset_train)) %>%
    conf_mat(truth=price_up, estimate=.pred_class) %>%
    summary(estimator="binary", event_level="second")

Note: The last column, called f_meas is the F1-score.

πŸ’‘ IMPORTANT: Note that we set event_level="second" in the summary function. This is because the price_up variable is an ordered factor, where the first level is No and the second level is Yes. If we had set event_level="first", the summary would have been calculated for the No level, which is not what we want.

You probably see a β€˜good’ accuracy value for the baseline model (\(> 70\%\)), meaning the model gets it right most of the time. However, recall is only \(\approx 0.4\), meaning the model is not very good at predicting Yes outcomes. The model is biased towards predicting No outcomes, as you can see from the confusion matrix. After all, there are more instances of No than Yes in the dataset.

Thresholds

The baseline model uses a threshold of 0.5 to decide whether to predict Yes or No. But what if we changed that threshold? What if we set it to 0.3, for example? Would that improve the model?

All you have to do is to rewrite the .pred_class column using a different threshold. Here’s how you do it:

my_threshold <- 0.3

# Here we set the threshold to my_threshold (0.3)
# Then we convert the result to a factor with labels "No" and "Yes"
baseline_model %>% 
    augment(rec_baseline %>% bake(dataset_train)) %>%
    mutate(.pred_class = .pred_Yes > my_threshold,
           .pred_class = factor(.pred_class, 
                                labels=c("No","Yes"), 
                                levels=c(FALSE, TRUE), 
                                ordered=TRUE)) %>%
    conf_mat(truth=price_up, estimate=.pred_class) %>%
    summary(estimator="binary", event_level="second")

You will find that recall has improved! Perhaps at a little cost to precision, but hey, you can’t have it all.

Thresholds are a hyperparameter of the model. This means that they are parameters that you can change to improve the model.

Part III: Craft Your Model (60 Minutes)

Your mission now is to develop a model trained, as before, on data up to the end of 2018 that excels in F1-Score on the training and testing sets. Can we get closer to f_means=1?

Strategies to Consider

  • Adjust the classification threshold
  • Enrich the model by adding more variables.
  • Transform variables β€” either manually or with recipes::step_* functions.
    • For the recipes route, consult the sections on Individual Transformations and Normalisation on the recipes documentation page. Just triple check the output of your recipes.

πŸ‘‰ Remember, the golden rule is you can’t use future data to forecast the past.

πŸ₯‡ A Friendly Challenge

Stopping here is perfectly okay. But if you’re in for a bit more excitement, join our little competition.

The prize? A DSI water bottle for the person who achieves the highest F1-score in both training and testing sets using a legitimate logistic regression model.

Feel free to take this challenge beyond the classroom. You’ve got until Wednesday at 23:59 to submit your finest model via Moodle. I will reveal the winner in Friday’s lecture.