πŸ’» LSE DS202A 2024: Week 04 - Lab

Author

Yassine Lahna & Stuart Bramwell

Published

21 Oct 2024

Welcome to our DS202A fourth lab!

This week, you will explore logistic regression, an important classification technique. You will perform data exploration, build a logistic regression model and evaluate its performance using standard classification metrics such as precision, recall and the area under the precision-recall curve.

πŸ₯… Learning Objectives

By the end of this lab, you will be able to:

  • Fit a logistic regression model in R.
  • Evaluate the performance of a logistic regression model using standard classification metrics.

πŸ“š Preparation

Materials to download

Use the link below to download the lab materials:

as well as the dataset we will use for this lab:

Put this dataset file in the data folder within your DS202A project folder.

βš™οΈ Setup

In this lab, we will use a few R libraries to help with data handling, model evaluation and visualization:

  • doParallel: to help us utilise multiple cores when running code that demands a lot of memory.
  • ggsci: for nice colour palettes.
  • tidymodels: an ecosystem for machine-learning models.
  • kknn: a library specifically for the \(K\)-NN model we’ll discover at the end of this lab.
  • tidyverse: an ecosystem for data manipulation and visualisation.

Install missing libraries:

You probably don’t have the doParallel and kknn libraries installed so you’ll need to install it.

#make sure you run this code only once and that this chunk is non-executable when you render your qmd
install.packages("doParallel")
install.packages("kknn")

Import required libraries:

library("ggsci")
library("tidymodels")
library("tidyverse")
library("doParallel")
library("kknn")

Create functions


# Define a metric set
# purrr::partial allows you to preset parameters for defined functions
# and turn them into their own objects.

class_metrics <- metric_set(precision, recall)
class_metrics <- partial(class_metrics, event_level = "second")

# ggplot2 plot themes

theme_histogram <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank())
  
}

theme_boxplot <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.position = "none")
  
}

theme_line <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.position = "none")
  
}

theme_bar <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank())
  
}

theme_scatter <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        legend.position = "bottom")
  
}

πŸ“‹ Lab Tasks

Part I - Exploratory data analysis (20 min)

The first step is to load the dataset. In this lab, we will be using the diabetes dataset which contains health-related data. It includes variables associated with medical conditions, lifestyle factors and demographic information:

  • diabetes: indicates whether the individual has diabetes.
  • high_bp: indicates whether the individual has high blood pressure.
  • high_chol: indicates whether the individual has high cholesterol.
  • chol_check: indicates whether the individual has had their cholesterol checked.
  • bmi: represents the individual’s Body Mass Index.
  • smoker: indicates whether the individual is a smoker.
  • stroke: indicates whether the individual has had a stroke.
  • heart_diseaseor_attack: indicates whether the individual has/had heart disease/attack.
  • phys_activity: indicates whether the individual engages in physical activity.
  • fruits and veggies: indicate the consumption of fruits and vegetables.
  • hvy_alcohol_consum: indicates heavy alcohol consumption.
  • no_docbc_cost: refers to whether an individual was unable to see a doctor due to cost-related barriers.
  • any_healthcare: indicates whether the individual has any form of healthcare.
  • gen_hlth: indicates the individual’s self-reported general health.
  • diff_walk: indicates whether the individual has difficulty walking of faces mobility challenges.
  • sex: indicates the individual’s gender.
  • age: represents the individual’s age.
  • education: represents the individual’s education level.
  • income: represents the individual’s income level.
diabetes_data <- read_csv("data/diabetes.csv")

Question 1: Check the dimensions of the dataframe. Check whether there are any missing values.


diabetes_data

diabetes_data %>% 
  summarise(across(everything(), ~ sum(is.na(.x)))) %>% 
  glimpse()

Question 2: What are the types of the variables in the dataset (continuous, discrete, categorical, ordinal) ? Convert the continuous variables to numeric, discrete variable to integers and categorical variables to factors.


continuous_vars <- c("age", "bmi")
discrete_vars <- c("gen_hlth", "education", "income")
categorical_vars <- c("diabetes", "high_bp", "high_chol", "chol_check",
                      "smoker", "stroke", "heart_diseaseor_attack",
                      "phys_activity", "fruits",  "veggies",
                      "hvy_alcohol_consump", "any_healthcare", "no_docbc_cost",
                      "diff_walk", "sex")

diabetes_data <- 
  diabetes_data %>% 
  mutate(across(all_of(continuous_vars), ~ as.numeric(.x)),
         across(all_of(discrete_vars), ~ as.integer(.x)),
         across(all_of(categorical_vars), ~ as.factor(.x)))

πŸ‘‰Note: Imagine how cumbersome it would be to perform these transformations if we had to do them for each variable! For more information on what across is doing, click here.

Question 3: Generate box plots for the variable bmi for each class of the column diabetes . Repeat the same for the variables education and income. What do you observe from the box plots ? How do BMI, education and income vary across the diabetes classes ?


diabetes_data %>% 
  ggplot(aes(diabetes, bmi, fill = diabetes)) +
  geom_boxplot() +
  theme_boxplot() +
  scale_fill_npg() +
  labs(x = "Diabetes", y = "BMI")

diabetes_data %>% 
  ggplot(aes(diabetes, education, fill = diabetes)) +
  geom_boxplot() +
  theme_boxplot() +
  scale_fill_npg() +
  labs(x = "Diabetes", y = "Education")

diabetes_data %>% 
  ggplot(aes(diabetes, income, fill = diabetes)) +
  geom_boxplot() +
  theme_boxplot() +
  scale_fill_npg() +
  labs(x = "Diabetes", y = "Income")

Question 4: How many observations fall into each class of the variable diabetes ? Comment. What are the challenges you would expect when training a model ?


count(diabetes_data, diabetes)

Part II - Fitting a logistic regression (25 min)

We want to perform a logistic regression using the available variables in the dataset to predict whether an individual has diabetes.

πŸ‘¨πŸ»β€πŸ« TEACHING MOMENT: Your class teacher will formalize the logistic regression in the context of the data at our disposal.

Now, we need to split the data into training and testing sets. Having a test set will help us evaluate how well the model generates.

  • In the training phase, we will use part of the data to fit the logistic regression model.

  • In the testing phase, we will assess the model’s performance on the remaining data (test set) which was not used during training.

Question 1: Why can’t we rely solely on the model’s performance on the training set to evaluate its ability to generalize?

Question 2: Split the dataset into training and testing sets using 75% of the data for training and 25% for testing. Use slice_sample to create a sample of 1% of the training data.


set.seed(123)

split_grid <- initial_split(diabetes_data, prop = 0.75, strata = diabetes)
train <- training(split_grid)
test <- testing(split_grid)

train_sample <- slice_sample(train, prop = 0.01)

πŸ’‘Tip: When using initial_split try specifying strata = diabetes. This ensures that the training and test sets will have identical proportions of the outcome.

Question 3: Fit a logistic regression model on the training data. Should all variables be included in the model ?


model <- 
  logistic_reg() %>% 
  fit(diabetes ~ ., data = train)

Question 4: Generate a summary of the model. Compare the \(p\)-values for bmi and for education.


summary(model$fit)

πŸ‘¨πŸ»β€πŸ« TEACHING MOMENT: Your class teacher will explain how to interpret the summary of the model (\(p-\)values, AIC …).

Part III - Evaluation (45 min)

Question 1: We are going to generate predictions on both the training and testing sets using our logistic regression model. Generate predictions for both sets.


# Generate predictions on the training set
train_preds <- 
  model %>% 
  augment(new_data = train, type.predict = "response")

# Generate predictions on the testing set
test_preds <- 
  model %>% 
  augment(new_data = test, type.predict = "response")

πŸ’‘Tip: When using augment for logistic regression, specify type.predict = "response".

Question 2: Copy and paste pull(test_preds, .pred_1)[1:10] into the console and hit enter to get the first ten predictions for the test set. The model’s predictions are scores ranging between 0 and 1 while the target variable diabetes is binary and takes only the values 0 or 1. How can we use the predictions to classify whether an individual has diabetes ?

Question 3: We are going to set an arbitrary threshold \(\alpha\). All the scores that are higher than \(\alpha=0.8\) will be classified as 1 and the scores lower than \(\alpha=0.8\) will be classified as 0.


# Convert predicted scores to binary outcomes for the training set
train_preds <- 
  train_preds %>% 
  mutate(.pred_class = as.factor(if_else(.pred_1 > 0.8, 1, 0)))

# Convert predicted scores to binary outcomes for the testing set
test_preds <- 
  test_preds %>% 
  mutate(.pred_class = as.factor(if_else(.pred_1 > 0.8, 1, 0)))

πŸ‘¨πŸ»β€πŸ« TEACHING MOMENT: Your class teacher will explain what precision and recall are.

Question 4: What would be the recall if we only predicted 1’s on the test set?

πŸ’‘πŸ’‘Tips:

  • Create a new factor variable. When using factor() make sure to specify levels = c(0, 1)!
  • Try the recall function, don’t forget to set event_level = "second"!

# Code here

Question 5: Now, let’s measure precision and recall for both training and testing sets.


# Code here

Question 6: Compare the results you obtain for both the training and testing sets. Was this expected ? Does the model overfit ? underfit ?

Question 7: What is the problem with setting an arbitrary threshold \(\alpha=0.8\) ? How should we expect the precision and the recall to behave if we increase the threshold \(\alpha\) ? If we decrease \(\alpha\) ?

Question 8: Compute the precision and recall for the test set for different values of the threshold \(\alpha\). Compute them for \(\alpha \in \{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 \}\).


# Create a list of probability thresholds

alpha_list <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7)

# This function helps create a probability threshold
# and subsets the data.

set_prob_th <-
  as_mapper(
    ~ .x %>% 
      mutate(truth = as.factor(diabetes == "1"),
             estimate = as.factor(as.numeric(.pred_1) > .y)) %>% 
      select(truth, estimate)
  )

# Register multiple cores

doParallel::registerDoParallel()

# Calculate probability thresholds across nested
# tibbles

ev_metrics <-
  crossing(alpha = alpha_list,
           nest(augment(model, new_data=train), .key="augmented")) %>% 
  mutate(preds = map2(augmented, alpha, ~set_prob_th(.x,.y)),
         metrics = map(preds, ~class_metrics(.x, truth=truth, estimate=estimate))) %>% 
  unnest(metrics) %>% 
  select(alpha, .metric, .estimate)

πŸ‘¨πŸ»β€πŸ« TEACHING MOMENT: Your teacher will explain what the map function is doing.

  • Because we need to perform operations over list columns, we need map and map2 to apply functions to each cell. With map2 we are telling R to calculate different probability thresholds found in alpha_list over the nested tibbles augmented.
  • With purrr::map functions, we can use anonymous functions (starting with ~) which provide a more compact syntax for function writing. We could use function(.x,.y) et_prob_th(.x,.y) to achieve the same thing but this would make the code longer.
  • .x (and .y) are the placeholders used when specifying an anonymous function.

Question 9: Create a plot that shows performance over different thresholds, using colour to distinguish between the different evaluation metrics.


# Code here

πŸ’°πŸŽπŸŽ‰ Bonus: Try redefining class_metrics to also include f_meas (the f1-score) and see what happens.

Question 10: Based on the precision and recall values you calculated, create a precision-recall plot to visualize the relationship between these two metrics. How does the shape of the precision-recall curve help you assess model performance ? What trade-offs do you observe between precision and recall ?


# Code here

Question 11: Let’s employ train_sample built earlier. Train a new logistic regression model using all the variables and then evaluate its performance on the testing set by generating a precision-recall curve. Compare the precision-recall curves of the original model and the new model.


# Code here

# Code here

Question 12: What differences do you observe between the two models’ precision-recall curves? How does training on less data affect the performance of the logistic regression model?


# Code here

πŸ’‘πŸ’‘Tips:

  • Try finding a way of combining the evaluation metrics when using the full data and the sample.
  • Create a list with two named elements
  • Pipe in bind_rows(.id = "sample")
  • Plot the results, add facet_wrap(. ~ .metric) to your ggplot and see what happens

Part IV - Cross-validation (X min)

In this part, we will explore a technique often used in Machine Learning to evaluate the performance and generalizability of our models. Cross-validation is often used to get a better estimation of the model’s performance on new data (test set).

πŸ’‘Tip: We have been creating a lot of objects that, in turn, have used up a lot of memory. It is worth, sometimes, removing the objects we have no use for anymore. We can then use gc() (β€œgarbage collection” πŸ˜‚) which can then help free up memory in RStudio.

  rm(list = ls(pattern = "ev_metrics$|model|pred"))
  gc()

The idea of \(k\)-fold cross validation is to split a dataset into a training set and testing set (just as we did previously), then to split the training set into \(k\) folds.

πŸ‘¨πŸ»β€πŸ« TEACHING MOMENT: Your class teacher will give more details on how cross-validation works and what its purpose is.

Question 1: Use the tidymodels package to perform leave one out cross-validation.


# Create leave one out resamples

loocv <- loo_cv(train_sample)

# Instantiate a logistic regression model

logit <- logistic_reg()

# Register multiple cores

doParallel::registerDoParallel()

# Create a nested tibble of fitted models

loocv_fit <-
  loocv %>% 
  mutate(train = map(splits, ~ training(.x)),
         holdout = map(splits, ~ testing(.x)),
         fit = map(train, ~ fit(logit, diabetes ~ ., data = .x))
         )

# Apply the model to the training data

loocv_preds <-
  loocv_fit %>% 
  select(fit, holdout) %>% 
  mutate(preds = map2(fit, holdout, ~ augment(.x, new_data = .y)))

# Unnest the predictions and calculate evaluation metrics

loocv_preds %>% 
  unnest(preds) %>% 
  class_metrics(truth = diabetes, estimate = .pred_class)

Question 2: Try manipulating the probability threshold and create a graph similar to Part II Question 9


# Code here

Part V - \(k\)-nearest neighbours (X min)

πŸ’‘Tip: See previous tip for why the below code is necessary.

rm(list = ls(pattern = "loocv|ev_metrics_new"))
gc()

πŸ‘¨πŸ»β€πŸ« TEACHING MOMENT: Your class teacher will explain how \(k\)-nn works.

Question 1: Start by normalizing the continuous variables. Explain why it can be useful to carry out this transformation.


# Create a standardise function

standardise <- function(.x) {
  out <- (.x - mean(.x)) / sd(.x)
  out
}

# Apply function to training and testing data

# Code here

Question 2: Perform a \(k\)-nn classification with \(k=5\). Generate the predictions for the set and compute both precision and recall.


# Register multiple cores

# code here

# Fit a kNN model

# code here

# Evaluate the model on the test set

# code here

πŸ’°πŸŽπŸŽ‰ Bonus:

Rerun the following code chunk:


new_model <-
  logistic_reg() %>% 
  fit(diabetes ~ ., data = train_sample)

new_preds <-
  new_model %>% 
  augment(new_data = test, type.predict = "response")

new_preds %>% 
  class_metrics(truth = diabetes, estimate = .pred_class)

Does kNN do a better job than logistic regression at predicting diabetes?

πŸ’°πŸŽπŸŽ‰πŸ’°πŸŽπŸŽ‰ Super Bonus

Can you create a bar plot that demonstrates this difference? Again, please consider using one of these plots when comparing model performance.


# Code here