💻 LSE DS202A 2024: Week 04 - Lab

Author

Yassine Lahna & Stuart Bramwell

Published

21 Oct 2024

Welcome to our DS202A fourth lab!

This week, you will explore logistic regression, an important classification technique. You will perform data exploration, build a logistic regression model and evaluate its performance using standard classification metrics such as precision, recall and the area under the precision-recall curve.

🥅 Learning Objectives

By the end of this lab, you will be able to:

Fit a logistic regression model in R.
Evaluate the performance of a logistic regression model using standard classification metrics.

📚 Preparation

Materials to download

Use the link below to download the lab materials:

as well as the dataset we will use for this lab:

Put this dataset file in the data folder within your DS202A project folder.

⚙️ Setup

In this lab, we will use a few R libraries to help with data handling, model evaluation and visualization:

doParallel: to help us utilise multiple cores when running code that demands a lot of memory.
ggsci: for nice colour palettes.
tidymodels: an ecosystem for machine-learning models.
kknn: a library specifically for the \(K\)-NN model we’ll discover at the end of this lab.
tidyverse: an ecosystem for data manipulation and visualisation.

Install missing libraries:

You probably don’t have the doParallel and kknn libraries installed so you’ll need to install it.

#make sure you run this code only once and that this chunk is non-executable when you render your qmd
install.packages("doParallel")
install.packages("kknn")

Import required libraries:

library("ggsci")
library("tidymodels")
library("tidyverse")
library("doParallel")
library("kknn")

Create functions


# Define a metric set
# purrr::partial allows you to preset parameters for defined functions
# and turn them into their own objects.

class_metrics <- metric_set(precision, recall)
class_metrics <- partial(class_metrics, event_level = "second")

# ggplot2 plot themes

theme_histogram <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank())
  
}

theme_boxplot <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.position = "none")
  
}

theme_line <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.position = "none")
  
}

theme_bar <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank())
  
}

theme_scatter <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        legend.position = "bottom")
  
}

📋 Lab Tasks

Part I - Exploratory data analysis (20 min)

The first step is to load the dataset. In this lab, we will be using the diabetes dataset which contains health-related data. It includes variables associated with medical conditions, lifestyle factors and demographic information:

diabetes: indicates whether the individual has diabetes.
high_bp: indicates whether the individual has high blood pressure.
high_chol: indicates whether the individual has high cholesterol.
chol_check: indicates whether the individual has had their cholesterol checked.
bmi: represents the individual’s Body Mass Index.
smoker: indicates whether the individual is a smoker.
stroke: indicates whether the individual has had a stroke.
heart_diseaseor_attack: indicates whether the individual has/had heart disease/attack.
phys_activity: indicates whether the individual engages in physical activity.
fruits and veggies: indicate the consumption of fruits and vegetables.
hvy_alcohol_consum: indicates heavy alcohol consumption.
no_docbc_cost: refers to whether an individual was unable to see a doctor due to cost-related barriers.
any_healthcare: indicates whether the individual has any form of healthcare.
gen_hlth: indicates the individual’s self-reported general health.
diff_walk: indicates whether the individual has difficulty walking of faces mobility challenges.
sex: indicates the individual’s gender.
age: represents the individual’s age.
education: represents the individual’s education level.
income: represents the individual’s income level.

diabetes_data <- read_csv("data/diabetes.csv")

Question 1: Check the dimensions of the dataframe. Check whether there are any missing values.


diabetes_data


diabetes_data %>% 
  summarise(across(everything(), ~ sum(is.na(.x)))) %>% 
  glimpse()

Question 2: What are the types of the variables in the dataset (continuous, discrete, categorical, ordinal) ? Convert the continuous variables to numeric, discrete variable to integers and categorical variables to factors.


continuous_vars <- c("age", "bmi")
discrete_vars <- c("gen_hlth", "education", "income")
categorical_vars <- c("diabetes", "high_bp", "high_chol", "chol_check",
                      "smoker", "stroke", "heart_diseaseor_attack",
                      "phys_activity", "fruits",  "veggies",
                      "hvy_alcohol_consump", "any_healthcare", "no_docbc_cost",
                      "diff_walk", "sex")

diabetes_data <- 
  diabetes_data %>% 
  mutate(across(all_of(continuous_vars), ~ as.numeric(.x)),
         across(all_of(discrete_vars), ~ as.integer(.x)),
         across(all_of(categorical_vars), ~ as.factor(.x)))

👉Note: Imagine how cumbersome it would be to perform these transformations if we had to do them for each variable! For more information on what across is doing, click here.

Question 3: Generate box plots for the variable bmi for each class of the column diabetes . Repeat the same for the variables education and income. What do you observe from the box plots ? How do BMI, education and income vary across the diabetes classes ?


diabetes_data %>% 
  ggplot(aes(diabetes, bmi, fill = diabetes)) +
  geom_boxplot() +
  theme_boxplot() +
  scale_fill_npg() +
  labs(x = "Diabetes", y = "BMI")


diabetes_data %>% 
  ggplot(aes(diabetes, education, fill = diabetes)) +
  geom_boxplot() +
  theme_boxplot() +
  scale_fill_npg() +
  labs(x = "Diabetes", y = "Education")


diabetes_data %>% 
  ggplot(aes(diabetes, income, fill = diabetes)) +
  geom_boxplot() +
  theme_boxplot() +
  scale_fill_npg() +
  labs(x = "Diabetes", y = "Income")

Question 4: How many observations fall into each class of the variable diabetes ? Comment. What are the challenges you would expect when training a model ?


count(diabetes_data, diabetes)

Part II - Fitting a logistic regression (25 min)

We want to perform a logistic regression using the available variables in the dataset to predict whether an individual has diabetes.

👨🏻‍🏫 TEACHING MOMENT: Your class teacher will formalize the logistic regression in the context of the data at our disposal.

Now, we need to split the data into training and testing sets. Having a test set will help us evaluate how well the model generates.

In the training phase, we will use part of the data to fit the logistic regression model.
In the testing phase, we will assess the model’s performance on the remaining data (test set) which was not used during training.

Question 1: Why can’t we rely solely on the model’s performance on the training set to evaluate its ability to generalize?

Question 2: Split the dataset into training and testing sets using 75% of the data for training and 25% for testing. Use slice_sample to create a sample of 1% of the training data.


set.seed(123)

split_grid <- initial_split(diabetes_data, prop = 0.75, strata = diabetes)
train <- training(split_grid)
test <- testing(split_grid)

train_sample <- slice_sample(train, prop = 0.01)

💡Tip: When using initial_split try specifying strata = diabetes. This ensures that the training and test sets will have identical proportions of the outcome.

Question 3: Fit a logistic regression model on the training data. Should all variables be included in the model ?


model <- 
  logistic_reg() %>% 
  fit(diabetes ~ ., data = train)

Question 4: Generate a summary of the model. Compare the \(p\)-values for bmi and for education.


summary(model$fit)

👨🏻‍🏫 TEACHING MOMENT: Your class teacher will explain how to interpret the summary of the model (\(p-\)values, AIC …).

Part III - Evaluation (45 min)

Question 1: We are going to generate predictions on both the training and testing sets using our logistic regression model. Generate predictions for both sets.


# Generate predictions on the training set
train_preds <- 
  model %>% 
  augment(new_data = train, type.predict = "response")

# Generate predictions on the testing set
test_preds <- 
  model %>% 
  augment(new_data = test, type.predict = "response")

💡Tip: When using augment for logistic regression, specify type.predict = "response".

Question 2: Copy and paste pull(test_preds, .pred_1)[1:10] into the console and hit enter to get the first ten predictions for the test set. The model’s predictions are scores ranging between 0 and 1 while the target variable diabetes is binary and takes only the values 0 or 1. How can we use the predictions to classify whether an individual has diabetes ?

Question 3: We are going to set an arbitrary threshold \(\alpha\). All the scores that are higher than \(\alpha=0.8\) will be classified as 1 and the scores lower than \(\alpha=0.8\) will be classified as 0.


# Convert predicted scores to binary outcomes for the training set
train_preds <- 
  train_preds %>% 
  mutate(.pred_class = as.factor(if_else(.pred_1 > 0.8, 1, 0)))

# Convert predicted scores to binary outcomes for the testing set
test_preds <- 
  test_preds %>% 
  mutate(.pred_class = as.factor(if_else(.pred_1 > 0.8, 1, 0)))

👨🏻‍🏫 TEACHING MOMENT: Your class teacher will explain what precision and recall are.

Question 4: What would be the recall if we only predicted 1’s on the test set?

💡💡Tips:

Create a new factor variable. When using factor() make sure to specify levels = c(0, 1)!
Try the recall function, don’t forget to set event_level = "second"!


# Code here

Question 5: Now, let’s measure precision and recall for both training and testing sets.


# Code here

Question 6: Compare the results you obtain for both the training and testing sets. Was this expected ? Does the model overfit ? underfit ?

Question 7: What is the problem with setting an arbitrary threshold \(\alpha=0.8\) ? How should we expect the precision and the recall to behave if we increase the threshold \(\alpha\) ? If we decrease \(\alpha\) ?

Question 8: Compute the precision and recall for the test set for different values of the threshold \(\alpha\). Compute them for \(\alpha \in \{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 \}\).


# Create a list of probability thresholds

alpha_list <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7)

# This function helps create a probability threshold
# and subsets the data.

set_prob_th <-
  as_mapper(
    ~ .x %>% 
      mutate(truth = as.factor(diabetes == "1"),
             estimate = as.factor(as.numeric(.pred_1) > .y)) %>% 
      select(truth, estimate)
  )

# Register multiple cores

doParallel::registerDoParallel()

# Calculate probability thresholds across nested
# tibbles

ev_metrics <-
  crossing(alpha = alpha_list,
           nest(augment(model, new_data=train), .key="augmented")) %>% 
  mutate(preds = map2(augmented, alpha, ~set_prob_th(.x,.y)),
         metrics = map(preds, ~class_metrics(.x, truth=truth, estimate=estimate))) %>% 
  unnest(metrics) %>% 
  select(alpha, .metric, .estimate)

👨🏻‍🏫 TEACHING MOMENT: Your teacher will explain what the map function is doing.

Because we need to perform operations over list columns, we need map and map2 to apply functions to each cell. With map2 we are telling R to calculate different probability thresholds found in alpha_list over the nested tibbles augmented.
With purrr::map functions, we can use anonymous functions (starting with ~) which provide a more compact syntax for function writing. We could use function(.x,.y) et_prob_th(.x,.y) to achieve the same thing but this would make the code longer.
.x (and .y) are the placeholders used when specifying an anonymous function.

Question 9: Create a plot that shows performance over different thresholds, using colour to distinguish between the different evaluation metrics.


# Code here

💰🎁🎉 Bonus: Try redefining class_metrics to also include f_meas (the f1-score) and see what happens.

Question 10: Based on the precision and recall values you calculated, create a precision-recall plot to visualize the relationship between these two metrics. How does the shape of the precision-recall curve help you assess model performance ? What trade-offs do you observe between precision and recall ?


# Code here

Question 11: Let’s employ train_sample built earlier. Train a new logistic regression model using all the variables and then evaluate its performance on the testing set by generating a precision-recall curve. Compare the precision-recall curves of the original model and the new model.


# Code here


# Code here

Question 12: What differences do you observe between the two models’ precision-recall curves? How does training on less data affect the performance of the logistic regression model?


# Code here

💡💡Tips:

Try finding a way of combining the evaluation metrics when using the full data and the sample.
Create a list with two named elements
Pipe in bind_rows(.id = "sample")
Plot the results, add facet_wrap(. ~ .metric) to your ggplot and see what happens

Part IV - Cross-validation (X min)

In this part, we will explore a technique often used in Machine Learning to evaluate the performance and generalizability of our models. Cross-validation is often used to get a better estimation of the model’s performance on new data (test set).

💡Tip: We have been creating a lot of objects that, in turn, have used up a lot of memory. It is worth, sometimes, removing the objects we have no use for anymore. We can then use gc() (“garbage collection” 😂) which can then help free up memory in RStudio.

  rm(list = ls(pattern = "ev_metrics$|model|pred"))
  gc()

The idea of \(k\)-fold cross validation is to split a dataset into a training set and testing set (just as we did previously), then to split the training set into \(k\) folds.

👨🏻‍🏫 TEACHING MOMENT: Your class teacher will give more details on how cross-validation works and what its purpose is.

Question 1: Use the tidymodels package to perform leave one out cross-validation.


# Create leave one out resamples

loocv <- loo_cv(train_sample)

# Instantiate a logistic regression model

logit <- logistic_reg()

# Register multiple cores

doParallel::registerDoParallel()

# Create a nested tibble of fitted models

loocv_fit <-
  loocv %>% 
  mutate(train = map(splits, ~ training(.x)),
         holdout = map(splits, ~ testing(.x)),
         fit = map(train, ~ fit(logit, diabetes ~ ., data = .x))
         )

# Apply the model to the training data

loocv_preds <-
  loocv_fit %>% 
  select(fit, holdout) %>% 
  mutate(preds = map2(fit, holdout, ~ augment(.x, new_data = .y)))

# Unnest the predictions and calculate evaluation metrics

loocv_preds %>% 
  unnest(preds) %>% 
  class_metrics(truth = diabetes, estimate = .pred_class)

Question 2: Try manipulating the probability threshold and create a graph similar to Part II Question 9


# Code here

Part V - \(k\)-nearest neighbours (X min)

💡Tip: See previous tip for why the below code is necessary.

rm(list = ls(pattern = "loocv|ev_metrics_new"))
gc()

👨🏻‍🏫 TEACHING MOMENT: Your class teacher will explain how \(k\)-nn works.

Question 1: Start by normalizing the continuous variables. Explain why it can be useful to carry out this transformation.


# Create a standardise function

standardise <- function(.x) {
  out <- (.x - mean(.x)) / sd(.x)
  out
}

# Apply function to training and testing data

# Code here

Question 2: Perform a \(k\)-nn classification with \(k=5\). Generate the predictions for the set and compute both precision and recall.


# Register multiple cores

# code here

# Fit a kNN model

# code here

# Evaluate the model on the test set

# code here

💰🎁🎉 Bonus:

Rerun the following code chunk:


new_model <-
  logistic_reg() %>% 
  fit(diabetes ~ ., data = train_sample)

new_preds <-
  new_model %>% 
  augment(new_data = test, type.predict = "response")

new_preds %>% 
  class_metrics(truth = diabetes, estimate = .pred_class)

Does kNN do a better job than logistic regression at predicting diabetes?

💰🎁🎉💰🎁🎉 Super Bonus

Can you create a bar plot that demonstrates this difference? Again, please consider using one of these plots when comparing model performance.


# Code here