π» LSE DS202A 2024: Week 04 - Lab
Welcome to our DS202A fourth lab!
This week, you will explore logistic regression, an important classification technique. You will perform data exploration, build a logistic regression model and evaluate its performance using standard classification metrics such as precision, recall and the area under the precision-recall curve.
π₯ Learning Objectives
By the end of this lab, you will be able to:
- Fit a logistic regression model in R.
- Evaluate the performance of a logistic regression model using standard classification metrics.
π Preparation
Materials to download
Use the link below to download the lab materials:
as well as the dataset we will use for this lab:
Put this dataset file in the data
folder within your DS202A
project folder.
βοΈ Setup
In this lab, we will use a few R libraries to help with data handling, model evaluation and visualization:
doParallel
: to help us utilise multiple cores when running code that demands a lot of memory.ggsci
: for nice colour palettes.tidymodels
: an ecosystem for machine-learning models.kknn
: a library specifically for the \(K\)-NN model weβll discover at the end of this lab.tidyverse
: an ecosystem for data manipulation and visualisation.
Install missing libraries:
You probably donβt have the doParallel
and kknn
libraries installed so youβll need to install it.
#make sure you run this code only once and that this chunk is non-executable when you render your qmd
install.packages("doParallel")
install.packages("kknn")
Import required libraries:
library("ggsci")
library("tidymodels")
library("tidyverse")
library("doParallel")
library("kknn")
Create functions
# Define a metric set
# purrr::partial allows you to preset parameters for defined functions
# and turn them into their own objects.
<- metric_set(precision, recall)
class_metrics <- partial(class_metrics, event_level = "second")
class_metrics
# ggplot2 plot themes
<- function() {
theme_histogram
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank())
}
<- function() {
theme_boxplot
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "none")
}
<- function() {
theme_line
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "none")
}
<- function() {
theme_bar
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank())
}
<- function() {
theme_scatter
theme_minimal() +
theme(panel.grid.minor = element_blank(),
legend.position = "bottom")
}
π Lab Tasks
Part I - Exploratory data analysis (20 min)
The first step is to load the dataset. In this lab, we will be using the diabetes
dataset which contains health-related data. It includes variables associated with medical conditions, lifestyle factors and demographic information:
diabetes
: indicates whether the individual has diabetes.high_bp
: indicates whether the individual has high blood pressure.high_chol
: indicates whether the individual has high cholesterol.chol_check
: indicates whether the individual has had their cholesterol checked.bmi
: represents the individualβs Body Mass Index.smoker
: indicates whether the individual is a smoker.stroke
: indicates whether the individual has had a stroke.heart_diseaseor_attack
: indicates whether the individual has/had heart disease/attack.phys_activity
: indicates whether the individual engages in physical activity.fruits
andveggies
: indicate the consumption of fruits and vegetables.hvy_alcohol_consum
: indicates heavy alcohol consumption.no_docbc_cost
: refers to whether an individual was unable to see a doctor due to cost-related barriers.any_healthcare
: indicates whether the individual has any form of healthcare.gen_hlth
: indicates the individualβs self-reported general health.diff_walk
: indicates whether the individual has difficulty walking of faces mobility challenges.sex
: indicates the individualβs gender.age
: represents the individualβs age.education
: represents the individualβs education level.income
: represents the individualβs income level.
<- read_csv("data/diabetes.csv") diabetes_data
Question 1: Check the dimensions of the dataframe. Check whether there are any missing values.
diabetes_data
%>%
diabetes_data summarise(across(everything(), ~ sum(is.na(.x)))) %>%
glimpse()
Question 2: What are the types of the variables in the dataset (continuous, discrete, categorical, ordinal) ? Convert the continuous variables to numeric, discrete variable to integers and categorical variables to factors.
<- c("age", "bmi")
continuous_vars <- c("gen_hlth", "education", "income")
discrete_vars <- c("diabetes", "high_bp", "high_chol", "chol_check",
categorical_vars "smoker", "stroke", "heart_diseaseor_attack",
"phys_activity", "fruits", "veggies",
"hvy_alcohol_consump", "any_healthcare", "no_docbc_cost",
"diff_walk", "sex")
<-
diabetes_data %>%
diabetes_data mutate(across(all_of(continuous_vars), ~ as.numeric(.x)),
across(all_of(discrete_vars), ~ as.integer(.x)),
across(all_of(categorical_vars), ~ as.factor(.x)))
πNote: Imagine how cumbersome it would be to perform these transformations if we had to do them for each variable! For more information on what across
is doing, click here.
Question 3: Generate box plots for the variable bmi
for each class of the column diabetes
. Repeat the same for the variables education
and income
. What do you observe from the box plots ? How do BMI, education and income vary across the diabetes classes ?
%>%
diabetes_data ggplot(aes(diabetes, bmi, fill = diabetes)) +
geom_boxplot() +
theme_boxplot() +
scale_fill_npg() +
labs(x = "Diabetes", y = "BMI")
%>%
diabetes_data ggplot(aes(diabetes, education, fill = diabetes)) +
geom_boxplot() +
theme_boxplot() +
scale_fill_npg() +
labs(x = "Diabetes", y = "Education")
%>%
diabetes_data ggplot(aes(diabetes, income, fill = diabetes)) +
geom_boxplot() +
theme_boxplot() +
scale_fill_npg() +
labs(x = "Diabetes", y = "Income")
Question 4: How many observations fall into each class of the variable diabetes
? Comment. What are the challenges you would expect when training a model ?
count(diabetes_data, diabetes)
Part II - Fitting a logistic regression (25 min)
We want to perform a logistic regression using the available variables in the dataset to predict whether an individual has diabetes.
π¨π»βπ« TEACHING MOMENT: Your class teacher will formalize the logistic regression in the context of the data at our disposal.
Now, we need to split the data into training and testing sets. Having a test set will help us evaluate how well the model generates.
In the training phase, we will use part of the data to fit the logistic regression model.
In the testing phase, we will assess the modelβs performance on the remaining data (test set) which was not used during training.
Question 1: Why canβt we rely solely on the modelβs performance on the training set to evaluate its ability to generalize?
Question 2: Split the dataset into training and testing sets using 75% of the data for training and 25% for testing. Use slice_sample
to create a sample of 1% of the training data.
set.seed(123)
<- initial_split(diabetes_data, prop = 0.75, strata = diabetes)
split_grid <- training(split_grid)
train <- testing(split_grid)
test
<- slice_sample(train, prop = 0.01) train_sample
π‘Tip: When using initial_split
try specifying strata = diabetes
. This ensures that the training and test sets will have identical proportions of the outcome.
Question 3: Fit a logistic regression model on the training data. Should all variables be included in the model ?
<-
model logistic_reg() %>%
fit(diabetes ~ ., data = train)
Question 4: Generate a summary of the model. Compare the \(p\)-values for bmi
and for education
.
summary(model$fit)
π¨π»βπ« TEACHING MOMENT: Your class teacher will explain how to interpret the summary of the model (\(p-\)values, AIC β¦).
Part III - Evaluation (45 min)
Question 1: We are going to generate predictions on both the training and testing sets using our logistic regression model. Generate predictions for both sets.
# Generate predictions on the training set
<-
train_preds %>%
model augment(new_data = train, type.predict = "response")
# Generate predictions on the testing set
<-
test_preds %>%
model augment(new_data = test, type.predict = "response")
π‘Tip: When using augment
for logistic regression, specify type.predict = "response"
.
Question 2: Copy and paste pull(test_preds, .pred_1)[1:10]
into the console and hit enter to get the first ten predictions for the test set. The modelβs predictions are scores ranging between 0 and 1 while the target variable diabetes
is binary and takes only the values 0 or 1. How can we use the predictions to classify whether an individual has diabetes ?
Question 3: We are going to set an arbitrary threshold \(\alpha\). All the scores that are higher than \(\alpha=0.8\) will be classified as 1 and the scores lower than \(\alpha=0.8\) will be classified as 0.
# Convert predicted scores to binary outcomes for the training set
<-
train_preds %>%
train_preds mutate(.pred_class = as.factor(if_else(.pred_1 > 0.8, 1, 0)))
# Convert predicted scores to binary outcomes for the testing set
<-
test_preds %>%
test_preds mutate(.pred_class = as.factor(if_else(.pred_1 > 0.8, 1, 0)))
π¨π»βπ« TEACHING MOMENT: Your class teacher will explain what precision and recall are.
Question 4: What would be the recall if we only predicted 1βs on the test set?
π‘π‘Tips:
- Create a new factor variable. When using
factor()
make sure to specifylevels = c(0, 1)
! - Try the
recall
function, donβt forget to setevent_level = "second"
!
# Code here
Question 5: Now, letβs measure precision and recall for both training and testing sets.
# Code here
Question 6: Compare the results you obtain for both the training and testing sets. Was this expected ? Does the model overfit ? underfit ?
Question 7: What is the problem with setting an arbitrary threshold \(\alpha=0.8\) ? How should we expect the precision and the recall to behave if we increase the threshold \(\alpha\) ? If we decrease \(\alpha\) ?
Question 8: Compute the precision and recall for the test set for different values of the threshold \(\alpha\). Compute them for \(\alpha \in \{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 \}\).
# Create a list of probability thresholds
<- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7)
alpha_list
# This function helps create a probability threshold
# and subsets the data.
<-
set_prob_th as_mapper(
~ .x %>%
mutate(truth = as.factor(diabetes == "1"),
estimate = as.factor(as.numeric(.pred_1) > .y)) %>%
select(truth, estimate)
)
# Register multiple cores
::registerDoParallel()
doParallel
# Calculate probability thresholds across nested
# tibbles
<-
ev_metrics crossing(alpha = alpha_list,
nest(augment(model, new_data=train), .key="augmented")) %>%
mutate(preds = map2(augmented, alpha, ~set_prob_th(.x,.y)),
metrics = map(preds, ~class_metrics(.x, truth=truth, estimate=estimate))) %>%
unnest(metrics) %>%
select(alpha, .metric, .estimate)
π¨π»βπ« TEACHING MOMENT: Your teacher will explain what the map function is doing.
- Because we need to perform operations over list columns, we need
map
andmap2
to apply functions to each cell. Withmap2
we are telling R to calculate different probability thresholds found inalpha_list
over the nested tibblesaugmented
. - With
purrr::map
functions, we can use anonymous functions (starting with~
) which provide a more compact syntax for function writing. We could usefunction(.x,.y) et_prob_th(.x,.y)
to achieve the same thing but this would make the code longer. .x
(and.y
) are the placeholders used when specifying an anonymous function.
Question 9: Create a plot that shows performance over different thresholds, using colour to distinguish between the different evaluation metrics.
# Code here
π°ππ Bonus: Try redefining class_metrics
to also include f_meas
(the f1-score) and see what happens.
Question 10: Based on the precision and recall values you calculated, create a precision-recall plot to visualize the relationship between these two metrics. How does the shape of the precision-recall curve help you assess model performance ? What trade-offs do you observe between precision and recall ?
# Code here
Question 11: Letβs employ train_sample
built earlier. Train a new logistic regression model using all the variables and then evaluate its performance on the testing set by generating a precision-recall curve. Compare the precision-recall curves of the original model and the new model.
# Code here
# Code here
Question 12: What differences do you observe between the two modelsβ precision-recall curves? How does training on less data affect the performance of the logistic regression model?
# Code here
π‘π‘Tips:
- Try finding a way of combining the evaluation metrics when using the full data and the sample.
- Create a list with two named elements
- Pipe in
bind_rows(.id = "sample")
- Plot the results, add
facet_wrap(. ~ .metric)
to your ggplot and see what happens
Part IV - Cross-validation (X min)
In this part, we will explore a technique often used in Machine Learning to evaluate the performance and generalizability of our models. Cross-validation is often used to get a better estimation of the modelβs performance on new data (test set).
π‘Tip: We have been creating a lot of objects that, in turn, have used up a lot of memory. It is worth, sometimes, removing the objects we have no use for anymore. We can then use gc()
(βgarbage collectionβ π) which can then help free up memory in RStudio.
rm(list = ls(pattern = "ev_metrics$|model|pred"))
gc()
The idea of \(k\)-fold cross validation is to split a dataset into a training set and testing set (just as we did previously), then to split the training set into \(k\) folds.
π¨π»βπ« TEACHING MOMENT: Your class teacher will give more details on how cross-validation works and what its purpose is.
Question 1: Use the tidymodels
package to perform leave one out cross-validation.
# Create leave one out resamples
<- loo_cv(train_sample)
loocv
# Instantiate a logistic regression model
<- logistic_reg()
logit
# Register multiple cores
::registerDoParallel()
doParallel
# Create a nested tibble of fitted models
<-
loocv_fit %>%
loocv mutate(train = map(splits, ~ training(.x)),
holdout = map(splits, ~ testing(.x)),
fit = map(train, ~ fit(logit, diabetes ~ ., data = .x))
)
# Apply the model to the training data
<-
loocv_preds %>%
loocv_fit select(fit, holdout) %>%
mutate(preds = map2(fit, holdout, ~ augment(.x, new_data = .y)))
# Unnest the predictions and calculate evaluation metrics
%>%
loocv_preds unnest(preds) %>%
class_metrics(truth = diabetes, estimate = .pred_class)
Question 2: Try manipulating the probability threshold and create a graph similar to Part II Question 9
# Code here
Part V - \(k\)-nearest neighbours (X min)
π‘Tip: See previous tip for why the below code is necessary.
rm(list = ls(pattern = "loocv|ev_metrics_new"))
gc()
π¨π»βπ« TEACHING MOMENT: Your class teacher will explain how \(k\)-nn works.
Question 1: Start by normalizing the continuous variables. Explain why it can be useful to carry out this transformation.
# Create a standardise function
<- function(.x) {
standardise <- (.x - mean(.x)) / sd(.x)
out
out
}
# Apply function to training and testing data
# Code here
Question 2: Perform a \(k\)-nn classification with \(k=5\). Generate the predictions for the set and compute both precision and recall.
# Register multiple cores
# code here
# Fit a kNN model
# code here
# Evaluate the model on the test set
# code here
π°ππ Bonus:
Rerun the following code chunk:
<-
new_model logistic_reg() %>%
fit(diabetes ~ ., data = train_sample)
<-
new_preds %>%
new_model augment(new_data = test, type.predict = "response")
%>%
new_preds class_metrics(truth = diabetes, estimate = .pred_class)
Does kNN do a better job than logistic regression at predicting diabetes?
π°πππ°ππ Super Bonus
Can you create a bar plot that demonstrates this difference? Again, please consider using one of these plots when comparing model performance.
# Code here