🛣️ LSE DS202A 2025: Week 04 - Lab Roadmap
Welcome to our DS202A fourth lab!
This week, you will explore logistic regression, an important classification technique. You will perform data exploration, build a logistic regression model and evaluate its performance using standard classification metrics such as precision, recall and the area under the precision-recall curve.
🎯 Learning Outcomes
By the end of this lab, you will be able to:
Build and interpret logistic regression models for binary classification - Fit logistic regression models using
tidymodels, understand how coefficients differ from linear regression, and interpret model outputs in the context of probability and odds.Evaluate classification model performance using multiple metrics - Calculate and interpret precision, recall, F1-score, and confusion matrices, understanding the trade-offs between different evaluation metrics and how threshold selection impacts performance.
Apply cross-validation techniques for robust model evaluation - Implement k-fold and leave-one-out cross-validation to assess model generalizability and obtain more reliable performance estimates beyond simple train-test splits.
Compare different classification algorithms and preprocessing techniques - Implement k-nearest neighbors (k-NN) classification, understand the importance of feature scaling for distance-based algorithms, and systematically compare model performance across different approaches.
📚 Preparation
In this lab, we will use a few R libraries to help with data handling, model evaluation and visualization:
doParallel: to help us utilise multiple cores when running code that demands a lot of memory.ggsci: for nice colour palettes.tidymodels: an ecosystem for machine-learning models.kknn: a library specifically for the \(K\)-NN model we’ll discover at the end of this lab.tidyverse: an ecosystem for data manipulation and visualisation.
Predicting diabetes
⚙️ Setup
Install missing libraries:
You probably don’t have the doParallel and kknn libraries installed so you’ll need to install them.
#make sure you run this code only once and that this chunk is non-executable when you render your qmd
install.packages("doParallel")
install.packages("kknn")Alternatively, if you have the pacman or librarian libraries installed, you could do either of:
pacman::p_load("doParallel","kknnn")or
librarian::shelf(doParallel,kknn)Download the lab’s .qmd notebook
Click on the link below to download the .qmd file for this lab. Save it in the DS202A folder you created in the first week. If you need a refresher on the setup, refer back to Part II of W01’s lab.
Import required libraries:
library("ggsci")
library("tidymodels")
library("tidyverse")
library("kknn")📋 Lab Tasks
Part I - Exploratory data analysis (20 min)
The first step is to load the dataset. In this lab, we will be using the diabetes dataset which contains health-related data. It includes variables associated with medical conditions, lifestyle factors and demographic information:
diabetes: indicates whether the individual has diabetes.high_bp: indicates whether the individual has high blood pressure.high_chol: indicates whether the individual has high cholesterol.chol_check: indicates whether the individual has had their cholesterol checked.bmi: represents the individual’s Body Mass Index.smoker: indicates whether the individual is a smoker.stroke: indicates whether the individual has had a stroke.heart_diseaseor_attack: indicates whether the individual has/had heart disease/attack.phys_activity: indicates whether the individual engages in physical activity.fruitsandveggies: indicate the consumption of fruits and vegetables.hvy_alcohol_consum: indicates heavy alcohol consumption.no_docbc_cost: refers to whether an individual was unable to see a doctor due to cost-related barriers.any_healthcare: indicates whether the individual has any form of healthcare.gen_hlth: indicates the individual’s self-reported general health.diff_walk: indicates whether the individual has difficulty walking of faces mobility challenges.sex: indicates the individual’s gender.age: represents the individual’s age.education: represents the individual’s education level.income: represents the individual’s income level.
diabetes_data <- read_csv("data/diabetes.csv")Question 1: Check the dimensions of the dataframe. Check whether there are any missing values.
diabetes_data
diabetes_data |>
summarise(across(everything(), ~ sum(is.na(.x)))) |>
glimpse()Challenge: What are the types of the variables in the dataset (continuous, discrete, categorical, ordinal)? Can you figure out how to convert them to the appropriate R data types efficiently?
Step 1: Explore your data types
First, let’s see what we’re working with:
# Check current data types
glimpse(diabetes_data)
# Or try:
# str(diabetes_data)
# sapply(diabetes_data, class)Step 2: Classify your variables
Look at each variable and think about what type it should be:
- Continuous variables (can take any value in a range) → should be
numeric - Discrete variables (whole numbers, counts, scales) → should be
integer - Categorical variables (distinct categories, yes/no) → should be
factor
🤔 Think about it: Which variables in your dataset fall into each category?
Step 3: Transform efficiently
Here’s your challenge: Can you convert multiple variables at once without writing separate mutate() statements for each one?
# Code hereStep 4: Check your work
After your transformations, your data should look something like this:
> diabetes_data |> head(10) |> glimpse()
Rows: 10
Columns: 20
$ diabetes <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ high_bp <fct> 0, 1, 0, 0, 0, 1, 0, 0, 1, 0
$ high_chol <fct> 0, 1, 0, 0, 1, 1, 1, 1, 1, 0
$ chol_check <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ bmi <dbl> 21, 28, 24, 27, 31, 33, 29, 27, 25, 33
$ smoker <fct> 0, 0, 0, 1, 1, 0, 1, 1, 0, 1
$ stroke <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ heart_diseaseor_attack <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ phys_activity <fct> 0, 1, 1, 1, 0, 0, 1, 0, 1, 1
$ fruits <fct> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1
$ veggies <fct> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1
$ hvy_alcohol_consump <fct> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0
$ any_healthcare <fct> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1
$ no_docbc_cost <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0
$ gen_hlth <int> 3, 3, 1, 2, 4, 4, 1, 3, 1, 2
$ diff_walk <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0
$ sex <fct> 0, 0, 1, 1, 0, 0, 1, 1, 1, 1
$ age <dbl> 7, 13, 1, 2, 8, 7, 10, 10, 8, 8
$ education <int> 4, 6, 4, 4, 3, 3, 6, 4, 6, 5
$ income <int> 2, 6, 7, 7, 2, 6, 8, 8, 8, 8Discussion questions:
- What’s the advantage of converting categorical variables to factors?
- Why might you want discrete variables as integers vs. numeric?
- How would you handle this if you had 50+ variables to transform?
Ask Claude for help!
Once you’ve attempted this challenge, ask Claude: “Can you show me the most efficient tidyverse way to transform multiple variables by type? What are the best practices for variable type conversion?”
Best practices to discover:
- 🎯 Efficiency: Transform multiple variables at once
- 🎯 Clarity: Group variables by their intended type
- 🎯 Reproducibility: Make your transformations explicit and documented
- 🎯 Validation: Always check your results with
glimpse()orstr()
Exploring the diabetes dataset: EDA challenge (15 minutes) 🎯 YOUR TURN TO EXPLORE! Now that you’ve transformed your variables, it’s time to dig into the data and uncover some interesting patterns. Your mission: Create 2-3 visualizations or summaries that reveal something interesting about diabetes and related health factors. Some ideas to spark your curiosity:
What’s the distribution of diabetes in the dataset? How does BMI relate to diabetes status? Are there age patterns in diabetes prevalence? Do lifestyle factors (smoking, physical activity, diet) show interesting relationships? What about the relationship between income/education and health outcomes? Are there gender differences in any of the health measures? How do different health conditions cluster together?
# Your EDA code here!
# Try different plot types: histograms, boxplots, bar charts, scatter plots
# Use group_by() and summarise() for interesting summaries
# Experiment with faceting by different variablesClass discussion
We’ll share some of the most interesting findings and discuss:
What patterns surprised you the most?
What questions do your findings raise?
How might these insights inform public health strategies?
What additional data would help explain these patterns?
Part II - Fitting a logistic regression
We want to perform a logistic regression using the available variables in the dataset to predict whether an individual has diabetes.
👨🏻🏫 TEACHING MOMENT: Your class teacher will formalize the logistic regression in the context of the data at our disposal.
Now, we need to split the data into training and testing sets. Having a test set will help us evaluate how well the model generalizes.
In the training phase, we will use part of the data to fit the logistic regression model.
In the testing phase, we will assess the model’s performance on the remaining data (test set) which was not used during training.
Question 1: Why can’t we rely solely on the model’s performance on the training set to evaluate its ability to generalize?
Question 2: Split the dataset into training and testing sets using 75% of the data for training and 25% for testing.
# Code here💡Tip: When using initial_split try specifying strata = diabetes. This ensures that the training and test sets will have identical proportions of the outcome.
Challenge: Can you figure out how to fit a logistic regression model using the tidymodels framework?
Hint: Look back at the lasso section from Lab 3 - you’ll need to modify that approach. Think about:
- What type of model specification do you need for logistic regression?
- What engine should you use?
- How do you fit the model to your training data?
# Code hereQuestion 3: Should all variables be included in the model? How might you decide?
Question 4: Once you have your model fitted, try to generate predictions and evaluate performance. How do you interpret logistic regression coefficients differently from linear regression?
# Code here👨🏻🏫 CLASS DISCUSSION: We’ll discuss model interpretation together:
- How do we read logistic regression output? - What do the coefficients represent?
- How do p-values help us understand variable importance?
- What metrics should we use to evaluate classification performance?
Part III - Evaluation
Question 1: We are going to generate predictions on the testing set using our logistic regression model.
# Code here💡Tip: When using augment for logistic regression, specify type.predict = "response".
Question 2: Copy and paste pull(test_preds, .pred_1)[1:10] into the console and hit enter to get the first ten predictions for the test set. The model’s predictions are scores ranging between 0 and 1 while the target variable diabetes is binary and takes only the values 0 or 1. How can we use the predictions to classify whether an individual has diabetes?
# Code hereQuestion 3: We are going to set a threshold of 0.5. All scores higher than 0.5 will be classified as 1 (diabetes) and scores of 0.5 or lower will be classified as 0 (no diabetes).
# Code hereQuestion 4: Let’s create a confusion matrix to see how well our model performs:
# Code hereQuestion 5: Table discussion - How would you evaluate whether this model is good or bad?
Discuss with your table partners:
- What does each cell in the confusion matrix represent?
- What would make you confident in this model?
- What concerns might you have?
- How might the consequences of false positives vs false negatives differ in a medical context?
Class Discussion: We’ll discuss what these different metrics mean and when each might be most important.
Understanding threshold impacts on precision, recall, and F1-score
Question 6: What is the problem with setting an arbitrary threshold α=0.8? How should we expect the precision and the recall to behave if we increase the threshold α? If we decrease α?
Question 7: Compute the precision, recall, and F1-score for the test set for different values of the threshold α. Compute them for α ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7}.
# Code hereVisualizing performance across thresholds
Question 8: Create a plot that shows performance over different thresholds, using colour to distinguish between the different evaluation metrics (precision, recall, F1-score).
# Code hereCreating a precision-recall curve
Question 9: Based on the precision and recall values you calculated, create a precision-recall plot to visualize the relationship between these two metrics. How does the shape of the precision-recall curve help you assess model performance? What trade-offs do you observe between precision and recall?
# Code herePart IV - Cross-validation
In this part, we will explore a technique often used in Machine Learning to evaluate the performance and generalizability of our models. Cross-validation is often used to get a better estimation of the model’s performance on new data (test set).
💡Tip: We have been creating a lot of objects that, in turn, have used up a lot of memory. It is worth, sometimes, removing the objects we have no use for anymore. We can then use gc() (“garbage collection” 😂) which can then help free up memory in RStudio.
The idea of \(k\)-fold cross validation is to split a dataset into a training set and testing set (just as we did previously), then to split the training set into \(k\) folds.
👨🏻🏫 TEACHING MOMENT: Your class teacher will give more details on how cross-validation works and what its purpose is.
Challenge: Can you figure out how to perform leave-one-out cross-validation using tidymodels?
Important: Sample only 500 observations from your training data - leave-one-out CV on the full dataset would take too long!
Hints: - Check the tidymodels documentation for leave-one-out cross-validation functions - Look for functions starting with loo_ - You’ll need to create resamples, fit models to each fold, and evaluate performance
set.seed(123)
# Sample 500 observations for computational efficiency
train_sample <- slice_sample(train, n = 500)
# LOO CV code hereResources: - Leave-one-out cross-validation documentation - Look specifically for the loo_cv() function
Questions to think about:
- Why do we use cross-validation instead of just train/test splits?
- What are the advantages and disadvantages of leave-one-out CV?
- How does the CV performance compare to your single train/test split?
Part V - \(k\)-nearest neighbours (X min)
💡Tip: See previous tip for why the below code is necessary.
👨🏻🏫 TEACHING MOMENT: Your class teacher will explain how \(k\)-nn works.
Question 1: Start by normalizing the continuous variables. Explain why it can be useful to carry out this transformation.
# Create a standardise function
standardise <- function(.x) {
out <- (.x - mean(.x)) / sd(.x)
out
}
# Apply function to training and testing data
train_std <-
train |>
mutate(across(where(is.double), ~ standardise(.x)))
test_std <-
test |>
mutate(across(where(is.double), ~ standardise(.x)))Question 2: Perform a \(k\)-nn classification with \(k=5\). Generate the predictions for the set and compute both precision and recall.
# Register multiple cores
doParallel::registerDoParallel()
# Fit a kNN model
knn_fit <-
nearest_neighbor(neighbors = 5) |>
set_mode("classification") |>
set_engine("kknn") |>
fit(diabetes ~ ., data = train_std)
# Evaluate the model on the test set
knn_fit |>
augment(new_data = test_std) |>
class_metrics(truth = diabetes, estimate = .pred_class)💰🎁🎉 Bonus:
Does \(k\)-NN outperform logistic regression?
# Code here