💻 LSE DS202W 2024: Week 09 - Lab

2023/24 Winter Term

Author

Dr Ghita Berrada

Published

11 Mar 2024

🥅 Learning Objectives

By the end of this lab, you will be able to:

Use principal component analysis (PCA) to perform dimensionality reduction
Understand the difference between supervised and unsupervised learning
Apply the k-means algorithm to cluster observations

📋 Lab Tasks

No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.

Part 0: Export your chat logs (~ 3 min)

As part of the GENIAL project, we ask that you fill out the following form as soon as you come to the lab:

🎯 ACTION POINTS

🔗 CLICK HERE to export your chat log.

Thanks for being GENIAL! You are now one step closer to earning some prizes! 🎟️

👉 NOTE: You MUST complete the initial form.

If you really don’t want to participate in GENIAL¹, just answer ‘No’ to the Terms & Conditions question - your e-mail address will be deleted from GENIAL’s database the following week.

📚 Preparation

We will be using the same dataset as in W08’s lecture. Click the button below to download it again if needed:

Use the link below to download the lab materials:

We will post solutions to Part III on Tuesday afternoon, only after all labs have ended.

⚙️ Setup

Load the libraries for today’s lab.

# Tidyverse packages we will use
library(ggplot2)
library(dplyr)     
library(tidyr)     
library(readr) 
library(lubridate)    

# Tidymodel packages we will use
library(rsample)
library(yardstick) 
library(parsnip)   
library(recipes)   
library(workflows) 
library(rpart)
library(tune)
library(broom)
library(purrr)

library(tidyclust)

Part I : Looking back at clustering (~25 min)

🎯 ACTION POINTS

Read the data set:

It is the dataset you’ve (just) downloaded (or downloaded for the W08 lecture).

# Modify the filepath if needed
filepath <- "data/marketing_campaign.csv"
marketing_tbl <- read_csv(filepath)

Re-create the new columns from the lecture:

df <-
    marketing_tbl %>%
    # Convert data type using lubridate
    mutate(Dt_Customer = dmy(Dt_Customer)) %>%

    # Pretend today is the last date in the dataset
    # Calculate how long the customer has been with the company
    mutate(Customer_Age = (max(Dt_Customer) - Dt_Customer) / ddays(1)) %>%
    select(-Dt_Customer) %>%

    # How much has the customer spent in total across all categories
    mutate(TotalSpent = rowSums(across(starts_with("Mnt")))) %>%
    
    # Remove unnecessary features
    select(-Z_CostContact, -Z_Revenue, -Response) %>%

    janitor::clean_names()

Finalize the pre-processing of the dataset with a recipe

recipe_kmeans <- recipe(~ ., data = df) %>%
    update_role(c(id, education, marital_status, year_birth), new_role = "ID") %>%
    step_impute_knn(income, neighbors = 3) %>%
    step_normalize(all_numeric_predictors())

Can you make sense of all the steps in the recipe?

🗣️ GROUP THEN CLASSROOM DISCUSSION

Let’s have a look at (part of) the dataframe that results from all the pre-processing steps we’ve gone through in the setup:

 recipe_kmeans %>% prep(df) %>% bake(df) %>% print(n=10,width=Inf)

Thinking back to the k-means algorithm you’ve encountered in the W08 lecture, do you think:

this dataframe is easily visualized?
this dataframe is easily clustered?
why or why not?

Part II: Dimensionality Reduction (25 min)

Our data set has 23 numeric variables in total. That would be a lot of predictors for a model. So far, we have only used one or a few predictors, but not more.

Pay Attention

🧑‍🏫 TEACHING MOMENT:

(Your class teacher will explain the motivation for dimensionality reduction, focusing on the why rather than the how (it will be explained in this week’s lecture.)

(You will see this concept again in the W09 lecture, don’t worry if you don’t fully understand it yet.)

The Curse of Dimensionality 🪄

No, that’s not a Harry Potter spell but it refers to the problems that often arise when dealing with high dimensional data (ie lots of variables).

Machine learning models tend to perform badly with too many predictors - there is a higher chance for overfitting, plus there is also a much bigger computational cost associated to training models with many variables.

Dimensionality Reduction to the Rescue 🧑‍⚕️

We want to reduce the number of variables. Intuitively, this could work because there are probably strong correlations between many of the individual variables, so there seems to be some redundancy in the data.

Dimensionality reduction refers to a collection of methods that can help reduce the number of variables while preserving most of the information.

Principal Component Analysis 💡

However, rather than deciding which variables to keep and which ones to throw out (that would fall under “variable selection”), we want to compress our high dimensional data into a low dimensional set of variables while retaining as much information from the original data as possible.

flowchart LR
  A["--------------- Lots of Dimensions ---------------"] --> B{PCA}
  B --> C["-- Few Dimensions --"]

In this lab, we will take this idea to the extreme and reduce the dimensionality of all data set to just 2. That has the additional benefit that we can also plot the results on a grid in two dimensions.

Prep the Data

Let’s create a new recipe.

pca_recipe <-
  # we don't have an outcome variable here
  recipe(~., data = df) %>%
  #make sure to ignore non-numeric columns (education, marital_status) and irrelevant predictors (id, year_birth) from the list of predictors
  update_role(c(id, education, marital_status, year_birth), new_role = "ID") %>%
  #apply knn imputation to fill missing values on only column with missing values in our data (i.e income)
  step_impute_knn(income, neighbors = 3) %>%
  # remove zero variance predictors as they don't carry any information
  step_zv(all_numeric_predictors()) %>%
  # normalize the data to avoid out-sized impact of some variables
  step_normalize(all_numeric_predictors()) %>%
  prep()

summary(pca_recipe)

This is our pre-processed data.

Note all columns have been standardized now too, so they have mean 0 and variance 1. This is a recommended (or even necessary) step prior to performing PCA in order to avoid that large-range variables (which naturally have a higher variance) have an out-sized impact on the results.

Perform PCA

Now, we can perform dimensionality reduction using PCA on this data.

This can be done simply by appending a step_pca() to our recipe. We’ll also set the num_comp=2, meaning that we want to reduce the dimensionality to just 2. We’ll also keep the original variables rather than replacing them here as we’ll still use some of them later in part III.

pca_recipe <-
 # we don't have an outcome variable here
  recipe(~., data = df) %>%
  #make sure to ignore non-numeric columns (education, marital_status) and irrelevant predictors (id, year_birth) from the list of predictors
  update_role(c(id, education, marital_status, year_birth), new_role = "ID") %>%
  #apply knn imputation to fill missing values on only column with missing values in our data (i.e income)
  step_impute_knn(income, neighbors = 3) %>%
  # remove zero variance predictors as they don't carry any information
  step_zv(all_numeric_predictors()) %>%
  # normalize the data to avoid out-sized impact of some variables
  step_normalize(all_numeric_predictors()) %>%
  # apply principal component analysis
  step_pca(all_numeric_predictors(), num_comp = 2, keep_original_cols=TRUE) %>%
  prep()

# apply the recipe to our data set
pca_results <- bake(pca_recipe, df)
pca_results %>%
  select(PC1, PC2) %>%
  head()

Plot the Results

Our data now contains the newly created two principal components (added to the end of the columns). We can plot them.

ggplot(pca_results, aes(PC1, PC2)) +
  geom_point(alpha=0.3,color="#7038d1") +
  ggtitle("Principal Components", subtitle = "2-dimensional representation of our 20+ numeric predictors")

There is clearly still some structure/information in our data, and we will exploit this in part II.

So how much information did we retain/lose?

We can answer this by looking at how much of the variance of the original data is contained in the first n=1,2,… principal components.

The first principal component is the one that captures the most the variance, then the second, and so on.

# this is a little awkward but we just want to 
# extract the fitted PCA that is hidden inside the recipe
pca_fit <- pca_recipe$steps[[4]]$res

# the eigenvalues determine the amount of variance explained
eigenvalues <-
  pca_fit %>%
  tidy(matrix = "eigenvalues")

eigenvalues %>%
  filter(PC <= 23) %>%
  ggplot(aes(PC, percent)) +
  geom_col()+
  geom_text(aes(label = percent), vjust = -0.2)

The first principal component captures more than 30 percent of the variance in all the numeric predictors we had initially. The second principal component still captures almost 9 percent.

eigenvalues %>%
  filter(PC <= 23) %>%
  ggplot(aes(PC, cumulative)) +
  geom_col()+
  geom_text(aes(label = cumulative), vjust = -0.2)

Cumulatively, the first 2 components retain 40 percent of the variance. If we wanted to retain at least 80 percent of the variance, we would need to include at least 11 principal components. That’s still much less than the 20+ columns we had initially.

If we were to plot all principal components here, only they would contain the full 100 percent of the variance.

Part II: K-Means Clustering (40 min)

👥 IN PAIRS:

Run k-means clustering on our two principal components from part II. Pick a number of clusters to pass to k-means.

#replace 3 by the number of clusters of your choice
kclust = 
  pca_results %>%
  select(PC1, PC2) %>%
  kmeans(centers = 3) 

tidy(kclust)

The centroids (as x and y coordinates) found by the k-means algorithm are shown in the table above. With those cluster centroids, we can now apply the clustering to our data set

What happens exactly: You can think of this as the analogue to making “predictions”, whereas “fitting” the model is the process of finding the centroids itself. Making predictions means we assign a cluster label to each point depending on the closest centroid.

Visualized, it looks like this:

augment(kclust, pca_results) %>%
  ggplot() +
  geom_point(aes(x=PC1, y=PC2, color=.cluster), alpha=0.3) +
  geom_point(data = tidy(kclust), aes(x=PC1, y=PC2), size = 6, shape = "x")

Pay Attention

🗣 Class Discussion: Your class teacher will mediate a discussion: Was your choice of k a good one? Why or why not?

So how many clusters do we need?

This is the BIG question when doing clustering. Since we don’t have any labels, there is no obvious answer to this question.

To start with, we can do the same thing again but for different values of k, say between k=1 and k=9, and then see if we can decide which one provides the best fit.


# this is a very compact way to perform k-means clustering for various k
kclusts <- 
  tibble(k = 1:9) %>%
  mutate(
    kclust = map(k, ~kmeans(select(pca_results, PC1, PC2), .x)),
    tidied = map(kclust, tidy),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, pca_results)
  )

clusters <- 
  kclusts %>%
  unnest(cols = c(tidied))

assignments <- 
  kclusts %>% 
  unnest(cols = c(augmented))

clusterings <- 
  kclusts %>%
  unnest(cols = c(glanced))

Let’s visualize the results using a facet plot.

ggplot(assignments, aes(x = PC1, y = PC2)) +
  geom_point(aes(color = .cluster), alpha = 0.7) + 
  facet_wrap(~ k) +
  geom_point(data = clusters, size = 4, shape = "x")

Pay Attention

🗣 Discussion: Based on this plot, what is the best choice for k?

Judging the fit only visually has of course limitations (we couldn’t even do that if we had more dimensions than 2). So what’s the alternative?

One commonly used method to choose the best number of clusters is the “elbow method”. It relies on the “within cluster sum of squares” (WCSS).

This metric will always keep decreasing as k increases. However, rather than choosing the max k, we may find that a moderate k is good enough such that the WCSS does not actually decrease that much further afterwards (ie the plot often looks like an elbow).

ggplot(clusterings, aes(k, tot.withinss)) +
  geom_line(color="red",alpha=0.5,linewidth=1) +
  geom_point(color="red",size=4)+
  theme_grey(base_size = 15)

Pay Attention

🗣 Discussion: Based on the elbow method, which k would you choose? Does the plot even show a clear “elbow”?

Footnotes

We’re gonna cry a little bit, not gonna lie. But no hard feelings. We’ll get over it.↩︎