🛣️ Week 08 - Lab Roadmap (90 min)

Clustering using k-means and dbscan

Author

The DS202A team

🥅 Learning Objectives

By the end of this lab, you will be able to:

Apply the k-means algorithm to cluster observations
Understand the limitations of the k-means algorithm
Apply the dbscan algorithm

📚 Preparation: Loading packages and data

Materials to download

Please download the following files into your data folder.

Then use the link below to download the lab materials:

⚙️ Setup(5 mins)

Load the libraries for today’s lab. If there is a library you haven’t installed yet, install them like in install.packages("dbscan").

# Uncomment and run the line below if needed (optional)
# install.packages("dbscan")
# install.packages("gapminder")
# install.packages("NbClust")

library("dbscan")
library("gapminder")
library("ggsci")
library("NbClust")
library("tidymodels")
library("tidyverse")

📋 Lab Tasks

No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.

Employing k-means on a two-dimensional customer segmentation data set (15 minutes)

We will create a tibble object named customers that has two features

income shows income of customers in thousands of USD
spending_score provides a 0-100 index that compares spending amongst customers

customers <- 
  read_csv("data/customers-data.csv") %>% 
  select(income, spending_score) %>% 
  mutate(across(everything(), ~ (.x - mean(.x)) / sd(.x)))

Let’s plot the correlation between both variables using a scatter plot.


customers %>% 
  ggplot(aes(income, spending_score)) +
  geom_point() +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  labs(x = "Income", y = "Spending score")

Looking at the graph, we can see that there are distinct segments of customers organised into somewhat distinct clusters.

🗣️ CLASSROOM DISCUSSION:

How many clusters do you see? Can you describe intuitively what each cluster represents?

Implementing k-means clustering

Implementing k-means clustering in R is fairly straightforward. We pass our 2-d customers tibble to kmeans and specify centers = n.


set.seed(321)
kclust_customers <- kmeans(customers, centers = 5)

After this, we can create a new variable (cluster) by finding the vector of cluster assignments in kclust_customers, and converting the results into a factor. After that, we can use some simple modifications to our ggplot to see some results.


customers %>% 
  mutate(cluster = as.factor(kclust_customers$cluster)) %>% 
  ggplot(aes(income, spending_score, colour = cluster)) +
  geom_point() +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        legend.position = "bottom") +
  scale_colour_bmj() +
  labs(x = "Income", y = "Spending score",
       colour = "Cluster #")

Validating k-means

We can tell intuitively that 5 clusters probably makes the most sense. However, in most cases, we will not have the luxury of doing this, and we will need to validate our choice of cluster number.

The elbow method

One widely used method is the elbow method. Simply put, when we see distinct “elbows” in the plot, we decide that adding more clusters will not result in a significant reduction in the total within-sum cluster sum of squared errors. As a result, we can stop and use that many clusters.

🧑‍🏫 TEACHING MOMENT:

What do we mean by total within-sum cluster sum of squared errors? When the k-means algorithm is initialised, k centroids are placed randomly at various points of the feature space. The algorithm then iteratively moves around these centroids on this feature space to a set of coordinates that minimise error (in this case total within-cluster sum of squared errors).

We can create an elbow plot using the nested tibble approach.


set.seed(321)

crossing(k = 1:9, nest(customers, .key = "data")) %>% 
  mutate(model = map2(data, k, ~ kmeans(.x, centers = .y)),
         glanced = map(model, ~ glance(.x))) %>% 
  unnest(glanced) %>% 
  ggplot(aes(k, tot.withinss)) +
  geom_point() +
  geom_line(linetype = "dashed") +
  theme_minimal() +
  theme(panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank()) +
  scale_x_continuous(breaks = 1:9, labels = 1:9) +
  labs(x = "Number of clusters", y = "Total within-cluster SS")

🗣️ CLASSROOM DISCUSSION:

What can we conclude from this plot? Is 5 clusters the best number?

Let the indices vote!

The elbow method, though visually intuitive, can often be difficult to interpret. Furthermore, there are many indices one can choose from to identify the most appropriate number of clusters. Instead of picking another, we can simply let multiple indices vote and the number of clusters with the highest number of votes will then be implemented.


votes <-
  NbClust(customers, distance = "euclidean", min.nc=2, max.nc=8, 
          method = "kmeans", index = "ch")

Implementing this method reveals that 6 clusters are most appropriate, which doesn’t exactly confirm our intuition…Is a group of points that we saw as one single cluster actually two separate clusters? (Note: the index used in the calculation above is the Calinski-Harabasz index. The NbClust library provides a total of 30 different indices you can use to determine to best number of clusters!)

Employing k-means on higher dimensions (15 minutes)

Suppose we are interested in understand the profiles of workers based in a company. We will use three standardised variables from the Absenteeism at Work data set:

transportation_expense how much it costs to travel into work
distance_from_residence_to_work the distance from residence to work
work_load_average_day average workload in a day

Cleaning the data set

Run the following code:

workers <- 
  read_csv("data/absenteeism-at-work-data.csv") %>% 
  janitor::clean_names() %>% 
  select(transportation_expense, distance_from_residence_to_work,
         work_load_average_day) %>%
  mutate(across(everything(), ~ (.x - mean(.x)) / sd(.x)))

Find the optimal number of clusters

Use an elbow plot or NbClust to find the optimal number of clusters:


votes <-
  NbClust(workers, distance = "euclidean", min.nc=2, max.nc=8, 
          method = "kmeans", index = "ch")

Apply the cluster labels to the data set

After doing this, run the code underneath to display the following visualisation.


workers %>% 
  mutate(cluster = as.factor(votes$Best.partition)) %>% 
  pivot_longer(-cluster, names_to = "variable") %>% 
  ggplot(aes(value, str_replace_all(variable, "\\_", " "), fill = variable)) +
  facet_wrap(. ~ paste("Cluster", cluster)) +
  geom_boxplot(alpha = 0.5) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "red", linewidth = 1.25) +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        legend.position = "none") +
  scale_fill_bmj() +
  labs(x = "Value", y = NULL)

🗣️ CLASSROOM DISCUSSION:

Can you describe intuitively what each cluster represents?

Is k-means clustering always the right choice? (10 minutes)

The answer is emphatically no. Due to the fact that k-means clustering uses the distance between points to identify clusters, it will try to create evenly sized clusters when clusters could, in fact, either be uneven or non-existent.

Example 1: Odd conditional distributions


circles <- 
  read_csv("data/circles-data.csv") %>% 
  select(x1, x2)

circles %>% 
  ggplot(aes(x1, x2)) +
  geom_point() +
  theme_minimal() +
  theme(panel.grid.minor = element_blank())

Find the optimal number of clusters


crossing(k = 1:9, nest(circles, .key = "data")) %>% 
  mutate(model = map2(data, k, ~ kmeans(.x, centers = .y)),
         glanced = map(model, ~ glance(.x))) %>% 
  unnest(glanced) %>% 
  ggplot(aes(k, tot.withinss)) +
  geom_point() +
  geom_line(linetype = "dashed") +
  theme_minimal() +
  theme(panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank())

Run the model


kclust <- kmeans(circles, centers = 4)

Plot the results


circles %>% 
  mutate(cluster = as.factor(kclust$cluster)) %>% 
  ggplot(aes(x1, x2, colour = cluster)) +
  geom_point() +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  scale_colour_ucscgb() +
  labs(x = "Variable 1", y = "Variable 2",
       colour = "Cluster #")

🗣️ CLASSROOM DISCUSSION:

What went wrong?

👉 NOTE: To see more about how different clustering algorithms produce different kinds of clusters in different scenarios, click here.

Introducing DBSCAN (20 minutes)

The DBSCAN algorithm overcomes some of the short-comings of the k-means algorithm by using the distance between nearest points.

👥 DISCUSS IN PAIRS/GROUPS:

Enter ?dbscan into the console and look at the eps and minPts hyperparameters.

The below code will implement dbscan.


db_model_circles <- dbscan::dbscan(circles, eps = 0.15, minPts = 5) # we specify that we use the function from the dbscan library (there could be a conflict with the fpc library if you've installed it in the W08 lecture)

circles %>% 
  mutate(cluster = as.factor(db_model_circles$cluster)) %>% 
  ggplot(aes(x1, x2, colour = cluster)) +
  geom_point() +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  scale_colour_ucscgb() +
  labs(x = "Variable 1", y = "Variable 2", colour = "Cluster #")

Applying dbscan to the customer segmentation data set

Use eps = 0.325 and minPts = 5 when applying dbscan to the customer segmentation data. Print the model output.


db_model_customers <- dbscan::dbscan(customers, eps = 0.325, minPts = 5) 
db_model_customers

Try plotting the results yourself to see the differences.


customers %>% 
  mutate(cluster = as.factor(db_model_customers$cluster),
         outlier = if_else(cluster == 0, TRUE, FALSE)) %>% 
  ggplot(aes(income, spending_score, colour = cluster, shape = outlier)) +
  geom_point(size = 2) +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  scale_colour_manual(values = c("black", pal_bmj()(7)), breaks = 1:7) +
  scale_shape_manual(values = c(16,4)) +
  labs(x = "Income", y = "Spending score",
       colour = "Cluster #", shape = "Outlier")

👥 DISCUSS IN PAIRS/GROUPS:

What difference in clustering do you see with dbscan?

Now it’s your turn: Global life expectancy and GDP per capita in 1997 (30 minutes)

Load the following data set


gapminder_1997 <- 
  gapminder %>% 
  filter(year == 1997) %>% 
  select(gdp_percap = gdpPercap, life_exp = lifeExp) %>%
  mutate(across(everything(),  ~ (.x-mean(.x))/sd(.x)))

Create a scatter plot


# Code here

Find the optimal number of clusters

Try an elbow plot …


# Code here

… or let the indices vote!


# Code here

Run the k-means algorithm and plot the results


# Code here

Try using dbscan to cluster the points

Try setting eps = 0.265 and minPts = 4.


# Code here

Plot the model


# Code here