✅ Week 08 Lab - Solutions

Clustering using k-means and dbscan

Author

The DS202A team

This solution file follows the format of the Quarto file .qmd you had to fill in during the lab session. If you want to render the document yourselves and play with the code, you can download the .qmd version of this solution file by clicking on the button below:

🥅 Learning Objectives

By the end of this lab, you will be able to:

  • Apply the k-means algorithm to cluster observations
  • Understand the limitations of the k-means algorithm
  • Apply the dbscan algorithm

⚙️ Setup(5 mins)

Load the libraries for today’s lab. If there is a library you haven’t installed yet, install them like in install.packages("dbscan").

# Uncomment and run the line below if needed (optional)
# install.packages("dbscan")
# install.packages("gapminder")
# install.packages("NbClust")
library("dbscan")
library("gapminder")
library("ggsci")
library("NbClust")
library("tidymodels")
library("tidyverse")

📋 Lab Tasks

No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.

Employing k-means on a two-dimensional customer segmentation data set (15 minutes)

We will create a tibble object named customers that has two features

  • income shows income of customers in thousands of USD
  • spending_score provides a 0-100 index that compares spending amongst customers
customers <- 
  read_csv("data/customers-data.csv") %>% 
  select(income, spending_score) %>% 
  mutate(across(everything(), ~ (.x - mean(.x)) / sd(.x)))
Rows: 200 Columns: 4                                                                                               
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): age, income, spending_score
lgl (1): male

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s plot the correlation between both variables using a scatter plot.


customers %>% 
  ggplot(aes(income, spending_score)) +
  geom_point() +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  labs(x = "Income", y = "Spending score")

Looking at the graph, we can see that there are distinct segments of customers organised into somewhat distinct clusters.

🗣️ CLASSROOM DISCUSSION:

How many clusters do you see? Can you describe intuitively what each cluster represents?

There appear to be 5 clusters. Three of the clusters show customers with low / medium / high incomes and spending scores. However, we see 2 further clusters of customer how have low incomes yet high spending, and vice versa.

Implementing k-means clustering

Implementing k-means clustering in R is fairly straightforward. We pass our 2-d customers tibble to kmeans and specify centers = n.


set.seed(321)
kclust_customers <- kmeans(customers, centers = 5)

After this, we can create a new variable (cluster) by finding the vector of cluster assignments in kclust_customers, and converting the results into a factor. After that, we can use some simple modifications to our ggplot to see some results.


customers %>% 
  mutate(cluster = as.factor(kclust_customers$cluster)) %>% 
  ggplot(aes(income, spending_score, colour = cluster)) +
  geom_point() +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        legend.position = "bottom") +
  scale_colour_bmj() +
  labs(x = "Income", y = "Spending score",
       colour = "Cluster #")

Validating k-means

We can tell intuitively that 5 clusters probably makes the most sense. However, in most cases, we will not have the luxury of doing this, and we will need to validate our choice of cluster number.

The elbow method

One widely used method is the elbow method. Simply put, when we see distinct “elbows” in the plot, we decide that adding more clusters will not result in a significant reduction in the total within-sum cluster sum of squared errors. As a result, we can stop and use that many clusters.

🧑‍🏫 TEACHING MOMENT:

What do we mean by total within-sum cluster sum of squared errors? When the k-means algorithm is initialised, k centroids are placed randomly at various points of the feature space. The algorithm then iteratively moves around these centroids on this feature space to a set of coordinates that minimise error (in this case total within-cluster sum of squared errors).

We can create an elbow plot using the nested tibble approach.


set.seed(321)

crossing(k = 1:9, nest(customers, .key = "data")) %>% 
  mutate(model = map2(data, k, ~ kmeans(.x, centers = .y)),
         glanced = map(model, ~ glance(.x))) %>% 
  unnest(glanced) %>% 
  ggplot(aes(k, tot.withinss)) +
  geom_point() +
  geom_line(linetype = "dashed") +
  theme_minimal() +
  theme(panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank()) +
  scale_x_continuous(breaks = 1:9, labels = 1:9) +
  labs(x = "Number of clusters", y = "Total within-cluster SS")

🗣️ CLASSROOM DISCUSSION:

What can we conclude from this plot? Is 5 clusters the best number?

Our initial intuition of 5 clusters seems to make sense here. After 5 clusters, the rate at which the total within cluster sum of squared errors drops appears to be no longer significant (in a substantive sense).

Let the indices vote!

The elbow method, though visually intuitive, can often be difficult to interpret. Furthermore, there are many indices one can choose from to identify the most appropriate number of clusters. Instead of picking another, we can simply let multiple indices vote and the number of clusters with the highest number of votes will then be implemented.


votes <-
  NbClust(customers, distance = "euclidean", min.nc=2, max.nc=8, 
          method = "kmeans", index = "ch")


votes$Best.nc[1]

Implementing this method reveals that 6 clusters are most appropriate, which doesn’t exactly confirm our intuition…Is a group of points that we saw as one single cluster actually two separate clusters? (Note: the index used in the calculation above is the Calinski-Harabasz index. The NbClust library provides a total of 30 different indices you can use to determine to best number of clusters!)

Employing k-means on higher dimensions (15 minutes)

Suppose we are interested in understand the profiles of workers based in a company. We will use three standardised variables from the Absenteeism at Work data set:

  • transportation_expense how much it costs to travel into work
  • distance_from_residence_to_work the distance from residence to work
  • work_load_average_day average workload in a day

Cleaning the data set

Run the following code:

workers <- 
  read_csv("data/absenteeism-at-work-data.csv") %>% 
  janitor::clean_names() %>% 
  select(transportation_expense, distance_from_residence_to_work,
         work_load_average_day) %>%
  mutate(across(everything(), ~ (.x - mean(.x)) / sd(.x)))
Rows: 740 Columns: 21                                                              
── Column specification ───────────────────────────────────────────────────────────
Delimiter: ","
dbl (21): ID, Reason for absence, Month of absence, Day of the week, Seasons, T...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Find the optimal number of clusters

Use an elbow plot or NbClust to find the optimal number of clusters:


votes <-
  NbClust(workers, distance = "euclidean", min.nc=2, max.nc=8, 
          method = "kmeans", index = "ch")


votes$Best.nc[1]
set.seed(321)

crossing(k = 1:9, nest(workers, .key = "data")) %>% 
  mutate(model = map2(data, k, ~ kmeans(.x, centers = .y)),
         glanced = map(model, ~ glance(.x))) %>% 
  unnest(glanced) %>% 
  ggplot(aes(k, tot.withinss)) +
  geom_point() +
  geom_line(linetype = "dashed") +
  theme_minimal() +
  theme(panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank()) +
  scale_x_continuous(breaks = 1:9, labels = 1:9) +
  labs(x = "Number of clusters", y = "Total within-cluster SS")

Both methods show 5 to be the appropriate number of clusters to choose here. However, for visualisation purposes, we’ll restrict ourselves to 3 clusters.

Apply the cluster labels to the data set

After doing this, run the code underneath to display the following visualisation.


workers %>% 
  mutate(cluster = as.factor(votes$Best.partition)) %>% 
  pivot_longer(-cluster, names_to = "variable") %>% 
  ggplot(aes(value, str_replace_all(variable, "\\_", " "), fill = variable)) +
  facet_wrap(. ~ paste("Cluster", cluster)) +
  geom_boxplot(alpha = 0.5) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "red", linewidth = 1.25) +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        legend.position = "none") +
  scale_fill_bmj() +
  labs(x = "Value", y = NULL)

🗣️ CLASSROOM DISCUSSION:

Can you describe intuitively what each cluster represents?

Cluster 1 shows workers with middling work loads, transportation costs, and distances from work. Cluster 2 shows workers with lower average work loads and higher distances from work. Cluster 3 shows workers with a higher average work loads.

Is k-means clustering always the right choice? (10 minutes)

The answer is emphatically no. Due to the fact that k-means clustering uses the distance between points to identify clusters, it will try to create evenly sized clusters when clusters could, in fact, either be uneven or non-existent.

Example 1: Odd conditional distributions


circles <- 
  read_csv("data/circles-data.csv") %>% 
  select(x1, x2)

circles %>% 
  ggplot(aes(x1, x2)) +
  geom_point() +
  theme_minimal() +
  theme(panel.grid.minor = element_blank())

Find the optimal number of clusters



set.seed(321)

crossing(k = 1:9, nest(circles, .key = "data")) %>% 
  mutate(model = map2(data, k, ~ kmeans(.x, centers = .y)),
         glanced = map(model, ~ glance(.x))) %>% 
  unnest(glanced) %>% 
  ggplot(aes(k, tot.withinss)) +
  geom_point() +
  geom_line(linetype = "dashed") +
  theme_minimal() +
  theme(panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank())

Run the model


kclust <- kmeans(circles, centers = 4)

Plot the results


circles %>% 
  mutate(cluster = as.factor(kclust$cluster)) %>% 
  ggplot(aes(x1, x2, colour = cluster)) +
  geom_point() +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  scale_colour_ucscgb() +
  labs(x = "Variable 1", y = "Variable 2",
       colour = "Cluster #")

🗣️ CLASSROOM DISCUSSION:

What went wrong?

Four centroids were placed in the 2-dimensional feature space, which then iterated towards a point whereby the total within cluster sum of squared errors was minimised. This resulted in the k-means algorithm partitioning the data like we might do with a pie, as opposed to respecting the ring-like structure of the clusters.

👉 NOTE: To see more about how different clustering algorithms produce different kinds of clusters in different scenarios, click here.

Introducing DBSCAN (20 minutes)

The DBSCAN algorithm overcomes some of the short-comings of the k-means algorithm by using the distance between nearest points.

👥 DISCUSS IN PAIRS/GROUPS:

Enter ?dbscan into the console and look at the eps and minPts hyperparameters.

The below code will implement dbscan.


db_model_circles <- dbscan::dbscan(circles, eps = 0.15, minPts = 5) # we specify that we use the function from the dbscan library (there could be a conflict with the fpc library if you've installed it in the W08 lecture)

circles %>% 
  mutate(cluster = as.factor(db_model_circles$cluster)) %>% 
  ggplot(aes(x1, x2, colour = cluster)) +
  geom_point() +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  scale_colour_ucscgb() +
  labs(x = "Variable 1", y = "Variable 2", colour = "Cluster #")

Applying dbscan to the customer segmentation data set

Use eps = 0.325 and minPts = 5 when applying dbscan to the customer segmentation data. Print the model output.


db_model_customers <- dbscan::dbscan(customers, eps = 0.325, minPts = 5)

db_model_customers
DBSCAN clustering for 200 objects.
Parameters: eps = 0.325, minPts = 5
Using euclidean distances and borderpoints = TRUE
The clustering contains 7 cluster(s) and 25 noise points.

 0  1  2  3  4  5  6  7 
25 12 12  7  3 88 30 23 

Available fields: cluster, eps, minPts, metric, borderPoints

Try plotting the results yourself to see the differences.


customers %>% 
  mutate(cluster = as.factor(db_model_customers$cluster),
         outlier = if_else(cluster == 0, TRUE, FALSE)) %>% 
  ggplot(aes(income, spending_score, colour = cluster, shape = outlier)) +
  geom_point(size = 2) +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  scale_colour_manual(values = c("black", pal_bmj()(7)), breaks = 1:7) +
  scale_shape_manual(values = c(16,4)) +
  labs(x = "Income", y = "Spending score",
       colour = "Cluster #", shape = "Outlier")

👥 DISCUSS IN PAIRS/GROUPS:

What difference in clustering do you see with dbscan?

Firstly, we have 6 clusters instead of 5. Secondly, some of the data points do not belong to a cluster and are considered to be outliers by dbscan. If we used k-means, all points would belong to a cluster, regardless of how far away a given point is from the centroid (as long as it is not closer to another).

Now it’s your turn: Global life expectancy and GDP per capita in 1997 (30 minutes)

Load the following data set


gapminder_1997 <- 
  gapminder %>% 
  filter(year == 1997) %>% 
  select(gdp_percap = gdpPercap, life_exp = lifeExp) %>%
  mutate(across(everything(),  ~ (.x-mean(.x))/sd(.x)))

Create a scatter plot


gapminder_1997 %>% 
  ggplot(aes(gdp_percap, life_exp)) +
  geom_point() +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  labs(x = "GDP per capita", y = "Life expectancy")

Find the optimal number of clusters

Try an elbow plot …


set.seed(321)

crossing(k = 1:9, nest(gapminder_1997, .key = "data")) %>% 
  mutate(model = map2(data, k, ~ kmeans(.x, centers = .y)),
         glanced = map(model, ~ glance(.x))) %>% 
  unnest(glanced) %>% 
  ggplot(aes(k, tot.withinss)) +
  geom_point() +
  geom_line(linetype = "dashed") +
  theme_minimal() +
  theme(panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank()) +
  scale_x_continuous(breaks = 1:9, labels = 1:9) +
  labs(x = "Number of clusters", y = "Total within-cluster SS")

Or try the vote of indices (using the Calinski-Harabasz index):

votes <-
  NbClust(gapminder_1997, distance = "euclidean", min.nc=2, max.nc=8, 
          method = "kmeans", index = "ch")

votes$Best.nc[1]
Number_clusters 
              8 

Run the k-means algorithm and plot the results


kclust_gap <- kmeans(gapminder_1997, centers = 3)

gapminder_1997 %>% 
  mutate(cluster = as.factor(kclust_gap$cluster)) %>% 
  ggplot(aes(gdp_percap, life_exp, colour = cluster)) +
  geom_point(size = 2) +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        legend.position = "bottom") +
  scale_colour_bmj() +
  labs(x = "GDP per capita", y = "Life expectancy",
       colour = "Cluster #")

Try using dbscan to cluster the points

Try setting eps = 0.265 and minPts = 4.


db_model_gap <- dbscan::dbscan(gapminder_1997, eps = 0.265, minPts = 4)
db_model_gap
DBSCAN clustering for 142 objects.
Parameters: eps = 0.265, minPts = 4
Using euclidean distances and borderpoints = TRUE
The clustering contains 3 cluster(s) and 8 noise points.

  0   1   2   3 
  8 104  17  13 

Available fields: cluster, eps, minPts, metric, borderPoints

Plot the model


n_clust <- length(unique(db_model_gap$cluster))

gapminder_1997 %>% 
  mutate(cluster = as.factor(db_model_gap$cluster),
         outlier = if_else(cluster == 0, TRUE, FALSE)) %>% 
  ggplot(aes(gdp_percap, life_exp, colour = cluster, shape = outlier)) +
  geom_point(size = 2) +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  scale_colour_manual(values = pal_bmj()(n_clust), breaks = 1:4) +
  scale_shape_manual(values = c(16,4)) +
  labs(x = "GDP per capita", y = "Life expectancy",
       colour = "Cluster #", shape = "Outlier")