✅ Week 08 Lab - Solutions
Clustering using k-means and dbscan
This solution file follows the format of the Quarto file .qmd
you had to fill in during the lab session. If you want to render the document yourselves and play with the code, you can download the .qmd
version of this solution file by clicking on the button below:
🥅 Learning Objectives
By the end of this lab, you will be able to:
- Apply the k-means algorithm to cluster observations
- Understand the limitations of the k-means algorithm
- Apply the dbscan algorithm
⚙️ Setup(5 mins)
Load the libraries for today’s lab. If there is a library you haven’t installed yet, install them like in install.packages("dbscan")
.
# Uncomment and run the line below if needed (optional)
# install.packages("dbscan")
# install.packages("gapminder")
# install.packages("NbClust")
library("dbscan")
library("gapminder")
library("ggsci")
library("NbClust")
library("tidymodels")
library("tidyverse")
📋 Lab Tasks
No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.
Employing k-means on a two-dimensional customer segmentation data set (15 minutes)
We will create a tibble object named customers
that has two features
income
shows income of customers in thousands of USDspending_score
provides a 0-100 index that compares spending amongst customers
<-
customers read_csv("data/customers-data.csv") %>%
select(income, spending_score) %>%
mutate(across(everything(), ~ (.x - mean(.x)) / sd(.x)))
Rows: 200 Columns: 4
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): age, income, spending_score
lgl (1): male
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s plot the correlation between both variables using a scatter plot.
%>%
customers ggplot(aes(income, spending_score)) +
geom_point() +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
labs(x = "Income", y = "Spending score")
Looking at the graph, we can see that there are distinct segments of customers organised into somewhat distinct clusters.
🗣️ CLASSROOM DISCUSSION:
How many clusters do you see? Can you describe intuitively what each cluster represents?
There appear to be 5 clusters. Three of the clusters show customers with low / medium / high incomes and spending scores. However, we see 2 further clusters of customer how have low incomes yet high spending, and vice versa.
Implementing k-means clustering
Implementing k-means clustering in R is fairly straightforward. We pass our 2-d customers
tibble to kmeans
and specify centers = n
.
set.seed(321)
<- kmeans(customers, centers = 5) kclust_customers
After this, we can create a new variable (cluster
) by finding the vector of cluster assignments in kclust_customers
, and converting the results into a factor. After that, we can use some simple modifications to our ggplot
to see some results.
%>%
customers mutate(cluster = as.factor(kclust_customers$cluster)) %>%
ggplot(aes(income, spending_score, colour = cluster)) +
geom_point() +
theme_minimal() +
theme(panel.grid.minor = element_blank(),
legend.position = "bottom") +
scale_colour_bmj() +
labs(x = "Income", y = "Spending score",
colour = "Cluster #")
Validating k-means
We can tell intuitively that 5 clusters probably makes the most sense. However, in most cases, we will not have the luxury of doing this, and we will need to validate our choice of cluster number.
The elbow method
One widely used method is the elbow method. Simply put, when we see distinct “elbows” in the plot, we decide that adding more clusters will not result in a significant reduction in the total within-sum cluster sum of squared errors. As a result, we can stop and use that many clusters.
🧑🏫 TEACHING MOMENT:
What do we mean by total within-sum cluster sum of squared errors? When the k-means algorithm is initialised, k centroids are placed randomly at various points of the feature space. The algorithm then iteratively moves around these centroids on this feature space to a set of coordinates that minimise error (in this case total within-cluster sum of squared errors).
We can create an elbow plot using the nested tibble approach.
set.seed(321)
crossing(k = 1:9, nest(customers, .key = "data")) %>%
mutate(model = map2(data, k, ~ kmeans(.x, centers = .y)),
glanced = map(model, ~ glance(.x))) %>%
unnest(glanced) %>%
ggplot(aes(k, tot.withinss)) +
geom_point() +
geom_line(linetype = "dashed") +
theme_minimal() +
theme(panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank()) +
scale_x_continuous(breaks = 1:9, labels = 1:9) +
labs(x = "Number of clusters", y = "Total within-cluster SS")
🗣️ CLASSROOM DISCUSSION:
What can we conclude from this plot? Is 5 clusters the best number?
Our initial intuition of 5 clusters seems to make sense here. After 5 clusters, the rate at which the total within cluster sum of squared errors drops appears to be no longer significant (in a substantive sense).
Let the indices vote!
The elbow method, though visually intuitive, can often be difficult to interpret. Furthermore, there are many indices one can choose from to identify the most appropriate number of clusters. Instead of picking another, we can simply let multiple indices vote and the number of clusters with the highest number of votes will then be implemented.
<-
votes NbClust(customers, distance = "euclidean", min.nc=2, max.nc=8,
method = "kmeans", index = "ch")
$Best.nc[1] votes
Implementing this method reveals that 6 clusters are most appropriate, which doesn’t exactly confirm our intuition…Is a group of points that we saw as one single cluster actually two separate clusters? (Note: the index used in the calculation above is the Calinski-Harabasz index. The NbClust
library provides a total of 30 different indices you can use to determine to best number of clusters!)
Employing k-means on higher dimensions (15 minutes)
Suppose we are interested in understand the profiles of workers based in a company. We will use three standardised variables from the Absenteeism at Work data set:
transportation_expense
how much it costs to travel into workdistance_from_residence_to_work
the distance from residence to workwork_load_average_day
average workload in a day
Cleaning the data set
Run the following code:
<-
workers read_csv("data/absenteeism-at-work-data.csv") %>%
::clean_names() %>%
janitorselect(transportation_expense, distance_from_residence_to_work,
%>%
work_load_average_day) mutate(across(everything(), ~ (.x - mean(.x)) / sd(.x)))
Rows: 740 Columns: 21
── Column specification ───────────────────────────────────────────────────────────
Delimiter: ","
dbl (21): ID, Reason for absence, Month of absence, Day of the week, Seasons, T...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Find the optimal number of clusters
Use an elbow plot or NbClust
to find the optimal number of clusters:
<-
votes NbClust(workers, distance = "euclidean", min.nc=2, max.nc=8,
method = "kmeans", index = "ch")
$Best.nc[1]
votes
set.seed(321)
crossing(k = 1:9, nest(workers, .key = "data")) %>%
mutate(model = map2(data, k, ~ kmeans(.x, centers = .y)),
glanced = map(model, ~ glance(.x))) %>%
unnest(glanced) %>%
ggplot(aes(k, tot.withinss)) +
geom_point() +
geom_line(linetype = "dashed") +
theme_minimal() +
theme(panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank()) +
scale_x_continuous(breaks = 1:9, labels = 1:9) +
labs(x = "Number of clusters", y = "Total within-cluster SS")
Both methods show 5 to be the appropriate number of clusters to choose here. However, for visualisation purposes, we’ll restrict ourselves to 3 clusters.
Apply the cluster labels to the data set
After doing this, run the code underneath to display the following visualisation.
%>%
workers mutate(cluster = as.factor(votes$Best.partition)) %>%
pivot_longer(-cluster, names_to = "variable") %>%
ggplot(aes(value, str_replace_all(variable, "\\_", " "), fill = variable)) +
facet_wrap(. ~ paste("Cluster", cluster)) +
geom_boxplot(alpha = 0.5) +
geom_vline(xintercept = 0, linetype = "dashed", colour = "red", linewidth = 1.25) +
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
legend.position = "none") +
scale_fill_bmj() +
labs(x = "Value", y = NULL)
🗣️ CLASSROOM DISCUSSION:
Can you describe intuitively what each cluster represents?
Cluster 1 shows workers with middling work loads, transportation costs, and distances from work. Cluster 2 shows workers with lower average work loads and higher distances from work. Cluster 3 shows workers with a higher average work loads.
Is k-means clustering always the right choice? (10 minutes)
The answer is emphatically no. Due to the fact that k-means clustering uses the distance between points to identify clusters, it will try to create evenly sized clusters when clusters could, in fact, either be uneven or non-existent.
Example 1: Odd conditional distributions
<-
circles read_csv("data/circles-data.csv") %>%
select(x1, x2)
%>%
circles ggplot(aes(x1, x2)) +
geom_point() +
theme_minimal() +
theme(panel.grid.minor = element_blank())
Find the optimal number of clusters
set.seed(321)
crossing(k = 1:9, nest(circles, .key = "data")) %>%
mutate(model = map2(data, k, ~ kmeans(.x, centers = .y)),
glanced = map(model, ~ glance(.x))) %>%
unnest(glanced) %>%
ggplot(aes(k, tot.withinss)) +
geom_point() +
geom_line(linetype = "dashed") +
theme_minimal() +
theme(panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank())
Run the model
<- kmeans(circles, centers = 4) kclust
Plot the results
%>%
circles mutate(cluster = as.factor(kclust$cluster)) %>%
ggplot(aes(x1, x2, colour = cluster)) +
geom_point() +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
scale_colour_ucscgb() +
labs(x = "Variable 1", y = "Variable 2",
colour = "Cluster #")
🗣️ CLASSROOM DISCUSSION:
What went wrong?
Four centroids were placed in the 2-dimensional feature space, which then iterated towards a point whereby the total within cluster sum of squared errors was minimised. This resulted in the k-means algorithm partitioning the data like we might do with a pie, as opposed to respecting the ring-like structure of the clusters.
👉 NOTE: To see more about how different clustering algorithms produce different kinds of clusters in different scenarios, click here.
Introducing DBSCAN (20 minutes)
The DBSCAN algorithm overcomes some of the short-comings of the k-means algorithm by using the distance between nearest points.
👥 DISCUSS IN PAIRS/GROUPS:
Enter ?dbscan
into the console and look at the eps
and minPts
hyperparameters.
The below code will implement dbscan
.
<- dbscan::dbscan(circles, eps = 0.15, minPts = 5) # we specify that we use the function from the dbscan library (there could be a conflict with the fpc library if you've installed it in the W08 lecture)
db_model_circles
%>%
circles mutate(cluster = as.factor(db_model_circles$cluster)) %>%
ggplot(aes(x1, x2, colour = cluster)) +
geom_point() +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
scale_colour_ucscgb() +
labs(x = "Variable 1", y = "Variable 2", colour = "Cluster #")
Applying dbscan to the customer segmentation data set
Use eps = 0.325
and minPts = 5
when applying dbscan to the customer segmentation data. Print the model output.
<- dbscan::dbscan(customers, eps = 0.325, minPts = 5)
db_model_customers
db_model_customers
DBSCAN clustering for 200 objects.
Parameters: eps = 0.325, minPts = 5
Using euclidean distances and borderpoints = TRUE
The clustering contains 7 cluster(s) and 25 noise points.
0 1 2 3 4 5 6 7
25 12 12 7 3 88 30 23
Available fields: cluster, eps, minPts, metric, borderPoints
Try plotting the results yourself to see the differences.
%>%
customers mutate(cluster = as.factor(db_model_customers$cluster),
outlier = if_else(cluster == 0, TRUE, FALSE)) %>%
ggplot(aes(income, spending_score, colour = cluster, shape = outlier)) +
geom_point(size = 2) +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
scale_colour_manual(values = c("black", pal_bmj()(7)), breaks = 1:7) +
scale_shape_manual(values = c(16,4)) +
labs(x = "Income", y = "Spending score",
colour = "Cluster #", shape = "Outlier")
👥 DISCUSS IN PAIRS/GROUPS:
What difference in clustering do you see with dbscan?
Firstly, we have 6 clusters instead of 5. Secondly, some of the data points do not belong to a cluster and are considered to be outliers by dbscan. If we used k-means, all points would belong to a cluster, regardless of how far away a given point is from the centroid (as long as it is not closer to another).
Now it’s your turn: Global life expectancy and GDP per capita in 1997 (30 minutes)
Load the following data set
<-
gapminder_1997 %>%
gapminder filter(year == 1997) %>%
select(gdp_percap = gdpPercap, life_exp = lifeExp) %>%
mutate(across(everything(), ~ (.x-mean(.x))/sd(.x)))
Create a scatter plot
%>%
gapminder_1997 ggplot(aes(gdp_percap, life_exp)) +
geom_point() +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
labs(x = "GDP per capita", y = "Life expectancy")
Find the optimal number of clusters
Try an elbow plot …
set.seed(321)
crossing(k = 1:9, nest(gapminder_1997, .key = "data")) %>%
mutate(model = map2(data, k, ~ kmeans(.x, centers = .y)),
glanced = map(model, ~ glance(.x))) %>%
unnest(glanced) %>%
ggplot(aes(k, tot.withinss)) +
geom_point() +
geom_line(linetype = "dashed") +
theme_minimal() +
theme(panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank()) +
scale_x_continuous(breaks = 1:9, labels = 1:9) +
labs(x = "Number of clusters", y = "Total within-cluster SS")
Or try the vote of indices (using the Calinski-Harabasz index):
<-
votes NbClust(gapminder_1997, distance = "euclidean", min.nc=2, max.nc=8,
method = "kmeans", index = "ch")
$Best.nc[1] votes
Number_clusters
8
Run the k-means algorithm and plot the results
<- kmeans(gapminder_1997, centers = 3)
kclust_gap
%>%
gapminder_1997 mutate(cluster = as.factor(kclust_gap$cluster)) %>%
ggplot(aes(gdp_percap, life_exp, colour = cluster)) +
geom_point(size = 2) +
theme_minimal() +
theme(panel.grid.minor = element_blank(),
legend.position = "bottom") +
scale_colour_bmj() +
labs(x = "GDP per capita", y = "Life expectancy",
colour = "Cluster #")
Try using dbscan to cluster the points
Try setting eps = 0.265
and minPts = 4
.
<- dbscan::dbscan(gapminder_1997, eps = 0.265, minPts = 4)
db_model_gap db_model_gap
DBSCAN clustering for 142 objects.
Parameters: eps = 0.265, minPts = 4
Using euclidean distances and borderpoints = TRUE
The clustering contains 3 cluster(s) and 8 noise points.
0 1 2 3
8 104 17 13
Available fields: cluster, eps, minPts, metric, borderPoints
Plot the model
<- length(unique(db_model_gap$cluster))
n_clust
%>%
gapminder_1997 mutate(cluster = as.factor(db_model_gap$cluster),
outlier = if_else(cluster == 0, TRUE, FALSE)) %>%
ggplot(aes(gdp_percap, life_exp, colour = cluster, shape = outlier)) +
geom_point(size = 2) +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
scale_colour_manual(values = pal_bmj()(n_clust), breaks = 1:4) +
scale_shape_manual(values = c(16,4)) +
labs(x = "GDP per capita", y = "Life expectancy",
colour = "Cluster #", shape = "Outlier")