π£οΈ Week 08 - Lab Roadmap (90 min)
Clustering using k-means and dbscan
π₯ Learning Objectives
By the end of this lab, you will be able to:
- Apply the k-means algorithm to cluster observations
- Understand the limitations of the k-means algorithm
- Apply the dbscan algorithm
π Preparation: Loading packages and data
Materials to download
Please download the following files into your data
folder.
Then use the link below to download the lab materials:
βοΈ Setup(5 mins)
Load the libraries for todayβs lab. If there is a library you havenβt installed yet, install them like in install.packages("dbscan")
.
# Uncomment and run the line below if needed (optional)
# install.packages("dbscan")
# install.packages("gapminder")
# install.packages("NbClust")
library("dbscan")
library("gapminder")
library("ggsci")
library("NbClust")
library("tidymodels")
library("tidyverse")
π Lab Tasks
No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.
Employing k-means on a two-dimensional customer segmentation data set (15 minutes)
We will create a tibble object named customers
that has two features
income
shows income of customers in thousands of USDspending_score
provides a 0-100 index that compares spending amongst customers
<-
customers read_csv("data/customers-data.csv") %>%
select(income, spending_score) %>%
mutate(across(everything(), ~ (.x - mean(.x)) / sd(.x)))
Letβs plot the correlation between both variables using a scatter plot.
%>%
customers ggplot(aes(income, spending_score)) +
geom_point() +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
labs(x = "Income", y = "Spending score")
Looking at the graph, we can see that there are distinct segments of customers organised into somewhat distinct clusters.
π£οΈ CLASSROOM DISCUSSION:
How many clusters do you see? Can you describe intuitively what each cluster represents?
Implementing k-means clustering
Implementing k-means clustering in R is fairly straightforward. We pass our 2-d customers
tibble to kmeans
and specify centers = n
.
set.seed(321)
<- kmeans(customers, centers = 5) kclust_customers
After this, we can create a new variable (cluster
) by finding the vector of cluster assignments in kclust_customers
, and converting the results into a factor. After that, we can use some simple modifications to our ggplot
to see some results.
%>%
customers mutate(cluster = as.factor(kclust_customers$cluster)) %>%
ggplot(aes(income, spending_score, colour = cluster)) +
geom_point() +
theme_minimal() +
theme(panel.grid.minor = element_blank(),
legend.position = "bottom") +
scale_colour_bmj() +
labs(x = "Income", y = "Spending score",
colour = "Cluster #")
Validating k-means
We can tell intuitively that 5 clusters probably makes the most sense. However, in most cases, we will not have the luxury of doing this, and we will need to validate our choice of cluster number.
The elbow method
One widely used method is the elbow method. Simply put, when we see distinct βelbowsβ in the plot, we decide that adding more clusters will not result in a significant reduction in the total within-sum cluster sum of squared errors. As a result, we can stop and use that many clusters.
π§βπ« TEACHING MOMENT:
What do we mean by total within-sum cluster sum of squared errors? When the k-means algorithm is initialised, k centroids are placed randomly at various points of the feature space. The algorithm then iteratively moves around these centroids on this feature space to a set of coordinates that minimise error (in this case total within-cluster sum of squared errors).
We can create an elbow plot using the nested tibble approach.
set.seed(321)
crossing(k = 1:9, nest(customers, .key = "data")) %>%
mutate(model = map2(data, k, ~ kmeans(.x, centers = .y)),
glanced = map(model, ~ glance(.x))) %>%
unnest(glanced) %>%
ggplot(aes(k, tot.withinss)) +
geom_point() +
geom_line(linetype = "dashed") +
theme_minimal() +
theme(panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank()) +
scale_x_continuous(breaks = 1:9, labels = 1:9) +
labs(x = "Number of clusters", y = "Total within-cluster SS")
π£οΈ CLASSROOM DISCUSSION:
What can we conclude from this plot? Is 5 clusters the best number?
Let the indices vote!
The elbow method, though visually intuitive, can often be difficult to interpret. Furthermore, there are many indices one can choose from to identify the most appropriate number of clusters. Instead of picking another, we can simply let multiple indices vote and the number of clusters with the highest number of votes will then be implemented.
<-
votes NbClust(customers, distance = "euclidean", min.nc=2, max.nc=8,
method = "kmeans", index = "ch")
Implementing this method reveals that 6 clusters are most appropriate, which doesnβt exactly confirm our intuitionβ¦Is a group of points that we saw as one single cluster actually two separate clusters? (Note: the index used in the calculation above is the Calinski-Harabasz index. The NbClust
library provides a total of 30 different indices you can use to determine to best number of clusters!)
Employing k-means on higher dimensions (15 minutes)
Suppose we are interested in understand the profiles of workers based in a company. We will use three standardised variables from the Absenteeism at Work data set:
transportation_expense
how much it costs to travel into workdistance_from_residence_to_work
the distance from residence to workwork_load_average_day
average workload in a day
Cleaning the data set
Run the following code:
<-
workers read_csv("data/absenteeism-at-work-data.csv") %>%
::clean_names() %>%
janitorselect(transportation_expense, distance_from_residence_to_work,
%>%
work_load_average_day) mutate(across(everything(), ~ (.x - mean(.x)) / sd(.x)))
Find the optimal number of clusters
Use an elbow plot or NbClust
to find the optimal number of clusters:
<-
votes NbClust(workers, distance = "euclidean", min.nc=2, max.nc=8,
method = "kmeans", index = "ch")
Apply the cluster labels to the data set
After doing this, run the code underneath to display the following visualisation.
%>%
workers mutate(cluster = as.factor(votes$Best.partition)) %>%
pivot_longer(-cluster, names_to = "variable") %>%
ggplot(aes(value, str_replace_all(variable, "\\_", " "), fill = variable)) +
facet_wrap(. ~ paste("Cluster", cluster)) +
geom_boxplot(alpha = 0.5) +
geom_vline(xintercept = 0, linetype = "dashed", colour = "red", linewidth = 1.25) +
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
legend.position = "none") +
scale_fill_bmj() +
labs(x = "Value", y = NULL)
π£οΈ CLASSROOM DISCUSSION:
Can you describe intuitively what each cluster represents?
Is k-means clustering always the right choice? (10 minutes)
The answer is emphatically no. Due to the fact that k-means clustering uses the distance between points to identify clusters, it will try to create evenly sized clusters when clusters could, in fact, either be uneven or non-existent.
Example 1: Odd conditional distributions
<-
circles read_csv("data/circles-data.csv") %>%
select(x1, x2)
%>%
circles ggplot(aes(x1, x2)) +
geom_point() +
theme_minimal() +
theme(panel.grid.minor = element_blank())
Find the optimal number of clusters
crossing(k = 1:9, nest(circles, .key = "data")) %>%
mutate(model = map2(data, k, ~ kmeans(.x, centers = .y)),
glanced = map(model, ~ glance(.x))) %>%
unnest(glanced) %>%
ggplot(aes(k, tot.withinss)) +
geom_point() +
geom_line(linetype = "dashed") +
theme_minimal() +
theme(panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank())
Run the model
<- kmeans(circles, centers = 4) kclust
Plot the results
%>%
circles mutate(cluster = as.factor(kclust$cluster)) %>%
ggplot(aes(x1, x2, colour = cluster)) +
geom_point() +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
scale_colour_ucscgb() +
labs(x = "Variable 1", y = "Variable 2",
colour = "Cluster #")
π£οΈ CLASSROOM DISCUSSION:
What went wrong?
π NOTE: To see more about how different clustering algorithms produce different kinds of clusters in different scenarios, click here.
Introducing DBSCAN (20 minutes)
The DBSCAN algorithm overcomes some of the short-comings of the k-means algorithm by using the distance between nearest points.
π₯ DISCUSS IN PAIRS/GROUPS:
Enter ?dbscan
into the console and look at the eps
and minPts
hyperparameters.
The below code will implement dbscan
.
<- dbscan::dbscan(circles, eps = 0.15, minPts = 5) # we specify that we use the function from the dbscan library (there could be a conflict with the fpc library if you've installed it in the W08 lecture)
db_model_circles
%>%
circles mutate(cluster = as.factor(db_model_circles$cluster)) %>%
ggplot(aes(x1, x2, colour = cluster)) +
geom_point() +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
scale_colour_ucscgb() +
labs(x = "Variable 1", y = "Variable 2", colour = "Cluster #")
Applying dbscan to the customer segmentation data set
Use eps = 0.325
and minPts = 5
when applying dbscan to the customer segmentation data. Print the model output.
<- dbscan::dbscan(customers, eps = 0.325, minPts = 5)
db_model_customers db_model_customers
Try plotting the results yourself to see the differences.
%>%
customers mutate(cluster = as.factor(db_model_customers$cluster),
outlier = if_else(cluster == 0, TRUE, FALSE)) %>%
ggplot(aes(income, spending_score, colour = cluster, shape = outlier)) +
geom_point(size = 2) +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
scale_colour_manual(values = c("black", pal_bmj()(7)), breaks = 1:7) +
scale_shape_manual(values = c(16,4)) +
labs(x = "Income", y = "Spending score",
colour = "Cluster #", shape = "Outlier")
π₯ DISCUSS IN PAIRS/GROUPS:
What difference in clustering do you see with dbscan?
Now itβs your turn: Global life expectancy and GDP per capita in 1997 (30 minutes)
Load the following data set
<-
gapminder_1997 %>%
gapminder filter(year == 1997) %>%
select(gdp_percap = gdpPercap, life_exp = lifeExp) %>%
mutate(across(everything(), ~ (.x-mean(.x))/sd(.x)))
Create a scatter plot
# Code here
Find the optimal number of clusters
Try an elbow plot β¦
# Code here
β¦ or let the indices vote!
# Code here
Run the k-means algorithm and plot the results
# Code here
Try using dbscan to cluster the points
Try setting eps = 0.265
and minPts = 4
.
# Code here
Plot the model
# Code here