π Week 08 - Lab Roadmap
Clustering using k-means and dbscan
Welcome to the seventh lab! This week is again about unsupervised learning and we explore another type of unsupervised learning this time: clustering.
βοΈ Setup
Downloading the student notebook
Click on the below button to download the student notebook.
Loading libraries
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from umap import UMAP
from lets_plot import *
LetsPlot.setup_html()
Downloading the data
Download the datasets we will use for this lab.
Use the links below to download the datasets:
Employing k-means on a two-dimensional customer segmentation data set (45 minutes)
We will create a data frame object named customers
that has two features
income
shows income of customers in thousands of USDspending_score
provides a 0-100 index that compares spending amongst customers
# Open the data set, and select only the variables needed for the analysis
Letβs plot the correlation between both variables using a scatter plot.
# Code here
Looking at the graph, we can see that there are distinct segments of customers organised into somewhat distinct clusters.
π£οΈ CLASSROOM DISCUSSION:
How many clusters do you see? Can you describe intuitively what each cluster represents?
Implementing k-means clustering
Implementing k-means clustering in Python is fairly straightforward. We build a pipeline that standardises the data and instantiates a k-means model with 5 clusters. We then fit this pipeline to customers
.
# Instantiate a pipeline
# Fit the model to the data
After this, we can create a new variable (cluster
) by finding the vector of cluster assignments in kclust_customers
, and converting the results into a factor. After that, we can use some simple modifications to our ggplot
to see some results.
# Identify cluster labels for each row
# Create a data frame based on customers by adding the cluster labels
# Plot the results
Validating k-means
We can tell intuitively that 5 clusters probably makes the most sense. However, in most cases, we will not have the luxury of being able to do this, and will need to validate our choice of cluster number.
The elbow method
One widely used method is the elbow method. Simply put, when we see distinct βelbowsβ in the plot, we decide that adding more clusters will not result in a significant reduction in the total within-sum cluster sum of squared errors. As a result, we can stop and use that many clusters.
We can create an elbow plot using the nested tibble approach.
# Create a function that returns inertia scores for a range of k
def find_optimal_cluster(k, data):
# Create a pipeline that scales the data then fits k-means models over a range of k
= Pipeline([("scaler",StandardScaler()),("kmeans",KMeans(n_clusters=k))])
pipe # Fit the model to the data
pipe.fit(data)# Return the inertia_ attribute
return pipe.named_steps["kmeans"].inertia_
# Create a data frame with the k range and associated inertia scores
# Plot the output
Is k-means clustering always the right choice?
The answer is emphatically no. Due to the fact that k-means clustering uses the distance between points to identify clusters, it will typically try to create evenly sized clusters when clusters should, in fact, be uneven.
Example 1: Odd conditional distributions
# Open the circles data
# Plot x1 and x2 features in circles data
Find the optimal number of clusters
# Create a data frame with the k range and associated inertia scores
# Plot the output
Run the model
# Instantiate model
# Fit to the data
Plot the results
# Code here
π£οΈ CLASSROOM DISCUSSION:
What went wrong?
π NOTE: To see more about how different clustering algorithms produce different kinds of clusters in different scenarios, click here.
Introducing DBSCAN (30 minutes)
The DBSCAN algorithm overcomes some of the short-comings of the k-means algorithm by using the distance between nearest points.
The below code will implement dbscan
.
# Instantiate a DBSCAN model
# Fit the model to the data
# Plot the results
Applying dbscan to the customer segmentation data set
Use eps = 0.6
and min_samples = 4
when applying dbscan to the customer segmentation data.
# Create a pipeline that standardises the data and instantiates a DBSCAN
# Fit the pipeline to the customers data
# Retrieve the labels using a list comprehension
# Create a list of outlier labels
Try plotting the results yourself to see the differences.
# Code here
π₯ DISCUSS IN PAIRS/GROUPS:
What difference in clustering do you see with dbscan?
Using DBSCAN with UMAP (15 minutes)
You will remember from last week that we can employ Uniform Manifold Approximation and Projection (UMAP) to create two-dimensional representations of data. With the Varieties of Democracy (V-Dem) data set, we found that employing UMAP led to several highly repressive regimes forming their own distinct cluster. We can employ DBSCAN on the UMAP embeddings to find new insights into political regimes.
πTask: Letβs load the data!
= pd.read_csv("../data/vdem-data-subset.csv") vdem
πTask: Next, we can build a UMAP model, fit it to vdem
and retrieve the embeddings.
# Set a random seed
123)
np.random.seed(
# Instantiate a UMAP with the relevant hyperparameter choice
= UMAP(n_neighbors=100)
reducer
# Create a subset of the data using only variables beginning with "v2x"
= vdem[vdem.columns[vdem.columns.str.contains("v2x")]].to_numpy()
vdem_subset
# Fit / transform the model to the subset of data to obtain the embeddings
= reducer.fit_transform(vdem_subset)
embedding
# Convert the embeddings to a data frame
= pd.DataFrame(embedding, columns = ["first_dim", "second_dim"]) embedding
πTask: After this, we can instantiate a DBSCAN, fit it to the embeddings, and plot the results. Try experimenting with different combinations of minimum points and epsilon neighbourhood.
π‘Hints:
If you are stuck with what hyperparameters to choose, try starting with
min_samples = 6
andeps = 1.15
.Try experimenting with tooltips
- Add the
country_year
labels fromvdem
to the embeddings data frame. - Instead of just adding a
geom_point
layer, try creating code that letβs thecountry_year
label for each row appear whenever you hover over it with your cursor (click here for the relevant documentation).
- Add the
# Instantiate a dbscan
# Fit the model to the data
# Retrieve the labels using a list comprehension
# Add the country_year variable in vdem to the
# Plot the output
π₯ Classroom discussion: Can we gain any new insights from using DBSCAN with the embeddings?