πŸ““ Week 08 - Lab Roadmap

Clustering using k-means and dbscan

Author

The DS202 Team

Published

10 Mar 2025

Welcome to the seventh lab! This week is again about unsupervised learning and we explore another type of unsupervised learning this time: clustering.

βš™οΈ Setup

Downloading the student notebook

Click on the below button to download the student notebook.

Loading libraries

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from umap import UMAP
from lets_plot import *
LetsPlot.setup_html()

Downloading the data

Download the datasets we will use for this lab.

Use the links below to download the datasets:

Employing k-means on a two-dimensional customer segmentation data set (45 minutes)

We will create a data frame object named customers that has two features

  • income shows income of customers in thousands of USD
  • spending_score provides a 0-100 index that compares spending amongst customers
# Open the data set, and select only the variables needed for the analysis

Let’s plot the correlation between both variables using a scatter plot.

# Code here

Looking at the graph, we can see that there are distinct segments of customers organised into somewhat distinct clusters.

πŸ—£οΈ CLASSROOM DISCUSSION:

How many clusters do you see? Can you describe intuitively what each cluster represents?

Implementing k-means clustering

Implementing k-means clustering in Python is fairly straightforward. We build a pipeline that standardises the data and instantiates a k-means model with 5 clusters. We then fit this pipeline to customers.

# Instantiate a pipeline

# Fit the model to the data

After this, we can create a new variable (cluster) by finding the vector of cluster assignments in kclust_customers, and converting the results into a factor. After that, we can use some simple modifications to our ggplot to see some results.

# Identify cluster labels for each row

# Create a data frame based on customers by adding the cluster labels

# Plot the results

Validating k-means

We can tell intuitively that 5 clusters probably makes the most sense. However, in most cases, we will not have the luxury of being able to do this, and will need to validate our choice of cluster number.

The elbow method

One widely used method is the elbow method. Simply put, when we see distinct β€œelbows” in the plot, we decide that adding more clusters will not result in a significant reduction in the total within-sum cluster sum of squared errors. As a result, we can stop and use that many clusters.

We can create an elbow plot using the nested tibble approach.

# Create a function that returns inertia scores for a range of k
def find_optimal_cluster(k, data):
    # Create a pipeline that scales the data then fits k-means models over a range of k
    pipe = Pipeline([("scaler",StandardScaler()),("kmeans",KMeans(n_clusters=k))])
    # Fit the model to the data
    pipe.fit(data)
    # Return the inertia_ attribute
    return pipe.named_steps["kmeans"].inertia_

# Create a data frame with the k range and associated inertia scores


# Plot the output

Is k-means clustering always the right choice?

The answer is emphatically no. Due to the fact that k-means clustering uses the distance between points to identify clusters, it will typically try to create evenly sized clusters when clusters should, in fact, be uneven.

Example 1: Odd conditional distributions

# Open the circles data


# Plot x1 and x2 features in circles data

Find the optimal number of clusters

# Create a data frame with the k range and associated inertia scores


# Plot the output

Run the model

# Instantiate model


# Fit to the data

Plot the results

# Code here

πŸ—£οΈ CLASSROOM DISCUSSION:

What went wrong?

πŸ‘‰ NOTE: To see more about how different clustering algorithms produce different kinds of clusters in different scenarios, click here.

Introducing DBSCAN (30 minutes)

The DBSCAN algorithm overcomes some of the short-comings of the k-means algorithm by using the distance between nearest points.

The below code will implement dbscan.

# Instantiate a DBSCAN model


# Fit the model to the data


# Plot the results

Applying dbscan to the customer segmentation data set

Use eps = 0.6 and min_samples = 4 when applying dbscan to the customer segmentation data.

# Create a pipeline that standardises the data and instantiates a DBSCAN


# Fit the pipeline to the customers data


# Retrieve the labels using a list comprehension


# Create a list of outlier labels

Try plotting the results yourself to see the differences.

# Code here

πŸ‘₯ DISCUSS IN PAIRS/GROUPS:

What difference in clustering do you see with dbscan?

Using DBSCAN with UMAP (15 minutes)

You will remember from last week that we can employ Uniform Manifold Approximation and Projection (UMAP) to create two-dimensional representations of data. With the Varieties of Democracy (V-Dem) data set, we found that employing UMAP led to several highly repressive regimes forming their own distinct cluster. We can employ DBSCAN on the UMAP embeddings to find new insights into political regimes.

πŸ“Task: Let’s load the data!

vdem = pd.read_csv("../data/vdem-data-subset.csv")

πŸ“Task: Next, we can build a UMAP model, fit it to vdem and retrieve the embeddings.

# Set a random seed
np.random.seed(123)

# Instantiate a UMAP with the relevant hyperparameter choice
reducer = UMAP(n_neighbors=100)

# Create a subset of the data using only variables beginning with "v2x"
vdem_subset = vdem[vdem.columns[vdem.columns.str.contains("v2x")]].to_numpy()

# Fit / transform the model to the subset of data to obtain the embeddings
embedding = reducer.fit_transform(vdem_subset)

# Convert the embeddings to a data frame
embedding = pd.DataFrame(embedding, columns = ["first_dim", "second_dim"])

πŸ“Task: After this, we can instantiate a DBSCAN, fit it to the embeddings, and plot the results. Try experimenting with different combinations of minimum points and epsilon neighbourhood.

πŸ’‘Hints:

  • If you are stuck with what hyperparameters to choose, try starting with min_samples = 6 and eps = 1.15.

  • Try experimenting with tooltips

    • Add the country_year labels from vdem to the embeddings data frame.
    • Instead of just adding a geom_point layer, try creating code that let’s the country_year label for each row appear whenever you hover over it with your cursor (click here for the relevant documentation).
# Instantiate a dbscan 


# Fit the model to the data


# Retrieve the labels using a list comprehension


# Add the country_year variable in vdem to the


# Plot the output

πŸ‘₯ Classroom discussion: Can we gain any new insights from using DBSCAN with the embeddings?