πŸ›£οΈ LSE DS202W 2025/2026: Week 08 - Lab Roadmap - Clustering Analysis in Python

Author

The DS202 Team

Published

10 Mar 2026

Week 8 Labs - Clustering Analysis in Python (DS202)

Welcome to the eighth lab! This week is again about unsupervised learning and we explore another type of unsupervised learning this time: clustering.

Overview (90 minutes total)

Today we’ll explore different clustering algorithms and evaluation methods using Multiple Correspondence Analysis (MCA) coordinates. We’ll compare k-means variants, k-medoids, and DBSCAN clustering approaches.

  • Technical Skills: Implement k-means, k-medoids, and DBSCAN algorithms with evaluation methods (elbow, silhouette) and adapt code templates across different clustering approaches

  • Analytical Skills: Compare clustering methods, evaluate their appropriateness for datasets, and interpret metrics to determine optimal cluster numbers and trade-offs

  • Critical Thinking: Justify algorithm selection based on data characteristics, assess outlier treatment decisions, and critique clustering limitations

  • Practical Application: Design complete clustering workflows, troubleshoot parameter selection challenges, and communicate results through effective visualizations

Lab Structure:

  • 10 min: Clustering algorithms overview + k-means template
  • 50 min: Student exploration of algorithms and evaluation methods
  • 30 min: DBSCAN exploration

Part 1: Introduction to Clustering Algorithms (10 minutes)

Key Differences Between Clustering Methods

Algorithm Centers Distance Outlier Sensitivity Use Case
k-means Computed centroids Euclidean High Spherical clusters
k-means++ Smart initialization Euclidean High Better initial centroids
k-medoids Actual data points Any metric Low Non-spherical clusters
DBSCAN None (density-based) Any metric Very low Arbitrary shapes

βš™οΈ Setup

Downloading the student notebook

Click on the below button to download the student notebook.

Downloading the data

Click on the below button to download the student notebook.

Loading libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import pdist, squareform
from sklearn.metrics import pairwise_distances
import warnings
import kmedoids

warnings.filterwarnings("ignore")

# Load the MCA coordinates from Week 7
mca_coords = pd.read_csv("data/mca-week-7.csv")

# If you have the actual MCA data file, uncomment the line below:
# mca_coords = pd.read_csv("data/mca-week-7.csv")

print(f"Data shape: {mca_coords.shape}")
print(mca_coords.head())

Part 2: K-means Template & Algorithm Applications (50 minutes)

Step 1: Study the K-means Template (15 minutes)

Here’s the complete workflow for k-means clustering evaluation:

# Set random seed for reproducibility
np.random.seed(123)


def evaluate_kmeans_clustering(data, k_range=range(2, 11)):
    """
    Evaluate k-means clustering using elbow method and silhouette analysis
    """
    results = []

    for k in k_range:
        # Fit k-means
        kmeans = KMeans(n_clusters=k, random_state=123, n_init=25, init='random')
        labels = kmeans.fit_predict(data)

        # Calculate metrics
        inertia = kmeans.inertia_  # Within-cluster sum of squares
        silhouette_avg = silhouette_score(data, labels)

        results.append({"k": k, "inertia": inertia, "silhouette": silhouette_avg})

    return pd.DataFrame(results)


# Evaluate k-means clustering
X = mca_coords[["dim_1", "dim_2"]].values
kmeans_results = evaluate_kmeans_clustering(X)

# Create evaluation plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Elbow plot
ax1.plot(kmeans_results["k"], kmeans_results["inertia"], "bo-")
ax1.set_xlabel("Number of clusters (k)")
ax1.set_ylabel("Within-cluster sum of squares")
ax1.set_title("Elbow Method (minimize)")
ax1.grid(True, alpha=0.3)

# Silhouette plot
ax2.plot(kmeans_results["k"], kmeans_results["silhouette"], "ro-")
ax2.set_xlabel("Number of clusters (k)")
ax2.set_ylabel("Average silhouette score")
ax2.set_title("Silhouette Analysis (maximize)")
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("K-means evaluation results:")
print(kmeans_results)
Tip

Check out the yellowbrick library to see how to implement the elbow and silhouette methods in a simpler way (see for example this page)

# Create final clustering with chosen k
optimal_k = 4  # Choose based on your evaluation plots

# Fit final k-means model
final_kmeans = KMeans(n_clusters=optimal_k, random_state=123, n_init=25, init='random')
kmeans_labels = final_kmeans.fit_predict(X)

# Visualize final clustering
plt.figure(figsize=(8, 6))
colors = ['red', 'blue', 'green', 'orange', 'purple', 'brown', 'pink', 'gray', 'olive', 'cyan']
for i in range(optimal_k):
    mask = kmeans_labels == i
    plt.scatter(mca_coords.loc[mask, 'dim_1'], 
               mca_coords.loc[mask, 'dim_2'], 
               c=colors[i], 
               label=f'Cluster {i+1}',
               s=60,
               alpha=0.7)

plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title(f'K-means Clustering (k = {optimal_k})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Quick Study Questions:

  • What does the range(2, 11) create in our evaluation?
  • Why do we use multiple random initializations (n_init=25)?
  • What are the two evaluation methods being calculated?

Step 2: Apply to K-means++ (10 minutes)

Your Task: Modify the template for enhanced k-means with better initialization.

Key change: K-means++ initialization is actually the default in scikit-learn’s KMeans!

# Find optimal k using k-means++
# Build final model and plot the results

Questions:

  1. Do you notice differences in stability compared to basic k-means?
  2. Does the optimal k suggestion change?

Step 3: Apply to k-medoids (10 min)

Your Task: Adapt the k-means template for k-medoids clustering.

πŸ’‘Hint: Use the kmedoids library for your code (see documentation)

Questions

  1. What optimal k do the elbow and silhouette methods suggest?
  2. How do k-medoids results compare to k-means?
# find optimal k using kmedoids
# build final model and plot the results

Step 4: Compare all three methods (10 minutes)

Create final clusterings using your chosen optimal k for all three methods:

# Code here

Discussion Questions:

  1. Which evaluation method (elbow vs silhouette) was most reliable across algorithms?
  2. Which clustering algorithm produced the best results for this dataset?
  3. What are the key trade-offs between the three methods?

Part 3: DBSCAN - Density-Based Clustering (30 minutes)

DBSCAN is fundamentally different:

  • No need to specify k
  • Finds arbitrary shapes
  • Identifies noise points
  • Key parameters: eps (neighborhood radius) and min_samples (minimum points)

Task 4: Basic DBSCAN

# Start with reasonable parameters
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)

# Check results
n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
n_noise = list(dbscan_labels).count(-1)

print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
print(f"Cluster labels: {np.unique(dbscan_labels)}")
# Visualize DBSCAN results
plt.figure(figsize=(10, 6))

# Plot clustered points
for i in range(n_clusters):
    mask = dbscan_labels == i
    plt.scatter(mca_coords.loc[mask, 'dim_1'], 
               mca_coords.loc[mask, 'dim_2'], 
               c=colors[i % len(colors)], 
               label=f'Cluster {i+1}',
               s=60,
               alpha=0.7)

# Plot noise points
if n_noise > 0:
    noise_mask = dbscan_labels == -1
    plt.scatter(mca_coords.loc[noise_mask, 'dim_1'], 
               mca_coords.loc[noise_mask, 'dim_2'], 
               c='black', 
               marker='x', 
               s=60, 
               label='Noise',
               alpha=0.8)

plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title(f'DBSCAN Clustering (eps=0.5, min_samples=5)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Task 5: Experiment with Epsilon

Try different eps values and observe the effects:

# Code here

Key Questions:

  1. How does changing epsilon affect cluster formation?
  2. What happens to noise points as epsilon changes?
  3. Should all points belong to a cluster?
  4. Which epsilon value seems most appropriate?

Task 6: Final Comparison

Create side-by-side plots comparing your best k-means, k-means++, and DBSCAN results:

# Code here

Discussion Questions:

  1. Which method captures the structure of your data best?
  2. When would you choose DBSCAN over k-means/k-medoids?
  3. What are the advantages of identifying noise points?

Summary Discussion

Reflect on:

  • Which clustering method worked best for this dataset and why?
  • How important was the choice of evaluation method?
  • What are the key considerations when choosing clustering algorithms?
  • When might you prefer density-based over centroid-based methods?

Key Takeaways

  • Template approach: Similar code patterns work across clustering methods
  • Evaluation matters: Elbow and silhouette can give different suggestions
  • No universal best: Choice depends on data characteristics and goals
  • DBSCAN advantages: Handles noise and arbitrary shapes, no k required

Homework Challenge

Apply these clustering methods to a dataset from your own field. Which method works best and why?

Additional Python-Specific Notes

Key Python Libraries Used:

  • sklearn.cluster: KMeans, DBSCAN
  • kmedoids: KMedoids (PAM implementation)
  • sklearn.metrics: silhouette_score, silhouette_samples
  • matplotlib.pyplot: Plotting and visualization
  • pandas: Data manipulation and analysis
  • numpy: Numerical computations
  • yellowbrick: elbow and silhouette plots

Installation Requirements:

conda install -c conda-forge scikit-learn matplotlib pandas numpy seaborn kmedoids

Optionally:

pip install yellowbrick

Performance Tips:

  • Use n_init=25 for k-means to ensure stable results
  • Set random_state for reproducible results
  • Consider data scaling if features have very different ranges
  • For large datasets, consider using MiniBatchKMeans instead of KMeans