DS101 – Fundamentals of Data Science
17 Nov 2025
Last week (W07) we covered:
Our Spotify example:
Today: Finish supervised learning → pivot to unsupervised!
So far, we have seen:
These make different assumptions about how classes are separated in feature space.
Now we introduce a classifier that takes a geometric view: it explicitly asks which boundary is safest.
Logistic regression draws one straight line. Decision trees draw blocky, step-like regions. Both are useful, but:
Support Vector Machines approach the problem differently: they look for the most robust boundary using geometry.
To motivate why SVMs matter, let’s compare how three familiar classifiers behave on the exact same data (Setosa vs Virginica from the Iris dataset):
Logistic regression
Draws one straight line as the decision boundary
Models the log-odds of the positive class:
\[ \log\left(\frac{p(\text{class}=1)}{p(\text{class}=0)}\right) \] as a linear function of the features
Interpretable, but limited to straight boundaries
Decision tree
Support Vector Machine (SVM)
Next slide: all three models on identical features.
If the two classes can be separated, there isn’t just one solution: there are infinitely many valid separating lines.
SVM asks:
Among all separating lines that classify the training data correctly, which one is the safest?
SVM chooses the line that maximises the margin — the safest separation between classes.
Perfect separation is rare:


Classes naturally overlap. Forcing a perfectly separating boundary often:
Soft-margin SVM relaxes the “no errors” rule:
The key tuning knob is the parameter C.
In the SVM optimisation, C is a positive parameter controlling how strongly we penalise violations of the margin.
Interpretation:
C is the model’s strictness.
Hit vs flop songs using energy and danceability:
This plot also shows why linear SVM is a poor choice for this dataset.

Benign vs malignant tumours (2D projection):

Sometimes even with soft margins:
We need a classifier that can draw smooth curved boundaries while still using the maximum-margin idea.
This is where kernels enter.
Idea:
SVMs do all this implicitly using a kernel function— we never compute the high-dimensional coordinates.
Common kernels:
RBF kernel finds smooth, curved regions separating all three iris species:
For Spotify hits/flops, a curved RBF boundary works far better than any straight line:
SVMs give us:
They complete our supervised learning toolkit alongside logistic regression and decision trees.
SVM is especially useful when:
Performance Metrics:
Classification Report:
precision recall f1-score support
Flop 0.89 0.73 0.80 960
Hit 0.77 0.91 0.83 960
accuracy 0.82 1920
macro avg 0.83 0.82 0.82 1920
weighted avg 0.83 0.82 0.82 1920
Confusion Matrix:

📊 Interpretation:
💡 Conceptual takeaway: The model isn’t just smarter — it’s smoother. It balances flexibility and generalization without memorizing the data.
So far, our models produced decision boundaries with fixed shapes:
Neural networks take a more flexible approach:
They learn complex boundaries by composing many simple functions.
To see how, we start with a single neuron.
A neuron computes:
\[ z = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b \]
\[ \text{output} = f(z) \]
Where:
It’s a small, flexible function that can bend a decision boundary.
Diagram:

By linking neurons into layers, networks can learn richer patterns:
Each layer transforms what the previous one learned.

A neural network is any model built from neurons. This includes:
Deep learning refers to neural networks that have many layers stacked together.
The building blocks are the same, but depth allows the model to learn increasingly abstract and complex representations.
All deep learning models are neural networks, but not all neural networks are “deep”.
Deep learning uses networks with multiple successive layers. A deep network:
Formally:
\[ x \rightarrow f_1(x) \rightarrow f_2(f_1(x)) \rightarrow \dots \rightarrow f_L(\dots) \]
Diagram: 
Deep networks build hierarchical representations:
Audio:
Images:
Tabular data:
Depth = expressive power.
But deep networks process sequences poorly on their own, which leads to…

Earlier sequence models used Recurrent Neural Networks (RNNs).
They worked like this:
Conceptually:
word1 → memory → word2 → memory → word3 → memory → ...
Limitations:
Transformers were created to overcome these limits.
Sequential models struggle with long-distance dependencies.
Example (linguistic):
“The animal didn’t cross the street because it was too tired.”
“it” refers to animal, not street.
Example (social science):
“The central bank raised rates, but economists expect them to fall next year.”
“them” refers to rates, not economists.
Models that only remember a small recent context often fail here.
Attention solves the context problem by letting each word:
Look directly at any other word and decide how important it is.
It assigns attention weights that highlight relevant parts of the input.
This enables:
Diagram:

A single attention pattern is often not enough.
Multi-head attention uses several attention mechanisms in parallel:
Each head provides a different perspective on the input.
Transformers remove recurrence and rely entirely on attention:
Transformers now power:
Canonical architecture diagram: 
Imagine you’re a sociologist studying global development.
You have data on ~170 countries:
But no one tells you which countries are:
You just get a big table of numbers — no labels.

Even without labels, you might notice:
At this stage, these are just visual impressions, not conclusions.
We will analyse this dataset properly later in a live clustering demo.
Real-world labels are often:
In many domains, we cannot rely on labels.
We still want to learn structure → this is where unsupervised learning comes in.
Your own brain does unsupervised learning constantly:
What you naturally do:
Unsupervised algorithms aim to do the same, but:
We learn from examples with answers.
(X, Y) pairs → learn mapping X → Y
Examples:
We only have X (no labels).
X → discover structure / patterns
Examples:
We will focus on four broad families:
Clustering
Association rule mining
Dimensionality reduction
Anomaly detection
Before any equations, we will build intuition with real-world examples.
Definition:
Divide data into groups (clusters) such that:
No labels are given — clusters are discovered, not predefined.
To get an idea of the breadth of existing clustering algorithms, see the scikit-learn documentation
Traditional categories like “Labour vs Conservative” can miss important nuances.
Researchers1 cluster voters based on:
A major UK study by More in Common (2022–2025) identifies seven “segments” such as:
These segments:
Countries signal alliances publicly, but their UN voting records tell another story.
Cluster countries based on:
We often find:
Clustering reveals the latent structure of international politics.
Instead of reading thousands of judgments one by one, we can:
represent cases by:
cluster cases to find:
This helps legal scholars and practitioners see the big picture.
In many diseases, not all patients are alike.
Cluster patients using:
You may discover:
Classic example: breast cancer subtypes (e.g., HER2+, ER+, triple-negative), originally discovered using unsupervised methods on molecular data.
Goal: find rules of the form:
If X is present, Y is often also present
Not about causation — just discovering regularities in large datasets.
Retailers have huge logs of transactions:
{bread, milk, eggs}, {pasta, tomatoes, parmesan}, …Association rule mining finds patterns (Agrawal, Imieliński, and Swami 1993) (Brin, Motwani, and Silverstein 1997) such as:
The famous (probably exaggerated, but useful to know about) story (see this page for the story):
Customers buying diapers often also bought beer.
Regardless of the exact details, the idea is:
co-occurrence reveals surprising links
which can be used for:
Electronic health records contain:
Association rule mining can uncover comorbidity patterns, e.g.:
These patterns are useful for:
Examples: (El Khalifa, Yew, and Chi 2024) on dementia multimorbidity; (Bata et al. 2025) on diabetes-related comorbidities.
Modern datasets often have many features:
Problems:
We need a way to compress without destroying meaning.
Dimensionality reduction:
Turn many correlated variables into a smaller number of informative dimensions.
Goals:
This is not specific to machine learning — it appears in many fields.
HDI is a classic example of dimensionality reduction in practice.
We have many development-related indicators:
HDI focuses on three core components:
Each component summarises several detailed indicators.
These are:
One number summarises a country’s overall development level.
HDI is not a machine-learning method, but it serves the same purpose as dimensionality reduction:
HDI illustrates what dimensionality reduction does:
Machine learning methods generalise this idea:
But the conceptual goal is the same:
Compress complexity into a smaller number of meaningful dimensions.
You don’t need the details in DS101, just the intuition:
PCA (Principal Component Analysis)
Factor analysis
MCA (Multiple Correspondence Analysis)
Autoencoders (deep learning)
The key takeaway:
Many variables → a few dimensions → structure becomes easier to see.
Sometimes the most important observations are the rare, strange, or unusual ones.
Anomaly detection:
Identify data points that do not fit with the majority.
Applications:
Normal patterns:
Anomalies:
These anomalies:
Labels are hard here:
For literature on the topic, see (Goswami 2024), (Han et al. 2020), (Berrada et al. 2020) and (Benabderrahmane et al. 2021)
Normal behaviour:
Anomalies:
These stand out as:
Anomaly detection can flag these for investigation.
The Panama Papers (2016):
Impossible for journalists to read everything.
So they used data analysis tools:
graph databases (e.g. Neo4j) to model companies, people, accounts
visual network exploration tools (Linkurious)
text mining to extract names, relationships, jurisdictions
algorithms to highlight unusual patterns:
Many big stories began with anomalous patterns:
In many ways, the anomalies were the story.
For more on the story, see this page
We’ve seen four different families:
They look quite different, but deep down they share a common idea:
We are always asking: what is similar to what, and what is different?
Clustering
Dimensionality reduction
Anomaly detection
Association rules
This common theme guides everything that comes next:
Next:
To use unsupervised methods, we often represent each data point as a vector of features.
We then measure:
How far apart are two vectors? How similar are they?
This requires a distance or similarity function.
The most familiar one:
\[ d(A,B) = \sqrt{\sum_{i=1}^n (A_i - B_i)^2} \]

Use when:
Examples:
Also called L1 distance:
\[ d(A,B) = \sum_{i=1}^n |A_i - B_i| \]

Use when:
Example:
Instead of distance, a similarity:
\[ S_C(A,B) = \frac{\sum_i A_i B_i} {\sqrt{\sum_i A_i^2} \sqrt{\sum_i B_i^2}} \]

Useful when:
Examples:
Not all data are numerical vectors.
We can still define “difference” for:
Hamming distance
Example: cat vs car → distance = 1 (only last letter differs)
Useful for:
Levenshtein (edit) distance
Example: kitten → sitting = 3 edits
Useful for:
For sets A and B:
\[ J(A,B) = \frac{|A \cap B|}{|A \cup B|} \]
Example: UN voting:
Then J(A,B) tells you:
Example:
Country A votes YES: {Res 1, 3, 5, 7}
Country B votes YES: {Res 1, 3, 6, 8}
Overlap: {1, 3} = 2
Union: {1, 3, 5, 6, 7, 8} = 6
Jaccard = 2/6 = 0.33
Useful for:
| Metric | Data type | Idea | Typical uses |
|---|---|---|---|
| Euclidean | Continuous | Straight-line distance | Geometric data, numeric indicators |
| Manhattan | Continuous | Grid / city-block distance | Robust to outliers, discrete steps |
| Cosine similarity | Continuous | Angle (pattern) | Text, proportions, spending profiles |
| Hamming | Categorical | Mismatched positions | Voting records, fixed-length codes |
| Levenshtein | Strings | Edits needed to match | Text, DNA, version differences |
| Jaccard | Sets | Overlap / union | Co-occurrence, basket analysis, tags, keywords |
Key point: your choice of metric defines what “similar” means and therefore shapes the clusters and anomalies you discover.
Distance tells us how far apart two points are.
We can extend this idea to:
How many neighbours does a point have within some radius?
Define the ε-neighbourhood of a point p as:
all points whose distance from p is less than ε.



We can think of:
density = number of points / volume (or area)
Intuition:
Points in high-density regions often belong to the same cluster
Points in low-density regions may be:
Some algorithms (like DBSCAN) are built entirely around:
We’ll return to this when we reach DBSCAN.
One of the most popular clustering algorithms: K-means.
The name:
Basic idea:
SVM (supervised):
K-means (unsupervised):
Same tool (distance), different objective.
K-means tries to minimise the within-cluster sum of squares (WCSS):
\[ \min_{C_1,\dots,C_K} \sum_{i=1}^K \sum_{x \in C_i} \lVert x - \mu_i \rVert^2 \]
where:
In words:
“Choose centres so that points are as close as possible to their cluster centre, on average.”
A live demo: see here
Step 1: choose K

Step 2: initialise centroids

Step 3: assign to nearest centroid

Step 4: recompute centroids





Assignments change, then centres update, then assignments change, etc.




Eventually:
K-means is guaranteed to converge, although not always to the global best solution.
Strengths:
Limitations:
One popular heuristic:

Real datasets often give:
Other approaches to choosing K:
See this link or this link for more approaches to determine \(k\)
Important message: K-means is simple and useful, but choosing \(K\) is an art + science process.
Problem:
K-means++ (see original paper (Arthur and Vassilvitskii 2007)):
Benefits:
Most modern implementations (e.g. scikit-learn) use K-means++ by default.
K-means:
assumes roughly spherical clusters
forces every point into a cluster
struggles with:
We need methods that can:
A live demo: see here
DBSCAN = Density-Based Spatial Clustering of Applications with Noise.
Parameters:
Concepts:
Algorithm (informally):
Clusters correspond to regions of high density, separated by low-density regions.
If you want more details, you can have a look at the original DBSCAN paper (Ester et al. 1996).
| Aspect | K-means | DBSCAN |
|---|---|---|
| # of clusters | Must choose K | Determined by data (via ε, min_samples) |
| Cluster shape | Spherical blobs (roughly) | Arbitrary shapes |
| Outliers | Forced into some cluster | Can be labelled as noise |
| Parameters | K | ε, min_samples |
| Best for | Simple, compact clusters | Clusters of varying shape + noise |
DBSCAN is particularly powerful when:
We return to our development dataset:
Questions:
We will:
You will be able to experiment yourself using:
Many real-world problems don’t come with labels.
Unsupervised learning helps us:
All these methods rely on a notion of similarity/distance.
Choosing the right distance metric is crucial.
We studied K-means in detail and introduced DBSCAN as a density-based alternative.
Labels are expensive, incomplete, or unavailable in many domains.
Unsupervised learning lets us explore structure before committing to labels.
Clusters and patterns are hypotheses, not ground truth.
The choice of:
Domain knowledge is essential to interpret and validate unsupervised results.
This week: Countries + numbers → unsupervised methods on tabular data.
Next week: Documents + words:
Representing text as numbers (bag-of-words, TF–IDF)
Using similarity & clustering for:
The same ideas — distance, clusters, anomalies — will carry over, just on a new type of data.
Thank you!
Reflect on how your own research area might use unsupervised learning.
Think about:
See you next week for text mining and unstructured data.

LSE DS101 2025/26 Autumn Term