Theme: Unsupervised learning
2023-11-24
- assign data points to the closest centroid (in terms of distance)
- recompute the centroids
The algorithm stops when centroids/data point assignments no longer change
See (Arthur and Vassilvitskii 2007) for more details
How many clusters do you see?
Let’s see how K-Means deals with this dataset
Now how about DBSCAN?
Quick recap: \(\epsilon\)-neighbourhood
DSBSCAN takes two parameters:
\(\epsilon\) or eps
: does that sound familiar?
That’s because it is! The parameter eps
defines the radius of the neighborhood around a point x
. This is what we called the \(\epsilon\)-neighborhood of x
earlier
minPts
: it’s simply the minimum number of points within \(\epsilon\) radiusA few definitions:
MinPts
, is a core pointMinPts
, but it belongs to the \(\epsilon\)-neighborhood of some core point \(z\).The figure shows the different types of points (core, border and outlier points) using MinPts = 6
:
A few more definitions:
In density-based clustering, a cluster is defined as a group of density connected points.
eps
of the starting point (\(x_i\)). Each point, with a neighbor count greater than or equal to MinPts
, is marked as core point or visited.The points that do not belong to any cluster are treated as outliers or noise.
If this explanation confuses you, you can have a look at this Youtube video
Important
Note that for most of the unsupervised learning techniques, no tidymodels
functions are available. So, while we stay in the unsupervised learning realm, our code will follow base R conventions instead of the usual tidyverse
/tidymodels
ones.
# to run this code you might need to install packages `factoextra`, `fpc` and `dbscan`
library(factoextra) # library that contains the data set and allows us to visualise the clusters
library("fpc") # library that allows to use dbscan
library(dbscan)
library(ggplot2)
#loading the data
data("multishapes")
df <- multishapes[, 1:2]
set.seed(123)
#plotting the data
ggplot(df, aes(x,y)) +
geom_point()
# running k-means on the data
km.res <- kmeans(df, 5, nstart = 25)
#visualising the results of k-means
fviz_cluster(km.res, df, geom = "point",
ellipse= FALSE, show.clust.cent = FALSE,
palette = "jco", ggtheme = theme_classic())
#running DBSCAN
set.seed(123)
db <- fpc::dbscan(df, eps = 0.15, MinPts = 5)
# Plot DBSCAN results
fviz_cluster(db, data = df, stand = FALSE,
ellipse = FALSE, show.clust.cent = FALSE,
geom = "point",palette = "jco", ggtheme = theme_classic())
We already mentioned that DBSCAN detects outliers/anomalies.
If you want to find out more about what to when you get new data and about other clustering-based anomaly detection techniques, head this way.
Let’s see an example:
Let’s say we have four points: A, B, C, and D.
If K=2, K-neighbors of A include C, B and D and therefore \(||N_2(A)||\) = 3.
Reachability distance:
Written more formally, this would look like this: \(RD(x_i,x_j)=\max(\textrm{K-distance}(x_j),\textrm{distance}(x_i,x_j))\)
Said differently, if a point \(x_i\) belongs to the K-neighbors of \(x_j\) , the reachability distance will be K-distance of \(x_j\) (blue line), otherwise the reachability distance will be the distance between \(x_i\) and \(x_j\) (orange line)
Local reachability density: It is inverse of the average reachability distance of a point A from its neighbors. The exact formula is:
\(LRD_k(A)=\frac{1}{\sum_{x_j\in N_k(A)} \frac{RD(A,x_j)}{||N_k(A)||}}\)
Intuitively according to the LRD formula, the higher the average reachability distance (i.e., neighbors are far from the point), the smaller the density of points are present around a particular point. This tells us how far a point is from the nearest cluster of points. Low LRD value imply that the closest cluster is far from the point.
In other words, the LRD relies on the intuition that the longer the distance to the next neighbors, the sparser the area the respective point is located in. It tells us how far we have to travel from our point to reach the next point or cluster of points. So the lower it is, the less dense it is, the longer we have to travel.
Or more specifically: \(LOF_k(A)=\frac{\sum_{x_j\in N_k(A)} LRD_k(x_j)}{||N_k(A)||}\times\frac{1}{LRD_k(A)}\)
Intuitively, if the point is not an outlier (i.e is an inlier), the ratio of average LRD of neighbors is approximately equal to the LRD of the point (because the density of a point and its neighbors are roughly equal). In that case, LOF is nearly equal to 1. On the other hand, if the point is an outlier, the LRD of a point is less than the average LRD of neighbors. Then LOF value will be high.
Generally, if LOF> 1, the point is considered an outlier, but that is not always true. Let’s say we know that we only have one outlier in the data, then we take the maximum LOF value among all the LOF values, and the point corresponding to the maximum LOF value will be considered as an outlier.
Distance | Value |
---|---|
d(A,B) | 1 |
d(A,C) | 2 |
d(A,D) | 3 |
d(B,C) | 1 |
d(B,D) | 4 |
d(C,D) | 3 |
K-distance(A): since C is the 2\(^{nd}\) nearest neighbor of A, K-distance(A)=d(A,C) =2 K-distance(B): since A, C are the 2\(^{nd}\) nearest neighbor of B, K-distance(B)=distance(B,C) OR K-distance(B)=distance(B,A) = 1 K-distance(C): since A is the 2\(^{nd}\) nearest neighbor of C –> K-distance(C)=distance(C,A) =2 K-distance(D): since A,C are the 2\(^{nd}\) nearest neighbor of D –> K-distance(D)=distance(D,A) or K-distance(D)=distance(D,C) =3
We use the K-distance and K-neighbourhood to compute the LRD for each point
\(LRD_2(A)=\frac{1}{\frac{RD(A,B)+RD(A,C)}{||N_2(A)||}}=\frac{\frac{1}{1+2}}{2}=0.667\) \(LRD_2(B)=\frac{1}{\frac{RD(B,A)+RD(B,C)}{||N_2(B)||}}=\frac{\frac{1}{2+2}}{2}=0.50\) \(LRD_2(C)=\frac{1}{\frac{RD(C,A)+RD(C,B)}{||N_2(C)||}}=\frac{\frac{1}{1+2}}{2}=0.667\) \(LRD_2(D)=\frac{1}{\frac{RD(D,A)+RD(D,C)}{||N_2(D)||}}=\frac{\frac{1}{3+3}}{2}=0.337\)
\(LOF_2(A)=\frac{LRD_2(B)+LRD_2(C)}{||N_2(A)||}\times\frac{1}{LRD_2(A)}=\frac{0.5+0.667}{2}\times\frac{1}{0.667}=0.87\) \(LOF_2(B)=\frac{LRD_2(A)+LRD_2(C)}{||N_2(B)||}\times\frac{1}{LRD_2(B)}=\frac{0.667+0.667}{2}\times\frac{1}{0.5}=1.334\) \(LOF_2(C)=\frac{LRD_2(A)+LRD_2(B)}{||N_2(C)||}\times\frac{1}{LRD_2(C)}=\frac{0.667+0.5}{2}\times\frac{1}{0.667}=0.87\) \(LOF_2(D)=\frac{LRD_2(A)+LRD_2(C)}{||N_2(D)||}\times\frac{1}{LRD_2(D)}=\frac{0.667+0.667}{2}\times\frac{1}{0.337}=2\)
The highest LOF among the four points is LOF(D). Therefore, D is an outlier/anomaly.
Pros
Cons
See (Breunig et al. 2000) for more details on LOF.
For a detailed demo (with code and PCA!) on (simulated) fraud dataset, head over this way.
Another popular method of anomaly detection is called isolation forest: if you’d like to learn about it, see (Liu, Ting, and Zhou 2008)
What is Principal Component Analysis (PCA)?
Principal component analysis (PCA) is a technique that transforms high-dimensions data into lower-dimensions while retaining as much information as possible.
What does “information” mean in the PCA context?
variance!
Let’s take a look at the variance formula: \(S^2=\frac{\sum (x_i-\bar{x})}{n-1}\) (i.e the average degree to which each point differs from the mean \(\bar{x}\)). Nowhere in the formula is there an association with information. So where does it come from.
Say we’re playing a guessing game and are trying to guess the height of three people.
Person | Height (cm) |
---|---|
Alex | 145 |
Ben | 160 |
Chris | 185 |
If looking at the silhouettes below
Then, we can guess Chris is A, B is Alex and C is Ben
If we now try to figure who hides behind these silhouettes
when we have this new group of people
Person | Height (cm) |
---|---|
Tabby | 171 |
Jon | 172 |
Alex | 173 |
this suddenly becomes tougher!
That’s because in the earlier group, the height varied substantially. In the same way, when our data has a higher variance, it holds more information. Variance is information
Now suppose we also have the people’s weights to try and differentiate them. Is it useful at all?
Person | Height (cm) | Weight(kg) |
---|---|---|
Alex | 145 | 68 |
Ben | 160 | 67 |
Chris | 185 | 70 |
What if height and weight had the same variance?
In that case, PCA combines height and weight linearly to create two new (uncorrelated/orthogonal) variables that capture the total variance of both height and weight: these are the principal components.
In general, with real data, you don’t get a principal component that captures 100% of the variances.
Performing a PCA will give us N number of principal components, where N is equal to the dimensionality of our original data (i.e the number of features in the data).
From this list of principal components, you usually choose the least number of principal components that would explain the most amount (i.e the most variance) of our original data.
If you want additional explanations on PCA, check out this Youtube video or this link or this link (more theoretical).
LSE DS202 2023/24 Autumn Term | archive