DS101 – Fundamentals of Data Science
18 Nov 2024
precision recall f1-score support
Negative 0.83 0.98 0.90 54
Positive 0.99 0.89 0.94 102
accuracy 0.92 156
macro avg 0.91 0.94 0.92 156
weighted avg 0.93 0.92 0.92 156
Say you have a dataset with two classes of points:
The goal is to find a line that separates the two classes of points:
It can get more complicated than just a line:
Image from Stack Exchange | TeX
(🗓️ Week 07)
(🗓️ Week 08)
Unsupervised learning algorithms :
Examples of common distance/similarity functions:
Real-valued data:
Euclidian distance: \(d(A,B)=\sqrt{\sum_{i=1}^n |A_i-B_i|^2}\)
Cosine distance: \(S_C(A,B)=\frac{\sum_{i=1}^n A_iB_i}{\sqrt{\sum_{i=1}^n A_i^2}\sqrt{\sum_{i=1}^n B_i^2}}\)
where A and B are two n-dimensional vectors of attributes, and \(A_i\) and \(B_i\) the \(i\)th components of vectors A and B respectively
String similarity:
Examples of common distance/similarity functions:
String similarity
Levenstein distance: given two strings, it’s the minimal number of edits (substutions, deletions, insertions) needed to turn one string into another e.g:
The Levenshtein distance between kitten and sitting is 3, since the following 3 edits change one into the other, and there is no way to do it with fewer than 3 edits:
Set similarity:
A set is a collection of items with no order or repetition. Sets are typically used to represent relationships or associations between objects or even people.
\(\epsilon\)-neighbourhoods:
Example:
(Images source: https://domino.ai/blog/topology-and-density-based-clustering)
Density:
Example:
If we take the figure above, the local density estimation at \(p=(3,2)\) is \(\frac{31}{0.25\pi} \approx 39.5\)
After the break:
Demo on Mall customer segmentation (original notebook and data from Kaggle):
Key concepts:
Alternative k-means initialization methods e.g k-means++ (see (Arthur and Vassilvitskii 2007)):
Basic idea of kmeans++ is selecting the first centroid randomly then assigning the remaining centroids based on the maximum squared distance: the goal is to push the centroids as far as possible from one another.
Here is the detailed K-means++ initialization algorithm:
Aim: detect abnormal patterns in the data
Based on the notions of similarity/density we discussed previously, can you guess how this goal is accomplished?
Clustering is a useful tool here as well: anomalies are points that are farthest from any other or points that lie in areas of low density
Quick demo using the same Mall customer dataset from earlier. To download the new demo notebook, click on this button:
To find out more about the DBSCAN algorithm mentioned in the demo, you can head here.
You can also have a look at how anomaly detection is used in the context of credit card fraud here
LSE DS101 2024/25 Autumn Term