DS101 – Fundamentals of Data Science
24 Nov 2025
Examples of common distance/similarity functions:
Real-valued data:
Euclidian distance: \(d(A,B)=\sqrt{\sum_{i=1}^n |A_i-B_i|^2}\)
Cosine distance: \(S_C(A,B)=\frac{\sum_{i=1}^n A_iB_i}{\sqrt{\sum_{i=1}^n A_i^2}\sqrt{\sum_{i=1}^n B_i^2}}\)
where A and B are two n-dimensional vectors of attributes, and \(A_i\) and \(B_i\) the \(i\)th components of vectors A and B respectively
String similarity:
Examples of common distance/similarity functions:
String similarity
Levenstein distance: given two strings, it’s the minimal number of edits (substutions, deletions, insertions) needed to turn one string into another e.g:
The Levenshtein distance between kitten and sitting is 3, since the following 3 edits change one into the other, and there is no way to do it with fewer than 3 edits:
Set similarity:
A set is a collection of items with no order or repetition. Sets are typically used to represent relationships or associations between objects or even people.
\(\epsilon\)-neighbourhoods:
Example:
(Images source: https://domino.ai/blog/topology-and-density-based-clustering)



Density:
Example:

If we take the figure above, the local density estimation at \(p=(3,2)\) is \(\frac{31}{0.25\pi} \approx 39.5\)




Demo on development indices (original data and R analysis in this post).
For the dataset, click on this button:
To download the indicators description, click on this button:
For clustering/anomaly detection demo notebook, click on this button:
Key concepts:
Alternative k-means initialization methods e.g k-means++ (see (Arthur and Vassilvitskii 2007)):
Basic idea of kmeans++ is selecting the first centroid randomly then assigning the remaining centroids based on the maximum squared distance: the goal is to push the centroids as far as possible from one another.
Here is the detailed K-means++ initialization algorithm:
A quick note on anomaly detection
Unstructured data is any type of data that lacks pre-defined formats and organization.
Structured data is collected following a known method or instance with a known schema (i.e table format), and unstructured data is everything else.
Unstructured data includes:
Unstructured data is reusable:
Because unstructured data can be used to satisfy multiple intentions, this could raise potential privacy concerns for the user
To demonstrate text mining concepts, we’ll be using the Political Apologies Database 1. We’ll start by looking at a single apology before moving to the apologies from a single country. Head to the course website to download the material needed to work with this first demo.
Tokenization:
Lemmatization:
Stopwords:
We continue using the Political Apologies Database. But this time, we’ll be looking at the whole collection of apologies.
Head to the course website to download the material needed to follow this second demo.
What is a corpus?
Example:
Consider a document containing 100 words where the word cooker appears 5 times. The term frequency (i.e., TF) for cooker is then (5 / 100) = 0.05.
Now, assume we have 10 million documents and the word cooker appears in one thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 4.
Therefore, the TF-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.
To learn more about TF-IDF, see this link or this one
Once vectorized, the text can be used as input to ML algorithms (supervised or unsupervised)!

LSE DS101 2025/26 Autumn Term