DS101 – Fundamentals of Data Science
25 Nov 2024
Examples of common distance/similarity functions:
Real-valued data:
Euclidian distance: \(d(A,B)=\sqrt{\sum_{i=1}^n |A_i-B_i|^2}\)
Cosine distance: \(S_C(A,B)=\frac{\sum_{i=1}^n A_iB_i}{\sqrt{\sum_{i=1}^n A_i^2}\sqrt{\sum_{i=1}^n B_i^2}}\)
where A and B are two n-dimensional vectors of attributes, and \(A_i\) and \(B_i\) the \(i\)th components of vectors A and B respectively
String similarity:
Examples of common distance/similarity functions:
String similarity
Levenstein distance: given two strings, it’s the minimal number of edits (substutions, deletions, insertions) needed to turn one string into another e.g:
The Levenshtein distance between kitten and sitting is 3, since the following 3 edits change one into the other, and there is no way to do it with fewer than 3 edits:
Set similarity:
A set is a collection of items with no order or repetition. Sets are typically used to represent relationships or associations between objects or even people.
\(\epsilon\)-neighbourhoods:
Example:
(Images source: https://domino.ai/blog/topology-and-density-based-clustering)
Density:
Example:
If we take the figure above, the local density estimation at \(p=(3,2)\) is \(\frac{31}{0.25\pi} \approx 39.5\)
Demo on development indices (original data and R analysis in this post).
For the dataset, click on this button:
To download the indicators description, click on this button:
For clustering/anomaly detection demo notebook, click on this button:
Key concepts:
Alternative k-means initialization methods e.g k-means++ (see (Arthur and Vassilvitskii 2007)):
Basic idea of kmeans++ is selecting the first centroid randomly then assigning the remaining centroids based on the maximum squared distance: the goal is to push the centroids as far as possible from one another.
Here is the detailed K-means++ initialization algorithm:
A quick note on anomaly detection
Unstructured data is any type of data that lacks pre-defined formats and organization.
Structured data is collected following a known method or instance with a known schema (i.e table format), and unstructured data is everything else.
Unstructured data includes:
Unstructured data is reusable:
Because unstructured data can be used to satisfy multiple intentions, this could raise potential privacy concerns for the user
Project Gutenberg is a library of over 70,000 full texts of books or individual stories in the public domain.
We are going to explore a few key text mining concepts by downloading and processing a book from your choice out of that library. Head to the course website to download the notebook (Project Gutenberg case study materials) to follow along this part.
Tokenization:
Lemmatization:
Stopwords:
We use the newspaper3k
package from Python to extract the most recent UK-related news articles from several newspapers:
Head to the course website to download the material needed to follow this second demo (news case study materials).
What is a corpus?
Example:
Consider a document containing 100 words where the word cooker appears 5 times. The term frequency (i.e., TF) for cooker is then (5 / 100) = 0.05.
Now, assume we have 10 million documents and the word cooker appears in one thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 4.
Therefore, the TF-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.
To learn more about TF-IDF, see this link or this one
Once vectorized, the text can be used as input to ML algorithms (supervised or unsupervised)!
LSE DS101 2024/25 Autumn Term