DS101 – Fundamentals of Data Science
13 Nov 2023
(🗓️ Week 07)
(🗓️ Week 08)
Unsupervised learning algorithms :
Examples of common distance/similarity functions:
Real-valued data:
Euclidian distance: \(d(A,B)=\sqrt{\sum_{i=1}^n |A_i-B_i|^2}\)
Cosine distance: \(S_C(A,B)=\frac{\sum_{i=1}^n A_iB_i}{\sqrt{\sum_{i=1}^n A_i^2}\sqrt{\sum_{i=1}^n B_i^2}}\)
where A and B are two n-dimensional vectors of attributes, and \(A_i\) and \(B_i\) the \(i\)th components of vectors A and B respectively
String similarity:
Examples of common distance/similarity functions:
Set similarity:
A set is a collection of items with no order or repetition. Sets are typically used to represent relationships or associations between objects or even people.
\(\epsilon\)-neighbourhoods:
Example:
(Images source: https://domino.ai/blog/topology-and-density-based-clustering)
Density:
Example:
If we take the figure above, the local density estimation at \(p=(3,2)\) is \(\frac{31}{0.25\pi} \approx 39.5\)
After the break:
Demo from this Kaggle notebook
Key concepts:
- precision=\(\frac{\textrm{all_relevant_retrieved}}{\textrm{all_instances}}=\frac{TP}{TP+FP}\)
recall=\(\frac{\textrm{all_relevant_retrieved}}{\textrm{all_relevant_instances}}=\frac{TP}{TP+FN}\)
\(F_1=2\frac{precision.recall}{precision+recall}\) (harmonic mean of precision/recall)
LSE DS101 2023/24 Autumn Term | archive