🗓️ Week 08
Machine Learning II

DS101 – Fundamentals of Data Science

Dr. Ghita Berrada

LSE Data Science Institute

13 Nov 2023

⏪ Recap

We saw that we could build (machine learning) models to make predictions either:
- about the future
- about new observations
We explored one of the machine learning paradigms that could help us do that: supervised learning
In supervised learning:
- a label is attached to each data point
- the goal for the algorithm(s) is to find the function (however complex) that maps the data to the labels
We discussed several examples of algorithms that fall in the supervised learning category

Types of Machine Learning

Supervised
Learning

Each data point is associated with a label.
The goal is to learn a function that maps the data to the labels.

(🗓️ Week 07)

Unsupervised
Learning

There is no such association between the data and the labels.
The focus is on similarities between data points.

(🗓️ Week 08)

Unsupervised Learning

What is Unsupervised Learning

Unsupervised learning algorithms :

rely on unlabeled data
learn hidden patterns in the data without human intervention

Types of Unsupervised Learning

Clustering

Consists in dividing data points into clusters, i.e groups of points that are similar between themselves and are dissimilar to data points belonging to other cluster
Examples:
- segmenting patients with a particular condition (e.g cancer) into groups depending on their response to particular treatment drugs (aim: treatment personalisation)
- customer segmentation according to common traits and/purchasing behaviour (aim: guiding marketing and business strategies)

Association rule mining

Series of algorithms that try to find rules to discover the probability of the co-occurrence of items in a collection
E.g:
- Given a collection of supermarket purchase records/transactions, can we tell how likely a customer is to buy milk if he/she has bought bread?
- Can we learn how likely a series of medical diagnoses are, given the symptoms to which they are associated?

Dimensionality reduction

Techniques that reduce the number of features, or dimensions, in a dataset (prior to applying a classification or regression algorithm)

Anomaly detection

Techniques that discover unusual data points in a dataset i.e with patterns that deviate from normal dataset patterns
E.g:
- Fraud detection (credit card fraud)
- Cyberattack detection
- Also useful in predicting whether a patient has or doesn’t have a disease

Unsupervised learning: where is it used in practice?

Market Basket Analysis : The goal here is to understand customer purchasing patterns and it involves analyzing data, e.g purchase history, to reveal product groupings and products likely to be purchased together.
Recommendation engines: Using data such as past purchasing or browsing behaviour or past product/services ratings or behaviour of similar users, unsupervised learning algorithms are used to suggest products, services, information to a user might like or find interesting
Medical imaging: Unsupervised learning algorithms are used for tasks as essential in radiology and pathology to diagnose patients quickly and accurately as image detection, classification and segmentation
Computer vision: Visual perception tasks such as object recognition rely on unsupervised learning algorithms
Natural language processing (NLP): Google News uses unsupervised learning to categorize articles based on the same story from various news outlets; for example, the results of the football transfer window would all be categorized under football.
Anomaly detection: The idea is to find atypical patterns in the data, i.e anomalies. Anomalies can be anything from a disease/medical diagnosis, a pattern of fraud (e. credit card fraud, insurance fraud, money laundering), security breaches (e.g cyberattacks) or faulty equipment
Genetic research: Clustering algorithms, in particular hierarchical clustering, are often used to analyze DNA patterns and reveal evolutionary relationships.

Similarity

Notion central to clustering and anomaly detection
Quantifies how data samples are related or close to each other
Often expressed as a number between 0 and 1
The higher it is, the closer the points are together, i.e “the more similar” they are
Based on a distance function

Similarity

Examples of common distance/similarity functions:

Real-valued data:

Euclidian distance: \(d(A,B)=\sqrt{\sum_{i=1}^n |A_i-B_i|^2}\)
Cosine distance: \(S_C(A,B)=\frac{\sum_{i=1}^n A_iB_i}{\sqrt{\sum_{i=1}^n A_i^2}\sqrt{\sum_{i=1}^n B_i^2}}\)

where A and B are two n-dimensional vectors of attributes, and \(A_i\) and \(B_i\) the \(i\)th components of vectors A and B respectively

String similarity:

Hamming distance: given two strings of equal length, it is the number of positions at which the corresponding symbols (i.e either letters, bits, or decimal digits) are different e.g:
- 0000 and 1111 is 4
- cat and car is 1
- call and cold is 2

Similarity

Examples of common distance/similarity functions:

Set similarity:

A set is a collection of items with no order or repetition. Sets are typically used to represent relationships or associations between objects or even people.

Jaccard index: it is often used in recommendation systems and social media analysis and measures the similarity between two sets based on the number of items present in both sets relative to the total number of items
\(J(A,B)=\frac{|A\cap B|}{|A\cup B|}\)
where A and B are two sets, \(|A\cap B|\) the number of items included in both sets and \(|A\cup B|\) the total number of items

Density

\(\epsilon\)-neighbourhoods:

given \(\epsilon>0\) and a point \(p\), the \(\epsilon\)-neighbourhood of \(p\) is the set of points that are at distance at most \(\epsilon\) away from \(p\)

Example:
(Images source: https://domino.ai/blog/topology-and-density-based-clustering)

100 points scattered on the [1,3]x[2,4] grid. We pick (3,2) as our point \(p\)

We draw the neighbourhood of \(p\) with radius \(\epsilon=0.5\): 31 points are in the neighbourhood

We draw the neighbourhood of \(p\) with radius \(\epsilon=0.15\): 3 points are in the neighbourhood

Density (continued)

Density:

Roughly defined, \(density=\frac{"mass"}{"volume"}\)
If we go back to our previous example, the “mass” would be the number of points in an \(\epsilon\)-neighbourhood centered in \(p\) and the “volume” the area of radius \(\epsilon\) centered around \(p\)

Example:

If we take the figure above, the local density estimation at \(p=(3,2)\) is \(\frac{31}{0.25\pi} \approx 39.5\)

Number that is meaningless on its own
If we calculate the densities of all points in our dataset, we could say that points within the same neighbourhood and similar local densities belong to the same cluster \(\rightarrow\) General principle of density-based clustering methods e.g DBSCAN (see (Ester et al. 1996) for more details on the DBSCAN algorithm)

An example of clustering algorithm: k-means

Artwork by @allison_horst

An example of clustering algorithm: k-means (continued)

An overview of existing clustering algorithms

For an overview of the clustering algorithms, their pros and cons and how they compare to each other, please head to (scikit-learn 2023a)
For a comparison of clustering algorithms on toy datasets (with Python implementation), see (scikit-learn 2023b)

Time for a break 🍵

After the break:

Clustering in practice
Anomaly detection

Clustering in action: customer segmentation

Demo from this Kaggle notebook

Key concepts:

Elbow method:
- it’s a method used to determine the optimal number of clusters
- the idea is to test various values of \(k\) and compute the within-cluster sum of squares (WCSS) for each of them, i.e the sum of square distances between the centroids and each point in the cluster and plot the graph \(k\) versus WCSS
- When \(k=1\), the WCSS is highest but the WCSS decreases as \(k\). The optimal value of \(k\) is reached where the graph starts to look like a straight line (i.e the elbow)
Alternative k-means initialization methods e.g k-means++ (see (Arthur and Vassilvitskii 2007))

Anomaly detection

Aim: detect abnormal patterns in the data
Based on the notions of similarity/density we discussed previously, can you guess how this goal is accomplished?
Clustering is a useful tool here as well: anomalies are points that are farthest from any other or points that lie in areas of low density
Quick demo from this Kaggle notebook

A quick aside: metrics notions

true positive (TP): A test result that correctly indicates the presence of a condition or characteristic
true negative (TN): A test result that correctly indicates the absence of a condition or characteristic
false positive (FP): A test result which wrongly indicates that a particular condition or attribute is present
false negative (FN): A test result which wrongly indicates that a particular condition or attribute is absent

- precision=\(\frac{\textrm{all_relevant_retrieved}}{\textrm{all_instances}}=\frac{TP}{TP+FP}\)

recall=\(\frac{\textrm{all_relevant_retrieved}}{\textrm{all_relevant_instances}}=\frac{TP}{TP+FN}\)
\(F_1=2\frac{precision.recall}{precision+recall}\) (harmonic mean of precision/recall)

References

Arthur, David, and Sergei Vassilvitskii. 2007. “K-Means++ the Advantages of Careful Seeding.” In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 1027–35.

Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 226–31. KDD’96. Portland, Oregon: AAAI Press.

scikit-learn. 2023a. “Scikit Learn User Guide: Clustering.” Scikit-Learn. https://scikit-learn/stable/modules/clustering.html.

———. 2023b. “Scikit Learn User Guide: Comparing different clustering algorithms on toy datasets.” Scikit-Learn. https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html.

🗓️ Week 08 Machine Learning II

⏪ Recap

Types of Machine Learning

Supervised Learning

Unsupervised Learning

Unsupervised Learning

What is Unsupervised Learning

Types of Unsupervised Learning

Clustering

Association rule mining

Dimensionality reduction

Anomaly detection

Unsupervised learning: where is it used in practice?

Similarity

Similarity

Similarity

Density

Density (continued)

An example of clustering algorithm: k-means

An example of clustering algorithm: k-means (continued)

An example of clustering algorithm: k-means (continued)

An overview of existing clustering algorithms

Time for a break 🍵

Clustering in action: customer segmentation

Anomaly detection

A quick aside: metrics notions

References

🗓️ Week 08
Machine Learning II

Supervised
Learning

Unsupervised
Learning