🗓️ Week 08
Machine Learning II

DS101 – Fundamentals of Data Science

13 Nov 2023

⏪ Recap

  • We saw that we could build (machine learning) models to make predictions either:
    • about the future
    • about new observations
  • We explored one of the machine learning paradigms that could help us do that: supervised learning
  • In supervised learning:
    • a label is attached to each data point
    • the goal for the algorithm(s) is to find the function (however complex) that maps the data to the labels
  • We discussed several examples of algorithms that fall in the supervised learning category

Types of Machine Learning

Supervised
Learning

  • Each data point is associated with a label.
  • The goal is to learn a function that maps the data to the labels.

(🗓️ Week 07)

Unsupervised
Learning

  • There is no such association between the data and the labels.
  • The focus is on similarities between data points.

(🗓️ Week 08)

Unsupervised Learning

What is Unsupervised Learning

Unsupervised learning algorithms :

  • rely on unlabeled data
  • learn hidden patterns in the data without human intervention

Types of Unsupervised Learning

Clustering

  • Consists in dividing data points into clusters, i.e groups of points that are similar between themselves and are dissimilar to data points belonging to other cluster
  • Examples:
    • segmenting patients with a particular condition (e.g cancer) into groups depending on their response to particular treatment drugs (aim: treatment personalisation)
    • customer segmentation according to common traits and/purchasing behaviour (aim: guiding marketing and business strategies)

Association rule mining

  • Series of algorithms that try to find rules to discover the probability of the co-occurrence of items in a collection
  • E.g:
    • Given a collection of supermarket purchase records/transactions, can we tell how likely a customer is to buy milk if he/she has bought bread?
    • Can we learn how likely a series of medical diagnoses are, given the symptoms to which they are associated?

Dimensionality reduction

  • Techniques that reduce the number of features, or dimensions, in a dataset (prior to applying a classification or regression algorithm)

Anomaly detection

  • Techniques that discover unusual data points in a dataset i.e with patterns that deviate from normal dataset patterns
  • E.g:
    • Fraud detection (credit card fraud)
    • Cyberattack detection
    • Also useful in predicting whether a patient has or doesn’t have a disease

Unsupervised learning: where is it used in practice?

  • Market Basket Analysis : The goal here is to understand customer purchasing patterns and it involves analyzing data, e.g purchase history, to reveal product groupings and products likely to be purchased together.
  • Recommendation engines: Using data such as past purchasing or browsing behaviour or past product/services ratings or behaviour of similar users, unsupervised learning algorithms are used to suggest products, services, information to a user might like or find interesting
  • Medical imaging: Unsupervised learning algorithms are used for tasks as essential in radiology and pathology to diagnose patients quickly and accurately as image detection, classification and segmentation
  • Computer vision: Visual perception tasks such as object recognition rely on unsupervised learning algorithms
  • Natural language processing (NLP): Google News uses unsupervised learning to categorize articles based on the same story from various news outlets; for example, the results of the football transfer window would all be categorized under football.
  • Anomaly detection: The idea is to find atypical patterns in the data, i.e anomalies. Anomalies can be anything from a disease/medical diagnosis, a pattern of fraud (e. credit card fraud, insurance fraud, money laundering), security breaches (e.g cyberattacks) or faulty equipment
  • Genetic research: Clustering algorithms, in particular hierarchical clustering, are often used to analyze DNA patterns and reveal evolutionary relationships.

Similarity

  • Notion central to clustering and anomaly detection
  • Quantifies how data samples are related or close to each other
  • Often expressed as a number between 0 and 1
  • The higher it is, the closer the points are together, i.e “the more similar” they are
  • Based on a distance function

Similarity

Examples of common distance/similarity functions:

Real-valued data:

  • Euclidian distance: \(d(A,B)=\sqrt{\sum_{i=1}^n |A_i-B_i|^2}\)

  • Cosine distance: \(S_C(A,B)=\frac{\sum_{i=1}^n A_iB_i}{\sqrt{\sum_{i=1}^n A_i^2}\sqrt{\sum_{i=1}^n B_i^2}}\)

where A and B are two n-dimensional vectors of attributes, and \(A_i\) and \(B_i\) the \(i\)th components of vectors A and B respectively

String similarity:

  • Hamming distance: given two strings of equal length, it is the number of positions at which the corresponding symbols (i.e either letters, bits, or decimal digits) are different e.g:
    • 0000 and 1111 is 4
    • cat and car is 1
    • call and cold is 2

Similarity

Examples of common distance/similarity functions:

Set similarity:

A set is a collection of items with no order or repetition. Sets are typically used to represent relationships or associations between objects or even people.

  • Jaccard index: it is often used in recommendation systems and social media analysis and measures the similarity between two sets based on the number of items present in both sets relative to the total number of items
    \(J(A,B)=\frac{|A\cap B|}{|A\cup B|}\)
    where A and B are two sets, \(|A\cap B|\) the number of items included in both sets and \(|A\cup B|\) the total number of items

Density

\(\epsilon\)-neighbourhoods:

  • given \(\epsilon>0\) and a point \(p\), the \(\epsilon\)-neighbourhood of \(p\) is the set of points that are at distance at most \(\epsilon\) away from \(p\)

Example:
(Images source: https://domino.ai/blog/topology-and-density-based-clustering)

100 points scattered on the [1,3]x[2,4] grid. We pick (3,2) as our point \(p\)

We draw the neighbourhood of \(p\) with radius \(\epsilon=0.5\): 31 points are in the neighbourhood

We draw the neighbourhood of \(p\) with radius \(\epsilon=0.15\): 3 points are in the neighbourhood

Density (continued)

Density:

  • Roughly defined, \(density=\frac{"mass"}{"volume"}\)
  • If we go back to our previous example, the “mass” would be the number of points in an \(\epsilon\)-neighbourhood centered in \(p\) and the “volume” the area of radius \(\epsilon\) centered around \(p\)

Example:

If we take the figure above, the local density estimation at \(p=(3,2)\) is \(\frac{31}{0.25\pi} \approx 39.5\)

  • Number that is meaningless on its own
  • If we calculate the densities of all points in our dataset, we could say that points within the same neighbourhood and similar local densities belong to the same cluster \(\rightarrow\) General principle of density-based clustering methods e.g DBSCAN (see (Ester et al. 1996) for more details on the DBSCAN algorithm)

An example of clustering algorithm: k-means

An example of clustering algorithm: k-means (continued)

An example of clustering algorithm: k-means (continued)

An overview of existing clustering algorithms

  • For an overview of the clustering algorithms, their pros and cons and how they compare to each other, please head to (scikit-learn 2023a)
  • For a comparison of clustering algorithms on toy datasets (with Python implementation), see (scikit-learn 2023b)

Time for a break 🍵

After the break:

  • Clustering in practice
  • Anomaly detection

Clustering in action: customer segmentation

Demo from this Kaggle notebook

Key concepts:

  • Elbow method:
    • it’s a method used to determine the optimal number of clusters
    • the idea is to test various values of \(k\) and compute the within-cluster sum of squares (WCSS) for each of them, i.e the sum of square distances between the centroids and each point in the cluster and plot the graph \(k\) versus WCSS
    • When \(k=1\), the WCSS is highest but the WCSS decreases as \(k\). The optimal value of \(k\) is reached where the graph starts to look like a straight line (i.e the elbow)
  • Alternative k-means initialization methods e.g k-means++ (see (Arthur and Vassilvitskii 2007))

Anomaly detection

  • Aim: detect abnormal patterns in the data
  • Based on the notions of similarity/density we discussed previously, can you guess how this goal is accomplished?
  • Clustering is a useful tool here as well: anomalies are points that are farthest from any other or points that lie in areas of low density
  • Quick demo from this Kaggle notebook

A quick aside: metrics notions

  • true positive (TP): A test result that correctly indicates the presence of a condition or characteristic
  • true negative (TN): A test result that correctly indicates the absence of a condition or characteristic
  • false positive (FP): A test result which wrongly indicates that a particular condition or attribute is present
  • false negative (FN): A test result which wrongly indicates that a particular condition or attribute is absent

- precision=\(\frac{\textrm{all_relevant_retrieved}}{\textrm{all_instances}}=\frac{TP}{TP+FP}\)

  • recall=\(\frac{\textrm{all_relevant_retrieved}}{\textrm{all_relevant_instances}}=\frac{TP}{TP+FN}\)

  • \(F_1=2\frac{precision.recall}{precision+recall}\) (harmonic mean of precision/recall)

References

Arthur, David, and Sergei Vassilvitskii. 2007. “K-Means++ the Advantages of Careful Seeding.” In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 1027–35.
Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 226–31. KDD’96. Portland, Oregon: AAAI Press.
scikit-learn. 2023a. “Scikit Learn User Guide: Clustering.” Scikit-Learn. https://scikit-learn/stable/modules/clustering.html.
———. 2023b. “Scikit Learn User Guide: Comparing different clustering algorithms on toy datasets.” Scikit-Learn. https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html.