π£οΈ Week 09 - Lab Roadmap
Introduction to anomaly detection in Python
Welcome to our 8th lab!
βοΈ Setup
Downloading the student notebook
Click on the below button to download the student notebook.
Loading libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans, DBSCAN
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor, NearestNeighbors
from sklearn.metrics import calinski_harabasz_score
from kneed import KneeLocator
from yellowbrick.cluster import KElbowVisualizer
from lets_plot import *
LetsPlot.setup_html()
Downloading the data
Download the datasets we will use for this lab.
Use the links below to download this dataset:
π₯ Learning objectives
- Understand the concept of anomaly detection
- Deepen understanding of dimensionality reduction using principle component analysis
- Learn how to implement anomaly detection using various algorithms
Introducing a new data set (10 minutes)
In this lab, we will use outlier detection to deepen our appreciation of 2000s and 2010s pop music. Using data from Spotify, we have a list of features for 919 popular singles that were released in the 1990s. Features include:
artist
: Name of the Artist.song
: Name of the Track.duration_ms
: Duration of the track in milliseconds.explicit
: The lyrics or content of a song or a music video contain one or more of the criteria which could be considered offensive or unsuitable for children.year
: Release Year of the track.popularity
: The higher the value the more popular the song is.danceability
: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall - regularity. A value of 0.0 is least danceable and 1.0 is most danceable.energy
: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.key
: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = Cβ―/Dβ, 2 = D, and so on. If no key was detected, the value is -1.loudness
: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.mode
: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.speechiness
: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.acousticness
: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.instrumentalness
: Predicts whether a track contains no vocals. βOohβ and βaahβ sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly βvocalβ. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.liveness
: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).tempo
: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.genre
: Genre of the track.
# Load the data set
# Print the shape attribute
π£CLASSROOM DISCUSSION
(Your class teacher will mediate this discussion)
How would you explore this data if dimensionality reduction was not an option?
How would you answer: what are the most common types of songs one can find on this data set?
Principle component analysis (20 minutes)
Letβs create a list of different musical attributes, and filter the data frame to only include said attributes:
# Create a list of musical attributes
= ["danceability","energy","loudness","speechiness","acousticness","instrumentalness","liveness","valence"]
music_attrs
# Create a new filtered data frame
π― ACTION POINTS
- To try to make sense of the sheer number of combinations of attributes, we will run PCA and apply it to our data set:
# Create a pipeline that scales the data and performs a PCA (select the first 5 components)
# Call the fit_transform method on the filtered data frame
- How much βinformationβ are we keeping after compressing the data with PCA?
# Create a data frame with a range of principle components and the cumulative variance
# Plot the output
- Letβs focus on the first two components, as they are common plotting practices.
# Create a data frame of the components
# Add information on artist and track to the data frame
# Plot the output, using tooltips to convey what artist / track is being hovered over
π§βπ« TEACHING MOMENT
Your class teacher will now guide the conversation and explain the plot below. If needed, they will recap how PCA works.
# Create a list of data frames for each loading
# Concatenate the list of data frames to create a singular data frame
# Create a new column showing the absolute value of the loading
# Plot the output
π£οΈ Discussion:
- How does Figure 3 help you interpret Figure 2?
- How does the above help you think about the attributes of the most common type of songs?
Part III: Anomaly detection techniques (1 hour)
π₯ IN PAIRS, go through the action points below and discuss your impressions and key takeaways.
π― ACTION POINTS
- Take a look at the clusters identified by DBSCAN! We will employ a method that can help you determine values for the epsilon neighbourhood and minimum samples hyperparameters. We adapted this code from here.
# Set min_samples equal to 2 times the number of dimensions
= 4
min_samples
# Instantiate nearest neighbours model, setting n_neighbors equal to min_sample
= NearestNeighbors(n_neighbors=min_samples)
nearest_neighbors
# Fit the model to the first two principle components
= nearest_neighbors.fit(output_df[["PC1","PC2"]])
neighbors
# Extract the distances and indices from the nearest neighbours model
= neighbors.kneighbors(output_df[["PC1","PC2"]])
distances, indices
# Sort the distances from the 4th dimension of the distances matrix
= np.sort(distances[:,min_samples-1], axis=0) distances
# Identify the knee point
= np.arange(len(distances))
i = KneeLocator(i, distances, S=1, curve='convex', direction='increasing', interp_method='polynomial')
knee = distances[knee.knee]
eps print(f"We should set the epsilon neighbourhood value to ~ {np.round(eps,4)}!")
# Instantiate a DBSCAN model
= DBSCAN(eps = eps, min_samples = min_samples)
dbscan
# Fit the model to the first two principle component features
= dbscan.fit(output_df[["PC1","PC2"]]) _
# Plot the output
= output_df
to_plot "dbscan"] = [str(lab) for lab in dbscan.labels_]
to_plot["dbscan_outlier"] = np.where(to_plot["dbscan"] == "-1", "Yes", "No")
to_plot[
("PC1", "PC2", color = "dbscan_outlier")) +
ggplot(to_plot, aes(= layer_tooltips().line("@song").line("@artist")) +
geom_point(tooltips = "Outlier")
labs(color )
π£ Discussion: How well do you think DBSCAN performs at anomaly detection on the two principle components?
π― ACTION POINTS
- Take a look at the clusters identified by k-means:
We have included a different cluster evaluation metric to the Elbow method - the Calinski-Harabasz Score. Basically, the top CH score is meant to produce the optimal number of clusters.
# Instantiate a k-means model
= KMeans(random_state=42)
model
# Instantiate a visualizer from the yellowbrick library
= KElbowVisualizer(model, k=(2,10), metric = "calinski_harabasz", timings=False)
visualizer
# Fit the visualizer to the first two principle components
"PC1","PC2"]])
visualizer.fit(output_df[[
# Finalize and render the figure
visualizer.show()
# Instantiate a model
= KMeans(n_clusters=4)
kmeans
# Fit the model to the data
= kmeans.fit(output_df[["PC1","PC2"]])
_
# Create a data frame based on customers by adding the cluster labels
"kmeans"] = [str(i) for i in kmeans.labels_]
to_plot[
("PC1", "PC2", colour = "kmeans")) +
ggplot(to_plot, aes(= layer_tooltips().line("@song").line("@artist").line("Cluster @kmeans")) +
geom_point(tooltips +
theme_minimal() = element_blank(),
theme(panel_grid_minor = "bottom") +
legend_position = "PC1", y = "PC2",
labs(x = "Cluster #")
colour )
π£ Discussion: How well do you think k-means performs at anomaly detection on the two principle components?
π― ACTION POINTS
- Take a look at the clusters identified by the isolation forest:
# Instantiate a model
= Pipeline([("scaler", StandardScaler()), ("isoforest", IsolationForest(random_state=123))])
pipe
# Fit model to training data
"PC1","PC2"]])
pipe.fit(output_df[[
# Calculate the anomaly scores for the same data frame
"isoforest"] = pipe.score_samples(output_df[["PC1","PC2"]])
to_plot[
# Thresholds to try out
= [-0.7, -0.65, -0.6, -0.55]
iso_ths
# Create variables that exceed each threshold
for th in iso_ths:
f"iso_th_{th}"] = to_plot["isoforest"] <= th
to_plot[
# Keep only the track, artist, first two principle components and various isoforest threshold variables
= to_plot.columns[to_plot.columns.str.contains("song|artist|PC[1-2]|iso_th")]
feats_to_keep
# Create a melted data frame to plot
= (
to_plot_melted
to_plotfilter(items = feats_to_keep)
.= feats_to_keep[feats_to_keep.str.contains("song|artist|PC")])
.melt(id_vars = {"variable":"th", "value": "isoforest_outlier"})
.rename(columns = lambda x: x["th"].str.replace("iso_th_","Threshold = "))
.assign(th
)
# Plot the outliers
("PC1","PC2",color="isoforest_outlier")) +
ggplot(to_plot_melted, aes(= layer_tooltips().line("@song").line("@artist")) +
geom_point(tooltips = "th") +
facet_wrap(facets = "Outlier?")
labs(color )
π£ Discussion: What is the relationship between the anomaly score and the number of outliers in the data?
π― ACTION POINTS
- Letβs see if the Local Outlier Factor (LOF) performs better than DBSCAN/Isolation Forest.
We use the LocalOutlierFactor()
function to calculate local outlier factors:
# Instantiate a Local Outlier Factor model, setting nearest neighbors to 10
= LocalOutlierFactor(n_neighbors = 10)
lof
# Fit the model to the first two principle components
"PC1","PC2"]])
lof.fit(output_df[[
# Append the negative outlier factor score to the data frame, using absolute values
"lof"] = np.abs(lof.negative_outlier_factor_) to_plot[
# Plot the output, using size to distinguish between LOF scores
="PC1", y="PC2", size="lof", color="lof")) + \
ggplot(to_plot, aes(x=layer_tooltips().line("@song").line("@artist").line("LOF: @lof"),
geom_point(tooltips=0.5) + \
alpha="right") + \
theme(legend_position+ \
scale_color_viridis() ="LOF score",
labs(color="\nNote: larger, lighter dots indicate higher LOF scores!") + \
caption="none") # Hides the size legend guides(size
π£ Discussion: Does LOF perform better than DBSCAN or isolation forests to detect βanomalousβ samples?