Main theme: Unsupervised learning
2023-11-17
Bias:
Variance:
Image source:https://towardsdatascience.com/what-bias-variance-bulls-eye-diagram-really-represent-ff6fb9670993
Image source:https://towardsdatascience.com/what-bias-variance-bulls-eye-diagram-really-represent-ff6fb9670993
Image source:https://towardsdatascience.com/what-bias-variance-bulls-eye-diagram-really-represent-ff6fb9670993
Image source:https://hshan0103.medium.com/understanding-bias-variance-trade-off-from-learning-curve-a64b4223bb02
We mentioned previously that single decision trees had a tendency to overfit.
Common tree parameters:
These parameters define the end condition for building a new tree and are usually tuned to increase accuracy and prevent overfitting.
We can tweak parameters to avoid overfitting or we can do something else…
So, what is better than a single tree?
Answer: Ensembles of trees!
Step 1: In a Random Forest model, a subset of data points and a subset of features is selected to construct each decision tree. In other words, \(n\) data points and \(m\) features are randomly selected out of the dataset to construct each of the \(B\) trees that will compose the forest
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: The final output of the forest is obtained through majority voting (for classification) or averaging of tree predictions (for regression).
Check (Breiman 2001) for more details
Boosting 1 2 is a method of combining many weak learners (trees) into a strong classifier.
Usually:
There are many different ways of iteratively adding learners to minimize a loss function.
Some of the most common are :
Benefits:
Problems:
Boston
datasetThe Boston
dataset is part of the ISLR2
library and contains information collected by the US Census service about housing in the Boston area (Massachussetts). It is composed of 13 variables and 506 observations (for details see here)
Boston
We train a regression tree to predict the variable medv
(i.e median value of owner-occupied homes in $1000s).
library(tidymodels)
library(ISLR2)
library(rpart.plot)
library(vip)
#loading data
data("Boston", package = "MASS")
Boston <- as_tibble(Boston)
set.seed(1234)
#splitting data in training and test set
Boston_split <- initial_split(Boston)
Boston_train <- training(Boston_split)
Boston_test <- testing(Boston_split)
#forest of trees specification
bagging_spec <- rand_forest(mtry = .cols()) %>%
set_engine("randomForest", importance = TRUE) %>%
set_mode("regression")
# fitting model and getting metrics
bagging_fit <- fit(bagging_spec, medv ~ ., data = Boston_train)
augment(bagging_fit, new_data = Boston_test) %>%
rmse(truth = medv, estimate = .pred)
# plotting ground truth against model predictions
augment(bagging_fit, new_data = Boston_test) %>%
ggplot(aes(medv, .pred)) +
geom_abline() +
geom_point(alpha = 0.5)
#plotting feature importance
vip(bagging_fit)
# boosted trees specification
boost_spec <- boost_tree(trees = 5000, tree_depth = 4) %>%
set_engine("xgboost") %>%
set_mode("regression")
# fitting model and getting metrics
boost_fit <- fit(boost_spec, medv ~ ., data = Boston_train)
augment(boost_fit, new_data = Boston_test) %>%
rmse(truth = medv, estimate = .pred)
#plotting ground truth against model predictions
augment(boost_fit, new_data = Boston_test) %>%
ggplot(aes(medv, .pred)) +
geom_abline() +
geom_point(alpha = 0.5)
#plotting feature importance
vip(boost_fit)
Unsupervised learning algorithms :
Techniques that reduce the number of features, or dimensions, in a dataset (prior to applying a classification or regression algorithm)
Example:
Examples of common distance/similarity functions:
Real-valued data:
Euclidian distance: \(d(A,B)=\sqrt{\sum_{i=1}^n |A_i-B_i|^2}\)
Cosine distance: \(S_C(A,B)=\frac{\sum_{i=1}^n A_iB_i}{\sqrt{\sum_{i=1}^n A_i^2}\sqrt{\sum_{i=1}^n B_i^2}}\)
where A and B are two n-dimensional vectors of attributes, and \(A_i\) and \(B_i\) the \(i\)th components of vectors A and B respectively
String similarity:
Hamming distance: given two strings of equal length, it is the number of positions at which the corresponding symbols (i.e either letters, bits, or decimal digits) are different e.g:
Levenstein distance: given two strings, it’s the minimal number of edits (substutions, deletions, insertions) needed to turn one string into another e.g:
The Levenshtein distance between kitten and sitting is 3, since the following 3 edits change one into the other, and there is no way to do it with fewer than 3 edits:
Examples of common distance/similarity functions:
Set similarity:
A set is a collection of items with no order or repetition. Sets are typically used to represent relationships or associations between objects or even people.
\(\epsilon\)-neighbourhoods:
Example:
(Images source: https://domino.ai/blog/topology-and-density-based-clustering)
Density:
Example:
If we take the figure above, the local density estimation at \(p=(3,2)\) is \(\frac{31}{0.25\pi} \approx 39.5\)
Κ-means clustering uses iterative refinement to produce a final result.
Algorithm inputs: number of clusters k and data set.
Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. More formally, if \(c_i\) is the collection of centroids in set \(C\), then each data point \(x\) is assigned to a cluster based on
\(\underset{c_i \in C}{\arg\min} \; dist(c_i,x)^2\)
where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for each ith cluster centroid be \(S_i\).
In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to that centroid’s cluster.
\(c_i=\frac{1}{|S_i|}\sum_{x_i \in S_i} x_i\)
The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).
This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not necessarily the best possible outcome), meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.
The elbow method, also known as total within sum of squares, is a heuristic used to determine the optimal number of clusters for k-means clustering.
Idea behind elbow method:
The steps to perform the elbow method are:
Source: (Dangeti 2017)
where \(a(i)\) is the measure of similarity of the point \(i\) to its own cluster, measured as the average distance of \(i\) from other points in the cluster i.e:
for each data point \(i \in C_i\) (data point \(i\) in cluster \(C_i\)), \(a_i=\frac{1}{|C_i|-1}\sum_{j\in C_i,i\neq j} d(i,j)\)
and \(b(i)\) is the measure of dissimilarity of \(i\) from points of other clusters i.e:
for each data point \(i \in C_i\), \(b(i)=\min_{i\neq j}\frac{1}{|C_j|}\sum_{j\in C_j}d(i,j)\)
\(d(i,j)\) is the distance between points \(i\) and \(j\), with the Euclidean Distance generally used as the distance metric.
Source: https://uc-r.github.io/kmeans_clustering
Calinski — Harabasz Method:
\(CH=\frac{BCSS/k-1}{WCSS/N-k}=\frac{\sum_{i=1}^k n_i||c_i-c||^2/(k-1)}{\sum_{i=1}^k\sum_{x\in C_i}||x-c_i||^2/N-k}\)
where \(n_i\) and \(c_i\) are the number of points and centroid of the \(i\)-th cluster respectively, \(c\) is the global centroid, \(N\) is the total number of points and \({x_1,...x_N}\) are the data points that compose the dataset.
See this link or this link for more approaches to determine \(k\)
Exploring a dataset on employment and demographics from #TidyTuesday (dataset here and original blogpost describing this application here).
We’re exploring employment by race and gender
We then apply the elbow method to figure out the optimal \(k\)
Not exactly conclusive but 4-5 seem best. Let’s try it with 4!
library(tidyverse)
library(tidymodels) # for kmeans
library(plotly) #for plotting
# downloading, loading and tidying the data
employed <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-23/employed.csv")
employed_tidy <- employed %>%
filter(!is.na(employ_n)) %>%
group_by(occupation = paste(industry, minor_occupation), race_gender) %>%
summarise(n = mean(employ_n)) %>%
ungroup()
# printing out the data
head(employed_tidy)
summary(employed_tidy)
# scaling the data variables
employment_demo <- employed_tidy %>%
filter(race_gender %in% c("Women", "Black or African American", "Asian")) %>%
pivot_wider(names_from = race_gender, values_from = n, values_fill = 0) %>%
janitor::clean_names() %>%
left_join(employed_tidy %>%
filter(race_gender == "TOTAL") %>%
select(-race_gender) %>%
rename(total = n)) %>%
filter(total > 1e3) %>%
mutate(across(c(asian, black_or_african_american, women), ~ . / (total)),
total = log(total),
across(where(is.numeric), ~ as.numeric(scale(.)))
) %>%
mutate(occupation = snakecase::to_snake_case(occupation))
employment_demo
#creating 3 clusters
employment_clust <- kmeans(select(employment_demo, -occupation), centers = 3)
summary(employment_clust)
augment(employment_clust, employment_demo) %>%
ggplot(aes(total, black_or_african_american, color = .cluster)) +
geom_point()
#elbow method
kclusts <-
tibble(k = 1:9) %>%
mutate(
kclust = map(k, ~ kmeans(select(employment_demo, -occupation), .x)),
glanced = map(kclust, glance)
)
kclusts %>%
unnest(cols = c(glanced)) %>%
ggplot(aes(k, tot.withinss)) +
geom_line(alpha = 0.5, size = 1.2, color = "midnightblue") +
geom_point(size = 2, color = "midnightblue")
Pros:
Cons:
More unsupervised learning
LSE DS202 2023/24 Autumn Term | archive