flowchart LR A["--------------- Lots of Dimensions ---------------"] --> B{PCA} B --> C["-- Few Dimensions --"]
π» LSE 202A 2023: Week 09 - Lab
2022/23 Autumn Term
π₯ Learning Objectives
By the end of this lab, you will be able to:
- Use principal component analysis (PCA) to perform dimensionality reduction
- Understand the difference between supervised and unsupervised learning
- Apply the k-means algorithm to cluster observations

When you reach Part III of this lab, read the specific instructions for GENIAL participants.
π Preparation
We will be using the same World Values Survey dataset as in last weekβs lab.
Use the link below to download the lab materials:
We will post solutions to Part III on Tuesday afternoon, only after all labs have ended.
βοΈ Setup
Load the libraries for todayβs lab.
# Tidyverse packages we will use
library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
# Tidymodel packages we will use
library(rsample)
library(yardstick)
library(parsnip)
library(recipes)
library(workflows)
library(rpart)
library(tune)
library(broom)
library(purrr)
# New libraries for today
library(dbscan)
Read the data set.
It is the dataset you downloaded in W07.
Note that negative values between -5
and -1
and also -9999
actually represent NAs in the data. So weβll load them as NAs.
Weβll mostly work with numerical columns today, so we will also:
- remove all categorical columns except for continent (weβll use it later)
- remove all columns related to the questionnaire so that we only have variables here that we can more easily interpret
<- "data/WVS_Wave7_modified.csv"
filepath <- read_csv(filepath, na = c("", "NA", -1, -2, -3, -4, -5, -9999))
wvs_data
=
wvs_data %>%
wvs_data select(-D_INTERVIEW, -COUNTRY_NAME, -REGION_NAMES, -TOWN_NAMES, -INTERVIEW_LANGUAGE, -starts_with("Q"), -contains("party"))
%>%
wvs_data head()
That leaves us with a data set with one categorical and 196 numeric columns, which includes variables such as the trust in institutions, life expectancy, education levels, GDP, and others.
Part I: Dimensionality Reduction (25 min)
Our data set has still 196 (!!) numeric variables in total. That would be a lot of predictors for a model. So far, we have only used one or a few predictors, but not more.
π§βπ« TEACHING MOMENT:
(Your class teacher will explain the motivation for dimensionality reduction, focusing on the why rather than the how (it will be explained in this weekβs lecture.)
The Curse of Dimensionality πͺ
No, thatβs not a Harry Potter spell but it refers to the problems that often arise when dealing with high dimensional data (ie lots of variables).
Machine learning models tend to perform badly with too many predictors - there is a higher chance for overfitting, plus there is also a much bigger computational cost associated to training models with many variables.
Dimensionality Reduction to the Rescue π§ββοΈ
We want to reduce the number of variables. Intuitively, this could work because there are probably strong correlations between many of the individual variables (eg TRUST ARMY
vs TRUST POLICE
, or health expenditure
vs life expectancy
), so there seems to be some redundancy in the data.
Dimensionality reduction refers to a collection of methods that can help reduce the number of variables while preserving most of the information.
Principal Component Analysis π‘
However, rather than deciding which variables to keep and which ones to throw out (that would fall under βvariable selectionβ), we want to compress our high dimensional data into a low dimensional set of variables while retaining as much information from the original data as possible.
In this lab, we will take this idea to the extreme and reduce the dimensionality of all data set to just 2. That has the additional benefit that we can also plot the results on a grid in two dimensions.
Prep the Data
- Letβs create a new recipe.
=
pca_recipe # we don't have an outcome variable here
recipe(~., data = wvs_data) %>%
# remove variables with more than 5% missing values (some have >95% NAs)
# this is mainly just to avoid dropping too many rows in the next step
step_filter_missing(all_numeric_predictors(), threshold=0.05) %>%
# remove rows with missing values
step_naomit(all_numeric_predictors(), skip=FALSE) %>%
# remove zero variance predictors as they don't carry any information
step_zv(all_numeric_predictors()) %>%
# normalize the data to avoid out-sized impact of some variables
step_normalize(all_numeric_predictors()) %>%
prep()
= bake(pca_recipe, wvs_data)
pca_results head(pca_results)
This is our pre-processed data. We dropped quite a few columns there (espcially the ones with mostly NA
) but we still have over 100 numeric variables.
Note all columns have been standardized now too, so they have mean 0 and variance 1. This is a recommended (or even necessary) step prior to performing PCA in order to avoid that large-range variables (which naturally have a higher variance) have an out-sized impact on the results.
Perform PCA
- Now, we can perform dimensionality reduction using PCA on this data.
This can be done simply by appending a step_pca()
to our recipe. Weβll also set the num_comp=2
, meaning that we want to reduce the dimensionality to just 2. Weβll also keep the original variables rather than replacing them here as weβll still use some of them later in part III.
=
pca_recipe # we don't have an outcome variable here
recipe(~., data = wvs_data) %>%
# remove variables with more than 5% missing values (some have >95% NAs)
# this is mainly just to avoid dropping too many rows in the next step
step_filter_missing(all_numeric_predictors(), threshold=0.05) %>%
# remove rows with missing values
step_naomit(all_numeric_predictors(), skip=FALSE) %>%
# remove zero variance predictors as they don't carry any information
step_zv(all_numeric_predictors()) %>%
# normalize the data to avoid out-sized impact of some variables
step_normalize(all_numeric_predictors()) %>%
# apply principal component analysis
step_pca(all_numeric_predictors(), num_comp = 2, keep_original_cols=TRUE) %>%
prep()
# apply the recipe to our data set
= bake(pca_recipe, wvs_data)
pca_results %>%
pca_results select(CONTINENT, PC1, PC2) %>%
head()
Plot the Results
- Our data now contains the newly created two principal components (added to the end of the columns). We can plot them.
ggplot(pca_results, aes(PC1, PC2)) +
geom_point(alpha=0.3) +
ggtitle("Principal Components", subtitle = "2-dimensional representation of our 100+ numeric predictors")
There is clearly still some structure/information in our data, and we will exploit this in part II.
So how much information did we retain/lose?
- We can answer this by looking at how much of the variance of the original data is contained in the first
n=1,2,β¦
principal components.
The first principal component is the one that captures the most the variance, then the second, and so on.
# this is a little awkward but we just want to
# extract the fitted PCA that is hidden inside the recipe
= pca_recipe$steps[[5]]$res
pca_fit
# the eigenvalues determine the amount of variance explained
=
eigenvalues %>%
pca_fit tidy(matrix = "eigenvalues")
%>%
eigenvalues filter(PC <= 30) %>%
ggplot(aes(PC, percent)) +
geom_col()
The first principal component captures more than 30 percent of the variance in all the numeric predictors we had initially. The second principal component still captures 10 percent.
%>%
eigenvalues filter(PC <= 30) %>%
ggplot(aes(PC, cumulative)) +
geom_col()
Cumulatively, the first 2 components retain 40 percent of the variance. If we wanted to retain at least 80 percent of the variance, we would need to include at least 27 principal components. Thatβs still much less than the 100+ columns we had initially.
If we were to plot all principal components here, only they would contain the full 100 percent of the variance.
Part II: K-Means Clustering (25 min)
π§βπ« TEACHING MOMENT:
(Your class teacher will explain the difference between supervised and unsupervised learning, and introduce k-means clustering as an example of unsupervisd learning.)
Short intro: clustering is an example of unsupervised learning. In contrast to supervised learning, we do not have any labels here, just some variables. The goal of clustering is to assign each data point a label by grouping the data into different clusters.
- Letβs run k-means clustering on our two principal components from part II. Weβll just go with 3 clusters for now.
=
kclust %>%
pca_results select(PC1, PC2) %>%
kmeans(centers = 3)
tidy(kclust)
The centroids (as x and y coordinates) found by the k-means algorithm are shown in the table above. With those cluster centroids, we can now apply the clustering to our data set
- What happens exactly: You can think of this as the analogue to making βpredictionsβ, whereas βfittingβ the model is the process of finding the centroids itself. Making predictions means we assign a cluster label to each point depending on the closest centroid.
Visualized, it looks like this:
augment(kclust, pca_results) %>%
ggplot() +
geom_point(aes(x=PC1, y=PC2, color=.cluster), alpha=0.3) +
geom_point(data = tidy(kclust), aes(x=PC1, y=PC2), size = 6, shape = "x")
π£ Discussion: Was the choice of k=3
a good one? Why or why not?
So how many clusters do we need?
This is the BIG question when doing clustering. Since we donβt have any labels, there is no obvious answer to this question.
- To start with, we can do the same thing again but for different values of
k
, say betweenk=1
andk=9
, and then see if we can decide which one provides the best fit.
# this is a very compact way to perform k-means clustering for various k
<-
kclusts tibble(k = 1:9) %>%
mutate(
kclust = map(k, ~kmeans(select(pca_results, PC1, PC2), .x)),
tidied = map(kclust, tidy),
glanced = map(kclust, glance),
augmented = map(kclust, augment, pca_results)
)
<-
clusters %>%
kclusts unnest(cols = c(tidied))
<-
assignments %>%
kclusts unnest(cols = c(augmented))
<-
clusterings %>%
kclusts unnest(cols = c(glanced))
- Letβs visualize the results using a facet plot.
ggplot(assignments, aes(x = PC1, y = PC2)) +
geom_point(aes(color = .cluster), alpha = 0.7) +
facet_wrap(~ k) +
geom_point(data = clusters, size = 4, shape = "x")
π£ Discussion: Based on this plot, what is the best choice for k
?
Judging the fit only visually has of course limitations (we couldnβt even do that if we had more dimensions than 2). So whatβs the alternative?
One commonly used method to choose the best number of clusters is the βelbow methodβ. It relies on the βwithin cluster sum of squaresβ (WCSS).
This metric will always keep decreasing as k
increases. However, rather than choosing the max k
, we may find that a moderate k
is good enough such that the WCSS does not actually decrease that much further afterwards (ie the plot often looks like an elbow).
ggplot(clusterings, aes(k, tot.withinss)) +
geom_line() +
geom_point()
π£ Discussion: Based on the elbow method, which k
would you choose? Does the even plot show a clear βelbowβ?
Part III: More PCA & DBSCAN (40 min)
If you are participating in the GENIAL project, you are asked to:
- Work independently (not in groups or pairs), but you can ask the class teacher for help if you get stuck.
- Have only the following tabs open in your browser:
- These lab instructions
- The ChatGPT website (open a new chat window and name it βDS202A - Week 09β)
- The
tidymodels
documentation page (you can open tabs with documentation pages for each package if you need to) - The
tidyverse
documentation page (you can open tabs with documentation pages for each package if you need to)
- Be aware of how useful (or not) ChatGPT was in helping you answer the questions in this section.
- Fill out this brief survey at the end of the lab: π link (requires LSE login)
Click here for tips on how to use ChatGPT today
General Tips:
Provide enough context. For example, rather than simply asking ChatGPT for a plot from scratch, you can also ask it to re-create a plot but with some modifications. You could try something similar for the recipe in part 3.2 too.
ChatGPT often struggles with R (more than it does with Python for example). Therefore, it could help to provide it with the documentation of the function(s) it needs to perform the task.
In case you are not participating in the GENIAL project, you can work in pairs or small groups to answer the questions in this section. You can also ask the class teacher for help if you get stuck.
We suggest you have these tabs open in your browser:
- The ChatGPT website (open a new chat window and name it βDS202A - Week 09β)
- The
tidymodels
documentation page (you can open tabs with documentation pages for each package if you need to) - The
tidyverse
documentation page (you can open tabs with documentation pages for each package if you need to)
- Fill out this brief survey at the end of the lab if you are part of GENIAL: π link (requires LSE login)
π§βπ« TEACHING MOMENT:
(Your class teacher will give an brief overview of the three parts to do in this section. Ask for clarification or help whenever needed.)
3.1 Explore our Principal Components
In this part youβll explore our two principal components from part I a bit further to find out what kind of information they still contain.
π― ACTION POINTS:
Take the original data
wvs_data
and compute the averagelifeexpect
perCONTINENT
.Then create a scatter plot using our
pca_results
just like we did in part I, but now color the points according to eitherCONTINENT
orlifeexpect
(actually try out both to see the connection).
Start with the code provided in the lab material (.qmd) you downloaded.
Question: What are your observations? Do the two principal components contain any information related to life expectancy and/or continent?
3.2 Principal Components as Predictors
So we saw that our two principal components contain lots of information still, so they should also be useful predictors.
In this part, you will predict whether an observations has CONTINENT="Africa"
. For this, we will quickly create a new binary outcome variable is_africa
that has a value of 1
if the row has country Africa and 0
otherwise.
######### leave this code cell as is
=
log_reg_data %>%
wvs_data mutate(is_africa = factor(CONTINENT == 'Africa', labels=c("No","Yes"),
levels=c(FALSE, TRUE), ordered=TRUE))
π― ACTION POINTS:
Weβll start with the same pca_recipe
weβve created in part I.
- Modify the recipe to account for the predictive modelling task.
- Check
summary(pca_recipe)
to ensure that your recipe has exactly one outcome (ieis_africa
) and exactly two predictors (iePC1
andPC2
). - Run the model training and evaluation code that is already provided (the code block after). If you make the right changes to the recipe, this should run and output a ROC curve.
Question: Does the model work well? Explain why (or why not).
3.3 DBSCAN Clustering
In this part, you will apply a different clustering algorithm, DBSCAN. In contrast to k-means, DBSCAN is a density based clustering method. That means it
- doesnβt require specifying the number of clusters beforehand
- it groups together points that are closely packed, separated by areas of lower density
π― ACTION POINTS:
- Cluster our
pca_results
data using DSBSAN. - Plot your clusters as we did in part II.
This time we wonβt use tidyverse functions. DBSCAN is an independent library outside the tidyverse, so part of the exercise is to find out how to apply it. As a hint, you only need the dbscan()
function.
You can follow this documentation page here for example.
You donβt need to specify the number of clusters beforehand, but you still have some control over the clusters by setting eps
(a measure to control the density of clusters) and minPts
(the minimum of points per cluster).
Can you find values that lead to appropriate clusters? You can start with eps=1
and minPts = 5
.
Questions: Compare these results to the results we have obtained by using k-means? Can you recognize the difference between clustering using centroids and by densities?