πŸ’» LSE 202A 2023: Week 09 - Lab

2022/23 Autumn Term

Author

Andreas StΓΆffelbauer

Published

19 Nov 2023

πŸ₯… Learning Objectives

By the end of this lab, you will be able to:

  • Use principal component analysis (PCA) to perform dimensionality reduction
  • Understand the difference between supervised and unsupervised learning
  • Apply the k-means algorithm to cluster observations
This lab is part of the GENIAL project.

When you reach Part III of this lab, read the specific instructions for GENIAL participants.

πŸ“š Preparation

We will be using the same World Values Survey dataset as in last week’s lab.

Use the link below to download the lab materials:

We will post solutions to Part III on Tuesday afternoon, only after all labs have ended.

βš™οΈ Setup

Load the libraries for today’s lab.

# Tidyverse packages we will use
library(ggplot2)
library(dplyr)     
library(tidyr)     
library(readr)     

# Tidymodel packages we will use
library(rsample)
library(yardstick) 
library(parsnip)   
library(recipes)   
library(workflows) 
library(rpart)
library(tune)
library(broom)
library(purrr)

# New libraries for today
library(dbscan)

Read the data set.

It is the dataset you downloaded in W07.

Note that negative values between -5 and -1 and also -9999 actually represent NAs in the data. So we’ll load them as NAs.

We’ll mostly work with numerical columns today, so we will also:

  • remove all categorical columns except for continent (we’ll use it later)
  • remove all columns related to the questionnaire so that we only have variables here that we can more easily interpret
filepath <- "data/WVS_Wave7_modified.csv"
wvs_data <- read_csv(filepath, na = c("", "NA", -1, -2, -3, -4, -5, -9999))

wvs_data = 
  wvs_data %>%
  select(-D_INTERVIEW, -COUNTRY_NAME, -REGION_NAMES, -TOWN_NAMES, -INTERVIEW_LANGUAGE, -starts_with("Q"), -contains("party"))

wvs_data %>%
  head()

That leaves us with a data set with one categorical and 196 numeric columns, which includes variables such as the trust in institutions, life expectancy, education levels, GDP, and others.

Part I: Dimensionality Reduction (25 min)

Our data set has still 196 (!!) numeric variables in total. That would be a lot of predictors for a model. So far, we have only used one or a few predictors, but not more.

Pay Attention

πŸ§‘β€πŸ« TEACHING MOMENT:

(Your class teacher will explain the motivation for dimensionality reduction, focusing on the why rather than the how (it will be explained in this week’s lecture.)

The Curse of Dimensionality πŸͺ„

No, that’s not a Harry Potter spell but it refers to the problems that often arise when dealing with high dimensional data (ie lots of variables).

Machine learning models tend to perform badly with too many predictors - there is a higher chance for overfitting, plus there is also a much bigger computational cost associated to training models with many variables.

Dimensionality Reduction to the Rescue πŸ§‘β€βš•οΈ

We want to reduce the number of variables. Intuitively, this could work because there are probably strong correlations between many of the individual variables (eg TRUST ARMY vs TRUST POLICE, or health expenditure vs life expectancy), so there seems to be some redundancy in the data.

Dimensionality reduction refers to a collection of methods that can help reduce the number of variables while preserving most of the information.

Principal Component Analysis πŸ’‘

However, rather than deciding which variables to keep and which ones to throw out (that would fall under β€œvariable selection”), we want to compress our high dimensional data into a low dimensional set of variables while retaining as much information from the original data as possible.

flowchart LR
  A["--------------- Lots of Dimensions ---------------"] --> B{PCA}
  B --> C["-- Few Dimensions --"]

In this lab, we will take this idea to the extreme and reduce the dimensionality of all data set to just 2. That has the additional benefit that we can also plot the results on a grid in two dimensions.

Prep the Data

  1. Let’s create a new recipe.
pca_recipe = 
  # we don't have an outcome variable here
  recipe(~., data = wvs_data) %>%
  # remove variables with more than 5% missing values (some have >95% NAs)
  # this is mainly just to avoid dropping too many rows in the next step
  step_filter_missing(all_numeric_predictors(), threshold=0.05) %>%
  # remove rows with missing values
  step_naomit(all_numeric_predictors(), skip=FALSE) %>%
  # remove zero variance predictors as they don't carry any information
  step_zv(all_numeric_predictors()) %>%
  # normalize the data to avoid out-sized impact of some variables
  step_normalize(all_numeric_predictors()) %>%
  prep()

pca_results = bake(pca_recipe, wvs_data)
head(pca_results)

This is our pre-processed data. We dropped quite a few columns there (espcially the ones with mostly NA) but we still have over 100 numeric variables.

Note all columns have been standardized now too, so they have mean 0 and variance 1. This is a recommended (or even necessary) step prior to performing PCA in order to avoid that large-range variables (which naturally have a higher variance) have an out-sized impact on the results.

Perform PCA

  1. Now, we can perform dimensionality reduction using PCA on this data.

This can be done simply by appending a step_pca() to our recipe. We’ll also set the num_comp=2, meaning that we want to reduce the dimensionality to just 2. We’ll also keep the original variables rather than replacing them here as we’ll still use some of them later in part III.

pca_recipe = 
  # we don't have an outcome variable here
  recipe(~., data = wvs_data) %>%
  # remove variables with more than 5% missing values (some have >95% NAs)
  # this is mainly just to avoid dropping too many rows in the next step
  step_filter_missing(all_numeric_predictors(), threshold=0.05) %>%
  # remove rows with missing values
  step_naomit(all_numeric_predictors(), skip=FALSE) %>%
  # remove zero variance predictors as they don't carry any information
  step_zv(all_numeric_predictors()) %>%
  # normalize the data to avoid out-sized impact of some variables
  step_normalize(all_numeric_predictors()) %>%
  # apply principal component analysis
  step_pca(all_numeric_predictors(), num_comp = 2, keep_original_cols=TRUE) %>%
  prep()

# apply the recipe to our data set
pca_results = bake(pca_recipe, wvs_data)
pca_results %>%
  select(CONTINENT, PC1, PC2) %>%
  head()

Plot the Results

  1. Our data now contains the newly created two principal components (added to the end of the columns). We can plot them.
ggplot(pca_results, aes(PC1, PC2)) +
  geom_point(alpha=0.3) +
  ggtitle("Principal Components", subtitle = "2-dimensional representation of our 100+ numeric predictors")

There is clearly still some structure/information in our data, and we will exploit this in part II.

So how much information did we retain/lose?

  1. We can answer this by looking at how much of the variance of the original data is contained in the first n=1,2,… principal components.

The first principal component is the one that captures the most the variance, then the second, and so on.

# this is a little awkward but we just want to 
# extract the fitted PCA that is hidden inside the recipe
pca_fit = pca_recipe$steps[[5]]$res

# the eigenvalues determine the amount of variance explained
eigenvalues = 
  pca_fit %>%
  tidy(matrix = "eigenvalues")

eigenvalues %>%
  filter(PC <= 30) %>%
  ggplot(aes(PC, percent)) +
  geom_col()

The first principal component captures more than 30 percent of the variance in all the numeric predictors we had initially. The second principal component still captures 10 percent.

eigenvalues %>%
  filter(PC <= 30) %>%
  ggplot(aes(PC, cumulative)) +
  geom_col()

Cumulatively, the first 2 components retain 40 percent of the variance. If we wanted to retain at least 80 percent of the variance, we would need to include at least 27 principal components. That’s still much less than the 100+ columns we had initially.

If we were to plot all principal components here, only they would contain the full 100 percent of the variance.

Part II: K-Means Clustering (25 min)

Pay Attention

πŸ§‘β€πŸ« TEACHING MOMENT:

(Your class teacher will explain the difference between supervised and unsupervised learning, and introduce k-means clustering as an example of unsupervisd learning.)

Short intro: clustering is an example of unsupervised learning. In contrast to supervised learning, we do not have any labels here, just some variables. The goal of clustering is to assign each data point a label by grouping the data into different clusters.

  1. Let’s run k-means clustering on our two principal components from part II. We’ll just go with 3 clusters for now.
kclust = 
  pca_results %>%
  select(PC1, PC2) %>%
  kmeans(centers = 3)

tidy(kclust)

The centroids (as x and y coordinates) found by the k-means algorithm are shown in the table above. With those cluster centroids, we can now apply the clustering to our data set

  1. What happens exactly: You can think of this as the analogue to making β€œpredictions”, whereas β€œfitting” the model is the process of finding the centroids itself. Making predictions means we assign a cluster label to each point depending on the closest centroid.

Visualized, it looks like this:

augment(kclust, pca_results) %>%
  ggplot() +
  geom_point(aes(x=PC1, y=PC2, color=.cluster), alpha=0.3) +
  geom_point(data = tidy(kclust), aes(x=PC1, y=PC2), size = 6, shape = "x")
Pay Attention

πŸ—£ Discussion: Was the choice of k=3 a good one? Why or why not?

So how many clusters do we need?

This is the BIG question when doing clustering. Since we don’t have any labels, there is no obvious answer to this question.

  1. To start with, we can do the same thing again but for different values of k, say between k=1 and k=9, and then see if we can decide which one provides the best fit.

# this is a very compact way to perform k-means clustering for various k
kclusts <- 
  tibble(k = 1:9) %>%
  mutate(
    kclust = map(k, ~kmeans(select(pca_results, PC1, PC2), .x)),
    tidied = map(kclust, tidy),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, pca_results)
  )

clusters <- 
  kclusts %>%
  unnest(cols = c(tidied))

assignments <- 
  kclusts %>% 
  unnest(cols = c(augmented))

clusterings <- 
  kclusts %>%
  unnest(cols = c(glanced))
  1. Let’s visualize the results using a facet plot.
ggplot(assignments, aes(x = PC1, y = PC2)) +
  geom_point(aes(color = .cluster), alpha = 0.7) + 
  facet_wrap(~ k) +
  geom_point(data = clusters, size = 4, shape = "x")
Pay Attention

πŸ—£ Discussion: Based on this plot, what is the best choice for k?

Judging the fit only visually has of course limitations (we couldn’t even do that if we had more dimensions than 2). So what’s the alternative?

One commonly used method to choose the best number of clusters is the β€œelbow method”. It relies on the β€œwithin cluster sum of squares” (WCSS).

This metric will always keep decreasing as k increases. However, rather than choosing the max k, we may find that a moderate k is good enough such that the WCSS does not actually decrease that much further afterwards (ie the plot often looks like an elbow).

ggplot(clusterings, aes(k, tot.withinss)) +
  geom_line() +
  geom_point()
Pay Attention

πŸ—£ Discussion: Based on the elbow method, which k would you choose? Does the even plot show a clear β€œelbow”?

Part III: More PCA & DBSCAN (40 min)

If you are participating in the GENIAL project, you are asked to:

  • Work independently (not in groups or pairs), but you can ask the class teacher for help if you get stuck.
  • Have only the following tabs open in your browser:
    1. These lab instructions
    2. The ChatGPT website (open a new chat window and name it β€˜DS202A - Week 09’)
    3. The tidymodels documentation page (you can open tabs with documentation pages for each package if you need to)
    4. The tidyverse documentation page (you can open tabs with documentation pages for each package if you need to)
  • Be aware of how useful (or not) ChatGPT was in helping you answer the questions in this section.
  • Fill out this brief survey at the end of the lab: πŸ”— link (requires LSE login)
Click here for tips on how to use ChatGPT today

General Tips:

  1. Provide enough context. For example, rather than simply asking ChatGPT for a plot from scratch, you can also ask it to re-create a plot but with some modifications. You could try something similar for the recipe in part 3.2 too.

  2. ChatGPT often struggles with R (more than it does with Python for example). Therefore, it could help to provide it with the documentation of the function(s) it needs to perform the task.

In case you are not participating in the GENIAL project, you can work in pairs or small groups to answer the questions in this section. You can also ask the class teacher for help if you get stuck.

We suggest you have these tabs open in your browser:

  1. The ChatGPT website (open a new chat window and name it β€˜DS202A - Week 09’)
  2. The tidymodels documentation page (you can open tabs with documentation pages for each package if you need to)
  3. The tidyverse documentation page (you can open tabs with documentation pages for each package if you need to)
  • Fill out this brief survey at the end of the lab if you are part of GENIAL: πŸ”— link (requires LSE login)
Teaching Moment

πŸ§‘β€πŸ« TEACHING MOMENT:

(Your class teacher will give an brief overview of the three parts to do in this section. Ask for clarification or help whenever needed.)

3.1 Explore our Principal Components

In this part you’ll explore our two principal components from part I a bit further to find out what kind of information they still contain.

🎯 ACTION POINTS:

  1. Take the original data wvs_data and compute the average lifeexpect per CONTINENT.

  2. Then create a scatter plot using our pca_results just like we did in part I, but now color the points according to either CONTINENT or lifeexpect (actually try out both to see the connection).

Start with the code provided in the lab material (.qmd) you downloaded.

Question

Question: What are your observations? Do the two principal components contain any information related to life expectancy and/or continent?

3.2 Principal Components as Predictors

So we saw that our two principal components contain lots of information still, so they should also be useful predictors.

In this part, you will predict whether an observations has CONTINENT="Africa". For this, we will quickly create a new binary outcome variable is_africa that has a value of 1 if the row has country Africa and 0 otherwise.

######### leave this code cell as is
log_reg_data =
  wvs_data %>%
  mutate(is_africa = factor(CONTINENT == 'Africa', labels=c("No","Yes"), 
                            levels=c(FALSE, TRUE), ordered=TRUE))

🎯 ACTION POINTS:

We’ll start with the same pca_recipe we’ve created in part I.

  1. Modify the recipe to account for the predictive modelling task.
  2. Check summary(pca_recipe) to ensure that your recipe has exactly one outcome (ie is_africa) and exactly two predictors (ie PC1 and PC2).
  3. Run the model training and evaluation code that is already provided (the code block after). If you make the right changes to the recipe, this should run and output a ROC curve.
Question

Question: Does the model work well? Explain why (or why not).

3.3 DBSCAN Clustering

In this part, you will apply a different clustering algorithm, DBSCAN. In contrast to k-means, DBSCAN is a density based clustering method. That means it

  • doesn’t require specifying the number of clusters beforehand
  • it groups together points that are closely packed, separated by areas of lower density

🎯 ACTION POINTS:

  1. Cluster our pca_results data using DSBSAN.
  2. Plot your clusters as we did in part II.

This time we won’t use tidyverse functions. DBSCAN is an independent library outside the tidyverse, so part of the exercise is to find out how to apply it. As a hint, you only need the dbscan() function.

You can follow this documentation page here for example.

You don’t need to specify the number of clusters beforehand, but you still have some control over the clusters by setting eps (a measure to control the density of clusters) and minPts (the minimum of points per cluster).

Can you find values that lead to appropriate clusters? You can start with eps=1 and minPts = 5.

Question

Questions: Compare these results to the results we have obtained by using k-means? Can you recognize the difference between clustering using centroids and by densities?