💻 LSE DS202W 2023: Week 10 - Lab

2023/24 Winter Term

Author

Published

18 Mar 2024

This week, we will move on to a new dataset: (Kim and Escobedo-Land 2015). The code to download it is provided in the .qmd file. We will use it to explore the power of PCA and how it can help us identify outliers in a dataset.

I chose this one because it has a ton of categorical variables, something you are likely to encounter if you do research in the social sciences. (It’s also playful intuitive and variables are almost self-explanatory, which is a bonus!)

🥅 Learning Objectives

By the end of this lab, you will be able to:

Gain a deeper insight into PCA
Interpret PCA plots with more confidence
Compare and contrast different clustering techniques (DBSCAN vs k-means)
Use DBSCAN and LOF to identify outliers (anomalies) in a dataset in a purely unsupervised manner

However, the most critical aspect of this lab is that you will not be asked to produce code. Instead, you will be asked to interpret the results of the code we provide you with. Discussing the results with your peers will be key to your success in this lab.

This will help you practice for your Summative 03 and January exams. Those assignments will not require you to write code unless you want to get extra credit.

📚 Preparation

Use the link below to download the lab file:

🖇️ USEFUL LINKS

Keep a tab open with data codebook.

📋 Lab Tasks

No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.

Part 0: Export your chat logs (~ 3 min)

As part of the GENIAL project, we ask that you fill out the following form as soon as you come to the lab:

🎯 ACTION POINTS

🔗 CLICK HERE to export your chat log.

Thanks for being GENIAL! You are now one step closer to earning some prizes! 🎟️

👉 NOTE: You MUST complete the initial form.

If you really don’t want to participate in GENIAL¹, just answer ‘No’ to the Terms & Conditions question - your e-mail address will be deleted from GENIAL’s database the following week.

Part I - Meet a new dataset (15+5 mins)

No need to wait! Start tackling the action points below when you come to the classroom.

🎯 ACTION POINTS (15 min)

Download the data set using the code provided in the .qmd file
Read it and look at the data using the code provided in the .qmd file.

Click here to see the code

df_okcupid <- 
  read_csv("profiles_revised.csv") %>%
  # Fix issues with the data, converting "does&rsquo;t" to "doesnt"
  mutate(across(where(is.character), ~ str_replace_all(., stringr::fixed("&rsquo;"), ""))) %>%
  mutate(across(where(is.character), ~ str_replace_all(., stringr::fixed("++"), "pp"))) %>%
  mutate(zodiac_sign=purrr::map_chr(sign, ~ str_split(., " ")[[1]][1]),
         zodiac_importance=case_when(str_detect(sign, "but it doesnt matter") ~ "low importance",
                                     str_detect(sign, "and its fun to think about") ~ "medium importance",
                                     str_detect(sign, "matters a lot") ~ "high importance"),
         religion_importance=case_when(str_detect(religion, "and laughing about it") ~ "very low importance",
                                       str_detect(religion, "but not too serious about it") ~ "low importance",
                                       str_detect(religion, "and somewhat serious about it") ~ "medium importance",
                                       str_detect(religion, "and very serious about it") ~ "high importance"),
         religion=purrr::map_chr(religion, ~ str_split(., " ")[[1]][1])) %>%
  mutate(income=if_else(income == -1, "not informed", sprintf("%d", income))) %>%
  select(-c(sign))

df_okcupid %>% glimpse()

Refine and Tidy Up: A closer look at the dataset reveals that certain columns, like ethnicity, offspring, and speaks, contain values that reflect multiple selected responses. This occurs because respondents were given the option to choose more than one answer:

df_okcupid %>% select(ethnicity, offspring, speaks) %>% head() %>% knitr::kable()

ethnicity	offspring	speaks
asian, white	doesnt have kids, but might want them	english
white	doesnt have kids, but might want them	english (fluently), spanish (poorly), french (poorly)
NA	NA	english, french, cpp
white	doesnt want kids	english, german (poorly)
asian, black, other	NA	english
white	doesnt have kids, but might want them	english (fluently), chinese (okay)

We want to convert those to dummy variables to better use them later. We cannot use recipes for several reasons (ask Jon if you want to understand them). Instead, we must employ advanced data-reshaping techniques to create dummies of these particular columns.

⚠️ The code below is very advanced. The kind of which only the most advanced students of our sister course about data manipulation, DS105, would be able to understand (provided they know R)! So don’t worry about understanding it all that well.

Click here to see the code

df_okcupid <-
  df_okcupid %>% 
  # After splitting the string, pivot to longer format (aka. melt the data frame) 
  separate_longer_delim(ethnicity, ", ") %>% 
  separate_longer_delim(offspring, ", ") %>%
  separate_longer_delim(speaks, ", ") %>%
  mutate(temporary_var=TRUE) %>%
  pivot_longer(c(ethnicity, offspring, speaks), names_to="multiple_choice_var", values_to="response") %>%
  distinct() %>%
  drop_na(response) %>%
  mutate(response=str_replace_all(response, stringr::fixed("'"), ""),
         response=str_replace(response, stringr::fixed(" ("), "_"),
         response=str_replace(response, stringr::fixed(")"), "")) %>%
  mutate(across(where(is.character), ~ str_replace_all(., stringr::fixed(" / "), "_or_"))) %>%
  mutate(across(where(is.character), ~ str_replace_all(., " ", "_"))) %>%
  pivot_wider(names_from = c(multiple_choice_var, response), 
              values_from = temporary_var,
              names_sep = "_",
              values_fill = FALSE)

What we want you to appreciate is that instead of a single column, ethnicity, we have multiple ethnicity-related ones:

df_okcupid %>% select(starts_with("ethnicity")) %>% head()

ethnicity_asian	ethnicity_white	ethnicity_black	ethnicity_other	ethnicity_hispanic_or_latin	ethnicity_pacific_islander	ethnicity_native_american	ethnicity_middle_eastern	ethnicity_indian
TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
TRUE	FALSE	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE

The same applies to speaks_ and offspring_ variables.

Investigate the character columns. Just how many unique values are there in each of them?

apply(df_okcupid %>% select(where(is.character)), 2, function(x){length(unique(x))})

          body_type                diet              drinks               drugs 
                 13                  19                   7                   4 
          education              income                 job         orientation 
                 33                  13                  22                   3 
               pets            religion                 sex              smokes 
                 16                  10                   2                   6 
             status         zodiac_sign   zodiac_importance religion_importance 
                  5                  13                   4                   5

Let’s convert them all to factors!

df_okcupid <- df_okcupid %>% mutate(across(where(is.character), as.factor))

More dummy variables! Let’s use a recipe to convert the remaining categorical columns to dummy variables:

Click here to see the code

dummy_recipe <-
  # We don't have an outcome variable here
  recipe(~., data = df_okcupid) %>%
  # Remove variables with more than 50% missing values
  step_filter_missing(all_predictors(), threshold=0.5) %>%
  # Remove rows with missing values
  step_naomit(all_predictors(), skip=FALSE) %>%
  # Remove zero variance predictors as they don't carry any information
  step_zv(all_numeric_predictors()) %>%
  # Normalize the data to avoid out-sized impact of some variables
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_factor_predictors()) %>%
  # The dummy variables we created for ethnicity, offspring and speaks must be transformed to integers
  step_mutate(across(where(is.logical), as.integer)) %>%
  prep()

dummy_recipe %>% tidy()

number	operation	type	trained	skip	id
1	step	filter_missing	TRUE	FALSE	filter_missing_luTiF
2	step	naomit	TRUE	FALSE	naomit_guYYi
3	step	zv	TRUE	FALSE	zv_MVf9F
4	step	normalize	TRUE	FALSE	normalize_4VGX5
5	step	dummy	TRUE	FALSE	dummy_OCXRC
6	step	mutate	TRUE	FALSE	mutate_AXwDm

Feel overwhelmed by taking a look at the resulting 458 columns!

summary(dummy_recipe) %>% arrange(variable)

🗣 CLASSROOM DISCUSSION (5 min)

(Your class teacher will mediate this discussion)

How would you explore this data if dimensionality reduction was not an option?
How would you answer: what are the most common types of profiles one can find on OK Cupid?

Part II - The power of PCA (30 min)

🎯 ACTION POINTS

To try to make sense of the sheer number of combinations of attributes, we will run PCA:

Click here to see the code

pca_recipe <-
  # Reuse existing recipe
  dummy_recipe %>%
  # I'm not sure yet how many components I want to keep; let's say 30.
  step_pca(all_predictors(), num_comp = 30, keep_original_cols=TRUE) %>%
  prep()


# Apply the recipe to our data set
pca_results <- bake(pca_recipe, df_okcupid)

How much ‘information’ are we keeping after compressing the data with PCA?

Click here to see the code

pca_step <- pca_recipe %>% tidy() %>% filter(type == "pca") %>% pull(number)

pca_fit <- pca_recipe$steps[[pca_step]]$res

# the eigenvalues determine the amount of variance explained
eigenvalues = 
  pca_fit %>%
  tidy(matrix = "eigenvalues")

eigenvalues %>%
  filter(PC <= 5) %>%
  ggplot(aes(PC, cumulative)) +
  geom_col(fill="#C63C4A") +
  geom_text(aes(label=scales::percent(cumulative, accuracy=0.01)),
            nudge_y=0.05, fontface="bold", size=rel(4.5)) + 
  
  # Customise scales
  scale_y_continuous(name="Cumulative Explained Variance", 
                     labels=scales::percent,
                     limits=c(0, 1), 
                     breaks=seq(0, 1, 0.2)) +
  scale_x_continuous(name="Principal Component") +
  
  scale_x_continuous(name="\nPrincipal Component") +
  
  labs(title="The first PC explains ~33% of the variance in the data.\n Good compression!\n",
       subtitle="The second PC adds some 7% more, and so on...\n",
       caption="Figure 1. Explained variance\nper component") +
  
  # Prettify plot a bit
  theme_bw(base_size=16)

Let’s focus on the first two components, as they are common plotting practices.

Think of the plot below as a ~39% compressed 2D representation of our 400+ numeric/dummy variables.

Click here to see the code

ggplot(pca_results, aes(PC01, PC02)) +
  geom_point(alpha=0.2, size=rel(2.5)) +
   labs(title = "There is a huge blob in the middle!\n I wonder what that represents.\n", 
       subtitle = "There are some clear outliers here\n",
       caption = "Figure 2. Distribution of samples using\nPC01 and PC02 as coordinates.") +
  
  scale_y_continuous(breaks=seq(-7.5, 7.5, 2.5), limits=c(-7.5, 7.5)) +
  
    # Prettify plot a bit
  theme_bw(base_size=14)

How concentrated is this main blob?

Let’s change the geom() to see density instead of the dots:

Click here to see the code

ggplot(pca_results, aes(PC01, PC02)) +
  geom_bin2d() +
    labs(title = "Here is how concentrated the main blob is", 
       subtitle = "It's time to put your understanding of plots to the test!\n",
       caption = "Figure 3. Density of samples using\nPC01 and PC02 coordinates") +
  
  scale_fill_viridis_b() +
  # I tweaked those manually after seeing the plot
  scale_y_continuous(breaks=seq(-7.5, 7.5, 2.5), limits=c(-7.5, 7.5)) +
  
  # Prettify plot a bit
  theme_bw(base_size=14)

🗣 CLASSROOM DISCUSSION

(Your class teacher will mediate this discussion)

How could you use the plots above to investigate the most common types of OK Cupid profiles?

🧑‍🏫 TEACHING MOMENT

Your class teacher will now guide the conversation and explain the plot below. If needed, they will recap how PCA works.

Click here to see the code

tidied_pca <- tidy(pca_recipe, pca_step)

plot_df <- 
  tidied_pca %>%
  filter(component %in% paste0("PC", 1:5)) %>%
  arrange(desc(abs(value))) %>% 
  group_by(component) %>% 
  # Get 10 biggest contributors to this PC
  slice_head(n=10) %>%
  mutate(component = fct_inorder(component))

terms_levels <- 
  plot_df %>% 
  group_by(terms) %>% 
  summarise(sum_value=sum(abs(value))) %>% 
  arrange(sum_value) %>% 
  pull(terms)

plot_df$terms <- factor(plot_df$terms, levels=terms_levels)

ggplot(plot_df, aes(value, terms, fill = abs(value))) +
  geom_col() +
  
  scale_fill_viridis_c(name="Abs value") +
  
  facet_wrap(~component, nrow = 1) +
    labs(title="Which variables contributed the most (together) to a particular PC?\n", 
       subtitle="This can be confusing.\n Make sure to ask your class teacher about any doubts you might have\n",
       captions="Figure 4. Contribution of most important\n features to each component",
       y = NULL) +
  # Prettify plot a bit
  theme_bw(base_size=10)

🗣️ Discussion:

How does Figure 4 help you interpret Figures 2 & 3?
How does the above help you think about the attributes of the most common type of OK Cupid profiles?
How would you interpret the 2D coordinates if instead of PC01 vs PC02, we had plotted PC01 vs PC03 in Figures 2 & 3?

Add your answer and reflections here

Part III: Anomaly detection techniques (40 mins)

We will try two different techniques for outlier detection on the two principal components: DBSCAN and LOF.

👥 IN PAIRS, go through the action points below and discuss your impressions and key takeaways.

🎯 ACTION POINTS

Take a look at the clusters identified by DBSCAN:

Click here to see the code

# DBSCAN
eps = 1
minPts = 20

dbscan_fit <- 
  pca_results %>%
  select(PC01, PC02) %>%
  dbscan(eps=eps, minPts = minPts)

# Save the cluster back to the data frame
pca_results['dbscan_cluster'] = factor(dbscan_fit$cluster)

ggplot(pca_results, 
       aes(PC01, PC02, color=dbscan_cluster, alpha=dbscan_cluster)) +
  geom_point(size=rel(3.5)) +
  
  scale_colour_manual(name="DBSCAN\nCluster", values=c(`0`="#C63C4A", `1`="#878787")) +
  scale_alpha_manual(name="DBSCAN\nCluster", values=c(`0`=0.6, `1`=0.15)) +
  scale_y_continuous(breaks=seq(-7.5, 7.5, 2.5), limits=c(-7.5, 7.5)) +
  
   labs(title="DBSCAN helps us identify outliers", 
       subtitle="The algorithm identified 2 different clusters, labelled 0 & 1",
       captions="Figure 5. Clusters identified by DBSCAN") +
  # Prettify plot a bit
  theme_bw(base_size=14)

🗣 Discussion: How well do you think DBSCAN performs at anomaly detection on the two principle components?

Let’s filter to take a look at these samples:

(Ask your class teacher for help if you don’t know what we are doing here)

Click here to see the code

df_anomalies <-
  pca_results %>%
  mutate(outlier_type=case_when(
    dbscan_cluster == 0 & PC02 > 2.5 ~ "upper side",
    dbscan_cluster == 0 & PC02 < -2.5 ~ "lower side",
    .default="Not an outlier"
  )) %>%
  filter(outlier_type != "Not an outlier") %>%
  arrange(desc(outlier_type)) %>%
  select(outlier_type, PC01, PC02, age, height, sex_m)

df_anomalies

outlier_type	PC01	PC02	age	height	sex_m
upper side	2.134535	5.492575	-0.4549691	6.529850	0
upper side	1.730021	6.131443	-1.1277723	6.782886	1
upper side	3.082110	5.090541	-0.4549691	5.770742	1
lower side	1.563825	-4.635300	-0.8394281	-6.374985	1
lower side	1.084287	-4.141511	2.2362436	-3.844625	0
lower side	1.750879	-4.853435	-1.2238870	-6.374985	0

The problem with the data frame above is that we are looking at the standardised values. It will probably be more helpful if we look at the original data instead.

💡 Do you see why we selected these particular variables here?

Reverse the normalisation:

What were the mean and standard deviation of the normalized variables?

normalisation_step <- pca_recipe %>% tidy() %>% filter(type == "normalize") %>% pull(number)
normalisation_stats <- tidy(pca_recipe, normalisation_step)
normalisation_stats

terms	statistic	value	id
age	mean	32.733604	normalize_4VGX5
height	mean	68.193986	normalize_4VGX5
age	sd	10.404231	normalize_4VGX5
height	sd	3.952007	normalize_4VGX5

Or in a more useful format:

normalisation_stats <- normalisation_stats %>% 
  pivot_wider(names_from="statistic", values_from="value")

normalisation_stats

terms	id	mean	sd
age	normalize_4VGX5	32.73360	10.404231
height	normalize_4VGX5	68.19399	3.952007

Reverse-engineering the process for these variables, we get:

Click here to see the code

age_stats <- normalisation_stats %>% filter(terms == "age")
height_stats <- normalisation_stats %>% filter(terms == "height")

df_anomalies$age <- (df_anomalies$age * age_stats$sd) + age_stats$mean
df_anomalies$height <- (df_anomalies$height * height_stats$sd) + height_stats$mean

# Height is shown in inches, to see it in cm do df_anomalies %>% mutate(height=2.54*height)
df_anomalies

outlier_type	PC01	PC02	age	height	sex_m
upper side	2.134535	5.492575	28	94	0
upper side	1.730021	6.131443	21	95	1
upper side	3.082110	5.090541	28	91	1
lower side	1.563825	-4.635300	24	43	1
lower side	1.084287	-4.141511	56	53	0
lower side	1.750879	-4.853435	20	43	0

💡 These are unusually tall and unusually short people on the upper and lower sides of the outliers, respectively.

Small digression off-topic: Compare it to how k-means divides the data into k=2 clusters:

Click here to see the code

kmeans_fit <-
  pca_results %>% 
  select(PC01, PC02) %>% 
  kmeans(centers=2)

# Save the cluster back to the data frame
pca_results['kmeans_cluster'] = factor(kmeans_fit$cluster)

ggplot(pca_results, 
       aes(PC01, PC02, color=kmeans_cluster)) +
  geom_point(size=rel(3.5), alpha=0.3) +
  
  scale_colour_discrete(name="k-means\nCluster") + 
  scale_y_continuous(breaks=seq(-7.5, 7.5, 2.5), limits=c(-7.5, 7.5)) +
  
   labs(title="k-means captures a completely different pattern", 
       subtitle="What do you make of it?",
       captions="Figure 6. Clusters identified by\nk-means with k=2 clusters") +
  # Prettify plot a bit
  theme_bw(base_size=14)

Let’s see if the Local Outlier Factor (LOF) performs better than DBSCAN.

We use the lof() function to calculate local outlier factors:

lof_results <- pca_results %>%
  select(PC01, PC02) %>%
  # 5 is number of neighbors used to calculate the LOF
  lof(5)

# Append the LOF score as a new column
pca_results$score <- lof_results

What is the distribution of scores?

Click here to see the code

ggplot(pca_results, aes(x=score, fill=score >= 1.5)) +
  geom_histogram(binwidth=0.025) +
  
  scale_fill_manual(name="Comparatively\nHigh scores", 
                    values=c(`FALSE`="#878787", `TRUE`="#C63C4A")) +
  
   labs(title="The majority of samples have a score around 1.0", 
       subtitle="As we are looking for anomalies, our interest lies in the samples with outlier scores",
       captions="Figure 7. Distribution of LOF scores") +
  # Prettify plot a bit
  theme_bw(base_size=14)

We can now plot the results of LOF.

Click here to see the code

ggplot(pca_results, aes(PC01, PC02, size=score, alpha=score >= 1.5, color=score)) +
  geom_point() +

  scale_alpha_manual(name="LOF scores", values=c(`FALSE`=0.1, `TRUE`=0.6), guide="none") +
  scale_radius(name="LOF scores", range=c(1, 10)) +
  scale_color_viridis_c(name="LOF scores") +  
  
labs(title = "Samples coloured and sized according to their LOF scores", caption = "Figure 8. Samples coloured and sized\naccording to their LOF scores") +

# I tweaked those manually after seeing the plot 
scale_y_continuous(breaks=seq(-7.5, 7.5, 2.5), limits=c(-7.5, 7.5)) +

# Prettify plot a bit 
theme_bw(base_size=14)

🗣 Discussion: Does LOF perform better than DBSCAN to detect ‘anomalous’ samples?

References

Kim, Albert Y., and Adriana Escobedo-Land. 2015. “OkCupid Data for Introductory Statistics and Data Science Courses.” Journal of Statistics Education 23 (2): 5. https://doi.org/10.1080/10691898.2015.11889737.

Footnotes

We’re gonna cry a little bit, not gonna lie. But no hard feelings. We’ll get over it.↩︎