πŸ’» LSE DS202W 2023: Week 10 - Lab

2023/24 Winter Term

Author
Published

18 Mar 2024

This week, we will move on to a new dataset: (Kim and Escobedo-Land 2015). The code to download it is provided in the .qmd file. We will use it to explore the power of PCA and how it can help us identify outliers in a dataset.

I chose this one because it has a ton of categorical variables, something you are likely to encounter if you do research in the social sciences. (It’s also playful intuitive and variables are almost self-explanatory, which is a bonus!)

πŸ₯… Learning Objectives

By the end of this lab, you will be able to:

  • Gain a deeper insight into PCA
  • Interpret PCA plots with more confidence
  • Compare and contrast different clustering techniques (DBSCAN vs k-means)
  • Use DBSCAN and LOF to identify outliers (anomalies) in a dataset in a purely unsupervised manner

However, the most critical aspect of this lab is that you will not be asked to produce code. Instead, you will be asked to interpret the results of the code we provide you with. Discussing the results with your peers will be key to your success in this lab.

This will help you practice for your Summative 03 and January exams. Those assignments will not require you to write code unless you want to get extra credit.

πŸ“š Preparation

Use the link below to download the lab file:

πŸ–‡οΈ USEFUL LINKS

πŸ“‹ Lab Tasks

No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.

Part 0: Export your chat logs (~ 3 min)

As part of the GENIAL project, we ask that you fill out the following form as soon as you come to the lab:

🎯 ACTION POINTS

  1. πŸ”— CLICK HERE to export your chat log.

    Thanks for being GENIAL! You are now one step closer to earning some prizes! 🎟️

πŸ‘‰ NOTE: You MUST complete the initial form.

If you really don’t want to participate in GENIAL1, just answer β€˜No’ to the Terms & Conditions question - your e-mail address will be deleted from GENIAL’s database the following week.

Part I - Meet a new dataset (15+5 mins)

No need to wait! Start tackling the action points below when you come to the classroom.

🎯 ACTION POINTS (15 min)

  1. Download the data set using the code provided in the .qmd file

  2. Read it and look at the data using the code provided in the .qmd file.

Click here to see the code
df_okcupid <- 
  read_csv("profiles_revised.csv") %>%
  # Fix issues with the data, converting "does&rsquo;t" to "doesnt"
  mutate(across(where(is.character), ~ str_replace_all(., stringr::fixed("&rsquo;"), ""))) %>%
  mutate(across(where(is.character), ~ str_replace_all(., stringr::fixed("++"), "pp"))) %>%
  mutate(zodiac_sign=purrr::map_chr(sign, ~ str_split(., " ")[[1]][1]),
         zodiac_importance=case_when(str_detect(sign, "but it doesnt matter") ~ "low importance",
                                     str_detect(sign, "and its fun to think about") ~ "medium importance",
                                     str_detect(sign, "matters a lot") ~ "high importance"),
         religion_importance=case_when(str_detect(religion, "and laughing about it") ~ "very low importance",
                                       str_detect(religion, "but not too serious about it") ~ "low importance",
                                       str_detect(religion, "and somewhat serious about it") ~ "medium importance",
                                       str_detect(religion, "and very serious about it") ~ "high importance"),
         religion=purrr::map_chr(religion, ~ str_split(., " ")[[1]][1])) %>%
  mutate(income=if_else(income == -1, "not informed", sprintf("%d", income))) %>%
  select(-c(sign))

df_okcupid %>% glimpse()
  1. Refine and Tidy Up: A closer look at the dataset reveals that certain columns, like ethnicity, offspring, and speaks, contain values that reflect multiple selected responses. This occurs because respondents were given the option to choose more than one answer:
df_okcupid %>% select(ethnicity, offspring, speaks) %>% head() %>% knitr::kable()
ethnicity offspring speaks
asian, white doesnt have kids, but might want them english
white doesnt have kids, but might want them english (fluently), spanish (poorly), french (poorly)
NA NA english, french, cpp
white doesnt want kids english, german (poorly)
asian, black, other NA english
white doesnt have kids, but might want them english (fluently), chinese (okay)

We want to convert those to dummy variables to better use them later. We cannot use recipes for several reasons (ask Jon if you want to understand them). Instead, we must employ advanced data-reshaping techniques to create dummies of these particular columns.

⚠️ The code below is very advanced. The kind of which only the most advanced students of our sister course about data manipulation, DS105, would be able to understand (provided they know R)! So don’t worry about understanding it all that well.

Click here to see the code
df_okcupid <-
  df_okcupid %>% 
  # After splitting the string, pivot to longer format (aka. melt the data frame) 
  separate_longer_delim(ethnicity, ", ") %>% 
  separate_longer_delim(offspring, ", ") %>%
  separate_longer_delim(speaks, ", ") %>%
  mutate(temporary_var=TRUE) %>%
  pivot_longer(c(ethnicity, offspring, speaks), names_to="multiple_choice_var", values_to="response") %>%
  distinct() %>%
  drop_na(response) %>%
  mutate(response=str_replace_all(response, stringr::fixed("'"), ""),
         response=str_replace(response, stringr::fixed(" ("), "_"),
         response=str_replace(response, stringr::fixed(")"), "")) %>%
  mutate(across(where(is.character), ~ str_replace_all(., stringr::fixed(" / "), "_or_"))) %>%
  mutate(across(where(is.character), ~ str_replace_all(., " ", "_"))) %>%
  pivot_wider(names_from = c(multiple_choice_var, response), 
              values_from = temporary_var,
              names_sep = "_",
              values_fill = FALSE)

What we want you to appreciate is that instead of a single column, ethnicity, we have multiple ethnicity-related ones:

df_okcupid %>% select(starts_with("ethnicity")) %>% head()
ethnicity_asian ethnicity_white ethnicity_black ethnicity_other ethnicity_hispanic_or_latin ethnicity_pacific_islander ethnicity_native_american ethnicity_middle_eastern ethnicity_indian
TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

The same applies to speaks_ and offspring_ variables.

  1. Investigate the character columns. Just how many unique values are there in each of them?
apply(df_okcupid %>% select(where(is.character)), 2, function(x){length(unique(x))})
          body_type                diet              drinks               drugs 
                 13                  19                   7                   4 
          education              income                 job         orientation 
                 33                  13                  22                   3 
               pets            religion                 sex              smokes 
                 16                  10                   2                   6 
             status         zodiac_sign   zodiac_importance religion_importance 
                  5                  13                   4                   5 

Let’s convert them all to factors!

df_okcupid <- df_okcupid %>% mutate(across(where(is.character), as.factor))
  1. More dummy variables! Let’s use a recipe to convert the remaining categorical columns to dummy variables:
Click here to see the code
dummy_recipe <-
  # We don't have an outcome variable here
  recipe(~., data = df_okcupid) %>%
  # Remove variables with more than 50% missing values
  step_filter_missing(all_predictors(), threshold=0.5) %>%
  # Remove rows with missing values
  step_naomit(all_predictors(), skip=FALSE) %>%
  # Remove zero variance predictors as they don't carry any information
  step_zv(all_numeric_predictors()) %>%
  # Normalize the data to avoid out-sized impact of some variables
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_factor_predictors()) %>%
  # The dummy variables we created for ethnicity, offspring and speaks must be transformed to integers
  step_mutate(across(where(is.logical), as.integer)) %>%
  prep()

dummy_recipe %>% tidy()
number operation type trained skip id
1 step filter_missing TRUE FALSE filter_missing_luTiF
2 step naomit TRUE FALSE naomit_guYYi
3 step zv TRUE FALSE zv_MVf9F
4 step normalize TRUE FALSE normalize_4VGX5
5 step dummy TRUE FALSE dummy_OCXRC
6 step mutate TRUE FALSE mutate_AXwDm
  1. Feel overwhelmed by taking a look at the resulting 458 columns!
summary(dummy_recipe) %>% arrange(variable)

πŸ—£ CLASSROOM DISCUSSION (5 min)

(Your class teacher will mediate this discussion)

  • How would you explore this data if dimensionality reduction was not an option?
  • How would you answer: what are the most common types of profiles one can find on OK Cupid?

Part II - The power of PCA (30 min)

🎯 ACTION POINTS

  1. To try to make sense of the sheer number of combinations of attributes, we will run PCA:
Click here to see the code
pca_recipe <-
  # Reuse existing recipe
  dummy_recipe %>%
  # I'm not sure yet how many components I want to keep; let's say 30.
  step_pca(all_predictors(), num_comp = 30, keep_original_cols=TRUE) %>%
  prep()


# Apply the recipe to our data set
pca_results <- bake(pca_recipe, df_okcupid)
  1. How much β€˜information’ are we keeping after compressing the data with PCA?
Click here to see the code
pca_step <- pca_recipe %>% tidy() %>% filter(type == "pca") %>% pull(number)

pca_fit <- pca_recipe$steps[[pca_step]]$res

# the eigenvalues determine the amount of variance explained
eigenvalues = 
  pca_fit %>%
  tidy(matrix = "eigenvalues")

eigenvalues %>%
  filter(PC <= 5) %>%
  ggplot(aes(PC, cumulative)) +
  geom_col(fill="#C63C4A") +
  geom_text(aes(label=scales::percent(cumulative, accuracy=0.01)),
            nudge_y=0.05, fontface="bold", size=rel(4.5)) + 
  
  # Customise scales
  scale_y_continuous(name="Cumulative Explained Variance", 
                     labels=scales::percent,
                     limits=c(0, 1), 
                     breaks=seq(0, 1, 0.2)) +
  scale_x_continuous(name="Principal Component") +
  
  scale_x_continuous(name="\nPrincipal Component") +
  
  labs(title="The first PC explains ~33% of the variance in the data.\n Good compression!\n",
       subtitle="The second PC adds some 7% more, and so on...\n",
       caption="Figure 1. Explained variance\nper component") +
  
  # Prettify plot a bit
  theme_bw(base_size=16) 

  1. Let’s focus on the first two components, as they are common plotting practices.

Think of the plot below as a ~39% compressed 2D representation of our 400+ numeric/dummy variables.

Click here to see the code
ggplot(pca_results, aes(PC01, PC02)) +
  geom_point(alpha=0.2, size=rel(2.5)) +
   labs(title = "There is a huge blob in the middle!\n I wonder what that represents.\n", 
       subtitle = "There are some clear outliers here\n",
       caption = "Figure 2. Distribution of samples using\nPC01 and PC02 as coordinates.") +
  
  scale_y_continuous(breaks=seq(-7.5, 7.5, 2.5), limits=c(-7.5, 7.5)) +
  
    # Prettify plot a bit
  theme_bw(base_size=14) 

  1. How concentrated is this main blob?

Let’s change the geom() to see density instead of the dots:

Click here to see the code
ggplot(pca_results, aes(PC01, PC02)) +
  geom_bin2d() +
    labs(title = "Here is how concentrated the main blob is", 
       subtitle = "It's time to put your understanding of plots to the test!\n",
       caption = "Figure 3. Density of samples using\nPC01 and PC02 coordinates") +
  
  scale_fill_viridis_b() +
  # I tweaked those manually after seeing the plot
  scale_y_continuous(breaks=seq(-7.5, 7.5, 2.5), limits=c(-7.5, 7.5)) +
  
  # Prettify plot a bit
  theme_bw(base_size=14) 

πŸ—£ CLASSROOM DISCUSSION

(Your class teacher will mediate this discussion)

  • How could you use the plots above to investigate the most common types of OK Cupid profiles?

πŸ§‘β€πŸ« TEACHING MOMENT

Your class teacher will now guide the conversation and explain the plot below. If needed, they will recap how PCA works.

Click here to see the code
tidied_pca <- tidy(pca_recipe, pca_step)

plot_df <- 
  tidied_pca %>%
  filter(component %in% paste0("PC", 1:5)) %>%
  arrange(desc(abs(value))) %>% 
  group_by(component) %>% 
  # Get 10 biggest contributors to this PC
  slice_head(n=10) %>%
  mutate(component = fct_inorder(component))

terms_levels <- 
  plot_df %>% 
  group_by(terms) %>% 
  summarise(sum_value=sum(abs(value))) %>% 
  arrange(sum_value) %>% 
  pull(terms)

plot_df$terms <- factor(plot_df$terms, levels=terms_levels)

ggplot(plot_df, aes(value, terms, fill = abs(value))) +
  geom_col() +
  
  scale_fill_viridis_c(name="Abs value") +
  
  facet_wrap(~component, nrow = 1) +
    labs(title="Which variables contributed the most (together) to a particular PC?\n", 
       subtitle="This can be confusing.\n Make sure to ask your class teacher about any doubts you might have\n",
       captions="Figure 4. Contribution of most important\n features to each component",
       y = NULL) +
  # Prettify plot a bit
  theme_bw(base_size=10)

πŸ—£οΈ Discussion:

  • How does Figure 4 help you interpret Figures 2 & 3?
  • How does the above help you think about the attributes of the most common type of OK Cupid profiles?
  • How would you interpret the 2D coordinates if instead of PC01 vs PC02, we had plotted PC01 vs PC03 in Figures 2 & 3?

Add your answer and reflections here

Part III: Anomaly detection techniques (40 mins)

We will try two different techniques for outlier detection on the two principal components: DBSCAN and LOF.

πŸ‘₯ IN PAIRS, go through the action points below and discuss your impressions and key takeaways.

🎯 ACTION POINTS

  1. Take a look at the clusters identified by DBSCAN:
Click here to see the code
# DBSCAN
eps = 1
minPts = 20

dbscan_fit <- 
  pca_results %>%
  select(PC01, PC02) %>%
  dbscan(eps=eps, minPts = minPts)

# Save the cluster back to the data frame
pca_results['dbscan_cluster'] = factor(dbscan_fit$cluster)

ggplot(pca_results, 
       aes(PC01, PC02, color=dbscan_cluster, alpha=dbscan_cluster)) +
  geom_point(size=rel(3.5)) +
  
  scale_colour_manual(name="DBSCAN\nCluster", values=c(`0`="#C63C4A", `1`="#878787")) +
  scale_alpha_manual(name="DBSCAN\nCluster", values=c(`0`=0.6, `1`=0.15)) +
  scale_y_continuous(breaks=seq(-7.5, 7.5, 2.5), limits=c(-7.5, 7.5)) +
  
   labs(title="DBSCAN helps us identify outliers", 
       subtitle="The algorithm identified 2 different clusters, labelled 0 & 1",
       captions="Figure 5. Clusters identified by DBSCAN") +
  # Prettify plot a bit
  theme_bw(base_size=14) 

πŸ—£ Discussion: How well do you think DBSCAN performs at anomaly detection on the two principle components?

  1. Let’s filter to take a look at these samples:

(Ask your class teacher for help if you don’t know what we are doing here)

Click here to see the code
df_anomalies <-
  pca_results %>%
  mutate(outlier_type=case_when(
    dbscan_cluster == 0 & PC02 > 2.5 ~ "upper side",
    dbscan_cluster == 0 & PC02 < -2.5 ~ "lower side",
    .default="Not an outlier"
  )) %>%
  filter(outlier_type != "Not an outlier") %>%
  arrange(desc(outlier_type)) %>%
  select(outlier_type, PC01, PC02, age, height, sex_m)

df_anomalies
outlier_type PC01 PC02 age height sex_m
upper side 2.134535 5.492575 -0.4549691 6.529850 0
upper side 1.730021 6.131443 -1.1277723 6.782886 1
upper side 3.082110 5.090541 -0.4549691 5.770742 1
lower side 1.563825 -4.635300 -0.8394281 -6.374985 1
lower side 1.084287 -4.141511 2.2362436 -3.844625 0
lower side 1.750879 -4.853435 -1.2238870 -6.374985 0

The problem with the data frame above is that we are looking at the standardised values. It will probably be more helpful if we look at the original data instead.

πŸ’‘ Do you see why we selected these particular variables here?

  1. Reverse the normalisation:

What were the mean and standard deviation of the normalized variables?

normalisation_step <- pca_recipe %>% tidy() %>% filter(type == "normalize") %>% pull(number)
normalisation_stats <- tidy(pca_recipe, normalisation_step)
normalisation_stats
terms statistic value id
age mean 32.733604 normalize_4VGX5
height mean 68.193986 normalize_4VGX5
age sd 10.404231 normalize_4VGX5
height sd 3.952007 normalize_4VGX5

Or in a more useful format:

normalisation_stats <- normalisation_stats %>% 
  pivot_wider(names_from="statistic", values_from="value")

normalisation_stats
terms id mean sd
age normalize_4VGX5 32.73360 10.404231
height normalize_4VGX5 68.19399 3.952007

Reverse-engineering the process for these variables, we get:

Click here to see the code
age_stats <- normalisation_stats %>% filter(terms == "age")
height_stats <- normalisation_stats %>% filter(terms == "height")

df_anomalies$age <- (df_anomalies$age * age_stats$sd) + age_stats$mean
df_anomalies$height <- (df_anomalies$height * height_stats$sd) + height_stats$mean

# Height is shown in inches, to see it in cm do df_anomalies %>% mutate(height=2.54*height)
df_anomalies
outlier_type PC01 PC02 age height sex_m
upper side 2.134535 5.492575 28 94 0
upper side 1.730021 6.131443 21 95 1
upper side 3.082110 5.090541 28 91 1
lower side 1.563825 -4.635300 24 43 1
lower side 1.084287 -4.141511 56 53 0
lower side 1.750879 -4.853435 20 43 0

πŸ’‘ These are unusually tall and unusually short people on the upper and lower sides of the outliers, respectively.

  1. Small digression off-topic: Compare it to how k-means divides the data into k=2 clusters:
Click here to see the code
kmeans_fit <-
  pca_results %>% 
  select(PC01, PC02) %>% 
  kmeans(centers=2)

# Save the cluster back to the data frame
pca_results['kmeans_cluster'] = factor(kmeans_fit$cluster)

ggplot(pca_results, 
       aes(PC01, PC02, color=kmeans_cluster)) +
  geom_point(size=rel(3.5), alpha=0.3) +
  
  scale_colour_discrete(name="k-means\nCluster") + 
  scale_y_continuous(breaks=seq(-7.5, 7.5, 2.5), limits=c(-7.5, 7.5)) +
  
   labs(title="k-means captures a completely different pattern", 
       subtitle="What do you make of it?",
       captions="Figure 6. Clusters identified by\nk-means with k=2 clusters") +
  # Prettify plot a bit
  theme_bw(base_size=14)

  1. Let’s see if the Local Outlier Factor (LOF) performs better than DBSCAN.

We use the lof() function to calculate local outlier factors:

lof_results <- pca_results %>%
  select(PC01, PC02) %>%
  # 5 is number of neighbors used to calculate the LOF
  lof(5)

# Append the LOF score as a new column
pca_results$score <- lof_results

What is the distribution of scores?

Click here to see the code
ggplot(pca_results, aes(x=score, fill=score >= 1.5)) +
  geom_histogram(binwidth=0.025) +
  
  scale_fill_manual(name="Comparatively\nHigh scores", 
                    values=c(`FALSE`="#878787", `TRUE`="#C63C4A")) +
  
   labs(title="The majority of samples have a score around 1.0", 
       subtitle="As we are looking for anomalies, our interest lies in the samples with outlier scores",
       captions="Figure 7. Distribution of LOF scores") +
  # Prettify plot a bit
  theme_bw(base_size=14) 

We can now plot the results of LOF.

Click here to see the code
ggplot(pca_results, aes(PC01, PC02, size=score, alpha=score >= 1.5, color=score)) +
  geom_point() +

  scale_alpha_manual(name="LOF scores", values=c(`FALSE`=0.1, `TRUE`=0.6), guide="none") +
  scale_radius(name="LOF scores", range=c(1, 10)) +
  scale_color_viridis_c(name="LOF scores") +  
  
labs(title = "Samples coloured and sized according to their LOF scores", caption = "Figure 8. Samples coloured and sized\naccording to their LOF scores") +

# I tweaked those manually after seeing the plot 
scale_y_continuous(breaks=seq(-7.5, 7.5, 2.5), limits=c(-7.5, 7.5)) +

# Prettify plot a bit 
theme_bw(base_size=14) 

πŸ—£ Discussion: Does LOF perform better than DBSCAN to detect β€˜anomalous’ samples?

References

Kim, Albert Y., and Adriana Escobedo-Land. 2015. β€œOkCupid Data for Introductory Statistics and Data Science Courses.” Journal of Statistics Education 23 (2): 5. https://doi.org/10.1080/10691898.2015.11889737.

Footnotes

  1. We’re gonna cry a little bit, not gonna lie. But no hard feelings. We’ll get over it.β†©οΈŽ