🛣️ LSE DS202A 2025: Week 07 - Lab Roadmap

Author

The DS202 Team

Published

15 Sep 2025

Welcome to week 7 of DS202A!
This lab explores three approaches to dimensionality reduction: PCA, MCA, and Autoencoders.

🥅 Learning Objectives

By the end, you will be able to:

Explain the motivation behind dimensionality reduction
Apply PCA for continuous data
Use MCA for categorical data
Implement autoencoders for non-linear dimensionality reduction
Compare and interpret results across methods

⚙️ Setup (5 mins)

Nuvolos

From this lab onwards, you’ll be using the Nuvolos cloud platform. On this platform, the environment is already set up for you to minimize setup.

If using Nuvolos for the first time, first follow this guide then this one.

After you open RStudio on Nuvolos, you can open the student notebook file and proceed as usual.

Loading libraries

# Load required libraries
library(corrplot)
library(factoextra)
library(FactoMineR)
library(patchwork)
library(psychTools)
library(reshape2)
library(rsample)
library(tidymodels)
library(tidyverse)
library(torch)

(you might have to install new libraries: install.packages('factoextra','FactoMineR','psychTools','reshape2','torch') or librarian:::shelf(factoextra,FactoMineR,psychTools,reshape2,torch) (if you have the librarian library installed) )

🧑‍🏫 Teaching Moment: The Curse of Dimensionality 🪄

Tip

As the number of variables (dimensions) grows:

Data become sparse — points sit far apart; similarity becomes harder to judge.
Models risk overfitting — they can memorise quirks of high-dimensional noise.
Computation grows — training becomes slower and sometimes unstable.

Dimensionality reduction helps by compressing many variables into fewer, informative dimensions that preserve most of the structure:

PCA (continuous data): finds new linear axes (principal components) that capture the largest variance.
MCA (categorical data): places categories in space using chi-square distances based on co-occurrence patterns.
Autoencoders (neural nets): learn to squeeze inputs into a small set of numbers and rebuild them, enabling non-linear compressions when needed.

Why it matters in practice:

Fewer, well-chosen dimensions can reduce overfitting, speed up downstream models, and clarify the structure you’re modelling.
Different methods make different assumptions (linearity, data type, distance), which guides which one you should use.

Part I: Principal Component Analysis (PCA) — Continuous Data (20 mins)

📊 About the Data

The Holzinger–Swineford IQ dataset (1939) ¹ measures performance on multiple cognitive tests. We use PCA to summarise broad ability patterns.

🗣 Class Discussion:

Two IQ tests are strongly correlated. What should you expect PCA to do with them?

Load and Prepare Data

data <- holzinger.swineford |>
  as_tibble() |>
  select(-c(t25_frmbord2, t26_flags, mo, ageyr, grade, female, agemo)) |>
  mutate(across(c(school), as.factor))

set.seed(12345)
data_split <- initial_split(data, strata = "school", prop = 0.8)
train_data <- training(data_split)
test_data <- testing(data_split)

Explore Relationships

train_data |>
  select(starts_with("t")) |>
  cor() |>
  melt(value.name = "correlation") |>
  ggplot(aes(x = Var1, y = Var2, fill = correlation)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "IQ Test Correlation Matrix")

💭 Reflection: What can you say about this correlation heatmap?

Apply PCA

# PCA recipe
pca_recipe <- recipe(school ~ ., data = train_data) |>
  update_role(case, new_role = "ID") |>
  step_zv(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_pca(
    all_numeric_predictors(),
    num_comp = 5,
    keep_original_cols = TRUE
  ) |>
  prep()

# Transform training data
pca_data <- pca_recipe |>
  bake(train_data) |>
  select(case, PC1, PC2, PC3, PC4, PC5, school, everything())

Interpret PCA Results

pca_fit <- pca_recipe$steps[[3]]$res
eigenvalues <- pca_fit |> tidy(matrix = "eigenvalues")

# Variance explained
p1 <- eigenvalues |>
  ggplot(aes(PC, percent)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = round(percent, 2)), vjust = -0.3) +
  labs(title = "Variance Explained by Each PC", y = "Percent Variance") +
  theme_minimal()

# Cumulative variance
p2 <- eigenvalues |>
  ggplot(aes(PC, cumulative)) +
  geom_col(fill = "darkgreen") +
  geom_text(aes(label = round(cumulative, 2)), vjust = -0.3) +
  labs(title = "Cumulative Variance Explained", y = "Cumulative Percent") +
  theme_minimal()

p1 + p2

💬 Reflection

We set num_comp = 5 in our PCA step. Why might that be reasonable for this dataset?
From the plots above, how much cumulative variance is explained by 2, 3, or 5 components?
Why might 2 components be chosen by default for visualisation, and what are the risks of doing so without checking variance explained?

Visualise First Two Components

ggplot(pca_data, aes(PC1, PC2, color = school)) +
  geom_point(alpha = 0.7, size = 2) +
  labs(
    title = "First Two Principal Components",
    subtitle = "Colored by school"
  ) +
  theme_minimal()

Visualise PCA Loadings (what defines each component?)

pca_loadings <- pca_fit |>
  tidy(matrix = "loadings") |>
  filter(PC %in% c(1, 2, 3)) |>
  mutate(
    PC = factor(PC, levels = c("1", "2", "3")),
    abs_value = abs(value),
    sign = ifelse(value >= 0, "Positive", "Negative")
  )

ggplot(
  pca_loadings,
  aes(x = reorder(column, abs_value), y = value, fill = sign)
) +
  geom_col() +
  facet_wrap(~PC, scales = "free_y") +
  coord_flip() +
  scale_fill_manual(values = c("Positive" = "steelblue", "Negative" = "coral")) +
  labs(
    x = "Original Variables",
    y = "Loading Value",
    fill = "Direction"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 8),
    strip.text = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

💭 Reflection: What does this loadings plot tell you?

Part II: Multiple Correspondence Analysis (MCA) — Categorical Data (20 mins)

🌍 About the Data

In this part, you’ll use items from the World Values Survey (WVS) on ethical attitudes.

“The World Values Survey (WVS) is an international research program devoted to the scientific and academic study of social, political, economic, religious and cultural values of people in the world. The project’s goal is to assess which impact values stability or change over time has on the social, political and economic development of countries and societies” (Source: World Values Survey website)

For more details on the WVS, see the World Values Survey website or the reference in footnote 2².

Original responses are 0–10; here each item is binarised into “support” vs “oppose” using the global median for interpretability.

🗣 Class discussion

Why can’t you run PCA on categorical variables?
What might “distance” mean in MCA?

Load and Prepare Ethical Norms Data

values <- c(
  "cheating_benefits", "avoid_transport_fare", "stealing_property",
  "cheating_taxes", "accept_bribes", "homosexuality", "sex_work",
  "abortion", "divorce", "sex_before_marriage", "suicide",
  "euthanasia", "violence_against_spouse", "violence_against_child",
  "social_violence", "terrorism", "casual_sex", "political_violence",
  "death_penalty"
)

norms <- read_csv("data/wvs-wave-7-ethical-norms.csv") |>
  mutate(across(everything(), ~ as.factor(if_else(.x > median(.x), "y", "n"))))

colnames(norms) <- values

Apply MCA

# Treat the first variable as supplementary (projected after axes are built)
mca <- MCA(norms, quali.sup = 1, graph = FALSE)

💬 Reflection Why mark a variable as supplementary here?

Visualise MCA Results

# Biplot: categories in 2D space
fviz_mca_biplot(mca, repel = TRUE, label = "var", invisible = "ind") +
  labs(title = "MCA Biplot: Ethical Attitudes") +
  theme_minimal()

# Contribution of variables to the first dimension
fviz_contrib(mca, choice = "var", axes = 1, top = 10) +
  labs(title = "Contribution to Dimension 1") +
  theme_minimal()

Part III: Autoencoders — Non-linear Dimensionality Reduction (25 mins)

🧠 Concept

An autoencoder is a small neural network that tries to copy its input using fewer numbers:

The encoder first squeezes many numbers into a smaller set (encoded features).
The decoder then tries to rebuild the original from those few numbers.

If rebuilding is accurate, those few numbers capture most of the important information.

👉 Here, you return to the Holzinger–Swineford IQ dataset and learn a compact representation of the test scores.

💬 Reflection How is this similar to PCA? What extra flexibility might a neural network provide?

Prepare Data for Autoencoders

autoencoder_data <- holzinger.swineford |>
  as_tibble() |>
  select(-c(case, t25_frmbord2, t26_flags, mo, ageyr, grade, school, agemo, female))

data_cleaned <- recipe(~., data = autoencoder_data) |>
  step_zv(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  prep() |>
  bake(new_data = NULL)

Define Autoencoder Architecture

iq_autoencoder <- nn_module(
  "IQAutoencoder",

  initialize = function(input_dim = 24, encoding_dim = 6) {
    # Encoder
    self$encoder <- nn_sequential(
      nn_linear(input_dim, 16),
      nn_relu(),
      nn_dropout(0.2),
      nn_linear(16, 12),
      nn_relu(),
      nn_dropout(0.2),
      nn_linear(12, encoding_dim)
    )
    # Decoder
    self$decoder <- nn_sequential(
      nn_linear(encoding_dim, 12),
      nn_relu(),
      nn_dropout(0.2),
      nn_linear(12, 16),
      nn_relu(),
      nn_dropout(0.2),
      nn_linear(16, input_dim)
    )
  },

  forward = function(x) {
    encoded <- self$encoder(x)
    decoded <- self$decoder(encoded)
    return(decoded)
  },

  encode = function(x) {
    self$encoder(x)
  }
)

Train the Autoencoder

# Split to emulate train/validation
ae_split <- initial_split(as.data.frame(data_cleaned), prop = 0.8)
train_df <- training(ae_split)
val_df <- testing(ae_split)

train_data_ae <- torch_tensor(as.matrix(train_df), dtype = torch_float32())
val_data_ae   <- torch_tensor(as.matrix(val_df), dtype = torch_float32())

# Initialize and train
model <- iq_autoencoder(input_dim = 24, encoding_dim = 6)
optimizer <- optim_adam(model$parameters, lr = 0.001)
criterion <- nn_mse_loss()

epochs <- 100
train_losses <- c()

cat("Training autoencoder...\n")
for (epoch in 1:epochs) {
  model$train()
  output <- model(train_data_ae)
  loss <- criterion(output, train_data_ae)
  optimizer$zero_grad()
  loss$backward()
  optimizer$step()
  train_losses <- c(train_losses, loss$item())
  if (epoch %% 20 == 0) cat(sprintf("Epoch %d: Loss %.4f\n", epoch, loss$item()))
}
cat("Training complete!\n")

🗣️ Class discussion

Why use MSE here?
What are its limits and what else could you use?

Extract and Analyse Encoded Representations

model$eval()
data_tensor <- torch_tensor(as.matrix(data_cleaned), dtype = torch_float32())

with_no_grad({
  encoded_data <- as.data.frame(as.matrix(model$encode(data_tensor)$cpu()))
  colnames(encoded_data) <- paste0("AE_Component_", 1:6)
})

# Training loss plot
tibble(epoch = 1:epochs, loss = train_losses) %>%
  ggplot(aes(epoch, loss)) +
  geom_line(color = "steelblue", linewidth = 1) +
  labs(title = "Autoencoder Training Loss",
       y = "Reconstruction Loss",
       x = "Epoch") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank())

💭 Reflection: What does this plot tell you?

# Heatmap: correlations between original variables and AE components
correlations_ae <- cor(as.matrix(data_cleaned), encoded_data)
rownames(correlations_ae) <- colnames(data_cleaned)

corrplot(correlations_ae,
         method = "color",
         title = "Original IQ Tests vs Autoencoder Components",
         mar = c(0, 0, 2, 0),
         tl.cex = 0.8)

# Quick 2D view of the first two AE components
ggplot(encoded_data, aes(x = AE_Component_1, y = AE_Component_2)) +
  geom_point(alpha = 0.6, color = "darkred", size = 2) +
  labs(
    title = "Autoencoder Components 1 vs 2",
    x = "AE_Component_1",
    y = "AE_Component_2"
  ) +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
  theme_minimal()

Part IV: Method Comparison and Discussion (15 mins)

Side-by-Side Comparison

if (exists("pca_data") && exists("encoded_data")) {
  p_pca <- pca_data |>
    ggplot(aes(PC1, PC2)) +
    geom_point(alpha = 0.6, color = "blue") +
    labs(title = "PCA: First 2 Components") +
    theme_minimal()

  p_ae <- encoded_data |>
    ggplot(aes(AE_Component_1, AE_Component_2)) +
    geom_point(alpha = 0.6, color = "red") +
    labs(title = "Autoencoder: First 2 Components") +
    theme_minimal()

  p_pca + p_ae
}

🗣️Class discussion

How do auto-encoders and PCA compare?
What does this plot tell you?

🧹 Pre-processing & Missing Values

💬 Reflection

Each method you’ve used today — PCA, MCA, and Autoencoders — made different assumptions about the input data. Before you decide which one to apply to a new dataset, ask yourself:

Can this method handle missing values directly, or do I need to impute first?
Does the method require normalised/scaled variables, or can I leave raw values?
What does “distance” mean for the algorithm — and does scaling or encoding affect that distance?

Think through these.

Key Takeaways

🔑 When to use each method:

PCA: When you have continuous data and want interpretable linear combinations
MCA: When working with categorical data or contingency tables
Autoencoders: When you suspect non-linear relationships or need flexible architectures

Discussion Questions

Which method captured the most information with fewer dimensions?
How do the visualisations differ and what do they imply?
When choose autoencoders over PCA?
Trade-offs: interpretability vs flexibility
What if the data were longitudinal, not cross-sectional?

Extension Activities

🚀 Try these on your own:

Experiment with different numbers of components/dimensions
Apply these methods to your own datasets
Combine dimensionality reduction with clustering or classification
Explore other variants like kernel PCA or variational autoencoders

Congratulations! You’ve now experienced three fundamental approaches to dimensionality reduction. Each has its strengths and appropriate use cases in your toolkit.

Footnotes

Holzinger, K. J., & Swineford, F. (1939). A study in factor analysis: The stability of a bi-factor solution. Supplementary Educational Monographs, no. 48. Chicago: University of Chicago, Department of Education.
You can also see the documentation i.e ?holzinger.swineford (in the console) for more of a description of the dataset↩︎
Haerpfer, C., Inglehart, R., Moreno, A., Welzel, C., Kizilova, K., Diez-Medrano J., M. Lagos, P. Norris, E. Ponarin & B. Puranen (eds.). 2022. World Values Survey: Round Seven – Country-Pooled Datafile Version 6.0. Madrid, Spain & Vienna, Austria: JD Systems Institute & WVSA Secretariat. doi:10.14281/18241.24↩︎