---
title: 'Week 07 - Student Notebook'
subtitle: Dimensionality reduction using PCA and UMAP
author: MY NAME (MY CANDIDATE NUMBER)
date: 03 March 2025
---

Let's start by loading the required packages!

In [2]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from umap import UMAP
import plotly.express as px
from lets_plot import *
LetsPlot.setup_html()

  from .autonotebook import tqdm as notebook_tqdm


# Exploring ethical values and norms in the World Values Survey (15 minutes)

The WVS provides information on 19 such issues which respondents can rate from 0 (never justifiable) to 10 (always justifiable), including:

-   `Q177`: Claiming government benefits to which you are not entitled 
-   `Q178`: Avoiding a fare on public transport
-   `Q179`: Stealing property 
-   `Q180`: Cheating on taxes if you have a chance 
-   `Q181`: Someone accepting a bribe in the course of their duties 
-   `Q182`: Homosexuality
-   `Q183`: Prostitution
-   `Q184`: Abortion
-   `Q185`: Divorce
-   `Q186`: Sex before marriage
-   `Q187`: Suicide
-   `Q188`: Euthanasia
-   `Q189`: For a man to beat his wife
-   `Q190`: Parents beating children
-   `Q191`: Violence against other people
-   `Q192`: Terrorism as a political, ideological or religious mean
-   `Q193`: Having casual sex
-   `Q194`: Political violence
-   `Q195`: Death penalty

üó£Ô∏è **CLASSROOM DISCUSSION:** How could we explore relationships between these features?

üéØ **ACTION POINT:** Read the data set into Python (set `low_memory = False`)

In [None]:
wvs = pd.read_csv("../data/wvs-wave-7-ethical-norms.csv", low_memory=False)

üéØ **ACTION POINT:** Rename the variables to better reflect the ethical value / norm

In [4]:
# Create a list of columns to keep


# Create a list of descriptive labels for the columns


# Create a dictionary to rename the columns in the data frame, adapt the below code:
# vars_renamed = dict(zip(list of columns to keep, list of descriptive labels))

# Employ the changes to data frame


# Principle Component Analysis with the World Values Survey (45 minutes)

Besides making data exploration and visualization difficult, high dimensional data comes with other challenges too, in particular with regards to supervised learning.

::: callout-tip
## Pay Attention

üßë‚Äçüè´ **TEACHING MOMENT:**

### The Curse of Dimensionality ü™Ñ

No, that's not a Harry Potter spell but it refers to the problems that often arise when dealing with high dimensional data (ie lots of variables).

Machine learning models tend to perform badly with too many features - there is a higher chance for overfitting, plus there is also a much bigger computational cost associated to training models with many variables.

### Dimensionality Reduction to the Rescue üßë‚Äç‚öïÔ∏è

We want to reduce the number of variables. Intuitively, this could work because there are often strong correlations between many of the individual variables. In other words, there is redundancy in the data.

Dimensionality reduction refers to a collection of methods that can help reduce the number of variables while preserving most of the information. PCA is one such method.

### Principal Component Analysis (PCA) üí°

However, rather than deciding which variables to keep and which ones to throw out (that would fall under "*variable selection*"), we want to compress our high dimensional data into a low dimensional set of variables while retaining as much information from the original data as possible.

:::

To implement PCA, we must first normalise our variables.  Up until now, we have employed a user-based function.  However, because we have only numeric features, we can actually perform the data transformation and PCA in one step using pipelines!  `Pipeline` consists of a list of tuples, `(user defined step name, model instantiation)` that can be seen as a series of steps to be run in order to apply a model.

üéØ **ACTION POINT:** Use `Pipeline` to combine `StandardScaler()` and `PCA()` and fit the pipeline to the cleaned data.

In [5]:
# Create a pipeline whereby all features are scaled


# Fit the pipeline to the full data set


üéØ **ACTION POINT:** With pipelines, accessing the attributes of each step may seem tricky at first.  Luckily, however, any pipeline has a `named_steps` attribute, which can allow you to explore all the attributes that pertain to a specific step.  Try this with the `explained_variance_ratio_` attribute which gives us the proportion of variance explained by each principle component between all features.  **Ideally, we would like to see a graph showing the cumulative proportion of variance explained over the first 10 principle components.**

In [1]:
# Code here

üó£Ô∏è **CLASSROOM DISCUSSION:** Do we notice anything in particular?

üéØ **ACTION POINT:** Use `fit_transform` on the cleaned data to create an array.  Use `pd.DataFrame()` to transform this output into a singular data frame.  Try using list comprehension to create a series of variable names such as pc1, ..., pc19

In [6]:
# Employ fit_transform to create an array


# Convert the array to a data frame and rename the columns


üéØ **ACTION POINT:** Plot the first two principle components.

In [7]:
# Code here

üë®üèª‚Äçüè´ **TEACHING MOMENT:**  Why is there no correlation between the first and second principle components?

We can further elaborate on each principle component by understanding the role each feature plays in influencing its loadings.  We assess two aspects of the loadings.
-   **Magnitude:** larger absolute loadings show that a given feature will have a greater influence on the principle component.
-   **Sign:** positive / negative loadings indicate that a feature will contribute positively / negatively to the principle component.

In [None]:
# Create a list of data frames for each loading


# Concatenate the list of data frames to create a singular data frame


# Create a new column showing the absolute value of the loading


üéØ **ACTION POINT:** Plot the loadings of the first 4 principle components using a bar graph.  Add `facet_wrap(facets="component")` to the plot to see the loading for each principle component in isolation.

In [8]:
# Code here

üó£Ô∏è **CLASSROOM DISCUSSION:**  How can we describe each principle component? 

-   **PC1:** There are none that stand out - in fact, it looks like the principle component indicates moderate positions on all values / norms.
-   **PC2:** Interestingly, we see that issues such as sex before marriage, abortion tend to positively influence the principle component and issues such as the use of violence tend to negatively influence the principle component.
-   **PC3:** We see almost the flip side of PC2, although avoiding transport fairs and the death penalty influence the principle component the most.
-   **PC4:** This is the most interesting.  On the one hand, we see strong positive influence of the death penalty, yet political violence and terrorism strongly / negatively influence the principle component. 

## Part II: Using UMAP on the Varieties of Democracy Data Set (30 mins)

To round off our exploration of dimensionality reduction, we will look at UMAP (Uniform Manifold Approximation and Projection). UMAP, in essence, takes data that has a high number of dimensions and compresses it to a 2-dimensional feature space.

üëâ **NOTE:** Andy Coenen and Adam Pearce of Google PAIR have put together an excellent tutorial on UMAP (click [here](https://pair-code.github.io/understanding-umap/)).

In [9]:
vdem = pd.read_csv("../data/vdem-data-subset.csv")

To explore this in the context of social science data, we will be looking at the Varieties of Democracy data set (2010-2023). Specifically, we will be looking at four different variables

-   `v2x_polyarchy`: Index of free and fair elections.
-   `v2x_libdem`: Index of liberal democracy (protection of individuals / minorities from the tyranny of the state/majority).
-   `v2x_partip`: Index of participatory democracy (participation of citizens in all political processes).
-   `regime`: Four-fold typology of political regimes, namely, closed autocracy, electoral autocracy, electoral democracy, liberal democracy.

We will represent the three indices using a 3-d scatter plot, using colour to distinguish between regime type:

In [10]:
fig = px.scatter_3d(
    vdem, 
    x="v2x_polyarchy", 
    y="v2x_libdem", 
    z="v2x_partipdem", 
    color="regime",
    title="3D Scatter Plot of Democracy Variables"
)

fig.update_layout(
    scene=dict(
        xaxis_title="Polyarchy",
        yaxis_title="Liberal",
        zaxis_title="Participation"
    )
)

fig.show()

This is all good and well, but how can we represent this data as a 2-dimensional space? One increasingly popular method is UMAP. While we need to do some hand-waving as the mathematics are too complicated to be dealt with here, UMAP has a few key hyperparameters that can be experimented with.

-   Number of nearest neighbours used to construct the initial high-dimensional graph (`n_neighbors`).
-   Minimum distance between points in low dimensional space (`min_dist`).

We will instantiate a UMAP, setting `n_neighbors` to 100, and fit it using our continuous features.

In [11]:
# Set a random seed
np.random.seed(123)

# Instantiate a UMAP with the relevant hyperparameter choice
reducer = UMAP(n_neighbors=100)

# Create a subset of the data using only variables beginning with "v2x"
vdem_subset = vdem[vdem.columns[vdem.columns.str.contains("v2x")]].to_numpy()

# Fit / transform the model to the subset of data to obtain the embeddings
embedding = reducer.fit_transform(vdem_subset)

# Convert the embeddings to a data frame
embedding = pd.DataFrame(embedding, columns = ["first_dim", "second_dim"])

After this, we can extract the 2-d space created by UMAP and plot it.

In [251]:
(
    ggplot(embedding, aes("first_dim", "second_dim")) +
    geom_point() +
    labs(x = "First dimension", y = "Second dimension")
)

To see how this pertains to value labels, we can add the regime classifications to this plot.

In [13]:
# Add country/year and regime information to the embedding
embedding = pd.concat([embedding, vdem.filter(items=["country_year","regime"])], axis= 1)

# Plot the results
(
    ggplot(embedding, aes("first_dim", "second_dim", colour="regime")) +
    geom_point() +
    labs(x = "First dimension", y = "Second dimension", colour = "")
)

Note that UMAP has created a 2-map that maintains the topography of the data. Admittedly, this can be hard initially to interpret. However, we can note the following:

-   This is a semi-circle configuration of points whereby less democratic regime types become more prominent.
-   Despite this, we see considerable overlap between different regimes, indicating that considering these countries in at a given point in time to be distinctively democratic / autocratic may not be warranted in some cases.
-   There is an outlier cluster that is distinct to the semi-circle configuration of points that UMAP provides. We can isolate these observations simply by filtering.

In [14]:
embedding.query("second_dim < -1.5")["country_year"].to_numpy()

array(['Burma/Myanmar-2022', 'Burma/Myanmar-2023', 'Yemen-2016',
       'Yemen-2017', 'Yemen-2018', 'Yemen-2019', 'Yemen-2020',
       'Yemen-2021', 'Yemen-2022', 'Yemen-2023', 'South Sudan-2017',
       'South Sudan-2018', 'South Sudan-2021', 'South Sudan-2022',
       'Afghanistan-2021', 'Afghanistan-2022', 'Afghanistan-2023',
       'North Korea-2010', 'North Korea-2011', 'North Korea-2012',
       'North Korea-2013', 'North Korea-2014', 'North Korea-2015',
       'North Korea-2016', 'North Korea-2017', 'North Korea-2018',
       'North Korea-2019', 'North Korea-2020', 'North Korea-2021',
       'North Korea-2022', 'North Korea-2023', 'Qatar-2010', 'Qatar-2011',
       'Qatar-2012', 'Qatar-2013', 'Qatar-2014', 'Qatar-2015',
       'Qatar-2016', 'Qatar-2017', 'Qatar-2018', 'Qatar-2019',
       'Qatar-2020', 'Qatar-2021', 'Qatar-2022', 'Qatar-2023',
       'Syria-2010', 'Syria-2011', 'Syria-2012', 'Syria-2013',
       'Syria-2014', 'Syria-2015', 'Syria-2016', 'Syria-2017',
       'Syr

This approach yields a very interesting finding: the most notoriously undemocratic regimes in the world such as Afghanistan under the Taliban, North Korea, and Eritrea all form a part of this cluster.