💬 Week 10 Lab - Comments

Theme: Dimensionality reduction and interpretation of algorithms.

Author

Dr Jon Cardoso-Silva

Below, you will find some of the reflections we would expect you to have after completing the 💻 Week 10 Lab.

📝 Lab Tasks

📣 IMPORTANT

I am calling these comments rather than solutions because, as frustrating as it can be, most of these questions cannot be settled objectively. There are many ways to analyse a dataset, and it is possible that you would have done things correctly but differently.

The critical thing to remember is that we are always making decisions when analysing data.

It is crucial to justify those decisions clearly and ground them on our algorithms’ workings.

Part I - Meet a new dataset

The main point of this part was to reinforce once again the practices one would generally follow when meeting a new dataset. In particular, we focused on the following:

Data cleaning: we used the tidyverse package to clean the data and prepare it for analysis.
Date reshaping: sometimes the data is not in the format we need. We used the tidyr package to reshape the data in this case.
Feature engineering: we used recipes to create new features from the columns in the original dataset, especially the dummy variables.

🗣 CLASSROOM DISCUSSION

How would you explore this data if dimensionality reduction was not an option?

Exploratory Data Analysis (EDA) is your go-to approach when dimensionality reduction isn’t an option. Even if you are unfamiliar with the term, you will notice that this is something we’ve been practising since the beginning of the course. EDA involves generating summary statistics for data columns, visualising distributions and variable combinations, and more. It’s a highly creative and exploratory process, often guided by your curiosity or specific knowledge about how the data was generated.

How would you answer: what are the most common types of profiles one can find on OK Cupid?

One method is to use the group_by and summarise functions for generating group-specific summary statistics. For instance, grouping by sex and orientation and then summarising the age column can reveal the average age per group. Additionally, using count helps determine the number of individuals in each group.

Part II - The power of PCA

How could you use the plots above to investigate the most common types of OK Cupid profiles?

One approach is to focus on the most populated areas of the plot. For instance, we could determine the interquartile range for the first two principal components using df_pca %>% select(PC01, PC02) %>% summary(). Then, we could add a new column to our original dataset, such as df_okcupid['is_in_yellow_region'], to indicate whether a sample falls within this range, effectively highlighting the yellow part of the plot.

Without delving too deeply, we could next apply group_by and summarise functions to calculate the average age for each group (is_in_yellow_region=True vs is_in_yellow_region=False), or count to help us understand the size of each group, among other insights.

Perhaps more intriguing is a supervised learning approach. Here, is_in_yellow_region could serve as our target variable, and we could train a classifier to predict a sample’s likelihood of falling within the yellow region. Our focus would be on explaining the relationship between the target variable and predictors rather than purely on prediction - as we have been doing the most. For example, running a simple (preferably penalised) logistic regression could reveal which variables are most influential in the model, as indicated by lower p-values or which variables were selected in the case of penalised logistic regression.

How does Figure 4 help you interpret Figures 2 & 3?

Figure 4 illustrates the contribution of each feature to the principal components, clarifying the patterns observed as we move horizontally (left to right) and vertically (bottom to top) in Figures 2 & 3.

Specifically, consider how status_single and orientation_straight significantly influence PC01. Given that these variables are binary (FALSE=0 and TRUE=1), it’s reasonable to expect a higher representation of single and straight individuals as you move rightward on the x-axis in Figures 2 or 3.

It’s important to remember that PC01 encapsulates about 33% of the data’s ‘information’ and that other variables also contribute to it. This means the relationship, while linear, isn’t perfect. For instance, non-straight individuals may still appear on the right side of the plot.

How does the above help you think about the attributes of the most common type of OK Cupid profiles?

Putting it all together, we can use just the most relevant variables (as identified in Figure 4) to compare is_in_yellow_region=True vs is_in_yellow_region=False.

How would you interpret the 2D coordinates if instead of PC01 vs PC02, we had plotted PC01 vs PC03 in Figures 2 & 3?

The interpretation approach remains the same. Simply substitute PC02 with PC03 in the previous discussion. A key distinction with PC03 is that you can anticipate an increase in the samples’ age and height as the Y-axis value rises. In contrast, when using PC02, age and height are pointed in opposite directions.

Part III: Anomaly detection techniques

How well do you think DBSCAN performs at anomaly detection on the two principal components?

It is pretty good, and we can visually perceive that. The algorithm is able to identify clear ‘outliers’ in the data. Just remember that we are looking at just the two principal components, and these two PCs don’t encode all the variables in the data the same way.

Does LOF perform better than DBSCAN to detect ‘anomalous’ samples?

LOF provides a different, complementary picture of the data. It doesn’t single out outliers but shows you how a particular sample is an outlier of its neighbourhood.

Note: After writing everything myself, I used GPT-4 to edit my responses so they were more inclusive and easier to understand. At times, I wouldn’t like the suggestions, so I kept my own versions. Then, I ran the text through Grammarly to fix any remaining issues.