DS202A-2023-AT-Exam

🗄️ The data

For this assignment, we use data initially sourced from Hansard - UK Parliament, specifically debate transcripts from the UK House of Commons in 2023 1.

📋 Your Tasks

What do we need from you?

Question 1 - The Simple One

Worth: 15 marks

The figure below provides a summary of the UK House of Commons data. However, it needs both a title and a subtitle.

Good titles and subtitles should encapsulate the core question addressed by the figure or its central findings. They should also provide additional insights beyond restating the axis labels.

Q1: Provide a meaningful and informative title and subtitle for Figure 1.

Question 2 - The Challenge Begins

Worth: 35 marks

A session known as ‘Prime Minister’s Questions’ (PMQ) occurred on 31 of the 133 days when the House of Commons convened (always on Wednesdays). During PMQs, UK Members of Parliament (MPs) can directly pose questions to the Prime Minister (PM).

The plot below illustrates the structure and dynamics of the Q&As from two specific dates: the first and last PMQs of 2023. By ‘structure and dynamics’, we refer to elements like the sequence of questioning, who are the MPs who ask questions to the PM, the duration of questions and responses, and anything else that can be obtained by the PMQ transcript apart from the content of their speeches.

Q2: Imagine that you have to prepare this PMQ data for the clustering algorithms covered in this course (K-Means and DBSCAN), with the goal of identifying the most typical and the least typical structures within all PMQs of 2023. How would you preprocess this data? Provide an example of how your final dataset would appear just before running the algorithms. What columns would it contain? And how would you create such columns?

Note: The plot you see above was created with ggplot2. While we won’t share the code behind it, you can get the data it’s based on – pmq-2023.csv – from the same place where you downloaded this .html file.

Question 3 - Reverse engineering

Worth: 5 marks

Working with data analysis is like posing questions to the available data. One might have some preconceived hypotheses about the dynamics behind the data, or simply be driven by curiosity about what the data can reveal.

Take a look at the plot and accompanying code below:

Code
# Just because this is a test, the full code is provided here with no documentation or comments. 
# Don't do this in real life. 
# Be kind to your future self and your fellow coders – comment your code and break it into readable chunks!

library(dplyr)
library(broom)
library(parsnip)
library(recipes)
library(rsample)
library(tune)
library(workflows)
library(yardstick)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)

df_pmq <- readr::read_csv("pmq-2023.csv")
df_pmq$speech_order <- as.integer(df_pmq$speech_order)

corpus_pmq <- corpus(df_pmq %>% select(-url, -speaker_position), text_field="speech_raw_text")
docnames(corpus_pmq) <- paste("Debate ID:", df_pmq$debate_id, "Speech:", df_pmq$speech_order)

tokens_pmq <- 
  tokens(corpus_pmq, remove_punct = TRUE, remove_numbers=TRUE, remove_symbols = TRUE) %>% 
  tokens_remove(pattern =  c("right", "hon", "friend", "learned", "gentleman", "lady", "prime", "minister", "minister’s")) %>%
  tokens_remove(pattern=stopwords("en")) %>%
  tokens_ngrams(n=1:2)

dfm_pmq <-
  tokens_pmq %>%
  dfm() %>%
  dfm_trim(min_termfreq = 0.1, docfreq_type = "prop")


sunak_vs_starmer <- 
  dfm_pmq %>% 
  dfm_subset(speaker_main_position %in% c( "Leader of HM Official Opposition", "The Prime Minister")) %>%
  dfm_group(groups=docvars(.)$speaker_name)
tstat1 <- textstat_keyness(sunak_vs_starmer,target = "Rishi Sunak")
textplot_keyness(tstat1, margin = 0.2, n = 20, color=c("#0087dc", "#DC241f"))

Q3: What question(s) do you believe guided the data scientist’s decisions in creating this figure?

Question 4 - We’re always making decisions

Worth: 10 marks

Take a second look at the code provided under Q3.

Q4: List all the decisions made by the data scientist, as you can infer from their code. Provide comments on any potential drawbacks or considerations linked to each decision.

Question 5 - Testing an Unconventional Modelling Strategy

Worth: 15 marks

The previous analysis sparked the data scientist’s curiosity, leading them to explore an unconventional modelling idea:

Code
# Again, because this is a test, the full code is provided here with no documentation or comments. 
# Don't do this in real life. 
# Be kind to your future self and your fellow coders – comment your code and break it into readable chunks!

selected_tokens <- 
  c(tstat1 %>% na.omit() %>% arrange(chi2) %>% head(10) %>% pull(feature),
    tstat1 %>% na.omit() %>% arrange(desc(chi2)) %>% head(10) %>% pull(feature))

df <- 
  dfm_pmq %>% 
  dfm_subset(speaker_main_position %in% c( "Leader of HM Official Opposition", "The Prime Minister")) %>%
  dfm_select(selected_tokens) %>% 
  convert(to="data.frame")

df$speaker_name <- 
  dfm_pmq %>% 
  dfm_subset(speaker_main_position %in% c( "Leader of HM Official Opposition", "The Prime Minister")) %>%
  docvars(field='speaker_name') %>%
  factor(levels=c("Keir Starmer", "Rishi Sunak"))


log_reg <- logistic_reg() %>% set_mode("classification")

rec <- 
  recipe(speaker_name ~ ., data=df) %>% 
  update_role(doc_id, new_role="ID") %>% 
  prep()

wf <- 
  workflow() %>% 
  add_model(log_reg) %>% 
  add_recipe(rec) %>% 
  fit(df)

The code ran fine, although it produced a warning:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 

In the end, the data scientist produced the following diagnostics:

══ Workflow [trained] ═════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
(Intercept)          new      funding       answer      support       crisis         says       tories  
     1.3782       0.6297      17.4246      -2.3992       0.6266      -1.2519      -1.8124     -51.4000  
       make      billion         tory       energy        watch       ensure        local      forward  
     1.3895      17.6648     -20.4497       1.0189     -19.6559       1.3009      16.4616       2.3607  
      blame     continue   tory_party    `non-dom`         fine  
   -19.5775       0.7065      -0.9733      -2.7422      -7.1120  

Degrees of Freedom: 955 Total (i.e. Null);  935 Residual
Null Deviance:      907.1 
Residual Deviance: 478  AIC: 520

Also:

wf %>% augment(df, type="prob") %>% roc_auc(speaker_name, `.pred_Rishi Sunak`)

produces:

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.103
wf %>% augment(df, type="prob") %>% roc_auc(speaker_name, `.pred_Keir Starmer`)

produces:

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.897
wf %>% 
  augment(df, type="prob") %>% 
  mutate(pred=if_else(`.pred_Rishi Sunak` >= 0.5, 'Rishi Sunak', 'Keir Starmer') %>% factor(levels=c("Keir Starmer", "Rishi Sunak"))) %>%
  conf_mat(speaker_name, pred) %>% 
  autoplot(type="heatmap")

Q5: Address these points: What makes this modelling strategy unconventional? Can you assess its effectiveness with the available information? If yes, explain your judgment and reasons for it. If not, clarify why you can’t evaluate the model’s quality.

Question 6 - Exploring Uncharted Territory

Worth: 20 marks

(This one’s a real challenge)

The data scientist didn’t stop with the previous test. They tweaked the unconventional approach a bit further and then created a whole set of models:

Code
# Back to the full dfm
dfm_sunak_starmer <-
  dfm_pmq %>% 
  dfm_subset(speaker_main_position %in% c( "Leader of HM Official Opposition", "The Prime Minister"))


full_df <-
  dfm_sunak_starmer %>%
  convert(to="data.frame")

full_df$speaker_name <- 
  dfm_pmq %>% 
  dfm_subset(speaker_main_position %in% c( "Leader of HM Official Opposition", "The Prime Minister")) %>%
  docvars(field='speaker_name')

folds <- vfold_cv(full_df, v=5, repeats=1)

subset_df <- function(split, n_tokens=40){
  
  # Which indices are marked as training in this fold?
  idx_training <- split$in_id
  
  dfm_training <-
    dfm_sunak_starmer %>%
    dfm_subset(1:nrow(dfm_sunak_starmer) %in% idx_training)
  
  sunak_vs_starmer <- 
    dfm_training %>%
    dfm_group(groups=docvars(.)$speaker_name)
  
  tstat1 <- textstat_keyness(sunak_vs_starmer,target = "Rishi Sunak")
  
  selected_tokens <- 
    c(tstat1 %>% na.omit() %>% arrange(desc(chi2)) %>% tail(n_tokens) %>% pull(feature), 
      tstat1 %>% na.omit() %>% arrange(chi2) %>% tail(n_tokens) %>% pull(feature))
  
  df <- 
    dfm_training %>%
    dfm_select(selected_tokens) %>% 
    convert(to="data.frame")
  
  df$speaker_name <-
    dfm_training %>%
    docvars(field='speaker_name') %>%
    factor(levels=c("Keir Starmer", "Rishi Sunak"))
  
  df
  
}

train_alg <- function(split, model_spec){
  
  df_train <- subset_df(split)
  
  rec <- recipe(speaker_name ~ ., data=df_train) %>% update_role(doc_id, new_role="ID") %>% prep()
  wf <- workflow() %>% add_model(model_spec) %>% add_recipe(rec) %>% fit(df_train)
  
  idx_training <- split$in_id
  dfm_test <-
      dfm_sunak_starmer %>%
      dfm_subset(!1:nrow(dfm_sunak_starmer) %in% idx_training) %>%
      dfm_select(colnames(df_train %>% select(-speaker_name, -doc_id)))
      
  df_test <- dfm_test %>% convert(to="data.frame")
  df_test$speaker_name <- dfm_test %>% docvars(field='speaker_name') %>% factor(levels=c("Keir Starmer", "Rishi Sunak"))
    
  metrics <-
    bind_rows(
      wf %>% 
        augment(df_train, type="prob") %>% 
        roc_auc(speaker_name, `.pred_Rishi Sunak`) %>% 
        mutate(target="Rish Sunak",
               sample="train",
               fold_id=split$id$id),
      wf %>% 
        augment(df_train, type="prob") %>% 
        roc_auc(speaker_name, `.pred_Keir Starmer`) %>% 
        mutate(target="Keir Starmer", 
               sample="train",
               fold_id=split$id$id),
      wf %>% 
        augment(df_test, type="prob") %>% 
        roc_auc(speaker_name, `.pred_Rishi Sunak`) %>% 
        mutate(target="Rish Sunak",
               sample="test",
               fold_id=split$id$id),
      wf %>% 
        augment(df_test, type="prob") %>% 
        roc_auc(speaker_name, `.pred_Keir Starmer`) %>% 
        mutate(target="Keir Starmer", 
               sample="test",
               fold_id=split$id$id)
    )
  
  
  
  list("fitted_model"=wf, "metrics"=metrics)
}



log_reg <- logistic_reg() %>% set_mode("classification")
log_reg_models <- lapply(folds$splits, function(split){train_alg(split, log_reg)})

svm_radial <- svm_rbf() %>% set_mode("classification")
svm_models <- lapply(folds$splits, function(split){train_alg(split, svm_radial)})

dt <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification")
dt_models <- lapply(folds$splits, function(split){train_alg(split, dt)})

rf <- rand_forest() %>% set_engine("randomForest") %>% set_mode("classification")
rf_models <- lapply(folds$splits, function(split){train_alg(split, rf)})

df_results <-
  bind_rows(
    lapply(log_reg_models, 
           function(list_element){list_element$metrics %>% mutate(algorithm="logistic_regression")}),
    lapply(svm_models, 
           function(list_element){list_element$metrics %>% mutate(algorithm="svm_radial")}),
    lapply(dt_models, 
           function(list_element){list_element$metrics %>% mutate(algorithm="decision_tree")}),
    lapply(rf_models, 
           function(list_element){list_element$metrics %>% mutate(algorithm="random_forest")})
  )

This produced the huge table of results below:

Click here to see the table of results
.metric .estimator .estimate target sample fold_id algorithm
roc_auc binary 0.0144807 Rish Sunak train Fold1 logistic_regression
roc_auc binary 0.9855193 Keir Starmer train Fold1 logistic_regression
roc_auc binary 0.1048833 Rish Sunak test Fold1 logistic_regression
roc_auc binary 0.8951167 Keir Starmer test Fold1 logistic_regression
roc_auc binary 0.0118760 Rish Sunak train Fold2 logistic_regression
roc_auc binary 0.9881240 Keir Starmer train Fold2 logistic_regression
roc_auc binary 0.1132505 Rish Sunak test Fold2 logistic_regression
roc_auc binary 0.8867495 Keir Starmer test Fold2 logistic_regression
roc_auc binary 0.0125444 Rish Sunak train Fold3 logistic_regression
roc_auc binary 0.9874556 Keir Starmer train Fold3 logistic_regression
roc_auc binary 0.1817016 Rish Sunak test Fold3 logistic_regression
roc_auc binary 0.8182984 Keir Starmer test Fold3 logistic_regression
roc_auc binary 0.0088343 Rish Sunak train Fold4 logistic_regression
roc_auc binary 0.9911657 Keir Starmer train Fold4 logistic_regression
roc_auc binary 0.1277632 Rish Sunak test Fold4 logistic_regression
roc_auc binary 0.8722368 Keir Starmer test Fold4 logistic_regression
roc_auc binary 0.0194432 Rish Sunak train Fold5 logistic_regression
roc_auc binary 0.9805568 Keir Starmer train Fold5 logistic_regression
roc_auc binary 0.0884934 Rish Sunak test Fold5 logistic_regression
roc_auc binary 0.9115066 Keir Starmer test Fold5 logistic_regression
roc_auc binary 0.0269413 Rish Sunak train Fold1 svm_radial
roc_auc binary 0.9730587 Keir Starmer train Fold1 svm_radial
roc_auc binary 0.0820025 Rish Sunak test Fold1 svm_radial
roc_auc binary 0.9179975 Keir Starmer test Fold1 svm_radial
roc_auc binary 0.0415101 Rish Sunak train Fold2 svm_radial
roc_auc binary 0.9584899 Keir Starmer train Fold2 svm_radial
roc_auc binary 0.2038302 Rish Sunak test Fold2 svm_radial
roc_auc binary 0.7961698 Keir Starmer test Fold2 svm_radial
roc_auc binary 0.0332527 Rish Sunak train Fold3 svm_radial
roc_auc binary 0.9667473 Keir Starmer train Fold3 svm_radial
roc_auc binary 0.1347319 Rish Sunak test Fold3 svm_radial
roc_auc binary 0.8652681 Keir Starmer test Fold3 svm_radial
roc_auc binary 0.0315314 Rish Sunak train Fold4 svm_radial
roc_auc binary 0.9684686 Keir Starmer train Fold4 svm_radial
roc_auc binary 0.1292619 Rish Sunak test Fold4 svm_radial
roc_auc binary 0.8707381 Keir Starmer test Fold4 svm_radial
roc_auc binary 0.0501573 Rish Sunak train Fold5 svm_radial
roc_auc binary 0.9498427 Keir Starmer train Fold5 svm_radial
roc_auc binary 0.1436258 Rish Sunak test Fold5 svm_radial
roc_auc binary 0.8563742 Keir Starmer test Fold5 svm_radial
roc_auc binary 0.3576923 Rish Sunak train Fold1 decision_tree
roc_auc binary 0.6423077 Keir Starmer train Fold1 decision_tree
roc_auc binary 0.3522727 Rish Sunak test Fold1 decision_tree
roc_auc binary 0.6477273 Keir Starmer test Fold1 decision_tree
roc_auc binary 0.2733271 Rish Sunak train Fold2 decision_tree
roc_auc binary 0.7266729 Keir Starmer train Fold2 decision_tree
roc_auc binary 0.4026915 Rish Sunak test Fold2 decision_tree
roc_auc binary 0.5973085 Keir Starmer test Fold2 decision_tree
roc_auc binary 0.2235862 Rish Sunak train Fold3 decision_tree
roc_auc binary 0.7764138 Keir Starmer train Fold3 decision_tree
roc_auc binary 0.3606061 Rish Sunak test Fold3 decision_tree
roc_auc binary 0.6393939 Keir Starmer test Fold3 decision_tree
roc_auc binary 0.2538857 Rish Sunak train Fold4 decision_tree
roc_auc binary 0.7461143 Keir Starmer train Fold4 decision_tree
roc_auc binary 0.2669539 Rish Sunak test Fold4 decision_tree
roc_auc binary 0.7330461 Keir Starmer test Fold4 decision_tree
roc_auc binary 0.2843153 Rish Sunak train Fold5 decision_tree
roc_auc binary 0.7156847 Keir Starmer train Fold5 decision_tree
roc_auc binary 0.3349338 Rish Sunak test Fold5 decision_tree
roc_auc binary 0.6650662 Keir Starmer test Fold5 decision_tree
roc_auc binary 0.0420711 Rish Sunak train Fold1 random_forest
roc_auc binary 0.9579289 Keir Starmer train Fold1 random_forest
roc_auc binary 0.1050369 Rish Sunak test Fold1 random_forest
roc_auc binary 0.8949631 Keir Starmer test Fold1 random_forest
roc_auc binary 0.0557289 Rish Sunak train Fold2 random_forest
roc_auc binary 0.9442711 Keir Starmer train Fold2 random_forest
roc_auc binary 0.2853002 Rish Sunak test Fold2 random_forest
roc_auc binary 0.7146998 Keir Starmer test Fold2 random_forest
roc_auc binary 0.0606903 Rish Sunak train Fold3 random_forest
roc_auc binary 0.9393097 Keir Starmer train Fold3 random_forest
roc_auc binary 0.1988345 Rish Sunak test Fold3 random_forest
roc_auc binary 0.8011655 Keir Starmer test Fold3 random_forest
roc_auc binary 0.0446971 Rish Sunak train Fold4 random_forest
roc_auc binary 0.9553029 Keir Starmer train Fold4 random_forest
roc_auc binary 0.1237355 Rish Sunak test Fold4 random_forest
roc_auc binary 0.8762645 Keir Starmer test Fold4 random_forest
roc_auc binary 0.0922014 Rish Sunak train Fold5 random_forest
roc_auc binary 0.9077986 Keir Starmer train Fold5 random_forest
roc_auc binary 0.1719371 Rish Sunak test Fold5 random_forest
roc_auc binary 0.8280629 Keir Starmer test Fold5 random_forest

Q6: Let’s delve into this: is this a more robust test than the one you saw in Q5? When you consider all the results together, do you believe it confirms the effectiveness of the unconventional strategy taken by the data scientist, or does it imply it could be a dead end? Is there any particular algorithm that stands out in the results?


Lastly:

Q: How do you plan on rewarding yourself after completing this exam?

Footnotes

  1. In fact, we obtained this data from the website They Work For You because it contains the same data in a format that is easier to scrape.↩︎