DS202A-2023-AT-Exam

🗄️ The data

For this assignment, we use data initially sourced from Hansard - UK Parliament, specifically debate transcripts from the UK House of Commons in 2023 ¹.

📋 Your Tasks

What do we need from you?

Question 1 - The Simple One

Worth: 15 marks

The figure below provides a summary of the UK House of Commons data. However, it needs both a title and a subtitle.

Good titles and subtitles should encapsulate the core question addressed by the figure or its central findings. They should also provide additional insights beyond restating the axis labels.

Q1: Provide a meaningful and informative title and subtitle for Figure 1.

Question 2 - The Challenge Begins

Worth: 35 marks

A session known as ‘Prime Minister’s Questions’ (PMQ) occurred on 31 of the 133 days when the House of Commons convened (always on Wednesdays). During PMQs, UK Members of Parliament (MPs) can directly pose questions to the Prime Minister (PM).

The plot below illustrates the structure and dynamics of the Q&As from two specific dates: the first and last PMQs of 2023. By ‘structure and dynamics’, we refer to elements like the sequence of questioning, who are the MPs who ask questions to the PM, the duration of questions and responses, and anything else that can be obtained by the PMQ transcript apart from the content of their speeches.

Q2: Imagine that you have to prepare this PMQ data for the clustering algorithms covered in this course (K-Means and DBSCAN), with the goal of identifying the most typical and the least typical structures within all PMQs of 2023. How would you preprocess this data? Provide an example of how your final dataset would appear just before running the algorithms. What columns would it contain? And how would you create such columns?

Note: The plot you see above was created with ggplot2. While we won’t share the code behind it, you can get the data it’s based on – pmq-2023.csv – from the same place where you downloaded this .html file.

Question 3 - Reverse engineering

Worth: 5 marks

Working with data analysis is like posing questions to the available data. One might have some preconceived hypotheses about the dynamics behind the data, or simply be driven by curiosity about what the data can reveal.

Take a look at the plot and accompanying code below:

Code

# Just because this is a test, the full code is provided here with no documentation or comments. 
# Don't do this in real life. 
# Be kind to your future self and your fellow coders – comment your code and break it into readable chunks!

library(dplyr)
library(broom)
library(parsnip)
library(recipes)
library(rsample)
library(tune)
library(workflows)
library(yardstick)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)

df_pmq <- readr::read_csv("pmq-2023.csv")
df_pmq$speech_order <- as.integer(df_pmq$speech_order)

corpus_pmq <- corpus(df_pmq %>% select(-url, -speaker_position), text_field="speech_raw_text")
docnames(corpus_pmq) <- paste("Debate ID:", df_pmq$debate_id, "Speech:", df_pmq$speech_order)

tokens_pmq <- 
  tokens(corpus_pmq, remove_punct = TRUE, remove_numbers=TRUE, remove_symbols = TRUE) %>% 
  tokens_remove(pattern =  c("right", "hon", "friend", "learned", "gentleman", "lady", "prime", "minister", "minister’s")) %>%
  tokens_remove(pattern=stopwords("en")) %>%
  tokens_ngrams(n=1:2)

dfm_pmq <-
  tokens_pmq %>%
  dfm() %>%
  dfm_trim(min_termfreq = 0.1, docfreq_type = "prop")


sunak_vs_starmer <- 
  dfm_pmq %>% 
  dfm_subset(speaker_main_position %in% c( "Leader of HM Official Opposition", "The Prime Minister")) %>%
  dfm_group(groups=docvars(.)$speaker_name)
tstat1 <- textstat_keyness(sunak_vs_starmer,target = "Rishi Sunak")
textplot_keyness(tstat1, margin = 0.2, n = 20, color=c("#0087dc", "#DC241f"))

Q3: What question(s) do you believe guided the data scientist’s decisions in creating this figure?

Question 4 - We’re always making decisions

Worth: 10 marks

Take a second look at the code provided under Q3.

Q4: List all the decisions made by the data scientist, as you can infer from their code. Provide comments on any potential drawbacks or considerations linked to each decision.

Question 5 - Testing an Unconventional Modelling Strategy

Worth: 15 marks

The previous analysis sparked the data scientist’s curiosity, leading them to explore an unconventional modelling idea:

Code

# Again, because this is a test, the full code is provided here with no documentation or comments. 
# Don't do this in real life. 
# Be kind to your future self and your fellow coders – comment your code and break it into readable chunks!

selected_tokens <- 
  c(tstat1 %>% na.omit() %>% arrange(chi2) %>% head(10) %>% pull(feature),
    tstat1 %>% na.omit() %>% arrange(desc(chi2)) %>% head(10) %>% pull(feature))

df <- 
  dfm_pmq %>% 
  dfm_subset(speaker_main_position %in% c( "Leader of HM Official Opposition", "The Prime Minister")) %>%
  dfm_select(selected_tokens) %>% 
  convert(to="data.frame")

df$speaker_name <- 
  dfm_pmq %>% 
  dfm_subset(speaker_main_position %in% c( "Leader of HM Official Opposition", "The Prime Minister")) %>%
  docvars(field='speaker_name') %>%
  factor(levels=c("Keir Starmer", "Rishi Sunak"))


log_reg <- logistic_reg() %>% set_mode("classification")

rec <- 
  recipe(speaker_name ~ ., data=df) %>% 
  update_role(doc_id, new_role="ID") %>% 
  prep()

wf <- 
  workflow() %>% 
  add_model(log_reg) %>% 
  add_recipe(rec) %>% 
  fit(df)

The code ran fine, although it produced a warning:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

In the end, the data scientist produced the following diagnostics:

══ Workflow [trained] ═════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
(Intercept)          new      funding       answer      support       crisis         says       tories  
     1.3782       0.6297      17.4246      -2.3992       0.6266      -1.2519      -1.8124     -51.4000  
       make      billion         tory       energy        watch       ensure        local      forward  
     1.3895      17.6648     -20.4497       1.0189     -19.6559       1.3009      16.4616       2.3607  
      blame     continue   tory_party    `non-dom`         fine  
   -19.5775       0.7065      -0.9733      -2.7422      -7.1120  

Degrees of Freedom: 955 Total (i.e. Null);  935 Residual
Null Deviance:      907.1 
Residual Deviance: 478  AIC: 520

Also:

wf %>% augment(df, type="prob") %>% roc_auc(speaker_name, `.pred_Rishi Sunak`)

produces:

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.103

wf %>% augment(df, type="prob") %>% roc_auc(speaker_name, `.pred_Keir Starmer`)

produces:

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.897

wf %>% 
  augment(df, type="prob") %>% 
  mutate(pred=if_else(`.pred_Rishi Sunak` >= 0.5, 'Rishi Sunak', 'Keir Starmer') %>% factor(levels=c("Keir Starmer", "Rishi Sunak"))) %>%
  conf_mat(speaker_name, pred) %>% 
  autoplot(type="heatmap")

Q5: Address these points: What makes this modelling strategy unconventional? Can you assess its effectiveness with the available information? If yes, explain your judgment and reasons for it. If not, clarify why you can’t evaluate the model’s quality.

Question 6 - Exploring Uncharted Territory

Worth: 20 marks

(This one’s a real challenge)

The data scientist didn’t stop with the previous test. They tweaked the unconventional approach a bit further and then created a whole set of models:

Code

# Back to the full dfm
dfm_sunak_starmer <-
  dfm_pmq %>% 
  dfm_subset(speaker_main_position %in% c( "Leader of HM Official Opposition", "The Prime Minister"))


full_df <-
  dfm_sunak_starmer %>%
  convert(to="data.frame")

full_df$speaker_name <- 
  dfm_pmq %>% 
  dfm_subset(speaker_main_position %in% c( "Leader of HM Official Opposition", "The Prime Minister")) %>%
  docvars(field='speaker_name')

folds <- vfold_cv(full_df, v=5, repeats=1)

subset_df <- function(split, n_tokens=40){
  
  # Which indices are marked as training in this fold?
  idx_training <- split$in_id
  
  dfm_training <-
    dfm_sunak_starmer %>%
    dfm_subset(1:nrow(dfm_sunak_starmer) %in% idx_training)
  
  sunak_vs_starmer <- 
    dfm_training %>%
    dfm_group(groups=docvars(.)$speaker_name)
  
  tstat1 <- textstat_keyness(sunak_vs_starmer,target = "Rishi Sunak")
  
  selected_tokens <- 
    c(tstat1 %>% na.omit() %>% arrange(desc(chi2)) %>% tail(n_tokens) %>% pull(feature), 
      tstat1 %>% na.omit() %>% arrange(chi2) %>% tail(n_tokens) %>% pull(feature))
  
  df <- 
    dfm_training %>%
    dfm_select(selected_tokens) %>% 
    convert(to="data.frame")
  
  df$speaker_name <-
    dfm_training %>%
    docvars(field='speaker_name') %>%
    factor(levels=c("Keir Starmer", "Rishi Sunak"))
  
  df
  
}

train_alg <- function(split, model_spec){
  
  df_train <- subset_df(split)
  
  rec <- recipe(speaker_name ~ ., data=df_train) %>% update_role(doc_id, new_role="ID") %>% prep()
  wf <- workflow() %>% add_model(model_spec) %>% add_recipe(rec) %>% fit(df_train)
  
  idx_training <- split$in_id
  dfm_test <-
      dfm_sunak_starmer %>%
      dfm_subset(!1:nrow(dfm_sunak_starmer) %in% idx_training) %>%
      dfm_select(colnames(df_train %>% select(-speaker_name, -doc_id)))
      
  df_test <- dfm_test %>% convert(to="data.frame")
  df_test$speaker_name <- dfm_test %>% docvars(field='speaker_name') %>% factor(levels=c("Keir Starmer", "Rishi Sunak"))
    
  metrics <-
    bind_rows(
      wf %>% 
        augment(df_train, type="prob") %>% 
        roc_auc(speaker_name, `.pred_Rishi Sunak`) %>% 
        mutate(target="Rish Sunak",
               sample="train",
               fold_id=split$id$id),
      wf %>% 
        augment(df_train, type="prob") %>% 
        roc_auc(speaker_name, `.pred_Keir Starmer`) %>% 
        mutate(target="Keir Starmer", 
               sample="train",
               fold_id=split$id$id),
      wf %>% 
        augment(df_test, type="prob") %>% 
        roc_auc(speaker_name, `.pred_Rishi Sunak`) %>% 
        mutate(target="Rish Sunak",
               sample="test",
               fold_id=split$id$id),
      wf %>% 
        augment(df_test, type="prob") %>% 
        roc_auc(speaker_name, `.pred_Keir Starmer`) %>% 
        mutate(target="Keir Starmer", 
               sample="test",
               fold_id=split$id$id)
    )
  
  
  
  list("fitted_model"=wf, "metrics"=metrics)
}



log_reg <- logistic_reg() %>% set_mode("classification")
log_reg_models <- lapply(folds$splits, function(split){train_alg(split, log_reg)})

svm_radial <- svm_rbf() %>% set_mode("classification")
svm_models <- lapply(folds$splits, function(split){train_alg(split, svm_radial)})

dt <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification")
dt_models <- lapply(folds$splits, function(split){train_alg(split, dt)})

rf <- rand_forest() %>% set_engine("randomForest") %>% set_mode("classification")
rf_models <- lapply(folds$splits, function(split){train_alg(split, rf)})

df_results <-
  bind_rows(
    lapply(log_reg_models, 
           function(list_element){list_element$metrics %>% mutate(algorithm="logistic_regression")}),
    lapply(svm_models, 
           function(list_element){list_element$metrics %>% mutate(algorithm="svm_radial")}),
    lapply(dt_models, 
           function(list_element){list_element$metrics %>% mutate(algorithm="decision_tree")}),
    lapply(rf_models, 
           function(list_element){list_element$metrics %>% mutate(algorithm="random_forest")})
  )

This produced the huge table of results below:

Click here to see the table of results

.metric	.estimator	.estimate	target	sample	fold_id	algorithm
roc_auc	binary	0.0144807	Rish Sunak	train	Fold1	logistic_regression
roc_auc	binary	0.9855193	Keir Starmer	train	Fold1	logistic_regression
roc_auc	binary	0.1048833	Rish Sunak	test	Fold1	logistic_regression
roc_auc	binary	0.8951167	Keir Starmer	test	Fold1	logistic_regression
roc_auc	binary	0.0118760	Rish Sunak	train	Fold2	logistic_regression
roc_auc	binary	0.9881240	Keir Starmer	train	Fold2	logistic_regression
roc_auc	binary	0.1132505	Rish Sunak	test	Fold2	logistic_regression
roc_auc	binary	0.8867495	Keir Starmer	test	Fold2	logistic_regression
roc_auc	binary	0.0125444	Rish Sunak	train	Fold3	logistic_regression
roc_auc	binary	0.9874556	Keir Starmer	train	Fold3	logistic_regression
roc_auc	binary	0.1817016	Rish Sunak	test	Fold3	logistic_regression
roc_auc	binary	0.8182984	Keir Starmer	test	Fold3	logistic_regression
roc_auc	binary	0.0088343	Rish Sunak	train	Fold4	logistic_regression
roc_auc	binary	0.9911657	Keir Starmer	train	Fold4	logistic_regression
roc_auc	binary	0.1277632	Rish Sunak	test	Fold4	logistic_regression
roc_auc	binary	0.8722368	Keir Starmer	test	Fold4	logistic_regression
roc_auc	binary	0.0194432	Rish Sunak	train	Fold5	logistic_regression
roc_auc	binary	0.9805568	Keir Starmer	train	Fold5	logistic_regression
roc_auc	binary	0.0884934	Rish Sunak	test	Fold5	logistic_regression
roc_auc	binary	0.9115066	Keir Starmer	test	Fold5	logistic_regression
roc_auc	binary	0.0269413	Rish Sunak	train	Fold1	svm_radial
roc_auc	binary	0.9730587	Keir Starmer	train	Fold1	svm_radial
roc_auc	binary	0.0820025	Rish Sunak	test	Fold1	svm_radial
roc_auc	binary	0.9179975	Keir Starmer	test	Fold1	svm_radial
roc_auc	binary	0.0415101	Rish Sunak	train	Fold2	svm_radial
roc_auc	binary	0.9584899	Keir Starmer	train	Fold2	svm_radial
roc_auc	binary	0.2038302	Rish Sunak	test	Fold2	svm_radial
roc_auc	binary	0.7961698	Keir Starmer	test	Fold2	svm_radial
roc_auc	binary	0.0332527	Rish Sunak	train	Fold3	svm_radial
roc_auc	binary	0.9667473	Keir Starmer	train	Fold3	svm_radial
roc_auc	binary	0.1347319	Rish Sunak	test	Fold3	svm_radial
roc_auc	binary	0.8652681	Keir Starmer	test	Fold3	svm_radial
roc_auc	binary	0.0315314	Rish Sunak	train	Fold4	svm_radial
roc_auc	binary	0.9684686	Keir Starmer	train	Fold4	svm_radial
roc_auc	binary	0.1292619	Rish Sunak	test	Fold4	svm_radial
roc_auc	binary	0.8707381	Keir Starmer	test	Fold4	svm_radial
roc_auc	binary	0.0501573	Rish Sunak	train	Fold5	svm_radial
roc_auc	binary	0.9498427	Keir Starmer	train	Fold5	svm_radial
roc_auc	binary	0.1436258	Rish Sunak	test	Fold5	svm_radial
roc_auc	binary	0.8563742	Keir Starmer	test	Fold5	svm_radial
roc_auc	binary	0.3576923	Rish Sunak	train	Fold1	decision_tree
roc_auc	binary	0.6423077	Keir Starmer	train	Fold1	decision_tree
roc_auc	binary	0.3522727	Rish Sunak	test	Fold1	decision_tree
roc_auc	binary	0.6477273	Keir Starmer	test	Fold1	decision_tree
roc_auc	binary	0.2733271	Rish Sunak	train	Fold2	decision_tree
roc_auc	binary	0.7266729	Keir Starmer	train	Fold2	decision_tree
roc_auc	binary	0.4026915	Rish Sunak	test	Fold2	decision_tree
roc_auc	binary	0.5973085	Keir Starmer	test	Fold2	decision_tree
roc_auc	binary	0.2235862	Rish Sunak	train	Fold3	decision_tree
roc_auc	binary	0.7764138	Keir Starmer	train	Fold3	decision_tree
roc_auc	binary	0.3606061	Rish Sunak	test	Fold3	decision_tree
roc_auc	binary	0.6393939	Keir Starmer	test	Fold3	decision_tree
roc_auc	binary	0.2538857	Rish Sunak	train	Fold4	decision_tree
roc_auc	binary	0.7461143	Keir Starmer	train	Fold4	decision_tree
roc_auc	binary	0.2669539	Rish Sunak	test	Fold4	decision_tree
roc_auc	binary	0.7330461	Keir Starmer	test	Fold4	decision_tree
roc_auc	binary	0.2843153	Rish Sunak	train	Fold5	decision_tree
roc_auc	binary	0.7156847	Keir Starmer	train	Fold5	decision_tree
roc_auc	binary	0.3349338	Rish Sunak	test	Fold5	decision_tree
roc_auc	binary	0.6650662	Keir Starmer	test	Fold5	decision_tree
roc_auc	binary	0.0420711	Rish Sunak	train	Fold1	random_forest
roc_auc	binary	0.9579289	Keir Starmer	train	Fold1	random_forest
roc_auc	binary	0.1050369	Rish Sunak	test	Fold1	random_forest
roc_auc	binary	0.8949631	Keir Starmer	test	Fold1	random_forest
roc_auc	binary	0.0557289	Rish Sunak	train	Fold2	random_forest
roc_auc	binary	0.9442711	Keir Starmer	train	Fold2	random_forest
roc_auc	binary	0.2853002	Rish Sunak	test	Fold2	random_forest
roc_auc	binary	0.7146998	Keir Starmer	test	Fold2	random_forest
roc_auc	binary	0.0606903	Rish Sunak	train	Fold3	random_forest
roc_auc	binary	0.9393097	Keir Starmer	train	Fold3	random_forest
roc_auc	binary	0.1988345	Rish Sunak	test	Fold3	random_forest
roc_auc	binary	0.8011655	Keir Starmer	test	Fold3	random_forest
roc_auc	binary	0.0446971	Rish Sunak	train	Fold4	random_forest
roc_auc	binary	0.9553029	Keir Starmer	train	Fold4	random_forest
roc_auc	binary	0.1237355	Rish Sunak	test	Fold4	random_forest
roc_auc	binary	0.8762645	Keir Starmer	test	Fold4	random_forest
roc_auc	binary	0.0922014	Rish Sunak	train	Fold5	random_forest
roc_auc	binary	0.9077986	Keir Starmer	train	Fold5	random_forest
roc_auc	binary	0.1719371	Rish Sunak	test	Fold5	random_forest
roc_auc	binary	0.8280629	Keir Starmer	test	Fold5	random_forest

Q6: Let’s delve into this: is this a more robust test than the one you saw in Q5? When you consider all the results together, do you believe it confirms the effectiveness of the unconventional strategy taken by the data scientist, or does it imply it could be a dead end? Is there any particular algorithm that stands out in the results?

Lastly:

Q: How do you plan on rewarding yourself after completing this exam?

Footnotes

In fact, we obtained this data from the website They Work For You because it contains the same data in a format that is easier to scrape.↩︎