For this assignment, we use data initially sourced from Hansard - UK Parliament, specifically debate transcripts from the UK House of Commons in 2023 1.
📋 Your Tasks
What do we need from you?
Question 1 - The Simple One
Worth: 15 marks
The figure below provides a summary of the UK House of Commons data. However, it needs both a title and a subtitle.
Good titles and subtitles should encapsulate the core question addressed by the figure or its central findings. They should also provide additional insights beyond restating the axis labels.
Q1: Provide a meaningful and informative title and subtitle for Figure 1.
Question 2 - The Challenge Begins
Worth: 35 marks
A session known as ‘Prime Minister’s Questions’ (PMQ) occurred on 31 of the 133 days when the House of Commons convened (always on Wednesdays). During PMQs, UK Members of Parliament (MPs) can directly pose questions to the Prime Minister (PM).
The plot below illustrates the structure and dynamics of the Q&As from two specific dates: the first and last PMQs of 2023. By ‘structure and dynamics’, we refer to elements like the sequence of questioning, who are the MPs who ask questions to the PM, the duration of questions and responses, and anything else that can be obtained by the PMQ transcript apart from the content of their speeches.
Q2: Imagine that you have to prepare this PMQ data for the clustering algorithms covered in this course (K-Means and DBSCAN), with the goal of identifying the most typical and the least typical structures within all PMQs of 2023. How would you preprocess this data? Provide an example of how your final dataset would appear just before running the algorithms. What columns would it contain? And how would you create such columns?
Note: The plot you see above was created with ggplot2. While we won’t share the code behind it, you can get the data it’s based on – pmq-2023.csv – from the same place where you downloaded this .html file.
Question 3 - Reverse engineering
Worth: 5 marks
Working with data analysis is like posing questions to the available data. One might have some preconceived hypotheses about the dynamics behind the data, or simply be driven by curiosity about what the data can reveal.
Take a look at the plot and accompanying code below:
Code
# Just because this is a test, the full code is provided here with no documentation or comments. # Don't do this in real life. # Be kind to your future self and your fellow coders – comment your code and break it into readable chunks!library(dplyr)library(broom)library(parsnip)library(recipes)library(rsample)library(tune)library(workflows)library(yardstick)library(quanteda)library(quanteda.textstats)library(quanteda.textplots)df_pmq <- readr::read_csv("pmq-2023.csv")df_pmq$speech_order <-as.integer(df_pmq$speech_order)corpus_pmq <-corpus(df_pmq %>%select(-url, -speaker_position), text_field="speech_raw_text")docnames(corpus_pmq) <-paste("Debate ID:", df_pmq$debate_id, "Speech:", df_pmq$speech_order)tokens_pmq <-tokens(corpus_pmq, remove_punct =TRUE, remove_numbers=TRUE, remove_symbols =TRUE) %>%tokens_remove(pattern =c("right", "hon", "friend", "learned", "gentleman", "lady", "prime", "minister", "minister’s")) %>%tokens_remove(pattern=stopwords("en")) %>%tokens_ngrams(n=1:2)dfm_pmq <- tokens_pmq %>%dfm() %>%dfm_trim(min_termfreq =0.1, docfreq_type ="prop")sunak_vs_starmer <- dfm_pmq %>%dfm_subset(speaker_main_position %in%c( "Leader of HM Official Opposition", "The Prime Minister")) %>%dfm_group(groups=docvars(.)$speaker_name)tstat1 <-textstat_keyness(sunak_vs_starmer,target ="Rishi Sunak")textplot_keyness(tstat1, margin =0.2, n =20, color=c("#0087dc", "#DC241f"))
Q3: What question(s) do you believe guided the data scientist’s decisions in creating this figure?
Question 4 - We’re always making decisions
Worth: 10 marks
Take a second look at the code provided under Q3.
Q4: List all the decisions made by the data scientist, as you can infer from their code. Provide comments on any potential drawbacks or considerations linked to each decision.
Question 5 - Testing an Unconventional Modelling Strategy
Worth: 15 marks
The previous analysis sparked the data scientist’s curiosity, leading them to explore an unconventional modelling idea:
Code
# Again, because this is a test, the full code is provided here with no documentation or comments. # Don't do this in real life. # Be kind to your future self and your fellow coders – comment your code and break it into readable chunks!selected_tokens <-c(tstat1 %>%na.omit() %>%arrange(chi2) %>%head(10) %>%pull(feature), tstat1 %>%na.omit() %>%arrange(desc(chi2)) %>%head(10) %>%pull(feature))df <- dfm_pmq %>%dfm_subset(speaker_main_position %in%c( "Leader of HM Official Opposition", "The Prime Minister")) %>%dfm_select(selected_tokens) %>%convert(to="data.frame")df$speaker_name <- dfm_pmq %>%dfm_subset(speaker_main_position %in%c( "Leader of HM Official Opposition", "The Prime Minister")) %>%docvars(field='speaker_name') %>%factor(levels=c("Keir Starmer", "Rishi Sunak"))log_reg <-logistic_reg() %>%set_mode("classification")rec <-recipe(speaker_name ~ ., data=df) %>%update_role(doc_id, new_role="ID") %>%prep()wf <-workflow() %>%add_model(log_reg) %>%add_recipe(rec) %>%fit(df)
The code ran fine, although it produced a warning:
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
In the end, the data scientist produced the following diagnostics:
══ Workflow [trained] ═════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept) new funding answer support crisis says tories
1.3782 0.6297 17.4246 -2.3992 0.6266 -1.2519 -1.8124 -51.4000
make billion tory energy watch ensure local forward
1.3895 17.6648 -20.4497 1.0189 -19.6559 1.3009 16.4616 2.3607
blame continue tory_party `non-dom` fine
-19.5775 0.7065 -0.9733 -2.7422 -7.1120
Degrees of Freedom: 955 Total (i.e. Null); 935 Residual
Null Deviance: 907.1
Residual Deviance: 478 AIC: 520
Q5: Address these points: What makes this modelling strategy unconventional? Can you assess its effectiveness with the available information? If yes, explain your judgment and reasons for it. If not, clarify why you can’t evaluate the model’s quality.
Question 6 - Exploring Uncharted Territory
Worth: 20 marks
(This one’s a real challenge)
The data scientist didn’t stop with the previous test. They tweaked the unconventional approach a bit further and then created a whole set of models:
Q6: Let’s delve into this: is this a more robust test than the one you saw in Q5? When you consider all the results together, do you believe it confirms the effectiveness of the unconventional strategy taken by the data scientist, or does it imply it could be a dead end? Is there any particular algorithm that stands out in the results?
Lastly:
Q: How do you plan on rewarding yourself after completing this exam?
Footnotes
In fact, we obtained this data from the website They Work For You because it contains the same data in a format that is easier to scrape.↩︎