✔️ Exam | Solutions & Comments

DS202 - Data Science for Social Scientists

Author

DS202 2022MT Teaching Group

Published

07 February 2023

Below you find the solutions to the exam. We have also included some comments on the questions and the answers. We hope this will help you understand the rationale behind the marking scheme and the solutions.

✔️ Solutions

Q1

Q1a) You should have spotted all fifteen leaf nodes: 4, 13, 15, 20, 24, 28, 29, 42, 43, 44, 45, 46, 47, 50 & 51

Q1b) Leaf node 4. It contains most training data points, accounting for 89% of the samples.

Q1c)

  • Case 1 falls under leaf node 13
  • Case 2 falls under leaf node 24
  • Case 3 falls under leaf node 50

Q2

Q2a) Artie used two resampling techniques:

  1. Train/test split: he separated samples from before 2021 (training_data) from those on and after 2021 (external_set).
  2. 5 repeats of 10-fold cross-validation with equal distribution of Severity. If a parameter v is not indicated, by default, the vfold_cv function splits the data into v=10 folds (you can check the documentation with ?vfold_cv in RStudio). The argument repeats=5 indicates that the whole procedure of 10-folds gets repeated five times. The parameter strata = "severity" suggests that each one of these 5x10 folds maintains the same distribution of “Slight” vs “Serious” as in the original training_data.

Q2b) You should have described the following about that summary table:

  • These are the average values of the metrics across all n=50 folds.
  • Accuracy alone makes it tempting to think this is a good model. However, once we read the f-1 & precision, and most importantly, the insignificantly low recall value, we can safely conclude this is an inferior model.
  • The bad precision means that the model correctly identifies only 25 out of 100 actual Serious incidents. In other words, 75% of serious accidents would never trigger a dispatch of emergency services.
  • The bad recall indicates that only 1 out of each 100 predictions of “Serious” incidents corresponds to serious incidents. In our hypothetical scenario, we would unnecessarily dispatch emergency services 99% of the time when we could have reserved these resources for severe emergencies elsewhere.

Q3

The two sub-questions are very interconnected. Below is a general guide for how we will mark the responses.

The F1-score is the most appropriate method here because i) we care about both prediction and recall, and ii) we have an imbalanced dataset; 80% of the samples in the dataset are of Slight accidents. If you used the F1-score, provided reasons for choosing it, calculated it correctly, and reflected on its use on the external set, you will get high marks on Q3a and Q3b. (Full marks if, on top of that, your formatting was pristine)

If you used precision or recall alone but at least considered the pros and cons of your selected metric, you will still get average marks (Good). Suppose you simply stated that “precision is the most appropriate because we want more Serious accidents to be identified” but didn’t weigh on the problems of maximising precision at the cost of the recall. In that case, you will be marked lower (Inaccurate).

If you used Accuracy or “True Negative rate”, you missed the question’s point and will receive a low mark (Very inaccurate).

Q4

Q4a) The new decision tree sampled ~ 41000 (41e+3 in scientific notation) records of Slight accidents and ~41000 records of Serious accidents, which is what the numbers inside the root node mean. They differ from Artie’s tree because Camylla sampled with replacement, producing a more numerous dataset than the original one to keep the same proportion of Slight vs Serious accidents.

Q4b) Judging by the metrics on the cross-validation of training data, this model has a considerably better recall for “Serious”, but the real test is on the external set.

On the external set, we expect you to have noticed that while precision hasn’t changed much, the recall has improved. Instead of 1% in Artie’s case, Camylla’s model would send out ambulances to 26.45% of real serious cases. Because precision has not improved, one could argue that this is still a poor model, but it should be evident that Camylla’s model is superior.

Q4c) Because Camylla’s model has been over-sampled (with replacement), it is technically impossible to attest with mathematical certainty which leaf node would lead to the biggest proportion of false ‘Serious’ alerts. If you spotted this, great! This is a valid answer.

However, as an exercise in reading the decision tree, we will still consider valid solutions that assume the proportion of leaf nodes to be accurate representations of the data. If reasoning this way, seven leaf nodes trigger Serious alerts: 11, 13, 15, 17, 19, 25 and 29. By playing with the numbers inside these leaf nodes, you could conclude that node 17 has the biggest proportion of false ‘Serious’ alerts (45%).

Node 17 represents accidents:

  • That involved exactly 2 vehicles (2.5 ≥ number_vehicles ≥ 1.5)
  • In urban areas
  • that happened between 11 pm and 7 am.

You could argue it is still valid to dispatch emergency services because the accident happened at night, or you could argue that it is still worth dispatching due to the improved overall recall of the model. Alternatively, you could argue that we shouldn’t trigger alerts in leaf nodes with this high false-positive proportion. As long as you explain your opinion convincingly, it is alright.

Q5

There isn’t a single optimal model solution for this problem. Ideally, you would have drawn from the concepts and ideas introduced in Week 09-Week11 lectures, plus the Week 11 lab, to reinforce your outline proposal.

Excellent (≥25/35) outlines have exceeded our expectations and included the following:

  • Description of how they would structure the data in tidy data format, including names of the columns they would use in their data frames.
  • Description of how they would use tidyverse to “feature engineer” the analysis. For example, by including categories of time of day of the tweet (morning, afternoon, etc.) or how long has passed since the previous and the current tweet? Etc.
  • Proposal of an ML model to predict the likelihood of a tweet being deleted, followed by naming appropriate classification algorithms and metrics.
  • Resampling techniques to assess the ML model above.
  • The use of quanteda to create a corpus and document-frequency matrix of the text data.
  • Proposals of visual inspection of the data set with word clouds, keyness plots as well as 2D plots produced after running dimensionality reduction techniques
  • Proposal of clustering, topic modelling and dimensionality reduction to inspect subgroups about the text data
  • Proposal of integrating the numerical features (from the feature engineering stage) with the dfm to assess whether the performance of the predictive ML model would improve.
  • Investigate what we could learn from a model that predicts the likelihood of a tweet being deleted.

An Absolute Stellar (35/35) outline would have exceeded our expectations and, on top of all of the above, would have produced snippets of R code as a proof of concept of their analysis.