πŸ’» LSE DS202A 2024: Week 11 - Lab

2024/25 Autumn Term

Author

The DS202 Team

Learning objectives

  • Build a machine learning pipeline to answer a substantive research question.
  • Apply tidyverse, tidymodels and quanteda knowledge independently.

Background

In the final lab of DS202A, we will be answering the following research question:

How can we identify true from false statements?

To do so, we will utilise a subset of the LIAR data set which contains 5,740 statements from political figures in the United States.

Please download the following files into your data folder.

Ratings have been given (from β€œtrue” to β€œpants on fire”) regarding the truth of each statement. We have created two outcomes of interest:

  • true_statement: if the proportion of facts in a statement that are rated β€œtrue”, β€œmostly true” or β€œhalf true” exceed 50%, the statement is coded as true, otherwise it is false.
  • perc_true: the proportion of facts in a statement that are rated β€œtrue”, β€œmostly true” or β€œhalf true”.
  • subj_government_regulation-subj_missouri: binary indicator for whether or not a subject is discussed in a statement.

You also have a few additional variables that add information about each statement:

  • speaker : the person who said or wrote the statement
  • date : date at which the statement was made
  • speaker_description : a brief description/bio of the person who made the statement (e.g political position, affiliation, etc.)
  • state_info: if the speaker is an elected representative, which state does he/she a represtative of
  • context: context of the statement (i.e mostly the venue/location of the speech or statement)

You can either build a binary classification model or regression model or you can eschew both and opt for an unsupervised learning approach.

Your task

Generally, we have left this task open to give you as much opportunity as possible to build your own machine learning model.

You will need to use quanteda to perform feature engineering. From there, it is up to you. We have covered a lot of models, so please use this lab as an opportunity to try things out for yourself.

Things you might want to try

  • Check out the tokens_ngram function to create bi-grams, tri-grams etc.
  • Consider experimenting with the minimum number of terms.
  • You do not necessarily need to try one kind of algorithm - remember the πŸ“šreading week homework exercise!

You can work on this task independently or in groups of up to 3 people.

πŸ₯‡ A Friendly Challenge

If you’re in for a bit more excitement, join our little competition. Submit your solution (in HTML format) via Moodle by Wednesday (December 11th) 5pm.

There’ll be prizes for the top 3 submissions (determined in terms of solution correctness but also quality/richness of insights/interpretations).