π» LSE DS202A 2024: Week 11 - Lab
2024/25 Autumn Term
Learning objectives
- Build a machine learning pipeline to answer a substantive research question.
- Apply
tidyverse
,tidymodels
andquanteda
knowledge independently.
Background
In the final lab of DS202A, we will be answering the following research question:
How can we identify true from false statements?
To do so, we will utilise a subset of the LIAR data set which contains 5,740 statements from political figures in the United States.
Please download the following files into your data
folder.
Ratings have been given (from βtrueβ to βpants on fireβ) regarding the truth of each statement. We have created two outcomes of interest:
true_statement
: if the proportion of facts in a statement that are rated βtrueβ, βmostly trueβ or βhalf trueβ exceed 50%, the statement is coded as true, otherwise it is false.perc_true
: the proportion of facts in a statement that are rated βtrueβ, βmostly trueβ or βhalf trueβ.subj_government_regulation
-subj_missouri
: binary indicator for whether or not a subject is discussed in a statement.
You also have a few additional variables that add information about each statement:
speaker
: the person who said or wrote the statementdate
: date at which the statement was madespeaker_description
: a brief description/bio of the person who made the statement (e.g political position, affiliation, etc.)state_info
: if the speaker is an elected representative, which state does he/she a represtative ofcontext
: context of the statement (i.e mostly the venue/location of the speech or statement)
You can either build a binary classification model or regression model or you can eschew both and opt for an unsupervised learning approach.
Your task
Generally, we have left this task open to give you as much opportunity as possible to build your own machine learning model.
You will need to use quanteda to perform feature engineering. From there, it is up to you. We have covered a lot of models, so please use this lab as an opportunity to try things out for yourself.
Things you might want to try
- Check out the
tokens_ngram
function to create bi-grams, tri-grams etc. - Consider experimenting with the minimum number of terms.
- You do not necessarily need to try one kind of algorithm - remember the πreading week homework exercise!
You can work on this task independently or in groups of up to 3 people.
π₯ A Friendly Challenge
If youβre in for a bit more excitement, join our little competition. Submit your solution (in HTML format) via Moodle by Wednesday (December 11th) 5pm.
Thereβll be prizes for the top 3 submissions (determined in terms of solution correctness but also quality/richness of insights/interpretations).