LSE DS101L 2022/23 - Week 09
This week, we are back to being full-blown empiricists again. Rather than providing an inventory of concepts, I will show you the basic steps that are common to any initial analysis of text as data.
My commentaries throughout the examples will reflect how these topics relate to the things we have been seeing in this course. You might have trouble understanding things if you have not attended the lecture.
📚 Learning Outcomes
Discover the concept of tidy text, akin to that of tidy data
Understand the notion of tokens
Understand what it is and the need for stop words
Conduct a basic exploratory data analysis of text using R
⚙️ Setup (~15 min)
Before we start, let’s ensure we all have the same software dependencies installed. Let’s go through the two action points below.
🎯 ACTION POINTS:
📦 Open the R Console and type the following commands:
install.packages("tidyverse") # a very useful set of R packages install.packages("curl") # to download files from the internet install.packages("gutenbergr") # a package to download books from Gutenberg.org install.packages("tidytext") # a package for text analysis install.packages("quanteda") # a package for different text analysis install.packages("quanteda.textplots")
💡Did you notice that the chunk above contained code but you could NOT run it directly? If you change from the Visual to the Source view, you might be able to understand why. Ask your instructor if you didn’t spot the difference.
✅ Let’s check that these installations all worked fine. You should be able to run the chunk below without errors:
# These packages are part of tidyverse # Read more on tidyverse.org library(dplyr) library(magrittr) library(stringr) library(forcats) library(ggplot2) library(gutenbergr) library(tidytext) library(quanteda) library(quanteda.textplots)
1️⃣ Books as data (~20 min)
The internet is full of text. There are countless news pieces, opinion articles, social media posts, encyclopaedic entries, etc. Here, we are going to focus on one particular type of text: books that are in the public domain 1.
🎯 ACTION POINTS:
📚 Browse Project Gutenberg and choose a book (it must be in English) of your liking.
🔎 Take note of the book ID (an integer number). There are several ways to locate this number once you click on a book; you can either copy it from the URL of the page or scroll down until you find a line with the
EBook-No.
information.👨💻 Set the variable
BOOK_ID
below to the book ID of your selected book:# add your book ID here, right after the <- sign. <- 39713 BOOK_ID <- gutenberg_download(BOOK_ID) my_book
✅ Check that the download was successful. Type the following command on the console. You should see a data frame with the content of the book.
View(my_book)
(No one in their right mind would choose to read a book like this, but it certainly makes text analysis easier.)
👥 In pairs: Take a look at the book data frame of the person sitting next to you.
2️⃣Tokens and stopwords (~25 min)
🎯 ACTION POINTS:
The first thing one does when performing quantitative text analysis is to identify TOKENS2. Run the chunk below to look at the first tokens of your book:
<- my_tokens %>% my_book unnest_tokens(word, text) %>% head(20) my_tokens
How many tokens are there in total?
<- my_tokens %>% nrow() n_tokens n_tokens
How many unique tokens are there?
<- n_unique_tokens %>% my_tokens select(word) %>% n_distinct() n_unique_tokens
Let’s count the occurrence of each word in this book:
<- top100 %>% my_tokens count(word, sort=TRUE) %>% head(100) # 100 most frequent words top100
👥 DISCUSS IN PAIRS: Do you think this ranking of words provides a good summary of your book?
What do these frequencies represent in percentage terms?
%>% mutate(pctg=n/n_tokens) top100
Enter stop words (a list of words that are too frequent and thus do not help much with the analysis):
# load stop words from the tidytext package data(stop_words) stop_words
Let’s remove the stop words from our tokens. Take a look at the new list of top 100 tokens:
<- top100 %>% my_tokens anti_join(stop_words) %>% count(word, sort=TRUE) %>% head(100) %>% mutate(pctg=n/n_tokens) top100
👥 DISCUSS IN PAIRS: What do you think of this new list?
💡 If you feel like adding new stop words, use the template below:
<-
stop_words bind_rows(stop_words,
list(word=c("_a_", "_b_"), lexicon=c("Custom")))
3️⃣Per-Chapter analysis (~15 min)
Books normally follow a rigid structure. Here we will explore the fact that many books have chapters
<-
my_book %>%
my_book mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE))))
%>% head() my_book
<-
my_tokens %>%
my_book unnest_tokens(word, text) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
na.omit()
# Number of tokens per chapter
<-
tokens_per_chapter %>%
my_tokens group_by(chapter) %>%
summarise(tokens_in_chapter=n())
tokens_per_chapter
<-
top_tokens_per_chapter %>%
my_tokens anti_join(stop_words) %>%
count(chapter, word, sort=TRUE) %>%
ungroup() %>%
left_join(tokens_per_chapter) %>%
mutate(pctg=n/tokens_in_chapter) %>%
group_by(chapter) %>%
slice_max(n, n=10)
# 💡 By setting echo & output to FALSE, R will run and use this
# chunk of code, but it won't be visible in the produced HTML.
# I am leaving this here as just an example of how to
# produce a markdown table. You can then just copy the text
# that comes out of running this command and place it in
# the SOURCE of your .Rmd or .qmd file.
%>% filter(chapter == 1) %>% knitr::kable() top_tokens_per_chapter
Top tokens for the chapter 1:
chapter | word | n | tokens_in_chapter | pctg |
---|---|---|---|---|
1 | true | 36 | 3667 | 0.0098173 |
1 | reasoning | 22 | 3667 | 0.0059995 |
1 | theorem | 20 | 3667 | 0.0054540 |
1 | gamma | 17 | 3667 | 0.0046359 |
1 | recurrence | 15 | 3667 | 0.0040905 |
1 | analytic | 14 | 3667 | 0.0038178 |
1 | equality | 14 | 3667 | 0.0038178 |
1 | mathematical | 13 | 3667 | 0.0035451 |
1 | demonstration | 11 | 3667 | 0.0029997 |
1 | purely | 11 | 3667 | 0.0029997 |
1 | science | 11 | 3667 | 0.0029997 |
Plots:
Bar plots:
# Tip: you can control the width and height of images
# with the fig.height and fig.width parameters above.
<-
g
(ggplot(top_tokens_per_chapter %>% filter(chapter > 0, chapter < 4),
aes(x=fct_reorder(word, pctg), y=pctg))
+ geom_col()
+ coord_flip()
+ scale_y_continuous(labels=scales::percent)
+ scale_fill_gradient()
+ labs(x="Token Frequency", y="Tokens",
caption="Tokens are kind of ordered by frequency.\nIt's complicated")
+ facet_wrap(~ chapter, scales="free_y", labeller = label_both, ncol = 1)
+ theme_bw()
)
g
Dot plots:
# Tip: you can control the width and height of images
# with the fig.height and fig.width parameters above.
<-
plot_df %>%
top_tokens_per_chapter mutate(word=fct_reorder(word, pctg))
<- levels(plot_df$word)[1:20]
selected_words
<- plot_df %>% filter(word %in% selected_words)
plot_df
<-
g
(ggplot(plot_df, aes(x=factor(chapter), y=word, size=pctg, color=pctg))
+ geom_point()
+ scale_color_viridis_b(labels=scales::percent)
+ scale_size_area(labels=scales::percent)
+ labs(x="Chapter", y="Tokens",
caption="Tokens are kind of ordered by frequency.\n(It's complicated)")
+ theme_bw()
+ theme(text=element_text(size=rel(4)),
legend.text = element_text(size=rel(2)),
legend.position="bottom")
)
g
👥 Let’s look at each other’s plots!
4️⃣Corpora and the dfm (~20 min)
The analysis we will be doing in this section could also be performed with tidytext
but to give you an additional example of a more powerful tool, we will now focus our attention on another R package, quanteda
.
🎯ACTION POINTS:
Let’s start by loading the data we will be using:
data(data_char_ukimmig2010)
Now, let’s take a look at what is in this data:
View(data_char_ukimmig2010)
🗣️ CLASS-WIDE DISCUSSION: What do you see?
Let’s transform this into a corpus:
<- corpus(data_char_ukimmig2010, corp_immig docvars = data.frame(party = names(data_char_ukimmig2010))) corp_immig
Let’s summarise it:
summary(corp_immig)
The corpus function has done much of the work for us!
We can now extract the tokens, remove stop words, etc.:
<- tokens_immig %>% corp_immig tokens(remove_punct = TRUE, remove_symbols=TRUE, remove_numbers=TRUE) %>% tokens_select(pattern = stopwords("en"), selection = "remove") %>% tokens_select(pattern = c("immig*", "migra*"), selection="remove") head(tokens_immig)
Having filtered the tokens, we can now construct a document-frequency matrix:
<- dfm(tokens_immig) dfm_immig dfm_immig
What are the top features?
topfeatures(dfm_immig, 10) %>% data.frame()
🏡 Take-home exercise: the summary above encompasses all documents in the corpus, can you find a way to use
quanteda
to create a summary per documents, similar to how we did per chapter in the books example.
Let’s build a fancy plot. We will build a frequency co-occurrence matrix from our dfm:
# Reduce the dfm <- dfm_immig dfm(tokens_immig) %>% dfm_trim(min_termfreq = 10) <- log(colSums(dfm_immig)) size <- dfm_immig %>% fcm() fcm_immig set.seed(1) textplot_network(fcm_immig, min_freq = 0.8, vertex_size = size / max(size) * 4)
References
Footnotes
What does it mean for a book to be in the public domain, you ask? Check (Flynn, Giblin, and Petitjean 2019). But if you are here for just a quick non-scholarly definition of public domain, just check Wikipedia.↩︎
You can think of tokenisation as the process of splitting the text into units. We will use words as units, but in reality, it’s more complex than that. Chapter 1 of (Silge and Robinson 2017) says: “For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.”↩︎