LSE DS101L 2022/23 - Week 09

Author

Published

03/11/2023

This week, we are back to being full-blown empiricists again. Rather than providing an inventory of concepts, I will show you the basic steps that are common to any initial analysis of text as data.

My commentaries throughout the examples will reflect how these topics relate to the things we have been seeing in this course. You might have trouble understanding things if you have not attended the lecture.

📚 Learning Outcomes

Discover the concept of tidy text, akin to that of tidy data
Understand the notion of tokens
Understand what it is and the need for stop words
Conduct a basic exploratory data analysis of text using R

⚙️ Setup (~15 min)

Before we start, let’s ensure we all have the same software dependencies installed. Let’s go through the two action points below.

🎯 ACTION POINTS:

📦 Open the R Console and type the following commands:

install.packages("tidyverse")   # a very useful set of R packages

install.packages("curl")        # to download files from the internet
install.packages("gutenbergr")  # a package to download books from Gutenberg.org

install.packages("tidytext")    # a package for text analysis
install.packages("quanteda")    # a package for different text analysis
install.packages("quanteda.textplots")

💡Did you notice that the chunk above contained code but you could NOT run it directly? If you change from the Visual to the Source view, you might be able to understand why. Ask your instructor if you didn’t spot the difference.

✅ Let’s check that these installations all worked fine. You should be able to run the chunk below without errors:

# These packages are part of tidyverse
# Read more on tidyverse.org
library(dplyr)
library(magrittr)
library(stringr)
library(forcats)
library(ggplot2)

library(gutenbergr)

library(tidytext)
library(quanteda)
library(quanteda.textplots)

1️⃣ Books as data (~20 min)

The internet is full of text. There are countless news pieces, opinion articles, social media posts, encyclopaedic entries, etc. Here, we are going to focus on one particular type of text: books that are in the public domain ¹.

🎯 ACTION POINTS:

📚 Browse Project Gutenberg and choose a book (it must be in English) of your liking.
🔎 Take note of the book ID (an integer number). There are several ways to locate this number once you click on a book; you can either copy it from the URL of the page or scroll down until you find a line with the EBook-No. information.

👨‍💻 Set the variable BOOK_ID below to the book ID of your selected book:

# add your book ID here, right after the <- sign.
BOOK_ID <- 39713 

my_book <- gutenberg_download(BOOK_ID)

✅ Check that the download was successful. Type the following command on the console. You should see a data frame with the content of the book.
```
View(my_book)
```
(No one in their right mind would choose to read a book like this, but it certainly makes text analysis easier.)
👥 In pairs: Take a look at the book data frame of the person sitting next to you.

2️⃣Tokens and stopwords (~25 min)

🎯 ACTION POINTS:

The first thing one does when performing quantitative text analysis is to identify TOKENS². Run the chunk below to look at the first tokens of your book:
```
my_tokens <-
  my_book %>% 
  unnest_tokens(word, text)

my_tokens %>% head(20)
```

How many tokens are there in total?

n_tokens <- my_tokens %>% nrow()

n_tokens

How many unique tokens are there?

n_unique_tokens <-
  my_tokens %>%
  select(word) %>%
  n_distinct()

n_unique_tokens

Let’s count the occurrence of each word in this book:

top100 <- 
  my_tokens %>% 
  count(word, sort=TRUE) %>% 
  head(100) # 100 most frequent words

top100

👥 DISCUSS IN PAIRS: Do you think this ranking of words provides a good summary of your book?
What do these frequencies represent in percentage terms?
```
top100 %>% mutate(pctg=n/n_tokens)
```
Enter stop words (a list of words that are too frequent and thus do not help much with the analysis):
```
# load stop words from the tidytext package
data(stop_words)

stop_words
```

Let’s remove the stop words from our tokens. Take a look at the new list of top 100 tokens:

top100 <- 
  my_tokens %>% 
  anti_join(stop_words) %>%
  count(word, sort=TRUE) %>% 
  head(100) %>%
  mutate(pctg=n/n_tokens)

top100

👥 DISCUSS IN PAIRS: What do you think of this new list?

💡 If you feel like adding new stop words, use the template below:

stop_words <- 
  bind_rows(stop_words, 
            list(word=c("_a_", "_b_"), lexicon=c("Custom")))

3️⃣Per-Chapter analysis (~15 min)

Books normally follow a rigid structure. Here we will explore the fact that many books have chapters

my_book <-
  my_book %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]",
                                           ignore_case = TRUE))))
my_book %>% head()

my_tokens <- 
  my_book %>% 
  unnest_tokens(word, text) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  na.omit()

# Number of tokens per chapter
tokens_per_chapter <-
  my_tokens %>%
  group_by(chapter) %>%
  summarise(tokens_in_chapter=n())
tokens_per_chapter

top_tokens_per_chapter <-
  my_tokens %>% 
  anti_join(stop_words) %>%
  count(chapter, word, sort=TRUE) %>%
  ungroup() %>% 
  left_join(tokens_per_chapter) %>%
  mutate(pctg=n/tokens_in_chapter) %>%
  group_by(chapter) %>% 
  slice_max(n, n=10)

# 💡 By setting echo & output to FALSE, R will run and use this
# chunk of code, but it won't be visible in the produced HTML.

# I am leaving this here as just an example of how to 
# produce a markdown table. You can then just copy the text
# that comes out of running this command and place it in
# the SOURCE of your .Rmd or .qmd file.

top_tokens_per_chapter %>% filter(chapter == 1) %>% knitr::kable()

Top tokens for the chapter 1:

chapter	word	n	tokens_in_chapter	pctg
1	true	36	3667	0.0098173
1	reasoning	22	3667	0.0059995
1	theorem	20	3667	0.0054540
1	gamma	17	3667	0.0046359
1	recurrence	15	3667	0.0040905
1	analytic	14	3667	0.0038178
1	equality	14	3667	0.0038178
1	mathematical	13	3667	0.0035451
1	demonstration	11	3667	0.0029997
1	purely	11	3667	0.0029997
1	science	11	3667	0.0029997

Plots:

Bar plots:

# Tip: you can control the width and height of images
#      with the fig.height and fig.width parameters above.

g <- 
  (
   ggplot(top_tokens_per_chapter %>% filter(chapter > 0, chapter < 4), 
          aes(x=fct_reorder(word, pctg), y=pctg))
   + geom_col()
   + coord_flip()
   
   + scale_y_continuous(labels=scales::percent)
   + scale_fill_gradient()
   
   + labs(x="Token Frequency", y="Tokens", 
          caption="Tokens are kind of ordered by frequency.\nIt's complicated")
   
   + facet_wrap(~ chapter, scales="free_y", labeller = label_both, ncol = 1)
   + theme_bw()
   
  )

g

Dot plots:

# Tip: you can control the width and height of images
#      with the fig.height and fig.width parameters above.

plot_df <- 
  top_tokens_per_chapter %>% 
  mutate(word=fct_reorder(word, pctg)) 

selected_words <- levels(plot_df$word)[1:20]

plot_df <- plot_df %>% filter(word %in% selected_words)

g <- 
  (
   ggplot(plot_df, aes(x=factor(chapter), y=word, size=pctg, color=pctg))
   + geom_point()
   
   + scale_color_viridis_b(labels=scales::percent)
   + scale_size_area(labels=scales::percent)
   
   + labs(x="Chapter", y="Tokens", 
          caption="Tokens are kind of ordered by frequency.\n(It's complicated)")
   
   + theme_bw()
   + theme(text=element_text(size=rel(4)),
           legend.text = element_text(size=rel(2)),
           legend.position="bottom")
   
  )

g

👥 Let’s look at each other’s plots!

4️⃣Corpora and the dfm (~20 min)

The analysis we will be doing in this section could also be performed with tidytext but to give you an additional example of a more powerful tool, we will now focus our attention on another R package, quanteda.

🎯ACTION POINTS:

Let’s start by loading the data we will be using:
```
data(data_char_ukimmig2010)
```
Now, let’s take a look at what is in this data:
```
View(data_char_ukimmig2010)
```
🗣️ CLASS-WIDE DISCUSSION: What do you see?

Let’s transform this into a corpus:

corp_immig <- corpus(data_char_ukimmig2010, 
                     docvars = data.frame(party = names(data_char_ukimmig2010)))
corp_immig

Let’s summarise it:
```
summary(corp_immig)
```
The corpus function has done much of the work for us!

We can now extract the tokens, remove stop words, etc.:

tokens_immig <- 
  corp_immig %>% 
  tokens(remove_punct = TRUE, remove_symbols=TRUE, remove_numbers=TRUE) %>%
  tokens_select(pattern = stopwords("en"), selection = "remove") %>%
  tokens_select(pattern = c("immig*", "migra*"), selection="remove")
head(tokens_immig)

Having filtered the tokens, we can now construct a document-frequency matrix:
```
dfm_immig <- dfm(tokens_immig)
dfm_immig
```
What are the top features?
```
topfeatures(dfm_immig, 10) %>% data.frame()
```
🏡 Take-home exercise: the summary above encompasses all documents in the corpus, can you find a way to use quanteda to create a summary per documents, similar to how we did per chapter in the books example.

Let’s build a fancy plot. We will build a frequency co-occurrence matrix from our dfm:

# Reduce the dfm
dfm_immig <- 
  dfm(tokens_immig) %>% 
  dfm_trim(min_termfreq = 10)

size <- log(colSums(dfm_immig))

fcm_immig <- dfm_immig %>% fcm()

set.seed(1)
textplot_network(fcm_immig, min_freq = 0.8, vertex_size = size / max(size) * 4)

References

Flynn, Jacob, Rebecca Giblin, and François Petitjean. 2019. “What Happens When Books Enter the Public Domain? Testing Copyright’s Underuse Hypothesis Across Australia, New Zealand, The United States and Canada.” University of New South Wales Law Journal 42 (4). https://doi.org/10.53637/SRQB5157.

Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. First edition. Beijing ; Boston: O’Reilly. https://www.tidytextmining.com/index.html.

Footnotes

What does it mean for a book to be in the public domain, you ask? Check (Flynn, Giblin, and Petitjean 2019). But if you are here for just a quick non-scholarly definition of public domain, just check Wikipedia.↩︎
You can think of tokenisation as the process of splitting the text into units. We will use words as units, but in reality, it’s more complex than that. Chapter 1 of (Silge and Robinson 2017) says: “For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.”↩︎