LSE DS101L 2022/23 - Week 09

Author
Published

03/11/2023

This week, we are back to being full-blown empiricists again. Rather than providing an inventory of concepts, I will show you the basic steps that are common to any initial analysis of text as data.

My commentaries throughout the examples will reflect how these topics relate to the things we have been seeing in this course. You might have trouble understanding things if you have not attended the lecture.

📚 Learning Outcomes

⚙️ Setup (~15 min)

Before we start, let’s ensure we all have the same software dependencies installed. Let’s go through the two action points below.

🎯 ACTION POINTS:

  1. 📦 Open the R Console and type the following commands:

    install.packages("tidyverse")   # a very useful set of R packages
    
    install.packages("curl")        # to download files from the internet
    install.packages("gutenbergr")  # a package to download books from Gutenberg.org
    
    install.packages("tidytext")    # a package for text analysis
    install.packages("quanteda")    # a package for different text analysis
    install.packages("quanteda.textplots")

    💡Did you notice that the chunk above contained code but you could NOT run it directly? If you change from the Visual to the Source view, you might be able to understand why. Ask your instructor if you didn’t spot the difference.

  2. ✅ Let’s check that these installations all worked fine. You should be able to run the chunk below without errors:

    # These packages are part of tidyverse
    # Read more on tidyverse.org
    library(dplyr)
    library(magrittr)
    library(stringr)
    library(forcats)
    library(ggplot2)
    
    library(gutenbergr)
    
    library(tidytext)
    library(quanteda)
    library(quanteda.textplots)

1️⃣ Books as data (~20 min)

The internet is full of text. There are countless news pieces, opinion articles, social media posts, encyclopaedic entries, etc. Here, we are going to focus on one particular type of text: books that are in the public domain 1.

🎯 ACTION POINTS:

  1. 📚 Browse Project Gutenberg and choose a book (it must be in English) of your liking.

  2. 🔎 Take note of the book ID (an integer number). There are several ways to locate this number once you click on a book; you can either copy it from the URL of the page or scroll down until you find a line with the EBook-No. information.

  3. 👨‍💻 Set the variable BOOK_ID below to the book ID of your selected book:

    # add your book ID here, right after the <- sign.
    BOOK_ID <- 39713 
    
    my_book <- gutenberg_download(BOOK_ID)
  4. ✅ Check that the download was successful. Type the following command on the console. You should see a data frame with the content of the book.

    View(my_book)

    (No one in their right mind would choose to read a book like this, but it certainly makes text analysis easier.)

  5. 👥 In pairs: Take a look at the book data frame of the person sitting next to you.

2️⃣Tokens and stopwords (~25 min)

🎯 ACTION POINTS:

  1. The first thing one does when performing quantitative text analysis is to identify TOKENS2. Run the chunk below to look at the first tokens of your book:

    my_tokens <-
      my_book %>% 
      unnest_tokens(word, text)
    
    my_tokens %>% head(20)
  2. How many tokens are there in total?

    n_tokens <- my_tokens %>% nrow()
    
    n_tokens
  3. How many unique tokens are there?

    n_unique_tokens <-
      my_tokens %>%
      select(word) %>%
      n_distinct()
    
    n_unique_tokens
  4. Let’s count the occurrence of each word in this book:

    top100 <- 
      my_tokens %>% 
      count(word, sort=TRUE) %>% 
      head(100) # 100 most frequent words
    
    top100
  5. 👥 DISCUSS IN PAIRS: Do you think this ranking of words provides a good summary of your book?

  6. What do these frequencies represent in percentage terms?

    top100 %>% mutate(pctg=n/n_tokens)
  7. Enter stop words (a list of words that are too frequent and thus do not help much with the analysis):

    # load stop words from the tidytext package
    data(stop_words)
    
    stop_words
  8. Let’s remove the stop words from our tokens. Take a look at the new list of top 100 tokens:

    top100 <- 
      my_tokens %>% 
      anti_join(stop_words) %>%
      count(word, sort=TRUE) %>% 
      head(100) %>%
      mutate(pctg=n/n_tokens)
    
    top100
  9. 👥 DISCUSS IN PAIRS: What do you think of this new list?

💡 If you feel like adding new stop words, use the template below:

stop_words <- 
  bind_rows(stop_words, 
            list(word=c("_a_", "_b_"), lexicon=c("Custom")))

3️⃣Per-Chapter analysis (~15 min)

Books normally follow a rigid structure. Here we will explore the fact that many books have chapters

my_book <-
  my_book %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]",
                                           ignore_case = TRUE))))
my_book %>% head()
my_tokens <- 
  my_book %>% 
  unnest_tokens(word, text) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  na.omit()
# Number of tokens per chapter
tokens_per_chapter <-
  my_tokens %>%
  group_by(chapter) %>%
  summarise(tokens_in_chapter=n())
tokens_per_chapter
top_tokens_per_chapter <-
  my_tokens %>% 
  anti_join(stop_words) %>%
  count(chapter, word, sort=TRUE) %>%
  ungroup() %>% 
  left_join(tokens_per_chapter) %>%
  mutate(pctg=n/tokens_in_chapter) %>%
  group_by(chapter) %>% 
  slice_max(n, n=10)
# 💡 By setting echo & output to FALSE, R will run and use this
# chunk of code, but it won't be visible in the produced HTML.

# I am leaving this here as just an example of how to 
# produce a markdown table. You can then just copy the text
# that comes out of running this command and place it in
# the SOURCE of your .Rmd or .qmd file.

top_tokens_per_chapter %>% filter(chapter == 1) %>% knitr::kable()

Top tokens for the chapter 1:

chapter word n tokens_in_chapter pctg
1 true 36 3667 0.0098173
1 reasoning 22 3667 0.0059995
1 theorem 20 3667 0.0054540
1 gamma 17 3667 0.0046359
1 recurrence 15 3667 0.0040905
1 analytic 14 3667 0.0038178
1 equality 14 3667 0.0038178
1 mathematical 13 3667 0.0035451
1 demonstration 11 3667 0.0029997
1 purely 11 3667 0.0029997
1 science 11 3667 0.0029997

Plots:

Bar plots:

# Tip: you can control the width and height of images
#      with the fig.height and fig.width parameters above.

g <- 
  (
   ggplot(top_tokens_per_chapter %>% filter(chapter > 0, chapter < 4), 
          aes(x=fct_reorder(word, pctg), y=pctg))
   + geom_col()
   + coord_flip()
   
   + scale_y_continuous(labels=scales::percent)
   + scale_fill_gradient()
   
   + labs(x="Token Frequency", y="Tokens", 
          caption="Tokens are kind of ordered by frequency.\nIt's complicated")
   
   + facet_wrap(~ chapter, scales="free_y", labeller = label_both, ncol = 1)
   + theme_bw()
   
  )

g

Dot plots:

# Tip: you can control the width and height of images
#      with the fig.height and fig.width parameters above.

plot_df <- 
  top_tokens_per_chapter %>% 
  mutate(word=fct_reorder(word, pctg)) 

selected_words <- levels(plot_df$word)[1:20]

plot_df <- plot_df %>% filter(word %in% selected_words)

g <- 
  (
   ggplot(plot_df, aes(x=factor(chapter), y=word, size=pctg, color=pctg))
   + geom_point()
   
   + scale_color_viridis_b(labels=scales::percent)
   + scale_size_area(labels=scales::percent)
   
   + labs(x="Chapter", y="Tokens", 
          caption="Tokens are kind of ordered by frequency.\n(It's complicated)")
   
   + theme_bw()
   + theme(text=element_text(size=rel(4)),
           legend.text = element_text(size=rel(2)),
           legend.position="bottom")
   
  )

g

👥 Let’s look at each other’s plots!

4️⃣Corpora and the dfm (~20 min)

The analysis we will be doing in this section could also be performed with tidytext but to give you an additional example of a more powerful tool, we will now focus our attention on another R package, quanteda.

🎯ACTION POINTS:

  1. Let’s start by loading the data we will be using:

    data(data_char_ukimmig2010)
  2. Now, let’s take a look at what is in this data:

    View(data_char_ukimmig2010)
  3. 🗣️ CLASS-WIDE DISCUSSION: What do you see?

  4. Let’s transform this into a corpus:

    corp_immig <- corpus(data_char_ukimmig2010, 
                         docvars = data.frame(party = names(data_char_ukimmig2010)))
    corp_immig
  5. Let’s summarise it:

    summary(corp_immig)

    The corpus function has done much of the work for us!

  6. We can now extract the tokens, remove stop words, etc.:

    tokens_immig <- 
      corp_immig %>% 
      tokens(remove_punct = TRUE, remove_symbols=TRUE, remove_numbers=TRUE) %>%
      tokens_select(pattern = stopwords("en"), selection = "remove") %>%
      tokens_select(pattern = c("immig*", "migra*"), selection="remove")
    head(tokens_immig)
  7. Having filtered the tokens, we can now construct a document-frequency matrix:

    dfm_immig <- dfm(tokens_immig)
    dfm_immig
  8. What are the top features?

    topfeatures(dfm_immig, 10) %>% data.frame()

    🏡 Take-home exercise: the summary above encompasses all documents in the corpus, can you find a way to use quanteda to create a summary per documents, similar to how we did per chapter in the books example.

  1. Let’s build a fancy plot. We will build a frequency co-occurrence matrix from our dfm:

    # Reduce the dfm
    dfm_immig <- 
      dfm(tokens_immig) %>% 
      dfm_trim(min_termfreq = 10)
    
    size <- log(colSums(dfm_immig))
    
    fcm_immig <- dfm_immig %>% fcm()
    
    set.seed(1)
    textplot_network(fcm_immig, min_freq = 0.8, vertex_size = size / max(size) * 4)

References

Flynn, Jacob, Rebecca Giblin, and François Petitjean. 2019. “What Happens When Books Enter the Public Domain? Testing Copyright’s Underuse Hypothesis Across Australia, New Zealand, The United States and Canada.” University of New South Wales Law Journal 42 (4). https://doi.org/10.53637/SRQB5157.
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. First edition. Beijing ; Boston: O’Reilly. https://www.tidytextmining.com/index.html.

Footnotes

  1. What does it mean for a book to be in the public domain, you ask? Check (Flynn, Giblin, and Petitjean 2019). But if you are here for just a quick non-scholarly definition of public domain, just check Wikipedia.↩︎

  2. You can think of tokenisation as the process of splitting the text into units. We will use words as units, but in reality, it’s more complex than that. Chapter 1 of (Silge and Robinson 2017) says: “For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.”↩︎