πŸ¦ΈπŸ»β€β™€οΈ Lab 11 – Super tech support

Week 03 – Day 03 - Lab Roadmap (90 min)

Author
Published

26 July 2023

πŸ¦ΈπŸ»β€β™€οΈ Super tech support

We noticed that many of you were interested in working more on your final project so instead of the (optional) text mining content, use this time in the lab work on your projects. Raise your hand if you need help!

Read the project requirements carefully.

Don’t want to focus on your project? See what we had originally planned for today’s lab

This lab is a continuation of what we did in the morning. There are no hidden exercises, all the code is in this document. The objective is for you to run it and understand what it does.

Part 1: Using kwic cleverly

Let’s try to make this data more interesting for further analysis. Here we will:

  • use the power of kwic to try to extract just the object of the apology
  • build a new corpus out of this new subset of text data
  • remove unnecessary tokens (stop words + punctuation)

Extracting the object of the apology

The output of kwic can be converted to a dataframe. Let’s look at that same sample again, only this time we increased the window of tokens that show up before and after the keyword:

quanteda::kwic(tokens_pac %>% head(n=10), pattern="apolog*", window=40) %>% as.data.frame()

The info we care about the most is the column post.

A different pattern

This is good but there is a downside to the keyword we used. Not all entries have the term apolog* in their description. We will have to use a more complex pattern:

df_new <- 
  quanteda::kwic(tokens_pac,
                 pattern="apolog*|regre*|sorrow*|recogni*|around*|sorry*|remorse*|failur*",
                 window=40) %>%
  as.data.frame()
dim(df_new)

We seemed to have lost some documents. The original document has 367 rows.

🎯 ACTION POINT: Take a look at View(df_new)

Handling duplicates

Although we lost rows, there are duplicated rows in df_new because of multiple pattern matches in the same text:

df_new %>% group_by(docname) %>% filter(n() > 1)

Here is how we are going to deal with these duplicates: let’s keep just the one with the longest post text. This is equivalent to selecting the one with the earliest from value in the data frame above.

df_new <- df_new %>% arrange(from) %>% group_by(docname) %>% slice(1) 
dim(df_new)

Note: This is a choice, there is no absolute objective way to handle this case. Would you do anything differently?

🏠 TAKE-HOME (OPTIONAL) ACTIVITY: We used to have 367 rows, now we have 327. How would you change the pattern to avoid excluding data from the original data frame? (Note: I do not have a ready solution to this! Feel free to share yours.)

Part 2: Rebuilding the corpus, tokens and dfm

A new corpus

corp_pac <- quanteda::corpus(df_new, text_field="post", docid_field="docname")

corp_pac

Get rid of unnecessary tokens

tokens_pac <- 
  # Get rid of punctuations
  quanteda::tokens(corp_pac, remove_punct = TRUE) %>% 
  
  # Get rid of stopwords, finally!
  quanteda::tokens_remove(pattern = quanteda::stopwords("en"))
tokens_pac

Now rebuild the dfm:

dfm_pac <- quanteda::dfm(tokens_pac)
dfm_pac

Top Features

How did the top features change after removing stop words?

dfm_pac %>% quanteda::topfeatures()

Word Cloud

It’s easy to create a word cloud if you already have a dfm:

quanteda.textplots::textplot_wordcloud(dfm_pac)

πŸ—£οΈ CLASSROOM DISCUSSIONS: What do you think of the word cloud above, compared to the previous one? How would it be different if we had removed unnecessary tokens but kept the original longer description?

Can you think of any application of this methodology to your own project?