π¦Έπ»ββοΈ Lab 11 β Super tech support
Week 03 β Day 03 - Lab Roadmap (90 min)
π¦Έπ»ββοΈ Super tech support
We noticed that many of you were interested in working more on your final project so instead of the (optional) text mining content, use this time in the lab work on your projects. Raise your hand if you need help!
Read the project requirements carefully.
Donβt want to focus on your project? See what we had originally planned for todayβs lab
This lab is a continuation of what we did in the morning. There are no hidden exercises, all the code is in this document. The objective is for you to run it and understand what it does.
Part 1: Using kwic cleverly
Letβs try to make this data more interesting for further analysis. Here we will:
- use the power of
kwicto try to extract just the object of the apology - build a new corpus out of this new subset of text data
- remove unnecessary tokens (stop words + punctuation)
Extracting the object of the apology
The output of kwic can be converted to a dataframe. Letβs look at that same sample again, only this time we increased the window of tokens that show up before and after the keyword:
quanteda::kwic(tokens_pac %>% head(n=10), pattern="apolog*", window=40) %>% as.data.frame()The info we care about the most is the column post.
A different pattern
This is good but there is a downside to the keyword we used. Not all entries have the term apolog* in their description. We will have to use a more complex pattern:
df_new <-
quanteda::kwic(tokens_pac,
pattern="apolog*|regre*|sorrow*|recogni*|around*|sorry*|remorse*|failur*",
window=40) %>%
as.data.frame()
dim(df_new)We seemed to have lost some documents. The original document has 367 rows.
π― ACTION POINT: Take a look at View(df_new)
Handling duplicates
Although we lost rows, there are duplicated rows in df_new because of multiple pattern matches in the same text:
df_new %>% group_by(docname) %>% filter(n() > 1)Here is how we are going to deal with these duplicates: letβs keep just the one with the longest post text. This is equivalent to selecting the one with the earliest from value in the data frame above.
df_new <- df_new %>% arrange(from) %>% group_by(docname) %>% slice(1)
dim(df_new)Note: This is a choice, there is no absolute objective way to handle this case. Would you do anything differently?
π TAKE-HOME (OPTIONAL) ACTIVITY: We used to have 367 rows, now we have 327. How would you change the pattern to avoid excluding data from the original data frame? (Note: I do not have a ready solution to this! Feel free to share yours.)
Part 2: Rebuilding the corpus, tokens and dfm
A new corpus
corp_pac <- quanteda::corpus(df_new, text_field="post", docid_field="docname")
corp_pacGet rid of unnecessary tokens
tokens_pac <-
# Get rid of punctuations
quanteda::tokens(corp_pac, remove_punct = TRUE) %>%
# Get rid of stopwords, finally!
quanteda::tokens_remove(pattern = quanteda::stopwords("en"))
tokens_pacNow rebuild the dfm:
dfm_pac <- quanteda::dfm(tokens_pac)
dfm_pacTop Features
How did the top features change after removing stop words?
dfm_pac %>% quanteda::topfeatures()Word Cloud
Itβs easy to create a word cloud if you already have a dfm:
quanteda.textplots::textplot_wordcloud(dfm_pac)π£οΈ CLASSROOM DISCUSSIONS: What do you think of the word cloud above, compared to the previous one? How would it be different if we had removed unnecessary tokens but kept the original longer description?
Can you think of any application of this methodology to your own project?