π¦Έπ»ββοΈ Lab 11 β Super tech support
Week 03 β Day 03 - Lab Roadmap (90 min)
π¦Έπ»ββοΈ Super tech support
We noticed that many of you were interested in working more on your final project so instead of the (optional) text mining content, use this time in the lab work on your projects. Raise your hand if you need help!
Read the project requirements carefully.
Donβt want to focus on your project? See what we had originally planned for todayβs lab
This lab is a continuation of what we did in the morning. There are no hidden exercises, all the code is in this document. The objective is for you to run it and understand what it does.
Part 1: Using kwic
cleverly
Letβs try to make this data more interesting for further analysis. Here we will:
- use the power of
kwic
to try to extract just the object of the apology - build a new corpus out of this new subset of text data
- remove unnecessary tokens (stop words + punctuation)
Extracting the object of the apology
The output of kwic
can be converted to a dataframe. Letβs look at that same sample again, only this time we increased the window
of tokens that show up before and after the keyword:
::kwic(tokens_pac %>% head(n=10), pattern="apolog*", window=40) %>% as.data.frame() quanteda
The info we care about the most is the column post
.
A different pattern
This is good but there is a downside to the keyword we used. Not all entries have the term apolog*
in their description. We will have to use a more complex pattern:
<-
df_new ::kwic(tokens_pac,
quantedapattern="apolog*|regre*|sorrow*|recogni*|around*|sorry*|remorse*|failur*",
window=40) %>%
as.data.frame()
dim(df_new)
We seemed to have lost some documents. The original document has 367 rows.
π― ACTION POINT: Take a look at View(df_new)
Handling duplicates
Although we lost rows, there are duplicated rows in df_new
because of multiple pattern matches in the same text:
%>% group_by(docname) %>% filter(n() > 1) df_new
Here is how we are going to deal with these duplicates: letβs keep just the one with the longest post
text. This is equivalent to selecting the one with the earliest from
value in the data frame above.
<- df_new %>% arrange(from) %>% group_by(docname) %>% slice(1)
df_new dim(df_new)
Note: This is a choice, there is no absolute objective way to handle this case. Would you do anything differently?
π TAKE-HOME (OPTIONAL) ACTIVITY: We used to have 367 rows, now we have 327. How would you change the pattern
to avoid excluding data from the original data frame? (Note: I do not have a ready solution to this! Feel free to share yours.)
Part 2: Rebuilding the corpus, tokens and dfm
A new corpus
<- quanteda::corpus(df_new, text_field="post", docid_field="docname")
corp_pac
corp_pac
Get rid of unnecessary tokens
<-
tokens_pac # Get rid of punctuations
::tokens(corp_pac, remove_punct = TRUE) %>%
quanteda
# Get rid of stopwords, finally!
::tokens_remove(pattern = quanteda::stopwords("en"))
quanteda tokens_pac
Now rebuild the dfm:
<- quanteda::dfm(tokens_pac)
dfm_pac dfm_pac
Top Features
How did the top features change after removing stop words?
%>% quanteda::topfeatures() dfm_pac
Word Cloud
Itβs easy to create a word cloud if you already have a dfm:
::textplot_wordcloud(dfm_pac) quanteda.textplots
π£οΈ CLASSROOM DISCUSSIONS: What do you think of the word cloud above, compared to the previous one? How would it be different if we had removed unnecessary tokens but kept the original longer description?
Can you think of any application of this methodology to your own project?