🗓️ Week 09
Unstructured data: basics of text mining

DS101 – Fundamentals of Data Science

20 Nov 2023

⏪ Recap

  • We discovered what unsupervised learning was about:
    • the algorithms do not rely on labeled data to learn
    • the focus is on learning hidden patterns in the data without human intervention (and without labels!)
  • There are various types of unsupervised learning that have wide applications:
    • clustering
    • association rule mining
    • dimensionality reduction
    • anomaly detection

What is unstructured data

  • Unstructured data is any type of data that lacks pre-defined formats and organizations.

  • Structured data is collected following a known method or instance with a known schema (i.e table format), and unstructured data is everything else.

  • Unstructured data includes:

    • Audio
    • Video
    • Text content, often unformatted for structured data collection, including text messages, emails, social media, business documents and news
  • Unstructured data is reusable:

    • one piece of unstructured data can be examined different ways
    • the person collecting the data does not have to define ahead of time what information needs to be collected from it
  • Because unstructured data can be used to satisfy multiple intentions, this could raise potential privacy concerns for the user

Analysing a book from Project Gutenberg: a case study

Project Gutenberg is a library of over 70,000 full texts of books or individual stories in the public domain.

We are going to explore a few key text mining concepts by downloading and processing a book from your choice out of that library. Head to the course website to download the notebook to follow along this part.

A few key definitions

Tokenization:

  • The process of tokenization consists in splitting text into smaller units called tokens (most often words or sentences). In our case study, we split our text into words.

Lemmatization:

  • The process of lemmatization groups together all inflections/variants of a word into a single form to more easily analyze each word. For instance, if you lemmatize “program”, “programming”, and “programmed” they would all be converted to “program”. This makes certain types of analysis much easier.

Stopwords:

  • Stopwords are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. They tend to be the most common words in the language (e.g in English “a”, “the”, “is”, “are”, etc.) but can also be task-specific (e.g the word “immigration” in a corpus entirely about the topic of immigration).

Analysing the news: a case study

We use the newspaper3k package from Python to extract the most recent UK-related news articles from several newspapers:

  • The Guardian
  • BBC
  • The Sun
  • The Telegraph
  • The Independent
  • The Financial Times

Head to the course website to download the material needed to follow this second demo.

Introducing the concept of corpus

What is a corpus?

  • Simply put, a corpus is a collection of texts. It can be monolingual or multilingual.
  • The plural of corpus is corpora.
  • You can find some examples of publicly available (downloadable) corpora here.

The concept of Term frequency-inverse document frequency (TF-IDF)

  • Text vectorizer method that transforms the text into a usable vector i.e assigns a number to words in a text
  • Each number is a function of the word (term) frequency and inverse document frequency:
    • term frequency (TF) is how often a word appears in a document, divided by how many words there are. Term frequency measures how common a word is.
    • inverse document frequency (IDF) is how unique or rare a word is and is: \(IDF(t) = log_e(\textrm{Total number of documents} / \textrm{Number of documents with term t in it})\)

Example:

  • Consider a document containing 100 words where the word cooker appears 5 times. The term frequency (i.e., TF) for cooker is then (5 / 100) = 0.05.

  • Now, assume we have 10 million documents and the word cooker appears in one thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 4.

  • Therefore, the TF-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.

To learn more about TF-IDF, see this link or this one

Once vectorized, the text can be used as input to ML algorithms (supervised or unsupervised)!

Other Python libraries for NLP

References

Jurafsky, D., and J. H. Martin. 2023. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 3rd edition draft. Prentice Hall Series in Artificial Intelligence. Pearson Prentice Hall. https://web.stanford.edu/~jurafsky/slp3/.