DS101 – Fundamentals of Data Science
20 Nov 2023
Unstructured data is any type of data that lacks pre-defined formats and organizations.
Structured data is collected following a known method or instance with a known schema (i.e table format), and unstructured data is everything else.
Unstructured data includes:
Unstructured data is reusable:
Because unstructured data can be used to satisfy multiple intentions, this could raise potential privacy concerns for the user
Project Gutenberg is a library of over 70,000 full texts of books or individual stories in the public domain.
We are going to explore a few key text mining concepts by downloading and processing a book from your choice out of that library. Head to the course website to download the notebook to follow along this part.
Tokenization:
Lemmatization:
Stopwords:
We use the newspaper3k
package from Python to extract the most recent UK-related news articles from several newspapers:
Head to the course website to download the material needed to follow this second demo.
What is a corpus?
Example:
Consider a document containing 100 words where the word cooker appears 5 times. The term frequency (i.e., TF) for cooker is then (5 / 100) = 0.05.
Now, assume we have 10 million documents and the word cooker appears in one thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 4.
Therefore, the TF-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.
To learn more about TF-IDF, see this link or this one
Once vectorized, the text can be used as input to ML algorithms (supervised or unsupervised)!
LSE DS101 2023/24 Autumn Term | archive