DS101 – Fundamentals of Data Science
02 Dec 2024
Unstructured data is any type of data that lacks pre-defined formats and organization.
Structured data is collected following a known method or instance with a known schema (i.e table format), and unstructured data is everything else.
Unstructured data includes:
Unstructured data is reusable:
Because unstructured data can be used to satisfy multiple intentions, this could raise potential privacy concerns for the user
Tokenization:
Lemmatization:
Stopwords:
We use the newspaper3k
package from Python to extract the most recent UK-related news articles from several newspapers:
Head to the course website to download the material needed to follow this second demo (news case study materials).
What is a corpus?
Example:
Consider a document containing 100 words where the word cooker appears 5 times. The term frequency (i.e., TF) for cooker is then (5 / 100) = 0.05.
Now, assume we have 10 million documents and the word cooker appears in one thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 4.
Therefore, the TF-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.
To learn more about TF-IDF, see this link or this one
Once vectorized, the text can be used as input to ML algorithms (supervised or unsupervised)!
Before we start this case, let’s have a quick look at how LLMs work:
Step 1 (~35-40 min)
Read through the following articles:
Within your respective tables, discuss your first impressions of the articles.
LSE DS101 2024/25 Autumn Term