DS205 – Advanced Data Manipulation
03 Mar 2025
10:03 – 10:15
Databases, CSV files
JSON, XML, HTML
Text, images, audio
Sort these data sources from most structured to least structured:
Structured data
APIs
Semi-structured data
Web scraping
Unstructured data
Text analysis
Terry’s lived experience with ChatLSE - A Real-World RAG System
(Proof of concept, not production ready)
10:15 – 10:50
10:50 – 11:00
11:00 – 11:15
Tokenization
Normalization
Stop Word Removal
“The EU’s 2030 climate target requires a 55% reduction in GHG emissions.”
“We are committed to achieving net-zero emissions by 2050 through the implementation of renewable energy solutions.”
Notebook Demo: Section 1 & 2
11:15 – 11:40
Doc 1: "Climate change is a global challenge."
Doc 2: "Climate impacts are already visible."
Doc 3: "We need climate action now."
Document | climate | change | global | challenge | impacts | action | … |
---|---|---|---|---|---|---|---|
Doc 1 | 1 | 1 | 1 | 1 | 0 | 0 | … |
Doc 2 | 1 | 0 | 0 | 0 | 1 | 0 | … |
Doc 3 | 1 | 0 | 0 | 0 | 0 | 1 | … |
Term Frequency-Inverse Document Frequency: Weighting Words by Importance
Doc 1: "Climate change is a global challenge."
Doc 2: "Climate impacts are already visible."
Doc 3: "We need climate action now."
Document | climate | change | global | challenge | impacts | action |
---|---|---|---|---|---|---|
Doc 1 | 0.1 | 0.6 | 0.6 | 0.6 | 0.0 | 0.0 |
Doc 2 | 0.1 | 0.0 | 0.0 | 0.0 | 0.7 | 0.0 |
Doc 3 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.7 |
“You shall know a word by the company it keeps” (Firth, 1957)
CBOW Architecture: Predicting a word from its context
Adapted from the original Word2Vec paper: Mikolov et al., 2013
CBOW Architecture: Predicting “temperature” from context words
Example: “the global temperature is rising”
Real CBOW Implementation: Output is a probability distribution over entire vocabulary
The model learns to assign highest probability to the correct word (“temperature”)
Skip-gram Architecture: Predicting context words from the target word
Adapted from the original Word2Vec paper: Mikolov et al., 2013
In the original Word2Vec paper, they show that word vectors trained on a corpus of text can be used to capture semantic relationships between words.
“Table 8 shows words that follow various relationships. We follow the approach described above: the relationship is defined by subtracting two word vectors, and the result is added to another word. Thus for example,
Paris- France + Italy = Rome
.”
Notebook Demo: Section 3
Notebook Demo: Section 4 & 5
Notebook Demo: Live Exploration
11:40 – 11:55
Eventually we will have to cover:
Advanced representations
Transformers & deep text processing
Building & optimizing
RAG systems
Production-ready
document intelligence
THE END
Next Steps
Resources
LSE DS205 (2024/25)