🗣️ Week 07 Lecture

First Steps with Unstructured Data: PDF Extraction and Word Embeddings

Author

Published

27 February 2025

Last Updated: 27 February 2025, 18:00

This lecture marks our transition from structured data (APIs, web scraping) to unstructured documents like PDFs and complex text. The techniques we’ll cover are directly applicable to your ✍ Problem Set 1, particularly for those working on the Climate Action Tracker option where you’ll need to process and organize both structured and unstructured content effectively.

📍 Session Details

Date: Monday, 3 March 2025
Time: 10:00 am - 12:00 pm
Location: KSW.1.04 (different location)

📥 Lecture Materials

Click HERE to download the zip file that contains the files (notebooks and data) for this lecture. You may need to tweak the path to the data files to match your local setup.

The zip file contains:

Jupyter notebooks for both lecture and lab
Sample NDC documents (in PDF format)
Pre-processed text versions of the documents
Requirements file for setting up your environment

I recommend setting up a virtual environment:

# After extracting the zip file
cd path/to/DS205-Week07
python -m venv embedding-env
source embedding-env/bin/activate  # On Windows: embedding-env\Scripts\activate
pip install -r requirements.txt

🗣️ Lecture Content

The lecture was organised into these main sections:

1. Key Terminology: The Data Structure Spectrum (10:03 - 10:15)

Understanding the continuum from structured to unstructured data
The structure spectrum: Structured, semi-structured, and unstructured data
Our journey through the course: From APIs to web scraping to text analysis
Real-world applications driving this transition

2. Guest Presentation: ChatLSE - A Real-World RAG System (10:15 - 10:50)

Special guest: Terry, Research Assistant at the DSI
Architecture and components of a production RAG system
Challenges in processing LSE’s diverse document types
Lessons learned and best practices

3. From Raw Text to NLP-Ready Data (11:00 - 11:20)

PDF parsing challenges and solutions
Text extraction methods comparison
NLP preprocessing fundamentals:
- Tokenization
- Stopword removal
- Stemming and lemmatization
Common preprocessing workflows

4. Word Embeddings: Representing Meaning (11:20 - 11:40)

From bag-of-words to distributed representations
Word2Vec: intuition and mechanics
Properties and applications of word embeddings
Visualizing word relationships
Similarity and analogies in the climate domain

5. From Words to Documents: Building Toward RAG (11:40 - 11:55)

Document embeddings
Sentence/paragraph vectors
The path forward:
- Advanced representations with transformers
- Building and optimizing RAG systems
- Creating production-ready document intelligence

The notebook demonstrations showed: - PDF text extraction with different methods - Text preprocessing and tokenization - Training Word2Vec models on climate policy documents - Visualizing word embeddings with t-SNE - Exploring word similarities and analogies

🎬 Lecture Slides

Click HERE to download the slides used by Terry in the guest presentation.

Use keyboard arrows to navigate. Select the slides below or view fullscreen.

🎥 Session Recording

The lecture recording will be available on Moodle by the afternoon of the lecture.