🗣️ Week 07 Lecture
First Steps with Unstructured Data: PDF Extraction and Word Embeddings

Last Updated: 27 February 2025, 18:00
This lecture marks our transition from structured data (APIs, web scraping) to unstructured documents like PDFs and complex text. The techniques we’ll cover are directly applicable to your ✍ Problem Set 1, particularly for those working on the Climate Action Tracker option where you’ll need to process and organize both structured and unstructured content effectively.
📍 Session Details
- Date: Monday, 3 March 2025
- Time: 10:00 am - 12:00 pm
- Location: KSW.1.04 (different location)
📥 Lecture Materials
Click HERE to download the zip file that contains the files (notebooks and data) for this lecture. You may need to tweak the path to the data files to match your local setup.
The zip file contains:
- Jupyter notebooks for both lecture and lab
- Sample NDC documents (in PDF format)
- Pre-processed text versions of the documents
- Requirements file for setting up your environment
I recommend setting up a virtual environment:
# After extracting the zip file
cd path/to/DS205-Week07
python -m venv embedding-env
source embedding-env/bin/activate # On Windows: embedding-env\Scripts\activate
pip install -r requirements.txt
🗣️ Lecture Content
The lecture was organised into these main sections:
1. Key Terminology: The Data Structure Spectrum (10:03 - 10:15)
- Understanding the continuum from structured to unstructured data
- The structure spectrum: Structured, semi-structured, and unstructured data
- Our journey through the course: From APIs to web scraping to text analysis
- Real-world applications driving this transition
2. Guest Presentation: ChatLSE - A Real-World RAG System (10:15 - 10:50)
- Special guest: Terry, Research Assistant at the DSI
- Architecture and components of a production RAG system
- Challenges in processing LSE’s diverse document types
- Lessons learned and best practices
3. From Raw Text to NLP-Ready Data (11:00 - 11:20)
- PDF parsing challenges and solutions
- Text extraction methods comparison
- NLP preprocessing fundamentals:
- Tokenization
- Stopword removal
- Stemming and lemmatization
- Common preprocessing workflows
4. Word Embeddings: Representing Meaning (11:20 - 11:40)
- From bag-of-words to distributed representations
- Word2Vec: intuition and mechanics
- Properties and applications of word embeddings
- Visualizing word relationships
- Similarity and analogies in the climate domain
5. From Words to Documents: Building Toward RAG (11:40 - 11:55)
- Document embeddings
- Sentence/paragraph vectors
- The path forward:
- Advanced representations with transformers
- Building and optimizing RAG systems
- Creating production-ready document intelligence
The notebook demonstrations showed: - PDF text extraction with different methods - Text preprocessing and tokenization - Training Word2Vec models on climate policy documents - Visualizing word embeddings with t-SNE - Exploring word similarities and analogies
🎬 Lecture Slides
Click HERE to download the slides used by Terry in the guest presentation.
Use keyboard arrows to navigate. Select the slides below or view fullscreen.
🎥 Session Recording
The lecture recording will be available on Moodle by the afternoon of the lecture.