flowchart LR
opening[Session framing and PS2 positioning] --> blockA[Block A PDF extraction with unstructured]
blockA --> blockB[Block B starting embeddings]
blockB --> blockC[Block C where this stage ends]
blockC --> closing[Closing W09 handoff and lab prep]
đĨī¸ Week 08 Lecture
PDF Extraction Quality and Intro to Embeddings
This week we will explore a library called unstructured to extract text from PDFs and we will also run a first search based on vector embeddings, a technique powered by deep learning models that allows you to represent text as numerical vectors.
Use this week to select the PDFs you want to use in your âī¸ Problem Set 2 and to think about what you need to extract out of the PDF(s).
đ Session Details
- Date: Monday, 09 March 2026
- Time: 16:00 - 18:00
- Location: SAL.G.03
đ Preparation
- Re-read the extraction and embedding requirements in âī¸ Problem Set 2.
- Bring 2 to 3 corporate reports you want to test this week (ideally from your chosen PS2 companies).
- Skim the
unstructuredpartitioning docs if you want to know more about the library.
Section A: Framing W08 in the PS2 timeline
Today is about getting foundations right before you move into retrieval decisions in W09. We focus on two things only: extraction quality and one embedding baseline you can defend.
For âī¸ Problem Set 2, the sequence is practical. First, we extract usable text from PDFs. Then we split that long report text into smaller units, which we call chunks.
This week, we will use a deliberately simple chunking approach so you can get a full end-to-end baseline running. In W09, we will come back and treat chunking strategy properly, with a deeper discussion of trade-offs and evaluation.
If extraction is noisy, every step after that is harder to trust.
Section B: unstructured and ETL language
In W07 we used ETL and ELT language. Keep using that same framing here:
- Extract: load PDFs and partition content into typed elements
- Transform: clean, filter, and normalise extracted content
- Load: persist outputs for chunking and embedding stages
unstructured documentation mixes product and open-source language. In DS205, stay with the open-source path and keep your workflow reproducible in your own repo:
Section C: Choosing a partition_pdf strategy
There is no single best extraction strategy. Choose strategy based on the document you are processing.
Strategy comparison for PS2 work
| Strategy | Best first use | Typical strengths | Typical limits |
|---|---|---|---|
auto |
First pass when document quality is unknown | Good default for mixed reports | Can miss edge cases in complex layouts |
fast |
Clean PDFs with extractable text | Fast and lightweight | Weak on scanned pages and complex tables |
hi_res |
Table-heavy and multi-column reports | Better structural recovery | Slower, more dependencies |
ocr_only |
Scanned/image PDFs | Gets text where no embedded text exists | OCR noise and weaker layout fidelity |
Minimal runnable examples
Example 1: Baseline extraction with auto
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="data/raw/company_a_report.pdf",
strategy="auto",
)
print(f"Extracted {len(elements)} elements")
print(type(elements[0]).__name__, elements[0].text[:200])
Example 2: Comparing fast and hi_res on the same page range
from unstructured.partition.pdf import partition_pdf
fast_elements = partition_pdf(
filename="data/raw/company_a_report.pdf",
strategy="fast",
)
hires_elements = partition_pdf(
filename="data/raw/company_a_report.pdf",
strategy="hi_res",
)
print("fast:", len(fast_elements))
print("hi_res:", len(hires_elements))Practical tip: compare outputs on a small slice first, then scale to full reports.
Example 3: Scanned PDF fallback with ocr_only
from unstructured.partition.pdf import partition_pdf
ocr_elements = partition_pdf(
filename="data/raw/company_scanned_appendix.pdf",
strategy="ocr_only",
)
print(f"OCR elements: {len(ocr_elements)}")Which strategy when (what you see in TPI reports)
- If text copy-paste works and pages are simple, start with
fast. - If reading order or table boundaries are messy, test
hi_res. - If the report is scanned or image-based, move to
ocr_only. - If you are unsure, start with
auto, inspect outputs, then narrow.
Debug by symptoms before embedding
| Symptom | Likely cause | First adjustment |
|---|---|---|
| Almost no text extracted | Scanned/image PDF | try ocr_only or hi_res |
| Tables become fragmented text | Layout complexity | test hi_res, inspect element types |
| Reading order is incoherent | Multi-column parsing issue | compare fast vs hi_res on same sample |
| Runtime is very slow | Heavy strategy on full corpus too early | validate on a few pages/files first |
Section D: The notion of embeddings
What an embedding represents (visual intuition)
Think of an embedding as a coordinate in a high-dimensional space. Texts with similar meaning end up closer together.
flowchart LR
textA["Chunk A: Scope 1 emissions decline"] --> encoder[Embedding model]
textB["Chunk B: financed emissions target"] --> encoder
queryQ["Query: emissions reduction plan"] --> encoder
encoder --> vecA["Vector A"]
encoder --> vecB["Vector B"]
encoder --> vecQ["Vector Q"]
vecQ --> sim["Similarity comparison"]
vecA --> sim
vecB --> sim
sim --> topk["Top-k most similar chunks"]
Use this mental model in W08: retrieval quality depends on whether your embedded units preserve useful meaning.
From intuition to model choice
- Word2Vec intuition: nearby words in similar contexts get nearby vectors.
- Transformer encoders: context-sensitive vectors for phrases/sentences.
- Sentence-transformers: practical encoders for search/retrieval pipelines.
How to read a HuggingFace model card for PS2
Before coding in lab, inspect the model card and make one explicit model choice.
Write a short note in your repo that answers:
- Task fit: is it actually built for sentence similarity/retrieval?
- Training data: how close is it to climate disclosure language?
- Embedding size: what does vector dimension imply for storage and runtime?
- Input limits: what practical length limits do you need to respect?
- License: can you use it for coursework and share your repository?
- Usage notes: what caveat did you notice and how will you account for it?
Good starter references:
You will apply this in lab, where you run your first query -> embed -> similarity flow.
Section E: Environment setup (moved)
All setup and environment guidance is now in:
Use that page for:
- Nuvolos baseline environment
- local Windows/macOS/Linux conda variants
- optional GPU paths (Windows NVIDIA and macOS Apple Silicon)
- CI environment notes and caching
- dependency rationale (
poppler,tesseract,pandoc,python-magic)
Section G: What should go into your PS2 this week
By the end of W08, ideally your repository should contain:
- A runnable extraction step using
unstructuredon selected reports. - Evidence that you inspected extraction outputs (not just ran scripts).
- One embedding model choice justified from its HuggingFace model card.
- One basic search result file produced in lab from a simple sectioning of the extracted text.
- A clear use of
data/interim/as your playground area for temporary stage outputs. 1 - A short note on one extraction failure and one design decision.
- Updated setup instructions in
README.md.
đĨ Session Recording
The lecture recording will be available on Moodle by the afternoon of the lecture.
Appendix | Reference Links
Course links
- đģ W08 Lab
- đ Syllabus
- âī¸ Problem Set 2
Unstructured docs
Footnotes
We are adopting this as the working convention in DS205 this term, aligned with common industry project layouts: Cookiecutter Data Science project opinions.âŠī¸