🖥️ Week 08 Lecture

PDF Extraction Quality and Intro to Embeddings

Author

Dr Jon Cardoso-Silva

Published

10 May 2026

This week we will explore a library called unstructured to extract text from PDFs and we will also run a first search based on vector embeddings, a technique powered by deep learning models that allows you to represent text as numerical vectors.

Use this week to select the PDFs you want to use in your ✍️ Problem Set 2 and to think about what you need to extract out of the PDF(s).

📍 Session Details

Date: Monday, 09 March 2026
Time: 16:00 - 18:00
Location: SAL.G.03

📋 Preparation

Re-read the extraction and embedding requirements in ✍️ Problem Set 2.
Bring 2 to 3 corporate reports you want to test this week (ideally from your chosen PS2 companies).
Skim the unstructured partitioning docs if you want to know more about the library.

Section A: Framing W08 in the PS2 timeline

Today is about getting foundations right before you move into retrieval decisions in W09. We focus on two things only: extraction quality and one embedding baseline you can defend.

flowchart LR
    opening[Session framing and PS2 positioning] --> blockA[Block A PDF extraction with unstructured]
    blockA --> blockB[Block B starting embeddings]
    blockB --> blockC[Block C where this stage ends]
    blockC --> closing[Closing W09 handoff and lab prep]

For ✍️ Problem Set 2, the sequence is practical. First, we extract usable text from PDFs. Then we split that long report text into smaller units, which we call chunks.

This week, we will use a deliberately simple chunking approach so you can get a full end-to-end baseline running. In W09, we will come back and treat chunking strategy properly, with a deeper discussion of trade-offs and evaluation.

If extraction is noisy, every step after that is harder to trust.

Section B: `unstructured` and ETL language

In W07 we used ETL and ELT language. Keep using that same framing here:

Extract: load PDFs and partition content into typed elements
Transform: clean, filter, and normalise extracted content
Load: persist outputs for chunking and embedding stages

unstructured documentation mixes product and open-source language. In DS205, stay with the open-source path and keep your workflow reproducible in your own repo:

Section C: Choosing a `partition_pdf` strategy

There is no single best extraction strategy. Choose strategy based on the document you are processing.

Strategy comparison for PS2 work

Recommended `partition_pdf` strategy framing for W08
Strategy	Best first use	Typical strengths	Typical limits
`auto`	First pass when document quality is unknown	Good default for mixed reports	Can miss edge cases in complex layouts
`fast`	Clean PDFs with extractable text	Fast and lightweight	Weak on scanned pages and complex tables
`hi_res`	Table-heavy and multi-column reports	Better structural recovery	Slower, more dependencies
`ocr_only`	Scanned/image PDFs	Gets text where no embedded text exists	OCR noise and weaker layout fidelity

Minimal runnable examples

Example 1: Baseline extraction with auto

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="data/raw/company_a_report.pdf",
    strategy="auto",
)

print(f"Extracted {len(elements)} elements")
print(type(elements[0]).__name__, elements[0].text[:200])

Example 2: Comparing fast and hi_res on the same page range

from unstructured.partition.pdf import partition_pdf

fast_elements = partition_pdf(
    filename="data/raw/company_a_report.pdf",
    strategy="fast",
)

hires_elements = partition_pdf(
    filename="data/raw/company_a_report.pdf",
    strategy="hi_res",
)

print("fast:", len(fast_elements))
print("hi_res:", len(hires_elements))

Practical tip: compare outputs on a small slice first, then scale to full reports.

Example 3: Scanned PDF fallback with ocr_only

from unstructured.partition.pdf import partition_pdf

ocr_elements = partition_pdf(
    filename="data/raw/company_scanned_appendix.pdf",
    strategy="ocr_only",
)

print(f"OCR elements: {len(ocr_elements)}")

Which strategy when (what you see in TPI reports)

If text copy-paste works and pages are simple, start with fast.
If reading order or table boundaries are messy, test hi_res.
If the report is scanned or image-based, move to ocr_only.
If you are unsure, start with auto, inspect outputs, then narrow.

Debug by symptoms before embedding

Debug-first heuristics for W08 extraction work
Symptom	Likely cause	First adjustment
Almost no text extracted	Scanned/image PDF	try `ocr_only` or `hi_res`
Tables become fragmented text	Layout complexity	test `hi_res`, inspect element types
Reading order is incoherent	Multi-column parsing issue	compare `fast` vs `hi_res` on same sample
Runtime is very slow	Heavy strategy on full corpus too early	validate on a few pages/files first

Section D: The notion of embeddings

What an embedding represents (visual intuition)

Think of an embedding as a coordinate in a high-dimensional space. Texts with similar meaning end up closer together.

flowchart LR
    textA["Chunk A: Scope 1 emissions decline"] --> encoder[Embedding model]
    textB["Chunk B: financed emissions target"] --> encoder
    queryQ["Query: emissions reduction plan"] --> encoder
    encoder --> vecA["Vector A"]
    encoder --> vecB["Vector B"]
    encoder --> vecQ["Vector Q"]
    vecQ --> sim["Similarity comparison"]
    vecA --> sim
    vecB --> sim
    sim --> topk["Top-k most similar chunks"]

Use this mental model in W08: retrieval quality depends on whether your embedded units preserve useful meaning.

From intuition to model choice

Word2Vec intuition: nearby words in similar contexts get nearby vectors.
Transformer encoders: context-sensitive vectors for phrases/sentences.
Sentence-transformers: practical encoders for search/retrieval pipelines.

How to read a HuggingFace model card for PS2

Before coding in lab, inspect the model card and make one explicit model choice.

Write a short note in your repo that answers:

Task fit: is it actually built for sentence similarity/retrieval?
Training data: how close is it to climate disclosure language?
Embedding size: what does vector dimension imply for storage and runtime?
Input limits: what practical length limits do you need to respect?
License: can you use it for coursework and share your repository?
Usage notes: what caveat did you notice and how will you account for it?

Good starter references:

You will apply this in lab, where you run your first query -> embed -> similarity flow.

Section E: Environment setup (moved)

All setup and environment guidance is now in:

W08 Reference: Environment Setup and Troubleshooting

Use that page for:

Nuvolos baseline environment
local Windows/macOS/Linux conda variants
optional GPU paths (Windows NVIDIA and macOS Apple Silicon)
CI environment notes and caching
dependency rationale (poppler, tesseract, pandoc, python-magic)

Section G: What should go into your PS2 this week

By the end of W08, ideally your repository should contain:

A runnable extraction step using unstructured on selected reports.
Evidence that you inspected extraction outputs (not just ran scripts).
One embedding model choice justified from its HuggingFace model card.
One basic search result file produced in lab from a simple sectioning of the extracted text.
A clear use of data/interim/ as your playground area for temporary stage outputs. ¹
A short note on one extraction failure and one design decision.
Updated setup instructions in README.md.

🎥 Session Recording

The lecture recording will be available on Moodle by the afternoon of the lecture.

Appendix | Reference Links

Course links

Unstructured docs

Embeddings

Pipeline and debugging

Footnotes

We are adopting this as the working convention in DS205 this term, aligned with common industry project layouts: Cookiecutter Data Science project opinions.↩︎