DS205 2025-2026 Winter Term Icon

đŸ–Ĩī¸ Week 08 Lecture

PDF Extraction Quality and Intro to Embeddings

Author

Dr Jon Cardoso-Silva

Published

06 March 2026

This week we will explore a library called unstructured to extract text from PDFs and we will also run a first search based on vector embeddings, a technique powered by deep learning models that allows you to represent text as numerical vectors.

Use this week to select the PDFs you want to use in your âœī¸ Problem Set 2 and to think about what you need to extract out of the PDF(s).

📍 Session Details

  • Date: Monday, 09 March 2026
  • Time: 16:00 - 18:00
  • Location: SAL.G.03

📋 Preparation

  • Re-read the extraction and embedding requirements in âœī¸ Problem Set 2.
  • Bring 2 to 3 corporate reports you want to test this week (ideally from your chosen PS2 companies).
  • Skim the unstructured partitioning docs if you want to know more about the library.

Section A: Framing W08 in the PS2 timeline

Today is about getting foundations right before you move into retrieval decisions in W09. We focus on two things only: extraction quality and one embedding baseline you can defend.

flowchart LR
    opening[Session framing and PS2 positioning] --> blockA[Block A PDF extraction with unstructured]
    blockA --> blockB[Block B starting embeddings]
    blockB --> blockC[Block C where this stage ends]
    blockC --> closing[Closing W09 handoff and lab prep]

For âœī¸ Problem Set 2, the sequence is practical. First, we extract usable text from PDFs. Then we split that long report text into smaller units, which we call chunks.

This week, we will use a deliberately simple chunking approach so you can get a full end-to-end baseline running. In W09, we will come back and treat chunking strategy properly, with a deeper discussion of trade-offs and evaluation.

If extraction is noisy, every step after that is harder to trust.

Section B: unstructured and ETL language

In W07 we used ETL and ELT language. Keep using that same framing here:

  • Extract: load PDFs and partition content into typed elements
  • Transform: clean, filter, and normalise extracted content
  • Load: persist outputs for chunking and embedding stages

unstructured documentation mixes product and open-source language. In DS205, stay with the open-source path and keep your workflow reproducible in your own repo:

Section C: Choosing a partition_pdf strategy

There is no single best extraction strategy. Choose strategy based on the document you are processing.

Strategy comparison for PS2 work

Recommended partition_pdf strategy framing for W08
Strategy Best first use Typical strengths Typical limits
auto First pass when document quality is unknown Good default for mixed reports Can miss edge cases in complex layouts
fast Clean PDFs with extractable text Fast and lightweight Weak on scanned pages and complex tables
hi_res Table-heavy and multi-column reports Better structural recovery Slower, more dependencies
ocr_only Scanned/image PDFs Gets text where no embedded text exists OCR noise and weaker layout fidelity

Minimal runnable examples

Example 1: Baseline extraction with auto
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="data/raw/company_a_report.pdf",
    strategy="auto",
)

print(f"Extracted {len(elements)} elements")
print(type(elements[0]).__name__, elements[0].text[:200])
Example 2: Comparing fast and hi_res on the same page range
from unstructured.partition.pdf import partition_pdf

fast_elements = partition_pdf(
    filename="data/raw/company_a_report.pdf",
    strategy="fast",
)

hires_elements = partition_pdf(
    filename="data/raw/company_a_report.pdf",
    strategy="hi_res",
)

print("fast:", len(fast_elements))
print("hi_res:", len(hires_elements))

Practical tip: compare outputs on a small slice first, then scale to full reports.

Example 3: Scanned PDF fallback with ocr_only
from unstructured.partition.pdf import partition_pdf

ocr_elements = partition_pdf(
    filename="data/raw/company_scanned_appendix.pdf",
    strategy="ocr_only",
)

print(f"OCR elements: {len(ocr_elements)}")

Which strategy when (what you see in TPI reports)

  • If text copy-paste works and pages are simple, start with fast.
  • If reading order or table boundaries are messy, test hi_res.
  • If the report is scanned or image-based, move to ocr_only.
  • If you are unsure, start with auto, inspect outputs, then narrow.

Debug by symptoms before embedding

Debug-first heuristics for W08 extraction work
Symptom Likely cause First adjustment
Almost no text extracted Scanned/image PDF try ocr_only or hi_res
Tables become fragmented text Layout complexity test hi_res, inspect element types
Reading order is incoherent Multi-column parsing issue compare fast vs hi_res on same sample
Runtime is very slow Heavy strategy on full corpus too early validate on a few pages/files first

Section D: The notion of embeddings

What an embedding represents (visual intuition)

Think of an embedding as a coordinate in a high-dimensional space. Texts with similar meaning end up closer together.

flowchart LR
    textA["Chunk A: Scope 1 emissions decline"] --> encoder[Embedding model]
    textB["Chunk B: financed emissions target"] --> encoder
    queryQ["Query: emissions reduction plan"] --> encoder
    encoder --> vecA["Vector A"]
    encoder --> vecB["Vector B"]
    encoder --> vecQ["Vector Q"]
    vecQ --> sim["Similarity comparison"]
    vecA --> sim
    vecB --> sim
    sim --> topk["Top-k most similar chunks"]

Use this mental model in W08: retrieval quality depends on whether your embedded units preserve useful meaning.

From intuition to model choice

  1. Word2Vec intuition: nearby words in similar contexts get nearby vectors.
  2. Transformer encoders: context-sensitive vectors for phrases/sentences.
  3. Sentence-transformers: practical encoders for search/retrieval pipelines.

How to read a HuggingFace model card for PS2

Before coding in lab, inspect the model card and make one explicit model choice.

Write a short note in your repo that answers:

  • Task fit: is it actually built for sentence similarity/retrieval?
  • Training data: how close is it to climate disclosure language?
  • Embedding size: what does vector dimension imply for storage and runtime?
  • Input limits: what practical length limits do you need to respect?
  • License: can you use it for coursework and share your repository?
  • Usage notes: what caveat did you notice and how will you account for it?

Good starter references:

You will apply this in lab, where you run your first query -> embed -> similarity flow.

Section E: Environment setup (moved)

All setup and environment guidance is now in:

Use that page for:

  • Nuvolos baseline environment
  • local Windows/macOS/Linux conda variants
  • optional GPU paths (Windows NVIDIA and macOS Apple Silicon)
  • CI environment notes and caching
  • dependency rationale (poppler, tesseract, pandoc, python-magic)

Section G: What should go into your PS2 this week

By the end of W08, ideally your repository should contain:

  1. A runnable extraction step using unstructured on selected reports.
  2. Evidence that you inspected extraction outputs (not just ran scripts).
  3. One embedding model choice justified from its HuggingFace model card.
  4. One basic search result file produced in lab from a simple sectioning of the extracted text.
  5. A clear use of data/interim/ as your playground area for temporary stage outputs. 1
  6. A short note on one extraction failure and one design decision.
  7. Updated setup instructions in README.md.

đŸŽĨ Session Recording

The lecture recording will be available on Moodle by the afternoon of the lecture.

Footnotes

  1. We are adopting this as the working convention in DS205 this term, aligned with common industry project layouts: Cookiecutter Data Science project opinions.â†Šī¸Ž