🎞️ Week 08 Slides

PDF Extraction Quality and Embedding Foundations for PS2

Dr Jon Cardoso-Silva

2026-03-06

Week 08 in one sentence

After this session, you should be able to explain and defend your extraction and embedding choices for ✍️ Problem Set 2.

What W08 is, and is not

This week is for:

extraction quality diagnosis
start embeddings
documenting one failure and one decision

This week is not for:

full RAG completion
polished final UX
perfect retrieval performance

W07 to W10 trajectory

flowchart LR
    w07["W07 Pipeline skeleton and automation"] --> w08["W08 Check extraction quality + start embeddings"]
    w08 --> w09["W09 Chunking + vector retrieval quality"]
    w09 --> w10["W10 Integration support and refinement"]

Why extraction quality matters first

Poor extraction gives poor chunks, poor chunks give poor embeddings, and poor embeddings hurt retrieval. Start by checking extraction quality.

`unstructured` in this course

In DS205, we are using unstructured because it is a powerful open-source tool for turning messy PDFs into structured elements you can actually work with. It is also the toolchain we are using in ✍️ Problem Set 2. Stay on the open-source path for coursework, use platform pages for vocabulary only, and keep your setup assumptions explicit in your repo.

Choosing a `partition_pdf` strategy

Do not look for a universal “best strategy”. Match strategy to what you see in each report.

auto
fast
hi_res
ocr_only

Strategy heuristics

Strategy	Good first use	Typical trade-off
`auto`	Unknown PDF quality	Generic defaults can miss edge cases
`fast`	Clean, extractable text	Weak on complex layouts/tables
`hi_res`	Multi-column pages and table-heavy reports	Slower, heavier dependencies
`ocr_only`	Scanned pages or image-based text	OCR noise, weaker structure

What these reports usually look like

mixed native text + scanned appendices
table-heavy emissions sections
multi-column narrative sections
inconsistent heading structures

Pick strategy based on these symptoms, not habit.

Quick debug by symptoms

Missing text entirely -> check if PDF is scanned, try ocr_only or hi_res
Tables flattened into gibberish -> try hi_res, inspect element types
Reading order looks wrong -> compare fast vs hi_res, preserve metadata
Very slow extraction -> start with a sample subset, then scale

Embedding intuition

Word embeddings place words in a vector space.

Sentence/document embeddings do the same for bigger text units.

For PS2, focus on whether retrieval works on your reports, not on theory for its own sake.

Visual model of similarity search

flowchart LR
    reportChunk1["Chunk 1"] --> embedder["Embedder"]
    reportChunk2["Chunk 2"] --> embedder
    reportChunk3["Chunk 3"] --> embedder
    queryText["Query sentence/question"] --> embedder
    embedder --> vectorSet["Vectors"]
    vectorSet --> similarity["Similarity scoring"]
    similarity --> topResults["Top similar chunks"]

How to read a model card

Before coding, leave a short note in your repo covering:

task fit for retrieval/similarity
training data and domain mismatch risk
embedding size and storage/runtime trade-off
practical input limits
license constraints
one caveat from the authors you need to account for

From Word2Vec to sentence-transformers

Word2Vec builds local co-occurrence intuition.
Transformer encoders build contextual representations.
Sentence-transformers make this practical for retrieval tasks.

Baseline embedding choices

Start with one lightweight model.
Record why you chose it.
Evaluate with a small, fixed question set.
Change one variable at a time.

Where to stop in W08

By end of this week, I should be able to open your repo and see:

extraction stage that runs
inspect/diagnose step
one model-card-based embedding choice
notes on one failure and one design decision

What to save in your repo

README.md section: run instructions
dependency file (requirements.txt or environment.yml)
extraction script/notebook with saved outputs
brief decision log in markdown

Common traps

jumping to retrieval before validating extraction
claiming “works” without inspecting element outputs
mixing many model choices at once
not documenting environment assumptions

Bridge to lab

In lab you will:

run extraction on chosen PDFs
inspect extracted elements and metadata
run your first embedding baseline in code
run one basic search using manually split extracted text
document one extraction failure and one decision

First executable lab flow

flowchart LR
    query["Write query sentence/question"] --> box1["box1"]
    parse["parse_pdf + crude manual split"] --> box1
    box1 --> embedder["embedder"]
    embedder --> sim["search similarity"]

Bridge to W09

Next week depends on this week:

no reliable chunking without extraction quality
no meaningful retrieval evaluation without stable chunks

Your W08 outputs become W09 inputs, so this week is foundation work, not a side task.