🎞️ Week 08 Slides

PDF Extraction Quality and Embedding Foundations for PS2

Dr Jon Cardoso-Silva

2026-03-06

Week 08 in one sentence

After this session, you should be able to explain and defend your extraction and embedding choices for ✍️ Problem Set 2.

What W08 is, and is not

This week is for:

  • extraction quality diagnosis
  • start embeddings
  • documenting one failure and one decision

This week is not for:

  • full RAG completion
  • polished final UX
  • perfect retrieval performance

W07 to W10 trajectory

flowchart LR
    w07["W07 Pipeline skeleton and automation"] --> w08["W08 Check extraction quality + start embeddings"]
    w08 --> w09["W09 Chunking + vector retrieval quality"]
    w09 --> w10["W10 Integration support and refinement"]

Why extraction quality matters first

Poor extraction gives poor chunks, poor chunks give poor embeddings, and poor embeddings hurt retrieval. Start by checking extraction quality.

unstructured in this course

In DS205, we are using unstructured because it is a powerful open-source tool for turning messy PDFs into structured elements you can actually work with. It is also the toolchain we are using in ✍️ Problem Set 2. Stay on the open-source path for coursework, use platform pages for vocabulary only, and keep your setup assumptions explicit in your repo.

Choosing a partition_pdf strategy

Do not look for a universal “best strategy”. Match strategy to what you see in each report.

  • auto
  • fast
  • hi_res
  • ocr_only

Strategy heuristics

Strategy Good first use Typical trade-off
auto Unknown PDF quality Generic defaults can miss edge cases
fast Clean, extractable text Weak on complex layouts/tables
hi_res Multi-column pages and table-heavy reports Slower, heavier dependencies
ocr_only Scanned pages or image-based text OCR noise, weaker structure

What these reports usually look like

  • mixed native text + scanned appendices
  • table-heavy emissions sections
  • multi-column narrative sections
  • inconsistent heading structures

Pick strategy based on these symptoms, not habit.

Quick debug by symptoms

  • Missing text entirely -> check if PDF is scanned, try ocr_only or hi_res
  • Tables flattened into gibberish -> try hi_res, inspect element types
  • Reading order looks wrong -> compare fast vs hi_res, preserve metadata
  • Very slow extraction -> start with a sample subset, then scale

Embedding intuition

Word embeddings place words in a vector space.

Sentence/document embeddings do the same for bigger text units.

For PS2, focus on whether retrieval works on your reports, not on theory for its own sake.

How to read a model card

Before coding, leave a short note in your repo covering:

  • task fit for retrieval/similarity
  • training data and domain mismatch risk
  • embedding size and storage/runtime trade-off
  • practical input limits
  • license constraints
  • one caveat from the authors you need to account for

From Word2Vec to sentence-transformers

  • Word2Vec builds local co-occurrence intuition.
  • Transformer encoders build contextual representations.
  • Sentence-transformers make this practical for retrieval tasks.

Baseline embedding choices

  • Start with one lightweight model.
  • Record why you chose it.
  • Evaluate with a small, fixed question set.
  • Change one variable at a time.

Where to stop in W08

By end of this week, I should be able to open your repo and see:

  1. extraction stage that runs
  2. inspect/diagnose step
  3. one model-card-based embedding choice
  4. notes on one failure and one design decision

What to save in your repo

  • README.md section: run instructions
  • dependency file (requirements.txt or environment.yml)
  • extraction script/notebook with saved outputs
  • brief decision log in markdown

Common traps

  • jumping to retrieval before validating extraction
  • claiming “works” without inspecting element outputs
  • mixing many model choices at once
  • not documenting environment assumptions

Bridge to lab

In lab you will:

  • run extraction on chosen PDFs
  • inspect extracted elements and metadata
  • run your first embedding baseline in code
  • run one basic search using manually split extracted text
  • document one extraction failure and one decision

First executable lab flow

flowchart LR
    query["Write query sentence/question"] --> box1["box1"]
    parse["parse_pdf + crude manual split"] --> box1
    box1 --> embedder["embedder"]
    embedder --> sim["search similarity"]

Bridge to W09

Next week depends on this week:

  • no reliable chunking without extraction quality
  • no meaningful retrieval evaluation without stable chunks

Your W08 outputs become W09 inputs, so this week is foundation work, not a side task.

Reading and references