flowchart LR
w07["W07 Pipeline skeleton and automation"] --> w08["W08 Check extraction quality + start embeddings"]
w08 --> w09["W09 Chunking + vector retrieval quality"]
w09 --> w10["W10 Integration support and refinement"]
PDF Extraction Quality and Embedding Foundations for PS2
2026-03-06
After this session, you should be able to explain and defend your extraction and embedding choices for ✍️ Problem Set 2.
This week is for:
This week is not for:
flowchart LR
w07["W07 Pipeline skeleton and automation"] --> w08["W08 Check extraction quality + start embeddings"]
w08 --> w09["W09 Chunking + vector retrieval quality"]
w09 --> w10["W10 Integration support and refinement"]
Poor extraction gives poor chunks, poor chunks give poor embeddings, and poor embeddings hurt retrieval. Start by checking extraction quality.
unstructured in this courseIn DS205, we are using unstructured because it is a powerful open-source tool for turning messy PDFs into structured elements you can actually work with. It is also the toolchain we are using in ✍️ Problem Set 2. Stay on the open-source path for coursework, use platform pages for vocabulary only, and keep your setup assumptions explicit in your repo.
partition_pdf strategyDo not look for a universal “best strategy”. Match strategy to what you see in each report.
autofasthi_resocr_only| Strategy | Good first use | Typical trade-off |
|---|---|---|
auto |
Unknown PDF quality | Generic defaults can miss edge cases |
fast |
Clean, extractable text | Weak on complex layouts/tables |
hi_res |
Multi-column pages and table-heavy reports | Slower, heavier dependencies |
ocr_only |
Scanned pages or image-based text | OCR noise, weaker structure |
Pick strategy based on these symptoms, not habit.
ocr_only or hi_reshi_res, inspect element typesfast vs hi_res, preserve metadataWord embeddings place words in a vector space.
Sentence/document embeddings do the same for bigger text units.
For PS2, focus on whether retrieval works on your reports, not on theory for its own sake.
flowchart LR
reportChunk1["Chunk 1"] --> embedder["Embedder"]
reportChunk2["Chunk 2"] --> embedder
reportChunk3["Chunk 3"] --> embedder
queryText["Query sentence/question"] --> embedder
embedder --> vectorSet["Vectors"]
vectorSet --> similarity["Similarity scoring"]
similarity --> topResults["Top similar chunks"]
Before coding, leave a short note in your repo covering:
By end of this week, I should be able to open your repo and see:
README.md section: run instructionsrequirements.txt or environment.yml)In lab you will:
flowchart LR
query["Write query sentence/question"] --> box1["box1"]
parse["parse_pdf + crude manual split"] --> box1
box1 --> embedder["embedder"]
embedder --> sim["search similarity"]
Next week depends on this week:
Your W08 outputs become W09 inputs, so this week is foundation work, not a side task.