DS205 2025-2026 Winter Term Icon

πŸ’» Week 08 Lab

Partitioning PDFs and Building Your First Embedding Baseline

ps2
unstructured
partition-pdf
embeddings
Hands-on extraction quality checks and first embedding baseline for PS2.
Author

Dr Jon Cardoso-Silva

Published

06 March 2026

Modified

06 March 2026

πŸ₯… Learning Goals

By the end of this lab, I want you to be able to: i) extract text from selected PS2 PDFs with unstructured, ii) compare partition_pdf strategy outputs and justify one choice, iii) generate one baseline embedding set for extracted chunks, iv) document one failure mode and one design decision in your repository.

The πŸ–₯️ Week 08 Lecture showed how to check extraction quality and start embeddings for this stage of ✍️ Problem Set 2. In this lab, you will apply those choices to your own selected company reports.

πŸ“ Session Details

  • Date: Tuesday, 10 March 2026
  • Time: Check your timetable for your class slot
  • Duration: 90 minutes

πŸ›£οΈ Lab Roadmap

How the W08 lab is structured
Part Activity Type Focus Time Outcome
Part 0 πŸ‘€ Teaching Moment W08 goals and quality criteria 10 min Shared expectations
Part 1 ⏸️ Action Points Run extraction with partition_pdf 25 min Initial element outputs saved
Part 2 ⏸️ Action Points Compare strategies and debug symptoms 20 min One justified strategy choice
Part 3 ⏸️ Action Points Build one embedding baseline and run similarity search 20 min First executable search outputs
Part 4 πŸ—£οΈ Wrap-up Document one failure + one decision 15 min PS2-ready notes and next steps

πŸ‘‰ NOTE: Whenever you see a πŸ‘€ TEACHING MOMENT, pause and focus on your class teacher’s walkthrough.

Part 0: Opening and quality criteria (10 min)

This section is a TEACHING MOMENT

Your class teacher will recap what counts as success for this week:

  • extraction output is inspected, not assumed
  • strategy choice is justified by document symptoms
  • one embedding baseline is generated and recorded
  • one failure and one decision are documented in your repo

Part 1: Run extraction with partition_pdf (25 min)

🎯 ACTION POINTS

Step 1: Pick 2 to 3 reports

Use your chosen PS2 companies and select 2 to 3 PDF reports that give varied layout challenges (for example, one clean narrative report and one table-heavy or scanned document).

Step 2: Run a baseline extraction

Use this as a starter:

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="data/raw/company_a_report.pdf",
    strategy="auto",
)

print(f"elements: {len(elements)}")
print(type(elements[0]).__name__)
print(elements[0].text[:200])

Step 3: Save simple inspection evidence

Leave a short evidence trail in your notes:

  • number of elements extracted
  • 3 to 5 representative element samples
  • one note on what looks good and one note on what still looks wrong

Part 2: Compare strategies and debug symptoms (20 min)

🎯 ACTION POINTS

Run at least two strategies on the same report section:

  • fast vs hi_res, or
  • auto vs ocr_only (if scanned)

Use a compact comparison table in your notes:

Strategy What improved What got worse Runtime note Keep or reject
fast
hi_res

Quick troubleshooting prompts

  • Missing text entirely -> is this scanned?
  • Tables unusable -> should you test hi_res?
  • Reading order odd -> compare same page range across strategies
  • Runtime too high -> are you testing on full corpus too early?

Part 3: Build one embedding baseline and run basic search (20 min)

🎯 ACTION POINTS

Use the starter notebook for this part:

Before running it, copy download/week08/.env.example to .env in the same folder and adjust paths for your machine.

Feel free to use Jupyter Notebooks in your own project whenever you want to explore a new library, inspect outputs, or document technical decisions for a future reader.

For ✍️ Problem Set 2, your pipeline must still run as Python scripts.

Run your first end-to-end retrieval flow with explicit file input/output:

  • write a query sentence/question
  • parse_pdf and do crude manual splitting
  • embed both query and split units
  • run similarity ranking
  • save outputs to files

You are not designing chunking properly yet. For this week, manual splitting is enough to test end-to-end retrieval behaviour.

For this week, treat data/interim/ as your playground area for stage outputs while you test ideas. This mirrors a common pipeline structure where source files stay in data/raw/, temporary step outputs live in data/interim/, and stable final artefacts move to data/processed/.

πŸ“– Reference: Cookiecutter Data Science project opinions

Record:

  • model chosen and why you chose it
  • input file path used for parsing
  • output file paths created (manual_units.json, search_top5.json)
  • one example where the top hit was relevant
  • one example where the top hit was poor

Bonus suggested task (if you finish early):

Create a first script version of this notebook flow in your own PS2 repository (for example, pipeline/embed_search.py) so you can start moving from exploration to a runnable pipeline.

Part 4: Write up one failure and one decision (15 min)

🎯 ACTION POINTS

Before you leave, add a short section to your README.md or project notes with:

  1. Failure observed (for example: table extraction broke under fast)
  2. Decision made (for example: moved to hi_res for table-heavy documents)
  3. Next test planned (what you will validate when moving from manual splits to proper W09 chunking)

This gives you written evidence you can use in your PS2 submission.

Optional local setup troubleshooting (non-Nuvolos)

Open only if you are running locally and hitting dependency errors

All setup details are now centralised in:

That page includes Nuvolos baseline setup, local conda variants, GPU optional paths for Windows/macOS, and CI environment guidance.

Optional: GitHub Actions section for conda environment + cache

If you want CI to run this reliably, use a dedicated environment.ci.yml instead of reusing local/Nuvolos specs.

Why: - CI should stay minimal and reproducible. - Local machine fixes should not leak into CI. - GitHub runners are ephemeral, so environment naming conventions matter less than in local dev.

- name: Set up Miniconda
  uses: conda-incubator/setup-miniconda@v3
  with:
    environment-file: download/week08/environment.ci.yml
    auto-activate-base: true
    use-mamba: true

- name: Cache conda + pip packages
  uses: actions/cache@v4
  with:
    path: |
      ~/.conda/pkgs
      ~/.cache/pip
    key: ${{ runner.os }}-conda-${{ hashFiles('download/week08/environment.ci.yml') }}

This is intentionally not a full workflow file. Use W07 lecture/lab guidance to place these steps in your own workflow order.

If you want a Nuvolos-specific variant, create a separate environment-nuvolos.yml and point the appropriate setup to that file.

Example Nuvolos delta:

# environment-nuvolos.yml
name: rag
channels:
  - conda-forge
dependencies:
  - python=3.11
  - pip
  - poppler
  - tesseract
  - libreoffice
  - pandoc
  - pip:
      - unstructured[pdf]
      - sentence-transformers

Support for common blockers

πŸ”§ Common blockers this week

  • ModuleNotFoundError for extraction libs: verify your active environment and reinstall dependencies from your chosen file.
  • Extraction runs but outputs look empty: test whether PDF pages are scanned and try ocr_only or hi_res.
  • Very slow extraction: run strategy tests on a small subset first, then scale.
  • Unsure which strategy is β€œcorrect”: pick the one that best serves your target evidence sections and document the trade-off.

If you are stuck, ask your class teacher for a quick diagnosis before changing many things at once.

Appendix | Resources