π» Week 08 Lab
Partitioning PDFs and Building Your First Embedding Baseline
By the end of this lab, I want you to be able to: i) extract text from selected PS2 PDFs with unstructured, ii) compare partition_pdf strategy outputs and justify one choice, iii) generate one baseline embedding set for extracted chunks, iv) document one failure mode and one design decision in your repository.
The π₯οΈ Week 08 Lecture showed how to check extraction quality and start embeddings for this stage of βοΈ Problem Set 2. In this lab, you will apply those choices to your own selected company reports.
π Session Details
- Date: Tuesday, 10 March 2026
- Time: Check your timetable for your class slot
- Duration: 90 minutes
π£οΈ Lab Roadmap
| Part | Activity Type | Focus | Time | Outcome |
|---|---|---|---|---|
| Part 0 | π€ Teaching Moment | W08 goals and quality criteria | 10 min | Shared expectations |
| Part 1 | βΈοΈ Action Points | Run extraction with partition_pdf |
25 min | Initial element outputs saved |
| Part 2 | βΈοΈ Action Points | Compare strategies and debug symptoms | 20 min | One justified strategy choice |
| Part 3 | βΈοΈ Action Points | Build one embedding baseline and run similarity search | 20 min | First executable search outputs |
| Part 4 | π£οΈ Wrap-up | Document one failure + one decision | 15 min | PS2-ready notes and next steps |
π NOTE: Whenever you see a π€ TEACHING MOMENT, pause and focus on your class teacherβs walkthrough.
Part 0: Opening and quality criteria (10 min)
This section is a TEACHING MOMENT
Your class teacher will recap what counts as success for this week:
- extraction output is inspected, not assumed
- strategy choice is justified by document symptoms
- one embedding baseline is generated and recorded
- one failure and one decision are documented in your repo
Part 1: Run extraction with partition_pdf (25 min)
π― ACTION POINTS
Step 1: Pick 2 to 3 reports
Use your chosen PS2 companies and select 2 to 3 PDF reports that give varied layout challenges (for example, one clean narrative report and one table-heavy or scanned document).
Step 2: Run a baseline extraction
Use this as a starter:
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="data/raw/company_a_report.pdf",
strategy="auto",
)
print(f"elements: {len(elements)}")
print(type(elements[0]).__name__)
print(elements[0].text[:200])Step 3: Save simple inspection evidence
Leave a short evidence trail in your notes:
- number of elements extracted
- 3 to 5 representative element samples
- one note on what looks good and one note on what still looks wrong
Part 2: Compare strategies and debug symptoms (20 min)
π― ACTION POINTS
Run at least two strategies on the same report section:
fastvshi_res, orautovsocr_only(if scanned)
Use a compact comparison table in your notes:
| Strategy | What improved | What got worse | Runtime note | Keep or reject |
|---|---|---|---|---|
fast |
||||
hi_res |
Quick troubleshooting prompts
- Missing text entirely -> is this scanned?
- Tables unusable -> should you test
hi_res? - Reading order odd -> compare same page range across strategies
- Runtime too high -> are you testing on full corpus too early?
Part 3: Build one embedding baseline and run basic search (20 min)
π― ACTION POINTS
Use the starter notebook for this part:
Before running it, copy download/week08/.env.example to .env in the same folder and adjust paths for your machine.
Feel free to use Jupyter Notebooks in your own project whenever you want to explore a new library, inspect outputs, or document technical decisions for a future reader.
For βοΈ Problem Set 2, your pipeline must still run as Python scripts.
Run your first end-to-end retrieval flow with explicit file input/output:
- write a query sentence/question
parse_pdfand do crude manual splitting- embed both query and split units
- run similarity ranking
- save outputs to files
You are not designing chunking properly yet. For this week, manual splitting is enough to test end-to-end retrieval behaviour.
For this week, treat data/interim/ as your playground area for stage outputs while you test ideas. This mirrors a common pipeline structure where source files stay in data/raw/, temporary step outputs live in data/interim/, and stable final artefacts move to data/processed/.
π Reference: Cookiecutter Data Science project opinions
Record:
- model chosen and why you chose it
- input file path used for parsing
- output file paths created (
manual_units.json,search_top5.json) - one example where the top hit was relevant
- one example where the top hit was poor
Bonus suggested task (if you finish early):
Create a first script version of this notebook flow in your own PS2 repository (for example, pipeline/embed_search.py) so you can start moving from exploration to a runnable pipeline.
Part 4: Write up one failure and one decision (15 min)
π― ACTION POINTS
Before you leave, add a short section to your README.md or project notes with:
- Failure observed (for example: table extraction broke under
fast) - Decision made (for example: moved to
hi_resfor table-heavy documents) - Next test planned (what you will validate when moving from manual splits to proper W09 chunking)
This gives you written evidence you can use in your PS2 submission.
Optional local setup troubleshooting (non-Nuvolos)
Open only if you are running locally and hitting dependency errors
All setup details are now centralised in:
That page includes Nuvolos baseline setup, local conda variants, GPU optional paths for Windows/macOS, and CI environment guidance.
Optional: GitHub Actions section for conda environment + cache
If you want CI to run this reliably, use a dedicated environment.ci.yml instead of reusing local/Nuvolos specs.
Why: - CI should stay minimal and reproducible. - Local machine fixes should not leak into CI. - GitHub runners are ephemeral, so environment naming conventions matter less than in local dev.
- name: Set up Miniconda
uses: conda-incubator/setup-miniconda@v3
with:
environment-file: download/week08/environment.ci.yml
auto-activate-base: true
use-mamba: true
- name: Cache conda + pip packages
uses: actions/cache@v4
with:
path: |
~/.conda/pkgs
~/.cache/pip
key: ${{ runner.os }}-conda-${{ hashFiles('download/week08/environment.ci.yml') }}This is intentionally not a full workflow file. Use W07 lecture/lab guidance to place these steps in your own workflow order.
If you want a Nuvolos-specific variant, create a separate environment-nuvolos.yml and point the appropriate setup to that file.
Example Nuvolos delta:
# environment-nuvolos.yml
name: rag
channels:
- conda-forge
dependencies:
- python=3.11
- pip
- poppler
- tesseract
- libreoffice
- pandoc
- pip:
- unstructured[pdf]
- sentence-transformersSupport for common blockers
π§ Common blockers this week
ModuleNotFoundErrorfor extraction libs: verify your active environment and reinstall dependencies from your chosen file.- Extraction runs but outputs look empty: test whether PDF pages are scanned and try
ocr_onlyorhi_res. - Very slow extraction: run strategy tests on a small subset first, then scale.
- Unsure which strategy is βcorrectβ: pick the one that best serves your target evidence sections and document the trade-off.
If you are stuck, ask your class teacher for a quick diagnosis before changing many things at once.
Appendix | Resources
Course links
- π₯οΈ W08 Lecture
- ποΈ W08 Slides
- βοΈ Problem Set 2
- π Syllabus
Unstructured docs