π» Week 09 Lab
Retrieval Decisions for PS2
By the end of this lab, you should be able to: i) store chunk embeddings in ChromaDB with metadata, ii) choose and test a chunking strategy on your own PS2 PDFs, iii) choose and test an embedding model, iv) build a small reference set and compute Recall@5 as your first benchmark.
In the π₯οΈ Week 09 Lecture, we compared chunking strategies, Word2Vec, and two sentence-transformer models on the Ajinomoto PDF. The best configuration reached 50% Recall@5. Today you make the same decisions on your own PS2 data.
The generation step we will cover in π₯οΈ W10 has its own challenges but it is more straightforward: you feed retrieved chunks to a language model and format the output. Chunking + embedding + retrieval is the core of your βοΈ Problem Set 2. The hard work is getting the right chunks back in the first place, and having an initial baseline can help you assess whether your choices are improving anything in your pipeline.
π Session Details
- Date: Tuesday, 17 March 2026
- Time: Check your timetable for your class slot
- Duration: 90 minutes
π Preparation
- Attend or watch the π₯οΈ W09 Lecture.
- Open your PS2 repository and confirm your PDF extraction code from W08 runs.
- Open the lab notebook from Nuvolos shared files (
/files/week09/) or download it below.
π£οΈ Lab Roadmap
| Part | Activity Type | Focus | Time | Outcome |
|---|---|---|---|---|
| Part 0 | π€ Teaching Moment | Recap and framing | 10 min | Shared understanding of todayβs goal |
| Part 1 | π€ Teaching Moment + π― Action Points | ChromaDB: add, query, inspect | 30 min | Worked example runs, you understand the three operations |
| Part 2 | π― Action Points | Work on Problem Set 2 | 50 min | Chunks stored, queries tested, Recall@5 recorded |
Part 0: Barryβs opening (10 min)
This section is a TEACHING MOMENT
Barry will recap the lecture results: two chunking strategies, three retrieval methods, best Recall@5 of 50%. The question for today: can you do better on your own documents by making deliberate choices about chunking and embedding?
The lecture gave you a worked example. The lab is where you make it your own.
Part 1: ChromaDB (30 min)
π― ACTION POINTS
ChromaDB is a vector database. It stores embeddings alongside metadata (company name, page number, source file) and lets you search by similarity with filters. Up until the π₯οΈ W09 Lecture notebook, we showed embeddings living in NumPy arrays, but those disappear the moment the kernel shuts down. A database is a more appropriate way to store embeddings so they persist and you can reuse them later.
The NB02 notebook has a worked Ajinomoto example covering the three things you need to know:
- Add: store chunks with embeddings and metadata in a collection.
- Query: send a query vector and get back the nearest chunks, optionally filtered by metadata.
- Inspect: look at what came back (chunk IDs, distances, pages, text previews).
Run the worked example cells in the notebook. Read the output and make sure you understand what each cell does. If something breaks, fix it before moving to your own data.
The notebook also includes a token budget check: when you retrieve k=5 chunks and paste them into a prompt for a language model (W10), you need to know whether they fit in the context window. The worked example shows you how to measure this.
For anything beyond what the notebook covers, the ChromaDB getting started guide and querying docs are the references.
Part 2: Work on Problem Set 2 (50 min)
π₯ FEEL FREE TO WORK WITH OTHERS:
You can work solo or in pairs for establishing a good baseline/ground-truth for evaluation, especially if you find a way to complement each other. Say, either by using the same companies but comparing different chunking+embedding approaches or the reverse: different companies but similar chunking+embedding strategies to see how they perform.
Finding a combination of chunking and embedding that works well for these documents is a real challenge. We donβt even have model solutions! So, working with others wonβt be framed as plagiarism as long as you acknowledged, in your write-up, who you worked with and how you both tested things together.
π― ACTION POINTS
The second half of the notebook has stub cells for each step. Use the rest of the lab to work through them on your own PS2 PDFs.
Here is what you should aim for by the end of the session. Not all of it will be finished today, and that is fine. The point is to have a clear starting position and know what to change next.
Choose a chunking strategy. The lecture showed char-limit packing (Strategy A) and heading-delimited sections (Strategy B). Neither dominated. Your documents may behave differently. You can use the functions from utils.py, or go bolder: try partition_pdf with strategy="hi_res" and infer_table_structure=True for better table extraction, or try RecursiveCharacterTextSplitter from langchain-text-splitters for recursive splitting with overlap. If you add new packages, update your environment.yml.
Choose an embedding model. Will you stick with the Q&A model from the lecture? Try the similarity model and compare? Venture further using the MTEB leaderboard? Filter by βRetrievalβ and keep model size under ~100M parameters for Nuvolos. Change one variable at a time so you can tell what made the difference.
Build a reference set and compute Recall@5. You read your PDFs in W08. You know roughly where the emissions targets, activity data, and company information live. Write 3 to 5 (query, expected chunk ID) pairs. Run retrieval against them. Record the number. Low recall with a clear explanation of what you tried is better than high recall you cannot reproduce.
Record your decisions. Make sure to keep a record of what you tried and what the results were. This will be important for your write-up and for your next steps in W10.
Before you leave (last 5 minutes)
Confirm you can state: your chosen chunking strategy (and why), your chosen embedding model (and why), your current Recall@5, and your next adjustment before W10.
Appendix | Resources
Course links
- π₯οΈ W09 Lecture
- βοΈ Problem Set 2
- π Syllabus
Embeddings and models