💻 Week 09 Lab

Retrieval Decisions for PS2

ps2

chromadb

embeddings

retrieval

evaluation

W09 lab: chunking, embedding, and retrieval decisions for PS2 using ChromaDB.

Author

Dr Jon Cardoso-Silva

Published

10 May 2026

Modified

10 May 2026

🥅 Learning Goals

By the end of this lab, you should be able to: i) store chunk embeddings in ChromaDB with metadata, ii) choose and test a chunking strategy on your own PS2 PDFs, iii) choose and test an embedding model, iv) build a small reference set and compute Recall@5 as your first benchmark.

In the 🖥️ Week 09 Lecture, we compared chunking strategies, Word2Vec, and two sentence-transformer models on the Ajinomoto PDF. The best configuration reached 50% Recall@5. Today you make the same decisions on your own PS2 data.

The generation step we will cover in 🖥️ W10 has its own challenges but it is more straightforward: you feed retrieved chunks to a language model and format the output. Chunking + embedding + retrieval is the core of your ✍️ Problem Set 2. The hard work is getting the right chunks back in the first place, and having an initial baseline can help you assess whether your choices are improving anything in your pipeline.

📍 Session Details

Date: Tuesday, 17 March 2026
Time: Check your timetable for your class slot
Duration: 90 minutes

📋 Preparation

Attend or watch the 🖥️ W09 Lecture.
Open your PS2 repository and confirm your PDF extraction code from W08 runs.
Open the lab notebook from Nuvolos shared files (/files/week09/) or download it below.

🛣️ Lab Roadmap

How the W09 lab is structured
Part	Activity Type	Focus	Time	Outcome
Part 0	👤 Teaching Moment	Recap and framing	10 min	Shared understanding of today’s goal
Part 1	👤 Teaching Moment + 🎯 Action Points	ChromaDB: add, query, inspect	30 min	Worked example runs, you understand the three operations
Part 2	🎯 Action Points	Work on Problem Set 2	50 min	Chunks stored, queries tested, Recall@5 recorded

Part 0: Barry’s opening (10 min)

This section is a TEACHING MOMENT

Barry will recap the lecture results: two chunking strategies, three retrieval methods, best Recall@5 of 50%. The question for today: can you do better on your own documents by making deliberate choices about chunking and embedding?

The lecture gave you a worked example. The lab is where you make it your own.

Part 1: ChromaDB (30 min)

🎯 ACTION POINTS

ChromaDB is a vector database. It stores embeddings alongside metadata (company name, page number, source file) and lets you search by similarity with filters. Up until the 🖥️ W09 Lecture notebook, we showed embeddings living in NumPy arrays, but those disappear the moment the kernel shuts down. A database is a more appropriate way to store embeddings so they persist and you can reuse them later.

The NB02 notebook has a worked Ajinomoto example covering the three things you need to know:

Add: store chunks with embeddings and metadata in a collection.
Query: send a query vector and get back the nearest chunks, optionally filtered by metadata.
Inspect: look at what came back (chunk IDs, distances, pages, text previews).

Run the worked example cells in the notebook. Read the output and make sure you understand what each cell does. If something breaks, fix it before moving to your own data.

The notebook also includes a token budget check: when you retrieve k=5 chunks and paste them into a prompt for a language model (W10), you need to know whether they fit in the context window. The worked example shows you how to measure this.

For anything beyond what the notebook covers, the ChromaDB getting started guide and querying docs are the references.

Part 2: Work on Problem Set 2 (50 min)

👥 FEEL FREE TO WORK WITH OTHERS:

You can work solo or in pairs for establishing a good baseline/ground-truth for evaluation, especially if you find a way to complement each other. Say, either by using the same companies but comparing different chunking+embedding approaches or the reverse: different companies but similar chunking+embedding strategies to see how they perform.

Finding a combination of chunking and embedding that works well for these documents is a real challenge. We don’t even have model solutions! So, working with others won’t be framed as plagiarism as long as you acknowledged, in your write-up, who you worked with and how you both tested things together.

🎯 ACTION POINTS

The second half of the notebook has stub cells for each step. Use the rest of the lab to work through them on your own PS2 PDFs.

Here is what you should aim for by the end of the session. Not all of it will be finished today, and that is fine. The point is to have a clear starting position and know what to change next.

Choose a chunking strategy. The lecture showed char-limit packing (Strategy A) and heading-delimited sections (Strategy B). Neither dominated. Your documents may behave differently. You can use the functions from utils.py, or go bolder: try partition_pdf with strategy="hi_res" and infer_table_structure=True for better table extraction, or try RecursiveCharacterTextSplitter from langchain-text-splitters for recursive splitting with overlap. If you add new packages, update your environment.yml.

Choose an embedding model. Will you stick with the Q&A model from the lecture? Try the similarity model and compare? Venture further using the MTEB leaderboard? Filter by “Retrieval” and keep model size under ~100M parameters for Nuvolos. Change one variable at a time so you can tell what made the difference.

Build a reference set and compute Recall@5. You read your PDFs in W08. You know roughly where the emissions targets, activity data, and company information live. Write 3 to 5 (query, expected chunk ID) pairs. Run retrieval against them. Record the number. Low recall with a clear explanation of what you tried is better than high recall you cannot reproduce.

Record your decisions. Make sure to keep a record of what you tried and what the results were. This will be important for your write-up and for your next steps in W10.

Before you leave (last 5 minutes)

Confirm you can state: your chosen chunking strategy (and why), your chosen embedding model (and why), your current Recall@5, and your next adjustment before W10.

Appendix | Resources

Course links

🖥️ W09 Lecture
✍️ Problem Set 2
📓 Syllabus

ChromaDB

Embeddings and models

Chunking tools