💻 Week 10 Lab

From Retrieval to Generation

ps2

rag

generation

chromadb

cross-encoder

W10 lab: walking through the full RAG pipeline code, from CLI scripts to generation with citations.

Author

Dr Jon Cardoso-Silva

Published

30 March 2026

Modified

30 March 2026

🥅 Learning Goals

By the end of this lab, you should be able to: i) understand the full CLI pipeline from extraction to vector storage, ii) see how different chunking strategies compare on real retrieval benchmarks, iii) understand how cross-encoder reranking improves retrieval, iv) follow a generation step from retrieved chunks to a cited answer, v) articulate what you will try on your own PS2 data before Thursday’s deadline.

In the 🖥️ Week 10 Lecture, we covered how language models work (tokenisation, transformers, encoder vs decoder), the HuggingFace text-generation pipeline, and how to add cited answers to a RAG system. Today Barry walks you through the code that implements all of this.

There is not much you can tweak in the generation step to get better results if the retrieval does not deliver good chunks. The quality of your curated dataset and of your chunking strategy matters the most. Today is about seeing how the pieces fit together so you can make informed decisions for your ✍️ Problem Set 2 submission on Thursday of next week.

📍 Session Details

Date: Tuesday, 24 March 2026
Time: Check your timetable for your class slot
Duration: 90 minutes

📋 Preparation

Attend or watch the 🖥️ W10 Lecture.
Have your PS2 repository open with your ChromaDB collection from W09.
Make sure you can run notebooks in the rag conda environment.

🛣️ Lab Roadmap

How the W10 lab is structured
Part	Activity Type	Focus	Time	Outcome
Part 0	👤 Teaching Moment	Catchup and conda cleanup	10 min	Everyone on the same page, disk space freed
Part 1	👤 Teaching Moment	Talk the Code: CLI scripts and NB00	25 min	You understand extract → chunk → embed → benchmark
Part 2	👤 Teaching Moment	Talk the Code: NB01 generation	15 min	You understand retrieve → prompt → generate → cite
Part 3	🗣️ Classroom Discussion	What will you try?	15 min	Each student has a plan for their PS2 pipeline
Part 4	👤 Teaching Moment	Submission checklist	10 min	You know exactly what to submit by 2nd April
Part 5	🎯 Action Points	Work on PS2	15 min	Progress on your own pipeline

Part 0: Catchup and cleanup (10 min)

This section is a TEACHING MOMENT

Barry will check in with the class: where are people at with PS2? What blockers did you hit since W09? This is the last lab before the deadline, so now is the time to surface any issues.

🧹 Tip: Clean up old conda environments to save disk space

On Nuvolos (or your own machine), you may have accumulated conda environments from earlier in the course. The food environment from PS1 is no longer needed. Removing it frees several hundred MB of disk space.

Check what you have:

conda env list

You should see something like:

# conda environments:
#
base                  *  /opt/conda
food                     /opt/conda/envs/food
rag                      /opt/conda/envs/rag

Remove the ones you no longer need:

conda env remove -n food

Keep using the rag environment for PS2. If you need to add new packages (e.g. langchain-text-splitters), add them to your environment.yml and run:

conda env update -f environment.yml --prune

The --prune flag removes packages that are no longer listed in the file.

Part 1: Talk the Code — CLI scripts and NB00 (25 min)

This section is a TEACHING MOMENT

Barry walks through Jon’s CLI pipeline scripts and the benchmark notebook. The goal is not to run these live (extraction takes 20 minutes) but to read the code together and understand what each piece does.

Barry will show you three scripts and one notebook:

00_extract.py runs partition_pdf on your chosen PDF and saves the extracted elements as a .pkl (pickle) file. Pickle is Python’s way of serialising any object to a binary file so you can reload it in seconds instead of re-running a 20-minute extraction. The 🖥️ W10 Lecture slides have a section on pickle if you want the details. The key point: run extraction once, save the result, and never wait for it again.
01_chunk_strategies.py reads the pickle file and applies up to six different chunking strategies, saving each as a CSV. Barry will show the strategy implementations briefly. The point is not to memorise them but to see that chunking is an empirical decision: no single strategy dominates, and the right choice depends on your specific PDFs. You can run this with:
```
conda run --no-capture-output -n rag python 01_chunk_strategies.py --strategy all
```
02_vector_store.py reads the chunk CSVs, embeds them with the Q&A MiniLM model, and stores them in ChromaDB. One collection per strategy. Barry may run this one live so you can see the rich output table:
```
conda run --no-capture-output -n rag python 02_vector_store.py --all
```
NB00 (Benchmark notebook) loads all six collections and benchmarks Recall@5 across three queries, with and without cross-encoder reranking. Barry will show the key results:
- The mean recall table showing which strategy won (spoiler: char_limit with reranking, despite being the simplest strategy).
- The “does reranking help?” comparison showing that the cross-encoder improved recall for every strategy.
- The side-by-side rank comparison showing where the ground truth chunks actually landed in each strategy’s ranking.

The takeaway: try the simple approach first, measure it, and only add complexity if the numbers justify it.

Part 2: Talk the Code | NB01 generation (15 min)

This section is a TEACHING MOMENT

Barry walks through the generation notebook up to and including the v1 (minimal) prompt. The later sections (stricter prompt, self-critique, extractive-style prompting) are for you to explore on your own.

Barry will show you the key steps in NB01:

Loading the winning collection from NB00’s benchmark results. The code opens the persisted ChromaDB and goes straight to retrieval.
Two-stage retrieval. The solution with just a single embedder retrieves a broad pool of 50 candidates. The cross-encoder rescores them and keeps the top 10. There are some hidden slides from the 🖥️ W10 Lecture slides if needed.
Token budget. Before building the prompt, the code checks how many chunks fit inside TinyLlama’s 2048-token context window after accounting for the system message, the question, and the reserved output tokens.
Building the prompt with numbered sources. Each chunk gets a [Source N] label with its chunk ID, filename, and page number. The model can cite these in its answer.
Applying the chat template. TinyLlama expects <|system|>, <|user|>, <|assistant|> tokens. The code uses apply_chat_template to format the prompt correctly.
Running the model and reading the output. The v1 prompt produces an answer. Barry will show how the answer is cleaned (trimming runaway generation) and how the source list is appended.

After this walkthrough, Barry will briefly preview what the rest of the notebook contains (v2 stricter prompt, self-critique loop, extractive-style prompting) so you know what to try on your own.

Part 3: What will you try? (15 min)

This section is a CLASSROOM DISCUSSION

Barry leads a structured discussion. Each student (or pair) answers three questions:

🗣️ Discussion prompt

Complete these three sentences:

My PDFs are from [company/sector] and the hardest thing about them is [tables / scanned pages / multilingual content / long narrative sections / other].
For chunking, I plan to try [strategy] because [reason based on what I saw in NB00 or what I know about my documents].
My expected facts for the main driving question are [list 2-3 specific numbers, percentages, or years you will check the answer against].

Barry might use a Mentimeter or show of hands to collect patterns from the discussion. Do most people have the same problem (tables)? Are people converging on the same strategy? Did anyone try something different that worked?

Part 4: Submission checklist (10 min)

This section is a TEACHING MOMENT

Barry walks through the submission requirements. PS2 is due Thursday 2 April, 8pm. For each item, show of hands: who has this done?

Component	What to check
Pipeline runs end-to-end	`python pipeline.py run-all` produces output
Generation step added	Retrieved chunks go into a language model, answer comes out
Citations attached	Each answer shows source chunk ID, PDF filename, page number
Evaluation documented	Recall@5 and fact checks for each driving question
Diagnosis table	Where did it fail? Retrieval or generation? What did you try?
README.md	What it does, how to set up, how to run
CONTRIBUTING.md	How the pipeline works internally, known issues

If you are missing components, prioritise them in the working time below. A pipeline that runs end-to-end with honest evaluation beats a pipeline with fancy features that does not run.

Part 5: Work on PS2 (15 min)

🎯 ACTION POINTS

Use the remaining time to work on your PS2 pipeline. Barry circulates.

Before you leave, write down (for yourself):

Which chunking strategy you are using and why.
Which generation model you are using (TinyLlama on Nuvolos, or something larger locally).
Your three expected facts for your main driving question.
The one thing you need to finish before Thursday.

Appendix | Resources

Course links

🖥️ W10 Lecture
✍️ Problem Set 2
📔 Syllabus

Generation and models

Retrieval and evaluation