DS205 2025-2026 Winter Term Icon

🖥️ Week 09 Lecture

Chunking, Embeddings, and Retrieval

Author

Dr Jon Cardoso-Silva

Published

16 March 2026

🥅 Learning Goals

By the end of this lecture, you should be able to: i) compare two chunking strategies, ii) run a regex baseline for target language, iii) compare Word2Vec, similarity MiniLM, and Q&A MiniLM retrieval, iv) interpret Recall@5 results and choose a model for PS2.

📍 Logistics

📍Location: Monday, 16 March 2026, 4-6 pm at SAL.G.03

This page is slides-first. Use the deck below during and after class.

📋 Preparation

  • You attended the 🖥️ W08 Lecture and 💻 W08 Lab.
  • You have a first extraction workflow running for your PS2 PDFs.
  • You can run the W09 notebooks in the rag environment.

A polite DS205 avatar holding a survey form, looking hopeful

Tell the LSE about your experience in this course!
(2 out of 38 of you have completed the course survey)
2%
0% – 50%
50% – 75%
75% – 100%

Could we ask a small but important favour? The LSE runs a course survey every term, and your feedback genuinely shapes how this module is taught next year. It takes about 3 minutes.

💡 Note: Please assess all the instructors you have interacted with
(Jon counts as a teacher too!).

Last updated: 14 March 2026

🗣️ What we will cover on this lecture

  • Chunking strategy A vs B and why boundaries change retrieval outcomes.
  • Regex baseline and manual reference IDs for target statements.
  • Word2Vec vs sentence-transformer vs Q&A retrieval comparison.
  • Recall@5 interpretation and model-selection decisions for PS2.

📓 Lecture Materials

🎬 Facilitation Slides

Use keyboard arrows to navigate. You can also open the deck in fullscreen.

📥 Lecture Notebook

🔖 Appendix

Useful links