π» Week 07 Lab
Data Quality Experimentation with Word Embeddings

Last Updated: 3 March 2025, 20:00
πTime and Location: Tuesday, 4 March 2025. Check your timetable for the precise time and location of your class.
π£οΈ Lab Roadmap
Note to instructor: Welcome students to the lab. Reinforce the essence of todayβs lab: to explore, hands-on, the impact of decisions we make about data pre-processing and model parameters.
In this lab, weβll explore how different text extraction methods affect word embedding quality. Weβll use two versions of the same corpus of National Determined Contributions (NDC) documents to compare embedding models.
Throughout this lab, keep this question in mind:
βHow does the quality of a) the text extraction, b) the pre-processing, and c) the model parameters affect the representation of words in the embedding space?β
Part I: Setup and Data Preparation (15 min)
No need to wait for your class teacher. You can start working on this part on your own as soon as you arrive.
Note to instructor: Make sure students have successfully extracted the zip file and can access the notebook. Some students might struggle with setting up the virtual environment, especially on Windows. Be prepared to troubleshoot common issues.
π― ACTION POINTS
Download the lab materials.
Click HERE to download the zip file with the files we will use in this lab.
Download the
DS205-Week07-Files.zip
file and extract it to a location of your choice (e.g.,~/DS205-Week07/
).You will end up with a folder called
DS205-Week07
with the following contents:DS205-Week07/ βββ data/ β βββ ndc-docs-lazy/ # Text version of the NDC PDFs (using a lazy approach) β βββ ndc-docs-robust/ # Text version of the NDC PDFs (using a more robust approach) β βββ ndc-pdfs/ # The original NDC PDFs βββ DS205-Week07-Lecture.ipynb βββ DS205-Week07-Lab.ipynb βββ requirements.txt
(Optional but recommended) Dedicated Python Environment.
Before you start working on the lab, we recommend you create a dedicated Python environment and install the requirements:
cd path/to/DS205-Week07 python -m venv embedding-env source embedding-env/bin/activate # On Windows: embedding-env\Scripts\activate pip install -r requirements.txt
Open the notebook in your preferred IDE (e.g., VSCode, JupyterLab, etc.) and make sure to select the kernel that matches the one in the environment (
embedding-env
).
Part II: Loading and Exploring the Datasets (15 min)
Note to instructor: Ask studets to work on Part II of the notebook during this period. You can either ask them to go through it alone as you walk around the room or you can make this a π£οΈ TEACHING MOMENT where you guide them through the code. Donβt change the parameters yet, we will reserve Part IV for this type of experimentation.
π― ACTION POINTS
Run the cells of the Part II of the notebook. There you will find the code to load the datasets and explore their contents.
π€¨ Think about it: Why might language detection be important when working with international policy documents? What challenges might arise from multilingual text in our word embeddings?
Part III: Training and Evaluating Word Embeddings (15 min)
Note to instructor: Guide them through Part 3 on the notebook and show any other demos of your own if you feel like it.
π£οΈ TEACHING MOMENT
Your class teacher will go through the Word2Vec model parameters (Part 3 on the notebook) and how they affect the resulting word embeddings.
This is a great opportunity to ask questions if you donβt fully get how Word2Vec works.
Part IV: Open Exploration (40 min)
Note to instructor: Let students work on this part on their own but remember to anchor the discussion at the end. This being a very open-ended task, itβs important to share how everyone explored the models and their findings.
Now itβs your turn to explore the models and draw your own conclusions.
π― ACTION POINTS:
Alternate between these two tasks:
- Keep the pre-processing (Section 2.3 in the notebook) the same and try different model parameters (Section 3.1 in the notebook).
- Keep the model parameters (Section 3.1 in the notebook) the same and try different pre-processing (Section 2.3 in the notebook).
π‘ ADVICE:
- Test only one thing at a time.
- Re-run all relevant cells every time you change the pre-processing or the model parameters.
π£οΈ CLASSROOM DISCUSSION:
Letβs gather everyoneβs findings and discuss:
- What were the most significant differences you found between the models?
- Which preprocessing techniques seemed most effective?
- What would you use a Word2Vec model for?
- How would you evaluate which model is βbetterβ for a real-world application?
- What implications does this have for text extraction in NLP pipelines?
π Keep Exploring!
If you finish early, continue experimenting with the models. Try different preprocessing techniques, model parameters, or visualisation approaches.
Weβd love to hear about any interesting patterns or insights you discover in your exploration!