💻 Week 07 Lab

Data Quality Experimentation with Word Embeddings

Author

Published

04 March 2025

🥅 Learning Goals

By the end of this lab, you will learn how to: i) compare text quality from different PDF extraction methods, ii) implement basic text preprocessing for word embeddings, iii) train and evaluate Word2Vec models, and iv) visualize how data quality affects embedding relationships in climate policy contexts.

Last Updated: 3 March 2025, 20:00

📍Time and Location: Tuesday, 4 March 2025. Check your timetable for the precise time and location of your class.

🛣️ Lab Roadmap

Note to instructor: Welcome students to the lab. Reinforce the essence of today’s lab: to explore, hands-on, the impact of decisions we make about data pre-processing and model parameters.

In this lab, we’ll explore how different text extraction methods affect word embedding quality. We’ll use two versions of the same corpus of National Determined Contributions (NDC) documents to compare embedding models.

Throughout this lab, keep this question in mind:

“How does the quality of a) the text extraction, b) the pre-processing, and c) the model parameters affect the representation of words in the embedding space?”

Part I: Setup and Data Preparation (15 min)

No need to wait for your class teacher. You can start working on this part on your own as soon as you arrive.

Note to instructor: Make sure students have successfully extracted the zip file and can access the notebook. Some students might struggle with setting up the virtual environment, especially on Windows. Be prepared to troubleshoot common issues.

🎯 ACTION POINTS

Download the lab materials.

Click HERE to download the zip file with the files we will use in this lab.

Download the DS205-Week07-Files.zip file and extract it to a location of your choice (e.g., ~/DS205-Week07/).

You will end up with a folder called DS205-Week07 with the following contents:

DS205-Week07/
├── data/
│   ├── ndc-docs-lazy/ # Text version of the NDC PDFs (using a lazy approach)
│   ├── ndc-docs-robust/ # Text version of the NDC PDFs (using a more robust approach)
│   ├── ndc-pdfs/ # The original NDC PDFs
├── DS205-Week07-Lecture.ipynb
├── DS205-Week07-Lab.ipynb
├── requirements.txt

(Optional but recommended) Dedicated Python Environment.

Before you start working on the lab, we recommend you create a dedicated Python environment and install the requirements:
```
cd path/to/DS205-Week07
python -m venv embedding-env
source embedding-env/bin/activate # On Windows: embedding-env\Scripts\activate
pip install -r requirements.txt
```
Open the notebook in your preferred IDE (e.g., VSCode, JupyterLab, etc.) and make sure to select the kernel that matches the one in the environment (embedding-env).

Part II: Loading and Exploring the Datasets (15 min)

Note to instructor: Ask studets to work on Part II of the notebook during this period. You can either ask them to go through it alone as you walk around the room or you can make this a 🗣️ TEACHING MOMENT where you guide them through the code. Don’t change the parameters yet, we will reserve Part IV for this type of experimentation.

🎯 ACTION POINTS

Run the cells of the Part II of the notebook. There you will find the code to load the datasets and explore their contents.

🤨 Think about it: Why might language detection be important when working with international policy documents? What challenges might arise from multilingual text in our word embeddings?

Part III: Training and Evaluating Word Embeddings (15 min)

Note to instructor: Guide them through Part 3 on the notebook and show any other demos of your own if you feel like it.

🗣️ TEACHING MOMENT

Your class teacher will go through the Word2Vec model parameters (Part 3 on the notebook) and how they affect the resulting word embeddings.

This is a great opportunity to ask questions if you don’t fully get how Word2Vec works.

Part IV: Open Exploration (40 min)

Note to instructor: Let students work on this part on their own but remember to anchor the discussion at the end. This being a very open-ended task, it’s important to share how everyone explored the models and their findings.

Now it’s your turn to explore the models and draw your own conclusions.

🎯 ACTION POINTS:

Alternate between these two tasks:

Keep the pre-processing (Section 2.3 in the notebook) the same and try different model parameters (Section 3.1 in the notebook).
Keep the model parameters (Section 3.1 in the notebook) the same and try different pre-processing (Section 2.3 in the notebook).

💡 ADVICE:

Test only one thing at a time.
Re-run all relevant cells every time you change the pre-processing or the model parameters.

🗣️ CLASSROOM DISCUSSION:

Let’s gather everyone’s findings and discuss:

What were the most significant differences you found between the models?
Which preprocessing techniques seemed most effective?
What would you use a Word2Vec model for?
How would you evaluate which model is “better” for a real-world application?
What implications does this have for text extraction in NLP pipelines?

🏠 Keep Exploring!

If you finish early, continue experimenting with the models. Try different preprocessing techniques, model parameters, or visualisation approaches.

We’d love to hear about any interesting patterns or insights you discover in your exploration!