πŸ’» Week 08 Lab

Exploring Transformer Models with HuggingFace

Author
Published

11 March 2025

πŸ₯… Learning Goals
By the end of this lab, you will have explored transformer-based models for climate document analysis, experimented with contextual embeddings, and discussed potential insights that can be derived from these advanced NLP techniques.
DS205 icon

Last Updated: 10 March 2025, 18:00

πŸ“Time and Location: Tuesday, 11 March 2025. Check your timetable for the precise time and location of your class.

πŸ§ͺ Lab Overview

Today’s lab focuses on hands-on exploration of transformer models, building on the concepts introduced in the πŸ—£οΈ Week 08 Lecture. You’ll work with the same climate document corpus from last week but using more advanced NLP techniques.

Download also the utils.py file:

Prerequisites

We assume you:

  • βœ… Attended or reviewed the πŸ—£οΈ Week 08 Lecture
  • βœ… Have your Python environment set up with the required packages
  • βœ… Have access to the NDC corpus from last week’s lab

If you have not installed the packages, we recommend you run the lab’s notebook on Nuvolos. The packages and environment are already set up for you there.

πŸ›£οΈ Lab Structure

Part 1: Setting Up (10 min)

This is an 🎯 ACTION POINT for you to work on individually.

  1. Environment Setup
    • Continue using your embedding-env from last week
    • Update with new requirements (see below)
    • Load the ClimateBERT model
  2. Loading the Data
    • Work with the NDC corpus from last week
    • Compare β€˜lazy’ vs β€˜robust’ preprocessing approaches
Click HERE to see the updated requirements.txt
# Core data science packages
numpy==1.26.4
pandas==2.2.3
matplotlib==3.10.1
scikit-learn==1.6.1

# NLP and text processing
nltk==3.9.1
gensim==4.3.3
langdetect==1.0.9

# Transformers and deep learning
transformers==4.39.3
datasets==2.18.0
torch==2.2.1

# Visualization
lets-plot==4.6.0

# Utilities
tqdm==4.67.1
ipykernel==6.29.5
ipywidgets==8.1.5

Part 2: Document Chunking and Embeddings (20 min)

πŸ—£οΈ TEACHING MOMENT

Your class teacher will guide you through:

  • Fine-grained document chunking strategies
  • Computing embeddings with transformer models
  • Comparing embedding similarities

Part 3: Open Exploration (50 min)

This is your chance to deeply explore transformer models and their capabilities. Choose from these suggested areas or pursue your own interests.

🎯 ACTION POINTS:

  1. Choose an area to explore
  2. Document your findings
  3. Share screenshots of interesting discoveries in the #social Slack channel

🏠 Looking Ahead

The techniques explored today will be directly relevant to ✍️ Problem Set 2 (to be released soon). Use this lab to:

  • Experiment with different approaches to document analysis
  • Understand how transformer models handle climate-specific language
  • Practice explaining your methodology and findings

Your ability to link practical explorations to theoretical concepts will be key for the upcoming assignment.