πŸ“ Problem Set 2: Climate Policy Information Extraction with NLP (40%)

2024/25 Winter Term

Author
Published

11 March 2025

πŸ₯… Learning Goals
By the end of this assignment, you will: i) Implement and compare text embedding approaches, ii) Build an effective information retrieval system using vector search, iii) Evaluate model performance on domain-specific tasks, iv) Apply NLP techniques to solve real-world climate finance challenges

Overview

In this assignment, you will create a Natural Language Processing (NLP) system to extract specific climate policy information from Nationally Determined Contributions (NDC) documents.

Your goal is to build a data pipeline that can automatically retrieve information from unstructured data (the NDC PDFs) and answer the following question:

β€œWhat emissions reduction target is each country in the NDC registry aiming for by 2030?”

Motivation: You will take on the role of a TPI analyst who has just begun working on the EP2.a.i indicator of the ASCOR assessment. (In practice, the analyst will need to gather additional data beyond this assignment, as they will compare against the 2019 emissions levels.) Your aim is to automate the process as much as possible and create a pipeline that not only answers the question but also identifies the specific page, paragraph, or section of the PDF that contains the relevant piece(s) of information.

NOTE: We value your process leading to a strong final product more than the product itself. You can still achieve a high score even if only a few questions are automatically answered, as long as you demonstrate the use of the right skills.

πŸ’‘ Tip: By the time this assignment is published, you will have been introduced to most of the necessary elements. The upcoming lectures and labs in Weeks 09 and 10 will cover the remaining topics, and we will also set aside time to help you make progress.

πŸ“š Preparation

  1. Click on this GitHub Classroom link 1 to create your designated repository.
  1. Follow the setup instructions in the starter code README to get your environment ready.

πŸ“€ Submission Details

πŸ“… Due Date: Friday 28 March 2025, 8pm UK time

πŸ“€ Submission Method: Push your work to your allocated GitHub repository. If you can see your work on GitHub, we can see it too and that’s what we mark.

πŸ’‘ Tip: Start early and make regular commits. This helps track your progress and ensures you have a working solution by the deadline.

⚠️ Note: Late submissions will receive penalties according to LSE’s policy.

πŸ“š Required Tasks

This assignment consists of three main components:

  1. Document Analysis and Annotation (25%)
  2. Embedding Generation and Comparison (35%)
  3. Information Extraction System (40%)

The table below breaks down these three main components into what we expect you to do.

Requirement Details Course Connection
Data Annotation
Document Analysis
- Process PDF files to a structured data format
- Label sections of the extracted text with location data (e.g. page number, section heading, etc.)
- Document annotation methodology
πŸ’» W07 Lab
Text extraction methods
+
πŸ’» W08 Lab
Document chunking
Quality Assurance
Document Analysis
- Validate consistency (πŸ†)
- Handle edge cases (πŸ†)
πŸ’» W08 Lab
Exploratory quality assurance
Implementation
Embeddings
- Compare keyword search (creative use of Word2Vec or of things like bag-of-words and TF-IDF)
- Compare it with Transformer-generated embeddings
πŸ—£οΈ W07 Lecture
Word2Vec models
+
πŸ—£οΈ W08 Lecture
Transformer models
Analysis
Embeddings
- Exploratory analysis of embedding spaces (e.g., pairwise similarity, visualisation)
- Document discovery of insights (or lack of) from exploration
- Document performance differences
πŸ—£οΈ W08 Lecture
Embedding visualization
+
πŸ’» W08 Lab
HuggingFace ecosystem
Vector Search
Extraction
- Implement similarity search
- Apply thresholds or other relevant filters (πŸ†)
πŸ—£οΈ W09 Lecture
PostgreSQL with pgVector
+
πŸ’» W09 Lab
Vector search implementation
Pipeline
Extraction
- Design effective prompts
- Extract relevant information to a structured data format
πŸ—£οΈ W10 Lecture
Collaborative extraction techniques
+
πŸ’» W10 Lab
Extraction workshops
Evaluation
Extraction
- Document & page-level precision
- Paragraph-level precision (πŸ†)
- Compare results against manually-curated ground-truth annotations (πŸ†)
πŸ’» W09 Lab
Location precision techniques
+
πŸ—£οΈ W10 Lecture
Precision evaluation

Everything that is NOT marked with a πŸ† is considered a core requirement. The πŸ† marks are intended to reward those who can go above and beyond and deserve a distinction.

πŸ› οΈ Getting Started: Starter Code & Project Structure

You won’t start completely from scratch. I’m giving you a good starting point to help you focus on the NLP components rather than boilerplate. Here’s what you’ll find:

climate-policy-extractor/
β”œβ”€β”€ climate_policy_extractor/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ items.py                       
β”‚   β”œβ”€β”€ settings.py                    
β”‚   └── spiders/
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── ndc_spider.py                            
β”œβ”€β”€ notebooks/  # Just a suggestion. Remove/add as needed
β”‚   β”œβ”€β”€ NB01-pdf-extractor.ipynb   
β”‚   β”œβ”€β”€ NB02-embedding-comparison.ipynb
β”‚   β”œβ”€β”€ NB03-information-retrieval.ipynb   
β”‚   └── NB04-evaluation.ipynb   
β”œβ”€β”€ README.md    
β”œβ”€β”€ REPORT.md
β”œβ”€β”€ requirements.txt
└── scrapy.cfg       

What’s Already Implemented:

  • Web scraping pipeline that collects NDC documents and saves the PDFs in the data/pdfs folder

    • Run with scrapy crawl ndc_spider
      You should see the following in your data/ folder when you run the spider:

    • Search for TODO in the code to find the places you need to implements

  • Basic file management

  • Skeleton for document extraction using the unstructured library

    • The relevant code is in the notebooks/NB01 and on the notebooks/utils.py file.
  • Empty notebooks with suggested workflow structure

  • Documentation templates

  • README with initial setup instructions

What You Need to Implement:

  1. Complete the document extraction and annotation process
  2. Build embedding generation and comparison functionality
  3. Develop the information retrieval system
  4. Create evaluation metrics and performance analysis

The starter code handles the tedious parts of data collection, allowing you to focus on the NLP components that form the core of this assessment.

βœ”οΈ Marking Guide

In line with the unwritten but widely-used UK marking conventions, grades must be awarded as follows:

  • 40-49: Basic implementation with significant room for improvement (typically missing many core requirements)
  • 50-59: Working implementation but one that meets only the very basic requirements (it looks very incomplete)
  • 60-69: Good implementation demonstrating solid understanding with small caveats and minor improvements possible
  • 70+: Excellent implementation going beyond expectations, showing creativity and depth of understanding without being overly verbose or over-engineered

Note from Jon: I find this artificial β€˜cap’ at 70+ marks silly and unnecessary and it clashes with what I understand to be the pedagogical purposes of an undergraduate course that is all about demonstrating hands-on experience. If I can show that your work is of a high standard and clearly demonstrates that you are truly and meaningfullyengaged with the material beyond a shallow level, I’ll be happy to award disctions.

πŸ“‹ Deliverables

Deliverable Details
GitHub Repository - Complete implementation code (not just starter code)
- Documentation
- Requirements file
Technical Report - Annotation methodology
- Embedding performance analysis
- System evaluation results
- Design decisions and trade-offs
Interactive Demo Notebook(s) - System workflow
- Example queries
- Performance metrics
- Summary of results (tables and/or charts)

⚠️ Note: Your implementation should be reproducible. Include clear setup instructions and handle dependencies appropriately.

πŸ‘‚ Feedback

You will receive:

  • Detailed feedback on your implementation
  • Suggestions for improvement
  • Justification for marks awarded
  • Specific suggestions for potential contributions to the public repository

By the way, you should receive feedback on your previous assignment by Week 09.

πŸ‘‚ Show you can act on feedback

In Spring Term 2025, we will evaluate all submissions and select the best ones to create a public repository. All students will then have the opportunity to contribute through pull requests (PRs) based on the feedback received on their individual submissions.

These contributions can earn you additional marks beyond your initial grade. This approach allows you to:

  1. Improve your work based on detailed feedback
  2. Gain experience with collaborative development workflows
  3. Build a public portfolio piece that showcases your NLP skills

The specific details about the public repository and contribution process will be shared later.

πŸ‘‰ We understand this timeline might not work for everyone, given a potential clash with group projects of this course as well as exams period. However, as this is essentially extra marks, don’t feel pressured to take part in it.

πŸ”— Connection to Future Work

The skills developed in this assignment directly support your upcoming group project where you’ll:

  1. Work in teams to build more complex information extraction systems
  2. Implement automated pipelines with CI/CD (Week 11)
  3. Scale solutions to handle larger document collections
  4. Address real-world information needs for climate policy analysis

πŸ’‘ Tip: Attending Weeks 09-11 will provide critical knowledge for both completing this assignment effectively and preparing for your group project.

Footnotes

  1. Visit the Moodle version of this page to get the link. The link is private and only available for formally enrolled students.β†©οΈŽ