✍️ Problem Set 2: Climate Transition Intelligence with RAG (40%)

2025/26 Winter Term

Author

Published

10 May 2026

🥅 Learning Goals

By the end of this assignment, you will: i) Design and implement a multi-stage data pipeline for processing unstructured corporate disclosure PDFs, ii) Extract, chunk, and embed text from real-world climate reports using open-source tools, iii) Build a retrieval-augmented generation (RAG) system that answers domain-specific questions about corporate emissions, iv) Evaluate your pipeline’s outputs and explain where and why it fails

Overview

In this assignment, you will build a Retrieval-Augmented Generation (RAG) pipeline that extracts Carbon Performance information from corporate disclosure documents collected by the TPI Centre. (We will introduce all the necessary concepts step by step in Weeks 07-10 so you can build it incrementally.)

Your goal is to process real PDF reports published by companies in a specific sector and build a system that can answer questions about their emissions targets and performance. The pipeline should attempt to answer the following questions using language model prompts, without hardcoding the answers ¹:

“What companies are tracked by this pipeline?”
“What are the emissions targets for [Company X]?”
“What is the activity of emissions for [Company X]?”
(More ambitious) “How has the activity or emission targets of [Company X] changed over time?”

Motivation: You are taking on the role of someone building a tool that helps TPI Centre analysts do their work more efficiently. The TPI Centre evaluates companies’ Carbon Performance by comparing their emissions pathways against climate scenarios consistent with the Paris Agreement. Analysts currently read hundreds of pages of corporate disclosures to find specific data points. Your pipeline automates the retrieval step: given a question and a collection of documents, find the relevant passages and generate a structured answer.

🔔 NOTE: You can still score very well even if your pipeline does not produce satisfying answers to all questions. What matters is that you demonstrate the right skills, document your decisions, and explain where and why things fail.

⏳	Deadline	Thursday, 26 March 2026, at 8 pm UK time
💎	Weight	40% of final grade
✋	GitHub Classroom	Accept the assignment

GITHUB ASSIGNMENT INVITATION LINK:
https://classroom.github.com/a/jjDKUVGv

📋 Choose Your Sector

Pick one of the following sectors. Your project must cover at least two companies from that sector.

Sector	What to expect
Food Producers	Continues the food theme from W01-W05. One caveat: many food companies do not disclose the physical weight of their sourced agricultural inputs, so you may encounter gaps in the “denominator” data for emissions intensity.
Electrical Utilities	The most straightforward sector for this assignment. Emissions intensity is Scope 1 emissions divided by MWh of electricity produced. Regional benchmarks (OECD vs non-OECD) add an interesting dimension.
Diversified Mining	Mining companies extract many different commodities. TPI uses a “Copper Equivalent” denominator to make them comparable. This adds an extra layer of complexity to the data you need to extract.

More companies means a richer dataset and more interesting comparisons, but also more documents to process and more edge cases to handle. Two companies done well is a stronger submission than six companies done poorly.

🗄️ Data Source

The TPI Centre has provided corporate disclosure PDFs for each sector, organised by company. Access the data at:

📂 TPI Centre Carbon Performance Data (SharePoint)

The folder is structured as: one folder per sector, one subfolder per company, one or more PDF files inside each company folder. The type of report varies by company but in there you may find sustainability reports, annual reports, CDP responses, and similar public disclosures.

You do not need to write a web crawler for this assignment ². The documents are provided. Download the PDFs for your chosen sector and companies, and focus your effort on the pipeline itself.

Click here for a 💡 TIP about getting started with data

💡 TIP ON HOW TO GET STARTED

Start with one company. Pick one PDF. Extract the text. Look at what comes out. Find where the emissions data lives in the document. Only then think about chunking and embedding. Git add, git commit, git push.

A single company processed thoroughly with clear documentation is a stronger submission than five companies processed superficially.

🔍 Going further: automated document discovery (optional)

If you want to go beyond the provided data and build a system that can discover new documents, here I would actually recommend you consider using a search API (such as Perplexity or Google Search API) as the first stage of your pipeline, rather than writing a bespoke Scrapy spider. This separates document discovery from document processing, which is how the TPI Centre’s own CLEAR system works. This is entirely optional and could be a goal for the final group project instead.

📌 Decisions You Will Need to Make

📌 DECISIONS, DECISIONS, DECISIONS
You will need to make several decisions independently. Document all of them in your README.md or CONTRIBUTING.md as appropriate.
Pipeline architecture: How many stages does your pipeline have? What does each stage do? How do they connect? You should apply the three design principles from 🖥️ W07 Lecture (atomicity, idempotency, modularity), but the specific stages are yours to decide. [W07]
PDF extraction strategy: Corporate reports vary wildly. Some have clean text. Others have tables, infographics, multi-column layouts, or data split across pages. How will you handle this? [W08]
Chunking approach: How do you split extracted text into pieces suitable for embedding? By page? By paragraph? By section heading? What metadata do you preserve with each chunk? [W08-W09]
Embedding model: Which open-source model from HuggingFace will you use? Why? How does it handle the domain-specific language in climate disclosures? [W08-W09]
Retrieval strategy: How do you find the most relevant chunks for a given question? What similarity threshold do you use? How do you handle cases where the answer spans multiple chunks? [W09]
Language model for generation: Which open-source model will you use for the generation step? How do you construct your prompts? [W10]
What you consider a “good” answer: How do you evaluate whether your pipeline produced a useful result? What counts as a null result, and what do you do with it? [W10]

🧰 Technical Requirements

You MUST use open-source language models loaded from HuggingFace for your pipeline. If you want to compare against commercial APIs (OpenAI, Anthropic, Google), you may, but results from open-source models are required. [W08-W10]
You MUST use the unstructured library for PDF text extraction. You may use additional extraction tools alongside it if you justify the choice. [W08]
Your pipeline MUST be runnable. Someone cloning your repository should be able to follow your README.md and reproduce your results. If your pipeline requires API keys or large model downloads, or data that cannot be stored on GitHub,document this clearly. [W08-W10]
Your code MUST follow the engineering principles from this course: typed Python where appropriate, minimal documentation that earns its place, no over-engineering. [W08-W10]
You MUST produce a README.md that tells a user of the pipeline what it does and how to run it, and a CONTRIBUTING.md that onboard a fellow developer on how the pipeline works internally, known bugs they might want to fix, and how to set up the development environment.
You are encouraged to maintain a documentation of how you configured your AI agents (if using any). That is, AGENTS.md, .github/copilot-instructions.md, CLAUDE.md, or equivalent file documenting your AI collaboration rules. If you do, share it with classmates. There is no single authoritative version for the course.

📚 Background: TPI’s Carbon Performance Assessment

The TPI Centre checks whether companies are doing their fair share to stay within the global carbon budget. Here is how their Carbon Performance assessment works.

The assessment process

1. Setting the goal (the benchmark). Experts divide the global carbon budget among industries based on where it is most cost-effective to reduce emissions. They calculate how fast an average company in each industry needs to reduce its pollution to meet climate goals like limiting warming to 1.5°C. This creates a target line, or “benchmark pathway”, over time.

2. Checking current emissions. TPI looks at the company’s own public reports to see how much pollution they have created recently. To make comparisons fair across companies of different sizes, they calculate “emissions intensity”: the amount of pollution per unit of activity.

\[ \text{Emissions intensity} = \frac{\text{Emissions}}{\text{Activity}} \]

3. Looking at promises. TPI then looks at the targets the company has set for the future. For these assessments, it is assumed that companies will actually meet their goals, and TPI calculates what the company’s future emission levels will look like.

4. The comparison. TPI draws a line showing the company’s past, present, and promised future emissions, and lays it on top of the scientific benchmark line. If the company’s line is at or below the target, they are considered “aligned” with climate goals. If above, they are falling behind.

🔔 NOTE

There is one exception. For industries that need to be phased out entirely (like coal mining), TPI measures absolute emissions rather than intensity. We are not working with coal in this assignment, but the comparison table below includes it for context.

Emissions Scopes

In greenhouse gas accounting, emissions are broken down into three categories called “Scopes”.

Scope 1: Direct operational emissions. These come from sources a company owns or controls. For electrical utilities, this is the pollution from burning fossil fuels to generate electricity.
Scope 2: Indirect operational emissions. These come from the electricity, heat, or steam a company purchases to run its operations. For diversified miners, the energy needed for extraction and processing often comes from external power grids.
Scope 3: Value chain emissions. These are indirect emissions outside a company’s direct control, either upstream (from suppliers) or downstream (from customers using the products). For food producers, nearly 95% of assessed emissions come from upstream agricultural inputs. For diversified miners, downstream emissions from processing and burning sold products dwarf operational emissions.

The TPI Centre customises its assessment to focus on whichever scope represents the largest share of a sector’s climate impact.

How the sectors compare

Feature	Food Producers	Diversified Mining	Electrical Utilities	Coal Mining
Numerator (Emissions)	Scope 1, 2 & Upstream Scope 3 (purchased agriculture)	Scope 1, 2 & Downstream Scope 3 (processing & use of products)	Scope 1 (direct from electricity generation)	Scope 1, 2 & Downstream Scope 3 (use of sold products)
Denominator (Activity)	Tonnes of agricultural inputs	Tonnes of Copper Equivalent (CuEq)	Megawatt hours (MWh) of electricity produced	None (absolute emissions index)
Benchmark Scenarios	1.5°C, Below 2°C, 2°C	1.5°C, Below 2°C, National Pledges	1.5°C, Below 2°C, National Pledges	1.5°C, Below 2°C, National Pledges
Distinctive Trait	Links complex food products back to raw agricultural inputs	Uses market prices to equate different commodities	Features regional benchmarks (OECD vs non-OECD)	Tracks absolute phase-out rather than intensity improvement

More detail on each sector

Electrical Utilities (regional approach). Emissions intensity is greenhouse gas emissions from electricity generation divided by MWh produced. Utilities are assessed using regional benchmarks: developed regions are expected to hit net zero by 2035, while non-OECD countries have until 2045.

Diversified Mining (value equivalence approach). Mining companies extract many different resources. Using physical weight alone would unfairly penalise high-volume commodity producers. TPI converts all production into “Copper Equivalent” (CuEq) based on market prices, creating a single value-weighted metric. The assessment weights downstream Scope 3 emissions heavily.

Food Producers (supply chain approach). Because 80% of food sector emissions come from agriculture, TPI measures intensity as tonnes of CO2 equivalent per tonne of agricultural inputs purchased. Instead of measuring output of processed foods, TPI translates production back to raw agricultural commodities.

Coal Mining (absolute phase-out approach, for reference only). Unlike the other sectors, coal mining is assessed on absolute emissions rather than intensity. The industry cannot simply become more efficient; it must undergo a managed phase-out. Production needs to drop over 90% by 2050.

📖 Further reading: TPI’s methodology reports | TPI Centre corporates page

♟️ A Tactical Plan (W07-W10)

Week	Focus	What to aim for
W07	Pipeline design & setup	Accept the PS2 repository and clone it. Choose your sector and companies. Download the PDFs. Build a skeleton `pipeline.py` with Click (as practised in 💻 W07 Lab). Test it on GitHub Actions (in case you want to use it for the final submission). Read the TPI methodology to understand what data you need to find.
W08	PDF extraction & embeddings intro	Extract text from your PDFs using `unstructured`. Inspect what comes out. Identify where emissions data appears in the documents (tables? paragraphs? infographics?). Start experimenting with Word2Vec and transformer embeddings.
W09	Chunking & vector search	Decide on a chunking strategy. Generate embeddings for your chunks. Build a retrieval step that finds relevant chunks for a given question. Evaluate whether the right chunks are coming back.
W10	Generation & evaluation	Add the language model generation step. Construct prompts that use retrieved chunks to answer the driving questions. Evaluate your results. Document where and why the pipeline fails. Polish `README.md` and `CONTRIBUTING.md`.

The deadline is Thursday 26 March 2026, 8pm UK time. This falls in W10, so you have W07-W09 for development and the first half of W10 for polish. Start early to make the most of the time available.

✔️ How We Will Grade Your Work

We value your process leading to a strong final product more than the product itself. Higher marks reward depth of understanding, clarity of documentation, and honest evaluation of what works and what does not. Adding more companies or more pipeline stages without clear reasoning does not score higher.

Over-engineered solutions, unnecessary abstractions, and verbose documentation will be marked down, not up. The same rubric philosophy from ✍️ Problem Set 1 applies here.

📥 Pipeline Implementation (0-40 marks)

Pipeline architecture, PDF extraction, chunking, embedding, retrieval, and generation

Marks	Level	Description
<16	Poor	Pipeline does not run, or code has fundamental errors that prevent any meaningful output. No evidence of engagement with the PDF extraction challenge.
16-19	Weak	Pipeline runs but produces minimal useful output. PDF extraction attempted but chunking, embedding, or retrieval is missing or broken. Little connection to what was taught in W08-W10.
20-23	Fair	Pipeline works end-to-end but with notable weaknesses. Extraction produces usable text. Embeddings generated. Retrieval returns results, but quality is poor or unevaluated.
24-27	Good	Competent pipeline. PDF extraction handles common cases. Chunking strategy is reasonable and justified. Retrieval returns relevant chunks for at least some questions. Language model produces structured answers. Code is clean and follows course engineering principles.
28-31	Very Good!	Clean pipeline with thoughtful decisions at each stage. Chunking preserves useful metadata. Embedding model choice is justified. Retrieval quality is evaluated. Language model prompts are well-constructed. Code is lean and maintainable.
32+	Distinction	Exceptional implementation. Sophisticated handling of difficult documents (tables, multi-page data). Retrieval strategy goes beyond naive similarity search. Prompt engineering shows iteration and refinement. Pipeline is reproducible, well-documented, and every component earns its place.

📊 Evaluation & Analysis (0-30 marks)

How you assess what works, what fails, and why

Marks	Level	Description
<12	Poor	No evaluation of pipeline outputs. No discussion of failure cases. Results presented without interpretation.
12-14	Weak	Some evaluation attempted but superficial. Failures noted but not explained. No comparison of approaches or models.
15-17	Fair	Evaluation present with some useful observations. Some failure cases explained. Limited exploration of why retrieval or generation fails for specific questions.
18-20	Good	Solid evaluation. Clear documentation of what works and what does not. Failure analysis connects to specific pipeline decisions (chunking too coarse, wrong scope extracted, table data lost in extraction). Comparison between at least two approaches (e.g. different embedding models or chunking strategies).
21-23	Very Good!	Thoughtful evaluation with genuine insight. Analysis of which document types or question types cause failures. Evidence of iterating on the pipeline based on evaluation results. Honest about limitations. Comparison with commercial API results if attempted.
24+	Distinction	Exceptional analysis. Systematic evaluation across companies and question types. Insights about domain-specific challenges (e.g. scopes matter differently per sector, table extraction degrades specific metrics). Clear recommendations for improvement.

🧐 Documentation & Engineering Practice (0-30 marks)

README, CONTRIBUTING, code quality, reproducibility, and collaboration readiness

Marks	Level	Description
<12	Poor	README missing or unhelpful. No setup instructions. Code is not reproducible. No evidence of version control discipline.
12-14	Weak	README exists but is generic or verbose. CONTRIBUTING missing. Code runs but setup is unclear. Commit history shows bulk commits rather than incremental progress.
15-17	Fair	README explains what the pipeline does and how to run it. Some documentation of decisions. Code is organised but has room for improvement.
18-20	Good	Clear README for users and CONTRIBUTING for developers. Pipeline decisions documented with reasoning. Code follows course engineering principles. Regular commits. Type hints where appropriate.
21-23	Very Good!	README and CONTRIBUTING are practical and minimal. Documentation earns its place without being verbose. AGENTS.md or equivalent shows thoughtful AI collaboration rules. Code is clean enough that another student could extend it.
24+	Distinction	Exceptional documentation and engineering. Someone could fork this repository and build on it. Professional-grade README. CONTRIBUTING explains architectural decisions. Every file earns its place.

📮 Need Help?

Post questions in the #help Slack channel.
Book office hours via StudentHub.
🤖 Use the DS205 AI Tutor for help with code, concepts, and debugging.

Start with one company. Get text out of one PDF. See what the data looks like. Build from there.

🔗 Connection to Final Project

The skills you develop here directly support the final group project (W11 onwards, due Spring Term), where you will build more complex RAG systems for TPI research questions with teammates. Think of Problem Set 2 as building your individual competency; the final project tests whether you can apply it collaboratively at larger scale.

Appendix | Reference Links

Course links

TPI Centre

TPI corporates page
Methodology report (click on the “Methodology” button)
Sector methodology notes (search for the sector you chose)

Tools

Pipeline concepts

Footnotes

That is, your pipeline should give me a way (an API? a terminal command built with Click? a GUI? you choose) where I can pose those questions to your pipeline and get a useful answer.↩︎
If you love web scraping, you can try to write a cron job that runs a spider/Selenium script every X days to act as trackers for new documents.↩︎