✍️ Problem Set 2: Climate Transition Intelligence with RAG (40%)
2025/26 Winter Term
Overview
In this assignment, you will build a Retrieval-Augmented Generation (RAG) pipeline that extracts Carbon Performance information from corporate disclosure documents collected by the TPI Centre. (We will introduce all the necessary concepts step by step in Weeks 07-10 so you can build it incrementally.)
Your goal is to process real PDF reports published by companies in a specific sector and build a system that can answer questions about their emissions targets and performance. The pipeline should attempt to answer the following questions using language model prompts, without hardcoding the answers 1:
- “What companies are tracked by this pipeline?”
- “What are the emissions targets for [Company X]?”
- “What is the activity of emissions for [Company X]?”
- (More ambitious) “How has the activity or emission targets of [Company X] changed over time?”
Motivation: You are taking on the role of someone building a tool that helps TPI Centre analysts do their work more efficiently. The TPI Centre evaluates companies’ Carbon Performance by comparing their emissions pathways against climate scenarios consistent with the Paris Agreement. Analysts currently read hundreds of pages of corporate disclosures to find specific data points. Your pipeline automates the retrieval step: given a question and a collection of documents, find the relevant passages and generate a structured answer.
🔔 NOTE: You can still score very well even if your pipeline does not produce satisfying answers to all questions. What matters is that you demonstrate the right skills, document your decisions, and explain where and why things fail.
| ⏳ | Deadline | Thursday, 26 March 2026, at 8 pm UK time |
| 💎 | Weight | 40% of final grade |
| ✋ | GitHub Classroom | Accept the assignment |
GITHUB ASSIGNMENT INVITATION LINK:
https://classroom.github.com/a/jjDKUVGv
📋 Choose Your Sector
Pick one of the following sectors. Your project must cover at least two companies from that sector.
| Sector | What to expect |
|---|---|
| Food Producers | Continues the food theme from W01-W05. One caveat: many food companies do not disclose the physical weight of their sourced agricultural inputs, so you may encounter gaps in the “denominator” data for emissions intensity. |
| Electrical Utilities | The most straightforward sector for this assignment. Emissions intensity is Scope 1 emissions divided by MWh of electricity produced. Regional benchmarks (OECD vs non-OECD) add an interesting dimension. |
| Diversified Mining | Mining companies extract many different commodities. TPI uses a “Copper Equivalent” denominator to make them comparable. This adds an extra layer of complexity to the data you need to extract. |
More companies means a richer dataset and more interesting comparisons, but also more documents to process and more edge cases to handle. Two companies done well is a stronger submission than six companies done poorly.
🗄️ Data Source
The TPI Centre has provided corporate disclosure PDFs for each sector, organised by company. Access the data at:
📂 TPI Centre Carbon Performance Data (SharePoint)
The folder is structured as: one folder per sector, one subfolder per company, one or more PDF files inside each company folder. The type of report varies by company but in there you may find sustainability reports, annual reports, CDP responses, and similar public disclosures.
You do not need to write a web crawler for this assignment 2. The documents are provided. Download the PDFs for your chosen sector and companies, and focus your effort on the pipeline itself.
Click here for a 💡 TIP about getting started with data
💡 TIP ON HOW TO GET STARTED
Start with one company. Pick one PDF. Extract the text. Look at what comes out. Find where the emissions data lives in the document. Only then think about chunking and embedding. Git add, git commit, git push.
A single company processed thoroughly with clear documentation is a stronger submission than five companies processed superficially.
🔍 Going further: automated document discovery (optional)
If you want to go beyond the provided data and build a system that can discover new documents, here I would actually recommend you consider using a search API (such as Perplexity or Google Search API) as the first stage of your pipeline, rather than writing a bespoke Scrapy spider. This separates document discovery from document processing, which is how the TPI Centre’s own CLEAR system works. This is entirely optional and could be a goal for the final group project instead.
📌 Decisions You Will Need to Make
📌 DECISIONS, DECISIONS, DECISIONS
You will need to make several decisions independently. Document all of them in your README.md or CONTRIBUTING.md as appropriate.
Pipeline architecture: How many stages does your pipeline have? What does each stage do? How do they connect? You should apply the three design principles from 🖥️ W07 Lecture (atomicity, idempotency, modularity), but the specific stages are yours to decide. [W07]
PDF extraction strategy: Corporate reports vary wildly. Some have clean text. Others have tables, infographics, multi-column layouts, or data split across pages. How will you handle this? [W08]
Chunking approach: How do you split extracted text into pieces suitable for embedding? By page? By paragraph? By section heading? What metadata do you preserve with each chunk? [W08-W09]
Embedding model: Which open-source model from HuggingFace will you use? Why? How does it handle the domain-specific language in climate disclosures? [W08-W09]
Retrieval strategy: How do you find the most relevant chunks for a given question? What similarity threshold do you use? How do you handle cases where the answer spans multiple chunks? [W09]
Language model for generation: Which open-source model will you use for the generation step? How do you construct your prompts? [W10]
What you consider a “good” answer: How do you evaluate whether your pipeline produced a useful result? What counts as a null result, and what do you do with it? [W10]
🧰 Technical Requirements
You MUST use open-source language models loaded from HuggingFace for your pipeline. If you want to compare against commercial APIs (OpenAI, Anthropic, Google), you may, but results from open-source models are required. [W08-W10]
You MUST use the
unstructuredlibrary for PDF text extraction. You may use additional extraction tools alongside it if you justify the choice. [W08]Your pipeline MUST be runnable. Someone cloning your repository should be able to follow your
README.mdand reproduce your results. If your pipeline requires API keys or large model downloads, or data that cannot be stored on GitHub,document this clearly. [W08-W10]Your code MUST follow the engineering principles from this course: typed Python where appropriate, minimal documentation that earns its place, no over-engineering. [W08-W10]
You MUST produce a
README.mdthat tells a user of the pipeline what it does and how to run it, and aCONTRIBUTING.mdthat onboard a fellow developer on how the pipeline works internally, known bugs they might want to fix, and how to set up the development environment.You are encouraged to maintain a documentation of how you configured your AI agents (if using any). That is,
AGENTS.md,.github/copilot-instructions.md,CLAUDE.md, or equivalent file documenting your AI collaboration rules. If you do, share it with classmates. There is no single authoritative version for the course.
📚 Background: TPI’s Carbon Performance Assessment
The TPI Centre checks whether companies are doing their fair share to stay within the global carbon budget. Here is how their Carbon Performance assessment works.
The assessment process
1. Setting the goal (the benchmark). Experts divide the global carbon budget among industries based on where it is most cost-effective to reduce emissions. They calculate how fast an average company in each industry needs to reduce its pollution to meet climate goals like limiting warming to 1.5°C. This creates a target line, or “benchmark pathway”, over time.
2. Checking current emissions. TPI looks at the company’s own public reports to see how much pollution they have created recently. To make comparisons fair across companies of different sizes, they calculate “emissions intensity”: the amount of pollution per unit of activity.
\[ \text{Emissions intensity} = \frac{\text{Emissions}}{\text{Activity}} \]
3. Looking at promises. TPI then looks at the targets the company has set for the future. For these assessments, it is assumed that companies will actually meet their goals, and TPI calculates what the company’s future emission levels will look like.
4. The comparison. TPI draws a line showing the company’s past, present, and promised future emissions, and lays it on top of the scientific benchmark line. If the company’s line is at or below the target, they are considered “aligned” with climate goals. If above, they are falling behind.
🔔 NOTE
There is one exception. For industries that need to be phased out entirely (like coal mining), TPI measures absolute emissions rather than intensity. We are not working with coal in this assignment, but the comparison table below includes it for context.
Emissions Scopes
In greenhouse gas accounting, emissions are broken down into three categories called “Scopes”.
Scope 1: Direct operational emissions. These come from sources a company owns or controls. For electrical utilities, this is the pollution from burning fossil fuels to generate electricity.
Scope 2: Indirect operational emissions. These come from the electricity, heat, or steam a company purchases to run its operations. For diversified miners, the energy needed for extraction and processing often comes from external power grids.
Scope 3: Value chain emissions. These are indirect emissions outside a company’s direct control, either upstream (from suppliers) or downstream (from customers using the products). For food producers, nearly 95% of assessed emissions come from upstream agricultural inputs. For diversified miners, downstream emissions from processing and burning sold products dwarf operational emissions.
The TPI Centre customises its assessment to focus on whichever scope represents the largest share of a sector’s climate impact.
How the sectors compare
| Feature | Food Producers | Diversified Mining | Electrical Utilities | Coal Mining |
|---|---|---|---|---|
| Numerator (Emissions) | Scope 1, 2 & Upstream Scope 3 (purchased agriculture) | Scope 1, 2 & Downstream Scope 3 (processing & use of products) | Scope 1 (direct from electricity generation) | Scope 1, 2 & Downstream Scope 3 (use of sold products) |
| Denominator (Activity) | Tonnes of agricultural inputs | Tonnes of Copper Equivalent (CuEq) | Megawatt hours (MWh) of electricity produced | None (absolute emissions index) |
| Benchmark Scenarios | 1.5°C, Below 2°C, 2°C | 1.5°C, Below 2°C, National Pledges | 1.5°C, Below 2°C, National Pledges | 1.5°C, Below 2°C, National Pledges |
| Distinctive Trait | Links complex food products back to raw agricultural inputs | Uses market prices to equate different commodities | Features regional benchmarks (OECD vs non-OECD) | Tracks absolute phase-out rather than intensity improvement |
More detail on each sector
Electrical Utilities (regional approach). Emissions intensity is greenhouse gas emissions from electricity generation divided by MWh produced. Utilities are assessed using regional benchmarks: developed regions are expected to hit net zero by 2035, while non-OECD countries have until 2045.
Diversified Mining (value equivalence approach). Mining companies extract many different resources. Using physical weight alone would unfairly penalise high-volume commodity producers. TPI converts all production into “Copper Equivalent” (CuEq) based on market prices, creating a single value-weighted metric. The assessment weights downstream Scope 3 emissions heavily.
Food Producers (supply chain approach). Because 80% of food sector emissions come from agriculture, TPI measures intensity as tonnes of CO2 equivalent per tonne of agricultural inputs purchased. Instead of measuring output of processed foods, TPI translates production back to raw agricultural commodities.
Coal Mining (absolute phase-out approach, for reference only). Unlike the other sectors, coal mining is assessed on absolute emissions rather than intensity. The industry cannot simply become more efficient; it must undergo a managed phase-out. Production needs to drop over 90% by 2050.
📖 Further reading: TPI’s methodology reports | TPI Centre corporates page
♟️ A Tactical Plan (W07-W10)
| Week | Focus | What to aim for |
|---|---|---|
| W07 | Pipeline design & setup | Accept the PS2 repository and clone it. Choose your sector and companies. Download the PDFs. Build a skeleton pipeline.py with Click (as practised in 💻 W07 Lab). Test it on GitHub Actions (in case you want to use it for the final submission). Read the TPI methodology to understand what data you need to find. |
| W08 | PDF extraction & embeddings intro | Extract text from your PDFs using unstructured. Inspect what comes out. Identify where emissions data appears in the documents (tables? paragraphs? infographics?). Start experimenting with Word2Vec and transformer embeddings. |
| W09 | Chunking & vector search | Decide on a chunking strategy. Generate embeddings for your chunks. Build a retrieval step that finds relevant chunks for a given question. Evaluate whether the right chunks are coming back. |
| W10 | Generation & evaluation | Add the language model generation step. Construct prompts that use retrieved chunks to answer the driving questions. Evaluate your results. Document where and why the pipeline fails. Polish README.md and CONTRIBUTING.md. |
The deadline is Thursday 26 March 2026, 8pm UK time. This falls in W10, so you have W07-W09 for development and the first half of W10 for polish. Start early to make the most of the time available.
✔️ How We Will Grade Your Work
We value your process leading to a strong final product more than the product itself. Higher marks reward depth of understanding, clarity of documentation, and honest evaluation of what works and what does not. Adding more companies or more pipeline stages without clear reasoning does not score higher.
Over-engineered solutions, unnecessary abstractions, and verbose documentation will be marked down, not up. The same rubric philosophy from ✍️ Problem Set 1 applies here.
📥 Pipeline Implementation (0-40 marks)
Pipeline architecture, PDF extraction, chunking, embedding, retrieval, and generation
| Marks | Level | Description |
|---|---|---|
| <16 | Poor | Pipeline does not run, or code has fundamental errors that prevent any meaningful output. No evidence of engagement with the PDF extraction challenge. |
| 16-19 | Weak | Pipeline runs but produces minimal useful output. PDF extraction attempted but chunking, embedding, or retrieval is missing or broken. Little connection to what was taught in W08-W10. |
| 20-23 | Fair | Pipeline works end-to-end but with notable weaknesses. Extraction produces usable text. Embeddings generated. Retrieval returns results, but quality is poor or unevaluated. |
| 24-27 | Good | Competent pipeline. PDF extraction handles common cases. Chunking strategy is reasonable and justified. Retrieval returns relevant chunks for at least some questions. Language model produces structured answers. Code is clean and follows course engineering principles. |
| 28-31 | Very Good! | Clean pipeline with thoughtful decisions at each stage. Chunking preserves useful metadata. Embedding model choice is justified. Retrieval quality is evaluated. Language model prompts are well-constructed. Code is lean and maintainable. |
| 32+ | Distinction | Exceptional implementation. Sophisticated handling of difficult documents (tables, multi-page data). Retrieval strategy goes beyond naive similarity search. Prompt engineering shows iteration and refinement. Pipeline is reproducible, well-documented, and every component earns its place. |
📊 Evaluation & Analysis (0-30 marks)
How you assess what works, what fails, and why
| Marks | Level | Description |
|---|---|---|
| <12 | Poor | No evaluation of pipeline outputs. No discussion of failure cases. Results presented without interpretation. |
| 12-14 | Weak | Some evaluation attempted but superficial. Failures noted but not explained. No comparison of approaches or models. |
| 15-17 | Fair | Evaluation present with some useful observations. Some failure cases explained. Limited exploration of why retrieval or generation fails for specific questions. |
| 18-20 | Good | Solid evaluation. Clear documentation of what works and what does not. Failure analysis connects to specific pipeline decisions (chunking too coarse, wrong scope extracted, table data lost in extraction). Comparison between at least two approaches (e.g. different embedding models or chunking strategies). |
| 21-23 | Very Good! | Thoughtful evaluation with genuine insight. Analysis of which document types or question types cause failures. Evidence of iterating on the pipeline based on evaluation results. Honest about limitations. Comparison with commercial API results if attempted. |
| 24+ | Distinction | Exceptional analysis. Systematic evaluation across companies and question types. Insights about domain-specific challenges (e.g. scopes matter differently per sector, table extraction degrades specific metrics). Clear recommendations for improvement. |
🧐 Documentation & Engineering Practice (0-30 marks)
README, CONTRIBUTING, code quality, reproducibility, and collaboration readiness
| Marks | Level | Description |
|---|---|---|
| <12 | Poor | README missing or unhelpful. No setup instructions. Code is not reproducible. No evidence of version control discipline. |
| 12-14 | Weak | README exists but is generic or verbose. CONTRIBUTING missing. Code runs but setup is unclear. Commit history shows bulk commits rather than incremental progress. |
| 15-17 | Fair | README explains what the pipeline does and how to run it. Some documentation of decisions. Code is organised but has room for improvement. |
| 18-20 | Good | Clear README for users and CONTRIBUTING for developers. Pipeline decisions documented with reasoning. Code follows course engineering principles. Regular commits. Type hints where appropriate. |
| 21-23 | Very Good! | README and CONTRIBUTING are practical and minimal. Documentation earns its place without being verbose. AGENTS.md or equivalent shows thoughtful AI collaboration rules. Code is clean enough that another student could extend it. |
| 24+ | Distinction | Exceptional documentation and engineering. Someone could fork this repository and build on it. Professional-grade README. CONTRIBUTING explains architectural decisions. Every file earns its place. |
📮 Need Help?
- Post questions in the
#helpSlack channel. - Book office hours via StudentHub.
- 🤖 Use the DS205 AI Tutor for help with code, concepts, and debugging.
Start with one company. Get text out of one PDF. See what the data looks like. Build from there.
🔗 Connection to Final Project
The skills you develop here directly support the final group project (W11 onwards, due Spring Term), where you will build more complex RAG systems for TPI research questions with teammates. Think of Problem Set 2 as building your individual competency; the final project tests whether you can apply it collaboratively at larger scale.
Appendix | Reference Links
Course links
- 🖥️ W07 Lecture
- 💻 W07 Lab
- 📓 Syllabus
- 🤖 DS205 AI Tutor
Slack
TPI Centre
- TPI corporates page
- Methodology report (click on the “Methodology” button)
- Sector methodology notes (search for the sector you chose)
Pipeline concepts
Footnotes
That is, your pipeline should give me a way (an API? a terminal command built with
Click? a GUI? you choose) where I can pose those questions to your pipeline and get a useful answer.↩︎If you love web scraping, you can try to write a cron job that runs a spider/Selenium script every X days to act as trackers for new documents.↩︎