DS205 2025-2026 Winter Term Icon

📦 Final Project (40%)

2025/26 Winter Term

Author
Published

30 March 2026

The final assignment of the course is a group project. You will work in teams of 3-4 students to develop a data product or conduct benchmarking to improve an existing data product in partnership with the TPI Centre. You will find that some projects are closely related, and groups are very much free to talk to each other and share ideas, and learn from each other.

As with ✍️ Problem Set 2, this assignment will be open-ended to some degree as you will be developing solutions to a real-world problem. We do not have model solutions for these projects and when grading them, I will be looking most importantly at how you planned and approached the problem, and how you documented your process of development or discoveries.

Deadline Tuesday 26 May 2026, 8 pm UK time
💎 Weight 40% of the final grade (group submission)
👥 Teams 3–4 students per group
📤 Submission Via your existing GitHub Classroom repository

Forming groups and choosing a project (W11)

We form groups during the Week 11 lecture.

  1. Choose a group name your team is happy to keep for the rest of the term.

    💡 This will be a public repository name, so choose something creative and unique.

  2. Use the GitHub Classroom group-assignment link:

  1. Every member must join the team so their GitHub username appears on the Classroom roster. That is how I verify who is in which group.
  2. If anyone is not in a team by Wednesday morning, I will assign students manually.

Project preferences (bidding)

After your group exists on Classroom, submit your ranked preferences on the SharePoint Excel sheet (link shared in the lecture). One row per group, not per student. Typical columns:

Column What to put
Group name The name shown in GitHub Classroom
First choice Project code (AF)
For A you may add your sector of choice: Food / Utilities / Mining
Second choice Another project code
Third choice Optional
Notes Optional context

Later, after the lecture, I will allocate projects using what I have seen from ✍️ Problem Set 1 and ✍️ Problem Set 2, matching groups to projects where I think you can succeed best.

Requirements for all teams

These expectations apply to every project. Each project card in the section below adds only what is specific to that brief.

Documentation (repository docs)

Every group must ship:

  • README.md for people who can follow install and run instructions but assume these are people who at best will replicate your work but won’t have the deep knowledge about your codebase to fix bugs or add new features. Say how to run the app or reproduce the benchmark, what the repo contains, and how to get help. Where the deliverable includes a UI, include at least one screenshot or short GIF so a reader can see it working.
  • CONTRIBUTING.md for developers: how the repo is organised, how to set up a dev environment, how to run tests, and any system dependencies (Docker, OCR engines, API keys in .env, and so on).
  • DECISIONS.md for architectural choices and alternatives you considered. If you attend meetings with Sylvan or anyone else from the TPI Centre, record key decisions and outcomes here. If this grows too long, feel free to add a DECISIONS_LOG.md file to keep track of the evolution of discussions, with dates and summaries of the decisions made. Then, keep just the final decisions in DECISIONS.md.

Individual projects may require extra files (for example BENCHMARK.md for Projects A and B1-B2, a token spend summary for C, Docker notes for D, an AI usage log for E, or run logs for F). Those are listed in the relevant card, not repeated here.

Engineering and tooling

  • Python quality: Typed Python, Ruff for linting, and the logging module writing to a log file (not print to the console), unless a project brief explicitly allows otherwise.
  • Data stores: Either build neat ‘data lakes’, with structured files that follow a consistent file naming convention or use a database (SQLite, ChromaDB, etc.). That is, unless the project specifies something else (for example Project D uses PostgreSQL with pgvector via Docker).
  • Tests: You may use pytest if you want to guarantee that future changes to the code base doesn’t change expected behaviour. Here it is advisable to set up a test suite, ideally on GitHub Actions so you can run them automatically on every push to the main branch.
  • Agentic coding: Use assistants as you like, but stay in charge of design and maintainability. Optional but useful: a shared .github/copilot-instructions.md or AGENTS.md so the whole team uses the same conventions.

Collaboration and repo practice

  • Track work with GitHub Issues and/or a Project Board.
  • Small, low-risk changes can go straight to main but for larger or risky changes, use a branch and pull request so others can review before merging.
  • You are encouraged to learn from other teams (including other project codes), just do not copy another group’s code wholesale otherwise your group’s solution won’t shine! Sharing decisions and trade-offs in conversation is fine.

CLEAR reference repository

Everyone will have read-only access to Sylvan’s CLEAR codebase as a reference for patterns and tech stack choices.

What we will reward

Because everyone in the cohort is using agentic AI coding tools, adding new or complex features to your projects does not necessarily make us go “WOW” when looking at your project. I will be looking mostly at evidence of thoughtful planning, reproducible engineering, clear documentation of process, and evidence of teamwork in the repository. More code does not necessarily equal better!

When your AI coding tool drops way too much code and you have no idea what’s going on, consider: can you demonstrate that what you have built/vibe-coded is as lean as possible and still intelligible to human reviewers? Does it follow any existing software engineering principles or best practices? What kind of websearch would help you check that it does? How does it relate to things you’ve seen in the course? Does that code solve a real problem or are you just adding more “stuff” to impress the marker?

The section How we will grade will be updated with the detailed rubric soon.

What are the options of projects?

The projects vary from simpler extensions of the Problem Set 2 to ambitious projects that require you to design and implement a new tool or service.

Click on each project to expand the full specification.

Project A

Structured Data Extraction from TPI Carbon Performance PDFs

Turn assessment tables into queryable structured data with a browsable Streamlit interface

Number of groups (parallel): up to 3

Background

The goal of this project is to build a data product in which users can upload a PDF, any PDF, and the pipeline will extract the tables from the PDF and present them in a unified structured form and make them browsable.

The pipeline you built in ✍️ Problem Set 2 treated every PDF as a bag of text chunks but as you have seen, many of the PDFs used to assess carbon performance of corporations contain many structured tables: emissions intensity by year, scope breakdowns, benchmark comparisons, target commitments, etc. Tables in these PDFs frequently span page breaks, with footers, page numbers, and repeated column headers splitting the content mid-row. Your pipeline must detect and handle this automatically.

Core engineering challenge. Detecting that a table continues across a page break, stripping the noise (footers, page numbers, repeated headers), and merging the fragments into a single coherent table with metadata recording which pages it came from. How will you check that it works for most reasonable PDFs?

What to build

A pipeline with two separable parts: an offline extraction stage and a Streamlit interface.

  1. About the extraction stage: You are free to share the extraction code you built in Problem Set 2 with each other within the group and use or modify it as you see fit. As you know, the extraction stage takes a PDF, finds all tables, merges pages that span across, and writes the results to a reasonable structured storage (SQLite or well-organised files). Each extracted table should also carry metadata: page range, confidence score, whether it was merged, and a link to the surrounding non-tabular text.

  2. About the Streamlit app: The Streamlit app should let a user upload a PDF or select a pre-processed document, then browse extracted tables via a dropdown. Selecting a table renders it as a scrollable, filterable dataframe. A collapsible right-hand panel shows the relevant PDF page(s) as an image. Non-tabular text surrounding the table should be accessible, not discarded.

Target interface

A reference mockup called TableScope will have been shown in the W11 lecture. Build toward that design or feel free to design your own interface in Streamlit. The dropdown navigates between tables, the sidebar shows the corresponding PDF page(s), and the table footer records provenance. Tables merged across pages are labelled as such in the selector.

Deliverables

  • Pipeline as Python script or FastAPI API. (you choose whatever seems simpler and most appropriate) A reproducible extraction pipeline runnable from the command line or as an API endpoint. Accepts a PDF path, outputs structured tables with metadata.
  • Streamlit app. Browsable interface meeting that allows one to upload a PDF and browse the extracted tables interactively. The interface should be designed to be used by non-technical users, so it should be easy to understand and use. There should be a way to download the extracted tables as a CSV file and the interface should somehow show where in the PDF the table was found.
  • Benchmark report. Select a good number of PDFs from the relevant sector (the more the better), run the pipeline on them, check that it worked (ideally manually by a human) and report on the performance. You define the benchmark methodology but consider: How many tables were detected? How many merges were correct? Where does the pipeline fail, and why? You define the benchmark methodology.
  • Documentation. Meet the Requirements for all teams for README.md, CONTRIBUTING.md, and DECISIONS.md. For this project you must also include BENCHMARK.md reporting on benchmark performance.

Sector allocation

We will allow up to three groups to take this project independently, focusing on one of the three sectors: Food Producers, Electrical Utilities, or Diversified Mining. The PDFs used by the TPI Centre to assess carbon performance of corporations in these sectors are the same ones we used in ✍️ Problem Set 2 and are available on SharePoint.

Tools and constraints

General engineering, documentation, and collaboration rules are in Requirements for all teams.

Project-specific:

  • You choose the tech stack, but work that stays close to the TPI Centre stack or that would be easy to maintain alongside Sylvan’s upstream project will be rewarded. Use the CLEAR repository as a reference, not a copy-paste source.
  • You do not need retrieval or generation unless you see a clear benefit.
  • If you use a database, use SQLite (or ChromaDB only if you use embeddings).

Benchmark track (two related projects)

Project B1

PDF Ingestion Strategy Benchmark

If you turn each PDF into Markdown before chunking, does retrieval improve on TPI documents?

Number of groups (parallel): 2

Background

If we convert PDFs into Markdown before chunking, does retrieval improve on TPI documents? The goal of this project is to empirically evaluate this question.

In ✍️ Problem Set 2, every pipeline started the same way: feed a PDF into unstructured, get back a list of elements, chunk them. It works, but has known failure modes: inconsistent element boundaries, dropped table content, and variable behaviour across different PDF layouts. One alternative is to turn the PDF into Markdown first, then chunk that text. Markdown output is cleaner and more uniform.

Core engineering challenge: Designing a fair comparison. Ingestion strategy is the one variable you are testing, and chunking parameters, embedding model, and evaluation method must stay constant across all conditions. Build the reference answer set before running any pipeline condition.

Conditions to compare

  • Baseline: unstructured direct PDF extraction with char-limit chunking, as taught in W09. (It’s fine to vary the number of characters and generate several baselines if you want to.)
  • Condition 1: Use docling to turn the PDF into Markdown, then apply heading-delimited or char-limit chunking on that Markdown.
  • Condition 2: Use markitdown (Microsoft) to turn the PDF into Markdown, then chunk the same way as in Condition 1.
  • Optional Condition 3: unstructured with hi_res mode and infer_table_structure=True, if you want to push what unstructured itself can do.

All conditions use the same embedding model (multi-qa-MiniLM-L6-cos-v1), the same ChromaDB setup, and the same reference answer set.

What to build

A reproducible benchmark pipeline where each condition is a separately runnable stage producing a ChromaDB collection (in the same ChromaDB database). A shared evaluation script queries all collections with the same questions and computes Recall@K for each condition. No manual steps between PDF and results table.

The output is a written report in any format you prefer (Markdown, Quarto, Jupyter notebook). Cover: what was compared, how the reference answer set was built, Recall@K results per condition, qualitative observations about chunk legibility, and a recommendation for which strategy to use on TPI CP documents.

Deliverables

  • Benchmark pipeline and evaluation. Reproducible stages per condition, shared evaluation script, and Recall@K results as described above.
  • Written report. What you compared, how the reference set was built, results per condition, legibility notes, and a clear recommendation for TPI CP documents.
  • Documentation. Meet the Requirements for all teams for README.md, CONTRIBUTING.md, and DECISIONS.md. Include BENCHMARK.md with your methodology and results (or fold that content into the main report if you prefer one long document, and if you do, say so in the README).

Reference answer set

Build this manually before running any pipeline condition. Pick questions, read the PDFs directly to identify the correct source chunk for each, and record the expected chunk text or page location. Recall@K then measures whether the correct chunk appears in the top K retrieved results.

Cover at least companies across two sectors to check whether results hold across different document layouts.

If you ask me Which questions to ask and test? or How many documents should I use?, my answer will be: you decide! But the more the better!

Secondary questions for the report

  • After you turn PDFs into Markdown, do you get more legible chunks? Show examples side by side.
  • Does conversion reduce average chunk length, and by how much?
  • How does each tool handle tables? Show the same table extracted by each method.
  • Which tool is faster on a 20-page PDF vs a 60-page PDF?
  • Are there PDF types where one approach clearly fails?

Tools and constraints

General engineering, documentation, and collaboration rules are in Requirements for all teams.

Project-specific:

  • Do not change the embedding model between conditions. The variable is the ingestion not the embedding model.
  • The pipeline must be fully reproducible from a clean clone. Document any additional system dependencies (e.g. Tesseract, poppler) in CONTRIBUTING.md.

💡 TIP: It is a good idea for B1 and B2 groups to nominate one contact each to check in on each other informally and learn from each other. If B1 finds that turning PDFs into Markdown with a particular tool works well under a specific chunking approach, this information will be useful for the B2 group testing that approach. If B2 finds a chunking configuration that outperforms the Week 9 baseline, B1 would appreciate to hear about it too.

Project B2

Retrieval Configuration Benchmark: Chunking Strategies and Reranking at Scale

Does the simplest strategy still win when you test it across the full TPI dataset?

Number of groups (parallel): 2

Background

Does the W09 baseline hold when you test it at scale, across the whole TPI dataset? The goal of this project is to find out empirically.

In W09, char-limit chunking with cross-encoder reranking outperformed every fancier strategy on a single Ajinomoto PDF. That shaped the baseline you used in ✍️ Problem Set 2. But it was one document. CLEAR uses the same reranking approach in production across all its entity types. Neither result tells us whether these choices hold across the full TPI CP dataset, with multiple sectors, varying document lengths, different table densities, and inconsistent layouts. This project tests both variables systematically.

Core engineering challenge: Keeping everything else constant while varying two independent factors across a dataset large enough to be meaningful. Build the reference answer set before any pipeline run. If the experimental design is flawed, the numbers mean nothing.

Factors to vary

  • Chunking strategy: char-limit (the W09 baseline), heading-delimited sections, and sliding window with overlap. All applied to the same unstructured extraction output so ingestion is not a confound.
  • Reranking: bi-encoder retrieval only vs bi-encoder followed by cross-encoder reranking (ms-marco-MiniLM-L-6-v2, the model used in CLEAR). This gives six conditions in a 3x2 grid.

All conditions use the same embedding model (multi-qa-MiniLM-L6-cos-v1) and ChromaDB setup.

What to build

A benchmark pipeline where each condition is a separately runnable script producing a ChromaDB collection. A shared evaluation script queries all collections with the reference questions and computes Recall@K and median query latency per condition. Results are written to a structured output (CSV or JSON) that feeds directly into the report.

The report can be in any format you prefer. Cover: experimental design, how the reference answer set was constructed, the full results table, the latency vs recall trade-off, and a recommendation with a clear rationale. Where results are ambiguous or dataset-dependent, say so.

Deliverables

  • Benchmark pipeline and evaluation. Separate runnable scripts per condition, shared evaluation, structured results feeding the report.
  • Written report. Experimental design, reference set construction, full results, latency vs recall, and a recommendation grounded in the numbers.
  • Documentation. Meet the Requirements for all teams for README.md, CONTRIBUTING.md, and DECISIONS.md. Include BENCHMARK.md with your methodology and results (or fold into the main report, and if you do, say so in the README).

Reference answer set

Build it manually before running any condition. Cover at least two sectors and include both short factual questions (“what was the 2022 intensity figure?”) and longer contextual ones (“what benchmark scenario is the company assessed against?”).

If you ask me How many question-answer pairs should I use?, my answer will be: you decide! Aim for something in the range of 30-50, but the more the better.

Questions the report should address

  • Does reranking consistently improve Recall@K, or only for certain question types?
  • What is the median latency cost of reranking per query on Nuvolos hardware?
  • Does the winning chunking strategy from W09 hold at scale, or does a different strategy win on certain sectors or document types?
  • Is there one combination that clearly dominates, or does the answer depend on context?
  • What would you recommend to Sylvan for CLEAR’s retrieval configuration, and why?

Tools and constraints

General engineering, documentation, and collaboration rules are in Requirements for all teams.

Project-specific:

  • Do not change the embedding model or ingestion tool between conditions.
  • The pipeline must be fully reproducible from a clean clone.

💡 TIP: It is a good idea for B1 and B2 groups to nominate one contact each to check in on each other informally and learn from each other. If B2 finds a chunking configuration that outperforms the Week 9 baseline, B1 would appreciate to hear about it too. And vice versa.

Project C

Multi-Step RAG: Decomposed Reasoning vs Single-Shot Generation

Break complex questions into atomic model calls, persist intermediate results, then measure whether it helps

Number of groups (parallel): 1

Background

Does breaking a complex Carbon Performance question into smaller sub-questions, answered one at a time, produce better or more trustworthy results than a single-shot RAG call? The goal of this project is to find out empirically.

A standard RAG pipeline answers a question in one shot: retrieve chunks, build a prompt, generate an answer. That works for simple factual queries. It struggles with questions like “Has this company’s Carbon Performance alignment improved since 2019, and is its current trajectory consistent with the 1.5°C benchmark?” because the model has to retrieve, compare, reason across time, and synthesise findings in a single call. When it fails, you cannot tell where.

Multi-step workflows decompose a complex question into sub-criteria. Each sub-criterion gets its own retrieval and generation call. Intermediate results are written to a database. A final assembly call reads those stored outputs and produces the overall answer. Each call does one thing. Each intermediate result is inspectable. CLEAR currently uses single-shot generation with a capable model. This project tests whether decomposition produces better or more trustworthy answers, and at what cost.

Core engineering challenge: Designing the decomposition. Break a complex question into sub-criteria that the retrieval pipeline can answer independently, that combine into the full answer, and that are not so granular that the assembly step brings back the same failure mode. Justify your decomposition in the report.

What to build

Two pipelines operating on the same TPI CP documents, answering the same set of complex questions.

  • Single-shot baseline: Standard RAG: retrieve top-K chunks, build one prompt, generate one answer. Run with a capable open-weight model via NEBIUS.
  • Multi-step pipeline: For each complex question, define a decomposition into sub-questions. Each sub-question runs its own retrieval and generation call. Intermediate answers are written to a database. A final assembly call reads from the database and produces the overall answer.

The database is not optional. Intermediate results must be persisted between steps, not passed in memory, so the pipeline can be resumed, inspected, and audited. This is the principle Sylvan described: save to the database, then re-read it to assemble the final result.

Compute budget

This group receives $100 USD of NEBIUS compute tokens to run open-weight models that are not feasible on Nuvolos. Good starting points include Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct, or Mistral-Large. The budget covers both pipelines. Log token usage per run and include a spend summary in the report. Commercial APIs (OpenAI, Anthropic, Gemini) are not permitted. NEBIUS provides access to open-weight models via an OpenAI-compatible client.

Questions to evaluate on

Select complex Carbon Performance questions that genuinely require multi-step reasoning and are answerable from the PDF content. Establish ground-truth answers manually before running either pipeline. Suggested types:

  • Trajectory questions: “Has [Company X]’s emissions intensity improved consistently since [year], and by how much?”
  • Comparative questions: “Which company in the Food Producers sector is closest to the 1.5°C benchmark in the most recent assessment year?”
  • Change-over-time questions: “What has changed in [Company X]’s stated emissions targets between its 2020 and 2023 assessments?”

If you ask me How many questions should I use?, my answer will be: you decide! But use at least five, and the more the better.

What the benchmark should measure

  • Answer correctness: Score each answer against the ground truth (correct, partially correct, or incorrect) independently for each pipeline.
  • Answer faithfulness: Trace each claim in the final answer back to a source chunk. Is the answer grounded in retrieved content?
  • Inspectability: For the multi-step pipeline, can you identify which sub-step produced an incorrect intermediate result? Show an example where this mattered.
  • Latency and token cost: Total wall-clock time and token consumption per question for each pipeline. Multi-step will cost more, so report it honestly.

What the report should conclude

A direct comparison with a recommendation addressed to Sylvan: does decomposition improve answer quality on these question types? Does it improve trustworthiness even where accuracy is similar? Is the additional complexity and cost justified for TPI’s use case? Ground the recommendation in the benchmark results.

Deliverables

  • Two pipelines and benchmark report. Single-shot baseline, multi-step pipeline with persisted intermediates, same question set and ground truth, metrics above, and a clear recommendation to Sylvan.
  • Documentation. Meet the Requirements for all teams for README.md, CONTRIBUTING.md, and DECISIONS.md. Summarise token spend per run in the report and document the NEBIUS / .env setup in CONTRIBUTING.md.

Tools and constraints

General engineering, documentation, and collaboration rules are in Requirements for all teams.

Project-specific:

  • NEBIUS API accessed via the OpenAI-compatible client. Open-weight models only.
  • SQLite is sufficient for storing intermediate results. Document the schema in CONTRIBUTING.md.
  • Both pipelines must be reproducible from a clean clone with NEBIUS credentials in .env.
  • Log token spend per run and summarise it in the report.

💡 TIP: The compute budget is fixed and does not roll over. Validate your pipeline logic on Nuvolos with a small model first, then switch to NEBIUS for the final benchmark runs. Do not burn tokens debugging.

Project D

Vector Store Refactoring: ChromaDB vs pgvector

Refactor the pipeline to support both backends, then measure what changes

Number of groups (parallel): 1

Background

ChromaDB and pgvector are both reasonable choices for a retrieval pipeline. Which one is actually better for TPI’s context, and by how much? The goal of this project is to make that comparison concrete.

DS205 this year taught ChromaDB. CLEAR uses PostgreSQL with pgvector, which was taught in DS205 last year. pgvector runs inside a full relational database: you can join vector search results with structured metadata in a single query, use Alembic for schema migrations, and share the same Postgres instance already running in production. ChromaDB is simpler to set up but sits outside the relational world. Neither choice is obviously wrong. This project makes the trade-offs concrete with measurements rather than intuition.

Skill to discover: Docker and Docker Compose. Docker was mentioned in passing in DS205 but not formally taught. This project requires it, because pgvector runs inside a Docker container managed by Docker Compose. You are expected to pick this up independently. Sylvan’s CLEAR repository is a working reference. The docker-compose.yml and Dockerfile.* files show exactly how he sets it up. Start there, then read the Docker documentation. Being comfortable learning tools the course did not cover is part of what this project is testing.

Core engineering challenge: Writing an abstraction layer that lets the pipeline switch between ChromaDB and pgvector via a single environment variable, without leaking backend details into the retrieval or generation code. If the calling code has to know which backend it is talking to, that coupling is itself a finding worth reporting.

What to build

Start from a pipeline in the same style as ✍️ Problem Set 2 (your own ✍️ Problem Set 2 submission is fine as a base). Introduce a typed vector store interface that both backends implement. The ChromaDB implementation follows the pattern you learned in ✍️ Problem Set 2. The pgvector implementation uses SQLAlchemy 2.x and the pgvector Python package, running inside Docker Compose following the CLEAR pattern (pgvector/pgvector:0.7.1-pg16).

The pipeline runs in either mode by setting VECTOR_STORE=chroma or VECTOR_STORE=pgvector. Before benchmarking, verify that both modes return identical results on the same query.

Dimensions to benchmark

  • Query latency: Top-5 chunk retrieval for 20 representative queries. Three runs each, report median.
  • Ingestion throughput: Time to embed and store all chunks for a 20-page and a 60-page PDF.
  • Deployment complexity: Number of services required, Docker Compose line count, setup steps from a clean machine. Concrete counts, not impressions.
  • Code legibility: Line count of each backend implementation. Number of imports. Cognitive complexity score via radon.
  • Recall@5: On a reference answer set you build manually before running any condition.
  • Metadata filtering expressiveness: Filter by company, year, and sector at query time in each backend. Show the code side by side.

What the report should conclude

A recommendation report addressed to Sylvan: for a team of social science researchers running TPI pipeline queries on a university laptop, which backend is the better default, and at what point would the answer change? Tie the recommendation to your measurements and say what would flip it (for example a latency threshold, deployment cost, or filtering needs).

Deliverables

  • Dual-backend pipeline. Typed abstraction, ChromaDB and pgvector implementations, same pytest suite for both, benchmark dimensions above.
  • Written report. Recommendation to Sylvan with numbers, and say where you would revisit the choice if constraints changed.
  • Documentation. Meet the Requirements for all teams for README.md, CONTRIBUTING.md, and DECISIONS.md. Docker and Compose setup must be reproducible from CONTRIBUTING.md alone.

Tools and constraints

General engineering, documentation, and collaboration rules are in Requirements for all teams.

Project-specific:

  • pgvector runs via Docker Compose. Document the setup in CONTRIBUTING.md completely enough that someone on a fresh machine can reproduce it without asking questions.
  • Use SQLAlchemy 2.x and the pgvector Python package for the pgvector interface. No raw SQL for embedding operations.
  • The ChromaDB implementation must use PersistentClient and where-based metadata filtering, matching the pattern from ✍️ Problem Set 2.
  • The abstraction interface must be typed. Both implementations must pass the same pytest suite.

💡 TIP: This project is self-contained. Build your own reference answer set for Recall@5 independently of any other group.

Project E

Blue Skies Frontend: A New Web Interface for TPI Carbon Performance Data

Build a real FastAPI backend, then vibe the frontend with whatever tools you like

Number of groups (parallel): 1

Background

CLEAR runs as a Streamlit app and is already used by real analysts at the TPI Centre. What would a purpose-built web interface for the same data look like, designed from scratch for a policymaker or investor audience? That is the question this project asks you to answer by building it.

CLEAR is served from Sylvan’s CodeOcean VM. This project asks you to imagine what the future of CLEAR might look like and build accordingly. The output must be a working web application backed by a real API. The frontend can look like anything. The backend must serve retrieved chunks and generated answers from the TPI dataset over HTTP, built with FastAPI following the patterns taught in DS205.

Skill to discover: web frontend development. DS205 taught FastAPI for building APIs, not web frontend development. You are expected to pick up whatever frontend tooling you choose independently. AI-assisted development is explicitly encouraged for the frontend layer: Lovable, v0, Vercel, Cursor, Claude Code, and GitHub Copilot are all fair game. You own every part of the system you submit, regardless of the tools used. The backend must be Python with FastAPI to DS205 engineering standards. The frontend has no language constraint.

What to build

A two-layer system.

  • Backend API (FastAPI): Serves the full RAG pipeline over HTTP. Endpoints must cover at minimum query submission, chunk retrieval with metadata, and generated answer. Design it so that a different frontend could replace the current one without touching the backend.
  • Frontend web application: Connects to the backend API. Technology choice is entirely yours. Make TPI Carbon Performance data accessible to someone who knows what Carbon Performance is but has never heard of ChromaDB.

What the frontend must demonstrate

  • A user can submit a natural language question about a TPI company and receive an answer with cited source chunks.
  • The source chunks are visible: the user can see where the answer came from.
  • The interface handles at least two of the four driving questions from ✍️ Problem Set 2, including one that requires temporal reasoning across multiple documents.
  • The application works in a fresh browser without any local setup by the user.

What the project is not

A Figma mockup or a static HTML page. The frontend must talk to a live backend. A backend that hardcodes responses is not acceptable. The RAG pipeline must run for real.

Design ambition

Do not rebuild CLEAR with a nicer theme. Think about who would use this and what they actually need: navigation across companies, comparison views, temporal change visualisation, citation transparency, sector filtering. I would rather see one or two interactions thought through end-to-end than a long list of half-finished features.

Deliverables

  • Backend API. FastAPI application with documented endpoints. Typed, linted, with at least basic integration tests. Deployable locally from a clean clone.
  • Frontend application. Live, accessible web application. Include the URL in the README. If hosted (Vercel, Render, etc.), it must be live at submission time.
  • Design rationale. A short section in the README or a separate doc: who the target user is, what interactions were prioritised, and what was deliberately left out.
  • AI usage log. Brief account of which AI tools were used for which parts. Frontend AI assistance is expected and fine, and the backend must be student-authored.
  • Documentation. Meet the Requirements for all teams for README.md, CONTRIBUTING.md, and DECISIONS.md. Include frontend build steps in CONTRIBUTING.md.

Tools and constraints

General engineering, documentation, and collaboration rules are in Requirements for all teams.

Project-specific:

  • Backend must be Python with FastAPI. No constraint on frontend language or framework.
  • Include frontend build instructions in CONTRIBUTING.md.
  • The RAG pipeline must use the tools taught in DS205: ChromaDB, sentence-transformers, HuggingFace generation. Commercial APIs are not permitted for retrieval or generation.

💡 TIP: Get the backend API working before building the frontend. A week of unblocked frontend work is more productive than two weeks where both layers are being built simultaneously without a stable contract between them.

Project F

Document Discovery: Automated Detection of New TPI-Relevant PDFs

Monitor sources, find new documents, hand them off to the pipeline without human intervention

Number of groups (parallel): 2

Background

Every pipeline in this course started with a PDF already in hand. Someone found it, downloaded it, and put it in the right folder. This project builds the layer that removes that manual step.

For TPI’s purposes, tracking corporate climate disclosures across hundreds of companies globally, manual downloading is the bottleneck. Companies publish new reports on irregular schedules. A live intelligence system that requires a human to notice and download new documents is not really live. This project builds the discovery layer: a system that monitors specific sources, detects new relevant documents, downloads them, and hands them off for ingestion, automatically, on a schedule.

Core engineering challenge: Defining “relevant” precisely enough to filter signal from noise at scale. A search for “annual report emissions” returns thousands of results. You need a filtering strategy (by domain, document structure, keyword presence, or some combination) and you need to evaluate its precision honestly: how many downloaded documents were actually useful, and how many were noise?

Two valid approaches (choose one or combine)

  • Search API approach: Use an external search API (Google Custom Search, Bing Web Search, DuckDuckGo, or Perplexity) to run targeted queries for new climate disclosure documents. Parse results, filter for PDFs from known company domains, download and deduplicate against what is already in the pipeline. Schedule via GitHub Actions cron.
  • Sitemap crawler approach: Build Scrapy spiders that traverse the sitemaps or investor relations pages of a defined list of TPI-tracked companies. Detect new PDFs by comparing against a stored manifest of previously seen documents. Download and flag new arrivals for ingestion. Schedule via GitHub Actions cron.

The two approaches can be combined: use the search API to discover unknown sources, use the crawler for known company sites where the structure is predictable. Justify your choice in the decisions log.

What to build

A scheduled discovery pipeline with three stages: discover (find candidate document URLs), filter (determine which are new and relevant), and hand off (download to a staging folder and log the result). The hand-off interface must be clean enough that an ingestion pipeline built like the one in ✍️ Problem Set 2 could consume it without modification. The discovery system and the ingestion system must be decoupled.

The scheduling mechanism is GitHub Actions with a cron trigger. The pipeline must run to completion in a clean environment on every scheduled run, logging what it found, what it filtered out, and why.

Scope constraint

Pick a realistic scope: two to four companies from one TPI sector, with a defined set of source URLs or search queries. A system that works reliably on a small, well-defined target scores higher than one that tries to cover everything and works inconsistently.

What the report should cover

  • Precision: over the evaluation period, what fraction of downloaded documents were genuinely new and relevant?
  • Coverage: were there documents that should have been found but were missed, and how do you know?
  • Latency: how quickly after publication does the system detect a new document?
  • Failure modes: what breaks the system, and how does it recover?
  • A recommendation for how this layer could be integrated into CLEAR as a production component.

Deliverables

  • Scheduled discovery pipeline. Discover, filter, hand-off as above, with GitHub Actions cron and inspectable logs after each run.
  • Written report. Cover the bullets under “What the report should cover” with evidence from your evaluation period.
  • Documentation. Meet the Requirements for all teams for README.md, CONTRIBUTING.md, and DECISIONS.md. Document GitHub Secrets for any search API keys in CONTRIBUTING.md, and keep run logs or summaries where a marker can see what each scheduled run did.

Tools and constraints

General engineering, documentation, and collaboration rules are in Requirements for all teams.

Project-specific:

  • Scrapy and Selenium are the taught tools for web crawling and remain the default. Search APIs are permitted without justification for this project.
  • GitHub Actions cron scheduling is required. The pipeline must produce a log inspectable after each run.
  • Document deduplication logic must be tested with pytest. This is the most failure-prone component.
  • Any search API keys must be stored as GitHub Secrets, following the pattern taught in W07. Document the setup in CONTRIBUTING.md.

💡 TIP: Two groups can take this project independently, focusing on different sectors or using different discovery strategies. No coordination between groups is required, but if you discover something useful it is always nice to share.

How we will grade

WIP: This will be introduced during the W11 Lecture and then later updated with the detailed rubric.