DS205 2025-2026 Winter Term Icon

🖥️ Week 04 Lecture

Building Collaborative APIs with FastAPI

Author

Dr Jon Cardoso-Silva

Published

06 March 2026

Last Updated: 09 February 2026

No slides this week. We’ll work through problems together, write code in groups, and build concepts from what you discover. Download the notebook before we start.

📍 Session Details

  • Date: Monday, 09 February 2026
  • Time: 16:00 - 18:00
  • Location: SAL.G.03

📋 Preparation

  • Make as much progress on the web scraping part of ✍️ Problem Set 1 as you can. If you have scraped data and barcodes, bring them.
  • Download the exercise notebook and data files we’ll use in class:

Download validation exercise notebook

Download Waitrose scraped data (JSONL)

Download OpenFoodFacts sample responses (JSON)

  • Nuvolos tip: If you’ve finished with web scraping, switch to the regular VS Code application on Nuvolos (not the one with Chromium + Selenium). The Chromium version is slower and you won’t need a browser for the remaining work.

Motivation

Once you have already written your web scraper for ✍️ Problem Set 1, you will inherit someone else’s data and your immediate next task will involve enriching each product with the corresponding NOVA classification.

We will start this lecture by discussing how to find the corresponding product in the OpenFoodFacts API. Then, we will talk about how to serve this enriched data through a FastAPI application.

The lecture will be divided into three moments:

  1. Moment 1: We will discuss how to find the corresponding product in the OpenFoodFacts API.
  2. Moment 2: We will talk about data models/schemas and their role in web scraping and API development.
  3. Moment 3: We will build a minimal FastAPI application to serve data.

Moment 1: Finding the corresponding product in the OpenFoodFacts API

The OpenFoodFacts API

The product endpoint returns many meaningful fields per product. We will look at the schema used by OpenFoodFacts to structure product data returned by the following endpoint of their API:

GET /api/v2/product/{barcode}

What is a schema?

For now, think of a schema as a list of fields that are returned by the API typically in a JSON format. The schema tells you the names of the fields and their types, and how nested the fields are.

Handling null values

Not every barcode returns a match. Some products haven’t been uploaded to OpenFoodFacts yet, so you get a 404.

If you drop all unmatched products, you bias your UPF proportion estimate. Store-brand or niche products may be systematically absent. While you can complete your ✍️ Problem Set 1 by just reporting the number of unmatched products and then dropping them, you are more than welcome to find strategies to complete the dataset.

If you truly want to know how it is to work in a real-world scenario where you need to produce a data pipeline without full information, here are a few strategies to consider:

  • Search by name: Less reliable than barcode, but broader coverage. Don’t forget to document your decisions.
  • Manual lookup: Check OpenFoodFacts or other sources for a small number of critical products. The API itself allows you to send data to its server! But if that’s too much work, you can manually check if you find the product on the website, or you can make a decision about the NOVA classification based on the information you have about the product. In that case, please document which reference you used to reach your decision.

💡 Transparency matters more than completeness.

Moment 2: Data models/schemas and their uses in web scraping and API development

The schema concept in Scrapy

The word schema appeared when we read the OpenFoodFacts docs. The same concept is available in Scrapy and in API development. In both cases, a schema defines the structure of the data you produce or consume.

Scrapy’s items.py defines the structure of your scraped data:

import scrapy


class WaitroseProduct(scrapy.Item):
    """Represents a single product scraped from Waitrose."""
    category = scrapy.Field()
    name = scrapy.Field()
    barcode = scrapy.Field()
    url = scrapy.Field()
    food_type = scrapy.Field()

The above transparently communicates to anyone reading items.py what fields exist. The Field class also allows you to add metadata to the fields, but they are often not used. Read more about this in the 📚 Scrapy documentation.

The schema concept in API development

Soon, we will learn how to create our own APIs to serve our data. We will find that our library of choice, the popular FastAPI framework, uses a similar concept to define the structure of the data that is returned by the API. FastAPI relies on another library called Pydantic to do this.

Pydantic does the same for APIs but it comes with the very welcome addition of type validation. That is, you can specify the precise type of the data that is expected in each field and if you try to create an object with an invalid type, Pydantic takes care to validate it and throw an error on the spot.

For example, take the following Pydantic model:

from pydantic import BaseModel


class Product(BaseModel):
    name: str
    price: float
    url: str

If you try to create a Product with price="two pounds" (a string instead of a float), Pydantic will raise an error. The schema serves more than just a specification of the fields; it also documents the type of data expected in each field.

💡 Read more about Pydantic features in the 📚 Pydantic documentation. In particular, read about the Field class and the BaseModel class.

Moment 3: FastAPI

Creating an API endpoint is like writing your own custom function in Python. The major difference is that you will be able to call (invoke) that function from any other application on the machine (curl, requests.get(), etc.) and, if you host your API on a public-facing server, you will be able to call it from anywhere in the world.

Building an endpoint

We use Wikipedia’s “List of Foods” as our teaching example. The Pydantic model becomes the foundation for an API endpoint:

from fastapi import FastAPI
from pydantic import BaseModel


class Food(BaseModel):
    name: str
    category: str


app = FastAPI()


@app.get("/foods")
def get_foods() -> list[Food]:
    ## TODO: get data from the database/files

    return all_foods

Run it:

uvicorn main:app --reload

Open http://localhost:8000/docs to view the auto-generated documentation produced by FastAPI (also known as Swagger UI). The URL might actually be slightly different on Nuvolos, because it uses a proxy.

Consuming the API

import requests

response = requests.get("http://localhost:8000/foods")
foods = response.json()
print(foods)

Connecting to Problem Set 1

In Part B, you apply this pattern to Waitrose:

  1. You receive your partner’s scraper and data.
  2. You define Pydantic models matching their data schema.
  3. You build FastAPI endpoints serving that data.

Where does the OpenFoodFacts enrichment happen? Options include a Scrapy pipeline, a post-scraping script, or inside the FastAPI application itself. You choose based on your architecture. We’ll explore these options in the lab.

What’s next

Later, at home, start working on Part B of your ✍️ Problem Set 1. By the end of the lecture, you should already know who your assignment partner is, so get in touch with them and gain access to their repository to start working on the API development part of the problem set.

🎯 ACTION POINTS:

Here are some tips of what to do once you are ready to start working on Part B:

  1. Familiarise yourself with the scraper/README.md file your partner wrote when they were working on Part A.
  2. Check if they have left you with some pre-existing data to work with. This should be under the data/scraped/ folder, or if they are extra nice and already did the OpenFoodFacts enrichment part of the problem set, you should find the data/enriched/ folder.
  3. Think of the API endpoints you want to implement with FastAPI.
  4. Think of how you will read the data provided by your partner into your FastAPI application and reshape it to meet your needs.
  5. Write the code for your FastAPI application and test it out by calling it from the Swagger UI or using curl or requests.get().

You have until 26 February 2026, 8pm UK time, to complete Part B of your ✍️ Problem Set 1. To meet the requirements, you will need to write the relevant endpoints detailed in the assignment specification, but if you want to go beyond, you are more than welcome to use the enriched data/API outputs to answer the ambitious question set in the assignment page:

What proportion of groceries and food items available on a UK supermarket website is ultra-processed (UPF)? And for any given UPF item, what is its closest item that is non-UPF?

🎥 Session Recording

The lecture recording will be available on Moodle by the afternoon of the lecture.