๐ฃ๏ธ Week 09 Lecture
Building a Data Pipeline + PostgreSQL Database with SQLAlchemy
Last updated: 16 March 2025, 18:30

๐ข Important Note: Todayโs lecture will be hands-on and collaborative, focusing on getting everyone up to speed with โ๏ธ Problem Set 2. Bring your laptops and questions!
Our goal is to at least get you set up with a database for your assignment. If time allows, we will also add in the pgVector extension to enable vector search directly in the database.
๐ Session Details
- Date: Monday, 17 March 2025
- Time: 10:00 am - 12:00 pm
- Location: KSW.1.04
๐ Preparation
To maximise todayโs session, please:
- Set up your โ๏ธ Problem Set 2 repository by accepting the GitHub Classroom assignment
- Clone the repository to your local machine
- Run the starter code to identify any issues or questions
- Bring your laptop to the lecture
If possible:
- Download PostgreSQL and install it on your machine
๐ฃ๏ธ Lecture Content
Throughout this session, we will progress together with our data pipeline. You can follow the live demo (by git pull
ing the commits as I make them) or work independently.
I will also code with the help of Cursor to demonstrate the utility of an AI coding assistant. We will test the concept of vibe coding! Just how much of a fad is this?
Hour 1: Adding a Database to Your Data Pipeline
๐ฅ๏ธ Live Demo + Hands-on
In this first hour, we will improve our climate policy scraper by adding a database to our pipeline. Weโll:
Review our starter code and discuss download challenges.
Even with
scrapy
โs autothrottling, it seems like we are hitting some limits.I have mapped this challenge to
GitHub Issue #1. Letโs see if/how Cursor can help us in real-time!
Introduce a database (PostgreSQL) to our pipeline.
The way we will suggest you work with databases in this course is with SQLAlchemy, which is very similar to the concept of the model layer first introduced in ๐ฃ๏ธ Week 02.
I mapped this task in
GitHub Issue #2 to help us get started.
Differences between SQLAlchemy models versus scrapy Items.
We will then discuss the differences between SQLAlchemy models versus scrapy items. We will need both if we want to ensure our pipeline is robust and flexible.
๐ก Note how these concepts come up again and again. When we were building APIs, we also engaged with data models when working with Pydantic.
๐ก Key Concept: Separating concerns in data pipelines improves reliability. A database helps us track state and manage failures effectively.
Hour 2: Crafting a tasks.py
Script for Pipeline Management
๐ฅ๏ธ Advanced Demo
Building a command-line interface with Click.
Now that we have our database setup, we need an easy way to interact with it. Weโll use Click to build a command-line interface that feels natural to use:
python tasks.py init-db # Set up database tables python tasks.py recreate-db # To start fresh python tasks.py download # Start downloading PDFs # what else would you like to do?
Managing downloads properly.
Now that we have a database, we can consult it to see if we have already downloaded the files. Where we havenโt, we can download them. We will aim to implement a proper solution that retries downloads (what if the server is temporarily down?) and shows progress bars using
tqdm
(how long until itโs done?).Putting it all together.
Weโll see how all the pieces fit: the scraper feeds data to PostgreSQL, and our CLI tool helps manage everything else. Using different tools for different jobs is a common pattern in data engineering pipelines (more on that in Week 11).
โ ๏ธ Note: If we have time, Iโll show you how to add vector storage with pgVector. If not, Iโll provide some samples for you to test during tomorrowโs lab.
๐ฅ Session Recording
The lecture recording will be available on Moodle by the afternoon of the lecture.
๐ฎ Looking Ahead
Next week (Week 10), weโll:
- Explore BERT and Transformers further
- Revisit vector retrieval and search based on embeddings
- Discuss: are we over-engineering our pipeline? When can we rely on keyword search? When do we absolutely need vector search?
Remember that โ๏ธ Problem Set 2 is due on Friday, 28 March 2025, 8pm UK time.