🗣️ Week 09 Lecture

Building a Data Pipeline + PostgreSQL Database with SQLAlchemy

Author

Published

17 March 2025

🥅 Learning Goals

By the end of this session, you should be able to: i) Design data pipelines with error handling; ii) Work with PostgreSQL and SQLAlchemy; iii) Understand vector databases and perform similarity searches.

Last updated: 16 March 2025, 18:30

📢 Important Note: Today’s lecture will be hands-on and collaborative, focusing on getting everyone up to speed with ✍️ Problem Set 2. Bring your laptops and questions!

Our goal is to at least get you set up with a database for your assignment. If time allows, we will also add in the pgVector extension to enable vector search directly in the database.

📍 Session Details

Date: Monday, 17 March 2025
Time: 10:00 am - 12:00 pm
Location: KSW.1.04

📚 Preparation

To maximise today’s session, please:

Set up your ✍️ Problem Set 2 repository by accepting the GitHub Classroom assignment
Clone the repository to your local machine
Run the starter code to identify any issues or questions
Bring your laptop to the lecture

If possible:

Download PostgreSQL and install it on your machine

🗣️ Lecture Content

Throughout this session, we will progress together with our data pipeline. You can follow the live demo (by git pulling the commits as I make them) or work independently.

I will also code with the help of Cursor to demonstrate the utility of an AI coding assistant. We will test the concept of vibe coding! Just how much of a fad is this?

Hour 1: Adding a Database to Your Data Pipeline

🖥️ Live Demo + Hands-on

In this first hour, we will improve our climate policy scraper by adding a database to our pipeline. We’ll:

Review our starter code and discuss download challenges.

Even with scrapy’s autothrottling, it seems like we are hitting some limits.

I have mapped this challenge to GitHub Issue #1. Let’s see if/how Cursor can help us in real-time!
Introduce a database (PostgreSQL) to our pipeline.

The way we will suggest you work with databases in this course is with SQLAlchemy, which is very similar to the concept of the model layer first introduced in 🗣️ Week 02.

I mapped this task in GitHub Issue #2 to help us get started.
Differences between SQLAlchemy models versus scrapy Items.

We will then discuss the differences between SQLAlchemy models versus scrapy items. We will need both if we want to ensure our pipeline is robust and flexible.

💡 Note how these concepts come up again and again. When we were building APIs, we also engaged with data models when working with Pydantic.

💡 Key Concept: Separating concerns in data pipelines improves reliability. A database helps us track state and manage failures effectively.

Hour 2: Crafting a `tasks.py` Script for Pipeline Management

🖥️ Advanced Demo

Building a command-line interface with Click.

Now that we have our database setup, we need an easy way to interact with it. We’ll use Click to build a command-line interface that feels natural to use:

python tasks.py init-db     # Set up database tables
python tasks.py recreate-db # To start fresh
python tasks.py download    # Start downloading PDFs

# what else would you like to do?

Managing downloads properly.

Now that we have a database, we can consult it to see if we have already downloaded the files. Where we haven’t, we can download them. We will aim to implement a proper solution that retries downloads (what if the server is temporarily down?) and shows progress bars using tqdm (how long until it’s done?).
Putting it all together.

We’ll see how all the pieces fit: the scraper feeds data to PostgreSQL, and our CLI tool helps manage everything else. Using different tools for different jobs is a common pattern in data engineering pipelines (more on that in Week 11).

⚠️ Note: If we have time, I’ll show you how to add vector storage with pgVector. If not, I’ll provide some samples for you to test during tomorrow’s lab.

🎥 Session Recording

The lecture recording will be available on Moodle by the afternoon of the lecture.

🔮 Looking Ahead

Next week (Week 10), we’ll:

Explore BERT and Transformers further
Revisit vector retrieval and search based on embeddings
Discuss: are we over-engineering our pipeline? When can we rely on keyword search? When do we absolutely need vector search?

Remember that ✍️ Problem Set 2 is due on Friday, 28 March 2025, 8pm UK time.