๐Ÿ—ฃ๏ธ Week 09 Lecture

Building a Data Pipeline + PostgreSQL Database with SQLAlchemy

Author
Published

17 March 2025

๐Ÿฅ… Learning Goals
By the end of this session, you should be able to: i) Design data pipelines with error handling; ii) Work with PostgreSQL and SQLAlchemy; iii) Understand vector databases and perform similarity searches.

Last updated: 16 March 2025, 18:30

DS205 course icon

๐Ÿ“ข Important Note: Todayโ€™s lecture will be hands-on and collaborative, focusing on getting everyone up to speed with โœ๏ธ Problem Set 2. Bring your laptops and questions!

Our goal is to at least get you set up with a database for your assignment. If time allows, we will also add in the pgVector extension to enable vector search directly in the database.

๐Ÿ“ Session Details

  • Date: Monday, 17 March 2025
  • Time: 10:00 am - 12:00 pm
  • Location: KSW.1.04

๐Ÿ“š Preparation

To maximise todayโ€™s session, please:

  1. Set up your โœ๏ธ Problem Set 2 repository by accepting the GitHub Classroom assignment
  2. Clone the repository to your local machine
  3. Run the starter code to identify any issues or questions
  4. Bring your laptop to the lecture

If possible:

  1. Download PostgreSQL and install it on your machine

๐Ÿ—ฃ๏ธ Lecture Content

Throughout this session, we will progress together with our data pipeline. You can follow the live demo (by git pulling the commits as I make them) or work independently.

I will also code with the help of Cursor to demonstrate the utility of an AI coding assistant. We will test the concept of vibe coding! Just how much of a fad is this?

Hour 1: Adding a Database to Your Data Pipeline

๐Ÿ–ฅ๏ธ Live Demo + Hands-on

In this first hour, we will improve our climate policy scraper by adding a database to our pipeline. Weโ€™ll:

  1. Review our starter code and discuss download challenges.

    Even with scrapyโ€™s autothrottling, it seems like we are hitting some limits.

    I have mapped this challenge to GitHub Issue #1. Letโ€™s see if/how Cursor can help us in real-time!

  2. Introduce a database (PostgreSQL) to our pipeline.

    The way we will suggest you work with databases in this course is with SQLAlchemy, which is very similar to the concept of the model layer first introduced in ๐Ÿ—ฃ๏ธ Week 02.

    I mapped this task in GitHub Issue #2 to help us get started.

  3. Differences between SQLAlchemy models versus scrapy Items.

    We will then discuss the differences between SQLAlchemy models versus scrapy items. We will need both if we want to ensure our pipeline is robust and flexible.

    ๐Ÿ’ก Note how these concepts come up again and again. When we were building APIs, we also engaged with data models when working with Pydantic.

๐Ÿ’ก Key Concept: Separating concerns in data pipelines improves reliability. A database helps us track state and manage failures effectively.

Hour 2: Crafting a tasks.py Script for Pipeline Management

๐Ÿ–ฅ๏ธ Advanced Demo

  1. Building a command-line interface with Click.

    Now that we have our database setup, we need an easy way to interact with it. Weโ€™ll use Click to build a command-line interface that feels natural to use:

    python tasks.py init-db     # Set up database tables
    python tasks.py recreate-db # To start fresh
    python tasks.py download    # Start downloading PDFs
    
    # what else would you like to do?
  2. Managing downloads properly.

    Now that we have a database, we can consult it to see if we have already downloaded the files. Where we havenโ€™t, we can download them. We will aim to implement a proper solution that retries downloads (what if the server is temporarily down?) and shows progress bars using tqdm (how long until itโ€™s done?).

  3. Putting it all together.

    Weโ€™ll see how all the pieces fit: the scraper feeds data to PostgreSQL, and our CLI tool helps manage everything else. Using different tools for different jobs is a common pattern in data engineering pipelines (more on that in Week 11).

โš ๏ธ Note: If we have time, Iโ€™ll show you how to add vector storage with pgVector. If not, Iโ€™ll provide some samples for you to test during tomorrowโ€™s lab.

๐ŸŽฅ Session Recording

The lecture recording will be available on Moodle by the afternoon of the lecture.

๐Ÿ”ฎ Looking Ahead

Next week (Week 10), weโ€™ll:

  1. Explore BERT and Transformers further
  2. Revisit vector retrieval and search based on embeddings
  3. Discuss: are we over-engineering our pipeline? When can we rely on keyword search? When do we absolutely need vector search?

Remember that โœ๏ธ Problem Set 2 is due on Friday, 28 March 2025, 8pm UK time.