DS205 2025-2026 Winter Term Icon

๐Ÿ’ป Week 07 Lab

Building a Click Pipeline and Wiring It to GitHub Actions

click
github-actions
pipeline
ps2
Build a skeleton Click pipeline for Problem Set 2 and automate it with GitHub Actions.
Author

Dr Jon Cardoso-Silva

Published

06 March 2026

Modified

06 March 2026

๐Ÿฅ… Learning Goals

By the end of this lab, you should be able to: i) Structure a multi-stage data pipeline as a Click CLI, ii) Run pipeline stages individually and in sequence from the terminal, iii) Define a GitHub Actions workflow that installs dependencies and runs the same pipeline on a remote machine, iv) Read a GitHub Actions run log and match it to local terminal output.

The ๐Ÿ–ฅ๏ธ Week 07 Lecture showed how to configure a GitHub Actions workflow to run your pipeline steps on a remote GitHub-hosted machine. This lab makes that correspondence concrete: you will build a Click CLI that defines your pipeline stages, run it locally to confirm it works, then create a GitHub Actions workflow that runs the same commands on a remote machine.

๐Ÿ“ Session Details

  • Date: Tuesday, 03 March 2026
  • Time: Check your timetable for the precise time of your class
  • Duration: 90 minutes

๐Ÿ›ฃ๏ธ Lab Roadmap

How the W07 lab will be structured
Part Activity Type Focus Time Outcome
Part 0 ๐Ÿ‘ค Teaching Moment Pipeline principles recap + PS2 goals 10 min Shared vocabulary before coding
Part 1 Setup Accept โœ๏ธ Problem Set 2 repository 5 min Cloned repo ready
Part 2 โธ๏ธ Action Points Create requirements.txt and pipeline.py with Click 25-35 min Working CLI with placeholder stages
Part 3 โธ๏ธ Action Points Wire the CLI to GitHub Actions 30-40 min Green tick on the Actions tab
Part 4 ๐Ÿ—ฃ๏ธ Discussion (if time allows) AI agents 0-15 min Vocabulary for agent concepts

๐Ÿ‘‰ NOTE: Whenever you see a ๐Ÿ‘ค TEACHING MOMENT, this means your class teacher deserves your full attention!

Part 0: Barryโ€™s Opening (10 min)

This section is a TEACHING MOMENT

Barry will recap the three pipeline design principles from the lecture and introduce what โœ๏ธ Problem Set 2 asks you to build.

Barry will recap these concepts introduced in the lecture:

  • Atomicity: each stage does one thing only
  • Idempotency: running the same step twice produces the same output
  • Modularity: each stage is independent of the internal logic of other stages

These principles will guide the TPI pipeline you design for โœ๏ธ Problem Set 2 too. The goal of todayโ€™s lab is to practice the pattern of defining pipeline stages as CLI commands, not to write the actual pipeline logic yet.

Part 1: Accept โœ๏ธ Problem Set 2 Repository (5 min)

Accept the GitHub Classroom link below and clone the repository to your machine. The repo contains the โœ๏ธ Problem Set 2 brief. You will create the pipeline files yourself in the next part.

COMING SOON: Link to GitHub Classroom assignment.

Part 2: Create requirements.txt and pipeline.py (25 min)

๐ŸŽฏ ACTION POINTS

Conda will be more robust on the long run but for now, letโ€™s keep things simple and use pip for this lab. Later on, you may need to switch to conda for more complex dependencies.

Step 1: Create requirements.txt

Create a file called requirements.txt in the root of your repository with these two packages:

click
tqdm

This is intentionally minimal. Your โœ๏ธ Problem Set 2 will need more packages as you build the actual pipeline in W08+, but for today you only need Click to define the CLI structure.

Install the dependencies now:

pip install -r requirements.txt

Step 2: Create pipeline.py

Create a file called pipeline.py in the root of your repository. This file defines your pipeline as a Click CLI with one subcommand per stage. Here is a worked example to start from:

import logging

import click

logging.basicConfig(level=logging.INFO)

@click.group()
def cli():
    """TPI data pipeline."""
    pass

@cli.command()
def crawl():
    """Collect data from TPI website."""
    logging.info("crawl: stage not yet implemented")

@cli.command()
def extract():
    """Extract structured content from collected data."""
    logging.info("extract: stage not yet implemented")

@cli.command()
def embed():
    """Generate embeddings from extracted content."""
    logging.info("embed: stage not yet implemented")

@cli.command()
def serve():
    """Start the search API."""
    logging.info("serve: stage not yet implemented")

if __name__ == "__main__":
    cli()

The stage names (crawl, extract, embed, serve) are just examples. Replace them with whatever stages you imagine your โœ๏ธ Problem Set 2 pipeline will need. It is completely fine to revise these names later as you learn more about TPI data and RAG pipelines in W08+.

Each function body should only log a message for now. The real logic comes later. The point of this lab is to practise the architectural pattern: name your stages, wire them into a CLI, and confirm they run. The docstring on each command becomes help text when you run python pipeline.py --help. Writing a clear one-line description is good practice.

๐Ÿ“– Read more: documenting Click commands.

Step 3: Run individual stages

Test each stage on its own:

python pipeline.py crawl
python pipeline.py embed

You should see the logging messages appear. Try python pipeline.py --help to confirm all your stages are listed.

Step 4: Add a run-all command

You can run all stages in sequence using Clickโ€™s Context.invoke. Add this command to your pipeline.py:

@cli.command("run-all")
@click.pass_context
def run_all(ctx):
    """Run all pipeline stages in sequence."""
    ctx.invoke(crawl)
    ctx.invoke(extract)
    ctx.invoke(embed)
    ctx.invoke(serve)

Then run:

python pipeline.py run-all

All stages should fire in sequence and you should see four logging messages.

๐Ÿ”” IMPORTANT:

If you rename your stage functions, update the ctx.invoke calls in run-all to match.

๐Ÿ“– Read more: Click: Context.invoke API reference

Step 5: Push to GitHub

Commit both files and push:

git add requirements.txt pipeline.py
git commit -m "Add skeleton pipeline with Click"
git push

You will need this on GitHub for Part 3.

Part 3: Wire to GitHub Actions (30 min)

๐ŸŽฏ ACTION POINTS

Step 1: Create the workflow file

Create a file under the .github/workflows/ directory called pipeline.yml in your repository. You will need to create the .github/workflows/ directory first if it does not exist.

Type this YAML into the pipeline.yml file:

name: TPI pipeline

on:
  push:
    branches: [main]

jobs:
  pipeline:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run pipeline
        run: python pipeline.py run-all

Each block in this YAML file maps to something you already understand from the lecture. The run: lines are the same commands you just ran in your terminal. If you want a refresher on what each piece does, see the ๐Ÿ–ฅ๏ธ Week 07 Lecture page.

Step 2: Push and verify

Commit the workflow file and push:

git add .github/workflows/pipeline.yml
git commit -m "Add GitHub Actions workflow"
git push

Open your repository on GitHub and click the Actions tab. You should see a workflow run triggered by your push. Wait for it to finish and look for a green tick.

Step 3: Read the logs

Click into the workflow run, then open the Run pipeline step. You should see the same logging messages you saw locally. The output is identical because the command is identical: python pipeline.py run-all runs the same way on GitHubโ€™s Ubuntu machine as it does on your laptop.

Step 4: Troubleshooting

๐Ÿ”ง Common issues

  • โ€œMy workflow isnโ€™t triggeringโ€: check that the file is at exactly .github/workflows/pipeline.yml (not .yaml, not in a different folder).
  • ModuleNotFoundError: No module named 'click': check that requirements.txt is committed and not listed in .gitignore.
  • โ€œThe workflow ran but I see no log outputโ€: confirm that pipeline.py uses logging.info() and that logging.basicConfig(level=logging.INFO) is at the top of the file. Alternatively, swap logging.info() for print() if you prefer. Both work.

๐Ÿ’ก If you are stuck on the GitHub Actions step, ask Barry for help. The most common mistake is a file path typo in the workflow YAML.

Part 4: AI Agents Discussion (bonus, only if time allows)

Open discussion: What makes something an AI agent?

This discussion only runs if Parts 2 and 3 are complete. It is optional.

Opening question: โ€œYou have just built a pipeline that runs automatically when you push code, installs its own dependencies, and executes every stage in sequence without you touching it. What is the difference between that and an AI agent?โ€

There is no single correct answer, which is precisely the point. Consider the MIT CSAIL four-part characterisation of agents (from their 2025 AI Agent Index):

  1. Autonomy: operating with minimal human oversight
  2. Goal complexity: pursuing high-level objectives through planning
  3. Environmental interaction: interacting with the world through tools and APIs
  4. Generality: handling under-specified instructions and adapting to new tasks

Which of these four will your โœ๏ธ Problem Set 2 pipeline have once fully built? That question is more useful than any definitive answer about what counts as an agent.

๐Ÿ“– Further reading:

Appendix | Resources

Course links

โœ๏ธ Problem Set 2