💻 Week 02, Day 03 - Lab

Mastering API Pagination

Author

Dr Jon Cardoso-Silva

Last updated

23 July 2025

🥅 Learning Objectives

By the end of this lab, you should be able to: i) Understand the role of pagination in data collection from APIs, ii) Identify and use an after token from a JSON response to fetch the next page of data, iii) Write a loop to make multiple, sequential API requests to collect a full dataset, iv) Consolidate paginated results into a single data structure for analysis.

This morning, you successfully connected to Reddit’s API and pulled down some data. Brilliant! But here’s the thing: Reddit only gives you 25 posts at a time. That’s fine for a quick look, but what if you need 100 posts for a proper analysis?

You can’t just ask for 100 posts in one go. Instead, you need to make multiple requests and stitch the results together. This process is called pagination, and it’s absolutely essential for serious data collection.

⏰ Wednesday, 23 July 2025 | Either 2:00-3.30pm or 3.30-5:00pm 📍 Check your timetable for the location of your class

🛣️ What You’ll Be Doing

You’ll work through the ME204_W02D03_Lab.ipynb notebook, which should be in your lab-notebooks/ folder on Nuvolos. If you need to download it:

Part I: Getting Started and Understanding the Problem (20 min)

You’ll begin by setting up your Reddit connection using this morning’s credentials, then make your first request to see exactly how pagination works.

Seeing Pagination in Action

Your teacher will show you:

Where to find the after token buried in Reddit’s response
How this token works like a bookmark that says “start here for the next batch”
Why APIs do this (hint: imagine if everyone could request a million posts at once!)

🎯 What You’ll Do

Right, time to get your hands dirty. You’ll work through the notebook step by step.

Load Your Reddit Credentials

In section 1.1, you’ll load the same credentials you set up this morning. The notebook checks you’ve got all four bits: username, password, client ID, and client secret.
Get Connected to Reddit

In section 1.2, you’ll use the exact same authentication dance from the lecture. Post your credentials, get back an access token, job done.
Make Your First Request

In section 2.1, you’ll fetch 25 posts from r/Art (or whatever subreddit takes your fancy). More importantly, you’ll examine the response and spot that crucial after token.

Part II: Building Your Collection Loop (40 min)

Now for the meat of it: you’ll write the code that automatically requests multiple pages and combines them into one dataset.

The Pagination Dance

Here’s what you’re building:

Make your first request → Reddit gives you 25 posts plus an after token
Use that after token in your next request → get 25 more posts plus a new after token
Keep going until you’ve got enough data
Combine everything into one tidy dataset

This exact pattern works for Twitter, GitHub, news APIs and pretty much every modern API uses this approach.

🎯 What You’ll Do

This is where you’ll write the pagination logic yourself. No copy-pasting—you need to understand how this works.

Write the Collection Loop

In section 2.2, you’ll create a for loop that makes 3 more requests (giving you pages 2, 3, and 4 for a total of 100 posts). Each time through the loop:
- You’ll update your request parameters to include the after token from the previous response
- Make the API request with these updated parameters
- Extract the new after token for your next iteration
- Add a polite time.sleep(1) (or some other small delay) if you think you might be close to the rate limit
Combine Your Results

In section 2.3, you’ll dig into your 4 pages of JSON data, extract all the individual posts, and create one clean pandas DataFrame.
Check Your Work

In section 2.4, you’ll verify everything worked properly:
- Confirm you’ve got exactly 100 unique posts (no duplicates!)
- Check the data covers the time range you expected
- Find the most upvoted post in your collection

Part III: Saving Your Work and Thinking Bigger (30 min)

Finally, you’ll save your collected data properly and explore how to turn your working code into something you can reuse.

🎯 What You’ll Do

You’ll finish the job properly with good data management practices.

Save Your Data

You’ll save your posts in two formats: JSON (for backup) and CSV (for analysis). Use sensible filenames like Art_top_100_posts.json.
Double-Check Everything

You’ll load your saved CSV back into a fresh DataFrame to make sure nothing got corrupted in the saving process.

🚀 For the Keen Ones: Function-Based Approach

If you finish early, there’s a bonus section where you’ll refactor your pagination code into a reusable function called fetch_reddit_posts().

✨ What You’ve Accomplished

Excellent work! You’ve just learnt one of the most important skills in data engineering: API pagination. This technique will serve you well when you’re working with any API that limits response sizes.

What you can now do:

Understand how pagination tokens work as bookmarks
Make sequential API requests with updated parameters
Combine multiple API responses into unified datasets
Be mindful of rate limiting when making API requests
Verify your data quality with proper checks

🔍 Tomorrow: When Things Get More Complex

Tomorrow’s session builds directly on what you’ve done today.

What happens when you want to collect both posts AND comments from Reddit? You’ll have two related but separate datasets. How do you organise that? How do you connect a comment back to its original post?

This will lead us naturally into database design and SQL.