💻 Week 02, Day 03 - Lab
Mastering API Pagination
By the end of this lab, you should be able to: i) Understand the role of pagination in data collection from APIs, ii) Identify and use an after
token from a JSON response to fetch the next page of data, iii) Write a loop to make multiple, sequential API requests to collect a full dataset, iv) Consolidate paginated results into a single data structure for analysis.
This morning, you successfully connected to Reddit’s API and pulled down some data. Brilliant! But here’s the thing: Reddit only gives you 25 posts at a time. That’s fine for a quick look, but what if you need 100 posts for a proper analysis?
You can’t just ask for 100 posts in one go. Instead, you need to make multiple requests and stitch the results together. This process is called pagination, and it’s absolutely essential for serious data collection.
⏰ Wednesday, 23 July 2025 | Either 2:00-3.30pm or 3.30-5:00pm 📍 Check your timetable for the location of your class
🛣️ What You’ll Be Doing
You’ll work through the ME204_W02D03_Lab.ipynb
notebook, which should be in your lab-notebooks/
folder on Nuvolos. If you need to download it:
Part I: Getting Started and Understanding the Problem (20 min)
You’ll begin by setting up your Reddit connection using this morning’s credentials, then make your first request to see exactly how pagination works.
Seeing Pagination in Action
Your teacher will show you:
- Where to find the
after
token buried in Reddit’s response - How this token works like a bookmark that says “start here for the next batch”
- Why APIs do this (hint: imagine if everyone could request a million posts at once!)
🎯 What You’ll Do
Right, time to get your hands dirty. You’ll work through the notebook step by step.
Load Your Reddit Credentials
In section 1.1, you’ll load the same credentials you set up this morning. The notebook checks you’ve got all four bits: username, password, client ID, and client secret.
Get Connected to Reddit
In section 1.2, you’ll use the exact same authentication dance from the lecture. Post your credentials, get back an access token, job done.
Make Your First Request
In section 2.1, you’ll fetch 25 posts from r/Art (or whatever subreddit takes your fancy). More importantly, you’ll examine the response and spot that crucial
after
token.
Part II: Building Your Collection Loop (40 min)
Now for the meat of it: you’ll write the code that automatically requests multiple pages and combines them into one dataset.
Here’s what you’re building:
- Make your first request → Reddit gives you 25 posts plus an
after
token - Use that
after
token in your next request → get 25 more posts plus a newafter
token
- Keep going until you’ve got enough data
- Combine everything into one tidy dataset
This exact pattern works for Twitter, GitHub, news APIs and pretty much every modern API uses this approach.
🎯 What You’ll Do
This is where you’ll write the pagination logic yourself. No copy-pasting—you need to understand how this works.
Write the Collection Loop
In section 2.2, you’ll create a
for
loop that makes 3 more requests (giving you pages 2, 3, and 4 for a total of 100 posts). Each time through the loop:- You’ll update your request parameters to include the
after
token from the previous response - Make the API request with these updated parameters
- Extract the new
after
token for your next iteration - Add a polite
time.sleep(1)
(or some other small delay) if you think you might be close to the rate limit
- You’ll update your request parameters to include the
Combine Your Results
In section 2.3, you’ll dig into your 4 pages of JSON data, extract all the individual posts, and create one clean pandas DataFrame.
Check Your Work
In section 2.4, you’ll verify everything worked properly:
- Confirm you’ve got exactly 100 unique posts (no duplicates!)
- Check the data covers the time range you expected
- Find the most upvoted post in your collection
Part III: Saving Your Work and Thinking Bigger (30 min)
Finally, you’ll save your collected data properly and explore how to turn your working code into something you can reuse.
🎯 What You’ll Do
You’ll finish the job properly with good data management practices.
Save Your Data
You’ll save your posts in two formats: JSON (for backup) and CSV (for analysis). Use sensible filenames like
Art_top_100_posts.json
.Double-Check Everything
You’ll load your saved CSV back into a fresh DataFrame to make sure nothing got corrupted in the saving process.
If you finish early, there’s a bonus section where you’ll refactor your pagination code into a reusable function called fetch_reddit_posts()
.
✨ What You’ve Accomplished
Excellent work! You’ve just learnt one of the most important skills in data engineering: API pagination. This technique will serve you well when you’re working with any API that limits response sizes.
What you can now do:
- Understand how pagination tokens work as bookmarks
- Make sequential API requests with updated parameters
- Combine multiple API responses into unified datasets
- Be mindful of rate limiting when making API requests
- Verify your data quality with proper checks
🔍 Tomorrow: When Things Get More Complex
Tomorrow’s session builds directly on what you’ve done today.
What happens when you want to collect both posts AND comments from Reddit? You’ll have two related but separate datasets. How do you organise that? How do you connect a comment back to its original post?
This will lead us naturally into database design and SQL.