🔍 Choosing Your Data Source

A guide for your DS105W final project

Author

Dr Jon Cardoso-Silva

Published

31 March 2025

DS105W course icon

Selecting the Right Data Source

Choosing an appropriate data source is a critical first step for your final project. This guide will help you understand the requirements and make informed decisions about where to get your data.

🚨 Important Reminders

Before exploring data sources, make sure you understand these key requirements:

  1. Simple bulk downloads won’t work as your main data source. For example:

    • Downloading a basic CSV from a website
    • Using a pre-made dataset from Kaggle

    However, these datasets can be used as supplementary data to enrich your analysis.

  2. Stick to the requests library. Pre-made API wrappers like spotipy for Spotify or praw for Reddit are not allowed.

  3. Document your decisions in your notebooks and README.

  4. Focus on quality of analysis over quantity of data. Only increase complexity once you have a first full draft of your project.

Core Requirements

Your primary data source must be one of:

1. Data Collected Through an API

Key considerations:

  • You are responsible for handling authentication and rate limits
  • Use the requests library (no pre-made wrappers)
  • Store your API keys securely (never commit them to GitHub)
  • Plan for API downtime or rate limiting

2. Self-Collected Data

This is an ambitious but rewarding approach if you have access to interesting data:

Requirements:

  • Clear collection methodology
  • Explicit consent was obtained (if collecting from people)
  • Privacy considerations addressed
  • Structured format that can be loaded into pandas

Examples:

  • You can register for a free API key from Spotify but then rather than simply using it to access data from the API, you can use the notion of Authorisation (with proper consent) to access your own listening history.
  • Personal analytics (health data, app usage, digital behaviour)
  • Survey-based research (with substantial responses)

3. Complex Static Datasets

Although simple static datasets are not allowed, you can use downloaded datasets if they are complex and require significant reshaping:

Examples:

Requirements: - Dataset must require significant data reshaping - Should involve complex JSON structures needing normalisation - Must be properly cited and documented

API Suggestions

APIs You’ve Already Used

Weather & Environment

  • OpenMeteo API
    • Clean documentation
    • Free to use
    • Many endpoints and analytical angles we haven’t explored in the course

Social Media & Text

  • Reddit API
    • We’ve covered the basics of connecting to it
    • Rich information beyond what we explored in Mini-Project 2
    • Perfect for text data analysis

Other Interesting APIs

Transport & Location

  • Transport for London (TfL)
    • Real-time journey data
    • Documentation can be tricky
    • Active developer forum if you need help

Google APIs

  • Maps (location analysis)
  • Trends (search patterns)
  • Books Ngram (text analysis)
  • Note: These might charge a fee if you go beyond certain rate limits. We are NOT responsible for any charges you might incur.

Government & Public Data

  • UK Government Data
    • data.gov.uk
    • Various datasets on public services, economy, health
  • EU Open Data Portal

Entertainment & Media

  • TMDB (The Movie Database)
    • Film and TV show information
    • Well-documented API
  • News APIs
    • NewsAPI, The Guardian, New York Times
    • Current events and historical articles

Evaluating Data Source Quality

When choosing your data source, consider these factors:

  1. Relevance: Does the data help answer your research question?
  2. Accessibility: Can you reliably access the data throughout your project?
  3. Quality: Is the data accurate, complete, and up-to-date?
  4. Volume: Is there enough data to support meaningful analysis?
  5. Complexity: Does the data require interesting transformation work?
  6. Documentation: Is the API or dataset well-documented?

Common Pitfalls to Avoid

  1. Choosing overly simple datasets: Your project should demonstrate data transformation skills.
  2. Relying on unstable APIs: Some free APIs have severe rate limits or poor reliability.
  3. Collecting too much data: Focus on quality over quantity.
  4. Ignoring data privacy: Always respect privacy and obtain proper consent.
  5. Waiting too long to start: Begin with a small sample to test your approach.

Getting Started

  1. Explore the documentation: Understand what data is available and how to access it.
  2. Test with small requests: Verify you can access and process the data before committing.
  3. Plan your database schema: Design your tables and relationships early.
  4. Document your process: Keep notes on your data collection decisions.

💡 Note: You’re encouraged to discover and come up with other interesting data sources. If you’re using an API that’s not listed here, you don’t need to get it pre-approved - just go for it!