Choosing Your Data Source: A Guide

Ideas and requirements for your DS105A final project

Everything you need to know about data sources for your final project
Author
Published

05 December 2024

🚨 IMPORTANT REMINDERS – Do not ignore this text

Before exploring data sources, make sure you understand these key requirements:

  1. Simple bulk downloads won’t work as your main data source. For example, none of the following are allowed:

    • Downloading a very simple CSV from a website
    • Using a pre-made dataset from Kaggle

    However, these datasets can be used as supplementary data to enrich your analysis.

  2. Stick to the requests library - 'pre-made' API wrappers like spotipy for Spotify or praw for Reddit are not allowed.

  3. Document your decisions in the notebooks.

  4. Focus on quality of analysis over quantity of data. Only grow in complexity once you have a first full draft of your project.

Core Requirements

Your primary data source must be one of:

  1. Data collected through an API
    • Use the requests library
    • You are responsible for handling authentication and rate limits
  2. Self-collected data with proper documentation
    • Clear collection methodology
    • Explicit consent was obtained
    • Privacy considerations addressed
  3. Complex static datasets
    • Although simple static datasets are not allowed, you ARE allowed to use downloaded datasets if they are complex and require significant reshaping.
    • These include datasets like OpenSanctions or OpenCorporates, World Values Survey, or VDEM, or any other you came across in your research.
    • Should require significant data reshaping
    • Example: Complex JSON structures needing normalization
    • If you go on this route, you need to get express permission from Jon.

πŸ’‘ Note: You’re encouraged to discover and come up with other interesting data sources. If you’re using an API that’s not listed here, you don’t need to get it pre-approved - just go for it!

Getting Started: Data Source Ideas

APIs You Already Know

This is by far the safest bet, as you’re already familiar with these APIs.

Weather & Environment

  • OpenMeteo API
    • Clean documentation
    • Free to use
    • There are many endpoints and angles to this data we haven’t explored in the course!

πŸ’‘ Tip: While you’ve used OpenMeteo in your W06 summative, there are many endpoints and analytical angles we haven’t touched in the course

Music & Entertainment

  • Spotify API
    • You’re well familiar with this one from the summative
    • Great for more complex ideas like analyzing music features across genres or tracking popularity trends
    • Authentication setup already familiar to you

πŸ’¬ Comment: If you are not tired of using this API, this is definitely a safe bet. But what will you come up with to add more complexity?

Social Media & Text

  • Reddit API
    • We’ve covered the basics of connecting to it
    • There’s a wealth of information beyond what we explored
    • Perfect if you want to do analysis on text data (not covered in the course, though)

Any other API!

(This is likely to be the most popular choice for projects)

You can search online for APIs that interest you. Here are just some initial ideas:

Transport for London (TfL)

  • Real-time journey data
  • ⚠️ Documentation can be tricky
  • Active developer forum if you need help

Google APIs

There are multiple interesting endpoints:

  • Maps (location analysis)
  • Trends (search patterns)
  • Books Ngram (text analysis)
  • All require API key registration and they might charge a fee if you go beyond certain rate limits. We ARE NOT responsible for any charges you might incur.

Self-Collected Data Approaches

Only choose this approach once you have figured out a way to get the data out of your device/accounts and into a format you can work with (e.g., CSV, JSON, TXT). The data does need to be super structured - you can’t just dump a bunch of text files in a folder – but you will need to have a clear plan for how you will structure it.

Some ideas:

Personal Analytics

  • Health and fitness tracking
  • App usage patterns
  • Digital behavior logs

⚠️ CONSENT REQUIREMENTS

When collecting personal data from team members (or anyone else), you must follow these guidelines:

  1. Each person must explicitly agree to share their data (write a T&Cs if you are using an automated API)
  2. Document this consent in your repository
  3. Include your data anonymization strategy
  4. Clearly state what data is being collected and how it will be used
  5. Address any privacy concerns in your documentation

Your repository must include a clear statement about how consent was obtained from each team member.

Survey-Based Research

πŸ’‘ Perfect Match: If you’re in your third year or taking another course that requires conducting a survey, this could be an excellent opportunity to combine efforts - just ensure it works for your whole group and matches DS105A’s data manipulation requirements.

Keep in mind:

  • You’ll need substantial responses (think hundreds, not dozens)
  • Consider combining with other data sources
  • Your analysis methods should match your expertise level
  • Note: If you’re using statistical methods, you need to be confident in explaining and justifying their use.