🔍 Choosing Your Data Source
A guide for your DS105W final project

Selecting the Right Data Source
Choosing an appropriate data source is a critical first step for your final project. This guide will help you understand the requirements and make informed decisions about where to get your data.
🚨 Important Reminders
Before exploring data sources, make sure you understand these key requirements:
Simple bulk downloads won’t work as your main data source. For example:
- Downloading a basic CSV from a website
- Using a pre-made dataset from Kaggle
However, these datasets can be used as supplementary data to enrich your analysis.
Stick to the
requests
library. Pre-made API wrappers likespotipy
for Spotify orpraw
for Reddit are not allowed.Document your decisions in your notebooks and README.
Focus on quality of analysis over quantity of data. Only increase complexity once you have a first full draft of your project.
Core Requirements
Your primary data source must be one of:
1. Data Collected Through an API
Key considerations:
- You are responsible for handling authentication and rate limits
- Use the
requests
library (no pre-made wrappers) - Store your API keys securely (never commit them to GitHub)
- Plan for API downtime or rate limiting
2. Self-Collected Data
This is an ambitious but rewarding approach if you have access to interesting data:
Requirements:
- Clear collection methodology
- Explicit consent was obtained (if collecting from people)
- Privacy considerations addressed
- Structured format that can be loaded into pandas
Examples:
- You can register for a free API key from Spotify but then rather than simply using it to access data from the API, you can use the notion of Authorisation (with proper consent) to access your own listening history.
- Personal analytics (health data, app usage, digital behaviour)
- Survey-based research (with substantial responses)
⚠️ Consent Requirements
When collecting personal data from team members (or anyone else), you must follow these guidelines:
- Each person must explicitly agree to share their data
- Document this consent in your repository
- Include your data anonymisation strategy
- Clearly state what data is being collected and how it will be used
- Address any privacy concerns in your documentation
3. Complex Static Datasets
Although simple static datasets are not allowed, you can use downloaded datasets if they are complex and require significant reshaping:
Examples:
Requirements: - Dataset must require significant data reshaping - Should involve complex JSON structures needing normalisation - Must be properly cited and documented
API Suggestions
APIs You’ve Already Used
Weather & Environment
- OpenMeteo API
- Clean documentation
- Free to use
- Many endpoints and analytical angles we haven’t explored in the course
Other Interesting APIs
Transport & Location
- Transport for London (TfL)
- Real-time journey data
- Documentation can be tricky
- Active developer forum if you need help
Google APIs
- Maps (location analysis)
- Trends (search patterns)
- Books Ngram (text analysis)
- Note: These might charge a fee if you go beyond certain rate limits. We are NOT responsible for any charges you might incur.
Government & Public Data
- UK Government Data
- data.gov.uk
- Various datasets on public services, economy, health
- EU Open Data Portal
- data.europa.eu
- Comprehensive datasets on European topics
Entertainment & Media
- TMDB (The Movie Database)
- Film and TV show information
- Well-documented API
- News APIs
- NewsAPI, The Guardian, New York Times
- Current events and historical articles
Evaluating Data Source Quality
When choosing your data source, consider these factors:
- Relevance: Does the data help answer your research question?
- Accessibility: Can you reliably access the data throughout your project?
- Quality: Is the data accurate, complete, and up-to-date?
- Volume: Is there enough data to support meaningful analysis?
- Complexity: Does the data require interesting transformation work?
- Documentation: Is the API or dataset well-documented?
Common Pitfalls to Avoid
- Choosing overly simple datasets: Your project should demonstrate data transformation skills.
- Relying on unstable APIs: Some free APIs have severe rate limits or poor reliability.
- Collecting too much data: Focus on quality over quantity.
- Ignoring data privacy: Always respect privacy and obtain proper consent.
- Waiting too long to start: Begin with a small sample to test your approach.
Getting Started
- Explore the documentation: Understand what data is available and how to access it.
- Test with small requests: Verify you can access and process the data before committing.
- Plan your database schema: Design your tables and relationships early.
- Document your process: Keep notes on your data collection decisions.
💡 Note: You’re encouraged to discover and come up with other interesting data sources. If you’re using an API that’s not listed here, you don’t need to get it pre-approved - just go for it!
Social Media & Text