6️⃣ Choosing Your Data Source: A Guide
Ideas and requirements for your DS105A final project
🚨 IMPORTANT REQUIREMENTS
Simple bulk downloads will not work as your main data source. You cannot download a basic CSV from a website or use a pre-made dataset from Kaggle. These datasets can supplement your analysis but cannot be your primary data source.
You must use the requests library. Pre-made API wrappers like spotipy for Spotify or praw for Reddit are not allowed.
Document your decisions in the notebooks and focus on quality of analysis over quantity of data. Build complexity gradually once you have a complete first draft.
What Counts as a Valid Data Source
Your primary data source must fall into one of three categories:
API-collected data using the requests library. You handle authentication and rate limits yourself.
Self-collected data with proper documentation, explicit consent, and addressed privacy considerations.
Complex static datasets that require significant reshaping. Examples include OpenSanctions, OpenCorporates, World Values Survey, or VDEM. These datasets typically contain complex JSON structures that need normalisation. You need express permission from Jon before using this approach.
💡 Note: You can discover and use any API that interests you. APIs not listed here don’t need pre-approval.
- OpenMeteo API: Clean documentation and free to use. You are already familiar with the basics, but there are many endpoints and types of data you have not explored.
- Reddit API: Offers a wide range of topics and communities, with plenty of opportunities to collect novel data.
- Transport for London (TfL) API: Real-time transport and journey data. Again, you are already familiar with this API from your summative work. You can use many other endpoints of this API to collect data.
- Google APIs: Includes Maps (location data), Trends (search behaviour), and Books Ngram (text analysis). All require API key registration and can charge if you go beyond free limits. Make sure you are aware of potential fees.
- Other APIs: Browse this list of public APIs or search online for any other APIs that interest you. You do not need pre-approval if you discover something not listed here.
Self-Collected Data
Choose this approach only after determining how to extract data from your device or accounts into a workable format like CSV, JSON, or TXT. The data needs clear structure; you cannot simply dump text files in a folder.
Personal analytics might include health and fitness tracking, app usage patterns, or digital behaviour logs. For example: you might want to collect your group’s Spotify data and build your own ‘wrapped’ Spotify listening history.
Survey-based research works particularly well if you’re taking another course requiring survey data. This creates an opportunity to combine efforts while meeting DS105A’s data manipulation requirements.
Survey projects need substantial responses (hundreds, not dozens) and should consider combining with other data sources. Your analysis methods must match your expertise level. If you’re using statistical methods, you need confidence in explaining and justifying their use.
⚠️ CONSENT REQUIREMENTS
When collecting personal data from team members or others, follow these guidelines:
Each person must explicitly agree to share their data. Document this consent in your repository. Include your data anonymisation strategy and clearly state what data you’re collecting and how you’ll use it. Address privacy concerns in your documentation.
Your repository must contain a clear statement about how you obtained consent from each participant.
Getting Started
Begin by identifying what genuinely interests your team. The most successful projects emerge when students choose data sources they care about exploring. Consider what questions you want to answer, then find data sources that can provide those answers.
Remember that discovering interesting patterns in familiar data often produces better results than struggling with complex data you don’t understand. Start with something manageable and grow the complexity as your project develops.
