Choosing Your Data Source: A Guide
Ideas and requirements for your DS105A final project
π¨ IMPORTANT REMINDERS β Do not ignore this text
Before exploring data sources, make sure you understand these key requirements:
Simple bulk downloads wonβt work as your main data source. For example, none of the following are allowed:
- Downloading a very simple CSV from a website
- Using a pre-made dataset from Kaggle
However, these datasets can be used as supplementary data to enrich your analysis.
Stick to the
requests
library - 'pre-made' API wrappers like spotipy for Spotify or praw for Reddit are not allowed.Document your decisions in the notebooks.
Focus on quality of analysis over quantity of data. Only grow in complexity once you have a first full draft of your project.
Core Requirements
Your primary data source must be one of:
- Data collected through an API
- Use the
requests
library - You are responsible for handling authentication and rate limits
- Use the
- Self-collected data with proper documentation
- Clear collection methodology
- Explicit consent was obtained
- Privacy considerations addressed
- Complex static datasets
- Although simple static datasets are not allowed, you ARE allowed to use downloaded datasets if they are complex and require significant reshaping.
- These include datasets like OpenSanctions or OpenCorporates, World Values Survey, or VDEM, or any other you came across in your research.
- Should require significant data reshaping
- Example: Complex JSON structures needing normalization
- If you go on this route, you need to get express permission from Jon.
π‘ Note: Youβre encouraged to discover and come up with other interesting data sources. If youβre using an API thatβs not listed here, you donβt need to get it pre-approved - just go for it!
Getting Started: Data Source Ideas
APIs You Already Know
This is by far the safest bet, as youβre already familiar with these APIs.
Weather & Environment
- OpenMeteo API
- Clean documentation
- Free to use
- There are many endpoints and angles to this data we havenβt explored in the course!
π‘ Tip: While youβve used OpenMeteo in your W06 summative, there are many endpoints and analytical angles we havenβt touched in the course
Music & Entertainment
- Spotify API
- Youβre well familiar with this one from the summative
- Great for more complex ideas like analyzing music features across genres or tracking popularity trends
- Authentication setup already familiar to you
π¬ Comment: If you are not tired of using this API, this is definitely a safe bet. But what will you come up with to add more complexity?
Any other API!
(This is likely to be the most popular choice for projects)
You can search online for APIs that interest you. Here are just some initial ideas:
Transport for London (TfL)
- Real-time journey data
- β οΈ Documentation can be tricky
- Active developer forum if you need help
Google APIs
There are multiple interesting endpoints:
- Maps (location analysis)
- Trends (search patterns)
- Books Ngram (text analysis)
- All require API key registration and they might charge a fee if you go beyond certain rate limits. We ARE NOT responsible for any charges you might incur.
Self-Collected Data Approaches
Only choose this approach once you have figured out a way to get the data out of your device/accounts and into a format you can work with (e.g., CSV, JSON, TXT). The data does need to be super structured - you canβt just dump a bunch of text files in a folder β but you will need to have a clear plan for how you will structure it.
Some ideas:
Personal Analytics
- Health and fitness tracking
- App usage patterns
- Digital behavior logs
β οΈ CONSENT REQUIREMENTS
When collecting personal data from team members (or anyone else), you must follow these guidelines:
- Each person must explicitly agree to share their data (write a T&Cs if you are using an automated API)
- Document this consent in your repository
- Include your data anonymization strategy
- Clearly state what data is being collected and how it will be used
- Address any privacy concerns in your documentation
Your repository must include a clear statement about how consent was obtained from each team member.
Survey-Based Research
π‘ Perfect Match: If youβre in your third year or taking another course that requires conducting a survey, this could be an excellent opportunity to combine efforts - just ensure it works for your whole group and matches DS105Aβs data manipulation requirements.
Keep in mind:
- Youβll need substantial responses (think hundreds, not dozens)
- Consider combining with other data sources
- Your analysis methods should match your expertise level
- Note: If youβre using statistical methods, you need to be confident in explaining and justifying their use.
Social Media & Text