✋ FAQ

2023/24 Winter Term

Author

Your frequently asked questions answered. This page will be updated as the course progresses.

Frequently asked questions

Q: What skills do I need to show in the final project?

You will be working on a group project for the next few months, and you have until 23 May 2024, 5pm UK time to complete it. The final project will be a data science project, focusing more on data collection and manipulation than a deep analysis. It will involve the following tasks:

Collecting data by yourself: You can either scrape data from a website (using scrapy, spiders, or selenium) or collect it from an API (as seen in the 🧑‍🏫 W08 lecture).
- Other than the scrapy library, we will also allow the use of scrapy spiders and selenium if you prefer them for your project.
- If you need extra support for spiders or selenium, let us know so we can plan additional teaching sessions during the Spring Term.
- ⚠️ You cannot use BeautifulSoup, lxml, or any other library to scrape data. You can only use scrapy or selenium for scraping.
- If you decide to collect data from an API, you’re free to choose any API, but you must use the requests library to collect the data.
- ⚠️ You cannot use ready-made libraries like tweepy or praw to collect data from Twitter or Reddit. You must use requests to collect the data from APIs.
Organising the collected data: Adhere to the DS105-style by avoiding for and while loops (unless unavoidable), using list/dict comprehension for creating dictionaries, and using custom functions in .py modules. Aim to use pd.apply() for efficient single-column data creation/manipulation.
Saving the data in a database: You must save your data using sqlite3 (covered in the 🧑‍🏫 W10 lecture).
Manipulating data in a vectorised manner with pandas: Demonstrate that you find opportunities to use groupby()->apply() and pivot(), either in pandas or in SQL (covered in the 🧑‍🏫 W09 notebook and 🧑‍🏫 W10 lecture).
Creating plots using the grammar-of-graphics style: Use plotnine (covered in the 💻 W08 lab) or altair.
Cleaning text data using regular expressions: Use the re library mindfully (covered in the 🧑‍🏫 W11 lecture).
Effective GitHub collaboration: Your group used branches, issues and pull requests effectively.
Neat website: your group’s website is neat and well-organised, with a clear structure and a good design.

Q: What is the final project worth?

The final project is worth 40% of your final grade. This includes a presentation (15%) and the submission of the GitHub repository that contains the source code of your project (25%).

Check out the ✍️ Assessments page.

Q: When is the final project due?

The final project is due on Thursday, 23 May 2024 at 5 pm (Week 04 of 2023/24 Spring Term)

Q: What will I submit?

The process will be the same as your previous assignments involving code. You will accept a group assignment via GitHub Classroom, and this will automatically create a repository for your group. You will then work on your project in this repository and submit it by pushing your changes to GitHub.

Q: How much data should I use?

You don’t need a lot of data. If your dataset has a few thousand rows, it should be fine. Big data projects are more impressive, though.

Q: What kind of data should I use?

You will have to choose one or several data sources. Your primary main data source must:
- be collected by your group via web scraping
- be collected by your group via an API
In this course, we only focused on tabular data, but you can also have unstructured data (e.g., text, images, audio, video).

Consult us if you are not sure your data source is appropriate.

Q: Do I have to pick a ‘serious’ topic?

While you can go for a serious, academic topic and therefore collect data from a government API or data-rich portals like the Wikimedia projects, it is absolutely acceptable to choose a more fun, light-hearted subject and collect data from social media, a sports or gaming platform, or a streaming service.

Q: What kind of analysis should I do?

MUST-dos:
- Data cleaning (e.g., using adequate data types, removing missing values, removing duplicates, etc.)
- Data exploration (e.g., summary statistics, histograms, etc.)
- Data visualisation with a grammar-of-graphics package, such as plotnine (static) or altair (interactive)
- Data analysis (insights from summarising the data, a closer look at certain plots, etc.)
CAN-dos (things that are not taught in the course but that would make your project more impressive):
- Machine Learning (e.g., classification, clustering, etc.)
- Natural Language Processing (e.g., sentiment analysis, topic modelling, etc.)
- Network Analysis (e.g., centrality measures, community detection, etc.)
- Deep Learning (e.g., image classification, object detection, etc.)
- Interactive websites (e.g., using Streamlit)