✋ FAQ
2023/24 Winter Term
Your frequently asked questions answered. This page will be updated as the course progresses.
Frequently asked questions
Q: What skills do I need to show in the final project?
You will be working on a group project for the next few months, and you have until 23 May 2024, 5pm UK time to complete it. The final project will be a data science project, focusing more on data collection and manipulation than a deep analysis. It will involve the following tasks:
Collecting data by yourself: You can either scrape data from a website (using
scrapy
, spiders, or selenium) or collect it from an API (as seen in the 🧑🏫 W08 lecture).Other than the
scrapy
library, we will also allow the use ofscrapy
spiders andselenium
if you prefer them for your project.If you need extra support for
spiders
orselenium
, let us know so we can plan additional teaching sessions during the Spring Term.⚠️ You cannot use
BeautifulSoup
,lxml
, or any other library to scrape data. You can only usescrapy
orselenium
for scraping.If you decide to collect data from an API, you’re free to choose any API, but you must use the
requests
library to collect the data.⚠️ You cannot use ready-made libraries like
tweepy
orpraw
to collect data from Twitter or Reddit. You must userequests
to collect the data from APIs.
Organising the collected data: Adhere to the DS105-style by avoiding
for
andwhile
loops (unless unavoidable), using list/dict comprehension for creating dictionaries, and using custom functions in.py
modules. Aim to usepd.apply()
for efficient single-column data creation/manipulation.Saving the data in a database: You must save your data using
sqlite3
(covered in the 🧑🏫 W10 lecture).Manipulating data in a vectorised manner with
pandas
: Demonstrate that you find opportunities to usegroupby()->apply()
andpivot()
, either in pandas or in SQL (covered in the 🧑🏫 W09 notebook and 🧑🏫 W10 lecture).Creating plots using the grammar-of-graphics style: Use
plotnine
(covered in the 💻 W08 lab) or altair.Cleaning text data using regular expressions: Use the
re
library mindfully (covered in the 🧑🏫 W11 lecture).Effective GitHub collaboration: Your group used branches, issues and pull requests effectively.
Neat website: your group’s website is neat and well-organised, with a clear structure and a good design.
Q: What is the final project worth?
The final project is worth 40% of your final grade. This includes a presentation (15%) and the submission of the GitHub repository that contains the source code of your project (25%).
Check out the ✍️ Assessments page.
Q: When is the final project due?
- The final project is due on Thursday, 23 May 2024 at 5 pm (Week 04 of 2023/24 Spring Term)
Q: What will I submit?
- The process will be the same as your previous assignments involving code. You will accept a group assignment via GitHub Classroom, and this will automatically create a repository for your group. You will then work on your project in this repository and submit it by pushing your changes to GitHub.
Q: How much data should I use?
- You don’t need a lot of data. If your dataset has a few thousand rows, it should be fine. Big data projects are more impressive, though.
Q: What kind of data should I use?
You will have to choose one or several data sources. Your primary main data source must:
- be collected by your group via web scraping
- be collected by your group via an API
In this course, we only focused on tabular data, but you can also have unstructured data (e.g., text, images, audio, video).
Consult us if you are not sure your data source is appropriate.
Q: Do I have to pick a ‘serious’ topic?
- While you can go for a serious, academic topic and therefore collect data from a government API or data-rich portals like the Wikimedia projects, it is absolutely acceptable to choose a more fun, light-hearted subject and collect data from social media, a sports or gaming platform, or a streaming service.
Q: What kind of analysis should I do?
- MUST-dos:
- Data cleaning (e.g., using adequate data types, removing missing values, removing duplicates, etc.)
- Data exploration (e.g., summary statistics, histograms, etc.)
- Data visualisation with a grammar-of-graphics package, such as
plotnine
(static) oraltair
(interactive) - Data analysis (insights from summarising the data, a closer look at certain plots, etc.)
- CAN-dos (things that are not taught in the course but that would make your project more impressive):
- Machine Learning (e.g., classification, clustering, etc.)
- Natural Language Processing (e.g., sentiment analysis, topic modelling, etc.)
- Network Analysis (e.g., centrality measures, community detection, etc.)
- Deep Learning (e.g., image classification, object detection, etc.)
- Interactive websites (e.g., using Streamlit)