Step 4: The final part of the process

2023/24 Winter Term

It is sad, but all good things must come to an end.

Author

Dr. Jon Cardoso-Silva

What we want to see in the final project.

What we want to see

In short:

A web page that tells the story of your project. The text on your web page must have a maximum of 5000 words.
- Tell us what made you curious about this kind of data in the first place.
- Explain how you gathered the data (and what challenges you faced).
- How do the raw data look like? How did you pre-process and store the data?
- Show us a curated view of the key insights from your exploratory data analysis (insightful tables, plots, and summaries).
A GitHub repository that contains your source code.
- We want to see that you did exploratory data analysis (EDA) and data cleaning in your Jupyter Notebooks or python scripts.
- We want to see that you used the tools and practices we taught you in class (all eleven weeks of the course).
- We will check the repo history to ensure all team members have contributed with commits to the project (it does not have to be an equal contribution).

How we will mark your final project

While the final project is worth 25% of your final grade, you will be marked on a 0-100 scale.

Criterion: Storytelling

🏅 Reward: 15 marks

Do the public-facing documents of your project tell us a story?

Click here to understand what we are looking for in detail

📝 Deliverables

We will look at the following when assessing this criterion:

The README of your GitHub repository
The public web page of your project

💡 IMPORTANT NOTE: It is typical for groups to challenge themselves and opt for creating dashboards (like Streamlit) or interactive data visualizations at this stage of the course. While this is encouraged, we will still assess the public URL associated with your GitHub repository for this criterion. If your dynamic visualization tool cannot be rendered directly via the URL, you should still create a ‘static’ website with the key insights extracted from your dashboard. You will not earn extra marks for the presence of dashboards in this criterion.

🗒️ Description

You must produce a public website associated with your GitHub repository. This website must tell a story. The text must be engaging and clear. There should be no fluff. You must describe relevant technical steps without too many details. There should be a nice conclusion.

Your GitHub repository’s README file must describe the steps to reproduce your analysis. There should be instructions on setting up the environment, creating the necessary credentials and other secret files, and running the code and in what order.

📑 Marking Scheme

In the end, the storytelling of your webpage will be graded in one of the following categories:

Outstanding (12-15 marks)

We were mesmerised by the writing and the website layout.
The narrative is engaging and clear. Your webpage reads like a piece of great data journalism writing. It has depth, it is nuanced, and it is captivating.
The website layout is impeccable, to a level of professionalism we were not expecting.

Great (10-11 marks)

The narrative is engaging and clear, and the technical steps are described with adequate technical detail.
We couldn’t find any major issues with the writing and/or website layout.

Good (7-9 marks)

The website’s information and structure are overall clear and organised. However, the aesthetics could be improved for a more engaging experience (it’s either a bit dull or, the opposite, overwhelming).
While a narrative is present, the writing could be more engaging, concise, and precise to better captivate the audience.

Poor (<7 marks)

Submissions that fall into one of the descriptions below will receive a mark of 6 or less:

The narrative is not engaging, the technical steps are not described clearly, the website layout is not engaging, and the website is hard to navigate.
Although a lot has been done, the website is super vague. There is a lot of fluff.
The website leaves out crucial details, rendering the report misleading or incomplete.
There’s no website or no README file.
The group created a dashboard or interactive visualisation tool but did not provide a static website with the key insights extracted from the dashboard.

Criterion: Organisation

🏅 Reward: 20 marks

A well-organised repo is a good sign that you understand the tools and practices we taught you in class.

Click here to understand what we are looking for in detail

📝 Deliverables

We will look at the following when assessing this criterion:

Structure and names of files and folders in your GitHub repository
Use of markdown in your Jupyter Notebooks
Coding style and comments in your Jupyter Notebook and Python scripts
Use of relative paths in your code
Setup files (like requirements.txt, .gitignore, etc.)

🗒️ Description

Your GitHub repository must be well-organised. Source code files should be organised consistently and thoughtfully, making it easy for someone else to understand and replicate your work. You should use markdown in your Jupyter Notebooks to document your thought process and the steps you took to clean and analyse the data. You should use relative paths in your code to make it easier to run your code on different machines and potentially different OSes. You should have setup files like requirements.txt and .gitignore alongside instructions on the README to make it easier for someone else to replicate your work.

📑 Marking Scheme

Your group will be awarded one of the following marks based on the organisation of your GitHub repository:

Outstanding (16-20 marks)

The organisation of your repository and website goes beyond our wildest expectations. You went beyond what was specified in the ‘Great’ category in a surprisingly positive way we couldn’t have foreseen.

Great (14-15 marks)

Not only is the GitHub repository well-organised, but it is also clear and easy to navigate. There is a consistent standard, regardless of who created which file.
The README file is detailed and provides clear instructions on how to run the code. The instructions are mindful of how other people, with potentially different OSes, might run the code.
When it comes to the webpage, you used Quarto publishing configured with a GitHub Action to automate the deployment of your website.
If the project used an advanced data visualisation tool like Streamlit, there are clear instructions for how a person could replicate the dashboard themselves.
The group acknowledges the use of AI tools in the project, and it clearly shows how the group used the tool for learning/productivity hacks rather than simply over-relying on it.

Good (9-13 marks)

The GitHub repository’s structure is clear and organised overall, with well-named folders and files. If anything, you forgot to do things like: add conda environment instructions or to add IDE/build/temporary files to the .gitignore.
Although there are instructions for how to run the code on the README, they could be clearer or more complete.
Most notebooks with markdown cells and Python scripts and modules have good documentation, but it is inconsistent across all files.
The website is hosted on GitHub pages; all images and links work. The website might also use a theme to make it look more professional or have been created with Quarto.
The group acknowledges the use of AI tools in the project, but the acknowledgment is not precise enough.

Poor (<9 marks)

Submissions that fall into one of the descriptions below will receive a mark of 8 or less:

There is minimal organisation in the GitHub repository. Files are scattered around, almost all in the root directory.
The folder structure is extremely confusing. It could be that there are way too many folders or the structure is too deeply nested.
Important files are missing (there are no README or setup files), or the instructions to run the code are very unclear. It is hard to figure out how to navigate your repository.
There is no acknowledgment of AI tools used in the project; we don’t even know if AI chatbots or other AI tools were used to help with the project. OR, the acknowledgment is not precise enough.

Criterion: Collaboration

🏅 Reward: 15 marks

Did we see good evidence of GitHub collaboration in your project?

💡 IMPORTANT NOTE: Since this is a group assignment, we want to see that you collaborated as a team using the Git tools taught in the course. We like to give the group the autonomy to distribute the workload. However, suppose it’s evident that one person carried most of the workload throughout the project. In that case, the group will receive a low score on this criterion, and we will evaluate this individual’s contribution separately. Conversely, if a team member person contributed very minimally to the project, they will receive a low score on this criterion.

Click here to understand what we are looking for in detail

📝 Deliverables

We will look at the following when assessing this criterion:

Names of authors in Jupyter Notebooks and Python scripts
Commit history of the repository
README file of the repository
GitHub Issues & Pull Requests
GitHub Project Board

🗒️ Description

We seek evidence that you exercised your Git & GitHub muscles as a team. We want to see that you used GitHub to collaborate, GitHub Issues to track tasks, separate branches to avoid messing up each other’s contributions, and Pull Requests to review each other’s work. We also want to see that you used the GitHub Project Board to track the progress of your project.

☣️ The least favorable scenario is when team members work in isolation, with minimal interaction with each other’s code or outputs, or when the group relies on platforms like Google Drive or Dropbox until the final project is hastily uploaded to GitHub just before the deadline.

📑 Marking Scheme

Your group will be awarded one of the following marks based on your collaboration:

Outstanding (11-15 marks)

The group made good use of GitHub’s project board, updating the status of issues regularly.
The group made sensible choices about Issues & Pull Requests; they didn’t use it for every little thing. Minor tasks did not require a PR and were merged directly to the main branch.
We see a GitHub Issue associated with every major task or bug that was fixed. Issues are labelled with relevant custom issue labels. Issues have meaningful descriptions.
The group used Pull Requests to review important changes to the codebase. Pull Requests contain a high-level description of their purpose and instructions on how to validate the results. We see evidence of others reviewing the code and providing feedback.
There is a clear list of everyone’s overall contributions to the project somewhere in the project’s webpage or README file.
All members contributed substantially to the project, as we can see from the commit history.
Notebooks and scripts have clear authorship, denoting the one or more team members who contributed to the project.

Great (10 marks)

There is a clear list of everyone’s overall contributions to the project somewhere in the project’s webpage or README file.
All members contributed substantially to the project, as we can see from the commit history.
Notebooks and scripts have clear authorship, denoting the one or more team members who contributed to the project.
There were some good uses of GitHub Issues & Pull Requests, although they could have been better organised.

Good (7-9 marks)

There is a list of everyone’s contributions to the project somewhere in the project’s webpage or README file.
All members contributed with at least one commit to the group’s GitHub repository.
There was no meaningful use of GitHub Issues & Pull Requests. The group could have used them more effectively to track the project’s progress.

Poor (<7 marks)

Submissions that fall into one of the descriptions below will receive a mark of 6 or less:

The use of Git/GitHub is minimal or non-existent. The group did not use GitHub Issues & Pull Requests to track the project’s progress.
A single individual carried most of the workload throughout the project, and the group did not collaborate effectively.

Criterion: Exploratory Data Analysis (EDA)

🏅 Reward: 20 marks

Did you produce nice plots and tables that vividly depict the data and what we can learn from it?

Click here to understand what we are looking for in detail

📝 Deliverables

We will look at the following when assessing this criterion:

Data summaries, plots, and tables in your Jupyter Notebooks and public web page
Interpretation of the plots and tables in notebooks as well as in your public web page

🗒️ Description

Tell us what you discovered from your data. Show us the most relevant columns, summaries, and distributions. Use plots and tables to paint a vivid picture of what the data looks like. Make sure all labels are clear and visible. All variables should be clearly identified. Only use grammar-of-graphics libraries like ggplot (R), plotnine (python), altair (python), or bokeh (python) to generate the plots.

📑 Marking Scheme

Your group will be awarded one of the following marks based on the quality of your exploratory data analysis:

Outstanding (17-20 marks)

On top of everything described in the ‘Great’ category, the group went above and beyond to create a visually stunning and insightful set of plots and tables, beyond our wildest expectations.

Great (14-16 marks)

All plots have legends, axes, and titles that are clear and visible. There is nothing blurry or requiring a deep zoom-in.
Colour schemes were chosen wisely, and the plots are not overwhelming with too many colours [1], [2].
Plot types were always appropriate for the type of data being visualised [3] [4] [5] [6].
Plot titles communicate the key insight of the plot rather than simply describing the axis labels.
If present, tables are well-formatted and easy to read. They are not overwhelming with too many columns or rows.
Interpretation of the plots and tables is clear and insightful. The group tells us what they confirmed or discovered from the data (no need for formal hypothesis testing, though).
All plots were created using grammar-of-graphics libraries like ggplot (R), plotnine (python), altair (python), or bokeh (python).
If using a dynamic visualisation tool like Streamlit, the group ensured to add explanations around the plots, so that users know how to engage with the visualisation.

Good (9-13 marks)

Most plots have legends, axes, and titles that are clear and visible, although there might have been a few problematic plots.
Colour schemes were chosen wisely, but there might have been a few plots with too many colours or colours that were not easy to distinguish.
Plot types were mostly appropriate for the type of data being visualised, but there might have been a few plots that were not the best choice.
Some plot titles were descriptive rather than insightful, and some tables were not well-formatted or easy to read.
Some plots were not created using grammar-of-graphics libraries like ggplot (R), plotnine (python), altair (python), or bokeh (python).
If using a dynamic visualisation tool like Streamlit, the group did not add enough explanations around the plots, and we ended up with a repository of plots that seemed a bit disconnected.

Poor (<9 marks)

Submissions that fall into one of the descriptions below will receive a mark of 6 or less:

Plots are not engaging, do not communicate any insights, or are not clear and visible.
The group did not use grammar-of-graphics libraries like ggplot (R), plotnine (python), altair (python), or bokeh (python) to generate the plots.

Criterion: Data Manipulation

🏅 Reward: 30 marks

This is the core criterion. Did you show us that you can manipulate data effectively as per the practices we taught you in class?

Click here to understand what we are looking for in detail

📝 Deliverables

We will look at the following when assessing this criterion:

Web scraping, API calls, or other data gathering techniques
Use of pandas functions in your Jupyter Notebooks and Python scripts
Presence of meaningful custom functions in your Jupyter Notebooks and Python scripts
Use of SQL in your Jupyter Notebooks and Python scripts
Use of text mining techniques (if applicable)

🗒️ Description

This is the core criterion. It’s a callback to everything you’ve been training for during this course and in previous assignments.

We want to see that you can manipulate data effectively. Make sure to communicate how your data flows through your code. You should maximize the use of Pandas functions in your code to clean and transform your data. We also want to see if you know how to merge data from different tables using Pandas merge or SQL joins. If you are working with text data, you should use text mining techniques to extract insights from the data.

📑 Marking Scheme

Your group will be awarded one of the following marks based on the quality of your data manipulation:

Outstanding (24-30 marks)

On top of everything described in the ‘Great’ category, the group went above and beyond with their data-wrangling efforts beyond our wildest expectations.

Great (21-23 marks)

The group used a variety of data manipulation techniques to clean and transform the data. They used Pandas functions effectively and efficiently.
The group stored the data with a clear structure, hopefully in a database or an extremely well-organised file system.
The group used SQL to merge data from different tables or to perform complex queries.
If applicable, the group used text mining techniques to extract insights from the data.
The group created meaningful custom functions to automate repetitive tasks.
The group created batch scripts, cron jobs or GitHub Actions to automate the data collection and cleaning process.
When collecting data via web scraping, the group used Scrapy spiders.
The group used the Python requests library when collecting data via APIs.
Pandas .apply() function was used appropriately.
There are no for or while loops that are not strictly necessary.

Very Good (19-20 marks)

The group used a variety of data manipulation techniques to clean and transform the data. They used Pandas functions effectively.
The group just fell short of the expectations for the ‘Great’ category.

Good (15-18 marks)

The group used some data manipulation techniques to clean and transform the data. They used Pandas functions, but not always effectively.
The group achieved the goals but there were big misses, such as not merging data using Pandas merge or SQL joins or using loops unnecessarily

Poor (<15 marks)

We see minimal data manipulation in the project. The group did not use Pandas functions effectively or did not use them at all.
Webscraping was done by copying code from the internet without understanding it. Non-authorised libraries were used (BeautifulSoup, Parsel, lxml, etc.)