✍️ Midterm Assignment

Author

Dr. Jon Cardoso-Silva

Last updated

11 July 2024

⏲️ Due Date:

Monday 16 July 2024 at 9pm, UK Time.

This is worth 25% of your final grade.

📝 Instructions:

If you haven’t done it already, download the full Waitrose dataset by clicking the button below:

Unzip the file and save the folder to a suitable location.
Create a new Jupyter Notebook and name it LSE_ME204_2024_Midterm.ipynb. Make sure to use an appropriate mix of Markdown and Python cells to document your analysis.
Add code to read all the files into a single DataFrame, df, and to perform an initial preprocessing of the data (e.g., drop duplicates, rename columns, adequate data types etc.)

You can reuse the code we have been using in class, including that of 🗓️ Week 01 – Day 04.
TASK 01 – DATA CLEANING: There are still a lot of duplicates because many products appear in multiple categories. Revisit the pre-processing code so that each product appears only once in the df DataFrame, following these steps:
- Reduce the Data Frame so that each row represents a unique product
- Add a new column categories that contains a list of all categories that the product appears in.
- Keep a item-price column with the original price of the item.
More than producing valid code, remember to document your thought process and the steps you took to achieve this.
TASK 02 – COMPARATIVE ANALYSIS: From the plot in 🗓️ Week 01 – Day 04, we learned that the supermarket’s own brand ('Everyday Value') contains the cheapest products in the dataset. Your goal is to create at least 2 data visualizations with an analysis of how the products in this category compare to their equivalent branded ones in other categories within the dataset.
- This question is intentionally open-ended! You can let your curiosity guide you. The main point is for you to demonstrate your data wrangling skills with pandas.
- For example, you can choose to compare the price of specific groups of products (e.g. fruits? dairy products? etc.).

GENERATIVE AI ACKNOWLEDGEMENT: You are free to use any Generative AI tool (ChatGPT, Claude, Microsoft Copilot, GitHub Copilot, HuggingChat, etc.) when working on your assignment. However, you MUST acknowledge the use of these tools in your submission and how these tools helped you in completing the assignment.

For example, you can add a Markdown cell at the end of your notebook with the following content:

## 🤖 **Generative AI Acknowledgement**

This assignment was completed with the help of ChatGPT and GitHub Copilot.

More specifically, I used it in the following ways:

- After I wrote the function `X()` to group the data, I got several errors. I typed the code into Copilot's chat and queried it for solutions. The suggestions worked but as they were not exactly in the spirit of the code we learned in the course, I did `Y` and `Z` to adapt it to our context.
- ...

ENSURE REPRODUCIBILITY: Once you have completed all the tasks, click on ‘Restart’ and then ‘Run All’ at the top of the notebook to ensure that all cells run without errors.
SUBMITTING: Save the Jupyter Notebook and submit the LSE_ME204_2024_Midterm.ipynb file to Moodle (under the '✍️ Assignments' section).

📑 How you will be assessed

We will assign you a score from 0 to 100 based on the following criteria:

Organisation (20 points): The notebook is well-structured, the code is clear and easy to follow, there are no unnecessary prints or outputs, the decisions are well justified, and the analysis is well documented.
Reproducibility (20 points): The notebook runs without errors and the results can be reproduced by running all cells.
TASK 01 - Data Cleaning (20 points): The data is cleaned and pre-processed correctly, as per the instructions. The final DataFrame contains all the necessary information that will be used in the TASK 02 and data types are correctly assigned.
TASK 02 - Comparative Analysis (40 points): The analysis is insightful and well documented. The visualizations are clear and well designed. The analysis is based on the data, the visualizations are relevant to the questions asked and you added insightful comments that reflect your understanding of the data.

Grade Expectations

In line with common practices in the UK Higher Education sector, you should expect to score as follows:

You should only expect to get an A+ (80+) if you did a SUPERB job and went way beyond our wildest expectations.
You should get an A (70-79) if we couldn’t find much (or anything) to criticise!
You should expect to score an A- (65-69) if you followed the principles taught in the classroom, albeit with a few minor mistakes here and there, or we could find some things you could improve (better coding practices, better documentation, etc.)
You should get a B+ (60-64) if, even though your solution was correct, it had some errors or you didn’t engage with several of the principles taught in the course (concretely for us: good use of pandas, good use of data types, neat formatting of notebooks).
You should get a B (50-59) if there were important mistakes or you didn’t follow the instructions properly.
You should get a B- (48-49) if there were big mistakes, several errors or you didn’t follow the instructions.
You should get a C+ (42-47) if there were many mistakes, the notebook was incomplete or too many things were missing.
You should expect to get a C (40-41) if it seemed like you didn’t put much effort into the assignment.
If the submission constitutes a very poor effort, then you should expect to get a F (39 or less).