📝 IRDAP COURSEWORK (60%)

2023/24 Academic Year

Author

⏲️ Due Date:

26 August 2024 at 5 pm UK time

Context

You are receiving a dataset with information about food products collected from the UK Waitrose supermarket website. The data was collected via webscraping in July 2024, and adheres to the robots.txt policy of the website. The dataset is stored in a CSV file named foods.csv.

The dataset has been minimally cleaned up but still needs some preprocessing. Your first task will be to deduplicate the data and do any other preprocessing you think is necessary. Then, you will be asked to explore the dataset with the supervised and unsupervised learning techniques you have learned in the course. Importantly, you will have to come up with investigative questions and/or hypotheses that you will test with your analysis, and to figure out which of the techniques you have learned are most appropriate for answering these questions.

The dataset has the following kinds of information:

Name and page of the product
The size of the product.
A longer description of the product
Price per unit, which varies according to the type of product (e.g., weight, volume, quantity, etc.)
The categories of the product, which indicates under which listings the product is found on the website.
Nutritional information, such as energy, fat, saturated fat, carbohydrates, sugars, protein, salt and others, per serving of the product.
The ingredients of the product, when available.
Additional information about the product, such as whether it is suitable for vegetarians, vegans, or if it contains allergens.

This assignment is worth 60% of your final grade.

📚 Preparation

Read carefully.

Click on 🔗 THIS LINK to be redirected to a page on GitHub where you will be asked to accept the assignment.
Accept the assignment: Click the link, sign in to GitHub, and click the green button.
Access your private repository: You’ll be redirected to a new private repository named ds202-irdap-coursework-2024-<yourusername>.
Clone your repository and open it on your preferred IDE (RStudio or VSCode): Once you’ve cloned your repository (RStudio instructions, VSCode instructions), a folder ds202-irdap-coursework-2024-<yourusername> will appear on your computer. We will refer to this folder as <github-repo-folder> for the rest of this document.

Note: There are no files, except maybe for a README.md file. You are responsible for creating the files and folders as described below.

Download the dataset: You must download the dataset foods.csv from the button below. Save it to your <github-repo-folder>.

Create a Quarto document: Create a <CANDIDATE_NUMBER>.qmd file replacing the text <CANDIDATE_NUMBER> with your actual LSE number.

For example, if your candidate number is 12345, then your file should be named 12345.qmd.

You should write all of your code and analysis in this file.
Set the document to self-contained: Replace whatever is between the --- lines at the top of your newly created .qmd file with the following:
```
---
title: "DS202 - IRDAP COURSEWORK (2024)"
author: <CANDIDATE_NUMBER>
output: html
self-contained: true
---
```
Frequently check that the code is replicable: Every now and then, you should knit your Quarto document to an HTML file to check that the code is replicable. You can do this by clicking on the “Knit” button in RStudio or by running the following command in your terminal:

quarto render <CANDIDATE_NUMBER>.qmd

THIS IS VERY IMPORTANT!! We will assess your work based on your HTML file, not your .qmd file. At every assignment, we receive last-minute panicked e-mails from students who can’t produce an HTML file because they get errors. 99% of the time, these errors appear when you type things in the R Console that you didn’t include in your .qmd file or because the order of the code in your .qmd file is not the same as the order you ran it. If you knit your .qmd file frequently, you will catch these errors early and be able to fix them before the deadline.

Frequently push your changes to GitHub: You should commit and push your changes to GitHub frequently. We will assess your work based on the last version pushed to GitHub before the deadline. You won’t submit anything via Moodle. GitHub is configured to not accept any more pushes after the deadline. So, make sure you push your final version before the deadline.
Update your README.md file: You should update the README.md file in your repository to include a brief description of your project and acknowledge your use of Generative AI (whatever tool you used). Check the GenAI policy of this course.

📋 Tasks

If you follow all the instructions correctly, you should end up with a GitHub repository that has the following structure:

<github-repo-folder>
│
├── <CANDIDATE_NUMBER>.qmd
├── <CANDIDATE_NUMBER>.html
├── foods.csv
└── README.md

OK, but what should you do in your <CANDIDATE_NUMBER>.qmd file? Here are the tasks you need to complete:

Task 1: Clean up the data (30 marks)

Ideally, each row of the dataset should have a unique combination of product name, product page and size. No two rows should have the same combination of these three variables. However, the data collection process was not perfect, and there were some duplicates in the dataset. You will have to preprocess the dataset to produce a clean version of it, according to the unique combination mentioned above.

In addition to deduplication, if you feel that there are other preprocessing steps that are necessary, you should place them at the start of the notebook.

💡 TIP: When you first start working on this assignment, focus on just deduplicating the data. Then, as you go about working on Tasks 2 and 3, whenever you feel the need to create a new column or modify the contents of an existing column in the dataset, go back to the preprocessing section of your notebook and the new pre-processing/cleaning up code there.

Your code should be well-commented and easy to follow. You should also include a brief explanation of the preprocessing steps you took in the text of your notebook.

Task 2: Unsupervised Learning (30 marks)

Once you have cleaned up the data, take a look at the dataset and think about what kind of questions you could answer with unsupervised learning techniques. You should come up with at least one investigative question, driven by your curiosity, to answer with unsupervised learning techniques.

Please remember to explain why you chose the techniques you used to answer these questions and to show, with plots or summary tables, the results of your analysis.

You will be judged on the quality of the questions you ask, the appropriateness of the techniques you use to answer them, and the clarity of your results.

Task 3: Supervised Learning (40 marks)

After you have finished the unsupervised learning section, think about what kind of questions you could answer with supervised learning techniques. You should come up with at least two investigative questions, driven by your curiosity, to answer with supervised learning techniques.

Remember to explain why you chose the techniques you used to answer these questions and to demonstrate, with plots or summary tables, the results of your analysis.

You will be judged on the quality of the questions you ask, the appropriateness of the techniques you use to answer them, and the clarity of your results.

📦 Submission

Commit and push your changes to GitHub. Remember that we will assess your work based on the last version pushed to GitHub before the deadline. You won’t submit anything via Moodle. GitHub is configured to not accept any more pushes after the deadline, so make sure you push your final version before the deadline.

Final Tip


# The category column is a string that represents a list of categories. 
# If you want, you can split the string into a list column using str_split()
# Don't blindly copy the code below into your report!
# Think about what the data looks like before and after the transformation and check 
# if this is a transformation you need for your analysis.
foods <- 
    foods %>%
    mutate(category = str_replace_all(category, "[\\[\\]\\']", ""),
           category = str_split(category, ","))