🗓️ Week 01 – Day 04: Reshaping Data for Visualization

Practice makes perfect!

Author

Published

11 July 2024

Today we have practical exercises to hep strengthen what you saw yesterday. We will practice cleaning, summarising and reorganizing data for visualization.

🥅 Learning Objectives

Review the goals for today

At the end of the day you should be able to:

Create your own computational notebook from scratch
Organize your computational notebook in subsections using Markdown
Create new columns as needed to summarize data
Create (lambda) functions to apply to data
Use the groupby -> apply -> combine pattern to summarize data
Create plots using lets-plot

📚 Preparation

🎯 ACTION POINTS:

(Preferably in pairs)

Ensure you have the base miniconda environment activated just like yesterday.
Download the new dataset for today’s exercises by clicking the button below:

Unzip the file and save the folder on the ME204/data folder. You should have a ME204/data/waitrose folder with the several .csv files.
Create a new Jupyter Notebook in the ME204/code folder. Give it a suitable name.
Add a Markdown cell at the top of the notebook with the following content:
```
# Week 01 -- Day 04: Reshaping Data for Visualization
```
Feel free to add more information to the cell later.
Add a Python cell where you will keep the imports for the notebook. You can start with the following:
```
# To list files in a directory
import os 

# For data manipulation
import pandas as pd
```

Create a new Python to load the data from the ME204/data/waitrose folder into your notebook using the code below:

# List all files in the ME204/data/waitrose folder
all_files = [os.path.join('waitrose', file) for file in os.listdir('../data/waitrose') 
             if file.endswith('.csv')]

# Read every single file and concatenate them into a single DataFrame with pandas concat
df = pd.concat((pd.read_csv(file) for file in all_files))

Check that the data was loaded correctly by running the following code:
```
df.head()
```
You should see the first few rows of the dataset. You can also use df.info() to get more information about the dataset.

Perform the initial pre-processing steps to clean the data using the following code:

# Drop duplicates
df = df.drop_duplicates()

df = df.drop(columns=['data-product-name', 
                      'data-product-type', 
                      'data-product-index'])
df = (
    df.rename(columns={
        'data-product-id': 'id',
        'data-product-on-offer': 'offer',
        'product-page': 'page',
        'product-name': 'name',
        'product-size': 'size',
    })
)

# The id does not need 64 bits. 32 bits is enough.
# See https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.intc for ranges.
df['id'] = df['id'].astype('int32')

📋 Exercise 01: Revisiting the `clean_item_price()` function

Yesterday, we created the following function to clean the item price and it worked fine for the bakery.csv file:

def clean_item_price(item_price: str):    
    """
    Cleans the item price string by performing necessary transformations.

    Parameters:
    item_price (str): The item price as a string.

    Returns:
    str: The cleaned item price.
    """
    
    if '£' in item_price:
        item_price = item_price.replace('£', '')
    elif 'p' in item_price:
        item_price = item_price.replace('p', '')
        item_price = '0.' + item_price
        
    return float(item_price)

Now, you will notice that it no longer works when we have more data. We will need to rethink the function to make it more robust.

🎯 ACTION POINTS:

In conversation with your coding partner, try to figure out and articulate why the error is happening

💡 You can document and register your thoughts in a markdown cell to help you remember later.
Spot which rows/columns are causing the issue.
Make decisions about how to handle the issue without dropping any data.
Adjust the clean_item_price() function to handle the new situations you found.
Check that you have a new df['item-price'] column with the cleaned prices.

📋 Exercise 02: What is the distribution of prices per `category`?

We now have more categories in the dataset. We would like to know the distribution of prices per category and eventually plot it.

🎯 ACTION POINTS:

Your goal is to create a dataframe, call it plot_df, where the unit of analysis is the category and the variables are as follows:

category	count	mean	std	min	Q1	median	Q3	max
…	…	…	…	…	…	…	…	…

Where Q1 is the first quartile (25th percentile), Q3 is the third quartile (75th percentile), and std is the standard deviation.

How would you know if you did it correctly?

You will be able to run the following code and see the the plot below:

Click to reveal the code

# This configures what shows up when you hover your mouse over the plot.
tooltip_setup = (
    layer_tooltips()
        .line('@category')
        .line('[@Q1 -- @median -- @Q3]')
        .format('@Q1', '£ {.2f}')
        .format('@median', '£ {.2f}')
        .format('@Q3', '£ {.2f}')
)

g = (
    # Maps the columns to the aesthetics of the plot.
    ggplot(plot_df, aes(y='category', x='median', xmin='Q1', xmax='Q3', fill='category')) +

    # GEOMS

    # Add a line range that 'listens to' columns informed in `ymin` and `ymax` aesthetics
    geom_linerange(size=1, alpha=0.75, tooltips=tooltip_setup) +

    # Add points to the plot (listen to `x` and `y` and fill aesthetics)
    geom_point(size=3, stroke=1, shape=21, tooltips=tooltip_setup) +

    # SCALES

    # Remove the legend (we can already read the categories from the y-axis)
    scale_fill_discrete(guide='none') +

    # Specify names for the axes
    scale_y_continuous(name="Categories\n(from largest to smallest median)", expand=[0.05, 0.05]) +
    scale_x_continuous(name="Price (£)", expand=[0., 0.05], format='£ {.2f}', breaks=np.arange(0, 20, 2.5)) +

    # LABELS
    # It's nice when the plot tells you the key takeaways
    labs(title='"Beer, Wine & Spirits" has the highest median price for individual items',
         subtitle="Dots represent the median price, bars represent the 25th and 75th percentiles") +
    theme(axis_text_x=element_text(size=15),
        axis_text_y=element_text(size=17),
        axis_title_x=element_text(size=20),
        axis_title_y=element_text(size=20),
        plot_title=element_text(size=19, face='bold'),
        plot_subtitle=element_text(size=18),
        legend_position='none') +
    ggsize(1000, 500)

)

g

🔗 Useful Tips and Links to official documentation:

Revisit the groupby() method (see groupby in pandas documentation)
You might want to use the describe() method for pandas Series (see describe in pandas documentation)
If you use categorical variables, you might be able to control the order of the categories (see Categorical in pandas documentation)

📋 Exercise 03: Are there any products that appear in multiple categories?

Also: if so, why didn’t the df.drop_duplicates() method remove them?

🎯 ACTION POINTS:

Write code to investigate whether there are products that appear in multiple categories.

📚 Next Steps

Did you finish it all before the end of the session? If so, you can start working on the Midterm Assignment due on Monday 15 July 2024 at 9pm, UK Time.

🥅 Learning Objectives

📚 Preparation

📋 Exercise 01: Revisiting the clean_item_price() function

📋 Exercise 02: What is the distribution of prices per category?

🔗 Useful Tips and Links to official documentation:

📋 Exercise 03: Are there any products that appear in multiple categories?

📚 Next Steps

📋 Exercise 01: Revisiting the `clean_item_price()` function

📋 Exercise 02: What is the distribution of prices per `category`?