πŸ—“οΈ Week 01 – Day 04: Reshaping Data for Visualization

Practice makes perfect!

Author
Published

11 July 2024

Today we have practical exercises to hep strengthen what you saw yesterday. We will practice cleaning, summarising and reorganizing data for visualization.

πŸ₯… Learning Objectives

Review the goals for today

At the end of the day you should be able to:

  • Create your own computational notebook from scratch
  • Organize your computational notebook in subsections using Markdown
  • Create new columns as needed to summarize data
  • Create (lambda) functions to apply to data
  • Use the groupby -> apply -> combine pattern to summarize data
  • Create plots using lets-plot

πŸ“š Preparation

🎯 ACTION POINTS:

(Preferably in pairs)

  1. Ensure you have the base miniconda environment activated just like yesterday.

  2. Download the new dataset for today’s exercises by clicking the button below:

    Unzip the file and save the folder on the ME204/data folder. You should have a ME204/data/waitrose folder with the several .csv files.

  3. Create a new Jupyter Notebook in the ME204/code folder. Give it a suitable name.

  4. Add a Markdown cell at the top of the notebook with the following content:

    # Week 01 -- Day 04: Reshaping Data for Visualization

    Feel free to add more information to the cell later.

  5. Add a Python cell where you will keep the imports for the notebook. You can start with the following:

    # To list files in a directory
    import os 
    
    # For data manipulation
    import pandas as pd
  6. Create a new Python to load the data from the ME204/data/waitrose folder into your notebook using the code below:

    # List all files in the ME204/data/waitrose folder
    all_files = [os.path.join('waitrose', file) for file in os.listdir('../data/waitrose') 
                 if file.endswith('.csv')]
    
    # Read every single file and concatenate them into a single DataFrame with pandas concat
    df = pd.concat((pd.read_csv(file) for file in all_files))
  7. Check that the data was loaded correctly by running the following code:

    df.head()

    You should see the first few rows of the dataset. You can also use df.info() to get more information about the dataset.

  8. Perform the initial pre-processing steps to clean the data using the following code:

    # Drop duplicates
    df = df.drop_duplicates()
    
    df = df.drop(columns=['data-product-name', 
                          'data-product-type', 
                          'data-product-index'])
    df = (
        df.rename(columns={
            'data-product-id': 'id',
            'data-product-on-offer': 'offer',
            'product-page': 'page',
            'product-name': 'name',
            'product-size': 'size',
        })
    )
    
    # The id does not need 64 bits. 32 bits is enough.
    # See https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.intc for ranges.
    df['id'] = df['id'].astype('int32')

πŸ“‹ Exercise 01: Revisiting the clean_item_price() function

Yesterday, we created the following function to clean the item price and it worked fine for the bakery.csv file:

def clean_item_price(item_price: str):    
    """
    Cleans the item price string by performing necessary transformations.

    Parameters:
    item_price (str): The item price as a string.

    Returns:
    str: The cleaned item price.
    """
    
    if 'Β£' in item_price:
        item_price = item_price.replace('Β£', '')
    elif 'p' in item_price:
        item_price = item_price.replace('p', '')
        item_price = '0.' + item_price
        
    return float(item_price)

Now, you will notice that it no longer works when we have more data. We will need to rethink the function to make it more robust.

🎯 ACTION POINTS:

  1. In conversation with your coding partner, try to figure out and articulate why the error is happening

    πŸ’‘ You can document and register your thoughts in a markdown cell to help you remember later.

  2. Spot which rows/columns are causing the issue.

  3. Make decisions about how to handle the issue without dropping any data.

  4. Adjust the clean_item_price() function to handle the new situations you found.

  5. Check that you have a new df['item-price'] column with the cleaned prices.

πŸ“‹ Exercise 02: What is the distribution of prices per category?

We now have more categories in the dataset. We would like to know the distribution of prices per category and eventually plot it.

🎯 ACTION POINTS:

Your goal is to create a dataframe, call it plot_df, where the unit of analysis is the category and the variables are as follows:

category count mean std min Q1 median Q3 max
… … … … … … … … …

Where Q1 is the first quartile (25th percentile), Q3 is the third quartile (75th percentile), and std is the standard deviation.

How would you know if you did it correctly?

You will be able to run the following code and see the the plot below:

Click to reveal the code
# This configures what shows up when you hover your mouse over the plot.
tooltip_setup = (
    layer_tooltips()
        .line('@category')
        .line('[@Q1 -- @median -- @Q3]')
        .format('@Q1', 'Β£ {.2f}')
        .format('@median', 'Β£ {.2f}')
        .format('@Q3', 'Β£ {.2f}')
)

g = (
    # Maps the columns to the aesthetics of the plot.
    ggplot(plot_df, aes(y='category', x='median', xmin='Q1', xmax='Q3', fill='category')) +

    # GEOMS

    # Add a line range that 'listens to' columns informed in `ymin` and `ymax` aesthetics
    geom_linerange(size=1, alpha=0.75, tooltips=tooltip_setup) +

    # Add points to the plot (listen to `x` and `y` and fill aesthetics)
    geom_point(size=3, stroke=1, shape=21, tooltips=tooltip_setup) +

    # SCALES

    # Remove the legend (we can already read the categories from the y-axis)
    scale_fill_discrete(guide='none') +

    # Specify names for the axes
    scale_y_continuous(name="Categories\n(from largest to smallest median)", expand=[0.05, 0.05]) +
    scale_x_continuous(name="Price (Β£)", expand=[0., 0.05], format='Β£ {.2f}', breaks=np.arange(0, 20, 2.5)) +

    # LABELS
    # It's nice when the plot tells you the key takeaways
    labs(title='"Beer, Wine & Spirits" has the highest median price for individual items',
         subtitle="Dots represent the median price, bars represent the 25th and 75th percentiles") +
    theme(axis_text_x=element_text(size=15),
        axis_text_y=element_text(size=17),
        axis_title_x=element_text(size=20),
        axis_title_y=element_text(size=20),
        plot_title=element_text(size=19, face='bold'),
        plot_subtitle=element_text(size=18),
        legend_position='none') +
    ggsize(1000, 500)

)

g

πŸ“‹ Exercise 03: Are there any products that appear in multiple categories?

Also: if so, why didn’t the df.drop_duplicates() method remove them?

🎯 ACTION POINTS:

  • Write code to investigate whether there are products that appear in multiple categories.

πŸ“š Next Steps

Did you finish it all before the end of the session? If so, you can start working on the Midterm Assignment due on Monday 15 July 2024 at 9pm, UK Time.