ποΈ Week 01 β Day 04: Reshaping Data for Visualization
Practice makes perfect!
Today we have practical exercises to hep strengthen what you saw yesterday. We will practice cleaning, summarising and reorganizing data for visualization.
π₯ Learning Objectives
Review the goals for today
At the end of the day you should be able to:
- Create your own computational notebook from scratch
- Organize your computational notebook in subsections using Markdown
- Create new columns as needed to summarize data
- Create (lambda) functions to apply to data
- Use the
groupby
->apply
->combine
pattern to summarize data - Create plots using
lets-plot
π Preparation
π― ACTION POINTS:
(Preferably in pairs)
Ensure you have the
base
miniconda environment activated just like yesterday.Download the new dataset for todayβs exercises by clicking the button below:
Unzip the file and save the folder on the
ME204/data
folder. You should have aME204/data/waitrose
folder with the several.csv
files.Create a new Jupyter Notebook in the
ME204/code
folder. Give it a suitable name.Add a Markdown cell at the top of the notebook with the following content:
# Week 01 -- Day 04: Reshaping Data for Visualization
Feel free to add more information to the cell later.
Add a Python cell where you will keep the imports for the notebook. You can start with the following:
# To list files in a directory import os # For data manipulation import pandas as pd
Create a new Python to load the data from the
ME204/data/waitrose
folder into your notebook using the code below:# List all files in the ME204/data/waitrose folder = [os.path.join('waitrose', file) for file in os.listdir('../data/waitrose') all_files if file.endswith('.csv')] # Read every single file and concatenate them into a single DataFrame with pandas concat = pd.concat((pd.read_csv(file) for file in all_files)) df
Check that the data was loaded correctly by running the following code:
df.head()
You should see the first few rows of the dataset. You can also use
df.info()
to get more information about the dataset.Perform the initial pre-processing steps to clean the data using the following code:
# Drop duplicates = df.drop_duplicates() df = df.drop(columns=['data-product-name', df 'data-product-type', 'data-product-index']) = ( df ={ df.rename(columns'data-product-id': 'id', 'data-product-on-offer': 'offer', 'product-page': 'page', 'product-name': 'name', 'product-size': 'size', }) ) # The id does not need 64 bits. 32 bits is enough. # See https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.intc for ranges. 'id'] = df['id'].astype('int32') df[
π Exercise 01: Revisiting the clean_item_price()
function
Yesterday, we created the following function to clean the item price and it worked fine for the bakery.csv
file:
def clean_item_price(item_price: str):
"""
Cleans the item price string by performing necessary transformations.
Parameters:
item_price (str): The item price as a string.
Returns:
str: The cleaned item price.
"""
if 'Β£' in item_price:
= item_price.replace('Β£', '')
item_price elif 'p' in item_price:
= item_price.replace('p', '')
item_price = '0.' + item_price
item_price
return float(item_price)
Now, you will notice that it no longer works when we have more data. We will need to rethink the function to make it more robust.
π― ACTION POINTS:
In conversation with your coding partner, try to figure out and articulate why the error is happening
π‘ You can document and register your thoughts in a markdown cell to help you remember later.
Spot which rows/columns are causing the issue.
Make decisions about how to handle the issue without dropping any data.
Adjust the
clean_item_price()
function to handle the new situations you found.Check that you have a new
df['item-price']
column with the cleaned prices.
π Exercise 02: What is the distribution of prices per category
?
We now have more categories in the dataset. We would like to know the distribution of prices per category and eventually plot it.
π― ACTION POINTS:
Your goal is to create a dataframe, call it plot_df
, where the unit of analysis is the category
and the variables are as follows:
category | count | mean | std | min | Q1 | median | Q3 | max |
---|---|---|---|---|---|---|---|---|
β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ |
Where Q1
is the first quartile (25th percentile), Q3
is the third quartile (75th percentile), and std
is the standard deviation.
How would you know if you did it correctly?
You will be able to run the following code and see the the plot below:
Click to reveal the code
# This configures what shows up when you hover your mouse over the plot.
= (
tooltip_setup
layer_tooltips()'@category')
.line('[@Q1 -- @median -- @Q3]')
.line(format('@Q1', 'Β£ {.2f}')
.format('@median', 'Β£ {.2f}')
.format('@Q3', 'Β£ {.2f}')
.
)
= (
g # Maps the columns to the aesthetics of the plot.
='category', x='median', xmin='Q1', xmax='Q3', fill='category')) +
ggplot(plot_df, aes(y
# GEOMS
# Add a line range that 'listens to' columns informed in `ymin` and `ymax` aesthetics
=1, alpha=0.75, tooltips=tooltip_setup) +
geom_linerange(size
# Add points to the plot (listen to `x` and `y` and fill aesthetics)
=3, stroke=1, shape=21, tooltips=tooltip_setup) +
geom_point(size
# SCALES
# Remove the legend (we can already read the categories from the y-axis)
='none') +
scale_fill_discrete(guide
# Specify names for the axes
="Categories\n(from largest to smallest median)", expand=[0.05, 0.05]) +
scale_y_continuous(name="Price (Β£)", expand=[0., 0.05], format='Β£ {.2f}', breaks=np.arange(0, 20, 2.5)) +
scale_x_continuous(name
# LABELS
# It's nice when the plot tells you the key takeaways
='"Beer, Wine & Spirits" has the highest median price for individual items',
labs(title="Dots represent the median price, bars represent the 25th and 75th percentiles") +
subtitle=element_text(size=15),
theme(axis_text_x=element_text(size=17),
axis_text_y=element_text(size=20),
axis_title_x=element_text(size=20),
axis_title_y=element_text(size=19, face='bold'),
plot_title=element_text(size=18),
plot_subtitle='none') +
legend_position1000, 500)
ggsize(
)
g
π Useful Tips and Links to official documentation:
- Revisit the
groupby()
method (seegroupby
in pandas documentation) - You might want to use the
describe()
method for pandas Series (seedescribe
in pandas documentation) - If you use categorical variables, you might be able to control the order of the categories (see
Categorical
in pandas documentation)
π Exercise 03: Are there any products that appear in multiple categories?
Also: if so, why didnβt the df.drop_duplicates()
method remove them?
π― ACTION POINTS:
- Write code to investigate whether there are products that appear in multiple categories.
π Next Steps
Did you finish it all before the end of the session? If so, you can start working on the Midterm Assignment due on Monday 15 July 2024 at 9pm, UK Time.