π» Week 04 Lab
Cleaning and transforming data with pandas
You might feel very lost if you didnβt attend the lecture. This lab is a direct continuation of the lecture, and we will be using the same files and concepts we introduced there.
πLocation: Friday, 25 October 2024. Time varies.
π₯ Learning Objectives
We want you to learn/practice the following goals in this lab:
π Preparation
- Complete at least Task 1 of the π W04 Formative Exercise.
- Attend the π¨π»βπ« W04 Lecture
- Be ready to play the roles of π§ββοΈ Pilot and π Copilot(s) in this lab again!
NOTE: The skills you are practising today will be essential for the βοΈ W06 Summative.
π£οΈ Roadmap
Here is how we will achieve the goal for this lab:
Part I: βοΈ Set Up (20 min)
Everyone should set up their environment for this lab, regardless of the role they will play.
π― INDIVIDUAL ACTION POINTS
Ensure you have a copy of the files from the lecture.
Letβs assemble all the files we need for this lab. Choose the option that best fits your situation:
Option 1: I cloned the GitHub repository yesterday in the lecture
Youβve been paying attention. Youβre a star! π Try the instructions below for the ultimate experience.
If you are getting stuck or running out of time, ignore this approach for now and try the second option.
Go to the GitHub repository we created yesterday.
Create a fork of it!
Click on the
Fork
button in the top right corner of the page. This will create a copy of the repository but under your account (so you can make mistakes as you learn without breaking the original).Rename or delete the old repository
We will clone this new version of the repository to your computer/Nuvolos, but the folder name will be the same as the old one. We need to delete or rename the old one to avoid conflicts.
Whether you are on Nuvolos or your own computer:
- If you took notes on the repository yesterday that you want to keep, rename the folder from
ds105a-2024
to something likeds105a-2024-old
. If you want to do this via the Terminal, run the commandmv ds105a-2024 ds105a-2024-old
on bash, zsh or the Powershell - Otherwise, itβs safe to delete the folder. If you want to do this via the Terminal, run the command
rm -rf ds105a-2024
on bash or zsh. If you are on Powershell, runrmdir ds105a-2024
(Press βAβ when asked)
- If you took notes on the repository yesterday that you want to keep, rename the folder from
Option 2: I didnβt clone the GitHub repository yesterday or I found the thing above too complicated
Download the W04-Lecture starting files:
Unzip the file.
Use your computerβs file explorer to extract all files and folders from the zip file. Move the extracted folder to an appropriate location on your computer.
If you are on Nuvolos, upload the zip file to Nuvolos to
/files
and unzip it there. See the instructions below:
- Then hover your mouse on top of the file and select βExtract Allβ.
- If this creates a folder within a folder, you can organise it later with
mv
commands (donβt worry about it if you donβt have the time right now).
Open the folder on VS Code.
Create a new notebook at
W04-Lecture/code/NB00 - Preprocessing.ipynb
.(Yes, we are taking one step back from where we started in the lecture.)
You should see the following project structure:
Part II: Cleaning up the data (rest of the lab)
We have a couple of goals here:
π TODAYβS CHALLENGES:
- Work with the original CSV files
- Convert the
item-price
column of the original data from a string to a float - Check that your pre-processing worked by plotting the
item-price
column per category - If time allows, clean up the
product-size
column whenever it displays a range.
The point is not to get to the solution as quick as possible. Instead:
π Prioritise:
- Discussing the problem with your partner: how would you split the problem into smaller pieces?
- Thinking about the Python or pandas building blocks you could use to solve it.
- Figuring out the best immediate next step to take. Donβt try to solve everything at once.
Not all groups will be able to complete all of the puzzles below in the lab, so donβt worry if you are not progressing as fast as you would like. We will share the solutions with you at the end of the day and you can study from there and ask questions.
π¦Έ Your class teacher is equipped with the superpower of conflict mediation. Ask them for help if you are struggling to collaborate or if you feel stuck as a team.
π― ACTION POINTS:
π₯ Together Negotiate the roles:
Ideally, the person less experienced or less confident with their programming skills should take the role of π§ββοΈ Pilot.
π Copilot(s) should not take over if the Pilot is struggling to understand or follow a suggestion. The whole point of this activity is to figure out how to best express what you know and donβt know to others.
You can decide if and when you want to swap roles for the rest of the lab.
Starting Point
To make your life easier, here is the pandas code to read all the CSV files into a single DataFrame.
There are new Python features (list comprehension) and new pandas functions (pd.concat) that you might not have seen before. If you understand roughly what this code does, you are in a good place. Ask your class teacher if you have any questions about specific lines.
π Code to read all the CSV files
If you are reading this on Moodle and for some random reason, copying the code below is not working, visit the public website version of this page and copy the code from there.
# Identify the location of the original files
# This represents the path: ../data/waitrose-2024-07
= os.path.join('..', 'data', 'waitrose-2024-07')
data_folder
# Use a list comprehension to get all the files in the folder
= [os.path.join(data_folder, file) for file in os.listdir(data_folder)
all_files if file.endswith('.csv')]
# Print the list of files if you want to check
# print(all_files)
# Read every single file as a DataFrame
# Save the dataframes in a list
= [pd.read_csv(file) for file in all_files]
list_of_dfs
# Use pd.concat to concatenate all the files into a single DataFrame
= pd.concat(list_of_dfs)
df
# Check that we have all the data
df.info()
Puzzle 1: Explore the item-price
column (20 min)
Donβt spend more than 20 min here. Jump to the solution if you exhausted the time.
π― ACTION POINTS:
π§ββοΈ Pilot Create a new heading called βExploring the
item-price
columnβ.Then, the pilot creates a new cell and take a look at whatβs inside the
item-price
column.π₯ Together Discuss the data.
Can we convert these strings straight away to floats?
- If yes, just convert it and you are done with this lab! You can scroll down to the plot and go home!
- If not, what type of data cleaning do we need to do?
π§ββοΈ Pilot Write down a data cleaning strategy.
You can add a new markdown cell and place some bullet points with the type of cleaning you need to do. Then, discuss what is the best way to implement it.
π Copilot(s): you can help to browse the
NB01
, your own notes, or the pandas documentation or use AI to try and identify whichDataFrame
orSeries
operations could be useful here.π€ To make good use of AI: remember to always give it A LOT of context! You need to provide a piece of what the data looks like, the type of problems in the column, what you know about Python/pandas and what you donβt know. Only then will you get recommendations that are truly useful for learning.
π₯ Together Discuss validation strategies.
How would you know if your data cleaning worked? Write it down in a markdown cell.
β A solution for this part
Ideally you should have identified that the item-price
column is full of non-numerial characters and that the prices are represented in four different ways:
- Some prices have a currency symbol at the beginning:
Β£1.99
- Others just have a
p
at the end:99p
to represent 99 pence - Some represent unitary prices:
Β£ 2.88 each est.
- Some represent ranges:
Β£1.99 - Β£2.99
How would you know if your data cleaning worked?
A very simple way is to look at the ../data/waitrose-products-combined-2024-07.csv
file and check if your new version of the item-price
column looks exactly like the one in the file.
π€« Hereβs some code that could help you test that:
# Read the 'cleaned' file
= pd.read_csv('../data/waitrose-products-combined-2024-07.csv')
df_cleaned
# The Series function equals() could help you here
'item-price'].equals(df['item-price']) df_cleaned[
Another, more robust way is to convert the column to a float and check if the conversion worked:
# This should not raise an error
'item-price'].astype(float)
df[
# There should be no None or NaN values
'item-price'].astype(float).isna().sum() == 0 df[
Puzzle 2: Start with the Β£
situation
π― ACTION POINTS
π§ββοΈ Pilot Add a function to clean up a price string
Add the skeleton of a function below to your notebook:
def clean_item_price(item_price: str): return item_price
This function is very silly right now. It just returns the same string you pass to it. But itβs a good starting point.
π‘ Tip: Test the function with just ONE string:
'Β£1.99') clean_item_price( # or 'Β£ 2.88 each est.') clean_item_price(
π§ββοΈ Pilot Test the function with
Series.apply()
:- π Copilot(s): find out how to apply a custom function to an entire column in one line.
π§ββοΈ Pilot Modify the function to treat the simplest of the four data cleaning scenarios. That is, figure out what to write in the function to convert strings that have the pound sign on the left,
Β£ 2.00
.Your function should still return a string, but a clean one that looks like a float.
- π Copilot(s): remind the pilot to test the function with individual values first before using the
apply()
method.
- π Copilot(s): remind the pilot to test the function with individual values first before using the
Puzzle 3: The p
situation
Keep going with the same function, this time add code to treat cases where the price is represented in pence.
Puzzle 4: The each
situation
Keep going with the same function, this time add code to treat cases where the price is represented in unitary prices.
Puzzle 5: The range situation
Keep going with the same function, this time add code to treat cases where the price is represented in ranges.
π Think about it: do your modifications to the item-price
column have repercussions for other columns in the DataFrame?
π Plotting the data
If, and only if, you managed to create a good clean_item_price()
function, you can now convert the item-price
column to a float and run the code below to investigate how much each category costs on average (here we are using the median price to represent an average).
By the way, the code below requires the following imports:
from lets_plot import *
LetsPlot.setup_html()
π Code to compare prices per category
If you are reading this on Moodle and for some random reason, copying the code below is not working, visit the public website version of this page and copy the code from there.
= (
plot_df 'category')['item-price'].describe()
df.groupby(
.reset_index()={'25%': 'Q1', '50%': 'median', '75%': 'Q3'})
.rename(columns='median')
.sort_values(by
)
# plot_df.head() to see how it looks like
# This configures what shows up when you hover your mouse over the plot.
= (
tooltip_setup
layer_tooltips()'@category')
.line('[@Q1 -- @median -- @Q3]')
.line(format('@Q1', 'Β£ {.2f}')
.format('@median', 'Β£ {.2f}')
.format('@Q3', 'Β£ {.2f}')
.
)
= (
g # Maps the columns to the aesthetics of the plot.
='category', x='median', xmin='Q1', xmax='Q3', fill='category')) +
ggplot(plot_df, aes(y
# GEOMS
# Add a line range that 'listens to' columns informed in `ymin` and `ymax` aesthetics
=1, alpha=0.75, tooltips=tooltip_setup) +
geom_linerange(size
# Add points to the plot (listen to `x` and `y` and fill aesthetics)
=3, stroke=1, shape=21, tooltips=tooltip_setup) +
geom_point(size
# SCALES
# Remove the legend (we can already read the categories from the y-axis)
='none') +
scale_fill_discrete(guide
# Specify names for the axes
="Categories\n(from largest to smallest median)", expand=[0.05, 0.05]) +
scale_y_continuous(name="Price (Β£)", expand=[0., 0.05], format='Β£ {.2f}', breaks=np.arange(0, 20, 2.5)) +
scale_x_continuous(name
# LABELS
# It's nice when the plot tells you the key takeaways
='"Beer, Wine & Spirits" has the highest median price for individual items',
labs(title="Dots represent the median price, bars represent the 25th and 75th percentiles") +
subtitle=element_text(size=15),
theme(axis_text_x=element_text(size=17),
axis_text_y=element_text(size=20),
axis_title_x=element_text(size=20),
axis_title_y=element_text(size=19, face='bold'),
plot_title=element_text(size=18),
plot_subtitle='none') +
legend_position1000, 500)
ggsize(
)
g