π» Lab 02 β Tidying up tabular data
Week 01 β Day 02 - Lab Roadmap (90 min)
ποΈ The Brief
We will revisit the data frame we used yesterday and try to clean it up and modify it a bit. We will also learn how to run the scripts as opposed to typing the code in the console.
π Lab Tasks
Feel free to get started on the following tasks as soon as you enter the lab.
Part I: Setting Up (15 min)
π― ACTION POINTS:
- Create a new script file for todayβs lab under the
code
folder.
(VS Code)
Open VS Code and ensure you are on the correct Working Directory
Create a new Python file by clicking on the
File
>New File
menu option (see image below).Add the following code snippet to the new file:
import pandas as pd = pd.read_csv('./data/bakery.csv') df = ['data-product-id', 'data-product-name', 'data-product-on-offer', 'item-price'] selected_columns print(df[selected_columns])
π NOTE: The
./
represents the current folder (directory) from where the script is being run. This corresponds to your working directory.Save the file by clicking on
File
>Save
option or by pressingCtrl + S
on your keyboard.VERY IMPORTANT: Save the file to the
ME204/code
folder with the following nameweek01-lab02.py
. The image below shows the example for Windows, but the process is similar for Mac.
(RStudio)
Open RStudio and ensure you are on the correct Working Directory
Create a new R script file by clicking on the
File
>New File
>R Script
menu option (see image below).Add the following code snippet to the new file:
library(tidyverse) <- read_csv('./data/bakery.csv') df <- c('data-product-id', 'data-product-name', 'data-product-on-offer', 'item-price') selected_columns print(df[selected_columns])
π NOTE: The
./
represents the current folder (directory) from where the script is being run. This corresponds to your working directory.Save the file by clicking on
File
>Save
option or by pressingCtrl + S
on your keyboard.VERY IMPORTANT: Save the file to the
ME204/code
folder with the following nameweek01-lab02.R
. The image below shows the example for Windows, but the process is similar for Mac.
- π§βπ« TEACHING MOMENT: Your class teacher will demonstrate how to run your script and how to see the output of your code, and help you solve any issues you might have.
Part II: Changing Column Names and Dropping Duplicate Rows (35 min)
Add code to your existing script that adresses the following tasks.
π NOTE: Remember to always save the script before running it, either by pressing Ctrl + S
or clicking on File
> Save
.
π― ACTION POINTS:
- Take a look at the columns
data-product-name
andproduct-name
. Write code that establishes, without a doubt, whether these two columns are identical or not. If they are identical, drop one of the columns. If they are not, how would you modify the column names to make them consistent?
π‘ TIPS
Check the Pandas documentation for the
equals
function in Python and the dplyr documentation for theidentical
function in R.You can use the
drop
function in Python and theselect
function in R to drop columns.
β Click to view the solution
In Python, you can use the equals
function from the pandas
library to compare two columns:
print(f"Are the columns identical? {df['product-name'].equals(df['data-product-name'])}")
if df['product-name'].equals(df['data-product-name']):
print("The columns are identical. Dropping the 'data-product-name' column as it is redundant.")
= df.drop(columns=['data-product-name'])
df else:
print("The columns are not identical. Consider renaming the columns.")
In R, you can use the identical
function to compare two columns:
if (identical(df$`product-name`, df$`data-product-name`)) {
print("The columns are identical. Dropping the 'data-product-name' column as it is redundant.")
<- df %>% select(-`data-product-name`)
df else{
}print("The columns are not identical. Consider renaming the columns.")
}
- Change the column names to more concise and meaningful names. For example,
data-product-id
can be renamed to the much simplerid
,product-name
toname
, and so on. Feel free to make your own choices, just be consistent.
π‘ TIPS
In Python, you can use the rename
function from the pandas
library:
= df.rename(columns={'old-column-name': 'new-column-name'}) df
In R, you can use the rename
function from the dplyr
library (part of `tidyverse):
<- df %>% rename(new_column_name = old_column_name) df
- What are the
id
s that appear more than once in the data frame? Print out the values of theseid
s.
π HINT: You can use the duplicated
function from the pandas
library in Python and the duplicated
function from the dplyr
library in R.
β Click to view the solution
In Python, you can use the duplicated
function from the pandas
library:
= df['id'].duplicated()
duplicated_rows print(df[duplicated_rows]['id'].values) # It also works without the `values` part
In R, you can use the duplicated
function from the dplyr
library (part of `tidyverse):
<- df %>% select(id) %>% duplicated()
duplicated_rows print(df %>% filter(duplicated_rows) %>% select(id))
- Challenge: Filter the data frame to show only the rows with the
id
s identified above. Sort the data frame by theid
column and conclude whether the rows which share the sameid
are identical or not.
π HINT: You can use the sort_values
function from the pandas
library in Python and the arrange
function from the dplyr
library in R.
- If the rows are indeed identical, drop the duplicates from the data frame. If they are not, how would you modify the
id
s to make them consistent?
π‘ TIPS
In Python, you can use the drop_duplicates
function from the pandas
library:
= df.drop_duplicates(subset='id') df
In R, you can use the distinct
function from the dplyr
library (part of `tidyverse):
# .keep_all = TRUE ensures that all columns are kept
<- df %>% distinct(id, .keep_all = TRUE) df
- Check the dimensions of the data frame again and compare them with the original dimensions. How many rows were dropped?
β Click to view the solution
In Python:
print(df.shape)
In R:
print(dim(df))
Part III: Cleaning Data Types (40 min)
- Print out the data types of all the columns in the data frame.
Python
print(df.info())
R
In tidyverse:
print(glimpse(df))
Use what you have learned in the morning to modify the data types of the
id
column so it is represented as an integer.Remove all the
Β£
symbols from theprice
column.
π‘ TIPS
In Python, you can use the str.replace
function to remove the Β£
symbol:
'price'] = df['price'].str.replace('Β£', '') df[
In R, you can use the str_replace
function from the stringr
package (contained in tidyverse
) to remove the Β£
symbol:
<- df %>% mutate(price = str_replace(price, 'Β£', '')) df
- Challenge: Convert the
price
column to a numeric data type. If there are any errors, figure out which rows are causing the issue and try to fix them.
The reason this is challenge is because you might need to use conditional statements to check for the presence of Β£
or something else in the price
column.
π‘ TIPS
In Python, you can use if-else
statements inside an apply
function to convert the column to numeric.
In R, you can use the case_when
function from the dplyr
library to convert the column to numeric.
π§βπ« TEACHING MOMENT
When it gets closer to the end of the lab, your class teacher will show you how to solve the challenge task and discuss any issues you might have faced.
π Bonus tasks
If you have completed all the tasks above, go back to the π Bonus Tasks of yesterdayβs lab and see if you can solve them now. If you had already solved them, can you improve your code to use the new techniques you have learned today?