πŸ’» Lab 02 – Tidying up tabular data

Week 01 – Day 02 - Lab Roadmap (90 min)

Author
Published

09 July 2024

πŸ—’οΈ The Brief

We will revisit the data frame we used yesterday and try to clean it up and modify it a bit. We will also learn how to run the scripts as opposed to typing the code in the console.

πŸ“‹ Lab Tasks

Feel free to get started on the following tasks as soon as you enter the lab.

Part I: Setting Up (15 min)

🎯 ACTION POINTS:

  1. Create a new script file for today’s lab under the code folder.
VSCode logo (VS Code)

  • Open VS Code and ensure you are on the correct Working Directory

  • Create a new Python file by clicking on the File > New File menu option (see image below).

  • Add the following code snippet to the new file:

    import pandas as pd
    
    df = pd.read_csv('./data/bakery.csv')
    
    selected_columns = ['data-product-id', 'data-product-name', 'data-product-on-offer', 'item-price']
    
    print(df[selected_columns])

    πŸ“ NOTE: The ./ represents the current folder (directory) from where the script is being run. This corresponds to your working directory.

  • Save the file by clicking on File > Save option or by pressing Ctrl + S on your keyboard.

  • VERY IMPORTANT: Save the file to the ME204/code folder with the following name week01-lab02.py. The image below shows the example for Windows, but the process is similar for Mac.

RStudio logo (RStudio)

  • Open RStudio and ensure you are on the correct Working Directory

  • Create a new R script file by clicking on the File > New File > R Script menu option (see image below).

  • Add the following code snippet to the new file:

    library(tidyverse)
    
    df <- read_csv('./data/bakery.csv')
    
    selected_columns <- c('data-product-id', 'data-product-name', 'data-product-on-offer', 'item-price')
    
    print(df[selected_columns])

    πŸ“ NOTE: The ./ represents the current folder (directory) from where the script is being run. This corresponds to your working directory.

  • Save the file by clicking on File > Save option or by pressing Ctrl + S on your keyboard.

  • VERY IMPORTANT: Save the file to the ME204/code folder with the following name week01-lab02.R. The image below shows the example for Windows, but the process is similar for Mac.

  1. πŸ§‘β€πŸ« TEACHING MOMENT: Your class teacher will demonstrate how to run your script and how to see the output of your code, and help you solve any issues you might have.

Part II: Changing Column Names and Dropping Duplicate Rows (35 min)

Add code to your existing script that adresses the following tasks.

πŸ“ NOTE: Remember to always save the script before running it, either by pressing Ctrl + S or clicking on File > Save.

🎯 ACTION POINTS:

  1. Take a look at the columns data-product-name and product-name. Write code that establishes, without a doubt, whether these two columns are identical or not. If they are identical, drop one of the columns. If they are not, how would you modify the column names to make them consistent?
πŸ’‘ TIPS
  • Check the Pandas documentation for the equals function in Python and the dplyr documentation for the identical function in R.

  • You can use the drop function in Python and the select function in R to drop columns.

βœ… Click to view the solution

In Python, you can use the equals function from the pandas library to compare two columns:

print(f"Are the columns identical? {df['product-name'].equals(df['data-product-name'])}")

if df['product-name'].equals(df['data-product-name']):
    print("The columns are identical. Dropping the 'data-product-name' column as it is redundant.")
    df = df.drop(columns=['data-product-name'])
else:
    print("The columns are not identical. Consider renaming the columns.")

In R, you can use the identical function to compare two columns:

if (identical(df$`product-name`, df$`data-product-name`)) {
  print("The columns are identical. Dropping the 'data-product-name' column as it is redundant.")
  df <- df %>% select(-`data-product-name`)
}else{
  print("The columns are not identical. Consider renaming the columns.")
}
  1. Change the column names to more concise and meaningful names. For example, data-product-id can be renamed to the much simpler id, product-name to name, and so on. Feel free to make your own choices, just be consistent.
πŸ’‘ TIPS

In Python, you can use the rename function from the pandas library:

df = df.rename(columns={'old-column-name': 'new-column-name'})

In R, you can use the rename function from the dplyr library (part of `tidyverse):

df <- df %>% rename(new_column_name = old_column_name)
  1. What are the ids that appear more than once in the data frame? Print out the values of these ids.

πŸ‘‰ HINT: You can use the duplicated function from the pandas library in Python and the duplicated function from the dplyr library in R.

βœ… Click to view the solution

In Python, you can use the duplicated function from the pandas library:

duplicated_rows = df['id'].duplicated()
print(df[duplicated_rows]['id'].values) # It also works without the `values` part

In R, you can use the duplicated function from the dplyr library (part of `tidyverse):

duplicated_rows <- df %>% select(id) %>% duplicated()
print(df %>% filter(duplicated_rows) %>% select(id))
  1. Challenge: Filter the data frame to show only the rows with the ids identified above. Sort the data frame by the id column and conclude whether the rows which share the same id are identical or not.

πŸ‘‰ HINT: You can use the sort_values function from the pandas library in Python and the arrange function from the dplyr library in R.

  1. If the rows are indeed identical, drop the duplicates from the data frame. If they are not, how would you modify the ids to make them consistent?
πŸ’‘ TIPS

In Python, you can use the drop_duplicates function from the pandas library:

df = df.drop_duplicates(subset='id')

In R, you can use the distinct function from the dplyr library (part of `tidyverse):

# .keep_all = TRUE ensures that all columns are kept
df <- df %>% distinct(id, .keep_all = TRUE)
  1. Check the dimensions of the data frame again and compare them with the original dimensions. How many rows were dropped?
βœ… Click to view the solution

In Python:

print(df.shape)

In R:

print(dim(df))

Part III: Cleaning Data Types (40 min)

  1. Print out the data types of all the columns in the data frame.
Python logo Python

print(df.info())
R logo R

In tidyverse:

print(glimpse(df))
  1. Use what you have learned in the morning to modify the data types of the id column so it is represented as an integer.

  2. Remove all the Β£ symbols from the price column.

πŸ’‘ TIPS

In Python, you can use the str.replace function to remove the Β£ symbol:

df['price'] = df['price'].str.replace('Β£', '')

In R, you can use the str_replace function from the stringr package (contained in tidyverse) to remove the Β£ symbol:

df <- df %>% mutate(price = str_replace(price, 'Β£', ''))
  1. Challenge: Convert the price column to a numeric data type. If there are any errors, figure out which rows are causing the issue and try to fix them.

The reason this is challenge is because you might need to use conditional statements to check for the presence of Β£ or something else in the price column.

πŸ’‘ TIPS

In Python, you can use if-else statements inside an apply function to convert the column to numeric.

In R, you can use the case_when function from the dplyr library to convert the column to numeric.

πŸ§‘β€πŸ« TEACHING MOMENT

When it gets closer to the end of the lab, your class teacher will show you how to solve the challenge task and discuss any issues you might have faced.


🎁 Bonus tasks

If you have completed all the tasks above, go back to the 🎁 Bonus Tasks of yesterday’s lab and see if you can solve them now. If you had already solved them, can you improve your code to use the new techniques you have learned today?