๐Ÿ’ป Lab 01 โ€“ Meet the Data Frame

Lab roadmap (90 min)

Author
Published

08 July 2024

๐Ÿ—’๏ธ The Brief

Today, we will learn how to load, manipulate, and visualize tabular data 1 and compare the R vs. Python ways of doing so.

๐Ÿฅ… Learning Objectives

  • Practice setting up your IDE for a productive workflow.
  • View CSV data in its plain text form.
  • Load and view tabular CSV data as a data frame in R and Python.
  • Select and filter the data frame in R and Python.
  • Rank the data frame by column values in R and Python.

๐Ÿ“‹ Lab Tasks

Letโ€™s get started!

Part I: Introductions & Set up (20 min)

๐Ÿง‘๐Ÿปโ€๐Ÿซ TEACHING MOMENT: Your chance to get to know your instructor and classmates.

๐Ÿ’ก TIP: Whenever you see a ๐Ÿง‘๐Ÿปโ€๐Ÿซ TEACHING MOMENT in the lab roadmap, it means your instructor will be talking to you, and they deserve your full attention. Please do not work on the lab exercises during these moments. You will have plenty of time to do so later on.

  1. Meet your class teacher! Say hello and introduce yourself.

  2. ๐Ÿ‘ฅ Pair up! Ideally, you should work in pairs during this lab. If you were an R person in the morning, pair up with a Python person now, and vice versa. If you are in an odd-numbered group, you can work in a group of three.

    • Groups of 4 are also fine. Just make sure you are mixing R and Python people.

๐ŸŽฏ ACTION POINTS:

Time for some action:

  1. Follow the instructions on the Section 4 of the ๐Ÿ“‹ Getting Ready page.

    ๐Ÿ“ NOTE: If you couldnโ€™t install everything on your laptop in the morning, use the PC provided in the classroom. The computers already have all the Python/R tools you need.

  2. Contact your class teacher if you have any issues.

Part II: View the data (20 min)

We will give you some data to demonstrate how to load and view it in R and Python.

Try to figure out what the data is about and where it comes from! ๐Ÿ”ฎ

๐ŸŽฏ ACTION POINTS:

  1. Download the data file below.

    ๐Ÿ‘‰ We will learn more about CSV files and other file formats tomorrow.

  2. IMPORTANT: Save it to the ME204/data folder on your computer.

  3. (Optional) If you are familiar with Microsoft Excel, open the file to see what it looks like. ๐Ÿ‘‰ If prompted by Excel if you want to modify the file, choose 'No'!

  4. View the data in its raw form.

    VSCode logo (VS Code)

    1. On the left sidebar, click on the Explorer icon and navigate to the ME204/data folder.

    2. Click on the bakery.csv file.

    3. You should see the contents of the file in the editor.

    ๐Ÿ’ก TIP: Install the Rainbow CSV extension to make it easier to read CSV files.

    RStudio logo (RStudio)

    1. On the right sidebar, click on the Files tab and navigate to the ME204/data folder.

    2. Click on the bakery.csv file and select the View File option.

  5. Discuss with your partner what you see in the data file.

  6. Load the data into a data frame 2 in R and Python.

    Python logo Python

    1. Open the Terminal on VS Code (Terminal > New Terminal).

    2. Type ipython (or python if that does not work) and press Enter to open the Python interpreter.

    3. Type the following code to load the data as a data frame:

      import pandas as pd
      df = pd.read_csv('data/bakery.csv')
    4. Type df and press Enter to view the data frame.

    R logo R

    1. Open the R Console in RStudio.

    2. Type the following code to load the data as a data frame:

      library(tidyverse)
      df <- read_csv('data/bakery.csv')
    3. Type df and press Enter to view the data frame.

  7. Discuss with your partner how the data is displayed differently in R and Python and respective IDEs.

๐Ÿง‘๐Ÿปโ€๐Ÿซ TEACHING MOMENT:

Your class teacher will answer any questions you have about the data and the code you just ran and mediate the following ๐Ÿ—ฃ๏ธ CLASSROOM DISCUSSION:

  • The data is the same but the level of information displayed is different from VS Code (Python) and RStudio (R). Do you see the differences?

Part III: Select and filter the data (40 min)

Now that you have loaded the data, letโ€™s practice selecting and filtering the data frame. As you go through the exercises, try to stop every now and then to discuss with your partner how the code and results differ between R and Python.

๐ŸŽฏ ACTION POINTS:

  1. Inspect: how many rows and columns are in the data frame?

    Python logo Python

    In Pandas, the shape attribute gives you the dimensions of the data frame:

    df.shape

    Windows logo R

    In R, you can use the dim function:

    dim(df)
  2. List all the columns present in the data frame.

    Python logo Python

    df.columns

    Windows logo R

    colnames(df)

    The precise format will depend on the language you are using, but you should see a list like this:

    [
        'data-product-id', 'data-product-name', 'data-product-type',
        'data-product-on-offer', 'data-product-index', 'image-url',
        'product-page', 'product-name', 'product-size', 'item-price',
        'price-per-unit', 'offer-description', 'category'
    ]

    The above should now give you a better idea of what the data is about.

  3. Select just a subset of columns to view.

    Python logo Python

    Using Pandas:

    df[['product-name', 'item-price', 'category']]

    Windows logo R

    With the tidyverse:

    df %>% select(`product-name`, `item-price`, `category`)

    We are not using base R in this course, but if we were, we would pass the column names like this:

    df[, c(`product-name`, `item-price`, `category`)]

    The c( ) function is Rโ€™s way of creating a vector of column names.

  4. Select the first 5 rows of the data frame.

    Python logo Python

    Use the pandas head method:

    df.head(5)

    You can compound this with the previous step to select columns at the same time:

    df[['product-name', 'item-price', 'category']].head(5)

    Or

    df.head(5)[['product-name', 'item-price', 'category']]

    Windows logo R

    Using the tidyverse:

    df %>% slice_head(n = 5)

    You can compound this with the previous step to select columns at the same time:

    df %>% select(`product-name`, `item-price`, `category`) %>% slice_head(n = 5)

    Or

    df %>% slice_head(n = 5) %>% select(`product-name`, `item-price`, `category`)
  5. View the full record of the first row:

    Python logo Python

    The iloc method, which stands for โ€œinteger location,โ€ can be used to select rows by their position in the data frame. Python uses 0-based indexing, so the first row is at position 0:

    df.iloc[0]

    Windows logo R

    The slice function can be used to select rows by their position in the data frame. R uses 1-based indexing, so the first row is at position 1:

    df %>% slice(1)
  6. Filter the data frame to just the rows where the product has a particular size:

    ๐Ÿ‘‰ Here the code starts to get a bit more complex, as filtering typically involves multiple steps. In a single line of code we typically: identify a column to filter, check the content of all items in that column to establish if it contains a desired value, and then return only the rows that meet the criteria.

    Python logo Python

    There are multiple ways to filter data in pandas.

    OPTION A: One way is to use the query method:

    df.query('`product-size` == "800g"')

    To get a better view of the data, you can filter and select columns at the same time:

    df.query('`product-size` == "800g"')[['product-name', 'item-price', 'category']]

    OPTION B: Another way to filter is to first get a list of True/False values for each row based on the condition, and then use that list to filter the data frame:

    # Run this line to see the list of True/False values
    df['product-size'] == '800g'

    Now, to use this list to filter the data frame, pass it inside the square brackets:

    df[df['product-size'] == '800g'][['product-name', 'item-price', 'category']]

    OPTION C: You can also use the loc method to filter the data frame:

    df.loc[df['product-size'] == '800g', ['product-name', 'item-price', 'category']]

    This method is similar to the iloc method, but it allows you to filter rows based on a condition (not just by position), and select columns at the same time.

    Windows logo R

    There are multiple ways to filter data in R.

    OPTION A (DISCOURAGED): If we were to use base R, we could use the subset function:

    subset(df, `product-size` == "800g")

    To get a better view of the data, you can subset and select columns at the same time:

    subset(df, `product-size` == "800g", select = c(`product-name`, `item-price`, `category`))

    The c( ) function is Rโ€™s way of creating a vector.

    OPTION B: A better way to filter data in R is to use the dplyr package, part of the tidyverse. The filter function is used to filter rows based on a condition:

    df %>% filter(`product-size` == "800g")

    You can add another pipe (%>%) to the same line of code to select columns at the same time:

    df %>% filter(`product-size` == "800g") %>% select(`product-name`, `item-price`, `category`)
  7. Filter based on the presence of a certain text in the product name:

    Python logo Python

    We can use the str.contains method to identify rows of a column that contain a certain string:

    df[df['product-name'].str.contains('Bread')][['product-name', 'item-price', 'category']]

    Windows logo R

    In tidyverse, we have to use an additional function inside the filter function to filter rows based on the presence of a certain text in a column. The str_detect function from the stringr package is used to identify rows of a column that contain a certain string:

    df %>% 
        filter(str_detect(`product-name`, 'Bread')) %>% 
        select(`product-name`, `item-price`, `category`)

Wrap-up & Next Steps (10 min)

Even if you didnโ€™t finish all the exercises, take some time to share your experiences with the class.

๐Ÿ—ฃ๏ธ CLASSROOM DISCUSSION:

Your class teacher will lead a discussion on the following topics:

  • Did you find R or Python easier to work with when manipulating tabular data? Or it was the same?

  • What are the other things you could do with this simple data frame? (Even if you donโ€™t quite know how to do them yet.)


๐ŸŽ Bonus tasks

Here are some additional tasks to try. Those are a bit more challenging because they involve combining multiple steps and searching online to understand a bit more of the syntax of the language you are using.

  1. Mini-challenge: Filter the data frame to show all bread products that are of size โ€˜800gโ€™.

    ๐Ÿ‘‰ Donโ€™t worry if you canโ€™t figure out this one yet. This will become a much easier task by the end of the week.

  2. Challenge: Rank all sliced breads, of any size, based on their price per gram, from the most expensive to the cheapest.

    ๐Ÿ‘‰ Donโ€™t worry if you canโ€™t figure out this one yet. This will become a much easier task by the end of the week.

Footnotes

  1. Tabular data is data that is organized in rows and columns (similar but not quite the same as a spreadsheet), where:

    • Each row represents a single observation or record.
    • Each column represents a variable or attribute of the observation.
    Name Age
    Alice 25
    Bob 30
    Charlie 22

    In this example, the unit of analysis (the thing that each row represents) is a person, and the variables or attributes (columns) are Name and Age.โ†ฉ๏ธŽ

  2. A data frame is the standard way to view and work with tabular data in data science. It is a two-dimensional data structure where each column can be of a different data type (e.g., numeric, character, factor). Data frames are used in R (with base R and tidyverse) and Python (with the pandas library).โ†ฉ๏ธŽ