๐ป Lab 01 โ Meet the Data Frame
Lab roadmap (90 min)
๐๏ธ The Brief
Today, we will learn how to load, manipulate, and visualize tabular data 1 and compare the R vs. Python ways of doing so.
๐ฅ Learning Objectives
- Practice setting up your IDE for a productive workflow.
- View CSV data in its plain text form.
- Load and view tabular CSV data as a data frame in R and Python.
- Select and filter the data frame in R and Python.
- Rank the data frame by column values in R and Python.
๐ Lab Tasks
Letโs get started!
Part I: Introductions & Set up (20 min)
๐ง๐ปโ๐ซ TEACHING MOMENT: Your chance to get to know your instructor and classmates.
๐ก TIP: Whenever you see a ๐ง๐ปโ๐ซ TEACHING MOMENT in the lab roadmap, it means your instructor will be talking to you, and they deserve your full attention. Please do not work on the lab exercises during these moments. You will have plenty of time to do so later on.
Meet your class teacher! Say hello and introduce yourself.
๐ฅ Pair up! Ideally, you should work in pairs during this lab. If you were an R person in the morning, pair up with a Python person now, and vice versa. If you are in an odd-numbered group, you can work in a group of three.
- Groups of 4 are also fine. Just make sure you are mixing R and Python people.
๐ฏ ACTION POINTS:
Time for some action:
Follow the instructions on the Section 4 of the ๐ Getting Ready page.
๐ NOTE: If you couldnโt install everything on your laptop in the morning, use the PC provided in the classroom. The computers already have all the Python/R tools you need.
Contact your class teacher if you have any issues.
Part II: View the data (20 min)
We will give you some data to demonstrate how to load and view it in R and Python.
Try to figure out what the data is about and where it comes from! ๐ฎ
๐ฏ ACTION POINTS:
Download the data file below.
๐ We will learn more about CSV files and other file formats tomorrow.
IMPORTANT: Save it to the
ME204/data
folder on your computer.(Optional) If you are familiar with Microsoft Excel, open the file to see what it looks like. ๐ If prompted by Excel if you want to modify the file, choose 'No'!
View the data in its raw form.
(VS Code)
On the left sidebar, click on the
Explorer
icon and navigate to theME204/data
folder.Click on the
bakery.csv
file.You should see the contents of the file in the editor.
๐ก TIP: Install the Rainbow CSV extension to make it easier to read CSV files.
(RStudio)
On the right sidebar, click on the
Files
tab and navigate to theME204/data
folder.Click on the
bakery.csv
file and select theView File
option.
Discuss with your partner what you see in the data file.
Load the data into a data frame 2 in R and Python.
Python
Open the Terminal on VS Code (
Terminal
>New Terminal
).Type
ipython
(orpython
if that does not work) and pressEnter
to open the Python interpreter.Type the following code to load the data as a data frame:
import pandas as pd = pd.read_csv('data/bakery.csv') df
Type
df
and pressEnter
to view the data frame.
R
Open the R Console in RStudio.
Type the following code to load the data as a data frame:
library(tidyverse) <- read_csv('data/bakery.csv') df
Type
df
and pressEnter
to view the data frame.
Discuss with your partner how the data is displayed differently in R and Python and respective IDEs.
๐ง๐ปโ๐ซ TEACHING MOMENT:
Your class teacher will answer any questions you have about the data and the code you just ran and mediate the following ๐ฃ๏ธ CLASSROOM DISCUSSION:
- The data is the same but the level of information displayed is different from VS Code (Python) and RStudio (R). Do you see the differences?
Part III: Select and filter the data (40 min)
Now that you have loaded the data, letโs practice selecting and filtering the data frame. As you go through the exercises, try to stop every now and then to discuss with your partner how the code and results differ between R and Python.
๐ฏ ACTION POINTS:
Inspect: how many rows and columns are in the data frame?
Python
In Pandas, the
shape
attribute gives you the dimensions of the data frame:df.shape
R
In R, you can use the
dim
function:dim(df)
List all the columns present in the data frame.
Python
df.columns
R
colnames(df)
The precise format will depend on the language you are using, but you should see a list like this:
['data-product-id', 'data-product-name', 'data-product-type', 'data-product-on-offer', 'data-product-index', 'image-url', 'product-page', 'product-name', 'product-size', 'item-price', 'price-per-unit', 'offer-description', 'category' ]
The above should now give you a better idea of what the data is about.
Select just a subset of columns to view.
Python
Using Pandas:
'product-name', 'item-price', 'category']] df[[
R
With the
tidyverse
:%>% select(`product-name`, `item-price`, `category`) df
We are not using base R in this course, but if we were, we would pass the column names like this:
c(`product-name`, `item-price`, `category`)] df[,
The
c( )
function is Rโs way of creating a vector of column names.Select the first 5 rows of the data frame.
Python
Use the pandas
head
method:5) df.head(
You can compound this with the previous step to select columns at the same time:
'product-name', 'item-price', 'category']].head(5) df[[
Or
5)[['product-name', 'item-price', 'category']] df.head(
R
Using the
tidyverse
:%>% slice_head(n = 5) df
You can compound this with the previous step to select columns at the same time:
%>% select(`product-name`, `item-price`, `category`) %>% slice_head(n = 5) df
Or
%>% slice_head(n = 5) %>% select(`product-name`, `item-price`, `category`) df
View the full record of the first row:
Python
The
iloc
method, which stands for โinteger location,โ can be used to select rows by their position in the data frame. Python uses 0-based indexing, so the first row is at position 0:0] df.iloc[
R
The
slice
function can be used to select rows by their position in the data frame. R uses 1-based indexing, so the first row is at position 1:%>% slice(1) df
Filter the data frame to just the rows where the product has a particular size:
๐ Here the code starts to get a bit more complex, as filtering typically involves multiple steps. In a single line of code we typically: identify a column to filter, check the content of all items in that column to establish if it contains a desired value, and then return only the rows that meet the criteria.
Python
There are multiple ways to filter data in pandas.
OPTION A: One way is to use the
query
method:'`product-size` == "800g"') df.query(
To get a better view of the data, you can filter and select columns at the same time:
'`product-size` == "800g"')[['product-name', 'item-price', 'category']] df.query(
OPTION B: Another way to filter is to first get a list of True/False values for each row based on the condition, and then use that list to filter the data frame:
# Run this line to see the list of True/False values 'product-size'] == '800g' df[
Now, to use this list to filter the data frame, pass it inside the square brackets:
'product-size'] == '800g'][['product-name', 'item-price', 'category']] df[df[
OPTION C: You can also use the
loc
method to filter the data frame:'product-size'] == '800g', ['product-name', 'item-price', 'category']] df.loc[df[
This method is similar to the
iloc
method, but it allows you to filter rows based on a condition (not just by position), and select columns at the same time.R
There are multiple ways to filter data in R.
OPTION A (DISCOURAGED): If we were to use base R, we could use the
subset
function:subset(df, `product-size` == "800g")
To get a better view of the data, you can subset and select columns at the same time:
subset(df, `product-size` == "800g", select = c(`product-name`, `item-price`, `category`))
The
c( )
function is Rโs way of creating a vector.OPTION B: A better way to filter data in R is to use the
dplyr
package, part of thetidyverse
. Thefilter
function is used to filter rows based on a condition:%>% filter(`product-size` == "800g") df
You can add another pipe (
%>%
) to the same line of code to select columns at the same time:%>% filter(`product-size` == "800g") %>% select(`product-name`, `item-price`, `category`) df
Filter based on the presence of a certain text in the product name:
Python
We can use the
str.contains
method to identify rows of a column that contain a certain string:'product-name'].str.contains('Bread')][['product-name', 'item-price', 'category']] df[df[
R
In
tidyverse
, we have to use an additional function inside thefilter
function to filter rows based on the presence of a certain text in a column. Thestr_detect
function from thestringr
package is used to identify rows of a column that contain a certain string:%>% df filter(str_detect(`product-name`, 'Bread')) %>% select(`product-name`, `item-price`, `category`)
Wrap-up & Next Steps (10 min)
Even if you didnโt finish all the exercises, take some time to share your experiences with the class.
๐ฃ๏ธ CLASSROOM DISCUSSION:
Your class teacher will lead a discussion on the following topics:
Did you find R or Python easier to work with when manipulating tabular data? Or it was the same?
What are the other things you could do with this simple data frame? (Even if you donโt quite know how to do them yet.)
๐ Bonus tasks
Here are some additional tasks to try. Those are a bit more challenging because they involve combining multiple steps and searching online to understand a bit more of the syntax of the language you are using.
Mini-challenge: Filter the data frame to show all bread products that are of size โ800gโ.
๐ Donโt worry if you canโt figure out this one yet. This will become a much easier task by the end of the week.
Challenge: Rank all sliced breads, of any size, based on their price per gram, from the most expensive to the cheapest.
๐ Donโt worry if you canโt figure out this one yet. This will become a much easier task by the end of the week.
Footnotes
Tabular data is data that is organized in rows and columns (similar but not quite the same as a spreadsheet), where:
- Each row represents a single observation or record.
- Each column represents a variable or attribute of the observation.
Name Age Alice 25 Bob 30 Charlie 22 In this example, the unit of analysis (the thing that each row represents) is a person, and the variables or attributes (columns) are
Name
andAge
.โฉ๏ธA data frame is the standard way to view and work with tabular data in data science. It is a two-dimensional data structure where each column can be of a different data type (e.g., numeric, character, factor). Data frames are used in R (with base R and
tidyverse
) and Python (with thepandas
library).โฉ๏ธ