πŸ’» Lab 01 – Recap of base R and tidyverse fundamentals

Lab roadmap (90 min)

Author
Published

10 July 2023

πŸ“‹ LAB DIFFICULTY: 😁 EASY (assumes just basic experience with R)

πŸ₯… Learning Objectives

  • Refresh your R skills
  • Compare and contrast base R and tidyverse solutions to the same problem
  • Practice loading, manipulating, and visualizing data using R and tidyverse

πŸ“‹ Lab Tasks

In our first lab, you will be given some practical exercises to practice loading, manipulating, and visualizing data using R and tidyverse. We’ll dive into a real dataset called Tesco Groceries 1.0, curated by researchers from Nokia Bell Labs, King’s College London, University of Turing, and Tesco Labs (Aiello et al. 2020).

You can find a detailed description of this dataset in the πŸ“– Data Dictionary: Tesco Grocery 1.0 webpage.

Now, let’s get started!

Part 1: βš™οΈ Setup (15 minutes)

🎯 ACTION POINT

(Whenever you come across this text β€˜πŸŽ― ACTION POINT’, it means you have a set of tasks to complete)

Before you dive into coding, take a moment to complete the following steps:

  1. Give a warm hello to your instructor! And don’t forget to high-five the two classmates sitting closest to you! πŸ™Œ

    • Yes, we’re serious about this one!
  2. Ensure that R is installed on your computer.

  3. If you haven’t already, we also suggest using an integrated development environment (IDE) like RStudio, which is available for free download.

  4. Open RStudio and create two new R scripts.

    • Save the first script as lab01.R
    • Save the second as lab01-tidyverse.R.

    Start writing your code in the first script. Then, after you have completed the base R exercises, copy and paste your code into the second script and modify it to use tidyverse functions instead.

  5. Head to the dataset page and download the file named Dec_lsoa_grocery.csv.

  6. Save the file in the same folder as your R script.

Part 2: Let’s view our data! (20 minutes)

πŸ‘©πŸ»β€πŸ« TEACHING MOMENT

(Whenever you come across this text β€˜πŸ‘©πŸ»β€πŸ« TEACHING MOMENT’, it means your instructor deserves your full attention)

  • Your instructor will load the dataset into R and name it df. She will run View(df) so you all explore the dataset’s structure and variables together.

  • Your instructor will filter df to show only the row(s) corresponding to the region of London we are currently in. The LSOA code for the Aldwych area surrounding LSE is E01004735.

    • She will show the base R and the tidyverse ways of doing this.
  • Your instructor will open the Open Geography portal, made available by the Office for National Statistics of the UK, to show you how you can highlight a region on the map by its LSOA code. Keep a tab open on this page, as we will use it later in the lab.

  • πŸ—£οΈ CLASSROOM-WIDE DISCUSSION: Why do you think the authors gave us the dataset in this format instead of, say, simply a list of all the products purchased by customers in that area?

Part 3: Now you are the data analyst! (55 minutes)

Try to complete the action points below using base R first (type your solutions in lab01.R). After you’ve finished, convert your results to tidyverse (type them in lab01-tidyverse.R). If you get stuck, ask your instructor for help.

Feel free to πŸ‘₯ pair up with a classmate to work on the exercises together.

🎯 ACTION POINT

  1. Filter the dataset to contain only the following columns:

    • The identifier column (area_id)
    • Columns with demographic data (population, age, area, etc.)
    • Columns that represent the average consumption of nutrients (check data dictionary for examples) across all LSOA regions – ignore the columns with suffixes.
  2. Identify the top three regions with the highest average alcohol consumption and print them out. Also, determine the three regions with the lowest average alcohol consumption. Repeat the process for sugar consumption.

    • Can you also find out where these regions are located?
  3. Calculate the average and standard deviation of the population sizes across all LSOA regions. Save the results in a single data frame. Print out the data frame.

  4. Choose two nutrients (carbs, sugar, fat, saturated fat, protein, or fibre) and create a scatterplot to visualize their relationship. What observations can you make?

    • Please note that for the base R solution, you should not use the ggplot2 package. You should use the plot() function instead.

πŸ‘©πŸ»β€πŸ« TEACHING MOMENT

Just before you wrap up, your instructor will assess everyone’s progress with base R and tidyverse. Make sure to jot down any areas that are still unclear to you after the lab, as you’ll have the opportunity to discuss them in tomorrow’s lecture.

Additionally, she might request you to complete a brief poll to gauge the ease with which you were able to generate the base R and tidyverse solutions.

References

Aiello, Luca Maria, Daniele Quercia, Rossano Schifanella, and Lucia Del Prete. 2020. β€œTesco Grocery 1.0, a Large-Scale Dataset of Grocery Purchases in London.” Scientific Data 7 (1): 57. https://doi.org/10.1038/s41597-020-0397-7.