🛣️ Week 01 Lab - Roadmap (90 min)

2024/25 Autumn Term

Author

Welcome to the first DS202 lab!

The main goal of this lab is to review some fundamental R/tidyverse concepts that you’ll need throughout this course.

🥅 Learning Objectives

  • Configure your working environment for the course, including R, RStudio or VS Code, and Quarto documents.
  • Start using markdown to write your notes and code for lab activities.
  • Practice consulting and reading the documentation to understand how a function works.
  • Review R/tidyverse fundamental concepts such as vectors, for loops, pipes and functions.
About Using ChatGPT and Similar AI Tools

We’re not against you using AI assistants like ChatGPT in this course. However, for this particular lab, we recommend doing it on your own (searching online is fine, though). This task is your chance to see where you stand with your R programming skills. If you use an AI assistant now, you won’t get a clear picture of your abilities.

If you get stuck, consider asking your instructor or classmates for help. You can also consult the documentation for the functions and packages you’re using, which is a valuable skill in this field. Don’t worry, you’ll have plenty of chances to use AI help in later labs.

🤝 Part I: Introductions (15 min)

🧑🏻‍🏫 TEACHING MOMENT: Your chance to get to know your classmates and instructors.

⚙️ Part II: Setup (15 min)

You can go straight to the set of instructions below.

Start by going through the instructions from the 📚W01 Lab Preparation page.

  1. If you have R and RStudio or VScode already installed, you might want to check that your software is up-to-date before proceeding. You can check your R version once you start an R session (i.e in the R console in RStudio or by typing R in a terminal e.g VSCode terminal or Mac terminal or Windows command line) by typing the command sessionInfo(): the current R version is 4.3.2. If you have a version of R older than 4.3.2, it might be worth updating your R installation: see how you can update your R here. Also check for RStudio or VSCode updates. For a guide on how to update R and RStudio, check this page and for instructions on how to check for VSCode updates (and run VSCode updates!), head over this way.
  2. Once you know your R and your chosen IDE (i.e RStudio or VSCode) are up to date, make sure you have fully completed the setup steps outlined in the 📚W01 Lab Preparation page before you head to the instructions from the section below.

We will be working with Quarto documents throughout this course, these are .qmd files that combine code, text, and output in a single document. Quarto documents are a great way to keep your code, notes, and output organised and reproducible. (Assignments will also require that you work with .qmd files.)

🎯 ACTION POINTS

Follow the steps below according to your preferred IDE.

Tip

In this course, you’re free to choose the IDE you want to work with. However, if you’re not that comfortable with R and/or Quarto yet, we’d recommend you stick with RStudio.

RStudio

Using RStudio:

  1. Open RStudio.

  2. Create a new project (File > New Project… > New Directory > New Project) and save it in an appropriate folder on your computer. Consider naming it DS202A.

  3. Click on the link below to download the .qmd file for this lab. Save it in the DS202A folder you created in step 2.

  1. Type your responses to the tasks in this lab directly on the .qmd file.
  • Don’t be limited by just editing code chunks, feel free to add comments and notes in Markdown language as well.
  • Try ‘rendering’ the .qmd file to HTML (click the ‘Knit’ button in the top-right corner of the Source pane) to see what your code and notes look like in a nice report.
VS Code VS Code

Using VS Code:

  1. Create a new folder in an appropriate folder on your computer. Consider naming it DS202A.

  2. Open VS Code and open the folder you have just created (File > Open Folder). Let’s call this your ‘project’ folder.

  3. Click on the link below to download the .qmd file for this lab. Save it in the DS202A folder you created in step 2.

  1. Type your responses to the tasks in this lab directly on the .qmd file.
  • Don’t be limited by just editing code chunks, feel free to add comments and notes in Markdown language as well.
  • Try ‘rendering’ the .qmd file to HTML (click the ‘Render’ button in the top-right corner of the tab) to see what your code and notes look like in a nice report.
(Optional) Click here to learn how to create a .qmd file from scratch

We will begin by creating a .qmd file for this lab. You will write your solutions to the tasks below in this file.

🎯 ACTION POINTS

RStudio

Using RStudio:

  1. Open RStudio.

  2. Create a new project (File > New Project… > New Directory > New Project) and save it in an appropriate folder on your computer. Consider naming it DS202A.

  3. On this project, create a new Quarto markdown document by clicking on the File menu in the top-left corner of the RStudio window, then selecting Quarto Document from the drop-down menu.

This will create an empty but pre-configured .qmd file in the Source pane of RStudio.

  1. Save the new file as LSE_DS202A_W01_lab.qmd somewhere inside the project folder you created in step 2.
  • You can choose to create sub-folders inside your project folder to keep your files organised. For example, you could create a folder called labs and save this file inside it.
  1. Keep adding your solutions to the tasks below to this file as you work through the lab. Don’t be limited by just code chunks, feel free to add comments and notes in Markdown language as well.
VS Code VS Code

Using VS Code:

  1. Create a new folder in an appropriate folder on your computer. Consider naming it DS202A.

  2. Open VS Code and open the folder you have just created (File > Open Folder). Let’s call this your ‘project’ folder.

  3. On this project, create a new Quarto markdown document by clicking File > New File menu in the top-left corner of the VS Code window. Then selecting Quarto Document from the drop-down menu.

This will create an empty but pre-configured .qmd file in VSCode.

  1. Save the new file as LSE_DS202A_W01_lab.qmd somewhere inside the project folder you created in step 2.
  • You can choose to create sub-folders inside your project folder to keep your files organised. For example, you could create a folder called labs and save this file inside it.
  1. Keep adding your solutions to the tasks below to this file as you work through the lab. Don’t be limited by just code chunks, feel free to add comments and notes in Markdown language as well.

🧑🏻‍🏫 TEACHING MOMENT: Your class teacher will give more in-depth explanations about Quarto files as well as how to run R code blocks from them (you can also read the explanations on this page if you want further details) and inform you of the next steps to take.

📖 Part III: Review of R/tidyverse fundamental concepts (55 min)

Finding the R documentation

Before asking for help from your peers and/or class teacher, you can try and look up the R documentation to find out more about a function and how it should be used e.g if you want to figure out more details about the c() function used to create vectors (see Section 1.3.1 for details), you can invoke the R documentation through the RStudio console or the VSCode terminal (after typing R in the VSCode terminal to start a new R session) by typing the command ?c or help(c) or alternatively by running the same commands through an R block in Quarto.

Vectors

A vector is the most common and basic data type in R. It is composed by a series of values of the same type, e.g. character, numeric1, integer, logical (i.e TRUE and FALSE), complex or raw. Have a look at this page for a more thorough description of the basic data types in R (with examples!).

You use the c() function2 to create a vector. The values you combine within the vector must be comma-separated.

Here are examples of vectors:

x <- c(22L,30L,42L) #vector of integers
[1] 22 30 42
y <- c(6.02214,3.14159,6.674) #vector of numerics
[1] 6.02214 3.14159 6.67400
z <- c(FALSE, TRUE, FALSE) #vector of logicals
[1] FALSE  TRUE FALSE
s <- c("Darth Vader", "Luke Skywalker", "Han Solo") #vector of characters
[1] "Darth Vader"    "Luke Skywalker" "Han Solo" 

You can use vectors of integers or numerics in arithmetic operations or in computations, for example:

y <- sort(y) # operation to sort the content of y
[1] 3.14159 6.02214 6.67400
v <- x^2/2 - y*3 + 5 
[1] 237.5752 436.9336 866.9780
w <- mean(y) # computing the mean of y
[1] 5.279243

You can also create vectors that are sequences of numbers:

a <- c(1:10) #a sequence of numbers from 1 to 10 with an increment of 1
[1]  1  2  3  4  5  6  7  8  9 10
b <- seq(-3,3,by=0.5) #a sequence of numbers from -3 to 3 with an increment of -0.5
[1] -3.0 -2.5 -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0

And you can use the rep() function to repeat sequence items (can you spot the difference between using the times or each keyword with rep in the example below?):

d <- rep(a, times=2)
[1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10
e <- rep(a, each=2)
[1]  1  1  2  2  3  3  4  4  5  5  6  6  7  7  8  8  9  9 10 10

You can access a vector item vector by referring to its index number inside brackets []: the first item has index 1, the second item has index 2, and so on. For example, if you want to access the second element of vector s above (i.e value "Luke Skywalker"), you would need this line of code:

s[2]

You can access multiple vector elements, thanks to the c() function: the line s[c(1,3)] lets you access the first and third values of the s vector, i.e the values "Darth Vader" and "Han Solo".

You can use negative indexes to access all vector elements except the one(s) specified by the index(es) e.g use the command s[-3] to access all vector s elements except the value at index 3, i.e to access the values at indexes 1 and 2 (values "Darth Vader" and "Luke Skywalker").

You can also change the value of a particular vector element by referring to its index:

s[3] <- "Princess Leia" #changes the value of the third element of vector s from "Han Solo" to "Princess Leia"

You can add elements to a vector with the append() function e.g:

s <- append(s,c("Han Solo","Yoda"))  #adds the values "Han Solo" and "Yoda" to the end of vector s. 
#You can use the `after` keyword of the `append` function to specify after which index you want your new values inserted (see R documentation)

🎯 ACTION POINTS

  1. Try the basic vector manipulations from the examples and check that you understand the results.

  2. Could you create three vectors:

    • a vector k which is a sequence from 8 to 1 (i.e descending order) in increments of 2
    • a vector m which contains the content of k repeated three times
    • a vector n containing three copies of k, with a 0 separating each copy from the next one?
  3. Can you check whether m is equal to n?

  4. Note that you should also obtain a warning message because the 2 vectors are not of the same length. How can you check the length of both vectors?

  5. What will happen in each of these examples:

    num_char <- c(1, 2, 3, "a")
    num_logical <- c(1, 2, 3, TRUE)
    char_logical <- c("a", "b", "c", TRUE)
    tricky <- c(1, 2, 3, "4")

    ?

  6. Suppose the following vector represents flu cases over a given number of weeks:

    flu_cases <- c(NA, 1, 0, 0, 3, NA, 3, 3, 61, 411, 2570, 7208)
    • Could you remove the missing values from the vector?
    • How many weeks were the number of cases over 10?
    • Could you compute the mean and standard deviation associated with the number of flu cases?
    • Was there a week where the number of cases was equal to 42? And if so, which was it? Same questions with value 3.

Chaining operations with the pipe operator (%>%)

The pipe operator, %>%, comes from the magrittr package by Stefan Milton Bache. Packages in tidyverse load %>% automatically, so you don’t usually load magrittr explicitly.

This means that, before we are able to use the pipe operator, we either have to load the magrittr library or the tidyverse library (we would usually load tidyverse if we are going to use other tidyverse functions) after having made sure we have magrittr and/or tidyverse installed.

Note on installing and loading packages in R

How do I install a package in R e.g tidyverse?

Just run install.packages("tidyverse") in your R console. This will install all the packages that are part of the tidyverse ecosystem.

DO NOT leave an install.packages() command in your .qmd file. Always do this in the R Console. Otherwise, you won’t be able to render your markdown file as an HTML later.

How do I load a package/library that I have already installed e.g magrittr?

To load magrittr, you simply need to execute the call library(magrittr) before you use any function that comes from this package (e.g pipe).

However, in your .qmd file, make it a habit to create and dedicate your first chunk of code to loading all the packages you’ll use using the library() function. While it’s true that you may not know all the packages you will need when starting a new file, making the first chunk a reserved space for the task of loading libraries creates a neat, centralised hub. In this lab, since we simply need the pipe operator for now, we would simply load magrittr with library(magrittr) in that first chunk of code: we can circle back to it if we need to add further libraries. You have to rerun the chunk for the packages to be loaded, every time you add a new library() call to the chunk.

It allows you to chain together sequences of operations and has four main advantages:

  • you structure the sequence of your data operations from left to right, as apposed to from inside and out;
  • you avoid nested function calls (that was one of the possible solutions to create vector n in question 2 in Section 1.3.1);
  • you minimize the need for local variables and function definitions
  • you make it easy to add steps anywhere in the sequence of operations

What does it mean in practice?

Let’s go back to the flu cases example.

Suppose we want to only look at the first 9 weeks of data, replace the missing values in this subset of data with value 0 and then count how many weeks within the period the number of cases was equal to 0.

We could write the sequence of operations as follows:


flu_cases <- c(NA, 1, 0, 0, 3, NA, 3, 3, 61, 411, 2570, 7208)
flu_cases <- flu_cases[1:9]
flu_cases <- replace(flu_cases,is.na(flu_cases),0)
zero_cases <- length(subset(flu_cases,flu_cases==0))

Notice the number of assignments and nested function calls.

Alternatively, we could rewrite the sequence with the pipe (%>%) operator:

zero_cases <- c(NA, 1, 0, 0, 3, NA, 3, 3, 61, 411, 2570, 7208) %>% 
   .[1:9] %>% #in this line and the following . replaces the vector c(NA, 1, 0, 0, 3, NA, 3, 3, 61, 411, 2570, 7208) i.e flu_cases
   replace(is.na(.),0) %>%
   subset(.==0) %>%
   length()

In this case, each line performs an operation from the sequence : the pipe is essentially equivalent to an English language ‘then’; define this vector then take its first nine values then replace the missing values within it with the value 0 then take the subset of this vector where the value is equal to 0 then take the length of the resulting vector (i.e count the number of weeks within the first nine weeks where the number of cases is 0).

For details on pipes, have a look at this tutorial.

🎯 ACTION POINTS

  1. Can you re-create vector n from question 2 of Section 1.3.1 using the pipe operator?
  2. Use the sample function to create two vectors (you can choose the vector length) whose values are in the range [7-42]. Append both vectors together, then scale the resulting vector before only selecting negative values and getting a count of negative values. Use the pipe operator to write your sequence of operations.

For loops

Suppose that, for some reasons, you want to print out sentences of the form: “The year is [year]”” where [year] is equal to 2019, 2020, up to 2024. You can do this as follows:

print(paste("The year is", 2019))
[1] "The year is 2019"
print(paste("The year is", 2020))
[1] "The year is 2020"
print(paste("The year is", 2021))
[1] "The year is 2021"
print(paste("The year is",2022))
[1] "The year is 2022"
print(paste("The year is", 2023))
[1] "The year is 2023"
print(paste("The year is", 2024))
[1] "The year is 2024"

As you quickly see, this is rather tedious since you copy the same code chunk over and over again. Rather than doing this, you could use a for loop to write repetitive parts of code.

Using a for loop, the code above transforms into:

for (year in 2019:2024){
  print(paste("The year is", year))
}

The best way to understand this loop is as follows: “For each year that is in the sequence 2019:2024, you execute the code chunk print(paste("The year is", year))”. Once the for loop has executed the code chunk for every year in the vector (i.e sequence 2019:2024), the loop stops and goes to the first instruction after the loop block.

🎯 ACTION POINTS

  1. Suppose you have a new vector of characters:

    t <- c("R2-D2","Chewbacca","Obi-Wan Kenobi")
    • Could you write code that creates a vector that contains all the Star Wars character names, then, for each name, counts the number of characters, then finds the Star Wars character with longest name? Use a for loop in your code.
    • Could you write code that creates a vector that contains all the Star Wars character names, then, for each name, counts the number of characters and prints out a line of the form “The character name [name of character] is composed of [x] characters” (e.g “The character name Yoda is composed of 4 characters”)?

Functions

In this lab, we’ve encountered and used quite a few different pre-made functions in (e.g c(), replace(), subset(), sample(), rep(), length()), but sometimes you just need to write your own function to tackle your data, i.e your set/succession of (reproducible) instructions.

A function is simply a code block that performs a specific task (which can be more or less complex), e.g as calculating a sum.

You should think of writing a function whenever you’ve copied and pasted a block of code more than twice.

In R, functions are of the form:

name_of_the_function <- function(arguments) {
function_content
}

You give your function a (meaningful) name (name_of_the_function ), define your function arguments (arguments ) i.e the parameters it needs to perform the task it supposed to perform, and put some content in the function. You define how the function should deal with the input/arguments to perform the task it needs to perform in function_content.

Here’s, as an example, a very simple function to sum two numbers:

sum_twonumbers <- function(number1, number2){
     result <- number1 + number2
     return(result)
     }

As expected, you can invoke this function multiple times with different parameters and get different sum results e.g:

sum_twonumbers(45,6)
[1] 51
sum_twonumbers(1007888,177999)
[1] 1185887
sum_twonumbers(42,1337)
[1] 1379

🎯 ACTION POINTS

  1. Could you write a function that takes a vector of characters as an input, counts the number of characters for each element of the vector and and prints out a line of the form “The character name [name of character] is composed of [x] characters” (e.g “The character name Yoda is composed of 4 characters”)?
  2. Write both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors (hint: create example vectors with NA in them to test your function).

Transforming for loops with sapply

for loops are all well and good and they are rather convenient (and quite easy to grasp and write!). But they’re not exactly the most efficient solution (computationally) when it comes to executing repetitive pieces of code. R supports vectorization and vectorized solutions that make use of apply functions, such as lapply and sapply (it’s often better to use sapply as it outputs a vector and is slightly more efficient) are more efficient than solutions that use loops, in particular for loops.

🎯 ACTION POINTS

  1. Take the for loops from question 9 and see if you can rewrite them using the sapply function. Consult the R documentation to help you with your task or have a look at this link.

Footnotes

  1. the equivalent of Python float type↩︎

  2. You can use the R documentation by typing ?c or ?help(c) in your RStudio console (in VSCode, you would first open a terminal, type R to start and R session and then use the same command as in the RStudio console) or running ?c or ?help(c) through an R code block within Quarto↩︎