✅ LSE DS202A 2025: Week 01 - Lab Solutions

Author

The DS202 Team

Published

15 Sep 2025

Here are the solutions to Lab 1.

If you want to render the document yourselves and play with the code, you can download the .qmd version of this solution file by clicking on the button below:

📖 Review of R/tidyverse fundamental concepts

Finding the R documentation

Before asking for help from your peers and/or class teacher, you can try and look up the R documentation to find out more about a function and how it should be used e.g if you want to figure out more details about the c() function used to create vectors (see Section 1.1 for details), you can invoke the R documentation through the RStudio console or the VSCode terminal (after typing R in the VSCode terminal to start a new R session) by typing the command ?c or help(c) or alternatively by running the same commands through an R block in Quarto.

Vectors

A vector is the most common and basic data type in R. It is composed by a series of values of the same type, e.g. character, numeric¹, integer, logical (i.e TRUE and FALSE), complex or raw. Have a look at this page for a more thorough description of the basic data types in R (with examples!).

You use the c() function² to create a vector. The values you combine within the vector must be comma-separated.

Here are examples of vectors:

x <- c(22L, 30L, 42L) #vector of integers
x
y <- c(6.02214, 3.14159, 6.674) #vector of numerics
y
z <- c(FALSE, TRUE, FALSE) #vector of logicals
z
s <- c("Darth Vader", "Luke Skywalker", "Han Solo") #vector of characters
s

[1] 22 30 42
[1] 6.02214 3.14159 6.67400
[1] FALSE  TRUE FALSE
[1] "Darth Vader"    "Luke Skywalker" "Han Solo"

You can use vectors of integers or numerics in arithmetic operations or in computations, for example:

y <- sort(y) # operation to sort the content of y
y
v <- x^2 / 2 - y * 3 + 5
v
w <- mean(y) # computing the mean of y
w

[1] 3.14159 6.02214 6.67400
[1] 237.5752 436.9336 866.9780
[1] 5.279243

You can also create vectors that are sequences of numbers:

a <- c(1:10) #a sequence of numbers from 1 to 10 with an increment of 1
a
b <- seq(-3, 3, by = 0.5) #a sequence of numbers from -3 to 3 with an increment of -0.5
b

[1]  1  2  3  4  5  6  7  8  9 10
[1] -3.0 -2.5 -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0

And you can use the rep() function to repeat sequence items (can you spot the difference between using the times or each keyword with rep in the example below?):

d <- rep(a, times = 2)
d
e <- rep(a, each = 2)
e

[1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10
[1]  1  1  2  2  3  3  4  4  5  5  6  6  7  7  8  8  9  9 10 10

You can access a vector item vector by referring to its index number inside brackets []: the first item has index 1, the second item has index 2, and so on. For example, if you want to access the second element of vector s above (i.e value "Luke Skywalker"), you would need this line of code:

s[2]
[1] "Luke Skywalker"

You can access multiple vector elements, thanks to the c() function: the line s[c(1,3)] lets you access the first and third values of the s vector, i.e the values "Darth Vader" and "Han Solo".

You can use negative indexes to access all vector elements except the one(s) specified by the index(es) e.g use the command s[-3] to access all vector s elements except the value at index 3, i.e to access the values at indexes 1 and 2 (values "Darth Vader" and "Luke Skywalker").

You can also change the value of a particular vector element by referring to its index:

s[3] <- "Princess Leia" #changes the value of the third element of vector s from "Han Solo" to "Princess Leia"
s
[1] "Darth Vader"    "Luke Skywalker" "Princess Leia"

You can add elements to a vector with the append() function e.g:

s <- append(s, c("Han Solo", "Yoda")) #adds the values "Han Solo" and "Yoda" to the end of vector s.
#You can use the `after` keyword of the `append` function to specify after which index you want your new values inserted (see R documentation)
s
[1] "Darth Vader"    "Luke Skywalker" "Princess Leia"  "Han Solo"      
[5] "Yoda"

🎯 ACTION POINTS

Try the basic vector manipulations from the examples and check that you understand the results.
Could you create three vectors:
- a vector k which is a sequence from 8 to 2 (i.e descending order) in increments of 2


k <- seq(8, 2, -2)
k

[1] 8 6 4 2

a vector m which contains the content of k repeated three times


m <- rep("k", 3)
m

[1] "k" "k" "k"

a vector n containing three copies of k, with a 0 separating each copy from the next one?


n <- rep(c("k", "0"), 3)
n

[1] "k" "0" "k" "0" "k" "0"

Can you check whether m is equal to n?


m == n

[1]  TRUE FALSE  TRUE FALSE  TRUE FALSE

Note that you should also obtain a warning message because the 2 vectors are not of the same length. How can you check the length of both vectors?


length(m)
length(n)

[1] 3
[1] 6

What will happen in each of these examples:

num_char <- c(1, 2, 3, "a")
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
tricky <- c(1, 2, 3, "4")

Suppose the following vector represents flu cases over a given number of weeks:

flu_cases <- c(NA, 1, 0, 0, 3, NA, 3, 3, 61, 411, 2570, 7208)

Could you remove the missing values from the vector?
How many weeks were the number of cases over 10?
Could you compute the mean and standard deviation associated with the number of flu cases?
Was there a week where the number of cases was equal to 42? And if so, which was it? Same questions with value 3.

Chaining operations with the pipe operator (`|>`)

The pipe operator, |>, is a native feature in R (available from R 4.1.0 onwards) that allows you to chain together sequences of operations. While the magrittr package introduced the %>% pipe operator, the native |> is now the preferred approach and doesn’t require loading any additional packages.

Native |> vs magrittr %>% Pipe Operators

R now has two pipe operators available. You are welcome to choose the one that suits your coding style.

Native pipe |> (R 4.1.0+): - Built into base R, no packages required - Simpler syntax, better performance - Does NOT support placeholder syntax (.) - For complex operations requiring placeholders, use anonymous functions: (\(x) operation_with_x)()

magrittr pipe %>%: - Requires loading magrittr or tidyverse packages - Supports placeholder syntax with . for more complex piping - Example: data %>% operation(., argument)

When to use which: - Use |> for most straightforward piping operations - Use %>% when you need placeholder functionality for complex operations - In this course, we’ll primarily use |> as it’s now the R standard

Example comparison:

# magrittr style (requires library(magrittr))
result <- data %>% 
  .[1:5] %>% 
  subset(. > 2)

# Native pipe equivalent
result <- data |> 
  (\(x) x[1:5])() |> 
  (\(x) subset(x, x > 2))()

The pipe operator has four main advantages:

you structure the sequence of your data operations from left to right, as apposed to from inside and out;
you avoid nested function calls (that was one of the possible solutions to create vector n in question 2 in Section 1.1);
you minimize the need for local variables and function definitions
you make it easy to add steps anywhere in the sequence of operations

What does it mean in practice?

Let’s go back to the flu cases example.

Suppose we want to only look at the first 9 weeks of data, replace the missing values in this subset of data with value 0 and then count how many weeks within the period the number of cases was equal to 0.

We could write the sequence of operations as follows:

Note on installing and loading packages in R

How do I install a package in R e.g tidyverse?

Just run install.packages("tidyverse") in your R console. This will install all the packages that are part of the tidyverse ecosystem.

DO NOT leave an install.packages() command in your .qmd file. Always do this in the R Console. Otherwise, you won’t be able to render your markdown file as an HTML later.

How do I load a package/library that I have already installed e.g magrittr?

To load magrittr, you simply need to execute the call library(magrittr) before you use any function that comes from this package (e.g pipe).

However, in your .qmd file, make it a habit to create and dedicate your first chunk of code to loading all the packages you’ll use using the library() function. While it’s true that you may not know all the packages you will need when starting a new file, making the first chunk a reserved space for the task of loading libraries creates a neat, centralised hub. In this lab, since we simply need the pipe operator for now, we would simply load magrittr with library(magrittr) in that first chunk of code: we can circle back to it if we need to add further libraries. You have to rerun the chunk for the packages to be loaded, every time you add a new library() call to the chunk.

It allows you to chain together sequences of operations and has four main advantages:

you structure the sequence of your data operations from left to right, as apposed to from inside and out;
you avoid nested function calls (that was one of the possible solutions to create vector n in question 2 in Section 1.1);
you minimize the need for local variables and function definitions
you make it easy to add steps anywhere in the sequence of operations

What does it mean in practice?

Let’s go back to the flu cases example.

We could write the sequence of operations as follows:


flu_cases <- c(NA, 1, 0, 0, 3, NA, 3, 3, 61, 411, 2570, 7208)
flu_cases <- flu_cases[1:9]
flu_cases <- replace(flu_cases, is.na(flu_cases), 0)
zero_cases <- length(subset(flu_cases, flu_cases == 0))
zero_cases

[1] 4

Notice the number of assignments and nested function calls.

Alternatively, we could rewrite the sequence with the pipe (|>) operator. For a challenge, try writing out the same solution using %>%.

# Native version
zero_cases <- c(NA, 1, 0, 0, 3, NA, 3, 3, 61, 411, 2570, 7208) |>
  (\(x) x[1:9])() |>
  (\(x) replace(x, is.na(x), 0))() |>
  (\(x) subset(x, x == 0))() |>
  length()

zero_cases

[1] 4

# magrittr version
zero_cases <- c(NA, 1, 0, 0, 3, NA, 3, 3, 61, 411, 2570, 7208) %>%
  .[1:9] %>%
  replace(., is.na(.), 0) %>%
  subset(., . == 0) %>%
  length()

zero_cases

[1] 4

In this case, each line performs an operation from the sequence : the pipe is essentially equivalent to an English language ‘then’; define this vector then take its first nine values then replace the missing values within it with the value 0 then take the subset of this vector where the value is equal to 0 then take the length of the resulting vector (i.e count the number of weeks within the first nine weeks where the number of cases is 0).

For details on pipes, have a look at this tutorial.

🎯 ACTION POINTS

Can you re-create vector n from question 2 of Section 1.1 using the pipe operator?

n <- m |>
  append(0, after = 1) |>
  append(0, after = 3) |>
  append(0, after = 5)

n

[1] "k" "0" "k" "0" "k" "0"

Use the sample function to create two vectors (you can choose the vector length) whose values are in the range [7-42]. Append both vectors together, then scale the resulting vector before only selecting negative values and getting a count of negative values. Use the pipe operator to write your sequence of operations.

#write your answer here
vector_range <- 7:42 #specifying the value range of both vectors we'll create
l1 <- 10 #length of the first vector
l2 <- 23 #length of the second vector
v1 <- sample(vector_range, l1) #creating a first vector whose values are in the range [7-42] and of length l1, i.e 10 here
v2 <- sample(vector_range, l2) #creating a second vector whose values are in the range [7-42] and of length l2, i.e 23 here

neg_count <- append(v1, v2) |>
  scale() |>
  (\(x) x[, 1])() |>
  (\(x) subset(x, x < 0))() |>
  length()

neg_count

[1] 16

# magrittr version
neg_count <- append(v1, v2) %>%
  scale() %>%
  .[, 1] %>%
  subset(., . < 0) %>%
  length()

neg_count

[1] 16

`For` loops

Suppose that, for some reasons, you want to print out sentences of the form: “The year is [year]”” where [year] is equal to 2019, 2020, up to 2026. You can do this as follows:

print(paste("The year is", 2019))
print(paste("The year is", 2020))
print(paste("The year is", 2021))
print(paste("The year is", 2022))
print(paste("The year is", 2023))
print(paste("The year is", 2024))
print(paste("The year is", 2025))
print(paste("The year is", 2026))

[1] "The year is 2019"
[1] "The year is 2020"
[1] "The year is 2021"
[1] "The year is 2022"
[1] "The year is 2023"
[1] "The year is 2024"
[1] "The year is 2025"
[1] "The year is 2026"

As you quickly see, this is rather tedious since you copy the same code chunk over and over again. Rather than doing this, you could use a for loop to write repetitive parts of code.

Using a for loop, the code above transforms into:

for (year in 2019:2026) {
  print(paste("The year is", year))
}

[1] "The year is 2019"
[1] "The year is 2020"
[1] "The year is 2021"
[1] "The year is 2022"
[1] "The year is 2023"
[1] "The year is 2024"

The best way to understand this loop is as follows: “For each year that is in the sequence 2019:2026, you execute the code chunk print(paste("The year is", year))”. Once the for loop has executed the code chunk for every year in the vector (i.e sequence 2019:2026), the loop stops and goes to the first instruction after the loop block.

🎯 ACTION POINTS

Suppose you have a new vector of characters:

t <- c("R2-D2", "Chewbacca", "Obi-Wan Kenobi")

- Could you write code that creates a vector that contains all the Star Wars character names, then, for each name, counts the number of characters, then finds the Star Wars character with longest name? Use a `for` loop in your code.

#write your answer here
#Let's start with vector `s` that was defined in @sec-vectors

s <- c("Darth Vader", "Luke Skywalker", "Han Solo")
# we modify `s` as shown in @sec-vectors to include all possible known Starwars characters

s[3] <- "Princess Leia"
s <- append(s, c("Han Solo", "Yoda"))

#we append `s` and `t` in a single vector that includes all (known) Starwars character names
char_names <- append(s, t)

#we now write our `for` loop

vec_chars <- c() #we need a variable (vector) that stores the number of characters per Starwars character name

# now we start the loop
for (i in char_names) {
  vec_chars[i] = nchar(i)
}

#finding the Starwars character with the longest name

which(vec_chars == max(vec_chars)) #this returns two values: "Luke Skywalker" (with its index in the vector i.e 2) and "Obi-Wan Kenobi" (with its index in the vector i.e 8)

Luke Skywalker Obi-Wan Kenobi 
             2              8

- Could you write code that creates a vector that contains all the Star Wars character names, then, for each name, counts the number of characters and prints out a line of the form "The character name [name of character] is composed of [x] characters" (e.g "The character name Yoda is composed of 4 characters")?

#write your answer here

#We start as we did before with vectors `s` and `t`
t <- c("R2-D2", "Chewbacca", "Obi-Wan Kenobi")

s <- c("Darth Vader", "Luke Skywalker", "Han Solo")
# we modify `s` as shown in @sec-vectors to include all possible known Starwars characters
s[3] <- "Princess Leia"
s <- append(s, c("Han Solo", "Yoda"))

#we append `s` and `t` in a single vector that includes all (known) Starwars character names
char_names <- append(s, t)

# we write our `for` loop

for (i in char_names) {
  n_chars <- nchar(i)
  print(paste("The character name", i, "is composed of", n_chars, "characters"))
}

[1] "The character name Darth Vader is composed of 11 characters"
[1] "The character name Luke Skywalker is composed of 14 characters"
[1] "The character name Princess Leia is composed of 13 characters"
[1] "The character name Han Solo is composed of 8 characters"
[1] "The character name Yoda is composed of 4 characters"
[1] "The character name R2-D2 is composed of 5 characters"
[1] "The character name Chewbacca is composed of 9 characters"
[1] "The character name Obi-Wan Kenobi is composed of 14 characters"

Functions

In this lab, we’ve encountered and used quite a few different pre-made functions in (e.g c(), replace(), subset(), sample(), rep(), length()), but sometimes you just need to write your own function to tackle your data, i.e your set/succession of (reproducible) instructions.

A function is simply a code block that performs a specific task (which can be more or less complex), e.g as calculating a sum.

You should think of writing a function whenever you’ve copied and pasted a block of code more than twice.

In R, functions are of the form:

name_of_the_function <- function(arguments) {
  function_content
}

You give your function a (meaningful) name (name_of_the_function ), define your function arguments (arguments ) i.e the parameters it needs to perform the task it supposed to perform, and put some content in the function. You define how the function should deal with the input/arguments to perform the task it needs to perform in function_content.

Here’s, as an example, a very simple function to sum two numbers:

sum_twonumbers <- function(number1, number2) {
  result <- number1 + number2
  return(result)
}

As expected, you can invoke this function multiple times with different parameters and get different sum results e.g:

sum_twonumbers(45, 6)
sum_twonumbers(1007888, 177999)
sum_twonumbers(42, 1337)

[1] 51
[1] 1185887
[1] 1379

🎯 ACTION POINTS

Could you write a function that takes a vector of characters as an input, counts the number of characters for each element of the vector and and prints out a line of the form “The character name [name of character] is composed of [x] characters” (e.g “The character name Yoda is composed of 4 characters”)?


count_character_names <- function(char_vector) {
  for (name in char_vector) {
    char_count <- nchar(name)
    cat(
      "The character name",
      name,
      "is composed of",
      char_count,
      "characters\n"
    )
  }
}

Write both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors (hint: create example vectors with NA in them to test your function).


both_na <- function(vector1, vector2) {
  # Check if vectors are the same length
  if (length(vector1) != length(vector2)) {
    stop("Vectors must be the same length")
  }

  # Count positions where both vectors have NA
  sum(is.na(vector1) & is.na(vector2))
}

Transforming `for` loops with `sapply`

for loops are all well and good and they are rather convenient (and quite easy to grasp and write!). But they’re not exactly the most efficient solution (computationally) when it comes to executing repetitive pieces of code. R supports vectorization and vectorized solutions that make use of apply functions, such as lapply and sapply (it’s often better to use sapply as it outputs a vector and is slightly more efficient) are more efficient than solutions that use loops, in particular for loops.

🎯 ACTION POINTS

Take the for loops from question 9 and see if you can rewrite them using the sapply function. Consult the R documentation to help you with your task or have a look at this link.

Footnotes

the equivalent of Python float type↩︎
You can use the R documentation by typing ?c or ?help(c) in your RStudio console (in VSCode, you would first open a terminal, type R to start and R session and then use the same command as in the RStudio console) or running ?c or ?help(c) through an R code block within Quarto↩︎

📖 Review of R/tidyverse fundamental concepts

Vectors

Chaining operations with the pipe operator (|>)

For loops

Functions

Transforming for loops with sapply

Footnotes

Chaining operations with the pipe operator (`|>`)

`For` loops

Transforming `for` loops with `sapply`