✅ LSE DS202A 2025: Week 01 - Lab Solutions
Here are the solutions to Lab 1.
If you want to render the document yourselves and play with the code, you can download the .qmd
version of this solution file by clicking on the button below:
📖 Review of R/tidyverse fundamental concepts
Before asking for help from your peers and/or class teacher, you can try and look up the R documentation to find out more about a function and how it should be used e.g if you want to figure out more details about the c()
function used to create vectors (see Section 1.1 for details), you can invoke the R documentation through the RStudio console or the VSCode terminal (after typing R
in the VSCode terminal to start a new R session) by typing the command ?c
or help(c)
or alternatively by running the same commands through an R block in Quarto.
Vectors
A vector is the most common and basic data type in R. It is composed by a series of values of the same type, e.g. character
, numeric
1, integer
, logical
(i.e TRUE
and FALSE
), complex
or raw
. Have a look at this page for a more thorough description of the basic data types in R (with examples!).
You use the c()
function2 to create a vector. The values you combine within the vector must be comma-separated.
Here are examples of vectors:
<- c(22L, 30L, 42L) #vector of integers
x
x<- c(6.02214, 3.14159, 6.674) #vector of numerics
y
y<- c(FALSE, TRUE, FALSE) #vector of logicals
z
z<- c("Darth Vader", "Luke Skywalker", "Han Solo") #vector of characters
s
s
1] 22 30 42
[1] 6.02214 3.14159 6.67400
[1] FALSE TRUE FALSE
[1] "Darth Vader" "Luke Skywalker" "Han Solo" [
You can use vectors of integers or numerics in arithmetic operations or in computations, for example:
<- sort(y) # operation to sort the content of y
y
y<- x^2 / 2 - y * 3 + 5
v
v<- mean(y) # computing the mean of y
w
w
1] 3.14159 6.02214 6.67400
[1] 237.5752 436.9336 866.9780
[1] 5.279243 [
You can also create vectors that are sequences of numbers:
<- c(1:10) #a sequence of numbers from 1 to 10 with an increment of 1
a
a<- seq(-3, 3, by = 0.5) #a sequence of numbers from -3 to 3 with an increment of -0.5
b
b
1] 1 2 3 4 5 6 7 8 9 10
[1] -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 [
And you can use the rep()
function to repeat sequence items (can you spot the difference between using the times
or each
keyword with rep
in the example below?):
<- rep(a, times = 2)
d
d<- rep(a, each = 2)
e
e
1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
[1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 [
You can access a vector item vector by referring to its index number inside brackets []
: the first item has index 1, the second item has index 2, and so on. For example, if you want to access the second element of vector s
above (i.e value "Luke Skywalker"
), you would need this line of code:
2]
s[1] "Luke Skywalker" [
You can access multiple vector elements, thanks to the c()
function: the line s[c(1,3)]
lets you access the first and third values of the s
vector, i.e the values "Darth Vader"
and "Han Solo"
.
You can use negative indexes to access all vector elements except the one(s) specified by the index(es) e.g use the command s[-3]
to access all vector s
elements except the value at index 3, i.e to access the values at indexes 1 and 2 (values "Darth Vader"
and "Luke Skywalker"
).
You can also change the value of a particular vector element by referring to its index:
3] <- "Princess Leia" #changes the value of the third element of vector s from "Han Solo" to "Princess Leia"
s[
s1] "Darth Vader" "Luke Skywalker" "Princess Leia" [
You can add elements to a vector with the append()
function e.g:
<- append(s, c("Han Solo", "Yoda")) #adds the values "Han Solo" and "Yoda" to the end of vector s.
s #You can use the `after` keyword of the `append` function to specify after which index you want your new values inserted (see R documentation)
s1] "Darth Vader" "Luke Skywalker" "Princess Leia" "Han Solo"
[5] "Yoda" [
🎯 ACTION POINTS
- Try the basic vector manipulations from the examples and check that you understand the results.
- Could you create three vectors:
- a vector
k
which is a sequence from 8 to 2 (i.e descending order) in increments of 2
- a vector
<- seq(8, 2, -2)
k
k
1] 8 6 4 2 [
- a vector
m
which contains the content ofk
repeated three times
<- rep("k", 3)
m
m
1] "k" "k" "k" [
- a vector
n
containing three copies ofk
, with a 0 separating each copy from the next one?
<- rep(c("k", "0"), 3)
n
n
1] "k" "0" "k" "0" "k" "0" [
- Can you check whether
m
is equal ton
?
== n
m
1] TRUE FALSE TRUE FALSE TRUE FALSE [
- Note that you should also obtain a warning message because the 2 vectors are not of the same length. How can you check the length of both vectors?
length(m)
length(n)
1] 3
[1] 6 [
- What will happen in each of these examples:
<- c(1, 2, 3, "a")
num_char <- c(1, 2, 3, TRUE)
num_logical <- c("a", "b", "c", TRUE)
char_logical <- c(1, 2, 3, "4") tricky
- Suppose the following vector represents flu cases over a given number of weeks:
<- c(NA, 1, 0, 0, 3, NA, 3, 3, 61, 411, 2570, 7208) flu_cases
- Could you remove the missing values from the vector?
- How many weeks were the number of cases over 10?
- Could you compute the mean and standard deviation associated with the number of flu cases?
- Was there a week where the number of cases was equal to 42? And if so, which was it? Same questions with value 3.
Chaining operations with the pipe operator (|>
)
The pipe operator, |>
, is a native feature in R (available from R 4.1.0 onwards) that allows you to chain together sequences of operations. While the magrittr
package introduced the %>%
pipe operator, the native |>
is now the preferred approach and doesn’t require loading any additional packages.
|>
vs magrittr
%>%
Pipe Operators
R now has two pipe operators available. You are welcome to choose the one that suits your coding style.
Native pipe |>
(R 4.1.0+): - Built into base R, no packages required - Simpler syntax, better performance - Does NOT support placeholder syntax (.
) - For complex operations requiring placeholders, use anonymous functions: (\(x) operation_with_x)()
magrittr pipe %>%
: - Requires loading magrittr
or tidyverse
packages - Supports placeholder syntax with .
for more complex piping - Example: data %>% operation(., argument)
When to use which: - Use |>
for most straightforward piping operations - Use %>%
when you need placeholder functionality for complex operations - In this course, we’ll primarily use |>
as it’s now the R standard
Example comparison:
# magrittr style (requires library(magrittr))
<- data %>%
result 1:5] %>%
.[subset(. > 2)
# Native pipe equivalent
<- data |>
result 1:5])() |>
(\(x) x[subset(x, x > 2))() (\(x)
The pipe operator has four main advantages:
- you structure the sequence of your data operations from left to right, as apposed to from inside and out;
- you avoid nested function calls (that was one of the possible solutions to create vector
n
in question 2 in Section 1.1); - you minimize the need for local variables and function definitions
- you make it easy to add steps anywhere in the sequence of operations
What does it mean in practice?
Let’s go back to the flu cases example.
Suppose we want to only look at the first 9 weeks of data, replace the missing values in this subset of data with value 0 and then count how many weeks within the period the number of cases was equal to 0.
We could write the sequence of operations as follows:
How do I install a package in R e.g tidyverse
?
Just run
install.packages("tidyverse")
in your R console. This will install all the packages that are part of thetidyverse
ecosystem.DO NOT leave an
install.packages()
command in your .qmd file. Always do this in the R Console. Otherwise, you won’t be able to render your markdown file as an HTML later.
How do I load a package/library that I have already installed e.g magrittr
?
To load
magrittr
, you simply need to execute the calllibrary(magrittr)
before you use any function that comes from this package (e.g pipe).
However, in your
.qmd
file, make it a habit to create and dedicate your first chunk of code to loading all the packages you’ll use using thelibrary()
function. While it’s true that you may not know all the packages you will need when starting a new file, making the first chunk a reserved space for the task of loading libraries creates a neat, centralised hub. In this lab, since we simply need the pipe operator for now, we would simply loadmagrittr
withlibrary(magrittr)
in that first chunk of code: we can circle back to it if we need to add further libraries. You have to rerun the chunk for the packages to be loaded, every time you add a newlibrary()
call to the chunk.
It allows you to chain together sequences of operations and has four main advantages:
- you structure the sequence of your data operations from left to right, as apposed to from inside and out;
- you avoid nested function calls (that was one of the possible solutions to create vector
n
in question 2 in Section 1.1); - you minimize the need for local variables and function definitions
- you make it easy to add steps anywhere in the sequence of operations
What does it mean in practice?
Let’s go back to the flu cases example.
Suppose we want to only look at the first 9 weeks of data, replace the missing values in this subset of data with value 0 and then count how many weeks within the period the number of cases was equal to 0.
We could write the sequence of operations as follows:
<- c(NA, 1, 0, 0, 3, NA, 3, 3, 61, 411, 2570, 7208)
flu_cases <- flu_cases[1:9]
flu_cases <- replace(flu_cases, is.na(flu_cases), 0)
flu_cases <- length(subset(flu_cases, flu_cases == 0))
zero_cases
zero_cases
1] 4 [
Notice the number of assignments and nested function calls.
Alternatively, we could rewrite the sequence with the pipe (|>
) operator. For a challenge, try writing out the same solution using %>%
.
# Native version
<- c(NA, 1, 0, 0, 3, NA, 3, 3, 61, 411, 2570, 7208) |>
zero_cases 1:9])() |>
(\(x) x[replace(x, is.na(x), 0))() |>
(\(x) subset(x, x == 0))() |>
(\(x) length()
zero_cases
1] 4
[
# magrittr version
<- c(NA, 1, 0, 0, 3, NA, 3, 3, 61, 411, 2570, 7208) %>%
zero_cases 1:9] %>%
.[replace(., is.na(.), 0) %>%
subset(., . == 0) %>%
length()
zero_cases
1] 4 [
In this case, each line performs an operation from the sequence : the pipe is essentially equivalent to an English language ‘then’; define this vector then take its first nine values then replace the missing values within it with the value 0 then take the subset of this vector where the value is equal to 0 then take the length of the resulting vector (i.e count the number of weeks within the first nine weeks where the number of cases is 0).
For details on pipes, have a look at this tutorial.
🎯 ACTION POINTS
- Can you re-create vector
n
from question 2 of Section 1.1 using the pipe operator?
<- m |>
n append(0, after = 1) |>
append(0, after = 3) |>
append(0, after = 5)
n
1] "k" "0" "k" "0" "k" "0" [
- Use the
sample
function to create two vectors (you can choose the vector length) whose values are in the range[7-42]
. Append both vectors together, then scale the resulting vector before only selecting negative values and getting a count of negative values. Use the pipe operator to write your sequence of operations.
#write your answer here
<- 7:42 #specifying the value range of both vectors we'll create
vector_range <- 10 #length of the first vector
l1 <- 23 #length of the second vector
l2 <- sample(vector_range, l1) #creating a first vector whose values are in the range [7-42] and of length l1, i.e 10 here
v1 <- sample(vector_range, l2) #creating a second vector whose values are in the range [7-42] and of length l2, i.e 23 here
v2
<- append(v1, v2) |>
neg_count scale() |>
1])() |>
(\(x) x[, subset(x, x < 0))() |>
(\(x) length()
neg_count
1] 16
[
# magrittr version
<- append(v1, v2) %>%
neg_count scale() %>%
1] %>%
.[, subset(., . < 0) %>%
length()
neg_count
1] 16 [
For
loops
Suppose that, for some reasons, you want to print out sentences of the form: “The year is [year]”” where [year] is equal to 2019, 2020, up to 2026. You can do this as follows:
print(paste("The year is", 2019))
print(paste("The year is", 2020))
print(paste("The year is", 2021))
print(paste("The year is", 2022))
print(paste("The year is", 2023))
print(paste("The year is", 2024))
print(paste("The year is", 2025))
print(paste("The year is", 2026))
1] "The year is 2019"
[1] "The year is 2020"
[1] "The year is 2021"
[1] "The year is 2022"
[1] "The year is 2023"
[1] "The year is 2024"
[1] "The year is 2025"
[1] "The year is 2026" [
As you quickly see, this is rather tedious since you copy the same code chunk over and over again. Rather than doing this, you could use a for
loop to write repetitive parts of code.
Using a for
loop, the code above transforms into:
for (year in 2019:2026) {
print(paste("The year is", year))
}
1] "The year is 2019"
[1] "The year is 2020"
[1] "The year is 2021"
[1] "The year is 2022"
[1] "The year is 2023"
[1] "The year is 2024" [
The best way to understand this loop is as follows: “For each year that is in the sequence 2019:2026
, you execute the code chunk print(paste("The year is", year))
”. Once the for
loop has executed the code chunk for every year in the vector (i.e sequence 2019:2026
), the loop stops and goes to the first instruction after the loop block.
🎯 ACTION POINTS
- Suppose you have a new vector of characters:
<- c("R2-D2", "Chewbacca", "Obi-Wan Kenobi") t
- Could you write code that creates a vector that contains all the Star Wars character names, then, for each name, counts the number of characters, then finds the Star Wars character with longest name? Use a `for` loop in your code.
#write your answer here
#Let's start with vector `s` that was defined in @sec-vectors
<- c("Darth Vader", "Luke Skywalker", "Han Solo")
s # we modify `s` as shown in @sec-vectors to include all possible known Starwars characters
3] <- "Princess Leia"
s[<- append(s, c("Han Solo", "Yoda"))
s
#we append `s` and `t` in a single vector that includes all (known) Starwars character names
<- append(s, t)
char_names
#we now write our `for` loop
<- c() #we need a variable (vector) that stores the number of characters per Starwars character name
vec_chars
# now we start the loop
for (i in char_names) {
= nchar(i)
vec_chars[i]
}
#finding the Starwars character with the longest name
which(vec_chars == max(vec_chars)) #this returns two values: "Luke Skywalker" (with its index in the vector i.e 2) and "Obi-Wan Kenobi" (with its index in the vector i.e 8)
-Wan Kenobi
Luke Skywalker Obi2 8
- Could you write code that creates a vector that contains all the Star Wars character names, then, for each name, counts the number of characters and prints out a line of the form "The character name [name of character] is composed of [x] characters" (e.g "The character name Yoda is composed of 4 characters")?
#write your answer here
#We start as we did before with vectors `s` and `t`
<- c("R2-D2", "Chewbacca", "Obi-Wan Kenobi")
t
<- c("Darth Vader", "Luke Skywalker", "Han Solo")
s # we modify `s` as shown in @sec-vectors to include all possible known Starwars characters
3] <- "Princess Leia"
s[<- append(s, c("Han Solo", "Yoda"))
s
#we append `s` and `t` in a single vector that includes all (known) Starwars character names
<- append(s, t)
char_names
# we write our `for` loop
for (i in char_names) {
<- nchar(i)
n_chars print(paste("The character name", i, "is composed of", n_chars, "characters"))
}
1] "The character name Darth Vader is composed of 11 characters"
[1] "The character name Luke Skywalker is composed of 14 characters"
[1] "The character name Princess Leia is composed of 13 characters"
[1] "The character name Han Solo is composed of 8 characters"
[1] "The character name Yoda is composed of 4 characters"
[1] "The character name R2-D2 is composed of 5 characters"
[1] "The character name Chewbacca is composed of 9 characters"
[1] "The character name Obi-Wan Kenobi is composed of 14 characters" [
Functions
In this lab, we’ve encountered and used quite a few different pre-made functions in (e.g c()
, replace()
, subset()
, sample()
, rep()
, length()
), but sometimes you just need to write your own function to tackle your data, i.e your set/succession of (reproducible) instructions.
A function is simply a code block that performs a specific task (which can be more or less complex), e.g as calculating a sum.
You should think of writing a function whenever you’ve copied and pasted a block of code more than twice.
In R, functions are of the form:
<- function(arguments) {
name_of_the_function
function_content }
You give your function a (meaningful) name (name_of_the_function
), define your function arguments (arguments
) i.e the parameters it needs to perform the task it supposed to perform, and put some content in the function. You define how the function should deal with the input/arguments to perform the task it needs to perform in function_content
.
Here’s, as an example, a very simple function to sum two numbers:
<- function(number1, number2) {
sum_twonumbers <- number1 + number2
result return(result)
}
As expected, you can invoke this function multiple times with different parameters and get different sum results e.g:
sum_twonumbers(45, 6)
sum_twonumbers(1007888, 177999)
sum_twonumbers(42, 1337)
1] 51
[1] 1185887
[1] 1379 [
🎯 ACTION POINTS
- Could you write a function that takes a vector of characters as an input, counts the number of characters for each element of the vector and and prints out a line of the form “The character name [name of character] is composed of [x] characters” (e.g “The character name Yoda is composed of 4 characters”)?
<- function(char_vector) {
count_character_names for (name in char_vector) {
<- nchar(name)
char_count cat(
"The character name",
name,"is composed of",
char_count,"characters\n"
)
} }
- Write
both_na()
, a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors (hint: create example vectors with NA in them to test your function).
<- function(vector1, vector2) {
both_na # Check if vectors are the same length
if (length(vector1) != length(vector2)) {
stop("Vectors must be the same length")
}
# Count positions where both vectors have NA
sum(is.na(vector1) & is.na(vector2))
}
Transforming for
loops with sapply
for
loops are all well and good and they are rather convenient (and quite easy to grasp and write!). But they’re not exactly the most efficient solution (computationally) when it comes to executing repetitive pieces of code. R supports vectorization and vectorized solutions that make use of apply
functions, such as lapply
and sapply
(it’s often better to use sapply
as it outputs a vector and is slightly more efficient) are more efficient than solutions that use loops, in particular for
loops.
🎯 ACTION POINTS
- Take the
for
loops from question 9 and see if you can rewrite them using thesapply
function. Consult the R documentation to help you with your task or have a look at this link.
Footnotes
the equivalent of Python
float
type↩︎You can use the R documentation by typing
?c
or?help(c)
in your RStudio console (in VSCode, you would first open a terminal, type R to start and R session and then use the same command as in the RStudio console) or running?c
or?help(c)
through an R code block within Quarto↩︎