🛣️ Week 02 Lab - Roadmap (90 min)

2024/25 Autumn Term

Author

Stuart Bramwell, Tabtim Duenger & Yassine Lahna

Welcome to the second DS202A lab!

We’ll begin by exploring several data sets from the psymetadata package. psymetadata contains information on effects and variances used in 22 meta-analyses of several important research topics in the fields of social, developmental, and cognitive psychology. Meta-analyses combine the findings of multiple studies on the same topic to draw broader conclusions. Each of the data sets will typically consist of a number of columns such as:

While meta-analyses are not our focus of this course, we think they provide a really great opportunity to learn the basics of the tidyverse ecosystem while learning something interesting and concrete about findings in psychology. More generally, we also think that packages such as psymetadata really showcase the added value that user-driven programming languages such as R and Python can provide to social scientists. With a few lines of code, we can have access to amazing data about the social world, and psymetadata merely represents the tip of the iceberg.

The main goal is to practice manipulating and plotting some of these studies using dplyr and ggplot.

🥅 Learning Objectives

  • Configure your working environment for the course, including R, RStudio or VS Code, and Quarto documents.
  • Start using markdown to write your notes and code for lab activities.
  • Practice consulting and reading the documentation to understand how a function works.
  • Practice with dplyr to manipulate and structure data.
  • Practice using ggplot to create plots to visualise your data.

📋 Lab Tasks

No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.

🛠 Part 1: Data manipulation with dplyr(40 min)

The dplyr package is part of the tidyverse ecosystem and will be your bread and butter for performing basic data manipulations.

We will get you exploring various data sets by using key dplyr “verbs”, namely (in no particular order!)

  • count: provides counts of qualitative variables
  • filter: keeps / removes rows
  • select: keeps / removes columns
  • rename: names current variables
  • mutate: creates new variables
  • summarise: provides statistical summaries of quantitative variables
  • group_by: performs group-level calculations

We will go through all of these verbs in this lab, but part of becoming a proficient R-user (and data scientist in general) comes from reading the documentation, searching online and experimenting. Here are the links to the documentation for the packages we will be using:

Links to documentation

🎯 ACTION POINTS

The action points below will help you get there. Work in pairs or groups of three. When stuck, try to look at the documentation of the functions you are using. If that doesn’t help, ask your class instructor.

1.1. Click on the link below to download the .qmd file for this lab. Save it in the DS202A folder you created last week. If you need a refresher on the setup, refer back to Part II of last week’s lab.

1.2. In your .qmd file, insert a new code chunk and import the required libraries:

library(ggsci)       # for nice colour themes
library(psymetadata) # imports meta-analyses in psychology
library(tidyverse)   # imports dplyr, ggplot2 and more
library(dplyr)       # for data manipulation
library(tidyr)       # for data reshaping
library(readr)       # for reading data
library(tidyselect)  # for selecting columns
library(ggsci)       # for nice colour themes
library(psymetadata) # imports meta-analyses in psychology

1.3. By loading psymetadata, we can already have access to the meta-analyses, so there is no need to load the data. Let’s print the Gnambs (2020) study on the colour red and cognitive performance to inspect it (to read the study click here).


gnambs2020

1.4. As you can see, printing gnambs2020 results in the whole contents of the data frame being printed out.

⚠️ Do not make a habit out of this.

Printing whole data frames is too much information and can severely detract from the overall flow of the document. Remember that, at the end of the day, data scientists must be able to effectively communicate their work to both technical and non-technical users, and it impresses no-one when whole data sets (particularly extremely large ones!) are printed out. Instead, we can print tibbles which are a data frame with special properties.


gnambs2020 %>% 
  as_tibble()

Note that we can tell the dimensions of the data (r nrow(gnambs2020) rows and r ncol(gnambs2020) columns) and can see the first ten rows and however many columns can fit into the R console. This is a much more compact way of introducing a data frame.

For the majority of these labs, we will work with tibbles instead of base R data frames. This is not only because they have a nicer output when we print them, but also because we can do advanced things like create list columns (more on this later).

1.5. We can further inspect the data frame using the glimpse() function from the dplyr package. This can be especially useful when you have lots of columns, as printing tibbles typically only shows as many columns that can fit into the R console.


gnambs2020 %>%
  glimpse()

1.6. 👥 DISCUSS IN PAIRS/GROUPS: Time to switch to another study. Type ?barroso2021 in the console. What columns are in the Barroso et al (2021) study? What is this study about?

1.7. Let’s explore another study again. Let’s take a look at the Maldonado et al (2020) study which explores age differences in executive function.


maldonado2020 %>% 
  as_tibble()

As you can see, the study includes different domains (remember to look at ?maldonado2020). We can see how different domains are included in the meta-analysis and the number of studies that focus on each domain by using the count function.


maldonado2020 %>% 
  as_tibble() %>% 
  count(domain)

Suppose we only wanted to compare effect sizes and variances for the “updating” domain. We can employ the filter verb to do this. How many rows do you think will be in this data frame?


maldonado2020 %>% 
  as_tibble() %>% 
  filter(domain == "updating")

1.8. Now, let’s look at the Nuijten et al (2020) study on intelligence. Suppose we were only interested in the unique study ID, effect size and variance. We can select these columns which produces a data frame with a reduced number of columns.


nuijten2020 %>% 
  as_tibble() %>% 
  select(study_id, yi, vi)

Sometimes, you may want to change variable names to make them more descriptive. yi and vi may not be helpful, so we can change them to something like effect_size and effect_variance by using the rename function.


nuijten2020 %>% 
  as_tibble() %>% 
  select(study_id, yi, vi) %>% 
  rename(effect_size = yi, effect_variance = vi)

Note: you can also rename columns by citing their relative position in the data frame.


nuijten2020 %>% 
  as_tibble() %>% 
  select(study_id, yi, vi) %>% 
  rename(effect_size = 2, effect_variance = 3)

1.9. Now let’s create a new variable. We’ll do this with the Agadullina et al (2018) meta-analysis on out-group entitativity and prejudice. Suppose we want to see how many effect sizes are statistically significant. We can calculate t-values by dividing the effect size by the square root of the effect variance (a.k.a. the standard error). To do this, we can use the mutate function to create a new variable effect_tvalue.

👨🏻‍🏫 TEACHING MOMENT: Your tutor will briefly explain a succinct way of understanding t-values.


agadullina2018 %>% 
  as_tibble() %>% 
  # Let's use some of the code in the previous subsection
  select(study_id, yi, vi) %>% 
  rename(effect_size = yi, effect_variance = vi) %>% 
  mutate(effect_tvalue = effect_size / sqrt(effect_variance))

Now that we have created a new column, we may want to calculate some summary statistics. For example, we can ask how big the t-values are, on average. We can pipe in the summarise command to calculate this average. (Note: we want to take the average of the absolute values, so we need to use the abs function wrapped inside mean.)


# We copy and paste the code in the above chunk ...
agadullina2018 %>% 
  as_tibble() %>% 
  select(study_id, yi, vi) %>% 
  rename(effect_size = yi, effect_variance = vi) %>% 
  mutate(effect_tvalue = effect_size / sqrt(effect_variance)) %>%
# ... and pipe in summarise
  summarise(effect_tvalue = mean(abs(effect_tvalue)))

1.10 As a final part, suppose we think that there may be average differences in t-values based on the design of the study. Agadullina et al (2018) have data from both cross-sectional observational studies and experiments. We can tweak our pipeline to calculate group means based on design by using the group_by function.


agadullina2018 %>% 
  as_tibble() %>% 
  # Remember to select the design column
  select(study_id, yi, vi, design) %>% 
  rename(effect_size = yi, effect_variance = vi) %>% 
  mutate(effect_tvalue = effect_size / sqrt(effect_variance)) %>%
  group_by(design) %>% 
  summarise(effect_tvalue = mean(abs(effect_tvalue)))

We can see that, on average, the effects of cross-sectional studies are r round(6.74/2.66, 1) times more precisely estimated! We’ll explore this discrepancy graphically in the next part of this lab.

📋 Part 2: Data visualisation with ggplot(50 min)

They say a picture paints a thousand words, and we are, more often than not, inclined to agree with them (whoever “they” are)! Thankfully, we can use perhaps the most widely used package, ggplot2 (click here for documentation) to paint such pictures or (more accurately) build such graphs.

2.1 Histograms

Suppose we want to plot the distribution of an outcome of interest. We can use a histogram to plot the effect sizes found in Nuijten et al (2020).


nuijten2020 %>% 
  as_tibble() %>% 
  ggplot(aes(es)) +
  geom_histogram()

👥 WORK IN PAIRS/GROUPS:

  • Include the following custom theme:

theme_histogram <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank())
  
}
  • Change a few parameters in the geom_histogram layer. Set
  • bins to 100
  • colour to “black”
  • fill to “lightgrey”
  • alpha to 0.5

nuijten2020 %>% 
  as_tibble() %>% 
  ggplot(aes(es)) +
  geom_histogram() + #modify this line's code!
  theme_histogram() +
  labs(x = "Effect sizes in Nuijten et al (2020)", y = "Count")

🗣️ CLASSROOM DISCUSSION:

Are there any alterations to this graph that would make it better?

2.2 Bar graphs

We counted the number of studies in the Maldonado et al (2020) meta analysis that featured different domains. We can use ggplot to create a bar graph. We first need to specify what we want on the x and y axes, which constitute the first and second argument of the aes function.


maldonado2020 %>% 
  as_tibble() %>% 
  count(domain, name = "n_studies") %>% 
  ggplot(aes(n_studies, domain)) +
  geom_col()

👥 WORK IN PAIRS/GROUPS:

  • Include the following custom theme:

theme_bar <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank())
  
}
  • Edit the axes. Use “Number of studies” on the x-axis and “Domain” on the y-axis by adding the labs function.
  • Check out the fct_reorder function to order the domains by the number of studies. You can do this in a separate mutate line or in the aes wrapper.

👉 NOTE: Don’t like horizontal bars? Well, we can also create vertical bars too. Can you guess how we code this?

2.3 Scatter plots

Scatter plots are the best way to graphically summarise the relationship between two quantitative variables. Suppose we wanted to track effect sizes and statistical significance on intelligence studies found in Nuijten et al. (2020). We can use a scatter plot to visualise this, distinguishing effect sizes by whether or not they achieve conventional levels of significance (p < 0.05) using colour.


nuijten2020 %>% 
  as_tibble() %>% 
  drop_na(yi) %>% 
  ggplot(aes(year, yi, colour = abs(yi / sqrt(vi)) > 1.96)) +
  geom_point() 

👥 WORK IN PAIRS/GROUPS:

  • Include the following custom theme:
theme_scatter <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        legend.position = "bottom")
  
}
  • Only include studies from 1980 onward.
  • Include a horizontal dashed line using geom_hline.
  • It is hard to see the number of studies, try these two things:
  • Replace geom_point with geom_jitter.
  • Try adding translucency to each point.
  • Replace the colour scheme (try scale_colour_cosmic())
  • Label the x-axis with “Year”, the y-axis with “Effect sizes in Nuijten et al. (2020)”, and colour with “Statistically significant?”

2.4 Line plots

Line plots can be used to track parameters of interest over time. With Nuijten et al (2020), we can see how many studies were included in their meta-analysis over time.

nuijten2020 %>% 
  as_tibble() %>% 
  count(year) %>% 
  ggplot(aes(year, n)) +
  geom_line()

👥 WORK IN PAIRS/GROUPS:

  • Include the following custom theme:
theme_line <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank())
  
}
  • Change layer from a line to an area (note: this may not be suitable in all cases)
  • Fill the area in “midnightblue” and set the alpha to 0.5
  • Capitalise the x-axis and change the y-axis to “Number of studies”

👉 NOTE: We think this line plot works in this context, and that it is useful to introduce you to this style of charting progress over time. However, for others contexts filling the area under the line may not work, so do bear this in mind when making decisions!

2.5 Box plots

Box plots are an excellent way to summarise the distribution of values across different strata over key quantities of interest (e.g. median, interquartile range). Recall that Agadullina et al (2018) code whether or not a study is cross-sectional or experimental and that we found average discrepancies in terms of significance levels between these two strata. We can use box plots to graphically illustrate this.


agadullina2018 %>% 
  as_tibble() %>% 
  mutate(ti = abs(yi / sqrt(vi))) %>% 
  ggplot(aes(design, ti, fill = design)) +
  geom_boxplot()

👥 WORK IN PAIRS/GROUPS:

  • Include the following custom theme:
theme_boxplot <- function() {
  
  theme_minimal() +
  theme(panel.grid = element_blank(),
        legend.position = "none")
  
}
  • Add scale_fill_bmj()
  • Set the alpha values in the box plot layer to 0.75
  • Set the x-axis label to NULL and y-axis to “T-values in Agadullina et al (2018)”

2.6 Density plots

Let’s look at another way of showing discrepancies found in Agadullina et al (2018), namely density plots. Although we cannot see quantities of interest, we can get a better sense of how t-values are distributed across different strata in a more raw manner.


agadullina2018 %>% 
  as_tibble() %>% 
  mutate(ti = yi / sqrt(vi)) %>% 
  ggplot(aes(abs(ti), fill = design)) +
  geom_density() 

👥 WORK IN PAIRS/GROUPS:

  • Include the following custom theme:
theme_density <- function() {
  
  theme_minimal() +
  theme(panel.grid = element_blank(),
        legend.position = "bottom")
  
}
  • Add scale_fill_lancet()
  • Set the alpha values in the density layer to 0.75
  • Set the x-axis label to “T-values in Agadullina et al (2018)” and y-axis to “Density”