🛣️ LSE DS202A 2025: Week 02 - Lab Roadmap

Author

The DS202 Team

Published

15 Sep 2025

🎯 Learning Outcomes

By the end of this lab, you will be able to:

Master core data manipulation techniques with dplyr - Filter rows, select and rename columns, create new variables with mutate, group data, and calculate summary statistics using a coherent pipeline approach with the pipe operator.
Transform and reshape data for analysis - Convert data frames to tibbles, handle categorical variables with conditional logic (if_else), and efficiently explore data structure using functions like glimpse() and count().
Create effective data visualisations with ggplot2 - Build multiple chart types including histograms, bar charts, scatter plots, line plots, box plots, and density plots using the grammar of graphics framework.
Apply visualisation best practices and customization - Implement custom themes, modify aesthetic properties (color, transparency, labels), add trend lines, and make informed decisions about appropriate chart types for different data scenarios.

📋 Lab Tasks

🛠 Part I: Data manipulation with `dplyr` (40 min)

⚙️ Setup

Download the lab’s .qmd notebook

Click on the link below to download the .qmd file for this lab. Save it in the DS202A folder you created last week. If you need a refresher on the setup, refer back to Part II of last week’s lab.

Import required libraries:

library(ggsci)
library(psychTools)
library(psymetadata)
library(tidyverse)

Try this option if having issues with the imports above

library(dplyr)       # for data manipulation
library(tidyr)       # for data reshaping
library(readr)       # for reading data
library(tidyselect)  # for selecting columns
library(ggsci)       # for nice colour themes
library(psymetadata) # imports meta-analyses in psychology

No need to load any data here: we can already have access to the Holzinger-Swineford dataset by loading psychTools. Let’s proceed! 😉

1.3. Let’s load the Holzinger-Swineford dataset, which contains mental ability test scores for school children, to inspect it.


hs_data <-
  holzinger.swineford

hs_data

1.4. As you can see, printing hs_data results in the whole contents of the data frame being printed out.

⚠️ Do not make a habit out of this.

Printing whole data frames is too much information and can severely detract from the overall flow of the document. Remember that, at the end of the day, data scientists must be able to effectively communicate their work to both technical and non-technical users, and it impresses no-one when whole data sets (particularly extremely large ones!) are printed out. Instead, we can print tibbles which are a data frame with special properties.


hs_data |>
  as_tibble()

Note that we can tell the dimensions of the data (r nrow(hs_data) rows and r ncol(hs_data) columns) and can see the first ten rows and however many columns can fit into the R console. This is a much more compact way of introducing a data frame.

Try it out:

ncol(hs_data)

For the majority of these labs, we will work with tibbles instead of base R data frames. This is not only because they have a nicer output when we print them, but also because we can do advanced things like create list columns (more on this later).

1.5. We can further inspect the data frame using the glimpse() function from the dplyr package. This can be especially useful when you have lots of columns, as printing tibbles typically only shows as many columns that can fit into the R console.


hs_data |>
  glimpse()

1.7. Let’s explore the Holzinger-Swineford dataset further. This dataset includes students from different schools and grades.


hs_data |>
  as_tibble()

As you can see, the dataset includes different grades and schools. We can see how different grades are distributed in the dataset and the number of students in each grade by using the count function.


hs_data |>
  as_tibble() |>
  count(grade)

Suppose we only wanted to look at students from the “Grant-White” school. We can employ the filter verb to do this. How many rows do you think will be in this data frame?


hs_data |>
  as_tibble() |>
  filter(school == "Grant-White")

1.8. Now, let’s focus on specific variables of interest. Suppose we were only interested in the case ID, grade, gender, and age. We can select these columns which produces a data frame with a reduced number of columns.


hs_data |>
  as_tibble() |>
  select(case, grade, female, ageyr)

Sometimes, you may want to change variable names to make them more descriptive. female and ageyr may not be as clear, so we can change them to something like gender and age by using the rename function.


hs_data |>
  as_tibble() |>
  select(case, grade, female, ageyr) |>
  rename(gender = female, age = ageyr)

Note: you can also rename columns by citing their relative position in the data frame.


hs_data |>
  as_tibble() |>
  select(case, grade, female, ageyr) |>
  rename(gender = 3, age = 4)

1.9. Now let’s create a new variable. We’ll continue working with the Holzinger-Swineford dataset. Suppose we want to create a more descriptive gender variable. The current female variable is coded numerically, so we can create a categorical version using the mutate function.

👨🏻‍🏫 TEACHING MOMENT: Your tutor will briefly explain how the if_else function works for creating categorical variables.


hs_data |>
  as_tibble() |>
  # Let's use some of the code in the previous subsection
  select(case, grade, female, ageyr) |>
  rename(gender = 3, age = 4) |>
  mutate(gender_cat = if_else(gender == 2, "female", "male"))

Now that we have created a new column, we may want to calculate some summary statistics. For example, we can see how many students are in each gender category. We can pipe in the count command to calculate this.


hs_data |>
  as_tibble() |>
  # Let's use some of the code in the previous subsection
  select(case, grade, female, ageyr) |>
  rename(gender = 3, age = 4) |>
  mutate(gender_cat = if_else(gender == 2, "female", "male")) |>
  count(gender_cat)

1.10 As a final part, suppose we think that there may be average differences in test performance based on gender. We can tweak our pipeline to calculate group means for a specific test (in this case, number recognition) by using the group_by function.


hs_data |>
  as_tibble() |>
  # Let's use some of the code in the previous subsection
  select(case, grade, female, ageyr, t15_numbrecg) |>
  rename(gender = 3, age = 4) |>
  mutate(gender_cat = if_else(gender == 2, "female", "male")) |>
  group_by(gender_cat) |>
  summarise(mean_numbrecg = mean(t15_numbrecg))

We can see some differences in average performance between the groups! We’ll explore these patterns graphically in the next part of this lab.

📋 Part 2: Data visualisation with `ggplot`(50 min)

They say a picture paints a thousand words, and we are, more often than not, inclined to agree with them (whoever “they” are)! Thankfully, we can use perhaps the most widely used package, ggplot2 (click here for documentation) to paint such pictures or (more accurately) build such graphs.

👉 Note: Throughout, your instructor will use ggplot2 but we encourage you to experiment with different plotting libraries. At the end of the day, what matters is how you design the graph though ggplot2 has several nice features which makes building professional graphs a lot easier.

2.1 Histograms

Suppose we want to plot the distribution of an outcome of interest. We can use a histogram to plot the cube test scores from the Holzinger-Swineford dataset.

Design Considerations:

Bin width matters: Too few bins oversimplify the distribution; too many create noise. Start with 20-40 bins for most datasets and adjust based on your data’s characteristics.
Transparency aids comparison: Using alpha transparency (0.5-0.8) allows overlapping distributions to remain visible when comparing groups.
Reduce visual clutter: Remove unnecessary grid lines, especially vertical ones that compete with the bars for attention.
Clear labeling: Descriptive axis labels help readers understand what they’re viewing without referring to external documentation.

👥 DESIGN EXERCISE:

Consider how adjusting bin count affects the story your histogram tells. Experiment with transparency levels to find the balance between visibility and clarity. Think about when vertical grid lines add value versus when they create visual noise.


# Code here

2.2 Bar graphs

We counted the number of students in different schools in the Holzinger-Swineford dataset. We can use ggplot to create a bar graph. We first need to specify what we want on the x and y axes, which constitute the first and second argument of the aes function.

Design Principles:

Start from zero: Bar length represents magnitude, so truncated axes can mislead viewers about relative differences.
Order thoughtfully: Arrange categories by frequency, alphabetically, or by meaningful progression rather than randomly.
Minimize decoration: Remove unnecessary elements like 3D effects, heavy borders, or excessive grid lines that don’t aid comprehension.
Consider orientation: Horizontal bars work better for long category names and make labels more readable.

👥 DESIGN EXERCISE:

Practice creating clean, focused bar charts. Consider when to use horizontal versus vertical orientation. Experiment with different ordering strategies and observe how they change the story your visualisation tells.


# Code here

2.3 Scatter plots

Scatter plots are the best way to graphically summarise the relationship between two quantitative variables. Suppose we wanted to explore the relationship between visual perception and form board test scores in the Holzinger-Swineford dataset. We can use a scatter plot to visualise this, distinguishing points by school using colour.

Effective Design Strategies:

Handle overplotting: Use transparency, jittering, or smaller point sizes when dealing with many overlapping points.
Scale appropriately: Consider log transformations for skewed data to reveal relationships that might be hidden in linear scales.
Guide the eye: Clear axis labels and appropriate scales help viewers understand the relationship being shown.
Show uncertainty: Consider adding trend lines or confidence intervals when appropriate to highlight patterns.

👥 DESIGN EXERCISE:

Explore how different transformations (log, square root) can reveal hidden patterns in your data. Practice using transparency effectively to handle overplotting while maintaining readability.


# Code here

2.4 Line plots

Line plots can be used to track parameters of interest over time. For this section, we’ll use the Nuijten et al (2020) dataset (nuijten2020) which contains longitudinal data to see how many studies were included in their meta-analysis over time.

Design Best Practices:

Connect meaningfully: Only connect points where the progression between them is meaningful (typically time-based data).
Choose appropriate styling: Dotted lines can suggest uncertainty or projection; solid lines imply measured data.
Layer thoughtfully: Combining points with lines helps readers identify individual measurements while seeing the overall trend.
Scale to show variation: Ensure your y-axis scale reveals meaningful variation without exaggerating minor fluctuations.
Consider filled areas: Area charts can effectively show cumulative quantities or emphasize the magnitude of change, but use them judiciously.

👥 DESIGN EXERCISE:

Experiment with different line styles to convey different types of information. Consider when adding area fill enhances understanding versus when it creates confusion. Practice setting appropriate time axis intervals that match your data’s natural rhythm.


# Code here

👉 NOTE: We think this line plot works in this context, and that it is useful to introduce you to this style of charting progress over time. However, for others contexts filling the area under the line may not work, so do bear this in mind when making decisions!

2.5 Box plots

Box plots are an excellent way to summarise the distribution of values across different strata over key quantities of interest (e.g. median, interquartile range). We can use the Holzinger-Swineford dataset to explore differences in test performance across different grades using box plots to graphically illustrate this.

Design Considerations:

Handle extreme values: Log scales can be essential when comparing groups with very different ranges or when dealing with skewed data.
Simplify visual elements: Remove unnecessary grid lines that don’t aid in reading values or making comparisons.
Provide context: Clear group labels and axis titles help viewers understand what comparisons they’re making.
Consider alternatives: Violin plots or strip charts might better serve your purpose when sample sizes are small or when showing full distributions is important.

👥 DESIGN EXERCISE:

Practice deciding when log transforms reveal meaningful patterns. Experiment with removing different grid elements to create cleaner, more focused visualisations.


# Code here

2.6 Density plots

Let’s look at another way of showing differences in the Holzinger-Swineford dataset, namely density plots. Although we cannot see exact quantities, we can get a better sense of how test scores are distributed across different groups (in this case, gender) in a more detailed manner.

Effective Design Elements:

Transform when needed: Log scales can reveal patterns in skewed data that would be invisible on linear scales.
Minimize visual noise: Reduce grid lines and other decorative elements that don’t contribute to understanding.
Consider bandwidth: The smoothing parameter affects how much detail versus generalization your plot shows.
Enable comparison: When showing multiple densities, use transparency and distinct colors to enable easy comparison.

👥 DESIGN EXERCISE:

Explore how different transformations affect the insights you can draw from density plots. Practice balancing detail with clarity by adjusting smoothing parameters.


# Code here

Universal Design Principles

Across all visualisation types, several principles enhance effectiveness:

Accessibility: Ensure your visualisations work for colourblind viewers and can be understood in black and white. Use patterns, shapes, and positioning alongside color.

Clarity: Every element should serve a purpose. Remove or de-emphasise anything that doesn’t directly contribute to understanding your data.

Context: Provide enough information for viewers to understand what they’re seeing without overwhelming them with unnecessary detail.

Consistency: Use consistent scales, colors, and styling across related visualisations to enable easy comparison.

Focus: Direct attention to the most important insights through strategic use of color, size, and positioning.

Remember: the goal of data visualisation is to facilitate understanding and insight, not to showcase technical capabilities. Always prioritize clarity and accessibility over visual complexity.

🎯 Learning Outcomes

📋 Lab Tasks

🛠 Part I: Data manipulation with dplyr (40 min)

⚙️ Setup

📋 Part 2: Data visualisation with ggplot(50 min)

2.1 Histograms

2.2 Bar graphs

2.3 Scatter plots

2.4 Line plots

2.5 Box plots

2.6 Density plots

Universal Design Principles

🛠 Part I: Data manipulation with `dplyr` (40 min)

📋 Part 2: Data visualisation with `ggplot`(50 min)