๐ฃ๏ธ LSE DS202A 2025: Week 02 - Lab Roadmap
๐ฏ Learning Outcomes
By the end of this lab, you will be able to:
Master core data manipulation techniques with
dplyr
- Filter rows, select and rename columns, create new variables withmutate
, group data, and calculate summary statistics using a coherent pipeline approach with the pipe operator.Transform and reshape data for analysis - Convert data frames to tibbles, handle categorical variables with conditional logic (
if_else
), and efficiently explore data structure using functions likeglimpse()
andcount()
.Create effective data visualisations with
ggplot2
- Build multiple chart types including histograms, bar charts, scatter plots, line plots, box plots, and density plots using the grammar of graphics framework.Apply visualisation best practices and customization - Implement custom themes, modify aesthetic properties (color, transparency, labels), add trend lines, and make informed decisions about appropriate chart types for different data scenarios.
๐ Lab Tasks
๐ Part I: Data manipulation with dplyr
(40 min)
โ๏ธ Setup
Download the labโs .qmd
notebook
Click on the link below to download the .qmd
file for this lab. Save it in the DS202A
folder you created last week. If you need a refresher on the setup, refer back to Part II of last weekโs lab.
Import required libraries:
library(ggsci)
library(psychTools)
library(psymetadata)
library(tidyverse)
Try this option if having issues with the imports above
library(dplyr) # for data manipulation
library(tidyr) # for data reshaping
library(readr) # for reading data
library(tidyselect) # for selecting columns
library(ggsci) # for nice colour themes
library(psymetadata) # imports meta-analyses in psychology
No need to load any data here: we can already have access to the Holzinger-Swineford dataset by loading psychTools
. Letโs proceed! ๐
1.3. Letโs load the Holzinger-Swineford dataset, which contains mental ability test scores for school children, to inspect it.
<-
hs_data
holzinger.swineford
hs_data
1.4. As you can see, printing hs_data
results in the whole contents of the data frame being printed out.
โ ๏ธ Do not make a habit out of this.
Printing whole data frames is too much information and can severely detract from the overall flow of the document. Remember that, at the end of the day, data scientists must be able to effectively communicate their work to both technical and non-technical users, and it impresses no-one when whole data sets (particularly extremely large ones!) are printed out. Instead, we can print tibbles which are a data frame with special properties.
|>
hs_data as_tibble()
Note that we can tell the dimensions of the data (r nrow(hs_data)
rows and r ncol(hs_data)
columns) and can see the first ten rows and however many columns can fit into the R console. This is a much more compact way of introducing a data frame.
Try it out:
ncol(hs_data)
For the majority of these labs, we will work with tibbles instead of base R data frames. This is not only because they have a nicer output when we print them, but also because we can do advanced things like create list columns (more on this later).
1.5. We can further inspect the data frame using the glimpse()
function from the dplyr
package. This can be especially useful when you have lots of columns, as printing tibbles typically only shows as many columns that can fit into the R console.
|>
hs_data glimpse()
1.7. Letโs explore the Holzinger-Swineford dataset further. This dataset includes students from different schools and grades.
|>
hs_data as_tibble()
As you can see, the dataset includes different grades and schools. We can see how different grades are distributed in the dataset and the number of students in each grade by using the count
function.
|>
hs_data as_tibble() |>
count(grade)
Suppose we only wanted to look at students from the โGrant-Whiteโ school. We can employ the filter
verb to do this. How many rows do you think will be in this data frame?
|>
hs_data as_tibble() |>
filter(school == "Grant-White")
1.8. Now, letโs focus on specific variables of interest. Suppose we were only interested in the case ID, grade, gender, and age. We can select
these columns which produces a data frame with a reduced number of columns.
|>
hs_data as_tibble() |>
select(case, grade, female, ageyr)
Sometimes, you may want to change variable names to make them more descriptive. female
and ageyr
may not be as clear, so we can change them to something like gender
and age
by using the rename
function.
|>
hs_data as_tibble() |>
select(case, grade, female, ageyr) |>
rename(gender = female, age = ageyr)
Note: you can also rename columns by citing their relative position in the data frame.
|>
hs_data as_tibble() |>
select(case, grade, female, ageyr) |>
rename(gender = 3, age = 4)
1.9. Now letโs create a new variable. Weโll continue working with the Holzinger-Swineford dataset. Suppose we want to create a more descriptive gender variable. The current female
variable is coded numerically, so we can create a categorical version using the mutate
function.
๐จ๐ปโ๐ซ TEACHING MOMENT: Your tutor will briefly explain how the if_else
function works for creating categorical variables.
|>
hs_data as_tibble() |>
# Let's use some of the code in the previous subsection
select(case, grade, female, ageyr) |>
rename(gender = 3, age = 4) |>
mutate(gender_cat = if_else(gender == 2, "female", "male"))
Now that we have created a new column, we may want to calculate some summary statistics. For example, we can see how many students are in each gender category. We can pipe in the count
command to calculate this.
|>
hs_data as_tibble() |>
# Let's use some of the code in the previous subsection
select(case, grade, female, ageyr) |>
rename(gender = 3, age = 4) |>
mutate(gender_cat = if_else(gender == 2, "female", "male")) |>
count(gender_cat)
1.10 As a final part, suppose we think that there may be average differences in test performance based on gender. We can tweak our pipeline to calculate group means for a specific test (in this case, number recognition) by using the group_by
function.
|>
hs_data as_tibble() |>
# Let's use some of the code in the previous subsection
select(case, grade, female, ageyr, t15_numbrecg) |>
rename(gender = 3, age = 4) |>
mutate(gender_cat = if_else(gender == 2, "female", "male")) |>
group_by(gender_cat) |>
summarise(mean_numbrecg = mean(t15_numbrecg))
We can see some differences in average performance between the groups! Weโll explore these patterns graphically in the next part of this lab.
๐ Part 2: Data visualisation with ggplot
(50 min)
They say a picture paints a thousand words, and we are, more often than not, inclined to agree with them (whoever โtheyโ are)! Thankfully, we can use perhaps the most widely used package, ggplot2
(click here for documentation) to paint such pictures or (more accurately) build such graphs.
๐ Note: Throughout, your instructor will use ggplot2
but we encourage you to experiment with different plotting libraries. At the end of the day, what matters is how you design the graph though ggplot2
has several nice features which makes building professional graphs a lot easier.
2.1 Histograms
Suppose we want to plot the distribution of an outcome of interest. We can use a histogram to plot the cube test scores from the Holzinger-Swineford dataset.
Design Considerations:
- Bin width matters: Too few bins oversimplify the distribution; too many create noise. Start with 20-40 bins for most datasets and adjust based on your dataโs characteristics.
- Transparency aids comparison: Using alpha transparency (0.5-0.8) allows overlapping distributions to remain visible when comparing groups.
- Reduce visual clutter: Remove unnecessary grid lines, especially vertical ones that compete with the bars for attention.
- Clear labeling: Descriptive axis labels help readers understand what theyโre viewing without referring to external documentation.
๐ฅ DESIGN EXERCISE:
Consider how adjusting bin count affects the story your histogram tells. Experiment with transparency levels to find the balance between visibility and clarity. Think about when vertical grid lines add value versus when they create visual noise.
# Code here
2.2 Bar graphs
We counted the number of students in different schools in the Holzinger-Swineford dataset. We can use ggplot
to create a bar graph. We first need to specify what we want on the x and y axes, which constitute the first and second argument of the aes
function.
Design Principles:
- Start from zero: Bar length represents magnitude, so truncated axes can mislead viewers about relative differences.
- Order thoughtfully: Arrange categories by frequency, alphabetically, or by meaningful progression rather than randomly.
- Minimize decoration: Remove unnecessary elements like 3D effects, heavy borders, or excessive grid lines that donโt aid comprehension.
- Consider orientation: Horizontal bars work better for long category names and make labels more readable.
๐ฅ DESIGN EXERCISE:
Practice creating clean, focused bar charts. Consider when to use horizontal versus vertical orientation. Experiment with different ordering strategies and observe how they change the story your visualisation tells.
# Code here
2.3 Scatter plots
Scatter plots are the best way to graphically summarise the relationship between two quantitative variables. Suppose we wanted to explore the relationship between visual perception and form board test scores in the Holzinger-Swineford dataset. We can use a scatter plot to visualise this, distinguishing points by school using colour.
Effective Design Strategies:
- Handle overplotting: Use transparency, jittering, or smaller point sizes when dealing with many overlapping points.
- Scale appropriately: Consider log transformations for skewed data to reveal relationships that might be hidden in linear scales.
- Guide the eye: Clear axis labels and appropriate scales help viewers understand the relationship being shown.
- Show uncertainty: Consider adding trend lines or confidence intervals when appropriate to highlight patterns.
๐ฅ DESIGN EXERCISE:
Explore how different transformations (log, square root) can reveal hidden patterns in your data. Practice using transparency effectively to handle overplotting while maintaining readability.
# Code here
2.4 Line plots
Line plots can be used to track parameters of interest over time. For this section, weโll use the Nuijten et al (2020) dataset (nuijten2020
) which contains longitudinal data to see how many studies were included in their meta-analysis over time.
Design Best Practices:
- Connect meaningfully: Only connect points where the progression between them is meaningful (typically time-based data).
- Choose appropriate styling: Dotted lines can suggest uncertainty or projection; solid lines imply measured data.
- Layer thoughtfully: Combining points with lines helps readers identify individual measurements while seeing the overall trend.
- Scale to show variation: Ensure your y-axis scale reveals meaningful variation without exaggerating minor fluctuations.
- Consider filled areas: Area charts can effectively show cumulative quantities or emphasize the magnitude of change, but use them judiciously.
๐ฅ DESIGN EXERCISE:
Experiment with different line styles to convey different types of information. Consider when adding area fill enhances understanding versus when it creates confusion. Practice setting appropriate time axis intervals that match your dataโs natural rhythm.
# Code here
๐ NOTE: We think this line plot works in this context, and that it is useful to introduce you to this style of charting progress over time. However, for others contexts filling the area under the line may not work, so do bear this in mind when making decisions!
2.5 Box plots
Box plots are an excellent way to summarise the distribution of values across different strata over key quantities of interest (e.g. median, interquartile range). We can use the Holzinger-Swineford dataset to explore differences in test performance across different grades using box plots to graphically illustrate this.
Design Considerations:
- Handle extreme values: Log scales can be essential when comparing groups with very different ranges or when dealing with skewed data.
- Simplify visual elements: Remove unnecessary grid lines that donโt aid in reading values or making comparisons.
- Provide context: Clear group labels and axis titles help viewers understand what comparisons theyโre making.
- Consider alternatives: Violin plots or strip charts might better serve your purpose when sample sizes are small or when showing full distributions is important.
๐ฅ DESIGN EXERCISE:
Practice deciding when log transforms reveal meaningful patterns. Experiment with removing different grid elements to create cleaner, more focused visualisations.
# Code here
2.6 Density plots
Letโs look at another way of showing differences in the Holzinger-Swineford dataset, namely density plots. Although we cannot see exact quantities, we can get a better sense of how test scores are distributed across different groups (in this case, gender) in a more detailed manner.
Effective Design Elements:
- Transform when needed: Log scales can reveal patterns in skewed data that would be invisible on linear scales.
- Minimize visual noise: Reduce grid lines and other decorative elements that donโt contribute to understanding.
- Consider bandwidth: The smoothing parameter affects how much detail versus generalization your plot shows.
- Enable comparison: When showing multiple densities, use transparency and distinct colors to enable easy comparison.
๐ฅ DESIGN EXERCISE:
Explore how different transformations affect the insights you can draw from density plots. Practice balancing detail with clarity by adjusting smoothing parameters.
# Code here
Universal Design Principles
Across all visualisation types, several principles enhance effectiveness:
Accessibility: Ensure your visualisations work for colourblind viewers and can be understood in black and white. Use patterns, shapes, and positioning alongside color.
Clarity: Every element should serve a purpose. Remove or de-emphasise anything that doesnโt directly contribute to understanding your data.
Context: Provide enough information for viewers to understand what theyโre seeing without overwhelming them with unnecessary detail.
Consistency: Use consistent scales, colors, and styling across related visualisations to enable easy comparison.
Focus: Direct attention to the most important insights through strategic use of color, size, and positioning.
Remember: the goal of data visualisation is to facilitate understanding and insight, not to showcase technical capabilities. Always prioritize clarity and accessibility over visual complexity.