✅ A model solution for the W04 formative
What follows is a possible solution for the W04 formative. If you want to render the .qmd that was used to generate this page for yourselves, use the download button below:
⚙️ Setup
We start, as usual, by loading libraries.
#| echo: TRUE
#| warning: FALSE
#| message: FALSE
# Start by loading the required libraries
library(tidyverse)
library(tidymodels) # For modeling
library(readr)
library(janitor) # For cleaning column names
library(knitr) # For nice tables
library(ggplot2)
library(naniar)
library(countrycode)
library(patchwork)
library(ggcorrplot)
library(ggrepel)
library(reshape2)
library(magrittr)For cleaner formatting (and coding), it’s good practice to load all the libraries you’ll be using in your .qmd at the beginning of the file.
Question 1
Loading the data
Importing the data should only require the read_csv method.
# Load VCI dataset
vci_raw <- read_csv("../../data/VCI_data_2018-24_by_demographic.csv")Rows: 8165 Columns: 9
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (4): country, code, demo_type, demographic
dbl (5): year, children, safe, effective, beliefs
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.When loading the data, you should only ever use relative paths and not absolute paths (i.e not full paths that are specific to your machine).
Doing so ensures your code is more reproducible!
Data inspection
We quickly inspect the content of the loaded dataframe:
# Inspect structure
glimpse(vci_raw)Rows: 8,165
Columns: 9
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"…
$ code <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF"…
$ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015…
$ demo_type <chr> "Age", "Age", "Age", "Age", "Age", "Age", "Education", "Ed…
$ demographic <chr> "18-24", "25-34", "35-44", "45-54", "55-64", "65", "Primar…
$ children <dbl> 96, 94, 97, 97, 97, 100, 95, 98, 98, 98, 94, 99, 97, 97, 9…
$ safe <dbl> 89, 90, 91, 90, 89, 100, 89, 92, 91, 92, 88, 94, 90, 92, 8…
$ effective <dbl> 93, 92, 92, 95, 95, 98, 92, 95, 96, 94, 92, 96, 94, 94, 92…
$ beliefs <dbl> 89, 87, 87, 84, 85, 96, 85, 91, 97, 90, 85, 93, 84, 87, 87…There will be no need to clean the column names here: they are lowercase and consistent.
vci_raw %>% distinct(year)# A tibble: 8 × 1
year
<dbl>
1 2015
2 2018
3 2019
4 2020
5 2021
6 2022
7 2023
8 2024The dataframe contains data for multiple years. Let’s check how many countries had data collected from per year:
# Count unique countries per year
countries_per_year <- vci_raw %>%
group_by(year) %>%
summarise(unique_countries = n_distinct(country)) %>%
arrange(desc(unique_countries))
countries_per_year# A tibble: 8 × 2
year unique_countries
<dbl> <int>
1 2018 144
2 2015 70
3 2023 70
4 2020 53
5 2019 41
6 2022 28
7 2021 15
8 2024 12The year with the most countries is 2018, providing the best coverage for analysis (i.e a sufficient number of data points for analysis). Other candidate years could be 2015 or 2023: while fewer countries had data collected for those years, the analysis of these years might still be interesting (2015 allows for the analysis of earlier trends while 2023 allows for the analysis of much more recent trends).
Choosing a year is a trade-off between coverage (number of countries) and recency/relevance of data. This inspection ensures the dataset we select for modeling has sufficient observations.
Question 2
Year selection and missing data handling
Let’s say we choose to analyse the year 2018 (the year with most data). Let’s select the data for that year in the VCP dataset:
# Select a single year for analysis
vci_year <- vci_raw %>%
filter(year == 2018)Before we aggregate the data per country, we need to check missingness patterns in the data and address missing responses before aggregation. If we don’t, the weighted means per country that we will calculate will be biased.
vci_year %>%
summarise(across(c(children, safe, effective, beliefs), ~sum(is.na(.))))# A tibble: 1 × 4
children safe effective beliefs
<int> <int> <int> <int>
1 0 0 0 2187There are no missing values in children, safe or effective. There are 2187 missing values in beliefs, which means that all values for this variable are missing: we won’t be able to compute a weighted average for this variable so we discard it. It is safe to do so as we see that only beliefs has missing values (2187 missing), while the other variables (children, safe, effective) have zero missing. That’s a strong hint that the missingness is completely at random (MCAR), because the missingness is not associated with any other variable in your dataset. But just to be thorough, let’s show that the missingness mechanism is indeed MCAR:
# Visualise missingness
p <- vis_miss(vci_year) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip() +
labs(title = "Missing Data by Variable")
# View interactively (in RStudio viewer pane)
print(p)The visualisation makes it clear that the pattern is indeed MCAR. And our only missing values are the beliefs and code (i.e country code) columns, which we won’t use for analysis. So we can essentially just drop these two columns.
vci_year <- vci_year %>% select(-c(code,beliefs))Now we’re all set to aggregate our indicators.
Objective: Aggregating VCP Indicators to the Country Level
We aim to construct country-level vaccine confidence indicators (VCIs) for three core dimensions:
children: “Vaccines are important for children”
safe: “Vaccines are safe”
effective: “Vaccines are effective”
These indicators should reflect the population-weighted national average attitude, accounting for demographic heterogeneity. Some empirical research consistently shows that vaccine confidence varies systematically by age and, to a lesser extent, by gender [Mondal, Sinharoy, and Su (2021)](Presotto et al. 2022) (so when constructing a weighting model we’ll prefer an age-dominated model). Ignoring demographic structure would yield misleading national estimates—for example, overrepresenting older, more vaccine-confident cohorts in aging populations.
The target estimand for any indicator (X) (e.g., agreement that “vaccines are safe”) is the demographically weighted mean:
\[ \bar{X}_{\text{country}} = \frac{\sum_{a \in \mathcal{A}} \sum_{g \in \mathcal{G}} N_{a,g} \cdot X_{a,g}}{\sum_{a,g} N_{a,g}}, \]
where:
- \(\mathcal{A}\) = set of age groups (e.g., 18–24, 25–34, …, 65+)
- \(\mathcal{G}\) = {Male, Female}
- \(N_{a,g}\) = nominal (absolute) population size of age group (a) and gender (g) (from demographic data)
- \(X_{a,g}\) = true mean response in subgroup ((a,g))
Note: We use absolute population counts (e.g., 3,058,081 males aged 18–24 in Afghanistan), not percentages, to ensure correct weighting.
The core challenge: The VCI survey reports only marginal means—by age \((A_a)\) and by gender \((G_g)\)—but not the joint means \(X_{a,g}\). To proceed, we must adopt a structural model that links the unobserved \(X_{a,g}\) to the observed marginals.
Statistical Framework and Notation
Let \(X\) denote the latent vaccine-confidence score (measured as % agreement). We define:
\(A_a = \mathbb{E}[X \mid \text{Age} = a]\): Age-marginal mean — the average response among all individuals in age group (a), regardless of gender. > Example: If VCI reports
children = 97for"18-24", this means 97% of all 18–24-year-olds (men and women combined) agree that “vaccines are important for children.”\(G_g = \mathbb{E}[X \mid \text{Gender} = g]\): Gender-marginal mean — the average response among all individuals of gender (\(g\)), regardless of age. > Example: If VCI reports
children = 96for"Male", this means 96% of all men (across all ages) agree with the statement.\(X_{a,g} = \mathbb{E}[X \mid \text{Age} = a, \text{Gender} = g]\): Joint mean — the average response among individuals who are both in age group (\(a\)) and gender (g).
> Example: The (unobserved) percentage of women aged 25–34 who agree. This is what we need for precise weighting, but it is not reported in the VCI.\(\mu = \mathbb{E}[X]\): Grand population mean — the true national average attitude, which is our ultimate target.
The grand mean can be computed from age marginals and age-group populations: \[ \mu = \sum_{a} w_a A_a, \quad \text{where} \quad w_a = \frac{N_a}{\sum_{a'} N_{a'}}, \quad N_a = \sum_{g} N_{a,g}. \]
Here, \(\mu\) serves as a normalization constant that ensures coherence between age and gender effects in additive/multiplicative models.
Structural Models for Joint Age–Gender Effects
Because \(X_{a,g}\) is unobserved, we evaluate three plausible structural assumptions grounded in empirical vaccine-confidence literature. All models preserve the true age–gender population structure via weights \(N_{a,g}\); none assume demographic independence.
Model A: Age-Dominant (Primary Specification)
\[ X_{a,g} = A_a \]
Interpretation: Vaccine confidence depends only on age; gender differences within age groups are negligible. This implies \(\mathbb{E}[X \mid a, g] = f(a)\).
Assumptions:
- Gender has no independent effect on confidence after conditioning on age.
- Population age–gender structure (\(N_{a,g}\)) is fully accounted for in weighting.
Justification: Consistent with empirical findings that age effects dominate (Mondal, Sinharoy, and Su 2021) (Presotto et al. 2022). Mathematically equivalent to assuming zero gender-by-age interaction.
Model B: Additive Effects (Sensitivity Analysis 1)
\[ X_{a,g} = A_a + (G_g - \mu) \]
Interpretation: Age and gender exert independent additive effects. The term \((G_g - \mu)\) represents the gender-specific deviation from the national mean.
Why subtract (\(\mu\))?
This centers the gender effect so that:
\[
\sum_g \frac{N_{a,g}}{N_a} X_{a,g} = A_a \quad \text{(preserves age marginal)}
\]
\[
\sum_a \frac{N_{a,g}}{N_g} X_{a,g} = G_g \quad \text{(preserves gender marginal)}
\]
Assumptions:
- Gender effect is constant across all age groups (e.g., women score 2 points higher than men at every age).
- No age–gender interaction; effects are linear and separable.
Limitation: May produce estimates outside [0,100] if effects are large (though rare in practice).
Model C: Multiplicative Effects (Sensitivity Analysis 2)
\[ X_{a,g} = A_a \times \frac{G_g}{\mu} \]
Interpretation: Gender modifies confidence proportionally. If women’s average score is 5% higher than (\(\mu\)), they are 5% higher in every age group.
Assumptions:
- Gender effect is scale-invariant (relative, not absolute).
- Proportional adjustment preserves the [0,100] response scale.
Use case: Appropriate when gender differences are better expressed as relative deviations (e.g., in low-confidence contexts).
Assumption Summary
| Model | Equation | Key Assumption | Preserves Marginals? |
|---|---|---|---|
| A. Age-dominant | \(X_{a,g} = A_a\) | No gender effect within age | Age only |
| B. Additive | \(X_{a,g} = A_a + (G_g - \mu)\) | Constant absolute gender effect | Both |
| C. Multiplicative | \(X_{a,g} = A_a \cdot (G_g / \mu)\) | Constant relative gender effect | Both |
Critical note: All models use true \(N_{a,g}\) from demographic data, so no assumption of age–gender independence is made. The only assumptions concern the response structure \(X_{a,g}\).
Implementation of Structural Models
To compute country-level vaccine confidence indicators under each structural assumption, we execute a four-stage workflow:
- Harmonize age group labels across datasets
- Extract marginal VCI responses
- Construct the age–gender population table
- Compute the grand mean ((\(\mu\))) and apply each model
All steps preserve data integrity and handle missingness transparently (beliefs is excluded as it is 100% missing).
Age Group Harmonization
The VCI dataset reports age groups as "18-24", while the demographic dataset uses "18 - 24" (with spaces). We normalize all labels to a consistent format:
normalize_age_group <- function(x) {
x <- trimws(x)
x <- gsub("\\s*\\-\\s*", "-", x) # "18 - 24" → "18-24"
x <- ifelse(x == "65", "65+", x) # align with demographics
return(x)
}# Loading demographics dataset
demographics <- read_csv("../../data/US-Census-population-data-2014-2025.csv") %>% clean_names()
# --- Normalize demographics for 2018 ---
demographics_normalized <- demographics %>%
filter(year == 2018) %>% # restrict to analysis year
rename (country = country_area_name) %>%
mutate(
group_clean = case_when(
is.na(group) ~ "TOTAL", # label total rows
TRUE ~ normalize_age_group(group) # normalize age groups
)
) %>%
select(-group) %>% # remove redundant original column
rename(group = group_clean) # keep the cleaned name as 'group'
# --- Normalize VCI ---
vci_normalized <- vci_year %>%
mutate(
demographic_clean = if_else(
demo_type == "Age", # only normalize age categories
normalize_age_group(demographic),
demographic # leave others (Gender, Education, etc.) as-is
)
) %>%
select(-demographic) %>% # drop redundant original column
rename(demographic = demographic_clean) # rename cleaned version back to 'demographic'Let’s do a quick sanity check to verify the normalization produced the dataframes we expected:
demographics_normalized %>%
count(group) %>%
arrange(group)
vci_normalized %>%
count(demo_type, demographic)> demographics_normalized %>%
+ count(group) %>%
+ arrange(group)
# A tibble: 7 × 2
group n
<chr> <int>
1 18-24 227
2 25-34 227
3 35-44 227
4 45-54 227
5 55-64 227
6 65+ 227
7 TOTAL 227
>
> vci_normalized %>%
+ count(demo_type, demographic)
# A tibble: 18 × 3
demo_type demographic n
<chr> <chr> <int>
1 Age 18-24 144
2 Age 25-34 144
3 Age 35-44 144
4 Age 45-54 144
5 Age 55-64 144
6 Age 65+ 144
7 Education Primary and below 144
8 Education Secondary and Vocational 144
9 Education University and above 142
10 Gender Female 144
11 Gender Male 144
12 Religion Atheist/agnostic 104
13 Religion Buddhist 60
14 Religion Christian 130
15 Religion Hindu 41
16 Religion Jewish 33
17 Religion Muslim 119
18 Religion Other 118This output confirms everything is now nicely standardized:
- Both datasets have clean, consistent age labels (18-24, 25-34, …, 65+). -
demographics_normalizedhas one record per age group per country, plus “TOTAL” (to account for the total population). vci_normalizedincludes multiple demographic types (Age, Gender, Education, Religion, etc.) — exactly what we expected.
Extract Marginal VCI Responses
We isolate age- and gender-specific survey means:
# Age-marginal responses (A_a)
vci_age <- vci_normalized %>%
filter(demo_type == "Age") %>%
select(country, age_group = demographic, children_a = children, safe_a = safe, effective_a = effective)
# Gender-marginal responses (G_g)
vci_gender <- vci_year %>%
filter(demo_type == "Gender") %>%
select(country, gender = demographic, children_g = children, safe_g = safe, effective_g = effective)Construct Age–Gender Population Table
We build the joint population table \(N_{a,g}\):
demo_ag <- demographics_normalized %>%
filter(group != "TOTAL") %>% # remove total rows
select(country, age_group = group, # rename group for clarity
male_population, female_population) %>%
pivot_longer(
cols = c(male_population, female_population),
names_to = "gender_pop",
values_to = "population"
) %>%
mutate(
gender = if_else(gender_pop == "male_population", "Male", "Female")
) %>%
select(country, age_group, gender, population)Compute Grand Mean (\(\mu\))
The grand mean (\(\mu\)) is the population-weighted national average attitude. We compute it by weighting age-specific responses by total age-group populations (summed across genders). This serves as the normalization constant for Models B and C, ensuring that additive and multiplicative adjustments preserve the overall marginal structure.
# Use demographics_normalized instead of the raw demographics dataset
# because labels and "TOTAL" are already handled there.
age_pops <- demographics_normalized %>%
filter(group != "TOTAL") %>% # exclude total rows
select(country, age_group = group, # rename group for consistency
age_pop = population) # keep population for weighting
grand_means <- vci_age %>%
inner_join(age_pops, by = c("country", "age_group")) %>%
group_by(country) %>%
summarise(
mu_children = weighted.mean(children_a, age_pop, na.rm = TRUE),
mu_safe = weighted.mean(safe_a, age_pop, na.rm = TRUE),
mu_effective = weighted.mean(effective_a, age_pop, na.rm = TRUE),
.groups = "drop"
)Apply Structural Models
Model A: Age-Dominant
Model A implementation: We join the age marginals with the population grid and assign the same age-specific score to both males and females within each age group \((\hat{X}_{a,g} = A_a)\). We then compute the population-weighted country mean by summing across all age × gender cells.
ag_A <- demo_ag %>%
left_join(vci_age, by = c("country", "age_group")) %>%
mutate(children_hat = children_a, safe_hat = safe_a, effective_hat = effective_a)
agg_A <- ag_A %>%
group_by(country) %>%
summarise(
children_w = weighted.mean(children_hat, population, na.rm = TRUE),
safe_w = weighted.mean(safe_hat, population, na.rm = TRUE),
effective_w = weighted.mean(effective_hat, population, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(method = "Age-dominant")Model B: Additive
Model B implementation: We join age marginals, gender marginals, and grand means to the population grid. For each cell, we add the gender-specific deviation \((G_g - \mu)\) to the age baseline \(A_a\). This assumes gender has a constant additive effect across all age groups.
ag_B <- demo_ag %>%
left_join(vci_age, by = c("country", "age_group")) %>%
left_join(vci_gender, by = c("country", "gender")) %>%
left_join(grand_means, by = "country") %>%
mutate(
children_hat = children_a + (children_g - mu_children),
safe_hat = safe_a + (safe_g - mu_safe),
effective_hat = effective_a + (effective_g - mu_effective)
)
agg_B <- ag_B %>%
group_by(country) %>%
summarise(
children_w = weighted.mean(children_hat, population, na.rm = TRUE),
safe_w = weighted.mean(safe_hat, population, na.rm = TRUE),
effective_w = weighted.mean(effective_hat, population, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(method = "Additive")Model C: Multiplicative
Model C implementation: We join age marginals, gender marginals, and grand means to the population grid. For each cell, we scale the age baseline \(A_a\) by the relative gender factor \((G_g / \mu)\). This assumes gender effects are proportional rather than additive.
ag_C <- demo_ag %>%
left_join(vci_age, by = c("country", "age_group")) %>%
left_join(vci_gender, by = c("country", "gender")) %>%
left_join(grand_means, by = "country") %>%
mutate(
children_hat = children_a * (children_g / mu_children),
safe_hat = safe_a * (safe_g / mu_safe),
effective_hat = effective_a * (effective_g / mu_effective)
)
agg_C <- ag_C %>%
group_by(country) %>%
summarise(
children_w = weighted.mean(children_hat, population, na.rm = TRUE),
safe_w = weighted.mean(safe_hat, population, na.rm = TRUE),
effective_w = weighted.mean(effective_hat, population, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(method = "Multiplicative")Model Comparison and Final Selection
To ensure that our results are robust to modeling assumptions, we now assess how sensitive the country-level estimates are to the choice of structural model (Age-dominant, Additive, Multiplicative). The key idea is simple: if all three models produce nearly identical results, then the aggregated vaccine confidence indicators are stable and our preferred model choice is defensible.
We proceed in three steps:
- Combine the results from all models into one dataset.
- Compute, for each country and indicator, the range (maximum – minimum) across models.
- Summarize and visualize these ranges to evaluate robustness.
This section performs a sensitivity analysis: we quantify how much our country-level indicators change when we swap the structural assumption used to impute the unobserved Age × Gender means \(X_{a,g}\).
What we’re comparing. We have three country-level outputs:
agg_A: Age-dominant (\(X_{a,g}=A_a\))agg_B: Additive (\(X_{a,g}=A_a+(G_g-\mu)\))agg_C: Multiplicative (\(X_{a,g}=A_a \times (G_g/\mu)\))
Each contains the same indicators—children_w, safe_w, effective_w—computed with the same Age × Gender population weights \(N_{a,g}\), but under different structural assumptions about \(X_{a,g}\).
Step 1 – Combine model outputs. We row-bind (bind_rows) the three tables into a single long table agg_all. This gives, for each country × indicator, three comparable values—one per model. Having them side-by-side allows like-for-like comparison. If there were any countries missing from a model (e.g., due to an upstream join misalignment), this step would expose that.
Step 2 – Compute within-country ranges. We reshape with pivot_longer() so we can process the three indicators generically, then we group_by(country, indicator) and compute:
min_val: the smallest value across A/B/Cmax_val: the largest value across A/B/Crange_val = max_val − min_val: the largest possible disagreement between models for that country–indicator
Why range? With only three models, the range is the most transparent measure of sensitivity: it literally answers “how far apart do the models get?” A variance/SD is less interpretable here. We also set na.rm = TRUE to ensure that any occasional NA (from unmatched joins or missing values after imputation) does not crash the calculation; if a country had fewer than two non-missing model values, range_val might be 0 or NA—that itself is a useful diagnostic.
Step 3 – Summarize the differences. We summarize range_val across countries for each indicator using:
- Median (
median_range): the typical model difference - 95th percentile (
perc95_range): a high-end difference capturing almost all countries - 99th percentile (
perc99_range): a “near worst-case” difference
Interpretation guideline: if the median is ~0.1–0.2 percentage points (pp) and even the 99th percentile is ~≤2–2.5 pp, then practical conclusions (e.g., country rankings, effect sizes) are unlikely to change with the model choice. In your current run, you reported medians ~0.11–0.15 pp and 99th percentiles ~2.04–2.35 pp—excellent robustness.
What this gives us.
If the ranges are small, our estimates are structurally stable: the choice between Age-dominant, Additive, and Multiplicative assumptions has negligible impact on country-level VCIs. That justifies using the simpler, empirically grounded Age-dominant model as the primary series going forward.
#| label: cmp-prepare
#| echo: true
#| message: false
#| warning: false
# --- Step 1: Combine model outputs ---
# Each model (A, B, C) contains country-level weighted indicators.
# We bind them together to compare values for the same country across models.
agg_all <- bind_rows(agg_A, agg_B, agg_C)
# --- Step 2: Compute within-country ranges ---
# For each country and indicator (children, safe, effective),
# we calculate how much the values differ across the three models.
# The range (max - min) measures the largest possible disagreement.
model_range <- agg_all %>%
pivot_longer(
cols = c(children_w, safe_w, effective_w),
names_to = "indicator",
values_to = "value"
) %>%
group_by(country, indicator) %>%
summarise(
min_val = min(value, na.rm = TRUE),
max_val = max(value, na.rm = TRUE),
range_val = max_val - min_val,
.groups = "drop"
)
# --- Step 3: Summarize the ranges across all countries ---
# We report the median (typical difference) and the 95th and 99th percentiles (worst-case differences)
# of these ranges for each indicator. Small values indicate strong robustness.
summary_range <- model_range %>%
group_by(indicator) %>%
summarise(
median_range = median(range_val, na.rm = TRUE),
perc95_range = quantile(range_val, 0.95, na.rm = TRUE),
perc99_range = quantile(range_val, 0.99, na.rm = TRUE),
.groups = "drop"
)Table with sensitivity analysis results
#| label: tbl-model-sensitivity
#| tbl-cap: "Cross-model sensitivity of country-level indicators (range in percentage points). Lower values indicate higher robustness to modeling assumptions."
#| echo: false
#| message: false
#| warning: false
summary_range %>%
dplyr::mutate(
dplyr::across(dplyr::where(is.numeric), ~round(.x, 3))
) %>%
knitr::kable(align = c("l", "r", "r", "r"))| indicator | median_range | perc95_range | perc99_range |
|---|---|---|---|
| children_w | 0.110 | 1.438 | 2.159 |
| effective_w | 0.152 | 1.516 | 2.345 |
| safe_w | 0.119 | 1.622 | 2.045 |
Another way to visualise the sensitivity analysis results
#| label: fig-model-range
#| fig-cap: "Distribution of per-country model ranges by indicator. Boxes clustered near zero confirm that results are stable across structural assumptions."
#| fig-width: 7
#| fig-height: 4.5
#| echo: true
#| message: false
#| warning: false
ggplot(model_range, aes(x = indicator, y = range_val)) +
geom_boxplot(fill = "turquoise3", alpha = 0.6, outlier.alpha = 0.3) +
coord_flip() +
labs(
title = "Cross-model Sensitivity of Vaccine Confidence Indicators (2018)",
subtitle = "Range (max–min) across Age-dominant, Additive, and Multiplicative models",
x = "Indicator",
y = "Range across models (percentage points)"
) +
theme_minimal(base_size = 12)Interpretation.
The table shows median ranges around 0.11–0.15 pp and 99th-percentile ranges around 2.04–2.35 pp (our numbers: children_w ≈ 0.110 / 2.159, effective_w ≈ 0.152 / 2.345, safe_w ≈ 0.119 / 2.045).
The boxplots confirm these differences are tightly clustered near zero with only a few mild outliers.
Substantively, this means:
- The three modeling strategies yield nearly identical country-level estimates.
- Practical conclusions (e.g., ranking countries, interpreting effect sizes) are unchanged by the structural assumption.
- We can confidently proceed using the Age-dominant model (A) as the primary series and merge this data with WHO/UNICEF data.
#| label: select-final-vci-dataset
#| echo: true
#| message: false
#| warning: false
vci_country_2018 <- agg_A %>%
dplyr::rename(children = children_w, safe = safe_w, effective = effective_w) %>%
dplyr::mutate(year = 2018) %>%
dplyr::select(country, year, children, safe, effective)Comparison of the age-dominant model with a model unweighted by population
To better understand the impact of population weighting, we compare the age-dominant model (which uses age-group population weights) with a simple unweighted model that treats all age groups as equally represented within each country. This illustrates how demographic weighting influences national vaccine confidence estimates.
#| label: compare-weighting
#| echo: true
#| message: false
#| warning: false
# 1️⃣ Compute simple unweighted means by country
vci_unweighted <- vci_age %>%
group_by(country) %>%
summarise(
children_unweighted = mean(children_a, na.rm = TRUE),
safe_unweighted = mean(safe_a, na.rm = TRUE),
effective_unweighted = mean(effective_a, na.rm = TRUE),
.groups = "drop"
)
# 2️⃣ Combine with age-dominant (weighted) results
compare_weighting <- agg_A %>%
rename(
children_weighted = children_w,
safe_weighted = safe_w,
effective_weighted = effective_w
) %>%
left_join(vci_unweighted, by = "country") %>%
mutate(
diff_children = children_weighted - children_unweighted,
diff_safe = safe_weighted - safe_unweighted,
diff_effective = effective_weighted - effective_unweighted
)
# 3️⃣ Summarise across countries
summary_weighting <- compare_weighting %>%
summarise(
across(starts_with("diff_"),
list(
mean = ~mean(.x, na.rm = TRUE),
median = ~median(.x, na.rm = TRUE),
p95 = ~quantile(.x, 0.95, na.rm = TRUE),
p99 = ~quantile(.x, 0.99, na.rm = TRUE)
)
),
.groups = "drop"
)
# 4️⃣ Make output tidy for reporting
summary_weighting_long <- summary_weighting %>%
pivot_longer(
everything(),
names_to = c("indicator", "stat"),
names_pattern = "diff_(.*)_(.*)",
values_to = "value"
) %>%
pivot_wider(names_from = stat, values_from = value)
summary_weighting_long# A tibble: 3 × 5
indicator mean median p95 p99
<chr> <dbl> <dbl> <dbl> <dbl>
1 children 0.173 0.174 1.65 3.03
2 safe 0.0924 0.121 1.85 2.83
3 effective -0.0699 0.0626 1.43 2.96Code explanation (step-by-step):
Compute unweighted national means:
We average age-specific responses for each indicator within a country, without considering population size. This assumes all age groups contribute equally to the national mean — an unrealistic but instructive baseline.
Join with weighted estimates (Model A):
The age-dominant model uses actual age-group population weights, yielding population-representative national scores. We join the two sets of results and compute the differences for each indicator (
diff_*).Summarise across countries: We report the mean, median, and 95th and 99th percentile of these differences to understand how much population weighting changes the results globally.
Reshape for readability: The final
pivot_longerandpivot_widersteps simply format the summary into a clear table by indicator and statistic.
Results Interpretation
| Indicator | Mean diff (pp) | Median diff (pp) | 95th percentile (pp) | 99th percentile (pp) |
|---|---|---|---|---|
| children | 0.17 | 0.17 | 1.65 | 3.03 |
| safe | 0.09 | 0.12 | 1.85 | 2.83 |
| effective | -0.07 | 0.06 | 1.43 | 2.96 |
Differences are small but systematic. On average, unweighted estimates are slightly higher (by around 0.1–0.2 pp for most indicators), reflecting the overrepresentation of older, more confident respondents in unweighted means. In a few cases (top 1–5% of countries), the differences reach 2–3 percentage points — enough to alter country rankings.
Interpretation: Weighting by age corrects these biases and ensures that national means reflect true population distributions. This demonstrates, empirically, why the age-dominant (weighted) model provides a more defensible baseline.
Interpretation, Validation, and Limitations
Our aggregation framework produces demographically representative national vaccine confidence indicators that explicitly account for age structure — ensuring that cross-country comparisons reflect population composition rather than sample artifacts.
By anchoring estimates in actual population weights (\(N_{a,g}\)) and grounding our modeling assumptions in empirical literature, we achieve comparability and interpretability across countries.
Validation Against Alternative Approaches
We conducted two complementary robustness checks:
Model structure sensitivity — compared three demographic models (Age-dominant, Additive, Multiplicative). Results were nearly identical: median inter-model differences were below 0.2 percentage points, with 99% of countries differing by less than 2.5 pp. → This confirms that our country-level estimates are robust to reasonable modeling assumptions about gender effects.
Weighting sensitivity — compared the age-dominant (population-weighted) model with unweighted country means (simple averages across age groups). Unweighted estimates were typically 0.1–0.2 percentage points higher, occasionally up to 3 pp in aging countries where older, more confident cohorts are overrepresented. → This shows that age-weighting corrects subtle but systematic demographic bias.
Overall conclusion: Both sensitivity analyses indicate that our results are stable and demographically defensible. The Age-dominant (age-weighted) model is therefore adopted as the primary specification.
Key Limitations
Despite its rigor, our approach is not without limitations. Each limitation below explains both what the issue is and why it matters for the interpretation of results.
1. No joint-response data (Age × Gender cells unobserved)
The VCI dataset only provides marginal means by age and gender, not by their joint combinations (e.g., “women aged 25–34”). To obtain national estimates, we therefore relied on structural models (Age-dominant, Additive, Multiplicative) that approximate these joint means under simplifying assumptions.
Why this matters: If real-world age and gender effects interact — for instance, if younger women are less confident than younger men, but older women are more confident — none of our models can fully capture that pattern. This could slightly misstate national averages in populations with strong demographic interactions.
Implication: Results are robust for overall trends, but may underestimate subgroup heterogeneity within countries. Future data collections should publish cross-tabulated VCI responses by age and gender to enable more precise modeling.
2. Nonresponse and coverage bias
The VCI is primarily based on online or mobile panel surveys, which may systematically underrepresent certain population groups — for example, older adults without internet access, low-income rural populations, or minority-language speakers.
Why this matters: Even if population weights correct for known demographics (e.g., the proportion of older people), they cannot adjust for attitudinal differences within demographic cells. If those who choose to respond are systematically more vaccine-positive than those who don’t, the resulting estimates will overstate true confidence.
Implication: Weighting reduces, but cannot eliminate, self-selection bias. Independent validation against nationally representative surveys (e.g., Gallup or Pew datasets) would strengthen external validity.
3. Population uncertainty and temporal alignment
Demographic population estimates come from international repositories (UN, World Bank, or national statistical offices) and are themselves modeled using censuses and projections. While we restricted analysis to 2018 to align with VCI data, underlying population structures (especially for small or rapidly changing countries) may still be uncertain.
Why this matters: In countries with incomplete census coverage or high migration, age–gender population counts may have nontrivial errors. This affects weighting precision — particularly in small countries where a few thousand people can shift proportions significantly.
Implication: Confidence indicators for such countries should be interpreted with caution, and uncertainty intervals could be incorporated in future work through Bayesian demographic modeling or bootstrap resampling.
4. Interpretational limits of attitudinal indicators
The VCI indicators measure self-reported agreement with statements like “Vaccines are safe” or “Vaccines are important for children.” These are valuable for gauging confidence, but they are not equivalent to behavioral intent or actual uptake.
Why this matters:
A population may strongly agree that vaccines are safe, yet still show low vaccination coverage due to access barriers, misinformation, or political distrust. Conversely, a country with moderate confidence scores might achieve high uptake through effective public health systems.
Implication:
The aggregated indicators reflect attitudinal readiness, not behavioral compliance. Interpretation should therefore avoid assuming a direct one-to-one link between confidence and coverage.
5. Cross-country comparability challenges
Although we harmonized demographic and survey structures, question phrasing, translation, and cultural interpretation can vary across national contexts.
Why this matters: In some languages, “safe” may connote “without any risk,” while in others it implies “generally beneficial.” Such linguistic nuances introduce small but systematic measurement differences across countries.
Implication: Cross-country comparisons should be treated as approximate indicators of relative confidence, not precise interval-scaled values. When possible, analyses should focus on within-country changes over time, where such biases cancel out.
Summary Table: Sources and Implications of Limitations
| Limitation | Source of Bias | Likely Direction | Mitigation |
|---|---|---|---|
| Missing joint Age×Gender data | Structural modeling assumption | May smooth true subgroup variation | Sensitivity analysis across models |
| Nonresponse / coverage bias | Online survey undercoverage | Potential overestimation of confidence | Apply demographic weights; external validation |
| Population uncertainty | Census and projection errors | Random; may add noise | Restrict to well-measured years; bootstrap weighting |
| Attitude vs. behavior gap | Conceptual mismatch | May mispredict uptake | Use VCI as attitudinal, not behavioral, measure |
| Cross-cultural measurement differences | Translation / interpretation variance | Context-dependent | Prefer within-country comparisons |
Why This Approach Is Defensible
This method follows best practices in survey weighting and cross-national comparative analysis:
- Uses true population weights to represent national demographics accurately.
- Grounds structural assumptions in peer-reviewed empirical evidence.
- Includes transparent robustness checks across models and weighting schemes.
- Avoids ecological fallacy by staying at the appropriate level of aggregation.
Weighting doesn’t aim to make the sample look exactly like the population in every single detail. It aims to adjust for known sources of bias in a principled way.
Our workflow accomplishes precisely that for global vaccine confidence estimation.
Merging VCI dataset with WHO/UNICEF and World Bank datasets
To model predictors of vaccine confidence, we now merge the aggregated VCI dataset with vaccination coverage (WHO/UNICEF) and socioeconomic indicators (World Bank). However, the two auxiliary datasets use different formats and naming conventions, so we first clean, standardize, and reshape them for compatibility.
Step 1: Load the WHO/UNICEF and World Bank datasets
who_data <- read_csv('../../data/who_measles_first_dose_coverage.csv') %>% clean_names()
wb_data <- read_csv('../../data/WorldBank_socio-economic_indicators_with_metadata.csv') %>% clean_names()> who_data <- read_csv('../../data/who_measles_first_dose_coverage.csv') %>% clean_names()
Rows: 4831 Columns: 34 s
── Column specification ───────────────────────────────────────────
Delimiter: ","
chr (10): IndicatorCode, Indicator, ValueType, ParentLocationC...
dbl (3): Period, FactValueNumeric, Value
lgl (20): IsLatestYear, Dim1 type, Dim1, Dim1ValueCode, Dim2 t...
dttm (1): DateModified
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> wb_data <- read_csv('../../data/WorldBank_socio-economic_indicators_with_metadata.csv') %>% clean_names()
Rows: 2676 Columns: 15 32.29MB/s, eta: 0s
── Column specification ───────────────────────────────────────────
Delimiter: ","
chr (15): Series Name, Series Code, Country Name, Country Code,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.Code explanation
read_csv()imports the raw CSV files.clean_names()(from janitor) standardizes variable names to lower snake_case (e.g.,IndicatorCode→indicator_code), ensuring they are consistent and easier to reference downstream.Assumption
Both files contain country-level data for multiple years; we will filter to 2018 to match the year of the VCI survey.
Step 2: Prepare WHO/UNICEF data (measles MCV1 coverage)
who_2018 <- who_data %>%
# Keep the MCV1 series, annual values, year 2018
filter(indicator_code == "WHS8_110",
period_type == "Year",
period == 2018) %>%
transmute(
iso3 = spatial_dim_value_code, # already ISO3 in your file (e.g., "MNE", "CAF")
country_who = location, # keep for diagnostics
year = period,
measles_mcv1 = coalesce(fact_value_numeric, as.numeric(value))
) %>%
distinct(iso3, year, .keep_all = TRUE) # one row per iso3-yearCode explanation
filter()selects:
indicator_code == "WHS8_110"→ the WHO/UNICEF code for measles first-dose coverage (MCV1).period_type == "Year"ensures we keep annual data (some WHO tables include monthly or cumulative estimates).period == 2018isolates the year of analysis.
transmute():
- Keeps only relevant columns and renames them for clarity (
iso3,year,measles_mcv1).- Uses
coalesce()to fill missing numeric values by attempting conversion from thevaluecolumn (some WHO exports store numbers as text).
distinct()ensures one observation per country-year pair.Assumptions
- The WHO data already use ISO3 country codes (
spatial_dim_value_code), so we don’t need to normalize names here.- MCV1 coverage is a valid proxy for overall vaccination performance — consistent with prior research: “Thus, measles vaccination serves as a good proxy for ‘full immunisation’, ensuring that a child has received all critical age-appropriate vaccines and has ongoing access to healthcare.” (Vaccher et al. 2025)
Step 3: Prepare World Bank socioeconomic indicators
# Map Series Code -> short column names you want in the model
wb_keep <- tribble(
~series_code, ~var,
"NY.GDP.PCAP.CD", "gdp_per_capita_current_usd",
"SI.POV.NAHC", "poverty_headcount_national",
"SL.UEM.TOTL.ZS", "unemployment_rate",
"SE.ADT.LITR.ZS", "adult_literacy_rate",
"SE.PRM.ENRR", "primary_school_enrollment",
"SH.XPD.GHED.GD.ZS", "domestic_health_expenditure_pct_gdp",
"SH.MED.PHYS.ZS", "physicians_per_1000",
"SH.MED.NUMW.P3", "nurses_and_midwives_per_1000",
"SH.MED.CHWK.ZS", "community_health_workers_per_1000",
"SH.STA.BASS.ZS", "access_to_sanitation"
)
# The WB file is wide by year; keep 2018 only, then pivot wider by Series Code
wb_2018 <- wb_data %>%
select(
series_code, country_code,
x2018_yr2018
) %>%
rename(
iso3 = country_code,
value_2018 = x2018_yr2018
) %>%
# Keep only indicators we care about
inner_join(wb_keep, by = "series_code") %>%
mutate(year = 2018) %>%
select(iso3, year, var, value_2018) %>%
# Pivot to wide so each kept indicator becomes a column
pivot_wider(names_from = var, values_from = value_2018)Code explanation
- The World Bank dataset is in wide format — each year is a separate column (e.g.,
2018 [YR2018]).- We use
select()to keep the country code, series code, and 2018 values only.rename()makes the naming consistent (iso3,value_2018).inner_join(wb_keep)filters the dataset down to the specific indicators we plan to use (based onSeries Codevalues).pivot_wider()converts the long indicator list into one row per country, with each indicator as a separate column.Assumptions
- Indicators are cross-sectionally complete for 2018. Missing values will be carried as
NAwithout imputation.- ISO3 country codes are consistent across all World Bank files.
Step 4: Harmonize country names and add ISO3 codes to the VCI dataset
# Custom fixes for tricky names across sources
# Note: Kosovo (XKX) and Saint Martin (MAF) are not standard ISO3;
# custom-coded here for join consistency across WHO/WB datasets.
custom_match <- c(
"Congo, Democratic Republic of the" = "COD",
"Congo, Republic of the" = "COG",
"Côte d’Ivoire" = "CIV",
"Cote d'Ivoire" = "CIV",
"Eswatini" = "SWZ",
"Russian Federation" = "RUS",
"Viet Nam" = "VNM",
"Iran" = "IRN",
"Iran, Islamic Republic of" = "IRN",
"Egypt" = "EGY",
"Egypt, Arab Rep." = "EGY",
"Syrian Arab Republic" = "SYR",
"Türkiye" = "TUR",
"Gambia, The" = "GMB",
"Lao PDR" = "LAO",
"Kyrgyz Republic" = "KGZ",
"Yemen, Rep." = "YEM",
"Hong Kong SAR, China" = "HKG",
"Macao SAR, China" = "MAC",
"Kosovo" = "XKX",
"Saint Martin" = "MAF"
)
vci_iso <- vci_country_2018 %>%
mutate(
iso3 = countrycode(country, "country.name", "iso3c", custom_match = custom_match)
) %>%
select(iso3, year, country, children, safe, effective)Code explanation
- The VCI dataset currently uses plain English country names (e.g.,
"Congo, Republic of the").countrycode()maps these to ISO3 codes, which serve as stable join keys across datasets.- The
custom_matchvector manually overrides problematic names like"Kosovo"and"Saint Martin"that are not officially recognized ISO members.- We retain only the ISO3 code, country name, year, and the three vaccine confidence indicators.
Assumptions
- All remaining countries can be uniquely matched via
countrycode().- Custom overrides ensure one-to-one mappings for special territories (e.g.,
"Côte d’Ivoire"→CIV).
Step 5: Diagnostics — check for unmatched countries
merged_data <- vci_iso %>%
left_join(who_2018, by = c("iso3", "year")) %>%
left_join(wb_2018, by = c("iso3", "year"))
# Quick diagnostics: which VCI countries failed to find WHO/WB matches?
anti_who <- vci_iso %>%
anti_join(who_2018, by = c("iso3", "year"))
anti_wb <- vci_iso %>%
anti_join(wb_2018, by = c("iso3", "year"))
nrow(merged_data)
anti_who %>% select(country, iso3) %>% distinct() %>% arrange(country)
anti_wb %>% select(country, iso3) %>% distinct() %>% arrange(country)> nrow(merged_data)
[1] 227
> anti_who %>% select(country, iso3) %>% distinct() %>% arrange(country)
# A tibble: 33 × 2
country iso3
<chr> <chr>
1 American Samoa ASM
2 Anguilla AIA
3 Aruba ABW
4 Bermuda BMU
5 Cayman Islands CYM
6 Curaçao CUW
7 Faroe Islands FRO
8 French Polynesia PYF
9 Gibraltar GIB
10 Greenland GRL
# ℹ 23 more rows
# ℹ Use `print(n = ...)` to see more rows
> anti_wb %>% select(country, iso3) %>% distinct() %>% arrange(country)
# A tibble: 10 × 2
country iso3
<chr> <chr>
1 Anguilla AIA
2 Cook Islands COK
3 Guernsey GGY
4 Jersey JEY
5 Montserrat MSR
6 Saint Barthelemy BLM
7 Saint Helena, Ascension, and Tristan da Cunha SHN
8 Saint Pierre and Miquelon SPM
9 Taiwan TWN
10 Wallis and Futuna WLF Code explanation
anti_join()finds records invci_isothat do not have a match in WHO or WB data (based oniso3andyear).- This helps identify territories missing from auxiliary sources — typically small islands or non-sovereign regions (e.g., Hong Kong, Guernsey, Puerto Rico).
- Printing these tables lets us decide whether to drop or reassign them before analysis.
Assumptions
- WHO and World Bank only report for sovereign countries, so non-matching entities will be excluded.
- Dropping these has negligible effect on global or regional patterns due to their small populations.
Step 6: Merge datasets and finalize analysis file
analysis_2018 <- vci_iso %>%
inner_join(who_2018, by = c("iso3", "year")) %>%
left_join(wb_2018, by = c("iso3", "year"))Code explanation
inner_join()ensures we only keep countries with both VCI and WHO vaccination coverage data (since WHO is the dependent variable in later modeling).left_join()appends available World Bank socioeconomic indicators. Missing indicators are left asNA.- The result,
analysis_2018, is a harmonized country-level dataset for 2018, ready for modeling.Assumptions
- The modeling will only use complete cases (countries with valid outcome and covariates).
- Joining on ISO3 ensures consistent cross-dataset alignment and avoids ambiguity in country names.
Question 3
We’ll be exploring distributions, relationships, potential outliers, and missingness in the merged dataset analysis_2018 (from Q2).
Distributions of aggregated VCP indicators and vaccination coverage
#| label: q3-dists
#| echo: true
#| message: false
#| warning: false
d <- analysis_2018
p1 <- ggplot(d, aes(x = children)) +
geom_histogram(binwidth = 2, fill = "turquoise2") +
labs(title = "VCP: Vaccines Important for Children",
x = "Percent agreeing", y = "Countries")
p2 <- ggplot(d, aes(x = safe)) +
geom_histogram(binwidth = 2, fill = "turquoise2") +
labs(title = "VCP: Vaccines Are Safe",
x = "Percent agreeing", y = "Countries")
p3 <- ggplot(d, aes(x = effective)) +
geom_histogram(binwidth = 2, fill = "turquoise2") +
labs(title = "VCP: Vaccines Are Effective",
x = "Percent agreeing", y = "Countries")
p4 <- ggplot(d, aes(x = measles_mcv1)) +
geom_histogram(binwidth = 2, fill = "turquoise2") +
labs(title = "Measles (MCV1) Coverage",
x = "Coverage (%)", y = "Countries")
(p1 | p2) / (p3 | p4)Interpretation
This plot shows histograms summarizing vaccine confidence and coverage across countries, using four indicators:
Top left — “Vaccines Important for Children”
- Most countries have a high percentage of people agreeing that vaccines are important for children (mostly above 70%).
- A smaller number of countries have lower agreement (below 60%).
- Interpretation: There’s broad global consensus on vaccine importance, though a few countries have weaker confidence.
Top right — “Vaccines Are Safe”
- Responses are more spread out, ranging roughly from 30% to 90%.
- The distribution peaks around 70–80%, meaning that in many countries, most people consider vaccines safe—but confidence in safety varies more than in importance to children.
- Interpretation: Perceived safety is a more contentious issue and may explain vaccine hesitancy.
Bottom left — “Vaccines Are Effective”
- Similar to the safety perception, but slightly more concentrated between 60% and 90%.
- Most countries believe in vaccine effectiveness, but there is still noticeable variation.
- Interpretation: While effectiveness is generally accepted, it’s not as universally agreed upon as importance of vaccines to children.
Bottom right — “Measles (MCV1) Coverage”
- Most countries have high coverage rates (80–100%), though a few fall much lower (around 30–60%).
- The sharp peak near 100% suggests that many nations have successfully achieved high measles vaccination rates.
- Interpretation: Actual coverage levels are generally high, but countries with lower confidence (in safety or effectiveness) might also show lower coverage.
Overall interpretation:
- Vaccine confidence is generally high but uneven—people tend to agree that vaccines are important to children, though they’re less certain about safety and effectiveness.
- The high MCV1 (measles) coverage in most countries suggests that even with some doubts, vaccination programs remain widely implemented.
- However, countries with low perceived safety/effectiveness could be at higher risk for reduced vaccine uptake in the future.
Relationships with socio-economic / health system indicators
Let’s select a few variables from the analysis_2018 dataset and rename them for plotting purposes:
rel_vars_clean <- analysis_2018 |>
select(
measles_mcv1,
vax_children = children,
vax_safe = safe,
vax_effective = effective,
gdp_pc = gdp_per_capita_current_usd,
literacy = adult_literacy_rate,
health_spend = domestic_health_expenditure_pct_gdp,
sanitation = access_to_sanitation,
school_enrol = primary_school_enrollment,
poverty = poverty_headcount_national,
doctors = physicians_per_1000,
nurses = nurses_and_midwives_per_1000
) |>
mutate(across(everything(), as.numeric))Warning :There were 8 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(everything(), as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run warnings()dplyr::last_dplyr_warnings() to see the 7
remaining warnings.Correlation heatmap
#| label: q3-correlation-heatmap
#| echo: true
#| message: false
#| warning: false
# Compute correlations
cor_mat <- cor(rel_vars_clean, use = "pairwise.complete.obs")
# Plot
ggcorrplot(
cor_mat,
hc.order = TRUE, # Cluster similar variables together
type = "lower", # Only lower half
lab = TRUE, # Show correlation coefficients
lab_size = 2.5,
colors = c("red", "white", "blue"),
title = "Correlation Heatmap: Vaccine Coverage,\n Confidence, and System Indicators",
tl.cex = 10,
tl.srt = 45
)Key findings from the correlation heatmap
- Vaccine confidence variables are strongly correlated with each other
vax_safe,vax_children, andvax_effectiveshow very high correlations (r ≈ 0.8–0.9). → Countries that believe vaccines are safe also tend to believe they are effective and important for children. → This suggests these attitudes cluster together as a single “vaccine confidence” construct.
- Measles (MCV1) coverage relates moderately to both confidence and system indicators
measles_mcv1correlates positively (r ≈ 0.4–0.6) with:literacysanitationgdp_pcnursesandhealth_spend
These are all system-level indicators (socioeconomic development and healthcare capacity). → Higher coverage tends to occur in more developed contexts — where infrastructure, education, and healthcare are stronger.
With vaccine confidence variables (
vax_safe,vax_effective), correlations are smaller (~0.1–0.3). → While confidence matters, system strength seems to have a stronger statistical link to actual vaccination coverage.
- Poverty is negatively correlated with nearly everything
povertyshows negative correlations (~–0.4 to –0.6) with:literacy,sanitation,gdp_pc,health_spend,nurses, etc. → Expected: poorer countries have weaker infrastructure and lower vaccine coverage.
Also slightly negative with vaccine confidence, implying that poverty may also affect attitudes (perhaps via trust, access, or education).
- Health system indicators are interlinked
gdp_pc,nurses, andhealth_spendcorrelate very strongly (~0.65–0.85). → Wealthier countries spend more on health and have higher staffing — consistent with expected patterns of development.
Summary interpretation of the correlation heatmap
| Theme | Relationship | Strength | Interpretation |
|---|---|---|---|
| Vaccine confidence variables | Very strong positive | ~0.8–0.9 | Beliefs about safety, effectiveness, and importance move together |
| Measles coverage vs. confidence | Weak–moderate positive | ~0.2–0.4 | Some link, but weaker than with development factors |
| Measles coverage vs. system strength | Moderate–strong positive | ~0.4–0.6 | Higher coverage in richer, more educated, better serviced countries |
| Poverty vs. everything else | Strong negative | ~–0.4 to –0.6 | Poverty undermines both system indicators and coverage |
Overall conclusion:
- Vaccine confidence clusters tightly together, forming a coherent belief structure.
- However, measles vaccination coverage correlates more strongly with development and health system indicators than with confidence alone.
- In short: confidence matters, but capacity matters more for actual vaccine delivery outcomes.
Relationship between measles vaccine coverage and the other variables
rel_vars_long <- rel_vars_clean |>
pivot_longer(-measles_mcv1, names_to = "variable", values_to = "value")
ggplot(rel_vars_long, aes(x = value, y = measles_mcv1)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
facet_wrap(~ variable, scales = "free_x") +
labs(
title = "Measles Coverage vs. Confidence and System Indicators",
x = NULL, y = "Measles (MCV1) Coverage (%)"
) +
theme_minimal()cor_target <- cor_mat["measles_mcv1", ] |>
sort(decreasing = TRUE) |>
as.data.frame() |>
rownames_to_column("variable")
colnames(cor_target) <- c("variable", "correlation")
ggplot(cor_target[-1, ], aes(x = reorder(variable, correlation), y = correlation)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Correlation of Measles (MCV1) Coverage with Confidence & System Indicators",
x = NULL, y = "Correlation"
) +
theme_minimal()Correlation Bar Chart
This chart ranks how strongly each factor correlates with measles vaccination coverage.
🔹 Strongest positive correlations
- Sanitation, literacy, doctors, nurses, health_spend, and GDP per capita all show moderate-to-strong positive correlations (r ≈ 0.4–0.6). → Countries with better infrastructure, higher education, and stronger healthcare systems tend to have higher vaccination coverage.
Interpretation: Coverage seems closely linked to national capacity and development, not just vaccine attitudes.
🔹 Weak or negligible correlations
- Vaccine confidence indicators (
vax_safe,vax_effective,vax_children) have near-zero correlations with actual coverage. → People’s reported beliefs about vaccines (safety, effectiveness, importance) don’t clearly translate into coverage differences at the national level. → This could reflect that even if confidence is high, logistical or system constraints still limit access.
🔹 Negative correlation
- Poverty shows a strong negative correlation (~–0.4). → Poorer countries have lower vaccination coverage, likely due to barriers in access, supply, and infrastructure.
📈 Scatterplots: MCV1 Coverage vs Predictors
This plot shows the same relationships more visually, variable by variable.
🔹 System and development variables
- GDP per capita, literacy, sanitation, health spending, and healthcare workforce all show upward trends — as these increase, coverage tends to rise.
- The relationships flatten slightly at high levels (near 100% coverage), suggesting a ceiling effect — most rich or well-developed countries already have nearly universal coverage.
🔹 Poverty
- The scatterplot for poverty shows a clear downward slope, confirming that higher poverty rates are associated with lower coverage.
🔹 Vaccine confidence variables
vax_safe,vax_effective, andvax_childrenare almost flat lines — no clear relationship. → This suggests that differences in beliefs across countries don’t explain much of the variation in vaccination rates once system and economic factors are considered.
🧠 Overall Interpretation
| Theme | Observation | Interpretation |
|---|---|---|
| Health system & development | Strongest positive correlations with MCV1 coverage | Structural capacity — not just public attitudes — drives vaccination success. |
| Poverty | Strong negative relationship | Economic hardship constrains access to healthcare and vaccines. |
| Vaccine confidence | Weak to negligible correlation | Confidence is generally high across countries; it’s not the limiting factor compared to infrastructure. |
| Visual patterns | Clear upward or downward slopes for system indicators; flat for attitudes | Coverage differences largely reflect material/systemic inequalities. |
In summary:
Countries with better socioeconomic development, stronger healthcare systems, and lower poverty achieve higher measles vaccination coverage. While public confidence in vaccines is important for sustaining demand, structural capacity appears to be the dominant factor explaining global variation in actual coverage.
Outliers and influential observations
#| label: q3-outliers
#| echo: true
#| message: false
#| warning: false
#|
ggplot(analysis_2018, aes(x = safe, y = measles_mcv1)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
geom_text_repel(
data = subset(analysis_2018, measles_mcv1 < 60 | safe < 70),
aes(label = country),
size = 3, max.overlaps = 10
) +
labs(
x = "Percent agreeing vaccines are safe",
y = "Measles (MCV1) Coverage (%)",
title = "Outliers in Vaccine Confidence vs. Coverage"
) +
theme_minimal()What this plot shows
We’ve plotted measles (MCV1) vaccination coverage (%) against vaccine safety confidence (% agreeing), with country labels added for the most unusual or illustrative points.
The red line is a simple linear trend, and the labels highlight countries that appear far from the main cluster — visually “outliers” in how confidence relates to coverage.
Key takeaways from the visualization
🔹 1. Two main outlier clusters appear
(a) Low confidence but high coverage (top left quadrant)
Examples: France, Japan, Ukraine, Lithuania, Austria, Bulgaria, Russia, Cyprus, Germany, Latvia, etc.
- These countries achieve very high coverage (≥90%) despite low safety confidence (30–60%).
- Mostly high-income countries with strong public health systems and school-entry vaccination requirements.
- Suggests that structural and policy mechanisms, rather than public sentiment, maintain high vaccination rates.
🧠 Interpretation: These are “confidence outliers” — trust is low, but delivery remains high.
(b) High confidence but low coverage (bottom right quadrant)
Examples: Ethiopia, Yemen, Nigeria, Gabon, Guinea, Togo, Mali**, etc.
- These countries report strong belief in vaccine safety (70–90%) but coverage well below 80%.
- Most are low-income or fragile states, where logistical barriers (infrastructure, access, conflict) outweigh willingness.
🧠 Interpretation: These are “system outliers” — trust is high, but capacity is weak.
🔹 2. Flat regression line
The red regression line is nearly horizontal, confirming that the linear association between confidence and coverage is weak or nonexistent. → This matches the correlation results you observed earlier (r ≈ 0.1–0.2).
🧠 Interpretation: Vaccine confidence varies widely across countries, but it doesn’t systematically predict coverage. Instead, non-attitudinal factors (infrastructure, poverty, governance) dominate.
🔹 3. Regional contrasts
- Europe & East Asia: cluster of low-confidence, high-coverage nations.
- Sub-Saharan Africa & Middle East: high-confidence, lower-coverage cluster.
- Americas and parts of Asia (Peru, Bolivia, Pakistan): middle of the pack.
This reflects a global confidence–coverage paradox:
In wealthy countries, hesitancy doesn’t necessarily reduce coverage, while in poorer countries, confidence isn’t enough to ensure high coverage.
Summary Interpretation
| Pattern | Example countries | What it means |
|---|---|---|
| High coverage, low confidence | France, Japan, Austria | Trust issues but strong systems keep coverage high |
| Low coverage, high confidence | Nigeria, Ethiopia, Yemen | Positive attitudes but weak infrastructure limit access |
| Weak overall association | Flat red line | Vaccine uptake is driven more by structural than attitudinal factors |
Missingness exploration (before any imputation)
Before exploring the missingness in analysis_2018, let’s ensure all columns are of the right format (if numeric variables are encoded as characters, R might mistakenly consider they have no missing values!).
analysis_2018 <- analysis_2018 |>
mutate(across(
c(gdp_per_capita_current_usd,
access_to_sanitation,
poverty_headcount_national,
primary_school_enrollment,
adult_literacy_rate,
unemployment_rate,
domestic_health_expenditure_pct_gdp,
physicians_per_1000,
nurses_and_midwives_per_1000),
~ na_if(., "..") |> as.numeric()
))Let’s now visualise the missingness in the matrix.
#| label: q3-missingness-visualisation
#| echo: true
#| message: false
#| warning: false
# Visual pattern of missingness
# Create a matrix of missingness (TRUE/FALSE)
miss_matrix <- is.na(analysis_2018)
# Convert to long format
miss_long <- melt(miss_matrix)
colnames(miss_long) <- c("Observation", "Variable", "Missing")
# Plot
ggplot(miss_long, aes(x = Variable, y = Observation, fill = Missing)) +
geom_tile(color = "white", size = 0.1) +
scale_fill_manual(values = c("FALSE" = "darkorchid4", "TRUE" = "white")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(title = "Missing Data in analysis_2018",
subtitle = "White = missing, Purple = observed",
fill = "Missing?") +
coord_fixed(ratio = 0.05) # Adjust ratio to make plot taller/wider as neededAnd let’s check the percentage of missing values per variable
# Per-variable missing rates
missing_summary <- analysis_2018 %>%
summarise(across(everything(), ~ mean(is.na(.)))) %>%
pivot_longer(everything(), names_to = "variable", values_to = "missing_rate") %>%
arrange(desc(missing_rate))
missing_summary# A tibble: 17 × 2
variable missing_rate
<chr> <dbl>
1 adult_literacy_rate 0.747
2 poverty_headcount_national 0.660
3 children 0.345
4 safe 0.345
5 effective 0.345
6 physicians_per_1000 0.237
7 primary_school_enrollment 0.175
8 nurses_and_midwives_per_1000 0.119
9 unemployment_rate 0.0773
10 access_to_sanitation 0.0412
11 gdp_per_capita_current_usd 0.0258
12 domestic_health_expenditure_pct_gdp 0.0103
13 iso3 0
14 year 0
15 country 0
16 country_who 0
17 measles_mcv1 0 Most variables have a reasonable degree of missingness (generally below 35%). The two exceptions are adult_literacy_rate (almost 75%) and poverty_headcount_national (66%): the missingness in adult_literacy_rate is simply too high for proper imputation so we’ll drop this column. We’ll impute the rest of the columns as the visualisation shows the pattern to likely be MAR (e.g countries that don’t report on children also don’t report on safe or effective or countries that don’t report physicians per 1000 are not likely to report nurses and midwives per 1000).
Question 4
We now build predictive models for measles coverage. beliefs is omitted because it is entirely missing for 2018.
Train/test split
#| label: q4-split
#| echo: true
#| message: false
#| warning: false
set.seed(42)
data_split <- initial_split(analysis_2018, prop = 0.7, strata = measles_mcv1)
train_data <- training(data_split)
test_data <- testing(data_split)Code explanation We use stratified sampling to maintain the coverage distribution in both sets.
Recipe for pre-processing
#| label: q4-recipe
#| echo: true
#| message: false
#| warning: false
model_recipe <- recipe(
measles_mcv1 ~ children + safe + effective +
gdp_per_capita_current_usd + poverty_headcount_national + unemployment_rate +
primary_school_enrollment +
domestic_health_expenditure_pct_gdp + physicians_per_1000 +
nurses_and_midwives_per_1000 +
access_to_sanitation,
data = train_data
) %>%
step_zv(all_predictors()) %>%
step_impute_bag(all_predictors()) %>%
step_normalize(all_predictors())Code explanation
- Removes zero-variance predictors.
- Imputes missing numeric predictors with bagged trees.
- Normalizes (z-scores) for comparability—required for regularized models like LASSO.
Baseline linear regression
#| label: q4-lm
#| echo: true
#| message: false
#| warning: false
lm_spec <- linear_reg() %>% set_engine("lm")
lm_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(lm_spec) %>%
fit(train_data)
lm_preds <- predict(lm_wf, test_data) %>%
bind_cols(test_data %>% select(measles_mcv1))
metrics(lm_preds, truth = measles_mcv1, estimate = .pred)# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 9.48
2 rsq standard 0.464Let’s also compute training set metrics:
lm_preds_train <- predict(lm_wf, train_data) %>%
bind_cols(train_data %>% select(measles_mcv1))
metrics(lm_preds_train, truth = measles_mcv1, estimate = .pred)# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 11.4
2 rsq standard 0.352Interpretation
Training set: R² ≈ 0.35 shows a moderate fit; RMSE ≈ 11 pp indicates predictions typically deviate by about 11 percentage points from observed measles coverage (which seems like a rather large error given the scale of measles vaccine coverage!).
Test set: R² ≈ 0.46 and RMSE ≈ 9.5 pp suggest the model generalises reasonably well, even improving slightly out-of-sample — implying mild over-regularisation or a few influential training points rather than overfitting.
MAE ≈ 7 pp on the test set (respectively 8.11 pp on the training set) means the average country-level prediction error is roughly seven percentage points of coverage (respectively 8.11 percentage points of coverage). The discrepancy between RMSE and MAE points to the existence of a few large outliers.
Overall, the model explains between around 35 to 45 % of cross-country variation in measles coverage, capturing meaningful structure but leaving room for system-level or contextual factors not yet modelled (e.g. supply bottlenecks, conflict). The gapt between training and test set performances suggests the model slightly underfits (i.e it is likely not complex enough e.g missing variables or not capturing potential non-linearities)
Residuals plots
lm_aug <- augment(lm_wf, test_data)
ggplot(lm_aug, aes(x = .pred, y = .resid, label = country)) +
geom_point(alpha = 0.6) +
geom_text_repel(data = subset(lm_aug, abs(.resid) > 15)) +
geom_hline(yintercept = 0, color = "red") +
theme_minimal()1. Overall pattern
- The points are mostly scattered randomly around 0, which is good — it suggests the model isn’t systematically biased (no obvious curvature or funnel shape).
- However, there’s a slight clustering of residuals below 0 for fitted values between ~70–90 — meaning the model slightly overpredicts coverage for those mid-range countries.
- At the high end (~95–100 predicted coverage), residuals are tightly clustered near 0 — the model performs best for already high-coverage countries.
Interpretation: The model fits high-performing countries well but struggles a bit for mid-level coverage nations.
⚠️ 2. Outliers
A few countries stand out with large residuals:
| Country | Residual | Interpretation |
|---|---|---|
| Chad | ≈ –30 | Actual coverage far below what the model predicts → possibly very weak health infrastructure or data quality issues. |
| Nigeria, Ethiopia | –20 to –15 | Model overpredicts their performance — suggests system or conflict-related delivery barriers. |
| Micronesia, Bosnia & Herzegovina | –15 to –20 | Overprediction again — maybe missing predictors relevant to small states or post-conflict systems. |
| Bangladesh | +10 to +15 | Underpredicted — performs better than expected given its predictors (strong immunization programs). |
Interpretation: These are “influential observations” — not necessarily leverage points yet, but they have large residuals (unexpected outcomes given model inputs).
3. Variance pattern (homoscedasticity check)
- The vertical spread of residuals doesn’t clearly widen or narrow with fitted values → variance appears fairly constant.
- There’s no strong funnel shape → no major heteroscedasticity.
Interpretation: The assumption of constant variance holds reasonably well.
📈 4. Model bias
- The slight negative bias for low–mid predicted values implies the model may overestimate coverage in lower-performing contexts.
- We might be missing a predictor that captures fragility, conflict, or inequality, which depresses coverage beyond what GDP or health spending explain.
🧠 Summary interpretation
| Aspect | Observation | Implication |
|---|---|---|
| Centering | Most residuals near 0 | Model fits average well overall |
| Spread | Fairly uniform | No strong heteroscedasticity |
| Outliers | Chad, Nigeria, Ethiopia, Micronesia, Bosnia | Real-world “exceptions” — weaker or stronger than expected |
| Bias | Slight overprediction for mid-range | Missing contextual predictor (governance/conflict?) |
Our model predicts measles coverage fairly well overall, with consistent variance and no severe bias. A handful of countries (especially Chad, Nigeria, Ethiopia) underperform relative to expectations, while Bangladesh exceeds them — pointing to systemic or policy differences not captured by the current predictors.
Regularized model (ElasticNet)
#| label: q4-elastic-net
#| echo: true
#| message: false
#| warning: false
# 1. Define elastic net model specification (fixed penalty and mixture)
elastic_spec <- linear_reg(penalty = 0.1, mixture = 0.5) %>%
set_engine("glmnet")
# 2. Build workflow
elastic_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(elastic_spec)
# 3. Fit the model directly to training data
elastic_fit <- elastic_wf %>%
fit(train_data)
# 4. Predict on the test data
elastic_preds <- predict(elastic_fit, test_data) %>%
bind_cols(test_data %>% select(measles_mcv1))
# 5. Evaluate model performance
metrics(elastic_preds, truth = measles_mcv1, estimate = .pred)# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 9.43
2 rsq standard 0.471#Predict on the training data
elastic_preds_train <- predict(elastic_fit, train_data) %>%
bind_cols(train_data %>% select(measles_mcv1))
# Evaluate model performance
metrics(elastic_preds_train, truth = measles_mcv1, estimate = .pred)# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 11.4
2 rsq standard 0.351We’re selecting ElasticNet here as it’s highly likely that most predictors are relevant (based on the baseline model) and also because we know many predictors to be strongly correlated (e.g safe, effective and children).
Compare models
#| label: q4-compare
#| echo: true
#| message: false
#| warning: false
bind_rows(
lm = metrics(lm_preds, truth = measles_mcv1, estimate = .pred) %>% mutate(model = "Linear Regression (test)"),
elastic_net = metrics(elastic_preds, truth = measles_mcv1, estimate = .pred) %>% mutate(model = "ElasticNet (test)")
) %>%
filter(.metric %in% c("rmse", "rsq")) %>%
select(model, .metric, .estimate)# A tibble: 4 × 3
model .metric .estimate
<chr> <chr> <dbl>
1 Linear Regression (test) rmse 9.48
2 Linear Regression (test) rsq 0.464
3 ElasticNet (test) rmse 9.43
4 ElasticNet (test) rsq 0.471And for the training set:
bind_rows(
lm = metrics(lm_preds_train, truth = measles_mcv1, estimate = .pred) %>% mutate(model = "Linear Regression (train)"),
elastic_net = metrics(elastic_preds_train, truth = measles_mcv1, estimate = .pred) %>% mutate(model = "ElasticNet (train)")
) %>%
filter(.metric %in% c("rmse", "rsq")) %>%
select(model, .metric, .estimate)# A tibble: 4 × 3
model .metric .estimate
<chr> <chr> <dbl>
1 Linear Regression (train) rmse 11.4
2 Linear Regression (train) rsq 0.352
3 ElasticNet (train) rmse 11.4
4 ElasticNet (train) rsq 0.351The performance of both models is sensibly similar with ElasticNet having a very slight edge on the test set.
Comparison of model coefficients
lm_fit <- extract_fit_parsnip(lm_wf)
lm_coefs <- tidy(lm_fit) %>%
select(term, estimate) %>%
rename(ols_estimate = estimate)
elastic_fit_parsnip <- extract_fit_parsnip(elastic_fit)
elastic_coefs <- tidy(elastic_fit_parsnip) %>%
select(term, estimate) %>%
rename(elastic_estimate = estimate)
coef_compare <- full_join(lm_coefs, elastic_coefs, by = "term")
coef_compare_long <- coef_compare %>%
pivot_longer(cols = c(ols_estimate, elastic_estimate),
names_to = "model", values_to = "estimate")
ggplot(coef_compare_long, aes(x = reorder(term, estimate), y = estimate, fill = model)) +
geom_col(position = "dodge") +
coord_flip() +
labs(
title = "Comparison of Coefficients: OLS vs Elastic Net",
x = NULL, y = "Coefficient Estimate",
fill = "Model"
) +
theme_minimal()ElasticNet places slightly more emphasis on poverty_headcount_national (does that hint at the sorts of variables we could try and add to this model to improve it further?) but in general, the coefficients are sensibly the same, which explains the very close performance of both baseline and ElasticNet models.
Question 5
Limitations
- Missing data – Some indicators lack coverage for small or fragile states; even principled imputation adds uncertainty.
- Aggregation – National averages hide within-country inequities (urban–rural, income, gender).
- Collinearity and causality – Socio-economic variables inter-correlate; regression identifies association, not cause.
- Single year – Cross-section cannot reveal temporal dynamics or lag effects.
- Attitude vs behavior – VCP indicators measure belief, not actual uptake; low access can still limit coverage.
Potential improvements
- Multiple or model-based imputation could better capture uncertainty in missing predictors, possibly incorporating regional or income-group cues.
- Cross-validation would help us evaluate our models better. And tuning the parameters of ElasticNet would allow to check whether the performance of this particular model could be improved. We could also try out non-linear models (e.g tree-based models) to see if they improve model performance at all.
- Panel (time-series) data would permit analysis of within-country changes and policy effects over time.
- Richer health-system measures—facility density, cold-chain reliability, workforce distribution—could clarify supply-side drivers.
- Governance and trust indicators (e.g., Worldwide Governance Indicators, Edelman Trust Index) could test institutional determinants of vaccine confidence.
- Uncertainty quantification through bootstrapping or Bayesian hierarchical modeling would make estimates more transparent and interpretable.
Summary reflection
Vaccine-confidence indicators remain closely aligned with vaccination coverage, but enduring structural factors—education, health investment, and system access—still dominate cross-national differences. With richer and longitudinal inputs and potentially better models, this framework could evolve toward a more causal, policy-relevant understanding of how trust, governance, and health capacity jointly sustain immunization coverage.
