✅ A model solution for the W04 formative

Author

What follows is a possible solution for the W04 formative. If you want to render the .qmd that was used to generate this page for yourselves, use the download button below:

⚙️ Setup

We start, as usual, by loading libraries.

#| echo: TRUE
#| warning: FALSE
#| message: FALSE
# Start by loading the required libraries
library(tidyverse)
library(tidymodels) # For modeling
library(readr)
library(janitor)    # For cleaning column names
library(knitr)      # For nice tables
library(ggplot2)
library(naniar)
library(countrycode)
library(patchwork)
library(ggcorrplot)
library(ggrepel)
library(reshape2)
library(magrittr)

Tip

For cleaner formatting (and coding), it’s good practice to load all the libraries you’ll be using in your .qmd at the beginning of the file.

Question 1

Loading the data

Importing the data should only require the read_csv method.

# Load VCI dataset
vci_raw <- read_csv("../../data/VCI_data_2018-24_by_demographic.csv")

Rows: 8165 Columns: 9                                                         
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (4): country, code, demo_type, demographic
dbl (5): year, children, safe, effective, beliefs

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note

When loading the data, you should only ever use relative paths and not absolute paths (i.e not full paths that are specific to your machine).

Doing so ensures your code is more reproducible!

Data inspection

We quickly inspect the content of the loaded dataframe:

# Inspect structure
glimpse(vci_raw)

Rows: 8,165
Columns: 9
$ country     <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"…
$ code        <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF"…
$ year        <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015…
$ demo_type   <chr> "Age", "Age", "Age", "Age", "Age", "Age", "Education", "Ed…
$ demographic <chr> "18-24", "25-34", "35-44", "45-54", "55-64", "65", "Primar…
$ children    <dbl> 96, 94, 97, 97, 97, 100, 95, 98, 98, 98, 94, 99, 97, 97, 9…
$ safe        <dbl> 89, 90, 91, 90, 89, 100, 89, 92, 91, 92, 88, 94, 90, 92, 8…
$ effective   <dbl> 93, 92, 92, 95, 95, 98, 92, 95, 96, 94, 92, 96, 94, 94, 92…
$ beliefs     <dbl> 89, 87, 87, 84, 85, 96, 85, 91, 97, 90, 85, 93, 84, 87, 87…

There will be no need to clean the column names here: they are lowercase and consistent.

vci_raw %>% distinct(year)

# A tibble: 8 × 1
   year
  <dbl>
1  2015
2  2018
3  2019
4  2020
5  2021
6  2022
7  2023
8  2024

The dataframe contains data for multiple years. Let’s check how many countries had data collected from per year:

# Count unique countries per year
countries_per_year <- vci_raw %>%
  group_by(year) %>%
  summarise(unique_countries = n_distinct(country)) %>%
  arrange(desc(unique_countries))

countries_per_year

# A tibble: 8 × 2
   year unique_countries
  <dbl>            <int>
1  2018              144
2  2015               70
3  2023               70
4  2020               53
5  2019               41
6  2022               28
7  2021               15
8  2024               12

The year with the most countries is 2018, providing the best coverage for analysis (i.e a sufficient number of data points for analysis). Other candidate years could be 2015 or 2023: while fewer countries had data collected for those years, the analysis of these years might still be interesting (2015 allows for the analysis of earlier trends while 2023 allows for the analysis of much more recent trends).

Choosing a year is a trade-off between coverage (number of countries) and recency/relevance of data. This inspection ensures the dataset we select for modeling has sufficient observations.

Question 2

Year selection and missing data handling

Let’s say we choose to analyse the year 2018 (the year with most data). Let’s select the data for that year in the VCP dataset:

# Select a single year for analysis
vci_year <- vci_raw %>%
  filter(year == 2018)

Before we aggregate the data per country, we need to check missingness patterns in the data and address missing responses before aggregation. If we don’t, the weighted means per country that we will calculate will be biased.

vci_year %>%
  summarise(across(c(children, safe, effective, beliefs), ~sum(is.na(.))))

# A tibble: 1 × 4
  children  safe effective beliefs
     <int> <int>     <int>   <int>
1        0     0         0    2187

There are no missing values in children, safe or effective. There are 2187 missing values in beliefs, which means that all values for this variable are missing: we won’t be able to compute a weighted average for this variable so we discard it. It is safe to do so as we see that only beliefs has missing values (2187 missing), while the other variables (children, safe, effective) have zero missing. That’s a strong hint that the missingness is completely at random (MCAR), because the missingness is not associated with any other variable in your dataset. But just to be thorough, let’s show that the missingness mechanism is indeed MCAR:

# Visualise missingness
p <- vis_miss(vci_year) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip() +
  labs(title = "Missing Data by Variable")

# View interactively (in RStudio viewer pane)
print(p)

The visualisation makes it clear that the pattern is indeed MCAR. And our only missing values are the beliefs and code (i.e country code) columns, which we won’t use for analysis. So we can essentially just drop these two columns.

vci_year <- vci_year %>% select(-c(code,beliefs))

Now we’re all set to aggregate our indicators.

Objective: Aggregating VCP Indicators to the Country Level

We aim to construct country-level vaccine confidence indicators (VCIs) for three core dimensions:

children: “Vaccines are important for children”
safe: “Vaccines are safe”
effective: “Vaccines are effective”

These indicators should reflect the population-weighted national average attitude, accounting for demographic heterogeneity. Some empirical research consistently shows that vaccine confidence varies systematically by age and, to a lesser extent, by gender [Mondal, Sinharoy, and Su (2021)](Presotto et al. 2022) (so when constructing a weighting model we’ll prefer an age-dominated model). Ignoring demographic structure would yield misleading national estimates—for example, overrepresenting older, more vaccine-confident cohorts in aging populations.

The target estimand for any indicator (X) (e.g., agreement that “vaccines are safe”) is the demographically weighted mean:

\[ \bar{X}_{\text{country}} = \frac{\sum_{a \in \mathcal{A}} \sum_{g \in \mathcal{G}} N_{a,g} \cdot X_{a,g}}{\sum_{a,g} N_{a,g}}, \]

where:
- \(\mathcal{A}\) = set of age groups (e.g., 18–24, 25–34, …, 65+)
- \(\mathcal{G}\) = {Male, Female}
- \(N_{a,g}\) = nominal (absolute) population size of age group (a) and gender (g) (from demographic data)
- \(X_{a,g}\) = true mean response in subgroup ((a,g))

Note: We use absolute population counts (e.g., 3,058,081 males aged 18–24 in Afghanistan), not percentages, to ensure correct weighting.

The core challenge: The VCI survey reports only marginal means—by age \((A_a)\) and by gender \((G_g)\)—but not the joint means \(X_{a,g}\). To proceed, we must adopt a structural model that links the unobserved \(X_{a,g}\) to the observed marginals.

Statistical Framework and Notation

Let \(X\) denote the latent vaccine-confidence score (measured as % agreement). We define:

\(A_a = \mathbb{E}[X \mid \text{Age} = a]\): Age-marginal mean — the average response among all individuals in age group (a), regardless of gender. > Example: If VCI reports children = 97 for "18-24", this means 97% of all 18–24-year-olds (men and women combined) agree that “vaccines are important for children.”
\(G_g = \mathbb{E}[X \mid \text{Gender} = g]\): Gender-marginal mean — the average response among all individuals of gender (\(g\)), regardless of age. > Example: If VCI reports children = 96 for "Male", this means 96% of all men (across all ages) agree with the statement.
\(X_{a,g} = \mathbb{E}[X \mid \text{Age} = a, \text{Gender} = g]\): Joint mean — the average response among individuals who are both in age group (\(a\)) and gender (g).
> Example: The (unobserved) percentage of women aged 25–34 who agree. This is what we need for precise weighting, but it is not reported in the VCI.
\(\mu = \mathbb{E}[X]\): Grand population mean — the true national average attitude, which is our ultimate target.

The grand mean can be computed from age marginals and age-group populations: \[ \mu = \sum_{a} w_a A_a, \quad \text{where} \quad w_a = \frac{N_a}{\sum_{a'} N_{a'}}, \quad N_a = \sum_{g} N_{a,g}. \]

Here, \(\mu\) serves as a normalization constant that ensures coherence between age and gender effects in additive/multiplicative models.

Structural Models for Joint Age–Gender Effects

Because \(X_{a,g}\) is unobserved, we evaluate three plausible structural assumptions grounded in empirical vaccine-confidence literature. All models preserve the true age–gender population structure via weights \(N_{a,g}\); none assume demographic independence.

Model A: Age-Dominant (Primary Specification)

\[ X_{a,g} = A_a \]

Interpretation: Vaccine confidence depends only on age; gender differences within age groups are negligible. This implies \(\mathbb{E}[X \mid a, g] = f(a)\).

Assumptions:

Gender has no independent effect on confidence after conditioning on age.
Population age–gender structure (\(N_{a,g}\)) is fully accounted for in weighting.

Justification: Consistent with empirical findings that age effects dominate (Mondal, Sinharoy, and Su 2021) (Presotto et al. 2022). Mathematically equivalent to assuming zero gender-by-age interaction.

Model B: Additive Effects (Sensitivity Analysis 1)

\[ X_{a,g} = A_a + (G_g - \mu) \]

Interpretation: Age and gender exert independent additive effects. The term \((G_g - \mu)\) represents the gender-specific deviation from the national mean.

Why subtract (\(\mu\))?

This centers the gender effect so that:

\[ \sum_g \frac{N_{a,g}}{N_a} X_{a,g} = A_a \quad \text{(preserves age marginal)} \]
\[ \sum_a \frac{N_{a,g}}{N_g} X_{a,g} = G_g \quad \text{(preserves gender marginal)} \]

Assumptions:

Gender effect is constant across all age groups (e.g., women score 2 points higher than men at every age).
No age–gender interaction; effects are linear and separable.

Limitation: May produce estimates outside [0,100] if effects are large (though rare in practice).

Model C: Multiplicative Effects (Sensitivity Analysis 2)

\[ X_{a,g} = A_a \times \frac{G_g}{\mu} \]

Interpretation: Gender modifies confidence proportionally. If women’s average score is 5% higher than (\(\mu\)), they are 5% higher in every age group.

Assumptions:

Gender effect is scale-invariant (relative, not absolute).
Proportional adjustment preserves the [0,100] response scale.

Use case: Appropriate when gender differences are better expressed as relative deviations (e.g., in low-confidence contexts).

Assumption Summary

Model	Equation	Key Assumption	Preserves Marginals?
A. Age-dominant	\(X_{a,g} = A_a\)	No gender effect within age	Age only
B. Additive	\(X_{a,g} = A_a + (G_g - \mu)\)	Constant absolute gender effect	Both
C. Multiplicative	\(X_{a,g} = A_a \cdot (G_g / \mu)\)	Constant relative gender effect	Both

Critical note: All models use true \(N_{a,g}\) from demographic data, so no assumption of age–gender independence is made. The only assumptions concern the response structure \(X_{a,g}\).

Implementation of Structural Models

To compute country-level vaccine confidence indicators under each structural assumption, we execute a four-stage workflow:

Harmonize age group labels across datasets
Extract marginal VCI responses
Construct the age–gender population table
Compute the grand mean ((\(\mu\))) and apply each model

All steps preserve data integrity and handle missingness transparently (beliefs is excluded as it is 100% missing).

Age Group Harmonization

The VCI dataset reports age groups as "18-24", while the demographic dataset uses "18 - 24" (with spaces). We normalize all labels to a consistent format:

normalize_age_group <- function(x) {
  x <- trimws(x)
  x <- gsub("\\s*\\-\\s*", "-", x)  # "18 - 24" → "18-24"
  x <- ifelse(x == "65", "65+", x)  # align with demographics
  return(x)
}

# Loading demographics dataset

demographics <- read_csv("../../data/US-Census-population-data-2014-2025.csv") %>% clean_names()
# --- Normalize demographics for 2018 ---
demographics_normalized <- demographics %>%
  filter(year == 2018) %>% # restrict to analysis year
  rename (country = country_area_name) %>%
  mutate(
    group_clean = case_when(
      is.na(group) ~ "TOTAL",                  # label total rows
      TRUE ~ normalize_age_group(group)        # normalize age groups
    )
  ) %>%
  select(-group) %>%                           # remove redundant original column
  rename(group = group_clean)                  # keep the cleaned name as 'group'

# --- Normalize VCI ---
vci_normalized <- vci_year %>%
  mutate(
    demographic_clean = if_else(
      demo_type == "Age",                      # only normalize age categories
      normalize_age_group(demographic),
      demographic                              # leave others (Gender, Education, etc.) as-is
    )
  ) %>%
  select(-demographic) %>%                     # drop redundant original column
  rename(demographic = demographic_clean)      # rename cleaned version back to 'demographic'

Let’s do a quick sanity check to verify the normalization produced the dataframes we expected:

demographics_normalized %>%
  count(group) %>%
  arrange(group)

vci_normalized %>%
  count(demo_type, demographic)

> demographics_normalized %>%
+     count(group) %>%
+     arrange(group)
# A tibble: 7 × 2
  group     n
  <chr> <int>
1 18-24   227
2 25-34   227
3 35-44   227
4 45-54   227
5 55-64   227
6 65+     227
7 TOTAL   227
> 
> vci_normalized %>%
+     count(demo_type, demographic)
# A tibble: 18 × 3
   demo_type demographic                  n
   <chr>     <chr>                    <int>
 1 Age       18-24                      144
 2 Age       25-34                      144
 3 Age       35-44                      144
 4 Age       45-54                      144
 5 Age       55-64                      144
 6 Age       65+                        144
 7 Education Primary and below          144
 8 Education Secondary and Vocational   144
 9 Education University and above       142
10 Gender    Female                     144
11 Gender    Male                       144
12 Religion  Atheist/agnostic           104
13 Religion  Buddhist                    60
14 Religion  Christian                  130
15 Religion  Hindu                       41
16 Religion  Jewish                      33
17 Religion  Muslim                     119
18 Religion  Other                      118

This output confirms everything is now nicely standardized:

Both datasets have clean, consistent age labels (18-24, 25-34, …, 65+). -demographics_normalized has one record per age group per country, plus “TOTAL” (to account for the total population).
vci_normalized includes multiple demographic types (Age, Gender, Education, Religion, etc.) — exactly what we expected.

Extract Marginal VCI Responses

We isolate age- and gender-specific survey means:

# Age-marginal responses (A_a)
vci_age <- vci_normalized %>%
  filter(demo_type == "Age") %>%
  select(country, age_group = demographic, children_a = children, safe_a = safe, effective_a = effective)

# Gender-marginal responses (G_g)
vci_gender <- vci_year %>%
  filter(demo_type == "Gender") %>%
  select(country, gender = demographic, children_g = children, safe_g = safe, effective_g = effective)

Construct Age–Gender Population Table

We build the joint population table \(N_{a,g}\):

demo_ag <- demographics_normalized %>%
  filter(group != "TOTAL") %>%                # remove total rows
  select(country, age_group = group,          # rename group for clarity
         male_population, female_population) %>%
  pivot_longer(
    cols = c(male_population, female_population),
    names_to = "gender_pop",
    values_to = "population"
  ) %>%
  mutate(
    gender = if_else(gender_pop == "male_population", "Male", "Female")
  ) %>%
  select(country, age_group, gender, population)

Compute Grand Mean (\(\mu\))

The grand mean (\(\mu\)) is the population-weighted national average attitude. We compute it by weighting age-specific responses by total age-group populations (summed across genders). This serves as the normalization constant for Models B and C, ensuring that additive and multiplicative adjustments preserve the overall marginal structure.

# Use demographics_normalized instead of the raw demographics dataset
# because labels and "TOTAL" are already handled there.

age_pops <- demographics_normalized %>%
  filter(group != "TOTAL") %>%                 # exclude total rows
  select(country, age_group = group,           # rename group for consistency
         age_pop = population)                 # keep population for weighting

grand_means <- vci_age %>%
  inner_join(age_pops, by = c("country", "age_group")) %>%
  group_by(country) %>%
  summarise(
    mu_children = weighted.mean(children_a, age_pop, na.rm = TRUE),
    mu_safe = weighted.mean(safe_a, age_pop, na.rm = TRUE),
    mu_effective = weighted.mean(effective_a, age_pop, na.rm = TRUE),
    .groups = "drop"
  )

Apply Structural Models

Model A: Age-Dominant

Model A implementation: We join the age marginals with the population grid and assign the same age-specific score to both males and females within each age group \((\hat{X}_{a,g} = A_a)\). We then compute the population-weighted country mean by summing across all age × gender cells.

ag_A <- demo_ag %>%
  left_join(vci_age, by = c("country", "age_group")) %>%
  mutate(children_hat = children_a, safe_hat = safe_a, effective_hat = effective_a)

agg_A <- ag_A %>%
  group_by(country) %>%
  summarise(
    children_w = weighted.mean(children_hat, population, na.rm = TRUE),
    safe_w = weighted.mean(safe_hat, population, na.rm = TRUE),
    effective_w = weighted.mean(effective_hat, population, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(method = "Age-dominant")

Model B: Additive

Model B implementation: We join age marginals, gender marginals, and grand means to the population grid. For each cell, we add the gender-specific deviation \((G_g - \mu)\) to the age baseline \(A_a\). This assumes gender has a constant additive effect across all age groups.

ag_B <- demo_ag %>%
  left_join(vci_age, by = c("country", "age_group")) %>%
  left_join(vci_gender, by = c("country", "gender")) %>%
  left_join(grand_means, by = "country") %>%
  mutate(
    children_hat = children_a + (children_g - mu_children),
    safe_hat = safe_a + (safe_g - mu_safe),
    effective_hat = effective_a + (effective_g - mu_effective)
  )

agg_B <- ag_B %>%
  group_by(country) %>%
  summarise(
    children_w = weighted.mean(children_hat, population, na.rm = TRUE),
    safe_w = weighted.mean(safe_hat, population, na.rm = TRUE),
    effective_w = weighted.mean(effective_hat, population, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(method = "Additive")

Model C: Multiplicative

Model C implementation: We join age marginals, gender marginals, and grand means to the population grid. For each cell, we scale the age baseline \(A_a\) by the relative gender factor \((G_g / \mu)\). This assumes gender effects are proportional rather than additive.

ag_C <- demo_ag %>%
  left_join(vci_age, by = c("country", "age_group")) %>%
  left_join(vci_gender, by = c("country", "gender")) %>%
  left_join(grand_means, by = "country") %>%
  mutate(
    children_hat = children_a * (children_g / mu_children),
    safe_hat = safe_a * (safe_g / mu_safe),
    effective_hat = effective_a * (effective_g / mu_effective)
  )

agg_C <- ag_C %>%
  group_by(country) %>%
  summarise(
    children_w = weighted.mean(children_hat, population, na.rm = TRUE),
    safe_w = weighted.mean(safe_hat, population, na.rm = TRUE),
    effective_w = weighted.mean(effective_hat, population, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(method = "Multiplicative")

Model Comparison and Final Selection

To ensure that our results are robust to modeling assumptions, we now assess how sensitive the country-level estimates are to the choice of structural model (Age-dominant, Additive, Multiplicative). The key idea is simple: if all three models produce nearly identical results, then the aggregated vaccine confidence indicators are stable and our preferred model choice is defensible.

We proceed in three steps:

Combine the results from all models into one dataset.
Compute, for each country and indicator, the range (maximum – minimum) across models.
Summarize and visualize these ranges to evaluate robustness.

💡 Code explanation (read this before running the code)

This section performs a sensitivity analysis: we quantify how much our country-level indicators change when we swap the structural assumption used to impute the unobserved Age × Gender means \(X_{a,g}\).

What we’re comparing. We have three country-level outputs:

agg_A: Age-dominant (\(X_{a,g}=A_a\))
agg_B: Additive (\(X_{a,g}=A_a+(G_g-\mu)\))
agg_C: Multiplicative (\(X_{a,g}=A_a \times (G_g/\mu)\))

Each contains the same indicators—children_w, safe_w, effective_w—computed with the same Age × Gender population weights \(N_{a,g}\), but under different structural assumptions about \(X_{a,g}\).

Step 1 – Combine model outputs. We row-bind (bind_rows) the three tables into a single long table agg_all. This gives, for each country × indicator, three comparable values—one per model. Having them side-by-side allows like-for-like comparison. If there were any countries missing from a model (e.g., due to an upstream join misalignment), this step would expose that.

Step 2 – Compute within-country ranges. We reshape with pivot_longer() so we can process the three indicators generically, then we group_by(country, indicator) and compute:

min_val: the smallest value across A/B/C
max_val: the largest value across A/B/C
range_val = max_val − min_val: the largest possible disagreement between models for that country–indicator

Why range? With only three models, the range is the most transparent measure of sensitivity: it literally answers “how far apart do the models get?” A variance/SD is less interpretable here. We also set na.rm = TRUE to ensure that any occasional NA (from unmatched joins or missing values after imputation) does not crash the calculation; if a country had fewer than two non-missing model values, range_val might be 0 or NA—that itself is a useful diagnostic.

Step 3 – Summarize the differences. We summarize range_val across countries for each indicator using:

Median (median_range): the typical model difference
95th percentile (perc95_range): a high-end difference capturing almost all countries
99th percentile (perc99_range): a “near worst-case” difference

Interpretation guideline: if the median is ~0.1–0.2 percentage points (pp) and even the 99th percentile is ~≤2–2.5 pp, then practical conclusions (e.g., country rankings, effect sizes) are unlikely to change with the model choice. In your current run, you reported medians ~0.11–0.15 pp and 99th percentiles ~2.04–2.35 pp—excellent robustness.

What this gives us.

If the ranges are small, our estimates are structurally stable: the choice between Age-dominant, Additive, and Multiplicative assumptions has negligible impact on country-level VCIs. That justifies using the simpler, empirically grounded Age-dominant model as the primary series going forward.

#| label: cmp-prepare
#| echo: true
#| message: false
#| warning: false
# --- Step 1: Combine model outputs ---
# Each model (A, B, C) contains country-level weighted indicators.
# We bind them together to compare values for the same country across models.

agg_all <- bind_rows(agg_A, agg_B, agg_C)

# --- Step 2: Compute within-country ranges ---
# For each country and indicator (children, safe, effective),
# we calculate how much the values differ across the three models.
# The range (max - min) measures the largest possible disagreement.

model_range <- agg_all %>%
  pivot_longer(
    cols = c(children_w, safe_w, effective_w),
    names_to = "indicator",
    values_to = "value"
  ) %>%
  group_by(country, indicator) %>%
  summarise(
    min_val   = min(value, na.rm = TRUE),
    max_val   = max(value, na.rm = TRUE),
    range_val = max_val - min_val,
    .groups   = "drop"
  )

# --- Step 3: Summarize the ranges across all countries ---
# We report the median (typical difference) and the 95th and 99th percentiles (worst-case differences)
# of these ranges for each indicator. Small values indicate strong robustness.

summary_range <- model_range %>%
  group_by(indicator) %>%
  summarise(
    median_range = median(range_val, na.rm = TRUE),
    perc95_range = quantile(range_val, 0.95, na.rm = TRUE),
    perc99_range = quantile(range_val, 0.99, na.rm = TRUE),
    .groups = "drop"
  )

Table with sensitivity analysis results

#| label: tbl-model-sensitivity
#| tbl-cap: "Cross-model sensitivity of country-level indicators (range in percentage points). Lower values indicate higher robustness to modeling assumptions."
#| echo: false
#| message: false
#| warning: false
summary_range %>%
  dplyr::mutate(
    dplyr::across(dplyr::where(is.numeric), ~round(.x, 3))
  ) %>%
  knitr::kable(align = c("l", "r", "r", "r"))

indicator	median_range	perc95_range	perc99_range
children_w	0.110	1.438	2.159
effective_w	0.152	1.516	2.345
safe_w	0.119	1.622	2.045

Another way to visualise the sensitivity analysis results

#| label: fig-model-range
#| fig-cap: "Distribution of per-country model ranges by indicator. Boxes clustered near zero confirm that results are stable across structural assumptions."
#| fig-width: 7
#| fig-height: 4.5
#| echo: true
#| message: false
#| warning: false

ggplot(model_range, aes(x = indicator, y = range_val)) +
  geom_boxplot(fill = "turquoise3", alpha = 0.6, outlier.alpha = 0.3) +
  coord_flip() +
  labs(
    title = "Cross-model Sensitivity of Vaccine Confidence Indicators (2018)",
    subtitle = "Range (max–min) across Age-dominant, Additive, and Multiplicative models",
    x = "Indicator",
    y = "Range across models (percentage points)"
  ) +
  theme_minimal(base_size = 12)

Interpretation.

The table shows median ranges around 0.11–0.15 pp and 99th-percentile ranges around 2.04–2.35 pp (our numbers: children_w ≈ 0.110 / 2.159, effective_w ≈ 0.152 / 2.345, safe_w ≈ 0.119 / 2.045).

The boxplots confirm these differences are tightly clustered near zero with only a few mild outliers.

Substantively, this means:

The three modeling strategies yield nearly identical country-level estimates.
Practical conclusions (e.g., ranking countries, interpreting effect sizes) are unchanged by the structural assumption.
We can confidently proceed using the Age-dominant model (A) as the primary series and merge this data with WHO/UNICEF data.

#| label: select-final-vci-dataset
#| echo: true
#| message: false
#| warning: false
vci_country_2018 <- agg_A %>%
  dplyr::rename(children = children_w, safe = safe_w, effective = effective_w) %>%
  dplyr::mutate(year = 2018) %>%
  dplyr::select(country, year, children, safe, effective)

Comparison of the age-dominant model with a model unweighted by population

To better understand the impact of population weighting, we compare the age-dominant model (which uses age-group population weights) with a simple unweighted model that treats all age groups as equally represented within each country. This illustrates how demographic weighting influences national vaccine confidence estimates.

#| label: compare-weighting
#| echo: true
#| message: false
#| warning: false

# 1️⃣ Compute simple unweighted means by country
vci_unweighted <- vci_age %>%
  group_by(country) %>%
  summarise(
    children_unweighted = mean(children_a, na.rm = TRUE),
    safe_unweighted = mean(safe_a, na.rm = TRUE),
    effective_unweighted = mean(effective_a, na.rm = TRUE),
    .groups = "drop"
  )

# 2️⃣ Combine with age-dominant (weighted) results
compare_weighting <- agg_A %>%
  rename(
    children_weighted = children_w,
    safe_weighted = safe_w,
    effective_weighted = effective_w
  ) %>%
  left_join(vci_unweighted, by = "country") %>%
  mutate(
    diff_children = children_weighted - children_unweighted,
    diff_safe = safe_weighted - safe_unweighted,
    diff_effective = effective_weighted - effective_unweighted
  )

# 3️⃣ Summarise across countries
summary_weighting <- compare_weighting %>%
  summarise(
    across(starts_with("diff_"),
           list(
             mean   = ~mean(.x, na.rm = TRUE),
             median = ~median(.x, na.rm = TRUE),
             p95    = ~quantile(.x, 0.95, na.rm = TRUE),
             p99    = ~quantile(.x, 0.99, na.rm = TRUE)
           )
    ),
    .groups = "drop"
  )

# 4️⃣ Make output tidy for reporting
summary_weighting_long <- summary_weighting %>%
  pivot_longer(
    everything(),
    names_to = c("indicator", "stat"),
    names_pattern = "diff_(.*)_(.*)",
    values_to = "value"
  ) %>%
  pivot_wider(names_from = stat, values_from = value)

summary_weighting_long

# A tibble: 3 × 5
  indicator    mean median   p95   p99
  <chr>       <dbl>  <dbl> <dbl> <dbl>
1 children   0.173  0.174   1.65  3.03
2 safe       0.0924 0.121   1.85  2.83
3 effective -0.0699 0.0626  1.43  2.96

Code explanation (step-by-step):

Compute unweighted national means:

We average age-specific responses for each indicator within a country, without considering population size. This assumes all age groups contribute equally to the national mean — an unrealistic but instructive baseline.

Join with weighted estimates (Model A):

The age-dominant model uses actual age-group population weights, yielding population-representative national scores. We join the two sets of results and compute the differences for each indicator (diff_*).

Summarise across countries: We report the mean, median, and 95th and 99th percentile of these differences to understand how much population weighting changes the results globally.

Reshape for readability: The final pivot_longer and pivot_wider steps simply format the summary into a clear table by indicator and statistic.

Results Interpretation

Indicator	Mean diff (pp)	Median diff (pp)	95th percentile (pp)	99th percentile (pp)
children	0.17	0.17	1.65	3.03
safe	0.09	0.12	1.85	2.83
effective	-0.07	0.06	1.43	2.96

Differences are small but systematic. On average, unweighted estimates are slightly higher (by around 0.1–0.2 pp for most indicators), reflecting the overrepresentation of older, more confident respondents in unweighted means. In a few cases (top 1–5% of countries), the differences reach 2–3 percentage points — enough to alter country rankings.

Interpretation: Weighting by age corrects these biases and ensures that national means reflect true population distributions. This demonstrates, empirically, why the age-dominant (weighted) model provides a more defensible baseline.

Interpretation, Validation, and Limitations

Our aggregation framework produces demographically representative national vaccine confidence indicators that explicitly account for age structure — ensuring that cross-country comparisons reflect population composition rather than sample artifacts.

By anchoring estimates in actual population weights (\(N_{a,g}\)) and grounding our modeling assumptions in empirical literature, we achieve comparability and interpretability across countries.

Validation Against Alternative Approaches

We conducted two complementary robustness checks:

Model structure sensitivity — compared three demographic models (Age-dominant, Additive, Multiplicative). Results were nearly identical: median inter-model differences were below 0.2 percentage points, with 99% of countries differing by less than 2.5 pp. → This confirms that our country-level estimates are robust to reasonable modeling assumptions about gender effects.
Weighting sensitivity — compared the age-dominant (population-weighted) model with unweighted country means (simple averages across age groups). Unweighted estimates were typically 0.1–0.2 percentage points higher, occasionally up to 3 pp in aging countries where older, more confident cohorts are overrepresented. → This shows that age-weighting corrects subtle but systematic demographic bias.

Overall conclusion: Both sensitivity analyses indicate that our results are stable and demographically defensible. The Age-dominant (age-weighted) model is therefore adopted as the primary specification.

Key Limitations

Despite its rigor, our approach is not without limitations. Each limitation below explains both what the issue is and why it matters for the interpretation of results.

1. No joint-response data (Age × Gender cells unobserved)

The VCI dataset only provides marginal means by age and gender, not by their joint combinations (e.g., “women aged 25–34”). To obtain national estimates, we therefore relied on structural models (Age-dominant, Additive, Multiplicative) that approximate these joint means under simplifying assumptions.

Why this matters: If real-world age and gender effects interact — for instance, if younger women are less confident than younger men, but older women are more confident — none of our models can fully capture that pattern. This could slightly misstate national averages in populations with strong demographic interactions.
Implication: Results are robust for overall trends, but may underestimate subgroup heterogeneity within countries. Future data collections should publish cross-tabulated VCI responses by age and gender to enable more precise modeling.

2. Nonresponse and coverage bias

The VCI is primarily based on online or mobile panel surveys, which may systematically underrepresent certain population groups — for example, older adults without internet access, low-income rural populations, or minority-language speakers.

Why this matters: Even if population weights correct for known demographics (e.g., the proportion of older people), they cannot adjust for attitudinal differences within demographic cells. If those who choose to respond are systematically more vaccine-positive than those who don’t, the resulting estimates will overstate true confidence.
Implication: Weighting reduces, but cannot eliminate, self-selection bias. Independent validation against nationally representative surveys (e.g., Gallup or Pew datasets) would strengthen external validity.

3. Population uncertainty and temporal alignment

Demographic population estimates come from international repositories (UN, World Bank, or national statistical offices) and are themselves modeled using censuses and projections. While we restricted analysis to 2018 to align with VCI data, underlying population structures (especially for small or rapidly changing countries) may still be uncertain.

Why this matters: In countries with incomplete census coverage or high migration, age–gender population counts may have nontrivial errors. This affects weighting precision — particularly in small countries where a few thousand people can shift proportions significantly.
Implication: Confidence indicators for such countries should be interpreted with caution, and uncertainty intervals could be incorporated in future work through Bayesian demographic modeling or bootstrap resampling.

4. Interpretational limits of attitudinal indicators

The VCI indicators measure self-reported agreement with statements like “Vaccines are safe” or “Vaccines are important for children.” These are valuable for gauging confidence, but they are not equivalent to behavioral intent or actual uptake.

Why this matters:

A population may strongly agree that vaccines are safe, yet still show low vaccination coverage due to access barriers, misinformation, or political distrust. Conversely, a country with moderate confidence scores might achieve high uptake through effective public health systems.
Implication:

The aggregated indicators reflect attitudinal readiness, not behavioral compliance. Interpretation should therefore avoid assuming a direct one-to-one link between confidence and coverage.

5. Cross-country comparability challenges

Although we harmonized demographic and survey structures, question phrasing, translation, and cultural interpretation can vary across national contexts.

Why this matters: In some languages, “safe” may connote “without any risk,” while in others it implies “generally beneficial.” Such linguistic nuances introduce small but systematic measurement differences across countries.
Implication: Cross-country comparisons should be treated as approximate indicators of relative confidence, not precise interval-scaled values. When possible, analyses should focus on within-country changes over time, where such biases cancel out.

Summary Table: Sources and Implications of Limitations

Limitation	Source of Bias	Likely Direction	Mitigation
Missing joint Age×Gender data	Structural modeling assumption	May smooth true subgroup variation	Sensitivity analysis across models
Nonresponse / coverage bias	Online survey undercoverage	Potential overestimation of confidence	Apply demographic weights; external validation
Population uncertainty	Census and projection errors	Random; may add noise	Restrict to well-measured years; bootstrap weighting
Attitude vs. behavior gap	Conceptual mismatch	May mispredict uptake	Use VCI as attitudinal, not behavioral, measure
Cross-cultural measurement differences	Translation / interpretation variance	Context-dependent	Prefer within-country comparisons

Why This Approach Is Defensible

This method follows best practices in survey weighting and cross-national comparative analysis:

Uses true population weights to represent national demographics accurately.
Grounds structural assumptions in peer-reviewed empirical evidence.
Includes transparent robustness checks across models and weighting schemes.
Avoids ecological fallacy by staying at the appropriate level of aggregation.

Weighting doesn’t aim to make the sample look exactly like the population in every single detail. It aims to adjust for known sources of bias in a principled way.

Our workflow accomplishes precisely that for global vaccine confidence estimation.

Merging VCI dataset with WHO/UNICEF and World Bank datasets

To model predictors of vaccine confidence, we now merge the aggregated VCI dataset with vaccination coverage (WHO/UNICEF) and socioeconomic indicators (World Bank). However, the two auxiliary datasets use different formats and naming conventions, so we first clean, standardize, and reshape them for compatibility.

Step 1: Load the WHO/UNICEF and World Bank datasets

who_data <- read_csv('../../data/who_measles_first_dose_coverage.csv') %>% clean_names()
wb_data  <- read_csv('../../data/WorldBank_socio-economic_indicators_with_metadata.csv') %>% clean_names()

> who_data <- read_csv('../../data/who_measles_first_dose_coverage.csv') %>% clean_names()
Rows: 4831 Columns: 34                                             s
── Column specification ───────────────────────────────────────────
Delimiter: ","
chr  (10): IndicatorCode, Indicator, ValueType, ParentLocationC...
dbl   (3): Period, FactValueNumeric, Value
lgl  (20): IsLatestYear, Dim1 type, Dim1, Dim1ValueCode, Dim2 t...
dttm  (1): DateModified

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> wb_data  <- read_csv('../../data/WorldBank_socio-economic_indicators_with_metadata.csv') %>% clean_names()
Rows: 2676 Columns: 15                                             32.29MB/s, eta:  0s
── Column specification ───────────────────────────────────────────
Delimiter: ","
chr (15): Series Name, Series Code, Country Name, Country Code,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code explanation

read_csv() imports the raw CSV files.

clean_names() (from janitor) standardizes variable names to lower snake_case (e.g., IndicatorCode → indicator_code), ensuring they are consistent and easier to reference downstream.

Assumption

Both files contain country-level data for multiple years; we will filter to 2018 to match the year of the VCI survey.

Step 2: Prepare WHO/UNICEF data (measles MCV1 coverage)

who_2018 <- who_data %>%
  # Keep the MCV1 series, annual values, year 2018
  filter(indicator_code == "WHS8_110",
         period_type == "Year",
         period == 2018) %>%
  transmute(
    iso3 = spatial_dim_value_code,   # already ISO3 in your file (e.g., "MNE", "CAF")
    country_who = location,          # keep for diagnostics
    year = period,
    measles_mcv1 = coalesce(fact_value_numeric, as.numeric(value))
  ) %>%
  distinct(iso3, year, .keep_all = TRUE)  # one row per iso3-year

Code explanation

filter() selects:

indicator_code == "WHS8_110" → the WHO/UNICEF code for measles first-dose coverage (MCV1).

period_type == "Year" ensures we keep annual data (some WHO tables include monthly or cumulative estimates).

period == 2018 isolates the year of analysis.

transmute():

Keeps only relevant columns and renames them for clarity (iso3, year, measles_mcv1).

Uses coalesce() to fill missing numeric values by attempting conversion from the value column (some WHO exports store numbers as text).

distinct() ensures one observation per country-year pair.

Assumptions

The WHO data already use ISO3 country codes (spatial_dim_value_code), so we don’t need to normalize names here.

MCV1 coverage is a valid proxy for overall vaccination performance — consistent with prior research: “Thus, measles vaccination serves as a good proxy for ‘full immunisation’, ensuring that a child has received all critical age-appropriate vaccines and has ongoing access to healthcare.” (Vaccher et al. 2025)

Step 3: Prepare World Bank socioeconomic indicators

# Map Series Code -> short column names you want in the model
wb_keep <- tribble(
  ~series_code,             ~var,
  "NY.GDP.PCAP.CD",         "gdp_per_capita_current_usd",
  "SI.POV.NAHC",            "poverty_headcount_national",     
  "SL.UEM.TOTL.ZS",         "unemployment_rate",              
  "SE.ADT.LITR.ZS",         "adult_literacy_rate",            
  "SE.PRM.ENRR",            "primary_school_enrollment",      
  "SH.XPD.GHED.GD.ZS",      "domestic_health_expenditure_pct_gdp",
  "SH.MED.PHYS.ZS",         "physicians_per_1000",
  "SH.MED.NUMW.P3",         "nurses_and_midwives_per_1000",
  "SH.MED.CHWK.ZS",         "community_health_workers_per_1000",
  "SH.STA.BASS.ZS",         "access_to_sanitation"
)

# The WB file is wide by year; keep 2018 only, then pivot wider by Series Code
wb_2018 <- wb_data %>%
  select(
    series_code, country_code,
    x2018_yr2018
  ) %>%
  rename(
    iso3 = country_code,
    value_2018 = x2018_yr2018
  ) %>%
  # Keep only indicators we care about
  inner_join(wb_keep, by = "series_code") %>%
  mutate(year = 2018) %>%
  select(iso3, year, var, value_2018) %>%
  # Pivot to wide so each kept indicator becomes a column
  pivot_wider(names_from = var, values_from = value_2018)

Code explanation

The World Bank dataset is in wide format — each year is a separate column (e.g., 2018 [YR2018]).

We use select() to keep the country code, series code, and 2018 values only.

rename() makes the naming consistent (iso3, value_2018).

inner_join(wb_keep) filters the dataset down to the specific indicators we plan to use (based on Series Code values).

pivot_wider() converts the long indicator list into one row per country, with each indicator as a separate column.

Assumptions

Indicators are cross-sectionally complete for 2018. Missing values will be carried as NA without imputation.

ISO3 country codes are consistent across all World Bank files.

Step 4: Harmonize country names and add ISO3 codes to the VCI dataset

# Custom fixes for tricky names across sources
# Note: Kosovo (XKX) and Saint Martin (MAF) are not standard ISO3;
# custom-coded here for join consistency across WHO/WB datasets.
custom_match <- c(
  "Congo, Democratic Republic of the" = "COD",
  "Congo, Republic of the"            = "COG",
  "Côte d’Ivoire"                     = "CIV",
  "Cote d'Ivoire"                     = "CIV",
  "Eswatini"                          = "SWZ",
  "Russian Federation"                = "RUS",
  "Viet Nam"                          = "VNM",
  "Iran"                              = "IRN",
  "Iran, Islamic Republic of"         = "IRN",
  "Egypt"                             = "EGY",
  "Egypt, Arab Rep."                  = "EGY",
  "Syrian Arab Republic"              = "SYR",
  "Türkiye"                           = "TUR",
  "Gambia, The"                       = "GMB",
  "Lao PDR"                           = "LAO",
  "Kyrgyz Republic"                   = "KGZ",
  "Yemen, Rep."                       = "YEM",
  "Hong Kong SAR, China"              = "HKG",
  "Macao SAR, China"                  = "MAC",
  "Kosovo"                            = "XKX",
  "Saint Martin"                      = "MAF" 
)

vci_iso <- vci_country_2018 %>%
  mutate(
    iso3 = countrycode(country, "country.name", "iso3c", custom_match = custom_match)
  ) %>%
  select(iso3, year, country, children, safe, effective)

Code explanation

The VCI dataset currently uses plain English country names (e.g., "Congo, Republic of the").

countrycode() maps these to ISO3 codes, which serve as stable join keys across datasets.

The custom_match vector manually overrides problematic names like "Kosovo" and "Saint Martin" that are not officially recognized ISO members.

We retain only the ISO3 code, country name, year, and the three vaccine confidence indicators.

Assumptions

All remaining countries can be uniquely matched via countrycode().

Custom overrides ensure one-to-one mappings for special territories (e.g., "Côte d’Ivoire" → CIV).

Step 5: Diagnostics — check for unmatched countries


merged_data <- vci_iso %>%
  left_join(who_2018, by = c("iso3", "year")) %>%
  left_join(wb_2018,  by = c("iso3", "year"))

# Quick diagnostics: which VCI countries failed to find WHO/WB matches?
anti_who <- vci_iso %>%
  anti_join(who_2018, by = c("iso3", "year"))

anti_wb  <- vci_iso %>%
  anti_join(wb_2018,  by = c("iso3", "year"))

nrow(merged_data)
anti_who %>% select(country, iso3) %>% distinct() %>% arrange(country)
anti_wb  %>% select(country, iso3) %>% distinct() %>% arrange(country)

> nrow(merged_data)
[1] 227
> anti_who %>% select(country, iso3) %>% distinct() %>% arrange(country)
# A tibble: 33 × 2
   country          iso3 
   <chr>            <chr>
 1 American Samoa   ASM  
 2 Anguilla         AIA  
 3 Aruba            ABW  
 4 Bermuda          BMU  
 5 Cayman Islands   CYM  
 6 Curaçao          CUW  
 7 Faroe Islands    FRO  
 8 French Polynesia PYF  
 9 Gibraltar        GIB  
10 Greenland        GRL  
# ℹ 23 more rows
# ℹ Use `print(n = ...)` to see more rows
> anti_wb  %>% select(country, iso3) %>% distinct() %>% arrange(country)
# A tibble: 10 × 2
   country                                       iso3 
   <chr>                                         <chr>
 1 Anguilla                                      AIA  
 2 Cook Islands                                  COK  
 3 Guernsey                                      GGY  
 4 Jersey                                        JEY  
 5 Montserrat                                    MSR  
 6 Saint Barthelemy                              BLM  
 7 Saint Helena, Ascension, and Tristan da Cunha SHN  
 8 Saint Pierre and Miquelon                     SPM  
 9 Taiwan                                        TWN  
10 Wallis and Futuna                             WLF

Code explanation

anti_join() finds records in vci_iso that do not have a match in WHO or WB data (based on iso3 and year).

This helps identify territories missing from auxiliary sources — typically small islands or non-sovereign regions (e.g., Hong Kong, Guernsey, Puerto Rico).

Printing these tables lets us decide whether to drop or reassign them before analysis.

Assumptions

WHO and World Bank only report for sovereign countries, so non-matching entities will be excluded.

Dropping these has negligible effect on global or regional patterns due to their small populations.

Step 6: Merge datasets and finalize analysis file

analysis_2018 <- vci_iso %>%
  inner_join(who_2018, by = c("iso3", "year")) %>%
  left_join(wb_2018,  by = c("iso3", "year"))

Code explanation

inner_join() ensures we only keep countries with both VCI and WHO vaccination coverage data (since WHO is the dependent variable in later modeling).

left_join() appends available World Bank socioeconomic indicators. Missing indicators are left as NA.

The result, analysis_2018, is a harmonized country-level dataset for 2018, ready for modeling.

Assumptions

The modeling will only use complete cases (countries with valid outcome and covariates).

Joining on ISO3 ensures consistent cross-dataset alignment and avoids ambiguity in country names.

Question 3

We’ll be exploring distributions, relationships, potential outliers, and missingness in the merged dataset analysis_2018 (from Q2).

Distributions of aggregated VCP indicators and vaccination coverage

#| label: q3-dists
#| echo: true
#| message: false
#| warning: false

d <- analysis_2018

p1 <- ggplot(d, aes(x = children)) +
  geom_histogram(binwidth = 2, fill = "turquoise2") +
  labs(title = "VCP: Vaccines Important for Children",
       x = "Percent agreeing", y = "Countries")

p2 <- ggplot(d, aes(x = safe)) +
  geom_histogram(binwidth = 2, fill = "turquoise2") +
  labs(title = "VCP: Vaccines Are Safe",
       x = "Percent agreeing", y = "Countries")

p3 <- ggplot(d, aes(x = effective)) +
  geom_histogram(binwidth = 2, fill = "turquoise2") +
  labs(title = "VCP: Vaccines Are Effective",
       x = "Percent agreeing", y = "Countries")

p4 <- ggplot(d, aes(x = measles_mcv1)) +
  geom_histogram(binwidth = 2, fill = "turquoise2") +
  labs(title = "Measles (MCV1) Coverage",
       x = "Coverage (%)", y = "Countries")

(p1 | p2) / (p3 | p4)

Interpretation

This plot shows histograms summarizing vaccine confidence and coverage across countries, using four indicators:

Top left — “Vaccines Important for Children”
- Most countries have a high percentage of people agreeing that vaccines are important for children (mostly above 70%).
- A smaller number of countries have lower agreement (below 60%).
- Interpretation: There’s broad global consensus on vaccine importance, though a few countries have weaker confidence.
Top right — “Vaccines Are Safe”
- Responses are more spread out, ranging roughly from 30% to 90%.
- The distribution peaks around 70–80%, meaning that in many countries, most people consider vaccines safe—but confidence in safety varies more than in importance to children.
- Interpretation: Perceived safety is a more contentious issue and may explain vaccine hesitancy.
Bottom left — “Vaccines Are Effective”
- Similar to the safety perception, but slightly more concentrated between 60% and 90%.
- Most countries believe in vaccine effectiveness, but there is still noticeable variation.
- Interpretation: While effectiveness is generally accepted, it’s not as universally agreed upon as importance of vaccines to children.
Bottom right — “Measles (MCV1) Coverage”
- Most countries have high coverage rates (80–100%), though a few fall much lower (around 30–60%).
- The sharp peak near 100% suggests that many nations have successfully achieved high measles vaccination rates.
- Interpretation: Actual coverage levels are generally high, but countries with lower confidence (in safety or effectiveness) might also show lower coverage.

Overall interpretation:

Vaccine confidence is generally high but uneven—people tend to agree that vaccines are important to children, though they’re less certain about safety and effectiveness.
The high MCV1 (measles) coverage in most countries suggests that even with some doubts, vaccination programs remain widely implemented.
However, countries with low perceived safety/effectiveness could be at higher risk for reduced vaccine uptake in the future.

Relationships with socio-economic / health system indicators

Let’s select a few variables from the analysis_2018 dataset and rename them for plotting purposes:

rel_vars_clean <- analysis_2018 |>
  select(
    measles_mcv1,
    vax_children = children,
    vax_safe = safe,
    vax_effective = effective,
    gdp_pc = gdp_per_capita_current_usd,
    literacy = adult_literacy_rate,
    health_spend = domestic_health_expenditure_pct_gdp,
    sanitation = access_to_sanitation,
    school_enrol = primary_school_enrollment,
    poverty = poverty_headcount_national,
    doctors = physicians_per_1000,
    nurses = nurses_and_midwives_per_1000
  ) |>
  mutate(across(everything(), as.numeric))

Warning :There were 8 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(everything(), as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run warnings()dplyr::last_dplyr_warnings() to see the 7
  remaining warnings.

Correlation heatmap

#| label: q3-correlation-heatmap
#| echo: true
#| message: false
#| warning: false

# Compute correlations
cor_mat <- cor(rel_vars_clean, use = "pairwise.complete.obs")

# Plot
ggcorrplot(
  cor_mat,
  hc.order = TRUE,              # Cluster similar variables together
  type = "lower",               # Only lower half
  lab = TRUE,                   # Show correlation coefficients
  lab_size = 2.5,
  colors = c("red", "white", "blue"),
  title = "Correlation Heatmap: Vaccine Coverage,\n Confidence, and System Indicators",
  tl.cex = 10,
  tl.srt = 45
)

Key findings from the correlation heatmap

Vaccine confidence variables are strongly correlated with each other

vax_safe, vax_children, and vax_effective show very high correlations (r ≈ 0.8–0.9). → Countries that believe vaccines are safe also tend to believe they are effective and important for children. → This suggests these attitudes cluster together as a single “vaccine confidence” construct.

Measles (MCV1) coverage relates moderately to both confidence and system indicators

measles_mcv1 correlates positively (r ≈ 0.4–0.6) with:
- literacy
- sanitation
- gdp_pc
- nurses and health_spend
These are all system-level indicators (socioeconomic development and healthcare capacity). → Higher coverage tends to occur in more developed contexts — where infrastructure, education, and healthcare are stronger.
With vaccine confidence variables (vax_safe, vax_effective), correlations are smaller (~0.1–0.3). → While confidence matters, system strength seems to have a stronger statistical link to actual vaccination coverage.

Poverty is negatively correlated with nearly everything

poverty shows negative correlations (~–0.4 to –0.6) with:
- literacy, sanitation, gdp_pc, health_spend, nurses, etc. → Expected: poorer countries have weaker infrastructure and lower vaccine coverage.
Also slightly negative with vaccine confidence, implying that poverty may also affect attitudes (perhaps via trust, access, or education).

Health system indicators are interlinked

gdp_pc, nurses, and health_spend correlate very strongly (~0.65–0.85). → Wealthier countries spend more on health and have higher staffing — consistent with expected patterns of development.

Summary interpretation of the correlation heatmap

Theme	Relationship	Strength	Interpretation
Vaccine confidence variables	Very strong positive	~0.8–0.9	Beliefs about safety, effectiveness, and importance move together
Measles coverage vs. confidence	Weak–moderate positive	~0.2–0.4	Some link, but weaker than with development factors
Measles coverage vs. system strength	Moderate–strong positive	~0.4–0.6	Higher coverage in richer, more educated, better serviced countries
Poverty vs. everything else	Strong negative	~–0.4 to –0.6	Poverty undermines both system indicators and coverage

Overall conclusion:

Vaccine confidence clusters tightly together, forming a coherent belief structure.
However, measles vaccination coverage correlates more strongly with development and health system indicators than with confidence alone.
In short: confidence matters, but capacity matters more for actual vaccine delivery outcomes.

Relationship between measles vaccine coverage and the other variables

rel_vars_long <- rel_vars_clean |>
  pivot_longer(-measles_mcv1, names_to = "variable", values_to = "value")

ggplot(rel_vars_long, aes(x = value, y = measles_mcv1)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  facet_wrap(~ variable, scales = "free_x") +
  labs(
    title = "Measles Coverage vs. Confidence and System Indicators",
    x = NULL, y = "Measles (MCV1) Coverage (%)"
  ) +
  theme_minimal()

cor_target <- cor_mat["measles_mcv1", ] |>
  sort(decreasing = TRUE) |>
  as.data.frame() |>
  rownames_to_column("variable")

colnames(cor_target) <- c("variable", "correlation")

ggplot(cor_target[-1, ], aes(x = reorder(variable, correlation), y = correlation)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Correlation of Measles (MCV1) Coverage with Confidence & System Indicators",
    x = NULL, y = "Correlation"
  ) +
  theme_minimal()

Correlation Bar Chart

This chart ranks how strongly each factor correlates with measles vaccination coverage.

🔹 Strongest positive correlations

Sanitation, literacy, doctors, nurses, health_spend, and GDP per capita all show moderate-to-strong positive correlations (r ≈ 0.4–0.6). → Countries with better infrastructure, higher education, and stronger healthcare systems tend to have higher vaccination coverage.

Interpretation: Coverage seems closely linked to national capacity and development, not just vaccine attitudes.

🔹 Weak or negligible correlations

Vaccine confidence indicators (vax_safe, vax_effective, vax_children) have near-zero correlations with actual coverage. → People’s reported beliefs about vaccines (safety, effectiveness, importance) don’t clearly translate into coverage differences at the national level. → This could reflect that even if confidence is high, logistical or system constraints still limit access.

🔹 Negative correlation

Poverty shows a strong negative correlation (~–0.4). → Poorer countries have lower vaccination coverage, likely due to barriers in access, supply, and infrastructure.

📈 Scatterplots: MCV1 Coverage vs Predictors

This plot shows the same relationships more visually, variable by variable.

🔹 System and development variables

GDP per capita, literacy, sanitation, health spending, and healthcare workforce all show upward trends — as these increase, coverage tends to rise.
The relationships flatten slightly at high levels (near 100% coverage), suggesting a ceiling effect — most rich or well-developed countries already have nearly universal coverage.

🔹 Poverty

The scatterplot for poverty shows a clear downward slope, confirming that higher poverty rates are associated with lower coverage.

🔹 Vaccine confidence variables

vax_safe, vax_effective, and vax_children are almost flat lines — no clear relationship. → This suggests that differences in beliefs across countries don’t explain much of the variation in vaccination rates once system and economic factors are considered.

🧠 Overall Interpretation

Theme	Observation	Interpretation
Health system & development	Strongest positive correlations with MCV1 coverage	Structural capacity — not just public attitudes — drives vaccination success.
Poverty	Strong negative relationship	Economic hardship constrains access to healthcare and vaccines.
Vaccine confidence	Weak to negligible correlation	Confidence is generally high across countries; it’s not the limiting factor compared to infrastructure.
Visual patterns	Clear upward or downward slopes for system indicators; flat for attitudes	Coverage differences largely reflect material/systemic inequalities.

In summary:

Countries with better socioeconomic development, stronger healthcare systems, and lower poverty achieve higher measles vaccination coverage. While public confidence in vaccines is important for sustaining demand, structural capacity appears to be the dominant factor explaining global variation in actual coverage.

Outliers and influential observations

#| label: q3-outliers
#| echo: true
#| message: false
#| warning: false
#| 
ggplot(analysis_2018, aes(x = safe, y = measles_mcv1)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  geom_text_repel(
    data = subset(analysis_2018, measles_mcv1 < 60 | safe < 70),
    aes(label = country),
    size = 3, max.overlaps = 10
  ) +
  labs(
    x = "Percent agreeing vaccines are safe",
    y = "Measles (MCV1) Coverage (%)",
    title = "Outliers in Vaccine Confidence vs. Coverage"
  ) +
  theme_minimal()

What this plot shows

We’ve plotted measles (MCV1) vaccination coverage (%) against vaccine safety confidence (% agreeing), with country labels added for the most unusual or illustrative points.

The red line is a simple linear trend, and the labels highlight countries that appear far from the main cluster — visually “outliers” in how confidence relates to coverage.

Key takeaways from the visualization

🔹 1. Two main outlier clusters appear

(a) Low confidence but high coverage (top left quadrant)

Examples: France, Japan, Ukraine, Lithuania, Austria, Bulgaria, Russia, Cyprus, Germany, Latvia, etc.

These countries achieve very high coverage (≥90%) despite low safety confidence (30–60%).
Mostly high-income countries with strong public health systems and school-entry vaccination requirements.
Suggests that structural and policy mechanisms, rather than public sentiment, maintain high vaccination rates.

🧠 Interpretation: These are “confidence outliers” — trust is low, but delivery remains high.

(b) High confidence but low coverage (bottom right quadrant)

Examples: Ethiopia, Yemen, Nigeria, Gabon, Guinea, Togo, Mali**, etc.

These countries report strong belief in vaccine safety (70–90%) but coverage well below 80%.
Most are low-income or fragile states, where logistical barriers (infrastructure, access, conflict) outweigh willingness.

🧠 Interpretation: These are “system outliers” — trust is high, but capacity is weak.

🔹 2. Flat regression line

The red regression line is nearly horizontal, confirming that the linear association between confidence and coverage is weak or nonexistent. → This matches the correlation results you observed earlier (r ≈ 0.1–0.2).

🧠 Interpretation: Vaccine confidence varies widely across countries, but it doesn’t systematically predict coverage. Instead, non-attitudinal factors (infrastructure, poverty, governance) dominate.

🔹 3. Regional contrasts

Europe & East Asia: cluster of low-confidence, high-coverage nations.
Sub-Saharan Africa & Middle East: high-confidence, lower-coverage cluster.
Americas and parts of Asia (Peru, Bolivia, Pakistan): middle of the pack.

This reflects a global confidence–coverage paradox:

In wealthy countries, hesitancy doesn’t necessarily reduce coverage, while in poorer countries, confidence isn’t enough to ensure high coverage.

Summary Interpretation

Pattern	Example countries	What it means
High coverage, low confidence	France, Japan, Austria	Trust issues but strong systems keep coverage high
Low coverage, high confidence	Nigeria, Ethiopia, Yemen	Positive attitudes but weak infrastructure limit access
Weak overall association	Flat red line	Vaccine uptake is driven more by structural than attitudinal factors

Missingness exploration (before any imputation)

Before exploring the missingness in analysis_2018, let’s ensure all columns are of the right format (if numeric variables are encoded as characters, R might mistakenly consider they have no missing values!).

analysis_2018 <- analysis_2018 |>
  mutate(across(
    c(gdp_per_capita_current_usd,
      access_to_sanitation,
      poverty_headcount_national,
      primary_school_enrollment,
      adult_literacy_rate,
      unemployment_rate,
      domestic_health_expenditure_pct_gdp,
      physicians_per_1000,
      nurses_and_midwives_per_1000),
    ~ na_if(., "..") |> as.numeric()
  ))

Let’s now visualise the missingness in the matrix.

#| label: q3-missingness-visualisation
#| echo: true
#| message: false
#| warning: false
# Visual pattern of missingness

# Create a matrix of missingness (TRUE/FALSE)

miss_matrix <- is.na(analysis_2018)

# Convert to long format
miss_long <- melt(miss_matrix)
colnames(miss_long) <- c("Observation", "Variable", "Missing")

# Plot
ggplot(miss_long, aes(x = Variable, y = Observation, fill = Missing)) +
  geom_tile(color = "white", size = 0.1) +
  scale_fill_manual(values = c("FALSE" = "darkorchid4", "TRUE" = "white")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  labs(title = "Missing Data in analysis_2018",
       subtitle = "White = missing, Purple = observed",
       fill = "Missing?") +
  coord_fixed(ratio = 0.05)  # Adjust ratio to make plot taller/wider as needed

And let’s check the percentage of missing values per variable

# Per-variable missing rates
missing_summary <- analysis_2018 %>%
  summarise(across(everything(), ~ mean(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "missing_rate") %>%
  arrange(desc(missing_rate))

missing_summary

# A tibble: 17 × 2
   variable                            missing_rate
   <chr>                                      <dbl>
 1 adult_literacy_rate                       0.747 
 2 poverty_headcount_national                0.660 
 3 children                                  0.345 
 4 safe                                      0.345 
 5 effective                                 0.345 
 6 physicians_per_1000                       0.237 
 7 primary_school_enrollment                 0.175 
 8 nurses_and_midwives_per_1000              0.119 
 9 unemployment_rate                         0.0773
10 access_to_sanitation                      0.0412
11 gdp_per_capita_current_usd                0.0258
12 domestic_health_expenditure_pct_gdp       0.0103
13 iso3                                      0     
14 year                                      0     
15 country                                   0     
16 country_who                               0     
17 measles_mcv1                              0

Most variables have a reasonable degree of missingness (generally below 35%). The two exceptions are adult_literacy_rate (almost 75%) and poverty_headcount_national (66%): the missingness in adult_literacy_rate is simply too high for proper imputation so we’ll drop this column. We’ll impute the rest of the columns as the visualisation shows the pattern to likely be MAR (e.g countries that don’t report on children also don’t report on safe or effective or countries that don’t report physicians per 1000 are not likely to report nurses and midwives per 1000).

Question 4

We now build predictive models for measles coverage. beliefs is omitted because it is entirely missing for 2018.

Train/test split

#| label: q4-split
#| echo: true
#| message: false
#| warning: false

set.seed(42)


data_split <- initial_split(analysis_2018, prop = 0.7, strata = measles_mcv1)
train_data <- training(data_split)
test_data  <- testing(data_split)

Code explanation We use stratified sampling to maintain the coverage distribution in both sets.

Recipe for pre-processing

#| label: q4-recipe
#| echo: true
#| message: false
#| warning: false

model_recipe <- recipe(
  measles_mcv1 ~ children + safe + effective +
    gdp_per_capita_current_usd + poverty_headcount_national + unemployment_rate +
 primary_school_enrollment +
    domestic_health_expenditure_pct_gdp + physicians_per_1000 +
    nurses_and_midwives_per_1000 + 
    access_to_sanitation,
  data = train_data
) %>%
  step_zv(all_predictors()) %>%
  step_impute_bag(all_predictors()) %>%
  step_normalize(all_predictors())

Code explanation

Removes zero-variance predictors.
Imputes missing numeric predictors with bagged trees.
Normalizes (z-scores) for comparability—required for regularized models like LASSO.

Baseline linear regression

#| label: q4-lm
#| echo: true
#| message: false
#| warning: false

lm_spec <- linear_reg() %>% set_engine("lm")

lm_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(lm_spec) %>%
  fit(train_data)

lm_preds <- predict(lm_wf, test_data) %>%
  bind_cols(test_data %>% select(measles_mcv1))

metrics(lm_preds, truth = measles_mcv1, estimate = .pred)

# A tibble: 2 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       9.48 
2 rsq     standard       0.464

Let’s also compute training set metrics:

lm_preds_train <- predict(lm_wf, train_data) %>%
  bind_cols(train_data %>% select(measles_mcv1))

metrics(lm_preds_train, truth = measles_mcv1, estimate = .pred)

# A tibble: 2 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      11.4  
2 rsq     standard       0.352

Interpretation

Training set: R² ≈ 0.35 shows a moderate fit; RMSE ≈ 11 pp indicates predictions typically deviate by about 11 percentage points from observed measles coverage (which seems like a rather large error given the scale of measles vaccine coverage!).
Test set: R² ≈ 0.46 and RMSE ≈ 9.5 pp suggest the model generalises reasonably well, even improving slightly out-of-sample — implying mild over-regularisation or a few influential training points rather than overfitting.
MAE ≈ 7 pp on the test set (respectively 8.11 pp on the training set) means the average country-level prediction error is roughly seven percentage points of coverage (respectively 8.11 percentage points of coverage). The discrepancy between RMSE and MAE points to the existence of a few large outliers.

Overall, the model explains between around 35 to 45 % of cross-country variation in measles coverage, capturing meaningful structure but leaving room for system-level or contextual factors not yet modelled (e.g. supply bottlenecks, conflict). The gapt between training and test set performances suggests the model slightly underfits (i.e it is likely not complex enough e.g missing variables or not capturing potential non-linearities)

Residuals plots

lm_aug <- augment(lm_wf, test_data)

ggplot(lm_aug, aes(x = .pred, y = .resid, label = country)) +
  geom_point(alpha = 0.6) +
  geom_text_repel(data = subset(lm_aug, abs(.resid) > 15)) +
  geom_hline(yintercept = 0, color = "red") +
  theme_minimal()

1. Overall pattern

The points are mostly scattered randomly around 0, which is good — it suggests the model isn’t systematically biased (no obvious curvature or funnel shape).
However, there’s a slight clustering of residuals below 0 for fitted values between ~70–90 — meaning the model slightly overpredicts coverage for those mid-range countries.
At the high end (~95–100 predicted coverage), residuals are tightly clustered near 0 — the model performs best for already high-coverage countries.

Interpretation: The model fits high-performing countries well but struggles a bit for mid-level coverage nations.

⚠️ 2. Outliers

A few countries stand out with large residuals:

Country	Residual	Interpretation
Chad	≈ –30	Actual coverage far below what the model predicts → possibly very weak health infrastructure or data quality issues.
Nigeria, Ethiopia	–20 to –15	Model overpredicts their performance — suggests system or conflict-related delivery barriers.
Micronesia, Bosnia & Herzegovina	–15 to –20	Overprediction again — maybe missing predictors relevant to small states or post-conflict systems.
Bangladesh	+10 to +15	Underpredicted — performs better than expected given its predictors (strong immunization programs).

Interpretation: These are “influential observations” — not necessarily leverage points yet, but they have large residuals (unexpected outcomes given model inputs).

3. Variance pattern (homoscedasticity check)

The vertical spread of residuals doesn’t clearly widen or narrow with fitted values → variance appears fairly constant.
There’s no strong funnel shape → no major heteroscedasticity.

Interpretation: The assumption of constant variance holds reasonably well.

📈 4. Model bias

The slight negative bias for low–mid predicted values implies the model may overestimate coverage in lower-performing contexts.
We might be missing a predictor that captures fragility, conflict, or inequality, which depresses coverage beyond what GDP or health spending explain.

🧠 Summary interpretation

Aspect	Observation	Implication
Centering	Most residuals near 0	Model fits average well overall
Spread	Fairly uniform	No strong heteroscedasticity
Outliers	Chad, Nigeria, Ethiopia, Micronesia, Bosnia	Real-world “exceptions” — weaker or stronger than expected
Bias	Slight overprediction for mid-range	Missing contextual predictor (governance/conflict?)

Our model predicts measles coverage fairly well overall, with consistent variance and no severe bias. A handful of countries (especially Chad, Nigeria, Ethiopia) underperform relative to expectations, while Bangladesh exceeds them — pointing to systemic or policy differences not captured by the current predictors.

Regularized model (ElasticNet)

#| label: q4-elastic-net
#| echo: true
#| message: false
#| warning: false

# 1. Define elastic net model specification (fixed penalty and mixture)
elastic_spec <- linear_reg(penalty = 0.1, mixture = 0.5) %>%
  set_engine("glmnet")

# 2. Build workflow
elastic_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(elastic_spec)

# 3. Fit the model directly to training data
elastic_fit <- elastic_wf %>%
  fit(train_data)

# 4. Predict on the test data
elastic_preds <- predict(elastic_fit, test_data) %>%
  bind_cols(test_data %>% select(measles_mcv1))

# 5. Evaluate model performance
metrics(elastic_preds, truth = measles_mcv1, estimate = .pred)

# A tibble: 2 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       9.43 
2 rsq     standard       0.471

#Predict on the training data
elastic_preds_train <- predict(elastic_fit, train_data) %>%
  bind_cols(train_data %>% select(measles_mcv1))

# Evaluate model performance
metrics(elastic_preds_train, truth = measles_mcv1, estimate = .pred)

# A tibble: 2 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      11.4  
2 rsq     standard       0.351

We’re selecting ElasticNet here as it’s highly likely that most predictors are relevant (based on the baseline model) and also because we know many predictors to be strongly correlated (e.g safe, effective and children).

Compare models

#| label: q4-compare
#| echo: true
#| message: false
#| warning: false

bind_rows(
  lm = metrics(lm_preds, truth = measles_mcv1, estimate = .pred) %>% mutate(model = "Linear Regression (test)"),
  elastic_net = metrics(elastic_preds, truth = measles_mcv1, estimate = .pred) %>% mutate(model = "ElasticNet (test)")
) %>%
  filter(.metric %in% c("rmse", "rsq")) %>%
  select(model, .metric, .estimate)

# A tibble: 4 × 3
  model                    .metric .estimate
  <chr>                    <chr>       <dbl>
1 Linear Regression (test) rmse        9.48 
2 Linear Regression (test) rsq         0.464
3 ElasticNet (test)        rmse        9.43 
4 ElasticNet (test)        rsq         0.471

And for the training set:

bind_rows(
  lm = metrics(lm_preds_train, truth = measles_mcv1, estimate = .pred) %>% mutate(model = "Linear Regression (train)"),
  elastic_net = metrics(elastic_preds_train, truth = measles_mcv1, estimate = .pred) %>% mutate(model = "ElasticNet (train)")
) %>%
  filter(.metric %in% c("rmse", "rsq")) %>%
  select(model, .metric, .estimate)

# A tibble: 4 × 3
  model                     .metric .estimate
  <chr>                     <chr>       <dbl>
1 Linear Regression (train) rmse       11.4  
2 Linear Regression (train) rsq         0.352
3 ElasticNet (train)        rmse       11.4  
4 ElasticNet (train)        rsq         0.351

The performance of both models is sensibly similar with ElasticNet having a very slight edge on the test set.

Comparison of model coefficients

lm_fit <- extract_fit_parsnip(lm_wf)
lm_coefs <- tidy(lm_fit) %>%
  select(term, estimate) %>%
  rename(ols_estimate = estimate)

elastic_fit_parsnip <- extract_fit_parsnip(elastic_fit) 
elastic_coefs <- tidy(elastic_fit_parsnip) %>%
  select(term, estimate) %>%
  rename(elastic_estimate = estimate)

coef_compare <- full_join(lm_coefs, elastic_coefs, by = "term")

coef_compare_long <- coef_compare %>%
  pivot_longer(cols = c(ols_estimate, elastic_estimate),
               names_to = "model", values_to = "estimate")

ggplot(coef_compare_long, aes(x = reorder(term, estimate), y = estimate, fill = model)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(
    title = "Comparison of Coefficients: OLS vs Elastic Net",
    x = NULL, y = "Coefficient Estimate",
    fill = "Model"
  ) +
  theme_minimal()

ElasticNet places slightly more emphasis on poverty_headcount_national (does that hint at the sorts of variables we could try and add to this model to improve it further?) but in general, the coefficients are sensibly the same, which explains the very close performance of both baseline and ElasticNet models.

Question 5

Limitations

Missing data – Some indicators lack coverage for small or fragile states; even principled imputation adds uncertainty.
Aggregation – National averages hide within-country inequities (urban–rural, income, gender).
Collinearity and causality – Socio-economic variables inter-correlate; regression identifies association, not cause.
Single year – Cross-section cannot reveal temporal dynamics or lag effects.
Attitude vs behavior – VCP indicators measure belief, not actual uptake; low access can still limit coverage.

Potential improvements

Multiple or model-based imputation could better capture uncertainty in missing predictors, possibly incorporating regional or income-group cues.

Cross-validation would help us evaluate our models better. And tuning the parameters of ElasticNet would allow to check whether the performance of this particular model could be improved. We could also try out non-linear models (e.g tree-based models) to see if they improve model performance at all.

Panel (time-series) data would permit analysis of within-country changes and policy effects over time.

Richer health-system measures—facility density, cold-chain reliability, workforce distribution—could clarify supply-side drivers.

Governance and trust indicators (e.g., Worldwide Governance Indicators, Edelman Trust Index) could test institutional determinants of vaccine confidence.

Uncertainty quantification through bootstrapping or Bayesian hierarchical modeling would make estimates more transparent and interpretable.

Summary reflection

Vaccine-confidence indicators remain closely aligned with vaccination coverage, but enduring structural factors—education, health investment, and system access—still dominate cross-national differences. With richer and longitudinal inputs and potentially better models, this framework could evolve toward a more causal, policy-relevant understanding of how trust, governance, and health capacity jointly sustain immunization coverage.

References

Mondal, Pritish, Ankita Sinharoy, and Lilly Su. 2021. “Sociodemographic Predictors of COVID-19 Vaccine Acceptance: A Nationwide US-Based Survey Study.” Public Health 198: 252–59. https://www.sciencedirect.com/science/article/pii/S0033350621002985?via%3Dihub.

Presotto, Maria Ada, Rudolf A Jörres, Wolfgang Gesierich, Jörn Bullwinkel, Klaus F Rabe, Konrad Schultz, Franziska Kaestner, et al. 2022. “Gender Differences in Health Care Workers’ Risk-Benefit Trade-Offs for COVID-19 Vaccination.” Respiration 101 (7): 646–53. https://pmc.ncbi.nlm.nih.gov/articles/PMC9059044/.

Vaccher, Stefanie, Moses Laman, Margie Danchin, Fiona Angrisano, and Christopher Morgan. 2025. “Missed Measles Immunisations Places Individuals and Communities at Risk: The Equity Argument for Including Measles in Under-Immunised Definitions.” Vaccines 13 (2). https://doi.org/10.3390/vaccines13020108.