✏️ W04 Formative

2025/26 Autumn Term

Author

Dr Ghita Berrada

⏲️ Due Date:

🎯 Main Objectives:

Please submit your work even if you didn’t manage to go very far with the R code. As this is a formative assignment, it won’t be graded, and the main point is for you to get used to submitting your work through GitHub Classroom.

👉 Note: Completing this assignment will count towards your final class grade if you are a General Course or Exchange student. It will still count as submitted even if you submit just a few coding responses.

📚 Preparation (if you are new to GitHub)

You will use GitHub Classroom 1 to submit your work. You will need to have a GitHub account to do this.

  1. Create an account on GitHub.

Never heard of GitHub2? Or maybe you have heard of it but never used it? Then, follow the instructions below to get started.

  1. Go to our Slack workspace’s #announcements channel to find the link to ‘Intro to Git/GitHub’. You will be taken to a page with instructions on how to get started with Git and GitHub.

  2. Read the instructions in the README.md and complete the exercises.

  3. Ask any questions about the exercise above on the #help channel on Slack.

📝 Instructions

  1. Go to our Slack workspace’s #announcements channel to find a GitHub Classroom link. Do not share this link with anyone outside this course!

  2. Click on the link, sign in to GitHub and then click on the green button Accept this assignment.

  3. You will be redirected to a new private repository created just for you. The repository will be named ds202a-2025-2026-w04-formative--yourusername, where yourusername is your GitHub username. The repository will be private and will contain a README.md file with a copy of these instructions.

  4. Many of you might still be catching up with R and GitHub, so it’s okay if you can only complete a few questions. You will still get feedback on your answers, which will still count as completed (important for General Course and Exchange students).

  5. Create your own .qmd file with your answers. You can use the .qmd file you used in the W02 lab as a template. Just remove anything that is not relevant to this assignment.

  6. Try to create separate headers and code chunks for each question. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.

  7. Use the #help channel on Slack liberally if you get stuck.

“What do I submit?”

⚠️ Do you know your CANDIDATE NUMBER? You will need it.

“Your candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.

Source: LSE

  • A Quarto markdown file with the following naming convention: <CANDIDATE_NUMBER>.qmd, where <CANDIDATE_NUMBER> is your candidate number. For example, if your candidate number is 12345, then your file should be named 12345.qmd.

  • An HTML file render of the Quarto markdown file.

Remember to make your submitted HTML self-contained: don’t forget to add the self-contained: true command to your .qmd header before you render your file into HTML and make sure all the figures show and your text is formatted as expected!

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment.

✔️ How we will grade your work

We won’t! This is formative. But you will get feedback on your answers. It won’t be super detailed at this stage, but it should give you an idea of how you are doing.

👉 Note: Completing this assignment will count towards your final class grade if you are a General Course or Exchange student. It will still count as submitted even if you submit just a few coding responses.

General comment: This formative is not graded. The goal is to practice good data science workflow: clean code, consistent documentation, correct methods, and interpretation of results in context.

Important: Providing lots of code is not enough and is discouraged — explanations and interpretations must accompany code. Always justify your choices (e.g., handling of missing values, model choice, parameter settings) and interpret your results in the context of vaccine confidence and socio-economic setting.

Here’s how we’ll provide feedback on the questions of this formative:

Question 1 – Exploring the VCP data

  • Weak: Code runs but little to no explanation. Missing or incorrect renaming. No comment on dataset coverage by year.
  • Developing: Loads data, renames columns, and counts unique countries, but explanations are minimal or superficial.
  • Good: Loads and inspects dataset correctly, renames columns clearly, avoids unnecessary printing. Reports country coverage per year and identifies which year has the most countries, with a short but clear interpretation.
  • Excellent: As in “Good,” but also provides concise comments on dataset structure (e.g., what variables represent, data consistency, potential caveats). Explanations show understanding of how coverage differences across years may affect later analysis.

Question 2 – Aggregating VCP Indicators

  • Weak: Incomplete aggregation or missing justification of weighting. No merge with external datasets.

  • Developing: Performs aggregation but may apply weights incorrectly, or handles missing values inconsistently. Merges data but without careful renaming/cleaning. Minimal explanation.

  • Good:

    • Chooses a valid year with sufficient coverage and justifies the choice.
    • Aggregates VCP indicators across subgroups using the weighted mean formula.
    • Handles missing values appropriately for the type of data, with justification.
    • Reflects on the effect of subgroups/weighting on estimates.
    • Merges VCP, WHO, and World Bank data with correctly renamed columns.
  • Excellent:

    • Demonstrates awareness of why population weighting matters, and what could happen under alternative subgroup/weighting schemes.
    • Provides thoughtful justification for handling missingness (distinguishing categorical vs. continuous data, explaining when imputation is preferable vs. exclusion).
    • Produces a clean, well-structured merged dataset ready for exploration.

Question 3 – Data Exploration

  • Weak: Produces plots or tables without labeling, interpretation, or connection to the question.

  • Developing: Produces plots of distributions and scatterplots, but labeling is poor or interpretations are superficial. Notes missingness but does not explain consequences.

  • Good:

    • Plots distributions of VCP indicators and vaccination coverage with informative titles/axes.
    • Explores relationships with socio-economic/health indicators using scatterplots/correlations.
    • Identifies and comments on clear patterns or outliers.
    • Notes missingness at country level and discusses its consequences for modeling (e.g., reduced sample size, potential bias).
  • Excellent:

    • Provides insightful interpretations of patterns (e.g., links between wealth and coverage, role of literacy or health workforce).
    • Discusses outliers thoughtfully, suggesting possible reasons or implications.
    • Explains clearly how missingness will be dealt with before modeling (e.g., chosen imputation method, exclusion criteria) and why.

Question 4 – Predictive Modelling

  • Weak: Runs regression without train/test split or meaningful evaluation. No feature selection or justification.

  • Developing: Splits data and runs regression, but preprocessing or missing value handling is flawed. LASSO/feature selection is missing or incorrect. Metrics are reported without context.

  • Good:

    • Splits data correctly (70/30 with random state).
    • Builds a multivariate linear regression with required predictors.
    • Handles missing values appropriately, with justification.
    • Applies LASSO (or similar) and compares performance with regression.
    • Chooses appropriate evaluation metrics, explains why, and interprets them in context.
  • Excellent:

    • Justifies preprocessing choices carefully (e.g., imputation, scaling).
    • Explains model setup and parameter choices clearly.
    • Evaluates models with well-chosen metrics, showing awareness of pros/cons of each.
    • Provides nuanced interpretation of results: which predictors matter most, how reliable predictions are, and what socio-economic patterns emerge.
    • Presents results cleanly with plots/tables and concise explanations.

Question 5 – Reflection

  • Weak: Very short or generic reflection (e.g., “There were missing values”).
  • Developing: Mentions some limitations or improvements but lacks depth or context.
  • Good: Reflects on at least two key limitations (e.g., missing data, aggregation, variable selection). Suggests concrete improvements if more data were available.
  • Excellent: Provides a thoughtful and nuanced reflection that demonstrates awareness of both technical and conceptual limitations (e.g., bias from country-level aggregation, limits of linear models, issues of comparability across datasets). Suggests realistic improvements or next steps (e.g., subgroup-level modeling, additional explanatory variables, longitudinal analysis).

📚 Tasks

The questions below will build on your code from 💻 W02 Lab and W03 Labs.

About the Data

This assignment uses data from three main sources:

  1. Vaccine Confidence Index (VCP) – survey data on attitudes toward vaccines by country and demographic subgroup (Age × Gender).
  2. World Bank country-level indicators – economic, education, health system, labor, and infrastructure variables.
  3. WHO/UNICEF vaccination coverage – percentage of children vaccinated for key vaccines.
  4. US Census Bureau - percentage of the population by country that belong to Age x Gender subgroups (see here for details)

Below is a summary of the variables you will use:

Source Variable Meaning Documentation / Link
VCP children % agreeing vaccines are important for children VCP Documentation
safe % agreeing vaccines are safe VCP Documentation
effective % agreeing vaccines are effective VCP Documentation
beliefs % agreeing vaccines are compatible with their religious beliefs VCP Documentation
World Bank – Economic GDP_per_capita GDP per person, current USD World Bank GDP
Poverty_headcount_national % of population below national/societal poverty line World Bank Poverty
Unemployment_rate % of total labor force unemployed World Bank Unemployment
World Bank – Education Adult_literacy_rate % of adults (15+) who are literate World Bank Literacy
Primary_school_enrollment % gross enrollment in primary school World Bank Education
World Bank – Healthcare Domestic_health_expenditure_percent_GDP Government health expenditure as % of GDP World Bank Health Expenditure
Physicians_per_1000 Physicians per 1,000 population World Bank Health Workforce
Nurses_and_midwives_per_1000 Nurses and midwives per 1,000 population World Bank Health Workforce
Community_health_workers_per_1000 Community health workers per 1,000 population WHO Global Health Observatory
World Bank – Infrastructure Access_to_sanitation % of population with improved sanitation World Bank Sanitation
WHO/UNICEF Measles_coverage % of children vaccinated for measles WHO Immunization Data

You can download the VCP dataset by clicking on the button below:

You can download the World Bank dataset by clicking on the button below:

You can download the WHO/UNICEF dataset by clicking on the button below:

You can download the US Census dataset by clicking on the button below:

Your ultimate goal in this assignment will be to predict country-level measles vaccine first dose coverage based on variables in your merged data.

Question 1 – Exploring the VCP data

  1. Load the VCP dataset and inspect its structure. Rename columns if needed.
  2. For each year in the dataset, count the number of unique countries represented. Which year has the most countries?

Question 2 – Aggregating VCP Indicators

  • Select a single year with sufficient country coverage. All subsequent analysis will use data only for this year.

  • For each country, aggregate the VCP indicators (children, safe, effective, beliefs) across Age × Gender subgroups using population weights from the demographic table.

  • Weighted mean formula:

\[ \text{Weighted mean for variable } X = \frac{\sum (\text{Population of subgroup} \times X_\text{subgroup})}{\sum \text{Population of subgroup}} \]

Quick weighting example:

Suppose Country A has two subgroups:

Age × Gender % of total population children (%)
Men 25+ 60 90
Women 25+ 40 95

Weighted mean for children:

\[ (0.6 \times 90) + (0.4 \times 95) = 92\% \]

  • Handle missing responses as you see fit; justify your choice.
  • Reflect on how using Age × Gender subgroups and population weights affects the aggregated country-level VCP indicators. How might the estimates differ if different subgroups or weighting schemes had been used?
  • Merge the aggregated VCP indicators with WHO/UNICEF vaccination coverage and World Bank indicators for the same year (rename the WHO and World Bank dataset columns to fit the variable names given in the About the data section).

Question 3 – Data Exploration

  1. Examine the distributions of the aggregated VCP indicators and vaccination coverage.
  2. Explore relationships between vaccination coverage and socio-economic / health system indicators (scatterplots, correlations, etc.).
  3. Comment on any patterns or potential outliers.

Question 4 – Predictive Modelling

  1. Split your data into training (70%) and testing (30%) sets.

  2. Build a multivariate linear regression model predicting Measles_coverage using:

    • Aggregated VCP indicators (children, safe, effective, beliefs)
    • World Bank socio-economic indicators (GDP_per_capita, Poverty_headcount_national, Unemployment_rate, Adult_literacy_rate, Primary_school_enrollment, Domestic_health_expenditure_percent_GDP, Physicians_per_1000, Nurses_and_midwives_per_1000, Community_health_workers_per_1000, Access_to_sanitation)
  3. Evaluate your model on the test set using appropriate metrics.

  4. Explore variable selection and/or regularization (e.g., LASSO) and compare performance.

Question 5 – Reflection

  1. Reflect on potential limitations of your analysis, such as:

    • Missing data
    • Country-level aggregation
    • Variable selection
  2. Suggest potential improvements if more detailed data or additional variables were available.