✏️ W05 Formative

DS202W – Data Science for Social Scientists (Winter Term 2025/26)

Author

Dr Ghita Berrada

⏲️ Due Date 18 February 2026 at 5pm


🎯 Overall Goal of the Analysis

The goal of this formative is to predict country-level sustainable income growth, measured by Adjusted Net National Income growth, using:

  • Trade structure indicators constructed from bilateral trade data
  • Macroeconomic and institutional indicators from the World Bank
  • Trade globalisation measures from the KOF Globalisation Index

This assignment mirrors a real applied data-science workflow in economics and public policy, where:

  • raw data come at incompatible levels of aggregation
  • key explanatory variables must be constructed rather than downloaded
  • modelling choices require judgement and justification, not recipe-following

🎯 Main Objectives

  • Practice using GitHub Classroom
  • Practice creating and styling Quarto documents
  • Practice writing Python code on your own
  • Practice multi-stage data wrangling
  • Aggregate bilateral trade flows into interpretable country-level indicators
  • Explore trade structure, openness, and globalisation
  • Build and evaluate regression models
  • Diagnose and reason about multicollinearity
  • Interpret results in a development and policy context

Please submit your work even if you did not manage to complete all questions. This is a formative assignment and is not graded. The main point is for you to get used to submitting your work through GitHub Classroom and to formatting files with Quarto.

👉 Note: Completing this assignment will count towards your final class grade if you are a General Course or Exchange student. It will still count as submitted even if you submit just a few coding responses.


📚 Preparation (if you are new to GitHub)

You will use GitHub Classroom 1 to submit your work. You will need to have a GitHub account to do this.

  1. Create an account on GitHub.

Never heard of GitHub2? Or maybe you have heard of it but never used it? Then, follow the instructions below to get started.

  1. Go to our Slack workspace’s #ds202w-central channel to find the link to ‘Intro to Git/GitHub’. You will be taken to a page with instructions on how to get started with Git and GitHub.

  2. Read the instructions in the README.md and complete the exercises.

  3. Ask any questions about the exercise above on the #help channel on Slack.

📝 Instructions

  1. Go to our Slack workspace’s #ds202w-central channel to find a GitHub Classroom link. Do not share this link with anyone outside this course!

  2. Click on the link, sign in to GitHub and then click on the green button Accept this assignment.

  3. You will be redirected to a new private repository created just for you. The repository will be named ds202aw-2025-2026-w05-formative--yourusername, where yourusername is your GitHub username. The repository will be private and will contain a README.md file with a copy of these instructions.

  4. Many of you might still be catching up with Python and GitHub, so it’s okay if you can only complete a few questions. You will still get feedback on your answers, so your formative will still count as completed (important for General Course and Exchange students).

  5. Create your own .qmd file with your answers. You can use the .qmd file you used in the W02 lab as a template. Just remove anything that is not relevant to this assignment.

  6. Try to create separate headers and code chunks for each question. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.

  7. Use the #help channel on Slack liberally if you get stuck.

“What do I submit?”

⚠️ Do you know your CANDIDATE NUMBER? You will need it.

“Your candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.

Source: LSE

  • A Quarto markdown file with the following naming convention: <CANDIDATE_NUMBER>.qmd, where <CANDIDATE_NUMBER> is your candidate number. For example, if your candidate number is 12345, then your file should be named 12345.qmd.

  • An HTML file render of the Quarto markdown file.

Remember to make your submitted HTML self-contained: don’t forget to add the self-contained: true command to your .qmd header before you render your file into HTML and make sure all the figures show and your text is formatted as expected!

Also make sure to follow this course’s generative AI policy!

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment.

Tip

You can create a .qmd file from a Jupyter notebook (i.e .ipynb) by going on the VSCode Terminal, making sure you are in the same directory as your Jupyter notebook (use the pwd to check which directory you’re in and cd command to change directory if needed) and then typing the following command:

quarto convert <name_of_notebook>.ipynb

where <name_of_notebook>.ipynb is the name of the Jupyter notebook you want to convert into .qmd

Also check out the Quarto documentation to better understand the conversion from ipynb to qmd.

And check out this tutorial if you want to better understand the commands you can run on your VSCode terminal (e.g to change current directory).


📚 About the Data

This assignment combines three main data sources, each operating at a different level:

  1. CEPII Gravity Dataset – bilateral trade flows (exporter × importer)
  2. World Bank indicators – macroeconomic, institutional, and distributional variables
  3. KOF Globalisation Index – country-level measures of trade globalisation

Your task is to aggregate bilateral trade flows into country-level concentration indices (e.g., Herfindahl-Hirschman Index), merge these with macroeconomic controls, and build regression models to predict adjusted net national income growth—a welfare-sensitive measure (used as an alternative to GDP) that accounts for natural resource depletion and environmental costs.


1️⃣ CEPII Gravity Dataset (Bilateral Trade)

You will not be given this file directly. You’ll need to download and extract the dataset yourself with the given links.

Each row represents trade between two countries in a given year.

Relevant columns include (among others):

  • year
  • iso3_o – exporter ISO3 code
  • iso3_d – importer ISO3 code
  • tradeflow_baci – bilateral trade value (USD)
  • gdp_o, gdp_d
  • pop_o, pop_d

⚠️ This file is large. Efficient subsetting (by year, columns, or both) is part of the task.


2️⃣ World Bank Indicators (Provided)

You are given a cleaned CSV containing the following variables for multiple years (2017–2021):

Indicator Code Full Name What It Measures Units Data Source Official Documentation
NY.GDP.TOTL.RT.ZS Total natural resources rents Payments by extracting sector (oil, gas, minerals, forests) to government as % of GDP % of GDP World Development Indicators (WDI) https://data.worldbank.org/indicator/NY.GDP.TOTL.RT.ZS
NY.ADJ.NNAT.GN.ZS Adjusted net national income growth Annual % change in income after deducting natural resource depletion and pollution costs % annual growth World Development Indicators (WDI) https://data.worldbank.org/indicator/NY.ADJ.NNAT.GN.ZS
NY.GDP.PCAP.PP.CD GDP per capita, PPP Gross domestic product per person adjusted for price differences across countries Current international $ World Development Indicators (WDI) https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD
SP.POP.TOTL Population, total Total number of residents (including refugees) Persons World Development Indicators (WDI) https://data.worldbank.org/indicator/SP.POP.TOTL
SI.POV.GINI GINI index (World Bank estimate) Measure of income inequality (0 = perfect equality, 100 = maximal inequality) Index (0–100) World Development Indicators (WDI) https://data.worldbank.org/indicator/SI.POV.GINI
SE.SEC.CUAT.UP.ZS School enrollment, tertiary (% gross) Total enrollment in tertiary education (ISCED 5–8), regardless of age, expressed as % of official school-age population % World Development Indicators (WDI) https://data.worldbank.org/indicator/SE.SEC.CUAT.UP.ZS
NE.TRD.GNFS.ZS Trade (% of GDP) Sum of exports and imports of goods and services as % of GDP % of GDP World Development Indicators (WDI) https://data.worldbank.org/indicator/NE.TRD.GNFS.ZS
GE.EST Government Effectiveness: Estimate Perceptions of quality of public services, civil service, policy formulation/implementation, and credibility of government’s commitment to policies Normalized percentile rank (-2.5 to +2.5) Worldwide Governance Indicators (WGI) https://info.worldbank.org/governance/wgi/

📥 Download:


3️⃣ KOF Globalisation Index (Provided)

You are also given a CSV containing trade globalisation indicators from the KOF Swiss Economic Institute.

For this assignment, you should focus only on:

Variable Description
KOFTrGIdf Trade globalisation – de facto (actual flows)
KOFTrGIdj Trade globalisation – de jure (policies and regulations)

📖 Documentation: https://ethz.ch/content/dam/ethz/special-interest/dual/kof-dam/documents/Globalization/2025/KOFGI_2025_Variables.pdf

ℹ️ More information on the KOF Globalisation Index:

https://kof.ethz.ch/en/forecasts-and-indicators/indicators/kof-globalisation-index.html

📥 Download:


📚 Tasks

Your overall goal in this formative is to predict country-level adjusted net national income growth using information about:

  • trade structure (derived from bilateral trade flows),
  • trade openness and globalisation,
  • institutional and socio-economic characteristics.

You will start from raw bilateral trade data, which cannot be used directly in a regression model, and progressively construct a country-level modelling dataset.

Each question is designed to make clear what problem you are solving and why each step is necessary.


Question 1 – Understanding the CEPII Bilateral Trade Data

The CEPII Gravity dataset records bilateral trade flows between countries. Before constructing any indicators, you must understand what information the dataset contains and how it is structured.

1.1 Loading and inspecting the CEPII data

  1. Download the CEPII Gravity dataset from the official CEPII website (links given above) and load it into Python.

  2. Inspect the dataset:

    • How many rows and columns does it contain?
    • What does one row represent?
    • Which variables identify the exporting country and the importing country?
    • Which variable contains the value of trade flows?

Explain in your own words:

  • what kind of economic relationship each row corresponds to,
  • why this dataset is described as bilateral rather than country-level.

1.2 Time coverage and country coverage

  1. List all years available in the CEPII dataset.

  2. For each year:

    • count the number of unique exporting countries,
    • count the total number of exporter–importer pairs.
  3. Summarise your findings in a small table or figure.

Interpretation question Why might the number of available countries and trade pairs vary across years?


1.3 Scale and concentration of bilateral trade flows

Choose one year from the CEPII dataset purely for exploration.

  1. Examine the distribution of bilateral trade values:

    • Is the distribution symmetric or skewed?
    • Are most trade flows small relative to a few very large ones?
  2. Pick two exporting countries:

    • one large economy,
    • one smaller economy.
  3. For each exporter:

    • identify the top 5 importing partners by trade value,
    • compute the share of total exports going to these partners.

Interpretation question What do these patterns suggest about how trade relationships differ across countries?


1.4 Why aggregation is required

Explain why the CEPII data cannot be used directly to predict country-level outcomes such as adjusted net national income growth.


Question 2 – Aggregating Bilateral Trade Flows into Country-Level Indicators

The CEPII Gravity dataset records bilateral trade flows: each row corresponds to exports from one country to another. However, your outcome variable (adjusted net national income growth) and the other indicators you will use are country-level.

In this question, you will aggregate bilateral trade flows into interpretable country-level indicators of trade structure, focusing on export concentration and diversification.


What you will construct

For each exporting country in a given year, you will construct at least one indicator that summarises how concentrated or diversified its exports are across trading partners.

A common and widely used measure is the Herfindahl–Hirschman Index (HHI), but you are free to report both HHI and transformed versions (e.g. \(1 - HHI\)) if you find this helpful for interpretation.


How the aggregation works (worked example)

This example is purely illustrative. Its purpose is to show how concentration indices are computed from bilateral trade data like CEPII’s.

Suppose that in a given year, the CEPII data contain the following rows for exporting country A:

exporter importer trade_value
A B 40
A C 35
A D 25

Step 1 – Total exports

Country A’s total exports are the sum of its bilateral exports:

\[ \text{Total exports}_A = 40 + 35 + 25 = 100 \]

Step 2 – Export shares

Each trading partner’s export share is its share of total exports:

Partner Export share
B (40 / 100 = 0.40)
C (35 / 100 = 0.35)
D (25 / 100 = 0.25)

These shares always sum to 1.

Step 3 – Export concentration (HHI)

The Herfindahl–Hirschman Index is the sum of squared export shares:

\[ HHI_A = 0.40^2 + 0.35^2 + 0.25^2 = 0.16 + 0.1225 + 0.0625 = 0.345 \]

Interpretation:

  • If exports were evenly split across many partners, the HHI would be low.
  • If most exports went to one or two partners, the HHI would be high.
  • The index captures how concentrated trade relationships are, not how large total trade is.

In addition to the HHI, you may compute one or both of the following indicators for each exporting country:

  • Export diversification index: \[ \text{Diversification} = \frac{1}{HHI} \quad \text{or} \quad 1 - HHI \]

  • Top-1 partner export share: \[ \max_j (\text{Export share}_{cj}) \]

These indicators capture different aspects of trade structure:

  • HHI summarises overall concentration across all partners,
  • the top-1 share highlights dependence on a single trading partner, which may matter for exposure to geopolitical or demand shocks.

If you compute more than one indicator:

  • briefly explain how they differ conceptually,
  • comment on whether they rank countries similarly or differently.

You do not need to include all of them in your regression model later — this question is about construction and interpretation, not model selection.


Your task

Using the CEPII Gravity dataset:

  1. Choose a single year that you will later be able to merge with the World Bank and KOF datasets.

  2. For each exporting country in that year:

    • compute total exports,
    • compute export shares across partners,
    • compute at least one concentration or diversification indicator (e.g. HHI or (1 - HHI)).
  3. Produce a clean country-level dataset where each row corresponds to one exporting country.

Briefly explain:

  • what your chosen indicators measure,
  • how they differs from raw trade totals,
  • why your chosen indicators are meaningful for studying development and welfare outcomes.

Question 3 – Building the Modelling Dataset

You will now combine trade structure indicators with macro-level data.

3.1 Choosing a common year

Identify one year that is valid for all the datasets (CEPII, KOF and World Bank). Explain why that particular year was chosen.

Explain briefly why all datasets must align on the same year before modelling.


3.2 Merging datasets

Merge:

  • CEPII-derived trade indicators,
  • World Bank indicators,
  • KOF Trade Globalisation indicators (de facto and de jure).

Report:

  • the number of countries before merging,
  • the number of countries after merging.

Identify which types of countries tend to be lost during merging and suggest why.


3.3 Exploring the combined dataset

Explore the merged dataset:

  1. Examine missing values across variables.

  2. Explore relationships between:

    • trade concentration indicators,
    • trade (% of GDP),
    • KOF trade globalisation indicators.
  3. Use plots and/or correlations to support your discussion.

Explain what these relationships suggest about overlap, redundancy, or complementarity between trade indicators.


3.4 Exploring the outcome variable

Explore the adjusted net national income growth variable:

  • Describe its range and shape.
  • Identify any extreme values.
  • Comment on whether its scale suggests any transformations might be useful later.

Do not apply transformations yet — this question is exploratory.


Question 4 – Predictive Modelling

You will now build models to predict adjusted net national income growth.

4.1 Baseline linear model

  1. Split your data into training and test sets.
  2. Choose a set of predictors from the available variables.
  3. Fit a linear regression model.
  4. Evaluate your model using appropriate metrics.

You must justify:

  • which predictors you included or excluded,
  • any transformations you applied,
  • how you evaluated model performance and what that performance means

4.2 Regularised regression

Fit a regularised linear model.

Explain:

  • why regularisation is useful in this context,
  • how coefficient estimates differ from the baseline model,
  • what trade-offs you observe between interpretability and stability.

4.3 Comparing models

Compare your models using:

  • predictive performance,
  • stability of coefficients,
  • economic interpretability.

Your comparison should focus on reasoning, not just numerical performance.


Question 5 – Reflection

  1. What information is gained — and what is lost — by aggregating bilateral trade data into country-level indicators?
  2. How do data availability and indicator overlap constrain your modelling choices? What additional data would you add if you were able to do so?
  3. What would be one concrete improvement you would make to the analysis you performed in this formative?

✔️ How we will provide feedback on your work

We won’t grade this formative. But you will receive structured feedback on your answers, designed to help you prepare for summative work later in the course.

👉 Note: Completing this assignment will count towards your final class grade if you are a General Course or Exchange student, even if you only complete part of it.

ImportantGeneral comment

This formative is about practicing the full data science workflow:

  • understanding raw data,
  • making defensible aggregation choices,
  • building and evaluating models,
  • and interpreting results in context.

Providing long blocks of code is not sufficient. Explanations and interpretations must accompany your code, and you should always justify your choices (e.g. aggregation strategy, variable selection, transformations, evaluation approach).


Question 1 – Understanding the CEPII Data

Weak

  • Loads the data but treats it as a standard country-level dataset.
  • Little or no explanation of what one row represents.
  • No meaningful discussion of time or country coverage.

Developing

  • Correctly identifies exporters, importers, and trade values.
  • Some discussion of coverage, but interpretation is limited or descriptive.

Good

  • Clearly explains the bilateral structure of the CEPII data.
  • Explains why coverage varies across years.
  • Shows understanding of why scale and skewness matter for later analysis.

Excellent

  • Demonstrates strong conceptual understanding of why bilateral trade data pose modelling challenges.
  • Clearly explains why aggregation is necessary before regression.
  • Uses exploratory analysis to motivate later methodological choices.

Question 2 – Aggregating Bilateral Trade Flows

Weak

  • Aggregation is incorrect or mechanically applied.
  • Concentration measures are computed without explanation.
  • Final dataset is not at the country level.

Developing

  • Correct computation of export shares and concentration indices.
  • Limited explanation of what these measures capture economically.

Good

  • Correctly constructs concentration/diversification indicators.
  • Clearly explains how these differ from raw trade totals.
  • Produces a clean, well-structured country-level dataset.

Excellent

  • Demonstrates clear understanding of what trade concentration captures and what it ignores.
  • Discusses how different indicators (e.g. HHI vs. diversification) change interpretation.
  • Shows awareness of how aggregation choices may affect results.

Question 3 – Data Exploration and Merging

Weak

  • Merges datasets without checking losses or missingness.
  • Produces plots or tables without interpretation.

Developing

  • Identifies missing data and basic relationships.
  • Interpretation remains largely descriptive.

Good

  • Clearly discusses coverage, missingness, and overlaps between indicators.
  • Uses exploratory analysis to identify redundancy or complementarity.
  • Interprets plots and correlations using relevant economic or development context (e.g. trade dependence, institutional quality, resource rents), not just statistical description.

Excellent

  • Thoughtfully connects exploration to later modelling decisions.
  • Clearly explains how data constraints shape what can (and cannot) be estimated.
  • Uses substantive context to explain why certain patterns or outliers may arise (e.g. small open economies, resource-dependent countries, regional trading blocs).
  • Explicitly distinguishes between statistical association and economic interpretation, avoiding causal claims.

Question 4 – Predictive Modelling

Weak

  • Fits models without justification or interpretation.
  • No meaningful evaluation or comparison.

Developing

  • Fits baseline and regularised models but treats regularisation mechanically.
  • Limited discussion of model performance or coefficients.
  • Interprets coefficients or performance metrics mechanically, with limited connection to economic meaning.

Good

  • Justifies predictor choice and transformations.
  • Clearly explains why regularisation is useful in this setting.
  • Compares models in terms of performance and interpretability.
  • Interprets coefficients and model behaviour in light of trade structure, institutions, and development theory.
  • Explains whether estimated relationships are plausible given real-world constraints.

Excellent

  • Demonstrates strong understanding of multicollinearity and coefficient stability.
  • Thoughtfully discusses trade-offs between model complexity, stability, and interpretability.
  • Interprets results in a development and policy context.
  • Demonstrates awareness that model results reflect both data structure and context (e.g. indicator overlap, measurement error, country heterogeneity).
  • Carefully discusses why some predictors appear unstable or sensitive to regularisation, grounding the explanation in both statistics and economics.
  • Avoids over-interpretation of predictive performance, clearly acknowledging limitations.

Question 5 – Reflection

Weak

  • Very short or generic reflections not tied to the analysis.

Developing

  • Identifies limitations but with limited depth or specificity.

Good

  • Clearly reflects on aggregation, data availability, and modelling constraints.
  • Suggests concrete and relevant improvements.

Excellent

  • Demonstrates nuanced understanding of the limits of inference in applied economic analysis.
  • Connects technical decisions to substantive development and policy questions.
  • Proposes realistic next steps grounded in data availability.