✅ A model solution for the W04 formative

Author

Dr Ghita Berrada

What follows is a possible solution for the W04 formative. If you want to render the .qmd that was used to generate this page for yourselves, use the download button below:

⚙️ Setup

We start, as usual, by loading libraries.

import pandas as pd
import re
import missingno as msno
import sweetviz as sv
from lets_plot import *
LetsPlot.setup_html()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import r2_score, root_mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
import numpy as np
import miceforest as mf
from sklearn.preprocessing import LabelEncoder

Tip

For cleaner formatting (and coding), it’s good practice to load all the libraries you’ll be using in your .qmd (or Jupyter Notebook) at the beginning of the file.

Question 1

Loading the data

Importing the data should only require the pd.read_csv method.

df=pd.read_csv('../../data/Life_Expectancy_Data.csv')

Note

When loading the data, you should only ever use relative paths and not absolute paths (i.e not full paths that are specific to your machine).

Doing so ensures your code is more reproducible!

We quickly inspect the content of the loaded dataframe by taking a look at the first 5 rows of it

df.head()

	Country	Year	Status	Life expectancy	Adult Mortality	infant deaths	Alcohol	percentage expenditure	Hepatitis B	Measles	…	Polio	Total expenditure	Diphtheria	HIV/AIDS	GDP	Population	thinness 1-19 years	thinness 5-9 years	Income composition of resources	Schooling
0	Afghanistan	2015	Developing	65.0	263.0	62	0.01	71.279624	65.0	1154	…	6.0	8.16	65.0	0.1	584.259210	33736494.0	17.2	17.3	0.479	10.1
1	Afghanistan	2014	Developing	59.9	271.0	64	0.01	73.523582	62.0	492	…	58.0	8.18	62.0	0.1	612.696514	327582.0	17.5	17.5	0.476	10.0
2	Afghanistan	2013	Developing	59.9	268.0	66	0.01	73.219243	64.0	430	…	62.0	8.13	64.0	0.1	631.744976	31731688.0	17.7	17.7	0.470	9.9
3	Afghanistan	2012	Developing	59.5	272.0	69	0.01	78.184215	67.0	2787	…	67.0	8.52	67.0	0.1	669.959000	3696958.0	17.9	18.0	0.463	9.8
4	Afghanistan	2011	Developing	59.2	275.0	71	0.01	7.097109	68.0	3013	…	68.0	7.87	68.0	0.1	63.537231	2978599.0	18.2	18.2	0.454	9.5

5 rows × 22 columns

A closer look at the dataframe’s column names gives us important information about the patterns we’ll need to deal with when renaming the columns: special characters like - and / , spaces and inconsistent capitalization of terms.

df.columns

Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling'],
      dtype='object')

Renaming columns

I rename the thinness 1-19 years column using the pandas rename method

df.rename(columns={' thinness  1-19 years':'thinness_10_19_years'},inplace=True)
df

	Country	Year	Status	Life expectancy	Adult Mortality	infant deaths	Alcohol	percentage expenditure	Hepatitis B	Measles	…	Polio	Total expenditure	Diphtheria	HIV/AIDS	GDP	Population	thinness_10_19_years	thinness 5-9 years	Income composition of resources	Schooling
0	Afghanistan	2015	Developing	65.0	263.0	62	0.01	71.279624	65.0	1154	…	6.0	8.16	65.0	0.1	584.259210	33736494.0	17.2	17.3	0.479	10.1
1	Afghanistan	2014	Developing	59.9	271.0	64	0.01	73.523582	62.0	492	…	58.0	8.18	62.0	0.1	612.696514	327582.0	17.5	17.5	0.476	10.0
2	Afghanistan	2013	Developing	59.9	268.0	66	0.01	73.219243	64.0	430	…	62.0	8.13	64.0	0.1	631.744976	31731688.0	17.7	17.7	0.470	9.9
3	Afghanistan	2012	Developing	59.5	272.0	69	0.01	78.184215	67.0	2787	…	67.0	8.52	67.0	0.1	669.959000	3696958.0	17.9	18.0	0.463	9.8
4	Afghanistan	2011	Developing	59.2	275.0	71	0.01	7.097109	68.0	3013	…	68.0	7.87	68.0	0.1	63.537231	2978599.0	18.2	18.2	0.454	9.5
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2933	Zimbabwe	2004	Developing	44.3	723.0	27	4.36	0.000000	68.0	31	…	67.0	7.13	65.0	33.6	454.366654	12777511.0	9.4	9.4	0.407	9.2
2934	Zimbabwe	2003	Developing	44.5	715.0	26	4.06	0.000000	7.0	998	…	7.0	6.52	68.0	36.7	453.351155	12633897.0	9.8	9.9	0.418	9.5
2935	Zimbabwe	2002	Developing	44.8	73.0	25	4.43	0.000000	73.0	304	…	73.0	6.53	71.0	39.8	57.348340	125525.0	1.2	1.3	0.427	10.0
2936	Zimbabwe	2001	Developing	45.3	686.0	25	1.72	0.000000	76.0	529	…	76.0	6.16	75.0	42.1	548.587312	12366165.0	1.6	1.7	0.427	9.8
2937	Zimbabwe	2000	Developing	46.0	665.0	24	1.68	0.000000	79.0	1483	…	78.0	7.10	78.0	43.5	547.358878	12222251.0	11.0	11.2	0.434	9.8

2938 rows × 22 columns

I write a function to rename the columns of the dataframe as required:

removing trailing/leading spaces from column names
removing special characters (e.g -,/) i.e non-alphanumeric characters and replacing them by _
replacing sequences of multiple _ that the renaming could have produced by a single _
finally, converting all column names to lowercase

# Function to clean column names correctly
def clean_column_name(name):
    name = name.strip()  # Remove leading/trailing spaces
    name = re.sub(r'[^a-zA-Z0-9]+', '_', name)  # Replace non-alphanumeric sequences (incl. spaces, dashes) with _
    name = re.sub(r'_+', '_', name)  # Replace multiple _ with a single one
    return name.lower()  # Convert to lowercase

# Print before renaming
print("Before renaming:", df.columns.tolist())

# Apply function to all column names
df.rename(columns=lambda col: clean_column_name(col), inplace=True)

# Print after renaming
print("After renaming:", df.columns.tolist())

Having a look at the column names and first rows of the dataframe after renaming shows that our renaming worked as intended

df.columns

Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', 'thinness_10_19_years',
       ' thinness 5-9 years', 'Income composition of resources', 'Schooling'],
      dtype='object')

df.head()

	Country	Year	Status	Life expectancy	Adult Mortality	infant deaths	Alcohol	percentage expenditure	Hepatitis B	Measles	…	Polio	Total expenditure	Diphtheria	HIV/AIDS	GDP	Population	thinness_10_19_years	thinness 5-9 years	Income composition of resources	Schooling
0	Afghanistan	2015	Developing	65.0	263.0	62	0.01	71.279624	65.0	1154	…	6.0	8.16	65.0	0.1	584.259210	33736494.0	17.2	17.3	0.479	10.1
1	Afghanistan	2014	Developing	59.9	271.0	64	0.01	73.523582	62.0	492	…	58.0	8.18	62.0	0.1	612.696514	327582.0	17.5	17.5	0.476	10.0
2	Afghanistan	2013	Developing	59.9	268.0	66	0.01	73.219243	64.0	430	…	62.0	8.13	64.0	0.1	631.744976	31731688.0	17.7	17.7	0.470	9.9
3	Afghanistan	2012	Developing	59.5	272.0	69	0.01	78.184215	67.0	2787	…	67.0	8.52	67.0	0.1	669.959000	3696958.0	17.9	18.0	0.463	9.8
4	Afghanistan	2011	Developing	59.2	275.0	71	0.01	7.097109	68.0	3013	…	68.0	7.87	68.0	0.1	63.537231	2978599.0	18.2	18.2	0.454	9.5

5 rows × 22 columns

Question 2

To look at missingness patterns, we plot the nullity matrix using the missingno library

msno.matrix(df)

There seems to be a MAR mechanism at play here:

there seems to be a correlation in the missingness patterns of income_composition_of_resources, schooling (not exactly surprising when you know that income_composition_of_resources is the human development index and schoolingis a component of that index)
there is a correlation between the pattern of missingness in gdp and population (again, not wholly unexpected, as gdp is, in fact, GDP per capita i.e GDP divided by population - which also points to the fact that the gdp column is not named correctly!)
total_expenditure is correlated with hepatitis_b and alcohol (again not exactly surprising given that total_expenditure represents the general government expenditure on health as a percentage of total government expenditure (%) and such expenditure would increase as hepatitis B immunization coverage increases and would also increase with an increase of alcohol consumption (and an increase of its burden on the healthcare system))

report = sv.analyze(df)
report.show_notebook()

The variables with the most missing values are population (22% of missing values) and hepatitis_b (19%). Other variables with missing values include:

life_expectancy (<1%)
adult_mortality (<1%)
polio (<1%)
diphteria (<1%)
bmi (1%)
thinness_10_19_years (1%)
thinness_5_9_years (1%)
schooling (6%)
income_composition_of_resources (6%)
alcohol (7%)
total_expenditure (8%)
gdp (15%)

Given the fact the proportion of missing values is rather low and the fact we have a MAR mechanism at play, the best approach to take to deal with the missing data prior to modeling (but after train/test split to avoid data leakage!) is to perform imputation. The variables with missing values are all numerical but the dataset contains categorical values (e.g status): we could perform K-NN imputation or multiple imputation (for more robust imputation) after label encoding the categorical values.

Question 3

Renaming `gdp` column

We rename gdp as gdp_per_capita

df.rename(columns={'gdp':'gdp_per_capita'},inplace=True)
df

	country	year	status	life_expectancy	adult_mortality	infant_deaths	alcohol	percentage_expenditure	hepatitis_b	measles	…	polio	total_expenditure	diphtheria	hiv_aids	gdp_per_capita	population	thinness_10_19_years	thinness_5_9_years	income_composition_of_resources	schooling
0	Afghanistan	2015	Developing	65.0	263.0	62	0.01	71.279624	65.0	1154	…	6.0	8.16	65.0	0.1	584.259210	33736494.0	17.2	17.3	0.479	10.1
1	Afghanistan	2014	Developing	59.9	271.0	64	0.01	73.523582	62.0	492	…	58.0	8.18	62.0	0.1	612.696514	327582.0	17.5	17.5	0.476	10.0
2	Afghanistan	2013	Developing	59.9	268.0	66	0.01	73.219243	64.0	430	…	62.0	8.13	64.0	0.1	631.744976	31731688.0	17.7	17.7	0.470	9.9
3	Afghanistan	2012	Developing	59.5	272.0	69	0.01	78.184215	67.0	2787	…	67.0	8.52	67.0	0.1	669.959000	3696958.0	17.9	18.0	0.463	9.8
4	Afghanistan	2011	Developing	59.2	275.0	71	0.01	7.097109	68.0	3013	…	68.0	7.87	68.0	0.1	63.537231	2978599.0	18.2	18.2	0.454	9.5
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2933	Zimbabwe	2004	Developing	44.3	723.0	27	4.36	0.000000	68.0	31	…	67.0	7.13	65.0	33.6	454.366654	12777511.0	9.4	9.4	0.407	9.2
2934	Zimbabwe	2003	Developing	44.5	715.0	26	4.06	0.000000	7.0	998	…	7.0	6.52	68.0	36.7	453.351155	12633897.0	9.8	9.9	0.418	9.5
2935	Zimbabwe	2002	Developing	44.8	73.0	25	4.43	0.000000	73.0	304	…	73.0	6.53	71.0	39.8	57.348340	125525.0	1.2	1.3	0.427	10.0
2936	Zimbabwe	2001	Developing	45.3	686.0	25	1.72	0.000000	76.0	529	…	76.0	6.16	75.0	42.1	548.587312	12366165.0	1.6	1.7	0.427	9.8
2937	Zimbabwe	2000	Developing	46.0	665.0	24	1.68	0.000000	79.0	1483	…	78.0	7.10	78.0	43.5	547.358878	12222251.0	11.0	11.2	0.434	9.8

2938 rows × 22 columns

Creating a `gdp_per_capita` column

GDP is simply GDP per capita multipled by population. We use this to create a gdp column

df['gdp'] = df['gdp_per_capita']*df['population']

Plotting and discussing GDP per capita distributions

Overall GDP per capita distribution

Let’s plot the overall distribution of GDP per capita for all countries

ggplot(df, aes(x='gdp_per_capita')) + \
    geom_histogram(bins=30, fill='blue', color='black', alpha=0.7) + \
    ggtitle("Distribution of GDP per Capita") + \
    xlab("GDP per Capita (USD)") + \
    ylab("Frequency") + \
    theme_minimal()

This is a heavily right-skewed distribution that indicates that wealth is not spread equally among countries and that most countries have low GDP per capita. There are a few outliers at very high values of GDP per capita (> $60,000) that correspond to wealthier nations.

GDP per capita for developing countries

Let’s now plot the GDP per capita for developing countries

ggplot(df.query("status=='Developing'"), aes(x='gdp_per_capita')) + \
    geom_histogram(bins=30, fill='blue', color='black', alpha=0.7) + \
    ggtitle("Distribution of GDP per Capita for developing countries") + \
    xlab("GDP per Capita (USD)") + \
    ylab("Frequency") + \
    theme_minimal()

The GDP per capita for developing countries mirrors sensibly the same distribution as the overall one: most countries have low to extremely low GDP per capita. Note that the higher end of the scale has lower values than the overall plot (the maximum here is at around $89,000 as opposed to around $120,000 in the overall plot) and that even fewer countries reach the higher end of the scale.

GDP per capita for developed countries

Let’s plot GDP per capita for developed countries

ggplot(df.query("status=='Developed'"), aes(x='gdp_per_capita')) + \
    geom_histogram(bins=30, fill='blue', color='black', alpha=0.7) + \
    ggtitle("Distribution of GDP per Capita for developed countries") + \
    xlab("GDP per Capita (USD)") + \
    ylab("Frequency") + \
    theme_minimal()

Compared to the overall plot, this distribution still shows right-skewness but it’s not as extreme.

Most developed countries have GDP per capita values above $10,000, and the distribution is more spread out. There are more countries with GDP per capita values in the $20,000 to $60,000 range. Unlike in the overall GDP per capita plot, there are no extreme concentrations at the very low end.

The highest bar is still at the lower end, which suggests that, even among countries classified as developed, a significant number have relatively lower GDP per capita (perhaps emerging economies within the developed category). Such an observation brings into question this whole notion of development: do we simply quantify a country’s development/economic wellbeing through measurements like GDP/GDP per capita or are other parameters to be taken into account? See the Stiglitz-Fitoussi-Sen report(Stiglitz, Sen, and Fitoussi 2009) on this point.

Again, we have a few outlier countries with a GDP per capita higher than $100,000, most likely wealthy countries like Switzerland, Luxembourg or Norway.

Overall, the distribution of GDP per capita for developed countries is more balanced though disparities still exist within developed countries.

Question 4

Distributions of human development index

Overall human development index distribution

Let’s plot the overall development index distribution

ggplot(df, aes(x='income_composition_of_resources')) + \
    geom_histogram(aes(y='..density..'), bins=30, fill='blue', color='black', alpha=0.6) + \
    geom_density(color='red', size=1.2) + \
    ggtitle("Histogram & Density of HDI") + \
    xlab("HDI") + \
    ylab("Density") + \
    theme_classic()

In contrast with the GDP per capita overall distribution, this is a more (slightly) left-skewed distribution.

There are outliers at the very low end of the scale (human development index equals 0!). Let’s check to which countries they correspond:

df[df['income_composition_of_resources']==0]

	country	year	status	life_expectancy	adult_mortality	infant_deaths	alcohol	percentage_expenditure	hepatitis_b	measles	…	total_expenditure	diphtheria	hiv_aids	gdp_per_capita	population	thinness_10_19_years	thinness_5_9_years	income_composition_of_resources	schooling	gdp
74	Antigua and Barbuda	2005	Developing	74.6	147.0	0	8.15	1455.608186	99.0	0	…	4.41	99.0	0.1	11371.938950	NaN	3.5	3.4	0.0	0.0	NaN
75	Antigua and Barbuda	2004	Developing	74.4	149.0	0	7.28	22.862952	97.0	0	…	4.21	97.0	0.1	1352.837400	NaN	3.5	3.4	0.0	0.0	NaN
76	Antigua and Barbuda	2003	Developing	74.2	151.0	0	7.16	1158.065259	99.0	0	…	4.53	99.0	0.1	9739.825560	NaN	3.5	3.5	0.0	0.0	NaN
77	Antigua and Barbuda	2002	Developing	74.0	153.0	0	7.21	927.407585	99.0	0	…	4.41	98.0	0.1	9386.716452	NaN	3.6	3.5	0.0	0.0	NaN
78	Antigua and Barbuda	2001	Developing	73.8	154.0	0	7.51	163.767698	96.0	0	…	4.48	97.0	0.1	9358.154162	NaN	3.6	3.5	0.0	0.0	NaN
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2853	Vanuatu	2004	Developing	69.6	169.0	0	0.85	334.167337	63.0	0	…	4.12	69.0	0.1	1787.947230	24143.0	1.6	1.5	0.0	10.7	4.316641e+07
2854	Vanuatu	2003	Developing	69.4	173.0	0	1.20	27.298391	64.0	165	…	4.20	69.0	0.1	158.527240	198964.0	1.6	1.6	0.0	10.4	3.154121e+07
2855	Vanuatu	2002	Developing	69.3	176.0	0	1.24	171.137361	66.0	101	…	3.52	7.0	0.1	1353.934819	193956.0	1.7	1.6	0.0	10.2	2.626038e+08
2856	Vanuatu	2001	Developing	69.1	179.0	0	0.91	163.105292	68.0	7	…	3.37	7.0	0.1	1362.617310	18929.0	1.7	1.6	0.0	10.1	2.579298e+07
2857	Vanuatu	2000	Developing	69.0	18.0	0	1.21	21.900752	7.0	9	…	3.28	71.0	0.1	1469.849149	18563.0	1.7	1.7	0.0	9.6	2.728481e+07

130 rows × 23 columns

df[df['income_composition_of_resources']==0]['country'].unique()

array(['Antigua and Barbuda', 'Bahamas', 'Bhutan',
       'Bosnia and Herzegovina', 'Burkina Faso', 'Cabo Verde', 'Chad',
       'Comoros', 'Equatorial Guinea', 'Eritrea', 'Ethiopia', 'Georgia',
       'Grenada', 'Guinea-Bissau', 'Kiribati', 'Lebanon', 'Madagascar',
       'Micronesia (Federated States of)', 'Montenegro', 'Nigeria',
       'Oman', 'Saint Lucia', 'Saint Vincent and the Grenadines',
       'Seychelles', 'South Sudan', 'Suriname',
       'The former Yugoslav republic of Macedonia', 'Timor-Leste',
       'Turkmenistan', 'Uzbekistan', 'Vanuatu'], dtype=object)

Looking at the list of countries where the human index is 0, we can deduce two things:

the value is either due to an entry error or due to the fact the statistics was not computed for many of these countries, if not all of these countries (this seems likely for Oman, Seychelles, Nigeria, Georgia, Ethiopia)
the human development index is likely very low in some of the countries in the list (known low income countries such as Bhutan or Cabo Verde or countries with a recent history of war and conflict such as the former Yugoslav republics or South Sudan) but the 0 values is again likely due to the statistic not been available in these countries rather than simply mirroring a situation where the statistic has been computed and been found equal to 0

Aside from this, the values of the human development index are typically higher than 0.25 and most are in the range [0.6-0.8], indicating mid to satisfactory levels of human development overall.

The overall median (0.677) of human development index confirms this observation.

np.nanmedian(df['income_composition_of_resources'])

0.677

Distribution for developing countries

ggplot(df.query("status=='Developing'"), aes(x='income_composition_of_resources')) + \
    geom_histogram(aes(y='..density..'), bins=30, fill='blue', color='black', alpha=0.6) + \
    geom_density(color='red', size=1.2) + \
    ggtitle("Histogram & Density of HDI for developing countries") + \
    xlab("HDI") + \
    ylab("Density") + \
    theme_classic()

The distribution is sensibly similar to the overall one. Our outliers at 0 are still there. The higher end of the scale is a tad lower at 0.92. While most values are still found between the [0.6-0.8] range, we also have a fair concentration of values at the [0.4-0.5], indicating there seems to be at least two distinct clusters of developing countries when it comes to human development index.

And as expected, the median of values for developing countries is slightly lower than the overall median.

np.nanmedian(df.query("status=='Developing'")['income_composition_of_resources'])

0.631

Distribution for developed countries

Let’s plot the distribution for developed countries.

ggplot(df.query("status=='Developed'"), aes(x='income_composition_of_resources')) + \
    geom_histogram(aes(y='..density..'), bins=30, fill='blue', color='black', alpha=0.6) + \
    geom_density(color='red', size=1.2) + \
    ggtitle("Histogram & Density of HDI for developed countries") + \
    xlab("HDI") + \
    ylab("Density") + \
    theme_classic()

The distribution is less left-skewed than the overall one. The range of values is also narrower as all values are within the range [0.7-0.95], which also means that the human development index is generally higher in developed countries than developing ones. At first glance, it looks as if the human development index differentiates between developing and developed countries better than GDP per capita did.

Question 5

Train-test split

We split the data in training and test set. We drop variables such as country, year, gdp, population, which won’t be used for modeling.

x_train, x_test, y_train, y_test = train_test_split(df.drop(['life_expectancy','country','year','gdp','population'], axis=1), df[['life_expectancy']], test_size=0.3, stratify=df["status"],random_state = 42)

Before proceeding, we take a quick look at the types of the variables in our training set.

x_train.dtypes

status                              object
adult_mortality                    float64
infant_deaths                        int64
alcohol                            float64
percentage_expenditure             float64
hepatitis_b                        float64
measles                              int64
bmi                                float64
under_five_deaths                    int64
polio                              float64
total_expenditure                  float64
diphtheria                         float64
hiv_aids                           float64
gdp_per_capita                     float64
thinness_10_19_years               float64
thinness_5_9_years                 float64
income_composition_of_resources    float64
schooling                          float64
dtype: object

Further processing of training and test sets: label encoding and imputation

Not all variables are numeric e.g status. So we use scikit-learn’s LabelEncoder to label encode status (and convert it to a numerical feature).

label_encoder = LabelEncoder()
x_train['status'] = label_encoder.fit_transform(x_train['status'].astype(str))

x_train

	status	adult_mortality	infant_deaths	alcohol	percentage_expenditure	hepatitis_b	measles	bmi	under_five_deaths	polio	total_expenditure	diphtheria	hiv_aids	gdp_per_capita	thinness_10_19_years	thinness_5_9_years	income_composition_of_resources	schooling
612	1	285.0	7	3.82	0.000000	7.0	315	25.0	10	69.0	2.79	82.0	3.5	NaN	8.0	7.6	0.558	10.7
775	1	166.0	6	6.02	664.558524	83.0	0	53.6	8	85.0	4.60	88.0	0.9	5451.669600	3.3	3.3	0.697	12.9
294	1	232.0	1	0.17	27.489070	93.0	6	19.5	1	93.0	6.30	93.0	0.4	177.234497	16.8	17.5	0.000	10.9
2264	1	25.0	27	0.29	5.397369	NaN	5839	17.5	50	49.0	4.63	52.0	0.5	473.453380	12.3	12.3	0.378	5.2
1915	0	66.0	0	6.59	15268.064450	NaN	3	58.9	0	93.0	9.26	93.0	0.1	87646.753460	0.7	0.7	0.936	17.4
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
772	1	157.0	6	5.93	97.522115	8.0	0	56.8	7	82.0	4.12	83.0	0.3	627.555440	3.3	3.2	0.709	13.2
2567	1	194.0	13	0.86	1.142004	NaN	38	31.6	16	84.0	4.59	85.0	0.3	17.815972	4.1	4.1	0.535	9.7
1190	1	193.0	1100	3.00	64.605901	44.0	33634	16.4	1500	79.0	4.33	82.0	0.2	1461.671957	26.9	27.7	0.580	10.8
1301	1	138.0	1	3.65	37.171088	96.0	0	52.0	1	96.0	5.66	96.0	0.5	521.333630	1.8	1.7	0.725	12.8
2804	0	112.0	28	8.52	0.000000	93.0	66	63.8	33	92.0	15.15	96.0	0.1	NaN	0.7	0.6	NaN	NaN

2056 rows × 18 columns

y_train

	life_expectancy
612	62.9
775	72.7
294	67.4
2264	57.5
1915	81.0
…	…
772	73.4
2567	64.0
1190	66.8
1301	75.3
2804	77.5

2056 rows × 1 columns

We do the same for the test data.

label_encoder = LabelEncoder()
x_test['status'] = label_encoder.fit_transform(x_test['status'].astype(str))

Now that it’s done, we proceed to impute missing values, first in the training set then the test set (to avoid data leakage). We’ll use multiple imputation.

np.random.seed(42)

x_train = x_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

df_train_imp = mf.ImputationKernel(
  pd.concat([x_train,y_train],axis=1),
  save_all_iterations_data=True,
  random_state=100
)
# Run the MICE algorithm for 5 iterations
df_train_imp.mice(5)
# Return the completed dataset (for MICE)
df_train_imp_mice = df_train_imp.complete_data(dataset=0)

np.random.seed(42)

x_test = x_test.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

df_test_imp = mf.ImputationKernel(
  pd.concat([x_test,y_test],axis=1),
  save_all_iterations_data=True,
  random_state=100
)
# Run the MICE algorithm for 5 iterations
df_test_imp.mice(5)
# Return the completed dataset (for MICE)
df_test_imp_mice = df_test_imp.complete_data(dataset=0)

Creating a linear regression model

We create a linear regression model using all variables to predict life_expectancy (as stated before country, year, gdp and population were dropped when creating the training/test split and are not used in this model)

# Create a model instance of a linear regression
linear_model = LinearRegression()

# Fit this instance to the training set
linear_model.fit(df_train_imp_mice.drop('life_expectancy', axis=1),df_train_imp_mice[['life_expectancy']])

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We compute test set metrics for this model

# Create predictions for the test set
linear_preds = linear_model.predict(df_test_imp_mice.drop('life_expectancy', axis=1))

# Calculate performance metrics
r2 = r2_score(df_test_imp_mice[['life_expectancy']], linear_preds)
rmse = root_mean_squared_error(df_test_imp_mice[['life_expectancy']], linear_preds)
mae = mean_absolute_error(df_test_imp_mice[['life_expectancy']], linear_preds)

# Print results
print("The R2 metric for this model is:",np.round(r2, 3),", the MAE is", np.round(mae, 3)," and the RMSE is ", np.round(rmse, 3),"\n")

The R2 metric for this model is: 0.817 , the MAE is 2.957  and the RMSE is  4.029

These metrics tell us the following:

this model explains close to 82% of the variance of the data (based on $R^2$). However, this model includes quite a few predictors, potentially artificially inflating the $R^2$ score, so it might be worth looking at the adjusted $R^2$ to account for the number of predictors
the model prediction are off by almost 3 years on average based on MAE and 4 based on RMSE. The discrepancy between MAE and RMSE indicates quite a few outlier residuals, which are more heavily weighted in RMSE. Given the scale of life expectancy (the average life expectancy was 66.8 and 71.8 years in 2000 and 2015 respectively, which are the first and last year of data represented in our dataset). This represents predictions that are off by 4.2 to 4.5% (based on MAE) or 5.6 to 6% based on RMSE. This looks like an acceptable scale of error.

Let’s quickly compute adjusted $R^2$ for this model:

adj_r2 = 1 - (1-linear_model.score(df_test_imp_mice.drop('life_expectancy', axis=1), df_test_imp_mice[['life_expectancy']]))*(len(df_test_imp_mice[['life_expectancy']])-1)/(len(df_test_imp_mice[['life_expectancy']])-df_test_imp_mice.drop('life_expectancy', axis=1).shape[1]-1)
print("The adjusted R2 for this model is:", adj_r2)

The adjusted R2 for this model is: 0.8132287101060904

Adjusted $R^2$ for this model is extremely close to $R^2$, which means that the variables in our model do not exhibit high collinearity and all have explanatory power.

Let’s now examine this model’s residuals and draw the residuals vs fitted plot.

predictions = linear_model.predict(df_train_imp_mice.drop('life_expectancy', axis=1)).flatten()
residuals = df_train_imp_mice[['life_expectancy']] - linear_model.predict(df_train_imp_mice.drop('life_expectancy', axis=1))
residuals = residuals.values.flatten()

residuals_df = pd.DataFrame({
    'Predicted': predictions,
    'Residuals': residuals
})

# Residuals vs fitted plot
ggplot(residuals_df) + \
    geom_point(aes(x='Predicted', y='Residuals')) + \
    geom_hline(yintercept=0, color='red', size=1) + \
    ggtitle('Residuals Plot for Linear Regression') + \
    xlab('Predicted values') + \
    ylab('Residuals')

Here are the main observations from the residuals plot:

the points are fairly randomly scattered around the horizontal line at zero and there are no obvious patterns to suggest non-linearity or heteroscedasticity. Most of the residuals seem to fall within a reasonable range from the horizontal line at zero, which indicates that the model is not systematically over- or under-predicting at different values
there is a slightly higher concentration of residuals between 60 and 80 on the x-axis. This suggests that this is the range where most of our observations lie, which is not exactly surprising given the expected range of life expectancy!
a few points stand out as potential outliers (e.g., around 40 on the x-axis and -15 to -20 on the y-axis). These deserve further investigation to understand if they are genuine extreme values or data errors. Many also happen to fall in areas where we have fewer life expectancy observations e.g the [30-40] range or [50-60] range.

In general, this residuals plot suggests that the linear regression model is likely a reasonable fit for the data but we might need to investigate outlier residuals more thoroughly to ensure they are dealt with appropriately.

Creating a LASSO model

# Create a model instance of a linear regression
lasso_model = Lasso(alpha=0.02,max_iter=2000)

# Fit this instance to the training set
lasso_model.fit(df_train_imp_mice.drop('life_expectancy', axis=1),df_train_imp_mice[['life_expectancy']])

Lasso(alpha=0.02, max_iter=2000)

# Create predictions for the test set
lasso_preds = lasso_model.predict(df_test_imp_mice.drop('life_expectancy', axis=1))

# Calculate performance metrics
r2 = r2_score(df_test_imp_mice[['life_expectancy']], lasso_preds)
rmse = root_mean_squared_error(df_test_imp_mice[['life_expectancy']], lasso_preds)
mae = mean_absolute_error(df_test_imp_mice[['life_expectancy']], lasso_preds)

# Print results
print("The R2 metric for this model is:",np.round(r2, 3),", the MAE is", np.round(mae, 3)," and the RMSE is ", np.round(rmse, 3),"\n")

The R2 metric for this model is: 0.816 , the MAE is 2.967  and the RMSE is  4.039

adj_r2_lasso = 1 - (1-lasso_model.score(df_test_imp_mice.drop('life_expectancy', axis=1), df_test_imp_mice[['life_expectancy']]))*(len(df_test_imp_mice[['life_expectancy']])-1)/(len(df_test_imp_mice[['life_expectancy']])-df_test_imp_mice.drop('life_expectancy', axis=1).shape[1]-1)
print("The adjusted R2 for this model is:", adj_r2_lasso)

The adjusted R2 for this model is: 0.8122625566583088

The metrics for this model are very close to the metrics for the linear model (a 0.001 drop in $R^2$ and adjusted $R^2$ as a 0.01 increase in both MAE and RMSE) so they tell sensibly the same story as for the linear model (with the linear model having a tiny edge).

predictions_lasso = lasso_model.predict(df_train_imp_mice.drop('life_expectancy', axis=1)).flatten()
residuals_lasso = df_train_imp_mice[['life_expectancy']].values.flatten() - lasso_model.predict(df_train_imp_mice.drop('life_expectancy', axis=1))

residuals_df_lasso = pd.DataFrame({
    'Predicted': predictions_lasso,
    'Residuals': residuals_lasso
})

# Residuals vs fitted plot
ggplot(residuals_df_lasso) + \
    geom_point(aes(x='Predicted', y='Residuals')) + \
    geom_hline(yintercept=0, color='red', size=1) + \
    ggtitle('Residuals Plot for LASSO Regression') + \
    xlab('Predicted values') + \
    ylab('Residuals')

The residuals plot presents a similar pattern to that of the linear model. Again, the conclusion is the LASSO model seems like a reasonable fit and that we need to investigate outliers further.

Let’s have a look at the coefficients reduced to 0 by the LASSO model.

lasso_output = pd.DataFrame({"feature": df_train_imp_mice.drop('life_expectancy', axis=1).columns, "coefficient": lasso_model.coef_})

# Code a positive / negative vector
lasso_output["positive"] = np.where(lasso_output["coefficient"] >= 0, True, False)

# Take the absolute value of the coefficients
lasso_output["coefficient"] = np.abs(lasso_output["coefficient"])

# Remove the constant and sort the data frame by (absolute) coefficient magnitude
lasso_output = lasso_output.query("feature != 'const'").sort_values("coefficient")

lasso_output["coefficient_rounded"] = lasso_output["coefficient"].round(6)
(
  ggplot(lasso_output, aes("coefficient", "feature", fill = "positive")) +
  geom_bar(stat = "identity") +
  geom_text(aes(label="coefficient_rounded"), ha="left", va="center", nudge_x=0.65) +
  theme(panel_grid_major_y = element_blank()) +
  labs(x = "Lasso coefficient", y = "Feature",
       fill = "Positive?")+
   ggsize(1400, 900)+
   coord_cartesian(xlim=(min(lasso_output["coefficient"]), max(lasso_output["coefficient"]) + 1.5))  # Add margin
)

Let’s now plot the linear regression model coefficients

linear_output = pd.DataFrame({"feature": df_train_imp_mice.drop('life_expectancy', axis=1).columns, "coefficient": linear_model.coef_.flatten()})

# Code a positive / negative vector
linear_output["positive"] = np.where(linear_output["coefficient"] >= 0, True, False)

# Take the absolute value of the coefficients
linear_output["coefficient"] = np.abs(linear_output["coefficient"])

# Remove the constant and sort the data frame by (absolute) coefficient magnitude
linear_output = linear_output.query("feature != 'const'").sort_values("coefficient")

linear_output["coefficient_rounded"] = linear_output["coefficient"].round(6)
(
    ggplot(linear_output, aes("coefficient", "feature", fill="positive")) +
    geom_bar(stat="identity") +
    geom_text(aes(label="coefficient_rounded"), ha="left", va="center", nudge_x=0.65) +  
    theme(panel_grid_major_y=element_blank()) +
    labs(x="Linear regression coefficient", y="Feature", fill="Positive?") +
    ggsize(1400, 900) +  # Increase figure size
    coord_cartesian(xlim=(min(linear_output["coefficient"]), max(linear_output["coefficient"]) + 1.5))  # Add margin
)

From this, we can see that the LASSO model reduces all coefficients but none of them to 0. All coefficients that are present in the linear regression model are also present in the LASSO model.

Overall, given that LASSO doesn’t improve on the linear regression metrics and also doesn’t reduce the number of predictive variables in the model compared to linear regression, it might be best to simply select the linear regression model in this case.

References

Stiglitz, Joseph E, Amartya Sen, and Jean-Paul Fitoussi. 2009. “Report by the Commission on the Measurement of Economic Performance and Social Progress.” https://ec.europa.eu/eurostat/documents/8131721/8131772/Stiglitz-Sen-Fitoussi-Commission-report.pdf.

⚙️ Setup

Question 1

Loading the data

Renaming columns

Question 2

Question 3

Renaming gdp column

Creating a gdp_per_capita column

Plotting and discussing GDP per capita distributions

Question 4

Distributions of human development index

Question 5

Train-test split

Further processing of training and test sets: label encoding and imputation

Creating a linear regression model

Creating a LASSO model

References

Renaming `gdp` column

Creating a `gdp_per_capita` column