✅ A model solution for the W04 formative
What follows is a possible solution for the W04 formative. If you want to render the .qmd
that was used to generate this page for yourselves, use the download button below:
⚙️ Setup
We start, as usual, by loading libraries.
import pandas as pd
import re
import missingno as msno
import sweetviz as sv
from lets_plot import *
LetsPlot.setup_html()from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import r2_score, root_mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
import numpy as np
import miceforest as mf
from sklearn.preprocessing import LabelEncoder
For cleaner formatting (and coding), it’s good practice to load all the libraries you’ll be using in your .qmd
(or Jupyter Notebook) at the beginning of the file.
Question 1
Loading the data
Importing the data should only require the pd.read_csv
method.
=pd.read_csv('../../data/Life_Expectancy_Data.csv') df
When loading the data, you should only ever use relative paths and not absolute paths (i.e not full paths that are specific to your machine).
Doing so ensures your code is more reproducible!
We quickly inspect the content of the loaded dataframe by taking a look at the first 5 rows of it
df.head()
Country | Year | Status | Life expectancy | Adult Mortality | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | … | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness 1-19 years | thinness 5-9 years | Income composition of resources | Schooling | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 2015 | Developing | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | … | 6.0 | 8.16 | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 |
1 | Afghanistan | 2014 | Developing | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | … | 58.0 | 8.18 | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 |
2 | Afghanistan | 2013 | Developing | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | … | 62.0 | 8.13 | 64.0 | 0.1 | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 |
3 | Afghanistan | 2012 | Developing | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | … | 67.0 | 8.52 | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 |
4 | Afghanistan | 2011 | Developing | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | … | 68.0 | 7.87 | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 |
5 rows × 22 columns
A closer look at the dataframe’s column names gives us important information about the patterns we’ll need to deal with when renaming the columns: special characters like -
and /
, spaces and inconsistent capitalization of terms.
df.columns
'Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
Index(['infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
' thinness 1-19 years', ' thinness 5-9 years',
'Income composition of resources', 'Schooling'],
='object') dtype
Renaming columns
I rename the thinness 1-19 years
column using the pandas
rename
method
={' thinness 1-19 years':'thinness_10_19_years'},inplace=True)
df.rename(columns df
Country | Year | Status | Life expectancy | Adult Mortality | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | … | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_10_19_years | thinness 5-9 years | Income composition of resources | Schooling | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 2015 | Developing | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | … | 6.0 | 8.16 | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 |
1 | Afghanistan | 2014 | Developing | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | … | 58.0 | 8.18 | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 |
2 | Afghanistan | 2013 | Developing | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | … | 62.0 | 8.13 | 64.0 | 0.1 | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 |
3 | Afghanistan | 2012 | Developing | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | … | 67.0 | 8.52 | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 |
4 | Afghanistan | 2011 | Developing | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | … | 68.0 | 7.87 | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2933 | Zimbabwe | 2004 | Developing | 44.3 | 723.0 | 27 | 4.36 | 0.000000 | 68.0 | 31 | … | 67.0 | 7.13 | 65.0 | 33.6 | 454.366654 | 12777511.0 | 9.4 | 9.4 | 0.407 | 9.2 |
2934 | Zimbabwe | 2003 | Developing | 44.5 | 715.0 | 26 | 4.06 | 0.000000 | 7.0 | 998 | … | 7.0 | 6.52 | 68.0 | 36.7 | 453.351155 | 12633897.0 | 9.8 | 9.9 | 0.418 | 9.5 |
2935 | Zimbabwe | 2002 | Developing | 44.8 | 73.0 | 25 | 4.43 | 0.000000 | 73.0 | 304 | … | 73.0 | 6.53 | 71.0 | 39.8 | 57.348340 | 125525.0 | 1.2 | 1.3 | 0.427 | 10.0 |
2936 | Zimbabwe | 2001 | Developing | 45.3 | 686.0 | 25 | 1.72 | 0.000000 | 76.0 | 529 | … | 76.0 | 6.16 | 75.0 | 42.1 | 548.587312 | 12366165.0 | 1.6 | 1.7 | 0.427 | 9.8 |
2937 | Zimbabwe | 2000 | Developing | 46.0 | 665.0 | 24 | 1.68 | 0.000000 | 79.0 | 1483 | … | 78.0 | 7.10 | 78.0 | 43.5 | 547.358878 | 12222251.0 | 11.0 | 11.2 | 0.434 | 9.8 |
2938 rows × 22 columns
I write a function to rename the columns of the dataframe as required:
- removing trailing/leading spaces from column names
- removing special characters (e.g
-
,/
) i.e non-alphanumeric characters and replacing them by_
- replacing sequences of multiple
_
that the renaming could have produced by a single_
- finally, converting all column names to lowercase
# Function to clean column names correctly
def clean_column_name(name):
= name.strip() # Remove leading/trailing spaces
name = re.sub(r'[^a-zA-Z0-9]+', '_', name) # Replace non-alphanumeric sequences (incl. spaces, dashes) with _
name = re.sub(r'_+', '_', name) # Replace multiple _ with a single one
name return name.lower() # Convert to lowercase
# Print before renaming
print("Before renaming:", df.columns.tolist())
# Apply function to all column names
=lambda col: clean_column_name(col), inplace=True)
df.rename(columns
# Print after renaming
print("After renaming:", df.columns.tolist())
Having a look at the column names and first rows of the dataframe after renaming shows that our renaming worked as intended
df.columns
'Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
Index(['infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', 'thinness_10_19_years',
' thinness 5-9 years', 'Income composition of resources', 'Schooling'],
='object') dtype
df.head()
Country | Year | Status | Life expectancy | Adult Mortality | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | … | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_10_19_years | thinness 5-9 years | Income composition of resources | Schooling | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 2015 | Developing | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | … | 6.0 | 8.16 | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 |
1 | Afghanistan | 2014 | Developing | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | … | 58.0 | 8.18 | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 |
2 | Afghanistan | 2013 | Developing | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | … | 62.0 | 8.13 | 64.0 | 0.1 | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 |
3 | Afghanistan | 2012 | Developing | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | … | 67.0 | 8.52 | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 |
4 | Afghanistan | 2011 | Developing | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | … | 68.0 | 7.87 | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 |
5 rows × 22 columns
Question 2
To look at missingness patterns, we plot the nullity matrix using the missingno
library
msno.matrix(df)
There seems to be a MAR mechanism at play here:
- there seems to be a correlation in the missingness patterns of
income_composition_of_resources
,schooling
(not exactly surprising when you know thatincome_composition_of_resources
is the human development index andschooling
is a component of that index) - there is a correlation between the pattern of missingness in
gdp
andpopulation
(again, not wholly unexpected, asgdp
is, in fact, GDP per capita i.e GDP divided by population - which also points to the fact that thegdp
column is not named correctly!) total_expenditure
is correlated withhepatitis_b
andalcohol
(again not exactly surprising given thattotal_expenditure
represents the general government expenditure on health as a percentage of total government expenditure (%) and such expenditure would increase as hepatitis B immunization coverage increases and would also increase with an increase of alcohol consumption (and an increase of its burden on the healthcare system))
= sv.analyze(df)
report report.show_notebook()
The variables with the most missing values are population
(22% of missing values) and hepatitis_b
(19%). Other variables with missing values include:
life_expectancy
(<1%)adult_mortality
(<1%)polio
(<1%)diphteria
(<1%)bmi
(1%)thinness_10_19_years
(1%)thinness_5_9_years
(1%)schooling
(6%)income_composition_of_resources
(6%)alcohol
(7%)total_expenditure
(8%)gdp
(15%)
Given the fact the proportion of missing values is rather low and the fact we have a MAR mechanism at play, the best approach to take to deal with the missing data prior to modeling (but after train/test split to avoid data leakage!) is to perform imputation. The variables with missing values are all numerical but the dataset contains categorical values (e.g status
): we could perform K-NN imputation or multiple imputation (for more robust imputation) after label encoding the categorical values.
Question 3
Renaming gdp
column
We rename gdp
as gdp_per_capita
={'gdp':'gdp_per_capita'},inplace=True)
df.rename(columns df
country | year | status | life_expectancy | adult_mortality | infant_deaths | alcohol | percentage_expenditure | hepatitis_b | measles | … | polio | total_expenditure | diphtheria | hiv_aids | gdp_per_capita | population | thinness_10_19_years | thinness_5_9_years | income_composition_of_resources | schooling | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 2015 | Developing | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | … | 6.0 | 8.16 | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 |
1 | Afghanistan | 2014 | Developing | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | … | 58.0 | 8.18 | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 |
2 | Afghanistan | 2013 | Developing | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | … | 62.0 | 8.13 | 64.0 | 0.1 | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 |
3 | Afghanistan | 2012 | Developing | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | … | 67.0 | 8.52 | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 |
4 | Afghanistan | 2011 | Developing | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | … | 68.0 | 7.87 | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2933 | Zimbabwe | 2004 | Developing | 44.3 | 723.0 | 27 | 4.36 | 0.000000 | 68.0 | 31 | … | 67.0 | 7.13 | 65.0 | 33.6 | 454.366654 | 12777511.0 | 9.4 | 9.4 | 0.407 | 9.2 |
2934 | Zimbabwe | 2003 | Developing | 44.5 | 715.0 | 26 | 4.06 | 0.000000 | 7.0 | 998 | … | 7.0 | 6.52 | 68.0 | 36.7 | 453.351155 | 12633897.0 | 9.8 | 9.9 | 0.418 | 9.5 |
2935 | Zimbabwe | 2002 | Developing | 44.8 | 73.0 | 25 | 4.43 | 0.000000 | 73.0 | 304 | … | 73.0 | 6.53 | 71.0 | 39.8 | 57.348340 | 125525.0 | 1.2 | 1.3 | 0.427 | 10.0 |
2936 | Zimbabwe | 2001 | Developing | 45.3 | 686.0 | 25 | 1.72 | 0.000000 | 76.0 | 529 | … | 76.0 | 6.16 | 75.0 | 42.1 | 548.587312 | 12366165.0 | 1.6 | 1.7 | 0.427 | 9.8 |
2937 | Zimbabwe | 2000 | Developing | 46.0 | 665.0 | 24 | 1.68 | 0.000000 | 79.0 | 1483 | … | 78.0 | 7.10 | 78.0 | 43.5 | 547.358878 | 12222251.0 | 11.0 | 11.2 | 0.434 | 9.8 |
2938 rows × 22 columns
Creating a gdp_per_capita
column
GDP is simply GDP per capita multipled by population. We use this to create a gdp
column
'gdp'] = df['gdp_per_capita']*df['population'] df[
Plotting and discussing GDP per capita distributions
Overall GDP per capita distribution
Let’s plot the overall distribution of GDP per capita for all countries
='gdp_per_capita')) + \
ggplot(df, aes(x=30, fill='blue', color='black', alpha=0.7) + \
geom_histogram(bins"Distribution of GDP per Capita") + \
ggtitle("GDP per Capita (USD)") + \
xlab("Frequency") + \
ylab( theme_minimal()
This is a heavily right-skewed distribution that indicates that wealth is not spread equally among countries and that most countries have low GDP per capita. There are a few outliers at very high values of GDP per capita (> $60,000) that correspond to wealthier nations.
GDP per capita for developing countries
Let’s now plot the GDP per capita for developing countries
"status=='Developing'"), aes(x='gdp_per_capita')) + \
ggplot(df.query(=30, fill='blue', color='black', alpha=0.7) + \
geom_histogram(bins"Distribution of GDP per Capita for developing countries") + \
ggtitle("GDP per Capita (USD)") + \
xlab("Frequency") + \
ylab( theme_minimal()
The GDP per capita for developing countries mirrors sensibly the same distribution as the overall one: most countries have low to extremely low GDP per capita. Note that the higher end of the scale has lower values than the overall plot (the maximum here is at around $89,000 as opposed to around $120,000 in the overall plot) and that even fewer countries reach the higher end of the scale.
GDP per capita for developed countries
Let’s plot GDP per capita for developed countries
"status=='Developed'"), aes(x='gdp_per_capita')) + \
ggplot(df.query(=30, fill='blue', color='black', alpha=0.7) + \
geom_histogram(bins"Distribution of GDP per Capita for developed countries") + \
ggtitle("GDP per Capita (USD)") + \
xlab("Frequency") + \
ylab( theme_minimal()
Compared to the overall plot, this distribution still shows right-skewness but it’s not as extreme.
Most developed countries have GDP per capita values above $10,000, and the distribution is more spread out. There are more countries with GDP per capita values in the $20,000 to $60,000 range. Unlike in the overall GDP per capita plot, there are no extreme concentrations at the very low end.
The highest bar is still at the lower end, which suggests that, even among countries classified as developed, a significant number have relatively lower GDP per capita (perhaps emerging economies within the developed category). Such an observation brings into question this whole notion of development: do we simply quantify a country’s development/economic wellbeing through measurements like GDP/GDP per capita or are other parameters to be taken into account? See the Stiglitz-Fitoussi-Sen report(Stiglitz, Sen, and Fitoussi 2009) on this point.
Again, we have a few outlier countries with a GDP per capita higher than $100,000, most likely wealthy countries like Switzerland, Luxembourg or Norway.
Overall, the distribution of GDP per capita for developed countries is more balanced though disparities still exist within developed countries.
Question 4
Distributions of human development index
Overall human development index distribution
Let’s plot the overall development index distribution
='income_composition_of_resources')) + \
ggplot(df, aes(x='..density..'), bins=30, fill='blue', color='black', alpha=0.6) + \
geom_histogram(aes(y='red', size=1.2) + \
geom_density(color"Histogram & Density of HDI") + \
ggtitle("HDI") + \
xlab("Density") + \
ylab( theme_classic()
In contrast with the GDP per capita overall distribution, this is a more (slightly) left-skewed distribution.
There are outliers at the very low end of the scale (human development index equals 0!). Let’s check to which countries they correspond:
'income_composition_of_resources']==0] df[df[
country | year | status | life_expectancy | adult_mortality | infant_deaths | alcohol | percentage_expenditure | hepatitis_b | measles | … | total_expenditure | diphtheria | hiv_aids | gdp_per_capita | population | thinness_10_19_years | thinness_5_9_years | income_composition_of_resources | schooling | gdp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
74 | Antigua and Barbuda | 2005 | Developing | 74.6 | 147.0 | 0 | 8.15 | 1455.608186 | 99.0 | 0 | … | 4.41 | 99.0 | 0.1 | 11371.938950 | NaN | 3.5 | 3.4 | 0.0 | 0.0 | NaN |
75 | Antigua and Barbuda | 2004 | Developing | 74.4 | 149.0 | 0 | 7.28 | 22.862952 | 97.0 | 0 | … | 4.21 | 97.0 | 0.1 | 1352.837400 | NaN | 3.5 | 3.4 | 0.0 | 0.0 | NaN |
76 | Antigua and Barbuda | 2003 | Developing | 74.2 | 151.0 | 0 | 7.16 | 1158.065259 | 99.0 | 0 | … | 4.53 | 99.0 | 0.1 | 9739.825560 | NaN | 3.5 | 3.5 | 0.0 | 0.0 | NaN |
77 | Antigua and Barbuda | 2002 | Developing | 74.0 | 153.0 | 0 | 7.21 | 927.407585 | 99.0 | 0 | … | 4.41 | 98.0 | 0.1 | 9386.716452 | NaN | 3.6 | 3.5 | 0.0 | 0.0 | NaN |
78 | Antigua and Barbuda | 2001 | Developing | 73.8 | 154.0 | 0 | 7.51 | 163.767698 | 96.0 | 0 | … | 4.48 | 97.0 | 0.1 | 9358.154162 | NaN | 3.6 | 3.5 | 0.0 | 0.0 | NaN |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2853 | Vanuatu | 2004 | Developing | 69.6 | 169.0 | 0 | 0.85 | 334.167337 | 63.0 | 0 | … | 4.12 | 69.0 | 0.1 | 1787.947230 | 24143.0 | 1.6 | 1.5 | 0.0 | 10.7 | 4.316641e+07 |
2854 | Vanuatu | 2003 | Developing | 69.4 | 173.0 | 0 | 1.20 | 27.298391 | 64.0 | 165 | … | 4.20 | 69.0 | 0.1 | 158.527240 | 198964.0 | 1.6 | 1.6 | 0.0 | 10.4 | 3.154121e+07 |
2855 | Vanuatu | 2002 | Developing | 69.3 | 176.0 | 0 | 1.24 | 171.137361 | 66.0 | 101 | … | 3.52 | 7.0 | 0.1 | 1353.934819 | 193956.0 | 1.7 | 1.6 | 0.0 | 10.2 | 2.626038e+08 |
2856 | Vanuatu | 2001 | Developing | 69.1 | 179.0 | 0 | 0.91 | 163.105292 | 68.0 | 7 | … | 3.37 | 7.0 | 0.1 | 1362.617310 | 18929.0 | 1.7 | 1.6 | 0.0 | 10.1 | 2.579298e+07 |
2857 | Vanuatu | 2000 | Developing | 69.0 | 18.0 | 0 | 1.21 | 21.900752 | 7.0 | 9 | … | 3.28 | 71.0 | 0.1 | 1469.849149 | 18563.0 | 1.7 | 1.7 | 0.0 | 9.6 | 2.728481e+07 |
130 rows × 23 columns
'income_composition_of_resources']==0]['country'].unique() df[df[
'Antigua and Barbuda', 'Bahamas', 'Bhutan',
array(['Bosnia and Herzegovina', 'Burkina Faso', 'Cabo Verde', 'Chad',
'Comoros', 'Equatorial Guinea', 'Eritrea', 'Ethiopia', 'Georgia',
'Grenada', 'Guinea-Bissau', 'Kiribati', 'Lebanon', 'Madagascar',
'Micronesia (Federated States of)', 'Montenegro', 'Nigeria',
'Oman', 'Saint Lucia', 'Saint Vincent and the Grenadines',
'Seychelles', 'South Sudan', 'Suriname',
'The former Yugoslav republic of Macedonia', 'Timor-Leste',
'Turkmenistan', 'Uzbekistan', 'Vanuatu'], dtype=object)
Looking at the list of countries where the human index is 0, we can deduce two things:
- the value is either due to an entry error or due to the fact the statistics was not computed for many of these countries, if not all of these countries (this seems likely for Oman, Seychelles, Nigeria, Georgia, Ethiopia)
- the human development index is likely very low in some of the countries in the list (known low income countries such as Bhutan or Cabo Verde or countries with a recent history of war and conflict such as the former Yugoslav republics or South Sudan) but the 0 values is again likely due to the statistic not been available in these countries rather than simply mirroring a situation where the statistic has been computed and been found equal to 0
Aside from this, the values of the human development index are typically higher than 0.25 and most are in the range [0.6-0.8], indicating mid to satisfactory levels of human development overall.
The overall median (0.677) of human development index confirms this observation.
'income_composition_of_resources']) np.nanmedian(df[
0.677
Distribution for developing countries
"status=='Developing'"), aes(x='income_composition_of_resources')) + \
ggplot(df.query(='..density..'), bins=30, fill='blue', color='black', alpha=0.6) + \
geom_histogram(aes(y='red', size=1.2) + \
geom_density(color"Histogram & Density of HDI for developing countries") + \
ggtitle("HDI") + \
xlab("Density") + \
ylab( theme_classic()
The distribution is sensibly similar to the overall one. Our outliers at 0 are still there. The higher end of the scale is a tad lower at 0.92. While most values are still found between the [0.6-0.8] range, we also have a fair concentration of values at the [0.4-0.5], indicating there seems to be at least two distinct clusters of developing countries when it comes to human development index.
And as expected, the median of values for developing countries is slightly lower than the overall median.
"status=='Developing'")['income_composition_of_resources']) np.nanmedian(df.query(
0.631
Distribution for developed countries
Let’s plot the distribution for developed countries.
"status=='Developed'"), aes(x='income_composition_of_resources')) + \
ggplot(df.query(='..density..'), bins=30, fill='blue', color='black', alpha=0.6) + \
geom_histogram(aes(y='red', size=1.2) + \
geom_density(color"Histogram & Density of HDI for developed countries") + \
ggtitle("HDI") + \
xlab("Density") + \
ylab( theme_classic()
The distribution is less left-skewed than the overall one. The range of values is also narrower as all values are within the range [0.7-0.95], which also means that the human development index is generally higher in developed countries than developing ones. At first glance, it looks as if the human development index differentiates between developing and developed countries better than GDP per capita did.
Question 5
Train-test split
We split the data in training and test set. We drop variables such as country
, year
, gdp
, population
, which won’t be used for modeling.
= train_test_split(df.drop(['life_expectancy','country','year','gdp','population'], axis=1), df[['life_expectancy']], test_size=0.3, stratify=df["status"],random_state = 42) x_train, x_test, y_train, y_test
Before proceeding, we take a quick look at the types of the variables in our training set.
x_train.dtypes
object
status
adult_mortality float64
infant_deaths int64
alcohol float64
percentage_expenditure float64
hepatitis_b float64
measles int64
bmi float64
under_five_deaths int64
polio float64
total_expenditure float64
diphtheria float64
hiv_aids float64
gdp_per_capita float64
thinness_10_19_years float64
thinness_5_9_years float64
income_composition_of_resources float64
schooling float64object dtype:
Further processing of training and test sets: label encoding and imputation
Not all variables are numeric e.g status
. So we use scikit-learn
’s LabelEncoder
to label encode status
(and convert it to a numerical feature).
= LabelEncoder()
label_encoder 'status'] = label_encoder.fit_transform(x_train['status'].astype(str)) x_train[
x_train
status | adult_mortality | infant_deaths | alcohol | percentage_expenditure | hepatitis_b | measles | bmi | under_five_deaths | polio | total_expenditure | diphtheria | hiv_aids | gdp_per_capita | thinness_10_19_years | thinness_5_9_years | income_composition_of_resources | schooling | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
612 | 1 | 285.0 | 7 | 3.82 | 0.000000 | 7.0 | 315 | 25.0 | 10 | 69.0 | 2.79 | 82.0 | 3.5 | NaN | 8.0 | 7.6 | 0.558 | 10.7 |
775 | 1 | 166.0 | 6 | 6.02 | 664.558524 | 83.0 | 0 | 53.6 | 8 | 85.0 | 4.60 | 88.0 | 0.9 | 5451.669600 | 3.3 | 3.3 | 0.697 | 12.9 |
294 | 1 | 232.0 | 1 | 0.17 | 27.489070 | 93.0 | 6 | 19.5 | 1 | 93.0 | 6.30 | 93.0 | 0.4 | 177.234497 | 16.8 | 17.5 | 0.000 | 10.9 |
2264 | 1 | 25.0 | 27 | 0.29 | 5.397369 | NaN | 5839 | 17.5 | 50 | 49.0 | 4.63 | 52.0 | 0.5 | 473.453380 | 12.3 | 12.3 | 0.378 | 5.2 |
1915 | 0 | 66.0 | 0 | 6.59 | 15268.064450 | NaN | 3 | 58.9 | 0 | 93.0 | 9.26 | 93.0 | 0.1 | 87646.753460 | 0.7 | 0.7 | 0.936 | 17.4 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
772 | 1 | 157.0 | 6 | 5.93 | 97.522115 | 8.0 | 0 | 56.8 | 7 | 82.0 | 4.12 | 83.0 | 0.3 | 627.555440 | 3.3 | 3.2 | 0.709 | 13.2 |
2567 | 1 | 194.0 | 13 | 0.86 | 1.142004 | NaN | 38 | 31.6 | 16 | 84.0 | 4.59 | 85.0 | 0.3 | 17.815972 | 4.1 | 4.1 | 0.535 | 9.7 |
1190 | 1 | 193.0 | 1100 | 3.00 | 64.605901 | 44.0 | 33634 | 16.4 | 1500 | 79.0 | 4.33 | 82.0 | 0.2 | 1461.671957 | 26.9 | 27.7 | 0.580 | 10.8 |
1301 | 1 | 138.0 | 1 | 3.65 | 37.171088 | 96.0 | 0 | 52.0 | 1 | 96.0 | 5.66 | 96.0 | 0.5 | 521.333630 | 1.8 | 1.7 | 0.725 | 12.8 |
2804 | 0 | 112.0 | 28 | 8.52 | 0.000000 | 93.0 | 66 | 63.8 | 33 | 92.0 | 15.15 | 96.0 | 0.1 | NaN | 0.7 | 0.6 | NaN | NaN |
2056 rows × 18 columns
y_train
life_expectancy | |
---|---|
612 | 62.9 |
775 | 72.7 |
294 | 67.4 |
2264 | 57.5 |
1915 | 81.0 |
… | … |
772 | 73.4 |
2567 | 64.0 |
1190 | 66.8 |
1301 | 75.3 |
2804 | 77.5 |
2056 rows × 1 columns
We do the same for the test data.
= LabelEncoder()
label_encoder 'status'] = label_encoder.fit_transform(x_test['status'].astype(str)) x_test[
Now that it’s done, we proceed to impute missing values, first in the training set then the test set (to avoid data leakage). We’ll use multiple imputation.
42)
np.random.seed(
= x_train.reset_index(drop=True)
x_train = y_train.reset_index(drop=True)
y_train
= mf.ImputationKernel(
df_train_imp =1),
pd.concat([x_train,y_train],axis=True,
save_all_iterations_data=100
random_state
)# Run the MICE algorithm for 5 iterations
5)
df_train_imp.mice(# Return the completed dataset (for MICE)
= df_train_imp.complete_data(dataset=0) df_train_imp_mice
42)
np.random.seed(
= x_test.reset_index(drop=True)
x_test = y_test.reset_index(drop=True)
y_test
= mf.ImputationKernel(
df_test_imp =1),
pd.concat([x_test,y_test],axis=True,
save_all_iterations_data=100
random_state
)# Run the MICE algorithm for 5 iterations
5)
df_test_imp.mice(# Return the completed dataset (for MICE)
= df_test_imp.complete_data(dataset=0) df_test_imp_mice
Creating a linear regression model
We create a linear regression model using all variables to predict life_expectancy
(as stated before country
, year
, gdp
and population
were dropped when creating the training/test split and are not used in this model)
# Create a model instance of a linear regression
= LinearRegression()
linear_model
# Fit this instance to the training set
'life_expectancy', axis=1),df_train_imp_mice[['life_expectancy']]) linear_model.fit(df_train_imp_mice.drop(
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
We compute test set metrics for this model
# Create predictions for the test set
= linear_model.predict(df_test_imp_mice.drop('life_expectancy', axis=1))
linear_preds
# Calculate performance metrics
= r2_score(df_test_imp_mice[['life_expectancy']], linear_preds)
r2 = root_mean_squared_error(df_test_imp_mice[['life_expectancy']], linear_preds)
rmse = mean_absolute_error(df_test_imp_mice[['life_expectancy']], linear_preds)
mae
# Print results
print("The R2 metric for this model is:",np.round(r2, 3),", the MAE is", np.round(mae, 3)," and the RMSE is ", np.round(rmse, 3),"\n")
for this model is: 0.817 , the MAE is 2.957 and the RMSE is 4.029 The R2 metric
These metrics tell us the following:
- this model explains close to 82% of the variance of the data (based on \(R^2\)). However, this model includes quite a few predictors, potentially artificially inflating the \(R^2\) score, so it might be worth looking at the adjusted \(R^2\) to account for the number of predictors
- the model prediction are off by almost 3 years on average based on MAE and 4 based on RMSE. The discrepancy between MAE and RMSE indicates quite a few outlier residuals, which are more heavily weighted in RMSE. Given the scale of life expectancy (the average life expectancy was 66.8 and 71.8 years in 2000 and 2015 respectively, which are the first and last year of data represented in our dataset). This represents predictions that are off by 4.2 to 4.5% (based on MAE) or 5.6 to 6% based on RMSE. This looks like an acceptable scale of error.
Let’s quickly compute adjusted \(R^2\) for this model:
= 1 - (1-linear_model.score(df_test_imp_mice.drop('life_expectancy', axis=1), df_test_imp_mice[['life_expectancy']]))*(len(df_test_imp_mice[['life_expectancy']])-1)/(len(df_test_imp_mice[['life_expectancy']])-df_test_imp_mice.drop('life_expectancy', axis=1).shape[1]-1)
adj_r2 print("The adjusted R2 for this model is:", adj_r2)
for this model is: 0.8132287101060904 The adjusted R2
Adjusted \(R^2\) for this model is extremely close to \(R^2\), which means that the variables in our model do not exhibit high collinearity and all have explanatory power.
Let’s now examine this model’s residuals and draw the residuals vs fitted plot.
= linear_model.predict(df_train_imp_mice.drop('life_expectancy', axis=1)).flatten()
predictions = df_train_imp_mice[['life_expectancy']] - linear_model.predict(df_train_imp_mice.drop('life_expectancy', axis=1))
residuals = residuals.values.flatten()
residuals
= pd.DataFrame({
residuals_df 'Predicted': predictions,
'Residuals': residuals
})
# Residuals vs fitted plot
+ \
ggplot(residuals_df) ='Predicted', y='Residuals')) + \
geom_point(aes(x=0, color='red', size=1) + \
geom_hline(yintercept'Residuals Plot for Linear Regression') + \
ggtitle('Predicted values') + \
xlab('Residuals') ylab(
Here are the main observations from the residuals plot:
- the points are fairly randomly scattered around the horizontal line at zero and there are no obvious patterns to suggest non-linearity or heteroscedasticity. Most of the residuals seem to fall within a reasonable range from the horizontal line at zero, which indicates that the model is not systematically over- or under-predicting at different values
- there is a slightly higher concentration of residuals between 60 and 80 on the x-axis. This suggests that this is the range where most of our observations lie, which is not exactly surprising given the expected range of life expectancy!
- a few points stand out as potential outliers (e.g., around 40 on the x-axis and -15 to -20 on the y-axis). These deserve further investigation to understand if they are genuine extreme values or data errors. Many also happen to fall in areas where we have fewer life expectancy observations e.g the [30-40] range or [50-60] range.
In general, this residuals plot suggests that the linear regression model is likely a reasonable fit for the data but we might need to investigate outlier residuals more thoroughly to ensure they are dealt with appropriately.
Creating a LASSO model
# Create a model instance of a linear regression
= Lasso(alpha=0.02,max_iter=2000)
lasso_model
# Fit this instance to the training set
'life_expectancy', axis=1),df_train_imp_mice[['life_expectancy']]) lasso_model.fit(df_train_imp_mice.drop(
Lasso(alpha=0.02, max_iter=2000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.02, max_iter=2000)
# Create predictions for the test set
= lasso_model.predict(df_test_imp_mice.drop('life_expectancy', axis=1))
lasso_preds
# Calculate performance metrics
= r2_score(df_test_imp_mice[['life_expectancy']], lasso_preds)
r2 = root_mean_squared_error(df_test_imp_mice[['life_expectancy']], lasso_preds)
rmse = mean_absolute_error(df_test_imp_mice[['life_expectancy']], lasso_preds)
mae
# Print results
print("The R2 metric for this model is:",np.round(r2, 3),", the MAE is", np.round(mae, 3)," and the RMSE is ", np.round(rmse, 3),"\n")
for this model is: 0.816 , the MAE is 2.967 and the RMSE is 4.039 The R2 metric
= 1 - (1-lasso_model.score(df_test_imp_mice.drop('life_expectancy', axis=1), df_test_imp_mice[['life_expectancy']]))*(len(df_test_imp_mice[['life_expectancy']])-1)/(len(df_test_imp_mice[['life_expectancy']])-df_test_imp_mice.drop('life_expectancy', axis=1).shape[1]-1)
adj_r2_lasso print("The adjusted R2 for this model is:", adj_r2_lasso)
for this model is: 0.8122625566583088 The adjusted R2
The metrics for this model are very close to the metrics for the linear model (a 0.001 drop in \(R^2\) and adjusted \(R^2\) as a 0.01 increase in both MAE and RMSE) so they tell sensibly the same story as for the linear model (with the linear model having a tiny edge).
= lasso_model.predict(df_train_imp_mice.drop('life_expectancy', axis=1)).flatten()
predictions_lasso = df_train_imp_mice[['life_expectancy']].values.flatten() - lasso_model.predict(df_train_imp_mice.drop('life_expectancy', axis=1))
residuals_lasso
= pd.DataFrame({
residuals_df_lasso 'Predicted': predictions_lasso,
'Residuals': residuals_lasso
})
# Residuals vs fitted plot
+ \
ggplot(residuals_df_lasso) ='Predicted', y='Residuals')) + \
geom_point(aes(x=0, color='red', size=1) + \
geom_hline(yintercept'Residuals Plot for LASSO Regression') + \
ggtitle('Predicted values') + \
xlab('Residuals') ylab(
The residuals plot presents a similar pattern to that of the linear model. Again, the conclusion is the LASSO model seems like a reasonable fit and that we need to investigate outliers further.
Let’s have a look at the coefficients reduced to 0 by the LASSO model.
= pd.DataFrame({"feature": df_train_imp_mice.drop('life_expectancy', axis=1).columns, "coefficient": lasso_model.coef_})
lasso_output
# Code a positive / negative vector
"positive"] = np.where(lasso_output["coefficient"] >= 0, True, False)
lasso_output[
# Take the absolute value of the coefficients
"coefficient"] = np.abs(lasso_output["coefficient"])
lasso_output[
# Remove the constant and sort the data frame by (absolute) coefficient magnitude
= lasso_output.query("feature != 'const'").sort_values("coefficient") lasso_output
"coefficient_rounded"] = lasso_output["coefficient"].round(6)
lasso_output[
("coefficient", "feature", fill = "positive")) +
ggplot(lasso_output, aes(= "identity") +
geom_bar(stat ="coefficient_rounded"), ha="left", va="center", nudge_x=0.65) +
geom_text(aes(label= element_blank()) +
theme(panel_grid_major_y = "Lasso coefficient", y = "Feature",
labs(x = "Positive?")+
fill 1400, 900)+
ggsize(=(min(lasso_output["coefficient"]), max(lasso_output["coefficient"]) + 1.5)) # Add margin
coord_cartesian(xlim )
Let’s now plot the linear regression model coefficients
= pd.DataFrame({"feature": df_train_imp_mice.drop('life_expectancy', axis=1).columns, "coefficient": linear_model.coef_.flatten()})
linear_output
# Code a positive / negative vector
"positive"] = np.where(linear_output["coefficient"] >= 0, True, False)
linear_output[
# Take the absolute value of the coefficients
"coefficient"] = np.abs(linear_output["coefficient"])
linear_output[
# Remove the constant and sort the data frame by (absolute) coefficient magnitude
= linear_output.query("feature != 'const'").sort_values("coefficient") linear_output
"coefficient_rounded"] = linear_output["coefficient"].round(6)
linear_output[
("coefficient", "feature", fill="positive")) +
ggplot(linear_output, aes(="identity") +
geom_bar(stat="coefficient_rounded"), ha="left", va="center", nudge_x=0.65) +
geom_text(aes(label=element_blank()) +
theme(panel_grid_major_y="Linear regression coefficient", y="Feature", fill="Positive?") +
labs(x1400, 900) + # Increase figure size
ggsize(=(min(linear_output["coefficient"]), max(linear_output["coefficient"]) + 1.5)) # Add margin
coord_cartesian(xlim )
From this, we can see that the LASSO model reduces all coefficients but none of them to 0. All coefficients that are present in the linear regression model are also present in the LASSO model.
Overall, given that LASSO doesn’t improve on the linear regression metrics and also doesn’t reduce the number of predictive variables in the model compared to linear regression, it might be best to simply select the linear regression model in this case.