🗓️ Week 05
Exploratory Data Analysis

DS101 – Fundamentals of Data Science

28 Oct 2024

Statistical inference (continued)

Before we proceed with the main topic of this lecture (i.e exploratory data analysis), we take a slight step-back and go back to some topics linked to statistical inference we didn’t cover last week.

Hypothesis testing

  • The idea of confidence intervals is closely related to the idea of hypothesis testing.
  • The most common hypothesis test involves testing the null hypothesis of:
    • \(H_0\): There is no relationship between \(X\) and \(Y\) versus the .
    • \(H_A\): There is some relationship between \(X\) and \(Y\).
  • Mathematically, this corresponds to testing:

\[ \begin{align} &~~~~H_0:&\beta_1 &= 0 \\ &\text{vs} \\ &~~~~H_A:& \beta_1 &\neq 0, \end{align} \]

since if \(\beta_1=0\) then the model reduces to \(Y = \beta_0 + \epsilon\), and \(X\) and \(Y\) are not associated.

p-values

  • To test the null hypothesis, we compute something called a t-statistic (a bell-shaped distribution1)
  • Using statistical software, it is easy to compute the probability of observing any value equal to \(\mid t \mid\) or larger.
  • We call this probability the p-value.
  • If the p-value is less than some pre-specified level of significance, say \(\alpha = 0.05\), then we reject the null hypothesis in favor of the alternative hypothesis.

Now for something (slightly) different

  • RCTs
  • A/B testing
  • Causal inference

Randomised controlled trials

  • A randomised controlled trial (RCT) is a type of experiment in which participants are randomly assigned to one of two or more groups (treatment and control).
  • It is the norm in medicine and the life sciences, but also very common in the social sciences.
  • It is deemed by some to be the gold standard for determining causality.

RCTs: how do they work?

You have a group of people

  • Half of them get a pill
  • The other half gets a placebo (sugar pill)

You split them into two groups at random

Then what?

  • After a while, you measure the outcome of interest
  • You compare the two groups using a statistical hypothesis test
  • If the difference is statistically significant, you can conclude that the treatment caused the outcome

A/B testing

Causal Inference

📗 Book recommendation:

Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. London: Allen Lane.

(Pearl and Mackenzie 2018)

Exploratory Data Analysis

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

EDA: the basic principles

  • It’s an important step in the data science workflow that helps you understand the patterns of your data better and form hypotheses about them (prior to modeling)
  • EDA helps us understand the main features of the data, the relationships between variables, and the variables that are relevant to our problem.
  • EDA can also help us identify and handle missing or duplicate values, outliers, and errors in the data
  • One of the main tools in EDA is visualisation.

EDA: the basic principles

Steps of an EDA


  1. Generate questions about your data.
  2. Search for answers by visualising, transforming, and modelling your data.
  3. Use what you learn to refine your questions and/or generate new questions.
  • EDA is a fundamentally creative process, it is not a formal process with a strict set of rules.

  • The key to the process is to generate a large quantity of questions so as be able to ask quality questions and produce interesting insights on the data.

  • There are no set rules about which questions you should ask to guide your research.

  • However, to gather insights on your data, two types of questions will always be useful. These questions can very loosely be worded as:

    • What type of variation occurs within my variables?
    • What type of covariation occurs between my variables?

EDA: things to keep in mind

During the EDA process, as you explore the data, you need to remember some potential sources of (interpretation/analysis) errors:

  • you should always think about how the data were collected
  • what might be missing
  • whether there are data quality problems
  • and be really strict about the differences between correlation and causation (a topic in and by itself!).

EDA: A story of diamonds

Let’s go back to last week’s diamonds for an example.

This is what the first lines of the dataset look like:

carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

And here are some basic properties of the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB
Click here for the code to produce the results from this slide
from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_discrete, labs, element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
diamonds.head()
diamonds.info()

EDA: A story of diamonds

What if I want to have a look at the distribution of the carat variable?

Click here for the code to produce the results from this slide
from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_discrete, labs, element_text, scale_fill_manual
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
(ggplot(diamonds, aes(x="carat")) +
  theme_bw() + 
  geom_histogram(binwidth=0.5, fill='lightseagreen'))

Why would I look at this distribution plot? Here are some questions I could ask after seeing this plot.

  • What is the range of the variable carat?
  • Which values are the most common? Why?
  • Which values are rare? Why? Does that match expectations?
  • Can you spot any unusual patterns? What might explain them?

Here, we see that most of the carat values are below 3: we should probably zoom in the range of carats below 3 to discover more interesting patterns.

Click here for the code to produce the results from this slide
from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_discrete, labs, element_text, scale_fill_manual
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
smaller_diamonds = diamonds.query("carat < 3").copy()

(ggplot(smaller_diamonds, aes(x="carat")) + geom_histogram(binwidth=0.01,fill='rebeccapurple'))

This histogram suggests some interesting questions:

  • Why are there more diamonds at whole carats and common fractions of carats?
  • Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?

EDA: A story of diamonds

  • Outliers are sometimes difficult to see in a histogram when you have lots of data
  • If we look at the distribution of the y variable, the only indication of possible outliers is an unusually wide x-axis.
Click here for the code to produce the results from this slide
(ggplot(diamonds, aes(x="y")) + geom_histogram(binwidth=0.5,fill='slateblue'))

Click here for the code to produce the results from this slide
from plotnine import coord_cartesian
(
    ggplot(diamonds, aes(x="y"))
    + geom_histogram(binwidth=0.5,fill='slateblue')
    + coord_cartesian(ylim=[0, 50])
)

EDA: A story of diamonds

  • If we zoom in to small values of the y variable, we get to see the outlier values in the distribution (0, ~30, and ~60)
x y z price
11963 0.00 0.0 0.00 5139
15951 0.00 0.0 0.00 6381
24067 8.09 58.9 8.06 12210
24520 0.00 0.0 0.00 12800
26243 0.00 0.0 0.00 15686
27429 0.00 0.0 0.00 18034
49189 5.15 31.8 5.12 2075
49556 0.00 0.0 0.00 2130
49557 0.00 0.0 0.00 2130

The y variable as well as the x and z variables measure dimensions of the diamond (in mm):

  • their value can’t be 0 (especially not in all three dimensions!) so the values marked in orange are likely incorrect
  • the other rows show large diamonds but the price is off (the price should be much higher for such large diamonds!)
Click here for the code to produce the results from this slide
unusual = diamonds.query("y < 3 or y > 20").loc[:, ["x", "y", "z", "price"]]
unusual

What could we do with these outliers?

  • remove them (not recommended)
  • replace by missing value

For the handling of outliers, check this document from the World Bank.

EDA: Flying from New York

Let’s look at unusual values in another dataset: the nycflights13 dataset (available from the nycflights13 package in Python). It records flights that departed NYC in 2013.

year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA 1545 N14228 EWR IAH 227.0 1400 5 15
1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA 1714 N24211 LGA IAH 227.0 1416 5 29
2 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA 1141 N619AA JFK MIA 160.0 1089 5 40
3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 725 N804JB JFK BQN 183.0 1576 5 45
4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL 461 N668DN LGA ATL 116.0 762 6 0

The variable dep_time records flight departure time: flights that were cancelled have a missing value for this variable.

\(=>\) Outliers might occur, not due to errors in data collection/entry, but for reasons intrinsic to the problem studied.

EDA: Flying from New York

One thing you might want to do is compare the scheduled departure times for cancelled and non-cancelled times.

Click here for the code to produce the results from this slide
from plotnine import geom_freqpoly
import pandas as pd
flights2 = flights.assign(
    cancelled=lambda x: pd.isna(x["dep_time"]),
    sched_hour=lambda x: x["sched_dep_time"] // 100,
    sched_min=lambda x: x["sched_dep_time"] % 100,
    sched_dep_time=lambda x: x["sched_hour"] + x["sched_min"] / 60,
)

(
    ggplot(flights2, aes(x="sched_dep_time"))
    + geom_freqpoly(aes(color="cancelled"), binwidth=1 / 4)

Not exactly the best plot given that there are many more flights that were not cancelled than flights were (thankfully)!

EDA: A step-by-step example

Let’s now study a more complete step-by-step example of EDA: we’ll have a look at the Global Country Information Dataset 2023 available on Kaggle.

Its features are as follows:

Name Definition
Country Name of the country
Density (P/Km2) Population density measured in persons per square kilometer
Abbreviation Abbreviation or code representing the country
Agricultural Land (%) Percentage of land area used for agricultural purposes
Land Area (Km2) Total land area of the country in square kilometers
Armed Forces Size Size of the armed forces in the country
Birth Rate Number of births per 1,000 population per year
Calling Code International calling code for the country
Capital/Major City Name of the capital or major city
CO2 Emissions Carbon dioxide emissions in tons
CPI Consumer Price Index, a measure of inflation and purchasing power
CPI Change (%) Percentage change in the Consumer Price Index compared to the previous year
Currency_Code Currency code used in the country
Fertility Rate Average number of children born to a woman during her lifetime
Forested Area (%) Percentage of land area covered by forests
Gasoline_Price Price of gasoline per liter in local currency
GDP Gross Domestic Product, the total value of goods and services produced in the country
Gross Primary Education Enrollment (%) Gross enrollment ratio for primary education
Gross Tertiary Education Enrollment (%) Gross enrollment ratio for tertiary education
Infant Mortality Number of deaths per 1,000 live births before reaching one year of age
Largest City Name of the country’s largest city

EDA: A step-by-step example

Rest of the features

Name Definition
Life Expectancy Average number of years a newborn is expected to live
Maternal Mortality Ratio Number of maternal deaths per 100,000 live births
Minimum Wage Minimum wage level in local currency
Official Language Official language(s) spoken in the country
Out of Pocket Health Expenditure (%) Percentage of total health expenditure paid out-of-pocket by individuals
Physicians per Thousand Number of physicians per thousand people
Population Total population of the country
Population Labor Force Participation (%)
Tax Revenue (%) Tax revenue as a percentage of GDP
Total Tax Rate Overall tax burden as a percentage of commercial profits
Unemployment Rate Percentage of the labor force that is unemployed
Urban Population Percentage of the population living in urban areas
Latitude Latitude coordinate of the country’s location
Longitude Longitude coordinate of the country’s location

The data has 195 rows and 35 columns

EDA: A step-by-step example

Here is a sample of the data

Country Density(P/Km2) Abbreviation Agricultural Land( %) Land Area(Km2) Armed Forces size Birth Rate Calling Code Capital/Major City Co2-Emissions Out of pocket health expenditure Physicians per thousand Population Population: Labor force participation (%) Tax revenue (%) Total tax rate Unemployment rate Urban_population Latitude Longitude
112 Moldova 123 MD 74.20% 33,851 7,000 10.10 373.0 Chi���� 5,115 46.20% 3.21 2,657,637 43.10% 17.70% 38.70% 5.47% 1,135,502 47.411631 28.369885
153 Serbia 100 RS 39.30% 77,474 32,000 9.20 381.0 Belgrade 45,221 40.60% 3.11 6,944,975 54.90% 18.60% 36.60% 12.69% 3,907,243 44.016521 21.005859
35 Chile 26 CL 21.20% 756,096 122,000 12.43 56.0 Santiago 85,822 32.20% 2.59 18,952,038 62.60% 18.20% 34.00% 7.09% 16,610,135 -35.675147 -71.542969
19 Bhutan 20 BT 13.60% 38,394 6,000 17.26 975.0 Thimphu 1,261 19.80% 0.42 727,145 66.70% 16.00% 35.30% 2.34% 317,538 27.514162 90.433601
48 Dominica 96 DM 33.30% 751 NaN 12.00 1.0 Roseau 180 28.40% 1.08 71,808 NaN 22.10% 32.60% NaN 50,830 15.414999 -61.370976

5 rows × 35 columns

Click here for the code to produce the results from this slide
df = pd.read_csv('world-data-2023.csv') #You can download the file from Kaggle
df.sample(5)

EDA: A step-by-step example

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 35 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Country                                    195 non-null    object 
 1   Density
(P/Km2)                            195 non-null    object 
 2   Abbreviation                               188 non-null    object 
 3   Agricultural Land( %)                      188 non-null    object 
 4   Land Area(Km2)                             194 non-null    object 
 5   Armed Forces size                          171 non-null    object 
 6   Birth Rate                                 189 non-null    float64
 7   Calling Code                               194 non-null    float64
 8   Capital/Major City                         192 non-null    object 
 9   Co2-Emissions                              188 non-null    object 
 10  CPI                                        178 non-null    object 
 11  CPI Change (%)                             179 non-null    object 
 12  Currency-Code                              180 non-null    object 
 13  Fertility Rate                             188 non-null    float64
 14  Forested Area (%)                          188 non-null    object 
 15  Gasoline Price                             175 non-null    object 
 16  GDP                                        193 non-null    object 
 17  Gross primary education enrollment (%)     188 non-null    object 
 18  Gross tertiary education enrollment (%)    183 non-null    object 
 19  Infant mortality                           189 non-null    float64
 20  Largest city                               189 non-null    object 
 21  Life expectancy                            187 non-null    float64
 22  Maternal mortality ratio                   181 non-null    float64
 23  Minimum wage                               150 non-null    object 
 24  Official language                          190 non-null    object 
 25  Out of pocket health expenditure           188 non-null    object 
 26  Physicians per thousand                    188 non-null    float64
 27  Population                                 194 non-null    object 
 28  Population: Labor force participation (%)  176 non-null    object 
 29  Tax revenue (%)                            169 non-null    object 
 30  Total tax rate                             183 non-null    object 
 31  Unemployment rate                          176 non-null    object 
 32  Urban_population                           190 non-null    object 
 33  Latitude                                   194 non-null    float64
 34  Longitude                                  194 non-null    float64
dtypes: float64(9), object(26)
memory usage: 53.4+ KB

Two types of data: float64 and object (most likely string)
Click here for the code to produce the results from this slide
df.info()

EDA: Step-by-step

We have total 9 these floating columns:
                1) Birth Rate
                2) Calling Code
                3) Fertility Rate
                4) Infant mortality
                5) Life expectancy
                6) Maternal mortality ratio
                7) Physicians per thousand
                8) Latitude
                9) Longitude

Click here for the code to produce the results from this slide
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.style.use('dark_background')
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns

# numerical columns
numerical_column = df.select_dtypes(include=['float']).columns
print(f'We have total {len(numerical_column)} these floating columns:')
for count,column_name in enumerate(df.select_dtypes(include=['float']).columns,1):
    print(f'\t\t\t\t{count}) {column_name}')


df.describe().T.style.bar(
    subset=['mean'],
    color='purple',  # text color
).background_gradient(subset=['std'],cmap=plt.cm.coolwarm).background_gradient(subset='50%',cmap='viridis')

EDA: Step-by-step example

Your selected dataframe has 35 columns.
There are 33 columns that have missing values.
Click here for the code to produce the results from this slide
# credit: https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction.
# One of the best notebooks on getting started with a ML problem.

def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()

        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)

        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})

        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)

        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")

        # Return the dataframe with missing information
        return mis_val_table_ren_columns
df_missing= missing_values_table(df)
df_missing
index Missing Values % of Total Values
Minimum wage 45 23.1
Tax revenue (%) 26 13.3
Armed Forces size 24 12.3
Gasoline Price 20 10.3
Unemployment rate 19 9.7
Population: Labor force participation (%) 19 9.7
CPI 17 8.7
CPI Change (%) 16 8.2
Currency-Code 15 7.7
Maternal mortality ratio 14 7.2
Gross tertiary education enrollment (%) 12 6.2
Total tax rate 12 6.2
index Missing Values % of Total Values
Life expectancy 8 4.1
Agricultural Land( %) 7 3.6
Physicians per thousand 7 3.6
Out of pocket health expenditure 7 3.6
Abbreviation 7 3.6
Gross primary education enrollment (%) 7 3.6
Forested Area (%) 7 3.6
Fertility Rate 7 3.6
Co2-Emissions 7 3.6
Infant mortality 6 3.1
Largest city 6 3.1
Birth Rate 6 3.1
Official language 5 2.6
Urban_population 5 2.6
Capital/Major City 3 1.5
GDP 2 1.0
Population 1 0.5
Calling Code 1 0.5
Land Area(Km2) 1 0.5
Latitude 1 0.5
Longitude 1 0.5

EDA: Step-by-step example

Click here for the code to produce the results from this slide
import missingno as msno
msno.matrix(df)

EDA: Step-by-step example

Click here for the code to produce the results from this slide
msno.bar(df)

EDA: Step-by-step example

Reasons for Missing Values

Before we start handling the missing values, it is important to understand the various reasons behind the missingness in the data. Broadly speaking, there can be three possible reasons (also called (data) missingness mechanisms):

  1. Missing Completely at Random (MCAR)

The missing values on a given random variable (Y) are not associated/correlated with other variables in a given data set or with the variable (Y) itself. In other words, there is no particular reason for the missing values.

  1. Missing at Random (MAR)

The MAR mechanism occurs when the probability of missing values for a given random variable Y is related to some other measured variable (or variables) in the analysis model but not to the the values of Y itself.

EDA: Step-by-step example

Reasons for Missing Values

  1. Missing Not at Random (MNAR)

Missingness depends on variables unobserved in the data available or on the value of the random variable Y (which is missing values) itself.

For more details, see (Enders 2022) or (Scheffer 2002).

EDA: Step-by-step example

Missingness patterns in the data

  • Most columns have missing values
  • “Minimum wage,” “Tax revenue (%),” and “Armed Forces size” have 45, 26, and 24 missing values, respectively, which correspond to 23.08%, 13.33%, and 12.31% of the total observations
  • “Urban_population,” “Longitude,” “Population,” and “Calling Code” have only 5, 1, 1, and 1 missing values, respectively, which account for less than 3% of the total observations
  • MAR pattern between “Tax Revenue” and “Minimum wage” and between “Gross primary education enrollment (%)” and “Gross tertiary education enrollment (%)” \(=>\) imputation best strategy

EDA: Step-by-step example

Before we proceed to impute missing values, we clean up the data (converting columns from string type to float)

Click here for the code to do that
columns_to_convert = ['Density\n(P/Km2)', 'Agricultural Land( %)', 'Land Area(Km2)',
                      'Birth Rate', 'Co2-Emissions', 'Forested Area (%)',
                      'CPI', 'CPI Change (%)', 'Fertility Rate', 'Gasoline Price', 'GDP',
                      'Gross primary education enrollment (%)', 'Armed Forces size',
                      'Gross tertiary education enrollment (%)', 'Infant mortality',
                      'Life expectancy', 'Maternal mortality ratio', 'Minimum wage', 
                      'Out of pocket health expenditure', 'Physicians per thousand', 
                      'Population', 'Population: Labor force participation (%)', 
                      'Tax revenue (%)', 'Total tax rate', 'Unemployment rate', 'Urban_population']
df[columns_to_convert] = df[columns_to_convert].applymap(lambda x: float(str(x).replace('%','').replace(',', '').replace('$','')))

Missing data imputation

Possibilities:

  • replace the missing value by a constant value
  • replace the missing values by a statistic (mean or median)
  • K-NN imputation (replace missing value by average of value from closest neighbours in terms of distance)
Here, we’ll simply impute by replacing missing values for numerical variables by the median and for categorical variables by the mode.
Click here for the imputation code
for col in df.columns:
  if df[col].dtype=='float':
    df[col].fillna(df[col].median(),inplace=True)
  else:
    df[col].fillna(df[col].mode()[0],inplace=True)

We check the number of missing values in our columns.

Country                                      0
Density\n(P/Km2)                             0
Abbreviation                                 0
Agricultural Land( %)                        0
Land Area(Km2)                               0
Armed Forces size                            0
Birth Rate                                   0
Calling Code                                 0
Capital/Major City                           0
Co2-Emissions                                0
CPI                                          0
CPI Change (%)                               0
Currency-Code                                0
Fertility Rate                               0
Forested Area (%)                            0
Gasoline Price                               0
GDP                                          0
Gross primary education enrollment (%)       0
Gross tertiary education enrollment (%)      0
Infant mortality                             0
Largest city                                 0
Life expectancy                              0
Maternal mortality ratio                     0
Minimum wage                                 0
Official language                            0
Out of pocket health expenditure             0
Physicians per thousand                      0
Population                                   0
Population: Labor force participation (%)    0
Tax revenue (%)                              0
Total tax rate                               0
Unemployment rate                            0
Urban_population                             0
Latitude                                     0
Longitude                                    0
dtype: int64
No more missing values!
Click here for the code that produced the above table
missing = df.isnull().sum()
print(missing)

EDA: Step-by-step example

Summary metrics of numeric variables

Click here for the code that produced the above figure
# Calculate summary statistics for numerical columns
numerical_columns = df.select_dtypes([float,int]).columns
describe_numerical = df[numerical_columns].describe()

print('Summary statistics is describe below: ')
describe_numerical.T.style.bar(
    subset=['mean'],
    color='purple',
).background_gradient(subset=['std'],cmap=plt.cm.coolwarm).background_gradient(subset='50%',cmap='viridis').background_gradient(subset='max',cmap='viridis')

EDA: Step-by-step example

We create a column “GDP per capita” (GDP divided by Population)

Here are the first rows showing the new column.

index Country GDP per capita
0 Afghanistan 502.11548691997746
1 Albania 5352.857411084262
2 Algeria 3948.3432789227913
3 Andorra 40886.39116175365
4 Angola 2973.591159799147
Click here for the code that produced the above results
df["GDP per capita"]=df["GDP"]/df["Population"]
df[["Country","GDP per capita"]].head()

EDA: Step-by-step example

And get some facts about GDP per capita:

Country with highest GDP per capita is Vatican City

Country with lowest GDP per capita is Burundi

Click here for the code that produced the above results
gdppercapita_country = df.groupby('Country')['GDP per capita'].mean().reset_index()
# using loc we will do label indexing and find max and min gdp per capita countries
max_gdppercapita_country = gdppercapita_country.loc[gdppercapita_country['GDP per capita'].idxmax(),'Country']
min_gdppercapita_country = gdppercapita_country.loc[gdppercapita_country['GDP per capita'].idxmin(), 'Country']

# print these statements
print('Country with highest GDP per capita is {}\n' .format(max_gdppercapita_country))
print('Country with lowest GDP per capita is {}'.format(min_gdppercapita_country))

top_5_gdppercapita_country_countries= gdppercapita_country.sort_values(by='GDP per capita',ascending=False)
top_5_gdppercapita_country_countries = top_5_gdppercapita_country_countries.head()

fig1 = px.bar(data_frame=top_5_gdppercapita_country_countries,x=top_5_gdppercapita_country_countries['Country'],y=top_5_gdppercapita_country_countries['GDP per capita'],color='Country',template='plotly_dark')
fig1.update_layout(title = 'Top 5 GDP per capita Countries',title_x = 0.5,title_font = dict(size=29))# update title 
fig1.show()# show our plot

EDA: Step-by-step example


What does the data tell us so far?

  • The 5 countries with highest GDP per capita all have small populations, are located in Western Europe, are known for their very low tax rates and for being tax havens (to some extent).
  • The 5 countries with lowest GDP per capita are African countries and third world countries.

What kind of economic output does the GDP per capita really measure? What is its relationship with population size? tax rates? education? governance (not in this dataset)?

Click here for the code that produced the above results
bottom_5_gdppercapita_country_countries= gdppercapita_country.sort_values(by='GDP per capita',ascending=True)
bottom_5_gdppercapita_country_countries = bottom_5_gdppercapita_country_countries.head()
fig1 = px.bar(data_frame=bottom_5_gdppercapita_country_countries,x=bottom_5_gdppercapita_country_countries['Country'],y=bottom_5_gdppercapita_country_countries['GDP per capita'],color='Country',template='plotly_dark')
# update title 
fig1.update_layout(title = 'Bottom 5 GDP per capita Countries',title_x = 0.5,title_font = dict(size=29))
# show our plot
fig1.show()

EDA: Step-by-step example

Our exploration of GDP per capita has only just started:

  • we’ve made a few hypotheses based on our basic initial plots
  • but there is much more we could do e.g look at the relationship between GDP per capita and the other variables in the dataset, look at the distribution of GDP per capita itself (do we have clusters of countries with similar GDP per capita)?

EDA: Visualisation resources

  • Dataviz-inspiration.com: this website “aims at being the biggest list of chart examples available on the web”
  • From Data to Viz: the website “is a classification of chart types based on input data format. It comes in the form of a decision tree leading to a set of potentially appropriate visualizations to represent the dataset.”

References

Enders, Craig K. 2022. Applied Missing Data Analysis. Guilford Publications.
Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. London: Allen Lane.
Scheffer, Judi. 2002. “Dealing with Missing Data.”