🗓️ Week 05
Exploratory Data Analysis

DS101 – Fundamentals of Data Science

28 Oct 2024

Statistical inference (continued)

Before we proceed with the main topic of this lecture (i.e exploratory data analysis), we take a slight step-back and go back to some topics linked to statistical inference we didn’t cover last week.

Hypothesis testing

The idea of confidence intervals is closely related to the idea of hypothesis testing.
The most common hypothesis test involves testing the null hypothesis of:
- \(H_0\): There is no relationship between \(X\) and \(Y\) versus the .
- \(H_A\): There is some relationship between \(X\) and \(Y\).

Mathematically, this corresponds to testing:

\[ \begin{align} &~~~~H_0:&\beta_1 &= 0 \\ &\text{vs} \\ &~~~~H_A:& \beta_1 &\neq 0, \end{align} \]

since if \(\beta_1=0\) then the model reduces to \(Y = \beta_0 + \epsilon\), and \(X\) and \(Y\) are not associated.

p-values

To test the null hypothesis, we compute something called a t-statistic (a bell-shaped distribution¹)
Using statistical software, it is easy to compute the probability of observing any value equal to \(\mid t \mid\) or larger.
We call this probability the p-value.
If the p-value is less than some pre-specified level of significance, say \(\alpha = 0.05\), then we reject the null hypothesis in favor of the alternative hypothesis.

Now for something (slightly) different

RCTs
A/B testing
Causal inference

Randomised controlled trials

A randomised controlled trial (RCT) is a type of experiment in which participants are randomly assigned to one of two or more groups (treatment and control).
It is the norm in medicine and the life sciences, but also very common in the social sciences.
It is deemed by some to be the gold standard for determining causality.

RCTs: how do they work?

You have a group of people

Half of them get a pill
The other half gets a placebo (sugar pill)

You split them into two groups at random

Then what?

After a while, you measure the outcome of interest
You compare the two groups using a statistical hypothesis test
If the difference is statistically significant, you can conclude that the treatment caused the outcome

A/B testing

Causal Inference

📗 Book recommendation:

Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. London: Allen Lane.

–(Pearl and Mackenzie 2018)

Exploratory Data Analysis

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

EDA: the basic principles

It’s an important step in the data science workflow that helps you understand the patterns of your data better and form hypotheses about them (prior to modeling)
EDA helps us understand the main features of the data, the relationships between variables, and the variables that are relevant to our problem.
EDA can also help us identify and handle missing or duplicate values, outliers, and errors in the data
One of the main tools in EDA is visualisation.

EDA: the basic principles

Steps of an EDA

Generate questions about your data.
Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.

EDA is a fundamentally creative process, it is not a formal process with a strict set of rules.
The key to the process is to generate a large quantity of questions so as be able to ask quality questions and produce interesting insights on the data.
There are no set rules about which questions you should ask to guide your research.
However, to gather insights on your data, two types of questions will always be useful. These questions can very loosely be worded as:
- What type of variation occurs within my variables?
- What type of covariation occurs between my variables?

EDA: things to keep in mind

During the EDA process, as you explore the data, you need to remember some potential sources of (interpretation/analysis) errors:

you should always think about how the data were collected
what might be missing
whether there are data quality problems
and be really strict about the differences between correlation and causation (a topic in and by itself!).

EDA: A story of diamonds

Let’s go back to last week’s diamonds for an example.

This is what the first lines of the dataset look like:

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75

And here are some basic properties of the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB

Click here for the code to produce the results from this slide

from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_discrete, labs, element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
diamonds.head()
diamonds.info()

EDA: A story of diamonds

What if I want to have a look at the distribution of the carat variable?

Click here for the code to produce the results from this slide

from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_discrete, labs, element_text, scale_fill_manual
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
(ggplot(diamonds, aes(x="carat")) +
  theme_bw() + 
  geom_histogram(binwidth=0.5, fill='lightseagreen'))

Why would I look at this distribution plot? Here are some questions I could ask after seeing this plot.

What is the range of the variable carat?
Which values are the most common? Why?
Which values are rare? Why? Does that match expectations?
Can you spot any unusual patterns? What might explain them?

Here, we see that most of the carat values are below 3: we should probably zoom in the range of carats below 3 to discover more interesting patterns.

Click here for the code to produce the results from this slide

from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_discrete, labs, element_text, scale_fill_manual
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
smaller_diamonds = diamonds.query("carat < 3").copy()

(ggplot(smaller_diamonds, aes(x="carat")) + geom_histogram(binwidth=0.01,fill='rebeccapurple'))

This histogram suggests some interesting questions:

Why are there more diamonds at whole carats and common fractions of carats?
Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?

EDA: A story of diamonds

Outliers are sometimes difficult to see in a histogram when you have lots of data
If we look at the distribution of the y variable, the only indication of possible outliers is an unusually wide x-axis.

Click here for the code to produce the results from this slide

(ggplot(diamonds, aes(x="y")) + geom_histogram(binwidth=0.5,fill='slateblue'))

Click here for the code to produce the results from this slide

from plotnine import coord_cartesian
(
    ggplot(diamonds, aes(x="y"))
    + geom_histogram(binwidth=0.5,fill='slateblue')
    + coord_cartesian(ylim=[0, 50])
)

EDA: A story of diamonds

If we zoom in to small values of the y variable, we get to see the outlier values in the distribution (0, ~30, and ~60)

	x	y	z	price
11963	0.00	0.0	0.00	5139
15951	0.00	0.0	0.00	6381
24067	8.09	58.9	8.06	12210
24520	0.00	0.0	0.00	12800
26243	0.00	0.0	0.00	15686
27429	0.00	0.0	0.00	18034
49189	5.15	31.8	5.12	2075
49556	0.00	0.0	0.00	2130
49557	0.00	0.0	0.00	2130

The y variable as well as the x and z variables measure dimensions of the diamond (in mm):

their value can’t be 0 (especially not in all three dimensions!) so the values marked in orange are likely incorrect
the other rows show large diamonds but the price is off (the price should be much higher for such large diamonds!)

Click here for the code to produce the results from this slide

unusual = diamonds.query("y < 3 or y > 20").loc[:, ["x", "y", "z", "price"]]
unusual

What could we do with these outliers?

remove them (not recommended)
replace by missing value

For the handling of outliers, check this document from the World Bank.

EDA: Flying from New York

Let’s look at unusual values in another dataset: the nycflights13 dataset (available from the nycflights13 package in Python). It records flights that departed NYC in 2013.

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
0	2013	1	1	517.0	515	2.0	830.0	819	11.0	UA	1545	N14228	EWR	IAH	227.0	1400	5	15
1	2013	1	1	533.0	529	4.0	850.0	830	20.0	UA	1714	N24211	LGA	IAH	227.0	1416	5	29
2	2013	1	1	542.0	540	2.0	923.0	850	33.0	AA	1141	N619AA	JFK	MIA	160.0	1089	5	40
3	2013	1	1	544.0	545	-1.0	1004.0	1022	-18.0	B6	725	N804JB	JFK	BQN	183.0	1576	5	45
4	2013	1	1	554.0	600	-6.0	812.0	837	-25.0	DL	461	N668DN	LGA	ATL	116.0	762	6	0

The variable dep_time records flight departure time: flights that were cancelled have a missing value for this variable.

\(=>\) Outliers might occur, not due to errors in data collection/entry, but for reasons intrinsic to the problem studied.

EDA: Flying from New York

One thing you might want to do is compare the scheduled departure times for cancelled and non-cancelled times.

Click here for the code to produce the results from this slide

from plotnine import geom_freqpoly
import pandas as pd
flights2 = flights.assign(
    cancelled=lambda x: pd.isna(x["dep_time"]),
    sched_hour=lambda x: x["sched_dep_time"] // 100,
    sched_min=lambda x: x["sched_dep_time"] % 100,
    sched_dep_time=lambda x: x["sched_hour"] + x["sched_min"] / 60,
)

(
    ggplot(flights2, aes(x="sched_dep_time"))
    + geom_freqpoly(aes(color="cancelled"), binwidth=1 / 4)

Not exactly the best plot given that there are many more flights that were not cancelled than flights were (thankfully)!

EDA: A step-by-step example

Let’s now study a more complete step-by-step example of EDA: we’ll have a look at the Global Country Information Dataset 2023 available on Kaggle.

Its features are as follows:

Name	Definition
Country	Name of the country
Density (P/Km2)	Population density measured in persons per square kilometer
Abbreviation	Abbreviation or code representing the country
Agricultural Land (%)	Percentage of land area used for agricultural purposes
Land Area (Km2)	Total land area of the country in square kilometers
Armed Forces Size	Size of the armed forces in the country
Birth Rate	Number of births per 1,000 population per year
Calling Code	International calling code for the country
Capital/Major City	Name of the capital or major city
CO2 Emissions	Carbon dioxide emissions in tons
CPI	Consumer Price Index, a measure of inflation and purchasing power
CPI Change (%)	Percentage change in the Consumer Price Index compared to the previous year
Currency_Code	Currency code used in the country
Fertility Rate	Average number of children born to a woman during her lifetime
Forested Area (%)	Percentage of land area covered by forests
Gasoline_Price	Price of gasoline per liter in local currency
GDP	Gross Domestic Product, the total value of goods and services produced in the country
Gross Primary Education Enrollment (%)	Gross enrollment ratio for primary education
Gross Tertiary Education Enrollment (%)	Gross enrollment ratio for tertiary education
Infant Mortality	Number of deaths per 1,000 live births before reaching one year of age
Largest City	Name of the country’s largest city

EDA: A step-by-step example

Rest of the features

Name	Definition
Life Expectancy	Average number of years a newborn is expected to live
Maternal Mortality Ratio	Number of maternal deaths per 100,000 live births
Minimum Wage	Minimum wage level in local currency
Official Language	Official language(s) spoken in the country
Out of Pocket Health Expenditure (%)	Percentage of total health expenditure paid out-of-pocket by individuals
Physicians per Thousand	Number of physicians per thousand people
Population	Total population of the country
Population	Labor Force Participation (%)
Tax Revenue (%)	Tax revenue as a percentage of GDP
Total Tax Rate	Overall tax burden as a percentage of commercial profits
Unemployment Rate	Percentage of the labor force that is unemployed
Urban Population	Percentage of the population living in urban areas
Latitude	Latitude coordinate of the country’s location
Longitude	Longitude coordinate of the country’s location

The data has 195 rows and 35 columns

EDA: A step-by-step example

Here is a sample of the data

	Country	Density(P/Km2)	Abbreviation	Agricultural Land( %)	Land Area(Km2)	Armed Forces size	Birth Rate	Calling Code	Capital/Major City	Co2-Emissions	…	Out of pocket health expenditure	Physicians per thousand	Population	Population: Labor force participation (%)	Tax revenue (%)	Total tax rate	Unemployment rate	Urban_population	Latitude	Longitude
112	Moldova	123	MD	74.20%	33,851	7,000	10.10	373.0	Chi��	5,115	…	46.20%	3.21	2,657,637	43.10%	17.70%	38.70%	5.47%	1,135,502	47.411631	28.369885
153	Serbia	100	RS	39.30%	77,474	32,000	9.20	381.0	Belgrade	45,221	…	40.60%	3.11	6,944,975	54.90%	18.60%	36.60%	12.69%	3,907,243	44.016521	21.005859
35	Chile	26	CL	21.20%	756,096	122,000	12.43	56.0	Santiago	85,822	…	32.20%	2.59	18,952,038	62.60%	18.20%	34.00%	7.09%	16,610,135	-35.675147	-71.542969
19	Bhutan	20	BT	13.60%	38,394	6,000	17.26	975.0	Thimphu	1,261	…	19.80%	0.42	727,145	66.70%	16.00%	35.30%	2.34%	317,538	27.514162	90.433601
48	Dominica	96	DM	33.30%	751	NaN	12.00	1.0	Roseau	180	…	28.40%	1.08	71,808	NaN	22.10%	32.60%	NaN	50,830	15.414999	-61.370976

5 rows × 35 columns

Click here for the code to produce the results from this slide

df = pd.read_csv('world-data-2023.csv') #You can download the file from Kaggle
df.sample(5)

EDA: A step-by-step example

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 35 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Country                                    195 non-null    object 
 1   Density
(P/Km2)                            195 non-null    object 
 2   Abbreviation                               188 non-null    object 
 3   Agricultural Land( %)                      188 non-null    object 
 4   Land Area(Km2)                             194 non-null    object 
 5   Armed Forces size                          171 non-null    object 
 6   Birth Rate                                 189 non-null    float64
 7   Calling Code                               194 non-null    float64
 8   Capital/Major City                         192 non-null    object 
 9   Co2-Emissions                              188 non-null    object 
 10  CPI                                        178 non-null    object 
 11  CPI Change (%)                             179 non-null    object 
 12  Currency-Code                              180 non-null    object 
 13  Fertility Rate                             188 non-null    float64
 14  Forested Area (%)                          188 non-null    object 
 15  Gasoline Price                             175 non-null    object 
 16  GDP                                        193 non-null    object 
 17  Gross primary education enrollment (%)     188 non-null    object 
 18  Gross tertiary education enrollment (%)    183 non-null    object 
 19  Infant mortality                           189 non-null    float64
 20  Largest city                               189 non-null    object 
 21  Life expectancy                            187 non-null    float64
 22  Maternal mortality ratio                   181 non-null    float64
 23  Minimum wage                               150 non-null    object 
 24  Official language                          190 non-null    object 
 25  Out of pocket health expenditure           188 non-null    object 
 26  Physicians per thousand                    188 non-null    float64
 27  Population                                 194 non-null    object 
 28  Population: Labor force participation (%)  176 non-null    object 
 29  Tax revenue (%)                            169 non-null    object 
 30  Total tax rate                             183 non-null    object 
 31  Unemployment rate                          176 non-null    object 
 32  Urban_population                           190 non-null    object 
 33  Latitude                                   194 non-null    float64
 34  Longitude                                  194 non-null    float64
dtypes: float64(9), object(26)
memory usage: 53.4+ KB

Two types of data: float64 and object (most likely string)

Click here for the code to produce the results from this slide

df.info()

EDA: Step-by-step

We have total 9 these floating columns:
                1) Birth Rate
                2) Calling Code
                3) Fertility Rate
                4) Infant mortality
                5) Life expectancy
                6) Maternal mortality ratio
                7) Physicians per thousand
                8) Latitude
                9) Longitude

Click here for the code to produce the results from this slide

import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.style.use('dark_background')
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns

# numerical columns
numerical_column = df.select_dtypes(include=['float']).columns
print(f'We have total {len(numerical_column)} these floating columns:')
for count,column_name in enumerate(df.select_dtypes(include=['float']).columns,1):
    print(f'\t\t\t\t{count}) {column_name}')


df.describe().T.style.bar(
    subset=['mean'],
    color='purple',  # text color
).background_gradient(subset=['std'],cmap=plt.cm.coolwarm).background_gradient(subset='50%',cmap='viridis')

EDA: Step-by-step example

Your selected dataframe has 35 columns.
There are 33 columns that have missing values.

Click here for the code to produce the results from this slide

# credit: https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction.
# One of the best notebooks on getting started with a ML problem.

def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()

        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)

        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})

        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)

        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")

        # Return the dataframe with missing information
        return mis_val_table_ren_columns
df_missing= missing_values_table(df)
df_missing

index	Missing Values	% of Total Values
Minimum wage	45	23.1
Tax revenue (%)	26	13.3
Armed Forces size	24	12.3
Gasoline Price	20	10.3
Unemployment rate	19	9.7
Population: Labor force participation (%)	19	9.7
CPI	17	8.7
CPI Change (%)	16	8.2
Currency-Code	15	7.7
Maternal mortality ratio	14	7.2
Gross tertiary education enrollment (%)	12	6.2
Total tax rate	12	6.2

index	Missing Values	% of Total Values
Life expectancy	8	4.1
Agricultural Land( %)	7	3.6
Physicians per thousand	7	3.6
Out of pocket health expenditure	7	3.6
Abbreviation	7	3.6
Gross primary education enrollment (%)	7	3.6
Forested Area (%)	7	3.6
Fertility Rate	7	3.6
Co2-Emissions	7	3.6
Infant mortality	6	3.1
Largest city	6	3.1
Birth Rate	6	3.1
Official language	5	2.6
Urban_population	5	2.6
Capital/Major City	3	1.5
GDP	2	1.0
Population	1	0.5
Calling Code	1	0.5
Land Area(Km2)	1	0.5
Latitude	1	0.5
Longitude	1	0.5

EDA: Step-by-step example

Click here for the code to produce the results from this slide

import missingno as msno
msno.matrix(df)

EDA: Step-by-step example

Click here for the code to produce the results from this slide

msno.bar(df)

EDA: Step-by-step example

Reasons for Missing Values

Before we start handling the missing values, it is important to understand the various reasons behind the missingness in the data. Broadly speaking, there can be three possible reasons (also called (data) missingness mechanisms):

Missing Completely at Random (MCAR)

The missing values on a given random variable (Y) are not associated/correlated with other variables in a given data set or with the variable (Y) itself. In other words, there is no particular reason for the missing values.

Missing at Random (MAR)

The MAR mechanism occurs when the probability of missing values for a given random variable Y is related to some other measured variable (or variables) in the analysis model but not to the the values of Y itself.

EDA: Step-by-step example

Reasons for Missing Values

Missing Not at Random (MNAR)

Missingness depends on variables unobserved in the data available or on the value of the random variable Y (which is missing values) itself.

For more details, see (Enders 2022) or (Scheffer 2002).

EDA: Step-by-step example

Missingness patterns in the data

Most columns have missing values
“Minimum wage,” “Tax revenue (%),” and “Armed Forces size” have 45, 26, and 24 missing values, respectively, which correspond to 23.08%, 13.33%, and 12.31% of the total observations
“Urban_population,” “Longitude,” “Population,” and “Calling Code” have only 5, 1, 1, and 1 missing values, respectively, which account for less than 3% of the total observations
MAR pattern between “Tax Revenue” and “Minimum wage” and between “Gross primary education enrollment (%)” and “Gross tertiary education enrollment (%)” \(=>\) imputation best strategy

EDA: Step-by-step example

Before we proceed to impute missing values, we clean up the data (converting columns from string type to float)

Click here for the code to do that

columns_to_convert = ['Density\n(P/Km2)', 'Agricultural Land( %)', 'Land Area(Km2)',
                      'Birth Rate', 'Co2-Emissions', 'Forested Area (%)',
                      'CPI', 'CPI Change (%)', 'Fertility Rate', 'Gasoline Price', 'GDP',
                      'Gross primary education enrollment (%)', 'Armed Forces size',
                      'Gross tertiary education enrollment (%)', 'Infant mortality',
                      'Life expectancy', 'Maternal mortality ratio', 'Minimum wage', 
                      'Out of pocket health expenditure', 'Physicians per thousand', 
                      'Population', 'Population: Labor force participation (%)', 
                      'Tax revenue (%)', 'Total tax rate', 'Unemployment rate', 'Urban_population']
df[columns_to_convert] = df[columns_to_convert].applymap(lambda x: float(str(x).replace('%','').replace(',', '').replace('$','')))

Missing data imputation

Possibilities:

replace the missing value by a constant value
replace the missing values by a statistic (mean or median)
K-NN imputation (replace missing value by average of value from closest neighbours in terms of distance)

Here, we’ll simply impute by replacing missing values for numerical variables by the median and for categorical variables by the mode.

Click here for the imputation code

for col in df.columns:
  if df[col].dtype=='float':
    df[col].fillna(df[col].median(),inplace=True)
  else:
    df[col].fillna(df[col].mode()[0],inplace=True)

We check the number of missing values in our columns.

Country                                      0
Density\n(P/Km2)                             0
Abbreviation                                 0
Agricultural Land( %)                        0
Land Area(Km2)                               0
Armed Forces size                            0
Birth Rate                                   0
Calling Code                                 0
Capital/Major City                           0
Co2-Emissions                                0
CPI                                          0
CPI Change (%)                               0
Currency-Code                                0
Fertility Rate                               0
Forested Area (%)                            0
Gasoline Price                               0
GDP                                          0
Gross primary education enrollment (%)       0
Gross tertiary education enrollment (%)      0
Infant mortality                             0
Largest city                                 0
Life expectancy                              0
Maternal mortality ratio                     0
Minimum wage                                 0
Official language                            0
Out of pocket health expenditure             0
Physicians per thousand                      0
Population                                   0
Population: Labor force participation (%)    0
Tax revenue (%)                              0
Total tax rate                               0
Unemployment rate                            0
Urban_population                             0
Latitude                                     0
Longitude                                    0
dtype: int64

No more missing values!

Click here for the code that produced the above table

missing = df.isnull().sum()
print(missing)

EDA: Step-by-step example

Summary metrics of numeric variables

Click here for the code that produced the above figure

# Calculate summary statistics for numerical columns
numerical_columns = df.select_dtypes([float,int]).columns
describe_numerical = df[numerical_columns].describe()

print('Summary statistics is describe below: ')
describe_numerical.T.style.bar(
    subset=['mean'],
    color='purple',
).background_gradient(subset=['std'],cmap=plt.cm.coolwarm).background_gradient(subset='50%',cmap='viridis').background_gradient(subset='max',cmap='viridis')

EDA: Step-by-step example

We create a column “GDP per capita” (GDP divided by Population)

Here are the first rows showing the new column.

index	Country	GDP per capita
0	Afghanistan	502.11548691997746
1	Albania	5352.857411084262
2	Algeria	3948.3432789227913
3	Andorra	40886.39116175365
4	Angola	2973.591159799147

Click here for the code that produced the above results

df["GDP per capita"]=df["GDP"]/df["Population"]
df[["Country","GDP per capita"]].head()

EDA: Step-by-step example

And get some facts about GDP per capita:

Country with highest GDP per capita is Vatican City

Country with lowest GDP per capita is Burundi

Click here for the code that produced the above results

gdppercapita_country = df.groupby('Country')['GDP per capita'].mean().reset_index()
# using loc we will do label indexing and find max and min gdp per capita countries
max_gdppercapita_country = gdppercapita_country.loc[gdppercapita_country['GDP per capita'].idxmax(),'Country']
min_gdppercapita_country = gdppercapita_country.loc[gdppercapita_country['GDP per capita'].idxmin(), 'Country']

# print these statements
print('Country with highest GDP per capita is {}\n' .format(max_gdppercapita_country))
print('Country with lowest GDP per capita is {}'.format(min_gdppercapita_country))

top_5_gdppercapita_country_countries= gdppercapita_country.sort_values(by='GDP per capita',ascending=False)
top_5_gdppercapita_country_countries = top_5_gdppercapita_country_countries.head()

fig1 = px.bar(data_frame=top_5_gdppercapita_country_countries,x=top_5_gdppercapita_country_countries['Country'],y=top_5_gdppercapita_country_countries['GDP per capita'],color='Country',template='plotly_dark')
fig1.update_layout(title = 'Top 5 GDP per capita Countries',title_x = 0.5,title_font = dict(size=29))# update title 
fig1.show()# show our plot

EDA: Step-by-step example

What does the data tell us so far?

The 5 countries with highest GDP per capita all have small populations, are located in Western Europe, are known for their very low tax rates and for being tax havens (to some extent).
The 5 countries with lowest GDP per capita are African countries and third world countries.

What kind of economic output does the GDP per capita really measure? What is its relationship with population size? tax rates? education? governance (not in this dataset)?

Click here for the code that produced the above results

bottom_5_gdppercapita_country_countries= gdppercapita_country.sort_values(by='GDP per capita',ascending=True)
bottom_5_gdppercapita_country_countries = bottom_5_gdppercapita_country_countries.head()
fig1 = px.bar(data_frame=bottom_5_gdppercapita_country_countries,x=bottom_5_gdppercapita_country_countries['Country'],y=bottom_5_gdppercapita_country_countries['GDP per capita'],color='Country',template='plotly_dark')
# update title 
fig1.update_layout(title = 'Bottom 5 GDP per capita Countries',title_x = 0.5,title_font = dict(size=29))
# show our plot
fig1.show()

EDA: Step-by-step example

Our exploration of GDP per capita has only just started:

we’ve made a few hypotheses based on our basic initial plots
but there is much more we could do e.g look at the relationship between GDP per capita and the other variables in the dataset, look at the distribution of GDP per capita itself (do we have clusters of countries with similar GDP per capita)?

EDA: Visualisation resources

Dataviz-inspiration.com: this website “aims at being the biggest list of chart examples available on the web”
From Data to Viz: the website “is a classification of chart types based on input data format. It comes in the form of a decision tree leading to a set of potentially appropriate visualizations to represent the dataset.”

References

Enders, Craig K. 2022. Applied Missing Data Analysis. Guilford Publications.

Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. London: Allen Lane.

Scheffer, Judi. 2002. “Dealing with Missing Data.”

🗓️ Week 05 Exploratory Data Analysis

Statistical inference (continued)

Hypothesis testing

p-values

Now for something (slightly) different

Randomised controlled trials

RCTs: how do they work?

Then what?

A/B testing

Causal Inference

Exploratory Data Analysis

EDA: the basic principles

EDA: the basic principles

Steps of an EDA

EDA: things to keep in mind

EDA: A story of diamonds

EDA: A story of diamonds

EDA: A story of diamonds

EDA: A story of diamonds

EDA: Flying from New York

EDA: Flying from New York

EDA: A step-by-step example

EDA: A step-by-step example

EDA: A step-by-step example

EDA: A step-by-step example

EDA: Step-by-step

EDA: Step-by-step example

EDA: Step-by-step example

EDA: Step-by-step example

EDA: Step-by-step example

EDA: Step-by-step example

EDA: Step-by-step example

EDA: Step-by-step example

EDA: Step-by-step example

EDA: Step-by-step example

EDA: Step-by-step example

EDA: Step-by-step example

EDA: Step-by-step example

EDA: Visualisation resources

References

🗓️ Week 05
Exploratory Data Analysis