DS101 – Fundamentals of Data Science
28 Oct 2024
Before we proceed with the main topic of this lecture (i.e exploratory data analysis), we take a slight step-back and go back to some topics linked to statistical inference we didn’t cover last week.
\[ \begin{align} &~~~~H_0:&\beta_1 &= 0 \\ &\text{vs} \\ &~~~~H_A:& \beta_1 &\neq 0, \end{align} \]
since if \(\beta_1=0\) then the model reduces to \(Y = \beta_0 + \epsilon\), and \(X\) and \(Y\) are not associated.
You have a group of people
You split them into two groups at random
Icons: people by Anastasia Latysheva, candy by Samy Menai, arrow from TAQRIB ICON and pill from Viktor Fediuk. All from the Noun Project
Image source: Devopedia
📗 Book recommendation:
Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. London: Allen Lane.
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
EDA is a fundamentally creative process, it is not a formal process with a strict set of rules.
The key to the process is to generate a large quantity of questions so as be able to ask quality questions and produce interesting insights on the data.
There are no set rules about which questions you should ask to guide your research.
However, to gather insights on your data, two types of questions will always be useful. These questions can very loosely be worded as:
During the EDA process, as you explore the data, you need to remember some potential sources of (interpretation/analysis) errors:
Let’s go back to last week’s diamonds for an example.
This is what the first lines of the dataset look like:
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
And here are some basic properties of the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null category
2 color 53940 non-null category
3 clarity 53940 non-null category
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB
What if I want to have a look at the distribution of the carat
variable?
Why would I look at this distribution plot? Here are some questions I could ask after seeing this plot.
carat
?Here, we see that most of the carat values are below 3: we should probably zoom in the range of carats below 3 to discover more interesting patterns.
from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_discrete, labs, element_text, scale_fill_manual
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
smaller_diamonds = diamonds.query("carat < 3").copy()
(ggplot(smaller_diamonds, aes(x="carat")) + geom_histogram(binwidth=0.01,fill='rebeccapurple'))
This histogram suggests some interesting questions:
y
variable, the only indication of possible outliers is an unusually wide x-axis.y
variable, we get to see the outlier values in the distribution (0, ~30, and ~60)x | y | z | price | |
---|---|---|---|---|
11963 | 0.00 | 0.0 | 0.00 | 5139 |
15951 | 0.00 | 0.0 | 0.00 | 6381 |
24067 | 8.09 | 58.9 | 8.06 | 12210 |
24520 | 0.00 | 0.0 | 0.00 | 12800 |
26243 | 0.00 | 0.0 | 0.00 | 15686 |
27429 | 0.00 | 0.0 | 0.00 | 18034 |
49189 | 5.15 | 31.8 | 5.12 | 2075 |
49556 | 0.00 | 0.0 | 0.00 | 2130 |
49557 | 0.00 | 0.0 | 0.00 | 2130 |
The y
variable as well as the x
and z
variables measure dimensions of the diamond (in mm):
What could we do with these outliers?
For the handling of outliers, check this document from the World Bank.
Let’s look at unusual values in another dataset: the nycflights13
dataset (available from the nycflights13
package in Python). It records flights that departed NYC in 2013.
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 |
1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 |
2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 |
3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 |
4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 |
The variable dep_time
records flight departure time: flights that were cancelled have a missing value for this variable.
\(=>\) Outliers might occur, not due to errors in data collection/entry, but for reasons intrinsic to the problem studied.
One thing you might want to do is compare the scheduled departure times for cancelled and non-cancelled times.
from plotnine import geom_freqpoly
import pandas as pd
flights2 = flights.assign(
cancelled=lambda x: pd.isna(x["dep_time"]),
sched_hour=lambda x: x["sched_dep_time"] // 100,
sched_min=lambda x: x["sched_dep_time"] % 100,
sched_dep_time=lambda x: x["sched_hour"] + x["sched_min"] / 60,
)
(
ggplot(flights2, aes(x="sched_dep_time"))
+ geom_freqpoly(aes(color="cancelled"), binwidth=1 / 4)
Not exactly the best plot given that there are many more flights that were not cancelled than flights were (thankfully)!
Let’s now study a more complete step-by-step example of EDA: we’ll have a look at the Global Country Information Dataset 2023 available on Kaggle.
Its features are as follows:
Name | Definition |
---|---|
Country | Name of the country |
Density (P/Km2) | Population density measured in persons per square kilometer |
Abbreviation | Abbreviation or code representing the country |
Agricultural Land (%) | Percentage of land area used for agricultural purposes |
Land Area (Km2) | Total land area of the country in square kilometers |
Armed Forces Size | Size of the armed forces in the country |
Birth Rate | Number of births per 1,000 population per year |
Calling Code | International calling code for the country |
Capital/Major City | Name of the capital or major city |
CO2 Emissions | Carbon dioxide emissions in tons |
CPI | Consumer Price Index, a measure of inflation and purchasing power |
CPI Change (%) | Percentage change in the Consumer Price Index compared to the previous year |
Currency_Code | Currency code used in the country |
Fertility Rate | Average number of children born to a woman during her lifetime |
Forested Area (%) | Percentage of land area covered by forests |
Gasoline_Price | Price of gasoline per liter in local currency |
GDP | Gross Domestic Product, the total value of goods and services produced in the country |
Gross Primary Education Enrollment (%) | Gross enrollment ratio for primary education |
Gross Tertiary Education Enrollment (%) | Gross enrollment ratio for tertiary education |
Infant Mortality | Number of deaths per 1,000 live births before reaching one year of age |
Largest City | Name of the country’s largest city |
Rest of the features
Name | Definition |
---|---|
Life Expectancy | Average number of years a newborn is expected to live |
Maternal Mortality Ratio | Number of maternal deaths per 100,000 live births |
Minimum Wage | Minimum wage level in local currency |
Official Language | Official language(s) spoken in the country |
Out of Pocket Health Expenditure (%) | Percentage of total health expenditure paid out-of-pocket by individuals |
Physicians per Thousand | Number of physicians per thousand people |
Population | Total population of the country |
Population | Labor Force Participation (%) |
Tax Revenue (%) | Tax revenue as a percentage of GDP |
Total Tax Rate | Overall tax burden as a percentage of commercial profits |
Unemployment Rate | Percentage of the labor force that is unemployed |
Urban Population | Percentage of the population living in urban areas |
Latitude | Latitude coordinate of the country’s location |
Longitude | Longitude coordinate of the country’s location |
The data has 195 rows and 35 columns
Here is a sample of the data
Country | Density(P/Km2) | Abbreviation | Agricultural Land( %) | Land Area(Km2) | Armed Forces size | Birth Rate | Calling Code | Capital/Major City | Co2-Emissions | … | Out of pocket health expenditure | Physicians per thousand | Population | Population: Labor force participation (%) | Tax revenue (%) | Total tax rate | Unemployment rate | Urban_population | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
112 | Moldova | 123 | MD | 74.20% | 33,851 | 7,000 | 10.10 | 373.0 | Chi���� | 5,115 | … | 46.20% | 3.21 | 2,657,637 | 43.10% | 17.70% | 38.70% | 5.47% | 1,135,502 | 47.411631 | 28.369885 |
153 | Serbia | 100 | RS | 39.30% | 77,474 | 32,000 | 9.20 | 381.0 | Belgrade | 45,221 | … | 40.60% | 3.11 | 6,944,975 | 54.90% | 18.60% | 36.60% | 12.69% | 3,907,243 | 44.016521 | 21.005859 |
35 | Chile | 26 | CL | 21.20% | 756,096 | 122,000 | 12.43 | 56.0 | Santiago | 85,822 | … | 32.20% | 2.59 | 18,952,038 | 62.60% | 18.20% | 34.00% | 7.09% | 16,610,135 | -35.675147 | -71.542969 |
19 | Bhutan | 20 | BT | 13.60% | 38,394 | 6,000 | 17.26 | 975.0 | Thimphu | 1,261 | … | 19.80% | 0.42 | 727,145 | 66.70% | 16.00% | 35.30% | 2.34% | 317,538 | 27.514162 | 90.433601 |
48 | Dominica | 96 | DM | 33.30% | 751 | NaN | 12.00 | 1.0 | Roseau | 180 | … | 28.40% | 1.08 | 71,808 | NaN | 22.10% | 32.60% | NaN | 50,830 | 15.414999 | -61.370976 |
5 rows × 35 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 195 non-null object
1 Density
(P/Km2) 195 non-null object
2 Abbreviation 188 non-null object
3 Agricultural Land( %) 188 non-null object
4 Land Area(Km2) 194 non-null object
5 Armed Forces size 171 non-null object
6 Birth Rate 189 non-null float64
7 Calling Code 194 non-null float64
8 Capital/Major City 192 non-null object
9 Co2-Emissions 188 non-null object
10 CPI 178 non-null object
11 CPI Change (%) 179 non-null object
12 Currency-Code 180 non-null object
13 Fertility Rate 188 non-null float64
14 Forested Area (%) 188 non-null object
15 Gasoline Price 175 non-null object
16 GDP 193 non-null object
17 Gross primary education enrollment (%) 188 non-null object
18 Gross tertiary education enrollment (%) 183 non-null object
19 Infant mortality 189 non-null float64
20 Largest city 189 non-null object
21 Life expectancy 187 non-null float64
22 Maternal mortality ratio 181 non-null float64
23 Minimum wage 150 non-null object
24 Official language 190 non-null object
25 Out of pocket health expenditure 188 non-null object
26 Physicians per thousand 188 non-null float64
27 Population 194 non-null object
28 Population: Labor force participation (%) 176 non-null object
29 Tax revenue (%) 169 non-null object
30 Total tax rate 183 non-null object
31 Unemployment rate 176 non-null object
32 Urban_population 190 non-null object
33 Latitude 194 non-null float64
34 Longitude 194 non-null float64
dtypes: float64(9), object(26)
memory usage: 53.4+ KB
float64
and object
(most likely string)
We have total 9 these floating columns:
1) Birth Rate
2) Calling Code
3) Fertility Rate
4) Infant mortality
5) Life expectancy
6) Maternal mortality ratio
7) Physicians per thousand
8) Latitude
9) Longitude
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.style.use('dark_background')
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
# numerical columns
numerical_column = df.select_dtypes(include=['float']).columns
print(f'We have total {len(numerical_column)} these floating columns:')
for count,column_name in enumerate(df.select_dtypes(include=['float']).columns,1):
print(f'\t\t\t\t{count}) {column_name}')
df.describe().T.style.bar(
subset=['mean'],
color='purple', # text color
).background_gradient(subset=['std'],cmap=plt.cm.coolwarm).background_gradient(subset='50%',cmap='viridis')
Your selected dataframe has 35 columns.
There are 33 columns that have missing values.
# credit: https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction.
# One of the best notebooks on getting started with a ML problem.
def missing_values_table(df):
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
# Rename the columns
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
# Sort the table by percentage of missing descending
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_val_table_ren_columns
df_missing= missing_values_table(df)
df_missing
index | Missing Values | % of Total Values |
---|---|---|
Minimum wage | 45 | 23.1 |
Tax revenue (%) | 26 | 13.3 |
Armed Forces size | 24 | 12.3 |
Gasoline Price | 20 | 10.3 |
Unemployment rate | 19 | 9.7 |
Population: Labor force participation (%) | 19 | 9.7 |
CPI | 17 | 8.7 |
CPI Change (%) | 16 | 8.2 |
Currency-Code | 15 | 7.7 |
Maternal mortality ratio | 14 | 7.2 |
Gross tertiary education enrollment (%) | 12 | 6.2 |
Total tax rate | 12 | 6.2 |
index | Missing Values | % of Total Values |
---|---|---|
Life expectancy | 8 | 4.1 |
Agricultural Land( %) | 7 | 3.6 |
Physicians per thousand | 7 | 3.6 |
Out of pocket health expenditure | 7 | 3.6 |
Abbreviation | 7 | 3.6 |
Gross primary education enrollment (%) | 7 | 3.6 |
Forested Area (%) | 7 | 3.6 |
Fertility Rate | 7 | 3.6 |
Co2-Emissions | 7 | 3.6 |
Infant mortality | 6 | 3.1 |
Largest city | 6 | 3.1 |
Birth Rate | 6 | 3.1 |
Official language | 5 | 2.6 |
Urban_population | 5 | 2.6 |
Capital/Major City | 3 | 1.5 |
GDP | 2 | 1.0 |
Population | 1 | 0.5 |
Calling Code | 1 | 0.5 |
Land Area(Km2) | 1 | 0.5 |
Latitude | 1 | 0.5 |
Longitude | 1 | 0.5 |
Reasons for Missing Values
Before we start handling the missing values, it is important to understand the various reasons behind the missingness in the data. Broadly speaking, there can be three possible reasons (also called (data) missingness mechanisms):
The missing values on a given random variable (Y) are not associated/correlated with other variables in a given data set or with the variable (Y) itself. In other words, there is no particular reason for the missing values.
The MAR mechanism occurs when the probability of missing values for a given random variable Y is related to some other measured variable (or variables) in the analysis model but not to the the values of Y itself.
Reasons for Missing Values
Missingness depends on variables unobserved in the data available or on the value of the random variable Y (which is missing values) itself.
For more details, see (Enders 2022) or (Scheffer 2002).
Missingness patterns in the data
Before we proceed to impute missing values, we clean up the data (converting columns from string type to float)
columns_to_convert = ['Density\n(P/Km2)', 'Agricultural Land( %)', 'Land Area(Km2)',
'Birth Rate', 'Co2-Emissions', 'Forested Area (%)',
'CPI', 'CPI Change (%)', 'Fertility Rate', 'Gasoline Price', 'GDP',
'Gross primary education enrollment (%)', 'Armed Forces size',
'Gross tertiary education enrollment (%)', 'Infant mortality',
'Life expectancy', 'Maternal mortality ratio', 'Minimum wage',
'Out of pocket health expenditure', 'Physicians per thousand',
'Population', 'Population: Labor force participation (%)',
'Tax revenue (%)', 'Total tax rate', 'Unemployment rate', 'Urban_population']
df[columns_to_convert] = df[columns_to_convert].applymap(lambda x: float(str(x).replace('%','').replace(',', '').replace('$','')))
Missing data imputation
Possibilities:
We check the number of missing values in our columns.
Country 0
Density\n(P/Km2) 0
Abbreviation 0
Agricultural Land( %) 0
Land Area(Km2) 0
Armed Forces size 0
Birth Rate 0
Calling Code 0
Capital/Major City 0
Co2-Emissions 0
CPI 0
CPI Change (%) 0
Currency-Code 0
Fertility Rate 0
Forested Area (%) 0
Gasoline Price 0
GDP 0
Gross primary education enrollment (%) 0
Gross tertiary education enrollment (%) 0
Infant mortality 0
Largest city 0
Life expectancy 0
Maternal mortality ratio 0
Minimum wage 0
Official language 0
Out of pocket health expenditure 0
Physicians per thousand 0
Population 0
Population: Labor force participation (%) 0
Tax revenue (%) 0
Total tax rate 0
Unemployment rate 0
Urban_population 0
Latitude 0
Longitude 0
dtype: int64
No more missing values!
Summary metrics of numeric variables
# Calculate summary statistics for numerical columns
numerical_columns = df.select_dtypes([float,int]).columns
describe_numerical = df[numerical_columns].describe()
print('Summary statistics is describe below: ')
describe_numerical.T.style.bar(
subset=['mean'],
color='purple',
).background_gradient(subset=['std'],cmap=plt.cm.coolwarm).background_gradient(subset='50%',cmap='viridis').background_gradient(subset='max',cmap='viridis')
We create a column “GDP per capita” (GDP divided by Population)
Here are the first rows showing the new column.
index | Country | GDP per capita |
---|---|---|
0 | Afghanistan | 502.11548691997746 |
1 | Albania | 5352.857411084262 |
2 | Algeria | 3948.3432789227913 |
3 | Andorra | 40886.39116175365 |
4 | Angola | 2973.591159799147 |
And get some facts about GDP per capita:
Country with highest GDP per capita is Vatican City
Country with lowest GDP per capita is Burundi
gdppercapita_country = df.groupby('Country')['GDP per capita'].mean().reset_index()
# using loc we will do label indexing and find max and min gdp per capita countries
max_gdppercapita_country = gdppercapita_country.loc[gdppercapita_country['GDP per capita'].idxmax(),'Country']
min_gdppercapita_country = gdppercapita_country.loc[gdppercapita_country['GDP per capita'].idxmin(), 'Country']
# print these statements
print('Country with highest GDP per capita is {}\n' .format(max_gdppercapita_country))
print('Country with lowest GDP per capita is {}'.format(min_gdppercapita_country))
top_5_gdppercapita_country_countries= gdppercapita_country.sort_values(by='GDP per capita',ascending=False)
top_5_gdppercapita_country_countries = top_5_gdppercapita_country_countries.head()
fig1 = px.bar(data_frame=top_5_gdppercapita_country_countries,x=top_5_gdppercapita_country_countries['Country'],y=top_5_gdppercapita_country_countries['GDP per capita'],color='Country',template='plotly_dark')
fig1.update_layout(title = 'Top 5 GDP per capita Countries',title_x = 0.5,title_font = dict(size=29))# update title
fig1.show()# show our plot
What does the data tell us so far?
What kind of economic output does the GDP per capita really measure? What is its relationship with population size? tax rates? education? governance (not in this dataset)?
bottom_5_gdppercapita_country_countries= gdppercapita_country.sort_values(by='GDP per capita',ascending=True)
bottom_5_gdppercapita_country_countries = bottom_5_gdppercapita_country_countries.head()
fig1 = px.bar(data_frame=bottom_5_gdppercapita_country_countries,x=bottom_5_gdppercapita_country_countries['Country'],y=bottom_5_gdppercapita_country_countries['GDP per capita'],color='Country',template='plotly_dark')
# update title
fig1.update_layout(title = 'Bottom 5 GDP per capita Countries',title_x = 0.5,title_font = dict(size=29))
# show our plot
fig1.show()
Our exploration of GDP per capita has only just started:
LSE DS101 2024/25 Autumn Term