DS101 – Fundamentals of Data Science
21 Oct 2024
1️⃣ a random sample of 10% of people then look at their e-mails on a given day.
2️⃣ a random sample of 10% of all emails sent on a given day
✍️ A quick exercise:
References: (Guyan 2022) & (D’Ignazio and Klein 2020)
“This overall process of going from the world to the data, and then from the data back to the world, is the field of statistical inference.”
“More precisely, statistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic processes1.”
Source: (Schutt and O’Neil 2013, chap. 2)
#| echo: true
from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_discrete, labs, element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
(ggplot(diamonds) +
geom_bar(aes(x=diamonds.cut, fill=diamonds.cut)) +
scale_fill_discrete() +
theme_bw() +
labs(title="Diamond cut", x="Cut", y="Count")+
theme(figure_size=(16, 12),text=element_text(size=24))
)
from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_continuous, labs, element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
from plotnine.coords import coord_cartesian
import pandas as pd
(ggplot(diamonds) +
geom_bar(aes(x=diamonds.carat, fill=diamonds.carat)) +
scale_fill_continuous() +
theme_bw() +
labs(title="Diamond carat", x="Carat", y="Count")+
coord_cartesian(xlim=(0.2, 5.01))+
theme(figure_size=(20, 12),text=element_text(size=24))
)
from plotnine import ggplot, geom_point, aes, element_text, labs
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
(ggplot(data=diamonds) +
geom_point(aes(x=diamonds.carat, y=diamonds.price), color="#800020", size=4, alpha=0.5) +
theme_bw() +
theme(axis_text=element_text(size=18), axis_title=element_text(size=16)) +
labs(title="Relationship between diamond carat and diamond price", x="Diamond carat", y="Diamond price (in dollars)")
)
from plotnine import ggplot, geom_point, aes, element_text, labs
from plotnine.themes import theme, theme_bw
from plotnine.data import midwest
(ggplot(data=midwest) +
geom_point(aes(x=midwest.percollege, y=midwest.percpovertyknown), color="#5C5AD3", size=4, alpha=0.5) +
theme_bw() +
theme(axis_text=element_text(size=18), axis_title=element_text(size=16)) +
labs(title="Relationship between percentage college educated\n population and percentage of known poverty", x="Percentage of college educated population", y="Percentage of known poverty")
)
from plotnine import ggplot, geom_point, aes, element_text, labs
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful
(ggplot(data=faithful) +
geom_point(aes(x=faithful.eruptions, y=faithful.waiting), color="#007BA7", size=4, alpha=0.5) +
theme_bw() +
theme(axis_text=element_text(size=18), axis_title=element_text(size=16)) +
labs(title="Relationship between eruptions duration and waiting time between\n eruptions for Old Faithful geyser in Yellowstone National\n Park, Wyoming, USA.", x="Eruptions (in mins)", y="Wating time (in mins)")
)
Numerical data
Pearson correlation coefficient:
\[ \rho = \frac{\sum_{i=1}^{n}{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum_{i=1}^{n}{(x_i - \bar{x})^2\sum_{i=1}^{n}{(y_i - \bar{y})^2}}}} \]
Discrete data
Contingency table (cross-tabulation):
Variable 1 / Variable 2 |
Category A | Category B | Category C |
---|---|---|---|
Category A | 10 | 20 | 30 |
Category B | 40 | 50 | 60 |
Category C | 70 | 80 | 90 |
Category D | 0 | 10 | 40 |
🧮 Let’s calculate like in the old ages!
faithful
dataset:
Hints
faithful
dataset with these lines of codepip install pydataset #only run this line of code if the next line gives a `ModuleNotFoundError`
from pydataset import data
df=data('faithful') #dataframe with `faithful` data
To access a particular column e.g waiting
in a dataframe df
, you need this line:
Adapt these lines to suit your needs!
faithful
dataset in .csv
format by clicking the button below: You probably already heard this before, but it bears remembering the mantra:
“correlation does not imply causation”
Photo by cottonbro studio | Pexels
Check more on the Spurious Correlations website
Check more on the Spurious Correlations website
Check more on the Spurious Correlations website
Let’s talk about three possible interpretations of probability:
Classical
Frequentist
Bayesian
Events of the same kind can be reduced to a certain number of equally possible cases.
Example: coin tosses lead to either heads or tails \(1/2\) of the time ( \(50\%/50\%\))
What would be the outcome if I repeat the process many times?
Example: if I toss a coin \(1,000,000\) times, I expect \(\approx 50\%\) heads and \(\approx 50\%\) tails outcome.
What is your judgement of the likelihood of the outcome? Based on previous information.
Example: if I know that this coin has symmetric weight, I expect a \(50\%/50\%\) outcome.
For our purposes:
Note
💡 Although there is no such thing as “a probability of \(120\%\)” or “a probability of \(-23\%\)”, you could still use this language to refer to an increase or decrease in an outcome.
After the break:
from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful
(ggplot(data = faithful) +
geom_histogram(aes(x=faithful.eruptions), binwidth = 1.0, fill="#00BFFF") +
scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[0, 6]) +
scale_y_continuous(name="Density", breaks = range(0, 120, 10), limits=[0, 120]) +
theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
labs(title="Histogram with binwidth = 1.0") +
geom_hline(yintercept = 110.0, color = "red", linetype = "dashed", size = 1.0)
)
from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful
(ggplot(data = faithful) +
geom_histogram(aes(x=faithful.eruptions), binwidth = 0.5, fill="#00BFFF") +
scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[0, 6]) +
scale_y_continuous(name="Density", breaks = range(0, 120, 10), limits=[0, 120]) +
theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
labs(title="Histogram with binwidth = 0.5") +
geom_hline(yintercept = 80.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
geom_hline(yintercept = 110.0, color = "#ffb2b2", linetype = "dashed", size = 1.0)
)
from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful
(ggplot(data = faithful) +
geom_histogram(aes(x=faithful.eruptions), binwidth = 0.1, fill="#00BFFF") +
scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[0, 6]) +
scale_y_continuous(name="Density", breaks = range(0, 120, 10), limits=[0, 120]) +
theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
labs(title="Histogram with binwidth = 0.1") +
geom_hline(yintercept = 20.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
geom_hline(yintercept = 80.0, color = "#ffb2b2", linetype = "dashed", size = 1.0) +
geom_hline(yintercept = 110.0, color = "#ffb2b2", linetype = "dashed", size = 1.0)
)
Let’s zoom in ➡️
from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful
(ggplot(data = faithful) +
geom_histogram(aes(x=faithful.eruptions), binwidth = 0.1, fill="#00BFFF") +
scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[1.5, 5.25]) +
scale_y_continuous(name="Density", breaks = range(0, 20, 4), limits=[0, 20]) +
theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
labs(title="Histogram with binwidth = 0.1",subtitle="The two patterns in the data become more apparent when\n we bin it this way.") +
geom_hline(yintercept = 3.0, color = "red", linetype = "dashed", size = 1.0)
)
Note
Statistical Function:
\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
For more on this topic, see Chapter 5 (sections 5.2 and 5.3) of (Shafer and Zhang 2012)
Statistical Function:
\[ f(x) = \frac{\lambda^xe^{-\lambda}}{x!} \]
Statistical Function:
\[ f(x) = \frac{\lambda^xe^{-\lambda}}{x!} \]
For more on Poisson distribution, see this page.
For more on probability distributions, see Chapter 4 of (Illowsky and Dean 2013).
Let \(x_i,i=1,2,...,N\), be independent random variables, each of which is described by a probability density function (PDF) \(f_i(x)\) (these may be all different) with a mean \(\mu_i\) and variance \(\sigma_i^2\). The random variable \(z=\sum_i \frac{x_i}{N}\), i.e., the average of the \(x_i\), has the following properties:
For examples of how to apply the central limit theorem, see Chapter 6, section 6.2 of (Shafer and Zhang 2012) or this page or this page
Let’s say we have the following problem: we want to predict patient medical insurance costs and we have a dataset that contains the following variables:
How do we go about solving the problem?
👉🏻 We want to express medical insurance costs as function of some or all of the variables contained in the dataset
The generic supervised model:
\[ Y = \operatorname{f}(X) + \epsilon \]
is defined more explicitly as follows ➡️
\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]
when we use a single predictor, \(X\).
\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]
when there are multiple predictors, \(X_p\).
Warning
The data can be found on Kaggle. You can download it below:
Our data has 1338 rows and 7 columns.
This is what the first lines of the data look like:
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
Your selected dataframe has 7 columns.
There are 0 columns that have missing values.
Missing Values | % of Total Values | |
---|---|---|
age | 0 | 0.0 |
sex | 0 | 0.0 |
bmi | 0 | 0.0 |
children | 0 | 0.0 |
smoker | 0 | 0.0 |
region | 0 | 0.0 |
charges | 0 | 0.0 |
import missingno as msno
def missing_values_table(df):
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
# Rename the columns
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : "% of Total Values"})
# Sort the table by percentage of missing descending
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] >= 0].sort_values(
"% of Total Values", ascending=False).round(1)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.query("`Missing Values`!=0").shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_val_table_ren_columns
missing_values_table(data)
age | bmi | children | charges | |
---|---|---|---|---|
count | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 |
mean | 39.207025 | 30.663397 | 1.094918 | 13270.422265 |
std | 14.049960 | 6.098187 | 1.205493 | 12110.011237 |
min | 18.000000 | 15.960000 | 0.000000 | 1121.873900 |
25% | 27.000000 | 26.296250 | 0.000000 | 4740.287150 |
50% | 39.000000 | 30.400000 | 1.000000 | 9382.033000 |
75% | 51.000000 | 34.693750 | 2.000000 | 16639.912515 |
max | 64.000000 | 53.130000 | 5.000000 | 63770.428010 |
age
, bmi
and children
: variables whose distribution is close to a normal distribution charges
though has a mean that is much higher than its median: likely the effect of outliers
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams['figure.figsize'] = [30, 5]
plt.rcParams['font.size'] = 18
cat_features = data.select_dtypes(include = 'object').columns
num_features = data.select_dtypes(include = ['int', 'float']).columns
fig, ax = plt.subplots(nrows= int(np.ceil(len(cat_features) / 3)), ncols = 3)
ax = ax.flatten()
for i, column in enumerate(cat_features):
sns.countplot(data, x = data[column], ax = ax[i])
bmi
is indeed normally distributedcharges
is, as expected, most definitely not normally distributed: the data is right-skewed (i.e the tail is more pronounced on the right). The distribution is most likely a Gamma distributionage
and children
are also not normally distributed.age
(and probably a uniform distribution)children
(right-skewed distribution)charges
is highly correlated with smoker
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data['sex'] = encoder.fit_transform(data['sex'])
data['smoker'] = encoder.fit_transform(data['smoker'])
data['region'] = encoder.fit_transform(data['region'])
f, ax = plt.subplots(figsize = (30,15))
corr = data.corr()
sns.heatmap(corr,annot = True, square=True, ax=ax)
fig, axis = plt.subplots(nrows = 1, ncols = 2)
ax = axis.flatten()
sns.histplot(data[data.smoker == 1]["charges"], kde=True, ax = ax[0])
ax[0].set_title("Distribution of charges for smokers")
sns.histplot(data[data.smoker == 0]["charges"], kde=True, ax = ax[1])
ax[1].set_title("Distribution of charges for non-smokers")
ax= sns.countplot(data = data, x= 'smoker', hue = 'sex')
ax.set_xticks([0, 1])
ax.set_xticklabels(["non-smoker", "smoker"])
ax.legend(["female","male"])
charges
are, on average, higher for smokers than non-smokersLet’s model charges
with a single predictor: smoker
.
OLS Regression Results
=======================================================================================
Dep. Variable: charges R-squared (uncentered): 0.652
Model: OLS Adj. R-squared (uncentered): 0.652
Method: Least Squares F-statistic: 2505.
Date: Mon, 21 Oct 2024 Prob (F-statistic): 1.00e-308
Time: 14:15:14 Log-Likelihood: -14300.
No. Observations: 1338 AIC: 2.860e+04
Df Residuals: 1337 BIC: 2.861e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
smoker 3.205e+04 640.409 50.046 0.000 3.08e+04 3.33e+04
==============================================================================
Omnibus: 79.948 Durbin-Watson: 1.190
Prob(Omnibus): 0.000 Jarque-Bera (JB): 226.414
Skew: -0.279 Prob(JB): 6.84e-50
Kurtosis: 4.936 Cond. No. 1.00
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
First, I now have the equation that represents the (linear) relationship between charges
and smoker
:
\[\begin{align} \operatorname{charges} = & 3.205e+04 \times \operatorname{smoker} \end{align}\]
A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter.
Let’s compute the confidence intervals for our model coefficients:
2.5 % | 97.5 % | |
---|---|---|
smoker | 30793.915664 | 33306.547999 |
\[ \begin{align} &~~~~H_0:&\beta_1 &= 0 \\ &\text{vs} \\ &~~~~H_A:& \beta_1 &\neq 0, \end{align} \]
since if \(\beta_1=0\) then the model reduces to \(Y = \beta_0 + \epsilon\), and \(X\) and \(Y\) are not associated.
Do you think this looks like a good model?
Indicative reading of the week:
Reference: (Aschwanden 2015)
LSE DS101 2024/25 Autumn Term