🗓️ Week 04
Statistical Inference

DS101 – Fundamentals of Data Science

Dr. Ghita Berrada

LSE Data Science Institute

21 Oct 2024

The Data Science Workflow

Statistics belongs more closely to this stage of the data science workflow.
We will learn how to use statistics to make inferences about the world from data.

Population vs Sample

Population

The population is the set of all elements that we are interested in.
It could be any set of objects or units (not just people):
- all the possible meaningful sentences one can conceive of in a language
- all the stars in the universe
The size of the population is often referred to as \(N\) (uppercase)

Sample

A sample is a subset of the population of size \(n\)
We examine the sample observations to draw conclusions and make inferences about the population.
Samples are always limited by what we can measure.

How you take the sample matters

The sample you take will influence the conclusions you draw.
Suppose you want to study patterns of e-mail use in a company.
You could take:

1️⃣ a random sample of 10% of people then look at their e-mails on a given day.

2️⃣ a random sample of 10% of all emails sent on a given day

✍️ A quick exercise:

Form two groups.
Each group will be assigned either of the two random samples we’ve just seen (sample 1 or 2 to discuss).
Now discuss and write down the pros and cons of your allocated sample.
🗣️ Let’s discuss.

In the age of Big Data
why don’t we just sample \(N=\text{ALL}\)?

Sampling solves some engineering problems
Bias: we can’t generalise much beyond the population
Data is not objective!

Inference

Statistical Inference

“This overall process of going from the world to the data, and then from the data back to the world, is the field of statistical inference.”

– (Schutt and O’Neil 2013, chap. 2) ⭐

Statistical Inference

“More precisely, statistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic processes¹.”

– (Schutt and O’Neil 2013, chap. 2)

Exploratory Data Analysis (EDA)

Start with an Exploratory Data Analysis

EDA is fundamentally a creative process.
Like most creative processes, the key to asking quality questions is to generate a large number of questions.
💡 It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset.

Steps of an EDA

Generate questions about your data.
Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.

What do you want to show?

What values can my variable take?

Click to check the code that was used to generate this figure

#| echo: true
from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_discrete, labs, element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds

(ggplot(diamonds) + 
      geom_bar(aes(x=diamonds.cut, fill=diamonds.cut)) + 
      scale_fill_discrete() +
      theme_bw() + 
      labs(title="Diamond cut", x="Cut", y="Count")+
      theme(figure_size=(16, 12),text=element_text(size=24))
      )

What values can my variable take?

Click to check the code that produced this figure

from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_continuous, labs, element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
from plotnine.coords import coord_cartesian
import pandas as pd

(ggplot(diamonds) + 
      geom_bar(aes(x=diamonds.carat, fill=diamonds.carat)) + 
      scale_fill_continuous() +
      theme_bw() + 
      labs(title="Diamond carat", x="Carat", y="Count")+
      coord_cartesian(xlim=(0.2, 5.01))+
      theme(figure_size=(20, 12),text=element_text(size=24))
      )

How are the variables co-related?

Click to check the code that produced this figure

from plotnine import ggplot, geom_point, aes, element_text, labs
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds

(ggplot(data=diamonds) + 
      geom_point(aes(x=diamonds.carat, y=diamonds.price), color="#800020", size=4, alpha=0.5) + 
      theme_bw() + 
      theme(axis_text=element_text(size=18), axis_title=element_text(size=16)) +
      labs(title="Relationship between diamond carat and diamond price", x="Diamond carat", y="Diamond price (in dollars)")
      )

How are the variables co-related?

Click to check the code that produced this figure

from plotnine import ggplot, geom_point, aes, element_text, labs
from plotnine.themes import theme, theme_bw
from plotnine.data import midwest

(ggplot(data=midwest) + 
      geom_point(aes(x=midwest.percollege, y=midwest.percpovertyknown), color="#5C5AD3", size=4, alpha=0.5) + 
      theme_bw() + 
      theme(axis_text=element_text(size=18), axis_title=element_text(size=16)) +
      labs(title="Relationship between percentage college educated\n population and percentage of known poverty", x="Percentage of college educated population", y="Percentage of known poverty")
      )

How are the variables co-related?

Click to check the code that produced this figure


from plotnine import ggplot, geom_point, aes, element_text, labs
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful

(ggplot(data=faithful) + 
      geom_point(aes(x=faithful.eruptions, y=faithful.waiting), color="#007BA7", size=4, alpha=0.5) + 
      theme_bw() + 
      theme(axis_text=element_text(size=18), axis_title=element_text(size=16)) +
      labs(title="Relationship between eruptions duration and waiting time between\n eruptions for Old Faithful geyser in Yellowstone National\n Park, Wyoming, USA.", x="Eruptions (in mins)", y="Wating time (in mins)")
      )

Some ways to summarise association

Numerical data

Pearson correlation coefficient:

\[ \rho = \frac{\sum_{i=1}^{n}{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum_{i=1}^{n}{(x_i - \bar{x})^2\sum_{i=1}^{n}{(y_i - \bar{y})^2}}}} \]

The above assumes variables are linearly related
A coefficient value of 0 does NOT imply a lack of relationship between variables, it simply implies a lack of linear relationship between them
Values range from -1 to +1

Discrete data

Contingency table (cross-tabulation):

Variable 1 / Variable 2	Category A	Category B	Category C
Category A	10	20	30
Category B	40	50	60
Category C	70	80	90
Category D	0	10	40

✍️ Activity

🧮 Let’s calculate like in the old ages!

Form two groups once again. Each group is assigned to calculation group 1 or 2 (see below for details)
Calculate the correlation coefficient for the faithful dataset:
- Group 1: Use only samples with \(\operatorname{eruption} < 3\) minutes
- Group 2: Use only samples with \(\operatorname{eruption} \ge 3\) minutes
🗣️ Let’s discuss

Hints

Use Google Colab to do the computation
You can load the faithful dataset with these lines of code

pip install pydataset #only run this line of code if the next line gives a `ModuleNotFoundError`
from pydataset import data
df=data('faithful') #dataframe with `faithful` data

To get, for example, the dataframe filtered so as to only contain waiting times with a value greater than 60 mins, you would need the following line of code

df.query('waiting>60')

To access a particular column e.g waiting in a dataframe df, you need this line:

df['waiting']

Adapt these lines to suit your needs!

Use this tutorial to find out how to compute the correlation coefficient.
While you are highly encouraged to use Python for this exercise, if you are really stuck or if you really don’t know how to write this exercise in Python, you can do the computation using more familiar tool (e.g Excel). To allow you to do this, you can download the faithful dataset in .csv format by clicking the button below:

You probably already heard this before, but it bears remembering the mantra:

“correlation does not imply causation”

Spurious Correlations

What is Probability?

Let’s talk about three possible interpretations of probability:

Classical

Frequentist

Bayesian

Events of the same kind can be reduced to a certain number of equally possible cases.

Example: coin tosses lead to either heads or tails \(1/2\) of the time ( \(50\%/50\%\))

What would be the outcome if I repeat the process many times?

Example: if I toss a coin \(1,000,000\) times, I expect \(\approx 50\%\) heads and \(\approx 50\%\) tails outcome.

What is your judgement of the likelihood of the outcome? Based on previous information.

Example: if I know that this coin has symmetric weight, I expect a \(50\%/50\%\) outcome.

What is Probability?

For our purposes:

Probabilities are numbers between 0 and 1
The sum of all possible outcomes of an event must sum to 1.
It is useful to think of things as probabilities

Note

💡 Although there is no such thing as “a probability of \(120\%\)” or “a probability of \(-23\%\)”, you could still use this language to refer to an increase or decrease in an outcome.

Time for a break 🍵

After the break:

Interpretations of probability
Probability distributions
Central limit theorem

Distributions

Histograms

Click to see the code

from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful

(ggplot(data = faithful) + 
      geom_histogram(aes(x=faithful.eruptions), binwidth = 1.0, fill="#00BFFF") + 
      scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[0, 6]) +
      scale_y_continuous(name="Density", breaks = range(0, 120, 10), limits=[0, 120]) +
      theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
      labs(title="Histogram with binwidth = 1.0") +
      geom_hline(yintercept = 110.0, color = "red", linetype = "dashed", size = 1.0)
      )

Histograms: What happens if we bin it differently?

Click to see the code

from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful

(ggplot(data = faithful) + 
      geom_histogram(aes(x=faithful.eruptions), binwidth = 0.5, fill="#00BFFF") + 
      scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[0, 6]) +
      scale_y_continuous(name="Density", breaks = range(0, 120, 10), limits=[0, 120]) +
      theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
      labs(title="Histogram with binwidth = 0.5") +
       geom_hline(yintercept = 80.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
      geom_hline(yintercept = 110.0, color = "#ffb2b2", linetype = "dashed", size = 1.0)
      )

Histograms: What happens if we bin it differently?

Click to see the code

from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful

(ggplot(data = faithful) + 
      geom_histogram(aes(x=faithful.eruptions), binwidth = 0.1, fill="#00BFFF") + 
      scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[0, 6]) +
      scale_y_continuous(name="Density", breaks = range(0, 120, 10), limits=[0, 120]) +
      theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
      labs(title="Histogram with binwidth = 0.1") +
       geom_hline(yintercept = 20.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
      geom_hline(yintercept = 80.0, color = "#ffb2b2", linetype = "dashed", size = 1.0) +
      geom_hline(yintercept = 110.0, color = "#ffb2b2", linetype = "dashed", size = 1.0)
      )

Let’s zoom in ➡️

Zoom in 🔎

Click to see the code

from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful

(ggplot(data = faithful) + 
      geom_histogram(aes(x=faithful.eruptions), binwidth = 0.1, fill="#00BFFF") + 
      scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[1.5, 5.25]) +
      scale_y_continuous(name="Density", breaks = range(0, 20, 4), limits=[0, 20]) +
      theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
      labs(title="Histogram with binwidth = 0.1",subtitle="The two patterns in the data become more apparent when\n we bin it this way.") +
      geom_hline(yintercept = 3.0, color = "red", linetype = "dashed", size = 1.0)
      )

Probability Distributions

the probability distribution of a random variable is a function that gives the probabilities of occurrence of different possible outcomes for that variable.
think of it as an attempt to model the data
you hope that the model will be a good representation of the population

Note

the word model has a very precise meaning here

Some Very Common Probability Distributions

Normal Distribution

Statistical Function:

\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

also known as Gaussian distribution
\(\mu\) is the mean and \(\sigma\) the standard deviation
type of symmetrical and balanced likelihood distribution around the mean that illustrates that data close to the mean occur more frequently than data far from the mean
Basis for many statistical methods

For more on this topic, see Chapter 5 (sections 5.2 and 5.3) of (Shafer and Zhang 2012)

Poisson Distribution

Statistical Function:

\[ f(x) = \frac{\lambda^xe^{-\lambda}}{x!} \]

used to calculate the probability of an event happening a counted number of times within a time interval
e.g if an event happens independently and randomly over time and the mean rate of occurrence is constant over time, then the number of occurences (of the event) follow a Poisson distribution

Poisson Distribution

Statistical Function:

\[ f(x) = \frac{\lambda^xe^{-\lambda}}{x!} \]

discrete distribution (possibilities listed as 0,1,2,…); depends only on the mean number of occurrences expected
examples of random variables that follow Poisson:
- The number of orders your company receives tomorrow.
- The number of people who apply for a job tomorrow to your HR department.
- The number of defects in a finished product.
- The number of calls your company receives next week for help concerning an “easy-to-assemble” toy.

For more on Poisson distribution, see this page.
For more on probability distributions, see Chapter 4 of (Illowsky and Dean 2013).

Central Limit Theorem

Let \(x_i,i=1,2,...,N\), be independent random variables, each of which is described by a probability density function (PDF) \(f_i(x)\) (these may be all different) with a mean \(\mu_i\) and variance \(\sigma_i^2\). The random variable \(z=\sum_i \frac{x_i}{N}\), i.e., the average of the \(x_i\), has the following properties:

its expected value is given by \(〈z=\frac{(\sum_i \mu_i)}{N}\);
its variance is given by \(var(z)=\frac{(\sum_i\sigma_i^2)}{N^2}\);
as \(N\rightarrow \infty\), the PDF of \(z\) tends to a normal distribution with the same mean and variance.

For examples of how to apply the central limit theorem, see Chapter 6, section 6.2 of (Shafer and Zhang 2012) or this page or this page

Introduction to linear regression

Let’s say we have the following problem: we want to predict patient medical insurance costs and we have a dataset that contains the following variables:

sex
smoker (i.e whether the patient is a smoker)
BMI
age
children (i.e the number of children the patient has)
region (i.e the region the patient lives in)

How do we go about solving the problem?

👉🏻 We want to express medical insurance costs as function of some or all of the variables contained in the dataset

What is Linear Regression

The generic supervised model:

\[ Y = \operatorname{f}(X) + \epsilon \]

is defined more explicitly as follows ➡️

Simple linear regression

\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]

when we use a single predictor, \(X\).

Multiple linear regression

\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]

when there are multiple predictors, \(X_p\).

Warning

True regression functions are never linear!
Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.

Let’s look at our data and problem more closely

The data can be found on Kaggle. You can download it below:

Our data has 1338 rows and 7 columns.

How we got this answer

A very simple way of getting this answer is simply running this code

data.shape

once the data from the CSV file is loaded (see code chunk in the next column)

This is what the first lines of the data look like:

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

Click here for the code to produce the above table

import pandas as pd

data=pd.read_csv("insurance.csv")
data.head()

Let’s take a closer look at our data and problem

Your selected dataframe has 7 columns.
There are 0 columns that have missing values.

	Missing Values	% of Total Values
age	0	0.0
sex	0	0.0
bmi	0	0.0
children	0	0.0
smoker	0	0.0
region	0	0.0
charges	0	0.0

Click here for the code to produce the above table

import missingno as msno

def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()

        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)

        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : "% of Total Values"})

        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] >= 0].sort_values(
        "% of Total Values", ascending=False).round(1)

        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
            "There are " + str(mis_val_table_ren_columns.query("`Missing Values`!=0").shape[0]) +
              " columns that have missing values.")

        # Return the dataframe with missing information
        return mis_val_table_ren_columns

missing_values_table(data)

Some more basic data exploration

	age	bmi	children	charges
count	1338.000000	1338.000000	1338.000000	1338.000000
mean	39.207025	30.663397	1.094918	13270.422265
std	14.049960	6.098187	1.205493	12110.011237
min	18.000000	15.960000	0.000000	1121.873900
25%	27.000000	26.296250	0.000000	4740.287150
50%	39.000000	30.400000	1.000000	9382.033000
75%	51.000000	34.693750	2.000000	16639.912515
max	64.000000	53.130000	5.000000	63770.428010

Mean close to the median for age, bmi and children: variables whose distribution is close to a normal distribution
charges though has a mean that is much higher than its median: likely the effect of outliers

Click here for the code to produce the above table

data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

Click here for the code to produce the above results

data.info()

Data exploration before modelling

Click here for the code to produce the above plot

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams['figure.figsize'] = [30, 5]
plt.rcParams['font.size'] = 18
cat_features = data.select_dtypes(include = 'object').columns
num_features = data.select_dtypes(include = ['int', 'float']).columns

fig, ax = plt.subplots(nrows= int(np.ceil(len(cat_features) / 3)), ncols = 3)
ax = ax.flatten()
for i, column in enumerate(cat_features):
    sns.countplot(data, x = data[column], ax = ax[i])

Data exploration before modelling

bmi is indeed normally distributed
charges is, as expected, most definitely not normally distributed: the data is right-skewed (i.e the tail is more pronounced on the right). The distribution is most likely a Gamma distribution
age and children are also not normally distributed.

Click here for the code to produce the plot

fig, axes = plt.subplots( nrows= 2, ncols=2, figsize = (20,15))
axes = axes.flatten()

for i , column in enumerate(num_features):
    ax = axes[i]
    sns.histplot(data[column], kde=True, ax=ax)

plt.tight_layout()
plt.show()

Data exploration before modelling

No outliers for age (and probably a uniform distribution)
No outliers as well for children (right-skewed distribution)

Click here for the code to produce the above plot

fig, axes = plt.subplots( nrows= 2, ncols=2, figsize = (20,15))
axes = axes.flatten()

for i , column in enumerate(num_features):
    ax = axes[i]
    sns.boxplot(data = data, x = data[column], ax=ax)

plt.tight_layout()
plt.show()

Data exploration before modelling

charges is highly correlated with smoker

Click here for the code to produce the above plot

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data['sex'] = encoder.fit_transform(data['sex'])
data['smoker'] = encoder.fit_transform(data['smoker'])
data['region'] = encoder.fit_transform(data['region'])
f, ax = plt.subplots(figsize = (30,15))
corr = data.corr()
sns.heatmap(corr,annot = True, square=True, ax=ax)

Data exploration before modelling

Click here for the code to produce the above plot

fig, axis = plt.subplots(nrows = 1, ncols = 2)
ax = axis.flatten()
sns.histplot(data[data.smoker == 1]["charges"], kde=True, ax = ax[0])
ax[0].set_title("Distribution of charges for smokers")
sns.histplot(data[data.smoker == 0]["charges"], kde=True, ax = ax[1])
ax[1].set_title("Distribution of charges for non-smokers")

ax= sns.countplot(data = data, x= 'smoker', hue = 'sex')
ax.set_xticks([0, 1])
ax.set_xticklabels(["non-smoker", "smoker"])
ax.legend(["female","male"])

As expected, charges are, on average, higher for smokers than non-smokers
The number of non-smoking patients is greater.
We see that the number of male smokers is higher than the number of female smokers \(=>\) the total cost of treatment for men will be higher.

A simple linear model

Let’s model charges with a single predictor: smoker.

Click here for the code to produce the above plot

import numpy as np, statsmodels.api as sm
y=data['charges']
X=data['smoker']
mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                                OLS Regression Results                                
=======================================================================================
Dep. Variable:                charges   R-squared (uncentered):                   0.652
Model:                            OLS   Adj. R-squared (uncentered):              0.652
Method:                 Least Squares   F-statistic:                              2505.
Date:                Mon, 21 Oct 2024   Prob (F-statistic):                   1.00e-308
Time:                        14:15:14   Log-Likelihood:                         -14300.
No. Observations:                1338   AIC:                                  2.860e+04
Df Residuals:                    1337   BIC:                                  2.861e+04
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
smoker      3.205e+04    640.409     50.046      0.000    3.08e+04    3.33e+04
==============================================================================
Omnibus:                       79.948   Durbin-Watson:                   1.190
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              226.414
Skew:                          -0.279   Prob(JB):                     6.84e-50
Kurtosis:                       4.936   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

How should I interpret these results?

First, I now have the equation that represents the (linear) relationship between charges and smoker:

\[\begin{align} \operatorname{charges} = & 3.205e+04 \times \operatorname{smoker} \end{align}\]

What about confidence intervals?

If we were to fit a linear model from repeated samples of the data, we would get different coefficients every time.
Because of the Central Limit Theorem, we know that the mean of this sampling distribution can be approximated by a Normal Distribution.
We know from Normal Theory that 95% of the distribution lies within two times the standard deviation (centred around the mean).

A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter.

Back to our example

Let’s compute the confidence intervals for our model coefficients:

Check the code to compute the confidence interval

res.conf_int(0.05) #95% confidence interval

	2.5 %	97.5 %
smoker	30793.915664	33306.547999

Hypothesis testing

The idea of confidence intervals is closely related to the idea of hypothesis testing.
The most common hypothesis test involves testing the null hypothesis of:
- \(H_0\): There is no relationship between \(X\) and \(Y\) versus the .
- \(H_A\): There is some relationship between \(X\) and \(Y\).

Mathematically, this corresponds to testing:

\[ \begin{align} &~~~~H_0:&\beta_1 &= 0 \\ &\text{vs} \\ &~~~~H_A:& \beta_1 &\neq 0, \end{align} \]

since if \(\beta_1=0\) then the model reduces to \(Y = \beta_0 + \epsilon\), and \(X\) and \(Y\) are not associated.

p-values

To test the null hypothesis, we compute something called a t-statistic (a bell-shaped distribution¹)
Using statistical software, it is easy to compute the probability of observing any value equal to \(\mid t \mid\) or larger.
We call this probability the p-value.
If the p-value is less than some pre-specified level of significance, say \(\alpha = 0.05\), then we reject the null hypothesis in favor of the alternative hypothesis.

Visualising the model

Do you think this looks like a good model?

What can go wrong?

The model is only as good as the data you use to fit it.
The model is only as good as the assumptions it makes.
People forget that the level of statistical significance is somewhat arbitrary.

Indicative reading of the week:

References

Aschwanden, Christie. 2015. “Science Isn’t Broken.” FiveThirtyEight. https://fivethirtyeight.com/features/science-isnt-broken/.

D’Ignazio, Catherine, and Lauren F. Klein. 2020. Data Feminism. Strong Ideas Series. Cambridge, Massachusetts: The MIT Press. https://ebookcentral.proquest.com/lib/londonschoolecons/reader.action?docID=6120950.

DeGroot, Morris H., and Mark J. Schervish. 2003. Probability and Statistics. 3. ed., international edition. Boston Munich: Addison-Wesley.

Guyan, Kevin. 2022. Queer Data: Using Gender, Sex and Sexuality Data for Action. Bloomsbury Studies in Digital Cultures. London: Bloomsbury Academic. https://web-s-ebscohost-com.gate3.library.lse.ac.uk/ehost/detail/detail?nobk=y&vid=2&sid=a8efeedd-6bfc-459a-9f0c-a67dabcc75d1@redis&bdata=JnNpdGU9ZWhvc3QtbGl2ZQ==#AN=3077276&db=nlebk.

Illowsky, Barbara, and Susan L. Dean. 2013. Introductory Statistics. Houston, Texas: OpenStax College. https://openstax.org/details/books/introductory-statistics.

Schutt, Rachel, and Cathy O’Neil. 2013. Doing Data Science. First edition. Beijing ; Sebastopol: O’Reilly Media. https://ebookcentral.proquest.com/lib/londonschoolecons/detail.action?docID=1465965.

Shafer, Douglas S., and Zhiyi Zhang. 2012. Introductory Statistics. Saylor Foundation. https://saylordotorg.github.io/text_introductory-statistics/.

🗓️ Week 04 Statistical Inference

The Data Science Workflow

Population vs Sample

Population

Sample

How you take the sample matters

In the age of Big Data why don’t we just sample \(N=\text{ALL}\)?

Inference

Statistical Inference

Statistical Inference

Exploratory Data Analysis (EDA)

Start with an Exploratory Data Analysis

Steps of an EDA

What do you want to show?

What do you want to show?

What values can my variable take?

What values can my variable take?

How are the variables co-related?

How are the variables co-related?

How are the variables co-related?

Some ways to summarise association

Numerical data

Discrete data

✍️ Activity

“correlation does not imply causation”

Spurious Correlations

Spurious Correlations

Spurious Correlations

What is Probability?

What is Probability?

What is Probability?

Time for a break 🍵

Distributions

Histograms

Histograms: What happens if we bin it differently?

Histograms: What happens if we bin it differently?

Zoom in 🔎

Probability Distributions

Some Very Common Probability Distributions

Normal Distribution

Poisson Distribution

Poisson Distribution

Central Limit Theorem

Introduction to linear regression

What is Linear Regression

Simple linear regression

Multiple linear regression

Let’s look at our data and problem more closely

Let’s take a closer look at our data and problem

Some more basic data exploration

Data exploration before modelling

Data exploration before modelling

Data exploration before modelling

Data exploration before modelling

Data exploration before modelling

A simple linear model

How should I interpret these results?

What about confidence intervals?

Back to our example

Hypothesis testing

p-values

Visualising the model

What can go wrong?

References

🗓️ Week 04
Statistical Inference

In the age of Big Data
why don’t we just sample \(N=\text{ALL}\)?