🗓️ Week 04
Statistical Inference I

DS101 – Fundamentals of Data Science

Dr. Jon Cardoso-Silva

LSE Data Science Institute

06 Feb 2023

The Data Science Workflow

Statistics belongs more closely to this stage of the data science workflow.
We will learn how to use statistics to make inferences about the world from data.

Statistical Inference I & II

What we will see this week and the next:

🗓️ Week 04

Samples, Population & Resampling
Exploratory Data Analysis
Correlation vs Causation
What is Probability?
Probability Distributions

🗓️ Week 05

Hypothesis Testing
Framing Research Questions
Randomised Controlled Trials
A/B Tests
What about Cause and Effect?

Population vs Sample

Population

The population is the set of all elements that we are interested in.
It could be any set of objects or units (not just people):
- all the possible meaningful sentences one can conceive of in a language
- all the stars in the universe
The size of the population is often referred to as \(N\) (uppercase)

Sample

A sample is a subset of the population of size \(n\)
We examine the sample observations to draw conclusions and make inferences about the population.
Samples are always limited by what we can measure.

How you take the sample matters

The sample you take will influence the conclusions you draw.
Suppose you want to study patterns of e-mail use in a company.
You could take:

1️⃣ a random sample of 10% of people then look at their e-mails on a given day.

2️⃣ a random sample of 10% of all emails sent on a given day

✍️ A quick exercise:

We will split the class into 2 groups.
Open RStudio and create an R notebook DS101L_2022_23_Week_04.Rmd
Now write down the pros and cons of your allocated sample
🗣️ Let’s discuss

In the age of Big Data
why don’t we just sample \(N=\text{ALL}\)?

Sampling solves some engineering problems
Bias: we can’t generalise much beyond the population
Data is not objective!

Inference

Statistical Inference

“This overall process of going from the world to the data, and then from the data back to the world, is the field of statistical inference.”

– (Schutt and O’Neil 2013, chap. 2) ⭐

Statistical Inference

“More precisely, statistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic processes¹.”

– (Schutt and O’Neil 2013, chap. 2)

Exploratory Data Analysis (EDA)

Start with an Exploratory Data Analysis

EDA is fundamentally a creative process.
Like most creative processes, the key to asking quality questions is to generate a large number of questions.
💡 It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset.

Steps of an EDA

Generate questions about your data.
Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.

What do you want to show?

What values can my variable take?

library(ggplot2)
data(diamonds)

g <- (ggplot(data=diamonds) + geom_bar(mapping=aes(x=cut, fill=cut)) + 
      scale_fill_discrete(guide=NULL) +
      theme_bw() + theme(axis.text=element_text(size=rel(1.15)), axis.title=element_text(size=rel(1.4))) +
      labs(title="Diamond cut", x="Cut", y="Count")
      )
g

How are the variables co-related?

library(ggplot2)
data(faithful)

g <- (ggplot(data=faithful) + 
      geom_point(mapping=aes(x=eruptions, y=waiting), color="#5C5AD3", size=4, alpha=0.5) + 
      theme_bw() + 
      theme(axis.text=element_text(size=rel(1.15)), axis.title=element_text(size=rel(1.4))) +
      labs(title="Old Faithful geyser in Yellowstone National Park, Wyoming, USA.", x="Eruptions (in mins)", y="Waiting Time (in mins)")
      )
g

Some ways to summarise association

Numerical data

Pearson correlation coefficient:

\[ \rho = \frac{\sum_{i=1}^{n}{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum_{i=1}^{n}{(x_i - \bar{x})^2\sum_{i=1}^{n}{(y_i - \bar{y})^2}}}} \]

The above assumes variables are linearly related
Values range from -1 to +1

Discrete data

Contingency table (cross-tabulation):

Variable 1 / Variable 2	Category A	Category B	Category C
Category A	10	20	30
Category B	40	50	60
Category C	70	80	90
Category D	0	10	40

✍️ Activity

🧮 Let’s calculate like in the old ages!

Form two groups again
Create a new header in your notebook
Calculate the correlation coefficient for the faithful dataset:
- Group 1: Use only samples with \(\operatorname{eruption} < 3\) minutes
- Group 2: Use only samples with \(\operatorname{eruption} \ge 3\) minutes
Format your results using markdown
🗣️ Let’s discuss

You probably already heard this before, but it bears remembering the mantra:

“correlation does not imply causation”

Spurious Correlations

Time for coffee ☕

After the break:

Interpretations of probability
Probability distributions
Next week’s summative assessment

What is Probability?

Let’s talk about three possible interpretations of probability:

Classical

Frequentist

Bayesian

Events of the same kind can be reduced to a certain number of equally possible cases.

Example: coin tosses lead to either heads or tails \(1/2\) of the time ( \(50\%/50\%\))

What would be the outcome if I repeat the process many times?

Example: if I toss a coin \(1,000,000\) times, I expect \(\approx 50\%\) heads and \(\approx 50\%\) tails outcome.

What is your judgement of the likelihood of the outcome? Based on previous information.

Example: if I know that this coin has symmetric weight, I expect a \(50\%/50\%\) outcome.

What is Probability?

For our purposes:

Probabilities are numbers between 0 and 1
The sum of all possible outcomes of an event must sum to 1.
It is useful to think of things as probabilities

Note

💡 Although there is no such thing as “a probability of \(120\%\)” or “a probability of \(-23\%\)”, you could still use this language to refer to an increase or decrease in an outcome.

Distributions

Histograms

Click to see the code

library(ggplot2)
data(faithful)

g <- (ggplot(data = faithful) + 
      geom_histogram(mapping = aes(x=eruptions), binwidth = 1.0, fill="#5C5AD3") + 
      scale_x_continuous("Eruptions (in mins)", breaks = seq(0, 6, 1), limits=c(0, 6)) +
      scale_y_continuous("Density", breaks = seq(0, 120, 10), limits=c(0, 120)) +
      theme_bw() + theme(axis.text=element_text(size=rel(1.2)), title=element_text(size=rel(2))) +
      labs(title="Histogram with binwidth = 1.0") +
      geom_hline(yintercept = 110.0, color = "red", linetype = "dashed", size = 1.0)
      )
g

Histograms | What happens if we bin it differently?

Click to see the code

library(ggplot2)
data(faithful)

g <- (ggplot(data = faithful) + 
      geom_histogram(mapping = aes(x=eruptions), binwidth = 0.5, fill="#5C5AD3") + 
      scale_x_continuous("Eruptions (in mins)", breaks = seq(0, 6, 1), limits=c(0, 6)) +
      scale_y_continuous("Density", breaks = seq(0, 120, 10), limits=c(0, 120)) +
      theme_bw() +
      theme(axis.text=element_text(size=rel(1.2)), title=element_text(size=rel(2))) +
      labs(title="Histogram with binwidth = 0.5") +      
      geom_hline(yintercept = 80.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
      geom_hline(yintercept = 110.0, color = "#ffb2b2", linetype = "dashed", size = 1.0)
      )
g

Histograms | What happens if we bin it differently?

Click to see the code

library(ggplot2)
data(faithful)

g <- (ggplot(data = faithful) + 
      geom_histogram(mapping = aes(x=eruptions), binwidth = 0.1, fill="#5C5AD3") + 
      scale_x_continuous("Eruptions (in mins)", breaks = seq(0, 6, 1), limits=c(0, 6)) +
      scale_y_continuous("Density", breaks = seq(0, 120, 10), limits=c(0, 120)) +
      theme_bw() +
      theme(axis.text=element_text(size=rel(1.2)), title=element_text(size=rel(2))) +
      labs(title="Histogram with binwidth = 0.1") +      
      geom_hline(yintercept = 20.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
      geom_hline(yintercept = 80.0, color = "#ffb2b2", linetype = "dashed", size = 1.0) +
      geom_hline(yintercept = 110.0, color = "#ffb2b2", linetype = "dashed", size = 1.0)
      )
g

Let’s zoom in ➡️

Zoom in 🔎

Click to see the code

library(ggplot2)
data(faithful)

g <- (ggplot(data = faithful) + 
      geom_histogram(mapping = aes(x=eruptions), binwidth = 0.1, fill="#5C5AD3") + 
      scale_x_continuous("Eruptions (in mins)", breaks = seq(1, 6, 1), limits=c(1.5, 5.25)) +
      scale_y_continuous("Density", breaks = seq(0, 20, 4), limits=c(0, 20)) +
      theme_bw() +
      theme(axis.text=element_text(size=rel(1.2)), title=element_text(size=rel(1.7))) +
      geom_vline(xintercept = 3.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
      labs(title="Histogram with binwidth = 0.1",
           subtitle="The two patterns in the data become more apparent when we bin it this way."))

g

Probability Distributions

the probability distribution of a random variable is a function that gives the probabilities of occurrence of different possible outcomes for that variable.
think of it as an attempt to model the data
you hope that the model will be a good representation of the population

Note

the word model has a very precise meaning here
we will come back to models next week

Some Very Common Probability Distributions

Normal Distribution

Statistical Function:

\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

Poisson Distribution

Statistical Function:

\[ f(x) = \frac{\lambda^xe^{-\lambda}}{x!} \]

What’s Next

Your Week 05 Presentation

Groups:

Each group will focus on a different plot type, as illustrated in R Graph Gallery:
- Group A: Plots of distributions
- Group B: Plots of correlations
- Group C: Plots of rankings
- Group D: Plots of evolution
- Group E: Plots of flow

🎯 Your task:

Prepare a presentation of 12-15 minutes for each group.
Search for 3 interesting examples of plots that belong to your group.
Plots must come from:
- published papers, reports
- or from yourself!
⚠️ DO NOT use the same plots that are in the R Graph Gallery
💡Use Slack to share your tips on how to find good references!

References

D’Ignazio, Catherine, and Lauren F. Klein. 2020. Data Feminism. Strong Ideas Series. Cambridge, Massachusetts: The MIT Press. https://ebookcentral.proquest.com/lib/londonschoolecons/reader.action?docID=6120950.

DeGroot, Morris H., and Mark J. Schervish. 2003. Probability and Statistics. 3. ed., international edition. Boston Munich: Addison-Wesley.

Guyan, Kevin. 2022. Queer Data: Using Gender, Sex and Sexuality Data for Action. Bloomsbury Studies in Digital Cultures. London: Bloomsbury Academic. https://web-s-ebscohost-com.gate3.library.lse.ac.uk/ehost/detail/detail?nobk=y&vid=2&sid=a8efeedd-6bfc-459a-9f0c-a67dabcc75d1@redis&bdata=JnNpdGU9ZWhvc3QtbGl2ZQ==#AN=3077276&db=nlebk.

Schutt, Rachel, and Cathy O’Neil. 2013. Doing Data Science. First edition. Beijing ; Sebastopol: O’Reilly Media. https://ebookcentral.proquest.com/lib/londonschoolecons/detail.action?docID=1465965.

🗓️ Week 04 Statistical Inference I

The Data Science Workflow

Statistical Inference I & II

🗓️ Week 04

🗓️ Week 05

Population vs Sample

Population

Sample

How you take the sample matters

In the age of Big Data why don’t we just sample \(N=\text{ALL}\)?

Inference

Statistical Inference

Statistical Inference

Exploratory Data Analysis (EDA)

Start with an Exploratory Data Analysis

Steps of an EDA

What do you want to show?

What do you want to show?

What values can my variable take?

How are the variables co-related?

Some ways to summarise association

Numerical data

Discrete data

✍️ Activity

“correlation does not imply causation”

Spurious Correlations

Spurious Correlations

Time for coffee ☕

What is Probability?

What is Probability?

What is Probability?

Distributions

Histograms

Histograms | What happens if we bin it differently?

Histograms | What happens if we bin it differently?

Zoom in 🔎

Probability Distributions

Some Very Common Probability Distributions

Normal Distribution

Poisson Distribution

What’s Next

Your Week 05 Presentation

References

🗓️ Week 04
Statistical Inference I

In the age of Big Data
why don’t we just sample \(N=\text{ALL}\)?