🗓️ Week 04
Statistical Inference I

DS101 – Fundamentals of Data Science

16 Oct 2023

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       communicate->end

  • Statistics belongs more closely to this stage of the data science workflow.
  • We will learn how to use statistics to make inferences about the world from data.

Statistical Inference I & II

What we will see this week and the next:

🗓️ Week 04


  • Samples, Population & Resampling
  • Exploratory Data Analysis
  • Correlation vs Causation
  • What is Probability?
  • Probability Distributions

🗓️ Week 05


  • Hypothesis Testing
  • Framing Research Questions
  • Randomised Controlled Trials
  • A/B Tests
  • What about Cause and Effect?

Population vs Sample

Population

  • The population is the set of all elements that we are interested in.
  • It could be any set of objects or units (not just people):
    • all the possible meaningful sentences one can conceive of in a language
    • all the stars in the universe
  • The size of the population is often referred to as \(N\) (uppercase)

Sample

  • A sample is a subset of the population of size \(n\)
  • We examine the sample observations to draw conclusions and make inferences about the population.
  • Samples are always limited by what we can measure.

How you take the sample matters

  • The sample you take will influence the conclusions you draw.
  • Suppose you want to study patterns of e-mail use in a company.
  • You could take:

1️⃣ a random sample of 10% of people then look at their e-mails on a given day.

2️⃣ a random sample of 10% of all emails sent on a given day

✍️ A quick exercise:

  • Pair up with the neighbours in your table.
  • Each table will be assigned either of the two random samples we’ve just seen (sample 1 or 2 to discuss).
  • Now discuss and write down the pros and cons of your allocated sample.
  • 🗣️ Let’s discuss.

In the age of Big Data
why don’t we just sample \(N=\text{ALL}\)?

  • Sampling solves some engineering problems
  • Bias: we can’t generalise much beyond the population
  • Data is not objective!

Inference

Statistical Inference




“This overall process of going from the world to the data, and then from the data back to the world, is the field of statistical inference.”

(Schutt and O’Neil 2013, chap. 2)

a Population b Sample c Sample b->c Take out a sample d Conclusions based on the sample c->d Hypotheses     c->d c->d c->d d:n->a:n Inferences about the population

Statistical Inference

“More precisely, statistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic processes1.”

(Schutt and O’Neil 2013, chap. 2)

Exploratory Data Analysis (EDA)

Start with an Exploratory Data Analysis

  • EDA is fundamentally a creative process.
  • Like most creative processes, the key to asking quality questions is to generate a large number of questions.
  • 💡 It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset.

Steps of an EDA

  1. Generate questions about your data.
  2. Search for answers by visualising, transforming, and modelling your data.
  3. Use what you learn to refine your questions and/or generate new questions.

What do you want to show?

What do you want to show?

What values can my variable take?

from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_discrete, labs, element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds

(ggplot(diamonds) + 
      geom_bar(aes(x=diamonds.cut, fill=diamonds.cut)) + 
      scale_fill_discrete() +
      theme_bw() + 
      labs(title="Diamond cut", x="Cut", y="Count")+
      theme(figure_size=(20, 12),text=element_text(size=24))
      )

What values can my variable take?

from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_continuous, labs, element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
from plotnine.coords import coord_cartesian
import pandas as pd

(ggplot(diamonds) + 
      geom_bar(aes(x=diamonds.carat, fill=diamonds.carat)) + 
      scale_fill_continuous() +
      theme_bw() + 
      labs(title="Diamond carat", x="Carat", y="Count")+
      coord_cartesian(xlim=(0.2, 5.01))+
      theme(figure_size=(20, 12),text=element_text(size=24))
      )

Some ways to summarise association


Numerical data

Pearson correlation coefficient:

\[ \rho = \frac{\sum_{i=1}^{n}{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum_{i=1}^{n}{(x_i - \bar{x})^2\sum_{i=1}^{n}{(y_i - \bar{y})^2}}}} \]

  • The above assumes variables are linearly related
  • A coefficient value of 0 does NOT imply a lack of relationship between variables, it simply implies a lack of linear relationship between them
  • Values range from -1 to +1

Discrete data

Contingency table (cross-tabulation):

Variable 1 /
Variable 2
Category A Category B Category C
Category A 10 20 30
Category B 40 50 60
Category C 70 80 90
Category D 0 10 40

✍️ Activity

🧮 Let’s calculate like in the old ages!

  • Form a group with your table neighbours again. Each table is assigned to calculation group 1 or 2 (see below for details)
  • Calculate the correlation coefficient for the faithful dataset:
    • Group 1: Use only samples with \(\operatorname{eruption} < 3\) minutes
    • Group 2: Use only samples with \(\operatorname{eruption} \ge 3\) minutes
  • 🗣️ Let’s discuss

Hints

  • Use Google Colab or your Nuvolos account to do the computation
  • You can load the faithful dataset with these lines of code
pip install pydataset #only run this line of code if the next line gives a `ModuleNotFoundError`
from pydataset import data
df=data('faithful') #dataframe with `faithful` data
  • To get, for example, the dataframe filtered so as to only contain waiting times with a value greater than 60 mins, you would need the following line of code
df.query('waiting>60')

To access a particular column e.g waiting in a dataframe df, you need this line:

df['waiting']

Adapt these lines to suit your needs!

  • Use this tutorial to find out how to compute the correlation coefficient.
  • While you are highly encouraged to use Python for this exercise, if you are really stuck or if you really don’t know how to write this exercise in Python, you can do the computation using more familiar toold (e.g Excel). To allow you to do this, you can download the faithful dataset in .csv format by clicking the button below:


You probably already heard this before, but it bears remembering the mantra:

“correlation does not imply causation”

Spurious Correlations


Spurious Correlations


Spurious Correlations


Time for a break 🍵

After the break:

  • Interpretations of probability
  • Probability distributions
  • Central limit theorem
  • Handling missing data

What is Probability?

What is Probability?

Let’s talk about three possible interpretations of probability:

Classical

Frequentist

Bayesian

Events of the same kind can be reduced to a certain number of equally possible cases.

Example: coin tosses lead to either heads or tails \(1/2\) of the time ( \(50\%/50\%\))

What would be the outcome if I repeat the process many times?

Example: if I toss a coin \(1,000,000\) times, I expect \(\approx 50\%\) heads and \(\approx 50\%\) tails outcome.

What is your judgement of the likelihood of the outcome? Based on previous information.

Example: if I know that this coin has symmetric weight, I expect a \(50\%/50\%\) outcome.

What is Probability?

For our purposes:

  • Probabilities are numbers between 0 and 1
  • The sum of all possible outcomes of an event must sum to 1.
  • It is useful to think of things as probabilities

Note

💡 Although there is no such thing as “a probability of \(120\%\)” or “a probability of \(-23\%\)”, you could still use this language to refer to an increase or decrease in an outcome.

Distributions

Histograms

Click to see the code
from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful

(ggplot(data = faithful) + 
      geom_histogram(aes(x=faithful.eruptions), binwidth = 1.0, fill="#00BFFF") + 
      scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[0, 6]) +
      scale_y_continuous(name="Density", breaks = range(0, 120, 10), limits=[0, 120]) +
      theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
      labs(title="Histogram with binwidth = 1.0") +
      geom_hline(yintercept = 110.0, color = "red", linetype = "dashed", size = 1.0)
      )

Histograms: What happens if we bin it differently?

Click to see the code
from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful

(ggplot(data = faithful) + 
      geom_histogram(aes(x=faithful.eruptions), binwidth = 0.5, fill="#00BFFF") + 
      scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[0, 6]) +
      scale_y_continuous(name="Density", breaks = range(0, 120, 10), limits=[0, 120]) +
      theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
      labs(title="Histogram with binwidth = 0.5") +
       geom_hline(yintercept = 80.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
      geom_hline(yintercept = 110.0, color = "#ffb2b2", linetype = "dashed", size = 1.0)
      )

Histograms: What happens if we bin it differently?

Click to see the code
from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful

(ggplot(data = faithful) + 
      geom_histogram(aes(x=faithful.eruptions), binwidth = 0.1, fill="#00BFFF") + 
      scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[0, 6]) +
      scale_y_continuous(name="Density", breaks = range(0, 120, 10), limits=[0, 120]) +
      theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
      labs(title="Histogram with binwidth = 0.1") +
       geom_hline(yintercept = 20.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
      geom_hline(yintercept = 80.0, color = "#ffb2b2", linetype = "dashed", size = 1.0) +
      geom_hline(yintercept = 110.0, color = "#ffb2b2", linetype = "dashed", size = 1.0)
      )

Let’s zoom in ➡️

Zoom in 🔎

Click to see the code
from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful

(ggplot(data = faithful) + 
      geom_histogram(aes(x=faithful.eruptions), binwidth = 0.1, fill="#00BFFF") + 
      scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[1.5, 5.25]) +
      scale_y_continuous(name="Density", breaks = range(0, 20, 4), limits=[0, 20]) +
      theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
      labs(title="Histogram with binwidth = 0.1",subtitle="The two patterns in the data become more apparent when\n we bin it this way.") +
      geom_hline(yintercept = 3.0, color = "red", linetype = "dashed", size = 1.0)
      )

Probability Distributions

  • the probability distribution of a random variable is a function that gives the probabilities of occurrence of different possible outcomes for that variable.
  • think of it as an attempt to model the data
  • you hope that the model will be a good representation of the population

Note

  • the word model has a very precise meaning here
  • we will come back to models next week

Some Very Common Probability Distributions

Normal Distribution

Statistical Function:

\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

  • also known as Gaussian distribution
  • \(\mu\) is the mean and \(\sigma\) the standard deviation
  • type of symmetrical and balanced likelihood distribution around the mean that illustrates that data close to the mean occur more frequently than data far from the mean
  • Basis for many statistical methods

For more on this topic, see Chapter 5 (sections 5.2 and 5.3) of (Shafer and Zhang 2012)

Poisson Distribution

Statistical Function:

\[ f(x) = \frac{\lambda^xe^{-\lambda}}{x!} \]

  • used to calculate the probability of an event happening a counted number of times within a time interval
  • e.g if an event happens independently and randomly over time and the mean rate of occurrence is constant over time, then the number of occurences (of the event) follow a Poisson distribution

Poisson Distribution

Statistical Function:

\[ f(x) = \frac{\lambda^xe^{-\lambda}}{x!} \]

  • discrete distribution (possibilities listed as 0,1,2,…); depends only on the mean number of occurrences expected
  • examples of random variables that follow Poisson:
    • The number of orders your company receives tomorrow.
    • The number of people who apply for a job tomorrow to your HR department.
    • The number of defects in a finished product.
    • The number of calls your company receives next week for help concerning an “easy-to-assemble” toy.

For more on probability distributions, see Chapter 4 of (Illowsky and Dean 2013).

Central Limit Theorem

Let \(x_i,i=1,2,...,N\), be independent random variables, each of which is described by a probability density function (PDF) \(f_i(x)\) (these may be all different) with a mean \(\mu_i\) and variance \(\sigma_i^2\). The random variable \(z=\sum_i \frac{x_i}{N}\), i.e., the average of the \(x_i\), has the following properties:

  1. its expected value is given by \(〈z=\frac{(\sum_i \mu_i)}{N}\);
  2. its variance is given by \(var(z)=\frac{(\sum_i\sigma_i^2)}{N^2}\);
  3. as \(N\rightarrow \infty\), the PDF of \(z\) tends to a normal distribution with the same mean and variance.

For examples of how to apply the central limit theorem, see Chapter 6, section 6.2 of (Shafer and Zhang 2012) or this page or this page

Handling missing data (demo)

The notebook that contains the demo is here.

And you can download the dataset to load to get the demo running by clicking on the button below:

For more, see (Enders 2022) and (Scheffer 2002)

References

D’Ignazio, Catherine, and Lauren F. Klein. 2020. Data Feminism. Strong Ideas Series. Cambridge, Massachusetts: The MIT Press. https://ebookcentral.proquest.com/lib/londonschoolecons/reader.action?docID=6120950.
DeGroot, Morris H., and Mark J. Schervish. 2003. Probability and Statistics. 3. ed., international edition. Boston Munich: Addison-Wesley.
Enders, Craig K. 2022. Applied Missing Data Analysis. Guilford Publications.
Guyan, Kevin. 2022. Queer Data: Using Gender, Sex and Sexuality Data for Action. Bloomsbury Studies in Digital Cultures. London: Bloomsbury Academic. https://web-s-ebscohost-com.gate3.library.lse.ac.uk/ehost/detail/detail?nobk=y&vid=2&sid=a8efeedd-6bfc-459a-9f0c-a67dabcc75d1@redis&bdata=JnNpdGU9ZWhvc3QtbGl2ZQ==#AN=3077276&db=nlebk.
Illowsky, Barbara, and Susan L. Dean. 2013. Introductory Statistics. Houston, Texas: OpenStax College. https://openstax.org/details/books/introductory-statistics.
Scheffer, Judi. 2002. “Dealing with Missing Data.”
Schutt, Rachel, and Cathy O’Neil. 2013. Doing Data Science. First edition. Beijing ; Sebastopol: O’Reilly Media. https://ebookcentral.proquest.com/lib/londonschoolecons/detail.action?docID=1465965.
Shafer, Douglas S., and Zhiyi Zhang. 2012. Introductory Statistics. Saylor Foundation. https://saylordotorg.github.io/text_introductory-statistics/.