DS101 – Fundamentals of Data Science
06 Feb 2023
What we will see this week and the next:
1️⃣ a random sample of 10% of people then look at their e-mails on a given day.
2️⃣ a random sample of 10% of all emails sent on a given day
✍️ A quick exercise:
DS101L_2022_23_Week_04.Rmd
References: (Guyan 2022) & (D’Ignazio and Klein 2020)
“This overall process of going from the world to the data, and then from the data back to the world, is the field of statistical inference.”
“More precisely, statistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic processes1.”
Source: (Schutt and O’Neil 2013, chap. 2)
library(ggplot2)
data(faithful)
g <- (ggplot(data=faithful) +
geom_point(mapping=aes(x=eruptions, y=waiting), color="#5C5AD3", size=4, alpha=0.5) +
theme_bw() +
theme(axis.text=element_text(size=rel(1.15)), axis.title=element_text(size=rel(1.4))) +
labs(title="Old Faithful geyser in Yellowstone National Park, Wyoming, USA.", x="Eruptions (in mins)", y="Waiting Time (in mins)")
)
g
Pearson correlation coefficient:
\[ \rho = \frac{\sum_{i=1}^{n}{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum_{i=1}^{n}{(x_i - \bar{x})^2\sum_{i=1}^{n}{(y_i - \bar{y})^2}}}} \]
Contingency table (cross-tabulation):
Variable 1 / Variable 2 |
Category A | Category B | Category C |
---|---|---|---|
Category A | 10 | 20 | 30 |
Category B | 40 | 50 | 60 |
Category C | 70 | 80 | 90 |
Category D | 0 | 10 | 40 |
🧮 Let’s calculate like in the old ages!
faithful
dataset:
You probably already heard this before, but it bears remembering the mantra:
“correlation does not imply causation”
Photo by cottonbro studio | Pexels
Check more on the Spurious Correlations website
Check more on the Spurious Correlations website
After the break:
Let’s talk about three possible interpretations of probability:
Classical
Frequentist
Bayesian
Events of the same kind can be reduced to a certain number of equally possible cases.
Example: coin tosses lead to either heads or tails \(1/2\) of the time ( \(50\%/50\%\))
What would be the outcome if I repeat the process many times?
Example: if I toss a coin \(1,000,000\) times, I expect \(\approx 50\%\) heads and \(\approx 50\%\) tails outcome.
What is your judgement of the likelihood of the outcome? Based on previous information.
Example: if I know that this coin has symmetric weight, I expect a \(50\%/50\%\) outcome.
For our purposes:
Note
💡 Although there is no such thing as “a probability of \(120\%\)” or “a probability of \(-23\%\)”, you could still use this language to refer to an increase or decrease in an outcome.
library(ggplot2)
data(faithful)
g <- (ggplot(data = faithful) +
geom_histogram(mapping = aes(x=eruptions), binwidth = 1.0, fill="#5C5AD3") +
scale_x_continuous("Eruptions (in mins)", breaks = seq(0, 6, 1), limits=c(0, 6)) +
scale_y_continuous("Density", breaks = seq(0, 120, 10), limits=c(0, 120)) +
theme_bw() + theme(axis.text=element_text(size=rel(1.2)), title=element_text(size=rel(2))) +
labs(title="Histogram with binwidth = 1.0") +
geom_hline(yintercept = 110.0, color = "red", linetype = "dashed", size = 1.0)
)
g
library(ggplot2)
data(faithful)
g <- (ggplot(data = faithful) +
geom_histogram(mapping = aes(x=eruptions), binwidth = 0.5, fill="#5C5AD3") +
scale_x_continuous("Eruptions (in mins)", breaks = seq(0, 6, 1), limits=c(0, 6)) +
scale_y_continuous("Density", breaks = seq(0, 120, 10), limits=c(0, 120)) +
theme_bw() +
theme(axis.text=element_text(size=rel(1.2)), title=element_text(size=rel(2))) +
labs(title="Histogram with binwidth = 0.5") +
geom_hline(yintercept = 80.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
geom_hline(yintercept = 110.0, color = "#ffb2b2", linetype = "dashed", size = 1.0)
)
g
library(ggplot2)
data(faithful)
g <- (ggplot(data = faithful) +
geom_histogram(mapping = aes(x=eruptions), binwidth = 0.1, fill="#5C5AD3") +
scale_x_continuous("Eruptions (in mins)", breaks = seq(0, 6, 1), limits=c(0, 6)) +
scale_y_continuous("Density", breaks = seq(0, 120, 10), limits=c(0, 120)) +
theme_bw() +
theme(axis.text=element_text(size=rel(1.2)), title=element_text(size=rel(2))) +
labs(title="Histogram with binwidth = 0.1") +
geom_hline(yintercept = 20.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
geom_hline(yintercept = 80.0, color = "#ffb2b2", linetype = "dashed", size = 1.0) +
geom_hline(yintercept = 110.0, color = "#ffb2b2", linetype = "dashed", size = 1.0)
)
g
Let’s zoom in ➡️
library(ggplot2)
data(faithful)
g <- (ggplot(data = faithful) +
geom_histogram(mapping = aes(x=eruptions), binwidth = 0.1, fill="#5C5AD3") +
scale_x_continuous("Eruptions (in mins)", breaks = seq(1, 6, 1), limits=c(1.5, 5.25)) +
scale_y_continuous("Density", breaks = seq(0, 20, 4), limits=c(0, 20)) +
theme_bw() +
theme(axis.text=element_text(size=rel(1.2)), title=element_text(size=rel(1.7))) +
geom_vline(xintercept = 3.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
labs(title="Histogram with binwidth = 0.1",
subtitle="The two patterns in the data become more apparent when we bin it this way."))
g
Note
Statistical Function:
\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
Statistical Function:
\[ f(x) = \frac{\lambda^xe^{-\lambda}}{x!} \]
Groups:
Each group will focus on a different plot type, as illustrated in R Graph Gallery:
Group A
: Plots of distributionsGroup B
: Plots of correlationsGroup C
: Plots of rankingsGroup D
: Plots of evolutionGroup E
: Plots of flow🎯 Your task:
LSE DS101 2022/23 Lent Term (archive)