DS101 – Fundamentals of Data Science
16 Oct 2023
What we will see this week and the next:
1️⃣ a random sample of 10% of people then look at their e-mails on a given day.
2️⃣ a random sample of 10% of all emails sent on a given day
✍️ A quick exercise:
References: (Guyan 2022) & (D’Ignazio and Klein 2020)
“This overall process of going from the world to the data, and then from the data back to the world, is the field of statistical inference.”
“More precisely, statistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic processes1.”
Source: (Schutt and O’Neil 2013, chap. 2)
from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_discrete, labs, element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
(ggplot(diamonds) +
geom_bar(aes(x=diamonds.cut, fill=diamonds.cut)) +
scale_fill_discrete() +
theme_bw() +
labs(title="Diamond cut", x="Cut", y="Count")+
theme(figure_size=(20, 12),text=element_text(size=24))
)
from plotnine import ggplot, geom_histogram, geom_bar, aes, scale_fill_continuous, labs, element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
from plotnine.coords import coord_cartesian
import pandas as pd
(ggplot(diamonds) +
geom_bar(aes(x=diamonds.carat, fill=diamonds.carat)) +
scale_fill_continuous() +
theme_bw() +
labs(title="Diamond carat", x="Carat", y="Count")+
coord_cartesian(xlim=(0.2, 5.01))+
theme(figure_size=(20, 12),text=element_text(size=24))
)
from plotnine import ggplot, geom_point, aes, element_text, labs
from plotnine.themes import theme, theme_bw
from plotnine.data import diamonds
(ggplot(data=diamonds) +
geom_point(aes(x=diamonds.carat, y=diamonds.price), color="#800020", size=4, alpha=0.5) +
theme_bw() +
theme(axis_text=element_text(size=18), axis_title=element_text(size=16)) +
labs(title="Relationship between diamond carat and diamond price", x="Diamond carat", y="Diamond price (in dollars)")
)
from plotnine import ggplot, geom_point, aes, element_text, labs
from plotnine.themes import theme, theme_bw
from plotnine.data import midwest
(ggplot(data=midwest) +
geom_point(aes(x=midwest.percollege, y=midwest.percpovertyknown), color="#5C5AD3", size=4, alpha=0.5) +
theme_bw() +
theme(axis_text=element_text(size=18), axis_title=element_text(size=16)) +
labs(title="Relationship between percentage college educated population and percentage of known poverty", x="Percentage of college educated population", y="Percentage of known poverty")
)
from plotnine import ggplot, geom_point, aes, element_text, labs
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful
(ggplot(data=faithful) +
geom_point(aes(x=faithful.eruptions, y=faithful.waiting), color="#007BA7", size=4, alpha=0.5) +
theme_bw() +
theme(axis_text=element_text(size=18), axis_title=element_text(size=16)) +
labs(title="Relationship between eruptions duration and waiting time between eruptions for Old Faithful geyser in Yellowstone National Park, Wyoming, USA.", x="Eruptions (in mins)", y="Wating time (in mins)")
)
Pearson correlation coefficient:
\[ \rho = \frac{\sum_{i=1}^{n}{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum_{i=1}^{n}{(x_i - \bar{x})^2\sum_{i=1}^{n}{(y_i - \bar{y})^2}}}} \]
Contingency table (cross-tabulation):
Variable 1 / Variable 2 |
Category A | Category B | Category C |
---|---|---|---|
Category A | 10 | 20 | 30 |
Category B | 40 | 50 | 60 |
Category C | 70 | 80 | 90 |
Category D | 0 | 10 | 40 |
🧮 Let’s calculate like in the old ages!
faithful
dataset:
Hints
faithful
dataset with these lines of codepip install pydataset #only run this line of code if the next line gives a `ModuleNotFoundError`
from pydataset import data
df=data('faithful') #dataframe with `faithful` data
To access a particular column e.g waiting
in a dataframe df
, you need this line:
Adapt these lines to suit your needs!
faithful
dataset in .csv
format by clicking the button below: You probably already heard this before, but it bears remembering the mantra:
“correlation does not imply causation”
Photo by cottonbro studio | Pexels
Check more on the Spurious Correlations website
Check more on the Spurious Correlations website
Check more on the Spurious Correlations website
After the break:
Let’s talk about three possible interpretations of probability:
Classical
Frequentist
Bayesian
Events of the same kind can be reduced to a certain number of equally possible cases.
Example: coin tosses lead to either heads or tails \(1/2\) of the time ( \(50\%/50\%\))
What would be the outcome if I repeat the process many times?
Example: if I toss a coin \(1,000,000\) times, I expect \(\approx 50\%\) heads and \(\approx 50\%\) tails outcome.
What is your judgement of the likelihood of the outcome? Based on previous information.
Example: if I know that this coin has symmetric weight, I expect a \(50\%/50\%\) outcome.
For our purposes:
Note
💡 Although there is no such thing as “a probability of \(120\%\)” or “a probability of \(-23\%\)”, you could still use this language to refer to an increase or decrease in an outcome.
from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful
(ggplot(data = faithful) +
geom_histogram(aes(x=faithful.eruptions), binwidth = 1.0, fill="#00BFFF") +
scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[0, 6]) +
scale_y_continuous(name="Density", breaks = range(0, 120, 10), limits=[0, 120]) +
theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
labs(title="Histogram with binwidth = 1.0") +
geom_hline(yintercept = 110.0, color = "red", linetype = "dashed", size = 1.0)
)
from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful
(ggplot(data = faithful) +
geom_histogram(aes(x=faithful.eruptions), binwidth = 0.5, fill="#00BFFF") +
scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[0, 6]) +
scale_y_continuous(name="Density", breaks = range(0, 120, 10), limits=[0, 120]) +
theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
labs(title="Histogram with binwidth = 0.5") +
geom_hline(yintercept = 80.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
geom_hline(yintercept = 110.0, color = "#ffb2b2", linetype = "dashed", size = 1.0)
)
from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful
(ggplot(data = faithful) +
geom_histogram(aes(x=faithful.eruptions), binwidth = 0.1, fill="#00BFFF") +
scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[0, 6]) +
scale_y_continuous(name="Density", breaks = range(0, 120, 10), limits=[0, 120]) +
theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
labs(title="Histogram with binwidth = 0.1") +
geom_hline(yintercept = 20.0, color = "#ff0000", linetype = "dashed", size = 1.0) +
geom_hline(yintercept = 80.0, color = "#ffb2b2", linetype = "dashed", size = 1.0) +
geom_hline(yintercept = 110.0, color = "#ffb2b2", linetype = "dashed", size = 1.0)
)
Let’s zoom in ➡️
from plotnine import ggplot, geom_histogram, aes, scale_x_continuous, scale_y_continuous, labs, geom_hline,element_text
from plotnine.themes import theme, theme_bw
from plotnine.data import faithful
(ggplot(data = faithful) +
geom_histogram(aes(x=faithful.eruptions), binwidth = 0.1, fill="#00BFFF") +
scale_x_continuous(name="Eruptions (in mins)", breaks = range(0, 6, 1), limits=[1.5, 5.25]) +
scale_y_continuous(name="Density", breaks = range(0, 20, 4), limits=[0, 20]) +
theme_bw() + theme(axis_text=element_text(size=16), title=element_text(size=16)) +
labs(title="Histogram with binwidth = 0.1",subtitle="The two patterns in the data become more apparent when\n we bin it this way.") +
geom_hline(yintercept = 3.0, color = "red", linetype = "dashed", size = 1.0)
)
Note
Statistical Function:
\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
For more on this topic, see Chapter 5 (sections 5.2 and 5.3) of (Shafer and Zhang 2012)
Statistical Function:
\[ f(x) = \frac{\lambda^xe^{-\lambda}}{x!} \]
Statistical Function:
\[ f(x) = \frac{\lambda^xe^{-\lambda}}{x!} \]
For more on probability distributions, see Chapter 4 of (Illowsky and Dean 2013).
Let \(x_i,i=1,2,...,N\), be independent random variables, each of which is described by a probability density function (PDF) \(f_i(x)\) (these may be all different) with a mean \(\mu_i\) and variance \(\sigma_i^2\). The random variable \(z=\sum_i \frac{x_i}{N}\), i.e., the average of the \(x_i\), has the following properties:
For examples of how to apply the central limit theorem, see Chapter 6, section 6.2 of (Shafer and Zhang 2012) or this page or this page
The notebook that contains the demo is here.
And you can download the dataset to load to get the demo running by clicking on the button below:
For more, see (Enders 2022) and (Scheffer 2002)
LSE DS101 2023/24 Autumn Term | archive