Theme: Supervised Learning
26 Jan 2024
to capture patterns in data (think trends, correlations, associations, etc.) and use them to make predictions or to support decisions.
The next number can be represented as a function, \(f(\ )\), of the previous one:
\[ \operatorname{next number} = f(\operatorname{previous number}) \]
Or, let’s say, as a function of the position of the number in the sequence:
Position | Number |
---|---|
1 | 3 |
2 | 6 |
3 | 9 |
4 | 12 |
5 | 15 |
6 | 18 |
7 | 21 |
8 | 24 |
In equation form:
\[ \operatorname{Number} = f(\operatorname{Position}) \]
where
\[ f(x) = 3 \operatorname{x} \]
👈🏻 Typically, we use a tabular format like this to represent our data when doing ML.
The goal of ML:
Find a function \(f\), in any suitable mathematical form, that can best predict how \(Y\) will vary given \(X\).
The goal of ML:
Find a function \(f\), in any suitable mathematical form, that can best predict how \(Y\) will vary given \(X\).
Let’s find out:
If we visit the OEIS ® page and paste our sequence: \(3, 6, 9, 12, 15, 18, 21, 24, ...\)
… we will get 19 different sequences that contain those same numbers!
The sad truth: we don’t.
The statistician George Box famously wrote:
“All models are wrong, but some are useful.”
This example we just explored can be represented as:
\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]
where:
This example we just explored can be represented as:
\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]
where:
Number
in our example)Position
in our example)Whenever you are modelling something that can be represented somehow as the equation above, you are doing supervised learning.
You can have multiple columns of \(X\) values 👉
X1 | X2 | X3 | X4 | … | Y |
---|---|---|---|---|---|
1 | 2 | 3 | 10 | … | 3 |
2 | 4 | 6 | 20 | … | 6 |
3 | 6 | 9 | 30 | … | 9 |
4 | 8 | 12 | 40 | … | 12 |
5 | 10 | 15 | 50 | … | 15 |
If you have nothing specific to predict, or no designated \(Y\), then you are not engaging in supervised learning.
You can still use ML to find patterns in the data, a process known as unsupervised learning.
These are, broadly speaking, the two main ways of learning from data:
Linear regression is a simple approach to supervised learning.
The generic supervised model:
\[ Y = \operatorname{f}(X) + \epsilon \]
is defined more explicitly as follows ➡️
\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]
when we use a single predictor, \(X\).
\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]
when there are multiple predictors, \(X_p\).
Warning
We assume a model:
\[ Y = \beta_0 + \beta_1 X + \epsilon , \]
where:
We want to estimate:
\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x \]
where:
And you suspect there is a linear relationship between X and Y.
How would you go about fitting a line to it?
A line right through the “centre of gravity” of the cloud of data.
There are multiple ways to estimate the coefficients.
Residuals are the distances from each data point to this line.
\(e_i\)\(=(y_i-\hat{y}_i)\) represents the \(i\)th residual
Observed vs. Predicted
From this, we can define the Residual Sum of Squares (RSS) as
\[ \mathrm{RSS}= e_1^2 + e_2^2 + \dots + e_n^2, \]
or equivalently as
\[ \mathrm{RSS}= (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 + \dots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2. \]
Note
The (ordinary) least squares approach chooses \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the RSS.
We treat this as an optimisation problem. We want to minimize RSS:
\[ \begin{align} \min \mathrm{RSS} =& \sum_i^n{e_i^2} \\ =& \sum_i^n{\left(y_i - \hat{y}_i\right)^2} \\ =& \sum_i^n{\left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i\right)^2} \end{align} \]
To find \(\hat{\beta}_0\), we have to solve the following partial derivative:
\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} = 0 \]
… which will lead you to:
\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, \]
where we made use of the sample means:
\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{2 (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{after chain rule})\\ 0 &= 2 \sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{we took $2$ out}) \\ 0 &=\sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\div 2) \\ 0 &=\sum_i^n{y_i} - \sum_i^n{\hat{\beta}_0} - \sum_i^n{\hat{\beta}_1 x_i} & (\text{sep. sums}) \end{align} \]
\[ \begin{align} 0 &=\sum_i^n{y_i} - n\hat{\beta}_0 - \hat{\beta}_1\sum_i^n{ x_i} & (\text{simplified}) \\ n\hat{\beta}_0 &= \sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i} & (+ n\hat{\beta}_0) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i}}{n} & (\text{isolate }\hat{\beta}_0 ) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i}}{n} - \hat{\beta}_1\frac{\sum_i^n{x_i}}{n} & (\text{after rearranging})\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} ~~~ \blacksquare & \end{align} \]
Similarly, to find \(\hat{\beta}_1\) we solve:
\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{[\sum_i^n{y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i}]} = 0 \]
… which will lead you to:
\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \]
\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{\left(2x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{after chain rule})\\ 0 &= 2\sum_i^n{\left( x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{we took $2$ out}) \\ 0 &= \sum_i^n{\left(x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\div 2) \\ 0 &= \sum_i^n{\left(y_ix_i - \hat{\beta}_0x_i - \hat{\beta}_1 x_i^2\right)} & (\text{distributed } x_i) \end{align} \]
\[ \begin{align} 0 &= \sum_i^n{\left(y_ix_i - (\bar{y} - \hat{\beta}_1 \bar{x})x_i - \hat{\beta}_1 x_i^2\right)} & (\text{replaced } \hat{\beta}_0) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i + \hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{rearranged}) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \sum_i^n{\left(\hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{separate sums})\\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \hat{\beta}_1\sum_i^n{\left(\bar{x}x_i - x_i^2\right)} & (\text{took $\hat{\beta}_1$ out}) \\ \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} & (\text{isolate }\hat{\beta}_1) ~~~ \blacksquare \end{align} \]
And that is how OLS works!
\[ \begin{align} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align} \]
👉 Let’s work with the simple assumption that the price of a house in a month is a function of the price of a house in the previous month.
date | region | average_price | average_price_lag1 (previous month) |
|
---|---|---|---|---|
2023-06-01 | England | £ 306447 | £ 303671 | |
2023-05-01 | England | £ 303671 | £ 304102 | |
2023-04-01 | England | £ 304102 | £ 302772 | |
2023-03-01 | England | £ 302772 | £ 305859 | |
2023-02-01 | England | £ 305859 | £ 307375 | |
2023-01-01 | England | £ 307375 | £ 309949 | |
2022-12-01 | England | £ 309949 | £ 312288 | |
2022-11-01 | England | £ 312288 | £ 311100 |
👉 Let’s work with the simple assumption that the price of a house in a month is a function of the price of a house in the previous month.
date | region | average_price | average_price_lag1 (previous month) |
difference |
---|---|---|---|---|
2023-06-01 | England | £ 306447 | £ 303671 | +2776 |
2023-05-01 | England | £ 303671 | £ 304102 | -431 |
2023-04-01 | England | £ 304102 | £ 302772 | +1330 |
2023-03-01 | England | £ 302772 | £ 305859 | -3087 |
2023-02-01 | England | £ 305859 | £ 307375 | -1516 |
2023-01-01 | England | £ 307375 | £ 309949 | -2574 |
2022-12-01 | England | £ 309949 | £ 312288 | -2339 |
2022-11-01 | England | £ 312288 | £ 311100 | +1188 |
Linea regression seems like a pretty reasonable assumption.
What values did OLS fit for \(\hat{\beta}_0\) and \(\hat{\beta}_1\)?
parsnip model object
Call:
stats::lm(formula = average_price ~ average_price_lag1, data = data)
Coefficients:
(Intercept) average_price_lag1
-560.208 1.006
That is:
\[ \begin{align} \hat{\beta}_0 &\approx - 560.21 \\ \hat{\beta}_1 &\approx + 1.006 \end{align} \]
Call:
stats::lm(formula = average_price ~ average_price_lag1, data = data)
Residuals:
Min 1Q Median 3Q Max
-17580.9 -1071.9 101.5 1238.6 15980.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.602e+02 8.636e+02 -0.649 0.517
average_price_lag1 1.006e+00 3.989e-03 252.126 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2525 on 220 degrees of freedom
Multiple R-squared: 0.9966, Adjusted R-squared: 0.9965
F-statistic: 6.357e+04 on 1 and 220 DF, p-value: < 2.2e-16
# Retrieve a summary of the model
lm_fit2 <-
lm_spec %>%
fit(average_price ~ average_price_lag1 + average_price_lag2,
data=df_england)
lm_fit2$fit %>% summary()
Call:
stats::lm(formula = average_price ~ average_price_lag1 + average_price_lag2,
data = data)
Residuals:
Min 1Q Median 3Q Max
-15607.2 -1088.6 124.5 1375.2 15818.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -642.07477 873.58753 -0.735 0.4631
average_price_lag1 0.88133 0.06742 13.072 <2e-16 ***
average_price_lag2 0.12523 0.06789 1.845 0.0664 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2520 on 217 degrees of freedom
Multiple R-squared: 0.9966, Adjusted R-squared: 0.9965
F-statistic: 3.144e+04 on 2 and 217 DF, p-value: < 2.2e-16
How to revise for this course next week.
Your first summative problem set should be available on Moodle after this lecture.
It is due on the 19th of October.
To understand in detail all the assumptions implicitly made by linear models,
read (James et al. 2021, chaps. 2–3)
(Not compulsory but highly recommended reading)
LSE DS202 2023/24 Winter Term | Archive