Theme: Supervised Learning
31 Jan 2025
to capture patterns in data (think trends, correlations, associations, etc.) and use them to make predictions or to support decisions.
The next number can be represented as a function, \(f(\ )\), of the previous one:
\[ \operatorname{next number} = f(\operatorname{previous number}) \]
Or, let’s say, as a function of the position of the number in the sequence:
Position | Number |
---|---|
1 | 3 |
2 | 6 |
3 | 9 |
4 | 12 |
5 | 15 |
6 | 18 |
7 | 21 |
8 | 24 |
In equation form:
\[ \operatorname{Number} = f(\operatorname{Position}) \]
where
\[ f(x) = 3 \operatorname{x} \]
👈🏻 Typically, we use a tabular format like this to represent our data when doing ML.
The goal of ML:
Find a function \(f\), in any suitable mathematical form, that can best predict how \(Y\) will vary given \(X\).
The goal of ML:
Find a function \(f\), in any suitable mathematical form, that can best predict how \(Y\) will vary given \(X\).
Let’s find out:
If we visit the OEIS ® page and paste our sequence: \(3, 6, 9, 12, 15, 18, 21, 24, ...\)
… we will get 19 different sequences that contain those same numbers!
The sad truth: we don’t.
The statistician George Box famously wrote:
“All models are wrong, but some are useful.”
This example we just explored can be represented as:
\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]
where:
This example we just explored can be represented as:
\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]
where:
Number
in our example)Position
in our example)Whenever you are modelling something that can be represented somehow as the equation above, you are doing supervised learning.
You can have multiple columns of \(X\) values 👉
X1 | X2 | X3 | X4 | … | Y |
---|---|---|---|---|---|
1 | 2 | 3 | 10 | … | 3 |
2 | 4 | 6 | 20 | … | 6 |
3 | 6 | 9 | 30 | … | 9 |
4 | 8 | 12 | 40 | … | 12 |
5 | 10 | 15 | 50 | … | 15 |
If you have nothing specific to predict, or no designated \(Y\), then you are not engaging in supervised learning.
You can still use ML to find patterns in the data, a process known as unsupervised learning.
These are, broadly speaking, the two main ways of learning from data:
Linear regression is a simple approach to supervised learning.
The generic supervised model:
\[ Y = \operatorname{f}(X) + \epsilon \]
is defined more explicitly as follows ➡️
\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]
when we use a single predictor, \(X\).
\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]
when there are multiple predictors, \(X_p\).
Warning
We assume a model:
\[ Y = \beta_0 + \beta_1 X + \epsilon , \]
where:
We want to estimate:
\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x \]
where:
And you suspect there is a linear relationship between X and Y.
How would you go about fitting a line to it?
A line right through the “centre of gravity” of the cloud of data.
There are multiple ways to estimate the coefficients.
Residuals are the distances from each data point to this line.
\(e_i\)\(=(y_i-\hat{y}_i)\) represents the \(i\)th residual
Observed vs. Predicted
From this, we can define the Residual Sum of Squares (RSS) as
\[ \mathrm{RSS}= e_1^2 + e_2^2 + \dots + e_n^2, \]
or equivalently as
\[ \mathrm{RSS}= (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 + \dots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2. \]
Note
The (ordinary) least squares approach chooses \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the RSS.
We treat this as an optimisation problem. We want to minimize RSS:
\[ \begin{align} \min \mathrm{RSS} =& \sum_i^n{e_i^2} \\ =& \sum_i^n{\left(y_i - \hat{y}_i\right)^2} \\ =& \sum_i^n{\left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i\right)^2} \end{align} \]
To find \(\hat{\beta}_0\), we have to solve the following partial derivative:
\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} = 0 \]
… which will lead you to:
\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, \]
where we made use of the sample means:
\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{2 (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{after chain rule})\\ 0 &= 2 \sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{we took $2$ out}) \\ 0 &=\sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\div 2) \\ 0 &=\sum_i^n{y_i} - \sum_i^n{\hat{\beta}_0} - \sum_i^n{\hat{\beta}_1 x_i} & (\text{sep. sums}) \end{align} \]
\[ \begin{align} 0 &=\sum_i^n{y_i} - n\hat{\beta}_0 - \hat{\beta}_1\sum_i^n{ x_i} & (\text{simplified}) \\ n\hat{\beta}_0 &= \sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i} & (+ n\hat{\beta}_0) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i}}{n} & (\text{isolate }\hat{\beta}_0 ) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i}}{n} - \hat{\beta}_1\frac{\sum_i^n{x_i}}{n} & (\text{after rearranging})\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} ~~~ \blacksquare & \end{align} \]
Similarly, to find \(\hat{\beta}_1\) we solve:
\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{[\sum_i^n{y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i}]} = 0 \]
… which will lead you to:
\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \]
\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{\left(2x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{after chain rule})\\ 0 &= 2\sum_i^n{\left( x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{we took $2$ out}) \\ 0 &= \sum_i^n{\left(x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\div 2) \\ 0 &= \sum_i^n{\left(y_ix_i - \hat{\beta}_0x_i - \hat{\beta}_1 x_i^2\right)} & (\text{distributed } x_i) \end{align} \]
\[ \begin{align} 0 &= \sum_i^n{\left(y_ix_i - (\bar{y} - \hat{\beta}_1 \bar{x})x_i - \hat{\beta}_1 x_i^2\right)} & (\text{replaced } \hat{\beta}_0) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i + \hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{rearranged}) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \sum_i^n{\left(\hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{separate sums})\\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \hat{\beta}_1\sum_i^n{\left(\bar{x}x_i - x_i^2\right)} & (\text{took $\hat{\beta}_1$ out}) \\ \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} & (\text{isolate }\hat{\beta}_1) ~~~ \blacksquare \end{align} \]
And that is how OLS works!
\[ \begin{align} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align} \]
For this part, we’ll switch back to a Jupyter Notebook that you can download on the course website so you can run the code along with my explanations.
A few metrics
\[ \begin{align} R^2 &= 1-\frac{RSS}{TSS}\\ &= 1-\frac{\sum_{i=1}^N (y_i-\hat{y})^2}{\sum_{i=1}^N (y_i-\bar{y})^2} \end{align} \]
RSS is the residual sum of squares, which is the sum of squared residuals. This value captures the prediction error of a model.
TSS is the total sum of squares. To calculate this value, assume a simple model in which the prediction for each observation is the mean of all the observed actuals. TSS is proportional to the variance of the dependent variable, as \[\frac{TSS}{N}\] is the actual variance of \(y\) where \(N\) is the number of observations. Think of \(TSS\) as the variance that a simple mean model cannot explain.
Caveats:
It does not :
⚠️ WARNING: \(R^2\) increases with every variable you add to your model (even it’s there’s just chance correlation between variables and the new variable simply adds noise!). A regression model with more independent variables than another model can look like a better fit simply because it has more variables!
⚠️ WARNING: When a model has an excessive number of independent variables and/or polynomial terms, it starts to fit the specificities of the data and its random noise too closely rather than actually reflecting the entire population: that’s called overfitting. This phenomenon results in deceptively high \(R^2\) values and decreases the precision of predictions.
Adjusted \(R^2\) adjusts for the number of predictors in a regression model so you could use it instead of \(R^2\)
⚠️ WARNING: - Sensitive to outliers and large errors due to the squaring. - Emphasizes large errors
See also here
How to revise for this course next week.
LSE DS202 2024/25 Winter Term