Theme: Supervised Learning
09 Oct 2025

Sign in here
Important
🥐 Important releases this week
Your first formative is now available!
Deadline: Week 04 (October 23rd 5pm)
Submission: GitHub Classroom
| Traditional Programming | Machine Learning |
|---|---|
| Humans define explicit rules | Algorithm learns/infers rules from data |
| Inputs + Rules → Output | Inputs + Outputs → Model/Rules |
| Deterministic, well-defined tasks | Can handle noisy/complex tasks |
Consider this sequence:
3, 6, 9, 12, 15, 18, 21, 24, …
What comes next?
Note
Most people guess: 27
Why? You recognized the pattern: “add 3 each time”
This is learning! You extracted a rule from examples.
The next number can be represented as a function, \(f(\ )\), of the previous one:
\[ \operatorname{next number} = f(\operatorname{previous number}) \]
Or, let’s say, as a function of the position of the number in the sequence:
| Position | Number |
|---|---|
| 1 | 3 |
| 2 | 6 |
| 3 | 9 |
| 4 | 12 |
| 5 | 15 |
| 6 | 18 |
| 7 | 21 |
| 8 | 24 |
We can express this as a function:
\[\text{Number} = 3 \times \text{Position}\]
Or more generally:
\[Y = f(X)\]
Machine Learning is about finding the best function \(f\) that captures patterns in data.
👈🏻 Typically, we use a tabular format like this to represent our data when doing ML.
Visit OEIS.org and search for: 3, 6, 9, 12, 15, 18, 21, 24
Result: 19 different sequences match these numbers!
They all start the same but diverge later.
“All models are wrong, but some are useful.”
— George Box
The goal: Find a useful model, not a perfect one
Statistics asks: “What is the relationship?” Machine Learning asks: “How well can we predict?”
Both are valuable — and complementary.
| Traditional Statistics | Machine Learning |
|---|---|
| Explain relationships | Make predictions |
| Test hypotheses | Optimize performance |
| Understand causality | Forecast outcomes |
| “Why does X affect Y?” | “How well can we predict Y?” |
Both are valuable! They complement each other.
Supervised Learning
Have labeled data: both inputs (X) and outputs (Y)
Goal: Learn to predict Y from X
Example:
Unsupervised Learning
Have only inputs (X), no labels
Goal: Find hidden structure or patterns
Example:
Today’s focus: Supervised Learning
We observe pairs of data:
\[(\text{Input}_1, \text{Output}_1), (\text{Input}_2, \text{Output}_2), \ldots\]
We assume there’s a relationship:
\[Y = f(X) + \varepsilon\]
Where:
Find an approximation \(\hat{f}\) such that:
\[\hat{Y} = \hat{f}(X)\]
is as close as possible to the true \(Y\)
Linear Regression is the simplest way to find \(\hat{f}\)
Note
Example
\[ \textrm{Happiness} = f(\textrm{GNI}, \textrm{Health}, \textrm{Education}, \textrm{Freedom}, \textrm{Social Support}, \ldots) \]
We’ll use this real-world example to illustrate supervised learning through linear regression later today.
You can have multiple columns of \(X\) values 👉
| X1 | X2 | X3 | X4 | … | Y |
|---|---|---|---|---|---|
| 1 | 2 | 3 | 10 | … | 3 |
| 2 | 4 | 6 | 20 | … | 6 |
| 3 | 6 | 9 | 30 | … | 9 |
| 4 | 8 | 12 | 40 | … | 12 |
| 5 | 10 | 15 | 50 | … | 15 |
If you have nothing specific to predict, or no designated \(Y\), then you are not engaging in supervised learning.
You can still use ML to find patterns in the data, a process known as unsupervised learning.
Linear regression is the simplest approach to supervised learning.
The generic supervised model:
\[ Y = \operatorname{f}(X) + \epsilon \]
is defined more explicitly as follows ➡️
\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]
when we use a single predictor, \(X\).
\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]
when there are multiple predictors, \(X_p\).
Warning
We assume a model:
\[ Y = \beta_0 + \beta_1 X + \epsilon , \]
where:
We want to estimate:
\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x \]
where:
And you suspect there is a linear relationship between X and Y.
How would you go about fitting a line to it?
A line right through the “centre of gravity” of the cloud of data.
There are multiple ways to estimate the coefficients.
The key results are:
\(\hat{\beta}_1 = \frac{\text{How much X and Y move together}}{\text{How much X varies}}\)
\(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\)
Translation:
Residuals are the distances from each data point to this line.
\(e_i\)\(=(y_i-\hat{y}_i)\) represents the \(i\)th residual
Observed vs. Predicted
From this, we can define the Residual Sum of Squares (RSS) as
\[ \mathrm{RSS}= e_1^2 + e_2^2 + \dots + e_n^2, \]
or equivalently as
\[ \mathrm{RSS}= (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 + \dots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2. \]
Note
The (ordinary) least squares approach chooses \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the RSS.
Three good reasons:
This gives us the RSS (Residual Sum of Squares):
\[\text{RSS} = \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2\]
We treat this as an optimisation problem. We want to minimize RSS:
\[ \begin{align} \min \mathrm{RSS} =& \sum_i^n{e_i^2} \\ =& \sum_i^n{\left(y_i - \hat{y}_i\right)^2} \\ =& \sum_i^n{\left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i\right)^2} \end{align} \]
To find \(\hat{\beta}_0\), we have to solve the following partial derivative:
\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} = 0 \]
… which will lead you to:
\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, \]
where we made use of the sample means:
\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{2 (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{after chain rule})\\ 0 &= 2 \sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{we took $2$ out}) \\ 0 &=\sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\div 2) \\ 0 &=\sum_i^n{y_i} - \sum_i^n{\hat{\beta}_0} - \sum_i^n{\hat{\beta}_1 x_i} & (\text{sep. sums}) \end{align} \]
\[ \begin{align} 0 &=\sum_i^n{y_i} - n\hat{\beta}_0 - \hat{\beta}_1\sum_i^n{ x_i} & (\text{simplified}) \\ n\hat{\beta}_0 &= \sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i} & (+ n\hat{\beta}_0) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i}}{n} & (\text{isolate }\hat{\beta}_0 ) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i}}{n} - \hat{\beta}_1\frac{\sum_i^n{x_i}}{n} & (\text{after rearranging})\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} ~~~ \blacksquare & \end{align} \]
Similarly, to find \(\hat{\beta}_1\) we solve:
\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{[\sum_i^n{y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i}]} = 0 \]
… which will lead you to:
\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \]
\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{\left(2x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{after chain rule})\\ 0 &= 2\sum_i^n{\left( x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{we took $2$ out}) \\ 0 &= \sum_i^n{\left(x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\div 2) \\ 0 &= \sum_i^n{\left(y_ix_i - \hat{\beta}_0x_i - \hat{\beta}_1 x_i^2\right)} & (\text{distributed } x_i) \end{align} \]
\[ \begin{align} 0 &= \sum_i^n{\left(y_ix_i - (\bar{y} - \hat{\beta}_1 \bar{x})x_i - \hat{\beta}_1 x_i^2\right)} & (\text{replaced } \hat{\beta}_0) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i + \hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{rearranged}) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \sum_i^n{\left(\hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{separate sums})\\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \hat{\beta}_1\sum_i^n{\left(\bar{x}x_i - x_i^2\right)} & (\text{took $\hat{\beta}_1$ out}) \\ \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} & (\text{isolate }\hat{\beta}_1) ~~~ \blacksquare \end{align} \]
And that is how OLS works!
\[ \begin{align} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align} \]
How do we know if our model is good? 🧐
Idea: What proportion of variance in Y does our model explain?
\(R^2 = 1 - \frac{\textrm{RSS}}{\textrm{TSS}}\)
Where:
Interpretation:
Higher R² is better, but:
Bottom line: R² is useful but not the whole story
Idea: Adjusts R² for the number of predictors, penalizing overly complex models.
\(\textrm{Adjusted } R^2 = 1 - \frac{(1-R^2)(n-1)}{n-p-1}\)
Where:
Interpretation:
Bottom line: Adjusted R² gives a more honest view when comparing models with different numbers of predictors
Idea: Average size of prediction errors (same units as Y)
\(\textrm{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\)
Example:
Interpretation:
Idea: Average absolute size of errors
\(\textrm{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\)
Example:
Comparison to RMSE:
Idea: Average percentage deviation between prediction and actual
\(\textrm{MAPE} = \frac{100}{n} \sum_{i=1}^n \frac{|y_i - \hat{y}_i|}{|y_i|}\)
Example:
Predicting daily sales:
Average these % errors across all days → MAPE
Caution:
| Metric | When to Use |
|---|---|
| R² / Adjusted R² | Understand proportion of variance explained; compare models |
| RMSE | When large errors are especially bad (penalize outliers) |
| MAE | When all errors are equally important; robust to outliers |
| MAPE | When you want relative errors in percentage; avoid if actuals near 0 |
Tip: Report multiple metrics to get a full picture of model performance
Scenario: Predicting house prices
What happens?
The predictors are perfectly correlated—they measure the same thing.
OLS gets confused! Coefficients become unstable.
This makes no sense!
We need a better approach…
Scenario: Predicting student test scores
You measure 20 things: - Study hours, sleep, practice tests, breakfast, exercise, stress, motivation, coffee intake, social media time, …
But: Only 3 actually matter for test scores
OLS: Keeps all 20 variables, hard to interpret, prone to overfitting
Extreme case: \(p > n\)
Common in modern data:
Total Error = Bias² + Variance + Irreducible Error
Bias: How far off is our model on average?
Variance: How much does our model change with different data?
OLS is unbiased (under assumptions)
Key insight: Sometimes accepting a little bias can dramatically reduce variance!
This is what regularization does.
Shrinking Coefficients to Reduce Variance
Modify the objective function:
Instead of just minimizing RSS, minimize:
\[\text{RSS} + \lambda \sum_{j=1}^{p}\beta_j^2\]
What this does: - Still tries to fit the data well (RSS part) - But also penalizes large coefficients (\(\lambda\) part) - Forces coefficients to be smaller
λ (lambda) controls the strength of penalization:
Challenge: Choosing the right λ (more on this later!)
Recall our house price problem with sqft and sqmeters:
OLS (unstable):
Ridge (λ = 1000):
Much better!
Constraint interpretation:
Ridge finds coefficients that minimize RSS, subject to:
\[\sum_{j=1}^{p}\beta_j^2 \leq s\]
✅ Handles multicollinearity well
✅ Reduces variance by shrinking coefficients
✅ Always has a solution (even when \(p > n\))
✅ All variables stay in the model
❌ Doesn’t select variables (nothing is exactly zero)
When to use: You think most predictors are relevant but want to reduce overfitting
Shrinking AND Selecting
Different penalty:
Instead of squaring coefficients, use absolute values:
\[\text{RSS} + \lambda \sum_{j=1}^{p}|\beta_j|\]
Subtle but crucial difference:
Ridge uses \(\beta_j^2\) → shrinks smoothly
Lasso uses \(|\beta_j|\) → shrinks AND can set coefficients exactly to zero
Scenario: 20 predictors, but only 3 truly matter
OLS: Keeps all 20 (noisy, hard to interpret)
Ridge: Shrinks all 20 but keeps them all
Lasso (λ = 0.5):
Perfect! Lasso identified the 3 truly important predictors.
✅ Shrinks coefficients toward zero
✅ Selects variables (sets some to exactly zero)
✅ Creates sparse models (easy to interpret)
✅ Works when \(p > n\)
❌ Can be unstable with correlated predictors
When to use: You suspect only a few predictors truly matter
Using the 20-predictor student score data:
Too small λ: Keeps noise (overfitting)
Too large λ: Loses signal (underfitting)
Just right λ: Finds the true signal
Combining Ridge and Lasso
Example: Temperature measurements
These all measure the same thing in different units!
Lasso’s behavior: Arbitrarily picks one, zeros out the others
Problem: Selection is unstable, we lose information
Lasso results:
Which one survives is somewhat random
Combines both penalties:
\[\text{RSS} + \lambda\left[\alpha|\beta_j| + (1-\alpha)\beta_j^2\right]\]
Two tuning parameters:
Typical choice: α = 0.5 (equal mix)
Elastic Net results (α = 0.5):
Much better!
✅ Combines benefits of Ridge and Lasso
✅ Handles correlated predictors (groups them)
✅ Performs variable selection (sparse models)
✅ More stable than Lasso alone
✅ Works when \(p > n\)
When to use: Correlated predictors + want sparse model
Often a safe default choice!
With 3 temperature measures + 2 distance measures + 3 noise variables:
Ridge:
Keeps all 8 variables (small coefficients for noise)
Lasso:
temp_c, dist_km (arbitrarily picks one from each group)
Elastic Net:
temp_c, temp_f, temp_k, dist_km, dist_mi
(keeps groups, removes noise)
Use OLS when:
Use Ridge when:
Use Lasso when:
Use Elastic Net when:
In practice: Try several methods and compare!
The big question: How to choose λ (and α)?
For now:
Later in the course (Week 4):
You’ll learn cross-validation—a systematic method for choosing optimal tuning parameters
As λ increases:
Useful insight: Variables that persist at higher λ are more important!
In R: plot(glmnet_model) shows these paths
How to revise for this course next week.
![]()
LSE DS202 2025/26 Autumn Term