🗓️ Week 02:
Introduction to Regression Algorithms

Theme: Supervised Learning

Dr. Ghita Berrada

LSE Data Science Institute

09 Oct 2025

Quick announcements

Important

🥐 Important releases this week

Your first formative is now available!

Deadline: Week 04 (October 23rd 5pm)

Submission: GitHub Classroom

What is Machine Learning?

Traditional Programming vs Machine Learning

Traditional Programming	Machine Learning
Humans define explicit rules	Algorithm learns/infers rules from data
Inputs + Rules → Output	Inputs + Outputs → Model/Rules
Deterministic, well-defined tasks	Can handle noisy/complex tasks

“To learn” (for ML) often implies the following particular meaning: -> to capture patterns in data (think trends, correlations, associations, etc.) and use them to make predictions or to support decisions.
Different from traditional statistics, which is more focused on inference (i.e. testing hypotheses).

Learning from Patterns: A Simple Example

Consider this sequence:

3, 6, 9, 12, 15, 18, 21, 24, …

What comes next?

Note

Most people guess: 27

Why? You recognized the pattern: “add 3 each time”

This is learning! You extracted a rule from examples.

Formalizing the Pattern

The next number can be represented as a function, $f(\ )$, of the previous one:

\[ \operatorname{next number} = f(\operatorname{previous number}) \]

Or, let’s say, as a function of the position of the number in the sequence:

Position	Number
1	3
2	6
3	9
4	12
5	15
6	18
7	21
8	24

We can express this as a function:

\[\text{Number} = 3 \times \text{Position}\]

Or more generally:

\[Y = f(X)\]

Machine Learning is about finding the best function $f$ that captures patterns in data.

👈🏻 Typically, we use a tabular format like this to represent our data when doing ML.

But… Are We Sure About the Pattern?

Visit OEIS.org and search for: 3, 6, 9, 12, 15, 18, 21, 24

Result: 19 different sequences match these numbers!

They all start the same but diverge later.

“All models are wrong, but some are useful.”
— George Box

The goal: Find a useful model, not a perfect one

A Light Touch of Philosophy

Statistics asks: “What is the relationship?” Machine Learning asks: “How well can we predict?”

Both are valuable — and complementary.

Machine Learning vs Traditional Statistics

Traditional Statistics	Machine Learning
Explain relationships	Make predictions
Test hypotheses	Optimize performance
Understand causality	Forecast outcomes
“Why does X affect Y?”	“How well can we predict Y?”

Both are valuable! They complement each other.

ML: two main paradigms

Supervised Learning

Have labeled data: both inputs (X) and outputs (Y)
Goal: Learn to predict Y from X
Example:
- Predict salary from education, experience, location
- Predict stock prices on the stock market
- Predict temperatures based on environmental factors

Unsupervised Learning

Have only inputs (X), no labels
Goal: Find hidden structure or patterns
Example:
- Group customers by behavior
- Group patients by response to treatment
- Find patients who present atypical presentations of a disease

Today’s focus: Supervised Learning

The Supervised Learning Framework

We observe pairs of data:

\[(\text{Input}_1, \text{Output}_1), (\text{Input}_2, \text{Output}_2), \ldots\]

We assume there’s a relationship:

\[Y = f(X) + \varepsilon\]

Where:

$f$ is the (unknown) true relationship
$\varepsilon$ is random noise we can’t predict

Our Goal in Supervised Learning

Find an approximation $\hat{f}$ such that:

\[\hat{Y} = \hat{f}(X)\]

is as close as possible to the true $Y$

The “hat” ($\hat{\phantom{x}}$) means estimated or predicted
$f$ is unknown: the best we can aim for is an approximation i.e a model

Linear Regression is the simplest way to find $\hat{f}$

Note

Example

\[ \textrm{Happiness} = f(\textrm{GNI}, \textrm{Health}, \textrm{Education}, \textrm{Freedom}, \textrm{Social Support}, \ldots) \]

We’ll use this real-world example to illustrate supervised learning through linear regression later today.

How does this approximation process work?

You have to come up with a suitable mathematical form for $\hat{f}$.
- Each ML algorithm will have its own way of doing this.
- You could also come up with your own function if you are so inclined.
It’s likely $\hat{f}$, will have some parameters that you will need to estimate.
- Instead of proposing $\hat{f}(x) = 3x$, we say to ourselves:
  ‘I don’t know if 3 is the absolute best number here, maybe the data can tell me?’
- We could then propose $\hat{f}(x) = \beta x$ and set ourselves to find out the optimal value of $\beta$ that ‘best’ fits the data.

How does this approximation process work? (cont.)

To train your model, i.e. to find the best value for the parameters, you need to feed your model with past data that contains both $X$ and $Y$ values.
You MUST have already collected ‘historical’ data that contains both $X$ and $Y$ values.
The model will then be able to predict $Y$ values for any $X$ values.

You can have multiple columns of $X$ values 👉

X1	X2	X3	X4	…	Y
1	2	3	10	…	3
2	4	6	20	…	6
3	6	9	30	…	9
4	8	12	40	…	12
5	10	15	50	…	15

If you have nothing specific to predict, or no designated $Y$, then you are not engaging in supervised learning.

You can still use ML to find patterns in the data, a process known as unsupervised learning.

Linear Regression

The basic models

Linear regression is the simplest approach to supervised learning.

The generic supervised model:

\[ Y = \operatorname{f}(X) + \epsilon \]

is defined more explicitly as follows ➡️

Simple linear regression

\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]

when we use a single predictor, $X$.

Multiple linear regression

\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]

when there are multiple predictors, $X_p$.

Warning

Most real-life processes are not linear.
Still, linear regression is a good starting point for many problems.
Do you know the assumptions underlying linear models?

Linear Regression with a single predictor

We assume a model:

\[ Y = \beta_0 + \beta_1 X + \epsilon , \]

where:

$\beta_0$: an unknown constant that represents the intercept of the line.
$\beta_1$: an unknown constant that represents the slope of the line It can also be interpreted as the average change in $Y$ for one unit increase in $X$.
$\epsilon$: the random error term (irreducible)

Linear Regression with a single predictor

We want to estimate:

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x \]

where:

$\hat{y}$: is a prediction of $Y$ on the basis of $X = x$.
$\hat{\beta_0}$: is an estimate of the “true” $\beta_0$.
$\hat{\beta_1}$: is an estimate of the “true” $\beta_1$.

Suppose you came across some data:

And you suspect there is a linear relationship between X and Y.

How would you go about fitting a line to it?

Does this line fit?

A line right through the “centre of gravity” of the cloud of data.

Different estimators, different equations

There are multiple ways to estimate the coefficients.

If you use different techniques, you might get different equations
The most common algorithm is called
Ordinary Least Squares (OLS)
Alternative estimators (Karafiath 2009):
- Least Absolute Deviation (LAD)
- Weighted Least Squares (WLS)
- Generalized Least Squares (GLS)
- Heteroskedastic-Consistent (HC) variants

Algorithm: Ordinary Least Squares (OLS)

The OLS Solution (Intuition First!)

The key results are:

$\hat{\beta}_1 = \frac{\text{How much X and Y move together}}{\text{How much X varies}}$

$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$

Translation:

The slope captures the covariance between X and Y, scaled by variance of X
The intercept ensures the line passes through the average point $(\bar{x}, \bar{y})$

The concept of residuals

Residuals are the distances from each data point to this line.

$e_i$$=(y_i-\hat{y}_i)$ represents the $i$th residual

Observed vs. Predicted

Residual Sum of Squares (RSS)

From this, we can define the Residual Sum of Squares (RSS) as

\[ \mathrm{RSS}= e_1^2 + e_2^2 + \dots + e_n^2, \]

or equivalently as

\[ \mathrm{RSS}= (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 + \dots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2. \]

Note

The (ordinary) least squares approach chooses $\hat{\beta}_0$ and $\hat{\beta}_1$ to minimize the RSS.

Why Square the Residuals?

Three good reasons:

Positive and negative errors don’t cancel out
Large errors are penalized more than small ones
Mathematically convenient (smooth, differentiable)

This gives us the RSS (Residual Sum of Squares):

\[\text{RSS} = \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2\]

OLS: objective function

We treat this as an optimisation problem. We want to minimize RSS:

\[ \begin{align} \min \mathrm{RSS} =& \sum_i^n{e_i^2} \\ =& \sum_i^n{\left(y_i - \hat{y}_i\right)^2} \\ =& \sum_i^n{\left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i\right)^2} \end{align} \]

Estimating $\hat{\beta}_0$

To find $\hat{\beta}_0$, we have to solve the following partial derivative:

\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} = 0 \]

… which will lead you to:

\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, \]

where we made use of the sample means:

$\bar{y} \equiv \frac{1}{n} \sum_{i=1}^n y_i$
$\bar{x} \equiv \frac{1}{n} \sum_{i=1}^n x_i$

Full derivation

\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{2 (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{after chain rule})\\ 0 &= 2 \sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{we took $2$ out}) \\ 0 &=\sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\div 2) \\ 0 &=\sum_i^n{y_i} - \sum_i^n{\hat{\beta}_0} - \sum_i^n{\hat{\beta}_1 x_i} & (\text{sep. sums}) \end{align} \]

\[ \begin{align} 0 &=\sum_i^n{y_i} - n\hat{\beta}_0 - \hat{\beta}_1\sum_i^n{ x_i} & (\text{simplified}) \\ n\hat{\beta}_0 &= \sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i} & (+ n\hat{\beta}_0) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i}}{n} & (\text{isolate }\hat{\beta}_0 ) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i}}{n} - \hat{\beta}_1\frac{\sum_i^n{x_i}}{n} & (\text{after rearranging})\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} ~~~ \blacksquare & \end{align} \]

Estimating $\hat{\beta}_1$

Similarly, to find $\hat{\beta}_1$ we solve:

\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{[\sum_i^n{y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i}]} = 0 \]

… which will lead you to:

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \]

Full derivation

\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{\left(2x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{after chain rule})\\ 0 &= 2\sum_i^n{\left( x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{we took $2$ out}) \\ 0 &= \sum_i^n{\left(x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\div 2) \\ 0 &= \sum_i^n{\left(y_ix_i - \hat{\beta}_0x_i - \hat{\beta}_1 x_i^2\right)} & (\text{distributed } x_i) \end{align} \]

\[ \begin{align} 0 &= \sum_i^n{\left(y_ix_i - (\bar{y} - \hat{\beta}_1 \bar{x})x_i - \hat{\beta}_1 x_i^2\right)} & (\text{replaced } \hat{\beta}_0) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i + \hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{rearranged}) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \sum_i^n{\left(\hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{separate sums})\\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \hat{\beta}_1\sum_i^n{\left(\bar{x}x_i - x_i^2\right)} & (\text{took $\hat{\beta}_1$ out}) \\ \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} & (\text{isolate }\hat{\beta}_1) ~~~ \blacksquare \end{align} \]

Parameter Estimation (OLS)

And that is how OLS works!

\[ \begin{align} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align} \]

Evaluating Linear Regression

How do we know if our model is good? 🧐

Metric 1: R² (R-squared)

Idea: What proportion of variance in Y does our model explain?

$R^2 = 1 - \frac{\textrm{RSS}}{\textrm{TSS}}$

Where:

RSS = Residual Sum of Squares (our errors)
TSS = Total Sum of Squares (variance in Y) where $TSS=\sum_{i=1}^N (y_i-\bar{y})^2$

Interpretation:

$R^2 = 0$: Model explains nothing (useless)
$R^2 = 1$: Model explains everything (perfect)
$R^2 = 0.7$: Model explains 70% of variance

R²: Cautions

Higher R² is better, but:
- Doesn’t tell if you have enough data
- Doesn’t detect wrong predictors
- Doesn’t imply causality
- Can be artificially inflated by adding variables

Bottom line: R² is useful but not the whole story

Metric 2: Adjusted R²

Idea: Adjusts R² for the number of predictors, penalizing overly complex models.

$\textrm{Adjusted } R^2 = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$

Where:

R² = regular R-squared
n = number of observations
p = number of predictors

Interpretation:

Increases only if a new predictor improves the model more than expected by chance
Can decrease if a useless variable is added
Helps prevent overfitting when comparing models with different numbers of predictors

Why Variance Matters

R² and Adjusted R² measure variance explained: a model is good if it captures signal, not just noise
High explained variance → predictions reflect the true pattern in the data
Adjusted R² penalizes variables that don’t meaningfully reduce error

Bottom line: Adjusted R² gives a more honest view when comparing models with different numbers of predictors

Metric 3: RMSE (Root Mean Squared Error)

Idea: Average size of prediction errors (same units as Y)

$\textrm{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$

Example:

Predicting house prices: Y in $
RMSE = 15,000 → average deviation ≈ $15k

Interpretation:

Lower is better
Penalizes large errors more than small ones

Metric 4: MAE (Mean Absolute Error)

Idea: Average absolute size of errors

$\textrm{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$

Example:

Predicting exam scores (0–100)
MAE = 5 → on average, predictions are 5 points off

Comparison to RMSE:

MAE treats all errors equally
RMSE penalizes large deviations more
MAE more robust to outliers

Metric 5: MAPE (Mean Absolute Percentage Error)

Idea: Average percentage deviation between prediction and actual

$\textrm{MAPE} = \frac{100}{n} \sum_{i=1}^n \frac{|y_i - \hat{y}_i|}{|y_i|}$

Example:

Predicting daily sales:
- Actual = 200 units, Predicted = 180 → % error = 10%
Average these % errors across all days → MAPE

Caution:

Actual values near 0 → denominator very small → huge percentage errors
Example: Actual = 0.01, Predicted = 0.1 → error = 900%
Prefer RMSE or MAE if data has zeros or very small values

Which Metric to Use?

Metric	When to Use
R² / Adjusted R²	Understand proportion of variance explained; compare models
RMSE	When large errors are especially bad (penalize outliers)
MAE	When all errors are equally important; robust to outliers
MAPE	When you want relative errors in percentage; avoid if actuals near 0

Tip: Report multiple metrics to get a full picture of model performance

When OLS Struggles

Problem 1: Multicollinearity

Scenario: Predicting house prices

sqft <- c(1500, 1800, 2000, 2200, 2500)
sqmeters <- sqft * 0.092903  # Same info, different units!
price <- c(300000, 350000, 380000, 420000, 480000)

What happens?

The predictors are perfectly correlated—they measure the same thing.

OLS gets confused! Coefficients become unstable.

Multicollinearity: OLS Results

lm(price ~ sqft + sqmeters)

Coefficients:
(Intercept)     sqft      sqmeters   
   52891       -1247      15311

This makes no sense!

Square footage has a negative coefficient?
Coefficients are huge and contradictory
Small data changes → completely different estimates

We need a better approach…

Problem 2: Too Many Predictors

Scenario: Predicting student test scores

You measure 20 things: - Study hours, sleep, practice tests, breakfast, exercise, stress, motivation, coffee intake, social media time, …

But: Only 3 actually matter for test scores

OLS: Keeps all 20 variables, hard to interpret, prone to overfitting

Problem 3: More Predictors Than Observations

Extreme case: $p > n$

50 observations but 100 predictors
OLS solution doesn’t even exist!
$\mathbf{X}^T\mathbf{X}$ is not invertible

Common in modern data:

Genomics (thousands of genes, hundreds of patients)
Text analysis (thousands of words, hundreds of documents)
Image analysis (thousands of pixels, hundreds of images)

The Bias-Variance Tradeoff

Understanding Prediction Error

Total Error = Bias² + Variance + Irreducible Error

Bias: How far off is our model on average?

Low bias: Model is flexible, fits well
High bias: Model is too simple, misses patterns

Variance: How much does our model change with different data?

Low variance: Stable, consistent predictions
High variance: Predictions vary wildly with new data

OLS and the Tradeoff

OLS is unbiased (under assumptions)

On average, gets the right answer
But can have high variance when:
- Predictors are correlated
- Many predictors relative to sample size
- Noise in the data

Key insight: Sometimes accepting a little bias can dramatically reduce variance!

This is what regularization does.

Ridge Regression

Shrinking Coefficients to Reduce Variance

Ridge: The Big Idea

Modify the objective function:

Instead of just minimizing RSS, minimize:

\[\text{RSS} + \lambda \sum_{j=1}^{p}\beta_j^2\]

What this does: - Still tries to fit the data well (RSS part) - But also penalizes large coefficients ($\lambda$ part) - Forces coefficients to be smaller

Ridge: The Tuning Parameter λ

λ (lambda) controls the strength of penalization:

$\lambda = 0$: No penalty → regular OLS
$\lambda$ small: Light penalty → similar to OLS
$\lambda$ large: Heavy penalty → coefficients shrink toward zero
$\lambda → \infty$: Maximum penalty → all coefficients = 0

Challenge: Choosing the right λ (more on this later!)

Example 1: Ridge Fixes Multicollinearity

Recall our house price problem with sqft and sqmeters:

OLS (unstable):

Coefficients: sqft = -1247, sqmeters = 15311

Ridge (λ = 1000):

Coefficients: sqft = 82.3, sqmeters = 461.2

Much better!

Both coefficients are positive (makes sense!)
Much smaller, more stable
Less sensitive to small changes in data

Ridge: What’s Happening Geometrically?

Constraint interpretation:

Ridge finds coefficients that minimize RSS, subject to:

\[\sum_{j=1}^{p}\beta_j^2 \leq s\]

Ridge: Properties

✅ Handles multicollinearity well

✅ Reduces variance by shrinking coefficients

✅ Always has a solution (even when $p > n$)

✅ All variables stay in the model

❌ Doesn’t select variables (nothing is exactly zero)

When to use: You think most predictors are relevant but want to reduce overfitting

Lasso Regression

Shrinking AND Selecting

Lasso: The Big Idea

Different penalty:

Instead of squaring coefficients, use absolute values:

\[\text{RSS} + \lambda \sum_{j=1}^{p}|\beta_j|\]

Subtle but crucial difference:

Ridge uses $\beta_j^2$ → shrinks smoothly

Lasso uses $|\beta_j|$ → shrinks AND can set coefficients exactly to zero

Example 2: Lasso Finds Important Variables

Scenario: 20 predictors, but only 3 truly matter

OLS: Keeps all 20 (noisy, hard to interpret)

Ridge: Shrinks all 20 but keeps them all

Lasso (λ = 0.5):

Non-zero coefficients:
X1  (study_hours):      4.87
X5  (sleep_deficit):   -2.93
X10 (practice_tests):   3.78

All other 17 variables: 0.00

Perfect! Lasso identified the 3 truly important predictors.

Lasso: Properties

✅ Shrinks coefficients toward zero

✅ Selects variables (sets some to exactly zero)

✅ Creates sparse models (easy to interpret)

✅ Works when $p > n$

❌ Can be unstable with correlated predictors

When to use: You suspect only a few predictors truly matter

Example 3: The Effect of λ

Using the 20-predictor student score data:

λ = 0.01:  18 non-zero coefficients
λ = 0.10:  12 non-zero coefficients  
λ = 0.50:   3 non-zero coefficients ← just right!
λ = 1.00:   2 non-zero coefficients
λ = 5.00:   0 non-zero coefficients

Too small λ: Keeps noise (overfitting)

Too large λ: Loses signal (underfitting)

Just right λ: Finds the true signal

Elastic Net

Combining Ridge and Lasso

The Problem with Correlated Predictors

Example: Temperature measurements

temp_celsius    <- c(20, 22, 18, 25, 19)
temp_fahrenheit <- temp_celsius * 9/5 + 32
temp_kelvin     <- temp_celsius + 273.15

These all measure the same thing in different units!

Lasso’s behavior: Arbitrarily picks one, zeros out the others

Problem: Selection is unstable, we lose information

Example 4: Lasso vs Correlated Predictors

Lasso results:

temp_celsius:     2.15
temp_fahrenheit:  0.00   # Zeroed out!
temp_kelvin:      0.00   # Zeroed out!

Which one survives is somewhat random

Different data samples → different selection
All three contain the same information!
Ideally, we’d keep them as a group

Elastic Net: The Solution

Combines both penalties:

\[\text{RSS} + \lambda\left[\alpha|\beta_j| + (1-\alpha)\beta_j^2\right]\]

Two tuning parameters:

$\lambda$: overall regularization strength
$\alpha$: balance between Lasso (α=1) and Ridge (α=0)

Typical choice: α = 0.5 (equal mix)

Example 4: Elastic Net Handles Groups

Elastic Net results (α = 0.5):

temp_celsius:     0.87
temp_fahrenheit:  0.64
temp_kelvin:      0.71

Much better!

Keeps correlated predictors as a group
Distributes weight among them
More stable selection
Still eliminates noise variables

Elastic Net: Properties

✅ Combines benefits of Ridge and Lasso

✅ Handles correlated predictors (groups them)

✅ Performs variable selection (sparse models)

✅ More stable than Lasso alone

✅ Works when $p > n$

When to use: Correlated predictors + want sparse model

Often a safe default choice!

Comparing the Methods

Side-by-Side: Temperature Example

With 3 temperature measures + 2 distance measures + 3 noise variables:

Ridge:

Keeps all 8 variables (small coefficients for noise)

Lasso:

temp_c, dist_km (arbitrarily picks one from each group)

Elastic Net:

temp_c, temp_f, temp_k, dist_km, dist_mi
(keeps groups, removes noise)

Decision Guide

Use OLS when:

Many observations relative to predictors ($n >> p$)
Predictors are not highly correlated
Interpretability is crucial
Linear assumptions seem reasonable

Decision Guide (continued)

Use Ridge when:

Multicollinearity is present
You believe most predictors matter
Want to reduce overfitting
Don’t need variable selection

Decision Guide (continued)

Use Lasso when:

Want automatic variable selection
Believe only few predictors matter
Sparse, interpretable model desired
Predictors not too correlated

Decision Guide (final)

Use Elastic Net when:

Have correlated predictors
Want variable selection
Unsure between Ridge and Lasso
Safe default: α = 0.5

In practice: Try several methods and compare!

Choosing Tuning Parameters

The big question: How to choose λ (and α)?

For now:

Try several values
Look at coefficient paths
See what makes sense for your problem

Later in the course (Week 4):

You’ll learn cross-validation—a systematic method for choosing optimal tuning parameters

Visualizing Coefficient Paths

As λ increases:

Ridge: All coefficients smoothly shrink
Lasso: Coefficients hit zero one by one
Elastic Net: Combination of both

Useful insight: Variables that persist at higher λ are more important!

In R: plot(glmnet_model) shows these paths

What’s next?

How to revise for this course next week.

To understand in detail all the assumptions implicitly made by linear models,
read (James et al. 2021, chaps. 2–3)
(Not compulsory but highly recommended reading)
Have a look at extensions of linear models here
Take a crack at LASSO and Ridge regression models (you might encounter them again in bonus tasks in the Week 3 lab) here and here or on Julia Silge’s Blog (LASSO-related page)

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. 2nd edition. Springer Texts in Statistics. New York NY: Springer. https://www.statlearning.com/.

Karafiath, Imre. 2009. “Is There a Viable Alternative to Ordinary Least Squares Regression When Security Abnormal Returns Are the Dependent Variable?” Review of Quantitative Finance and Accounting 32 (1): 17–31. https://doi.org/10.1007/s11156-007-0079-y.

🗓️ Week 02:Introduction to Regression Algorithms

Quick announcements

What is Machine Learning?

Traditional Programming vs Machine Learning

Learning from Patterns: A Simple Example

Formalizing the Pattern

But… Are We Sure About the Pattern?

A Light Touch of Philosophy

Machine Learning vs Traditional Statistics

ML: two main paradigms

The Supervised Learning Framework

Our Goal in Supervised Learning

How does this approximation process work?

How does this approximation process work? (cont.)

Linear Regression

The basic models

Simple linear regression

Multiple linear regression

Linear Regression with a single predictor

Linear Regression with a single predictor

Suppose you came across some data:

Does this line fit?

Different estimators, different equations

Algorithm: Ordinary Least Squares (OLS)

The OLS Solution (Intuition First!)

The concept of residuals

Residual Sum of Squares (RSS)

Why Square the Residuals?

OLS: objective function

Estimating \(\hat{\beta}_0\)

Full derivation

Estimating \(\hat{\beta}_1\)

Full derivation

Parameter Estimation (OLS)

Evaluating Linear Regression

Metric 1: R² (R-squared)

R²: Cautions

Metric 2: Adjusted R²

Why Variance Matters

Metric 3: RMSE (Root Mean Squared Error)

Metric 4: MAE (Mean Absolute Error)

Metric 5: MAPE (Mean Absolute Percentage Error)

Which Metric to Use?

When OLS Struggles

Problem 1: Multicollinearity

Multicollinearity: OLS Results

Problem 2: Too Many Predictors

Problem 3: More Predictors Than Observations

The Bias-Variance Tradeoff

Understanding Prediction Error

OLS and the Tradeoff

Ridge Regression

Ridge: The Big Idea

Ridge: The Tuning Parameter λ

Example 1: Ridge Fixes Multicollinearity

Ridge: What’s Happening Geometrically?

Ridge: Properties

Lasso Regression

Lasso: The Big Idea

Example 2: Lasso Finds Important Variables

Lasso: Properties

Example 3: The Effect of λ

Elastic Net

The Problem with Correlated Predictors

Example 4: Lasso vs Correlated Predictors

Elastic Net: The Solution

Example 4: Elastic Net Handles Groups

Elastic Net: Properties

Comparing the Methods

Side-by-Side: Temperature Example

Decision Guide

Decision Guide (continued)

Decision Guide (continued)

Decision Guide (final)

Choosing Tuning Parameters

Visualizing Coefficient Paths

What’s next?

References

🗓️ Week 02:
Introduction to Regression Algorithms