🗓️ Week 02:
Introduction to Regression Algorithms

Theme: Supervised Learning

09 Oct 2025

Quick announcements

Sign in here

Important

🥐 Important releases this week

Your first formative is now available!

Deadline: Week 04 (October 23rd 5pm)

Submission: GitHub Classroom

What is Machine Learning?

Traditional Programming vs Machine Learning


Traditional Programming Machine Learning
Humans define explicit rules Algorithm learns/infers rules from data
Inputs + Rules → Output Inputs + Outputs → Model/Rules
Deterministic, well-defined tasks Can handle noisy/complex tasks


  • “To learn” (for ML) often implies the following particular meaning: -> to capture patterns in data (think trends, correlations, associations, etc.) and use them to make predictions or to support decisions.
  • Different from traditional statistics, which is more focused on inference (i.e. testing hypotheses).

Learning from Patterns: A Simple Example

Consider this sequence:

3, 6, 9, 12, 15, 18, 21, 24, …

What comes next?


Note

Most people guess: 27

Why? You recognized the pattern: “add 3 each time”


This is learning! You extracted a rule from examples.

Formalizing the Pattern

The next number can be represented as a function, \(f(\ )\), of the previous one:

\[ \operatorname{next number} = f(\operatorname{previous number}) \]

Or, let’s say, as a function of the position of the number in the sequence:

Position Number
1 3
2 6
3 9
4 12
5 15
6 18
7 21
8 24

We can express this as a function:

\[\text{Number} = 3 \times \text{Position}\]

Or more generally:

\[Y = f(X)\]

Machine Learning is about finding the best function \(f\) that captures patterns in data.

👈🏻 Typically, we use a tabular format like this to represent our data when doing ML.

But… Are We Sure About the Pattern?


Visit OEIS.org and search for: 3, 6, 9, 12, 15, 18, 21, 24

Result: 19 different sequences match these numbers!

They all start the same but diverge later.


“All models are wrong, but some are useful.”
— George Box

The goal: Find a useful model, not a perfect one

A Light Touch of Philosophy

Statistics asks: “What is the relationship?” Machine Learning asks: “How well can we predict?”

Both are valuable — and complementary.

Machine Learning vs Traditional Statistics



Traditional Statistics Machine Learning
Explain relationships Make predictions
Test hypotheses Optimize performance
Understand causality Forecast outcomes
“Why does X affect Y?” “How well can we predict Y?”


Both are valuable! They complement each other.

ML: two main paradigms


Supervised Learning

  • Have labeled data: both inputs (X) and outputs (Y)

  • Goal: Learn to predict Y from X

  • Example:

    • Predict salary from education, experience, location
    • Predict stock prices on the stock market
    • Predict temperatures based on environmental factors

Unsupervised Learning

  • Have only inputs (X), no labels

  • Goal: Find hidden structure or patterns

  • Example:

    • Group customers by behavior
    • Group patients by response to treatment
    • Find patients who present atypical presentations of a disease

Today’s focus: Supervised Learning

The Supervised Learning Framework

We observe pairs of data:

\[(\text{Input}_1, \text{Output}_1), (\text{Input}_2, \text{Output}_2), \ldots\]

We assume there’s a relationship:

\[Y = f(X) + \varepsilon\]

Where:

  • \(f\) is the (unknown) true relationship
  • \(\varepsilon\) is random noise we can’t predict

Our Goal in Supervised Learning

Find an approximation \(\hat{f}\) such that:

\[\hat{Y} = \hat{f}(X)\]

is as close as possible to the true \(Y\)

  • The “hat” (\(\hat{\phantom{x}}\)) means estimated or predicted
  • \(f\) is unknown: the best we can aim for is an approximation i.e a model

Linear Regression is the simplest way to find \(\hat{f}\)

Note

Example

\[ \textrm{Happiness} = f(\textrm{GNI}, \textrm{Health}, \textrm{Education}, \textrm{Freedom}, \textrm{Social Support}, \ldots) \]

We’ll use this real-world example to illustrate supervised learning through linear regression later today.

How does this approximation process work?

  • You have to come up with a suitable mathematical form for \(\hat{f}\).
    • Each ML algorithm will have its own way of doing this.
    • You could also come up with your own function if you are so inclined.
  • It’s likely \(\hat{f}\), will have some parameters that you will need to estimate.
    • Instead of proposing \(\hat{f}(x) = 3x\), we say to ourselves:
      ‘I don’t know if 3 is the absolute best number here, maybe the data can tell me?’
    • We could then propose \(\hat{f}(x) = \beta x\) and set ourselves to find out the optimal value of \(\beta\) that ‘best’ fits the data.

How does this approximation process work? (cont.)

  • To train your model, i.e. to find the best value for the parameters, you need to feed your model with past data that contains both \(X\) and \(Y\) values.
  • You MUST have already collected ‘historical’ data that contains both \(X\) and \(Y\) values.
  • The model will then be able to predict \(Y\) values for any \(X\) values.

You can have multiple columns of \(X\) values 👉

X1 X2 X3 X4 Y
1 2 3 10 3
2 4 6 20 6
3 6 9 30 9
4 8 12 40 12
5 10 15 50 15



If you have nothing specific to predict, or no designated \(Y\), then you are not engaging in supervised learning.


You can still use ML to find patterns in the data, a process known as unsupervised learning.

Linear Regression

The basic models

Linear regression is the simplest approach to supervised learning.

The generic supervised model:

\[ Y = \operatorname{f}(X) + \epsilon \]

is defined more explicitly as follows ➡️

Simple linear regression

\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]

when we use a single predictor, \(X\).

Multiple linear regression

\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]

when there are multiple predictors, \(X_p\).

Warning

  • Most real-life processes are not linear.
  • Still, linear regression is a good starting point for many problems.
  • Do you know the assumptions underlying linear models?

Linear Regression with a single predictor

We assume a model:

\[ Y = \beta_0 + \beta_1 X + \epsilon , \]

where:

  • \(\beta_0\): an unknown constant that represents the intercept of the line.
  • \(\beta_1\): an unknown constant that represents the slope of the line It can also be interpreted as the average change in \(Y\) for one unit increase in \(X\).
  • \(\epsilon\): the random error term (irreducible)

Linear Regression with a single predictor

We want to estimate:

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x \]

where:

  • \(\hat{y}\): is a prediction of \(Y\) on the basis of \(X = x\).
  • \(\hat{\beta_0}\): is an estimate of the “true” \(\beta_0\).
  • \(\hat{\beta_1}\): is an estimate of the “true” \(\beta_1\).

Suppose you came across some data:

And you suspect there is a linear relationship between X and Y.

How would you go about fitting a line to it?

Does this line fit?

A line right through the “centre of gravity” of the cloud of data.

Different estimators, different equations

There are multiple ways to estimate the coefficients.

  • If you use different techniques, you might get different equations
  • The most common algorithm is called
    Ordinary Least Squares (OLS)
  • Alternative estimators (Karafiath 2009):
    • Least Absolute Deviation (LAD)
    • Weighted Least Squares (WLS)
    • Generalized Least Squares (GLS)
    • Heteroskedastic-Consistent (HC) variants

Algorithm: Ordinary Least Squares (OLS)

The OLS Solution (Intuition First!)


The key results are:

\(\hat{\beta}_1 = \frac{\text{How much X and Y move together}}{\text{How much X varies}}\)

\(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\)

Translation:

  • The slope captures the covariance between X and Y, scaled by variance of X
  • The intercept ensures the line passes through the average point \((\bar{x}, \bar{y})\)

The concept of residuals

Residuals are the distances from each data point to this line.

\(e_i\)\(=(y_i-\hat{y}_i)\) represents the \(i\)th residual

Observed vs. Predicted

Residual Sum of Squares (RSS)

From this, we can define the Residual Sum of Squares (RSS) as

\[ \mathrm{RSS}= e_1^2 + e_2^2 + \dots + e_n^2, \]

or equivalently as

\[ \mathrm{RSS}= (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 + \dots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2. \]



Note

The (ordinary) least squares approach chooses \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the RSS.

Why Square the Residuals?

Three good reasons:

  1. Positive and negative errors don’t cancel out
  2. Large errors are penalized more than small ones
  3. Mathematically convenient (smooth, differentiable)

This gives us the RSS (Residual Sum of Squares):

\[\text{RSS} = \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2\]

OLS: objective function

We treat this as an optimisation problem. We want to minimize RSS:

\[ \begin{align} \min \mathrm{RSS} =& \sum_i^n{e_i^2} \\ =& \sum_i^n{\left(y_i - \hat{y}_i\right)^2} \\ =& \sum_i^n{\left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i\right)^2} \end{align} \]

Estimating \(\hat{\beta}_0\)

To find \(\hat{\beta}_0\), we have to solve the following partial derivative:

\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} = 0 \]

… which will lead you to:

\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, \]

where we made use of the sample means:

  • \(\bar{y} \equiv \frac{1}{n} \sum_{i=1}^n y_i\)
  • \(\bar{x} \equiv \frac{1}{n} \sum_{i=1}^n x_i\)

Full derivation

\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{2 (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{after chain rule})\\ 0 &= 2 \sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{we took $2$ out}) \\ 0 &=\sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\div 2) \\ 0 &=\sum_i^n{y_i} - \sum_i^n{\hat{\beta}_0} - \sum_i^n{\hat{\beta}_1 x_i} & (\text{sep. sums}) \end{align} \]

\[ \begin{align} 0 &=\sum_i^n{y_i} - n\hat{\beta}_0 - \hat{\beta}_1\sum_i^n{ x_i} & (\text{simplified}) \\ n\hat{\beta}_0 &= \sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i} & (+ n\hat{\beta}_0) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i}}{n} & (\text{isolate }\hat{\beta}_0 ) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i}}{n} - \hat{\beta}_1\frac{\sum_i^n{x_i}}{n} & (\text{after rearranging})\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} ~~~ \blacksquare & \end{align} \]

Estimating \(\hat{\beta}_1\)

Similarly, to find \(\hat{\beta}_1\) we solve:

\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{[\sum_i^n{y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i}]} = 0 \]

… which will lead you to:

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \]

Full derivation

\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{\left(2x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{after chain rule})\\ 0 &= 2\sum_i^n{\left( x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{we took $2$ out}) \\ 0 &= \sum_i^n{\left(x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\div 2) \\ 0 &= \sum_i^n{\left(y_ix_i - \hat{\beta}_0x_i - \hat{\beta}_1 x_i^2\right)} & (\text{distributed } x_i) \end{align} \]

\[ \begin{align} 0 &= \sum_i^n{\left(y_ix_i - (\bar{y} - \hat{\beta}_1 \bar{x})x_i - \hat{\beta}_1 x_i^2\right)} & (\text{replaced } \hat{\beta}_0) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i + \hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{rearranged}) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \sum_i^n{\left(\hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{separate sums})\\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \hat{\beta}_1\sum_i^n{\left(\bar{x}x_i - x_i^2\right)} & (\text{took $\hat{\beta}_1$ out}) \\ \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} & (\text{isolate }\hat{\beta}_1) ~~~ \blacksquare \end{align} \]

Parameter Estimation (OLS)

And that is how OLS works!

\[ \begin{align} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align} \]

Evaluating Linear Regression


How do we know if our model is good? 🧐

Metric 1: R² (R-squared)

Idea: What proportion of variance in Y does our model explain?

\(R^2 = 1 - \frac{\textrm{RSS}}{\textrm{TSS}}\)

Where:

  • RSS = Residual Sum of Squares (our errors)
  • TSS = Total Sum of Squares (variance in Y) where \(TSS=\sum_{i=1}^N (y_i-\bar{y})^2\)

Interpretation:

  • \(R^2 = 0\): Model explains nothing (useless)
  • \(R^2 = 1\): Model explains everything (perfect)
  • \(R^2 = 0.7\): Model explains 70% of variance

R²: Cautions

  • Higher R² is better, but:

    • Doesn’t tell if you have enough data
    • Doesn’t detect wrong predictors
    • Doesn’t imply causality
    • Can be artificially inflated by adding variables

Bottom line: R² is useful but not the whole story

Metric 2: Adjusted R²

Idea: Adjusts R² for the number of predictors, penalizing overly complex models.

\(\textrm{Adjusted } R^2 = 1 - \frac{(1-R^2)(n-1)}{n-p-1}\)

Where:

  • R² = regular R-squared
  • n = number of observations
  • p = number of predictors

Interpretation:

  • Increases only if a new predictor improves the model more than expected by chance
  • Can decrease if a useless variable is added
  • Helps prevent overfitting when comparing models with different numbers of predictors

Why Variance Matters

  • R² and Adjusted R² measure variance explained: a model is good if it captures signal, not just noise
  • High explained variance → predictions reflect the true pattern in the data
  • Adjusted R² penalizes variables that don’t meaningfully reduce error

Bottom line: Adjusted R² gives a more honest view when comparing models with different numbers of predictors

Metric 3: RMSE (Root Mean Squared Error)

Idea: Average size of prediction errors (same units as Y)

\(\textrm{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\)

Example:

  • Predicting house prices: Y in $
  • RMSE = 15,000 → average deviation ≈ $15k

Interpretation:

  • Lower is better
  • Penalizes large errors more than small ones

Metric 4: MAE (Mean Absolute Error)

Idea: Average absolute size of errors

\(\textrm{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\)

Example:

  • Predicting exam scores (0–100)
  • MAE = 5 → on average, predictions are 5 points off

Comparison to RMSE:

  • MAE treats all errors equally
  • RMSE penalizes large deviations more
  • MAE more robust to outliers

Metric 5: MAPE (Mean Absolute Percentage Error)

Idea: Average percentage deviation between prediction and actual

\(\textrm{MAPE} = \frac{100}{n} \sum_{i=1}^n \frac{|y_i - \hat{y}_i|}{|y_i|}\)

Example:

  • Predicting daily sales:

    • Actual = 200 units, Predicted = 180 → % error = 10%
  • Average these % errors across all days → MAPE

Caution:

  • Actual values near 0 → denominator very small → huge percentage errors
  • Example: Actual = 0.01, Predicted = 0.1 → error = 900%
  • Prefer RMSE or MAE if data has zeros or very small values

Which Metric to Use?

Metric When to Use
R² / Adjusted R² Understand proportion of variance explained; compare models
RMSE When large errors are especially bad (penalize outliers)
MAE When all errors are equally important; robust to outliers
MAPE When you want relative errors in percentage; avoid if actuals near 0

Tip: Report multiple metrics to get a full picture of model performance

When OLS Struggles

Problem 1: Multicollinearity

Scenario: Predicting house prices

sqft <- c(1500, 1800, 2000, 2200, 2500)
sqmeters <- sqft * 0.092903  # Same info, different units!
price <- c(300000, 350000, 380000, 420000, 480000)

What happens?

The predictors are perfectly correlated—they measure the same thing.

OLS gets confused! Coefficients become unstable.

Multicollinearity: OLS Results

lm(price ~ sqft + sqmeters)

Coefficients:
(Intercept)     sqft      sqmeters   
   52891       -1247      15311

This makes no sense!

  • Square footage has a negative coefficient?
  • Coefficients are huge and contradictory
  • Small data changes → completely different estimates

We need a better approach…

Problem 2: Too Many Predictors

Scenario: Predicting student test scores

You measure 20 things: - Study hours, sleep, practice tests, breakfast, exercise, stress, motivation, coffee intake, social media time, …

But: Only 3 actually matter for test scores

OLS: Keeps all 20 variables, hard to interpret, prone to overfitting

Problem 3: More Predictors Than Observations

Extreme case: \(p > n\)

  • 50 observations but 100 predictors
  • OLS solution doesn’t even exist!
  • \(\mathbf{X}^T\mathbf{X}\) is not invertible

Common in modern data:

  • Genomics (thousands of genes, hundreds of patients)
  • Text analysis (thousands of words, hundreds of documents)
  • Image analysis (thousands of pixels, hundreds of images)

The Bias-Variance Tradeoff

Understanding Prediction Error

Total Error = Bias² + Variance + Irreducible Error

Bias: How far off is our model on average?

  • Low bias: Model is flexible, fits well
  • High bias: Model is too simple, misses patterns

Variance: How much does our model change with different data?

  • Low variance: Stable, consistent predictions
  • High variance: Predictions vary wildly with new data

OLS and the Tradeoff

OLS is unbiased (under assumptions)

  • On average, gets the right answer
  • But can have high variance when:
    • Predictors are correlated
    • Many predictors relative to sample size
    • Noise in the data

Key insight: Sometimes accepting a little bias can dramatically reduce variance!

This is what regularization does.

Ridge Regression

Shrinking Coefficients to Reduce Variance

Ridge: The Big Idea

Modify the objective function:

Instead of just minimizing RSS, minimize:

\[\text{RSS} + \lambda \sum_{j=1}^{p}\beta_j^2\]

What this does: - Still tries to fit the data well (RSS part) - But also penalizes large coefficients (\(\lambda\) part) - Forces coefficients to be smaller

Ridge: The Tuning Parameter λ

λ (lambda) controls the strength of penalization:

  • \(\lambda = 0\): No penalty → regular OLS
  • \(\lambda\) small: Light penalty → similar to OLS
  • \(\lambda\) large: Heavy penalty → coefficients shrink toward zero
  • \(\lambda → \infty\): Maximum penalty → all coefficients = 0

Challenge: Choosing the right λ (more on this later!)

Example 1: Ridge Fixes Multicollinearity

Recall our house price problem with sqft and sqmeters:

OLS (unstable):

Coefficients: sqft = -1247, sqmeters = 15311

Ridge (λ = 1000):

Coefficients: sqft = 82.3, sqmeters = 461.2

Much better!

  • Both coefficients are positive (makes sense!)
  • Much smaller, more stable
  • Less sensitive to small changes in data

Ridge: What’s Happening Geometrically?

Constraint interpretation:

Ridge finds coefficients that minimize RSS, subject to:

\[\sum_{j=1}^{p}\beta_j^2 \leq s\]

Ridge: Properties

Handles multicollinearity well

Reduces variance by shrinking coefficients

Always has a solution (even when \(p > n\))

All variables stay in the model

Doesn’t select variables (nothing is exactly zero)

When to use: You think most predictors are relevant but want to reduce overfitting

Lasso Regression

Shrinking AND Selecting

Lasso: The Big Idea

Different penalty:

Instead of squaring coefficients, use absolute values:

\[\text{RSS} + \lambda \sum_{j=1}^{p}|\beta_j|\]

Subtle but crucial difference:

Ridge uses \(\beta_j^2\) → shrinks smoothly

Lasso uses \(|\beta_j|\) → shrinks AND can set coefficients exactly to zero

Example 2: Lasso Finds Important Variables

Scenario: 20 predictors, but only 3 truly matter

OLS: Keeps all 20 (noisy, hard to interpret)

Ridge: Shrinks all 20 but keeps them all

Lasso (λ = 0.5):

Non-zero coefficients:
X1  (study_hours):      4.87
X5  (sleep_deficit):   -2.93
X10 (practice_tests):   3.78

All other 17 variables: 0.00

Perfect! Lasso identified the 3 truly important predictors.

Lasso: Properties

Shrinks coefficients toward zero

Selects variables (sets some to exactly zero)

Creates sparse models (easy to interpret)

Works when \(p > n\)

Can be unstable with correlated predictors

When to use: You suspect only a few predictors truly matter

Example 3: The Effect of λ

Using the 20-predictor student score data:

λ = 0.01:  18 non-zero coefficients
λ = 0.10:  12 non-zero coefficients  
λ = 0.50:   3 non-zero coefficients ← just right!
λ = 1.00:   2 non-zero coefficients
λ = 5.00:   0 non-zero coefficients

Too small λ: Keeps noise (overfitting)

Too large λ: Loses signal (underfitting)

Just right λ: Finds the true signal

Elastic Net

Combining Ridge and Lasso

The Problem with Correlated Predictors

Example: Temperature measurements

temp_celsius    <- c(20, 22, 18, 25, 19)
temp_fahrenheit <- temp_celsius * 9/5 + 32
temp_kelvin     <- temp_celsius + 273.15

These all measure the same thing in different units!

Lasso’s behavior: Arbitrarily picks one, zeros out the others

Problem: Selection is unstable, we lose information

Example 4: Lasso vs Correlated Predictors

Lasso results:

temp_celsius:     2.15
temp_fahrenheit:  0.00   # Zeroed out!
temp_kelvin:      0.00   # Zeroed out!

Which one survives is somewhat random

  • Different data samples → different selection
  • All three contain the same information!
  • Ideally, we’d keep them as a group

Elastic Net: The Solution

Combines both penalties:

\[\text{RSS} + \lambda\left[\alpha|\beta_j| + (1-\alpha)\beta_j^2\right]\]

Two tuning parameters:

  • \(\lambda\): overall regularization strength
  • \(\alpha\): balance between Lasso (α=1) and Ridge (α=0)

Typical choice: α = 0.5 (equal mix)

Example 4: Elastic Net Handles Groups

Elastic Net results (α = 0.5):

temp_celsius:     0.87
temp_fahrenheit:  0.64
temp_kelvin:      0.71

Much better!

  • Keeps correlated predictors as a group
  • Distributes weight among them
  • More stable selection
  • Still eliminates noise variables

Elastic Net: Properties

Combines benefits of Ridge and Lasso

Handles correlated predictors (groups them)

Performs variable selection (sparse models)

More stable than Lasso alone

Works when \(p > n\)

When to use: Correlated predictors + want sparse model

Often a safe default choice!

Comparing the Methods

Side-by-Side: Temperature Example

With 3 temperature measures + 2 distance measures + 3 noise variables:

Ridge:

Keeps all 8 variables (small coefficients for noise)

Lasso:

temp_c, dist_km (arbitrarily picks one from each group)

Elastic Net:

temp_c, temp_f, temp_k, dist_km, dist_mi
(keeps groups, removes noise)

Decision Guide

Use OLS when:

  • Many observations relative to predictors (\(n >> p\))
  • Predictors are not highly correlated
  • Interpretability is crucial
  • Linear assumptions seem reasonable

Decision Guide (continued)

Use Ridge when:

  • Multicollinearity is present
  • You believe most predictors matter
  • Want to reduce overfitting
  • Don’t need variable selection

Decision Guide (continued)

Use Lasso when:

  • Want automatic variable selection
  • Believe only few predictors matter
  • Sparse, interpretable model desired
  • Predictors not too correlated

Decision Guide (final)

Use Elastic Net when:

  • Have correlated predictors
  • Want variable selection
  • Unsure between Ridge and Lasso
  • Safe default: α = 0.5

In practice: Try several methods and compare!

Choosing Tuning Parameters

The big question: How to choose λ (and α)?

For now:

  • Try several values
  • Look at coefficient paths
  • See what makes sense for your problem

Later in the course (Week 4):

You’ll learn cross-validation—a systematic method for choosing optimal tuning parameters

Visualizing Coefficient Paths

As λ increases:

  • Ridge: All coefficients smoothly shrink
  • Lasso: Coefficients hit zero one by one
  • Elastic Net: Combination of both

Useful insight: Variables that persist at higher λ are more important!

In R: plot(glmnet_model) shows these paths

What’s next?

How to revise for this course next week.

  1. To understand in detail all the assumptions implicitly made by linear models,
    read (James et al. 2021, chaps. 2–3)
    (Not compulsory but highly recommended reading)
  2. Have a look at extensions of linear models here
  3. Take a crack at LASSO and Ridge regression models (you might encounter them again in bonus tasks in the Week 3 lab) here and here or on Julia Silge’s Blog (LASSO-related page)

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. 2nd edition. Springer Texts in Statistics. New York NY: Springer. https://www.statlearning.com/.
Karafiath, Imre. 2009. “Is There a Viable Alternative to Ordinary Least Squares Regression When Security Abnormal Returns Are the Dependent Variable?” Review of Quantitative Finance and Accounting 32 (1): 17–31. https://doi.org/10.1007/s11156-007-0079-y.