🗓️ Week 02:
Introduction to Regression Algorithms

Theme: Supervised Learning

26 Jan 2024

Machine Learning

What is Machine Learning?

  • Machine Learning (ML) is a subfield of Artificial Intelligence (AI)
    • Traditional AI is (was?) based on explicit programming of rules and logic.
    • Machine Learning is based on learning from examples – from data.
  • “To learn” here often implies the following particular meaning:
    • to capture patterns in data (think trends, correlations, associations, etc.) and use them to make predictions or to support decisions.

    • Different from traditional statistics, which is more focused on inference (i.e. testing hypotheses).

What does it mean to predict something?

  • Say our data is the following simple sequence: \(3, 6, 9, 12, 15, 18, 21, 24, ...\)
  • What number do you expect to come next? Why?
  • It is very likely that you guessed that
    \(\operatorname{next number}=27\)
  • We spot that the sequence follows a pattern:
    \(\operatorname{next number} = \operatorname{previous number} + 3\)
  • If we know the pattern, we can extrapolate (predict) the next number in the sequence.
  • In a way, we have “learned” the pattern from just looking at the data.

Predicting a sequence (formula)


The next number can be represented as a function, \(f(\ )\), of the previous one:

\[ \operatorname{next number} = f(\operatorname{previous number}) \]

Or, let’s say, as a function of the position of the number in the sequence:

Position Number
1 3
2 6
3 9
4 12
5 15
6 18
7 21
8 24

In equation form:

\[ \operatorname{Number} = f(\operatorname{Position}) \]

where

\[ f(x) = 3 \operatorname{x} \]

👈🏻 Typically, we use a tabular format like this to represent our data when doing ML.

The goal of Machine Learning


The goal of ML:

Find a function \(f\), in any suitable mathematical form, that can best predict how \(Y\) will vary given \(X\).

ML vs traditional stats


The goal of ML:

Find a function \(f\), in any suitable mathematical form, that can best predict how \(Y\) will vary given \(X\).

  • Seeing from afar, this is not that different from the goal of traditional statistics.
  • But a statistician might ask:
    • What is the data generating process that produced the data?
    • What evidence do we have that the pattern we have found is the “true” pattern?
    • How can we be sure that the pattern we have found is not just a coincidence?
    • They have a point 👉

Are there other possible sequences?

Let’s find out:

If we visit the OEIS ® page and paste our sequence: \(3, 6, 9, 12, 15, 18, 21, 24, ...\)

… we will get 19 different sequences that contain those same numbers!

Are there other possible sequences?

Are there other possible sequences?

How do we know?

The sad truth: we don’t.

  • The statistician George Box famously wrote:

    “All models are wrong, but some are useful.”

  • A final word on this dichotomy we are exploring:
    • Traditional stats: focuses on testing how well our assumptions (our models) fit the data. Typically via hypothesis testing.
    • Machine Learning: focuses more on assessing how well the model can predict unseen data.

Types of learning

Representation of learning

This example we just explored can be represented as:

\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]

where:

  • \(Y\): the output
  • \(X\): a set of inputs
  • \(f\): a suitable mathematical function
  • \(\epsilon~~\): a random error term

Representation of learning

This example we just explored can be represented as:

\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]

where:

  • \(Y\): the output (the Number in our example)
  • \(X\): the input (the Position in our example)
  • \(f\): a suitable mathematical function (simple or complex)
  • \(\epsilon~~\): a random error term

Whenever you are modelling something that can be represented somehow as the equation above, you are doing supervised learning.

Approximating \(f\)

  • \(f\) is almost always unknown (“all models are wrong”!)
  • The best we can aim for is an approximation (a model).
  • Let’s denote it \(\hat{f}\), which we can then use to predict values of \(Y\) for whatever \(X\) we encounter.
    • That is: \(\hat{Y} = \hat{f}(X)\)

How does this approximation process work?

  • You have to come up with a suitable mathematical form for \(\hat{f}\).
    • Each ML algorithm will have its own way of doing this.
    • You could also come up with your own function if you are so inclined.
  • It’s likely \(\hat{f}\), will have some parameters that you will need to estimate.
    • Instead of proposing \(\hat{f}(x) = 3x\), we say to ourselves:
      ‘I don’t know if 3 is the absolute best number here, maybe the data can tell me?’
    • We could then propose \(\hat{f}(x) = \beta x\) and set ourselves to find out the optimal value of \(\beta\) that ‘best’ fits the data.

How does this approximation process work? (cont.)

  • To train your model, i.e. to find the best value for the parameters, you need to feed your model with past data that contains both \(X\) and \(Y\) values.
  • You MUST have already collected ‘historical’ data that contains both \(X\) and \(Y\) values.
  • The model will then be able to predict \(Y\) values for any \(X\) values.

You can have multiple columns of \(X\) values 👉

X1 X2 X3 X4 Y
1 2 3 10 3
2 4 6 20 6
3 6 9 30 9
4 8 12 40 12
5 10 15 50 15



If you have nothing specific to predict, or no designated \(Y\), then you are not engaging in supervised learning.


You can still use ML to find patterns in the data, a process known as unsupervised learning.

Types of learning


These are, broadly speaking, the two main ways of learning from data:

Supervised Learning

  • Each observation \(\mathbf{x}_i = \{X1_i, X2_i, \ldots\}\) has an outcome associated with it (\(y_i\)).
  • Your goal is to find a \(\hat{f}\) that produces \(\hat{Y}\) value close to the true \(Y\) values.
  • Use it to make predictions or to support decisions.
  • Our focus on 🗓️ Weeks 2, 3, 4 & 5.

Unsupervised Learning

  • You have observations \(\mathbf{x}_i = \{X1_i, X2_i, \ldots\}\) but you don’t care about, or there is no response variable.
  • Focus: identify (dis)similarities in \(X\).
  • Use it to find clusters, anomalies, or other patterns in the data.
  • Our focus on 🗓️ Weeks 7, 8 & 9.

Linear Regression

The basic models

Linear regression is a simple approach to supervised learning.

The generic supervised model:

\[ Y = \operatorname{f}(X) + \epsilon \]

is defined more explicitly as follows ➡️

Simple linear regression

\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]

when we use a single predictor, \(X\).

Multiple linear regression

\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]

when there are multiple predictors, \(X_p\).

Warning

  • Most real-life processes are not linear.
  • Still, linear regression is a good starting point for many problems.

Linear Regression with a single predictor

We assume a model:

\[ Y = \beta_0 + \beta_1 X + \epsilon , \]

where:

  • \(\beta_0\): an unknown constant that represents the intercept of the line.
  • \(\beta_1\): an unknown constant that represents the slope of the line
  • \(\epsilon\): the random error term (irreducible)

Linear Regression with a single predictor

We want to estimate:

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x \]

where:

  • \(\hat{y}\): is a prediction of \(Y\) on the basis of \(X = x\).
  • \(\hat{\beta_0}\): is an estimate of the “true” \(\beta_0\).
  • \(\hat{\beta_1}\): is an estimate of the “true” \(\beta_1\).

Suppose you came across some data:

And you suspect there is a linear relationship between X and Y.

How would you go about fitting a line to it?

Does this line fit?

A line right through the “centre of gravity” of the cloud of data.

Different estimators, different equations

There are multiple ways to estimate the coefficients.

  • If you use different techniques, you might get different equations
  • The most common algorithm is called
    Ordinary Least Squares (OLS)
  • Alternative estimators (Karafiath 2009):
    • Least Absolute Deviation (LAD)
    • Weighted Least Squares (WLS)
    • Generalized Least Squares (GLS)
    • Heteroskedastic-Consistent (HC) variants

Algorithm: Ordinary Least Squares (OLS)

The concept of residuals

Residuals are the distances from each data point to this line.

\(e_i\)\(=(y_i-\hat{y}_i)\) represents the \(i\)th residual

Observed vs. Predicted

Residual Sum of Squares (RSS)

From this, we can define the Residual Sum of Squares (RSS) as

\[ \mathrm{RSS}= e_1^2 + e_2^2 + \dots + e_n^2, \]

or equivalently as

\[ \mathrm{RSS}= (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 + \dots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2. \]



Note

The (ordinary) least squares approach chooses \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the RSS.

OLS: objective function

We treat this as an optimisation problem. We want to minimize RSS:

\[ \begin{align} \min \mathrm{RSS} =& \sum_i^n{e_i^2} \\ =& \sum_i^n{\left(y_i - \hat{y}_i\right)^2} \\ =& \sum_i^n{\left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i\right)^2} \end{align} \]

Estimating \(\hat{\beta}_0\)

To find \(\hat{\beta}_0\), we have to solve the following partial derivative:

\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} = 0 \]

… which will lead you to:

\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, \]

where we made use of the sample means:

  • \(\bar{y} \equiv \frac{1}{n} \sum_{i=1}^n y_i\)
  • \(\bar{x} \equiv \frac{1}{n} \sum_{i=1}^n x_i\)

Full derivation

\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{2 (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{after chain rule})\\ 0 &= 2 \sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{we took $2$ out}) \\ 0 &=\sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\div 2) \\ 0 &=\sum_i^n{y_i} - \sum_i^n{\hat{\beta}_0} - \sum_i^n{\hat{\beta}_1 x_i} & (\text{sep. sums}) \end{align} \]

\[ \begin{align} 0 &=\sum_i^n{y_i} - n\hat{\beta}_0 - \hat{\beta}_1\sum_i^n{ x_i} & (\text{simplified}) \\ n\hat{\beta}_0 &= \sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i} & (+ n\hat{\beta}_0) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i}}{n} & (\text{isolate }\hat{\beta}_0 ) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i}}{n} - \hat{\beta}_1\frac{\sum_i^n{x_i}}{n} & (\text{after rearranging})\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} ~~~ \blacksquare & \end{align} \]

Estimating \(\hat{\beta}_1\)

Similarly, to find \(\hat{\beta}_1\) we solve:

\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{[\sum_i^n{y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i}]} = 0 \]

… which will lead you to:

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \]

Full derivation

\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{\left(2x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{after chain rule})\\ 0 &= 2\sum_i^n{\left( x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{we took $2$ out}) \\ 0 &= \sum_i^n{\left(x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\div 2) \\ 0 &= \sum_i^n{\left(y_ix_i - \hat{\beta}_0x_i - \hat{\beta}_1 x_i^2\right)} & (\text{distributed } x_i) \end{align} \]

\[ \begin{align} 0 &= \sum_i^n{\left(y_ix_i - (\bar{y} - \hat{\beta}_1 \bar{x})x_i - \hat{\beta}_1 x_i^2\right)} & (\text{replaced } \hat{\beta}_0) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i + \hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{rearranged}) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \sum_i^n{\left(\hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{separate sums})\\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \hat{\beta}_1\sum_i^n{\left(\bar{x}x_i - x_i^2\right)} & (\text{took $\hat{\beta}_1$ out}) \\ \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} & (\text{isolate }\hat{\beta}_1) ~~~ \blacksquare \end{align} \]

Parameter Estimation (OLS)

And that is how OLS works!

\[ \begin{align} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align} \]

Example: House Prices in England

House Prices in England

  • In the labs, you have been exploring the monthly percentage change in house prices in the UK
  • Here I will focus on the price itself
    • More specifically, the average price of a house in England (in £)
    • The Office for National Statistics (ONS) publishes this data every month and the price we see corresponds to the geometric mean of the prices of all houses sold in a territory (England) in that month.

Let’s look at the data

👉 Let’s work with the simple assumption that the price of a house in a month is a function of the price of a house in the previous month.

date region average_price average_price_lag1
(previous month)
2023-06-01 England £ 306447 £ 303671
2023-05-01 England £ 303671 £ 304102
2023-04-01 England £ 304102 £ 302772
2023-03-01 England £ 302772 £ 305859
2023-02-01 England £ 305859 £ 307375
2023-01-01 England £ 307375 £ 309949
2022-12-01 England £ 309949 £ 312288
2022-11-01 England £ 312288 £ 311100

Let’s look at the data

👉 Let’s work with the simple assumption that the price of a house in a month is a function of the price of a house in the previous month.

date region average_price average_price_lag1
(previous month)
difference
2023-06-01 England £ 306447 £ 303671 +2776
2023-05-01 England £ 303671 £ 304102 -431
2023-04-01 England £ 304102 £ 302772 +1330
2023-03-01 England £ 302772 £ 305859 -3087
2023-02-01 England £ 305859 £ 307375 -1516
2023-01-01 England £ 307375 £ 309949 -2574
2022-12-01 England £ 309949 £ 312288 -2339
2022-11-01 England £ 312288 £ 311100 +1188

Plotting the data

Linea regression seems like a pretty reasonable assumption.

Fitting a linear model

library(tidymodels)

# Create a linear model
lm_spec <- 
    linear_reg() %>%
    set_engine("lm") %>%
    set_mode("regression")

# Fit the model
lm_fit <- 
    lm_spec %>%
    fit(average_price ~ average_price_lag1, data=df)

Model summary

What values did OLS fit for \(\hat{\beta}_0\) and \(\hat{\beta}_1\)?

# Print a brief description of the model
print(lm_fit)
parsnip model object


Call:
stats::lm(formula = average_price ~ average_price_lag1, data = data)

Coefficients:
       (Intercept)  average_price_lag1  
          -560.208               1.006  

That is:

\[ \begin{align} \hat{\beta}_0 &\approx - 560.21 \\ \hat{\beta}_1 &\approx + 1.006 \end{align} \]

Interpretation of the coefficients

  • The relationship is almost 1:1
  • The price tends to increase by £1.006 for every £1 of the average price in the previous month.
  • That is, we estimate that house prices in England increase by 0.6% every month.

Visually inspecting the model

Further inspection

# Retrieve a summary of the model
summary(lm_fit$fit)
Call:
stats::lm(formula = average_price ~ average_price_lag1, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-17580.9  -1071.9    101.5   1238.6  15980.5 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -5.602e+02  8.636e+02  -0.649    0.517    
average_price_lag1  1.006e+00  3.989e-03 252.126   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2525 on 220 degrees of freedom
Multiple R-squared:  0.9966,    Adjusted R-squared:  0.9965 
F-statistic: 6.357e+04 on 1 and 220 DF,  p-value: < 2.2e-16

Does this model apply elsewhere?

Multiple Linear Regression

What if we add a second lagged variable?

# Retrieve a summary of the model
lm_fit2 <- 
        lm_spec %>%
        fit(average_price ~ average_price_lag1 + average_price_lag2, 
            data=df_england)

lm_fit2$fit %>% summary()       
Call:
stats::lm(formula = average_price ~ average_price_lag1 + average_price_lag2, 
    data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-15607.2  -1088.6    124.5   1375.2  15818.8 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -642.07477  873.58753  -0.735   0.4631    
average_price_lag1    0.88133    0.06742  13.072   <2e-16 ***
average_price_lag2    0.12523    0.06789   1.845   0.0664 .  
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2520 on 217 degrees of freedom
Multiple R-squared:  0.9966,    Adjusted R-squared:  0.9965 
F-statistic: 3.144e+04 on 2 and 217 DF,  p-value: < 2.2e-16

Compare the two models

What’s next?

How to revise for this course next week.

  1. Your first summative problem set should be available on Moodle after this lecture.
    It is due on the 19th of October.

  2. To understand in detail all the assumptions implicitly made by linear models,
    read (James et al. 2021, chaps. 2–3)
    (Not compulsory but highly recommended reading)

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. 2nd edition. Springer Texts in Statistics. New York NY: Springer. https://www.statlearning.com/.
Karafiath, Imre. 2009. “Is There a Viable Alternative to Ordinary Least Squares Regression When Security Abnormal Returns Are the Dependent Variable?” Review of Quantitative Finance and Accounting 32 (1): 17–31. https://doi.org/10.1007/s11156-007-0079-y.