🗓️ Week 02:
Introduction to Regression Algorithms

Theme: Supervised Learning

31 Jan 2025

Machine Learning

What is Machine Learning?

  • Machine Learning (ML) is a subfield of Artificial Intelligence (AI)
    • Traditional AI is (was?) based on explicit programming of rules and logic.
    • Machine Learning is based on learning from examples – from data.
  • “To learn” here often implies the following particular meaning:
    • to capture patterns in data (think trends, correlations, associations, etc.) and use them to make predictions or to support decisions.

    • Different from traditional statistics, which is more focused on inference (i.e. testing hypotheses).

What does it mean to predict something?

  • Say our data is the following simple sequence: \(3, 6, 9, 12, 15, 18, 21, 24, ...\)
  • What number do you expect to come next? Why?
  • It is very likely that you guessed that
    \(\operatorname{next number}=27\)
  • We spot that the sequence follows a pattern:
    \(\operatorname{next number} = \operatorname{previous number} + 3\)
  • If we know the pattern, we can extrapolate (predict) the next number in the sequence.
  • In a way, we have “learned” the pattern from just looking at the data.

Predicting a sequence (formula)


The next number can be represented as a function, \(f(\ )\), of the previous one:

\[ \operatorname{next number} = f(\operatorname{previous number}) \]

Or, let’s say, as a function of the position of the number in the sequence:

Position Number
1 3
2 6
3 9
4 12
5 15
6 18
7 21
8 24

In equation form:

\[ \operatorname{Number} = f(\operatorname{Position}) \]

where

\[ f(x) = 3 \operatorname{x} \]

👈🏻 Typically, we use a tabular format like this to represent our data when doing ML.

The goal of Machine Learning


The goal of ML:

Find a function \(f\), in any suitable mathematical form, that can best predict how \(Y\) will vary given \(X\).

ML vs traditional stats


The goal of ML:

Find a function \(f\), in any suitable mathematical form, that can best predict how \(Y\) will vary given \(X\).

  • Seeing from afar, this is not that different from the goal of traditional statistics.
  • But a statistician might ask:
    • What is the data generating process that produced the data?
    • What evidence do we have that the pattern we have found is the “true” pattern?
    • How can we be sure that the pattern we have found is not just a coincidence?
    • They have a point 👉

Are there other possible sequences?

Let’s find out:

If we visit the OEIS ® page and paste our sequence: \(3, 6, 9, 12, 15, 18, 21, 24, ...\)

… we will get 19 different sequences that contain those same numbers!

Are there other possible sequences?

Are there other possible sequences?

How do we know?

The sad truth: we don’t.

  • The statistician George Box famously wrote:

    “All models are wrong, but some are useful.”

  • A final word on this dichotomy we are exploring:
    • Traditional stats: focuses on testing how well our assumptions (our models) fit the data. Typically via hypothesis testing.
    • Machine Learning: focuses more on assessing how well the model can predict unseen data.

Types of learning

Representation of learning

This example we just explored can be represented as:

\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]

where:

  • \(Y\): the output
  • \(X\): a set of inputs
  • \(f\): a suitable mathematical function
  • \(\epsilon~~\): a random error term

Representation of learning

This example we just explored can be represented as:

\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]

where:

  • \(Y\): the output (the Number in our example)
  • \(X\): the input (the Position in our example)
  • \(f\): a suitable mathematical function (simple or complex)
  • \(\epsilon~~\): a random error term

Whenever you are modelling something that can be represented somehow as the equation above, you are doing supervised learning.

Approximating \(f\)

  • \(f\) is almost always unknown (“all models are wrong”!)
  • The best we can aim for is an approximation (a model).
  • Let’s denote it \(\hat{f}\), which we can then use to predict values of \(Y\) for whatever \(X\) we encounter.
    • That is: \(\hat{Y} = \hat{f}(X)\)

How does this approximation process work?

  • You have to come up with a suitable mathematical form for \(\hat{f}\).
    • Each ML algorithm will have its own way of doing this.
    • You could also come up with your own function if you are so inclined.
  • It’s likely \(\hat{f}\), will have some parameters that you will need to estimate.
    • Instead of proposing \(\hat{f}(x) = 3x\), we say to ourselves:
      ‘I don’t know if 3 is the absolute best number here, maybe the data can tell me?’
    • We could then propose \(\hat{f}(x) = \beta x\) and set ourselves to find out the optimal value of \(\beta\) that ‘best’ fits the data.

How does this approximation process work? (cont.)

  • To train your model, i.e. to find the best value for the parameters, you need to feed your model with past data that contains both \(X\) and \(Y\) values.
  • You MUST have already collected ‘historical’ data that contains both \(X\) and \(Y\) values.
  • The model will then be able to predict \(Y\) values for any \(X\) values.

You can have multiple columns of \(X\) values 👉

X1 X2 X3 X4 Y
1 2 3 10 3
2 4 6 20 6
3 6 9 30 9
4 8 12 40 12
5 10 15 50 15



If you have nothing specific to predict, or no designated \(Y\), then you are not engaging in supervised learning.


You can still use ML to find patterns in the data, a process known as unsupervised learning.

Types of learning


These are, broadly speaking, the two main ways of learning from data:

Supervised Learning

  • Each observation \(\mathbf{x}_i = \{X1_i, X2_i, \ldots\}\) has an outcome associated with it (\(y_i\)).
  • Your goal is to find a \(\hat{f}\) that produces \(\hat{Y}\) value close to the true \(Y\) values.
  • Use it to make predictions or to support decisions.
  • Our focus on 🗓️ Weeks 2, 3, 4 & 5.

Unsupervised Learning

  • You have observations \(\mathbf{x}_i = \{X1_i, X2_i, \ldots\}\) but you don’t care about, or there is no response variable.
  • Focus: identify (dis)similarities in \(X\).
  • Use it to find clusters, anomalies, or other patterns in the data.
  • Our focus on 🗓️ Weeks 7, 8 & 9.

Linear Regression

The basic models

Linear regression is a simple approach to supervised learning.

The generic supervised model:

\[ Y = \operatorname{f}(X) + \epsilon \]

is defined more explicitly as follows ➡️

Simple linear regression

\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]

when we use a single predictor, \(X\).

Multiple linear regression

\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]

when there are multiple predictors, \(X_p\).

Warning

  • Most real-life processes are not linear.
  • Still, linear regression is a good starting point for many problems.
  • Do you know the assumptions underlying linear models?

Linear Regression with a single predictor

We assume a model:

\[ Y = \beta_0 + \beta_1 X + \epsilon , \]

where:

  • \(\beta_0\): an unknown constant that represents the intercept of the line.
  • \(\beta_1\): an unknown constant that represents the slope of the line
  • \(\epsilon\): the random error term (irreducible)

Linear Regression with a single predictor

We want to estimate:

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x \]

where:

  • \(\hat{y}\): is a prediction of \(Y\) on the basis of \(X = x\).
  • \(\hat{\beta_0}\): is an estimate of the “true” \(\beta_0\).
  • \(\hat{\beta_1}\): is an estimate of the “true” \(\beta_1\).

Suppose you came across some data:

And you suspect there is a linear relationship between X and Y.

How would you go about fitting a line to it?

Does this line fit?

A line right through the “centre of gravity” of the cloud of data.

Different estimators, different equations

There are multiple ways to estimate the coefficients.

  • If you use different techniques, you might get different equations
  • The most common algorithm is called
    Ordinary Least Squares (OLS)
  • Alternative estimators (Karafiath 2009):
    • Least Absolute Deviation (LAD)
    • Weighted Least Squares (WLS)
    • Generalized Least Squares (GLS)
    • Heteroskedastic-Consistent (HC) variants

Algorithm: Ordinary Least Squares (OLS)

The concept of residuals

Residuals are the distances from each data point to this line.

\(e_i\)\(=(y_i-\hat{y}_i)\) represents the \(i\)th residual

Observed vs. Predicted

Residual Sum of Squares (RSS)

From this, we can define the Residual Sum of Squares (RSS) as

\[ \mathrm{RSS}= e_1^2 + e_2^2 + \dots + e_n^2, \]

or equivalently as

\[ \mathrm{RSS}= (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 + \dots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2. \]



Note

The (ordinary) least squares approach chooses \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the RSS.

OLS: objective function

We treat this as an optimisation problem. We want to minimize RSS:

\[ \begin{align} \min \mathrm{RSS} =& \sum_i^n{e_i^2} \\ =& \sum_i^n{\left(y_i - \hat{y}_i\right)^2} \\ =& \sum_i^n{\left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i\right)^2} \end{align} \]

Estimating \(\hat{\beta}_0\)

To find \(\hat{\beta}_0\), we have to solve the following partial derivative:

\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} = 0 \]

… which will lead you to:

\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, \]

where we made use of the sample means:

  • \(\bar{y} \equiv \frac{1}{n} \sum_{i=1}^n y_i\)
  • \(\bar{x} \equiv \frac{1}{n} \sum_{i=1}^n x_i\)

Full derivation

\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{2 (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{after chain rule})\\ 0 &= 2 \sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{we took $2$ out}) \\ 0 &=\sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\div 2) \\ 0 &=\sum_i^n{y_i} - \sum_i^n{\hat{\beta}_0} - \sum_i^n{\hat{\beta}_1 x_i} & (\text{sep. sums}) \end{align} \]

\[ \begin{align} 0 &=\sum_i^n{y_i} - n\hat{\beta}_0 - \hat{\beta}_1\sum_i^n{ x_i} & (\text{simplified}) \\ n\hat{\beta}_0 &= \sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i} & (+ n\hat{\beta}_0) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i}}{n} & (\text{isolate }\hat{\beta}_0 ) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i}}{n} - \hat{\beta}_1\frac{\sum_i^n{x_i}}{n} & (\text{after rearranging})\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} ~~~ \blacksquare & \end{align} \]

Estimating \(\hat{\beta}_1\)

Similarly, to find \(\hat{\beta}_1\) we solve:

\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{[\sum_i^n{y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i}]} = 0 \]

… which will lead you to:

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \]

Full derivation

\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{\left(2x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{after chain rule})\\ 0 &= 2\sum_i^n{\left( x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{we took $2$ out}) \\ 0 &= \sum_i^n{\left(x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\div 2) \\ 0 &= \sum_i^n{\left(y_ix_i - \hat{\beta}_0x_i - \hat{\beta}_1 x_i^2\right)} & (\text{distributed } x_i) \end{align} \]

\[ \begin{align} 0 &= \sum_i^n{\left(y_ix_i - (\bar{y} - \hat{\beta}_1 \bar{x})x_i - \hat{\beta}_1 x_i^2\right)} & (\text{replaced } \hat{\beta}_0) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i + \hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{rearranged}) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \sum_i^n{\left(\hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{separate sums})\\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \hat{\beta}_1\sum_i^n{\left(\bar{x}x_i - x_i^2\right)} & (\text{took $\hat{\beta}_1$ out}) \\ \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} & (\text{isolate }\hat{\beta}_1) ~~~ \blacksquare \end{align} \]

Parameter Estimation (OLS)

And that is how OLS works!

\[ \begin{align} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align} \]

Example: Modeling GDP growth (World Bank Data)

For this part, we’ll switch back to a Jupyter Notebook that you can download on the course website so you can run the code along with my explanations.

So what now? Evaluating linear regression

A few metrics

  1. \(R^2\) or coefficient of determination

\[ \begin{align} R^2 &= 1-\frac{RSS}{TSS}\\ &= 1-\frac{\sum_{i=1}^N (y_i-\hat{y})^2}{\sum_{i=1}^N (y_i-\bar{y})^2} \end{align} \]

  • RSS is the residual sum of squares, which is the sum of squared residuals. This value captures the prediction error of a model.

  • TSS is the total sum of squares. To calculate this value, assume a simple model in which the prediction for each observation is the mean of all the observed actuals. TSS is proportional to the variance of the dependent variable, as \[\frac{TSS}{N}\] is the actual variance of \(y\) where \(N\) is the number of observations. Think of \(TSS\) as the variance that a simple mean model cannot explain.

Caveats:

It does not :

  • indicate whether enough data points were used to make a solid conclusion!
  • show whether collinearity exists between explanatory variables
  • indicate whether the most appropriate independent variables were used for the model or the correct regression was used
  • indicate whether the model might be improved by using transformed versions of the existing set of independent variables
  • show that the independent variables are a cause of the changes in the dependent variable

⚠️ WARNING: \(R^2\) increases with every variable you add to your model (even it’s there’s just chance correlation between variables and the new variable simply adds noise!). A regression model with more independent variables than another model can look like a better fit simply because it has more variables!

⚠️ WARNING: When a model has an excessive number of independent variables and/or polynomial terms, it starts to fit the specificities of the data and its random noise too closely rather than actually reflecting the entire population: that’s called overfitting. This phenomenon results in deceptively high \(R^2\) values and decreases the precision of predictions.

Adjusted \(R^2\) adjusts for the number of predictors in a regression model so you could use it instead of \(R^2\)

So what now? Evaluating linear regression

  1. \(RMSE\) or root mean squared error
    \[RMSE=\sqrt{\frac{1}{N}\sum_{i=1}^N(y_i-\hat{y_i})^2}\]
  • metric independent of the dataset size (division of \(RSS\) by size of dataset \(N\) )
  • measures the average difference between values predicted by a model and the actual values. It provides an estimation of how well the model is able to predict the target value (accuracy)
  • The lower the value of the Root Mean Squared Error, the better the model is
  • has the advantage of representing the amount of error in the same unit as the predicted column making it easy to interpret e.g if you are trying to predict an amount in GBP, then the Root Mean Squared Error can be interpreted as the amount of error in GBP

⚠️ WARNING: - Sensitive to outliers and large errors due to the squaring. - Emphasizes large errors

See also here

So what now? Evaluating linear regression

  1. \(MAE\) or mean absolute error
    \[MAE=\sqrt{\frac{1}{N}\sum_{i=1}^N|y_i-\hat{y_i}|}\]
  • metric independent of the dataset size
  • measures average absolute difference between the predicted values and the actual target values. It provides an estimation of how well the model is able to predict the target value (accuracy)
  • The lower the value of the Mean Absolute Error, the better the model is
  • has the advantage of representing the amount of error in the same unit as the predicted column making it easy to interpret e.g if you are trying to predict an amount in GBP, then the Mean Absolute Error can be interpreted as the amount of error in GBP
  • provides a balanced representation of errors (no error term emphasized)

What’s next?

How to revise for this course next week.

  1. To understand in detail all the assumptions implicitly made by linear models,
    read (James et al. 2023, chaps. 2–3)
    (Not compulsory but highly recommended reading)
  2. Have a look at extensions of linear models here
  3. Take a crack at LASSO and Ridge regression models (you might encounter them in bonus tasks in the Week 3 lab) here and here or on this Datacamp tutorial.
  4. You can also take a look at (Hohl 2009) for an interesting paper on some of the limitations of linear regression

References

Hohl, Katrin. 2009. “Beyond the Average Case: The Mean Focus Fallacy of Standard Linear Regression and the Use of Quantile Regression for the Social Sciences.” Available at SSRN 1434418. http://dx.doi.org/10.2139/ssrn.1434418.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. 2023. An Introduction to Statistical Learning: With Applications in Python. Springer Texts in Statistics. Cham, Switzerland: Springer Cham. https://www.statlearning.com/.
Karafiath, Imre. 2009. “Is There a Viable Alternative to Ordinary Least Squares Regression When Security Abnormal Returns Are the Dependent Variable?” Review of Quantitative Finance and Accounting 32 (1): 17–31. https://doi.org/10.1007/s11156-007-0079-y.