What happens to R-squared when we add more predictors?

DS202 Blog Post

week02

R-squared

equations

Author

Dr. Jon Cardoso-Silva

Published

10 October 2022

TLDR

When using OLS, \(R^2\) will always stay the same or increase you add more features to a linear regression, even if those features are “useless”.

About \(R^2\)

As we saw briefly on Week 02 and as defined in Chapter 3 of the textbook (James et al. 2021), \(R^2\) represents the proportion of variability in the response variable that can be explained using the dependent variables (features). More specifically:

\[ R^2 = 1 - \frac{\operatorname{RSS}}{\operatorname{TSS}} \]

where \(\operatorname{TSS}\) represents the Total Sum of Squares:

\[ \operatorname{TSS} = \sum_i^n{\left(y_i - \bar{y}\right)^2}, \]

and \(\bar{y}\) is the mean of \(\mathbf{y}\).

As for the \(\operatorname{RSS}\), it represents the Residual Sum of Squares.

To understand why \(R^2\) can never decrease if we add more features, I think it is useful to visualise linear regression in matrix format.

Data and Variables

Suppose you have the following data:

\[ \mathbf{X} = \left[ \begin{array}{cc} x_{11}\\ x_{21}\\ \vdots\\ x_{n1} \end{array}\right], \]

that is, there i only \(p=1\) feature for the \(n\) observations of data, and you want to fit a linear regression to attempt to predict the \(\mathbf{y}\) column below:

\[ \mathbf{y} = \left[ \begin{array}{c} y_{1}\\ y_2\\ \vdots\\ y_n \end{array}\right]. \]

Simple Linear Regression in matrix format

We can also represent the regression coefficients obtained by Ordinary Least Squares (OLS) in matrix format.
\(\hat{\boldsymbol{\beta}}\) is a \(1 \times (p+1)\) matrix:

\[ \hat{\boldsymbol{\beta}} = \left[ \begin{array}{c} \hat{\beta}_{0}\\ \hat{\beta}_1 \end{array}\right]. \]

When thinking about regression this way, it is often useful to add an additional column of 1s to \(\mathbf{X}\):

\[ \mathbf{X} = \left[ \begin{array}{cc} 1 & x_{11}\\ 1 & x_{21}\\ 1 & \vdots\\ 1 & x_{n1} \end{array}\right], \]

so that:

\[ \hat{\boldsymbol{\beta}}\mathbf{X} =\left[ \begin{array}{c} \hat{\beta}_{0} \times 1 + \hat{\beta}_1 \times x_{11} \\ \hat{\beta}_{0} \times 1 + \hat{\beta}_1 \times x_{21} \\ \vdots \\ \hat{\beta}_{0} \times 1 + \hat{\beta}_1 \times x_{n1}\\ \end{array}\right] = \hat{\mathbf{y}} \]

represents the estimated values \(\hat{\mathbf{y}}\).

Residuals can then be represented as \(\mathbf{e} = (\mathbf{y} - \hat{\mathbf{y}})\), or

\[ \mathbf{e} =\left[ \begin{array}{c} y_{1} - (\hat{\beta}_{0} \times 1 + \hat{\beta}_1 \times x_{11}) \\ y_{2} - (\hat{\beta}_{0} \times 1 + \hat{\beta}_1 \times x_{21} )\\ \vdots \\ y_{n} - (\hat{\beta}_{0} \times 1 + \hat{\beta}_1 \times x_{n1})\\ \end{array}\right] \]

OLS provides an optimal way to find coefficients \(\hat{\boldsymbol{\beta}}\) that minimizes \(\operatorname{RSS}\).

In matrix format, the \(\operatorname{RSS}\) can be represented as:

\[ \operatorname{RSS} = \mathbf{e}^T\mathbf{e}. \]

OLS will find an optimal solution to \(\text{minimise } \operatorname{RSS}\). Let’s call it \(\operatorname{RSS}_{(p=1)}\).

What if I add another feature?

💡 What would happen if I added a second feature and ran a regression with \(p=2\)?

Think of the implications this has for the matrix of coefficients \(\hat{\boldsymbol{\beta}}\) and matrix of residuals \(\mathbf{e}\) if I decided to add a second column of data to \(\mathbf{X}\).

The new extended feature matrix, let’s call it \(\mathbf{X}'\), would look like the following:

\[ \mathbf{X} = \left[ \begin{array}{ccc} 1 & x_{11}&x_{12}\\ 1 & x_{21}&x_{22}\\ 1 & \vdots&\vdots\\ 1 & x_{n1}&x_{n2} \end{array}\right], \]

OLS would find new coefficients, let’s call them \(\hat{\boldsymbol{\beta}}'\):

\[ \hat{\boldsymbol{\beta}} = \left[ \begin{array}{c} \hat{\beta}'_{0}\\ \hat{\beta}'_1\\ \hat{\beta}'_2 \end{array}\right]. \]

Similarly, the new \(\mathbf{e}'\) could be represented as:

\[ \mathbf{e}' =\left[ \begin{array}{c} y_{1} - (\hat{\beta}'_{0} \times 1 + \hat{\beta}'_1 \times x_{11} + \hat{\beta}'_2 \times x_{12})\\ y_{2} - (\hat{\beta}'_{0} \times 1 + \hat{\beta}'_1 \times x_{21} + \hat{\beta}'_2 \times x_{22})\\ \vdots \\ y_{n} - (\hat{\beta}'_{0} \times 1 + \hat{\beta}'_1 \times x_{n1} + \hat{\beta}'_2 \times x_{n2})\\ \end{array}\right] \]

Again, if we run OLS, the algorithm will find an optimal solution to the the residuals above, a minimum \(\operatorname{RSS}_{(p=2)}\). What can we expect of \(\operatorname{RSS}_{(p=2)}\) in relation to \(\operatorname{RSS}_{(p=1)}\)? Let’s think through some scenarios:

If it turned out that \(\hat{\beta}'_0 = \hat{\beta}_0\) and \(\hat{\beta}'_1 = \hat{\beta}_1\):

If \(\hat{\beta}'_2 = 0\), then we could conclude that \(\mathbf{e}' = \mathbf{e}\) and \(\operatorname{RSS}_{(p=2)} = \operatorname{RSS}_{(p=1)}\). That is, by minimising the sum of squares, OLS cannot find a better solution that involves \(\hat{\beta}'_2\).
If \(\hat{\beta}'_2 \neq 0\) then notice that the new residuals are related to the previous ones: \({e_i}' = (e_i - \hat{\beta}'_2x_{i2}) ~~\forall i\).
That means \(({e_i}')^2 \geq e_i^2~~\forall i\) and therefore \(\operatorname{RSS}_{(p=2)} \geq \operatorname{RSS}_{(p=1)}\).

A similar logic applies to the case where \(\hat{\beta}'_0 \neq \hat{\beta}_0\) and \(\hat{\beta}'_1 \neq \hat{\beta}_1\) — check ¹ for a more formal proof.

Conclusion

By adding a new feature, given the fact that OLS uses squares of errors, it is inevitable that \(({e_i}')^2 \geq e_i^2~~\forall i\). OLS will always find an equivalent or a lower \(\operatorname{RSS}\) value. If you check the \(R^2\) definition once again, you will realise that \(\operatorname{TSS}\) will never change — as it is not related to \(\mathbf{X}\), only to \(\mathbf{y}\) — only \(\operatorname{RSS}\) can vary when you add new features. Since \(\operatorname{RSS}\) can only decrease or stay the same, \(R^2\) will always increase or stay the same, never decrease, if you add more features.

To correct this misleading tendency of R-squared, an adjusted index has been proposed. The ajusted R-squared takes the number of features into account and it is what you should rely on when assessing the goodness-of-fit of a linear regression.

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. Second edition. Springer Texts in Statistics. New York NY: Springer. https://www.statlearning.com/.

Footnotes

Math Exchange - Prove that \(R^2\) cannot decrease when adding a variable ↩︎