What happens to R-squared when we add more predictors?
DS202 Blog Post
About \(R^2\)
As we saw briefly on Week 02 and as defined in Chapter 3 of the textbook (James et al. 2021), \(R^2\) represents the proportion of variability in the response variable that can be explained using the dependent variables (features). More specifically:
\[ R^2 = 1 - \frac{\operatorname{RSS}}{\operatorname{TSS}} \]
where \(\operatorname{TSS}\) represents the Total Sum of Squares:
\[ \operatorname{TSS} = \sum_i^n{\left(y_i - \bar{y}\right)^2}, \]
and \(\bar{y}\) is the mean of \(\mathbf{y}\).
As for the \(\operatorname{RSS}\), it represents the Residual Sum of Squares.
To understand why \(R^2\) can never decrease if we add more features, I think it is useful to visualise linear regression in matrix format.
Data and Variables
Suppose you have the following data:
\[ \mathbf{X} = \left[ \begin{array}{cc} x_{11}\\ x_{21}\\ \vdots\\ x_{n1} \end{array}\right], \]
that is, there i only \(p=1\) feature for the \(n\) observations of data, and you want to fit a linear regression to attempt to predict the \(\mathbf{y}\) column below:
\[ \mathbf{y} = \left[ \begin{array}{c} y_{1}\\ y_2\\ \vdots\\ y_n \end{array}\right]. \]
Simple Linear Regression in matrix format
We can also represent the regression coefficients obtained by Ordinary Least Squares (OLS) in matrix format.
\(\hat{\boldsymbol{\beta}}\) is a \(1 \times (p+1)\) matrix:
\[ \hat{\boldsymbol{\beta}} = \left[ \begin{array}{c} \hat{\beta}_{0}\\ \hat{\beta}_1 \end{array}\right]. \]
When thinking about regression this way, it is often useful to add an additional column of 1s to \(\mathbf{X}\):
\[ \mathbf{X} = \left[ \begin{array}{cc} 1 & x_{11}\\ 1 & x_{21}\\ 1 & \vdots\\ 1 & x_{n1} \end{array}\right], \]
so that:
\[ \hat{\boldsymbol{\beta}}\mathbf{X} =\left[ \begin{array}{c} \hat{\beta}_{0} \times 1 + \hat{\beta}_1 \times x_{11} \\ \hat{\beta}_{0} \times 1 + \hat{\beta}_1 \times x_{21} \\ \vdots \\ \hat{\beta}_{0} \times 1 + \hat{\beta}_1 \times x_{n1}\\ \end{array}\right] = \hat{\mathbf{y}} \]
represents the estimated values \(\hat{\mathbf{y}}\).
Residuals can then be represented as \(\mathbf{e} = (\mathbf{y} - \hat{\mathbf{y}})\), or
\[ \mathbf{e} =\left[ \begin{array}{c} y_{1} - (\hat{\beta}_{0} \times 1 + \hat{\beta}_1 \times x_{11}) \\ y_{2} - (\hat{\beta}_{0} \times 1 + \hat{\beta}_1 \times x_{21} )\\ \vdots \\ y_{n} - (\hat{\beta}_{0} \times 1 + \hat{\beta}_1 \times x_{n1})\\ \end{array}\right] \]
OLS provides an optimal way to find coefficients \(\hat{\boldsymbol{\beta}}\) that minimizes \(\operatorname{RSS}\).
In matrix format, the \(\operatorname{RSS}\) can be represented as:
\[ \operatorname{RSS} = \mathbf{e}^T\mathbf{e}. \]
OLS will find an optimal solution to \(\text{minimise } \operatorname{RSS}\). Letβs call it \(\operatorname{RSS}_{(p=1)}\).
What if I add another feature?
π‘ What would happen if I added a second feature and ran a regression with \(p=2\)?
Think of the implications this has for the matrix of coefficients \(\hat{\boldsymbol{\beta}}\) and matrix of residuals \(\mathbf{e}\) if I decided to add a second column of data to \(\mathbf{X}\).
The new extended feature matrix, letβs call it \(\mathbf{X}'\), would look like the following:
\[ \mathbf{X} = \left[ \begin{array}{ccc} 1 & x_{11}&x_{12}\\ 1 & x_{21}&x_{22}\\ 1 & \vdots&\vdots\\ 1 & x_{n1}&x_{n2} \end{array}\right], \]
OLS would find new coefficients, letβs call them \(\hat{\boldsymbol{\beta}}'\):
\[ \hat{\boldsymbol{\beta}} = \left[ \begin{array}{c} \hat{\beta}'_{0}\\ \hat{\beta}'_1\\ \hat{\beta}'_2 \end{array}\right]. \]
Similarly, the new \(\mathbf{e}'\) could be represented as:
\[ \mathbf{e}' =\left[ \begin{array}{c} y_{1} - (\hat{\beta}'_{0} \times 1 + \hat{\beta}'_1 \times x_{11} + \hat{\beta}'_2 \times x_{12})\\ y_{2} - (\hat{\beta}'_{0} \times 1 + \hat{\beta}'_1 \times x_{21} + \hat{\beta}'_2 \times x_{22})\\ \vdots \\ y_{n} - (\hat{\beta}'_{0} \times 1 + \hat{\beta}'_1 \times x_{n1} + \hat{\beta}'_2 \times x_{n2})\\ \end{array}\right] \]
Again, if we run OLS, the algorithm will find an optimal solution to the the residuals above, a minimum \(\operatorname{RSS}_{(p=2)}\). What can we expect of \(\operatorname{RSS}_{(p=2)}\) in relation to \(\operatorname{RSS}_{(p=1)}\)? Letβs think through some scenarios:
If it turned out that \(\hat{\beta}'_0 = \hat{\beta}_0\) and \(\hat{\beta}'_1 = \hat{\beta}_1\):
- If \(\hat{\beta}'_2 = 0\), then we could conclude that \(\mathbf{e}' = \mathbf{e}\) and \(\operatorname{RSS}_{(p=2)} = \operatorname{RSS}_{(p=1)}\). That is, by minimising the sum of squares, OLS cannot find a better solution that involves \(\hat{\beta}'_2\).
- If \(\hat{\beta}'_2 \neq 0\) then notice that the new residuals are related to the previous ones: \({e_i}' = (e_i - \hat{\beta}'_2x_{i2}) ~~\forall i\).
That means \(({e_i}')^2 \geq e_i^2~~\forall i\) and therefore \(\operatorname{RSS}_{(p=2)} \geq \operatorname{RSS}_{(p=1)}\).
A similar logic applies to the case where \(\hat{\beta}'_0 \neq \hat{\beta}_0\) and \(\hat{\beta}'_1 \neq \hat{\beta}_1\) β check 1 for a more formal proof.
Conclusion
By adding a new feature, given the fact that OLS uses squares of errors, it is inevitable that \(({e_i}')^2 \geq e_i^2~~\forall i\). OLS will always find an equivalent or a lower \(\operatorname{RSS}\) value. If you check the \(R^2\) definition once again, you will realise that \(\operatorname{TSS}\) will never change β as it is not related to \(\mathbf{X}\), only to \(\mathbf{y}\) β only \(\operatorname{RSS}\) can vary when you add new features. Since \(\operatorname{RSS}\) can only decrease or stay the same, \(R^2\) will always increase or stay the same, never decrease, if you add more features.
To correct this misleading tendency of R-squared, an adjusted index has been proposed. The ajusted R-squared takes the number of features into account and it is what you should rely on when assessing the goodness-of-fit of a linear regression.