🗓️ Week 02:
Multiple Linear Regression

DS202 Data Science for Social Scientists

10/7/22

Assessing goodness-of-fit

How well does my model fit the data?

  • A look at three metrics:
    • \(\mathrm{RSE}\)
    • \(R^2\)
    • pearson Correlation

Residual Standard Errors (RSE)

  • Recall the “true model”: \(Y = f(X) + \epsilon\)
  • Even if we knew the true values of \(\beta_0\) and \(\beta_1\) — not just the estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) — our predictions of sales might still be off.
  • By how much?

Residual Standard Errors (RSE)

  • This can be estimated by the variance of errors: \(\sigma^2 = \operatorname{Var}(\epsilon)\).
  • As said earlier, this quantity can be approximated, for the simple linear regression case, by the Residual Standard Errors (\(\mathrm{RSE}\)) formula below:

\[ \sigma^2 \approx \mathrm{RSE} = \sqrt{\frac{\mathrm{RSS}}{(n-\mathrm{df})}} \]

where \(\mathrm{RSS} = \sum_i^n{(y_i - \hat{y}_i)^2}\) represents the residual sum of squares and \(\mathrm{df}\) represents the degrees of freedom in our model.

Back to our Advertising linear models

Let’s compare the linear models we fitted earlier:

  • TV 📺
out <- capture.output(summary(tv_model))
cat(paste(out[16:16]), sep="\n")
Residual standard error: 3.259 on 198 degrees of freedom
  • Radio 📻
out <- capture.output(summary(radio_model))
cat(paste(out[16:16]), sep="\n")
Residual standard error: 4.275 on 198 degrees of freedom
  • Newspaper 📰
out <- capture.output(summary(newspaper_model))
cat(paste(out[16:16]), sep="\n")
Residual standard error: 5.092 on 198 degrees of freedom

🗨️ What does it mean?

The \(R^2\) statistic

  • R-squared or fraction of variance explained is defined as:

\[ R^2 = \frac{\mathrm{TSS - RSS}}{\mathrm{TSS}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}} \]

where TSS = \(\sum_{i=1}^n (y_i - \bar{y})^2\) is the total sum of squares.

Tip

Intuitively, \(R^2\) measures the proportion of variability in \(Y\) that can be explained using \(X\).

  • \(R^2\) close to 1 means that a large proportion of the variance in \(Y\) is explained by the regression.

  • \(R^2\) close to 0 means that the regression does not explain much of the variability in \(Y\).

Sample correlation coefficient

By the way, in the simple linear regression setting, it can be shown that \(R^2 = (\operatorname{Cor}(X, Y))^2\), where \(\operatorname{Cor}(X, Y)\) is the correlation between \(X\) and \(Y\):

\[ \operatorname{Cor}(X, Y) = \frac{\sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}. \]

F-statistic

We used t-statistic to compute p-values for the coefficients (\(\hat{\beta}_0\) and \(\hat{\beta}_1\)).
Now how do I test whether the model, as a whole, makes sense?

  • For this, we perform the hypothesis test: \[ \begin{align} &~~~~H_0:&\beta_1 = \beta_2 = \ldots = \beta_j = 0 \\ &\text{vs} \\ &~~~~H_A:& \text{at least one } \beta_j \neq 0. \end{align} \]
  • which is performed by computing the F-statistic: \[ F = \frac{(TSS - RSS) / p}{RSS/(n - p - 1)} \sim F_{p, n-p-1} \]
  • If F is close to 1, there is no relationship between the response and the predictor(s).
  • If \(H_A\) is true, then we expect \(F\) to be greater than 1.
  • Check (James et al. 2021, 75–77) for an in-depth explanation of this test.

Back to our Advertising linear models

How well do our models explain the variability of the response?

  • TV 📺
out <- capture.output(summary(tv_model))
cat(paste(out[17:18]), sep="\n")
Multiple R-squared:  0.6119,    Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16
  • Radio 📻
out <- capture.output(summary(radio_model))
cat(paste(out[17:18]), sep="\n")
Multiple R-squared:  0.332, Adjusted R-squared:  0.3287 
F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16
  • Newspaper 📰
out <- capture.output(summary(newspaper_model))
cat(paste(out[17:18]), sep="\n")
Multiple R-squared:  0.05212,   Adjusted R-squared:  0.04733 
F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

🗨️ What does it mean?

Interpreting Multiple Linear Regression

What changes when we fit a regression model using multiple predictors instead of just one predictor at a time?

A multiple linear regression to Advertising

Fitting all predictors:

TV 📺 + Radio 📻 + Newspaper 📰

full_model <- lm(sales ~ ., data=advertising)
summary(full_model)

Call:
lm(formula = sales ~ ., data = advertising)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8277 -0.8908  0.2418  1.1893  2.8292 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.938889   0.311908   9.422   <2e-16 ***
TV           0.045765   0.001395  32.809   <2e-16 ***
radio        0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared:  0.8972,    Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Confidence Intervals

confint(full_model)
                  2.5 %     97.5 %
(Intercept)  2.32376228 3.55401646
TV           0.04301371 0.04851558
radio        0.17154745 0.20551259
newspaper   -0.01261595 0.01054097

Interpreting the coefficients

  • Recall the multiple regression model:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon , \]

  • We interpret \(\beta_j\) as the average effect on \(Y\) of a one unit increase in \(X_j\), holding all other predictors fixed. In the advertising example, the model becomes

\[ \mathrm{sales} = \beta_0 + \beta_1 \times \mathrm{TV} + \beta_2 \times \mathrm{radio} + \beta_3 \times \mathrm{newspaper} + \epsilon . \]

Interpreting the coefficients

  • The ideal scenario is when the predictors are uncorrelated – a balanced design:
    • Each coefficient can be estimated and tested separately.
    • Interpretations such as “a unit change in \(X_j\) is associated with a \(\beta_j\) change in \(Y\), while all the other variables stay fixed”, are possible.
  • Correlations amongst predictors cause problems:
    • The variance of all coefficients tends to increase, sometimes dramatically
    • Interpretations become hazardous – when \(X_j\) changes, everything else changes.
  • Claims of causality should be avoided for observational data.

Interaction effects

➡️ Predictors are not truly independent.

🤔 How should I account for the “synergy” (interaction) between them?

🖥️ I will share my screen to show you some examples.

Things to think about

  • What should I do if I have categorical variables?
    • For example: gender, education level, marital status, ethnicity?
  • What if I have too many variables? Which ones should I include or exclude?
    • By the way, when \(p \gg n\) ordinary least squares is not reliable.

Note

  • There are still many other aspects of linear regression we haven’t covered.
  • We will explore some of these questions briefly in the lab

What’s Next

Your Checklist:

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. Second edition. Springer Texts in Statistics. New York NY: Springer. https://www.statlearning.com/.