DS202 Data Science for Social Scientists
10/7/22
How well does my model fit the data?
\[ \sigma^2 \approx \mathrm{RSE} = \sqrt{\frac{\mathrm{RSS}}{(n-\mathrm{df})}} \]
where \(\mathrm{RSS} = \sum_i^n{(y_i - \hat{y}_i)^2}\) represents the residual sum of squares and \(\mathrm{df}\) represents the degrees of freedom in our model.
Let’s compare the linear models we fitted earlier:
Residual standard error: 3.259 on 198 degrees of freedom
Residual standard error: 4.275 on 198 degrees of freedom
\[ R^2 = \frac{\mathrm{TSS - RSS}}{\mathrm{TSS}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}} \]
where TSS = \(\sum_{i=1}^n (y_i - \bar{y})^2\) is the total sum of squares.
By the way, in the simple linear regression setting, it can be shown that \(R^2 = (\operatorname{Cor}(X, Y))^2\), where \(\operatorname{Cor}(X, Y)\) is the correlation between \(X\) and \(Y\):
\[ \operatorname{Cor}(X, Y) = \frac{\sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}. \]
We used t-statistic to compute p-values for the coefficients (\(\hat{\beta}_0\) and \(\hat{\beta}_1\)).
Now how do I test whether the model, as a whole, makes sense?
How well do our models explain the variability of the response?
Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
Multiple R-squared: 0.332, Adjusted R-squared: 0.3287
F-statistic: 98.42 on 1 and 198 DF, p-value: < 2.2e-16
What changes when we fit a regression model using multiple predictors instead of just one predictor at a time?
Fitting all predictors:
TV 📺 + Radio 📻 + Newspaper 📰
Call:
lm(formula = sales ~ ., data = advertising)
Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
radio 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon , \]
\[ \mathrm{sales} = \beta_0 + \beta_1 \times \mathrm{TV} + \beta_2 \times \mathrm{radio} + \beta_3 \times \mathrm{newspaper} + \epsilon . \]
➡️ Predictors are not truly independent.
🤔 How should I account for the “synergy” (interaction) between them?
🖥️ I will share my screen to show you some examples.
gender
, education level
, marital status
, ethnicity
?Your Checklist:
📙 Read (James et al. 2021, chap. 3)
👀 Browse the slides again
📝 Take note of anything that isn’t clear to you
📟 Share your questions on /week02
channel on Slack
💻 Have a look at this week’s lab page
DS202 - Data Science for Social Scientists 🤖 🤹