βοΈ Week 03 - Lab Solutions
DS202 - Data Science for Social Scientists
π Solutions to exercises
Use the
lm()
function to perform a simple linear regressionοΌand use thesummary()
function to print the results:> library(ISLR2) > Auto <- na.omit(Auto) > lm.fit <- lm(mpg ~ horsepower, data = Auto) > summary(lm.fit) : Calllm(formula = mpg ~ horsepower, data = Auto) : Residuals Min 1Q Median 3Q Max -13.5710 -3.2592 -0.3435 2.7630 16.9240 : CoefficientsPr(>|t|) Estimate Std. Error t value 39.935861 0.717499 55.66 <2e-16 *** (Intercept) -0.157845 0.006446 -24.49 <2e-16 *** horsepower --- : 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Signif. codes : 4.906 on 390 degrees of freedom Residual standard error-squared: 0.6059, Adjusted R-squared: 0.6049 Multiple R-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16 F
Regarding to the p-values of t-test and F-test, there is a strong relationship between the predictor
horsepower
and the reponsempg
. From the sign of coefficients, the relationship between the predicator and the response is negative. Using the functionpredict()
to predict the value of reponse and the confidence interval, we get:> predict(lm.fit, data.frame(horsepower = 98), interval = "confidence") fit lwr upr1 24.46708 23.97308 24.96108
Therefore, the predicted
mpg
associated with ahorsepower
of 98 is 24.47, and the associated 95 % confidence interval is [23.97308, 24.96108].Use the function
plot()
andabline()
:> attach(Auto) > plot(mpg, horsepower, ylim = c(0, 250)) > abline (lm.fit, lwd = 3, col = "red") > dev.off()
Using
plot()
function to produce diagnostic plots:> par(mfrow = c(2, 2)) > plot (lm.fit) > dev.off()
By observing four diagnostic plots, we could find non-linear patttern in residual plots. The quadratic trend of the residuals could be a problem. Then we plot studentized residuals to identify outliers:
> plot(predict(lm.fit), rstudent(lm.fit)) > dev.off()
There are possible outliers as seen in the plot of studentized residuals because there are data with a value greater than 3.
Use the
pairs()
function to produce a scatterplot matrix:> pairs(Auto) > dev.off()
Use the
cor()
function to compute the matrix of correlations between the variables while excluding the name variable:> cor(subset(Auto, select = -name))
Use the
lm()
function to perform a multiple linear regressionοΌand use thesummary()
function to print the results:> lm.fit1 <- lm(mpg ~ . -name, data = Auto) > summary (lm.fit1) : Calllm(formula = mpg ~ . - name, data = Auto) : Residuals Min 1Q Median 3Q Max -9.5903 -2.1565 -0.1169 1.8690 13.0604 : CoefficientsPr(>|t|) Estimate Std. Error t value -17.218435 4.644294 -3.707 0.00024 *** (Intercept) -0.493376 0.323282 -1.526 0.12780 cylinders 0.019896 0.007515 2.647 0.00844 ** displacement -0.016951 0.013787 -1.230 0.21963 horsepower -0.006474 0.000652 -9.929 < 2e-16 *** weight 0.080576 0.098845 0.815 0.41548 acceleration 0.750773 0.050973 14.729 < 2e-16 *** year 1.426141 0.278136 5.127 4.67e-07 *** origin --- : 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Signif. codes : 3.328 on 384 degrees of freedom Residual standard error-squared: 0.8215, Adjusted R-squared: 0.8182 Multiple R-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16 F
Yes, there is a relationship between the predictors and the response by testing the null hypothesis of whether all the regression coefficients are zero. The F-statistic is far from 1 (with a small p-value), indicating evidence against the null hypothesis.
Obverving the p-values associated with each predictorβs t-statistic, we see that displacement, weight, year, and origin have a statistically significant relationship, while cylinders, horsepower and acceleration do not.
The regression coefficient for year is 0.75. This suggests that, considering all other predictors fixed, mpg increases by additional 0.75 unit. In other words, cars become more fuel efficient every year by almost 1 mpg/year.
Use the
plot()
function to produce diagnostic plots:> par(mfrow = c(2, 2)) > plot (lm.fit1) > dev.off()
From the leverage plot, we see that point 14 appears to have a high leverage, although not a high magnitude residual. Besides, the quadratic trend of the residuals could be a problem. Maybe linear regression is not the best fit for this prediction.
We plot studentized residuals to identify outliers:
> plot(predict(lm.fit1), rstudent(lm.fit1)) > dev.off()
There are possible outliers as seen in the plot of studentized residuals because there are data with a value greater than 3.
Use the
*
and:
symbols to fit linear regression models with interaction effects:> lm.fit2 <- lm(mpg ~ cylinders * displacement + displacement * weight, data = Auto) > summary(lm.fit2)
Interaction between
displacement
andweight
is statistically signifcant, while the interaction betweencylinders
anddisplacement
is not.