library(ISLR2)
library(rsample)
set.seed(1) # Just so you and I have the same "random" numbers.
<- initial_split(Auto, prop = 0.5, strata = mpg)
Auto.split <- training(Auto.split)
Auto.train <- testing(Auto.split) Auto.test
π» Week 05 - Lab Roadmap (90 min)
DS202 - Data Science for Social Scientists
Last week we learned how to do the Classification Methods in R. As data scientists, we need to know how to examine the results we generated and also to justify our results. So, we will focus on and practice the validation methods in the R language in this weekβs lab.
If you feel confident at the current stage, free to explore more on your own. We have provided you with some supplementary online resources :
- π R for Data Analytics (part of the courses by Abhay Singh)
- π R for Data Science (book by Hadley Wickham & Garrett Grolemund)
R packages you will need:
install.packages("tidyverse")
install.packages("ISLR2")
install.packages("rsample")
install.packages("boot")
Step 1: The Validation Set Approach (15mins)
Step 1: The Validation Set Approach
In this step, we will explore how to extract the subset of the whole dataset as a training dataset, and then estimate the test error rates of various linear models. The dataset is Auto
from the ISLR2
package.
Step 1.1: Training vs Testing splits
Split the Auto
dataset into two halves, as randomly selecting 196 observations out of the original 392 observations. We performed splits using base R in the last lab. However, this can be done more easily using the rsample
package. We first create a split object Auto.split
using prop = 0.5
to specify a \(50\%/50\%\) train/test split. We also specify strata = mpg
as we want our train/test split to have a similar mean/standard deviation for mpg
. We then create dataframes using the training
and testing
functions.
Step 1.2: How good is your model at predicting the test set?
Fit a linear regression model using the training dataset (Auto.train
), making the mpg
as the dependent variable(x) and horsepower
as the independent variable(y). Then, using the fitted model to estimate the mpg
from Auto.test
. Finally, calculating the MSE of the 196 observations in the validation set.
# use the lm() function to fit a linear regression model
<- lm(mpg ~ horsepower, data = Auto.train)
lm.fit
# estimate the 'mpg' values by the lm.fit model
<- predict(lm.fit, Auto.test)
lm.pred
# calculate MSE
mean((Auto.test$mpg - lm.pred)^2)
[1] 25.58657
Therefore, we have estimated the test MSE for the linear regression model, \(\mathbf{25.59}\). (Well Done! πͺ )
Step 1.3: Does it get better if I modify the features using polynomial terms?
Repeat the second part to estimate the test error for the quadratic and cubic regressions.
<- lm(mpg ~ poly(horsepower, 2), data = Auto.train) lm.fit2
mean((Auto.test$mpg - predict(lm.fit2, Auto.test))^2)
[1] 19.08374
<- lm(mpg ~ poly(horsepower, 3), data = Auto.train)
lm.fit3 mean((Auto.test$mpg - predict(lm.fit3, Auto.test))^2)
[1] 19.00317
π‘ We can see a model that predicts mpg
using a quadratic function of horsepower
performs better than a model that involves only a linear function of horsepower
. Furthermore, we see that adding a cubic function of horsepower
actually increases MSE when compared to the quadratic function. Thus, the quadratic function of horsepower
appears to perform the best out of all the functions considered.
Step 1.4: Visualisation
Want to see how well does the quadratic fit maps onto the raw data?
Letβs create a scatter plot with horsepower
on the x-axis and mpg
on the y-axis. Now, we can predict mpg
using lm.fit2
and specify interval = 'confidence'
to get \(95\%\) confidence intervals. Along with geom_line
we can use geom_ribbon
to plot the line of best fit and associated confidence intervals. Remember to specify alpha
so that we can see the predicted value - otherwise the ribbon will not be translucent.
library(tidyverse)
<- data.frame(horsepower = 46:230)
sim.data
<- predict(lm.fit2, sim.data, interval = 'confidence')
sim.pred
<- cbind(sim.data, sim.pred)
sim.data
ggplot(data = sim.data, aes(x = horsepower, y = fit)) +
geom_point(data = Auto, aes(x = horsepower, y = mpg)) +
geom_line() +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.125, fill = 'blue') +
theme_minimal() +
labs(x = 'Horsepower', y = 'MPG')
Step 2: k-Fold Cross-Validation (5mins)
Step 2: k-Fold Cross-Validation
It will be easy to follow the former procedure in Step 2, by the cv.glm
to implement K-fold CV.
Step 2.1: Using K=10 folds
Letβs estimate Cross-Validation errors corresponding to the polynomial fits of orders one to ten using ten-fold cross-validation (via K = 10
).
The cv.glm()
function is part of the boot
library. Meanwhile, you can explore the cv.err
by yourself to see what call
,K
,delta
and seed
mean. This online webpage will be useful when interpreting the results.
library(boot)
set.seed(17)
.10 <- c()
cv.error
for(i in 1:10){
<- glm(mpg ~ poly(horsepower, i), data = Auto)
glm.fit .10[i] <- cv.glm(data = Auto, glmfit = glm.fit, K = 10)$delta[1]
cv.error
}
.10 cv.error
[1] 24.27207 19.26909 19.34805 19.29496 19.03198 18.89781 19.12061 19.14666
[9] 18.87013 20.95520
# we can plot the results by passing a data frame to ggplot
<- data.frame(poly = 1:10, cv.errs = cv.error.10)
cv.data
ggplot(data = cv.data, aes(x = poly, y = cv.errs)) +
geom_point() +
geom_line(linetype = 'dashed') +
scale_x_continuous(breaks = 1:10) +
theme_minimal() +
labs(x = 'Degree of polynomial', y = 'Cross-validated MSE')
Note that in the line of cv.error.10[i] <- cv.glm(data = Auto, glmfit = glm.fit, K=10)$delta[1]
, it will be very strict to K
rather than k
.
Step 3: The Bootstrap (20mins)
Step 3: The Bootstrap
We have learnt the theoretical method regarding Bootstrap. I understand that it may be a bit difficult for beginners in statistics, but we will mainly focus on the coding implementation and visualisation here. Also, we will introduce how to create a function
below.
Functions are βself containedβ modules of code that accomplish a specific task. referenced from Functions and theri argumens
With the help of a function, you can reuse the same pattern codes with a simple function name. In fact, you work with functions all the time in R - perhaps without even realising it!
Step 3.1 Estimating the Accuracy of a Linear Regression Model
In this step, we will use the boostrap approach to assess the variability of a coefficient estimate. For the sake of simplicity we will look at the relationship between weight
and horsepower
which appears to be linear.
Create a function as boot.fn()
which
<- function(data, index) {
boot.fn
lm(horsepower ~ weight, data = data[index,])$coefficients
}
boot.fn
simply returns a vector of coefficient estimates. It takes two parameters: data
and index
. data
is a placeholder for the dataframe used in the model. index
is a placeholder for the sample used to subset the dataframe. Other than this, the body of the function should look familiar. We are estimating a linear model where we are looking to predict horsepower
by weight
, and then extracting the coefficients.
Now, compare the results from bootstrap estimates and the standard estimates
# bootstrap with 1000 times
boot(Auto, boot.fn, 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Auto, statistic = boot.fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* -12.18348470 1.659517e-01 3.221766853
t2* 0.03917702 -6.479214e-05 0.001251433
summary(lm(horsepower ~ weight, data = Auto))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.18348470 3.570431493 -3.412328 7.115312e-04
weight 0.03917702 0.001153214 33.972031 1.364347e-118
We can find that in the bootstrap estimation process, \(\mathrm{SE}(\hat{\beta}_{0}) = 3.5704\) and \(\mathrm{SE}(\hat{\beta}_{1}) = 0.0012\) , while in the standard estimation process, \(\hat{\beta}_{Intercept}=-12.1835\) and \(\hat{\beta}_{horsepower}=0.0392\).
To get a better intuition of what the bootstrap algorithm does letβs create a ggplot. We can get the intercepts and slopes estimated and overlay them on a scatterplot (weight
on x-axis, horsepower
on y-axis). We will create 50 bootstrap resamples for ease of visualisation and use geom_abline
to overlay all the lines of best fit.
<- boot(Auto, boot.fn, 50)
boot.model
<- as.data.frame(boot.model$t)
boot.df names(boot.df) <- c('b0','b1')
ggplot(data = Auto, aes(x = weight, y = horsepower)) +
geom_point() +
geom_abline(data = boot.df,
aes(intercept = b0, slope = b1),
alpha = 0.1, colour = 'blue') +
theme_minimal() +
labs(x = 'Weight (lbs.)', y = 'Engine horsepower')
Step 4: π― Practical Exercises (50 mins)
Step 4: π― Practical Exercises
Since then, we have known and implemented the coding with Cross-validation and Bootstrap. In this practical case, we will use the new dataset Default
and also Weekly
from the ISRL
package. Do not forget to set a random seed before beginning your analysis.
Some questions are listed below. You are required to try to answer these questions in pairs using R commands. We will go over the solutions once everyone has finished these questions.
Q1: Train vs test sets
For the Default
dataset, please split the sample set into a training set and a test set, then fit a logistic regression model that uses income
and balance
to predict default
:
- Q1.1: Use three different splits of the observations into a training set and a test set.
- Q1.2: Fit three multiple logistic regression models using only the training observations.
- Q1.3: Based on the three models, obtain a prediction of default status for each individual in the test set by computing the posterior probability of default for that individual, and classifying the individual to the
default
category if the posterior probability is greater than 0.5. - Q1.4: Based on the three models, compute the test set error, which is the fraction of the observations in the test set that are misclassified.
Q2: Bootstrap
For the Default
dataset, We continue to consider the use of a logistic regression model to predict the probability of default
using income
and balance
on the Default
data set.
In particular, we will now compute estimates for the standard errors of the income
and balance
logistic regression coefficients in two different ways: 2.1. Using the bootstrap
, 2.2. Using the standard formula for computing the standard errors in the glm()
function. As following,
Using the
summary()
andglm()
functions, determine the estimated standard errors for the coefficients associated withincome
andbalance
in a multiple logistic regression model that uses both predictors.Write a function,
boot.fn()
,that takes as input theDefault
data set as well as an index of the observations, and that outputs the coefficient estimates forincome
andbalance
in the multiple logistic regression model.Use the
boot()
function together with yourboot.fn()
function to estimate the standard errors of the logistic regression coefficients forincome
andbalance
. Then, Create a histogram of the bootstrap parameter estimates withggplot2
, and also set thebins=20
, title as1,000 Bootstrap Parameter Estimates - 'balance' & 'income
.