✔️ Week 05 - Lab Solutions
DS202 - Data Science for Social Scientists
🔑 Solutions to exercises
Q1: Train vs test sets
For the Default
dataset, please split the sample set into a training set and a validation set, then fit a logistic regression model that uses income
and balance
to predict default
.
library(ISLR2)
library(tidyverse)
?Default
Q1.1
Use three different splits of the observations into a training set and a test set.
First generate the splits:
library(tidymodels)
<- initial_split(Default, prop = 0.5, strata = default)
Default.split1 <- initial_split(Default, prop = 0.5, strata = default)
Default.split2 <- initial_split(Default, prop = 0.5, strata = default) Default.split3
or maybe with different random proportions?
<- initial_split(Default, prop = 0.6, strata = default)
Default.split1 <- initial_split(Default, prop = 0.6, strata = default)
Default.split2 <- initial_split(Default, prop = 0.7, strata = default) Default.split3
Then, using those splits, create separate training and test sets from the original data:
<- training(Default.split1)
Default.train1 <- testing(Default.split1)
Default.test1
<- training(Default.split2)
Default.train2 <- testing(Default.split2)
Default.test2
<- training(Default.split3)
Default.train3 <- testing(Default.split3) Default.test3
Q1.2
Fit three multiple logistic regression models using only the training observations.
.1 = glm(default ~ income + balance, data = Default.train1, family = "binomial")
glm.fit
.2 = glm(default ~ income + balance, data = Default.train2, family = "binomial")
glm.fit
.3 = glm(default ~ income + balance, data = Default.train3, family = "binomial")
glm.fit
summary(glm.fit.1)
Q1.3
Based on the three models, obtain a prediction of default status for each individual in the test set by computing the posterior probability of default for that individual, and classifying the individual to the default
category if the posterior probability is greater than 0.5.
# Predict the test set
.1 <- predict(glm.fit.1, Default.test1, type = "response")
glm.probs
# Make predictions using threshold=0.5
.1 = if_else(glm.probs.1 > 0.5, "Yes", "No")
glm.preds
# do the same for the other datasets
Q1.4
Based on the three models, compute the test set error, which is the fraction of the observations in the test set that are misclassified.
mean(glm.preds.1 != Default.test1$default)
mean(glm.preds.2 != Default.test2$default)
mean(glm.preds.3 != Default.test3$default)
Q2: Bootstrap
For the Default
dataset, We continue to consider the use of a logistic regression model to predict the probability of default
using income
and balance
on the Default
data set.
In particular, we will now compute estimates for the standard errors of the income
and balance
logistic regression coefficients in two different ways: 2.1. Using the bootstrap
, 2.2. Using the standard formula for computing the standard errors in the glm()
function. As following,
i
Using the summary()
and glm()
functions, determine the estimated standard errors for the coefficients associated with income
and balance
in a multiple logistic regression model that uses both predictors.
# multiple logistic regression model
<- glm(default ~ income + balance, data = Default, family = "binomial")
log_def
summary(log_def)
summary(log_def)$coefficients[, 2] ## standard errors
ii
Write a function,boot.fn()
,that takes as input the Default
data set as well as an index of the observations, and that outputs the coefficient estimates for income
and balance
in the multiple logistic regression model.
# This is a modified version of the original `boot.fn` function
<- function(data, index = 1:nrow(data)) {
boot.fn coef(glm(default ~ income + balance, data = data, subset = index, family = "binomial"))[-1]
}
Test it:
## all data --without intercept
boot.fn(Default)
iii
Use the boot()
function together with your boot.fn()
function to estimate the standard errors of the logistic regression coefficients for income
and balance
. Then, Create a histogram of the bootstrap parameter estimates with ggplot2
, and also set the bins=20
, title as 1,000 Bootstrap Parameter Estimates - 'balance' & 'income
.
<- boot(data = Default, statistic = boot.fn, R = 1000)
boot_results boot_results