✔️ Week 05 - Lab Solutions

DS202 - Data Science for Social Scientists

Author

Xiaowei Gao/Mustafa Can Ozkan

Published

01 November 2022

🔑 Solutions to exercises

Q1: Train vs test sets

For the Default dataset, please split the sample set into a training set and a validation set, then fit a logistic regression model that uses income and balance to predict default.

library(ISLR2)
library(tidyverse)

?Default

Q1.1

Use three different splits of the observations into a training set and a test set.

First generate the splits:

library(tidymodels)

Default.split1 <- initial_split(Default, prop = 0.5, strata = default)
Default.split2 <- initial_split(Default, prop = 0.5, strata = default)
Default.split3 <- initial_split(Default, prop = 0.5, strata = default)

or maybe with different random proportions?

Default.split1 <- initial_split(Default, prop = 0.6, strata = default)
Default.split2 <- initial_split(Default, prop = 0.6, strata = default)
Default.split3 <- initial_split(Default, prop = 0.7, strata = default)

Then, using those splits, create separate training and test sets from the original data:

Default.train1 <- training(Default.split1)
Default.test1 <- testing(Default.split1)

Default.train2 <- training(Default.split2)
Default.test2 <- testing(Default.split2)

Default.train3 <- training(Default.split3)
Default.test3 <- testing(Default.split3)

Q1.2

Fit three multiple logistic regression models using only the training observations.

glm.fit.1 = glm(default ~ income + balance, data = Default.train1, family = "binomial")

glm.fit.2 = glm(default ~ income + balance, data = Default.train2, family = "binomial")

glm.fit.3 = glm(default ~ income + balance, data = Default.train3, family = "binomial")

summary(glm.fit.1)

Q1.3

Based on the three models, obtain a prediction of default status for each individual in the test set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

# Predict the test set
glm.probs.1 <- predict(glm.fit.1, Default.test1, type = "response")

# Make predictions using threshold=0.5
glm.preds.1 = if_else(glm.probs.1 > 0.5, "Yes", "No")

# do the same for the other datasets

Q1.4

Based on the three models, compute the test set error, which is the fraction of the observations in the test set that are misclassified.

mean(glm.preds.1 != Default.test1$default)
mean(glm.preds.2 != Default.test2$default)
mean(glm.preds.3 != Default.test3$default)

Q2: Bootstrap

For the Default dataset, We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set.

In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: 2.1. Using the bootstrap, 2.2. Using the standard formula for computing the standard errors in the glm() function. As following,

i

Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

# multiple logistic regression model 
log_def <- glm(default ~ income + balance, data = Default, family = "binomial")

summary(log_def)
summary(log_def)$coefficients[, 2]  ## standard errors

ii

Write a function,boot.fn(),that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

# This is a modified version of the original `boot.fn` function
boot.fn <- function(data, index = 1:nrow(data)) {
      coef(glm(default ~ income + balance, data = data, subset = index, family = "binomial"))[-1]
    }

Test it:

## all data --without intercept
boot.fn(Default)

iii

Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance. Then, Create a histogram of the bootstrap parameter estimates with ggplot2, and also set the bins=20, title as 1,000 Bootstrap Parameter Estimates - 'balance' & 'income.

boot_results <- boot(data = Default, statistic = boot.fn, R = 1000)
boot_results