✔️ Week 04 - Lab Solutions
DS202 - Data Science for Social Scientists
🔑 Solutions to exercises
Create a binary variable,
mpg01
, that contains a 1 ifmpg
contains a value above its median, and a 0 ifmpg
contains a value below its median. You can compute the median using themedian()
function. Note you may find it helpful to use thedata.frame()
function to create a single data set containing bothmpg01
and the otherAuto
variables.library(ISLR2) = na.omit(Auto) Auto = rep(0, dim(Auto)[1]) mpg01 $mpg > median(Auto$mpg)] = 1 mpg01[Auto= data.frame(Auto, mpg01) Auto head(Auto)
or a easier way by using
ifelse()
function:library(ISLR2) = na.omit(Auto) Auto $mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0) Autohead(Auto)
Before we move to next step, we need to lable
mpg01
andorgin
asfactor
so that R could recognized them as quanlitative variables instead of quantitative variables.$mpg01 = as.factor(Auto$mpg01) Auto$origin = as.factor(Auto$origin) Auto
Explore the data graphically in order to investigate the association between
mpg01
and the other features. Which of the other features seem most likely to be useful in predictingmpg01
? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.par(mfrow = c(2, 3)) plot(Auto$mpg01, Auto$cylinders, xlab = "mpg01", ylab = "Number of engine cylinders") plot(Auto$mpg01, Auto$displacement, xlab = "mpg01", ylab = "Engine displacement (cubic inches)") plot(Auto$mpg01, Auto$horsepower, xlab = "mpg01", ylab = "Horsepower") plot(Auto$mpg01, Auto$weight, xlab = "mpg01", ylab = "Weight (pounds)") plot(Auto$mpg01, Auto$acceleration, xlab = "mpg01", ylab = "Time to reach 60mpg (seconds)") plot(Auto$mpg01, Auto$year, xlab = "mpg01", ylab = "Manufacture year") mtext("Boxplots for cars with above(1) and below(0) median mpg", outer = TRUE, line = -3)
Boxplots were plotted to compare the distributions for each of the quantitative variables between cars with above-median mpg and those with below median-mpg. If the distribution of a predictor significantly varies with the response variable, then it may contribute to the prediction of response variable. If the distribution of a predictor does not significantly differ between different values of the response variable, then it may not contribute to the prediction of response variable. The boxplots suggest that
cylinders
,displacement
,horsepower
, andweight
might be the most useful in predictingmpg01
. The functionpar()
is used to change the layout of output plots.We could try to use
ggplot()
function to creat some fancy plots which is a combination of boxplot and scatterplot:ggplot(data = Auto, aes(acceleration, mpg01, colour = mpg01, fill = mpg01)) + geom_boxplot(alpha = 0.125) + geom_jitter(alpha = 0.5, size = 2)
To visualise the association between
mpg01
andyear
, scatterplot is used.par(mfrow = c(1, 1)) plot(Auto$year, Auto$mpg) abline(h = median(Auto$mpg), lwd = 2, col = "red")
The above scatterplot of
mpg
vsyear
shows that the newer cars in the data set tend to be more fuel efficient. Therefore, while manufacture year might not be as useful as the other four quantitative variables, it still seems worth including.plot(Auto$origin, Auto$mpg, xlab = "Origin", ylab = "mpg") abline(h = median(Auto$mpg), lwd = 2, col = "red")
Lastly, when looking at a boxplot that compares the
mpg
values for each car, categorized by country oforigin
, we see that there is a clear difference between American cars, which tend to have below-median fuel efficiency, and European and Japanese cars, which tend to have above-median fuel efficiency. Thus, it seems thatorigin
will also be useful in predictingmpg01
.In conclusion, all of the predictors except for
acceleration
andname
will be used in fitting this classification model for trying to predictmpg01
. Also,mpg
will be excluded because that was directly used to create the classification label.Split the data into a training set and a test set. Train set contains observations before 1979. Test set contains the rest of the observations.
attach(Auto) <- (year < 79) train <- Auto[train , ] Auto_train <- Auto[!train , ] Auto_test
Perform logistic regression on the training data in order to predict
mpg01
using the variables that seemed most associated withmpg01
in question 2. What is the test error of the model obtained?= glm(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = Auto, subset = train, family = "binomial") glm.fit summary(glm.fit) = predict(glm.fit, Auto_test, type = "response") glm.probs = rep(0, dim(Auto_test)[1]) glm.pred > 0.5] = 1 glm.pred[glm.probs table(glm.pred, Auto_test$mpg01, dnn = c("Predicted", "Actual")) mean(glm.pred == Auto_test$mpg01) 1] 0.877193 [
As is shown, the test error of the model ontained is (1-0.877193) = 0.122807.
Perform naive Bayes on the training data in order to predict
mpg01
using the variables that seemed most associated withmpg01
in question 2. What is the test error of the model obtained?library (e1071) = naiveBayes(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = Auto, subset = train) nb.fit nb.fit<- predict(nb.fit, Auto_test) nb.class table(nb.class, Auto_test$mpg01, dnn = c("Predicted", "Actual")) mean(nb.class == Auto_test$mpg01)
As is shown, the test error of the model ontained is (1-0.877193) = 0.122807.
Which of these two methods appears to provide the best results on this data? Justify your choice.
After comparing the test errors, these two classification models were equally good. To further compare the performance of these two model, we have to look at the confusion matrix and find these two classification models have identiical confusion matrix. It means that Precision, Recall, Accuracy and F-Score of these two models are all same. Therefore, we can conclude that these two classification models have equally good performace on this data set. One thing that should be cautious of is that the data set is imbalnced. Therefore, we cannot judge the performance of this data set based on test error. More detailed explaination about be found here: https://www.analyticsvidhya.com/blog/2021/06/5-techniques-to-handle-imbalanced-data-for-a-classification-problem/.