✔️ Week 04 - Lab Solutions
DS202 - Data Science for Social Scientists
🔑 Solutions to exercises
Create a binary variable,
mpg01, that contains a 1 ifmpgcontains a value above its median, and a 0 ifmpgcontains a value below its median. You can compute the median using themedian()function. Note you may find it helpful to use thedata.frame()function to create a single data set containing bothmpg01and the otherAutovariables.library(ISLR2) Auto = na.omit(Auto) mpg01 = rep(0, dim(Auto)[1]) mpg01[Auto$mpg > median(Auto$mpg)] = 1 Auto = data.frame(Auto, mpg01) head(Auto)or a easier way by using
ifelse()function:library(ISLR2) Auto = na.omit(Auto) Auto$mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0) head(Auto)Before we move to next step, we need to lable
mpg01andorginasfactorso that R could recognized them as quanlitative variables instead of quantitative variables.Auto$mpg01 = as.factor(Auto$mpg01) Auto$origin = as.factor(Auto$origin)Explore the data graphically in order to investigate the association between
mpg01and the other features. Which of the other features seem most likely to be useful in predictingmpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.par(mfrow = c(2, 3)) plot(Auto$mpg01, Auto$cylinders, xlab = "mpg01", ylab = "Number of engine cylinders") plot(Auto$mpg01, Auto$displacement, xlab = "mpg01", ylab = "Engine displacement (cubic inches)") plot(Auto$mpg01, Auto$horsepower, xlab = "mpg01", ylab = "Horsepower") plot(Auto$mpg01, Auto$weight, xlab = "mpg01", ylab = "Weight (pounds)") plot(Auto$mpg01, Auto$acceleration, xlab = "mpg01", ylab = "Time to reach 60mpg (seconds)") plot(Auto$mpg01, Auto$year, xlab = "mpg01", ylab = "Manufacture year") mtext("Boxplots for cars with above(1) and below(0) median mpg", outer = TRUE, line = -3)Boxplots were plotted to compare the distributions for each of the quantitative variables between cars with above-median mpg and those with below median-mpg. If the distribution of a predictor significantly varies with the response variable, then it may contribute to the prediction of response variable. If the distribution of a predictor does not significantly differ between different values of the response variable, then it may not contribute to the prediction of response variable. The boxplots suggest that
cylinders,displacement,horsepower, andweightmight be the most useful in predictingmpg01. The functionpar()is used to change the layout of output plots.We could try to use
ggplot()function to creat some fancy plots which is a combination of boxplot and scatterplot:ggplot(data = Auto, aes(acceleration, mpg01, colour = mpg01, fill = mpg01)) + geom_boxplot(alpha = 0.125) + geom_jitter(alpha = 0.5, size = 2)To visualise the association between
mpg01andyear, scatterplot is used.par(mfrow = c(1, 1)) plot(Auto$year, Auto$mpg) abline(h = median(Auto$mpg), lwd = 2, col = "red")The above scatterplot of
mpgvsyearshows that the newer cars in the data set tend to be more fuel efficient. Therefore, while manufacture year might not be as useful as the other four quantitative variables, it still seems worth including.plot(Auto$origin, Auto$mpg, xlab = "Origin", ylab = "mpg") abline(h = median(Auto$mpg), lwd = 2, col = "red")Lastly, when looking at a boxplot that compares the
mpgvalues for each car, categorized by country oforigin, we see that there is a clear difference between American cars, which tend to have below-median fuel efficiency, and European and Japanese cars, which tend to have above-median fuel efficiency. Thus, it seems thatoriginwill also be useful in predictingmpg01.In conclusion, all of the predictors except for
accelerationandnamewill be used in fitting this classification model for trying to predictmpg01. Also,mpgwill be excluded because that was directly used to create the classification label.Split the data into a training set and a test set. Train set contains observations before 1979. Test set contains the rest of the observations.
attach(Auto) train <- (year < 79) Auto_train <- Auto[train , ] Auto_test <- Auto[!train , ]Perform logistic regression on the training data in order to predict
mpg01using the variables that seemed most associated withmpg01in question 2. What is the test error of the model obtained?glm.fit = glm(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = Auto, subset = train, family = "binomial") summary(glm.fit) glm.probs = predict(glm.fit, Auto_test, type = "response") glm.pred = rep(0, dim(Auto_test)[1]) glm.pred[glm.probs > 0.5] = 1 table(glm.pred, Auto_test$mpg01, dnn = c("Predicted", "Actual")) mean(glm.pred == Auto_test$mpg01) [1] 0.877193As is shown, the test error of the model ontained is (1-0.877193) = 0.122807.
Perform naive Bayes on the training data in order to predict
mpg01using the variables that seemed most associated withmpg01in question 2. What is the test error of the model obtained?library (e1071) nb.fit = naiveBayes(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = Auto, subset = train) nb.fit nb.class <- predict(nb.fit, Auto_test) table(nb.class, Auto_test$mpg01, dnn = c("Predicted", "Actual")) mean(nb.class == Auto_test$mpg01)As is shown, the test error of the model ontained is (1-0.877193) = 0.122807.
Which of these two methods appears to provide the best results on this data? Justify your choice.
After comparing the test errors, these two classification models were equally good. To further compare the performance of these two model, we have to look at the confusion matrix and find these two classification models have identiical confusion matrix. It means that Precision, Recall, Accuracy and F-Score of these two models are all same. Therefore, we can conclude that these two classification models have equally good performace on this data set. One thing that should be cautious of is that the data set is imbalnced. Therefore, we cannot judge the performance of this data set based on test error. More detailed explaination about be found here: https://www.analyticsvidhya.com/blog/2021/06/5-techniques-to-handle-imbalanced-data-for-a-classification-problem/.
