βœ”οΈ Week 08 - Lab Solutions

DS202 - Data Science for Social Scientists

Author

Dr. Jon Cardoso-Silva

This notebook contain solutions to Take-home exercises in Week 08 Lab.

🏠 Take-home exercise Q1:

Edit the cell below modifying event_level from "second" to "first". Why do you get different results? What do you think is going on?

πŸ’‘Tip: Read the documentation of f_meas to understand what event_level represents. (Type ?f_meas)

πŸ’‘ Gold Tip: note the Levels of the factor variable called Purchase:

plot_df$Purchase

Solution:

The parameter event_level refers to the levels of the target variable:

levels(plot_df$Purchase)
[1] "CH" "MM"

From the above, we see that CH is the first level, and MM is the second level. Don’t know what a level is? Take a look at Chapter 15 of R for Data Science to revisit the concept of factors (categorical variables).

F1-score is always calculated in reference to one level. So, if we want to get the F1-score for the "CH" class, we use event_level="first":

plot_df %>% f_meas(Purchase, .pred_class, event_level="first")
# A tibble: 1 Γ— 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 f_meas  binary         0.861

Similarly, if we want to calculate the F1-score for the "MM" class, we have to look at event_level="second":

plot_df %>% f_meas(Purchase, .pred_class, event_level="second")
# A tibble: 1 Γ— 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 f_meas  binary         0.780

🏠 Take-home exercise Q2:

Create a plot of the confusion matrix for the orange_tidymodel like we did in Step 1

Q2. Solution

Code
# It's almost the same thing; only this time let's use the `.pred_class` column that was created when we augmented our data frame.
confusion_matrix <- table(expected=plot_df$Purchase, class_pred=plot_df$.pred_class)


plot_confusion_matrix(as_tibble(confusion_matrix), 
                      target_col = "expected", 
                      prediction_col = "class_pred",
                      
                      # Customizing the plot
                      add_normalized = TRUE,
                      add_col_percentages = FALSE,
                      add_row_percentages = FALSE,
                      counts_col = "n",
                      )
Warning in plot_confusion_matrix(as_tibble(confusion_matrix), target_col =
"expected", : 'ggimage' is missing. Will not plot arrows and zero-shading.
Warning in plot_confusion_matrix(as_tibble(confusion_matrix), target_col =
"expected", : 'rsvg' is missing. Will not plot arrows and zero-shading.

🏠 Take-home exercise Q3:

Create a plot of SVM decision space for the orange_tidymodel like we did in Step 1.

Q3. Solution

It’s almost the same thing; only this time we use the augment() function and refer to .pred_class when creating the plot.

Code
sim.data <- crossing(LoyalCH   = seq(0,1,0.05), PriceDiff = seq(-1,1,0.1))
sim.data <- augment(orange_tidymodel, sim.data)


g <- (
  plot_df %>%   
    ggplot()
  
    # Tile the background of the plot with SVM predictions
    + geom_tile(data = sim.data, aes(x=LoyalCH, y=PriceDiff, fill = .pred_class), alpha = 0.25)
  
    # Actual data
    + geom_point(aes(x=LoyalCH, y=PriceDiff, colour = Purchase, shape = is_correct), size=2.5, stroke=0.95, alpha=0.7)
  
    # Define X and Os
    + scale_shape_manual(values = c(4, 1))
    
    # (OPTIONAL) Customizing the colours and theme of the plot
    + scale_x_continuous(labels=scales::percent)
    + scale_colour_startrek()
    + scale_fill_startrek()
    + theme_minimal()
    + theme(panel.grid = element_blank(), legend.position = 'bottom', plot.title = element_text(hjust = 0.5))
    + labs(x = 'Customer brand loyalty for CH', y = 'Sale price of MM less sale price of CH', fill = 'Brand', colour = 'Brand', shape = 'Correct prediction?', title = sprintf('Overall Training Accuracy = %.2f %%', 100*(sum(plot_df$is_correct)/nrow(plot_df))))
)

g

🏠 Take-home exercise Q4:

  • Retrain the model in Step 3 for Smarket, this time using svm_linear instead of svm_rbf.
  • Reuse the code in Step 3.4 to replicate the plots and metric calculations for this new model.
  • Which model fits the target variable (Today) better?

Q4. Solution

Train the model

Code
# Remove Direction, otherwise we would be "cheating" 
filtered_data <- ISLR2::Smarket %>% select(Today, Volume, Lag1)

alternative_svm <-
  svm_linear() %>% 
  set_mode('regression') %>% 
  fit(Today ~ ., data = filtered_data)

alternative_svm
parsnip model object

$TypeDetail
[1] "L2-regularized L2-loss support vector regression primal (L2R_L2LOSS_SVR)"

$Type
[1] 11

$W
         Volume       Lag1        Bias
[1,] 0.05000429 -0.0249507 -0.07358462

$Bias
[1] 1

$NbClass
[1] 2

attr(,"class")
[1] "LiblineaR"

Create a plot_df data frame to use when plotting residuals.

Code
plot_df <- 
  augment(alternative_svm, filtered_data) %>% 
  mutate(row_number=row_number()) # adding this here just to make our plot easier
plot_df
# A tibble: 1,250 Γ— 6
    Today Volume   Lag1    .pred  .resid row_number
    <dbl>  <dbl>  <dbl>    <dbl>   <dbl>      <int>
 1  0.959   1.19  0.381 -0.0235   0.983           1
 2  1.03    1.30  0.959 -0.0327   1.06            2
 3 -0.623   1.41  1.03  -0.0288  -0.594           3
 4  0.614   1.28 -0.623  0.00577  0.608           4
 5  0.213   1.21  0.614 -0.0286   0.242           5
 6  1.39    1.35  0.213 -0.0114   1.40            6
 7 -0.403   1.44  1.39  -0.0361  -0.367           7
 8  0.027   1.41 -0.403  0.00687  0.0201          8
 9  1.30    1.16  0.027 -0.0161   1.32            9
10  0.287   1.23  1.30  -0.0445   0.331          10
# … with 1,240 more rows

Plot the residuals:

Code
g <- (
  ggplot(plot_df, aes(x=row_number, y=.resid))
  + geom_point(alpha=0.6)
  
  + theme_bw()
  + geom_hline(yintercept = c(-2,2), color="red", linetype="dashed")
  + labs(title="Distribution of residuals (the closer to zero the better)")
)

g

Is this model better? It’s not easy to say from the residuals plot alone, so let’s calculate the metrics:

plot_df %>% mae(Today, .pred)
# A tibble: 1 Γ— 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard       0.828
plot_df %>% rmse(Today, .pred)
# A tibble: 1 Γ— 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        1.14

The linear SVM had a slightly worse fit than the SVM with Radial Basis Function (Gaussian) we ran earlier. Both MAE and RMSE were smaller in the svm_rbf results:

MAE RMSE
svm_rbf 0.789891 1.09561
svm_linear 0.82845 1.135357