🗓️ Week 03:
Classifiers - Part II

DS202 Data Science for Social Scientists

10/14/22

Generative models

Bayes’ Theorem

  • Before we go on to explain what Naive Bayes is about, we need to understand the formula below.

\[ P(\mathbf{Y} = k | \mathbf{X} = x) = \frac{P(k)P(\mathbf{X}|\mathbf{Y}=k)}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}} \]

  • Let’s look at it step-by-step ⏭️

Bayes’ Theorem

\[ P(\mathbf{Y} = k | \mathbf{X} = x) = \frac{P(k)P(\mathbf{X}|\mathbf{Y}=k)}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}} \]

New variables

  • \(K \Rightarrow\) is the set of classes. In the binary case, \(K = \{0, 1\}\).
  • \(P(k) \Rightarrow\) is the probability that a random sample belongs to class \(k\).

Note

The textbook uses a slightly different notation.

Bayes’ Theorem

\[ \color{blue}{P(\mathbf{Y} = k | \mathbf{X} = x)} \color{Gainsboro}{= \frac{P(k)P(\mathbf{X}|\mathbf{Y}=k)}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}}} \]

  • The quantity above (in blue) is called the posterior distribution
  • It is what we are interested in when making inferences/predictions
Read it as:
What is the probability that the class is \(k\) given that the sample is \(x\)?

Bayes’ Theorem

\[ \color{Gainsboro}{P(\mathbf{Y} = k | \mathbf{X} = x) =} \frac{\color{blue}{P(k)}\color{Gainsboro}{P(\mathbf{X}|\mathbf{Y}=k)}}{\color{Gainsboro}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}}} \]

  • The quantity above (in blue) is called the prior distribution
  • It represents the proportion of samples of class \(k\) we believe (estimate) we would find if sampling at random.
Read it as:
What is the probability that the class is \(k\) given a random sample?

Bayes’ Theorem

\[ \color{Gainsboro}{P(\mathbf{Y} = k | \mathbf{X} = x) =} \frac{\color{Gainsboro}{P(k)}\color{blue}{P(\mathbf{X}|\mathbf{Y}=k)}}{\color{Gainsboro}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}}} \]

  • The quantity above (in blue) is often called the likelihood
  • It represents the density function of \(\mathbf{X}\) for samples of class \(k\).
Think of it as:
What values would I expect \(X\) to take when the class is \(\mathbf{Y} = k\)?

Bayes’ Theorem

\[ \color{Gainsboro}{P(\mathbf{Y} = k | \mathbf{X} = x) =} \frac{\color{Gainsboro}{P(k)}\color{Gainsboro}{P(\mathbf{X}|\mathbf{Y}=k)}}{\color{blue}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}}} \]

  • The quantity above (in blue) represents the density function of \(\mathbf{X}\) regardless of the class
  • It is often called the marginal probability of \(\mathbf{X}\).
    • Note that \(\color{blue}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}} = P(\mathbf{X})\)
Think of it as:
What values would I expect \(X\) if ignored the class?

Bayes’ Theorem

\[ P(\mathbf{Y} = k | \mathbf{X} = x) = \frac{P(k)P(\mathbf{X}|\mathbf{Y}=k)}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}} \]

  • Let’s look at how different algorithms explore this rule ⏭️

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA)

  • Assumptions:
    • Likelihood follows a Gaussian distribution
      • Each class has its own mean, \(\mu_k\)
      • All classes have the same standard deviation
        • That is, \(\sigma^2_1 = \sigma^2_2 = \ldots = \sigma^2_K\), or simply \(\sigma^2\)
    • We denote this as: \(P(\mathbf{X}|\mathbf{Y}=k) \sim N(\mu_k, \sigma^2)\)

LDA - Estimates

  • We estimate the mean per class and the shared standard deviation as follows:

\[ \begin{align} \hat{\mu}_k &= \frac{1}{n_k}\sum_{i:y_i=k}{x_i}\\ \hat{\sigma}^2 &= \frac{1}{n - K}\sum_{k=1}^K{\sum_{i:y_i=k}{\left(x_i - \hat{\mu}_k\right)^2}} \\ \hat{P}(k) &= \frac{n_k}{n} \end{align} \]

  • where:
    • \(n\) is the total number of training observations
    • \(n_k\) is the number of training observations in the \(k\)th class

Naive Bayes Classifier

Naive Bayes Classifier

  • Main Assumption:
Within the \(k\)th class, the \(p\) predictors are independent
  • Assuming features are not associated (not correlated), the likelihood becomes: \[ P(\mathbf{X}|\mathbf{Y}=k) = \underbrace{P(x_1 |\mathbf{Y}=k)}_{1\text{st} \text{ predictor}} \times \underbrace{P(x_2 |\mathbf{Y}=k)}_{2\text{nd} \text{ predictor}} \times \ldots \times \underbrace{P(x_p |\mathbf{Y}=k)}_{p\text{-th} \text{ predictor}} \]
  • This means the posterior is given by: \[ P(\mathbf{Y} = k| \mathbf{X} = x) = \frac{\quad\quad P(k) \times P(x_1 |\mathbf{Y}=k) \times P(x_2 |\mathbf{Y}=k) \times \ldots \times P(x_p |\mathbf{Y}=k)}{\sum_{l=1}^K{P(l) \times P(x_1 |\mathbf{Y}=l) \times P(x_2 |\mathbf{Y}=l) \times \ldots \times P(x_p |\mathbf{Y}=l)}} \]

A naive approach indeed

  • This may all look very complicated but it is actually quite simple
  • If data is discrete (categorical), you just count the proportion of each category.

Example:

\[ P(\mathbf{Y} = k| \mathbf{X}_j = x_j) = \begin{cases} 0.32 & \text{if } x_j = 1\\ 0.55 & \text{if } x_j = 2\\ 0.13 & \text{if } x_j = 3 \end{cases} \]

  • If data is continuous, use a histogram as an estimate for the true density of \(x_p\)
    • Alternatively, use a kernel density estimator

Making decisions

Default: Yes or No?

  • We have looked at how the probabilities (risk of default) change according to the value of predictors
  • But in practice we need to decide whether the risk is too high or tolerable
  • In our example, we might want to ask:
“Will this person default on their credit card? YES or NO?”

Default: Yes or No?

How would you classify the following customers?

Code
library(tidyverse)

full_model <- 
  glm(default ~ ., data=ISLR2::Default, family=binomial)

set.seed(40)
sample_customers <- 
  ISLR2::Default %>% 
  slice(9986, 9908, 6848, 9762, 9979, 7438)
pred <- predict(full_model, sample_customers, type="response")
# Format it as percentage
sample_customers$prediction <- 
  sapply(pred, function(x){sprintf("%.2f %%", 100*x)})
sample_customers
  default student   balance   income prediction
1      No      No  842.9494 39957.13     0.27 %
2      No      No 1500.5721 39891.86    10.53 %
3     Yes     Yes 1957.1203 18805.95    44.23 %
4      No      No 1902.1499 35008.67    53.71 %
5     Yes      No 2202.4624 47287.26    87.09 %
6     Yes     Yes 2461.5070 11878.56    93.34 %

Default: Yes or No?

How would you classify the following customers?

Code
library(tidyverse)

full_model <- 
  glm(default ~ ., data=ISLR2::Default, family=binomial)

set.seed(40)
sample_customers <- 
  ISLR2::Default %>% 
  slice(9986, 9908, 6848, 9762, 9979, 7438)
pred <- predict(full_model, sample_customers, type="response")
# Format it as percentage
sample_customers$prediction <- 
  sapply(pred, function(x){sprintf("%.2f %%", 100*x)})
sample_customers
  default student   balance   income prediction
1      No      No  842.9494 39957.13     0.27 %
2      No      No 1500.5721 39891.86    10.53 %
3     Yes     Yes 1957.1203 18805.95    44.23 %
4      No      No 1902.1499 35008.67    53.71 %
5     Yes      No 2202.4624 47287.26    87.09 %
6     Yes     Yes 2461.5070 11878.56    93.34 %
  • If we set our threshold \(= 50\%\), we get the following confusion matrix:
Actual
Predicted No Yes
No 2 1
Yes 1 2
  • If we set our threshold \(= 40\%\), we get the following confusion matrix:
Actual
Predicted No Yes
No 2 0
Yes 1 3


Which of the two is more accurate?

Thresholds

  • When making predictions about classes, we always have to make decisions.
  • Thresholds, applied to the predicted probability scores, are a way to decide whether to favour a particular class over another
  • ⏭️ Next, we will explore several metrics that can help us decide whether our classification model is good or bad.

Classification Metrics

Confusion Matrix

  • Let’s take another look at the confusion matrix. We can think of the numbers in each cell as the following:
Actual
Predicted No Yes
No True Negative (TN) False Negative (FN)
Yes False Positive (FP) True Positive (TP)


  • Ideally, we would have no False Negatives and no False Positives but, of course, that is never the case.

Classification metrics

  • It is convenient to aggregate those quantities into a few other metrics
  • Two of the most common ones are called sensitivity and specificity

\[ \begin{align} \text{Sensitivity} &= \text{True Positive Rate (TPR)} = \frac{TP}{P} \\ \text{Specificity} &= \text{True Negative Rate (TNR)} = \frac{TN}{N} \end{align} \]

  • Another common one is accuracy:

\[ \text{Accuracy} = \frac{TP + TN}{P + N} \]

  • A good model has high sensitivity and high specificity and high accuracy.

Which threshold is better?

📝 Now, looking at the logistic regression model we built for the entire dataset, work out the sensitivity, specificity and accuracy of the following confusion matrices:


Practice

  • ⏲️ 5 min to work out the math
  • 🗳️ Vote on your preferred threshold
    (on Slack)

\(\text{Threshold} = 50\%\):

Actual
Predicted No Yes
No 9627 228
Yes 40 105

\(\text{Threshold} = 40\%\):

Actual
Predicted No Yes
No 9588 199
Yes 79 134

Meet the ROC curve

  • The Receiver Operating Characteristic (ROC) curve is another way to assess the model.
  • It shows how sensitivity and specificity change as we vary the threshold from 0 to 1
    (threshold not shown).

What could go wrong?

Generalisation problems

  • The data used to train algorithms is called training data
  • Often, we want to use the fitted models to make predictions on new previously unseen data

Important

⚠️ A model that performs well on training data will not necessarily perform well on new data ⚠️

  • To make a robust assessment of our model, we have to split the data in two:
    • the training data and
    • the test data
  • We do NOT use the test data to fit the model
  • We will come back to this next week, this is the topic of 🗓️ Week 04.

Inappropriate reliance on metrics

  • Accuracy can be very misleading when classes are imbalanced
  • Consider the following model: \(\hat{y} = \text{Yes}\) (always)
    • Only \(3\%\) of customers default on their credit cards
    • Therefore, this model would have a \(97\%\) accuracy!
    • It is correct ninety-seven percent of times. But is it a good model?
      • 🙅‍♂️ NO!
  • Similarly, you have to ask yourself about the usefulness of any other metric
    • Is True Positive Rate more or less important than True Negative Rate for the classification problem at hand?
    • Why? Why not?
  • Ultimately, it boils down to how you plan to use this model afterwards.

What’s Next

Your Checklist:

  • 📙 Read (James et al. 2021, chap. 4)

  • 👀 Browse the slides again

  • 📝 Take note of anything that isn’t clear to you

  • 📟 Share your questions on /week04 channel on Slack

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. Second edition. Springer Texts in Statistics. New York NY: Springer. https://www.statlearning.com/.