🗓️ Week 03:
Classifiers - Part II

DS202 Data Science for Social Scientists

10/14/22

Generative models

Bayes’ Theorem

Before we go on to explain what Naive Bayes is about, we need to understand the formula below.

\[ P(\mathbf{Y} = k | \mathbf{X} = x) = \frac{P(k)P(\mathbf{X}|\mathbf{Y}=k)}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}} \]

Let’s look at it step-by-step ⏭️

Bayes’ Theorem

\[ P(\mathbf{Y} = k | \mathbf{X} = x) = \frac{P(k)P(\mathbf{X}|\mathbf{Y}=k)}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}} \]

New variables

\(K \Rightarrow\) is the set of classes. In the binary case, \(K = \{0, 1\}\).
\(P(k) \Rightarrow\) is the probability that a random sample belongs to class \(k\).

Note

The textbook uses a slightly different notation.

Bayes’ Theorem

\[ \color{blue}{P(\mathbf{Y} = k | \mathbf{X} = x)} \color{Gainsboro}{= \frac{P(k)P(\mathbf{X}|\mathbf{Y}=k)}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}}} \]

The quantity above (in blue) is called the posterior distribution
It is what we are interested in when making inferences/predictions

Read it as:

What is the probability that the class is \(k\) given that the sample is \(x\)?

Bayes’ Theorem

\[ \color{Gainsboro}{P(\mathbf{Y} = k | \mathbf{X} = x) =} \frac{\color{blue}{P(k)}\color{Gainsboro}{P(\mathbf{X}|\mathbf{Y}=k)}}{\color{Gainsboro}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}}} \]

The quantity above (in blue) is called the prior distribution
It represents the proportion of samples of class \(k\) we believe (estimate) we would find if sampling at random.

Read it as:

What is the probability that the class is \(k\) given a random sample?

Bayes’ Theorem

\[ \color{Gainsboro}{P(\mathbf{Y} = k | \mathbf{X} = x) =} \frac{\color{Gainsboro}{P(k)}\color{blue}{P(\mathbf{X}|\mathbf{Y}=k)}}{\color{Gainsboro}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}}} \]

The quantity above (in blue) is often called the likelihood
It represents the density function of \(\mathbf{X}\) for samples of class \(k\).

Think of it as:

What values would I expect \(X\) to take when the class is \(\mathbf{Y} = k\)?

Bayes’ Theorem

\[ \color{Gainsboro}{P(\mathbf{Y} = k | \mathbf{X} = x) =} \frac{\color{Gainsboro}{P(k)}\color{Gainsboro}{P(\mathbf{X}|\mathbf{Y}=k)}}{\color{blue}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}}} \]

The quantity above (in blue) represents the density function of \(\mathbf{X}\) regardless of the class
It is often called the marginal probability of \(\mathbf{X}\).
- Note that \(\color{blue}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}} = P(\mathbf{X})\)

Think of it as:

What values would I expect \(X\) if ignored the class?

Bayes’ Theorem

\[ P(\mathbf{Y} = k | \mathbf{X} = x) = \frac{P(k)P(\mathbf{X}|\mathbf{Y}=k)}{\sum_{l=1}^{K}{P(l)P(\mathbf{X}|\mathbf{Y}=l)}} \]

Let’s look at how different algorithms explore this rule ⏭️

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA)

Assumptions:
- Likelihood follows a Gaussian distribution
  - Each class has its own mean, \(\mu_k\)
  - All classes have the same standard deviation
    - That is, \(\sigma^2_1 = \sigma^2_2 = \ldots = \sigma^2_K\), or simply \(\sigma^2\)
- We denote this as: \(P(\mathbf{X}|\mathbf{Y}=k) \sim N(\mu_k, \sigma^2)\)

LDA - Estimates

We estimate the mean per class and the shared standard deviation as follows:

\[ \begin{align} \hat{\mu}_k &= \frac{1}{n_k}\sum_{i:y_i=k}{x_i}\\ \hat{\sigma}^2 &= \frac{1}{n - K}\sum_{k=1}^K{\sum_{i:y_i=k}{\left(x_i - \hat{\mu}_k\right)^2}} \\ \hat{P}(k) &= \frac{n_k}{n} \end{align} \]

where:
- \(n\) is the total number of training observations
- \(n_k\) is the number of training observations in the \(k\)th class

Naive Bayes Classifier

Main Assumption:

Within the \(k\)th class, the \(p\) predictors are independent

Assuming features are not associated (not correlated), the likelihood becomes: \[ P(\mathbf{X}|\mathbf{Y}=k) = \underbrace{P(x_1 |\mathbf{Y}=k)}_{1\text{st} \text{ predictor}} \times \underbrace{P(x_2 |\mathbf{Y}=k)}_{2\text{nd} \text{ predictor}} \times \ldots \times \underbrace{P(x_p |\mathbf{Y}=k)}_{p\text{-th} \text{ predictor}} \]

This means the posterior is given by: \[ P(\mathbf{Y} = k| \mathbf{X} = x) = \frac{\quad\quad P(k) \times P(x_1 |\mathbf{Y}=k) \times P(x_2 |\mathbf{Y}=k) \times \ldots \times P(x_p |\mathbf{Y}=k)}{\sum_{l=1}^K{P(l) \times P(x_1 |\mathbf{Y}=l) \times P(x_2 |\mathbf{Y}=l) \times \ldots \times P(x_p |\mathbf{Y}=l)}} \]

A naive approach indeed

This may all look very complicated but it is actually quite simple

If data is discrete (categorical), you just count the proportion of each category.

Example:

\[ P(\mathbf{Y} = k| \mathbf{X}_j = x_j) = \begin{cases} 0.32 & \text{if } x_j = 1\\ 0.55 & \text{if } x_j = 2\\ 0.13 & \text{if } x_j = 3 \end{cases} \]

If data is continuous, use a histogram as an estimate for the true density of \(x_p\)
- Alternatively, use a kernel density estimator

Making decisions

Default: Yes or No?

We have looked at how the probabilities (risk of default) change according to the value of predictors
But in practice we need to decide whether the risk is too high or tolerable
In our example, we might want to ask:

“Will this person default on their credit card? YES or NO?”

Default: Yes or No?

How would you classify the following customers?

Code

library(tidyverse)

full_model <- 
  glm(default ~ ., data=ISLR2::Default, family=binomial)

set.seed(40)
sample_customers <- 
  ISLR2::Default %>% 
  slice(9986, 9908, 6848, 9762, 9979, 7438)
pred <- predict(full_model, sample_customers, type="response")
# Format it as percentage
sample_customers$prediction <- 
  sapply(pred, function(x){sprintf("%.2f %%", 100*x)})
sample_customers

  default student   balance   income prediction
1      No      No  842.9494 39957.13     0.27 %
2      No      No 1500.5721 39891.86    10.53 %
3     Yes     Yes 1957.1203 18805.95    44.23 %
4      No      No 1902.1499 35008.67    53.71 %
5     Yes      No 2202.4624 47287.26    87.09 %
6     Yes     Yes 2461.5070 11878.56    93.34 %

Default: Yes or No?

How would you classify the following customers?

Code

library(tidyverse)

full_model <- 
  glm(default ~ ., data=ISLR2::Default, family=binomial)

set.seed(40)
sample_customers <- 
  ISLR2::Default %>% 
  slice(9986, 9908, 6848, 9762, 9979, 7438)
pred <- predict(full_model, sample_customers, type="response")
# Format it as percentage
sample_customers$prediction <- 
  sapply(pred, function(x){sprintf("%.2f %%", 100*x)})
sample_customers

  default student   balance   income prediction
1      No      No  842.9494 39957.13     0.27 %
2      No      No 1500.5721 39891.86    10.53 %
3     Yes     Yes 1957.1203 18805.95    44.23 %
4      No      No 1902.1499 35008.67    53.71 %
5     Yes      No 2202.4624 47287.26    87.09 %
6     Yes     Yes 2461.5070 11878.56    93.34 %

If we set our threshold \(= 50\%\), we get the following confusion matrix:

	Actual
Predicted	No	Yes
No	2	1
Yes	1	2

If we set our threshold \(= 40\%\), we get the following confusion matrix:

	Actual
Predicted	No	Yes
No	2	0
Yes	1	3

Which of the two is more accurate?

Thresholds

When making predictions about classes, we always have to make decisions.
Thresholds, applied to the predicted probability scores, are a way to decide whether to favour a particular class over another
⏭️ Next, we will explore several metrics that can help us decide whether our classification model is good or bad.

Classification Metrics

Confusion Matrix

Let’s take another look at the confusion matrix. We can think of the numbers in each cell as the following:

	Actual
Predicted	No	Yes
No	True Negative (TN)	False Negative (FN)
Yes	False Positive (FP)	True Positive (TP)

Ideally, we would have no False Negatives and no False Positives but, of course, that is never the case.

Classification metrics

It is convenient to aggregate those quantities into a few other metrics
Two of the most common ones are called sensitivity and specificity

\[ \begin{align} \text{Sensitivity} &= \text{True Positive Rate (TPR)} = \frac{TP}{P} \\ \text{Specificity} &= \text{True Negative Rate (TNR)} = \frac{TN}{N} \end{align} \]

Another common one is accuracy:

\[ \text{Accuracy} = \frac{TP + TN}{P + N} \]

A good model has high sensitivity and high specificity and high accuracy.

Which threshold is better?

📝 Now, looking at the logistic regression model we built for the entire dataset, work out the sensitivity, specificity and accuracy of the following confusion matrices:

Practice

⏲️ 5 min to work out the math
🗳️ Vote on your preferred threshold
(on Slack)

\(\text{Threshold} = 50\%\):

	Actual
Predicted	No	Yes
No	9627	228
Yes	40	105

\(\text{Threshold} = 40\%\):

	Actual
Predicted	No	Yes
No	9588	199
Yes	79	134

Meet the ROC curve

The Receiver Operating Characteristic (ROC) curve is another way to assess the model.
It shows how sensitivity and specificity change as we vary the threshold from 0 to 1
(threshold not shown).

What could go wrong?

Generalisation problems

The data used to train algorithms is called training data
Often, we want to use the fitted models to make predictions on new previously unseen data

Important

⚠️ A model that performs well on training data will not necessarily perform well on new data ⚠️

To make a robust assessment of our model, we have to split the data in two:
- the training data and
- the test data
We do NOT use the test data to fit the model
We will come back to this next week, this is the topic of 🗓️ Week 04.

Inappropriate reliance on metrics

Accuracy can be very misleading when classes are imbalanced
Consider the following model: \(\hat{y} = \text{Yes}\) (always)
- Only \(3\%\) of customers default on their credit cards
- Therefore, this model would have a \(97\%\) accuracy!
- It is correct ninety-seven percent of times. But is it a good model?
  - 🙅‍♂️ NO!
Similarly, you have to ask yourself about the usefulness of any other metric
- Is True Positive Rate more or less important than True Negative Rate for the classification problem at hand?
- Why? Why not?
Ultimately, it boils down to how you plan to use this model afterwards.

What’s Next

Your Checklist:

📙 Read (James et al. 2021, chap. 4)
👀 Browse the slides again
📝 Take note of anything that isn’t clear to you
📟 Share your questions on /week04 channel on Slack

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. Second edition. Springer Texts in Statistics. New York NY: Springer. https://www.statlearning.com/.

🗓️ Week 03:Classifiers - Part II

Generative models

Bayes’ Theorem

Bayes’ Theorem

Bayes’ Theorem

Bayes’ Theorem

Bayes’ Theorem

Bayes’ Theorem

Bayes’ Theorem

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA)

LDA - Estimates

Naive Bayes Classifier

Naive Bayes Classifier

A naive approach indeed

Making decisions

Default: Yes or No?

Default: Yes or No?

Default: Yes or No?

Which of the two is more accurate?

Thresholds

Classification Metrics

Confusion Matrix

Classification metrics

Which threshold is better?

Meet the ROC curve

What could go wrong?

Generalisation problems

Inappropriate reliance on metrics

What’s Next

References

🗓️ Week 03:
Classifiers - Part II