DS202 Data Science for Social Scientists
10/14/22
The generic supervised model:
\[ Y = \operatorname{f}(X) + \epsilon \]
still applies, only this time \(Y\) is categorical. ➡️
Our categorical variables of interest take values in an unordered set \(\mathcal{C}\), such as:
What if I just coded each category as a number?
\[ Y = \begin{cases} 1 &\text{if}~\color{brown}{brown},\\ 2 &\text{if}~\color{blue}{blue},\\ 3 &\text{if}~\color{green}{green}. \end{cases} \]
What could go wrong?
How would you interpret a particular prediction if your model returned:
Let’s talk about three possible interpretations of probability:
Classical
Frequentist
Bayesian
Events of the same kind can be reduced to a certain number of equally possible cases.
Example: coin tosses lead to either heads or tails \(1/2\) of the time ( \(50\%/50\%\))
What would be the outcome if I repeat the process many times?
Example: if I toss a coin \(1,000,000\) times, I expect \(\approx 50\%\) heads and \(\approx 50\%\) tails outcome.
What is your judgement of the likelihood of the outcome? Based on previous information.
Example: if I know that this coin has symmetric weight, I expect a \(50\%/50\%\) outcome.
For our purposes:
Consider a binary response:
\[ Y = \begin{cases} 0 \\ 1 \end{cases} \]
We model the probability that \(Y = 1\) using the logistic function (aka. sigmoid curve):
\[ Pr(Y = 1|X) = p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1 X}} \]
Source of illustration: TIBCO
As with linear regression, the coefficients are unknown and need to be estimated from training data:
\[ \hat{p}(X) = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1X}}{1 + e^{\hat{\beta}_0 + \hat{\beta}_1 X}} \]
We estimate these by maximising the likelihood function:
\[ \max \ell(\beta_0, \beta_1) = \prod_{i:y_i=1}{p(x_i)} \prod_{i':y_{i'}=0} (1 - p(x_{i'})), \]
and we call this method the Maximum Likelihood Estimate (MLE).
➡️ As usual, there are multiple ways to solve this equation!
How do you find the latitude and longitude of a mountain peak if you can’t see very far?
Advanced: If for whatever random reason, you find yourself enamored with the Maximum Likelihood Estimate, check (Agresti 2019) for a recent take on the statistical properties of this method.
Since we now mostly care about probabilities, how do the odds change according to features of the customers?
The quantity below is called the odds:
\[ \frac{p(X)}{1 - p(X)} = e^{\beta_0 + \beta_1 X} \]
📝 Give it a go! Using algebra, can you re-arrange the equation for \(p(X)\) presented in the Logistic regression model slides to arrive at the odds quantity shown above?
\[ log\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_1 X \]
Default
dataA sample of the data:
default student balance income
1 No No 729.5265 44361.625
2 No Yes 817.1804 12106.135
3 No No 1073.5492 31767.139
4 No No 529.2506 35704.494
5 No No 785.6559 38463.496
6 No Yes 919.5885 7491.559
7 No No 825.5133 24905.227
8 No Yes 808.6675 17600.451
9 No No 1161.0579 37468.529
10 No No 0.0000 29275.268
11 No Yes 0.0000 21871.073
12 No Yes 1220.5838 13268.562
13 No No 237.0451 28251.695
14 No No 606.7423 44994.556
15 No No 1112.9684 23810.174
How the data is spread:
No Yes
9667 333
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 481.7 823.6 835.4 1166.3 2654.3
Min. 1st Qu. Median Mean 3rd Qu. Max.
772 21340 34553 33517 43808 73554
No Yes
7056 2944
\[ \hat{y} = \frac{e^{-10.65133 + 0.005498917X}}{1 + e^{-10.65133 + 0.005498917X}} \]
That is:
\[ \begin{align} \hat{\beta}_0 &= -10.65133\\ \hat{\beta}_1 &= 0.005498917 \end{align} \]
Interpreting \(\hat{\beta}_0\):
balance
information:
Interpreting \(\hat{\beta}_1\):
balance
information:
\[ \hat{y} = \frac{e^{-3.094149 - 8.352575 \times 10^{-6} X}}{1 + e^{-3.094149 - 8.352575 \times 10^{-6} X}} \]
That is:
\[ \begin{align} \hat{\beta}_0 &= - 3.094149\\ \hat{\beta}_1 &= - 8.352575 \times 10^{-6} \end{align} \]
Interpreting \(\hat{\beta}_0\):
balance
information:
Interpreting \(\hat{\beta}_1\):
balance
information:
\[ \hat{y} = \frac{e^{-3.504128 + 0.4048871 X}}{1 + e^{-3.504128 + 0.4048871 X}} \]
That is:
\[ \begin{align} \hat{\beta}_0 &= -3.504128\\ \hat{\beta}_1 &= +0.4048871 \end{align} \]
Interpreting \(\hat{\beta}_0\):
balance
information:
Interpreting \(\hat{\beta}_1\):
balance
information:
The output of summary
is similar to that of linear regression:
Call:
glm(formula = default ~ balance, family = binomial, data = ISLR2::Default)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1596.5 on 9998 degrees of freedom
AIC: 1600.5
Number of Fisher Scoring iterations: 8
\[ log \left( \frac{p(X)}{1-p(X)} \right)=\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]
\[ p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p}} \]
Default
Full Model
Call:
glm(formula = default ~ ., family = binomial, data = ISLR2::Default)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4691 -0.1418 -0.0557 -0.0203 3.7383
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1571.5 on 9996 degrees of freedom
AIC: 1579.5
Number of Fisher Scoring iterations: 8
After our 10-min break ☕:
DS202 - Data Science for Social Scientists 🤖 🤹