DS101 – Fundamentals of Data Science
11 Nov 2024
We also saw how linear regression models are normally represented mathematically:
The generic supervised model:
\[ Y = \operatorname{f}(X) + \epsilon \]
is defined more explicitly as follows ➡️
\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]
when we use a single predictor, \(X\).
\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]
when there are multiple predictors, \(X_p\).
Note
The typical linear model assumes that:
Important
Barely any real-world process is linear.
We often want to use a model to make predictions
Remember the reading (Hohl 2009) from your 📝 reading week formative?
It mentions the relationship between life satisfaction and household income. Let’s take a closer look at it.
We take data from the latest round of the European Social Survey (ESS) i.e round 11 conducted in 2024 (European Social Survey European Research Infrastructure (ESS ERIC) 2024) and we only look at the variables linked to life satisfaction and household net income from all sources.
In this dataset:
OLS Regression Results
=======================================================================================
Dep. Variable: life_satisfaction R-squared (uncentered): 0.813
Model: OLS Adj. R-squared (uncentered): 0.813
Method: Least Squares F-statistic: 9.678e+04
Date: Mon, 11 Nov 2024 Prob (F-statistic): 0.00
Time: 05:17:18 Log-Likelihood: -58042.
No. Observations: 22190 AIC: 1.161e+05
Df Residuals: 22189 BIC: 1.161e+05
Df Model: 1
Covariance Type: nonrobust
==========================================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------------------
household_total_net_income_all_sources 1.1209 0.004 311.093 0.000 1.114 1.128
==============================================================================
Omnibus: 15.942 Durbin-Watson: 1.718
Prob(Omnibus): 0.000 Jarque-Bera (JB): 14.288
Skew: -0.014 Prob(JB): 0.000790
Kurtosis: 2.879 Cond. No. 1.00
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
RMSE: 3.3092749429303825
95% confidence interval:
0 1
household_total_net_income_all_sources 1.113871 1.127996
What do you think of this fit?
Note
We will focus a lot more on predictions from now on.
This is because Machine Learning, in practice, is all about making predictions.
Machine Learning (ML) is a subfield of Computer Science and Artificial Intelligence (AI) that focuses on the design and development of algorithms that can learn from data.
INPUT (data)
⬇️
ALGORITHM
⬇️
OUTPUT (prediction)
(🗓️ Week 07)
(🗓️ Week 08)
If we assume there is a way to map between X and Y, we could use SUPERVISED LEARNING to learn this mapping.
Suppose you want to be able to tell whether a patient has diabetes or not.
How would you approach this problem?
Image source: Oxford University Research News
If we try to predict risk of diabetes from existing symptoms, we could look at:
All of this information constitutes our input.
Consider a binary response:
\[ Y = \begin{cases} 0 \\ 1 \end{cases} \]
We model the probability that \(Y = 1\) using the logistic function (aka. sigmoid curve):
\[ Pr(Y = 1|X) = p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1 X}} \]
Source of illustration: TIBCO
age
): we’ll need to encode them in numerical values before applying logistic regressionclass
variable to outcome
for reasons of Python syntax)How do we evaluate the model’s performance?
Some definitions:
We are looking at two classes (diabetes and healthy): we consider one “positive” - it is the one we’re interested in i.e diabetes- and the other is “negative” i.e the healthy class
We have defined what “positive” and “negative” mean so:
How do we evaluate the model’s performance?
Some definitions:
---------------------------
target value: Positive
number of elements for target value Positive : 320
proportion in data: 0.6153846153846154
---------------------------
target value: Negative
number of elements for target value Negative : 200
proportion in data: 0.38461538461538464
How do we evaluate the model’s performance?
Choices to tackle the imbalance:
How do we evaluate the model’s performance?
Choices to tackle the imbalance:
How does the model we trained perform
precision recall f1-score support
Negative 0.92 0.91 0.92 54
Positive 0.95 0.96 0.96 102
accuracy 0.94 156
macro avg 0.94 0.93 0.94 156
weighted avg 0.94 0.94 0.94 156
After the break:
precision recall f1-score support
Negative 0.83 0.98 0.90 54
Positive 0.99 0.89 0.94 102
accuracy 0.92 156
macro avg 0.91 0.94 0.92 156
weighted avg 0.93 0.92 0.92 156
Say you have a dataset with two classes of points:
The goal is to find a line that separates the two classes of points:
It can get more complicated than just a line:
Image from Stack Exchange | TeX
LSE DS101 2024/25 Autumn Term