🗓️ Week 07
Machine Learning I

DS101 – Fundamentals of Data Science

11 Nov 2024

⏪ Recap

  • A model is a mathematical representation of a real-world process, a simplified version of reality.
  • We saw examples of linear regression models (week 4 lecture and class).
    • It contains several assumptions that need to be met for the model to be valid.
    • Social scientists have been using linear regression models for decades.

⏪ Recap: Linear Regression

We also saw how linear regression models are normally represented mathematically:

The generic supervised model:

\[ Y = \operatorname{f}(X) + \epsilon \]

is defined more explicitly as follows ➡️

Simple linear regression

\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]

when we use a single predictor, \(X\).

Multiple linear regression

\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]

when there are multiple predictors, \(X_p\).

Note

  • As a well-studied statistical technique, we know a lot about the properties of the model.
  • Researchers can use this knowledge to assess the validity of the model, using things like confidence intervals, hypothesis testing, and many other model diagnostics.

Limitations of Linear Regression Models

The typical linear model assumes that:

  • the relationship between the response and the predictors is linear.
  • the error terms are independent and identically distributed.
  • the error terms have a constant variance.
  • the error terms are normally distributed.

Important

Barely any real-world process is linear.

Making Predictions

We often want to use a model to make predictions

  • either about the future
  • or about new observations.

Linear regression is not always the way to go

Remember the reading (Hohl 2009) from your 📝 reading week formative?

It mentions the relationship between life satisfaction and household income. Let’s take a closer look at it.

We take data from the latest round of the European Social Survey (ESS) i.e round 11 conducted in 2024 (European Social Survey European Research Infrastructure (ESS ERIC) 2024) and we only look at the variables linked to life satisfaction and household net income from all sources.

Linear regression is not always the way to go

In this dataset:

  • life satisfaction is encoded as a categorical variable on a scale from 0-10 with 0 extremely dissatisfied and 10 very satisfied
  • household net income is encoded as a decile
  • values 77, 88, 99 correspond to missing values for both variables (the respondents didn’t answer or refused to answer or answered “don’t know”). We replace these values by the median of the values of each variable 1

Linear regression is not always the way to go

                                OLS Regression Results                                
=======================================================================================
Dep. Variable:      life_satisfaction   R-squared (uncentered):                   0.813
Model:                            OLS   Adj. R-squared (uncentered):              0.813
Method:                 Least Squares   F-statistic:                          9.678e+04
Date:                Mon, 11 Nov 2024   Prob (F-statistic):                        0.00
Time:                        05:17:18   Log-Likelihood:                         -58042.
No. Observations:               22190   AIC:                                  1.161e+05
Df Residuals:                   22189   BIC:                                  1.161e+05
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
==========================================================================================================
                                             coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------
household_total_net_income_all_sources     1.1209      0.004    311.093      0.000       1.114       1.128
==============================================================================
Omnibus:                       15.942   Durbin-Watson:                   1.718
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               14.288
Skew:                          -0.014   Prob(JB):                     0.000790
Kurtosis:                       2.879   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
RMSE:  3.3092749429303825
95% confidence interval:
                                                0         1
household_total_net_income_all_sources  1.113871  1.127996

Linear regression is not always the way to go

What do you think of this fit?

The thing about predictions

  • We can use a model to make predictions about the future.
  • We can also use a model to make predictions about new observations.
  • But we need to be careful about the assumptions that we make about the data.

Note

We will focus a lot more on predictions from now on.

This is because Machine Learning, in practice, is all about making predictions.

What is Machine Learning?

A definition

Machine Learning (ML) is a subfield of Computer Science and Artificial Intelligence (AI) that focuses on the design and development of algorithms that can learn from data.

How it works

https://xkcd.com/1838

How it works

  • The process is similar to the regular algorithms (think recipes) we saw in W03.
  • Only this time, the ingredients are data.

INPUT (data)

⬇️

ALGORITHM

⬇️

OUTPUT (prediction)

Types of Machine Learning

Supervised
Learning

  • Each data point is associated with a label.
  • The goal is to learn a function that maps the data to the labels.

(🗓️ Week 07)

Unsupervised
Learning

  • There is no such association between the data and the labels.
  • The focus is on similarities between data points.

(🗓️ Week 08)

Supervised Learning

Supervised Learning

X X (Input) Y Y (Output) X->Y f: X → Y


If we assume there is a way to map between X and Y, we could use SUPERVISED LEARNING to learn this mapping.

Supervised Learning

X X (Input) Y Y (Output) X->Y f: X → Y


  • The algorithm will teach itself to identify changes in \(Y\) based on the values of \(X\)
  • A dataset of labelled data is required​
  • Prediction: presented with new \(X\), the algorithm will be able to predict how \(Y\) would be like​

Supervised Learning

X X (Input) Y Y (Output) X->Y f: X → Y


  • If \(Y\) is a numerical value, we call it a regression problem.​
  • If \(Y\) is a category, we call it a classification problem.​

Supervised Learning

X X (Input) Y Y (Output) X->Y f: X → Y


  • There are countless ways to do that; each algorithm does it in its own way.​
  • Here are some names of basic algorithms:​
    • Linear regression​
    • Logistic regression​
    • Decision trees​
    • Support vector machines​
    • Neural networks​

One Practical Example

Suppose you want to be able to tell whether a patient has diabetes or not.

How would you approach this problem?

What features could we rely on to try and predict diabetes?

If we try to predict risk of diabetes from existing symptoms, we could look at:

  • patient demographics (i.e age and gender)
  • the existence or not of excessive thirst (polydipsia)
  • the existence or not of excessive, insatiable hunger (polyphagia)
  • the existence or not of alopecia (baldness)
  • the existence or not of polyuria (abnormally large amounts of urine)
  • the existence or not of sudden weight loss
  • the existence or not of obesity
  • the existence or not of visual blurring
  • the existence or not of genital thrush or itching
  • the existence or not of weakness
  • the existence or not of paresis (i.e partial or mild loss of muscle control or weakness)
  • the existence or not of muscle stiffness
  • the existence or not of irritability
  • the existence or not of delayed healing

All of this information constitutes our input.

What about the output?

  • The output is something you would normally want to predict or attempt to explain​ 1
  • Also referred to as the label, “ground truth”, target or dependent variable, or simply \(Y​\)
  • Most commonly it is a numerical or categorical variable​

Structure of dataset

  • The data shown is a dataset made publicly available in 2020 (Early Stage Diabetes Risk Prediction 2020) on the UC Irvine Machine Learning Repository (or UCI Machine Learning Repository)
  • The UCI repository is “a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms” ((Markelle Kelly, n.d.), “About Us” page)
    https://archive.ics.uci.edu
  • It is a dataset that contains the signs and symptom data of newly diabetic or would be diabetic patients and was collected using direct questionnaires from the patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh and approved by a doctor.

Algorithms

The Logistic Regression model

Consider a binary response:

\[ Y = \begin{cases} 0 \\ 1 \end{cases} \]

We model the probability that \(Y = 1\) using the logistic function (aka. sigmoid curve):

\[ Pr(Y = 1|X) = p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1 X}} \]

Logistic regression applied to our diabetes example

  • Our dataset had mainly categorical (binary) predictors (except for age): we’ll need to encode them in numerical values before applying logistic regression
  • We usually split the data in training and test set (depending on the amount of data available, we generally use about 70% of the data to train the model and 30% to test the model)
  • We’ll use all the predictors to predict the class/outcome variable (we rename the class variable to outcome for reasons of Python syntax)

Logistic regression applied to our diabetes example

How do we evaluate the model’s performance?

Some definitions:

  • We are looking at two classes (diabetes and healthy): we consider one “positive” - it is the one we’re interested in i.e diabetes- and the other is “negative” i.e the healthy class

  • We have defined what “positive” and “negative” mean so:

    • a “true positive” is simply an instance of the “positive” class that the model we are testing correctly predicts as belonging to the “positive” class i.e in our example, diabetes correctly identified as diabetes.
    • A “true negative” is an instance of the “negative” class correctly predicted as belonging to the “negative” class, i.e in our case, healthy patients correctly labelled as healthy.
    • A “false positive” is an instance of the negative class incorrectly identified as belonging to the positive class, here, healthy patients incorrectly labelled as having diabetes.
    • A “false negative” is an instance of the positive class incorrectly predicted as belonging to the negative class, here, diabetic patients incorrectly labelled as healthy.
    • Which do you think are more important in this setting?
    • Answer: Because we are in a medical setting, false negatives are much more detrimental than false positives: we are missing cases of disease, exposing patients to poor prognosis and risk of death (in the case of false negatives) as opposed to unnecessary tests, treatments and potential harmful treatment side-effects (in the case of false negatives).

Logistic regression applied to our diabetes example

How do we evaluate the model’s performance?

Some definitions:

  • First evaluation metric: accuracy. In the general case, accuracy is defined as : \[ \displaystyle {\text{Accuracy}}={\frac {\text{correct classifications}}{\text{all classifications}}}\] In the binary case, this translates as: \[ \begin{equation} {\displaystyle {\text{Accuracy}}={\frac {TP+TN}{TP+TN+FP+FN}}} \\ where TP = \text{True positive; } FP = \text{False positive; } TN = \text{True negative; } FN = \text{False negative} \end{equation} \]
  • However, is a 99% accuracy always a good thing?
  • No, that’s what we call the “accuracy paradox”. Accuracy is a metric which unsuited for imbalanced datasets
  • What about our dataset?
---------------------------
target value: Positive 
number of elements for target value Positive : 320 
proportion in data: 0.6153846153846154
---------------------------
target value: Negative 
number of elements for target value Negative : 200 
proportion in data: 0.38461538461538464

Logistic regression applied to our diabetes example

How do we evaluate the model’s performance?

Choices to tackle the imbalance:

  1. Resample the data (undersample/downsample it i.e reduce the number of data points from the majority class or oversample it i.e increase the number of samples from the minority class:
    • Downsampling is not recommended unless you have massive amounts of data (if you reduce your majority class data points to try and fit the proportion of minority class points present in your dataset, you might not be left with much data at all to analyse!)
    • So, in practice, you tend to oversample and one of the most common methods for this is called SMOTE (Chawla et al. 2002)

Logistic regression applied to our diabetes example

How do we evaluate the model’s performance?

Choices to tackle the imbalance:

  1. Choose metrics more suited for imbalanced datasets (our choice here):
    • The precision score measures the model performance by computing the ratio between true positives and total number of records labelled “positive” (i.e true positives+false positives) i.e \[ \text{precision} = \frac{TP}{TP+FP}\]
    • The recall score score measures the model performance by computing the ratio between true positives and total number of actual positive records in the dataset (i.e true positives+false negatives) i.e \[ \text{recall} = \frac{TP}{TP+FN}\]
  • The F1-score is the harmonic mean of precision and recall and is used as a metrics in the scenarios where you don’t want to give more weight to either precision or recall and seek a good compromise/tradeoff between both. This is also a good metric to use for imbalanced datasets i.e \[ \text{F1-score} = 2.\frac{precision.recall}{precision+recall}\]
  • a visual summary of TP, TN, FP, FN is the confusion matrix that gives a more detailed look at the performance of the model

Logistic regression applied to our diabetes example

How does the model we trained perform

 precision    recall  f1-score   support

    Negative       0.92      0.91      0.92        54
    Positive       0.95      0.96      0.96       102

    accuracy                           0.94       156
   macro avg       0.94      0.93      0.94       156
weighted avg       0.94      0.94      0.94       156

Time for a break 🍵

After the break:

  • Other examples of supervised learning algorithms
  • Self-supervised learning

The Decision Tree model

The Decision Tree model applied to diabetes: the metrics

       precision    recall  f1-score   support

    Negative       0.83      0.98      0.90        54
    Positive       0.99      0.89      0.94       102

    accuracy                           0.92       156
   macro avg       0.91      0.94      0.92       156
weighted avg       0.93      0.92      0.92       156

The Support Vector Machine model

The Support Vector Machine model

Say you have a dataset with two classes of points:

The Support Vector Machine model

The goal is to find a line that separates the two classes of points:

The Support Vector Machine model

It can get more complicated than just a line:

Neural Networks

Deep Learning

  • Stack a lot of layers on top of each other, and you get a deep neural network.
  • Forget about interpretability!

Attention

  • The Transformer architecture is a deep learning model that uses attention to learn contextual relationships between words in a text.
  • It has revolutionized the field of Natural Language Processing (NLP).

What’s next?

What’s next?

  • 🗓️ Week 08: Unsupervised learning
  • 🗓️ Week 09: More on unstructured data
    • You will also explore how to measure the performance of your models
  • 🗓️ Week 10: We’ll think deeper about what it means to predict something
    • We’ll also talk about fairness and bias in machine learning

References

Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE: Synthetic Minority over-Sampling Technique.” Journal of Artificial Intelligence Research 16 (June): 321–57. https://doi.org/10.1613/jair.953.
Early Stage Diabetes Risk Prediction.” 2020. UCI Machine Learning Repository.
European Social Survey European Research Infrastructure (ESS ERIC). 2024. ESS11 - Integrated File, Edition 1.0.” Sikt - Norwegian Agency for Shared Services in Education; Research. https://doi.org/10.21338/ess11e01_0.
Hohl, Katrin. 2009. “Beyond the Average Case: The Mean Focus Fallacy of Standard Linear Regression and the Use of Quantile Regression for the Social Sciences.” Available at SSRN 1434418. http://dx.doi.org/10.2139/ssrn.1434418.
Markelle Kelly, Kolby Nottingham, Rachel Longjohn. n.d. “The UCI Machine Learning Repository.” https://archive.ics.uci.edu.
Shmueli, Galit. 2010. “To Explain or to Predict?” Statistical Science 25 (3). https://doi.org/10.1214/10-STS330.