DS101 – Fundamentals of Data Science
06 Nov 2023
We also saw how linear regression models are normally represented mathematically:
The generic supervised model:
\[ Y = \operatorname{f}(X) + \epsilon \]
is defined more explicitly as follows ➡️
\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]
when we use a single predictor, \(X\).
\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]
when there are multiple predictors, \(X_p\).
Note
The typical linear model assumes that:
Important
Barely any real-world process is linear.
We often want to use a model to make predictions
Remember the simple linear model from W05?
What kind of prediction would this model give if we were to set the independent variable to 1 (i.e maximum possible human development index) ?
What kind of prediction would this model give if we were to set the independent variable to 1 (i.e maximum possible human development index) ?
Since the model equation was: \[\begin{align} \operatorname{life\_expectancy} = & 102.9273 \times \operatorname{income\_composition\_of\_resources} \end{align}\]
this means a human development index of 1 (i.e income_composition_of_resources=1
) means a life expectancy of 102.9273 years!
Note
We will focus a lot more on predictions from now on.
This is because Machine Learning, in practice, is all about making predictions.
Machine Learning (ML) is a subfield of Computer Science and Artificial Intelligence (AI) that focuses on the design and development of algorithms that can learn from data.
INPUT (data)
⬇️
ALGORITHM
⬇️
OUTPUT (prediction)
(🗓️ Week 07)
(🗓️ Week 08)
If we assume there is a way to map between X and Y, we could use SUPERVISED LEARNING to learn this mapping.
Suppose you want to be able to tell whether a patient has breast cancer or not.
How would you approach this problem?
Image source: Smithsonian Magazine
About 30 features, among which for each cell nucleus (3), 10 features:
All of this information constitutes our input.
The data shown before actually corresponds to the Diagnostic Wisconsin Breast Cancer Database1 (Wolberg and Street 1995) that was made publicly available (in 1995!) on the UC Irvine Machine Learning Repository (or UCI Machine Learning Repository).
The UCI repository is “a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms” ((Markelle Kelly, n.d.), “About Us” page).
After the break:
Consider a binary response:
\[ Y = \begin{cases} 0 \\ 1 \end{cases} \]
We model the probability that \(Y = 1\) using the logistic function (aka. sigmoid curve):
\[ Pr(Y = 1|X) = p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1 X}} \]
Source of illustration: TIBCO
Say you have a dataset with two classes of points:
The goal is to find a line that separates the two classes of points:
It can get more complicated than just a line:
Image from Stack Exchange | TeX
LSE DS101 2023/24 Autumn Term | archive