DS101 – Fundamentals of Data Science
10 Nov 2025
We also saw how linear regression models are normally represented mathematically:
The generic supervised model:
\[ Y = \operatorname{f}(X) + \epsilon \]
is defined more explicitly as follows ➡️
\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]
when we use a single predictor, \(X\).
\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]
when there are multiple predictors, \(X_p\).
Note
The typical linear model assumes that:
Important
Barely any real-world process is linear.
We often want to use a model to make predictions
Linear regression is powerful — but it’s not magic.
Sometimes, the world is more complicated than a single straight line.
Linear regression works well for continuous numeric outcomes, but what about:
Linear regression predicts continuous outcomes — like price, height, or temperature. But what happens when the target is Yes/No?
Note
Linear regression assumes continuous change — but here, the outcome is either 0 or 1 (benign or malignant).
Trying to “fit a line” between two categories can lead to impossible values like probabilities below 0 or above 1. Linear regression doesn’t know probabilities must be between 0 and 1. Here, small tumors lead to predicted probabilities below 0, large ones above 1.
We’ll need another model type for this later — logistic regression.

Now let’s look at a real dataset: wage and age from the ISLR book.
Note
Real data shows diminishing returns — wages rise, then flatten, and sometimes fall slightly. The real data reveals curvature: wages rise rapidly early in careers, then plateau and sometimes decline.
Linear regression assumes a constant slope: every extra year of age increases wage by the same amount. It draws a single straight line — it misses the curve.
That’s why we later use polynomial regression or other flexible methods. This shows that a straight line is too simple for curved, real-world data.

Sometimes, even if variables look simple individually, their combination tells the real story.
Note
Linear regression assumes each variable contributes a fixed, independent effect. But here, the effect of gender changes by class:
A single “gender coefficient” can’t capture that. This is what we call an interaction — the effect of one variable depends on another.


Note
If there were no interaction, the lines would be parallel. But they’re not — the “female advantage” shrinks as class drops. Linear regression would try to average across both and get it wrong.
Linear regression struggles when:
| Situation | Why it fails |
|---|---|
| Binary outcomes | Predictions can go below 0 or above 1 |
| Non-linear patterns | A single straight line can’t follow curves |
| Interactions | One “average slope” can’t fit all groups |
Linear regression is like a ruler — powerful, simple, but only straight. The real world isn’t always linear.
Note
We need more flexible tools for prediction. Enter Machine Learning.
Machine Learning (ML) is a subfield of Computer Science and Artificial Intelligence (AI) that focuses on the design and development of algorithms that can learn from data.
INPUT (data)
⬇️
ALGORITHM
⬇️
OUTPUT (prediction)

(🗓️ Week 07 - today!)
(🗓️ Week 08)
If we assume there is a way to map between X and Y, we could use SUPERVISED LEARNING to learn this mapping.
Suppose you’re a record executive at a major label. Two artists pitch you new songs. You can only invest your marketing budget in one.
Image source: Unsplash
How would you decide which song will be a hit?
If we try to predict whether a song will chart on Billboard Hot-100, we could look at:
All of this information constitutes our input (features extracted from Spotify’s API). You can find the dataset on Kaggle.


| index | track | artist | uri | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | time_signature | chorus_hit | sections | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Wild Things | Alessia Cara | spotify:track:2ZyuwVvV6Z3XJaXIFbspeE | 0.741 | 0.626 | 1 | -4.826 | 0 | 0.0886 | 0.02 | 0.0 | 0.0828 | 0.706 | 108.029 | 188493 | 4 | 41.18681 | 10 | 1 |
| 1 | Surfboard | Esquivel! | spotify:track:61APOtq25SCMuK0V5w2Kgp | 0.447 | 0.247 | 5 | -14.661 | 0 | 0.0346 | 0.871 | 0.814 | 0.0946 | 0.25 | 155.489 | 176880 | 3 | 33.18083 | 9 | 0 |
| 2 | Love Someone | Lukas Graham | spotify:track:2JqnpexlO9dmvjUMCaLCLJ | 0.55 | 0.415 | 9 | -6.557 | 0 | 0.052 | 0.161 | 0.0 | 0.108 | 0.274 | 172.065 | 205463 | 4 | 44.89147 | 9 | 1 |
| 3 | Music To My Ears (feat. Tory Lanez) | Keys N Krates | spotify:track:0cjfLhk8WJ3etPTCseKXtk | 0.502 | 0.648 | 0 | -5.698 | 0 | 0.0527 | 0.00513 | 0.0 | 0.204 | 0.291 | 91.837 | 193043 | 4 | 29.52521 | 7 | 0 |
| 4 | Juju On That Beat (TZ Anthem) | Zay Hilfigerrr & Zayion McCall | spotify:track:1lItf5ZXJc1by9SbPeljFd | 0.807 | 0.887 | 1 | -3.892 | 1 | 0.275 | 0.00381 | 0.0 | 0.391 | 0.78 | 160.517 | 144244 | 4 | 24.99199 | 8 | 1 |
A beautiful song, but not built for the algorithmic dance floor.
Let’s say our response is binary — a song either hits 🎯 or flops 💀:
\[ Y = \begin{cases} 0 & \text{= Flop} \\ 1 & \text{= Hit} \end{cases} \]
We want to model how the probability of a hit changes with some feature (e.g., energy, danceability).
→ Instead of predicting 0 or 1 directly,
we predict a probability between 0 and 1 using the logistic (sigmoid) function:
\[ P(Y = 1 \mid X) = p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \]
🌀 As \(X\) increases, the curve smoothly transitions from near 0 (flop)
to near 1 (hit) — perfect for probabilities!
Source of illustration: TIBCO
For one feature, we had:
\[ P(Y = 1 \mid X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \]
With multiple predictors (e.g. danceability, energy, tempo…),
we simply add more terms inside the exponent:
\[ P(Y = 1 \mid X_1, X_2, ..., X_p) = \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p}} {1 + e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p}} \]
🧠 Think of it as:
🎯 Still interpretable:
Example: Splitting the Data
| Dataset | Portion | Purpose |
|---|---|---|
| 🎓 Training set | 70% of songs | Used to teach the model the relationship between features (danceability, energy, etc.) and outcomes (hit/flop). |
| 🧪 Test set | 30% of songs | Used after training to evaluate how well the model generalizes to unseen songs. |
🧠 Think of it like studying vs. taking the exam — the model “studies” on the training set and gets “tested” on the test set.
Some key definitions:
We have two classes: Hit (positive class - what we’re interested in) and Flop (negative class)
Based on this, we can have four outcomes:
Question for the class: Which is worse for a record label?
Discussion point: Both are bad in different ways! False positives waste money, false negatives lose opportunity.
Accuracy: Overall, how often is the model correct? \[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]
However, is 99% accuracy always good? No! That’s the “accuracy paradox”
Precision: Of songs we predicted would be hits, how many actually were? \[\text{Precision} = \frac{TP}{TP + FP}\]
Recall: Of actual hits, how many did we catch? \[\text{Recall} = \frac{TP}{TP + FN}\]
F1-Score: Harmonic mean of precision and recall (balanced measure) \[\text{F1-score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]
Let’s check our Spotify dataset:
---------------------------
Target value: Hit
Number of songs: 3,199
Proportion: 0.50
---------------------------
Target value: Flop
Number of songs: 3,199
Proportion: 0.50
---------------------------

Note
Good news! Our dataset is balanced (50-50 split). This means accuracy is actually a reasonable metric to use, though we’ll still look at precision, recall, and F1-score for a complete picture.
Model Performance:
Classification Report:
precision recall f1-score support
Flop 0.87 0.71 0.78 960
Hit 0.76 0.89 0.82 960
accuracy 0.80 1920
macro avg 0.81 0.80 0.80 1920
weighted avg 0.81 0.80 0.80 1920
Interpretation:
✅ Model correctly classifies songs about 80% of the time
🎵 Better at catching hits (recall = 0.89) than flops (recall = 0.71)
⚖️ Precision trade-off:
🧠 Cautious model: under-predicts hits, over-predicts flops
💬 Would you trust this to spend millions in marketing?

How to read this:
| Term | Meaning | Spotify context |
|---|---|---|
| ✅ True Positives (855) | Predicted Hit → Actually Hit | Correctly spotted popular songs |
| ✅ True Negatives (683) | Predicted Flop → Actually Flop | Correctly dismissed weak songs |
| ⚠️ False Positives (277) | Predicted Hit → Actually Flop | 💸 Wasted promo budget on bad calls |
| ⚠️ False Negatives (105) | Predicted Flop → Actually Hit | 🎯 Missed breakout songs |
Takeaway:
| Prediction → | Flop | Hit |
|---|---|---|
| Actual: Flop | ✅ True Negative → No wasted budget |
⚠️ False Positive → 💸 Wasted marketing spend |
| Actual: Hit | ⚠️ False Negative → 🎯 Missed opportunity |
✅ True Positive → 💰 Promoted real hits |
Key question:
Which mistake costs more for a record label — spending money on a flop, or missing a song that could go viral?
Context matters:
Different stakeholders care about different errors:
Business context determines how we tune the model’s decision threshold.

Tip
Understanding the chart:
Each bar shows how a musical feature influences the model’s prediction of whether a song becomes a Hit or Flop.
Green bars → make a song more likely to be a hit:
Red bars → make a song less likely to be a hit:
The direction of the bar shows how a feature affects the outcome (helps vs. hurts), while the length shows how strongly it matters.
The model treats these effects as independent — like separate volume knobs. That’s a simplification: it can’t capture subtle combinations (e.g., quiet and emotional songs that still succeed).
💬 Discussion prompt:
Can you think of a real song that breaks this pattern? Why might the model misclassify it as a “flop”?
| Song | Why it breaks the pattern |
|---|---|
| Adele – “Someone Like You” (2011) | Low energy and acoustic, yet emotionally powerful global hit. |
| Billie Eilish – “When the Party’s Over” (2018) | Very low loudness and valence — minimalist and dark but hugely successful. |
| Bon Iver – “Holocene” (2011) | Quiet, slow, and folk-like — critical success despite low “hit” features. |
| Lewis Capaldi – “Someone You Loved” (2019) | Piano ballad with little energy or danceability, yet chart-topping. |

After the break:
Decision trees make predictions by asking a sequence of yes/no questions about song features — similar to how a person might reason through a choice.
Tip
How to read this:
Each step narrows down possibilities until the model can make a confident decision.

Performance Metrics:
Classification Report:
precision recall f1-score support
Flop 0.86 0.70 0.77 960
Hit 0.75 0.89 0.81 960
accuracy 0.79 1920
macro avg 0.81 0.79 0.79 1920
weighted avg 0.81 0.79 0.79 1920
Confusion Matrix:

📊 Interpretation:
⚠️ Even if accuracy doesn’t improve, trees help us see relationships that a straight-line model can’t.
⚠️ Warning: Decision trees can easily overfit i.e memorize training patterns instead of learning general trends.
Tip
Why they’re useful
But they have limits
💬 In summary: Logistic Regression draws one straight line. Decision Trees draw many rectangles — flexible but choppy. (Spoiler: SVMs will soon draw smooth curves instead.)
One tree can be unstable or biased — so modern methods combine many trees for better results.
Tip
Examples:
📈 Intuition:
A single tree = one person’s opinion. A forest = a team consensus — more stable and accurate.
🧠 Just remember:
“More trees → more reliable predictions.”
Tip
1️⃣ Why trees might be a better choice
2️⃣ What songs confuse the tree
3️⃣ Pruning to simplify
Say you have a dataset with two classes of points:
The goal is to find a line (or hyperplane) that best separates the two classes:
It can get more complicated than just a line - using the “kernel trick” to handle non-linear patterns:
Note

Idea in a nutshell: SVMs find the boundary that best separates the classes — not just any boundary, but the one with the widest margin between hits and flops.
Unlike logistic regression (a straight line) or trees (boxy splits), SVMs can bend the boundary smoothly.
Tip
How to think about it:

Performance Metrics:
Classification Report:
precision recall f1-score support
Flop 0.89 0.73 0.80 960
Hit 0.77 0.91 0.83 960
accuracy 0.82 1920
macro avg 0.83 0.82 0.82 1920
weighted avg 0.83 0.82 0.82 1920
Confusion Matrix:

📊 Interpretation:
💡 Conceptual takeaway: The model isn’t just smarter — it’s smoother. It balances flexibility and generalization without memorizing the data.
Tip
Why they help
But they have limits
💬 In summary:
Logistic Regression → Straight line Decision Trees → Boxy regions SVM → Smooth, flexible curves
Tip
1️⃣ Why SVMs help
2️⃣ Why simpler models can still win
3️⃣ What confuses SVMs
Neural networks are inspired by how the human brain processes information — they consist of layers of interconnected “neurons.”

Image from Stack Exchange | TeX
Note
You could compare each layer to a filter pipeline:
“Layer 1 extracts beats, layer 2 recognizes rhythm patterns, layer 3 recognizes the song’s mood.” Each layer’s function refines the previous one — like stacking musical effects.
The “deep” in deep learning simply refers to many layers of learned transformations — not infinite ones.
A shallow network has just one or two hidden layers.
A deep network stacks many hidden layers — sometimes 10, 50, or hundreds, depending on the task and data size.
Each additional layer lets the model learn a new level of abstraction:
These layers are composed functions — \[f(x) = f_L(f_{L-1}(...f_1(x)...))\] where each ( f_i ) transforms data one step further from raw input toward understanding.
Warning
Trade-off: Interpretability
As we add more layers, the network can learn richer representations — but it becomes harder to see what each layer has learned or why it produced a given output.
Note
“Deep doesn’t mean magical — it means stacked.” Each layer learns from the previous one, like a factory assembly line turning raw materials into finished products.
Deep networks are great at recognizing what is present, but not always how different parts relate.
💡 To capture meaning, we need models that can look across the entire input and decide which parts matter most right now.
👉 Enter the Attention Mechanism.

The Transformer architecture introduced an attention mechanism (Vaswani et al. 2017) — it lets the model focus on relevant context within the input.
Attention dynamically computes a new function: \[y = f(\text{input}, \text{context})\] where context is derived from relationships among all inputs.
Example:
Transformers stack layers of attention, combining local understanding with global context — forming deep, context-aware representations.
This architecture revolutionized modern AI: powering ChatGPT, BERT, Gemini, Claude, and others.
Note
“When reading a paragraph, do we process words one by one, or in context?” Attention allows the model to look around — to weigh which parts of the input provide the most useful information.
| Model Type | What it Learns | Depth | Concept | Strengths | Limitations |
|---|---|---|---|---|---|
| Logistic Regression | Linear function | 1 layer | Single transformation | Simple, interpretable | Misses non-linearities |
| Decision Tree | Piecewise rules | – | Conditional logic | Transparent | Overfits easily |
| SVM | Smooth separating function | – | Curved boundary | Flexible | Harder to explain |
| Neural Network | Composed functions | Few–dozens | Layered transformations | Very flexible | “Black box” |
| Transformer | Contextual functions | Many (dozens–hundreds) | Attention-based layers | Understands relationships | Complex, data-hungry |
Tip
Each step adds more depth and context: from a single transformation → layered functions → context-aware reasoning.
Depth = hierarchy of functions, Attention = awareness of relationships.
Our Spotify model achieved ~86–92% accuracy. But why did it perform so well?
Abundant historical data — 60 years of Billboard chart data (1960–2019). Thousands of examples to learn from.
Stable patterns — musical taste evolves slowly.
Reliable target — the Billboard Hot-100 offers consistent, objective labeling.
Clear feedback — we know which songs succeeded and which didn’t.
Important
But what if…
➡️ That’s when even the smartest models can fail dramatically.
Note
“Supervised models are only as good as their history. They don’t discover truth — they learn what has been true so far.”
Preview of tomorrow’s class
Tomorrow we’ll see what happens when data doesn’t follow the rules — using the case of COVID-19 prediction failures.
Result: Many “good” models failed catastrophically. We’ll explore why — and what this teaches us about trust, uncertainty, and the limits of data-driven prediction.
Note
“Spotify worked because history repeated itself. You’ll see tomorrow that COVID broke our models because the world changed faster than the data.”
Tip
Takeaway: So far, we’ve taught machines to predict outcomes when examples exist. Next, we’ll teach them to find structure — even when no answers are given.

LSE DS101 2025/26 Autumn Term