DS205: Advanced Data Manipulation
10 Mar 2025
Introduction & Outline
Biological inspiration
1. Anatomy & Physiology
Computation & learning computations
2. Functional mapping
3. Prediction
4. Matrix mutliplication
5. Errors - landscape
6. Gradients
7. Training lifecycle
Deep Learning
8. Multi Layered Perceptrons: Stacking layers - abstraction
Code
Data
Consequences
Questions
Linear: A straight line is a model
\[ \color{#a8f} {y = wx + b} \]
\(w\) is the slope \(\rightarrow\) i.e. one of these means “one-dimensional”
\(b\) is the bias (offset) \(\rightarrow\) everpresent output, no matter what you do
High-dimensional “spaces” – a pillar of neural computation
\[ \color{#a8f} {y = w_{1}x_{1} + w_{2}x_{2} + ... + b} \]
MLP: 2 modes of prediction …
NLP: language (sequence) modelling – “semi-supervised learning”
Networks of neurones become matrices – used for matrix x vector multiplication
Any relationship that isn’t linear!
The whole modelling paradigm becomes much more powerful when the relationships we model are no longer described by simple linear rules
The non-linearity is the essential catalyst and all-imporant ingredient!
\[ \color{#a8f} {y = sin(x)} \]
\[ \color{#a8f} {y = w_{1}. exp(x_{1}) + w_{2}.cos(x_{2})} \]
\[ \color{#a8f} {y = x^{2} + x^{3} + x^{4}} \]
\[ \color{#a8f} {\vdots} \]
\[ \color{#a8f} {\text{everything else!}} \]
This can render analysis quite intractable
“Models” are “blocks of numbers”!
Diffuse representations a.k.a. parallel distributed network – a.k.a. “connectionist”
Research in this field is all about finding “good numbers” to perform well doing “things we care about”
a.k.a:
Loss Function – a.k.a. “cost function”, “error”: “Distance Measures” – our “gap”
A cost function looks like a smooth, curved line, across a graph. For any point on this line, the height above zero represents “cost” in some way – e.g. how much “energy” the model will incur when tuned to a specific setting. The higher the point, the farther from ideal (of 0) the system is. We want to find the lowest point on this line, where the cost is smallest.
Common loss functions include:
Gradient descent algorithms “feel their way” to the bottom of the curve by picking a point and calculating the slope (or gradient) of the curve around it, and then moving in the direction where the slope is steepest
- Imagine this as feeling your way down a mountain in the dark. You may not know exactly where to move, or how close to valley floor you will get, but if, in general, you head down the slope in the steepest direction, you would hope to arrive at the lowest point in the area
If \(y = f(u)\) and \(u = g(x)\), then
\[ \frac{dy}{dx} = \frac{dy}{du} . \frac{du}{dx} \]
.. the “outside function” is \(f()\) and the “inside function” is \(g()\)
Chain rule translates as: “the derivate of the ‘outside’” \(\times\) “the derivative of the ‘inside’”
Alternatively,
If \(y = f[g(x)]\)
Then \(y' = f'[g(x)].g'(x)\)
Backpropagation is a gradient estimation method used to train neural network models. The gradient estimate is used by the optimization algorithm to compute the network parameter updates.
It is an efficient application of the chain rule to such networks (known since 1673). It is also known as the reverse mode of automatic differentiation.
The phrase “‘’Deep Learning’’” really applies to any design of neural network that makes use of many layers – today, numbering 10s, 100s or even 1000s of layers.
The explosion of interest in Deep Learning began with the success of AlexNet in a machine vision competition (circa. 2012)
Our “latent space”, which encodes our ”deeper information”, abstracts – meaning as we penetrate deeper into the network*
In the early history of neural computation, the computational power obtained by stacking multiple layers was not fully recognised, and in any case, few researchers had access to the scale of compute necessary to explore these advantages. Early neural networks used around 3-5 layers.
import numpy as np
# f = x * x -- i.e. no bias term
X = np.array([1, 2, 3, 4], dtype=np.float32)
y = np.array([2, 4, 6, 8], dtype=np.float32)
w = 0.0
# loss = MSE -- in the case of linear regression
def loss(y, y_predicted):
return ((y_predicted - y)**2).mean()
# gradient
# MSE = 1/N * (w*x -2)**2
# dJ/dX = 1/N * 2x (w*x -y) -- the symbolically calculated derivative
def gradient(x, y, y_predicted):
return np.dot(2*x, y_predicted-y).mean()
print(f'prediction before training: f(5) = {forward(5):.3f}')
#Training
learning_rate = 0.01
n_iters = 10
for epoch in range(n_iters):
# prediction
y_pred = forward(X)
# loss
l = loss(y, y_pred)
# gradients
dw = gradient(X, y, y_pred)
# update weights
w -= learning_rate * dw
if epoch %1 ==0:
print(f'epoch {epoch+1}: w = {w:.3f}, loss {l:.8f}')
print(f'Prediction after training: f(5) = {forward(5):.3f}')
Data is “anything we know how to record (and represent using a number)”
Anyone, with records of anything, can turn to machine learning
Modelling “the world” – the whole enterprise of neural network modelling aims to capture structure within data
Unstructured data
Intrinsically semantic data
Token sequences (and sets) – now we are interested in relations
The physical world
Dr. Jon Cardoso-Silva
🖥️ Live Demo
LSE DS205 (2024/25)