DS101 – Fundamentals of Data Science
14 Oct 2024
Tuesday 15th October 5pm DSI Visualisation Studio, Col.1.06
Form to sign in here
The mental skills and practices related to the following aspects of computing:
1️⃣ Designing computations that get computers to do the jobs for us
2️⃣ Explaining and interpreting the world as a complex information process
We usually have to think like computers to design the proper computations for the job.
We can use computers to create models of the world.
Let us look at a simple, old example of a computational model for urban segregation (McCown 2014) ➡️
Today our focus is on the first aspect:
1️⃣ Designing computations that get computers to do the jobs for us
2️⃣ Explaining and interpreting the world as a complex information process
Recipes for computers
Home temperature control
How would a simple home heating system work?
algorithm home_temperature_control is:
input:
Heating system state s;
Temperature reading frequency f;
Upper temperature threshold Tupper;
Lower temperature threshrold Tlower;
while (s=='active') do
current_temperature=read(temperature every f seconds)
if current_temperature < Tlower:
turn heating on
else if current_temperature > Tupper:
turn heating off
else if current_temperature >= Tlower and current_temperature <= Tupper:
maintain current heating state
Problem Definition:
Whenever I receive a list of numbers, I want to ensure this list is ordered.
Input:
[10, 20, 42, 29, 50]
Output:
[10, 20, 29, 42, 50]
Computation:
How would you solve it❓
Group up with the people of your respective tables and try to come up with a recipe that solves the problem, regardless of the list size.
Test that your recipe works for the following list:
[237, 153, 311, 33, 854, 212, 368, 42, 892, 755]
After the break:
Graph:
🎯 How do we get the shortest path between A and all the other nodes in this graph (i.e between A and B, A and D, A and G, etc…?
Pair with the people in your respective tables and discuss.
Step 1: Mark the source node with a current distance of 0 and the rest with infinity
Step 2: Set the non-visited node with the smallest current distance, as the current node, let’s say c
(initially c
is set to the source node).
Step 3: For each neighbour N
of the current node c
: add the current distance of c
with the weight of the edge c-N
. If it is smaller than the current distance of N
, set it as the new current distance of N
.
Step 4: Mark c
as visited.
Step 5: Go to Step 2 if there are any unvisited nodes.
Source node: A
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
Distance | 0 | \(\infty\) | \(\infty\) | \(\infty\) | \(\infty\) | \(\infty\) | \(\infty\) |
Unvisited nodes: {A,B,C,D,E,F,G}
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
Distance | 0 | 2 | 6 | \(\infty\) | \(\infty\) | \(\infty\) | \(\infty\) |
Closest node to source node (based on current distances): B
nodes on path: {A, B}
Unvisited nodes: {A,B,C,D,E,F,G}
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
Distance | 0 | 2 | 6 | 7 | \(\infty\) | \(\infty\) | \(\infty\) |
Unvisited nodes: {A,B,C,D,E,F,G}
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
Distance | 0 | 2 | 6 | 7 (via B not C 14) | \(\infty\) | \(\infty\) | \(\infty\) |
Unvisited nodes: {A,B,C,D,E,F,G}
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
Distance | 0 | 2 | 6 | 7 (via B not C 14) | 17 (A–B–D–E) | 22 (A–B–D–F) | \(\infty\) |
Unvisited nodes: {A,B,C,D,E,F,G}
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
Distance | 0 | 2 | 6 | 7 (via B not C 14) | 17 (A–B–D–F) | 22 (A–B–D–E) | 19 (A–B–D–F–G) |
For F: distance unchanged since second option (A–B–D–F–E) more costly (23 instead of 22)
Unvisited nodes: {A,B,C,D,E,F,G}
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
Distance | 0 | 2 | 6 | 7 (via B not C 14) | 17 (A–B–D–E) | 22 (A–B–D–F) | 19 (A–B–D–F–G) |
Possibilities: 22 (A–B–D–F), 23 (A–B–D–E–F), 25 (A–B–D–E–G–F)
Previous distance from (A–B–D–F) unchanged since lowest
No more unvisited nodes
Final shortest distance calculation from A to other nodes
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
Distance | 0 | 2 | 6 | 7 | 17 | 22 | 19 |
What could be potential meanings of our example graph?
What could be potential meanings of our example graph?
Say we have this table that represents ratings that some users gave to some movies currently airing in the UK:
💭 How does this table translate to a graph?
The paths per user are:
U1: Beetlejuice Beetlejuice:4.5 \(->\) Speak No Evil:2 \(->\) The Critic:3.5
U2: Beetlejuice Beetlejuice:2.5 \(->\) Transformers One:3 \(->\) Speak No Evil: 4 \(->\) The Critic:4
U3: Beetlejuice Beetlejuice:3 \(->\) The Critic:3.5
U4: Transformers One:5 \(->\) The Critic:4
U5: Beetlejuice Beetlejuice:4 \(->\) Transformers One:2 \(->\) Speak No Evil:3 \(->\) The Critic:2
Assuming the movies are the graph nodes, we have to somehow find a way to convert the user ratings into edge weights (or “costs”).
Supposing we denote as:
one way of calculating the cost would be the following:
\[ \begin{equation} L_{vv'}=\frac{1}{\sum_{i=1}^n\frac{w_{r_i}w_v'}{\exp{|r_{iv}-r{iv'}|}}} \end{equation} \]
Here is the graph we end up with!
Suppose you are working as an engineer at Uber and you need to find a way to match drivers with clients. How do you go about doing that?
A solution: the Hungarian algorithm!
Suppose you have a table that represents the “cost” of each job (i.e picking up a client) per driver:
Job 1 | Job 2 | Job 3 | |
---|---|---|---|
Driver A | 10 | 15 | 9 |
Driver B | 9 | 18 | 5 |
Driver C | 6 | 14 | 3 |
Step 1 — Subtract the minimum from every element of each row
Job 1 | Job 2 | Job 3 | |
---|---|---|---|
Driver A | 10 | 15 | 9 |
Driver B | 9 | 18 | 5 |
Driver C | 6 | 14 | 3 |
\(->\)
Job 1 | Job 2 | Job 3 | |
---|---|---|---|
Driver A | 1 | 6 | 0 |
Driver B | 4 | 13 | 0 |
Driver C | 3 | 11 | 0 |
Step 2 — Subtract the minimum of each column
Job 1 | Job 2 | Job 3 | |
---|---|---|---|
Driver A | 1 | 6 | 0 |
Driver B | 4 | 13 | 0 |
Driver C | 3 | 11 | 0 |
\(->\)
Job 1 | Job 2 | Job 3 | |
---|---|---|---|
Driver A | 0 | 0 | 0 |
Driver B | 3 | 7 | 0 |
Driver C | 2 | 5 | 0 |
Step 3 — Cross 0s with the minimum number of lines needed.
❌
Job 1 | Job 2 | Job 3 | |
---|---|---|---|
Driver A | |||
Driver B | |||
Driver C |
✅
Job 1 | Job 2 | Job 3 | |
---|---|---|---|
Driver A | |||
Driver B | 3 | 7 | |
Driver C | 2 | 5 |
If the number of lines is < \(n\), the number of rows and columns in the matrix, reduce the matrix again (i.e step 4). In our example, we have 2 lines, but we have a 3x3 matrix, so we’ll reduce again. If we didn’t have this case, we’d directly jump to step 5.
Step 4 — Find the smallest entry not covered by any line, and subtract this entry from the entire non-crossed matrix (and add it to the elements crossed twice).
Job 1 | Job 2 | Job 3 | |
---|---|---|---|
Driver A | |||
Driver B | 3 | 7 | |
Driver C | 2 | 5 |
\(->\)
Job 1 | Job 2 | Job 3 | |
---|---|---|---|
Driver A | |||
Driver B | 1 | 5 | |
Driver C | 0 | 3 |
then go back to step 3: cross the 0s!
Job 1 | Job 2 | Job 3 | |
---|---|---|---|
Driver A | |||
Driver B | 1 | 5 | |
Driver C |
This time, we have three lines so we can go to step 5.
Step 5 — Assign Jobs and Workers starting with the line with only one zero!
Job 1 | Job 2 | Job 3 | |
---|---|---|---|
Driver A | 0 | 0 | 2 |
Driver B | 1 | 5 | 0 |
Driver C | 0 | 3 | 0 |
LSE DS101 2024/25 Autumn Term