DS202 Data Science for Social Scientists
9/30/22
“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.
Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),
and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”
knows everything about statistics
able to communicate insights perfectly
fully understands businesses like no one
is a fluent computer programmer
We are all jugglers 🤹
It is often said that 80% of the time and effort spent on a data science project goes to the tasks highlighted above.
This course is about Machine Learning. So, in most examples and tutorials, we will assume that we already have good quality data.
Image created with the DALL·E algorithm using the prompt: ‘35mm macro photography of a robot holding a question mark card, white background’
The next number is a function of the previous one:
\[ \operatorname{next number} = f(\operatorname{previous number}) \]
In general terms, we can represented it as:
\[ \operatorname{Y} = f(\operatorname{X}) \]
where:
In general terms, we can represented it as:
\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]
where:
In general terms, there are two main ways to learn from data:
Now let’s shift our attention to understanding:
Let’s go back to our example:
Our simple sequence:
\(6, 9, 12, 15, 18, 21, 24\)
Becomes:
\(X\) | \(Y\) |
---|---|
6 | 9 |
9 | 12 |
12 | 15 |
15 | 18 |
18 | 21 |
21 | 24 |
And for prediction:
\(X\) | \(\hat{Y}\) |
---|---|
24 | ? |
we present the \(X\) values and ask the fitted model to give us \(\hat{Y}\).
Let’s create a dataframe to illustrate the process of training an algorithm:
# A tibble: 6 × 2
X Y
<int> <int>
1 6 9
2 9 12
3 12 15
4 15 18
5 18 21
6 21 24
Let’s simulate the introduction of some random error:
# A tibble: 6 × 3
X Y obsY
<int> <int> <dbl>
1 6 9 11.3
2 9 12 12.1
3 12 15 15.2
4 15 18 17.6
5 18 21 21.2
6 21 24 23.5
How much error was introduced by \(\epsilon\) per sample?
# A tibble: 6 × 5
X Y obsY error absError
<int> <int> <dbl> <dbl> <dbl>
1 6 9 11.3 -2.31 2.31
2 9 12 12.1 -0.0662 0.0662
3 12 15 15.2 -0.162 0.162
4 15 18 17.6 0.411 0.411
5 18 21 21.2 -0.197 0.197
6 21 24 23.5 0.513 0.513
This measure is called the Mean Absolute Error.
This is what we computed:
\[ \operatorname{MAE} = \frac{\sum_{i=1}^n{|(y_i + \epsilon) - y_i|}}{n} \]
DS202 - Data Science for Social Scientists 🤖 🤹