Theme: Supervised Learning
11 Oct 2024
to capture patterns in data (think trends, correlations, associations, etc.) and use them to make predictions or to support decisions.
The next number can be represented as a function, \(f(\ )\), of the previous one:
\[ \operatorname{next number} = f(\operatorname{previous number}) \]
Or, let’s say, as a function of the position of the number in the sequence:
Position | Number |
---|---|
1 | 3 |
2 | 6 |
3 | 9 |
4 | 12 |
5 | 15 |
6 | 18 |
7 | 21 |
8 | 24 |
In equation form:
\[ \operatorname{Number} = f(\operatorname{Position}) \]
where
\[ f(x) = 3 \operatorname{x} \]
👈🏻 Typically, we use a tabular format like this to represent our data when doing ML.
The goal of ML:
Find a function \(f\), in any suitable mathematical form, that can best predict how \(Y\) will vary given \(X\).
The goal of ML:
Find a function \(f\), in any suitable mathematical form, that can best predict how \(Y\) will vary given \(X\).
Let’s find out:
If we visit the OEIS ® page and paste our sequence: \(3, 6, 9, 12, 15, 18, 21, 24, ...\)
… we will get 19 different sequences that contain those same numbers!
The sad truth: we don’t.
The statistician George Box famously wrote:
“All models are wrong, but some are useful.”
This example we just explored can be represented as:
\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]
where:
This example we just explored can be represented as:
\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]
where:
Number
in our example)Position
in our example)Whenever you are modelling something that can be represented somehow as the equation above, you are doing supervised learning.
You can have multiple columns of \(X\) values 👉
X1 | X2 | X3 | X4 | … | Y |
---|---|---|---|---|---|
1 | 2 | 3 | 10 | … | 3 |
2 | 4 | 6 | 20 | … | 6 |
3 | 6 | 9 | 30 | … | 9 |
4 | 8 | 12 | 40 | … | 12 |
5 | 10 | 15 | 50 | … | 15 |
If you have nothing specific to predict, or no designated \(Y\), then you are not engaging in supervised learning.
You can still use ML to find patterns in the data, a process known as unsupervised learning.
These are, broadly speaking, the two main ways of learning from data:
Linear regression is a simple approach to supervised learning.
The generic supervised model:
\[ Y = \operatorname{f}(X) + \epsilon \]
is defined more explicitly as follows ➡️
\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]
when we use a single predictor, \(X\).
\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]
when there are multiple predictors, \(X_p\).
Warning
We assume a model:
\[ Y = \beta_0 + \beta_1 X + \epsilon , \]
where:
We want to estimate:
\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x \]
where:
And you suspect there is a linear relationship between X and Y.
How would you go about fitting a line to it?
A line right through the “centre of gravity” of the cloud of data.
There are multiple ways to estimate the coefficients.
Residuals are the distances from each data point to this line.
\(e_i\)\(=(y_i-\hat{y}_i)\) represents the \(i\)th residual
Observed vs. Predicted
From this, we can define the Residual Sum of Squares (RSS) as
\[ \mathrm{RSS}= e_1^2 + e_2^2 + \dots + e_n^2, \]
or equivalently as
\[ \mathrm{RSS}= (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 + \dots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2. \]
Note
The (ordinary) least squares approach chooses \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the RSS.
We treat this as an optimisation problem. We want to minimize RSS:
\[ \begin{align} \min \mathrm{RSS} =& \sum_i^n{e_i^2} \\ =& \sum_i^n{\left(y_i - \hat{y}_i\right)^2} \\ =& \sum_i^n{\left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i\right)^2} \end{align} \]
To find \(\hat{\beta}_0\), we have to solve the following partial derivative:
\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} = 0 \]
… which will lead you to:
\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, \]
where we made use of the sample means:
\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{2 (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{after chain rule})\\ 0 &= 2 \sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{we took $2$ out}) \\ 0 &=\sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\div 2) \\ 0 &=\sum_i^n{y_i} - \sum_i^n{\hat{\beta}_0} - \sum_i^n{\hat{\beta}_1 x_i} & (\text{sep. sums}) \end{align} \]
\[ \begin{align} 0 &=\sum_i^n{y_i} - n\hat{\beta}_0 - \hat{\beta}_1\sum_i^n{ x_i} & (\text{simplified}) \\ n\hat{\beta}_0 &= \sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i} & (+ n\hat{\beta}_0) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i}}{n} & (\text{isolate }\hat{\beta}_0 ) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i}}{n} - \hat{\beta}_1\frac{\sum_i^n{x_i}}{n} & (\text{after rearranging})\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} ~~~ \blacksquare & \end{align} \]
Similarly, to find \(\hat{\beta}_1\) we solve:
\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{[\sum_i^n{y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i}]} = 0 \]
… which will lead you to:
\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \]
\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{\left(2x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{after chain rule})\\ 0 &= 2\sum_i^n{\left( x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{we took $2$ out}) \\ 0 &= \sum_i^n{\left(x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\div 2) \\ 0 &= \sum_i^n{\left(y_ix_i - \hat{\beta}_0x_i - \hat{\beta}_1 x_i^2\right)} & (\text{distributed } x_i) \end{align} \]
\[ \begin{align} 0 &= \sum_i^n{\left(y_ix_i - (\bar{y} - \hat{\beta}_1 \bar{x})x_i - \hat{\beta}_1 x_i^2\right)} & (\text{replaced } \hat{\beta}_0) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i + \hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{rearranged}) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \sum_i^n{\left(\hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{separate sums})\\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \hat{\beta}_1\sum_i^n{\left(\bar{x}x_i - x_i^2\right)} & (\text{took $\hat{\beta}_1$ out}) \\ \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} & (\text{isolate }\hat{\beta}_1) ~~~ \blacksquare \end{align} \]
And that is how OLS works!
\[ \begin{align} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align} \]
ResponseId | EdLevel | DevType | RemoteWork | YearsCodePro | Gender | ConvertedCompYearly |
---|---|---|---|---|---|---|
3 | Graduate degree | Data scientist or machine learning specialist;Developer, front-end;Engineer, data;Engineer, site reliability | Hybrid (some remote, some in-person) | 5 | Man | 40205 |
11 | Bachelor’s degree | Developer, full-stack;Developer, back-end | Hybrid (some remote, some in-person) | 2 | Man | 60307 |
85 | Bachelor’s degree | Developer, full-stack | Fully remote | 7 | Man | 69102 |
109 | Bachelor’s degree | Developer, full-stack | Hybrid (some remote, some in-person) | 7 | Man | 75384 |
112 | Bachelor’s degree | Developer, front-end;Developer, full-stack;Developer, back-end | Hybrid (some remote, some in-person) | 10 | Man | 78525 |
survey_results_public.csv
file.The code below reproduces the dataframe from the previous slide. You’ll need the tidyverse
library for all the code in the following slides!
library(tidyverse)
filtered_gender <- c("Man", "Woman", "Non-binary")
survey_results <- read_csv("survey_results_public.csv") %>% # change the filepath to the location of the survey file you've downloaded!!
dplyr::filter(
Country == "United Kingdom of Great Britain and Northern Ireland",
Employment == "Employed, full-time",
ConvertedCompYearly > 0,
ConvertedCompYearly < 2e6
)
## identify non-ICs, to remove
managers_ctos <- survey_results %>%
dplyr::filter(str_detect(DevType, "Engineering manager|Product manager|Senior executive/VP"))
## identify academics, to remove
academics <- survey_results %>%
dplyr::filter(str_detect(DevType, "Academic researcher|Scientist|Educator"))
results <- survey_results %>%
anti_join(managers_ctos) %>%
anti_join(academics) %>%
transmute(ResponseId,
EdLevel = fct_collapse(EdLevel,
`Less than bachelor's` = c(
"Primary/elementary school",
"Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)",
"Some college/university study without earning a degree",
"Associate degree (A.A., A.S., etc.)"
),
`Other` = "Something else",
`Bachelor's degree` = "Bachelor’s degree (B.A., B.S., B.Eng., etc.)",
`Graduate degree` = c(
"Other doctoral degree (Ph.D., Ed.D., etc.)",
"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",
"Professional degree (JD, MD, etc.)"
)
),
DevType,
RemoteWork,
YearsCodePro = parse_number(YearsCodePro),
Gender = case_when(
str_detect(Gender, "Non-binary") ~ "Non-binary",
TRUE ~ Gender
),
ConvertedCompYearly
) %>%
dplyr::filter(Gender %in% filtered_gender)
results %>% head(5)
%>% knitr::kable() #note: instead of this code snippet you could try using this snippet instead 'view(results%>%head(5))'
Gender | Total respondents | Median annual salary for UK respondents |
---|---|---|
Woman | 103.00 | $62,820.00 |
Non-binary | 38.00 | $70,452.50 |
Man | 1,823.00 | $81,414.00 |
The code below reproduces the figure from the previous slide.
results %>%
ggplot(aes(ConvertedCompYearly, fill = Gender, color = Gender)) +
geom_density(alpha = 0.2, size = 1.5) +
scale_x_log10(labels = scales::dollar_format()) +
labs(
x = "Annual salary (USD)",
y = "Density",
title = "Salary for respondents on the Stack Overflow Developer Survey",
subtitle = "Overall, in the UK, men earn more than women and non-binary developers"
)
And this code reproduces the table:
results %>%
group_by(Gender) %>%
summarise(
Total = n(),
Salary = median(ConvertedCompYearly)
) %>%
arrange(Salary) %>%
mutate(
Total = formattable::comma(Total),
Salary = scales::dollar(Salary)
) %>%
knitr::kable(
align = "lrr",
col.names = c("Gender", "Total respondents", "Median annual salary for UK respondents")
)
Use this code to produce the figures from the last two slides:
filtered_devtype <- c(
"Other", "Student",
"Marketing or sales professional"
)
survey_results_parsed <- results %>%
mutate(DevType = str_split(DevType, pattern = ";")) %>%
unnest(DevType) %>%
mutate(
DevType = case_when(
str_detect(str_to_lower(DevType), "data scientist") ~ "Data scientist",
str_detect(str_to_lower(DevType), "data or business") ~ "Data analyst",
str_detect(str_to_lower(DevType), "desktop") ~ "Desktop",
str_detect(str_to_lower(DevType), "embedded") ~ "Embedded",
str_detect(str_to_lower(DevType), "devops") ~ "DevOps",
str_detect(DevType, "Engineer, data") ~ "Data engineer",
str_detect(str_to_lower(DevType), "site reliability") ~ "DevOps",
TRUE ~ DevType
),
DevType = str_remove_all(DevType, "Developer, "),
DevType = str_to_sentence(DevType),
DevType = str_replace_all(DevType, "Qa", "QA"),
DevType = str_replace_all(DevType, "Sre", "SRE"),
DevType = str_replace_all(DevType, "Devops", "DevOps")
) %>%
dplyr::filter(
!DevType %in% filtered_devtype,
!is.na(DevType)
)
survey_results_parsed %>%
mutate(Gender = fct_infreq(Gender)) %>%
ggplot(aes(Gender, ConvertedCompYearly, color = Gender)) +
geom_boxplot(outlier.colour = NA) +
geom_jitter(aes(alpha = Gender), width = 0.15) +
facet_wrap(~DevType) +
scale_y_log10(labels = scales::dollar_format()) +
scale_alpha_discrete(range = c(0.04, 0.4)) +
coord_flip() +
theme(legend.position = "none") +
labs(
x = NULL, y = NULL,
title = "Salary and Gender in the 2022 Stack Overflow Developer Survey",
subtitle = "Annual salaries for UK developers"
)
survey_results_parsed %>%
mutate(Gender = fct_infreq(Gender)) %>%
group_by(Gender, DevType,RemoteWork) %>%
summarise(YearsCodePro = median(YearsCodePro, na.rm = TRUE)) %>%
ungroup() %>%
ggplot(aes(Gender, YearsCodePro, fill = RemoteWork)) +
geom_col(position = position_dodge(preserve = "single")) +
facet_wrap(~DevType) +
labs(
x = NULL,
y = "Median years of professional coding experience",
fill = "Remote Work?",
title = "Years of experience, remote work, and gender in the 2022 Stack Overflow Developer Survey",
subtitle = "Women are more likely to work remotely or in hybrid mode and are less experienced"
)
Let’s first do it the way you’re used to:
modeling_df <- survey_results_parsed %>%
drop_na(YearsCodePro)%>%
dplyr::filter(
ConvertedCompYearly < 1e7,
YearsCodePro < 60
) %>%
dplyr::filter(Gender %in% c("Man", "Woman")) %>%
select(-ResponseId) %>%
mutate(ConvertedCompYearly = log(ConvertedCompYearly))
simple1 <- lm(ConvertedCompYearly ~ 0 + DevType + ., data = modeling_df)
summary(simple1)
Residuals:
Min 1Q Median 3Q Max
-8.0265 -0.4767 -0.1906 0.1690 2.9927
Coefficients:
Estimate Std. Error
DevTypeBack-end 11.085174 0.071129
DevTypeBlockchain 11.143958 0.349419
DevTypeCloud infrastructure engineer 11.222275 0.092274
DevTypeData analyst 10.886505 0.116062
DevTypeData engineer 11.114575 0.106025
DevTypeData scientist 10.878111 0.118479
DevTypeDatabase administrator 10.890003 0.101539
DevTypeDesigner 10.925094 0.114665
DevTypeDesktop 10.982192 0.084270
DevTypeDevOps 11.181084 0.088409
DevTypeEmbedded 10.983093 0.113421
DevTypeFront-end 10.920042 0.077666
DevTypeFull-stack 11.020301 0.068800
DevTypeGame or graphics 10.752088 0.171024
DevTypeMobile 11.005939 0.101414
DevTypeOther (please specify): 11.176998 0.119455
DevTypeProject manager 10.815152 0.183126
DevTypeQA or test 10.966565 0.116540
DevTypeSecurity professional 11.327253 0.151561
DevTypeSenior executive (c-suite, vp, etc.) 11.236853 0.182258
DevTypeSystem administrator 10.856546 0.107406
EdLevelBachelor's degree 0.077822 0.037406
EdLevelGraduate degree 0.216754 0.043232
EdLevelOther -0.217376 0.200976
RemoteWorkFully remote 0.207421 0.062595
RemoteWorkHybrid (some remote, some in-person) 0.088225 0.061308
YearsCodePro 0.015683 0.001729
GenderWoman -0.218579 0.074597
t value Pr(>|t|)
DevTypeBack-end 155.847 < 2e-16 ***
DevTypeBlockchain 31.893 < 2e-16 ***
DevTypeCloud infrastructure engineer 121.619 < 2e-16 ***
DevTypeData analyst 93.799 < 2e-16 ***
DevTypeData engineer 104.829 < 2e-16 ***
DevTypeData scientist 91.814 < 2e-16 ***
DevTypeDatabase administrator 107.250 < 2e-16 ***
DevTypeDesigner 95.278 < 2e-16 ***
DevTypeDesktop 130.322 < 2e-16 ***
DevTypeDevOps 126.470 < 2e-16 ***
DevTypeEmbedded 96.834 < 2e-16 ***
DevTypeFront-end 140.603 < 2e-16 ***
DevTypeFull-stack 160.180 < 2e-16 ***
DevTypeGame or graphics 62.869 < 2e-16 ***
DevTypeMobile 108.525 < 2e-16 ***
DevTypeOther (please specify): 93.567 < 2e-16 ***
DevTypeProject manager 59.058 < 2e-16 ***
DevTypeQA or test 94.101 < 2e-16 ***
DevTypeSecurity professional 74.737 < 2e-16 ***
DevTypeSenior executive (c-suite, vp, etc.) 61.654 < 2e-16 ***
DevTypeSystem administrator 101.080 < 2e-16 ***
EdLevelBachelor's degree 2.080 0.037550 *
EdLevelGraduate degree 5.014 5.58e-07 ***
EdLevelOther -1.082 0.279498
RemoteWorkFully remote 3.314 0.000929 ***
RemoteWorkHybrid (some remote, some in-person) 1.439 0.150217
YearsCodePro 9.071 < 2e-16 ***
GenderWoman -2.930 0.003408 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9065 on 3901 degrees of freedom
Multiple R-squared: 0.9938, Adjusted R-squared: 0.9937
F-statistic: 2.226e+04 on 28 and 3901 DF, p-value: < 2.2e-16
tidymodels
way of writing this same modelYou can write this same model with tidymodels
with the following code:
library(tidymodels)
# Create a linear model
lm_spec <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
lm_fit <-
lm_spec %>%
fit(ConvertedCompYearly ~ 0 + DevType + ., data=modeling_df)
print(lm_fit)
You’ll be seeing more of this in the week 3 lab.
Also check Julia Silge’s blog for an exploration of an earlier version (the 2019 version) of this dataset!
A few metrics
\[ \begin{align} R^2 &= 1-\frac{RSS}{TSS}\\ &= 1-\frac{\sum_{i=1}^N (y_i-\hat{y})^2}{\sum_{i=1}^N (y_i-\bar{y})^2} \end{align} \]
RSS is the residual sum of squares, which is the sum of squared residuals. This value captures the prediction error of a model.
TSS is the total sum of squares. To calculate this value, assume a simple model in which the prediction for each observation is the mean of all the observed actuals. TSS is proportional to the variance of the dependent variable, as \[\frac{TSS}{N}\] is the actual variance of \(y\) where \(N\) is the number of observations. Think of \(TSS\) as the variance that a simple mean model cannot explain.
Caveats:
It does not :
How to revise for this course next week.
LSE DS202 2024/25 Autumn Term