🗓️ Week 02:
Introduction to Regression Algorithms

Theme: Supervised Learning

11 Oct 2024

Machine Learning

What is Machine Learning?

  • Machine Learning (ML) is a subfield of Artificial Intelligence (AI)
    • Traditional AI is (was?) based on explicit programming of rules and logic.
    • Machine Learning is based on learning from examples – from data.
  • “To learn” here often implies the following particular meaning:
    • to capture patterns in data (think trends, correlations, associations, etc.) and use them to make predictions or to support decisions.

    • Different from traditional statistics, which is more focused on inference (i.e. testing hypotheses).

What does it mean to predict something?

  • Say our data is the following simple sequence: \(3, 6, 9, 12, 15, 18, 21, 24, ...\)
  • What number do you expect to come next? Why?
  • It is very likely that you guessed that
    \(\operatorname{next number}=27\)
  • We spot that the sequence follows a pattern:
    \(\operatorname{next number} = \operatorname{previous number} + 3\)
  • If we know the pattern, we can extrapolate (predict) the next number in the sequence.
  • In a way, we have “learned” the pattern from just looking at the data.

Predicting a sequence (formula)


The next number can be represented as a function, \(f(\ )\), of the previous one:

\[ \operatorname{next number} = f(\operatorname{previous number}) \]

Or, let’s say, as a function of the position of the number in the sequence:

Position Number
1 3
2 6
3 9
4 12
5 15
6 18
7 21
8 24

In equation form:

\[ \operatorname{Number} = f(\operatorname{Position}) \]

where

\[ f(x) = 3 \operatorname{x} \]

👈🏻 Typically, we use a tabular format like this to represent our data when doing ML.

The goal of Machine Learning


The goal of ML:

Find a function \(f\), in any suitable mathematical form, that can best predict how \(Y\) will vary given \(X\).

ML vs traditional stats


The goal of ML:

Find a function \(f\), in any suitable mathematical form, that can best predict how \(Y\) will vary given \(X\).

  • Seeing from afar, this is not that different from the goal of traditional statistics.
  • But a statistician might ask:
    • What is the data generating process that produced the data?
    • What evidence do we have that the pattern we have found is the “true” pattern?
    • How can we be sure that the pattern we have found is not just a coincidence?
    • They have a point 👉

Are there other possible sequences?

Let’s find out:

If we visit the OEIS ® page and paste our sequence: \(3, 6, 9, 12, 15, 18, 21, 24, ...\)

… we will get 19 different sequences that contain those same numbers!

Are there other possible sequences?

Are there other possible sequences?

How do we know?

The sad truth: we don’t.

  • The statistician George Box famously wrote:

    “All models are wrong, but some are useful.”

  • A final word on this dichotomy we are exploring:
    • Traditional stats: focuses on testing how well our assumptions (our models) fit the data. Typically via hypothesis testing.
    • Machine Learning: focuses more on assessing how well the model can predict unseen data.

Types of learning

Representation of learning

This example we just explored can be represented as:

\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]

where:

  • \(Y\): the output
  • \(X\): a set of inputs
  • \(f\): a suitable mathematical function
  • \(\epsilon~~\): a random error term

Representation of learning

This example we just explored can be represented as:

\[ \operatorname{Y} = f(\operatorname{X}) + \epsilon \]

where:

  • \(Y\): the output (the Number in our example)
  • \(X\): the input (the Position in our example)
  • \(f\): a suitable mathematical function (simple or complex)
  • \(\epsilon~~\): a random error term

Whenever you are modelling something that can be represented somehow as the equation above, you are doing supervised learning.

Approximating \(f\)

  • \(f\) is almost always unknown (“all models are wrong”!)
  • The best we can aim for is an approximation (a model).
  • Let’s denote it \(\hat{f}\), which we can then use to predict values of \(Y\) for whatever \(X\) we encounter.
    • That is: \(\hat{Y} = \hat{f}(X)\)

How does this approximation process work?

  • You have to come up with a suitable mathematical form for \(\hat{f}\).
    • Each ML algorithm will have its own way of doing this.
    • You could also come up with your own function if you are so inclined.
  • It’s likely \(\hat{f}\), will have some parameters that you will need to estimate.
    • Instead of proposing \(\hat{f}(x) = 3x\), we say to ourselves:
      ‘I don’t know if 3 is the absolute best number here, maybe the data can tell me?’
    • We could then propose \(\hat{f}(x) = \beta x\) and set ourselves to find out the optimal value of \(\beta\) that ‘best’ fits the data.

How does this approximation process work? (cont.)

  • To train your model, i.e. to find the best value for the parameters, you need to feed your model with past data that contains both \(X\) and \(Y\) values.
  • You MUST have already collected ‘historical’ data that contains both \(X\) and \(Y\) values.
  • The model will then be able to predict \(Y\) values for any \(X\) values.

You can have multiple columns of \(X\) values 👉

X1 X2 X3 X4 Y
1 2 3 10 3
2 4 6 20 6
3 6 9 30 9
4 8 12 40 12
5 10 15 50 15



If you have nothing specific to predict, or no designated \(Y\), then you are not engaging in supervised learning.


You can still use ML to find patterns in the data, a process known as unsupervised learning.

Types of learning


These are, broadly speaking, the two main ways of learning from data:

Supervised Learning

  • Each observation \(\mathbf{x}_i = \{X1_i, X2_i, \ldots\}\) has an outcome associated with it (\(y_i\)).
  • Your goal is to find a \(\hat{f}\) that produces \(\hat{Y}\) value close to the true \(Y\) values.
  • Use it to make predictions or to support decisions.
  • Our focus on 🗓️ Weeks 2, 3, 4 & 5.

Unsupervised Learning

  • You have observations \(\mathbf{x}_i = \{X1_i, X2_i, \ldots\}\) but you don’t care about, or there is no response variable.
  • Focus: identify (dis)similarities in \(X\).
  • Use it to find clusters, anomalies, or other patterns in the data.
  • Our focus on 🗓️ Weeks 7, 8 & 9.

Linear Regression

The basic models

Linear regression is a simple approach to supervised learning.

The generic supervised model:

\[ Y = \operatorname{f}(X) + \epsilon \]

is defined more explicitly as follows ➡️

Simple linear regression

\[ \begin{align} Y = \beta_0 +& \beta_1 X + \epsilon, \\ \\ \\ \end{align} \]

when we use a single predictor, \(X\).

Multiple linear regression

\[ \begin{align} Y = \beta_0 &+ \beta_1 X_1 + \beta_2 X_2 \\ &+ \dots \\ &+ \beta_p X_p + \epsilon \end{align} \]

when there are multiple predictors, \(X_p\).

Warning

  • Most real-life processes are not linear.
  • Still, linear regression is a good starting point for many problems.
  • Do you know the assumptions underlying linear models?

Linear Regression with a single predictor

We assume a model:

\[ Y = \beta_0 + \beta_1 X + \epsilon , \]

where:

  • \(\beta_0\): an unknown constant that represents the intercept of the line.
  • \(\beta_1\): an unknown constant that represents the slope of the line
  • \(\epsilon\): the random error term (irreducible)

Linear Regression with a single predictor

We want to estimate:

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x \]

where:

  • \(\hat{y}\): is a prediction of \(Y\) on the basis of \(X = x\).
  • \(\hat{\beta_0}\): is an estimate of the “true” \(\beta_0\).
  • \(\hat{\beta_1}\): is an estimate of the “true” \(\beta_1\).

Suppose you came across some data:

And you suspect there is a linear relationship between X and Y.

How would you go about fitting a line to it?

Does this line fit?

A line right through the “centre of gravity” of the cloud of data.

Different estimators, different equations

There are multiple ways to estimate the coefficients.

  • If you use different techniques, you might get different equations
  • The most common algorithm is called
    Ordinary Least Squares (OLS)
  • Alternative estimators (Karafiath 2009):
    • Least Absolute Deviation (LAD)
    • Weighted Least Squares (WLS)
    • Generalized Least Squares (GLS)
    • Heteroskedastic-Consistent (HC) variants

Algorithm: Ordinary Least Squares (OLS)

The concept of residuals

Residuals are the distances from each data point to this line.

\(e_i\)\(=(y_i-\hat{y}_i)\) represents the \(i\)th residual

Observed vs. Predicted

Residual Sum of Squares (RSS)

From this, we can define the Residual Sum of Squares (RSS) as

\[ \mathrm{RSS}= e_1^2 + e_2^2 + \dots + e_n^2, \]

or equivalently as

\[ \mathrm{RSS}= (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 + \dots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2. \]



Note

The (ordinary) least squares approach chooses \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the RSS.

OLS: objective function

We treat this as an optimisation problem. We want to minimize RSS:

\[ \begin{align} \min \mathrm{RSS} =& \sum_i^n{e_i^2} \\ =& \sum_i^n{\left(y_i - \hat{y}_i\right)^2} \\ =& \sum_i^n{\left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i\right)^2} \end{align} \]

Estimating \(\hat{\beta}_0\)

To find \(\hat{\beta}_0\), we have to solve the following partial derivative:

\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} = 0 \]

… which will lead you to:

\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, \]

where we made use of the sample means:

  • \(\bar{y} \equiv \frac{1}{n} \sum_{i=1}^n y_i\)
  • \(\bar{x} \equiv \frac{1}{n} \sum_{i=1}^n x_i\)

Full derivation

\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_0}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{2 (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{after chain rule})\\ 0 &= 2 \sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\text{we took $2$ out}) \\ 0 &=\sum_i^n{ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)} & (\div 2) \\ 0 &=\sum_i^n{y_i} - \sum_i^n{\hat{\beta}_0} - \sum_i^n{\hat{\beta}_1 x_i} & (\text{sep. sums}) \end{align} \]

\[ \begin{align} 0 &=\sum_i^n{y_i} - n\hat{\beta}_0 - \hat{\beta}_1\sum_i^n{ x_i} & (\text{simplified}) \\ n\hat{\beta}_0 &= \sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i} & (+ n\hat{\beta}_0) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i} - \hat{\beta}_1\sum_i^n{ x_i}}{n} & (\text{isolate }\hat{\beta}_0 ) \\ \hat{\beta}_0 &= \frac{\sum_i^n{y_i}}{n} - \hat{\beta}_1\frac{\sum_i^n{x_i}}{n} & (\text{after rearranging})\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} ~~~ \blacksquare & \end{align} \]

Estimating \(\hat{\beta}_1\)

Similarly, to find \(\hat{\beta}_1\) we solve:

\[ \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{[\sum_i^n{y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i}]} = 0 \]

… which will lead you to:

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \]

Full derivation

\[ \begin{align} 0 &= \frac{\partial ~\mathrm{RSS}}{\partial \hat{\beta}_1}{\sum_i^n{(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}} & \\ 0 &= \sum_i^n{\left(2x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{after chain rule})\\ 0 &= 2\sum_i^n{\left( x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\text{we took $2$ out}) \\ 0 &= \sum_i^n{\left(x_i~ (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)\right)} & (\div 2) \\ 0 &= \sum_i^n{\left(y_ix_i - \hat{\beta}_0x_i - \hat{\beta}_1 x_i^2\right)} & (\text{distributed } x_i) \end{align} \]

\[ \begin{align} 0 &= \sum_i^n{\left(y_ix_i - (\bar{y} - \hat{\beta}_1 \bar{x})x_i - \hat{\beta}_1 x_i^2\right)} & (\text{replaced } \hat{\beta}_0) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i + \hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{rearranged}) \\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \sum_i^n{\left(\hat{\beta}_1 \bar{x}x_i - \hat{\beta}_1 x_i^2\right)} & (\text{separate sums})\\ 0 &= \sum_i^n{\left(y_ix_i - \bar{y}x_i\right)} + \hat{\beta}_1\sum_i^n{\left(\bar{x}x_i - x_i^2\right)} & (\text{took $\hat{\beta}_1$ out}) \\ \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} & (\text{isolate }\hat{\beta}_1) ~~~ \blacksquare \end{align} \]

Parameter Estimation (OLS)

And that is how OLS works!

\[ \begin{align} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align} \]

Example: Modeling salary and gender in the tech industry (Stack Overflow Developer Survey)

Stack Overflow Survey Developer Survey Data

  • We’ll have a look at the 2022 Stack Overflow Developer Survey Data
  • Here I will focus on the data for the United Kingdom
    • More specifically, the annual salary (in $)
    • Stack Overflow) publishes this data every year. Note that the last where gender statistics are recorded is in 2022 (though surveys exist up to 2024)!

Let’s a quick look at a subset of the data

ResponseId EdLevel DevType RemoteWork YearsCodePro Gender ConvertedCompYearly
3 Graduate degree Data scientist or machine learning specialist;Developer, front-end;Engineer, data;Engineer, site reliability Hybrid (some remote, some in-person) 5 Man 40205
11 Bachelor’s degree Developer, full-stack;Developer, back-end Hybrid (some remote, some in-person) 2 Man 60307
85 Bachelor’s degree Developer, full-stack Fully remote 7 Man 69102
109 Bachelor’s degree Developer, full-stack Hybrid (some remote, some in-person) 7 Man 75384
112 Bachelor’s degree Developer, front-end;Developer, full-stack;Developer, back-end Hybrid (some remote, some in-person) 10 Man 78525

Stack Overflow Survey Data

The code below reproduces the dataframe from the previous slide. You’ll need the tidyverse library for all the code in the following slides!

library(tidyverse)
filtered_gender <- c("Man", "Woman", "Non-binary")
survey_results <- read_csv("survey_results_public.csv") %>% # change the filepath to the location of the survey file you've downloaded!!
  dplyr::filter(
    Country == "United Kingdom of Great Britain and Northern Ireland",
    Employment == "Employed, full-time",
    ConvertedCompYearly > 0,
    ConvertedCompYearly < 2e6
  )

## identify non-ICs, to remove
managers_ctos <- survey_results %>%
  dplyr::filter(str_detect(DevType, "Engineering manager|Product manager|Senior executive/VP"))

## identify academics, to remove
academics <- survey_results %>%
  dplyr::filter(str_detect(DevType, "Academic researcher|Scientist|Educator"))

results <- survey_results %>%
  anti_join(managers_ctos) %>%
  anti_join(academics) %>%
  transmute(ResponseId,
            EdLevel = fct_collapse(EdLevel,
                                   `Less than bachelor's` = c(
                                     "Primary/elementary school",
                                     "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)",
                                     "Some college/university study without earning a degree",
                                     "Associate degree (A.A., A.S., etc.)"
                                   ),
                                   `Other` = "Something else",
                                   `Bachelor's degree` = "Bachelor’s degree (B.A., B.S., B.Eng., etc.)",
                                   `Graduate degree` = c(
                                     "Other doctoral degree (Ph.D., Ed.D., etc.)",
                                     "Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",
                                     "Professional degree (JD, MD, etc.)"
                                    )
            ),
            DevType,
            RemoteWork,
            YearsCodePro = parse_number(YearsCodePro),
            Gender = case_when(
              str_detect(Gender, "Non-binary") ~ "Non-binary",
              TRUE ~ Gender
            ),
            ConvertedCompYearly
  ) %>%
  dplyr::filter(Gender %in% filtered_gender)


results %>% head(5)
        %>% knitr::kable() #note: instead of this code snippet you could try using this snippet instead 'view(results%>%head(5))' 

A tiny bit of data exploration before the modeling

Gender Total respondents Median annual
salary for UK respondents
Woman 103.00 $62,820.00
Non-binary 38.00 $70,452.50
Man 1,823.00 $81,414.00

A tiny bit of exploration before modeling : code (part 1)

The code below reproduces the figure from the previous slide.

results %>%
  ggplot(aes(ConvertedCompYearly, fill = Gender, color = Gender)) +
  geom_density(alpha = 0.2, size = 1.5) +
  scale_x_log10(labels = scales::dollar_format()) +
  labs(
    x = "Annual salary (USD)",
    y = "Density",
    title = "Salary for respondents on the Stack Overflow Developer Survey",
    subtitle = "Overall, in the UK, men earn more than women and non-binary developers"
  )

And this code reproduces the table:

results %>%
  group_by(Gender) %>%
  summarise(
    Total = n(),
    Salary = median(ConvertedCompYearly)
  ) %>%
  arrange(Salary) %>%
  mutate(
    Total = formattable::comma(Total),
    Salary = scales::dollar(Salary)
  ) %>%
  knitr::kable(
    align = "lrr",
    col.names = c("Gender", "Total respondents", "Median annual salary for UK respondents")
  )

A tiny bit of data exploration before the modeling

A tiny bit of data exploration before the modeling

A tiny bit of data exploration before the modeling: code part 2

Use this code to produce the figures from the last two slides:

filtered_devtype <- c(
  "Other", "Student",
  "Marketing or sales professional"
)


survey_results_parsed <- results %>%
  mutate(DevType = str_split(DevType, pattern = ";")) %>%
  unnest(DevType) %>%
  mutate(
    DevType = case_when(
      str_detect(str_to_lower(DevType), "data scientist") ~ "Data scientist",
      str_detect(str_to_lower(DevType), "data or business") ~ "Data analyst",
      str_detect(str_to_lower(DevType), "desktop") ~ "Desktop",
      str_detect(str_to_lower(DevType), "embedded") ~ "Embedded",
      str_detect(str_to_lower(DevType), "devops") ~ "DevOps",
      str_detect(DevType, "Engineer, data") ~ "Data engineer",
      str_detect(str_to_lower(DevType), "site reliability") ~ "DevOps",
      TRUE ~ DevType
    ),
    DevType = str_remove_all(DevType, "Developer, "),
    DevType = str_to_sentence(DevType),
    DevType = str_replace_all(DevType, "Qa", "QA"),
    DevType = str_replace_all(DevType, "Sre", "SRE"),
    DevType = str_replace_all(DevType, "Devops", "DevOps")
  ) %>%
  dplyr::filter(
    !DevType %in% filtered_devtype,
    !is.na(DevType)
  )

survey_results_parsed %>%
  mutate(Gender = fct_infreq(Gender)) %>%
  ggplot(aes(Gender, ConvertedCompYearly, color = Gender)) +
  geom_boxplot(outlier.colour = NA) +
  geom_jitter(aes(alpha = Gender), width = 0.15) +
  facet_wrap(~DevType) +
  scale_y_log10(labels = scales::dollar_format()) +
  scale_alpha_discrete(range = c(0.04, 0.4)) +
  coord_flip() +
  theme(legend.position = "none") +
  labs(
    x = NULL, y = NULL,
    title = "Salary and Gender in the 2022 Stack Overflow Developer Survey",
    subtitle = "Annual salaries for UK developers"
  )


survey_results_parsed %>%
  mutate(Gender = fct_infreq(Gender)) %>%
  group_by(Gender, DevType,RemoteWork) %>%
  summarise(YearsCodePro = median(YearsCodePro, na.rm = TRUE)) %>%
  ungroup() %>%
  ggplot(aes(Gender, YearsCodePro, fill = RemoteWork)) +
  geom_col(position = position_dodge(preserve = "single")) +
  facet_wrap(~DevType) +
  labs(
    x = NULL,
    y = "Median years of professional coding experience",
    fill = "Remote Work?",
    title = "Years of experience, remote work, and gender in the 2022 Stack Overflow Developer Survey",
    subtitle = "Women are more likely to work remotely or in hybrid mode and are less experienced"
  )

A linear model of our data

Let’s first do it the way you’re used to:

modeling_df <- survey_results_parsed %>%
  drop_na(YearsCodePro)%>%
  dplyr::filter(
    ConvertedCompYearly < 1e7,
    YearsCodePro < 60
  ) %>%
  dplyr::filter(Gender %in% c("Man", "Woman")) %>%
  select(-ResponseId) %>%
  mutate(ConvertedCompYearly = log(ConvertedCompYearly))

simple1 <- lm(ConvertedCompYearly ~ 0 + DevType + ., data = modeling_df)
summary(simple1)

What are our results and what do they tell us?

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0265 -0.4767 -0.1906  0.1690  2.9927 

Coefficients:
                                                Estimate Std. Error
DevTypeBack-end                                11.085174   0.071129
DevTypeBlockchain                              11.143958   0.349419
DevTypeCloud infrastructure engineer           11.222275   0.092274
DevTypeData analyst                            10.886505   0.116062
DevTypeData engineer                           11.114575   0.106025
DevTypeData scientist                          10.878111   0.118479
DevTypeDatabase administrator                  10.890003   0.101539
DevTypeDesigner                                10.925094   0.114665
DevTypeDesktop                                 10.982192   0.084270
DevTypeDevOps                                  11.181084   0.088409
DevTypeEmbedded                                10.983093   0.113421
DevTypeFront-end                               10.920042   0.077666
DevTypeFull-stack                              11.020301   0.068800
DevTypeGame or graphics                        10.752088   0.171024
DevTypeMobile                                  11.005939   0.101414
DevTypeOther (please specify):                 11.176998   0.119455
DevTypeProject manager                         10.815152   0.183126
DevTypeQA or test                              10.966565   0.116540
DevTypeSecurity professional                   11.327253   0.151561
DevTypeSenior executive (c-suite, vp, etc.)    11.236853   0.182258
DevTypeSystem administrator                    10.856546   0.107406
EdLevelBachelor's degree                        0.077822   0.037406
EdLevelGraduate degree                          0.216754   0.043232
EdLevelOther                                   -0.217376   0.200976
RemoteWorkFully remote                          0.207421   0.062595
RemoteWorkHybrid (some remote, some in-person)  0.088225   0.061308
YearsCodePro                                    0.015683   0.001729
GenderWoman                                    -0.218579   0.074597
                                               t value Pr(>|t|)    
DevTypeBack-end                                155.847  < 2e-16 ***
DevTypeBlockchain                               31.893  < 2e-16 ***
DevTypeCloud infrastructure engineer           121.619  < 2e-16 ***
DevTypeData analyst                             93.799  < 2e-16 ***
DevTypeData engineer                           104.829  < 2e-16 ***
DevTypeData scientist                           91.814  < 2e-16 ***
DevTypeDatabase administrator                  107.250  < 2e-16 ***
DevTypeDesigner                                 95.278  < 2e-16 ***
DevTypeDesktop                                 130.322  < 2e-16 ***
DevTypeDevOps                                  126.470  < 2e-16 ***
DevTypeEmbedded                                 96.834  < 2e-16 ***
DevTypeFront-end                               140.603  < 2e-16 ***
DevTypeFull-stack                              160.180  < 2e-16 ***
DevTypeGame or graphics                         62.869  < 2e-16 ***
DevTypeMobile                                  108.525  < 2e-16 ***
DevTypeOther (please specify):                  93.567  < 2e-16 ***
DevTypeProject manager                          59.058  < 2e-16 ***
DevTypeQA or test                               94.101  < 2e-16 ***
DevTypeSecurity professional                    74.737  < 2e-16 ***
DevTypeSenior executive (c-suite, vp, etc.)     61.654  < 2e-16 ***
DevTypeSystem administrator                    101.080  < 2e-16 ***
EdLevelBachelor's degree                         2.080 0.037550 *  
EdLevelGraduate degree                           5.014 5.58e-07 ***
EdLevelOther                                    -1.082 0.279498    
RemoteWorkFully remote                           3.314 0.000929 ***
RemoteWorkHybrid (some remote, some in-person)   1.439 0.150217    
YearsCodePro                                     9.071  < 2e-16 ***
GenderWoman                                     -2.930 0.003408 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9065 on 3901 degrees of freedom
Multiple R-squared:  0.9938,    Adjusted R-squared:  0.9937 
F-statistic: 2.226e+04 on 28 and 3901 DF,  p-value: < 2.2e-16
  • it pays off to be a data scientist but, maybe not as much as being an executive or back-end or full-stack developper 😉
  • women earn less than men
  • your education degree matters!
  • there is a benefit to working remotely (at least according to this model!)

The tidymodels way of writing this same model

You can write this same model with tidymodels with the following code:

library(tidymodels)

# Create a linear model
lm_spec <- 
  linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")


lm_fit <- 
  lm_spec %>%
  fit(ConvertedCompYearly ~ 0 + DevType + ., data=modeling_df)

print(lm_fit)

You’ll be seeing more of this in the week 3 lab.

Also check Julia Silge’s blog for an exploration of an earlier version (the 2019 version) of this dataset!

So what now? Evaluating linear regression

A few metrics

  1. \[R^2\] or coefficient of determination

\[ \begin{align} R^2 &= 1-\frac{RSS}{TSS}\\ &= 1-\frac{\sum_{i=1}^N (y_i-\hat{y})^2}{\sum_{i=1}^N (y_i-\bar{y})^2} \end{align} \]

  • RSS is the residual sum of squares, which is the sum of squared residuals. This value captures the prediction error of a model.

  • TSS is the total sum of squares. To calculate this value, assume a simple model in which the prediction for each observation is the mean of all the observed actuals. TSS is proportional to the variance of the dependent variable, as \[\frac{TSS}{N}\] is the actual variance of \(y\) where \(N\) is the number of observations. Think of \(TSS\) as the variance that a simple mean model cannot explain.

Caveats:

It does not :

  • indicate whether enough data points were used to make a solid conclusion!
  • show whether collinearity exists between explanatory variables
  • indicate whether the most appropriate independent variables were used for the model or the correct regression was used
  • indicate whether the model might be improved by using transformed versions of the existing set of independent variables
  • show that the independent variables are a cause of the changes in the dependent variable

So what now? Evaluating linear regression

  1. \[RMSE\] or root mean squared error
    \[RMSE=\sqrt{\frac{1}{N}\sum_{i=1}^N(y_i-\hat{y_i})^2}\]
  • metric independent of the dataset size (division of \(RSS\) by size of dataset \(N\) )
  • measures the average difference between values predicted by a model and the actual values. It provides an estimation of how well the model is able to predict the target value (accuracy)
  • The lower the value of the Root Mean Squared Error, the better the model is
  • has the advantage of representing the amount of error in the same unit as the predicted column making it easy to interpret e.g if you are trying to predict an amount in GBP, then the Root Mean Squared Error can be interpreted as the amount of error in GBP

What’s next?

How to revise for this course next week.

  1. To understand in detail all the assumptions implicitly made by linear models,
    read (James et al. 2021, chaps. 2–3)
    (Not compulsory but highly recommended reading)
  2. Have a look at extensions of linear models here
  3. Take a crack at LASSO and Ridge regression models (you might encounter them in bonus tasks in the Week 3 lab) here and here or on Julia Silge’s Blog (LASSO-related page)

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. 2nd edition. Springer Texts in Statistics. New York NY: Springer. https://www.statlearning.com/.
Karafiath, Imre. 2009. “Is There a Viable Alternative to Ordinary Least Squares Regression When Security Abnormal Returns Are the Dependent Variable?” Review of Quantitative Finance and Accounting 32 (1): 17–31. https://doi.org/10.1007/s11156-007-0079-y.