Non-linear algorithms
10/28/22
Following current trends, the next PM will be in office for approximately minus 200 days pic.twitter.com/avLQE9i1yy
— Rob Sansom (@Sansom_Rob) October 20, 2022
Linear and logistic regression are a good first shot for building ML models
lstat
and medv
in Boston
(💻 Week 05 Lab)Using the Auto
dataset, predict mpg
with a tree-based model using weight
and year
as features.
library(ISLR2) # to load Boston data
library(tidyverse) # to use things like the pipe (%>%), mutate and if_else
library(rpart) # a library that contains decision tree models
library(rpart.plot) # a library that plots rpart models
# The function rpart below fits a decision tree to the data
# You can control various aspects of the rpart fit with the parameter `control`
# Type ?rpart.control in the R console to see what else you can change in the algorithm
tree.reg <- rpart(mpg ~ weight + year, data = Auto, control = list(maxdepth = 2))
rpart.plot(tree.reg)
Using the Boston
dataset, predict whether medv
is above the median using crim
and tax
:
library(ISLR2) # to load Boston data
library(tidyverse) # to use things like the pipe (%>%), mutate and if_else
library(rpart) # a library that contains decision tree models
library(rpart.plot) # a library that plots rpart models
# Add a column named `medv_gtmed` to indicate whether tax rate is above median
Boston <- Boston %>% mutate(medv_gtmed = if_else(medv > median(medv), TRUE, FALSE))
# The function rpart below fits a decision tree to the data
# You can control various aspects of the rpart fit with the parameter `control`
# Type ?rpart.control in the R console to see what else you can change in the algorithm
tree.class <- rpart(medv_gtmed ~ lstat + tax, data = Boston, control = list(maxdepth = 2))
rpart.plot(tree.class)
How decision trees work:
Here’s how the regions were created in our regression/classification examples ⏭️
First, you will have to install the parttree
package:
# Follow the instructions by the developers of the package
# (https://github.com/grantmcdermott/parttree)
install.packages("remotes")
remotes::install_github("grantmcdermott/parttree", force = TRUE)
Then:
library(ISLR2) # to load Boston data
library(tidyverse) # to use things like the pipe (%>%), mutate and if_else
library(rpart) # a library that contains decision tree models
library(parttree) # R package for plotting simple decision tree partitions
# The function rpart below fits a decision tree to the data
# You can control various aspects of the rpart fit with the parameter `control`
# Type ?rpart.control in the R console to see what else you can change in the algorithm
tree.reg <- rpart(mpg ~ weight + year, data = Auto, control = list(maxdepth = 2))
Auto %>%
ggplot(aes(x = weight, y = year)) +
geom_jitter(size = 3, alpha = 0.25) +
geom_parttree(data = tree.reg, aes(fill = mpg), alpha = 0.2) +
theme_minimal() +
theme(panel.grid = element_blank(), legend.position = 'bottom') +
scale_x_continuous(labels = scales::comma) +
scale_fill_steps2() +
labs(x = 'Weight (lbs)', y = 'Year', fill = 'Miles per gallon')
First, you will have to install the parttree
package:
# Follow the instructions by the developers of the package
# (https://github.com/grantmcdermott/parttree)
install.packages("remotes")
remotes::install_github("grantmcdermott/parttree", force = TRUE)
Then:
library(ISLR2) # to load Boston data
library(tidyverse) # to use things like the pipe (%>%), mutate and if_else
library(rpart) # a library that contains decision tree models
library(parttree) # R package for plotting simple decision tree partitions
# Add a column named `medv_gtmed` to indicate whether tax rate is above median
Boston <- Boston %>% mutate(medv_gtmed = if_else(medv > median(medv), TRUE, FALSE))
# The function rpart below fits a decision tree to the data
# You can control various aspects of the rpart fit with the parameter `control`
# Type ?rpart.control in the R console to see what else you can change in the algorithm
tree.class <- rpart(medv_gtmed ~ lstat + tax, data = Boston, control = list(maxdepth = 2))
Boston %>%
ggplot(aes(x = lstat, y = tax)) +
geom_jitter(size = 3, alpha = 0.25) +
geom_parttree(data = tree.class, aes(fill = medv_gtmed), alpha = 0.2) +
theme_minimal() +
theme(panel.grid = element_blank(), legend.position = 'bottom') +
scale_x_continuous(labels = scales::percent_format(scale = 1)) +
scale_y_continuous(labels = dollar) +
scale_fill_steps2() +
labs(x = 'Proportion lower status', y = 'Tax rate per $10,000', fill = 'Probability above median')
Recursive binary splitting
\[ \sum_{i: x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j,s)} (y_i - \hat{y}_{R_2})^2 \]
max tree depth
, or min samples per leaf
After our 10-min break ☕:
DS202 - Data Science for Social Scientists 🤖 🤹