W08 summative possible solution
What follows is one possible solution for the W08 summative. It is, by no means, the only possible solution and is only provided to give you an idea of what was expected!
Possible run-through of the solution
Setup
library(readr)
library(ggplot2)
library(naniar)
library(skimr)
library(dplyr)
library(viridis)
library(recipes)
library(yardstick)
library(rsample)
library(tune)
library(workflows)
library(broom)
library(parsnip)
library(doParallel)
library(tidyr)
library(xgboost)
library(themis)
library(vip)
library(ggsci)
<- function() {
theme_scatter
theme_minimal() +
theme(panel.grid.minor = element_blank(),
legend.position = "bottom")
}
<- function() {
theme_line
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "none")
}
Part 1:
We know that, in this dataset, -999 values are missing values. We could replace them with NA
after reading the data e.g with the naniar
library’s replace_with_na_all()
but to simplify things, we are choosing to replace these values at loading time, thanks to the na
option of the read_csv
function:
<- read_csv('data/AQbench_dataset.csv',na=c("-999")) aq_bench
Rows: 5577 Columns: 53
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (6): country, htap_region, climatic_zone, type, type_of_area, dataset
dbl (47): id, lon, lat, alt, relative_alt, water_25km, evergreen_needleleaf_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
If we, now, look at our data, we have the following:
vis_miss(aq_bench)
and
skim(aq_bench)
── Data Summary ────────────────────────
Values
Name aq_bench
Number of rows 5577
Number of columns 53
_______________________
Column type frequency:
character 6
numeric 47
________________________
Group variables None
── Variable type: character ────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 country 0 1 4 24 0 70 0
2 htap_region 0 1 3 3 0 15 0
3 climatic_zone 0 1 7 16 0 13 0
4 type 0 1 5 10 0 5 0
5 type_of_area 0 1 5 8 0 5 0
6 dataset 0 1 3 5 0 3 0
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean
1 id 0 1 11482.
2 lon 0 1 13.5
3 lat 0 1 39.3
4 alt 0 1 265.
5 relative_alt 0 1 49.8
6 water_25km 0 1 12.7
7 evergreen_needleleaf_forest_25km 0 1 2.88
8 evergreen_broadleaf_forest_25km 0 1 0.512
9 deciduous_needleleaf_forest_25km 0 1 0.00224
10 deciduous_broadleaf_forest_25km 0 1 2.68
11 mixed_forest_25km 0 1 19.8
12 closed_shrublands_25km 0 1 0.213
13 open_shrublands_25km 0 1 3.10
14 woody_savannas_25km 0 1 5.20
15 savannas_25km 0 1 0.426
16 grasslands_25km 0 1 5.43
17 permanent_wetlands_25km 0 1 0.344
18 croplands_25km 0 1 21.0
19 urban_and_built-up_25km 0 1 12.4
20 cropland-natural_vegetation_mosaic_25km 0 1 11.5
21 snow_and_ice_25km 0 1 0.167
22 barren_or_sparsely_vegetated_25km 0 1 0.291
23 wheat_production 0 1 0.899
24 rice_production 0 1 1.48
25 nox_emissions 0 1 14.2
26 no2_column 0 1 4.08
27 population_density 0 1 24611.
28 max_population_density_5km 0 1 33295.
29 max_population_density_25km 0 1 57533.
30 nightlight_1km 0 1 46.0
31 nightlight_5km 0 1 39.8
32 max_nightlight_25km 0 1 59.9
33 o3_average_values 0 1 27.9
34 o3_daytime_avg 1098 0.803 32.9
35 o3_nighttime_avg 1087 0.805 21.9
36 o3_median 1130 0.797 26.5
37 o3_perc25 1130 0.797 15.8
38 o3_perc75 1130 0.797 37.2
39 o3_perc90 1130 0.797 47.2
40 o3_perc98 1130 0.797 60.8
41 o3_dma8eu 1013 0.818 58.2
42 o3_avgdma8epax 1025 0.816 37.9
43 o3_drmdmax1h 32 0.994 54.2
44 o3_w90 1087 0.805 163.
45 o3_aot40 1116 0.800 16345.
46 o3_nvgt070 1025 0.816 8.63
47 o3_nvgt100 1087 0.805 1.52
sd p0 p25 p50 p75 p100 hist
1 4041. 3336 8252 11732 15431 17722 ▃▅▆▅▇
2 88.1 -171. -76.0 7.48 127. 175. ▂▃▇▁▆
3 13.2 -90.0 35.4 39.8 45.8 82.5 ▁▁▁▇▂
4 466. -4 20 90 287 5500 ▇▁▁▁▁
5 107. -136 8 21 47 1826 ▇▁▁▁▁
6 19.4 0 0 0 23.4 100 ▇▂▁▁▁
7 9.20 0 0 0 1.6 96.4 ▇▁▁▁▁
8 4.17 0 0 0 0 99.5 ▇▁▁▁▁
9 0.0804 0 0 0 0 4.8 ▇▁▁▁▁
10 9.03 0 0 0 0 94.5 ▇▁▁▁▁
11 21.7 0 1.1 12 33.8 99.1 ▇▃▂▁▁
12 1.07 0 0 0 0 16.7 ▇▁▁▁▁
13 12.1 0 0 0 0 99.7 ▇▁▁▁▁
14 11.3 0 0 0 3.9 85.6 ▇▁▁▁▁
15 2.80 0 0 0 0 54.6 ▇▁▁▁▁
16 15.5 0 0 0 2.6 100 ▇▁▁▁▁
17 1.75 0 0 0 0 41 ▇▁▁▁▁
18 22.8 0 3.7 12.8 30.7 99.7 ▇▂▁▁▁
19 16.8 0 1.5 5.4 16.2 91.8 ▇▁▁▁▁
20 15.5 0 0 4.3 17.5 85.4 ▇▂▁▁▁
21 3.29 0 0 0 0 100 ▇▁▁▁▁
22 3.22 0 0 0 0 100 ▇▁▁▁▁
23 1.79 0 0 0.079 0.946 22.0 ▇▁▁▁▁
24 4.45 0 0 0 0 39.3 ▇▁▁▁▁
25 28.9 0 1.26 4.52 16.2 967. ▇▁▁▁▁
26 3.46 0 1.95 2.93 4.88 20.8 ▇▂▁▁▁
27 48021. 0 1480 7422 23939 940909 ▇▁▁▁▁
28 59626. 0 2515 11477 32349 973786 ▇▁▁▁▁
29 85087. 0 9180 24824 58521 1669078 ▇▁▁▁▁
30 21.7 0 30 59 63 63 ▂▁▁▁▇
31 20.7 0 22.0 45.1 59.3 63 ▃▂▂▃▇
32 8.83 0 62 63 63 63 ▁▁▁▁▇
33 6.40 3.62 23.7 27.4 31.4 65.6 ▁▇▆▁▁
34 6.21 8.37 28.9 33.0 36.7 56.1 ▁▂▇▃▁
35 7.10 2.69 17.1 20.7 25.4 54.7 ▁▇▃▁▁
36 6.91 2.51 22.0 26.2 30.5 54.7 ▁▅▇▂▁
37 7.81 0.2 10.2 14.8 20.2 48 ▃▇▃▁▁
38 6.34 11.5 33.4 37.3 41 61.6 ▁▂▇▂▁
39 7.05 16.3 42.8 47.6 52 74.9 ▁▂▇▃▁
40 9.58 22.1 55.5 60.8 66.8 115. ▁▆▇▁▁
41 9.17 20.4 53.0 58.4 64.1 108. ▁▅▇▁▁
42 6.40 10.7 33.7 38.3 42.1 71.2 ▁▃▇▁▁
43 8.27 16.6 49.5 54.3 58.7 102. ▁▃▇▁▁
44 115. 1.35 78.0 132. 225. 738. ▇▅▂▁▁
45 10262. 0 8943. 14985. 22370. 72431. ▇▇▂▁▁
46 11.9 0 1 4.25 12 180. ▇▁▁▁▁
47 3.99 0 0 0 1 93 ▇▁▁▁▁
The dataset doesn’t have variables with missing variables other than the o3_*
type of variables, which are likely to be highly redundant variations of the outcome variable o3_average_values
, and should not, for simplicity (and to avoid unintentional data leakage!) be used for modeling.
So, we won’t be doing any imputation (all the variables that would need such a treatment won’t be used).
- We filter our data:
<- aq_bench %>%
aq_bench_filtered subset(select=-c(lat,lon,dataset))
head(aq_bench_filtered)
# A tibble: 6 × 50
id country htap_region climatic_zone alt relative_alt type type_of_area
<dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 3336 Germany EUR cool_moist 12 3 backg… rural
2 3338 Germany EUR cool_moist 1 1 backg… rural
3 3339 Germany EUR cool_moist 205 66 backg… urban
4 3340 Germany EUR cool_moist 102 29 backg… urban
5 3341 Germany EUR cool_moist 45 8 backg… rural
6 3342 Germany EUR cool_moist 40 6 indus… urban
# ℹ 42 more variables: water_25km <dbl>,
# evergreen_needleleaf_forest_25km <dbl>,
# evergreen_broadleaf_forest_25km <dbl>,
# deciduous_needleleaf_forest_25km <dbl>,
# deciduous_broadleaf_forest_25km <dbl>, mixed_forest_25km <dbl>,
# closed_shrublands_25km <dbl>, open_shrublands_25km <dbl>,
# woody_savannas_25km <dbl>, savannas_25km <dbl>, grasslands_25km <dbl>, …
The 5 countries with the lowest number of rows in the dataset are:
%>% aq_bench_filtered group_by(country)%>% tally(sort=TRUE) %>% arrange(n) %>% head(5) %>% knitr::kable()
country n Algeria 1 American Samoa 1 Armenia 1 Barbados 1 Bermuda 1 Returning the countries by alphabetical order is fine. Other custom orderings (as long as they are justified) are also okay.
The 5 countries with the highest number of rows in the dataset are:
%>% aq_bench_filtered group_by(country)%>% tally(sort=TRUE) %>% head(5) %>% knitr::kable()
country n United States of America 1390 Japan 1182 Spain 440 France 404 Italy 361 Things are clearer for this ordering.
We could explore this further but it seems relatively clear that most of the data only comes from very few parts of the world (mainly North America, Europe and East Asia). This calls into question:
- the actual representiveness of this dataset
- whether including country as a predictor (at least without any additional precautions) is a good idea given the extreme imbalance of this variable.
The median NO2 per area is simply calculated as:
%>%
aq_bench_filtered group_by(type_of_area)%>%
summarise(Median=median(no2_column)) %>%
::kable() knitr
type_of_area | Median |
---|---|
remote | 0.81 |
rural | 2.32 |
suburban | 3.44 |
unknown | 2.46 |
urban | 4.05 |
NO2 is an air pollutant that forms when fossil fuels are burnt at high temperatures (e.g in vehicles such as cars or motorcycles and buses). As such, it’s not exactly surprising to notice higher concentrations of such a pollutant in urban and suburban areas compared to rural and remote areas: this could be due to the fact that urban and suburban areas tend to have more manufacturing plants (that generate N02!) as well as more traffic of vehicles susceptible of generating N02 (e.g cars, buses, etc.).
- Mileage varies here, of course, but one possible plot would be the following:
%>%
aq_bench_filtered ggplot(aes(population_density, o3_average_values, color = o3_average_values)) +
geom_point() +
theme_scatter()+
scale_color_viridis() +
labs(y = "Average O3 levels (in ppb)", x= bquote('Population Density (in person/km'^2~')'))+
ggtitle("The relationship between average O3 levels and population density is all but linear!")
The plot shows quite clearly that the relationship between average ozone levels and population density is not linear. We can see that the average level of ozone has the highest variance when population density is very low and decreases in variance as population density increases.
The non-linear relationship is a tad more obvious when we redo the same plot as earlier with non-linear interpolation added (loess smoother or polynomial regression):
%>%
aq_bench_filtered ggplot(aes(population_density, o3_average_values, color = o3_average_values)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, color = "blue", se = TRUE) +
theme_scatter() +
scale_color_viridis() +
labs(y = "Average O3 levels (in ppb)",
x = bquote('Population Density (in person/km'^2~')')) +
ggtitle("The relationship between average O3 levels and population density is all but linear!")
%>%
aq_bench_filtered ggplot(aes(population_density, o3_average_values, color = o3_average_values)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x,2), color = "blue", se = TRUE) +
theme_scatter() +
scale_color_viridis() +
labs(y = "Average O3 levels (in ppb)",
x = bquote('Population Density (in person/km'^2~')')) +
ggtitle("The relationship between average O3 levels and population density is all but linear!")
And you’re not convinced yet, you can do a log transformation of population density and redo the earlier plot: only then does the relationship between both variable become approximately linear (though not quite yet too)!
%>%
aq_bench_filtered ggplot(aes(log(population_density), o3_average_values, color = o3_average_values)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, color = "blue", se = TRUE) +
theme_scatter() +
scale_color_viridis() +
labs(y = "Average O3 levels (in ppb)",
x = bquote('Log(Population Density)')) +
ggtitle("It takes a log transformation to make the relationship between average O3 levels and population density approximately linear, though not quite!")
Part 2
- Let’s construct the baseline linear model.
We split the data into training and test set.
set.seed(42)
<- initial_split(aq_bench_filtered, prop = 0.75)
split_grid <- training(split_grid)
df_train <- testing(split_grid) df_test
As predictors for this model, we’ll be using all variables except:
- the variables whose name starts with
o3
(except the outcome variableo3_average_values
): such variables are likely to be too highly correlated with the outcome variableo3_average_values
lead to issues of circular reasoning and and induce potential data leakage - the
id
variable as it’s not predictive - the
country
variable (the prediction should not be tied to a particular country)
We create a recipe to pre-process the data:
<- recipe(o3_average_values ~ . , data = df_train) %>%
reg_rec update_role(starts_with("o3_"),new_role="id") %>%
update_role(o3_average_values,new_role="outcome") %>%
update_role(c("id","country"),new_role="id") %>%
step_novel(all_nominal_predictors())%>%
step_dummy(htap_region,climatic_zone,type_of_area,type)
step_novel
handles any unseen levels of nominal predictor variables.
We then write the linear model specification, workflow and fit the model
<-
lm_model linear_reg() %>%
set_engine("lm")
# Create a workflow to add the recipe and the model
<-
wflow workflow() %>%
add_recipe(reg_rec) %>%
add_model(lm_model)
<-
model %>%
wflow fit(data = df_train)
Let’s evaluate the model performance:
<- metric_set(rsq, rmse, mae, mape)
reg_metrics %>%
model augment(new_data=df_test)%>%
reg_metrics(truth = o3_average_values, estimate = .pred) %>%
::kable() knitr
.metric | .estimator | .estimate |
---|---|---|
rsq | standard | 0.620273 |
rmse | standard | 4.036346 |
mae | standard | 3.001378 |
mape | standard | 11.972501 |
Warning message:
In predict.lm(object = object$fit, newdata = new_data, type = "response", :
prediction from rank-deficient fit; consider predict(., rankdeficient="NA")
\(R^2\) is in the middle range at 0.620 (62% of the variance of the data is explained). RMSE shows the model to be off by 4.04 parts per billion (ppb) while MAE shows it to be off by 3.00 ppb. The range of o3_average_values
in the (training) set is between 3.6212 and 65.5899 ppb with a standard deviation of \(\approxeq 6.403\). This would suggest that the RMSE and MAE for this model are acceptable.
Let’s have a look at the residual plot for this model
%>%
model augment(new_data = df_test) %>%
ggplot(aes(.pred, .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_bw() +
labs(x = "Predicted average O3 levels (ppb)", y = "Residuals")
Here are a few observations about the residuals plot:
- The residuals appear to be distributed roughly symmetrically around the zero line. This suggests that the model has no systematic bias in its predictions, meaning it is not consistently overpredicting or underpredicting.
- There doesn’t seem to be a clear pattern (e.g., curved or systematic trends) in the residuals. This indicates that the linear model assumption (if this is a linear regression) may hold, and the errors are independent.
- There seems to be a fairly consistent spread of residuals across the range of predicted values, particularly in the center of the plot. However, at the higher predicted values (closer to 60), there is a slight indication of higher variance, as the residuals spread out more. This might indicate heteroscedasticity. We might want to test for this! Outliers with large residuals further implied the presence of extreme conditions or unaccounted-for variables that the model struggled to represent accurately.
- A few residuals are quite far from zero, especially in the high and low predicted values range. These could be outliers or data points where the model performed poorly.
- There are isolated residuals that are noticeably farther from the majority of the points. These might need further investigation to check whether they are due to data entry errors, extreme leverage points, or other issues.
Let’s see how the model performs on the 5 countries with the highest number of rows in the dataset:
<- model %>%
metrics_per_country augment(new_data=df_test)%>%
filter(country %in% c('United States of America','Japan','Spain','France','Italy'))%>%
group_by(country)%>%
reg_metrics(truth = o3_average_values, estimate = .pred)
%>%
metrics_per_country rename(Value=.estimate,Metric=.metric)%>%
filter(Metric=='rmse')%>%
ggplot(aes(Metric,Value,fill=country))+
geom_bar(stat="identity", position = "dodge")+
scale_fill_brewer(palette='Pastel1')
It looks as if the model performs best for Japan and worst for Italy based on RMSE. One could think this was a function of the number of rows per country in the dataset as Italy is the country with the lowest number of rows among the countries examined, however the second worst performance is for the USA, which is the country that has the most rows in the dataset.
- We have our baseline model in place. It’s time to try and improve it. In this part, we will first test two new models: a LASSO model and an XGBoost model. We keep the same set of features as for the baseline model.
If we did want to conduct feature engineering, here are some points and explanations we could have provided: - After creating the baseline, we can conduct correlation analysis on the fitted baseline model augmented with training data. This allows us to investigate the relationships between the predictors and their associations with the target variable (o3_average_values).
- We find that there are is a clear variation in the correlation of predictors with the target variable.
- Using domain knowledge, we will aim to improve upon the baseline model through transforming our raw features into a more effective set of inputs for our chosen model.
- e.g. after examining the factors affecting O₃ levels, we found that EU countries had implemented O₃ reduction targets before the 2010-2014 period when the AQ_Bench dataset was compiled (More Efforts Required to Reduce Ozone Pollution in Europe, n.d.). To reflect this, we can introduce a binary variable (EU) to indicate whether a country was an EU member.
The aim with LASSO is to prune out irrelevant features (our model would still be linear though). With XGBoost, we want to test a non-linear model which requires minimal pre-processing, is rather explainable, is scalable and known to be rather less sensitive to overfitting.
To improve the generalizability of our results and reduce the risk of overfitting, we will also be using 5-fold cross-validation to evaluate our results.
set.seed(42)
#setting-up cross validation
<- vfold_cv(df_train, v = 5) ozone_cv
LASSO model
<-
lasso_model linear_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
<-
lasso_grid crossing(penalty = c(0.001, 0.002, 0.005, 0.1, 0.2, 0.5, 1, 1.5,2))
<-
lasso_wf workflow() %>%
add_recipe(reg_rec) %>%
add_model(lasso_model)
# Register multiple cores for faster processing
::registerDoParallel()
doParallel
# Tune the model, just use one evaluation metric
<-
lasso_cv_fit tune_grid(
lasso_wf,resamples = ozone_cv,
metrics = metric_set(rmse),
grid = lasso_grid
)
# Collect the metrics from the model and plot the results
%>%
lasso_cv_fit collect_metrics() %>%
filter(.metric == "rmse") %>%
ggplot(aes(as.factor(penalty), mean, group = 1)) +
geom_point() +
geom_line(linetype = "dashed") +
theme_line() +
labs(x = "Penalty", y = "Mean Cross-Validated RMSE")
<- select_best(lasso_cv_fit, metric = "rmse")
lowest_rmse lowest_rmse
# A tibble: 1 × 2
penalty .config
<dbl> <chr>
1 0.005 Preprocessor1_Model3
<-
lasso_final_fit %>%
lasso_wf finalize_workflow(lowest_rmse) %>%
fit(df_train)
%>%
lasso_final_fit augment(new_data = df_test) %>%
reg_metrics(o3_average_values, .pred)%>%
::kable() knitr
.metric | .estimator | .estimate |
---|---|---|
rsq | standard | 0.6194571 |
rmse | standard | 4.0406143 |
mae | standard | 3.0048991 |
mape | standard | 11.9981317 |
The best penalty setting for LASSO is 0.05. However, this model doesn’t improve on the performance of the baseline linear regression model: \(R^2\) is a relatively similar 0.619, RMSE is the same at 4.04 ppb and the same can be said for MAE at 3.00 ppb. It looks as if there isn’t much scope for LASSO to reduce the feature set further and improve the model and we’re also hit by the limits of non-linearity/features with no transformations/interactions.
Let’s see if XGBoost does any better.
XGBoost Model
We’ll tune both the trees
and learn_rate
parameters for the XGBoost model.
We first define the model specification and workflow as well as grid search parameters.
<-
xgb_model boost_tree(learn_rate = tune(),trees=tune(), mtry=.cols()) %>%
set_mode("regression") %>%
set_engine("xgboost")
<-
xgb_grid crossing(learn_rate = c(0.0001, 0.001,0.005, 0.01,0.05, 0.1, 0.2, 0.3,0.5,0.7,1),trees=c(150,250,500,750,1000))
<-
xgb_wf workflow() %>%
add_recipe(reg_rec) %>%
add_model(xgb_model)
We then tune the model:
# Register multiple cores for faster processing
::registerDoParallel()
doParallel
# Tune the model, just use one evaluation metric (RMSE)
<-
xgb_cv_fit tune_grid(
xgb_wf,resamples = ozone_cv,
metrics = metric_set(rmse),
grid = xgb_grid
)
We then plot the results of the hyperparameter tuning and select the hyperparameters that produce the best RMSE.
# Collect the metrics from the model and plot the result
<- xgb_cv_fit %>%
gcollect_metrics() %>%
mutate(max = mean == max(mean))
%>% ggplot(aes(x = learn_rate, y = mean, color = factor(trees), group = factor(trees))) +
g geom_line(linewidth = 1) + # Thicker lines for visibility
geom_point(size = 2) + # Points to emphasize data points
labs(title = "Mean Performance vs Learn Rate by Number of Trees",
x = "Learn Rate",
y = "Mean RMSE",
color = "Number of Trees") + # Legend title for trees
scale_color_brewer(palette = "Set2") + # Use a color palette for distinct colors
theme_minimal(base_size = 14) + # Increase base font size for readability
theme(
axis.text.x = element_text(size = 10, angle = 45, hjust = 1), # Rotate x-axis labels for clarity
axis.text.y = element_text(size = 10), # Increase y-axis label size
panel.grid.major = element_line(color = "gray80"), # Major grid lines
panel.grid.minor = element_blank() # Optional: Remove minor grid lines
+
) scale_x_continuous(labels = scales::number_format(accuracy = 0.0001)) + # Adjust x-axis label format
scale_y_continuous(breaks = seq(min(g$mean), max(g$mean), by = 2))
<- select_best(xgb_cv_fit, metric = "rmse")
lowest_rmse lowest_rmse
# A tibble: 1 × 3
trees learn_rate .config
<dbl> <dbl> <chr>
1 750 0.05 Preprocessor1_Model24
The best RMSE is reach for a number of trees of 750 and a learning rate of 0.05. We fit the final model using these hyperparameters and collect metrics:
<-
xgb_final_fit %>%
xgb_wf finalize_workflow(lowest_rmse) %>%
fit(df_train)
%>%
xgb_final_fit augment(new_data = df_test) %>%
reg_metrics(o3_average_values, .pred) %>%
::kable() knitr
.metric | .estimator | .estimate |
---|---|---|
rsq | standard | 0.7084506 |
rmse | standard | 3.5349506 |
mae | standard | 2.5543592 |
mape | standard | 10.2185989 |
XGBoost is an improvement on both the baseline and the LASSO model: \(R^2\) is 0.708, indicating a stronger proportion of variance explained by the model. RMSE is a tad lower at 3.53 ppb and the same can be said of MAE at 2.55 ppb, reflecting more precise average predictions. It looks likely that the data has some non-linearity patterns that XGBoost, with its tree-based gradient boosting approach, is able to exploit.
Part 3
Loading the data
<- read_csv('data/water_potability.csv')
water_quality $Potability <- as.factor(water_quality$Potability) water_quality
Rows: 3276 Columns: 10
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (10): ph, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> water_quality$Potability <- as.factor(water_quality$Potability)
We convert the variable Potability
into factor since it’s our outcome variable and will need to be a factor to be used in our models.
Let’s have a look at the data:
skim(water_quality)
── Data Summary ────────────────────────
Values
Name water_quality
Number of rows 3276
Number of columns 10
_______________________
Column type frequency:
factor 1
numeric 9
________________________
Group variables None
── Variable type: factor ───────────────────────────────────────────────────────
skim_variable n_missing complete_rate ordered n_unique top_counts
1 Potability 0 1 FALSE 2 0: 1998, 1: 1278
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25
1 ph 491 0.850 7.08 1.59 0 6.09
2 Hardness 0 1 196. 32.9 47.4 177.
3 Solids 0 1 22014. 8769. 321. 15667.
4 Chloramines 0 1 7.12 1.58 0.352 6.13
5 Sulfate 781 0.762 334. 41.4 129 308.
6 Conductivity 0 1 426. 80.8 181. 366.
7 Organic_carbon 0 1 14.3 3.31 2.20 12.1
8 Trihalomethanes 162 0.951 66.4 16.2 0.738 55.8
9 Turbidity 0 1 3.97 0.780 1.45 3.44
p50 p75 p100 hist
1 7.04 8.06 14 ▁▂▇▂▁
2 197. 217. 323. ▁▂▇▃▁
3 20928. 27333. 61227. ▂▇▅▁▁
4 7.13 8.11 13.1 ▁▂▇▃▁
5 333. 360. 481. ▁▁▇▆▁
6 422. 482. 753. ▁▇▇▂▁
7 14.2 16.6 28.3 ▁▅▇▂▁
8 66.6 77.3 124 ▁▂▇▅▁
9 3.96 4.50 6.74 ▁▅▇▃▁
vis_miss(water_quality)
The pH, trihalomethanes and sulfate variables all have missing values and the missingness mechanism is MAR: we will need to perform imputation before including these variables in our models.
%>%
water_quality count(Potability) %>%
mutate(proportion = n / sum(n)) %>%
::kable() knitr
Potability | n | proportion |
---|---|---|
0 | 1998 | 0.6098901 |
1 | 1278 | 0.3901099 |
Let’s also check for class imbalance:
There are almost twice as many instances of non-potable water than there are of potable water!
- We split the data in training and test sets
set.seed(42)
<- initial_split(water_quality, prop = 0.75, strata=Potability)
split_water <- training(split_water)
train_water <- testing(split_water) test_water
Our training/test split is a stratified split on the outcome variable to take into account the imbalanced nature of the data.
Before we fit our models, we need to impute missing values for pH, trihalomethanes and sulfate. These variables as well as all the variables (aside from the outcome variable) in the dataset are numeric. We will use bagged tree imputation (it’s a more robust method of imputation, less likely to distort our data distribution). This is one step of our recipe
.
<- recipe( Potability ~ .,
water_recdata = train_water) %>%
step_impute_bag(all_predictors(),-all_outcomes())
Then, we write our baseline model (logistic regression) specification and combine the specification and recipe into a workflow
<-
logit_model logistic_reg() %>%
set_mode("classification")%>%
set_engine("glm")
<-
wflow_logit workflow() %>%
add_recipe(water_rec) %>%
add_model(logit_model)
We then fit the model and get the confusion matrix for the training data
<-
model_logit %>%
wflow_logit fit(data = train_water)
%>%
model_logit augment(new_data=train_water,type.predict = "response") %>%
conf_mat(truth = Potability, estimate = .pred_class) %>%
autoplot(type="heatmap",event_level="second")
The model does not look particularly good already: while it has no false positives (it classifies all instances of non potable water correctly), it misclassifies almost all negative instances (i.e almost all instances of potable water are misclassified as non-potable, potentially resulting in unnecessary water treatment and unnecessary costs). This is most likely due to the class imbalanace in our dataset. We could potentially apply some oversampling and see if it improves matters. With that said, for this particular application, it would have been much more damageable to have the opposite scenario i.e misclassification of non-potable water as it would have been a safety hazard.
%>%
model_logit augment(new_data=train_water,type.predict="response") %>%
conf_mat(truth = Potability, estimate = .pred_class) %>%
summary(event_level="second")
# A tibble: 13 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.611
2 kap binary 0.00255
3 sens binary 0.00209
4 spec binary 1
5 ppv binary 1
6 npv binary 0.610
7 mcc binary 0.0357
8 j_index binary 0.00209
9 bal_accuracy binary 0.501
10 detection_prevalence binary 0.000814
11 precision binary 1
12 recall binary 0.00209
13 f_meas binary 0.00417
The summary metrics confirm the observations from the confusion matrix: the precision is 1 which indicates that there are no false positives, recall is extremely low at 0.000418 which indicates that most negative instances are misclassified and obviously as a result, the F1-score is also very low. Balanced accuracy is 0.502, which indicates this classifier is no better than random. So, overall, this is a poor model.
We can opt to place more importance in our precision metric when evaluating the model’s performance, as it directly measures the proportion of correctly identified potable water instances among all instances predicted as potable. In this context, precision is particularly critical because it reflects the risk of false positives—instances where the model incorrectly predicts water to be potable when it is actually non-potable. Minimizing false positives is essential, as the consequences of such errors can be severe. False negatives — where potable water is classified as non-potable, are less damaging since they only result in potentially wasted potable water and/or extra (unnecessary) water treatment costs.
How does the model perform on the test data?
%>%
model_logit augment(new_data=test_water,type.predict = "response") %>%
conf_mat(truth = Potability, estimate = .pred_class) %>%
autoplot(type="heatmap",event_level="second")
%>%
model_logit augment(new_data=train_water,type.predict="response") %>%
conf_mat(truth = Potability, estimate = .pred_class) %>%
summary(event_level="second")
# A tibble: 13 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.611
2 kap binary 0.00255
3 sens binary 0.00209
4 spec binary 1
5 ppv binary 1
6 npv binary 0.610
7 mcc binary 0.0357
8 j_index binary 0.00209
9 bal_accuracy binary 0.501
10 detection_prevalence binary 0.000814
11 precision binary 1
12 recall binary 0.00209
13 f_meas binary 0.00417
The model is similarly poor on the test data and all the remarks made on the training data remain valid here.
Let’s try to improve things here.
In terms of pre-processing, let’s add do some oversampling with SMOTE to address the class imbalance in our dataset. SMOTE works by synthesizing new samples for the minority class (potable water in this case) by creating interpolated data points between existing minority samples. This technique increases the representation of potable water instances in the training set, ensuring the model has sufficient information to distinguish between potable and non-potable water. SMOTE preserves the variability of the data and prevents overfitting to the synthetic data by not simply duplicating minority samples. By applying SMOTE, we aim to improve the recall of the potable water class (minority class) without sacrificing the precision of the non-potable water class.
<- recipe( Potability ~ .,
water_rec_oversampdata = train_water) %>%
step_impute_bag(all_predictors(),-all_outcomes()) %>%
step_smote(Potability)
In terms of models, we’ll once again use an XGBoost model
<-
xgb_model boost_tree(learn_rate = tune(),trees=tune(), mtry=.cols()) %>%
set_mode("classification") %>%
set_engine("xgboost")
<-
xgb_grid crossing(learn_rate = c(0.0001, 0.001,0.005, 0.01,0.05, 0.1, 0.2, 0.3,0.5,0.7,1),trees=c(150,250,500,750,1000))
<-
xgb_wf workflow() %>%
add_recipe(water_rec_oversamp) %>%
add_model(xgb_model)
We will once again perform cross-validation to ensure the robustness of our hyperparameter tuning results and prevent overfitting (we select the F1-score as a metric to use for hyperparameter tuning):
set.seed(42)
#setting-up cross validation
<- vfold_cv(train_water, v = 5, strata=Potability) water_cv
We tune our hyperparameters:
# Register multiple cores for faster processing
::registerDoParallel()
doParallel
# Tune the model, just use one evaluation metric (roc_au)
<-
xgb_cv_fit tune_grid(
xgb_wf,resamples = water_cv,
metrics = metric_set(f_meas),
grid = xgb_grid
)
We then plot the results of the hyperparameter tuning and select the hyperparameters that produce the best F1-score.
<- xgb_cv_fit %>%
gcollect_metrics() %>%
mutate(max = mean == max(mean))
%>% ggplot(aes(x = learn_rate, y = mean, color = factor(trees), group = factor(trees))) +
g geom_line(linewidth = 1) + # Thicker lines for visibility
geom_point(size = 2) + # Points to emphasize data points
labs(title = "Mean Performance vs Learn Rate by Number of Trees",
x = "Learn Rate",
y = "Mean F1-score",
color = "Number of Trees") + # Legend title for trees
scale_color_brewer(palette = "Set2") + # Use a color palette for distinct colors
theme_minimal(base_size = 14) + # Increase base font size for readability
theme(
axis.text.x = element_text(size = 10, angle = 45, hjust = 1), # Rotate x-axis labels for clarity
axis.text.y = element_text(size = 10), # Increase y-axis label size
panel.grid.major = element_line(color = "gray80"), # Major grid lines
panel.grid.minor = element_blank() # Optional: Remove minor grid lines
+
) scale_x_continuous(labels = scales::number_format(accuracy = 0.0001)) + # Adjust x-axis label format
scale_y_continuous(breaks = seq(min(g$mean), max(g$mean), by = 2))
<- select_best(xgb_cv_fit, metric = "f_meas")
best_fmeas best_fmeas
# A tibble: 1 × 3
trees learn_rate .config
<dbl> <dbl> <chr>
1 750 0.05 Preprocessor1_Model24
The best hyperparameters are 750 trees and a learning rate of 0.05. We use these hyperparameters to fit our last model and evaluate it:
<-
xgb_final_fit %>%
xgb_wf finalize_workflow(best_fmeas) %>%
fit(train_water)
%>%
xgb_final_fit augment(new_data = test_water) %>%
conf_mat(truth = Potability, estimate = .pred_class) %>%
autoplot(type="heatmap",event_level="second")
We’ve lost in precision as we’ve increased recall (as was to be expected - since precision and recall is a tradeoff) but we are now classifying about half of the potable water instances correctly.
Let’s look at the summary metrics more closely:
%>%
xgb_final_fit augment(new_data = test_water) %>%
conf_mat(truth = Potability, estimate = .pred_class) %>%
summary(event_level="second")
# A tibble: 13 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.585
2 kap binary 0.139
3 sens binary 0.503
4 spec binary 0.638
5 ppv binary 0.471
6 npv binary 0.667
7 mcc binary 0.140
8 j_index binary 0.141
9 bal_accuracy binary 0.571
10 detection_prevalence binary 0.417
11 precision binary 0.471
12 recall binary 0.503
13 f_meas binary 0.486
The F1-score has indeed improved, though at 0.486, this is hardly a particularly good model. As expected, precision has decreased significantly, which could be rather problematic for this particular application. And balanced accuracy indicates the model to be only marginally better than random. So, overall, also not a good model (and obviously not a usable one!) though it looks like an improvement on the baseline (at least based on F1-score and balanced accuracy).
Note that, if we really wanted to be particularly thorough, and if we wanted to weigh precision more than we do recall while still taking a measure of recall into account, we should evaluate our models and do our hyperparameter tuning with respect to an F\(_\beta\) score instead of the F1-score where: \[𝐹_\beta=(1+𝛽^2)∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙/((𝛽^2∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)+𝑟𝑒𝑐𝑎𝑙𝑙)\]
You can specify such a score by still using the f_meas()
function from the yardstick
library with an extra option beta
: to make it so that precision is emphasized over recall, beta
should be chosen to be lower than 1 (a beta
of 0.5 for example weighs precision twice as much as recall). See the f_meas()
page in the documentation for details.
N.B: Additional analysis you might want to do in this part:
- examine the coefficients of the logistic regression model and their meaning
tidy(model_logit) %>% knitr::kable()
gives a summary of the parameters associated with the logistic regression model as follows:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0642678 | 0.7066225 | 0.0909506 | 0.9275318 |
ph | 0.0068396 | 0.0282697 | 0.2419421 | 0.8088250 |
Hardness | -0.0009036 | 0.0012783 | -0.7068222 | 0.4796770 |
Solids | 0.0000077 | 0.0000048 | 1.6074163 | 0.1079631 |
Chloramines | 0.0378306 | 0.0261191 | 1.4483884 | 0.1475084 |
Sulfate | -0.0013970 | 0.0011712 | -1.1928339 | 0.2329344 |
Conductivity | -0.0000136 | 0.0005114 | -0.0265884 | 0.9787880 |
Organic_carbon | -0.0197693 | 0.0125542 | -1.5747195 | 0.1153212 |
Trihalomethanes | -0.0001119 | 0.0026459 | -0.0422873 | 0.9662697 |
Turbidity | -0.0158731 | 0.0532957 | -0.2978306 | 0.7658325 |
From this, we can, for example, deduce that 1-unit increase in ph increases the odds by ~0.68% (we could then try and convert this odds increase in terms of probabilities).
- look at the feature importance in our models
vip(xgb_final_fit, num_features = 10, geom = "col", mapping = aes(fill = Variable)) +
scale_fill_brewer(palette="PuBu")
vip(model_logit, num_features = 10, geom = "col", mapping = aes(fill = Variable)) +
scale_fill_brewer(palette="PuBu")
Note how logistic regression and XGBoost place the emphasis on rather different predictive variables: solids, sulfate and chloramines are important features in both models (though not in the same order), however, pH, which is the most important feature for XGBoost, barely factors in for logistic regression and the same can be said for organic carbon, which is one of the most important features in logistic regression though a rather negligible factor in XGBoost.
You could obviously conduct the same type of analysis for regression models.