W08 summative possible solution

Author

Dr Ghita Berrada & Tabtim Duenger

What follows is one possible solution for the W08 summative. It is, by no means, the only possible solution and is only provided to give you an idea of what was expected!

Possible run-through of the solution

Setup

library(readr)
library(ggplot2)
library(naniar)
library(skimr)
library(dplyr)
library(viridis)
library(recipes)
library(yardstick)
library(rsample)
library(tune)
library(workflows)
library(broom)
library(parsnip)
library(doParallel)
library(tidyr)
library(xgboost)
library(themis)
library(vip)
library(ggsci)


theme_scatter <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        legend.position = "bottom")
  
}

theme_line <- function() {
  
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.position = "none")
  
}

Part 1:

We know that, in this dataset, -999 values are missing values. We could replace them with NA after reading the data e.g with the naniar library’s replace_with_na_all() but to simplify things, we are choosing to replace these values at loading time, thanks to the na option of the read_csv function:

aq_bench <- read_csv('data/AQbench_dataset.csv',na=c("-999"))

Rows: 5577 Columns: 53                                                                                             
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): country, htap_region, climatic_zone, type, type_of_area, dataset
dbl (47): id, lon, lat, alt, relative_alt, water_25km, evergreen_needleleaf_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If we, now, look at our data, we have the following:

vis_miss(aq_bench)

and

skim(aq_bench)

── Data Summary ────────────────────────
                           Values  
Name                       aq_bench
Number of rows             5577    
Number of columns          53      
_______________________            
Column type frequency:             
  character                6       
  numeric                  47      
________________________           
Group variables            None    

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 country               0             1   4  24     0       70          0
2 htap_region           0             1   3   3     0       15          0
3 climatic_zone         0             1   7  16     0       13          0
4 type                  0             1   5  10     0        5          0
5 type_of_area          0             1   5   8     0        5          0
6 dataset               0             1   3   5     0        3          0

── Variable type: numeric ──────────────────────────────────────────────────────
   skim_variable                           n_missing complete_rate        mean
 1 id                                              0         1     11482.     
 2 lon                                             0         1        13.5    
 3 lat                                             0         1        39.3    
 4 alt                                             0         1       265.     
 5 relative_alt                                    0         1        49.8    
 6 water_25km                                      0         1        12.7    
 7 evergreen_needleleaf_forest_25km                0         1         2.88   
 8 evergreen_broadleaf_forest_25km                 0         1         0.512  
 9 deciduous_needleleaf_forest_25km                0         1         0.00224
10 deciduous_broadleaf_forest_25km                 0         1         2.68   
11 mixed_forest_25km                               0         1        19.8    
12 closed_shrublands_25km                          0         1         0.213  
13 open_shrublands_25km                            0         1         3.10   
14 woody_savannas_25km                             0         1         5.20   
15 savannas_25km                                   0         1         0.426  
16 grasslands_25km                                 0         1         5.43   
17 permanent_wetlands_25km                         0         1         0.344  
18 croplands_25km                                  0         1        21.0    
19 urban_and_built-up_25km                         0         1        12.4    
20 cropland-natural_vegetation_mosaic_25km         0         1        11.5    
21 snow_and_ice_25km                               0         1         0.167  
22 barren_or_sparsely_vegetated_25km               0         1         0.291  
23 wheat_production                                0         1         0.899  
24 rice_production                                 0         1         1.48   
25 nox_emissions                                   0         1        14.2    
26 no2_column                                      0         1         4.08   
27 population_density                              0         1     24611.     
28 max_population_density_5km                      0         1     33295.     
29 max_population_density_25km                     0         1     57533.     
30 nightlight_1km                                  0         1        46.0    
31 nightlight_5km                                  0         1        39.8    
32 max_nightlight_25km                             0         1        59.9    
33 o3_average_values                               0         1        27.9    
34 o3_daytime_avg                               1098         0.803    32.9    
35 o3_nighttime_avg                             1087         0.805    21.9    
36 o3_median                                    1130         0.797    26.5    
37 o3_perc25                                    1130         0.797    15.8    
38 o3_perc75                                    1130         0.797    37.2    
39 o3_perc90                                    1130         0.797    47.2    
40 o3_perc98                                    1130         0.797    60.8    
41 o3_dma8eu                                    1013         0.818    58.2    
42 o3_avgdma8epax                               1025         0.816    37.9    
43 o3_drmdmax1h                                   32         0.994    54.2    
44 o3_w90                                       1087         0.805   163.     
45 o3_aot40                                     1116         0.800 16345.     
46 o3_nvgt070                                   1025         0.816     8.63   
47 o3_nvgt100                                   1087         0.805     1.52   
           sd      p0     p25       p50       p75      p100 hist 
 1  4041.     3336    8252    11732     15431       17722   ▃▅▆▅▇
 2    88.1    -171.    -76.0      7.48    127.        175.  ▂▃▇▁▆
 3    13.2     -90.0    35.4     39.8      45.8        82.5 ▁▁▁▇▂
 4   466.       -4      20       90       287        5500   ▇▁▁▁▁
 5   107.     -136       8       21        47        1826   ▇▁▁▁▁
 6    19.4       0       0        0        23.4       100   ▇▂▁▁▁
 7     9.20      0       0        0         1.6        96.4 ▇▁▁▁▁
 8     4.17      0       0        0         0          99.5 ▇▁▁▁▁
 9     0.0804    0       0        0         0           4.8 ▇▁▁▁▁
10     9.03      0       0        0         0          94.5 ▇▁▁▁▁
11    21.7       0       1.1     12        33.8        99.1 ▇▃▂▁▁
12     1.07      0       0        0         0          16.7 ▇▁▁▁▁
13    12.1       0       0        0         0          99.7 ▇▁▁▁▁
14    11.3       0       0        0         3.9        85.6 ▇▁▁▁▁
15     2.80      0       0        0         0          54.6 ▇▁▁▁▁
16    15.5       0       0        0         2.6       100   ▇▁▁▁▁
17     1.75      0       0        0         0          41   ▇▁▁▁▁
18    22.8       0       3.7     12.8      30.7        99.7 ▇▂▁▁▁
19    16.8       0       1.5      5.4      16.2        91.8 ▇▁▁▁▁
20    15.5       0       0        4.3      17.5        85.4 ▇▂▁▁▁
21     3.29      0       0        0         0         100   ▇▁▁▁▁
22     3.22      0       0        0         0         100   ▇▁▁▁▁
23     1.79      0       0        0.079     0.946      22.0 ▇▁▁▁▁
24     4.45      0       0        0         0          39.3 ▇▁▁▁▁
25    28.9       0       1.26     4.52     16.2       967.  ▇▁▁▁▁
26     3.46      0       1.95     2.93      4.88       20.8 ▇▂▁▁▁
27 48021.        0    1480     7422     23939      940909   ▇▁▁▁▁
28 59626.        0    2515    11477     32349      973786   ▇▁▁▁▁
29 85087.        0    9180    24824     58521     1669078   ▇▁▁▁▁
30    21.7       0      30       59        63          63   ▂▁▁▁▇
31    20.7       0      22.0     45.1      59.3        63   ▃▂▂▃▇
32     8.83      0      62       63        63          63   ▁▁▁▁▇
33     6.40      3.62   23.7     27.4      31.4        65.6 ▁▇▆▁▁
34     6.21      8.37   28.9     33.0      36.7        56.1 ▁▂▇▃▁
35     7.10      2.69   17.1     20.7      25.4        54.7 ▁▇▃▁▁
36     6.91      2.51   22.0     26.2      30.5        54.7 ▁▅▇▂▁
37     7.81      0.2    10.2     14.8      20.2        48   ▃▇▃▁▁
38     6.34     11.5    33.4     37.3      41          61.6 ▁▂▇▂▁
39     7.05     16.3    42.8     47.6      52          74.9 ▁▂▇▃▁
40     9.58     22.1    55.5     60.8      66.8       115.  ▁▆▇▁▁
41     9.17     20.4    53.0     58.4      64.1       108.  ▁▅▇▁▁
42     6.40     10.7    33.7     38.3      42.1        71.2 ▁▃▇▁▁
43     8.27     16.6    49.5     54.3      58.7       102.  ▁▃▇▁▁
44   115.        1.35   78.0    132.      225.        738.  ▇▅▂▁▁
45 10262.        0    8943.   14985.    22370.      72431.  ▇▇▂▁▁
46    11.9       0       1        4.25     12         180.  ▇▁▁▁▁
47     3.99      0       0        0         1          93   ▇▁▁▁▁

The dataset doesn’t have variables with missing variables other than the o3_* type of variables, which are likely to be highly redundant variations of the outcome variable o3_average_values, and should not, for simplicity (and to avoid unintentional data leakage!) be used for modeling.

So, we won’t be doing any imputation (all the variables that would need such a treatment won’t be used).

We filter our data:

aq_bench_filtered <- aq_bench %>%
                    subset(select=-c(lat,lon,dataset))
head(aq_bench_filtered)

# A tibble: 6 × 50
     id country htap_region climatic_zone   alt relative_alt type   type_of_area
  <dbl> <chr>   <chr>       <chr>         <dbl>        <dbl> <chr>  <chr>       
1  3336 Germany EUR         cool_moist       12            3 backg… rural       
2  3338 Germany EUR         cool_moist        1            1 backg… rural       
3  3339 Germany EUR         cool_moist      205           66 backg… urban       
4  3340 Germany EUR         cool_moist      102           29 backg… urban       
5  3341 Germany EUR         cool_moist       45            8 backg… rural       
6  3342 Germany EUR         cool_moist       40            6 indus… urban       
# ℹ 42 more variables: water_25km <dbl>,
#   evergreen_needleleaf_forest_25km <dbl>,
#   evergreen_broadleaf_forest_25km <dbl>,
#   deciduous_needleleaf_forest_25km <dbl>,
#   deciduous_broadleaf_forest_25km <dbl>, mixed_forest_25km <dbl>,
#   closed_shrublands_25km <dbl>, open_shrublands_25km <dbl>,
#   woody_savannas_25km <dbl>, savannas_25km <dbl>, grasslands_25km <dbl>, …

The 5 countries with the lowest number of rows in the dataset are:
```
aq_bench_filtered %>%
group_by(country)%>%
tally(sort=TRUE) %>%
arrange(n) %>%
head(5) %>% knitr::kable()
```
country n

Algeria 1

American Samoa 1

Armenia 1

Barbados 1

Bermuda 1

Returning the countries by alphabetical order is fine. Other custom orderings (as long as they are justified) are also okay.

The 5 countries with the highest number of rows in the dataset are:
```
aq_bench_filtered %>%
group_by(country)%>%
tally(sort=TRUE) %>%
head(5) %>% knitr::kable()
```
country n

United States of America 1390

Japan 1182

Spain 440

France 404

Italy 361

Things are clearer for this ordering.

We could explore this further but it seems relatively clear that most of the data only comes from very few parts of the world (mainly North America, Europe and East Asia). This calls into question:
- the actual representiveness of this dataset
- whether including country as a predictor (at least without any additional precautions) is a good idea given the extreme imbalance of this variable.
The median NO2 per area is simply calculated as:

country	n
Algeria	1
American Samoa	1
Armenia	1
Barbados	1
Bermuda	1

country	n
United States of America	1390
Japan	1182
Spain	440
France	404
Italy	361

aq_bench_filtered %>%
  group_by(type_of_area)%>%
  summarise(Median=median(no2_column)) %>%
  knitr::kable()

type_of_area	Median
remote	0.81
rural	2.32
suburban	3.44
unknown	2.46
urban	4.05

NO2 is an air pollutant that forms when fossil fuels are burnt at high temperatures (e.g in vehicles such as cars or motorcycles and buses). As such, it’s not exactly surprising to notice higher concentrations of such a pollutant in urban and suburban areas compared to rural and remote areas: this could be due to the fact that urban and suburban areas tend to have more manufacturing plants (that generate N02!) as well as more traffic of vehicles susceptible of generating N02 (e.g cars, buses, etc.).

Mileage varies here, of course, but one possible plot would be the following:

aq_bench_filtered %>% 
  ggplot(aes(population_density, o3_average_values, color = o3_average_values)) +
  geom_point() +
  theme_scatter()+
  scale_color_viridis() +
  labs(y = "Average O3 levels (in ppb)", x= bquote('Population Density (in person/km'^2~')'))+
  ggtitle("The relationship between average O3 levels and population density is all but linear!")

The plot shows quite clearly that the relationship between average ozone levels and population density is not linear. We can see that the average level of ozone has the highest variance when population density is very low and decreases in variance as population density increases.

The non-linear relationship is a tad more obvious when we redo the same plot as earlier with non-linear interpolation added (loess smoother or polynomial regression):

aq_bench_filtered %>% 
  ggplot(aes(population_density, o3_average_values, color = o3_average_values)) +
  geom_point() +
  geom_smooth(method = "loess", formula = y ~ x, color = "blue", se = TRUE) + 
  theme_scatter() +
  scale_color_viridis() +
  labs(y = "Average O3 levels (in ppb)", 
       x = bquote('Population Density (in person/km'^2~')')) +
  ggtitle("The relationship between average O3 levels and population density is all but linear!")

aq_bench_filtered %>% 
  ggplot(aes(population_density, o3_average_values, color = o3_average_values)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ poly(x,2), color = "blue", se = TRUE) + 
  theme_scatter() +
  scale_color_viridis() +
  labs(y = "Average O3 levels (in ppb)", 
       x = bquote('Population Density (in person/km'^2~')')) +
  ggtitle("The relationship between average O3 levels and population density is all but linear!")

And you’re not convinced yet, you can do a log transformation of population density and redo the earlier plot: only then does the relationship between both variable become approximately linear (though not quite yet too)!

aq_bench_filtered %>% 
  ggplot(aes(log(population_density), o3_average_values, color = o3_average_values)) +
  geom_point() +
  geom_smooth(method = "loess", formula = y ~ x, color = "blue", se = TRUE) + 
  theme_scatter() +
  scale_color_viridis() +
  labs(y = "Average O3 levels (in ppb)", 
       x = bquote('Log(Population Density)')) +
  ggtitle("It takes a log transformation to make the relationship between average O3 levels and population density approximately linear, though not quite!")

Part 2

Let’s construct the baseline linear model.

We split the data into training and test set.

set.seed(42)
split_grid <- initial_split(aq_bench_filtered, prop = 0.75)
df_train <- training(split_grid)
df_test <- testing(split_grid)

As predictors for this model, we’ll be using all variables except:

the variables whose name starts with o3 (except the outcome variable o3_average_values): such variables are likely to be too highly correlated with the outcome variable o3_average_values lead to issues of circular reasoning and and induce potential data leakage
the id variable as it’s not predictive
the country variable (the prediction should not be tied to a particular country)

We create a recipe to pre-process the data:

reg_rec <- recipe(o3_average_values ~ . , data = df_train) %>%
           update_role(starts_with("o3_"),new_role="id") %>%
           update_role(o3_average_values,new_role="outcome") %>%
           update_role(c("id","country"),new_role="id") %>%
           step_novel(all_nominal_predictors())%>%
           step_dummy(htap_region,climatic_zone,type_of_area,type)

step_novel handles any unseen levels of nominal predictor variables.

We then write the linear model specification, workflow and fit the model

lm_model <- 
  linear_reg() %>%
  set_engine("lm")

# Create a workflow to add the recipe and the model
wflow <- 
  workflow() %>% 
  add_recipe(reg_rec) %>% 
  add_model(lm_model)


model <- 
  wflow %>% 
  fit(data = df_train)

Let’s evaluate the model performance:

reg_metrics <- metric_set(rsq, rmse, mae, mape)
model %>%
  augment(new_data=df_test)%>%
  reg_metrics(truth = o3_average_values, estimate = .pred) %>%
  knitr::kable()

.metric	.estimator	.estimate
rsq	standard	0.620273
rmse	standard	4.036346
mae	standard	3.001378
mape	standard	11.972501

Warning message:
In predict.lm(object = object$fit, newdata = new_data, type = "response",  :
  prediction from rank-deficient fit; consider predict(., rankdeficient="NA")

\(R^2\) is in the middle range at 0.620 (62% of the variance of the data is explained). RMSE shows the model to be off by 4.04 parts per billion (ppb) while MAE shows it to be off by 3.00 ppb. The range of o3_average_values in the (training) set is between 3.6212 and 65.5899 ppb with a standard deviation of \(\approxeq 6.403\). This would suggest that the RMSE and MAE for this model are acceptable.

Let’s have a look at the residual plot for this model

model %>% 
  augment(new_data = df_test) %>%
  ggplot(aes(.pred, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_bw() +
  labs(x = "Predicted average O3 levels (ppb)", y = "Residuals")

Here are a few observations about the residuals plot:

The residuals appear to be distributed roughly symmetrically around the zero line. This suggests that the model has no systematic bias in its predictions, meaning it is not consistently overpredicting or underpredicting.
There doesn’t seem to be a clear pattern (e.g., curved or systematic trends) in the residuals. This indicates that the linear model assumption (if this is a linear regression) may hold, and the errors are independent.
There seems to be a fairly consistent spread of residuals across the range of predicted values, particularly in the center of the plot. However, at the higher predicted values (closer to 60), there is a slight indication of higher variance, as the residuals spread out more. This might indicate heteroscedasticity. We might want to test for this! Outliers with large residuals further implied the presence of extreme conditions or unaccounted-for variables that the model struggled to represent accurately.
A few residuals are quite far from zero, especially in the high and low predicted values range. These could be outliers or data points where the model performed poorly.
There are isolated residuals that are noticeably farther from the majority of the points. These might need further investigation to check whether they are due to data entry errors, extreme leverage points, or other issues.

Let’s see how the model performs on the 5 countries with the highest number of rows in the dataset:

metrics_per_country <- model %>%
  augment(new_data=df_test)%>%
  filter(country %in% c('United States of America','Japan','Spain','France','Italy'))%>%
  group_by(country)%>%
  reg_metrics(truth = o3_average_values, estimate = .pred)

metrics_per_country %>%
  rename(Value=.estimate,Metric=.metric)%>%
  filter(Metric=='rmse')%>%
  ggplot(aes(Metric,Value,fill=country))+
  geom_bar(stat="identity", position = "dodge")+
  scale_fill_brewer(palette='Pastel1')

It looks as if the model performs best for Japan and worst for Italy based on RMSE. One could think this was a function of the number of rows per country in the dataset as Italy is the country with the lowest number of rows among the countries examined, however the second worst performance is for the USA, which is the country that has the most rows in the dataset.

We have our baseline model in place. It’s time to try and improve it. In this part, we will first test two new models: a LASSO model and an XGBoost model. We keep the same set of features as for the baseline model.

If we did want to conduct feature engineering, here are some points and explanations we could have provided: - After creating the baseline, we can conduct correlation analysis on the fitted baseline model augmented with training data. This allows us to investigate the relationships between the predictors and their associations with the target variable (o3_average_values).
- We find that there are is a clear variation in the correlation of predictors with the target variable.
- Using domain knowledge, we will aim to improve upon the baseline model through transforming our raw features into a more effective set of inputs for our chosen model.
- e.g. after examining the factors affecting O₃ levels, we found that EU countries had implemented O₃ reduction targets before the 2010-2014 period when the AQ_Bench dataset was compiled (More Efforts Required to Reduce Ozone Pollution in Europe, n.d.). To reflect this, we can introduce a binary variable (EU) to indicate whether a country was an EU member.

The aim with LASSO is to prune out irrelevant features (our model would still be linear though). With XGBoost, we want to test a non-linear model which requires minimal pre-processing, is rather explainable, is scalable and known to be rather less sensitive to overfitting.

To improve the generalizability of our results and reduce the risk of overfitting, we will also be using 5-fold cross-validation to evaluate our results.

set.seed(42)
#setting-up cross validation
ozone_cv <- vfold_cv(df_train, v = 5)

LASSO model

lasso_model <-
  linear_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet") 

lasso_grid <-
  crossing(penalty = c(0.001, 0.002, 0.005, 0.1, 0.2, 0.5, 1, 1.5,2))


lasso_wf <-
  workflow() %>% 
  add_recipe(reg_rec) %>%
  add_model(lasso_model) 


# Register multiple cores for faster processing

doParallel::registerDoParallel()

# Tune the model, just use one evaluation metric

lasso_cv_fit <-
  tune_grid(
    lasso_wf,
    resamples = ozone_cv,
    metrics = metric_set(rmse),
    grid = lasso_grid
  )

# Collect the metrics from the model and plot the results

lasso_cv_fit %>% 
  collect_metrics() %>%
  filter(.metric == "rmse") %>% 
  ggplot(aes(as.factor(penalty), mean, group = 1)) +
  geom_point() +
  geom_line(linetype = "dashed") +
  theme_line() +
  labs(x = "Penalty", y = "Mean Cross-Validated RMSE")

lowest_rmse <- select_best(lasso_cv_fit, metric = "rmse")
lowest_rmse

# A tibble: 1 × 2
  penalty .config             
    <dbl> <chr>               
1   0.005 Preprocessor1_Model3

lasso_final_fit <- 
  lasso_wf %>% 
  finalize_workflow(lowest_rmse) %>% 
  fit(df_train)

lasso_final_fit %>%
  augment(new_data = df_test) %>% 
  reg_metrics(o3_average_values, .pred)%>%
  knitr::kable()

.metric	.estimator	.estimate
rsq	standard	0.6194571
rmse	standard	4.0406143
mae	standard	3.0048991
mape	standard	11.9981317

The best penalty setting for LASSO is 0.05. However, this model doesn’t improve on the performance of the baseline linear regression model: \(R^2\) is a relatively similar 0.619, RMSE is the same at 4.04 ppb and the same can be said for MAE at 3.00 ppb. It looks as if there isn’t much scope for LASSO to reduce the feature set further and improve the model and we’re also hit by the limits of non-linearity/features with no transformations/interactions.

Let’s see if XGBoost does any better.

XGBoost Model

We’ll tune both the trees and learn_rate parameters for the XGBoost model.

We first define the model specification and workflow as well as grid search parameters.

xgb_model <-
  boost_tree(learn_rate = tune(),trees=tune(), mtry=.cols()) %>%
  set_mode("regression") %>% 
  set_engine("xgboost") 

xgb_grid <-
  crossing(learn_rate = c(0.0001, 0.001,0.005, 0.01,0.05, 0.1, 0.2, 0.3,0.5,0.7,1),trees=c(150,250,500,750,1000))


xgb_wf <-
  workflow() %>% 
  add_recipe(reg_rec) %>%
  add_model(xgb_model)

We then tune the model:

# Register multiple cores for faster processing

doParallel::registerDoParallel()

# Tune the model, just use one evaluation metric (RMSE)

xgb_cv_fit <-
  tune_grid(
    xgb_wf,
    resamples = ozone_cv,
    metrics = metric_set(rmse),
    grid = xgb_grid
  )

We then plot the results of the hyperparameter tuning and select the hyperparameters that produce the best RMSE.

# Collect the metrics from the model and plot the result

g<- xgb_cv_fit %>% 
  collect_metrics() %>% 
  mutate(max = mean == max(mean))

g %>% ggplot(aes(x = learn_rate, y = mean, color = factor(trees), group = factor(trees))) +
  geom_line(linewidth = 1) +  # Thicker lines for visibility
  geom_point(size = 2) +  # Points to emphasize data points
  labs(title = "Mean Performance vs Learn Rate by Number of Trees",
       x = "Learn Rate",
       y = "Mean RMSE",
       color = "Number of Trees") +  # Legend title for trees
  scale_color_brewer(palette = "Set2") +  # Use a color palette for distinct colors
  theme_minimal(base_size = 14) +  # Increase base font size for readability
  theme(
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),  # Rotate x-axis labels for clarity
    axis.text.y = element_text(size = 10),  # Increase y-axis label size
    panel.grid.major = element_line(color = "gray80"),  # Major grid lines
    panel.grid.minor = element_blank()  # Optional: Remove minor grid lines
  ) +
  scale_x_continuous(labels = scales::number_format(accuracy = 0.0001)) +  # Adjust x-axis label format
  scale_y_continuous(breaks = seq(min(g$mean), max(g$mean), by = 2))

lowest_rmse <- select_best(xgb_cv_fit, metric = "rmse")
lowest_rmse

# A tibble: 1 × 3
  trees learn_rate .config              
  <dbl>      <dbl> <chr>                
1   750       0.05 Preprocessor1_Model24

The best RMSE is reach for a number of trees of 750 and a learning rate of 0.05. We fit the final model using these hyperparameters and collect metrics:

xgb_final_fit <- 
  xgb_wf %>% 
  finalize_workflow(lowest_rmse) %>% 
  fit(df_train)

xgb_final_fit %>%
  augment(new_data = df_test) %>% 
  reg_metrics(o3_average_values, .pred) %>%
  knitr::kable()

.metric	.estimator	.estimate
rsq	standard	0.7084506
rmse	standard	3.5349506
mae	standard	2.5543592
mape	standard	10.2185989

XGBoost is an improvement on both the baseline and the LASSO model: \(R^2\) is 0.708, indicating a stronger proportion of variance explained by the model. RMSE is a tad lower at 3.53 ppb and the same can be said of MAE at 2.55 ppb, reflecting more precise average predictions. It looks likely that the data has some non-linearity patterns that XGBoost, with its tree-based gradient boosting approach, is able to exploit.

Part 3

Loading the data

water_quality <- read_csv('data/water_potability.csv')
water_quality$Potability <- as.factor(water_quality$Potability)

Rows: 3276 Columns: 10                                                                                             
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (10): ph, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> water_quality$Potability <- as.factor(water_quality$Potability)

We convert the variable Potability into factor since it’s our outcome variable and will need to be a factor to be used in our models.

Let’s have a look at the data:

skim(water_quality)

── Data Summary ────────────────────────
                           Values       
Name                       water_quality
Number of rows             3276         
Number of columns          10           
_______________________                 
Column type frequency:                  
  factor                   1            
  numeric                  9            
________________________                
Group variables            None         

── Variable type: factor ───────────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique top_counts      
1 Potability            0             1 FALSE          2 0: 1998, 1: 1278

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable   n_missing complete_rate     mean       sd      p0      p25
1 ph                    491         0.850     7.08    1.59    0         6.09
2 Hardness                0         1       196.     32.9    47.4     177.  
3 Solids                  0         1     22014.   8769.    321.    15667.  
4 Chloramines             0         1         7.12    1.58    0.352     6.13
5 Sulfate               781         0.762   334.     41.4   129       308.  
6 Conductivity            0         1       426.     80.8   181.      366.  
7 Organic_carbon          0         1        14.3     3.31    2.20     12.1 
8 Trihalomethanes       162         0.951    66.4    16.2     0.738    55.8 
9 Turbidity               0         1         3.97    0.780   1.45      3.44
       p50      p75     p100 hist 
1     7.04     8.06    14    ▁▂▇▂▁
2   197.     217.     323.   ▁▂▇▃▁
3 20928.   27333.   61227.   ▂▇▅▁▁
4     7.13     8.11    13.1  ▁▂▇▃▁
5   333.     360.     481.   ▁▁▇▆▁
6   422.     482.     753.   ▁▇▇▂▁
7    14.2     16.6     28.3  ▁▅▇▂▁
8    66.6     77.3    124    ▁▂▇▅▁
9     3.96     4.50     6.74 ▁▅▇▃▁

vis_miss(water_quality)

The pH, trihalomethanes and sulfate variables all have missing values and the missingness mechanism is MAR: we will need to perform imputation before including these variables in our models.

water_quality %>%
  count(Potability) %>%
  mutate(proportion = n / sum(n)) %>%
  knitr::kable()

Potability	n	proportion
0	1998	0.6098901
1	1278	0.3901099

Let’s also check for class imbalance:

There are almost twice as many instances of non-potable water than there are of potable water!

We split the data in training and test sets

set.seed(42)
split_water <- initial_split(water_quality, prop = 0.75, strata=Potability)
train_water <- training(split_water)
test_water <- testing(split_water)

Our training/test split is a stratified split on the outcome variable to take into account the imbalanced nature of the data.

Before we fit our models, we need to impute missing values for pH, trihalomethanes and sulfate. These variables as well as all the variables (aside from the outcome variable) in the dataset are numeric. We will use bagged tree imputation (it’s a more robust method of imputation, less likely to distort our data distribution). This is one step of our recipe.

water_rec<- recipe( Potability ~ .,
                         data = train_water) %>%
  step_impute_bag(all_predictors(),-all_outcomes())

Then, we write our baseline model (logistic regression) specification and combine the specification and recipe into a workflow

logit_model <-
  logistic_reg() %>%
  set_mode("classification")%>%
  set_engine("glm") 

wflow_logit <- 
  workflow() %>% 
  add_recipe(water_rec) %>% 
  add_model(logit_model)

We then fit the model and get the confusion matrix for the training data

model_logit <- 
  wflow_logit %>% 
  fit(data = train_water)

model_logit %>% 
  augment(new_data=train_water,type.predict = "response") %>%
  conf_mat(truth = Potability, estimate = .pred_class) %>%
  autoplot(type="heatmap",event_level="second")

The model does not look particularly good already: while it has no false positives (it classifies all instances of non potable water correctly), it misclassifies almost all negative instances (i.e almost all instances of potable water are misclassified as non-potable, potentially resulting in unnecessary water treatment and unnecessary costs). This is most likely due to the class imbalanace in our dataset. We could potentially apply some oversampling and see if it improves matters. With that said, for this particular application, it would have been much more damageable to have the opposite scenario i.e misclassification of non-potable water as it would have been a safety hazard.

model_logit %>%
   augment(new_data=train_water,type.predict="response") %>%
   conf_mat(truth = Potability, estimate = .pred_class) %>%
   summary(event_level="second")

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary      0.611   
 2 kap                  binary      0.00255 
 3 sens                 binary      0.00209 
 4 spec                 binary      1       
 5 ppv                  binary      1       
 6 npv                  binary      0.610   
 7 mcc                  binary      0.0357  
 8 j_index              binary      0.00209 
 9 bal_accuracy         binary      0.501   
10 detection_prevalence binary      0.000814
11 precision            binary      1       
12 recall               binary      0.00209 
13 f_meas               binary      0.00417

The summary metrics confirm the observations from the confusion matrix: the precision is 1 which indicates that there are no false positives, recall is extremely low at 0.000418 which indicates that most negative instances are misclassified and obviously as a result, the F1-score is also very low. Balanced accuracy is 0.502, which indicates this classifier is no better than random. So, overall, this is a poor model.

We can opt to place more importance in our precision metric when evaluating the model’s performance, as it directly measures the proportion of correctly identified potable water instances among all instances predicted as potable. In this context, precision is particularly critical because it reflects the risk of false positives—instances where the model incorrectly predicts water to be potable when it is actually non-potable. Minimizing false positives is essential, as the consequences of such errors can be severe. False negatives — where potable water is classified as non-potable, are less damaging since they only result in potentially wasted potable water and/or extra (unnecessary) water treatment costs.

How does the model perform on the test data?

model_logit %>% 
  augment(new_data=test_water,type.predict = "response") %>%
  conf_mat(truth = Potability, estimate = .pred_class) %>%
  autoplot(type="heatmap",event_level="second")

model_logit %>%
   augment(new_data=train_water,type.predict="response") %>%
   conf_mat(truth = Potability, estimate = .pred_class) %>%
   summary(event_level="second")

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary      0.611   
 2 kap                  binary      0.00255 
 3 sens                 binary      0.00209 
 4 spec                 binary      1       
 5 ppv                  binary      1       
 6 npv                  binary      0.610   
 7 mcc                  binary      0.0357  
 8 j_index              binary      0.00209 
 9 bal_accuracy         binary      0.501   
10 detection_prevalence binary      0.000814
11 precision            binary      1       
12 recall               binary      0.00209 
13 f_meas               binary      0.00417

The model is similarly poor on the test data and all the remarks made on the training data remain valid here.

Let’s try to improve things here.

In terms of pre-processing, let’s add do some oversampling with SMOTE to address the class imbalance in our dataset. SMOTE works by synthesizing new samples for the minority class (potable water in this case) by creating interpolated data points between existing minority samples. This technique increases the representation of potable water instances in the training set, ensuring the model has sufficient information to distinguish between potable and non-potable water. SMOTE preserves the variability of the data and prevents overfitting to the synthetic data by not simply duplicating minority samples. By applying SMOTE, we aim to improve the recall of the potable water class (minority class) without sacrificing the precision of the non-potable water class.

water_rec_oversamp<- recipe( Potability ~ .,
                         data = train_water) %>%
  step_impute_bag(all_predictors(),-all_outcomes()) %>%
  step_smote(Potability)

In terms of models, we’ll once again use an XGBoost model

xgb_model <-
  boost_tree(learn_rate = tune(),trees=tune(), mtry=.cols()) %>%
  set_mode("classification") %>% 
  set_engine("xgboost") 

xgb_grid <-
  crossing(learn_rate = c(0.0001, 0.001,0.005, 0.01,0.05, 0.1, 0.2, 0.3,0.5,0.7,1),trees=c(150,250,500,750,1000))


xgb_wf <-
  workflow() %>% 
  add_recipe(water_rec_oversamp) %>%
  add_model(xgb_model)

We will once again perform cross-validation to ensure the robustness of our hyperparameter tuning results and prevent overfitting (we select the F1-score as a metric to use for hyperparameter tuning):

set.seed(42)
#setting-up cross validation
water_cv <- vfold_cv(train_water, v = 5, strata=Potability)

We tune our hyperparameters:

# Register multiple cores for faster processing

doParallel::registerDoParallel()


# Tune the model, just use one evaluation metric (roc_au)

xgb_cv_fit <-
  tune_grid(
    xgb_wf,
    resamples = water_cv,
    metrics = metric_set(f_meas),
    grid = xgb_grid
  )

We then plot the results of the hyperparameter tuning and select the hyperparameters that produce the best F1-score.

g<- xgb_cv_fit %>% 
  collect_metrics() %>% 
  mutate(max = mean == max(mean))

g %>% ggplot(aes(x = learn_rate, y = mean, color = factor(trees), group = factor(trees))) +
  geom_line(linewidth = 1) +  # Thicker lines for visibility
  geom_point(size = 2) +  # Points to emphasize data points
  labs(title = "Mean Performance vs Learn Rate by Number of Trees",
       x = "Learn Rate",
       y = "Mean F1-score",
       color = "Number of Trees") +  # Legend title for trees
  scale_color_brewer(palette = "Set2") +  # Use a color palette for distinct colors
  theme_minimal(base_size = 14) +  # Increase base font size for readability
  theme(
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),  # Rotate x-axis labels for clarity
    axis.text.y = element_text(size = 10),  # Increase y-axis label size
    panel.grid.major = element_line(color = "gray80"),  # Major grid lines
    panel.grid.minor = element_blank()  # Optional: Remove minor grid lines
  ) +
  scale_x_continuous(labels = scales::number_format(accuracy = 0.0001)) +  # Adjust x-axis label format
  scale_y_continuous(breaks = seq(min(g$mean), max(g$mean), by = 2))

best_fmeas <- select_best(xgb_cv_fit, metric = "f_meas")
best_fmeas

# A tibble: 1 × 3
  trees learn_rate .config              
  <dbl>      <dbl> <chr>                
1   750       0.05 Preprocessor1_Model24

The best hyperparameters are 750 trees and a learning rate of 0.05. We use these hyperparameters to fit our last model and evaluate it:

xgb_final_fit <- 
  xgb_wf %>% 
  finalize_workflow(best_fmeas) %>% 
  fit(train_water)

xgb_final_fit %>%
  augment(new_data = test_water) %>% 
  conf_mat(truth = Potability, estimate = .pred_class) %>%
  autoplot(type="heatmap",event_level="second")

We’ve lost in precision as we’ve increased recall (as was to be expected - since precision and recall is a tradeoff) but we are now classifying about half of the potable water instances correctly.

Let’s look at the summary metrics more closely:

xgb_final_fit %>%
  augment(new_data = test_water) %>% 
  conf_mat(truth = Potability, estimate = .pred_class) %>%
  summary(event_level="second")

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.585
 2 kap                  binary         0.139
 3 sens                 binary         0.503
 4 spec                 binary         0.638
 5 ppv                  binary         0.471
 6 npv                  binary         0.667
 7 mcc                  binary         0.140
 8 j_index              binary         0.141
 9 bal_accuracy         binary         0.571
10 detection_prevalence binary         0.417
11 precision            binary         0.471
12 recall               binary         0.503
13 f_meas               binary         0.486

The F1-score has indeed improved, though at 0.486, this is hardly a particularly good model. As expected, precision has decreased significantly, which could be rather problematic for this particular application. And balanced accuracy indicates the model to be only marginally better than random. So, overall, also not a good model (and obviously not a usable one!) though it looks like an improvement on the baseline (at least based on F1-score and balanced accuracy).

Note that, if we really wanted to be particularly thorough, and if we wanted to weigh precision more than we do recall while still taking a measure of recall into account, we should evaluate our models and do our hyperparameter tuning with respect to an F\(_\beta\) score instead of the F1-score where: \[𝐹_\beta=(1+𝛽^2)∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙/((𝛽^2∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)+𝑟𝑒𝑐𝑎𝑙𝑙)\]

You can specify such a score by still using the f_meas() function from the yardstick library with an extra option beta: to make it so that precision is emphasized over recall, beta should be chosen to be lower than 1 (a beta of 0.5 for example weighs precision twice as much as recall). See the f_meas() page in the documentation for details.

N.B: Additional analysis you might want to do in this part:

examine the coefficients of the logistic regression model and their meaning

tidy(model_logit) %>% knitr::kable()

gives a summary of the parameters associated with the logistic regression model as follows:

term	estimate	std.error	statistic	p.value
(Intercept)	0.0642678	0.7066225	0.0909506	0.9275318
ph	0.0068396	0.0282697	0.2419421	0.8088250
Hardness	-0.0009036	0.0012783	-0.7068222	0.4796770
Solids	0.0000077	0.0000048	1.6074163	0.1079631
Chloramines	0.0378306	0.0261191	1.4483884	0.1475084
Sulfate	-0.0013970	0.0011712	-1.1928339	0.2329344
Conductivity	-0.0000136	0.0005114	-0.0265884	0.9787880
Organic_carbon	-0.0197693	0.0125542	-1.5747195	0.1153212
Trihalomethanes	-0.0001119	0.0026459	-0.0422873	0.9662697
Turbidity	-0.0158731	0.0532957	-0.2978306	0.7658325

From this, we can, for example, deduce that 1-unit increase in ph increases the odds by ~0.68% (we could then try and convert this odds increase in terms of probabilities).

look at the feature importance in our models

vip(xgb_final_fit, num_features = 10, geom = "col", mapping = aes(fill = Variable)) +
  scale_fill_brewer(palette="PuBu")

vip(model_logit, num_features = 10, geom = "col", mapping = aes(fill = Variable)) +
  scale_fill_brewer(palette="PuBu")

Note how logistic regression and XGBoost place the emphasis on rather different predictive variables: solids, sulfate and chloramines are important features in both models (though not in the same order), however, pH, which is the most important feature for XGBoost, barely factors in for logistic regression and the same can be said for organic carbon, which is one of the most important features in logistic regression though a rather negligible factor in XGBoost.

You could obviously conduct the same type of analysis for regression models.