A beneficial property of the lasso penalty is that it shrinkscoefficients to zero. Less beneficial is that in the process, the lassotends to overshrink large coefficients. It has been argued that thelasso should “be considered as avariable screener rather thanamodel selector” (Su, Bogdan, &Candès (2017)). There appears a trade-off between prediction andselection: The penalty parameter value optimal for variable selectionwill overshrink the large coefficients, making it suboptimal forprediction. The penalty parameter value optimal for prediction(“lambda.min”) selects too many variables. The relaxed lasso andadaptive lasso have been proposed to improve upon this problem.
When fitting prediction rule ensembles (PREs), the highfalse-positive selection rate of the lasso may be a nuisance, becauseoften we want to obtain a sparse set of rules and linear terms. Thisvignette aims to show how the relaxed and adaptive lassos may deliversparser ensembles, while retaining (relatively) high predictiveaccuracy.
An excellent discussion of consistency, predictive performance andselection accuracy of the lasso and its variations is provided in themaster’s thesis ofKirkland (2014).
The relaxed lasso was originally proposed byMeinshausen (2007). Investigations ofSu et al. (2017) provide insight on why therelaxed lasso is beneficial.Hastie, Tibshirani,& Tibshirani (2017) propose a simplified version of therelaxed lasso, which is implemented in packageglmnet and can be employed in packagepre.Hastie et al.(2017) find that “best subset selection generally performingbetter in high signal-to-noise (SNR) ratio regimes, and the lasso betterin low SNR regimes” and that “the relaxed lasso […] is the overallwinner, performing just about as well as the lasso in low SNR scenarios,and as well as best subset selection in high SNR scenarios”. Functionpre supports use of the relaxed lasso through passing ofargumentrelax. A short introduction to the relaxed lassois provided inglmnet vignette “TheRelaxed lasso”, accessible inR by typingvignette("relax", "glmnet").
The adaptive lasso has been proposed byZou(2006). It applies positive weighting factors to the lassopenalty to control the bias through shrinking coefficients with weightsinversely proportional to their size. It thus aims to shrink smallcoefficients more and large coefficients less. It requires an initialestimate of the coefficients, for which OLS (if\(N > p\)) or ridge (if\(N < p\)) estimation is usually employed,to obtain a vector of weights. These weights can then be used to scalethe predictor matrix, or to scale the penalty; both approaches have thesame effect.
Functionpre allows for adaptive lasso estimationthrough specification of argumentad.alpha. It first usesridge regression for computing penalty weights. In principle, anyelastic net solution can be used, but use of ridge is recommended. Othersolutions can be used by specifying thead.alpha andad.penalty arguments. Lasso regression can be used byspecifyingad.alpha = 1 and OLS can be used by specifyingad.penalty = 0. Next, the inverse of the absolute values ofthe estimated coefficients are supplied as penalty factors to thecv.glmnet function. For the initial and final estimates,the same fold assignments are used in the cross validation.
It should not come as a surprise that a combination has also beenproposed: The relaxed adaptive lasso (Zhang,Zhao, Lu, & Xu (2022)). It can easily be employed throughspecifying bothad.alpha = 0 andrelax = TRUE.
We fit a PRE to predictOzone and employ the relaxedlasso by specifyingrelax = TRUE:
airq<- airquality[complete.cases(airquality), ]set.seed(42)airq.ens.rel<-pre(Ozone~ .,data = airq,relax =TRUE)If we specifyrelax = TRUE, thegammaargument (see?cv.glmnet for documentation on argumentsrelax andgamma) will by default be set to arange of five values in the interval [0, 1]. This can be overruled byspecifying different values for argumentgamma in the callto functionpre (but the default likely suffices in mostapplications).
We take a look at the regularization paths for the relaxed fits:
We obtained one regularization path for each value of\(\gamma\).\(gamma\) is a mixing parameter, thatdetermines the weight of the original lasso solution, relative to asolution containing only the selected variables, but with unpenalizedcoefficient estimates. The path for\(\gamma =1\) is the default lasso path, which we would also have obtainedwithout specifyingrelax = TRUE. Lower values of\(\gamma\) ‘unshrink’ the value of thenon-zero coefficients of the lasso towards their unpenalized values. Wesee that for the\(\lambda\) valueyielding the minimum MSE (indicated by the left-most vertical dottedline), the value of\(\gamma\) does notmake a lot of difference for the MSE, but when\(\lambda\) values increase, higher values of\(\gamma\) tend to improve predictiveperformance. This is a common pattern for\(\lambda\) and\(\gamma\).
For model selection using the"lambda.min" criterion, bydefault the\(\lambda\) and\(\gamma\) combination yielding the lowest CVerror is returned. For the"lambda.1se" criterion, the\(\lambda\) and\(\gamma\) combination yielding the sparsestsolution within 1 standard error of the error criterion of the minimumis returned:
fit<- airq.ens.rel$glmnet.fit$relaxedmat<-data.frame(lambda.1se =c(fit$lambda.1se, fit$gamma.1se, fit$nzero.1se),lambda.min =c(fit$lambda.min, fit$gamma.min, fit$nzero.min),row.names =c("lamda","gamma","# of non-zero terms"))mat## lambda.1se lambda.min## lamda 6.193185 3.382889## gamma 0.000000 0.000000## # of non-zero terms 9.000000 12.000000Thus, as the dotted vertical lines in the plots already suggest, withthe default"lambda.1se" criterion, a final model with 9terms will be selected, with coefficients obtained using a\(\lambda\) value of 6.193 and a\(\gamma\) value of 0. With the"lambda.min" criterion, we obtain a more complex fit;\(\gamma = 0\) still yields the lowest CVerror. Note that use of"lambda.min" increases thelikelihood of overfitting, because functionpre uses thesame data to extract the rules and fit the penalized regression, so inmost cases the default"lambda.1se" criterion can beexpected to provide a less complex, better generalizable, and often moreaccurate fit.
The default of functionpre is to use the"lambda.1se" criterion. Whenrelax = TRUE hasbeen specified in the call to functionpre, the default ofall functions andS3 methods applied to objects of classpre (print,plot,coef,predict,importance,explain,cvpre,singleplot,pairplot,interact) is to use the solutionobtained with"lambda.1se" and the\(\gamma\) value yielding lowest CV error atthat value of\(\lambda\). This can beoverruled by specifying a different value of\(\lambda\) (penalty.par.val)and/or\(\gamma\) (gamma).Some examples:
## ## Final ensemble with cv error within 1se of minimum: ## ## lambda = 6.193185 ## gamma = 0## number of terms = 9## mean cv error (se) = 304.7364 (79.60512)## ## cv error type : Mean-Squared Error## Final ensemble with minimum cv error: ## ## lambda = 3.382889 ## gamma = 0## number of terms = 12## mean cv error (se) = 244.7256 (67.54855)## ## cv error type : Mean-Squared Error## Final ensemble: ## ## lambda = 7.814913 ## gamma = 0## number of terms = 5## mean cv error (se) = 390.2582 (101.9163)## ## cv error type : Mean-Squared Error## Final ensemble: ## ## lambda = 7.814913 ## gamma = 1## number of terms = 5## mean cv error (se) = 682.127 (146.0948)## ## cv error type : Mean-Squared ErrorNote how the lowest CV error is indeed obtained with the"lambda.min" criterion, while the default"lambda.1se" yields a sparser model, with accuracy within 1standard error of"lambda.min". If we want to go (much)sparser, we need to specify a lower value for the\(\lambda\) penalty, and a lower value of\(\gamma\) should likely be preferred,to retain good-enough predictive accuracy.
Some rules for specification of\(\lambda\) and\(\gamma\):
If a numeric value of\(\lambda\) has been supplied, a (numeric)value for\(\gamma\)must besupplied.
Otherwise (if the default"lambda.1se" criterion isemployed, or"lambda.min" specified), the\(\gamma\) value yielding lowest CV error (atthe\(\lambda\) value associated withthe specified criterion) will be used; this\(\gamma\) value can be overruled bysupplying the desired\(\gamma\) valueto thegamma argument.
Multiple values of\(\gamma\)can be passed to functionpre, but all other methods andfunctions acceptonly a single value for\(\gamma\) (this differs from severalglmnet functions) .
If a specific\(\lambda\) valueis supplied, results are returned for a penalty parameter value that wasused in the path, and closest to the specified value.
Also note that in the code chunk above we refer to thepenalty.par.val argument by abbreviating it topenalty; this has the same effect as writingpenalty.par.val in full.
Using\(\gamma = 0\) amounts to aforward stepwise selection approach, with entry order of the variables(rules and linear terms) determined by the lasso. This approach can beuseful if we want a rule ensemble with low complexity and highgeneralizability, and especially when we want to decide a-priori on thenumber of terms we want to retain. By specifying a high value of\(\lambda\), we can retain a small number ofrules, while specifying\(\gamma = 0\)will provide unbiased (unpenalized) coefficients. This avoids theovershrinking of large coefficients. In terms of predictive accuracy,this approach may not perform best, but if low complexity(interpretability) is most important, this is a very useful approach,which does not reduce predictive accuracy too much.
To use forward stepwise regression with variable entry orderdetermined by the lasso, we specify a\(\gamma\) value of 0, and specify the numberof variables we want to retain through specification of\(\lambda\) (penalty.par.val).To find the value of\(\lambda\)corresponding to the number of terms one want to retain, check (resultsnot shown for space considerations):
Functionprune_pre is helpful for selecting sparserensembles. Say, we want to retain an ensemble with only five rules, thenprune_pre will return the\(\lambda\) and\(\gamma\) values that yield an ensemble ofspecified size, with optimal cross-validated predictive accuracy.
Here, we request the optimal parameter values for a five-termensemble:
## The best ensemble with 5 non-zero terms is obtained with a lambda value of 6.797013 and a gamma value of 0.## ## Overview of performance of ensembles selected with the nearest lambda values:## lambda number_of_nonzero_terms optimal_gamma mean_cv_error## s7 8.985251 2 0 425.7366## s8 8.576857 2 0 414.1260## s9 8.187026 4 0 407.1132## s10 7.814913 5 0 390.2582## s11 7.459713 5 0 378.8175## s12 7.120658 5 0.25 384.3582## s13 6.797013 5 0 370.8112## s14 6.488078 6 0 347.7939## s15 6.193185 9 0 304.7364## s16 5.911695 9 0 294.8892Note that themean_cv_error may be slightly optimistic.Cross validation was performed on the same data that was used thegenerate the rules. A less optimistic estimate of generalization errorcan be obtained using functioncvpre.
Finally, we fit a PRE with adaptive lasso to predictOzone levels:
## ## Final ensemble with cv error within 1se of minimum: ## ## lambda = 16.74798## number of terms = 8## mean cv error (se) = 318.0806 (91.18799)## ## cv error type : Mean-Squared ErrorThe adaptive lasso did not provide a sparser ensemble, while the meanCV error suggests better predictive accuracy than the standard, but notthe relaxed, lasso. Adaptive lasso settings can further be adjusted byspecification of argumentad.penalty.
We can also fit a rule ensemble using the relaxed adaptive lasso:
set.seed(42)airq.ens.rel.ad<-pre(Ozone~ .,data = airq,relax =TRUE,ad.alpha =0)print(airq.ens.rel.ad)## ## Final ensemble with cv error within 1se of minimum: ## ## lambda = 40.53227 ## gamma = 0.25## number of terms = 4## mean cv error (se) = 302.991 (91.09976)## ## cv error type : Mean-Squared Error## ## rule coefficient description## (Intercept) 75.01177 1## rule191 -21.82831 Wind > 5.7 & Temp <= 87## rule173 -18.12641 Wind > 5.7 & Temp <= 82## rule204 12.38470 Wind <= 10.3 & Solar.R > 148## rule51 -11.81249 Wind > 5.7 & Temp <= 84The summary suggests that the relaxed adaptive lasso provides thehighest predictive accuracy compared to the standard, the relaxed andthe adaptive lasso, when using the default"lambda.1se"criterion. Note however that the training data have been used togenerate the rules, to estimate the weights for the penalty factorsusing ridge regression and to estimate the final lasso model. Thus, theprinted CV error can provide an overly optimistic estimate of predictiveaccuracy. To obtain an honest estimate of predictive accuracy, it shouldbe computed on a separate test dataset or using an additional layer ofcross validation (e.g., using functioncvpre or otherapproach).
Use of the relaxed lasso improves accuracy and sparsity of the finalensemble. Relaxed lasso can be used to obtain an ensemble ofpre-specified sparsity, that still provides good predictive performance.Use of the adaptive lasso penalties may further improve predictiveaccuracy.
In case you obtained different results, the results above wereobtained using the following:
## R version 4.3.3 (2024-02-29 ucrt)## Platform: x86_64-w64-mingw32/x64 (64-bit)## Running under: Windows 11 x64 (build 22631)## ## Matrix products: default## ## ## locale:## [1] LC_COLLATE=C LC_CTYPE=Dutch_Netherlands.utf8 ## [3] LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C ## [5] LC_TIME=Dutch_Netherlands.utf8 ## ## time zone: Europe/Amsterdam## tzcode source: internal## ## attached base packages:## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages:## [1] pROC_1.18.5 caret_6.0-94 lattice_0.22-5 ggplot2_3.5.1 pre_1.0.8 ## [6] mice_3.16.0 ## ## loaded via a namespace (and not attached):## [1] tidyselect_1.2.1 timeDate_4041.110 libcoin_1.0-10 ## [4] dplyr_1.1.4 fastmap_1.1.1 digest_0.6.35 ## [7] rpart_4.1.23 timechange_0.3.0 lifecycle_1.0.4 ## [10] survival_3.5-8 magrittr_2.0.3 compiler_4.3.3 ## [13] rlang_1.1.4 sass_0.4.9 tools_4.3.3 ## [16] partykit_1.2-20 plotrix_3.8-4 utf8_1.2.4 ## [19] yaml_2.3.8 data.table_1.15.4 knitr_1.45 ## [22] interp_1.1-6 plyr_1.8.9 earth_5.3.3 ## [25] withr_3.0.0 purrr_1.0.2 stats4_4.3.3 ## [28] nnet_7.3-19 grid_4.3.3 fansi_1.0.6 ## [31] jomo_2.7-6 colorspace_2.1-0 future_1.33.2 ## [34] globals_0.16.3 scales_1.3.0 iterators_1.0.14 ## [37] MASS_7.3-60.0.1 cli_3.6.3 inum_1.0-5 ## [40] mvtnorm_1.2-5 rmarkdown_2.26 generics_0.1.3 ## [43] rstudioapi_0.16.0 future.apply_1.11.2 reshape2_1.4.4 ## [46] minqa_1.2.7 cachem_1.0.8 stringr_1.5.1 ## [49] splines_4.3.3 parallel_4.3.3 vctrs_0.6.5 ## [52] hardhat_1.3.1 boot_1.3-29 glmnet_4.1-8 ## [55] Matrix_1.6-5 jsonlite_1.8.8 mitml_0.4-5 ## [58] Formula_1.2-5 listenv_0.9.1 foreach_1.5.2 ## [61] gower_1.0.1 tidyr_1.3.1 jquerylib_0.1.4 ## [64] recipes_1.0.10 glue_1.7.0 parallelly_1.37.1 ## [67] nloptr_2.1.1 pan_1.9 plotmo_3.6.3 ## [70] codetools_0.2-19 lubridate_1.9.3 stringi_1.8.3 ## [73] shape_1.4.6.1 gtable_0.3.6 deldir_2.0-4 ## [76] lme4_1.1-35.4 munsell_0.5.1 tibble_3.2.1 ## [79] pillar_1.9.0 htmltools_0.5.8 ipred_0.9-14 ## [82] lava_1.8.0 R6_2.5.1 evaluate_0.24.0 ## [85] highr_0.10 backports_1.5.0 broom_1.0.6 ## [88] bslib_0.7.0 class_7.3-22 MatrixModels_0.5-3 ## [91] Rcpp_1.0.12 nlme_3.1-164 prodlim_2023.08.28 ## [94] xfun_0.43 ModelMetrics_1.2.2.2 pkgconfig_2.0.3