| Title: | Flexible, Ensemble-Based Variable Selection with PotentiallyMissing Data |
| Version: | 0.0.5 |
| Description: | Perform variable selection in settings with possibly missing data based on extrinsic (algorithm-specific) and intrinsic (population-level) variable importance. Uses a Super Learner ensemble to estimate the underlying prediction functions that give rise to estimates of variable importance. For more information about the methods, please see Williamson and Huang (2024) <doi:10.1515/ijb-2023-0059>. |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.2 |
| Depends: | R (≥ 3.1.0) |
| Imports: | SuperLearner, dplyr, magrittr, tibble, caret, mvtnorm,kernlab, rlang, ranger |
| Suggests: | vimp, stabs, testthat, knitr, rmarkdown, mice, xgboost,glmnet, polspline |
| URL: | https://github.com/bdwilliamson/flevr |
| BugReports: | https://github.com/bdwilliamson/flevr/issues |
| VignetteBuilder: | knitr |
| License: | MIT + file LICENSE |
| NeedsCompilation: | no |
| Packaged: | 2025-12-05 17:43:10 UTC; L107067 |
| Author: | Brian D. Williamson |
| Maintainer: | Brian D. Williamson <brian.d.williamson@kp.org> |
| Repository: | CRAN |
| Date/Publication: | 2025-12-06 14:30:02 UTC |
flevr: Flexible, Ensemble-Based Variable Selection with Potentially Missing Data
Description
A framework for flexible, ensemble-based variable selection using eitherextrinsic or intrinsic variable importance. You providethe data and a library of candidate algorithms for estimating theconditional mean outcome given covariates;flevr handles the rest.
Author(s)
Maintainer: Brian Williamsonhttps://bdwilliamson.github.io/
Methodology authors:
Brian D. Williamson
Ying Huang
See Also
Papers:
Other useful links:
Report bugs athttps://github.com/bdwilliamson/flevr/issues
Imports
The packages that we import either make the internal code nice(dplyr, magrittr, tibble) or are directly relevant for estimatingvariable importance (SuperLearner, caret).
We suggest several other packages: xgboost, ranger, glmnet, kernlab, polsplineand quadprog allow a flexible library of candidate learners in the SuperLearner; stabs allows importance to be embedded within stability selection;testthat and covr help with unit tests; andknitr, rmarkdown,and RCurl help with the vignettes and examples.
Author(s)
Maintainer: Brian D. Williamsonbrian.d.williamson@kp.org (ORCID)
See Also
Useful links:
Super Learner wrapper for a ranger object with variable importance
Description
Super Learner wrapper for a ranger object with variable importance
Usage
SL.ranger.imp( Y, X, newX, family, obsWeights = rep(1, length(Y)), num.trees = 500, mtry = floor(sqrt(ncol(X))), write.forest = TRUE, probability = family$family == "binomial", min.node.size = ifelse(family$family == "gaussian", 5, 1), replace = TRUE, sample.fraction = ifelse(replace, 1, 0.632), num.threads = 1, verbose = FALSE, importance = "impurity", ...)Arguments
Y | Outcome variable |
X | Training dataframe |
newX | Test dataframe |
family | Gaussian or binomial |
obsWeights | Observation-level weights |
num.trees | Number of trees. |
mtry | Number of variables to possibly split at in each node. Default isthe (rounded down) square root of the number variables. |
write.forest | Save ranger.forest object, required for prediction. Setto FALSE to reduce memory usage if no prediction intended. |
probability | Grow a probability forest as in Malley et al. (2012). |
min.node.size | Minimal node size. Default 1 for classification, 5 forregression, 3 for survival, and 10 for probability. |
replace | Sample with replacement. |
sample.fraction | Fraction of observations to sample. Default is 1 forsampling with replacement and 0.632 for sampling without replacement. |
num.threads | Number of threads to use. |
verbose | If TRUE, display additional output during execution. |
importance | Variable importance mode, one of 'none', 'impurity', 'impurity_corrected', 'permutation'. The 'impurity' measure is the Gini index for classification, the variance of the responses for regression and the sum of test statistics (see |
... | Any additional arguments, not currently used. |
Value
a named list with elementspred (predictions onnewX) andfit (the fittedranger object).
References
Breiman, L. (2001). Random forests. Machine learning 45:5-32.
Wright, M. N. & Ziegler, A. (2016). ranger: A Fast Implementation of RandomForests for High Dimensional Data in C++ and R. Journal of StatisticalSoftware, in press. http://arxiv.org/abs/1508.04409.
See Also
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# get the fitset.seed(20231129)fit <- SL.ranger.imp(Y = y, X = x, newX = x, family = binomial())fitWrapper for using Super Learner-based extrinsic selection within stability selection
Description
A wrapper function for Super Learner-based extrinsic variable selection withinstability selection, using thestabs package.
Usage
SL_stabs_fitfun(x, y, q, ...)Arguments
x | the features. |
y | the outcome of interest. |
q | the number of features to select on average. |
... | other arguments to pass to |
Value
a named list, with elements:selected (a logical vectorindicating whether or not each variable was selected); andpath (a logical matrix indicating which variable was selected at each step).
See Also
stabsel for general usage of stability selection.
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# use stability selection with SL (using small number of folds for CV, # small SL library and small number of bootstrap replicates for illustration only)set.seed(20231129)library("SuperLearner")sl_stabs <- stabs::stabsel(x = x, y = y, fitfun = SL_stabs_fitfun, args.fitfun = list(SL.library = "SL.glm", cvControl = list(V = 2)), q = 2, B = 5, PFER = 5)sl_stabsExample biomarker data
Description
A dataset inspired by data collected by the Early Detection ResearchNetwork (EDRN). Biomarkers developed at six "labs" are validated atat least one of four "validation sites" on 306 cysts. The data also include two binary outcome variables: whether or not the cyst was classified as mucinous,and whether or not the cyst was determined to have high malignant potential.
Usage
biomarkersFormat
biomarkers: a tibble with 306 rows and 24 columns, where the firstcolumn is the validation site, the next two columns are the possible outcomes,and the remaining columns are the biomarkers:
- institution
the validation site
- mucinous
a binary indicator of whether the cyst was classified as mucinous
- high_malignancy
a binary indicator of whether the cyst was classified as having high malignant potential
- lab1_actb
a biomarker
- lab1_molecules_score
a biomarker
- lab1_telomerase_score
a biomarker
- lab2_fluorescence_score
a biomarker
- lab3_muc3ac_score
a biomarker
- lab3_muc5ac_score
a biomarker
- lab4_areg_score
a biomarker
- lab4_glucose_score
a biomarker
- lab5_mucinous_call
a biomarker (binary)
- lab5_neoplasia_v1_call
a biomarker (binary)
- lab5_neoplasia_v2_call
a biomarker (binary)
- lab6_ab_score
a biomarker
- cea
a biomarker
- lab1_molecules_neoplasia_call
binary indicator of whether
lab1_molecules_score> 25- lab1_telomerase_neoplasia_call
binary indicator of whether
lab1_telomerase_score> 730- lab2_fluorescence_mucinous_call
binary indicator of whether
lab2_fluorescence_score> 1.23- lab4_areg_mucinous_call
binary indicator of whether
lab4_areg_score> 112- lab4_glucose_mucinous_call
binary indicator of whether
lab4_glucose_score< 50- lab4_combined_mucinous_call
binary indicator of whether
lab4_areg_score> 112 andlab4_glucose_score< 50- lab6_ab_neoplasia_call
binary indicator of whether
lab6_ab_score> 0.104- cea_call
binary indicator of whether
cea> 192
Source
Inspired by data collected by the EDRNhttps://edrn.nci.nih.gov/.
Extract extrinsic importance from a Super Learner object
Description
Extract the individual-algorithm extrinsic importance from each fittedalgorithm within the Super Learner; compute the average weighted rank of theimportance scores, with weights specified by each algorithm's weight in theSuper Learner.
Usage
extract_importance_SL(fit, feature_names, import_type = "all", ...)Arguments
fit | the fitted Super Learner ensemble |
feature_names | the names of the features |
import_type | the level of granularity for importance: |
... | other arguments to pass to individual-algorithm extractors. |
Value
a tibble, with columnsfeature (the feature) andrank (the weighted feature importance rank, with 1 indicating themost important feature).
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# get the fit (using a simple library and 2 folds for illustration only)set.seed(20231129)library("SuperLearner")fit <- SuperLearner::SuperLearner(Y = y, X = x, SL.library = c("SL.glm", "SL.mean"), cvControl = list(V = 2))# extract importance using all learnersimportance <- extract_importance_SL(fit = fit, feature_names = feature_nms)importance# extract importance of best learnerbest_importance <- extract_importance_SL(fit = fit, feature_names = feature_nms, import_type = "best")best_importanceExtract the learner-specific importance from a fitted SuperLearner algorithm
Description
Extract the individual-algorithm extrinsic importance from one fittedalgorithm within the Super Learner, along with the importance rank.
Usage
extract_importance_SL_learner(fit = NULL, coef = 0, feature_names = "", ...)Arguments
fit | the specific learner (e.g., from the Super Learner's |
coef | the Super Learner coefficient associated with the learner. |
feature_names | the feature names |
... | other arguments to pass to algorithm-specific importance extractors. |
Value
a tibble, with columnsalgorithm (the fitted algorithm),feature (the feature),importance (the algorithm-specificextrinsic importance of the feature),rank (the feature importancerank, with 1 indicating the most important feature), andweight(the algorithm's weight in the Super Learner)
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# get the fit (using a simple library and 2 folds for illustration only)library("SuperLearner")set.seed(20231129)fit <- SuperLearner::SuperLearner(Y = y, X = x, SL.library = c("SL.glm", "SL.mean"), cvControl = list(V = 2))# extract importanceimportance <- extract_importance_SL_learner(fit = fit$fitLibrary[[1]]$object, feature_names = feature_nms, coef = fit$coef[1])importanceExtract the learner-specific importance from a glm object
Description
Extract the individual-algorithm extrinsic importance from a glm object,along with the importance rank.
Usage
extract_importance_glm(fit = NULL, feature_names = "", coef = 0)Arguments
fit | the |
feature_names | the feature names |
coef | the Super Learner coefficient associated with the learner. |
Value
a tibble, with columnsalgorithm (the fitted algorithm),feature (the feature),importance (the algorithm-specificextrinsic importance of the feature),rank (the feature importancerank, with 1 indicating the most important feature), andweight(the algorithm's weight in the Super Learner)
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# get the fitfit <- stats::glm(y ~ ., family = "binomial", data = data.frame(y = y, x))# extract importanceimportance <- extract_importance_glm(fit = fit, feature_names = feature_nms)importanceExtract the learner-specific importance from a glmnet object
Description
Extract the individual-algorithm extrinsic importance from a glmnet object,along with the importance rank.
Usage
extract_importance_glmnet(fit = NULL, feature_names = "", coef = 0)Arguments
fit | the |
feature_names | the feature names |
coef | the Super Learner coefficient associated with the learner. |
Value
a tibble, with columnsalgorithm (the fitted algorithm),feature (the feature),importance (the algorithm-specificextrinsic importance of the feature),rank (the feature importancerank, with 1 indicating the most important feature), andweight(the algorithm's weight in the Super Learner)
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# get the fit (using only 3 CV folds for illustration only)set.seed(20231129)fit <- glmnet::cv.glmnet(x = as.matrix(x), y = y, family = "binomial", nfolds = 3)# extract importanceimportance <- extract_importance_glmnet(fit = fit, feature_names = feature_nms)importanceExtract the learner-specific importance from a mean object
Description
Extract the individual-algorithm extrinsic importance from a mean object,along with the importance rank.
Usage
extract_importance_mean(fit = NULL, feature_names = "", coef = 0)Arguments
fit | the |
feature_names | the feature names |
coef | the Super Learner coefficient associated with the learner. |
Value
a tibble, with columnsalgorithm (the fitted algorithm),feature (the feature),importance (the algorithm-specificextrinsic importance of the feature),rank (the feature importancerank, with 1 indicating the most important feature), andweight(the algorithm's weight in the Super Learner)
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# get the mean outcomefit <- mean(y)# extract importanceimportance <- extract_importance_mean(fit = fit, feature_names = feature_nms)importanceExtract the learner-specific importance from a polymars object
Description
Extract the individual-algorithm extrinsic importance from a polymars object,along with the importance rank.
Usage
extract_importance_polymars(fit = NULL, feature_names = "", coef = 0)Arguments
fit | the |
feature_names | the feature names |
coef | the Super Learner coefficient associated with the learner. |
Value
a tibble, with columnsalgorithm (the fitted algorithm),feature (the feature),importance (the algorithm-specificextrinsic importance of the feature),rank (the feature importancerank, with 1 indicating the most important feature), andweight(the algorithm's weight in the Super Learner)
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)x_mat <- as.matrix(x)# get the fitset.seed(20231129)fit <- polspline::polyclass(y, x_mat)# extract importanceimportance <- extract_importance_polymars(fit = fit, feature_names = feature_nms)importanceExtract the learner-specific importance from a ranger object
Description
Extract the individual-algorithm extrinsic importance from a ranger object,along with the importance rank.
Usage
extract_importance_ranger(fit = NULL, feature_names = "", coef = 0)Arguments
fit | the |
feature_names | the feature names |
coef | the Super Learner coefficient associated with the learner. |
Value
a tibble, with columnsalgorithm (the fitted algorithm),feature (the feature),importance (the algorithm-specificextrinsic importance of the feature),rank (the feature importancerank, with 1 indicating the most important feature), andweight(the algorithm's weight in the Super Learner)
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# get the fitset.seed(20231129)fit <- ranger::ranger(y ~ ., data = data.frame(y = y, x), importance = "impurity")# extract importanceimportance <- extract_importance_ranger(fit = fit, feature_names = feature_nms)importanceExtract the learner-specific importance from an svm object
Description
Extract the individual-algorithm extrinsic importance from a glm object,along with the importance rank.
Usage
extract_importance_svm( fit = NULL, feature_names = "", coef = 0, x = NULL, y = NULL, K = 10)Arguments
fit | the |
feature_names | the feature names |
coef | the Super Learner coefficient associated with the learner. |
x | the features |
y | the outcome |
K | the number of cross-validation folds |
Value
a tibble, with columnsalgorithm (the fitted algorithm),feature (the feature),importance (the algorithm-specificextrinsic importance of the feature),rank (the feature importancerank, with 1 indicating the most important feature), andweight(the algorithm's weight in the Super Learner)
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- as.data.frame(dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))])x_mat <- as.matrix(x)feature_nms <- names(x)# get the fitset.seed(20231129)fit <- kernlab::ksvm(x_mat, y)# extract importanceimportance <- extract_importance_svm(fit = fit, feature_names = feature_nms, x = x, y = y)importanceExtract the learner-specific importance from an xgboost object
Description
Extract the individual-algorithm extrinsic importance from an xgboost object,along with the importance rank.
Usage
extract_importance_xgboost(fit = NULL, feature_names = "", coef = 0)Arguments
fit | the |
feature_names | the feature names |
coef | the Super Learner coefficient associated with the learner. |
Value
a tibble, with columnsalgorithm (the fitted algorithm),feature (the feature),importance (the algorithm-specificextrinsic importance of the feature),rank (the feature importancerank, with 1 indicating the most important feature), andweight(the algorithm's weight in the Super Learner)
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- as.matrix(dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))])feature_nms <- names(x)set.seed(20231129)xgbmat <- xgboost::xgb.DMatrix(data = x, label = y)# get the fit, using a small number of rounds for illustration onlyfit <- xgboost::xgb.train( data = xgbmat, nrounds = 10, params = list("objective" = "binary:logistic", "nthread" = 1, "max_depth" = 1))# extract importanceimportance <- extract_importance_xgboost(fit = fit, feature_names = feature_nms)importancePerform extrinsic, ensemble-based variable selection
Description
Based on a fitted Super Learner ensemble, extract extrinsicvariable importance estimates, rank them, and do variableselection using the specified rank threshold.
Usage
extrinsic_selection( fit = NULL, feature_names = "", threshold = 20, import_type = "all", ...)Arguments
fit | the fitted Super Learner ensemble. |
feature_names | the names of the features (a character vector oflength |
threshold | the threshold for selection based on rankedvariable importance; rank 1 is the most important. Defaultsto 20 (though this is arbitrary, and really should bespecified for the task at hand). |
import_type | the type of extrinsic importance (either |
... | other arguments to pass to algorithm-specific importance extractors. |
Value
a tibble with the estimated extrinsic variable importance,the corresponding variable importance ranks, and the selectedvariables.
See Also
SuperLearner for specific usage oftheSuperLearner function and package.
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# get the fit (using a simple library and 2 folds for illustration only)library("SuperLearner")set.seed(20231129)fit <- SuperLearner::SuperLearner(Y = y, X = x, SL.library = c("SL.glm", "SL.mean"), cvControl = list(V = 2))# extract importanceimportance <- extrinsic_selection(fit = fit, feature_names = feature_nms, threshold = 1.5, import_type = "all")importanceGet an augmented set based on the next-most significant variables
Description
Based on the adjusted p-values from a FWER-controlling procedure and amore general error rate for which control is desired (e.g., generalizedFWER, proportion of false positives, or FDR), augment the set based on FWERcontrol with the next-most significant variables.
Usage
get_augmented_set( p_values = NULL, num_rejected = 0, alpha = 0.05, quantity = "gFWER", q = 0.05, k = 1)Arguments
p_values | the adjusted p-values. |
num_rejected | the number of rejected null hypotheses from the baseFWER-controlling procedure. |
alpha | the significance level. |
quantity | the quantity to control (i.e., |
q | the proportion for FDR or PFP control. |
k | the number of false positives for gFWER control. |
Value
a list of the variables selected into the augmentation set. Contains the following values:
set, a numeric vector where 1 denotes that the variable was selected and 0 otherwisek, the value of k usedq_star, the value of q-star used
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# estimate SPVIMs (using simple library and V = 2 for illustration only)set.seed(20231129)library("SuperLearner")est <- vimp::sp_vim(Y = y, X = x, V = 2, type = "auc", SL.library = "SL.glm", cvControl = list(V = 2))# get base setbase_set <- get_base_set(test_statistics = est$test_statistic, p_values = est$p_value, alpha = 0.2, method = "Holm")# get augmented setaugmented_set <- get_augmented_set(p_values = base_set$p_values, num_rejected = sum(base_set$decision), alpha = 0.2, quantity = "gFWER", k = 1)augmented_set$setGet an initial selected set based on intrinsic importance and a base method
Description
Using the estimated intrinsic importance and a base methoddesigned to control the family-wise error rate (e.g., Holm),obtain an initial selected set.
Usage
get_base_set( test_statistics = NULL, p_values = NULL, alpha = 0.05, method = "maxT", B = 10000, Sigma = diag(1, nrow = length(test_statistics)), q = NULL)Arguments
test_statistics | the test statistics (used with "maxT") |
p_values | (used with "minP" or "Holm") |
alpha | the alpha level |
method | the method (one of "none", "BY", "maxT", "minP", or "Holm") |
B | the number of resamples (for minP or maxT) |
Sigma | the estimated covariance matrix for the test statistics |
q | the false discovery rate (for method = "BY") |
Value
the initial selected set, a list of the following:
decision, a numeric vector with 1 indicating that the variable was selected and 0 otherwisep_values, the p-values used to make the decision
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# estimate SPVIMs (using simple library and V = 2 for illustration only)set.seed(20231129)library("SuperLearner")est <- vimp::sp_vim(Y = y, X = x, V = 2, type = "auc", SL.library = "SL.glm", cvControl = list(V = 2))# get base setbase_set <- get_base_set(test_statistics = est$test_statistic, p_values = est$p_value, alpha = 0.2, method = "Holm")base_set$decisionControl parameters for intrinsic variable selection
Description
Control parameters for SPVIM-based intrinsic variable selection.
Usage
intrinsic_control( quantity = "gFWER", base_method = "Holm", fdr_method = "Holm", q = 0.2, k = 5)Arguments
quantity | the desired quantity for error-rate control: possible valuesare |
base_method | the family-wise error rate controlling method to use forobtaining the initial set of selected variables. Possible values are |
fdr_method | the method for controlling the FDR (if |
q | the desired proportion of false positives (only used if |
k | the desired number of family-wise errors (an integer, greater thanor equal to zero.) |
Value
a list with the control parameters.
Examples
control <- intrinsic_control(quantity = "gFWER", base_method = "Holm", fdr_method = "Holm", k = 1)controlPerform intrinsic, ensemble-based variable selection
Description
Based on estimated SPVIM values, do variable selection using thespecified error-controlling method.
Usage
intrinsic_selection( spvim_ests = NULL, sample_size = NULL, feature_names = "", alpha = 0.05, control = list(quantity = "gFWER", base_method = "Holm", fdr_method = NULL, q = NULL, k = NULL))Arguments
spvim_ests | the estimated SPVIM values (an object of class |
sample_size | the number of independent observations used to estimatethe SPVIM values. |
feature_names | the names of the features (a character vector oflength |
alpha | the nominal generalized family-wise error rate, proportion offalse positives, or false discovery rate level to control at (e.g., 0.05). |
control | a list of parameters to control the variable selection process.Parameters include |
Value
a tibble with the estimated intrinsic variable importance,the corresponding variable importance ranks, and the selectedvariables.
See Also
sp_vim for specific usage ofthesp_vim function and thevimp package for estimatingintrinsic variable importance.
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# estimate SPVIMs (using simple library and V = 2 for illustration only)set.seed(20231129)library("SuperLearner")est <- vimp::sp_vim(Y = y, X = x, V = 2, type = "auc", SL.library = "SL.glm", cvControl = list(V = 2))# do intrinsic selectionintrinsic_set <- intrinsic_selection(spvim_ests = est, sample_size = nrow(dat_cc), alpha = 0.2, feature_names = feature_nms, control = list(quantity = "gFWER", base_method = "Holm", k = 1))intrinsic_setPool selected sets from multiply-imputed data
Description
Pool the selected sets from multiply-imputed or bootstrap + imputed data. Usesthe "stability" of the variables over the multiple selected sets to selectvariables that are stable across the sets, where stability is determined bypresence in a certain fraction of the selected sets (and the fraction must beabove the specified threshold to be "stable").
Usage
pool_selected_sets(sets = list(), threshold = 0.8)Arguments
sets | a list of sets of selected variables from the multiply-imputed datasets.Expects each set of selected variables to be a binary vector, where 1 denotesthat the variable was selected. |
threshold | a numeric threshold between 0 and 1 detemining the "stability"of a feature; only features with stability above the threshold after poolingwill be in the final selected set of variables. |
Value
a vector denoting the final set of selected variables (1 denotesselected, 0 denotes not selected)
Examples
data("biomarkers")x <- biomarkers[, !(names(biomarkers) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)library("dplyr")library("SuperLearner")# do multiple imputation (with a small number for illustration only)library("mice")n_imp <- 2set.seed(20231129)mi_biomarkers <- mice::mice(data = biomarkers, m = n_imp, printFlag = FALSE)imputed_biomarkers <- mice::complete(mi_biomarkers, action = "long") %>% rename(imp = .imp, id = .id)# set up a list to collect selected setsall_selected_vars <- vector("list", length = 5)for (i in 1:n_imp) { # fit a Super Learner using simple library for illustration only these_data <- imputed_biomarkers %>% filter(imp == i) this_y <- these_data$mucinous this_x <- these_data %>% select(starts_with("lab"), starts_with("cea")) this_x_df <- as.data.frame(this_x) fit <- SuperLearner::SuperLearner(Y = this_y, X = this_x_df, SL.library = "SL.glm", cvControl = list(V = 2), family = "binomial") # do extrinsic selection all_selected_vars[[i]] <- extrinsic_selection( fit = fit, feature_names = feature_nms, threshold = 5, import_type = "all" )$selected}# perform extrinsic variable selectionselected_vars <- pool_selected_sets(sets = all_selected_vars, threshold = 1 / n_imp)feature_nms[selected_vars]Pool SPVIM Estimates Using Rubin's Rules
Description
If multiple imputation was used due to the presence of missing data,pool SPVIM estimates from individual imputed datasets using Rubin's rules.Results in point estimates averaged over the imputations, along withwithin-imputation variance estimates and across-imputation variance estimates;and test statistics and p-values for hypothesis testing.
Usage
pool_spvims(spvim_ests = NULL)Arguments
spvim_ests | a list of estimated SPVIMs (of class |
Value
a list of results containing the following:
est, the average SPVIM estimate over the multiply-imputed datasetsse, the average of the within-imputation SPVIM variance estimatestest_statistics, the test statistics for hypothesis tests of zero importance, using the Rubin's rules standard error estimator and average SPVIM estimatep_values, p-values computed using the above test statisticstau_n, the across-imputation variance estimatesvcov, the overall variance-covariance matrix
Examples
data("biomarkers")library("dplyr")# do multiple imputation (with a small number for illustration only)library("mice")n_imp <- 2set.seed(20231129)mi_biomarkers <- mice::mice(data = biomarkers, m = n_imp, printFlag = FALSE)imputed_biomarkers <- mice::complete(mi_biomarkers, action = "long") %>% rename(imp = .imp, id = .id)# estimate SPVIMs for each imputed dataset, using simple library for illustration onlylibrary("SuperLearner")est_lst <- lapply(as.list(1:n_imp), function(l) { this_x <- imputed_biomarkers %>% filter(imp == l) %>% select(starts_with("lab"), starts_with("cea")) this_y <- biomarkers$mucinous suppressWarnings( vimp::sp_vim(Y = this_y, X = this_x, V = 2, type = "auc", SL.library = "SL.glm", gamma = 0.1, alpha = 0.05, delta = 0, cvControl = list(V = 2), env = environment()) )})# pool the SPVIMs using Rubin's rulespooled_spvims <- pool_spvims(spvim_ests = est_lst)pooled_spvimsExtract a Variance-Covariance Matrix for SPVIM Estimates
Description
Extract a variance-covariance matrix based on the efficient influence functionfor each of the estimated SPVIMs.
Usage
spvim_vcov(spvim_ests = NULL)Arguments
spvim_ests | estimated SPVIMs |
Value
a variance-covariance matrix
Examples
data("biomarkers")# subset to complete cases for illustrationcc <- complete.cases(biomarkers)dat_cc <- biomarkers[cc, ]# use only the mucinous outcome, not the high-malignancy outcomey <- dat_cc$mucinousx <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]feature_nms <- names(x)# estimate SPVIMs (using simple library and V = 2 for illustration only)set.seed(20231129)library("SuperLearner")est <- vimp::sp_vim(Y = y, X = x, V = 2, type = "auc", SL.library = "SL.glm", cvControl = list(V = 2))# get variance-covariance matrixvcov <- spvim_vcov(spvim_ests = est)