| Type: | Package |
| Title: | Stability-enHanced Approaches using Resampling Procedures |
| Version: | 1.4.8 |
| Date: | 2025-07-23 |
| Maintainer: | Barbara Bodinier <barbara.bodinier@gmail.com> |
| URL: | https://github.com/barbarabodinier/sharp |
| BugReports: | https://github.com/barbarabodinier/sharp/issues |
| Description: | In stability selection (N Meinshausen, P Bühlmann (2010) <doi:10.1111/j.1467-9868.2010.00740.x>) and consensus clustering (S Monti et al (2003) <doi:10.1023/A:1023949509487>), resampling techniques are used to enhance the reliability of the results. In this package (B Bodinier et al (2025) <doi:10.18637/jss.v112.i05>), hyper-parameters are calibrated by maximising model stability, which is measured under the null hypothesis that all selection (or co-membership) probabilities are identical (B Bodinier et al (2023a) <doi:10.1093/jrsssc/qlad058> and B Bodinier et al (2023b) <doi:10.1093/bioinformatics/btad635>). Functions are readily implemented for the use of LASSO regression, sparse PCA, sparse (group) PLS or graphical LASSO in stability selection, and hierarchical clustering, partitioning around medoids, K means or Gaussian mixture models in consensus clustering. |
| License: | GPL (≥ 3) |
| Language: | en-GB |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.1 |
| Depends: | fake (≥ 1.4.0), R (≥ 3.5) |
| Imports: | abind, beepr, future, future.apply, glassoFast (≥ 1.0.0),glmnet, grDevices, igraph, mclust, nloptr, plotrix, Rdpack,withr (≥ 2.4.0) |
| Suggests: | cluster, corpcor, dbscan, elasticnet, gglasso, mixOmics,nnet, OpenMx, RCy3, randomcoloR, rCOSA, rmarkdown, rpart,sgPLS, sparcl, survival (≥ 3.2.13), testthat (≥ 3.0.0),visNetwork |
| Additional_repositories: | https://barbarabodinier.github.io/drat |
| Config/testthat/edition: | 3 |
| RdMacros: | Rdpack |
| NeedsCompilation: | no |
| Packaged: | 2025-07-23 21:10:57 UTC; barbara |
| Author: | Barbara Bodinier [aut, cre] |
| Repository: | CRAN |
| Date/Publication: | 2025-07-23 21:30:02 UTC |
sharp: Stability-enHanced Approaches using Resampling Procedures
Description
In stability selection and consensus clustering, resampling techniques areused to enhance the reliability of the results. In this package,hyper-parameters are calibrated by maximising model stability, which ismeasured under the null hypothesis that all selection (or co-membership)probabilities are identical. Functions are readily implemented for the use ofLASSO regression, sparse PCA, sparse (group) PLS or graphical LASSO instability selection, and hierarchical clustering, partitioning aroundmedoids, K means or Gaussian mixture models in consensus clustering.
Details
| Package: | sharp |
| Type: | Package |
| Version: | 1.4.8 |
| Date: | 2025-07-23 |
| License: | GPL (>= 3) |
| Maintainer: | Barbara Bodinierbarbara.bodinier@gmail.com |
References
Bodinier B, Rodrigues S, Karimi M, Filippi S, Chiquet J, Chadeau-Hyam M (2025).“Stability Selection and Consensus Clustering in R: The R Package sharp.”Journal of Statistical Software,112(5), btad635.doi:10.18637/jss.v112.i05.
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.
Monti S, Tamayo P, Mesirov J, Golub T (2003).“Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data.”Machine Learning,52(1), 91–118.doi:10.1023/A:1023949509487.
Examples
oldpar <- par(no.readonly = TRUE)par(mar = c(5, 5, 5, 5))## Regression models# Data simulationset.seed(1)simul <- SimulateRegression(n = 100, pk = 50)# Stability selectionstab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata)CalibrationPlot(stab)summary(stab)SelectedVariables(stab)## Graphical models# Data simulationset.seed(1)simul <- SimulateGraphical(n = 100, pk = 20, topology = "scale-free")# Stability selectionstab <- GraphicalModel(xdata = simul$data)CalibrationPlot(stab)summary(stab)plot(stab)## PCA modelsif (requireNamespace("elasticnet", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateComponents(pk = c(5, 3, 4)) plot(simul) # Stability selection stab <- BiSelection( xdata = simul$data, ncomp = 3, implementation = SparsePCA ) CalibrationPlot(stab) summary(stab) SelectedVariables(stab)}## PLS modelsif (requireNamespace("sgPLS", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateRegression(n = 50, pk = c(10, 20, 30), family = "gaussian") # Stability selection stab <- BiSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", ncomp = 3, implementation = SparsePLS ) CalibrationPlot(stab) summary(stab) plot(stab)}par(oldpar)Adjacency matrix from object
Description
Returns the adjacency matrix from anigraph object or from the output ofsimulation or inference functions from the present package.
Usage
AdjacencyFromObject(object)Arguments
object | input object. |
Value
A vector without missing values or NULL.
Summarised coefficients conditionally on selection
Description
Computes descriptive statistics (defined byFUN) for coefficients ofthe (calibrated) models conditionally on selection across resamplingiterations.
Usage
AggregatedEffects( stability, lambda_id = NULL, side = "X", comp = 1, FUN = stats::median, ...)Arguments
stability | output of |
lambda_id | parameter ID with respect to the grid |
side | character string indicating if coefficients of predictors( |
comp | component ID. Only applicable to PLS models. |
FUN | function to use to aggregate coefficients of visited models overresampling iterations. Recommended functions include |
... | additional arguments to be passed to |
Value
A matrix of summarised coefficients conditionally on selection acrossresampling iterations. Missing values (NA) are returned forvariables that are never selected.
See Also
VariableSelection,BiSelection,Refit
Examples
# Example with univariate outcomeset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian")median_betas <- AggregatedEffects(stab)# Comparison with refitted modelrefitted <- Refit(xdata = simul$xdata, ydata = simul$ydata, stability = stab)refitted_betas <- coef(refitted)[-1, 1]plot(median_betas[names(refitted_betas), ], refitted_betas, panel.first = abline(0, 1, lty = 2))# Extracting mean betas conditionally on selectionmean_betas <- AggregatedEffects(stab, FUN = mean)plot(median_betas, mean_betas)# Regression with multivariate outcomesset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, q = 2, family = "gaussian")stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "mgaussian")median_betas <- AggregatedEffects(stab)dim(median_betas)# Sparse PLS with multivariate outcomeif (requireNamespace("sgPLS", quietly = TRUE)) { set.seed(1) simul <- SimulateRegression(n = 50, pk = 15, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata stab <- BiSelection( xdata = x, ydata = y, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), implementation = SparsePLS ) median_betas <- AggregatedEffects(stab) dim(median_betas) median_betas <- AggregatedEffects(stab, side = "Y") dim(median_betas)}Calibrated hyper-parameter(s)
Description
Extracts the calibrated hyper-parameters (or their indices forArgmaxId) with respect to the grids provided inLambdaandpi_list in argumentstability.
Usage
ArgmaxId(stability = NULL, S = NULL)Argmax(stability)Arguments
stability | output of |
S | matrix of stability scores obtained with different combinations ofparameters where rows correspond to different values of the parametercontrolling the level of sparsity in the underlying feature selectionalgorithm and columns correspond to different values of the threshold inselection proportions. If |
Value
A matrix of hyper-parameters (Argmax) or indices(ArgmaxId). For multi-block graphical models, rows correspondto different blocks.
See Also
VariableSelection,GraphicalModel
Examples
# Data simulationset.seed(1)simul <- SimulateGraphical(pk = 20)# Stability selectionstab <- GraphicalModel(xdata = simul$data)# Extracting calibrated hyper-parametersArgmax(stab)# Extracting calibrated hyper-parameters IDsids <- ArgmaxId(stab)ids# Relationship between the two functionsstab$Lambda[ids[1], 1]stab$params$pi_list[ids[2]]Stability selection of predictors and/or outcomes
Description
Performs stability selection for dimensionality reduction. The underlyingvariable selection algorithm (e.g. sparse PLS) is run with differentcombinations of parameters controlling the sparsity (e.g. number of selectedvariables per component) and thresholds in selection proportions. Thesehyper-parameters are jointly calibrated by maximisation of the stabilityscore.
Usage
BiSelection( xdata, ydata = NULL, group_x = NULL, group_y = NULL, LambdaX = NULL, LambdaY = NULL, AlphaX = NULL, AlphaY = NULL, ncomp = 1, scale = TRUE, pi_list = seq(0.01, 0.99, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, family = "gaussian", implementation = SparsePLS, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, n_cores = 1, output_data = FALSE, verbose = TRUE, beep = NULL, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
group_x | vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group. Only used formodels with group penalisation (e.g. |
group_y | optional vector encoding the grouping structure amongoutcomes. This argument indicates the number of variables in each group.Only used if |
LambdaX | matrix of parameters controlling the number of selectedvariables (for sparse PCA/PLS) or groups (for group and sparse group PLS)in X. |
LambdaY | matrix of parameters controlling the number of selectedvariables (for sparse PLS) or groups (for group or sparse group PLS) in Y.Only used if |
AlphaX | matrix of parameters controlling the level of sparsity withingroups in X. Only used if |
AlphaY | matrix of parameters controlling the level of sparsity withingroups in X. Only used if |
ncomp | number of components. |
scale | logical indicating if the data should be scaled (i.e.transformed so that all variables have a standard deviation of one). |
pi_list | vector of thresholds in selection proportions. If |
K | number of resampling iterations. |
tau | subsample size. Only used if |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
n_cat | computation options for the stability score. Default is |
family | type of PLS model. This parameter must be set to |
implementation | function to use for feature selection. Possiblefunctions are: |
resampling | resampling approach. Possible values are: |
cpss | logical indicating if complementary pair stability selectionshould be done. For this, the algorithm is applied on two non-overlappingsubsets of half of the observations. A feature is considered as selected ifit is selected for both subsamples. With this method, the data is split |
PFER_method | method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). If |
PFER_thr | threshold in PFER for constrained calibration by errorcontrol. If |
FDP_thr | threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. If |
n_cores | number of cores to use for parallel computing (see argument |
output_data | logical indicating if the input datasets |
verbose | logical indicating if a loading bar and messages should beprinted. |
beep | sound indicating the end of the run. Possible values are: |
... | additional parameters passed to the functions provided in |
Details
In stability selection, a feature selection algorithm is fitted onK subsamples (or bootstrap samples) of the data with differentparameters controlling the sparsity (LambdaX,LambdaY,AlphaX, and/orAlphaY). For a given (set of) sparsityparameter(s), the proportion out of theK models in which eachfeature is selected is calculated. Features with selection proportionsabove a threshold pi are considered stably selected. The stabilityselection model is controlled by the sparsity parameter(s) (denoted by\lambda) for the underlying algorithm, and the threshold in selectionproportion:
V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}
For sparse and sparse group dimensionality reduction, "feature" refers tovariable (variable selection model). For group PLS, "feature" refers togroup (group selection model). For (sparse) group PLS, groups need to bedefineda priori and specified in argumentsgroup_x and/orgroup_y.
These parameters can be calibrated by maximisation of a stability score(seeConsensusScore ifn_cat=NULL orStabilityScore otherwise) calculated under the nullhypothesis of equiprobability of selection.
It is strongly recommended to examine the calibration plot carefully tocheck that the grids of parametersLambda andpi_list do notrestrict the calibration to a region that would not include the globalmaximum (seeCalibrationPlot). In particular, the gridLambda may need to be extended when the maximum stability isobserved on the left or right edges of the calibration heatmap. In someinstances, multiple peaks of stability score can be observed. Simulationstudies suggest that the peak corresponding to the largest number ofselected features tend to give better selection performances. This is notnecessarily the highest peak (which is automatically retained by thefunctions in this package). The user can decide to manually choose anotherpeak.
To control the expected number of False Positives (Per Family Error Rate)in the results, a thresholdPFER_thr can be specified. Theoptimisation problem is then constrained to sets of parameters thatgenerate models with an upper-bound in PFER belowPFER_thr (seeMeinshausen and Bühlmann (2010) and Shah and Samworth (2013)).
Possible resampling procedures include defining (i)K subsamples ofa proportiontau of the observations, (ii)K bootstrap sampleswith the full sample size (obtained with replacement), and (iii)K/2splits of the data in half for complementary pair stability selection (seeargumentsresampling andcpss). In complementary pairstability selection, a feature is considered selected at a given resamplingiteration if it is selected in the two complementary subsamples.
For categorical outcomes (argumentfamily is"binomial" or"multinomial"), the proportions of observations from each categoryin all subsamples or bootstrap samples are the same as in the full sample.
To ensure reproducibility of the results, the starting number of the randomnumber generator is set toseed.
For parallelisation, stability selection with different sets of parameterscan be run onn_cores cores. Usingn_cores > 1 creates amultisession.
Value
An object of classbi_selection. A list with:
summary | amatrix of the best stability scores and corresponding parameterscontrolling the level of sparsity in the underlying algorithm for differentnumbers of components. Possible columns include: |
summary_full | a matrix of the best stability scores for differentcombinations of parameters controlling the sparsity and components. |
selectedX | a binary matrix encoding stably selected predictors. |
selpropX | a matrix of calibrated selection proportions forpredictors. |
selectedY | a binary matrix encoding stably selectedoutcomes. Only returned for PLS models. |
selpropY | a matrix ofcalibrated selection proportions for outcomes. Only returned for PLSmodels. |
selected | a binary matrix encoding stable relationshipsbetween predictor and outcome variables. Only returned for PLS models. |
selectedX_full | a binary matrix encoding stably selected predictors. |
selpropX_full | a matrix of selection proportions for predictors. |
selectedY_full | a binary matrix encoding stably selected outcomes.Only returned for PLS models. |
selpropY_full | a matrix of selectionproportions for outcomes. Only returned for PLS models. |
coefX | anarray of estimated loadings coefficients for the different components(rows), for the predictors (columns), as obtained across the |
coefY | an array ofestimated loadings coefficients for the different components (rows), forthe outcomes (columns), as obtained across the |
method | alist with |
params | a list with values usedfor arguments |
The rows ofsummary and columns ofselectedX,selectedY,selpropX,selpropY,selected,coefX andcoefY are ordered in the same way and correspond to components andparameter values stored insummary. The rows ofsummary_fulland columns ofselectedX_full,selectedY_full,selpropX_full andselpropY_full are ordered in the same wayand correspond to components and parameter values stored insummary_full.
References
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Shah RD, Samworth RJ (2013).“Variable selection with error control: another look at stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),75(1), 55-80.doi:10.1111/j.1467-9868.2011.01034.x.
Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.
Liquet B, de Micheaux PL, Hejblum BP, Thiébaut R (2016).“Group and sparse group partial least square approaches applied in genomics context.”Bioinformatics,32(1), 35-42.ISSN 1367-4803,doi:10.1093/bioinformatics/btv535.
KA LC, Rossouw D, Robert-Granié C, Besse P (2008).“A sparse PLS for variable selection when integrating omics data.”Stat Appl Genet Mol Biol,7(1), Article 35.ISSN 1544-6115,doi:10.2202/1544-6115.1390.
Shen H, Huang JZ (2008).“Sparse principal component analysis via regularized low rank matrix approximation.”Journal of Multivariate Analysis,99(6), 1015-1034.ISSN 0047-259X,doi:10.1016/j.jmva.2007.06.007.
Zou H, Hastie T, Tibshirani R (2006).“Sparse Principal Component Analysis.”Journal of Computational and Graphical Statistics,15(2), 265-286.doi:10.1198/106186006X113430.
See Also
SparsePCA,SparsePLS,GroupPLS,SparseGroupPLS,VariableSelection,Resample,StabilityScore
Other stability functions:Clustering(),GraphicalModel(),StructuralModel(),VariableSelection()
Examples
if (requireNamespace("sgPLS", quietly = TRUE)) { oldpar <- par(no.readonly = TRUE) par(mar = c(12, 5, 1, 1)) ## Sparse Principal Component Analysis # Data simulation set.seed(1) simul <- SimulateComponents(pk = c(5, 3, 4)) # sPCA: sparsity on X (unsupervised) stab <- BiSelection( xdata = simul$data, ncomp = 2, LambdaX = seq_len(ncol(simul$data) - 1), implementation = SparsePCA ) print(stab) # Calibration plot CalibrationPlot(stab) # Visualisation of the results summary(stab) plot(stab) SelectedVariables(stab) ## Sparse (Group) Partial Least Squares # Data simulation (continuous outcomes) set.seed(1) simul <- SimulateRegression(n = 100, pk = 15, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # sPLS: sparsity on X stab <- BiSelection( xdata = x, ydata = y, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), implementation = SparsePLS ) CalibrationPlot(stab) summary(stab) plot(stab) # sPLS: sparsity on both X and Y stab <- BiSelection( xdata = x, ydata = y, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), LambdaY = seq_len(ncol(y) - 1), implementation = SparsePLS, n_cat = 2 ) CalibrationPlot(stab) summary(stab) plot(stab) # sgPLS: sparsity on X stab <- BiSelection( xdata = x, ydata = y, K = 10, group_x = c(2, 8, 5), family = "gaussian", ncomp = 3, LambdaX = seq_len(2), AlphaX = seq(0.1, 0.9, by = 0.1), implementation = SparseGroupPLS ) CalibrationPlot(stab) summary(stab) par(oldpar)}Binomial probabilities for stability score
Description
Computes the probabilities of observing each category of selectionproportions under the assumption of a uniform selection procedure.
Usage
BinomialProbabilities(q, N, pi, K, n_cat = 3)Arguments
q | average number of features selected by the underlying algorithm. |
N | total number of features. |
pi | threshold in selection proportions. If n_cat=3, these values mustbe >0.5 and <1. If n_cat=2, these values must be >0 and <1. |
K | number of resampling iterations. |
n_cat | computation options for the stability score. Default is |
Value
A list of probabilities for each of the 2 or 3 categories ofselection proportions.
Multi-block grid
Description
Generates a matrix of parameters controlling the sparsity of the underlyingselection algorithm for multi-block calibration.
Usage
BlockLambdaGrid(Lambda, lambda_other_blocks = NULL)Arguments
Lambda | vector or matrix of penalty parameters. |
lambda_other_blocks | optional vector of penalty parameters to use forother blocks in the iterative multi-block procedure. |
Value
A list with:
Lambda | a matrix of (block-specific) penaltyparameters. In multi-block stability selection, rows correspond to sets ofpenalty parameters and columns correspond to different blocks. |
Sequential_template | logical matrix encoding the type of procedurefor data with multiple blocks in stability selection graphical modelling.For multi-block estimation, each block is calibrated separately whileothers blocks are weakly penalised ( |
See Also
Examples
# Multi-block gridLambda <- matrix( c( 0.8, 0.6, 0.3, 0.5, 0.4, 0.2, 0.7, 0.5, 0.1 ), ncol = 3, byrow = TRUE)mygrid <- BlockLambdaGrid(Lambda, lambda_other_blocks = 0.1)# Multi-parameter grid (not recommended)Lambda <- matrix( c( 0.8, 0.6, 0.3, 0.5, 0.4, 0.2, 0.7, 0.5, 0.1 ), ncol = 3, byrow = TRUE)mygrid <- BlockLambdaGrid(Lambda, lambda_other_blocks = NULL)Classification And Regression Trees
Description
Runs decision trees using implementation fromrpart.This function is not using stability.
Usage
CART(xdata, ydata, Lambda = NULL, family, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
Lambda | matrix of parameters controlling the number of splits in thedecision tree. |
family | type of regression model. This argument is defined as in |
... | additional parameters passed to |
Value
A list with:
selected | matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors. |
beta_full | array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors. Indicesalong the third dimension correspond to outcome variable(s). |
References
Breiman L, Friedman JH, Olshen R, Stone CJ (1984).Classification and Regression Trees.Wadsworth.
See Also
SelectionAlgo,VariableSelection
Other underlying algorithm functions:ClusteringAlgo(),PenalisedGraphical(),PenalisedOpenMx(),PenalisedRegression()
Examples
if (requireNamespace("rpart", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateRegression(pk = 50) # Running the LASSO mycart <- CART( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian" ) head(mycart$selected)}Calibration curve (internal)
Description
Creates a calibration curve for consensus clustering.
Usage
CalibrationCurve( stability, bty = "o", xlab = NULL, ylab = NULL, cex.axis = 1, cex.lab = 1.5, cex.legend = 1.2, pch = 19, lines = TRUE, col = NULL, legend = TRUE, ncol = 1)Arguments
stability | output of |
bty | character string indicating if the box around the plot should bedrawn. Possible values include: |
xlab | label of the x-axis. |
ylab | label of the y-axis. |
cex.axis | font size for axes. |
cex.lab | font size for labels. |
cex.legend | font size for text legend entries. |
pch | type of point, as in |
lines | logical indicating if the points should be linked by lines. Onlyused if |
col | vector of colours. |
legend | logical indicating if the legend should be included. |
ncol | integer indicating the number of columns in the legend. |
Value
a calibration curve.
Calibration plot
Description
Creates a plot showing the stability score as a function of the parameter(s)controlling the level of sparsity in the underlying feature selectionalgorithm and/or the threshold in selection proportions. See examples inVariableSelection,GraphicalModel,Clustering andBiSelection.
Usage
CalibrationPlot( stability, block_id = NULL, col = NULL, pch = 19, cex = 0.7, xlim = NULL, ylim = NULL, bty = "o", lines = TRUE, lty = 3, lwd = 2, show_argmax = TRUE, show_pix = FALSE, show_piy = FALSE, offset = 0.3, legend = TRUE, legend_length = NULL, legend_range = NULL, ncol = 1, xlab = NULL, ylab = NULL, zlab = expression(italic(q)), xlas = 2, ylas = NULL, zlas = 2, cex.lab = 1.5, cex.axis = 1, cex.legend = 1.2, xgrid = FALSE, ygrid = FALSE, params = c("ny", "alphay", "nx", "alphax"))Arguments
stability | output of |
block_id | ID of the block to visualise. Only used for multi-blockstability selection graphical models. If |
col | vector of colours. |
pch | type of point, as in |
cex | size of point. |
xlim | displayed range along the x-axis. Only used if |
ylim | displayed range along the y-axis. Only used if |
bty | character string indicating if the box around the plot should bedrawn. Possible values include: |
lines | logical indicating if the points should be linked by lines. Onlyused if |
lty | line type, as in |
lwd | line width, as in |
show_argmax | logical indicating if the calibrated parameter(s) shouldbe indicated by lines. |
show_pix | logical indicating if the calibrated threshold in selectionproportion in |
show_piy | logical indicating if the calibrated threshold in selectionproportion in |
offset | distance between the point and the text, as in |
legend | logical indicating if the legend should be included. |
legend_length | length of the colour bar. Only used if |
legend_range | range of the colour bar. Only used if |
ncol | integer indicating the number of columns in the legend. |
xlab | label of the x-axis. |
ylab | label of the y-axis. |
zlab | label of the z-axis. Only used if |
xlas | orientation of labels on the x-axis, as |
ylas | orientation of labels on the y-axis, as |
zlas | orientation of labels on the z-axis, as |
cex.lab | font size for labels. |
cex.axis | font size for axes. |
cex.legend | font size for text legend entries. |
xgrid | logical indicating if a vertical grid should be drawn. Only usedif |
ygrid | logical indicating if a horizontal grid should be drawn. Onlyused if |
params | vector of possible parameters if |
Value
A calibration plot.
See Also
VariableSelection,GraphicalModel,Clustering,BiSelection
Checking input data (regression model)
Description
Checks if input data formats are appropriate. For inappropriate inputs, thisfunction (i) fixes the data format, or (ii) stops the run and generates anerror message.
Usage
CheckDataRegression(xdata, ydata = NULL, family = "gaussian", verbose = TRUE)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
family | type of regression model. This argument is defined as in |
verbose | logical indicating if a loading bar and messages should beprinted. |
Checking input parameters (clustering)
Description
Checks if input parameters are valid. For invalid parameters, this function(i) stops the run and generates an error message, or (ii) sets the invalidparameter to its default value and reports it in a warning message.
Usage
CheckInputClustering( xdata, Lambda = NULL, pi_list = seq(0.6, 0.9, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = 3, implementation = HierarchicalClustering, scale = TRUE, resampling = "subsampling", verbose = TRUE)Arguments
xdata | data matrix with observations as rows and variables as columns. |
Lambda | vector of penalty parameters for weighted distance calculation.Only used for distance-based clustering, including for example |
K | number of resampling iterations. |
tau | subsample size. |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
n_cat | computation options for the stability score. Default is |
implementation | function to use for clustering. Possible functionsinclude |
scale | logical indicating if the data should be scaled to ensure thatall variables contribute equally to the clustering of the observations. |
verbose | logical indicating if a loading bar and messages should beprinted. |
Checking input parameters (graphical model)
Description
Checks if input parameters are valid. For invalid parameters, this function(i) stops the run and generates an error message, or (ii) sets the invalidparameter to its default value and reports it in a warning message.
Usage
CheckInputGraphical( xdata, pk = NULL, Lambda = NULL, lambda_other_blocks = 0.1, pi_list = seq(0.6, 0.9, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = 3, implementation = PenalisedGraphical, start = "cold", scale = TRUE, resampling = "subsampling", PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 50, lambda_max = NULL, lambda_path_factor = 1e-04, max_density = 0.3, verbose = TRUE)Arguments
xdata | data matrix with observations as rows and variables as columns.For multi-block stability selection, the variables in data have to beordered by group. |
pk | optional vector encoding the grouping structure. Only used formulti-block stability selection where |
Lambda | matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified in |
lambda_other_blocks | optional vector of parameters controlling thelevel of sparsity in neighbour blocks for the multi-block procedure. To usejointly a specific set of parameters for each block, |
pi_list | vector of thresholds in selection proportions. If |
K | number of resampling iterations. |
tau | subsample size. Only used if |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
n_cat | computation options for the stability score. Default is |
implementation | function to use for graphical modelling. If |
start | character string indicating if the algorithm should beinitialised at the estimated (inverse) covariance with previous penaltyparameters ( |
scale | logical indicating if the correlation ( |
resampling | resampling approach. Possible values are: |
PFER_method | method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). If |
PFER_thr | threshold in PFER for constrained calibration by errorcontrol. If |
FDP_thr | threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. If |
Lambda_cardinal | number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm. Only used if |
lambda_max | optional maximum value for the grid in penalty parameters.If |
lambda_path_factor | multiplicative factor used to define the minimumvalue in the grid. |
max_density | threshold on the density. The grid is defined such thatthe density of the estimated graph does not exceed max_density. |
verbose | logical indicating if a loading bar and messages should beprinted. |
Checking that a package is installed
Description
Checks if a package is installed and returns an error message if not.
Usage
CheckPackageInstalled(package)Arguments
package | character string indicating the name of the package. |
Checking input parameters (regression model)
Description
Checks if input parameters are valid. For invalid parameters, this function(i) stops the run and generates an error message, or (ii) sets the invalidparameter to its default value and reports it in a warning message.
Usage
CheckParamRegression( Lambda = NULL, pi_list = seq(0.6, 0.9, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, family = "gaussian", implementation = PenalisedRegression, resampling = "subsampling", PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 100, verbose = TRUE)Arguments
Lambda | matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified in |
pi_list | vector of thresholds in selection proportions. If |
K | number of resampling iterations. |
tau | subsample size. Only used if |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
n_cat | computation options for the stability score. Default is |
family | type of regression model. This argument is defined as in |
implementation | function to use for variable selection. Possiblefunctions are: |
resampling | resampling approach. Possible values are: |
PFER_method | method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). If |
PFER_thr | threshold in PFER for constrained calibration by errorcontrol. If |
FDP_thr | threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. If |
Lambda_cardinal | number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm. Only used if |
verbose | logical indicating if a loading bar and messages should beprinted. |
Consensus clustering
Description
Performs consensus (weighted) clustering. The underlying algorithm (e.g.hierarchical clustering) is run with different number of clustersnc.In consensus weighed clustering, weighted distances are calculated using thecosa2 algorithm with different penalty parametersLambda. The hyper-parameters are calibrated by maximisation of theconsensus score.
Usage
Clustering( xdata, nc = NULL, eps = NULL, Lambda = NULL, K = 100, tau = 0.5, seed = 1, n_cat = 3, implementation = HierarchicalClustering, scale = TRUE, linkage = "complete", row = TRUE, optimisation = c("grid_search", "nloptr"), n_cores = 1, output_data = FALSE, verbose = TRUE, beep = NULL, ...)Arguments
xdata | data matrix with observations as rows and variables as columns. |
nc | matrix of parameters controlling the number of clusters in theunderlying algorithm specified in |
eps | radius in density-based clustering, see |
Lambda | vector of penalty parameters for weighted distance calculation.Only used for distance-based clustering, including for example |
K | number of resampling iterations. |
tau | subsample size. |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
n_cat | computation options for the stability score. Default is |
implementation | function to use for clustering. Possible functionsinclude |
scale | logical indicating if the data should be scaled to ensure thatall variables contribute equally to the clustering of the observations. |
linkage | character string indicating the type of linkage used inhierarchical clustering to define the stable clusters. Possible valuesinclude |
row | logical indicating if rows (if |
optimisation | character string indicating the type of optimisationmethod to calibrate the regularisation parameter (only used if |
n_cores | number of cores to use for parallel computing (see argument |
output_data | logical indicating if the input datasets |
verbose | logical indicating if a loading bar and messages should beprinted. |
beep | sound indicating the end of the run. Possible values are: |
... | additional parameters passed to the functions provided in |
Details
In consensus clustering, a clustering algorithm is applied onK subsamples of the observations with different numbers of clustersprovided innc. Ifrow=TRUE (the default), the observations(rows) are the items to cluster. Ifrow=FALSE, the variables(columns) are the items to cluster. For a given number of clusters, theconsensus matrixcoprop stores the proportion of iterations wheretwo items were in the same estimated cluster, out of all iterations whereboth items were drawn in the subsample.
Stable cluster membership is obtained by applying a distance-basedclustering method using(1-coprop) as distance (seeClusters).
These parameters can be calibrated by maximisation of a stability score(seeConsensusScore) calculated under the null hypothesis ofequiprobability of co-membership.
It is strongly recommended to examine the calibration plot (seeCalibrationPlot) to check that there is a clear maximum. Theabsence of a clear maximum suggests that the clustering is not stable,consensus clustering outputs should not be trusted in that case.
To ensure reproducibility of the results, the starting number of the randomnumber generator is set toseed.
For parallelisation, stability selection with different sets of parameterscan be run onn_cores cores. Usingn_cores > 1 creates amultisession.
Value
An object of classclustering. A list with:
Sc | a matrixof the best stability scores for different (sets of) parameters controllingthe number of clusters and penalisation of attribute weights. |
nc | amatrix of numbers of clusters. |
Lambda | a matrix of regularisationparameters for attribute weights. |
Q | a matrix of the average numberof selected attributes by the underlying algorithm with differentregularisation parameters. |
coprop | an array of consensus matrices.Rows and columns correspond to items. Indices along the third dimensioncorrespond to different parameters controlling the number of clusters andpenalisation of attribute weights. |
selprop | an array of selectionproportions. Columns correspond to attributes. Rows correspond to differentparameters controlling the number of clusters and penalisation of attributeweights. |
method | a list with |
params | a list with values used for arguments |
The rows ofSc,nc,Lambda,Q,selprop and indices along the thirddimension ofcoprop are ordered in the same way and correspond toparameter values stored innc andLambda.
References
Bodinier B, Rodrigues S, Karimi M, Filippi S, Chiquet J, Chadeau-Hyam M (2025).“Stability Selection and Consensus Clustering in R: The R Package sharp.”Journal of Statistical Software,112(5), btad635.doi:10.18637/jss.v112.i05.
Bodinier B, Vuckovic D, Rodrigues S, Filippi S, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration of consensus weighted distance-based clustering approaches using sharp.”Bioinformatics, btad635.ISSN 1367-4811,doi:10.1093/bioinformatics/btad635, https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad635/52191190/btad635.pdf.
Kampert MM, Meulman JJ, Friedman JH (2017).“rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.”Journal of Classification,34(3), 514–547.doi:10.1007/s00357-017-9240-z.
Friedman JH, Meulman JJ (2004).“Clustering objects on subsets of attributes (with discussion).”Journal of the Royal Statistical Society: Series B (Statistical Methodology),66(4), 815-849.doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x,https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.
Monti S, Tamayo P, Mesirov J, Golub T (2003).“Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data.”Machine Learning,52(1), 91–118.doi:10.1023/A:1023949509487.
See Also
Resample,ConsensusScore,HierarchicalClustering,PAMClustering,KMeansClustering,GMMClustering
Other stability functions:BiSelection(),GraphicalModel(),StructuralModel(),VariableSelection()
Examples
# Consensus clusteringset.seed(1)simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1, ev_xc = 0.5)stab <- Clustering(xdata = simul$data)print(stab)CalibrationPlot(stab)summary(stab)Clusters(stab)plot(stab)# Consensus weighted clusteringif (requireNamespace("rCOSA", quietly = TRUE)) { set.seed(1) simul <- SimulateClustering( n = c(30, 30, 30), pk = 20, theta_xc = c(rep(1, 10), rep(0, 10)), ev_xc = 0.9 ) stab <- Clustering( xdata = simul$data, Lambda = LambdaSequence(lmin = 0.1, lmax = 10, cardinal = 10), noit = 20, niter = 10 ) print(stab) CalibrationPlot(stab) summary(stab) Clusters(stab) plot(stab) WeightBoxplot(stab)}(Weighted) clustering algorithm
Description
Runs the (weighted) clustering algorithm specified in the argumentimplementation and returns matrices of variable weights, and theco-membership structure. This function is not using stability.
Usage
ClusteringAlgo( xdata, nc = NULL, eps = NULL, Lambda = NULL, scale = TRUE, row = TRUE, implementation = HierarchicalClustering, ...)Arguments
xdata | data matrix with observations as rows and variables as columns. |
nc | matrix of parameters controlling the number of clusters in theunderlying algorithm specified in |
eps | radius in density-based clustering, see |
Lambda | vector of penalty parameters. |
scale | logical indicating if the data should be scaled to ensure thatall variables contribute equally to the clustering of the observations. |
row | logical indicating if rows (if |
implementation | function to use for clustering. Possible functionsinclude |
... | additional parameters passed to the function provided in |
Value
A list with:
selected | matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors. |
weight | array of model coefficients. Rows correspond todifferent model parameters. Columns correspond to predictors. Indices alongthe third dimension correspond to outcome variable(s). |
comembership | array of model coefficients. Rows correspond todifferent model parameters. Columns correspond to predictors. Indices alongthe third dimension correspond to outcome variable(s). |
See Also
Other underlying algorithm functions:CART(),PenalisedGraphical(),PenalisedOpenMx(),PenalisedRegression()
Examples
# Simulation of 15 observations belonging to 3 groupsset.seed(1)simul <- SimulateClustering( n = c(5, 5, 5), pk = 100)# Running hierarchical clusteringmyclust <- ClusteringAlgo( xdata = simul$data, nc = 2:5, implementation = HierarchicalClustering)Clustering performance
Description
Computes different metrics of clustering performance by comparing true andpredicted co-membership. This function can only be used in simulation studies(i.e. when the true cluster membership is known).
Usage
ClusteringPerformance(theta, theta_star, ...)Arguments
theta | output from |
theta_star | output from |
... | additional arguments to be passed to |
Value
A matrix of selection metrics including:
TP | number of True Positives (TP) |
FN | number of FalseNegatives (TN) |
FP | number of False Positives (FP) |
TN | numberof True Negatives (TN) |
sensitivity | sensitivity, i.e. TP/(TP+FN) |
specificity | specificity, i.e. TN/(TN+FP) |
accuracy | accuracy,i.e. (TP+TN)/(TP+TN+FP+FN) |
precision | precision (p), i.e.TP/(TP+FP) |
recall | recall (r), i.e. TP/(TP+FN) |
F1_score | F1-score, i.e. 2*p*r/(p+r) |
rand | Rand Index, i.e.(TP+TN)/(TP+FP+TN+FN) |
ari | Adjusted Rand Index (ARI), i.e.2*(TP*TN-FP*FN)/((TP+FP)*(TN+FP)+(TP+FN)*(TN+FN)) |
jaccard | Jaccardindex, i.e. TP/(TP+FP+FN) |
See Also
Other functions for model performance:SelectionPerformance(),SelectionPerformanceGraph()
Examples
# Data simulationset.seed(1)simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1)plot(simul)# Consensus clusteringstab <- Clustering( xdata = simul$data, nc = seq_len(5))# Clustering performanceClusteringPerformance(stab, simul)# Alternative formulationClusteringPerformance( theta = CoMembership(Clusters(stab)), theta_star = simul$theta)Pairwise co-membership
Description
Generates a symmetric and binary matrix indicating, if two items areco-members, i.e. belong to the same cluster.
Usage
CoMembership(groups)Arguments
groups | vector of group membership. |
Value
A symmetric and binary matrix.
Examples
# Simulated grouping structuremygroups <- c(rep(1, 3), rep(2, 5), rep(3, 2))# Co-membership matrixCoMembership(mygroups)Model coefficients
Description
Extracts the coefficients of visited models at different resamplingiterations of a stability selection run. This function can be applied to theoutput ofVariableSelection.
Usage
Coefficients(stability, side = "X", comp = 1, iterations = NULL)Arguments
stability | output of |
side | character string indicating if coefficients of the predictor( |
comp | component ID. Only applicable to PLS models. |
iterations | vector of iteration IDs. If |
See Also
Merging stability selection outputs
Description
Merges the outputs from two runs ofVariableSelection,GraphicalModel orClustering. The two runs musthave been done using the samemethods and the sameparams butwith differentseeds. The combined output will contain results basedon iterations from bothstability1 andstability2. Thisfunction can be used for parallelisation.
Usage
Combine(stability1, stability2, include_beta = TRUE)Arguments
stability1 | output from a first run of |
stability2 | output from a second run of |
include_beta | logical indicating if the beta coefficients of visitedmodels should be concatenated. Only applicable to variable selection orclustering. |
Value
A single output of the same format.
See Also
VariableSelection,GraphicalModel
Examples
## Variable selection# Data simulationset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")# Two runsstab1 <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, seed = 1, K = 10)stab2 <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, seed = 2, K = 10)# Merging the outputsstab <- Combine(stability1 = stab1, stability2 = stab2, include_beta = FALSE)str(stab)## Graphical modelling# Data simulationsimul <- SimulateGraphical(pk = 20)# Two runsstab1 <- GraphicalModel(xdata = simul$data, seed = 1, K = 10)stab2 <- GraphicalModel(xdata = simul$data, seed = 2, K = 10)# Merging the outputsstab <- Combine(stability1 = stab1, stability2 = stab2)str(stab)## Clustering# Data simulationsimul <- SimulateClustering(n = c(15, 15, 15))# Two runsstab1 <- Clustering(xdata = simul$data, seed = 1)stab2 <- Clustering(xdata = simul$data, seed = 2)# Merging the outputsstab <- Combine(stability1 = stab1, stability2 = stab2)str(stab)Concatenate stability objects
Description
Generates a single stability object from two stability objects. This functionis used to concatenate results when usingnloptr.
Usage
Concatenate(stability1, stability2 = NULL, order_output = FALSE)Arguments
stability1 | a stability object. |
stability2 | another stability object (optional). |
Value
A single stability object.
Consensus score
Description
Computes the consensus score from the consensus matrix, matrix of co-samplingcounts and consensus clusters. The score is a z statistic for the comparisonof the co-membership proportions observed within and between the consensusclusters.
Usage
ConsensusScore(prop, K, theta)Arguments
prop | consensus matrix. |
K | matrix of co-sampling counts. |
theta | consensus co-membership matrix. |
Details
To calculate the consensus score, the features are classified as being stablyselected or not (in selection) or as being in the same consensus cluster ornot (in clustering). In selection, the quantitiesX_w andX_b aredefined as the sum of the selection counts for features that are stablyselected or not, respectively. In clustering, the quantitiesX_w andX_b are defined as the sum of the co-membership counts for pairs ofitems in the same consensus cluster or in different consensus clusters,respectively.
Conditionally on this classification, and under the assumption that theselection (or co-membership) probabilities are the same for all features (oritem pairs) in each of these two categories, the quantitiesX_w andX_b follow binomial distributions with probabilitiesp_w andp_b, respectively.
In the most unstable situation, we suppose that all features (or item pairs)would have the same probability of being selected (or co-members). Theconsensus score is the z statistic from a z test where the null hypothesis isp_w \leq p_b.
The consensus score increases with stability.
Value
A consensus score.
See Also
Other stability metric functions:FDP(),PFER(),StabilityMetrics(),StabilityScore()
Examples
# Data simulationset.seed(2)simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1)plot(simul)# Consensus clusteringstab <- Clustering( xdata = simul$data)stab$Sc[3]# Calculating the consensus scoretheta <- CoMembership(Clusters(stab, argmax_id = 3))ConsensusScore( prop = (stab$coprop[, , 3])[upper.tri(stab$coprop[, , 3])], K = stab$sampled_pairs[upper.tri(stab$sampled_pairs)], theta = theta[upper.tri(theta)])(Weighted) density-based clustering
Description
Runs Density-Based Spatial Clustering of Applications with Noise (DBSCAN)clustering using implementation fromdbscan. This isalso known as the k-medoids algorithm. IfLambda is provided,clustering is applied on the weighted distance matrix calculated using theCOSA algorithm as implemented incosa2. Otherwise,distances are calculated usingdist. This function isnot using stability.
Usage
DBSCANClustering( xdata, nc = NULL, eps = NULL, Lambda = NULL, distance = "euclidean", ...)Arguments
xdata | data matrix with observations as rows and variables as columns. |
nc | matrix of parameters controlling the number of clusters in theunderlying algorithm specified in |
eps | radius in density-based clustering, see |
Lambda | vector of penalty parameters (see argument |
distance | character string indicating the type of distance to use. If |
... | additional parameters passed to |
Value
A list with:
comembership | an array of binary and symmetricco-membership matrices. |
weights | a matrix of median weights byfeature. |
References
Kampert MM, Meulman JJ, Friedman JH (2017).“rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.”Journal of Classification,34(3), 514–547.doi:10.1007/s00357-017-9240-z.
Friedman JH, Meulman JJ (2004).“Clustering objects on subsets of attributes (with discussion).”Journal of the Royal Statistical Society: Series B (Statistical Methodology),66(4), 815-849.doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x,https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.
See Also
Other clustering algorithms:GMMClustering(),HierarchicalClustering(),KMeansClustering(),PAMClustering()
Examples
if (requireNamespace("dbscan", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateClustering(n = c(10, 10), pk = 50) plot(simul) # DBSCAN clustering myclust <- DBSCANClustering( xdata = simul$data, eps = seq(0, 2 * sqrt(ncol(simul$data) - 1), by = 0.1) ) # Weighted PAM clustering (using COSA) if (requireNamespace("rCOSA", quietly = TRUE)) { myclust <- DBSCANClustering( xdata = simul$data, eps = c(0.25, 0.5, 0.75), Lambda = c(0.2, 0.5) ) }}Categorical from dummy variables
Description
Generates a single categorical variable from corresponding dummy variables.
Usage
DummyToCategories(x, verbose = FALSE)Arguments
x | matrix of dummy variables. |
verbose | logical indicating if messages should be printed. |
Value
A single categorical variable (numeric).
Ensemble model
Description
Creates an ensemble predictive model fromVariableSelectionoutputs.
Usage
Ensemble(stability, xdata, ydata)Arguments
stability | output of |
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
Value
An object of classensemble_model. A list with:
intercept | a vector of refitted intercepts for the |
beta | a matrix of beta coefficients from the |
models | a list of |
family | type of regression, extracted from |
See Also
Other ensemble model functions:EnsemblePredictions()
Examples
# Linear regressionset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian")ensemble <- Ensemble(stability = stab, xdata = simul$xdata, ydata = simul$ydata)# Logistic regressionset.seed(1)simul <- SimulateRegression(n = 200, pk = 20, family = "binomial")stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "binomial")ensemble <- Ensemble(stability = stab, xdata = simul$xdata, ydata = simul$ydata)Predictions from ensemble model
Description
Makes predictions using an ensemble model created fromVariableSelection outputs. For each observation inxdata, the predictions are calculated as the average predicted valuesobtained for that observation over theK models fitted in calibratedstability selection.
Usage
EnsemblePredictions(ensemble, xdata, ...)Arguments
ensemble | output of |
xdata | matrix of predictors with observations as rows and variables ascolumns. |
... | additional parameters passed to |
Value
A matrix of predictions computed from the observations inxdata.
See Also
Other ensemble model functions:Ensemble()
Examples
# Data simulationset.seed(1)simul <- SimulateRegression(n = 1000, pk = 50, family = "gaussian")# Training/test splitids <- Split(data = simul$ydata, tau = c(0.8, 0.2))stab <- VariableSelection( xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ])# Constructing the ensemble modelensemble <- Ensemble( stability = stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ])# Making predictionsyhat <- EnsemblePredictions( ensemble = ensemble, xdata = simul$xdata[ids[[2]], ])# Calculating Q-squaredcor(simul$ydata[ids[[2]], ], yhat)^2Prediction performance in regression
Description
Calculates model performance for linear (measured by Q-squared), logistic(AUC) or Cox (C-statistic) regression. This is done by (i) refitting themodel on a training set including a proportiontau of theobservations, and (ii) evaluating the performance on the remainingobservations (test set). For more reliable results, the procedure can berepeatedK times (defaultK=1).
Usage
ExplanatoryPerformance( xdata, ydata, new_xdata = NULL, new_ydata = NULL, stability = NULL, family = NULL, implementation = NULL, prediction = NULL, resampling = "subsampling", K = 1, tau = 0.8, seed = 1, n_thr = NULL, time = 1000, verbose = FALSE, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
new_xdata | optional test set (predictor data). |
new_ydata | optional test set (outcome data). |
stability | output of |
family | type of regression model. Possible values include |
implementation | optional function to refit the model. If |
prediction | optional function to compute predicted values from themodel refitted with |
resampling | resampling approach to create the training set. The defaultis |
K | number of training-test splits. Only used if |
tau | proportion of observations used in the training set. Only used if |
seed | value of the seed to ensure reproducibility of the results. Onlyused if |
n_thr | number of thresholds to use to construct the ROC curve. If |
time | numeric indicating the time for which the survival probabilitiesare computed. Only applicable to Cox regression. |
verbose | logical indicating if a loading bar and messages should beprinted. |
... | additional parameters passed to the function provided in |
Details
For a fair evaluation of the prediction performance, the data issplit into a training set (including a proportiontau of theobservations) and test set (remaining observations). The regression modelis fitted on the training set and applied on the test set. Performancemetrics are computed in the test set by comparing predicted and observedoutcomes.
For logistic regression, a Receiver Operating Characteristic (ROC) analysisis performed: the True and False Positive Rates (TPR and FPR), and AreaUnder the Curve (AUC) are computed for different thresholds in predictedprobabilities.
For Cox regression, the Concordance Index (as implemented inconcordance) looking at survival probabilities upto a specifictime is computed.
For linear regression, the squared correlation between predicted andobserved outcome in the test set (Q-squared) is reported.
Value
A list with:
TPR | True Positive Rate (for logistic regressiononly). |
FPR | False Positive Rate (for logistic regression only). |
AUC | Area Under the Curve (for logistic regression only). |
concordance | Concordance index (for Cox regression only). |
Beta | matrix of estimated beta coefficients across the |
See Also
Other prediction performance functions:Incremental()
Examples
# Data simulationset.seed(1)simul <- SimulateRegression( n = 1000, pk = 20, family = "binomial", ev_xy = 0.8)# Data split: selection, training and test setids <- Split( data = simul$ydata, family = "binomial", tau = c(0.4, 0.3, 0.3))xselect <- simul$xdata[ids[[1]], ]yselect <- simul$ydata[ids[[1]], ]xtrain <- simul$xdata[ids[[2]], ]ytrain <- simul$ydata[ids[[2]], ]xtest <- simul$xdata[ids[[3]], ]ytest <- simul$ydata[ids[[3]], ]# Stability selectionstab <- VariableSelection( xdata = xselect, ydata = yselect, family = "binomial")# Performances in test set of model refitted in training setroc <- ExplanatoryPerformance( xdata = xtrain, ydata = ytrain, new_xdata = xtest, new_ydata = ytest, stability = stab)plot(roc)roc$AUC# Alternative with multiple training/test splitsroc <- ExplanatoryPerformance( xdata = rbind(xtrain, xtest), ydata = c(ytrain, ytest), stability = stab, K = 100)plot(roc)boxplot(roc$AUC)# Partial Least Squares Discriminant Analysisif (requireNamespace("sgPLS", quietly = TRUE)) { stab <- VariableSelection( xdata = xselect, ydata = yselect, implementation = SparsePLS, family = "binomial" ) # Defining wrapping functions for predictions from PLS-DA PLSDA <- function(xdata, ydata, family = "binomial") { model <- mixOmics::plsda(X = xdata, Y = as.factor(ydata), ncomp = 1) return(model) } PredictPLSDA <- function(xdata, model) { xdata <- xdata[, rownames(model$loadings$X), drop = FALSE] predicted <- predict(object = model, newdata = xdata)$predict[, 2, 1] return(predicted) } # Performances with custom models roc <- ExplanatoryPerformance( xdata = rbind(xtrain, xtest), ydata = c(ytrain, ytest), stability = stab, K = 100, implementation = PLSDA, prediction = PredictPLSDA ) plot(roc)}False Discovery Proportion
Description
Computes the False Discovery Proportion (upper-bound) as a ratio of the PFER(upper-bound) over the number of stably selected features. In stabilityselection, the FDP corresponds to the expected proportion of stably selectedfeatures that are not relevant to the outcome (i.e. proportion of FalsePositives among stably selected features).
Usage
FDP(selprop, PFER, pi)Arguments
selprop | matrix or vector of selection proportions. |
PFER | Per Family Error Rate. |
pi | threshold in selection proportions. |
Value
The estimated upper-bound in FDP.
See Also
Other stability metric functions:ConsensusScore(),PFER(),StabilityMetrics(),StabilityScore()
Examples
# Simulating set of selection proportionsselprop <- round(runif(n = 20), digits = 2)# Computing the FDP with a threshold of 0.8fdp <- FDP(PFER = 3, selprop = selprop, pi = 0.8)Splitting observations into folds
Description
Generates a list ofn_folds non-overlapping sets of observation IDs(folds).
Usage
Folds(data, family = NULL, n_folds = 5)Arguments
data | vector or matrix of data. In regression, this should be theoutcome data. |
family | type of regression model. This argument is defined as in |
n_folds | number of folds. |
Details
For categorical outcomes (i.e.family argument is set to"binomial","multinomial" or"cox"), the split is donesuch that the proportion of observations from each of the categories ineach of the folds is representative of that of the full sample.
Value
A list of lengthn_folds with sets of non-overlappingobservation IDs.
Examples
# Splitting into 5 foldssimul <- SimulateRegression()ids <- Folds(data = simul$ydata)lapply(ids, length)# Balanced folds with respect to a binary variablesimul <- SimulateRegression(family = "binomial")ids <- Folds(data = simul$ydata, family = "binomial")lapply(ids, FUN = function(x) { table(simul$ydata[x, ])})Model-based clustering
Description
Runs clustering with Gaussian Mixture Models (GMM) using implementation fromMclust. This function is not using stability.
Usage
GMMClustering(xdata, nc = NULL, ...)Arguments
xdata | data matrix with observations as rows and variables as columns. |
nc | matrix of parameters controlling the number of clusters in theunderlying algorithm specified in |
... | additional parameters passed to |
Value
A list with:
comembership | an array of binary and symmetricco-membership matrices. |
weights | a matrix of median weights byfeature. |
See Also
Other clustering algorithms:DBSCANClustering(),HierarchicalClustering(),KMeansClustering(),PAMClustering()
Examples
# Data simulationset.seed(1)simul <- SimulateClustering(n = c(10, 10), pk = 50)# Clustering using Gaussian Mixture Modelsmygmm <- GMMClustering(xdata = simul$data, nc = seq_len(30))Graph visualisation
Description
Produces anigraph object from anadjacency matrix.
Usage
Graph( adjacency, node_label = NULL, node_colour = NULL, node_shape = NULL, edge_colour = "grey60", label_colour = "grey20", mode = "undirected", weighted = FALSE, satellites = FALSE)Arguments
adjacency | adjacency matrix or output of |
node_label | optional vector of node labels. This vector must contain asmany entries as there are rows/columns in the adjacency matrix and must bein the same order (the order is used to assign labels to nodes). |
node_colour | optional vector of node colours. This vector must containas many entries as there are rows/columns in the adjacency matrix and mustbe in the same order (the order is used to assign colours to nodes).Integers, named colours or RGB values can be used. |
node_shape | optional vector of node shapes. This vector must contain asmany entries as there are rows/columns in the adjacency matrix and must bein the same order (the order is used to assign shapes to nodes). Possiblevalues are |
edge_colour | optional character string for edge colour. Integers, namedcolours or RGB values can be used. |
label_colour | optional character string for label colour. Integers,named colours or RGB values can be used. |
mode | character string indicating how the adjacency matrix should beinterpreted. Possible values include |
weighted | indicating if entries of the adjacency matrix should defineedge width. If |
satellites | logical indicating if unconnected nodes (satellites) shouldbe included in the igraph object. |
Details
All functionalities implemented inigraph can be used on the output.These include cosmetic changes for the visualisation, but also varioustools for network analysis (including topological properties and communitydetection).
The R packagevisNetwork offersinteractive network visualisation tools. Anigraph object can easily be convertedto avisNetwork object (seeexample below).
For Cytoscape users, theRCy3 package can be usedto open the network in Cytoscape.
Value
An igraph object.
See Also
Adjacency,GraphicalModel,igraph manual,visNetwork manual,Cytoscape
Examples
## From adjacency matrix# Un-weightedadjacency <- SimulateAdjacency(pk = 20, topology = "scale-free")plot(Graph(adjacency))# Weightedadjacency <- adjacency * runif(prod(dim(adjacency)))adjacency <- adjacency + t(adjacency)plot(Graph(adjacency, weighted = TRUE))# Node colours and shapesplot(Graph(adjacency, weighted = TRUE, node_shape = "star", node_colour = "red"))## From stability selection outputs# Graphical modelset.seed(1)simul <- SimulateGraphical(pk = 20)stab <- GraphicalModel(xdata = simul$data)plot(Graph(stab))# Sparse PLSif (requireNamespace("sgPLS", quietly = TRUE)) { set.seed(1) simul <- SimulateRegression(n = 50, pk = c(5, 5, 5), family = "gaussian") x <- simul$xdata y <- simul$ydata stab <- BiSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), implementation = SparsePLS ) plot(Graph(stab))}## Tools from other packages# Applying some igraph functionalitiesadjacency <- SimulateAdjacency(pk = 20, topology = "scale-free")mygraph <- Graph(adjacency)igraph::degree(mygraph)igraph::betweenness(mygraph)igraph::shortest_paths(mygraph, from = 1, to = 2)igraph::walktrap.community(mygraph)# Interactive view using visNetworkif (requireNamespace("visNetwork", quietly = TRUE)) { vgraph <- mygraph igraph::V(vgraph)$shape <- rep("dot", length(igraph::V(vgraph))) v <- visNetwork::visIgraph(vgraph) mylayout <- as.matrix(v$x$nodes[, c("x", "y")]) mylayout[, 2] <- -mylayout[, 2] plot(mygraph, layout = mylayout)}# Opening in Cytoscape using RCy3if (requireNamespace("RCy3", quietly = TRUE)) { # Make sure that Cytoscape is open before running the following line # RCy3::createNetworkFromIgraph(mygraph)}Edge-wise comparison of two graphs
Description
Generates anigraph object representingthe common and graph-specific edges.
Usage
GraphComparison( graph1, graph2, col = c("tomato", "forestgreen", "navy"), lty = c(2, 3, 1), node_colour = NULL, show_labels = TRUE, ...)Arguments
graph1 | first graph. Possible inputs are: adjacency matrix, or |
graph2 | second graph. |
col | vector of edge colours. The first entry of the vector definesthe colour of edges in |
lty | vector of line types for edges. The order is defined as forargument |
node_colour | optional vector of node colours. This vector must containas many entries as there are rows/columns in the adjacency matrix and mustbe in the same order (the order is used to assign colours to nodes).Integers, named colours or RGB values can be used. |
show_labels | logical indicating if the node labels should be displayed. |
... | additional arguments to be passed to |
Value
An igraph object.
See Also
Examples
# Data simulationset.seed(1)simul1 <- SimulateGraphical(pk = 30)set.seed(2)simul2 <- SimulateGraphical(pk = 30)# Edge-wise comparison of the two graphsmygraph <- GraphComparison( graph1 = simul1, graph2 = simul2)plot(mygraph, layout = igraph::layout_with_kk(mygraph))Graphical model algorithm
Description
Runs the algorithm specified in the argumentimplementation andreturns the estimated adjacency matrix. This function is not using stability.
Usage
GraphicalAlgo( xdata, pk = NULL, Lambda, Sequential_template = NULL, scale = TRUE, implementation = PenalisedGraphical, start = "cold", ...)Arguments
xdata | matrix with observations as rows and variables as columns. |
pk | optional vector encoding the grouping structure. Only used formulti-block stability selection where |
Lambda | matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified in |
Sequential_template | logical matrix encoding the type of procedure touse for data with multiple blocks in stability selection graphicalmodelling. For multi-block estimation, the stability selection model isconstructed as the union of block-specific stable edges estimated while theothers are weakly penalised ( |
scale | logical indicating if the correlation ( |
implementation | function to use for graphical modelling. If |
start | character string indicating if the algorithm should beinitialised at the estimated (inverse) covariance with previous penaltyparameters ( |
... | additional parameters passed to the function provided in |
Details
The use of the procedure from Equation (4) or (5) is controlled bythe argument "Sequential_template".
Value
An array with binary and symmetric adjacency matrices along the thirddimension.
See Also
GraphicalModel,PenalisedGraphical
Other wrapping functions:SelectionAlgo()
Examples
# Data simulationset.seed(1)simul <- SimulateGraphical()# Running graphical LASSOmyglasso <- GraphicalAlgo( xdata = simul$data, Lambda = cbind(c(0.1, 0.2)))Stability selection graphical model
Description
Performs stability selection for graphical models. The underlying graphicalmodel (e.g. graphical LASSO) is run with different combinations of parameterscontrolling the sparsity (e.g. penalty parameter) and thresholds in selectionproportions. These two hyper-parameters are jointly calibrated bymaximisation of the stability score.
Usage
GraphicalModel( xdata, pk = NULL, Lambda = NULL, lambda_other_blocks = 0.1, pi_list = seq(0.01, 0.99, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, implementation = PenalisedGraphical, start = "warm", scale = TRUE, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 50, lambda_max = NULL, lambda_path_factor = 0.001, max_density = 0.5, optimisation = c("grid_search", "nloptr"), n_cores = 1, output_data = FALSE, verbose = TRUE, beep = NULL, ...)Arguments
xdata | data matrix with observations as rows and variables as columns.For multi-block stability selection, the variables in data have to beordered by group. |
pk | optional vector encoding the grouping structure. Only used formulti-block stability selection where |
Lambda | matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified in |
lambda_other_blocks | optional vector of parameters controlling thelevel of sparsity in neighbour blocks for the multi-block procedure. To usejointly a specific set of parameters for each block, |
pi_list | vector of thresholds in selection proportions. If |
K | number of resampling iterations. |
tau | subsample size. Only used if |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
n_cat | computation options for the stability score. Default is |
implementation | function to use for graphical modelling. If |
start | character string indicating if the algorithm should beinitialised at the estimated (inverse) covariance with previous penaltyparameters ( |
scale | logical indicating if the correlation ( |
resampling | resampling approach. Possible values are: |
cpss | logical indicating if complementary pair stability selectionshould be done. For this, the algorithm is applied on two non-overlappingsubsets of half of the observations. A feature is considered as selected ifit is selected for both subsamples. With this method, the data is split |
PFER_method | method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). If |
PFER_thr | threshold in PFER for constrained calibration by errorcontrol. If |
FDP_thr | threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. If |
Lambda_cardinal | number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm. Only used if |
lambda_max | optional maximum value for the grid in penalty parameters.If |
lambda_path_factor | multiplicative factor used to define the minimumvalue in the grid. |
max_density | threshold on the density. The grid is defined such thatthe density of the estimated graph does not exceed max_density. |
optimisation | character string indicating the type of optimisationmethod. With |
n_cores | number of cores to use for parallel computing (see argument |
output_data | logical indicating if the input datasets |
verbose | logical indicating if a loading bar and messages should beprinted. |
beep | sound indicating the end of the run. Possible values are: |
... | additional parameters passed to the functions provided in |
Details
In stability selection, a feature selection algorithm is fitted onK subsamples (or bootstrap samples) of the data with differentparameters controlling the sparsity (Lambda). For a given (set of)sparsity parameter(s), the proportion out of theK models in whicheach feature is selected is calculated. Features with selection proportionsabove a threshold pi are considered stably selected. The stabilityselection model is controlled by the sparsity parameter(s) for theunderlying algorithm, and the threshold in selection proportion:
V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}
These parameters can be calibrated by maximisation of a stability score(seeConsensusScore ifn_cat=NULL orStabilityScore otherwise) calculated under the nullhypothesis of equiprobability of selection.
It is strongly recommended to examine the calibration plot carefully tocheck that the grids of parametersLambda andpi_list do notrestrict the calibration to a region that would not include the globalmaximum (seeCalibrationPlot). In particular, the gridLambda may need to be extended when the maximum stability isobserved on the left or right edges of the calibration heatmap. In someinstances, multiple peaks of stability score can be observed. Simulationstudies suggest that the peak corresponding to the largest number ofselected features tend to give better selection performances. This is notnecessarily the highest peak (which is automatically retained by thefunctions in this package). The user can decide to manually choose anotherpeak.
To control the expected number of False Positives (Per Family Error Rate)in the results, a thresholdPFER_thr can be specified. Theoptimisation problem is then constrained to sets of parameters thatgenerate models with an upper-bound in PFER belowPFER_thr (seeMeinshausen and Bühlmann (2010) and Shah and Samworth (2013)).
Possible resampling procedures include defining (i)K subsamples ofa proportiontau of the observations, (ii)K bootstrap sampleswith the full sample size (obtained with replacement), and (iii)K/2splits of the data in half for complementary pair stability selection (seeargumentsresampling andcpss). In complementary pairstability selection, a feature is considered selected at a given resamplingiteration if it is selected in the two complementary subsamples.
To ensure reproducibility of the results, the starting number of the randomnumber generator is set toseed.
For parallelisation, stability selection with different sets of parameterscan be run onn_cores cores. Usingn_cores > 1 creates amultisession. Alternatively,the function can be run manually with differentseeds and all otherparameters equal. The results can then be combined usingCombine.
The generated network can be converted intoigraph object usingGraph. The R packagevisNetwork can be used forinteractive network visualisation (see examples inGraph).
Value
An object of classgraphical_model. A list with:
S | amatrix of the best stability scores for different (sets of) parameterscontrolling the level of sparsity in the underlying algorithm. |
Lambda | a matrix of parameters controlling the level of sparsity inthe underlying algorithm. |
Q | a matrix of the average number ofselected features by the underlying algorithm with different parameterscontrolling the level of sparsity. |
Q_s | a matrix of the calibratednumber of stably selected features with different parameters controllingthe level of sparsity. |
P | a matrix of calibrated thresholds inselection proportions for different parameters controlling the level ofsparsity in the underlying algorithm. |
PFER | a matrix of upper-boundsin PFER of calibrated stability selection models with different parameterscontrolling the level of sparsity. |
FDP | a matrix of upper-bounds inFDP of calibrated stability selection models with different parameterscontrolling the level of sparsity. |
S_2d | a matrix of stabilityscores obtained with different combinations of parameters. Columnscorrespond to different thresholds in selection proportions. |
PFER_2d | a matrix of upper-bounds in FDP obtained with differentcombinations of parameters. Columns correspond to different thresholds inselection proportions. Only returned if |
FDP_2d | a matrix of upper-bounds in PFER obtained with differentcombinations of parameters. Columns correspond to different thresholds inselection proportions. Only returned if |
selprop | an array of selection proportions. Rows and columnscorrespond to nodes in the graph. Indices along the third dimensioncorrespond to different parameters controlling the level of sparsity in theunderlying algorithm. |
sign | a matrix of signs of Pearson'scorrelations estimated from |
method | a list with |
params | a list with values used for arguments |
The rows ofS,Lambda,Q,Q_s,P,PFER,FDP,S_2d,PFER_2d andFDP_2d, andindices along the third dimension ofselprop are ordered in the sameway and correspond to parameter values stored inLambda. Formulti-block inference, the columns ofS,Lambda,Q,Q_s,P,PFER andFDP, and indices along thethird dimension ofS_2d correspond to the different blocks.
References
Bodinier B, Rodrigues S, Karimi M, Filippi S, Chiquet J, Chadeau-Hyam M (2025).“Stability Selection and Consensus Clustering in R: The R Package sharp.”Journal of Statistical Software,112(5), btad635.doi:10.18637/jss.v112.i05.
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Shah RD, Samworth RJ (2013).“Variable selection with error control: another look at stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),75(1), 55-80.doi:10.1111/j.1467-9868.2011.01034.x.
Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.
Friedman J, Hastie T, Tibshirani R (2008).“Sparse inverse covariance estimation with the graphical lasso.”Biostatistics,9(3), 432–441.
See Also
PenalisedGraphical,GraphicalAlgo,LambdaGridGraphical,Resample,StabilityScoreGraph,Adjacency,
Other stability functions:BiSelection(),Clustering(),StructuralModel(),VariableSelection()
Examples
oldpar <- par(no.readonly = TRUE)par(mar = rep(7, 4))## Single-block stability selection# Data simulationset.seed(1)simul <- SimulateGraphical(n = 100, pk = 20, nu_within = 0.1)# Stability selectionstab <- GraphicalModel(xdata = simul$data)print(stab)# Calibration heatmapCalibrationPlot(stab)# Visualisation of the resultssummary(stab)plot(stab)# Extraction of adjacency matrix or igraph objectAdjacency(stab)Graph(stab)## Multi-block stability selection# Data simulationset.seed(1)simul <- SimulateGraphical(pk = c(10, 10))# Stability selectionstab <- GraphicalModel(xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10)print(stab)# Calibration heatmap# par(mfrow = c(1, 3))CalibrationPlot(stab) # Producing three plots# Visualisation of the resultssummary(stab)plot(stab)# Multi-parameter stability selection (not recommended)Lambda <- matrix(c(0.8, 0.6, 0.3, 0.5, 0.4, 0.3, 0.7, 0.5, 0.1), ncol = 3)stab <- GraphicalModel( xdata = simul$data, pk = c(10, 10), Lambda = Lambda, lambda_other_blocks = NULL)stab$Lambda## Example with user-defined function: shrinkage estimation and selection# Data simulationset.seed(1)simul <- SimulateGraphical(n = 100, pk = 20, nu_within = 0.1)if (requireNamespace("corpcor", quietly = TRUE)) { # Writing user-defined algorithm in a portable function ShrinkageSelection <- function(xdata, Lambda, ...) { mypcor <- corpcor::pcor.shrink(xdata, verbose = FALSE) adjacency <- array(NA, dim = c(nrow(mypcor), ncol(mypcor), nrow(Lambda))) for (k in seq_len(nrow(Lambda))) { A <- ifelse(abs(mypcor) >= Lambda[k, 1], yes = 1, no = 0) diag(A) <- 0 adjacency[, , k] <- A } return(list(adjacency = adjacency)) } # Running the algorithm without stability myglasso <- GraphicalAlgo( xdata = simul$data, Lambda = matrix(c(0.05, 0.1), ncol = 1), implementation = ShrinkageSelection ) # Stability selection using shrinkage estimation and selection stab <- GraphicalModel( xdata = simul$data, Lambda = matrix(c(0.01, 0.05, 0.1), ncol = 1), implementation = ShrinkageSelection ) CalibrationPlot(stab) stable_adjacency <- Adjacency(stab)}par(oldpar)Group Partial Least Squares
Description
Runs a group Partial Least Squares model using implementation fromsgPLS-package. This function is not using stability.
Usage
GroupPLS( xdata, ydata, family = "gaussian", group_x, group_y = NULL, Lambda, keepX_previous = NULL, keepY = NULL, ncomp = 1, scale = TRUE, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
family | type of PLS model. If |
group_x | vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group. |
group_y | optional vector encoding the grouping structure amongoutcomes. This argument indicates the number of variables in each group. |
Lambda | matrix of parameters controlling the number of selected groupsat current component, as defined by |
keepX_previous | number of selected groups in previous components. Onlyused if |
keepY | number of selected groups of outcome variables. This argument isdefined as in |
ncomp | number of components. |
scale | logical indicating if the data should be scaled (i.e.transformed so that all variables have a standard deviation of one). Onlyused if |
... |
Value
A list with:
selected | matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors. |
beta_full | array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors (startingwith "X") or outcomes (starting with "Y") variables for differentcomponents (denoted by "PC"). |
References
Liquet B, de Micheaux PL, Hejblum BP, Thiébaut R (2016).“Group and sparse group partial least square approaches applied in genomics context.”Bioinformatics,32(1), 35-42.ISSN 1367-4803,doi:10.1093/bioinformatics/btv535.
See Also
Other penalised dimensionality reduction functions:SparseGroupPLS(),SparsePCA(),SparsePLS()
Examples
if (requireNamespace("sgPLS", quietly = TRUE)) { ## Group PLS # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # Running gPLS with 1 group and sparsity of 0.5 mypls <- GroupPLS( xdata = x, ydata = y, Lambda = 1, family = "gaussian", group_x = c(10, 15, 25), ) # Running gPLS with groups on outcomes mypls <- GroupPLS( xdata = x, ydata = y, Lambda = 1, family = "gaussian", group_x = c(10, 15, 25), group_y = c(2, 1), keepY = 1 )}(Weighted) hierarchical clustering
Description
Runs hierarchical clustering using implementation fromhclust. IfLambda is provided, clustering isapplied on the weighted distance matrix calculated using thecosa2 algorithm. Otherwise, distances are calculatedusingdist. This function is not using stability.
Usage
HierarchicalClustering( xdata, nc = NULL, Lambda = NULL, distance = "euclidean", linkage = "complete", ...)Arguments
xdata | data matrix with observations as rows and variables as columns. |
nc | matrix of parameters controlling the number of clusters in theunderlying algorithm specified in |
Lambda | vector of penalty parameters (see argument |
distance | character string indicating the type of distance to use. If |
linkage | character string indicating the type of linkage used inhierarchical clustering to define the stable clusters. Possible valuesinclude |
... | additional parameters passed to |
Value
A list with:
comembership | an array of binary and symmetricco-membership matrices. |
weights | a matrix of median weights byfeature. |
References
Kampert MM, Meulman JJ, Friedman JH (2017).“rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.”Journal of Classification,34(3), 514–547.doi:10.1007/s00357-017-9240-z.
Friedman JH, Meulman JJ (2004).“Clustering objects on subsets of attributes (with discussion).”Journal of the Royal Statistical Society: Series B (Statistical Methodology),66(4), 815-849.doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x,https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.
See Also
Other clustering algorithms:DBSCANClustering(),GMMClustering(),KMeansClustering(),PAMClustering()
Examples
# Data simulationset.seed(1)simul <- SimulateClustering(n = c(10, 10), pk = 50)# Hierarchical clusteringmyhclust <- HierarchicalClustering( xdata = simul$data, nc = seq_len(20))# Weighted Hierarchical clustering (using COSA)if (requireNamespace("rCOSA", quietly = TRUE)) { myhclust <- HierarchicalClustering( xdata = simul$data, weighted = TRUE, nc = seq_len(20), Lambda = c(0.2, 0.5) )}Incremental prediction performance in regression
Description
Computes the prediction performance of regression models where predictors aresequentially added by order of decreasing selection proportion. This functioncan be used to evaluate the marginal contribution of each of the selectedpredictors over and above more stable predictors. Performances are evaluatedas inExplanatoryPerformance.
Usage
Incremental( xdata, ydata, new_xdata = NULL, new_ydata = NULL, stability = NULL, family = NULL, implementation = NULL, prediction = NULL, resampling = "subsampling", n_predictors = NULL, K = 100, tau = 0.8, seed = 1, n_thr = NULL, time = 1000, verbose = TRUE, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
new_xdata | optional test set (predictor data). |
new_ydata | optional test set (outcome data). |
stability | output of |
family | type of regression model. Possible values include |
implementation | optional function to refit the model. If |
prediction | optional function to compute predicted values from themodel refitted with |
resampling | resampling approach to create the training set. The defaultis |
n_predictors | number of predictors to consider. |
K | number of training-test splits. Only used if |
tau | proportion of observations used in the training set. Only used if |
seed | value of the seed to ensure reproducibility of the results. Onlyused if |
n_thr | number of thresholds to use to construct the ROC curve. If |
time | numeric indicating the time for which the survival probabilitiesare computed. Only applicable to Cox regression. |
verbose | logical indicating if a loading bar and messages should beprinted. |
... | additional parameters passed to the function provided in |
Value
An object of classincremental.
For logistic regression, a list with:
FPR | A list with, for each ofthe models (sequentially added predictors), the False Positive Rates fordifferent thresholds (columns) and different data splits (rows). |
TPR | A list with, for each of the models (sequentially addedpredictors), the True Positive Rates for different thresholds (columns) anddifferent data splits (rows). |
AUC | A list with, for each of themodels (sequentially added predictors), a vector of Area Under the Curve(AUC) values obtained with different data splits. |
Beta | Estimatedregression coefficients from visited models. |
names | Names of thepredictors by order of inclusion. |
stable | Binary vector indicatingwhich predictors are stably selected. Only returned if |
For Cox regression, a list with:
concordance | A list with, for eachof the models (sequentially added predictors), a vector of concordanceindices obtained with different data splits. |
Beta | Estimatedregression coefficients from visited models. |
names | Names of thepredictors by order of inclusion. |
stable | Binary vector indicatingwhich predictors are stably selected. Only returned if |
For linear regression, a list with:
Q_squared | A list with, for eachof the models (sequentially added predictors), a vector of Q-squaredobtained with different data splits. |
Beta | Estimated regressioncoefficients from visited models. |
names | Names of the predictors byorder of inclusion. |
stable | Binary vector indicating whichpredictors are stably selected. Only returned if |
See Also
Other prediction performance functions:ExplanatoryPerformance()
Examples
# Data simulationset.seed(1)simul <- SimulateRegression( n = 1000, pk = 20, family = "binomial", ev_xy = 0.8)# Data split: selection, training and test setids <- Split( data = simul$ydata, family = "binomial", tau = c(0.4, 0.3, 0.3))xselect <- simul$xdata[ids[[1]], ]yselect <- simul$ydata[ids[[1]], ]xtrain <- simul$xdata[ids[[2]], ]ytrain <- simul$ydata[ids[[2]], ]xtest <- simul$xdata[ids[[3]], ]ytest <- simul$ydata[ids[[3]], ]# Stability selectionstab <- VariableSelection( xdata = xselect, ydata = yselect, family = "binomial")# Performances in test set of model refitted in training setincr <- Incremental( xdata = xtrain, ydata = ytrain, new_xdata = xtest, new_ydata = ytest, stability = stab, n_predictors = 10)plot(incr)# Alternative with multiple training/test splitsincr <- Incremental( xdata = rbind(xtrain, xtest), ydata = c(ytrain, ytest), stability = stab, K = 10, n_predictors = 10)plot(incr)(Sparse) K-means clustering
Description
Runs k-means clustering using implementation fromkmeans. This function is not using stability.
Usage
KMeansClustering(xdata, nc = NULL, Lambda = NULL, ...)Arguments
xdata | data matrix with observations as rows and variables as columns. |
nc | matrix of parameters controlling the number of clusters in theunderlying algorithm specified in |
Lambda | vector of penalty parameters (see argument |
... | additional parameters passed to |
Value
A list with:
comembership | an array of binary and symmetricco-membership matrices. |
weights | a matrix of median weights byfeature. |
References
Witten DM, Tibshirani R (2010).“A Framework for Feature Selection in Clustering.”Journal of the American Statistical Association,105(490), 713-726.doi:10.1198/jasa.2010.tm09415, PMID: 20811510.
See Also
Other clustering algorithms:DBSCANClustering(),GMMClustering(),HierarchicalClustering(),PAMClustering()
Examples
# Data simulationset.seed(1)simul <- SimulateClustering(n = c(10, 10), pk = 50)# K means clusteringmykmeans <- KMeansClustering(xdata = simul$data, nc = seq_len(20))# Sparse K means clusteringif (requireNamespace("sparcl", quietly = TRUE)) { mykmeans <- KMeansClustering( xdata = simul$data, nc = seq_len(20), Lambda = c(2, 5) )}Grid of penalty parameters (graphical model)
Description
Generates a relevant grid of penalty parameter values for penalised graphicalmodels.
Usage
LambdaGridGraphical( xdata, pk = NULL, lambda_other_blocks = 0.1, K = 100, tau = 0.5, n_cat = 3, implementation = PenalisedGraphical, start = "cold", scale = TRUE, resampling = "subsampling", PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 50, lambda_max = NULL, lambda_path_factor = 0.001, max_density = 0.5, ...)Arguments
xdata | data matrix with observations as rows and variables as columns.For multi-block stability selection, the variables in data have to beordered by group. |
pk | optional vector encoding the grouping structure. Only used formulti-block stability selection where |
lambda_other_blocks | optional vector of parameters controlling thelevel of sparsity in neighbour blocks for the multi-block procedure. To usejointly a specific set of parameters for each block, |
K | number of resampling iterations. |
tau | subsample size. Only used if |
n_cat | computation options for the stability score. Default is |
implementation | function to use for graphical modelling. If |
start | character string indicating if the algorithm should beinitialised at the estimated (inverse) covariance with previous penaltyparameters ( |
scale | logical indicating if the correlation ( |
resampling | resampling approach. Possible values are: |
PFER_method | method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). If |
PFER_thr | threshold in PFER for constrained calibration by errorcontrol. If |
FDP_thr | threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. If |
Lambda_cardinal | number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm. |
lambda_max | optional maximum value for the grid in penalty parameters.If |
lambda_path_factor | multiplicative factor used to define the minimumvalue in the grid. |
max_density | threshold on the density. The grid is defined such thatthe density of the estimated graph does not exceed max_density. |
... | additional parameters passed to the functions provided in |
Value
A matrix of lambda values withlength(pk) columns andLambda_cardinal rows.
See Also
Other lambda grid functions:LambdaGridRegression(),LambdaSequence()
Examples
# Single-block simulationset.seed(1)simul <- SimulateGraphical()# Generating grid of 10 valuesLambda <- LambdaGridGraphical(xdata = simul$data, Lambda_cardinal = 10)# Ensuring PFER < 5Lambda <- LambdaGridGraphical(xdata = simul$data, Lambda_cardinal = 10, PFER_thr = 5)# Multi-block simulationset.seed(1)simul <- SimulateGraphical(pk = c(10, 10))# Multi-block gridLambda <- LambdaGridGraphical(xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10)# Denser neighbouring blocksLambda <- LambdaGridGraphical( xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10, lambda_other_blocks = 0)# Using different neighbour penaltiesLambda <- LambdaGridGraphical( xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10, lambda_other_blocks = c(0.1, 0, 0.1))stab <- GraphicalModel( xdata = simul$data, pk = c(10, 10), Lambda = Lambda, lambda_other_blocks = c(0.1, 0, 0.1))stab$Lambda# Visiting from empty to full graphs with max_density=1Lambda <- LambdaGridGraphical( xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10, max_density = 1)bigblocks <- BlockMatrix(pk = c(10, 10))bigblocks_vect <- bigblocks[upper.tri(bigblocks)]N_blocks <- unname(table(bigblocks_vect))N_blocks # max number of edges per blockstab <- GraphicalModel(xdata = simul$data, pk = c(10, 10), Lambda = Lambda)apply(stab$Q, 2, max, na.rm = TRUE) # max average number of edges from underlying algoGrid of penalty parameters (regression model)
Description
Generates a relevant grid of penalty parameter values for penalisedregression using the implementation inglmnet.
Usage
LambdaGridRegression( xdata, ydata, tau = 0.5, seed = 1, family = "gaussian", resampling = "subsampling", Lambda_cardinal = 100, check_input = TRUE, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
tau | subsample size. Only used if |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
family | type of regression model. This argument is defined as in |
resampling | resampling approach. Possible values are: |
Lambda_cardinal | number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm. |
check_input | logical indicating if input values should be checked(recommended). |
... | additional parameters passed to the functions provided in |
Value
A matrix of lambda values with one column and as many rows asindicated inLambda_cardinal.
See Also
Other lambda grid functions:LambdaGridGraphical(),LambdaSequence()
Examples
# Data simulationset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") # simulated data# Lambda grid for linear regressionLambda <- LambdaGridRegression( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", Lambda_cardinal = 20)# Grid can be used in VariableSelection()stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", Lambda = Lambda)print(SelectedVariables(stab))Sequence of penalty parameters
Description
Generates a sequence of penalty parameters from extreme values and therequired number of elements. The sequence is defined on the log-scale.
Usage
LambdaSequence(lmax, lmin, cardinal = 100)Arguments
lmax | maximum value in the grid. |
lmin | minimum value in the grid. |
cardinal | number of values in the grid. |
Value
A vector with values between "lmin" and "lmax" and as many values asindicated by "cardinal".
See Also
Other lambda grid functions:LambdaGridGraphical(),LambdaGridRegression()
Examples
# Grid from extreme valuesmygrid <- LambdaSequence(lmax = 0.7, lmin = 0.001, cardinal = 10)Matrix from linear system outputs
Description
Returns a matrix from output ofPenalisedLinearSystem.
Usage
LinearSystemMatrix(vect, adjacency)Arguments
vect | vector of coefficients to assign to entries of the matrix. |
adjacency | binary adjacency matrix of the Directed Acyclic Graph(transpose of the asymmetric matrix A in Reticular Action Model notation).The row and column names of this matrix must be defined. |
Value
An asymmetric matrix.
See Also
Transforms NA into NULL
Description
Returns a vector with no missing values or NULL if there are no non-missingvalues.
Usage
NAToNULL(x)Arguments
x | input vector. |
Value
A vector without missing values or NULL.
Matrix from OpenMx outputs
Description
Returns a matrix from output ofmxPenaltySearch.
Usage
OpenMxMatrix(vect, adjacency, residual_covariance = NULL)Arguments
vect | vector of coefficients to assign to entries of the matrix. |
adjacency | binary adjacency matrix of the Directed Acyclic Graph(transpose of the asymmetric matrix A in Reticular Action Model notation).The row and column names of this matrix must be defined. |
residual_covariance | binary and symmetric matrix encoding the nonzeroentries in the residual covariance matrix (symmetric matrix S in ReticularAction Model notation). By default, this is the identity matrix (noresidual covariance). |
Value
An asymmetric matrix.
See Also
Writing OpenMx model (matrix specification)
Description
Returns matrix specification for use inmxModel from(i) the adjacency matrix of a Directed Acyclic Graph (asymmetric matrix A inReticular Action Model notation), and (ii) a binary matrix encoding nonzeroentries in the residual covariance matrix (symmetric matrix S in ReticularAction Model notation).
Usage
OpenMxModel(adjacency, residual_covariance = NULL, manifest = NULL)Arguments
adjacency | binary adjacency matrix of the Directed Acyclic Graph(transpose of the asymmetric matrix A in Reticular Action Model notation).The row and column names of this matrix must be defined. |
residual_covariance | binary and symmetric matrix encoding the nonzeroentries in the residual covariance matrix (symmetric matrix S in ReticularAction Model notation). By default, this is the identity matrix (noresidual covariance). |
manifest | optional vector of manifest variable names. |
Value
A list of RAM matrices that can be used inmxRun.
See Also
Examples
if (requireNamespace("OpenMx", quietly = TRUE)) { # Definition of simulated effects pk <- c(3, 2, 3) dag <- LayeredDAG(layers = pk) theta <- dag theta[2, 4] <- 0 theta[3, 7] <- 0 theta[4, 7] <- 0 # Data simulation set.seed(1) simul <- SimulateStructural(n = 500, v_between = 1, theta = theta, pk = pk) # Writing RAM matrices for mxModel ram_matrices <- OpenMxModel(adjacency = dag) # Running unpenalised model unpenalised <- OpenMx::mxRun(OpenMx::mxModel( "Model", OpenMx::mxData(simul$data, type = "raw"), ram_matrices$Amat, ram_matrices$Smat, ram_matrices$Fmat, ram_matrices$Mmat, OpenMx::mxExpectationRAM("A", "S", "F", "M", dimnames = colnames(dag)), OpenMx::mxFitFunctionML() ), silent = TRUE, suppressWarnings = TRUE) unpenalised$A$values # Incorporating latent variables ram_matrices <- OpenMxModel( adjacency = dag, manifest = paste0("x", seq_len(7)) ) ram_matrices$Fmat$values # Running unpenalised model unpenalised <- OpenMx::mxRun(OpenMx::mxModel( "Model", OpenMx::mxData(simul$data[, seq_len(7)], type = "raw"), ram_matrices$Amat, ram_matrices$Smat, ram_matrices$Fmat, ram_matrices$Mmat, OpenMx::mxExpectationRAM("A", "S", "F", "M", dimnames = colnames(dag)), OpenMx::mxFitFunctionML() ), silent = TRUE, suppressWarnings = TRUE) unpenalised$A$values}(Weighted) Partitioning Around Medoids
Description
Runs Partitioning Around Medoids (PAM) clustering using implementation frompam. This is also known as the k-medoids algorithm. IfLambda is provided, clustering is applied on the weighted distancematrix calculated using the COSA algorithm as implemented incosa2. Otherwise, distances are calculated usingdist. This function is not using stability.
Usage
PAMClustering(xdata, nc = NULL, Lambda = NULL, distance = "euclidean", ...)Arguments
xdata | data matrix with observations as rows and variables as columns. |
nc | matrix of parameters controlling the number of clusters in theunderlying algorithm specified in |
Lambda | vector of penalty parameters (see argument |
distance | character string indicating the type of distance to use. If |
... | additional parameters passed to |
Value
A list with:
comembership | an array of binary and symmetricco-membership matrices. |
weights | a matrix of median weights byfeature. |
References
Kampert MM, Meulman JJ, Friedman JH (2017).“rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.”Journal of Classification,34(3), 514–547.doi:10.1007/s00357-017-9240-z.
Friedman JH, Meulman JJ (2004).“Clustering objects on subsets of attributes (with discussion).”Journal of the Royal Statistical Society: Series B (Statistical Methodology),66(4), 815-849.doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x,https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.
See Also
Other clustering algorithms:DBSCANClustering(),GMMClustering(),HierarchicalClustering(),KMeansClustering()
Examples
if (requireNamespace("cluster", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateClustering(n = c(10, 10), pk = 50) # PAM clustering myclust <- PAMClustering( xdata = simul$data, nc = seq_len(20) ) # Weighted PAM clustering (using COSA) if (requireNamespace("rCOSA", quietly = TRUE)) { myclust <- PAMClustering( xdata = simul$data, nc = seq_len(20), Lambda = c(0.2, 0.5) ) }}Per Family Error Rate
Description
Computes the Per Family Error Rate upper-bound of a stability selection modelusing the methods proposed by Meinshausen and Bühlmann (2010) or Shah andSamworth (2013). In stability selection, the PFER corresponds to the expectednumber of stably selected features that are not relevant to the outcome (i.e.False Positives).
Usage
PFER(q, pi, N, K, PFER_method = "MB")Arguments
q | average number of features selected by the underlying algorithm. |
pi | threshold in selection proportions. |
N | total number of features. |
K | number of resampling iterations. |
PFER_method | method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). If |
Value
The estimated upper-bound in PFER.
References
Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.
Shah RD, Samworth RJ (2013).“Variable selection with error control: another look at stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),75(1), 55-80.doi:10.1111/j.1467-9868.2011.01034.x.
See Also
Other stability metric functions:ConsensusScore(),FDP(),StabilityMetrics(),StabilityScore()
Examples
# Computing PFER for 10/50 selected features and threshold of 0.8pfer_mb <- PFER(q = 10, pi = 0.8, N = 50, K = 100, PFER_method = "MB")pfer_ss <- PFER(q = 10, pi = 0.8, N = 50, K = 100, PFER_method = "SS")Partial Least Squares 'a la carte'
Description
Runs a Partial Least Squares (PLS) model in regression mode using algorithmimplemented inpls. This function allows for theconstruction of components based on different sets of predictor and/oroutcome variables. This function is not using stability.
Usage
PLS( xdata, ydata, selectedX = NULL, selectedY = NULL, family = "gaussian", ncomp = NULL, scale = TRUE)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
selectedX | binary matrix of size |
selectedY | binary matrix of size |
family | type of PLS model. Only |
ncomp | number of components. |
scale | logical indicating if the data should be scaled (i.e.transformed so that all variables have a standard deviation of one). |
Details
All matrices are defined as in (Wold et al. 2001). The weight matrixWmat is the equivalent ofloadings$X inpls. The loadings matrixPmat is theequivalent ofmat.c inpls. The scorematricesTmat andQmat are the equivalent ofvariates$X andvariates$Y inpls.
Value
A list with:
Wmat | matrix of X-weights. |
Wstar | matrix oftransformed X-weights. |
Pmat | matrix of X-loadings. |
Cmat | matrix of Y-weights. |
Tmat | matrix of X-scores. |
Umat | matrix of Y-scores. |
Qmat | matrix needed forpredictions. |
Rmat | matrix needed for predictions. |
meansX | vector used for centering of predictors, needed forpredictions. |
sigmaX | vector used for scaling of predictors, neededfor predictions. |
meansY | vector used for centering of outcomes,needed for predictions. |
sigmaY | vector used for scaling of outcomes,needed for predictions. |
methods | a list with |
params | a list with |
References
Wold S, Sjöström M, Eriksson L (2001).“PLS-regression: a basic tool of chemometrics.”Chemometrics and Intelligent Laboratory Systems,58(2), 109-130.ISSN 0169-7439,doi:10.1016/S0169-7439(01)00155-1, PLS Methods.
See Also
Examples
if (requireNamespace("mixOmics", quietly = TRUE)) { oldpar <- par(no.readonly = TRUE) # Data simulation set.seed(1) simul <- SimulateRegression(n = 200, pk = 15, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # PLS mypls <- PLS(xdata = x, ydata = y, ncomp = 3) if (requireNamespace("sgPLS", quietly = TRUE)) { # Sparse PLS to identify relevant variables stab <- BiSelection( xdata = x, ydata = y, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), LambdaY = seq_len(ncol(y) - 1), implementation = SparsePLS, n_cat = 2 ) plot(stab) # Refitting of PLS model mypls <- PLS( xdata = x, ydata = y, selectedX = stab$selectedX, selectedY = stab$selectedY ) # Nonzero entries in weights are the same as in selectedX par(mfrow = c(2, 2)) Heatmap(stab$selectedX, legend = FALSE ) title("Selected in X") Heatmap(ifelse(mypls$Wmat != 0, yes = 1, no = 0), legend = FALSE ) title("Nonzero entries in Wmat") Heatmap(stab$selectedY, legend = FALSE ) title("Selected in Y") Heatmap(ifelse(mypls$Cmat != 0, yes = 1, no = 0), legend = FALSE ) title("Nonzero entries in Cmat") } # Multilevel PLS # Generating random design z <- rep(seq_len(50), each = 4) # Extracting the within-variability x_within <- mixOmics::withinVariation(X = x, design = cbind(z)) # Running PLS on within-variability mypls <- PLS(xdata = x_within, ydata = y, ncomp = 3) par(oldpar)}Graphical LASSO
Description
Runs the graphical LASSO algorithm for estimation of a Gaussian GraphicalModel (GGM). This function is not using stability.
Usage
PenalisedGraphical( xdata, pk = NULL, Lambda, Sequential_template = NULL, scale = TRUE, start = "cold", output_omega = FALSE, ...)Arguments
xdata | matrix with observations as rows and variables as columns. |
pk | optional vector encoding the grouping structure. Only used formulti-block stability selection where |
Lambda | matrix of parameters controlling the level of sparsity. |
Sequential_template | logical matrix encoding the type of procedure touse for data with multiple blocks in stability selection graphicalmodelling. For multi-block estimation, the stability selection model isconstructed as the union of block-specific stable edges estimated while theothers are weakly penalised ( |
scale | logical indicating if the correlation ( |
start | character string indicating if the algorithm should beinitialised at the estimated (inverse) covariance with previous penaltyparameters ( |
output_omega | logical indicating if the estimated precision matricesshould be stored and returned. |
... | additional parameters passed to the function provided in |
Details
The use of the procedure from Equation (4) or (5) is controlled bythe argument "Sequential_template".
Value
An array with binary and symmetric adjacency matrices along the thirddimension.
References
Friedman J, Hastie T, Tibshirani R (2008).“Sparse inverse covariance estimation with the graphical lasso.”Biostatistics,9(3), 432–441.
See Also
Other underlying algorithm functions:CART(),ClusteringAlgo(),PenalisedOpenMx(),PenalisedRegression()
Examples
# Data simulationset.seed(1)simul <- SimulateGraphical()# Running graphical LASSOmyglasso <- PenalisedGraphical( xdata = simul$data, Lambda = matrix(c(0.1, 0.2), ncol = 1))# Returning estimated precision matrixmyglasso <- PenalisedGraphical( xdata = simul$data, Lambda = matrix(c(0.1, 0.2), ncol = 1), output_omega = TRUE)Penalised Structural Equation Model
Description
Runs penalised Structural Equation Modelling using implementations fromOpenMx functions (forPenalisedOpenMx),or using series of penalised regressions withglmnet(forPenalisedLinearSystem). The functionPenalisedLinearSystem does not accommodate latent variables.These functions are not using stability.
Usage
PenalisedOpenMx( xdata, adjacency, penalised = NULL, residual_covariance = NULL, Lambda, ...)PenalisedLinearSystem(xdata, adjacency, penalised = NULL, Lambda = NULL, ...)Arguments
xdata | matrix with observations as rows and variables as columns.Column names must be defined and in line with the row and column names of |
adjacency | binary adjacency matrix of the Directed Acyclic Graph(transpose of the asymmetric matrix A in Reticular Action Model notation).The row and column names of this matrix must be defined. |
penalised | optional binary matrix indicating which coefficients areregularised. |
residual_covariance | binary and symmetric matrix encoding the nonzeroentries in the residual covariance matrix (symmetric matrix S in ReticularAction Model notation). By default, this is the identity matrix (noresidual covariance). |
Lambda | matrix of parameters controlling the level of sparsity. Onlythe minimum, maximum and length are used in |
... | additional parameters passed to |
Value
A list with:
selected | matrix of binary selection status. Rowscorrespond to different regularisation parameters. Columns correspond todifferent parameters to estimated. |
beta_full | matrix of modelcoefficients. Rows correspond to different regularisation parameters.Columns correspond to different parameters to estimated. |
References
Jacobucci R, Grimm KJ, McArdle JJ (2016).“Regularized structural equation modeling.”Structural equation modeling: a multidisciplinary journal,23(4), 555–566.doi:10.1080/10705511.2016.1154793.
See Also
SelectionAlgo,VariableSelection,OpenMxMatrix,LinearSystemMatrix
Other underlying algorithm functions:CART(),ClusteringAlgo(),PenalisedGraphical(),PenalisedRegression()
Examples
# Data simulationpk <- c(3, 2, 3)dag <- LayeredDAG(layers = pk)theta <- dagtheta[2, 4] <- 0set.seed(1)simul <- SimulateStructural(theta = theta, pk = pk, output_matrices = TRUE)# Running regularised SEM (OpenMx)if (requireNamespace("OpenMx", quietly = TRUE)) { mysem <- PenalisedOpenMx( xdata = simul$data, adjacency = dag, Lambda = seq(1, 10, 1) ) OpenMxMatrix(vect = mysem$selected[3, ], adjacency = dag)}# Running regularised SEM (glmnet)mysem <- PenalisedLinearSystem( xdata = simul$data, adjacency = dag)LinearSystemMatrix(vect = mysem$selected[20, ], adjacency = dag)Penalised regression
Description
Runs penalised regression using implementation fromglmnet. This function is not using stability.
Usage
PenalisedRegression( xdata, ydata, Lambda = NULL, family, penalisation = c("classic", "randomised", "adaptive"), gamma = NULL, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
Lambda | matrix of parameters controlling the level of sparsity. |
family | type of regression model. This argument is defined as in |
penalisation | type of penalisation to use. If |
gamma | parameter for randomised or adaptive regularisation. Default is |
... | additional parameters passed to |
Value
A list with:
selected | matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors. |
beta_full | array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors. Indicesalong the third dimension correspond to outcome variable(s). |
References
Zou H (2006).“The adaptive lasso and its oracle properties.”Journal of the American statistical association,101(476), 1418–1429.
Tibshirani R (1996).“Regression Shrinkage and Selection via the Lasso.”Journal of the Royal Statistical Society. Series B (Methodological),58(1), 267–288.ISSN 00359246,http://www.jstor.org/stable/2346178.
See Also
SelectionAlgo,VariableSelection
Other underlying algorithm functions:CART(),ClusteringAlgo(),PenalisedGraphical(),PenalisedOpenMx()
Examples
# Data simulationset.seed(1)simul <- SimulateRegression(pk = 50)# Running the LASSOmylasso <- PenalisedRegression( xdata = simul$xdata, ydata = simul$ydata, Lambda = c(0.1, 0.2), family = "gaussian")# Using glmnet argumentsmylasso <- PenalisedRegression( xdata = simul$xdata, ydata = simul$ydata, Lambda = c(0.1), family = "gaussian", penalty.factor = c(rep(0, 10), rep(1, 40)))mylasso$beta_fullPartial Least Squares predictions
Description
Computes predicted values from a Partial Least Squares (PLS) model inregression mode applied onxdata. This function is using the algorithmimplemented inpredict.pls.
Usage
PredictPLS(xdata, model)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
model | output of |
Value
An array of predicted values.
See Also
Examples
if (requireNamespace("mixOmics", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = c(5, 5, 5), family = "gaussian") x <- simul$xdata y <- simul$ydata # PLS mypls <- PLS(xdata = x, ydata = y, ncomp = 3) # Predicted values predicted <- PredictPLS(xdata = x, model = mypls)}Regression model refitting
Description
Refits the regression model with stably selected variables as predictors(without penalisation). Variables inxdata not evaluated in thestability selection model will automatically be included as predictors.
Usage
Refit( xdata, ydata, stability = NULL, family = NULL, implementation = NULL, Lambda = NULL, seed = 1, verbose = TRUE, ...)Recalibrate( xdata, ydata, stability = NULL, family = NULL, implementation = NULL, Lambda = NULL, seed = 1, verbose = TRUE, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
stability | output of |
family | type of regression model. Possible values include |
implementation | optional function to refit the model. If |
Lambda | optional vector of penalty parameters. |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
verbose | logical indicating if a loading bar and messages should beprinted. |
... | additional arguments to be passed to the function provided in |
Value
The output as obtained from:
stats::lm | forlinear regression ( |
survival::coxph | for Cox regression ( |
stats::glm | for logistic regression( |
nnet::multinom | formultinomial regression ( |
See Also
Examples
## Linear regression# Data simulationset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")# Data splitids_train <- Resample( data = simul$ydata, tau = 0.5, family = "gaussian")xtrain <- simul$xdata[ids_train, , drop = FALSE]ytrain <- simul$ydata[ids_train, , drop = FALSE]xrefit <- simul$xdata[-ids_train, , drop = FALSE]yrefit <- simul$ydata[-ids_train, , drop = FALSE]# Stability selectionstab <- VariableSelection(xdata = xtrain, ydata = ytrain, family = "gaussian")print(SelectedVariables(stab))# Refitting the modelrefitted <- Refit( xdata = xrefit, ydata = yrefit, stability = stab)refitted$coefficients # refitted coefficientshead(refitted$fitted.values) # refitted predicted values# Fitting the full model (including all possible predictors)refitted <- Refit( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian")refitted$coefficients # refitted coefficients## Logistic regression# Data simulationset.seed(1)simul <- SimulateRegression(n = 200, pk = 20, family = "binomial")# Data splitids_train <- Resample( data = simul$ydata, tau = 0.5, family = "binomial")xtrain <- simul$xdata[ids_train, , drop = FALSE]ytrain <- simul$ydata[ids_train, , drop = FALSE]xrefit <- simul$xdata[-ids_train, , drop = FALSE]yrefit <- simul$ydata[-ids_train, , drop = FALSE]# Stability selectionstab <- VariableSelection(xdata = xtrain, ydata = ytrain, family = "binomial")# Refitting the modelrefitted <- Refit( xdata = xrefit, ydata = yrefit, stability = stab)refitted$coefficients # refitted coefficientshead(refitted$fitted.values) # refitted predicted probabilities## Partial Least Squares (multiple components)if (requireNamespace("sgPLS", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateRegression(n = 500, pk = 15, q = 3, family = "gaussian") # Data split ids_train <- Resample( data = simul$ydata, tau = 0.5, family = "gaussian" ) xtrain <- simul$xdata[ids_train, , drop = FALSE] ytrain <- simul$ydata[ids_train, , drop = FALSE] xrefit <- simul$xdata[-ids_train, , drop = FALSE] yrefit <- simul$ydata[-ids_train, , drop = FALSE] # Stability selection stab <- BiSelection( xdata = xtrain, ydata = ytrain, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(xtrain) - 1), LambdaY = seq_len(ncol(ytrain) - 1), implementation = SparsePLS ) plot(stab) # Refitting the model refitted <- Refit( xdata = xrefit, ydata = yrefit, stability = stab ) refitted$Wmat # refitted X-weights refitted$Cmat # refitted Y-weights}Resampling observations
Description
Generates a vector of resampled observation IDs.
Usage
Resample(data, family = NULL, tau = 0.5, resampling = "subsampling", ...)Arguments
data | vector or matrix of data. In regression, this should be theoutcome data. |
family | type of regression model. This argument is defined as in |
tau | subsample size. Only used if |
resampling | resampling approach. Possible values are: |
... | additional parameters passed to the function provided in |
Details
With categorical outcomes (i.e. "family" argument is set to"binomial", "multinomial" or "cox"), the resampling is done such that theproportion of observations from each of the categories is representative ofthat of the full sample.
Value
A vector of resampled IDs.
Examples
## Linear regression framework# Data simulationsimul <- SimulateRegression()# Subsamplingids <- Resample(data = simul$ydata, family = "gaussian")sum(duplicated(ids))# Bootstrappingids <- Resample(data = simul$ydata, family = "gaussian", resampling = "bootstrap")sum(duplicated(ids))## Logistic regression framework# Data simulationsimul <- SimulateRegression(family = "binomial")# Subsamplingids <- Resample(data = simul$ydata, family = "binomial")sum(duplicated(ids))prop.table(table(simul$ydata))prop.table(table(simul$ydata[ids]))# Data simulation for a binary confounderconf <- ifelse(runif(n = 100) > 0.5, yes = 1, no = 0)# User-defined resampling functionBalancedResampling <- function(data, tau, Z, ...) { s <- NULL for (z in unique(Z)) { s <- c(s, sample(which((data == "0") & (Z == z)), size = tau * sum((data == "0") & (Z == z)))) s <- c(s, sample(which((data == "1") & (Z == z)), size = tau * sum((data == "1") & (Z == z)))) } return(s)}# Resampling keeping proportions by Y and Zids <- Resample(data = simul$ydata, family = "binomial", resampling = BalancedResampling, Z = conf)prop.table(table(simul$ydata, conf))prop.table(table(simul$ydata[ids], conf[ids]))# User-defined resampling for stability selectionstab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "binomial", resampling = BalancedResampling, Z = conf)Variable selection algorithm
Description
Runs the variable selection algorithm specified in the argumentimplementation. This function is not using stability.
Usage
SelectionAlgo( xdata, ydata = NULL, Lambda, group_x = NULL, scale = TRUE, family = NULL, implementation = PenalisedRegression, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
Lambda | matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified in |
group_x | vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group. Only used formodels with group penalisation (e.g. |
scale | logical indicating if the predictor data should be scaled. |
family | type of regression model. This argument is defined as in |
implementation | function to use for variable selection. Possiblefunctions are: |
... | additional parameters passed to the function provided in |
Value
A list with:
selected | matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors. |
beta_full | array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors. Indicesalong the third dimension correspond to outcome variable(s). |
See Also
VariableSelection,PenalisedRegression,SparsePCA,SparsePLS,GroupPLS,SparseGroupPLS
Other wrapping functions:GraphicalAlgo()
Examples
# Data simulation (univariate outcome)set.seed(1)simul <- SimulateRegression(pk = 50)# Running the LASSOmylasso <- SelectionAlgo( xdata = simul$xdata, ydata = simul$ydata, Lambda = c(0.1, 0.2), family = "gaussian",)# Data simulation (multivariate outcome)set.seed(1)simul <- SimulateRegression(pk = 50, q = 3)# Running multivariate Gaussian LASSOmylasso <- SelectionAlgo( xdata = simul$xdata, ydata = simul$ydata, Lambda = c(0.1, 0.2), family = "mgaussian")str(mylasso)Selection performance
Description
Computes different metrics of selection performance by comparing the set ofselected features to the set of true predictors/edges. This function can onlybe used in simulation studies (i.e. when the true model is known).
Usage
SelectionPerformance(theta, theta_star, pk = NULL, cor = NULL, thr = 0.5)Arguments
theta | output from |
theta_star | output from |
pk | optional vector encoding the grouping structure. Only used formulti-block stability selection where |
cor | optional correlation matrix. Only used in graphical modelling. |
thr | optional threshold in correlation. Only used in graphicalmodelling and when argument "cor" is not NULL. |
Value
A matrix of selection metrics including:
TP | number of True Positives (TP) |
FN | number of FalseNegatives (TN) |
FP | number of False Positives (FP) |
TN | numberof True Negatives (TN) |
sensitivity | sensitivity, i.e. TP/(TP+FN) |
specificity | specificity, i.e. TN/(TN+FP) |
accuracy | accuracy,i.e. (TP+TN)/(TP+TN+FP+FN) |
precision | precision (p), i.e.TP/(TP+FP) |
recall | recall (r), i.e. TP/(TP+FN) |
F1_score | F1-score, i.e. 2*p*r/(p+r) |
If argument "cor" is provided, the number of False Positives amongcorrelated (FP_c) and uncorrelated (FP_i) pairs, defined as havingcorrelations (provided in "cor") above or below the threshold "thr", arealso reported.
Block-specific performances are reported if "pk" is not NULL. In this case,the first row of the matrix corresponds to the overall performances, andsubsequent rows correspond to each of the blocks. The order of the blocksis defined as inBlockStructure.
See Also
Other functions for model performance:ClusteringPerformance(),SelectionPerformanceGraph()
Examples
# Variable selection modelset.seed(1)simul <- SimulateRegression(pk = 30, nu_xy = 0.5)stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata)# Selection performanceSelectionPerformance(theta = stab, theta_star = simul)# Alternative formulationSelectionPerformance( theta = SelectedVariables(stab), theta_star = simul$theta)Graph representation of selection performance
Description
Generates an igraph object representing the True Positive, False Positive andFalse Negative edges by comparing the set of selected edges to the set oftrue edges. This function can only be used in simulation studies (i.e. whenthe true model is known).
Usage
SelectionPerformanceGraph( theta, theta_star, col = c("tomato", "forestgreen", "navy"), lty = c(2, 3, 1), node_colour = NULL, show_labels = TRUE, ...)Arguments
theta | binary adjacency matrix or output of |
theta_star | true binary adjacency matrix or output of |
col | vector of edge colours. The first entry of the vector definesthe colour of False Positive edges, second entry is for True Negatives andthird entry is for True Positives. |
lty | vector of line types for edges. The order is defined as forargument |
node_colour | optional vector of node colours. This vector must containas many entries as there are rows/columns in the adjacency matrix and mustbe in the same order (the order is used to assign colours to nodes).Integers, named colours or RGB values can be used. |
show_labels | logical indicating if the node labels should be displayed. |
... | additional arguments to be passed to |
Value
An igraph object.
See Also
SimulateGraphical,SimulateRegression,GraphicalModel,VariableSelection,BiSelection
Other functions for model performance:ClusteringPerformance(),SelectionPerformance()
Examples
# Data simulationset.seed(1)simul <- SimulateGraphical(pk = 30)# Stability selectionstab <- GraphicalModel(xdata = simul$data, K = 10)# Performance graphperfgraph <- SelectionPerformanceGraph( theta = stab, theta_star = simul)plot(perfgraph)Selection performance (internal)
Description
Computes different metrics of selection performance from a categoricalvector/matrix with 3 for True Positive, 2 for False Negative, 1 for FalsePositive and 0 for True Negative.
Usage
SelectionPerformanceSingle(Asum, cor = NULL, thr = 0.5)Arguments
Asum | vector (in variable selection) or matrix (in graphical modelling)containing values of |
cor | optional correlation matrix. Only used in graphical modelling. |
thr | optional threshold in correlation. Only used in graphicalmodelling and when argument "cor" is not NULL. |
Value
A matrix of selection metrics including:
TP | number of True Positives (TP) |
FN | number of FalseNegatives (TN) |
FP | number of False Positives (FP) |
TN | numberof True Negatives (TN) |
sensitivity | sensitivity, i.e. TP/(TP+FN) |
specificity | specificity, i.e. TN/(TN+FP) |
accuracy | accuracy,i.e. (TP+TN)/(TP+TN+FP+FN) |
precision | precision (p), i.e.TP/(TP+FP) |
recall | recall (r), i.e. TP/(TP+FN) |
F1_score | F1-score, i.e. 2*p*r/(p+r) |
If argument "cor" is provided, the number of False Positives amongcorrelated (FP_c) and uncorrelated (FP_i) pairs, defined as havingcorrelations (provided in "cor") above or below the threshold "thr", arealso reported.
Selection/co-membership proportions
Description
Extracts selection proportions (for stability selection) or co-membershipproportions (for consensus clustering).
Usage
SelectionProportions(stability, argmax_id = NULL)ConsensusMatrix(stability, argmax_id = NULL)Arguments
stability | output of |
argmax_id | optional indices of hyper-parameters. If |
Value
A vector or matrix of proportions.
See Also
VariableSelection,GraphicalModel,BiSelection,Clustering
Examples
# Stability selectionset.seed(1)simul <- SimulateRegression(pk = 50)stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata)SelectionProportions(stab)# Consensus clusteringset.seed(1)simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1, ev_xc = 0.5)stab <- Clustering(xdata = simul$data)ConsensusMatrix(stab)Selection proportions (graphical model)
Description
Extracts the selection proportions of the (calibrated) stability selectionmodel.
Usage
SelectionProportionsGraphical(stability, argmax_id = NULL)Arguments
stability | output of |
argmax_id | optional indices of hyper-parameters. If |
Value
A symmetric matrix.
Selection proportions (variable selection)
Description
Extracts the selection proportions of the (calibrated) stability selectionmodel.
Usage
SelectionProportionsRegression(stability, argmax_id = NULL)Arguments
stability | output of |
argmax_id | optional indices of hyper-parameters. If |
Value
A vector of selection proportions.
Consensus clustering (internal)
Description
Performs consensus (weighted) clustering. The underlying algorithm (e.g.hierarchical clustering) is run with different number of clustersnc.In consensus weighed clustering, weighted distances are calculated using thecosa2 algorithm with different penalty parametersLambda. The hyper-parameters are calibrated by maximisation of theconsensus score. This function uses a serial implementation and requires the grids ofhyper-parameters as input (for internal use only).
Usage
SerialClustering( xdata, nc, eps, Lambda, K = 100, tau = 0.5, seed = 1, n_cat = 3, implementation = HierarchicalClustering, scale = TRUE, linkage = "complete", row = TRUE, output_data = FALSE, verbose = TRUE, ...)Arguments
xdata | data matrix with observations as rows and variables as columns. |
nc | matrix of parameters controlling the number of clusters in theunderlying algorithm specified in |
eps | radius in density-based clustering, see |
Lambda | vector of penalty parameters for weighted distance calculation.Only used for distance-based clustering, including for example |
K | number of resampling iterations. |
tau | subsample size. |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
n_cat | computation options for the stability score. Default is |
implementation | function to use for clustering. Possible functionsinclude |
scale | logical indicating if the data should be scaled to ensure thatall variables contribute equally to the clustering of the observations. |
linkage | character string indicating the type of linkage used inhierarchical clustering to define the stable clusters. Possible valuesinclude |
row | logical indicating if rows (if |
output_data | logical indicating if the input datasets |
verbose | logical indicating if a loading bar and messages should beprinted. |
... | additional parameters passed to the functions provided in |
Value
A list with:
Sc | a matrixof the best stability scores for different (sets of) parameters controllingthe number of clusters and penalisation of attribute weights. |
nc | amatrix of numbers of clusters. |
Lambda | a matrix of regularisationparameters for attribute weights. |
Q | a matrix of the average numberof selected attributes by the underlying algorithm with differentregularisation parameters. |
coprop | an array of consensus matrices.Rows and columns correspond to items. Indices along the third dimensioncorrespond to different parameters controlling the number of clusters andpenalisation of attribute weights. |
selprop | an array of selectionproportions. Columns correspond to attributes. Rows correspond to differentparameters controlling the number of clusters and penalisation of attributeweights. |
method | a list with |
params | a list with values used for arguments |
The rows ofSc,nc,Lambda,Q,selprop and indices along the thirddimension ofcoprop are ordered in the same way and correspond toparameter values stored innc andLambda.
Stability selection graphical model (internal)
Description
Runs stability selection graphical models with different combinations ofparameters controlling the sparsity of the underlying selection algorithm(e.g. penalty parameter for regularised models) and thresholds in selectionproportions. These two parameters are jointly calibrated by maximising thestability score of the model (possibly under a constraint on the expectednumber of falsely stably selected features). This function uses a serialimplementation and requires the grid of parameters controlling the underlyingalgorithm as input (for internal use only).
Usage
SerialGraphical( xdata, pk = NULL, Lambda, lambda_other_blocks = 0.1, pi_list = seq(0.6, 0.9, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = n_cat, implementation = PenalisedGraphical, start = "cold", scale = TRUE, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, output_data = FALSE, verbose = TRUE, ...)Arguments
xdata | data matrix with observations as rows and variables as columns.For multi-block stability selection, the variables in data have to beordered by group. |
pk | optional vector encoding the grouping structure. Only used formulti-block stability selection where |
Lambda | matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified in |
lambda_other_blocks | optional vector of parameters controlling thelevel of sparsity in neighbour blocks for the multi-block procedure. To usejointly a specific set of parameters for each block, |
pi_list | vector of thresholds in selection proportions. If |
K | number of resampling iterations. |
tau | subsample size. Only used if |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
n_cat | computation options for the stability score. Default is |
implementation | function to use for graphical modelling. If |
start | character string indicating if the algorithm should beinitialised at the estimated (inverse) covariance with previous penaltyparameters ( |
scale | logical indicating if the correlation ( |
resampling | resampling approach. Possible values are: |
cpss | logical indicating if complementary pair stability selectionshould be done. For this, the algorithm is applied on two non-overlappingsubsets of half of the observations. A feature is considered as selected ifit is selected for both subsamples. With this method, the data is split |
PFER_method | method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). If |
PFER_thr | threshold in PFER for constrained calibration by errorcontrol. If |
FDP_thr | threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. If |
output_data | logical indicating if the input datasets |
verbose | logical indicating if a loading bar and messages should beprinted. |
... | additional parameters passed to the functions provided in |
Value
A list with:
S | a matrix of the best stability scores fordifferent (sets of) parameters controlling the level of sparsity in theunderlying algorithm. |
Lambda | a matrix of parameters controlling thelevel of sparsity in the underlying algorithm. |
Q | a matrix of theaverage number of selected features by the underlying algorithm withdifferent parameters controlling the level of sparsity. |
Q_s | amatrix of the calibrated number of stably selected features with differentparameters controlling the level of sparsity. |
P | a matrix ofcalibrated thresholds in selection proportions for different parameterscontrolling the level of sparsity in the underlying algorithm. |
PFER | a matrix of upper-bounds in PFER of calibrated stabilityselection models with different parameters controlling the level ofsparsity. |
FDP | a matrix of upper-bounds in FDP of calibratedstability selection models with different parameters controlling the levelof sparsity. |
S_2d | a matrix of stability scores obtained withdifferent combinations of parameters. Columns correspond to differentthresholds in selection proportions. |
PFER_2d | a matrix ofupper-bounds in FDP obtained with different combinations of parameters.Columns correspond to different thresholds in selection proportions. Onlyreturned if |
FDP_2d | a matrix of upper-bounds inPFER obtained with different combinations of parameters. Columns correspondto different thresholds in selection proportions. Only returned if |
selprop | an array of selection proportions.Rows and columns correspond to nodes in the graph. Indices along the thirddimension correspond to different parameters controlling the level ofsparsity in the underlying algorithm. |
sign | a matrix of signs ofPearson's correlations estimated from |
method | a listwith |
params | a list with values used for arguments |
The rows ofS,Lambda,Q,Q_s,P,PFER,FDP,S_2d,PFER_2d andFDP_2d, andindices along the third dimension ofselprop are ordered in the sameway and correspond to parameter values stored inLambda. Formulti-block inference, the columns ofS,Lambda,Q,Q_s,P,PFER andFDP, and indices along thethird dimension ofS_2d correspond to the different blocks.
Stability selection in regression (internal)
Description
Runs stability selection regression models with different combinations ofparameters controlling the sparsity of the underlying selection algorithm(e.g. penalty parameter for regularised models) and thresholds in selectionproportions. These two parameters are jointly calibrated by maximising thestability score of the model (possibly under a constraint on the expectednumber of falsely stably selected features). This function uses a serialimplementation and requires the grid of parameters controlling the underlyingalgorithm as input (for internal use only).
Usage
SerialRegression( xdata, ydata = NULL, Lambda, pi_list = seq(0.6, 0.9, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = 3, family = "gaussian", implementation = PenalisedRegression, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, group_x = NULL, group_penalisation = FALSE, output_data = FALSE, verbose = TRUE, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
Lambda | matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified in |
pi_list | vector of thresholds in selection proportions. If |
K | number of resampling iterations. |
tau | subsample size. Only used if |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
n_cat | computation options for the stability score. Default is |
family | type of regression model. This argument is defined as in |
implementation | function to use for variable selection. Possiblefunctions are: |
resampling | resampling approach. Possible values are: |
cpss | logical indicating if complementary pair stability selectionshould be done. For this, the algorithm is applied on two non-overlappingsubsets of half of the observations. A feature is considered as selected ifit is selected for both subsamples. With this method, the data is split |
PFER_method | method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). If |
PFER_thr | threshold in PFER for constrained calibration by errorcontrol. If |
FDP_thr | threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. If |
group_x | vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group. Only used formodels with group penalisation (e.g. |
group_penalisation | logical indicating if a group penalisation shouldbe considered in the stability score. The use of |
output_data | logical indicating if the input datasets |
verbose | logical indicating if a loading bar and messages should beprinted. |
... | additional parameters passed to the functions provided in |
Value
A list with:
S | a matrix of the best stability scores fordifferent parameters controlling the level of sparsity in the underlyingalgorithm. |
Lambda | a matrix of parameters controlling the level ofsparsity in the underlying algorithm. |
Q | a matrix of the averagenumber of selected features by the underlying algorithm with differentparameters controlling the level of sparsity. |
Q_s | a matrix of thecalibrated number of stably selected features with different parameterscontrolling the level of sparsity. |
P | a matrix of calibratedthresholds in selection proportions for different parameters controllingthe level of sparsity in the underlying algorithm. |
PFER | a matrix ofupper-bounds in PFER of calibrated stability selection models withdifferent parameters controlling the level of sparsity. |
FDP | amatrix of upper-bounds in FDP of calibrated stability selection models withdifferent parameters controlling the level of sparsity. |
S_2d | amatrix of stability scores obtained with different combinations ofparameters. Columns correspond to different thresholds in selectionproportions. |
PFER_2d | a matrix of upper-bounds in FDP obtained withdifferent combinations of parameters. Columns correspond to differentthresholds in selection proportions. |
FDP_2d | a matrix ofupper-bounds in PFER obtained with different combinations of parameters.Columns correspond to different thresholds in selection proportions. |
selprop | a matrix of selection proportions. Columns correspond topredictors from |
Beta | an array of model coefficients.Columns correspond to predictors from |
method | a list with |
params | alist with values used for arguments |
For allmatrices and arrays returned, the rows are ordered in the same way andcorrespond to parameter values stored inLambda.
Sparse group Partial Least Squares
Description
Runs a sparse group Partial Least Squares model using implementation fromsgPLS-package. This function is not using stability.
Usage
SparseGroupPLS( xdata, ydata, family = "gaussian", group_x, group_y = NULL, Lambda, alpha.x, alpha.y = NULL, keepX_previous = NULL, keepY = NULL, ncomp = 1, scale = TRUE, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
family | type of PLS model. If |
group_x | vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group. |
group_y | optional vector encoding the grouping structure amongoutcomes. This argument indicates the number of variables in each group. |
Lambda | matrix of parameters controlling the number of selected groupsat current component, as defined by |
alpha.x | vector of parameters controlling the level of sparsity withingroups of predictors. |
alpha.y | optional vector of parameters controlling the level ofsparsity within groups of outcomes. Only used if |
keepX_previous | number of selected groups in previous components. Onlyused if |
keepY | number of selected groups of outcome variables. This argument isdefined as in |
ncomp | number of components. |
scale | logical indicating if the data should be scaled (i.e.transformed so that all variables have a standard deviation of one). Onlyused if |
... |
Value
A list with:
selected | matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors. |
beta_full | array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors (startingwith "X") or outcomes (starting with "Y") variables for differentcomponents (denoted by "PC"). |
References
Liquet B, de Micheaux PL, Hejblum BP, Thiébaut R (2016).“Group and sparse group partial least square approaches applied in genomics context.”Bioinformatics,32(1), 35-42.ISSN 1367-4803,doi:10.1093/bioinformatics/btv535.
See Also
Other penalised dimensionality reduction functions:GroupPLS(),SparsePCA(),SparsePLS()
Examples
if (requireNamespace("sgPLS", quietly = TRUE)) { ## Sparse group PLS # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 30, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # Running sgPLS with 1 group and sparsity of 0.5 mypls <- SparseGroupPLS( xdata = x, ydata = y, Lambda = 1, family = "gaussian", group_x = c(10, 15, 5), alpha.x = 0.5 ) # Running sgPLS with groups on outcomes mypls <- SparseGroupPLS( xdata = x, ydata = y, Lambda = 1, family = "gaussian", group_x = c(10, 15, 5), alpha.x = 0.5, group_y = c(2, 1), keepY = 1, alpha.y = 0.9 ) ## Sparse group PLS-DA # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "binomial") # Running sgPLS-DA with 1 group and sparsity of 0.9 mypls <- SparseGroupPLS( xdata = simul$xdata, ydata = simul$ydata, Lambda = 1, family = "binomial", group_x = c(10, 15, 25), alpha.x = 0.9 )}Sparse K means clustering
Description
Runs sparse K means clustering using implementation fromKMeansSparseCluster. This function is not usingstability.
Usage
SparseKMeansClustering(xdata, nc = NULL, Lambda, ...)Arguments
xdata | data matrix with observations as rows and variables as columns. |
nc | matrix of parameters controlling the number of clusters in theunderlying algorithm specified in |
Lambda | vector of penalty parameters (see argument |
... | additional parameters passed to |
Value
A list with:
comembership | an array of binary and symmetricco-membership matrices. |
weights | a matrix of median weights byfeature. |
References
Witten DM, Tibshirani R (2010).“A Framework for Feature Selection in Clustering.”Journal of the American Statistical Association,105(490), 713-726.doi:10.1198/jasa.2010.tm09415, PMID: 20811510.
Sparse Principal Component Analysis
Description
Runs a sparse Principal Component Analysis model using implementation fromspca (ifalgo="sPCA") orspca (ifalgo="rSVD"). This function is notusing stability.
Usage
SparsePCA( xdata, Lambda, ncomp = 1, scale = TRUE, keepX_previous = NULL, algorithm = "sPCA", ...)Arguments
xdata | data matrix with observations as rows and variables as columns. |
Lambda | matrix of parameters controlling the number of selectedvariables at current component, as defined by |
ncomp | number of components. |
scale | logical indicating if the data should be scaled (i.e.transformed so that all variables have a standard deviation of one). |
keepX_previous | number of selected predictors in previous components.Only used if |
algorithm | character string indicating the name of the algorithm to use forsparse PCA. Possible values are: "sPCA" (for the algorithm proposed by Zou,Hastie and Tibshirani and implemented in |
... | additional arguments to be passed to |
Value
A list with:
selected | matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors. |
beta_full | array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors (startingwith "X") or outcomes (starting with "Y") variables for differentcomponents (denoted by "PC"). |
References
Zou H, Hastie T, Tibshirani R (2006).“Sparse Principal Component Analysis.”Journal of Computational and Graphical Statistics,15(2), 265-286.doi:10.1198/106186006X113430.
Shen H, Huang JZ (2008).“Sparse principal component analysis via regularized low rank matrix approximation.”Journal of Multivariate Analysis,99(6), 1015-1034.ISSN 0047-259X,doi:10.1016/j.jmva.2007.06.007.
See Also
Other penalised dimensionality reduction functions:GroupPLS(),SparseGroupPLS(),SparsePLS()
Examples
# Data simulationset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")x <- simul$xdata# Sparse PCA (by Zou, Hastie, Tibshirani)if (requireNamespace("elasticnet", quietly = TRUE)) { mypca <- SparsePCA( xdata = x, ncomp = 2, Lambda = c(1, 2), keepX_previous = 10, algorithm = "sPCA" )}# Sparse PCA (by Shen and Huang)if (requireNamespace("mixOmics", quietly = TRUE)) { mypca <- SparsePCA( xdata = x, ncomp = 2, Lambda = c(1, 2), keepX_previous = 10, algorithm = "rSVD" )}Sparse Partial Least Squares
Description
Runs a sparse Partial Least Squares model using implementation fromsgPLS-package. This function is not using stability.
Usage
SparsePLS( xdata, ydata, Lambda, family = "gaussian", ncomp = 1, scale = TRUE, keepX_previous = NULL, keepY = NULL, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
Lambda | matrix of parameters controlling the number of selectedpredictors at current component, as defined by |
family | type of PLS model. If |
ncomp | number of components. |
scale | logical indicating if the data should be scaled (i.e.transformed so that all variables have a standard deviation of one). Onlyused if |
keepX_previous | number of selected predictors in previous components.Only used if |
keepY | number of selected outcome variables. This argument is definedas in |
... |
Value
A list with:
selected | matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors. |
beta_full | array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors (startingwith "X") or outcomes (starting with "Y") variables for differentcomponents (denoted by "PC"). |
References
KA LC, Rossouw D, Robert-Granié C, Besse P (2008).“A sparse PLS for variable selection when integrating omics data.”Stat Appl Genet Mol Biol,7(1), Article 35.ISSN 1544-6115,doi:10.2202/1544-6115.1390.
See Also
Other penalised dimensionality reduction functions:GroupPLS(),SparseGroupPLS(),SparsePCA()
Examples
if (requireNamespace("sgPLS", quietly = TRUE)) { ## Sparse PLS # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 20, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # Running sPLS with 2 X-variables and 1 Y-variable mypls <- SparsePLS(xdata = x, ydata = y, Lambda = 2, family = "gaussian", keepY = 1) ## Sparse PLS-DA # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 20, family = "binomial") # Running sPLS-DA with 2 X-variables and 1 Y-variable mypls <- SparsePLS(xdata = simul$xdata, ydata = simul$ydata, Lambda = 2, family = "binomial")}Splitting observations into non-overlapping sets
Description
Generates a list oflength(tau) non-overlapping sets of observationIDs.
Usage
Split(data, family = NULL, tau = c(0.5, 0.25, 0.25))Arguments
data | vector or matrix of data. In regression, this should be theoutcome data. |
family | type of regression model. This argument is defined as in |
tau | vector of the proportion of observations in each of the sets. |
Details
With categorical outcomes (i.e.family argument is set to"binomial","multinomial" or"cox"), the split is donesuch that the proportion of observations from each of the categories ineach of the sets is representative of that of the full sample.
Value
A list of lengthlength(tau) with sets of non-overlappingobservation IDs.
Examples
# Splitting into 3 setssimul <- SimulateRegression()ids <- Split(data = simul$ydata)lapply(ids, length)# Balanced splits with respect to a binary variablesimul <- SimulateRegression(family = "binomial")ids <- Split(data = simul$ydata, family = "binomial")lapply(ids, FUN = function(x) { table(simul$ydata[x, ])})Adjacency from bipartite
Description
Generates a symmetric adjacency matrix encoding a bipartite graph.
Usage
Square(x)Arguments
x | matrix encoding the edges between two types of nodes (rows andcolumns). |
Value
A symmetric adjacency matrix encoding a bipartite graph.
Examples
# Simulated links between two setsset.seed(1)mat <- matrix(sample(c(0, 1), size = 15, replace = TRUE), nrow = 5, ncol = 3)# Adjacency matrix of a bipartite graphSquare(mat)Stability selection metrics
Description
Computes the stability score (seeStabilityScore) andupper-bounds of thePFER andFDP from selectionproportions of models with a given parameter controlling the sparsity of theunderlying algorithm and for different thresholds in selection proportions.
Usage
StabilityMetrics( selprop, pk = NULL, pi_list = seq(0.6, 0.9, by = 0.01), K = 100, n_cat = NULL, PFER_method = "MB", PFER_thr_blocks = Inf, FDP_thr_blocks = Inf, Sequential_template = NULL, graph = TRUE, group = NULL)Arguments
selprop | array of selection proportions. |
pk | optional vector encoding the grouping structure. Only used formulti-block stability selection where |
pi_list | vector of thresholds in selection proportions. If |
K | number of resampling iterations. |
n_cat | computation options for the stability score. Default is |
PFER_method | method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). If |
PFER_thr_blocks | vector of block-specific thresholds in PFER forconstrained calibration by error control. If |
FDP_thr_blocks | vector of block-specific thresholds in the expectedproportion of falsely selected features (or False Discovery Proportion,FDP) for constrained calibration by error control. If |
Sequential_template | logical matrix encoding the type of procedure touse for data with multiple blocks in stability selection graphicalmodelling. For multi-block estimation, the stability selection model isconstructed as the union of block-specific stable edges estimated while theothers are weakly penalised ( |
graph | logical indicating if stability selection is performed in aregression (if |
group | vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group and only needs tobe provided for group (but not sparse group) penalisation. |
Value
A list with:
S | a matrix of the best (block-specific) stabilityscores for different (sets of) penalty parameters. In multi-block stabilityselection, rows correspond to different sets of penalty parameters, (valuesare stored in the output "Lambda") and columns correspond to differentblocks. |
Lambda | a matrix of (block-specific) penalty parameters. Inmulti-block stability selection, rows correspond to sets of penaltyparameters and columns correspond to different blocks. |
Q | a matrixof average numbers of (block-specific) edges selected by the underlyingalgorithm for different (sets of) penalty parameters. In multi-blockstability selection, rows correspond to different sets of penaltyparameters, (values are stored in the output "Lambda") and columnscorrespond to different blocks. |
Q_s | a matrix of calibrated numbersof (block-specific) stable edges for different (sets of) penaltyparameters. In multi-block stability selection, rows correspond todifferent sets of penalty parameters, (values are stored in the output"Lambda") and columns correspond to different blocks. |
P | a matrix ofcalibrated (block-specific) thresholds in selection proportions fordifferent (sets of) penalty parameters. In multi-block stability selection,rows correspond to different sets of penalty parameters, (values are storedin the output "Lambda") and columns correspond to different blocks. |
PFER | a matrix of computed (block-specific) upper-bounds in PFER ofcalibrated graphs for different (sets of) penalty parameters. Inmulti-block stability selection, rows correspond to different sets ofpenalty parameters, (values are stored in the output "Lambda") and columnscorrespond to different blocks. |
FDP | a matrix of computed(block-specific) upper-bounds in FDP of calibrated stability selectionmodels for different (sets of) penalty parameters. In multi-block stabilityselection, rows correspond to different sets of penalty parameters, (valuesare stored in the output "Lambda") and columns correspond to differentblocks. |
S_2d | an array of (block-specific) stability scores obtainedwith different combinations of parameters. Rows correspond to different(sets of) penalty parameters and columns correspond to different thresholdsin selection proportions. In multi-block stability selection, indices alongthe third dimension correspond to different blocks. |
PFER_2d | anarray of computed upper-bounds of PFER obtained with different combinationsof parameters. Rows correspond to different penalty parameters and columnscorrespond to different thresholds in selection proportions. Not availablein multi-block stability selection graphical modelling. |
FDP_2d | anarray of computed upper-bounds of FDP obtained with different combinationsof parameters. Rows correspond to different penalty parameters and columnscorrespond to different thresholds in selection proportions. Not availablein multi-block stability selection graphical modelling. |
References
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.
Shah RD, Samworth RJ (2013).“Variable selection with error control: another look at stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),75(1), 55-80.doi:10.1111/j.1467-9868.2011.01034.x.
See Also
Other stability metric functions:ConsensusScore(),FDP(),PFER(),StabilityScore()
Examples
## Sparse or sparse group penalisation# Simulating set of selection proportionsset.seed(1)selprop <- matrix(round(runif(n = 20), digits = 2), nrow = 2)# Computing stability scores for different thresholdsmetrics <- StabilityMetrics( selprop = selprop, pi = c(0.6, 0.7, 0.8), K = 100, graph = FALSE)## Group penalisation# Simulating set of selection proportionsset.seed(1)selprop <- matrix(round(runif(n = 6), digits = 2), nrow = 2)selprop <- cbind( selprop[, 1], selprop[, 1], selprop[, 2], selprop[, 2], matrix(rep(selprop[, 3], each = 6), nrow = 2, byrow = TRUE))# Computing stability scores for different thresholdsmetrics <- StabilityMetrics( selprop = selprop, pi = c(0.6, 0.7, 0.8), K = 100, graph = FALSE, group = c(2, 2, 6))Stability score
Description
Computes the stability score from selection proportions of models with agiven parameter controlling the sparsity and for different thresholds inselection proportions. The score measures how unlikely it is that theselection procedure is uniform (i.e. uninformative) for a given combinationof parameters.
Usage
StabilityScore( selprop, pi_list = seq(0.6, 0.9, by = 0.01), K, n_cat = 3, group = NULL)Arguments
selprop | array of selection proportions. |
pi_list | vector of thresholds in selection proportions. If |
K | number of resampling iterations. |
n_cat | computation options for the stability score. Default is |
group | vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group and only needs tobe provided for group (but not sparse group) penalisation. |
Details
The stability score is derived from the likelihood under theassumption of uniform (uninformative) selection.
We classify the features into three categories: the stably selected ones(that have selection proportions\ge \pi), the stably excluded ones(selection proportion\le 1-\pi), and the unstable ones (selectionproportions between1-\pi and\pi).
Under the hypothesis of equiprobability of selection (instability), thelikelihood of observing stably selected, stably excluded and unstablefeatures can be expressed as:
L_{\lambda, \pi} = \prod_{j=1}^N [ ( 1 - F( K \pi - 1 ) )^{1_{H_{\lambda} (j) \ge K \pi}} \times ( F( K \pi - 1 ) - F( K ( 1 - \pi ) )^{1_{ (1-\pi) K < H_{\lambda} (j) < K \pi }} \times F( K ( 1 - \pi ) )^{1_{ H_{\lambda} (j) \le K (1-\pi) }} ]
whereH_{\lambda} (j) is the selection count of featurej andF(x) is the cumulative probability function of the binomialdistribution with parametersK and the average proportion of selectedfeatures over resampling iterations.
The stability score is computed as the minus log-transformed likelihoodunder the assumption of equiprobability of selection:
S_{\lambda, \pi} = -log(L_{\lambda, \pi})
The stability score increases with stability.
Alternatively, the stability score can be computed by considering only twosets of features: stably selected (selection proportions\ge \pi) ornot (selection proportions< \pi). This can be done usingn_cat=2.
Value
A vector of stability scores obtained with the different thresholdsin selection proportions.
References
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
See Also
Other stability metric functions:ConsensusScore(),FDP(),PFER(),StabilityMetrics()
Examples
# Simulating set of selection proportionsset.seed(1)selprop <- round(runif(n = 20), digits = 2)# Computing stability scores for different thresholdsscore <- StabilityScore(selprop, pi_list = c(0.6, 0.7, 0.8), K = 100)Stable results
Description
Extracts stable results for stability selection or consensus clustering.
Usage
Stable(stability, argmax_id = NULL, linkage = "complete")SelectedVariables(stability, argmax_id = NULL)Adjacency(stability, argmax_id = NULL)Clusters(stability, linkage = "complete", argmax_id = NULL)Arguments
stability | output of |
argmax_id | optional indices of hyper-parameters. If |
linkage | character string indicating the type of linkage used inhierarchical clustering to define the stable clusters. Possible valuesinclude |
Value
A binary vector or matrix encoding the selection status (1 ifselected,0 otherwise) in stability selection or stable clustermembership in consensus clustering.
See Also
VariableSelection,BiSelection,GraphicalModel,Clustering
Examples
# Variable selectionset.seed(1)simul <- SimulateRegression(pk = 20)stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata)SelectedVariables(stab)Stable(stab)# Graphical modelset.seed(1)simul <- SimulateGraphical(pk = 10)stab <- GraphicalModel(xdata = simul$data)Adjacency(stab)Stable(stab)# Clusteringset.seed(1)simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1)stab <- Clustering(xdata = simul$data)Clusters(stab)Stable(stab)Stability selection in Structural Equation Modelling
Description
Performs stability selection for Structural Equation Models. The underlyingarrow selection algorithm (e.g. regularised Structural Equation Modelling) isrun with different combinations of parameters controlling the sparsity (e.g.penalty parameter) and thresholds in selection proportions. These twohyper-parameters are jointly calibrated by maximisation of the stabilityscore.
Usage
StructuralModel( xdata, adjacency, residual_covariance = NULL, Lambda = NULL, pi_list = seq(0.01, 0.99, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, implementation = PenalisedLinearSystem, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 100, optimisation = c("grid_search", "nloptr"), n_cores = 1, output_data = FALSE, verbose = TRUE, ...)Arguments
xdata | matrix with observations as rows and variables as columns.Column names must be defined and in line with the row and column names of |
adjacency | binary adjacency matrix of the Directed Acyclic Graph(transpose of the asymmetric matrix A in Reticular Action Model notation).The row and column names of this matrix must be defined. |
residual_covariance | binary and symmetric matrix encoding the nonzeroentries in the residual covariance matrix (symmetric matrix S in ReticularAction Model notation). By default, this is the identity matrix (noresidual covariance). |
Lambda | matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified in |
pi_list | vector of thresholds in selection proportions. If |
K | number of resampling iterations. |
tau | subsample size. Only used if |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
n_cat | computation options for the stability score. Default is |
implementation | function to use for variable selection. |
resampling | resampling approach. Possible values are: |
cpss | logical indicating if complementary pair stability selectionshould be done. For this, the algorithm is applied on two non-overlappingsubsets of half of the observations. A feature is considered as selected ifit is selected for both subsamples. With this method, the data is split |
PFER_method | method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). If |
PFER_thr | threshold in PFER for constrained calibration by errorcontrol. If |
FDP_thr | threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. If |
Lambda_cardinal | number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm. Only used if |
optimisation | character string indicating the type of optimisationmethod. With |
n_cores | number of cores to use for parallel computing (see argument |
output_data | logical indicating if the input datasets |
verbose | logical indicating if a loading bar and messages should beprinted. |
... | additional parameters passed to the functions provided in |
Details
In stability selection, a feature selection algorithm is fitted onK subsamples (or bootstrap samples) of the data with differentparameters controlling the sparsity (Lambda). For a given (set of)sparsity parameter(s), the proportion out of theK models in whicheach feature is selected is calculated. Features with selection proportionsabove a threshold pi are considered stably selected. The stabilityselection model is controlled by the sparsity parameter(s) for theunderlying algorithm, and the threshold in selection proportion:
V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}
In Structural Equation Modelling, "feature" refers to an arrow in thecorresponding Directed Acyclic Graph.
These parameters can be calibrated by maximisation of a stability score(seeConsensusScore ifn_cat=NULL orStabilityScore otherwise) calculated under the nullhypothesis of equiprobability of selection.
It is strongly recommended to examine the calibration plot carefully tocheck that the grids of parametersLambda andpi_list do notrestrict the calibration to a region that would not include the globalmaximum (seeCalibrationPlot). In particular, the gridLambda may need to be extended when the maximum stability isobserved on the left or right edges of the calibration heatmap. In someinstances, multiple peaks of stability score can be observed. Simulationstudies suggest that the peak corresponding to the largest number ofselected features tend to give better selection performances. This is notnecessarily the highest peak (which is automatically retained by thefunctions in this package). The user can decide to manually choose anotherpeak.
To control the expected number of False Positives (Per Family Error Rate)in the results, a thresholdPFER_thr can be specified. Theoptimisation problem is then constrained to sets of parameters thatgenerate models with an upper-bound in PFER belowPFER_thr (seeMeinshausen and Bühlmann (2010) and Shah and Samworth (2013)).
Possible resampling procedures include defining (i)K subsamples ofa proportiontau of the observations, (ii)K bootstrapsamples with the full sample size (obtained with replacement), and (iii)K/2 splits of the data in half for complementary pair stabilityselection (see argumentsresampling andcpss). Incomplementary pair stability selection, a feature is considered selected ata given resampling iteration if it is selected in the two complementarysubsamples.
To ensure reproducibility of the results, the starting number of the randomnumber generator is set toseed.
For parallelisation, stability selection with different sets of parameterscan be run onn_cores cores. Usingn_cores > 1 creates amultisession. Alternatively,the function can be run manually with differentseeds and all otherparameters equal. The results can then be combined usingCombine.
Value
An object of classvariable_selection. A list with:
S | amatrix of the best stability scores for different parameters controllingthe level of sparsity in the underlying algorithm. |
Lambda | a matrixof parameters controlling the level of sparsity in the underlyingalgorithm. |
Q | a matrix of the average number of selected features bythe underlying algorithm with different parameters controlling the level ofsparsity. |
Q_s | a matrix of the calibrated number of stably selectedfeatures with different parameters controlling the level of sparsity. |
P | a matrix of calibrated thresholds in selection proportions fordifferent parameters controlling the level of sparsity in the underlyingalgorithm. |
PFER | a matrix of upper-bounds in PFER of calibratedstability selection models with different parameters controlling the levelof sparsity. |
FDP | a matrix of upper-bounds in FDP of calibratedstability selection models with different parameters controlling the levelof sparsity. |
S_2d | a matrix of stability scores obtained withdifferent combinations of parameters. Columns correspond to differentthresholds in selection proportions. |
PFER_2d | a matrix ofupper-bounds in FDP obtained with different combinations of parameters.Columns correspond to different thresholds in selection proportions. |
FDP_2d | a matrix of upper-bounds in PFER obtained with differentcombinations of parameters. Columns correspond to different thresholds inselection proportions. |
selprop | a matrix of selection proportions.Columns correspond to predictors from |
Beta | an arrayof model coefficients. Columns correspond to predictors from |
method | a list with |
params | a list with values used for arguments |
For all matrices and arrays returned, the rowsare ordered in the same way and correspond to parameter values stored inLambda.
References
Bodinier B, Rodrigues S, Karimi M, Filippi S, Chiquet J, Chadeau-Hyam M (2025).“Stability Selection and Consensus Clustering in R: The R Package sharp.”Journal of Statistical Software,112(5), btad635.doi:10.18637/jss.v112.i05.
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.
Shah RD, Samworth RJ (2013).“Variable selection with error control: another look at stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),75(1), 55-80.doi:10.1111/j.1467-9868.2011.01034.x.
Jacobucci R, Grimm KJ, McArdle JJ (2016).“Regularized structural equation modeling.”Structural equation modeling: a multidisciplinary journal,23(4), 555–566.doi:10.1080/10705511.2016.1154793.
See Also
SelectionAlgo,Resample,StabilityScore
Other stability functions:BiSelection(),Clustering(),GraphicalModel(),VariableSelection()
Examples
oldpar <- par(no.readonly = TRUE)par(mar = rep(7, 4))# Data simulationset.seed(1)pk <- c(3, 2, 3)simul <- SimulateStructural( n = 500, pk = pk, nu_between = 0.5, v_between = 1, v_sign = 1)# Stability selection (using glmnet)dag <- LayeredDAG(layers = pk)stab <- StructuralModel( xdata = simul$data, adjacency = dag)CalibrationPlot(stab)LinearSystemMatrix(vect = Stable(stab), adjacency = dag)# Stability selection (using OpenMx)if (requireNamespace("OpenMx", quietly = TRUE)) { stab <- StructuralModel( xdata = simul$data, implementation = PenalisedOpenMx, Lambda = seq(50, 500, by = 50), adjacency = dag ) CalibrationPlot(stab) OpenMxMatrix(SelectedVariables(stab), adjacency = dag)}## Not run: # Data simulation with latent variablesset.seed(1)pk <- c(3, 2, 3)simul <- SimulateStructural( n = 500, pk = pk, nu_between = 0.5, v_sign = 1, v_between = 1, n_manifest = 3, ev_manifest = 0.95)# Stability selection (using OpenMx)if (requireNamespace("OpenMx", quietly = TRUE)) { dag <- LayeredDAG(layers = pk, n_manifest = 3) penalised <- dag penalised[, seq_len(ncol(simul$data))] <- 0 stab <- StructuralModel( xdata = simul$data, implementation = PenalisedOpenMx, adjacency = dag, penalised = penalised, Lambda = seq(10, 100, by = 20), K = 10 # to increase for real use ) CalibrationPlot(stab) ids_latent <- grep("f", colnames(dag)) OpenMxMatrix(SelectedVariables(stab), adjacency = dag )[ids_latent, ids_latent]}## End(Not run)par(oldpar)Unweighted K-means clustering
Description
Runs k-means clustering using implementation fromkmeans. This function is not using stability.
Usage
UnweightedKMeansClustering(xdata, nc = NULL, ...)Arguments
xdata | data matrix with observations as rows and variables as columns. |
nc | matrix of parameters controlling the number of clusters in theunderlying algorithm specified in |
... | additional parameters passed to |
Value
A list with:
comembership | an array of binary and symmetricco-membership matrices. |
weights | a matrix of median weights byfeature. |
Stability selection in regression
Description
Performs stability selection for regression models. The underlying variableselection algorithm (e.g. LASSO regression) is run with differentcombinations of parameters controlling the sparsity (e.g. penalty parameter)and thresholds in selection proportions. These two hyper-parameters arejointly calibrated by maximisation of the stability score.
Usage
VariableSelection( xdata, ydata = NULL, Lambda = NULL, pi_list = seq(0.01, 0.99, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, family = "gaussian", implementation = PenalisedRegression, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 100, group_x = NULL, group_penalisation = FALSE, optimisation = c("grid_search", "nloptr"), n_cores = 1, output_data = FALSE, verbose = TRUE, beep = NULL, ...)Arguments
xdata | matrix of predictors with observations as rows and variables ascolumns. |
ydata | optional vector or matrix of outcome(s). If |
Lambda | matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified in |
pi_list | vector of thresholds in selection proportions. If |
K | number of resampling iterations. |
tau | subsample size. Only used if |
seed | value of the seed to initialise the random number generator andensure reproducibility of the results (see |
n_cat | computation options for the stability score. Default is |
family | type of regression model. This argument is defined as in |
implementation | function to use for variable selection. Possiblefunctions are: |
resampling | resampling approach. Possible values are: |
cpss | logical indicating if complementary pair stability selectionshould be done. For this, the algorithm is applied on two non-overlappingsubsets of half of the observations. A feature is considered as selected ifit is selected for both subsamples. With this method, the data is split |
PFER_method | method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). If |
PFER_thr | threshold in PFER for constrained calibration by errorcontrol. If |
FDP_thr | threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. If |
Lambda_cardinal | number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm. Only used if |
group_x | vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group. Only used formodels with group penalisation (e.g. |
group_penalisation | logical indicating if a group penalisation shouldbe considered in the stability score. The use of |
optimisation | character string indicating the type of optimisationmethod. With |
n_cores | number of cores to use for parallel computing (see argument |
output_data | logical indicating if the input datasets |
verbose | logical indicating if a loading bar and messages should beprinted. |
beep | sound indicating the end of the run. Possible values are: |
... | additional parameters passed to the functions provided in |
Details
In stability selection, a feature selection algorithm is fitted onK subsamples (or bootstrap samples) of the data with differentparameters controlling the sparsity (Lambda). For a given (set of)sparsity parameter(s), the proportion out of theK models in whicheach feature is selected is calculated. Features with selection proportionsabove a threshold pi are considered stably selected. The stabilityselection model is controlled by the sparsity parameter(s) for theunderlying algorithm, and the threshold in selection proportion:
V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}
If argumentgroup_penalisation=FALSE, "feature" refers to variable(variable selection model). If argumentgroup_penalisation=TRUE,"feature" refers to group (group selection model). In this case, groupsneed to be defineda priori and specified in argumentgroup_x.
These parameters can be calibrated by maximisation of a stability score(seeConsensusScore ifn_cat=NULL orStabilityScore otherwise) calculated under the nullhypothesis of equiprobability of selection.
It is strongly recommended to examine the calibration plot carefully tocheck that the grids of parametersLambda andpi_list do notrestrict the calibration to a region that would not include the globalmaximum (seeCalibrationPlot). In particular, the gridLambda may need to be extended when the maximum stability isobserved on the left or right edges of the calibration heatmap. In someinstances, multiple peaks of stability score can be observed. Simulationstudies suggest that the peak corresponding to the largest number ofselected features tend to give better selection performances. This is notnecessarily the highest peak (which is automatically retained by thefunctions in this package). The user can decide to manually choose anotherpeak.
To control the expected number of False Positives (Per Family Error Rate)in the results, a thresholdPFER_thr can be specified. Theoptimisation problem is then constrained to sets of parameters thatgenerate models with an upper-bound in PFER belowPFER_thr (seeMeinshausen and Bühlmann (2010) and Shah and Samworth (2013)).
Possible resampling procedures include defining (i)K subsamples ofa proportiontau of the observations, (ii)K bootstrapsamples with the full sample size (obtained with replacement), and (iii)K/2 splits of the data in half for complementary pair stabilityselection (see argumentsresampling andcpss). Incomplementary pair stability selection, a feature is considered selected ata given resampling iteration if it is selected in the two complementarysubsamples.
For categorical or time to event outcomes (argumentfamily is"binomial","multinomial" or"cox"), the proportionsof observations from each category in all subsamples or bootstrap samplesare the same as in the full sample.
To ensure reproducibility of the results, the starting number of the randomnumber generator is set toseed.
For parallelisation, stability selection with different sets of parameterscan be run onn_cores cores. Usingn_cores > 1 creates amultisession. Alternatively,the function can be run manually with differentseeds and all otherparameters equal. The results can then be combined usingCombine.
Value
An object of classvariable_selection. A list with:
S | amatrix of the best stability scores for different parameters controllingthe level of sparsity in the underlying algorithm. |
Lambda | a matrixof parameters controlling the level of sparsity in the underlyingalgorithm. |
Q | a matrix of the average number of selected features bythe underlying algorithm with different parameters controlling the level ofsparsity. |
Q_s | a matrix of the calibrated number of stably selectedfeatures with different parameters controlling the level of sparsity. |
P | a matrix of calibrated thresholds in selection proportions fordifferent parameters controlling the level of sparsity in the underlyingalgorithm. |
PFER | a matrix of upper-bounds in PFER of calibratedstability selection models with different parameters controlling the levelof sparsity. |
FDP | a matrix of upper-bounds in FDP of calibratedstability selection models with different parameters controlling the levelof sparsity. |
S_2d | a matrix of stability scores obtained withdifferent combinations of parameters. Columns correspond to differentthresholds in selection proportions. |
PFER_2d | a matrix ofupper-bounds in FDP obtained with different combinations of parameters.Columns correspond to different thresholds in selection proportions. |
FDP_2d | a matrix of upper-bounds in PFER obtained with differentcombinations of parameters. Columns correspond to different thresholds inselection proportions. |
selprop | a matrix of selection proportions.Columns correspond to predictors from |
Beta | an arrayof model coefficients. Columns correspond to predictors from |
method | a list with |
params | a list with values used for arguments |
For all matrices and arrays returned, the rowsare ordered in the same way and correspond to parameter values stored inLambda.
References
Bodinier B, Rodrigues S, Karimi M, Filippi S, Chiquet J, Chadeau-Hyam M (2025).“Stability Selection and Consensus Clustering in R: The R Package sharp.”Journal of Statistical Software,112(5), btad635.doi:10.18637/jss.v112.i05.
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Shah RD, Samworth RJ (2013).“Variable selection with error control: another look at stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),75(1), 55-80.doi:10.1111/j.1467-9868.2011.01034.x.
Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.
Tibshirani R (1996).“Regression Shrinkage and Selection via the Lasso.”Journal of the Royal Statistical Society. Series B (Methodological),58(1), 267–288.ISSN 00359246,http://www.jstor.org/stable/2346178.
See Also
PenalisedRegression,SelectionAlgo,LambdaGridRegression,Resample,StabilityScoreRefit,ExplanatoryPerformance,Incremental,
Other stability functions:BiSelection(),Clustering(),GraphicalModel(),StructuralModel()
Examples
oldpar <- par(no.readonly = TRUE)par(mar = rep(7, 4))# Linear regressionset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian")# Calibration plotCalibrationPlot(stab)# Extracting the resultssummary(stab)Stable(stab)SelectionProportions(stab)plot(stab)# Using randomised LASSOstab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", penalisation = "randomised")plot(stab)# Using adaptive LASSOstab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", penalisation = "adaptive")plot(stab)# Using additional arguments from glmnet (e.g. penalty.factor)stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", penalty.factor = c(rep(1, 45), rep(0, 5)))head(coef(stab))# Using CARTif (requireNamespace("rpart", quietly = TRUE)) { stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, implementation = CART, family = "gaussian", ) plot(stab)}# Regression with multivariate outcomesset.seed(1)simul <- SimulateRegression(n = 100, pk = 20, q = 3, family = "gaussian")stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "mgaussian")summary(stab)# Logistic regressionset.seed(1)simul <- SimulateRegression(n = 200, pk = 10, family = "binomial", ev_xy = 0.8)stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "binomial")summary(stab)# Sparse PCA (1 component, see BiSelection for more components)if (requireNamespace("elasticnet", quietly = TRUE)) { set.seed(1) simul <- SimulateComponents(pk = c(5, 3, 4)) stab <- VariableSelection( xdata = simul$data, Lambda = seq_len(ncol(simul$data) - 1), implementation = SparsePCA ) CalibrationPlot(stab, xlab = "") summary(stab)}# Sparse PLS (1 outcome, 1 component, see BiSelection for more options)if (requireNamespace("sgPLS", quietly = TRUE)) { set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, Lambda = seq_len(ncol(simul$xdata) - 1), implementation = SparsePLS, family = "gaussian" ) CalibrationPlot(stab, xlab = "") SelectedVariables(stab)}# Group PLS (1 outcome, 1 component, see BiSelection for more options)if (requireNamespace("sgPLS", quietly = TRUE)) { stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, Lambda = seq_len(5), group_x = c(5, 5, 10, 20, 10), group_penalisation = TRUE, implementation = GroupPLS, family = "gaussian" ) CalibrationPlot(stab, xlab = "") SelectedVariables(stab)}# Example with more hyper-parameters: elastic netset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")TuneElasticNet <- function(xdata, ydata, family, alpha) { stab <- VariableSelection( xdata = xdata, ydata = ydata, family = family, alpha = alpha, verbose = FALSE ) return(max(stab$S, na.rm = TRUE))}myopt <- optimise(TuneElasticNet, lower = 0.1, upper = 1, maximum = TRUE, xdata = simul$xdata, ydata = simul$ydata, family = "gaussian")stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", alpha = myopt$maximum)summary(stab)enet <- SelectedVariables(stab)# Comparison with LASSOstab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian")summary(stab)lasso <- SelectedVariables(stab)table(lasso, enet)# Example using an external function: group-LASSO with gglassoif (requireNamespace("gglasso", quietly = TRUE)) { set.seed(1) simul <- SimulateRegression(n = 200, pk = 20, family = "binomial") ManualGridGroupLasso <- function(xdata, ydata, family, group_x, ...) { # Defining the grouping group <- do.call(c, lapply(seq_len(length(group_x)), FUN = function(i) { rep(i, group_x[i]) })) if (family == "binomial") { ytmp <- ydata ytmp[ytmp == min(ytmp)] <- -1 ytmp[ytmp == max(ytmp)] <- 1 return(gglasso::gglasso(xdata, ytmp, loss = "logit", group = group, ...)) } else { return(gglasso::gglasso(xdata, ydata, lambda = lambda, loss = "ls", group = group, ...)) } } Lambda <- LambdaGridRegression( xdata = simul$xdata, ydata = simul$ydata, family = "binomial", Lambda_cardinal = 20, implementation = ManualGridGroupLasso, group_x = rep(5, 4) ) GroupLasso <- function(xdata, ydata, Lambda, family, group_x, ...) { # Defining the grouping group <- do.call(c, lapply(seq_len(length(group_x)), FUN = function(i) { rep(i, group_x[i]) })) # Running the regression if (family == "binomial") { ytmp <- ydata ytmp[ytmp == min(ytmp)] <- -1 ytmp[ytmp == max(ytmp)] <- 1 mymodel <- gglasso::gglasso(xdata, ytmp, lambda = Lambda, loss = "logit", group = group, ...) } if (family == "gaussian") { mymodel <- gglasso::gglasso(xdata, ydata, lambda = Lambda, loss = "ls", group = group, ...) } # Extracting and formatting the beta coefficients beta_full <- t(as.matrix(mymodel$beta)) beta_full <- beta_full[, colnames(xdata)] selected <- ifelse(beta_full != 0, yes = 1, no = 0) return(list(selected = selected, beta_full = beta_full)) } stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, implementation = GroupLasso, family = "binomial", Lambda = Lambda, group_x = rep(5, 4), group_penalisation = TRUE ) summary(stab)}par(oldpar)Stable attribute weights
Description
Creates a boxplots of the distribution of (calibrated) median attributeweights obtained from the COSA algorithm across the subsampling iterations.See examples inClustering.
Usage
WeightBoxplot( stability, at = NULL, argmax_id = NULL, col = NULL, boxwex = 0.3, xlab = "", ylab = "Weight", cex.lab = 1.5, las = 3, frame = "F", add = FALSE, ...)Arguments
stability | output of |
at | coordinates along the x-axis (more details in |
argmax_id | optional indices of hyper-parameters. If |
col | optional vector of colours. |
boxwex | box width (more details in |
xlab | label of the x-axis. |
ylab | label of the y-axis. |
cex.lab | font size for labels. |
las | orientation of labels on the x-axis (see |
frame | logical indicating if the box around the plot should be drawn(more details in |
add | logical indicating if the boxplot should be added to the currentplot. |
... | additional parameters passed to |
Value
A boxplot.
See Also
Star-shaped nodes
Description
Produces star-shaped nodes in an igraph object.
Usage
mystar(coords, v = NULL, params)Arguments
coords | a matrix of coordinates(see |
v | a vector of node IDs(see |
params | node graphical parameters(see |
See Also
Triangular nodes
Description
Produces triangular nodes in an igraph object.
Usage
mytriangle(coords, v = NULL, params)Arguments
coords | a matrix of coordinates(see |
v | a vector of node IDs(see |
params | node graphical parameters(see |
See Also
Consensus matrix heatmap
Description
Creates a heatmap of the (calibrated) consensus matrix. See examples inClustering.
Usage
## S3 method for class 'clustering'plot( x, linkage = "complete", argmax_id = NULL, theta = NULL, theta_star = NULL, col = c("ivory", "navajowhite", "tomato", "darkred"), lines = TRUE, col.lines = c("blue"), lwd.lines = 2, tick = TRUE, axes = TRUE, col.axis = NULL, cex.axis = 1, xlas = 2, ylas = 2, bty = "n", ...)Arguments
x | output of |
linkage | character string indicating the type of linkage used inhierarchical clustering to define the stable clusters. Possible valuesinclude |
argmax_id | optional indices of hyper-parameters. If |
theta | optional vector of cluster membership. If provided, the orderingof the items should be the same as in |
theta_star | optional vector of true cluster membership. If provided,the ordering of the items should be the same as in |
col | vector of colours. |
lines | logical indicating if lines separating the clusters provided in |
col.lines | colour of the lines separating the clusters. |
lwd.lines | width of the lines separating the clusters. |
tick | logical indicating if axis tickmarks should be displayed. |
axes | logical indicating if item labels should be displayed. |
col.axis | optional vector of cluster colours. |
cex.axis | font size for axes. |
xlas | orientation of labels on the x-axis, as |
ylas | orientation of labels on the y-axis, as |
bty | character string indicating if the box around the plot should bedrawn. Possible values include: |
... | additional arguments passed to |
Value
A heatmap.
Plot of incremental performance
Description
Represents prediction performances upon sequential inclusion of thepredictors in a logistic or Cox regression model as produced byIncremental. The median andquantiles of the performancemetric are reported. See examples inIncremental.
Usage
## S3 method for class 'incremental'plot( x, quantiles = c(0.05, 0.95), col = c("red", "grey"), col.axis = NULL, xgrid = FALSE, ygrid = FALSE, output_data = FALSE, ...)IncrementalPlot( x, quantiles = c(0.05, 0.95), col = c("red", "grey"), col.axis = NULL, xgrid = FALSE, ygrid = FALSE, output_data = FALSE, ...)PlotIncremental( x, quantiles = c(0.05, 0.95), col = c("red", "grey"), col.axis = NULL, xgrid = FALSE, ygrid = FALSE, output_data = FALSE, ...)Arguments
x | output of |
quantiles | quantiles defining the lower and upper bounds. |
col | vector of colours by stable selection status. |
col.axis | optional vector of label colours by stable selection status. |
xgrid | logical indicating if a vertical grid should be drawn. |
ygrid | logical indicating if a horizontal grid should be drawn. |
output_data | logical indicating if the median and quantiles should bereturned in a matrix. |
... | additional plotting arguments (see |
Value
A plot.
See Also
Receiver Operating Characteristic (ROC) band
Description
Plots the True Positive Rate (TPR) as a function of the False Positive Rate(FPR) for different thresholds in predicted probabilities. If the resultsfrom multiple ROC analyses are provided (e.g. output ofExplanatoryPerformance with largeK), the point-wisemedian is represented and flanked by a transparent band defined by point-wisequantiles. See examples inROC andExplanatoryPerformance.
Usage
## S3 method for class 'roc_band'plot( x, col_band = NULL, alpha = 0.5, quantiles = c(0.05, 0.95), add = FALSE, ...)Arguments
x | output of |
col_band | colour of the band defined by point-wise |
alpha | level of opacity for the band. |
quantiles | point-wise quantiles of the performances defining the band. |
add | logical indicating if the curve should be added to the currentplot. |
... | additional plotting arguments (see |
Value
A base plot.
See Also
Plot of selection proportions
Description
Makes a barplot of selection proportions in decreasing order. See examples inVariableSelection.
Usage
## S3 method for class 'variable_selection'plot( x, col = c("red", "grey"), col.axis = NULL, col.thr = "darkred", lty.thr = 2, n_predictors = NULL, ...)Arguments
x | output of |
col | vector of colours by stable selection status. |
col.axis | optional vector of label colours by stable selection status. |
col.thr | threshold colour. |
lty.thr | threshold line type as |
n_predictors | number of predictors to display. |
... | additional plotting arguments (see |
Value
A plot.
See Also
Predict method for stability selection
Description
Computes predicted values from the output ofVariableSelection.
Usage
## S3 method for class 'variable_selection'predict( object, xdata, ydata, newdata = NULL, method = c("ensemble", "refit"), ...)Arguments
object | output of |
xdata | predictor data (training set). |
ydata | outcome data (training set). |
newdata | optional predictor data (test set). |
method | character string indicating if predictions should be obtainedfrom an |
... | additional arguments passed to |
Value
Predicted values.
See Also
Refit,Ensemble,EnsemblePredictions
Examples
## Linear regression# Data simulationset.seed(1)simul <- SimulateRegression(n = 500, pk = 50, family = "gaussian")# Training/test splitids <- Split(data = simul$ydata, tau = c(0.8, 0.2))# Stability selectionstab <- VariableSelection( xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ])# Predictions from post stability selection estimationyhat <- predict(stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], newdata = simul$xdata[ids[[2]], ], method = "refit")cor(simul$ydata[ids[[2]], ], yhat)^2 # Q-squared# Predictions from ensemble modelyhat <- predict(stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], newdata = simul$xdata[ids[[2]], ], method = "ensemble")cor(simul$ydata[ids[[2]], ], yhat)^2 # Q-squared## Logistic regression# Data simulationset.seed(1)simul <- SimulateRegression(n = 500, pk = 20, family = "binomial", ev_xy = 0.9)# Training/test splitids <- Split(data = simul$ydata, family = "binomial", tau = c(0.8, 0.2))# Stability selectionstab <- VariableSelection( xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], family = "binomial")# Predictions from post stability selection estimationyhat <- predict(stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], newdata = simul$xdata[ids[[2]], ], method = "refit", type = "response")plot(ROC(predicted = yhat, observed = simul$ydata[ids[[2]], ]))# Predictions from ensemble modelyhat <- predict(stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], newdata = simul$xdata[ids[[2]], ], method = "ensemble", type = "response")plot(ROC(predicted = yhat, observed = simul$ydata[ids[[2]], ]), add = TRUE, col = "blue")