Movatterモバイル変換


[0]ホーム

URL:


Type:Package
Title:Stability-enHanced Approaches using Resampling Procedures
Version:1.4.8
Date:2025-07-23
Maintainer:Barbara Bodinier <barbara.bodinier@gmail.com>
URL:https://github.com/barbarabodinier/sharp
BugReports:https://github.com/barbarabodinier/sharp/issues
Description:In stability selection (N Meinshausen, P Bühlmann (2010) <doi:10.1111/j.1467-9868.2010.00740.x>) and consensus clustering (S Monti et al (2003) <doi:10.1023/A:1023949509487>), resampling techniques are used to enhance the reliability of the results. In this package (B Bodinier et al (2025) <doi:10.18637/jss.v112.i05>), hyper-parameters are calibrated by maximising model stability, which is measured under the null hypothesis that all selection (or co-membership) probabilities are identical (B Bodinier et al (2023a) <doi:10.1093/jrsssc/qlad058> and B Bodinier et al (2023b) <doi:10.1093/bioinformatics/btad635>). Functions are readily implemented for the use of LASSO regression, sparse PCA, sparse (group) PLS or graphical LASSO in stability selection, and hierarchical clustering, partitioning around medoids, K means or Gaussian mixture models in consensus clustering.
License:GPL (≥ 3)
Language:en-GB
Encoding:UTF-8
RoxygenNote:7.3.1
Depends:fake (≥ 1.4.0), R (≥ 3.5)
Imports:abind, beepr, future, future.apply, glassoFast (≥ 1.0.0),glmnet, grDevices, igraph, mclust, nloptr, plotrix, Rdpack,withr (≥ 2.4.0)
Suggests:cluster, corpcor, dbscan, elasticnet, gglasso, mixOmics,nnet, OpenMx, RCy3, randomcoloR, rCOSA, rmarkdown, rpart,sgPLS, sparcl, survival (≥ 3.2.13), testthat (≥ 3.0.0),visNetwork
Additional_repositories:https://barbarabodinier.github.io/drat
Config/testthat/edition:3
RdMacros:Rdpack
NeedsCompilation:no
Packaged:2025-07-23 21:10:57 UTC; barbara
Author:Barbara Bodinier [aut, cre]
Repository:CRAN
Date/Publication:2025-07-23 21:30:02 UTC

sharp: Stability-enHanced Approaches using Resampling Procedures

Description

In stability selection and consensus clustering, resampling techniques areused to enhance the reliability of the results. In this package,hyper-parameters are calibrated by maximising model stability, which ismeasured under the null hypothesis that all selection (or co-membership)probabilities are identical. Functions are readily implemented for the use ofLASSO regression, sparse PCA, sparse (group) PLS or graphical LASSO instability selection, and hierarchical clustering, partitioning aroundmedoids, K means or Gaussian mixture models in consensus clustering.

Details

Package: sharp
Type: Package
Version:1.4.8
Date: 2025-07-23
License: GPL (>= 3)
Maintainer: Barbara Bodinierbarbara.bodinier@gmail.com

References

Bodinier B, Rodrigues S, Karimi M, Filippi S, Chiquet J, Chadeau-Hyam M (2025).“Stability Selection and Consensus Clustering in R: The R Package sharp.”Journal of Statistical Software,112(5), btad635.doi:10.18637/jss.v112.i05.

Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.

Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.

Monti S, Tamayo P, Mesirov J, Golub T (2003).“Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data.”Machine Learning,52(1), 91–118.doi:10.1023/A:1023949509487.

Examples

oldpar <- par(no.readonly = TRUE)par(mar = c(5, 5, 5, 5))## Regression models# Data simulationset.seed(1)simul <- SimulateRegression(n = 100, pk = 50)# Stability selectionstab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata)CalibrationPlot(stab)summary(stab)SelectedVariables(stab)## Graphical models# Data simulationset.seed(1)simul <- SimulateGraphical(n = 100, pk = 20, topology = "scale-free")# Stability selectionstab <- GraphicalModel(xdata = simul$data)CalibrationPlot(stab)summary(stab)plot(stab)## PCA modelsif (requireNamespace("elasticnet", quietly = TRUE)) {  # Data simulation  set.seed(1)  simul <- SimulateComponents(pk = c(5, 3, 4))  plot(simul)  # Stability selection  stab <- BiSelection(    xdata = simul$data,    ncomp = 3,    implementation = SparsePCA  )  CalibrationPlot(stab)  summary(stab)  SelectedVariables(stab)}## PLS modelsif (requireNamespace("sgPLS", quietly = TRUE)) {  # Data simulation  set.seed(1)  simul <- SimulateRegression(n = 50, pk = c(10, 20, 30), family = "gaussian")  # Stability selection  stab <- BiSelection(    xdata = simul$xdata, ydata = simul$ydata,    family = "gaussian", ncomp = 3,    implementation = SparsePLS  )  CalibrationPlot(stab)  summary(stab)  plot(stab)}par(oldpar)

Adjacency matrix from object

Description

Returns the adjacency matrix from anigraph object or from the output ofsimulation or inference functions from the present package.

Usage

AdjacencyFromObject(object)

Arguments

object

input object.

Value

A vector without missing values or NULL.


Summarised coefficients conditionally on selection

Description

Computes descriptive statistics (defined byFUN) for coefficients ofthe (calibrated) models conditionally on selection across resamplingiterations.

Usage

AggregatedEffects(  stability,  lambda_id = NULL,  side = "X",  comp = 1,  FUN = stats::median,  ...)

Arguments

stability

output ofVariableSelection orBiSelection.

lambda_id

parameter ID with respect to the gridLambda. IfNULL, aggregated coefficients across the models run with thecalibrated parameter are returned.

side

character string indicating if coefficients of predictors(side="X") or outcomes (side="Y") should be returned. Onlyapplicable to PLS models.

comp

component ID. Only applicable to PLS models.

FUN

function to use to aggregate coefficients of visited models overresampling iterations. Recommended functions includemedian ormean.

...

additional arguments to be passed toFUN.

Value

A matrix of summarised coefficients conditionally on selection acrossresampling iterations. Missing values (NA) are returned forvariables that are never selected.

See Also

VariableSelection,BiSelection,Refit

Examples

# Example with univariate outcomeset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian")median_betas <- AggregatedEffects(stab)# Comparison with refitted modelrefitted <- Refit(xdata = simul$xdata, ydata = simul$ydata, stability = stab)refitted_betas <- coef(refitted)[-1, 1]plot(median_betas[names(refitted_betas), ], refitted_betas,  panel.first = abline(0, 1, lty = 2))# Extracting mean betas conditionally on selectionmean_betas <- AggregatedEffects(stab, FUN = mean)plot(median_betas, mean_betas)# Regression with multivariate outcomesset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, q = 2, family = "gaussian")stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "mgaussian")median_betas <- AggregatedEffects(stab)dim(median_betas)# Sparse PLS with multivariate outcomeif (requireNamespace("sgPLS", quietly = TRUE)) {  set.seed(1)  simul <- SimulateRegression(n = 50, pk = 15, q = 3, family = "gaussian")  x <- simul$xdata  y <- simul$ydata  stab <- BiSelection(    xdata = x, ydata = y,    family = "gaussian", ncomp = 3,    LambdaX = seq_len(ncol(x) - 1),    implementation = SparsePLS  )  median_betas <- AggregatedEffects(stab)  dim(median_betas)  median_betas <- AggregatedEffects(stab, side = "Y")  dim(median_betas)}

Calibrated hyper-parameter(s)

Description

Extracts the calibrated hyper-parameters (or their indices forArgmaxId) with respect to the grids provided inLambdaandpi_list in argumentstability.

Usage

ArgmaxId(stability = NULL, S = NULL)Argmax(stability)

Arguments

stability

output ofVariableSelection orGraphicalModel.

S

matrix of stability scores obtained with different combinations ofparameters where rows correspond to different values of the parametercontrolling the level of sparsity in the underlying feature selectionalgorithm and columns correspond to different values of the threshold inselection proportions. IfS=NULL, argumentstability must beprovided.

Value

A matrix of hyper-parameters (Argmax) or indices(ArgmaxId). For multi-block graphical models, rows correspondto different blocks.

See Also

VariableSelection,GraphicalModel

Examples

# Data simulationset.seed(1)simul <- SimulateGraphical(pk = 20)# Stability selectionstab <- GraphicalModel(xdata = simul$data)# Extracting calibrated hyper-parametersArgmax(stab)# Extracting calibrated hyper-parameters IDsids <- ArgmaxId(stab)ids# Relationship between the two functionsstab$Lambda[ids[1], 1]stab$params$pi_list[ids[2]]

Stability selection of predictors and/or outcomes

Description

Performs stability selection for dimensionality reduction. The underlyingvariable selection algorithm (e.g. sparse PLS) is run with differentcombinations of parameters controlling the sparsity (e.g. number of selectedvariables per component) and thresholds in selection proportions. Thesehyper-parameters are jointly calibrated by maximisation of the stabilityscore.

Usage

BiSelection(  xdata,  ydata = NULL,  group_x = NULL,  group_y = NULL,  LambdaX = NULL,  LambdaY = NULL,  AlphaX = NULL,  AlphaY = NULL,  ncomp = 1,  scale = TRUE,  pi_list = seq(0.01, 0.99, by = 0.01),  K = 100,  tau = 0.5,  seed = 1,  n_cat = NULL,  family = "gaussian",  implementation = SparsePLS,  resampling = "subsampling",  cpss = FALSE,  PFER_method = "MB",  PFER_thr = Inf,  FDP_thr = Inf,  n_cores = 1,  output_data = FALSE,  verbose = TRUE,  beep = NULL,  ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

group_x

vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group. Only used formodels with group penalisation (e.g.implementation=GroupPLS orimplementation=SparseGroupPLS).

group_y

optional vector encoding the grouping structure amongoutcomes. This argument indicates the number of variables in each group.Only used ifimplementation=GroupPLS orimplementation=SparseGroupPLS.

LambdaX

matrix of parameters controlling the number of selectedvariables (for sparse PCA/PLS) or groups (for group and sparse group PLS)in X.

LambdaY

matrix of parameters controlling the number of selectedvariables (for sparse PLS) or groups (for group or sparse group PLS) in Y.Only used iffamily="gaussian".

AlphaX

matrix of parameters controlling the level of sparsity withingroups in X. Only used ifimplementation=SparseGroupPLS.

AlphaY

matrix of parameters controlling the level of sparsity withingroups in X. Only used ifimplementation=SparseGroupPLS andfamily="gaussian".

ncomp

number of components.

scale

logical indicating if the data should be scaled (i.e.transformed so that all variables have a standard deviation of one).

pi_list

vector of thresholds in selection proportions. Ifn_cat=NULL orn_cat=2, these values must be>0 and<1. Ifn_cat=3, these values must be>0.5 and<1.

K

number of resampling iterations.

tau

subsample size. Only used ifresampling="subsampling" andcpss=FALSE.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

family

type of PLS model. This parameter must be set tofamily="gaussian" for continuous outcomes, or tofamily="binomial" for categorical outcomes. Only used ifydata is provided.

implementation

function to use for feature selection. Possiblefunctions are:SparsePCA,SparsePLS,GroupPLS,SparseGroupPLS.

resampling

resampling approach. Possible values are:"subsampling" for sampling without replacement of a proportiontau of the observations, or"bootstrap" for sampling withreplacement generating a resampled dataset with as many observations as inthe full sample. Alternatively, this argument can be a function to use forresampling. This function must use arguments nameddata andtau and return the IDs of observations to be included in theresampled dataset.

cpss

logical indicating if complementary pair stability selectionshould be done. For this, the algorithm is applied on two non-overlappingsubsets of half of the observations. A feature is considered as selected ifit is selected for both subsamples. With this method, the data is splitK/2 times (K models are fitted). Only used ifPFER_method="MB".

PFER_method

method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). IfPFER_method="MB", the method proposed by Meinshausen and Bühlmann(2010) is used. IfPFER_method="SS", the method proposed by Shah andSamworth (2013) under the assumption of unimodality is used.

PFER_thr

threshold in PFER for constrained calibration by errorcontrol. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

FDP_thr

threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

n_cores

number of cores to use for parallel computing (see argumentworkers inmultisession). Usingn_cores>1 is only supported withoptimisation="grid_search".

output_data

logical indicating if the input datasetsxdata andydata should be included in the output.

verbose

logical indicating if a loading bar and messages should beprinted.

beep

sound indicating the end of the run. Possible values are:NULL (no sound) or an integer between 1 and 11 (see argumentsound inbeep).

...

additional parameters passed to the functions provided inimplementation orresampling.

Details

In stability selection, a feature selection algorithm is fitted onK subsamples (or bootstrap samples) of the data with differentparameters controlling the sparsity (LambdaX,LambdaY,AlphaX, and/orAlphaY). For a given (set of) sparsityparameter(s), the proportion out of theK models in which eachfeature is selected is calculated. Features with selection proportionsabove a threshold pi are considered stably selected. The stabilityselection model is controlled by the sparsity parameter(s) (denoted by\lambda) for the underlying algorithm, and the threshold in selectionproportion:

V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}

For sparse and sparse group dimensionality reduction, "feature" refers tovariable (variable selection model). For group PLS, "feature" refers togroup (group selection model). For (sparse) group PLS, groups need to bedefineda priori and specified in argumentsgroup_x and/orgroup_y.

These parameters can be calibrated by maximisation of a stability score(seeConsensusScore ifn_cat=NULL orStabilityScore otherwise) calculated under the nullhypothesis of equiprobability of selection.

It is strongly recommended to examine the calibration plot carefully tocheck that the grids of parametersLambda andpi_list do notrestrict the calibration to a region that would not include the globalmaximum (seeCalibrationPlot). In particular, the gridLambda may need to be extended when the maximum stability isobserved on the left or right edges of the calibration heatmap. In someinstances, multiple peaks of stability score can be observed. Simulationstudies suggest that the peak corresponding to the largest number ofselected features tend to give better selection performances. This is notnecessarily the highest peak (which is automatically retained by thefunctions in this package). The user can decide to manually choose anotherpeak.

To control the expected number of False Positives (Per Family Error Rate)in the results, a thresholdPFER_thr can be specified. Theoptimisation problem is then constrained to sets of parameters thatgenerate models with an upper-bound in PFER belowPFER_thr (seeMeinshausen and Bühlmann (2010) and Shah and Samworth (2013)).

Possible resampling procedures include defining (i)K subsamples ofa proportiontau of the observations, (ii)K bootstrap sampleswith the full sample size (obtained with replacement), and (iii)K/2splits of the data in half for complementary pair stability selection (seeargumentsresampling andcpss). In complementary pairstability selection, a feature is considered selected at a given resamplingiteration if it is selected in the two complementary subsamples.

For categorical outcomes (argumentfamily is"binomial" or"multinomial"), the proportions of observations from each categoryin all subsamples or bootstrap samples are the same as in the full sample.

To ensure reproducibility of the results, the starting number of the randomnumber generator is set toseed.

For parallelisation, stability selection with different sets of parameterscan be run onn_cores cores. Usingn_cores > 1 creates amultisession.

Value

An object of classbi_selection. A list with:

summary

amatrix of the best stability scores and corresponding parameterscontrolling the level of sparsity in the underlying algorithm for differentnumbers of components. Possible columns include:comp (componentindex),nx (number of predictors to include, parameter of theunderlying algorithm),alphax (sparsity within the predictor groups,parameter of the underlying algorithm),pix (threshold in selectionproportion for predictors),ny (number of outcomes to include,parameter of the underlying algorithm),alphay (sparsity within theoutcome groups, parameter of the underlying algorithm),piy(threshold in selection proportion for outcomes),S (stabilityscore). Columns that are not relevant to the model are not reported (e.g.alpha_x andalpha_y are not returned for sparse PLS models).

summary_full

a matrix of the best stability scores for differentcombinations of parameters controlling the sparsity and components.

selectedX

a binary matrix encoding stably selected predictors.

selpropX

a matrix of calibrated selection proportions forpredictors.

selectedY

a binary matrix encoding stably selectedoutcomes. Only returned for PLS models.

selpropY

a matrix ofcalibrated selection proportions for outcomes. Only returned for PLSmodels.

selected

a binary matrix encoding stable relationshipsbetween predictor and outcome variables. Only returned for PLS models.

selectedX_full

a binary matrix encoding stably selected predictors.

selpropX_full

a matrix of selection proportions for predictors.

selectedY_full

a binary matrix encoding stably selected outcomes.Only returned for PLS models.

selpropY_full

a matrix of selectionproportions for outcomes. Only returned for PLS models.

coefX

anarray of estimated loadings coefficients for the different components(rows), for the predictors (columns), as obtained across theKvisited models (along the third dimension).

coefY

an array ofestimated loadings coefficients for the different components (rows), forthe outcomes (columns), as obtained across theK visited models(along the third dimension). Only returned for PLS models.

method

alist withtype="bi_selection" and values used for argumentsimplementation,family,scale,resampling,cpss andPFER_method.

params

a list with values usedfor argumentsK,group_x,group_y,LambdaX,LambdaY,AlphaX,AlphaY,pi_list,tau,n_cat,pk,n (number of observations),PFER_thr,FDP_thr andseed. The datasetsxdataandydata are also included ifoutput_data=TRUE.

The rows ofsummary and columns ofselectedX,selectedY,selpropX,selpropY,selected,coefX andcoefY are ordered in the same way and correspond to components andparameter values stored insummary. The rows ofsummary_fulland columns ofselectedX_full,selectedY_full,selpropX_full andselpropY_full are ordered in the same wayand correspond to components and parameter values stored insummary_full.

References

Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.

Shah RD, Samworth RJ (2013).“Variable selection with error control: another look at stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),75(1), 55-80.doi:10.1111/j.1467-9868.2011.01034.x.

Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.

Liquet B, de Micheaux PL, Hejblum BP, Thiébaut R (2016).“Group and sparse group partial least square approaches applied in genomics context.”Bioinformatics,32(1), 35-42.ISSN 1367-4803,doi:10.1093/bioinformatics/btv535.

KA LC, Rossouw D, Robert-Granié C, Besse P (2008).“A sparse PLS for variable selection when integrating omics data.”Stat Appl Genet Mol Biol,7(1), Article 35.ISSN 1544-6115,doi:10.2202/1544-6115.1390.

Shen H, Huang JZ (2008).“Sparse principal component analysis via regularized low rank matrix approximation.”Journal of Multivariate Analysis,99(6), 1015-1034.ISSN 0047-259X,doi:10.1016/j.jmva.2007.06.007.

Zou H, Hastie T, Tibshirani R (2006).“Sparse Principal Component Analysis.”Journal of Computational and Graphical Statistics,15(2), 265-286.doi:10.1198/106186006X113430.

See Also

SparsePCA,SparsePLS,GroupPLS,SparseGroupPLS,VariableSelection,Resample,StabilityScore

Other stability functions:Clustering(),GraphicalModel(),StructuralModel(),VariableSelection()

Examples

if (requireNamespace("sgPLS", quietly = TRUE)) {  oldpar <- par(no.readonly = TRUE)  par(mar = c(12, 5, 1, 1))  ## Sparse Principal Component Analysis  # Data simulation  set.seed(1)  simul <- SimulateComponents(pk = c(5, 3, 4))  # sPCA: sparsity on X (unsupervised)  stab <- BiSelection(    xdata = simul$data,    ncomp = 2,    LambdaX = seq_len(ncol(simul$data) - 1),    implementation = SparsePCA  )  print(stab)  # Calibration plot  CalibrationPlot(stab)  # Visualisation of the results  summary(stab)  plot(stab)  SelectedVariables(stab)  ## Sparse (Group) Partial Least Squares  # Data simulation (continuous outcomes)  set.seed(1)  simul <- SimulateRegression(n = 100, pk = 15, q = 3, family = "gaussian")  x <- simul$xdata  y <- simul$ydata  # sPLS: sparsity on X  stab <- BiSelection(    xdata = x, ydata = y,    family = "gaussian", ncomp = 3,    LambdaX = seq_len(ncol(x) - 1),    implementation = SparsePLS  )  CalibrationPlot(stab)  summary(stab)  plot(stab)  # sPLS: sparsity on both X and Y  stab <- BiSelection(    xdata = x, ydata = y,    family = "gaussian", ncomp = 3,    LambdaX = seq_len(ncol(x) - 1),    LambdaY = seq_len(ncol(y) - 1),    implementation = SparsePLS,    n_cat = 2  )  CalibrationPlot(stab)  summary(stab)  plot(stab)  # sgPLS: sparsity on X  stab <- BiSelection(    xdata = x, ydata = y, K = 10,    group_x = c(2, 8, 5),    family = "gaussian", ncomp = 3,    LambdaX = seq_len(2), AlphaX = seq(0.1, 0.9, by = 0.1),    implementation = SparseGroupPLS  )  CalibrationPlot(stab)  summary(stab)  par(oldpar)}

Binomial probabilities for stability score

Description

Computes the probabilities of observing each category of selectionproportions under the assumption of a uniform selection procedure.

Usage

BinomialProbabilities(q, N, pi, K, n_cat = 3)

Arguments

q

average number of features selected by the underlying algorithm.

N

total number of features.

pi

threshold in selection proportions. If n_cat=3, these values mustbe >0.5 and <1. If n_cat=2, these values must be >0 and <1.

K

number of resampling iterations.

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

Value

A list of probabilities for each of the 2 or 3 categories ofselection proportions.


Multi-block grid

Description

Generates a matrix of parameters controlling the sparsity of the underlyingselection algorithm for multi-block calibration.

Usage

BlockLambdaGrid(Lambda, lambda_other_blocks = NULL)

Arguments

Lambda

vector or matrix of penalty parameters.

lambda_other_blocks

optional vector of penalty parameters to use forother blocks in the iterative multi-block procedure.

Value

A list with:

Lambda

a matrix of (block-specific) penaltyparameters. In multi-block stability selection, rows correspond to sets ofpenalty parameters and columns correspond to different blocks.

Sequential_template

logical matrix encoding the type of procedurefor data with multiple blocks in stability selection graphical modelling.For multi-block estimation, each block is calibrated separately whileothers blocks are weakly penalised (TRUE only for the blockcurrently being calibrated andFALSE for other blocks). Otherapproaches with joint calibration of the blocks are allowed (all entriesare set toTRUE).

See Also

GraphicalModel

Examples

# Multi-block gridLambda <- matrix(  c(    0.8, 0.6, 0.3,    0.5, 0.4, 0.2,    0.7, 0.5, 0.1  ),  ncol = 3, byrow = TRUE)mygrid <- BlockLambdaGrid(Lambda, lambda_other_blocks = 0.1)# Multi-parameter grid (not recommended)Lambda <- matrix(  c(    0.8, 0.6, 0.3,    0.5, 0.4, 0.2,    0.7, 0.5, 0.1  ),  ncol = 3, byrow = TRUE)mygrid <- BlockLambdaGrid(Lambda, lambda_other_blocks = NULL)

Classification And Regression Trees

Description

Runs decision trees using implementation fromrpart.This function is not using stability.

Usage

CART(xdata, ydata, Lambda = NULL, family, ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

Lambda

matrix of parameters controlling the number of splits in thedecision tree.

family

type of regression model. This argument is defined as inglmnet. Possible values include"gaussian"(linear regression),"binomial" (logistic regression),"multinomial" (multinomial regression), and"cox" (survivalanalysis).

...

additional parameters passed torpart.

Value

A list with:

selected

matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors.

beta_full

array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors. Indicesalong the third dimension correspond to outcome variable(s).

References

Breiman L, Friedman JH, Olshen R, Stone CJ (1984).Classification and Regression Trees.Wadsworth.

See Also

SelectionAlgo,VariableSelection

Other underlying algorithm functions:ClusteringAlgo(),PenalisedGraphical(),PenalisedOpenMx(),PenalisedRegression()

Examples

if (requireNamespace("rpart", quietly = TRUE)) {  # Data simulation  set.seed(1)  simul <- SimulateRegression(pk = 50)  # Running the LASSO  mycart <- CART(    xdata = simul$xdata,    ydata = simul$ydata,    family = "gaussian"  )  head(mycart$selected)}

Calibration curve (internal)

Description

Creates a calibration curve for consensus clustering.

Usage

CalibrationCurve(  stability,  bty = "o",  xlab = NULL,  ylab = NULL,  cex.axis = 1,  cex.lab = 1.5,  cex.legend = 1.2,  pch = 19,  lines = TRUE,  col = NULL,  legend = TRUE,  ncol = 1)

Arguments

stability

output ofVariableSelection,GraphicalModel orBiSelection.

bty

character string indicating if the box around the plot should bedrawn. Possible values include:"o" (default, the box is drawn), or"n" (no box).

xlab

label of the x-axis.

ylab

label of the y-axis.

cex.axis

font size for axes.

cex.lab

font size for labels.

cex.legend

font size for text legend entries.

pch

type of point, as inpoints.

lines

logical indicating if the points should be linked by lines. Onlyused ifstability is the output ofBiSelection orClustering.

col

vector of colours.

legend

logical indicating if the legend should be included.

ncol

integer indicating the number of columns in the legend.

Value

a calibration curve.


Calibration plot

Description

Creates a plot showing the stability score as a function of the parameter(s)controlling the level of sparsity in the underlying feature selectionalgorithm and/or the threshold in selection proportions. See examples inVariableSelection,GraphicalModel,Clustering andBiSelection.

Usage

CalibrationPlot(  stability,  block_id = NULL,  col = NULL,  pch = 19,  cex = 0.7,  xlim = NULL,  ylim = NULL,  bty = "o",  lines = TRUE,  lty = 3,  lwd = 2,  show_argmax = TRUE,  show_pix = FALSE,  show_piy = FALSE,  offset = 0.3,  legend = TRUE,  legend_length = NULL,  legend_range = NULL,  ncol = 1,  xlab = NULL,  ylab = NULL,  zlab = expression(italic(q)),  xlas = 2,  ylas = NULL,  zlas = 2,  cex.lab = 1.5,  cex.axis = 1,  cex.legend = 1.2,  xgrid = FALSE,  ygrid = FALSE,  params = c("ny", "alphay", "nx", "alphax"))

Arguments

stability

output ofVariableSelection,GraphicalModel orBiSelection.

block_id

ID of the block to visualise. Only used for multi-blockstability selection graphical models. Ifblock_id=NULL, all blocksare represented in separate panels.

col

vector of colours.

pch

type of point, as inpoints.

cex

size of point.

xlim

displayed range along the x-axis. Only used ifstabilityis the output ofBiSelection.

ylim

displayed range along the y-axis. Only used ifstabilityis the output ofBiSelection.

bty

character string indicating if the box around the plot should bedrawn. Possible values include:"o" (default, the box is drawn), or"n" (no box).

lines

logical indicating if the points should be linked by lines. Onlyused ifstability is the output ofBiSelection orClustering.

lty

line type, as inpar. Only used ifstability is the output ofBiSelection.

lwd

line width, as inpar. Only used ifstability is the output ofBiSelection.

show_argmax

logical indicating if the calibrated parameter(s) shouldbe indicated by lines.

show_pix

logical indicating if the calibrated threshold in selectionproportion inX should be written for each point. Only used ifstability is the output ofBiSelection.

show_piy

logical indicating if the calibrated threshold in selectionproportion inY should be written for each point. Only used ifstability is the output ofBiSelection withpenalisation of the outcomes.

offset

distance between the point and the text, as intext. Only used ifshow_pix=TRUE orshow_piy=TRUE.

legend

logical indicating if the legend should be included.

legend_length

length of the colour bar. Only used ifstabilityis the output ofVariableSelection orGraphicalModel.

legend_range

range of the colour bar. Only used ifstability isthe output ofVariableSelection orGraphicalModel.

ncol

integer indicating the number of columns in the legend.

xlab

label of the x-axis.

ylab

label of the y-axis.

zlab

label of the z-axis. Only used ifstability is the outputofVariableSelection orGraphicalModel.

xlas

orientation of labels on the x-axis, aslas inpar.

ylas

orientation of labels on the y-axis, aslas inpar.

zlas

orientation of labels on the z-axis, aslas inpar.

cex.lab

font size for labels.

cex.axis

font size for axes.

cex.legend

font size for text legend entries.

xgrid

logical indicating if a vertical grid should be drawn. Only usedifstability is the output ofBiSelection.

ygrid

logical indicating if a horizontal grid should be drawn. Onlyused ifstability is the output ofBiSelection.

params

vector of possible parameters ifstability is of classbi_selection. The order of these parameters defines the order inwhich they are represented. Only used ifstability is the output ofBiSelection.

Value

A calibration plot.

See Also

VariableSelection,GraphicalModel,Clustering,BiSelection


Checking input data (regression model)

Description

Checks if input data formats are appropriate. For inappropriate inputs, thisfunction (i) fixes the data format, or (ii) stops the run and generates anerror message.

Usage

CheckDataRegression(xdata, ydata = NULL, family = "gaussian", verbose = TRUE)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

family

type of regression model. This argument is defined as inglmnet. Possible values include"gaussian"(linear regression),"binomial" (logistic regression),"multinomial" (multinomial regression), and"cox" (survivalanalysis).

verbose

logical indicating if a loading bar and messages should beprinted.


Checking input parameters (clustering)

Description

Checks if input parameters are valid. For invalid parameters, this function(i) stops the run and generates an error message, or (ii) sets the invalidparameter to its default value and reports it in a warning message.

Usage

CheckInputClustering(  xdata,  Lambda = NULL,  pi_list = seq(0.6, 0.9, by = 0.01),  K = 100,  tau = 0.5,  seed = 1,  n_cat = 3,  implementation = HierarchicalClustering,  scale = TRUE,  resampling = "subsampling",  verbose = TRUE)

Arguments

xdata

data matrix with observations as rows and variables as columns.

Lambda

vector of penalty parameters for weighted distance calculation.Only used for distance-based clustering, including for exampleimplementation=HierarchicalClustering,implementation=PAMClustering, orimplementation=DBSCANClustering.

K

number of resampling iterations.

tau

subsample size.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

implementation

function to use for clustering. Possible functionsincludeHierarchicalClustering (hierarchical clustering),PAMClustering (Partitioning Around Medoids),KMeansClustering (k-means) andGMMClustering(Gaussian Mixture Models). Alternatively, a user-defined function takingxdata andLambda as arguments and returning a binary andsymmetric matrix for which diagonal elements are equal to zero can be used.

scale

logical indicating if the data should be scaled to ensure thatall variables contribute equally to the clustering of the observations.

verbose

logical indicating if a loading bar and messages should beprinted.


Checking input parameters (graphical model)

Description

Checks if input parameters are valid. For invalid parameters, this function(i) stops the run and generates an error message, or (ii) sets the invalidparameter to its default value and reports it in a warning message.

Usage

CheckInputGraphical(  xdata,  pk = NULL,  Lambda = NULL,  lambda_other_blocks = 0.1,  pi_list = seq(0.6, 0.9, by = 0.01),  K = 100,  tau = 0.5,  seed = 1,  n_cat = 3,  implementation = PenalisedGraphical,  start = "cold",  scale = TRUE,  resampling = "subsampling",  PFER_method = "MB",  PFER_thr = Inf,  FDP_thr = Inf,  Lambda_cardinal = 50,  lambda_max = NULL,  lambda_path_factor = 1e-04,  max_density = 0.3,  verbose = TRUE)

Arguments

xdata

data matrix with observations as rows and variables as columns.For multi-block stability selection, the variables in data have to beordered by group.

pk

optional vector encoding the grouping structure. Only used formulti-block stability selection wherepk indicates the number ofvariables in each group. Ifpk=NULL, single-block stabilityselection is performed.

Lambda

matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified inimplementation.IfLambda=NULL andimplementation=PenalisedGraphical,LambdaGridGraphical is used to define a relevant grid.Lambda can be provided as a vector or a matrix withlength(pk) columns.

lambda_other_blocks

optional vector of parameters controlling thelevel of sparsity in neighbour blocks for the multi-block procedure. To usejointly a specific set of parameters for each block,lambda_other_blocks must be set toNULL (not recommended).Only used for multi-block stability selection, i.e. iflength(pk)>1.

pi_list

vector of thresholds in selection proportions. Ifn_cat=NULL orn_cat=2, these values must be>0 and<1. Ifn_cat=3, these values must be>0.5 and<1.

K

number of resampling iterations.

tau

subsample size. Only used ifresampling="subsampling" andcpss=FALSE.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

implementation

function to use for graphical modelling. Ifimplementation=PenalisedGraphical, the algorithm implemented inglassoFast is used for regularised estimation ofa conditional independence graph. Alternatively, a user-defined functioncan be provided.

start

character string indicating if the algorithm should beinitialised at the estimated (inverse) covariance with previous penaltyparameters (start="warm") or not (start="cold"). Usingstart="warm" can speed-up the computations, but could lead toconvergence issues (in particular with smallLambda_cardinal). Onlyused forimplementation=PenalisedGraphical (see argument"start" inglassoFast).

scale

logical indicating if the correlation (scale=TRUE) orcovariance (scale=FALSE) matrix should be used as input ofglassoFast ifimplementation=PenalisedGraphical. Otherwise, this argument must beused in the function provided inimplementation.

resampling

resampling approach. Possible values are:"subsampling" for sampling without replacement of a proportiontau of the observations, or"bootstrap" for sampling withreplacement generating a resampled dataset with as many observations as inthe full sample. Alternatively, this argument can be a function to use forresampling. This function must use arguments nameddata andtau and return the IDs of observations to be included in theresampled dataset.

PFER_method

method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). IfPFER_method="MB", the method proposed by Meinshausen and Bühlmann(2010) is used. IfPFER_method="SS", the method proposed by Shah andSamworth (2013) under the assumption of unimodality is used.

PFER_thr

threshold in PFER for constrained calibration by errorcontrol. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

FDP_thr

threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

Lambda_cardinal

number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm. Only used ifLambda=NULL.

lambda_max

optional maximum value for the grid in penalty parameters.Iflambda_max=NULL, the maximum value is set to the maximumcovariance in absolute value. Only used ifimplementation=PenalisedGraphical andLambda=NULL.

lambda_path_factor

multiplicative factor used to define the minimumvalue in the grid.

max_density

threshold on the density. The grid is defined such thatthe density of the estimated graph does not exceed max_density.

verbose

logical indicating if a loading bar and messages should beprinted.


Checking that a package is installed

Description

Checks if a package is installed and returns an error message if not.

Usage

CheckPackageInstalled(package)

Arguments

package

character string indicating the name of the package.


Checking input parameters (regression model)

Description

Checks if input parameters are valid. For invalid parameters, this function(i) stops the run and generates an error message, or (ii) sets the invalidparameter to its default value and reports it in a warning message.

Usage

CheckParamRegression(  Lambda = NULL,  pi_list = seq(0.6, 0.9, by = 0.01),  K = 100,  tau = 0.5,  seed = 1,  n_cat = NULL,  family = "gaussian",  implementation = PenalisedRegression,  resampling = "subsampling",  PFER_method = "MB",  PFER_thr = Inf,  FDP_thr = Inf,  Lambda_cardinal = 100,  verbose = TRUE)

Arguments

Lambda

matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified inimplementation.IfLambda=NULL andimplementation=PenalisedRegression,LambdaGridRegression is used to define a relevant grid.

pi_list

vector of thresholds in selection proportions. Ifn_cat=NULL orn_cat=2, these values must be>0 and<1. Ifn_cat=3, these values must be>0.5 and<1.

K

number of resampling iterations.

tau

subsample size. Only used ifresampling="subsampling" andcpss=FALSE.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

family

type of regression model. This argument is defined as inglmnet. Possible values include"gaussian"(linear regression),"binomial" (logistic regression),"multinomial" (multinomial regression), and"cox" (survivalanalysis).

implementation

function to use for variable selection. Possiblefunctions are:PenalisedRegression,SparsePLS,GroupPLS andSparseGroupPLS. Alternatively, a user-definedfunction can be provided.

resampling

resampling approach. Possible values are:"subsampling" for sampling without replacement of a proportiontau of the observations, or"bootstrap" for sampling withreplacement generating a resampled dataset with as many observations as inthe full sample. Alternatively, this argument can be a function to use forresampling. This function must use arguments nameddata andtau and return the IDs of observations to be included in theresampled dataset.

PFER_method

method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). IfPFER_method="MB", the method proposed by Meinshausen and Bühlmann(2010) is used. IfPFER_method="SS", the method proposed by Shah andSamworth (2013) under the assumption of unimodality is used.

PFER_thr

threshold in PFER for constrained calibration by errorcontrol. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

FDP_thr

threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

Lambda_cardinal

number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm. Only used ifLambda=NULL.

verbose

logical indicating if a loading bar and messages should beprinted.


Consensus clustering

Description

Performs consensus (weighted) clustering. The underlying algorithm (e.g.hierarchical clustering) is run with different number of clustersnc.In consensus weighed clustering, weighted distances are calculated using thecosa2 algorithm with different penalty parametersLambda. The hyper-parameters are calibrated by maximisation of theconsensus score.

Usage

Clustering(  xdata,  nc = NULL,  eps = NULL,  Lambda = NULL,  K = 100,  tau = 0.5,  seed = 1,  n_cat = 3,  implementation = HierarchicalClustering,  scale = TRUE,  linkage = "complete",  row = TRUE,  optimisation = c("grid_search", "nloptr"),  n_cores = 1,  output_data = FALSE,  verbose = TRUE,  beep = NULL,  ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.

nc

matrix of parameters controlling the number of clusters in theunderlying algorithm specified inimplementation. Ifnc isnot provided, it is set toseq(1, tau*nrow(xdata)).

eps

radius in density-based clustering, seedbscan. Only used ifimplementation=DBSCANClustering.

Lambda

vector of penalty parameters for weighted distance calculation.Only used for distance-based clustering, including for exampleimplementation=HierarchicalClustering,implementation=PAMClustering, orimplementation=DBSCANClustering.

K

number of resampling iterations.

tau

subsample size.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

implementation

function to use for clustering. Possible functionsincludeHierarchicalClustering (hierarchical clustering),PAMClustering (Partitioning Around Medoids),KMeansClustering (k-means) andGMMClustering(Gaussian Mixture Models). Alternatively, a user-defined function takingxdata andLambda as arguments and returning a binary andsymmetric matrix for which diagonal elements are equal to zero can be used.

scale

logical indicating if the data should be scaled to ensure thatall variables contribute equally to the clustering of the observations.

linkage

character string indicating the type of linkage used inhierarchical clustering to define the stable clusters. Possible valuesinclude"complete","single" and"average" (seeargument"method" inhclust for a full list).Only used ifimplementation=HierarchicalClustering.

row

logical indicating if rows (ifrow=TRUE) or columns (ifrow=FALSE) contain the items to cluster.

optimisation

character string indicating the type of optimisationmethod to calibrate the regularisation parameter (only used ifLambda is notNULL). Withoptimisation="grid_search"(the default), all values inLambda are visited.Alternatively, optimisation algorithms implemented innloptr can be used withoptimisation="nloptr".By default, we use"algorithm"="NLOPT_GN_DIRECT_L","xtol_abs"=0.1,"ftol_abs"=0.1 and"maxeval" defined aslength(Lambda). These values can bechanged by providing the argumentopts (seenloptr).

n_cores

number of cores to use for parallel computing (see argumentworkers inmultisession). Usingn_cores>1 is only supported withoptimisation="grid_search".

output_data

logical indicating if the input datasetsxdata andydata should be included in the output.

verbose

logical indicating if a loading bar and messages should beprinted.

beep

sound indicating the end of the run. Possible values are:NULL (no sound) or an integer between 1 and 11 (see argumentsound inbeep).

...

additional parameters passed to the functions provided inimplementation orresampling.

Details

In consensus clustering, a clustering algorithm is applied onK subsamples of the observations with different numbers of clustersprovided innc. Ifrow=TRUE (the default), the observations(rows) are the items to cluster. Ifrow=FALSE, the variables(columns) are the items to cluster. For a given number of clusters, theconsensus matrixcoprop stores the proportion of iterations wheretwo items were in the same estimated cluster, out of all iterations whereboth items were drawn in the subsample.

Stable cluster membership is obtained by applying a distance-basedclustering method using(1-coprop) as distance (seeClusters).

These parameters can be calibrated by maximisation of a stability score(seeConsensusScore) calculated under the null hypothesis ofequiprobability of co-membership.

It is strongly recommended to examine the calibration plot (seeCalibrationPlot) to check that there is a clear maximum. Theabsence of a clear maximum suggests that the clustering is not stable,consensus clustering outputs should not be trusted in that case.

To ensure reproducibility of the results, the starting number of the randomnumber generator is set toseed.

For parallelisation, stability selection with different sets of parameterscan be run onn_cores cores. Usingn_cores > 1 creates amultisession.

Value

An object of classclustering. A list with:

Sc

a matrixof the best stability scores for different (sets of) parameters controllingthe number of clusters and penalisation of attribute weights.

nc

amatrix of numbers of clusters.

Lambda

a matrix of regularisationparameters for attribute weights.

Q

a matrix of the average numberof selected attributes by the underlying algorithm with differentregularisation parameters.

coprop

an array of consensus matrices.Rows and columns correspond to items. Indices along the third dimensioncorrespond to different parameters controlling the number of clusters andpenalisation of attribute weights.

selprop

an array of selectionproportions. Columns correspond to attributes. Rows correspond to differentparameters controlling the number of clusters and penalisation of attributeweights.

method

a list withtype="clustering" and valuesused for argumentsimplementation,linkage, andresampling.

params

a list with values used for argumentsK,tau,pk,n (number of observations inxdata), andseed.

The rows ofSc,nc,Lambda,Q,selprop and indices along the thirddimension ofcoprop are ordered in the same way and correspond toparameter values stored innc andLambda.

References

Bodinier B, Rodrigues S, Karimi M, Filippi S, Chiquet J, Chadeau-Hyam M (2025).“Stability Selection and Consensus Clustering in R: The R Package sharp.”Journal of Statistical Software,112(5), btad635.doi:10.18637/jss.v112.i05.

Bodinier B, Vuckovic D, Rodrigues S, Filippi S, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration of consensus weighted distance-based clustering approaches using sharp.”Bioinformatics, btad635.ISSN 1367-4811,doi:10.1093/bioinformatics/btad635, https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad635/52191190/btad635.pdf.

Kampert MM, Meulman JJ, Friedman JH (2017).“rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.”Journal of Classification,34(3), 514–547.doi:10.1007/s00357-017-9240-z.

Friedman JH, Meulman JJ (2004).“Clustering objects on subsets of attributes (with discussion).”Journal of the Royal Statistical Society: Series B (Statistical Methodology),66(4), 815-849.doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x,https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.

Monti S, Tamayo P, Mesirov J, Golub T (2003).“Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data.”Machine Learning,52(1), 91–118.doi:10.1023/A:1023949509487.

See Also

Resample,ConsensusScore,HierarchicalClustering,PAMClustering,KMeansClustering,GMMClustering

Other stability functions:BiSelection(),GraphicalModel(),StructuralModel(),VariableSelection()

Examples

# Consensus clusteringset.seed(1)simul <- SimulateClustering(  n = c(30, 30, 30), nu_xc = 1, ev_xc = 0.5)stab <- Clustering(xdata = simul$data)print(stab)CalibrationPlot(stab)summary(stab)Clusters(stab)plot(stab)# Consensus weighted clusteringif (requireNamespace("rCOSA", quietly = TRUE)) {  set.seed(1)  simul <- SimulateClustering(    n = c(30, 30, 30), pk = 20,    theta_xc = c(rep(1, 10), rep(0, 10)),    ev_xc = 0.9  )  stab <- Clustering(    xdata = simul$data,    Lambda = LambdaSequence(lmin = 0.1, lmax = 10, cardinal = 10),    noit = 20, niter = 10  )  print(stab)  CalibrationPlot(stab)  summary(stab)  Clusters(stab)  plot(stab)  WeightBoxplot(stab)}

(Weighted) clustering algorithm

Description

Runs the (weighted) clustering algorithm specified in the argumentimplementation and returns matrices of variable weights, and theco-membership structure. This function is not using stability.

Usage

ClusteringAlgo(  xdata,  nc = NULL,  eps = NULL,  Lambda = NULL,  scale = TRUE,  row = TRUE,  implementation = HierarchicalClustering,  ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.

nc

matrix of parameters controlling the number of clusters in theunderlying algorithm specified inimplementation. Ifnc isnot provided, it is set toseq(1, nrow(xdata)).

eps

radius in density-based clustering, seedbscan. Only used ifimplementation=DBSCANClustering.

Lambda

vector of penalty parameters.

scale

logical indicating if the data should be scaled to ensure thatall variables contribute equally to the clustering of the observations.

row

logical indicating if rows (ifrow=TRUE) or columns (ifrow=FALSE) contain the items to cluster.

implementation

function to use for clustering. Possible functionsincludeHierarchicalClustering (hierarchical clustering),PAMClustering (Partitioning Around Medoids),KMeansClustering (k-means) andGMMClustering(Gaussian Mixture Models). Alternatively, a user-defined function takingxdata andLambda as arguments and returning a binary andsymmetric matrix for which diagonal elements are equal to zero can be used.

...

additional parameters passed to the function provided inimplementation.

Value

A list with:

selected

matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors.

weight

array of model coefficients. Rows correspond todifferent model parameters. Columns correspond to predictors. Indices alongthe third dimension correspond to outcome variable(s).

comembership

array of model coefficients. Rows correspond todifferent model parameters. Columns correspond to predictors. Indices alongthe third dimension correspond to outcome variable(s).

See Also

VariableSelection

Other underlying algorithm functions:CART(),PenalisedGraphical(),PenalisedOpenMx(),PenalisedRegression()

Examples

# Simulation of 15 observations belonging to 3 groupsset.seed(1)simul <- SimulateClustering(  n = c(5, 5, 5), pk = 100)# Running hierarchical clusteringmyclust <- ClusteringAlgo(  xdata = simul$data, nc = 2:5,  implementation = HierarchicalClustering)

Clustering performance

Description

Computes different metrics of clustering performance by comparing true andpredicted co-membership. This function can only be used in simulation studies(i.e. when the true cluster membership is known).

Usage

ClusteringPerformance(theta, theta_star, ...)

Arguments

theta

output fromClustering. Alternatively, it can bethe estimated co-membership matrix (seeCoMembership).

theta_star

output fromSimulateClustering.Alternatively,it can be the true co-membership matrix (seeCoMembership).

...

additional arguments to be passed toClusters.

Value

A matrix of selection metrics including:

TP

number of True Positives (TP)

FN

number of FalseNegatives (TN)

FP

number of False Positives (FP)

TN

numberof True Negatives (TN)

sensitivity

sensitivity, i.e. TP/(TP+FN)

specificity

specificity, i.e. TN/(TN+FP)

accuracy

accuracy,i.e. (TP+TN)/(TP+TN+FP+FN)

precision

precision (p), i.e.TP/(TP+FP)

recall

recall (r), i.e. TP/(TP+FN)

F1_score

F1-score, i.e. 2*p*r/(p+r)

rand

Rand Index, i.e.(TP+TN)/(TP+FP+TN+FN)

ari

Adjusted Rand Index (ARI), i.e.2*(TP*TN-FP*FN)/((TP+FP)*(TN+FP)+(TP+FN)*(TN+FN))

jaccard

Jaccardindex, i.e. TP/(TP+FP+FN)

See Also

Other functions for model performance:SelectionPerformance(),SelectionPerformanceGraph()

Examples

# Data simulationset.seed(1)simul <- SimulateClustering(  n = c(30, 30, 30), nu_xc = 1)plot(simul)# Consensus clusteringstab <- Clustering(  xdata = simul$data, nc = seq_len(5))# Clustering performanceClusteringPerformance(stab, simul)# Alternative formulationClusteringPerformance(  theta = CoMembership(Clusters(stab)),  theta_star = simul$theta)

Pairwise co-membership

Description

Generates a symmetric and binary matrix indicating, if two items areco-members, i.e. belong to the same cluster.

Usage

CoMembership(groups)

Arguments

groups

vector of group membership.

Value

A symmetric and binary matrix.

Examples

# Simulated grouping structuremygroups <- c(rep(1, 3), rep(2, 5), rep(3, 2))# Co-membership matrixCoMembership(mygroups)

Model coefficients

Description

Extracts the coefficients of visited models at different resamplingiterations of a stability selection run. This function can be applied to theoutput ofVariableSelection.

Usage

Coefficients(stability, side = "X", comp = 1, iterations = NULL)

Arguments

stability

output ofVariableSelection.

side

character string indicating if coefficients of the predictor(side="X") or outcome (side="Y") coefficients should bereturned. Optionside="Y" is only applicable to PLS models.

comp

component ID. Only applicable to PLS models.

iterations

vector of iteration IDs. Ifiterations=NULL, thecoefficients of all visited models are returned. This number must besmaller than the number of iterationsK used for the stabilityselection run.

See Also

VariableSelection


Merging stability selection outputs

Description

Merges the outputs from two runs ofVariableSelection,GraphicalModel orClustering. The two runs musthave been done using the samemethods and the sameparams butwith differentseeds. The combined output will contain results basedon iterations from bothstability1 andstability2. Thisfunction can be used for parallelisation.

Usage

Combine(stability1, stability2, include_beta = TRUE)

Arguments

stability1

output from a first run ofVariableSelection,GraphicalModel, orClustering.

stability2

output from a second run ofVariableSelection,GraphicalModel, orClustering.

include_beta

logical indicating if the beta coefficients of visitedmodels should be concatenated. Only applicable to variable selection orclustering.

Value

A single output of the same format.

See Also

VariableSelection,GraphicalModel

Examples

## Variable selection# Data simulationset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")# Two runsstab1 <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, seed = 1, K = 10)stab2 <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, seed = 2, K = 10)# Merging the outputsstab <- Combine(stability1 = stab1, stability2 = stab2, include_beta = FALSE)str(stab)## Graphical modelling# Data simulationsimul <- SimulateGraphical(pk = 20)# Two runsstab1 <- GraphicalModel(xdata = simul$data, seed = 1, K = 10)stab2 <- GraphicalModel(xdata = simul$data, seed = 2, K = 10)# Merging the outputsstab <- Combine(stability1 = stab1, stability2 = stab2)str(stab)## Clustering# Data simulationsimul <- SimulateClustering(n = c(15, 15, 15))# Two runsstab1 <- Clustering(xdata = simul$data, seed = 1)stab2 <- Clustering(xdata = simul$data, seed = 2)# Merging the outputsstab <- Combine(stability1 = stab1, stability2 = stab2)str(stab)

Concatenate stability objects

Description

Generates a single stability object from two stability objects. This functionis used to concatenate results when usingnloptr.

Usage

Concatenate(stability1, stability2 = NULL, order_output = FALSE)

Arguments

stability1

a stability object.

stability2

another stability object (optional).

Value

A single stability object.


Consensus score

Description

Computes the consensus score from the consensus matrix, matrix of co-samplingcounts and consensus clusters. The score is a z statistic for the comparisonof the co-membership proportions observed within and between the consensusclusters.

Usage

ConsensusScore(prop, K, theta)

Arguments

prop

consensus matrix.

K

matrix of co-sampling counts.

theta

consensus co-membership matrix.

Details

To calculate the consensus score, the features are classified as being stablyselected or not (in selection) or as being in the same consensus cluster ornot (in clustering). In selection, the quantitiesX_w andX_b aredefined as the sum of the selection counts for features that are stablyselected or not, respectively. In clustering, the quantitiesX_w andX_b are defined as the sum of the co-membership counts for pairs ofitems in the same consensus cluster or in different consensus clusters,respectively.

Conditionally on this classification, and under the assumption that theselection (or co-membership) probabilities are the same for all features (oritem pairs) in each of these two categories, the quantitiesX_w andX_b follow binomial distributions with probabilitiesp_w andp_b, respectively.

In the most unstable situation, we suppose that all features (or item pairs)would have the same probability of being selected (or co-members). Theconsensus score is the z statistic from a z test where the null hypothesis isp_w \leq p_b.

The consensus score increases with stability.

Value

A consensus score.

See Also

Other stability metric functions:FDP(),PFER(),StabilityMetrics(),StabilityScore()

Examples

# Data simulationset.seed(2)simul <- SimulateClustering(  n = c(30, 30, 30),  nu_xc = 1)plot(simul)# Consensus clusteringstab <- Clustering(  xdata = simul$data)stab$Sc[3]# Calculating the consensus scoretheta <- CoMembership(Clusters(stab, argmax_id = 3))ConsensusScore(  prop = (stab$coprop[, , 3])[upper.tri(stab$coprop[, , 3])],  K = stab$sampled_pairs[upper.tri(stab$sampled_pairs)],  theta = theta[upper.tri(theta)])

(Weighted) density-based clustering

Description

Runs Density-Based Spatial Clustering of Applications with Noise (DBSCAN)clustering using implementation fromdbscan. This isalso known as the k-medoids algorithm. IfLambda is provided,clustering is applied on the weighted distance matrix calculated using theCOSA algorithm as implemented incosa2. Otherwise,distances are calculated usingdist. This function isnot using stability.

Usage

DBSCANClustering(  xdata,  nc = NULL,  eps = NULL,  Lambda = NULL,  distance = "euclidean",  ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.

nc

matrix of parameters controlling the number of clusters in theunderlying algorithm specified inimplementation. Ifnc isnot provided, it is set toseq(1, tau*nrow(xdata)).

eps

radius in density-based clustering, seedbscan.

Lambda

vector of penalty parameters (see argumentlambda incosa2). Unweighted distance matrices are used ifLambda=NULL.

distance

character string indicating the type of distance to use. IfLambda=NULL, possible values include"euclidean","maximum","canberra","binary", and"minkowski" (see argumentmethod indist). Otherwise, possible values include"euclidean" (pwr=2) or"absolute" (pwr=1) (seeargumentpwr incosa2).

...

additional parameters passed todbscan(except forminPts which is fixed to2),dist, orcosa2. Ifweighted=TRUE, parametersniter (default to 1) andnoit (default to 100) correspond to the number of iterations incosa2 to calculate weights and may need to bemodified.

Value

A list with:

comembership

an array of binary and symmetricco-membership matrices.

weights

a matrix of median weights byfeature.

References

Kampert MM, Meulman JJ, Friedman JH (2017).“rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.”Journal of Classification,34(3), 514–547.doi:10.1007/s00357-017-9240-z.

Friedman JH, Meulman JJ (2004).“Clustering objects on subsets of attributes (with discussion).”Journal of the Royal Statistical Society: Series B (Statistical Methodology),66(4), 815-849.doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x,https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.

See Also

Other clustering algorithms:GMMClustering(),HierarchicalClustering(),KMeansClustering(),PAMClustering()

Examples

if (requireNamespace("dbscan", quietly = TRUE)) {  # Data simulation  set.seed(1)  simul <- SimulateClustering(n = c(10, 10), pk = 50)  plot(simul)  # DBSCAN clustering  myclust <- DBSCANClustering(    xdata = simul$data,    eps = seq(0, 2 * sqrt(ncol(simul$data) - 1), by = 0.1)  )  # Weighted PAM clustering (using COSA)  if (requireNamespace("rCOSA", quietly = TRUE)) {    myclust <- DBSCANClustering(      xdata = simul$data,      eps = c(0.25, 0.5, 0.75),      Lambda = c(0.2, 0.5)    )  }}

Categorical from dummy variables

Description

Generates a single categorical variable from corresponding dummy variables.

Usage

DummyToCategories(x, verbose = FALSE)

Arguments

x

matrix of dummy variables.

verbose

logical indicating if messages should be printed.

Value

A single categorical variable (numeric).


Ensemble model

Description

Creates an ensemble predictive model fromVariableSelectionoutputs.

Usage

Ensemble(stability, xdata, ydata)

Arguments

stability

output ofVariableSelection.

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

Value

An object of classensemble_model. A list with:

intercept

a vector of refitted intercepts for theKcalibrated models.

beta

a matrix of beta coefficients from theK calibrated models.

models

a list ofK models thatcan be used for prediction. These models are of class"lm" iffamily="gaussian" or"glm" iffamily="binomial".

family

type of regression, extracted fromstability. Possible values are"gaussian" or"binomial".

See Also

Other ensemble model functions:EnsemblePredictions()

Examples

# Linear regressionset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian")ensemble <- Ensemble(stability = stab, xdata = simul$xdata, ydata = simul$ydata)# Logistic regressionset.seed(1)simul <- SimulateRegression(n = 200, pk = 20, family = "binomial")stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "binomial")ensemble <- Ensemble(stability = stab, xdata = simul$xdata, ydata = simul$ydata)

Predictions from ensemble model

Description

Makes predictions using an ensemble model created fromVariableSelection outputs. For each observation inxdata, the predictions are calculated as the average predicted valuesobtained for that observation over theK models fitted in calibratedstability selection.

Usage

EnsemblePredictions(ensemble, xdata, ...)

Arguments

ensemble

output ofEnsemble.

xdata

matrix of predictors with observations as rows and variables ascolumns.

...

additional parameters passed topredict.

Value

A matrix of predictions computed from the observations inxdata.

See Also

predict.variable_selection

Other ensemble model functions:Ensemble()

Examples

# Data simulationset.seed(1)simul <- SimulateRegression(n = 1000, pk = 50, family = "gaussian")# Training/test splitids <- Split(data = simul$ydata, tau = c(0.8, 0.2))stab <- VariableSelection(  xdata = simul$xdata[ids[[1]], ],  ydata = simul$ydata[ids[[1]], ])# Constructing the ensemble modelensemble <- Ensemble(  stability = stab,  xdata = simul$xdata[ids[[1]], ],  ydata = simul$ydata[ids[[1]], ])# Making predictionsyhat <- EnsemblePredictions(  ensemble = ensemble,  xdata = simul$xdata[ids[[2]], ])# Calculating Q-squaredcor(simul$ydata[ids[[2]], ], yhat)^2

Prediction performance in regression

Description

Calculates model performance for linear (measured by Q-squared), logistic(AUC) or Cox (C-statistic) regression. This is done by (i) refitting themodel on a training set including a proportiontau of theobservations, and (ii) evaluating the performance on the remainingobservations (test set). For more reliable results, the procedure can berepeatedK times (defaultK=1).

Usage

ExplanatoryPerformance(  xdata,  ydata,  new_xdata = NULL,  new_ydata = NULL,  stability = NULL,  family = NULL,  implementation = NULL,  prediction = NULL,  resampling = "subsampling",  K = 1,  tau = 0.8,  seed = 1,  n_thr = NULL,  time = 1000,  verbose = FALSE,  ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

new_xdata

optional test set (predictor data).

new_ydata

optional test set (outcome data).

stability

output ofVariableSelection. Ifstability=NULL (the default), a model including all variables inxdata as predictors is fitted. Argumentfamily must beprovided in this case.

family

type of regression model. Possible values include"gaussian" (linear regression),"binomial" (logisticregression), and"cox" (survival analysis). If provided, thisargument must be consistent with inputstability.

implementation

optional function to refit the model. Ifimplementation=NULL andstability is the output ofVariableSelection,lm (linearregression),coxph (Cox regression),glm (logistic regression), ormultinom (multinomial regression) is used.

prediction

optional function to compute predicted values from themodel refitted withimplementation.

resampling

resampling approach to create the training set. The defaultis"subsampling" for sampling without replacement of a proportiontau of the observations. Alternatively, this argument can be afunction to use for resampling. This function must use arguments nameddata andtau and return the IDs of observations to beincluded in the resampled dataset.

K

number of training-test splits. Only used ifnew_xdata andnew_ydata are not provided.

tau

proportion of observations used in the training set. Only used ifnew_xdata andnew_ydata are not provided.

seed

value of the seed to ensure reproducibility of the results. Onlyused ifnew_xdata andnew_ydata are not provided.

n_thr

number of thresholds to use to construct the ROC curve. Ifn_thr=NULL, all predicted probability values are iteratively used asthresholds. For faster computations on large data, less thresholds can beused. Only applicable to logistic regression.

time

numeric indicating the time for which the survival probabilitiesare computed. Only applicable to Cox regression.

verbose

logical indicating if a loading bar and messages should beprinted.

...

additional parameters passed to the function provided inresampling.

Details

For a fair evaluation of the prediction performance, the data issplit into a training set (including a proportiontau of theobservations) and test set (remaining observations). The regression modelis fitted on the training set and applied on the test set. Performancemetrics are computed in the test set by comparing predicted and observedoutcomes.

For logistic regression, a Receiver Operating Characteristic (ROC) analysisis performed: the True and False Positive Rates (TPR and FPR), and AreaUnder the Curve (AUC) are computed for different thresholds in predictedprobabilities.

For Cox regression, the Concordance Index (as implemented inconcordance) looking at survival probabilities upto a specifictime is computed.

For linear regression, the squared correlation between predicted andobserved outcome in the test set (Q-squared) is reported.

Value

A list with:

TPR

True Positive Rate (for logistic regressiononly).

FPR

False Positive Rate (for logistic regression only).

AUC

Area Under the Curve (for logistic regression only).

concordance

Concordance index (for Cox regression only).

Beta

matrix of estimated beta coefficients across theKiterations. Coefficients are extracted using thecoeffunction.

See Also

VariableSelection,Refit

Other prediction performance functions:Incremental()

Examples

# Data simulationset.seed(1)simul <- SimulateRegression(  n = 1000, pk = 20,  family = "binomial", ev_xy = 0.8)# Data split: selection, training and test setids <- Split(  data = simul$ydata,  family = "binomial",  tau = c(0.4, 0.3, 0.3))xselect <- simul$xdata[ids[[1]], ]yselect <- simul$ydata[ids[[1]], ]xtrain <- simul$xdata[ids[[2]], ]ytrain <- simul$ydata[ids[[2]], ]xtest <- simul$xdata[ids[[3]], ]ytest <- simul$ydata[ids[[3]], ]# Stability selectionstab <- VariableSelection(  xdata = xselect,  ydata = yselect,  family = "binomial")# Performances in test set of model refitted in training setroc <- ExplanatoryPerformance(  xdata = xtrain, ydata = ytrain,  new_xdata = xtest, new_ydata = ytest,  stability = stab)plot(roc)roc$AUC# Alternative with multiple training/test splitsroc <- ExplanatoryPerformance(  xdata = rbind(xtrain, xtest),  ydata = c(ytrain, ytest),  stability = stab, K = 100)plot(roc)boxplot(roc$AUC)# Partial Least Squares Discriminant Analysisif (requireNamespace("sgPLS", quietly = TRUE)) {  stab <- VariableSelection(    xdata = xselect,    ydata = yselect,    implementation = SparsePLS,    family = "binomial"  )  # Defining wrapping functions for predictions from PLS-DA  PLSDA <- function(xdata, ydata, family = "binomial") {    model <- mixOmics::plsda(X = xdata, Y = as.factor(ydata), ncomp = 1)    return(model)  }  PredictPLSDA <- function(xdata, model) {    xdata <- xdata[, rownames(model$loadings$X), drop = FALSE]    predicted <- predict(object = model, newdata = xdata)$predict[, 2, 1]    return(predicted)  }  # Performances with custom models  roc <- ExplanatoryPerformance(    xdata = rbind(xtrain, xtest),    ydata = c(ytrain, ytest),    stability = stab, K = 100,    implementation = PLSDA, prediction = PredictPLSDA  )  plot(roc)}

False Discovery Proportion

Description

Computes the False Discovery Proportion (upper-bound) as a ratio of the PFER(upper-bound) over the number of stably selected features. In stabilityselection, the FDP corresponds to the expected proportion of stably selectedfeatures that are not relevant to the outcome (i.e. proportion of FalsePositives among stably selected features).

Usage

FDP(selprop, PFER, pi)

Arguments

selprop

matrix or vector of selection proportions.

PFER

Per Family Error Rate.

pi

threshold in selection proportions.

Value

The estimated upper-bound in FDP.

See Also

Other stability metric functions:ConsensusScore(),PFER(),StabilityMetrics(),StabilityScore()

Examples

# Simulating set of selection proportionsselprop <- round(runif(n = 20), digits = 2)# Computing the FDP with a threshold of 0.8fdp <- FDP(PFER = 3, selprop = selprop, pi = 0.8)

Splitting observations into folds

Description

Generates a list ofn_folds non-overlapping sets of observation IDs(folds).

Usage

Folds(data, family = NULL, n_folds = 5)

Arguments

data

vector or matrix of data. In regression, this should be theoutcome data.

family

type of regression model. This argument is defined as inglmnet. Possible values include"gaussian"(linear regression),"binomial" (logistic regression),"multinomial" (multinomial regression), and"cox" (survivalanalysis).

n_folds

number of folds.

Details

For categorical outcomes (i.e.family argument is set to"binomial","multinomial" or"cox"), the split is donesuch that the proportion of observations from each of the categories ineach of the folds is representative of that of the full sample.

Value

A list of lengthn_folds with sets of non-overlappingobservation IDs.

Examples

# Splitting into 5 foldssimul <- SimulateRegression()ids <- Folds(data = simul$ydata)lapply(ids, length)# Balanced folds with respect to a binary variablesimul <- SimulateRegression(family = "binomial")ids <- Folds(data = simul$ydata, family = "binomial")lapply(ids, FUN = function(x) {  table(simul$ydata[x, ])})

Model-based clustering

Description

Runs clustering with Gaussian Mixture Models (GMM) using implementation fromMclust. This function is not using stability.

Usage

GMMClustering(xdata, nc = NULL, ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.

nc

matrix of parameters controlling the number of clusters in theunderlying algorithm specified inimplementation. Ifnc isnot provided, it is set toseq(1, tau*nrow(xdata)).

...

additional parameters passed toMclust.

Value

A list with:

comembership

an array of binary and symmetricco-membership matrices.

weights

a matrix of median weights byfeature.

See Also

Other clustering algorithms:DBSCANClustering(),HierarchicalClustering(),KMeansClustering(),PAMClustering()

Examples

# Data simulationset.seed(1)simul <- SimulateClustering(n = c(10, 10), pk = 50)# Clustering using Gaussian Mixture Modelsmygmm <- GMMClustering(xdata = simul$data, nc = seq_len(30))

Graph visualisation

Description

Produces anigraph object from anadjacency matrix.

Usage

Graph(  adjacency,  node_label = NULL,  node_colour = NULL,  node_shape = NULL,  edge_colour = "grey60",  label_colour = "grey20",  mode = "undirected",  weighted = FALSE,  satellites = FALSE)

Arguments

adjacency

adjacency matrix or output ofGraphicalModel.

node_label

optional vector of node labels. This vector must contain asmany entries as there are rows/columns in the adjacency matrix and must bein the same order (the order is used to assign labels to nodes).

node_colour

optional vector of node colours. This vector must containas many entries as there are rows/columns in the adjacency matrix and mustbe in the same order (the order is used to assign colours to nodes).Integers, named colours or RGB values can be used.

node_shape

optional vector of node shapes. This vector must contain asmany entries as there are rows/columns in the adjacency matrix and must bein the same order (the order is used to assign shapes to nodes). Possiblevalues are"circle","square","triangle" or"star".

edge_colour

optional character string for edge colour. Integers, namedcolours or RGB values can be used.

label_colour

optional character string for label colour. Integers,named colours or RGB values can be used.

mode

character string indicating how the adjacency matrix should beinterpreted. Possible values include"undirected" or"directed" (seegraph_from_adjacency_matrix).

weighted

indicating if entries of the adjacency matrix should defineedge width. Ifweighted=FALSE, an unweighted igraph object iscreated, all edges have the same width. Ifweighted=TRUE, edge widthis defined by the corresponding value in the adjacency matrix. Ifweighted=NULL, nodes are linked by as many edges as indicated in theadjacency matrix (integer values are needed).

satellites

logical indicating if unconnected nodes (satellites) shouldbe included in the igraph object.

Details

All functionalities implemented inigraph can be used on the output.These include cosmetic changes for the visualisation, but also varioustools for network analysis (including topological properties and communitydetection).

The R packagevisNetwork offersinteractive network visualisation tools. Anigraph object can easily be convertedto avisNetwork object (seeexample below).

For Cytoscape users, theRCy3 package can be usedto open the network in Cytoscape.

Value

An igraph object.

See Also

Adjacency,GraphicalModel,igraph manual,visNetwork manual,Cytoscape

Examples

## From adjacency matrix# Un-weightedadjacency <- SimulateAdjacency(pk = 20, topology = "scale-free")plot(Graph(adjacency))# Weightedadjacency <- adjacency * runif(prod(dim(adjacency)))adjacency <- adjacency + t(adjacency)plot(Graph(adjacency, weighted = TRUE))# Node colours and shapesplot(Graph(adjacency, weighted = TRUE, node_shape = "star", node_colour = "red"))## From stability selection outputs# Graphical modelset.seed(1)simul <- SimulateGraphical(pk = 20)stab <- GraphicalModel(xdata = simul$data)plot(Graph(stab))# Sparse PLSif (requireNamespace("sgPLS", quietly = TRUE)) {  set.seed(1)  simul <- SimulateRegression(n = 50, pk = c(5, 5, 5), family = "gaussian")  x <- simul$xdata  y <- simul$ydata  stab <- BiSelection(    xdata = simul$xdata, ydata = simul$ydata,    family = "gaussian", ncomp = 3,    LambdaX = seq_len(ncol(x) - 1),    implementation = SparsePLS  )  plot(Graph(stab))}## Tools from other packages# Applying some igraph functionalitiesadjacency <- SimulateAdjacency(pk = 20, topology = "scale-free")mygraph <- Graph(adjacency)igraph::degree(mygraph)igraph::betweenness(mygraph)igraph::shortest_paths(mygraph, from = 1, to = 2)igraph::walktrap.community(mygraph)# Interactive view using visNetworkif (requireNamespace("visNetwork", quietly = TRUE)) {  vgraph <- mygraph  igraph::V(vgraph)$shape <- rep("dot", length(igraph::V(vgraph)))  v <- visNetwork::visIgraph(vgraph)  mylayout <- as.matrix(v$x$nodes[, c("x", "y")])  mylayout[, 2] <- -mylayout[, 2]  plot(mygraph, layout = mylayout)}# Opening in Cytoscape using RCy3if (requireNamespace("RCy3", quietly = TRUE)) {  # Make sure that Cytoscape is open before running the following line  # RCy3::createNetworkFromIgraph(mygraph)}

Edge-wise comparison of two graphs

Description

Generates anigraph object representingthe common and graph-specific edges.

Usage

GraphComparison(  graph1,  graph2,  col = c("tomato", "forestgreen", "navy"),  lty = c(2, 3, 1),  node_colour = NULL,  show_labels = TRUE,  ...)

Arguments

graph1

first graph. Possible inputs are: adjacency matrix, origraph object, or output ofGraphicalModel,VariableSelection,BiSelection, or output ofSimulateGraphical,SimulateRegression.

graph2

second graph.

col

vector of edge colours. The first entry of the vector definesthe colour of edges ingraph1 only, second entry is for edges ingraph2 only and third entry is for common edges.

lty

vector of line types for edges. The order is defined as forargumentcol.

node_colour

optional vector of node colours. This vector must containas many entries as there are rows/columns in the adjacency matrix and mustbe in the same order (the order is used to assign colours to nodes).Integers, named colours or RGB values can be used.

show_labels

logical indicating if the node labels should be displayed.

...

additional arguments to be passed toGraph.

Value

An igraph object.

See Also

SelectionPerformanceGraph

Examples

# Data simulationset.seed(1)simul1 <- SimulateGraphical(pk = 30)set.seed(2)simul2 <- SimulateGraphical(pk = 30)# Edge-wise comparison of the two graphsmygraph <- GraphComparison(  graph1 = simul1,  graph2 = simul2)plot(mygraph, layout = igraph::layout_with_kk(mygraph))

Graphical model algorithm

Description

Runs the algorithm specified in the argumentimplementation andreturns the estimated adjacency matrix. This function is not using stability.

Usage

GraphicalAlgo(  xdata,  pk = NULL,  Lambda,  Sequential_template = NULL,  scale = TRUE,  implementation = PenalisedGraphical,  start = "cold",  ...)

Arguments

xdata

matrix with observations as rows and variables as columns.

pk

optional vector encoding the grouping structure. Only used formulti-block stability selection wherepk indicates the number ofvariables in each group. Ifpk=NULL, single-block stabilityselection is performed.

Lambda

matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified inimplementation.IfLambda=NULL andimplementation=PenalisedGraphical,LambdaGridGraphical is used to define a relevant grid.Lambda can be provided as a vector or a matrix withlength(pk) columns.

Sequential_template

logical matrix encoding the type of procedure touse for data with multiple blocks in stability selection graphicalmodelling. For multi-block estimation, the stability selection model isconstructed as the union of block-specific stable edges estimated while theothers are weakly penalised (TRUE only for the block currently beingcalibrated andFALSE for other blocks). Other approaches with jointcalibration of the blocks are allowed (all entries are set toTRUE).

scale

logical indicating if the correlation (scale=TRUE) orcovariance (scale=FALSE) matrix should be used as input ofglassoFast ifimplementation=PenalisedGraphical. Otherwise, this argument must beused in the function provided inimplementation.

implementation

function to use for graphical modelling. Ifimplementation=PenalisedGraphical, the algorithm implemented inglassoFast is used for regularised estimation ofa conditional independence graph. Alternatively, a user-defined functioncan be provided.

start

character string indicating if the algorithm should beinitialised at the estimated (inverse) covariance with previous penaltyparameters (start="warm") or not (start="cold"). Usingstart="warm" can speed-up the computations, but could lead toconvergence issues (in particular with smallLambda_cardinal). Onlyused forimplementation=PenalisedGraphical (see argument"start" inglassoFast).

...

additional parameters passed to the function provided inimplementation.

Details

The use of the procedure from Equation (4) or (5) is controlled bythe argument "Sequential_template".

Value

An array with binary and symmetric adjacency matrices along the thirddimension.

See Also

GraphicalModel,PenalisedGraphical

Other wrapping functions:SelectionAlgo()

Examples

# Data simulationset.seed(1)simul <- SimulateGraphical()# Running graphical LASSOmyglasso <- GraphicalAlgo(  xdata = simul$data,  Lambda = cbind(c(0.1, 0.2)))

Stability selection graphical model

Description

Performs stability selection for graphical models. The underlying graphicalmodel (e.g. graphical LASSO) is run with different combinations of parameterscontrolling the sparsity (e.g. penalty parameter) and thresholds in selectionproportions. These two hyper-parameters are jointly calibrated bymaximisation of the stability score.

Usage

GraphicalModel(  xdata,  pk = NULL,  Lambda = NULL,  lambda_other_blocks = 0.1,  pi_list = seq(0.01, 0.99, by = 0.01),  K = 100,  tau = 0.5,  seed = 1,  n_cat = NULL,  implementation = PenalisedGraphical,  start = "warm",  scale = TRUE,  resampling = "subsampling",  cpss = FALSE,  PFER_method = "MB",  PFER_thr = Inf,  FDP_thr = Inf,  Lambda_cardinal = 50,  lambda_max = NULL,  lambda_path_factor = 0.001,  max_density = 0.5,  optimisation = c("grid_search", "nloptr"),  n_cores = 1,  output_data = FALSE,  verbose = TRUE,  beep = NULL,  ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.For multi-block stability selection, the variables in data have to beordered by group.

pk

optional vector encoding the grouping structure. Only used formulti-block stability selection wherepk indicates the number ofvariables in each group. Ifpk=NULL, single-block stabilityselection is performed.

Lambda

matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified inimplementation.IfLambda=NULL andimplementation=PenalisedGraphical,LambdaGridGraphical is used to define a relevant grid.Lambda can be provided as a vector or a matrix withlength(pk) columns.

lambda_other_blocks

optional vector of parameters controlling thelevel of sparsity in neighbour blocks for the multi-block procedure. To usejointly a specific set of parameters for each block,lambda_other_blocks must be set toNULL (not recommended).Only used for multi-block stability selection, i.e. iflength(pk)>1.

pi_list

vector of thresholds in selection proportions. Ifn_cat=NULL orn_cat=2, these values must be>0 and<1. Ifn_cat=3, these values must be>0.5 and<1.

K

number of resampling iterations.

tau

subsample size. Only used ifresampling="subsampling" andcpss=FALSE.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

implementation

function to use for graphical modelling. Ifimplementation=PenalisedGraphical, the algorithm implemented inglassoFast is used for regularised estimation ofa conditional independence graph. Alternatively, a user-defined functioncan be provided.

start

character string indicating if the algorithm should beinitialised at the estimated (inverse) covariance with previous penaltyparameters (start="warm") or not (start="cold"). Usingstart="warm" can speed-up the computations, but could lead toconvergence issues (in particular with smallLambda_cardinal). Onlyused forimplementation=PenalisedGraphical (see argument"start" inglassoFast).

scale

logical indicating if the correlation (scale=TRUE) orcovariance (scale=FALSE) matrix should be used as input ofglassoFast ifimplementation=PenalisedGraphical. Otherwise, this argument must beused in the function provided inimplementation.

resampling

resampling approach. Possible values are:"subsampling" for sampling without replacement of a proportiontau of the observations, or"bootstrap" for sampling withreplacement generating a resampled dataset with as many observations as inthe full sample. Alternatively, this argument can be a function to use forresampling. This function must use arguments nameddata andtau and return the IDs of observations to be included in theresampled dataset.

cpss

logical indicating if complementary pair stability selectionshould be done. For this, the algorithm is applied on two non-overlappingsubsets of half of the observations. A feature is considered as selected ifit is selected for both subsamples. With this method, the data is splitK/2 times (K models are fitted). Only used ifPFER_method="MB".

PFER_method

method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). IfPFER_method="MB", the method proposed by Meinshausen and Bühlmann(2010) is used. IfPFER_method="SS", the method proposed by Shah andSamworth (2013) under the assumption of unimodality is used.

PFER_thr

threshold in PFER for constrained calibration by errorcontrol. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

FDP_thr

threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

Lambda_cardinal

number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm. Only used ifLambda=NULL.

lambda_max

optional maximum value for the grid in penalty parameters.Iflambda_max=NULL, the maximum value is set to the maximumcovariance in absolute value. Only used ifimplementation=PenalisedGraphical andLambda=NULL.

lambda_path_factor

multiplicative factor used to define the minimumvalue in the grid.

max_density

threshold on the density. The grid is defined such thatthe density of the estimated graph does not exceed max_density.

optimisation

character string indicating the type of optimisationmethod. Withoptimisation="grid_search" (the default), all values inLambda are visited. Alternatively, optimisation algorithmsimplemented innloptr can be used withoptimisation="nloptr". By default, we use"algorithm"="NLOPT_GN_DIRECT_L","xtol_abs"=0.1,"ftol_abs"=0.1 and"maxeval"=Lambda_cardinal. These valuescan be changed by providing the argumentopts (seenloptr). For stability selection using penalisedregression,optimisation="grid_search" may be faster as it allowsfor warm start.

n_cores

number of cores to use for parallel computing (see argumentworkers inmultisession). Usingn_cores>1 is only supported withoptimisation="grid_search".

output_data

logical indicating if the input datasetsxdata andydata should be included in the output.

verbose

logical indicating if a loading bar and messages should beprinted.

beep

sound indicating the end of the run. Possible values are:NULL (no sound) or an integer between 1 and 11 (see argumentsound inbeep).

...

additional parameters passed to the functions provided inimplementation orresampling.

Details

In stability selection, a feature selection algorithm is fitted onK subsamples (or bootstrap samples) of the data with differentparameters controlling the sparsity (Lambda). For a given (set of)sparsity parameter(s), the proportion out of theK models in whicheach feature is selected is calculated. Features with selection proportionsabove a threshold pi are considered stably selected. The stabilityselection model is controlled by the sparsity parameter(s) for theunderlying algorithm, and the threshold in selection proportion:

V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}

These parameters can be calibrated by maximisation of a stability score(seeConsensusScore ifn_cat=NULL orStabilityScore otherwise) calculated under the nullhypothesis of equiprobability of selection.

It is strongly recommended to examine the calibration plot carefully tocheck that the grids of parametersLambda andpi_list do notrestrict the calibration to a region that would not include the globalmaximum (seeCalibrationPlot). In particular, the gridLambda may need to be extended when the maximum stability isobserved on the left or right edges of the calibration heatmap. In someinstances, multiple peaks of stability score can be observed. Simulationstudies suggest that the peak corresponding to the largest number ofselected features tend to give better selection performances. This is notnecessarily the highest peak (which is automatically retained by thefunctions in this package). The user can decide to manually choose anotherpeak.

To control the expected number of False Positives (Per Family Error Rate)in the results, a thresholdPFER_thr can be specified. Theoptimisation problem is then constrained to sets of parameters thatgenerate models with an upper-bound in PFER belowPFER_thr (seeMeinshausen and Bühlmann (2010) and Shah and Samworth (2013)).

Possible resampling procedures include defining (i)K subsamples ofa proportiontau of the observations, (ii)K bootstrap sampleswith the full sample size (obtained with replacement), and (iii)K/2splits of the data in half for complementary pair stability selection (seeargumentsresampling andcpss). In complementary pairstability selection, a feature is considered selected at a given resamplingiteration if it is selected in the two complementary subsamples.

To ensure reproducibility of the results, the starting number of the randomnumber generator is set toseed.

For parallelisation, stability selection with different sets of parameterscan be run onn_cores cores. Usingn_cores > 1 creates amultisession. Alternatively,the function can be run manually with differentseeds and all otherparameters equal. The results can then be combined usingCombine.

The generated network can be converted intoigraph object usingGraph. The R packagevisNetwork can be used forinteractive network visualisation (see examples inGraph).

Value

An object of classgraphical_model. A list with:

S

amatrix of the best stability scores for different (sets of) parameterscontrolling the level of sparsity in the underlying algorithm.

Lambda

a matrix of parameters controlling the level of sparsity inthe underlying algorithm.

Q

a matrix of the average number ofselected features by the underlying algorithm with different parameterscontrolling the level of sparsity.

Q_s

a matrix of the calibratednumber of stably selected features with different parameters controllingthe level of sparsity.

P

a matrix of calibrated thresholds inselection proportions for different parameters controlling the level ofsparsity in the underlying algorithm.

PFER

a matrix of upper-boundsin PFER of calibrated stability selection models with different parameterscontrolling the level of sparsity.

FDP

a matrix of upper-bounds inFDP of calibrated stability selection models with different parameterscontrolling the level of sparsity.

S_2d

a matrix of stabilityscores obtained with different combinations of parameters. Columnscorrespond to different thresholds in selection proportions.

PFER_2d

a matrix of upper-bounds in FDP obtained with differentcombinations of parameters. Columns correspond to different thresholds inselection proportions. Only returned iflength(pk)=1.

FDP_2d

a matrix of upper-bounds in PFER obtained with differentcombinations of parameters. Columns correspond to different thresholds inselection proportions. Only returned iflength(pk)=1.

selprop

an array of selection proportions. Rows and columnscorrespond to nodes in the graph. Indices along the third dimensioncorrespond to different parameters controlling the level of sparsity in theunderlying algorithm.

sign

a matrix of signs of Pearson'scorrelations estimated fromxdata.

method

a list withtype="graphical_model" and values used for argumentsimplementation,start,resampling,cpss andPFER_method.

params

a list with values used for argumentsK,pi_list,tau,n_cat,pk,n(number of observations inxdata),PFER_thr,FDP_thr,seed,lambda_other_blocks, andSequential_template.

The rows ofS,Lambda,Q,Q_s,P,PFER,FDP,S_2d,PFER_2d andFDP_2d, andindices along the third dimension ofselprop are ordered in the sameway and correspond to parameter values stored inLambda. Formulti-block inference, the columns ofS,Lambda,Q,Q_s,P,PFER andFDP, and indices along thethird dimension ofS_2d correspond to the different blocks.

References

Bodinier B, Rodrigues S, Karimi M, Filippi S, Chiquet J, Chadeau-Hyam M (2025).“Stability Selection and Consensus Clustering in R: The R Package sharp.”Journal of Statistical Software,112(5), btad635.doi:10.18637/jss.v112.i05.

Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.

Shah RD, Samworth RJ (2013).“Variable selection with error control: another look at stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),75(1), 55-80.doi:10.1111/j.1467-9868.2011.01034.x.

Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.

Friedman J, Hastie T, Tibshirani R (2008).“Sparse inverse covariance estimation with the graphical lasso.”Biostatistics,9(3), 432–441.

See Also

PenalisedGraphical,GraphicalAlgo,LambdaGridGraphical,Resample,StabilityScoreGraph,Adjacency,

Other stability functions:BiSelection(),Clustering(),StructuralModel(),VariableSelection()

Examples

oldpar <- par(no.readonly = TRUE)par(mar = rep(7, 4))## Single-block stability selection# Data simulationset.seed(1)simul <- SimulateGraphical(n = 100, pk = 20, nu_within = 0.1)# Stability selectionstab <- GraphicalModel(xdata = simul$data)print(stab)# Calibration heatmapCalibrationPlot(stab)# Visualisation of the resultssummary(stab)plot(stab)# Extraction of adjacency matrix or igraph objectAdjacency(stab)Graph(stab)## Multi-block stability selection# Data simulationset.seed(1)simul <- SimulateGraphical(pk = c(10, 10))# Stability selectionstab <- GraphicalModel(xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10)print(stab)# Calibration heatmap# par(mfrow = c(1, 3))CalibrationPlot(stab) # Producing three plots# Visualisation of the resultssummary(stab)plot(stab)# Multi-parameter stability selection (not recommended)Lambda <- matrix(c(0.8, 0.6, 0.3, 0.5, 0.4, 0.3, 0.7, 0.5, 0.1), ncol = 3)stab <- GraphicalModel(  xdata = simul$data, pk = c(10, 10),  Lambda = Lambda, lambda_other_blocks = NULL)stab$Lambda## Example with user-defined function: shrinkage estimation and selection# Data simulationset.seed(1)simul <- SimulateGraphical(n = 100, pk = 20, nu_within = 0.1)if (requireNamespace("corpcor", quietly = TRUE)) {  # Writing user-defined algorithm in a portable function  ShrinkageSelection <- function(xdata, Lambda, ...) {    mypcor <- corpcor::pcor.shrink(xdata, verbose = FALSE)    adjacency <- array(NA, dim = c(nrow(mypcor), ncol(mypcor), nrow(Lambda)))    for (k in seq_len(nrow(Lambda))) {      A <- ifelse(abs(mypcor) >= Lambda[k, 1], yes = 1, no = 0)      diag(A) <- 0      adjacency[, , k] <- A    }    return(list(adjacency = adjacency))  }  # Running the algorithm without stability  myglasso <- GraphicalAlgo(    xdata = simul$data,    Lambda = matrix(c(0.05, 0.1), ncol = 1), implementation = ShrinkageSelection  )  # Stability selection using shrinkage estimation and selection  stab <- GraphicalModel(    xdata = simul$data, Lambda = matrix(c(0.01, 0.05, 0.1), ncol = 1),    implementation = ShrinkageSelection  )  CalibrationPlot(stab)  stable_adjacency <- Adjacency(stab)}par(oldpar)

Group Partial Least Squares

Description

Runs a group Partial Least Squares model using implementation fromsgPLS-package. This function is not using stability.

Usage

GroupPLS(  xdata,  ydata,  family = "gaussian",  group_x,  group_y = NULL,  Lambda,  keepX_previous = NULL,  keepY = NULL,  ncomp = 1,  scale = TRUE,  ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

family

type of PLS model. Iffamily="gaussian", a group PLSmodel as defined ingPLS is run (for continuousoutcomes). Iffamily="binomial", a PLS-DA model as defined ingPLSda is run (for categorical outcomes).

group_x

vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group.

group_y

optional vector encoding the grouping structure amongoutcomes. This argument indicates the number of variables in each group.

Lambda

matrix of parameters controlling the number of selected groupsat current component, as defined byncomp.

keepX_previous

number of selected groups in previous components. Onlyused ifncomp > 1. The argumentkeepX insgPLS is obtained by concatenatingkeepX_previous andLambda.

keepY

number of selected groups of outcome variables. This argument isdefined as insgPLS. Only used iffamily="gaussian".

ncomp

number of components.

scale

logical indicating if the data should be scaled (i.e.transformed so that all variables have a standard deviation of one). Onlyused iffamily="gaussian".

...

additional arguments to be passed togPLS orgPLSda.

Value

A list with:

selected

matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors.

beta_full

array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors (startingwith "X") or outcomes (starting with "Y") variables for differentcomponents (denoted by "PC").

References

Liquet B, de Micheaux PL, Hejblum BP, Thiébaut R (2016).“Group and sparse group partial least square approaches applied in genomics context.”Bioinformatics,32(1), 35-42.ISSN 1367-4803,doi:10.1093/bioinformatics/btv535.

See Also

VariableSelection,BiSelection

Other penalised dimensionality reduction functions:SparseGroupPLS(),SparsePCA(),SparsePLS()

Examples

if (requireNamespace("sgPLS", quietly = TRUE)) {  ## Group PLS  # Data simulation  set.seed(1)  simul <- SimulateRegression(n = 100, pk = 50, q = 3, family = "gaussian")  x <- simul$xdata  y <- simul$ydata  # Running gPLS with 1 group and sparsity of 0.5  mypls <- GroupPLS(    xdata = x, ydata = y, Lambda = 1, family = "gaussian",    group_x = c(10, 15, 25),  )  # Running gPLS with groups on outcomes  mypls <- GroupPLS(    xdata = x, ydata = y, Lambda = 1, family = "gaussian",    group_x = c(10, 15, 25),    group_y = c(2, 1), keepY = 1  )}

(Weighted) hierarchical clustering

Description

Runs hierarchical clustering using implementation fromhclust. IfLambda is provided, clustering isapplied on the weighted distance matrix calculated using thecosa2 algorithm. Otherwise, distances are calculatedusingdist. This function is not using stability.

Usage

HierarchicalClustering(  xdata,  nc = NULL,  Lambda = NULL,  distance = "euclidean",  linkage = "complete",  ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.

nc

matrix of parameters controlling the number of clusters in theunderlying algorithm specified inimplementation. Ifnc isnot provided, it is set toseq(1, tau*nrow(xdata)).

Lambda

vector of penalty parameters (see argumentlambda incosa2). Unweighted distance matrices are used ifLambda=NULL.

distance

character string indicating the type of distance to use. IfLambda=NULL, possible values include"euclidean","maximum","canberra","binary", and"minkowski" (see argumentmethod indist). Otherwise, possible values include"euclidean" (pwr=2) or"absolute" (pwr=1) (seeargumentpwr incosa2).

linkage

character string indicating the type of linkage used inhierarchical clustering to define the stable clusters. Possible valuesinclude"complete","single" and"average" (seeargument"method" inhclust for a full list).Only used ifimplementation=HierarchicalClustering.

...

additional parameters passed tohclust,dist, orcosa2. Parametersniter (default to 1) andnoit (default to 100) correspond tothe number of iterations incosa2 to calculate weightsand may need to be modified. Argumentpwr incosa2 is ignored, please providedistanceinstead.

Value

A list with:

comembership

an array of binary and symmetricco-membership matrices.

weights

a matrix of median weights byfeature.

References

Kampert MM, Meulman JJ, Friedman JH (2017).“rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.”Journal of Classification,34(3), 514–547.doi:10.1007/s00357-017-9240-z.

Friedman JH, Meulman JJ (2004).“Clustering objects on subsets of attributes (with discussion).”Journal of the Royal Statistical Society: Series B (Statistical Methodology),66(4), 815-849.doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x,https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.

See Also

Other clustering algorithms:DBSCANClustering(),GMMClustering(),KMeansClustering(),PAMClustering()

Examples

# Data simulationset.seed(1)simul <- SimulateClustering(n = c(10, 10), pk = 50)# Hierarchical clusteringmyhclust <- HierarchicalClustering(  xdata = simul$data,  nc = seq_len(20))# Weighted Hierarchical clustering (using COSA)if (requireNamespace("rCOSA", quietly = TRUE)) {  myhclust <- HierarchicalClustering(    xdata = simul$data,    weighted = TRUE,    nc = seq_len(20),    Lambda = c(0.2, 0.5)  )}

Incremental prediction performance in regression

Description

Computes the prediction performance of regression models where predictors aresequentially added by order of decreasing selection proportion. This functioncan be used to evaluate the marginal contribution of each of the selectedpredictors over and above more stable predictors. Performances are evaluatedas inExplanatoryPerformance.

Usage

Incremental(  xdata,  ydata,  new_xdata = NULL,  new_ydata = NULL,  stability = NULL,  family = NULL,  implementation = NULL,  prediction = NULL,  resampling = "subsampling",  n_predictors = NULL,  K = 100,  tau = 0.8,  seed = 1,  n_thr = NULL,  time = 1000,  verbose = TRUE,  ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

new_xdata

optional test set (predictor data).

new_ydata

optional test set (outcome data).

stability

output ofVariableSelection. Ifstability=NULL (the default), a model including all variables inxdata as predictors is fitted. Argumentfamily must beprovided in this case.

family

type of regression model. Possible values include"gaussian" (linear regression),"binomial" (logisticregression), and"cox" (survival analysis). If provided, thisargument must be consistent with inputstability.

implementation

optional function to refit the model. Ifimplementation=NULL andstability is the output ofVariableSelection,lm (linearregression),coxph (Cox regression),glm (logistic regression), ormultinom (multinomial regression) is used.

prediction

optional function to compute predicted values from themodel refitted withimplementation.

resampling

resampling approach to create the training set. The defaultis"subsampling" for sampling without replacement of a proportiontau of the observations. Alternatively, this argument can be afunction to use for resampling. This function must use arguments nameddata andtau and return the IDs of observations to beincluded in the resampled dataset.

n_predictors

number of predictors to consider.

K

number of training-test splits. Only used ifnew_xdata andnew_ydata are not provided.

tau

proportion of observations used in the training set. Only used ifnew_xdata andnew_ydata are not provided.

seed

value of the seed to ensure reproducibility of the results. Onlyused ifnew_xdata andnew_ydata are not provided.

n_thr

number of thresholds to use to construct the ROC curve. Ifn_thr=NULL, all predicted probability values are iteratively used asthresholds. For faster computations on large data, less thresholds can beused. Only applicable to logistic regression.

time

numeric indicating the time for which the survival probabilitiesare computed. Only applicable to Cox regression.

verbose

logical indicating if a loading bar and messages should beprinted.

...

additional parameters passed to the function provided inresampling.

Value

An object of classincremental.

For logistic regression, a list with:

FPR

A list with, for each ofthe models (sequentially added predictors), the False Positive Rates fordifferent thresholds (columns) and different data splits (rows).

TPR

A list with, for each of the models (sequentially addedpredictors), the True Positive Rates for different thresholds (columns) anddifferent data splits (rows).

AUC

A list with, for each of themodels (sequentially added predictors), a vector of Area Under the Curve(AUC) values obtained with different data splits.

Beta

Estimatedregression coefficients from visited models.

names

Names of thepredictors by order of inclusion.

stable

Binary vector indicatingwhich predictors are stably selected. Only returned ifstability isprovided.

For Cox regression, a list with:

concordance

A list with, for eachof the models (sequentially added predictors), a vector of concordanceindices obtained with different data splits.

Beta

Estimatedregression coefficients from visited models.

names

Names of thepredictors by order of inclusion.

stable

Binary vector indicatingwhich predictors are stably selected. Only returned ifstability isprovided.

For linear regression, a list with:

Q_squared

A list with, for eachof the models (sequentially added predictors), a vector of Q-squaredobtained with different data splits.

Beta

Estimated regressioncoefficients from visited models.

names

Names of the predictors byorder of inclusion.

stable

Binary vector indicating whichpredictors are stably selected. Only returned ifstability isprovided.

See Also

VariableSelection,Refit

Other prediction performance functions:ExplanatoryPerformance()

Examples

# Data simulationset.seed(1)simul <- SimulateRegression(  n = 1000, pk = 20,  family = "binomial", ev_xy = 0.8)# Data split: selection, training and test setids <- Split(  data = simul$ydata,  family = "binomial",  tau = c(0.4, 0.3, 0.3))xselect <- simul$xdata[ids[[1]], ]yselect <- simul$ydata[ids[[1]], ]xtrain <- simul$xdata[ids[[2]], ]ytrain <- simul$ydata[ids[[2]], ]xtest <- simul$xdata[ids[[3]], ]ytest <- simul$ydata[ids[[3]], ]# Stability selectionstab <- VariableSelection(  xdata = xselect,  ydata = yselect,  family = "binomial")# Performances in test set of model refitted in training setincr <- Incremental(  xdata = xtrain, ydata = ytrain,  new_xdata = xtest, new_ydata = ytest,  stability = stab, n_predictors = 10)plot(incr)# Alternative with multiple training/test splitsincr <- Incremental(  xdata = rbind(xtrain, xtest),  ydata = c(ytrain, ytest),  stability = stab, K = 10, n_predictors = 10)plot(incr)

(Sparse) K-means clustering

Description

Runs k-means clustering using implementation fromkmeans. This function is not using stability.

Usage

KMeansClustering(xdata, nc = NULL, Lambda = NULL, ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.

nc

matrix of parameters controlling the number of clusters in theunderlying algorithm specified inimplementation. Ifnc isnot provided, it is set toseq(1, tau*nrow(xdata)).

Lambda

vector of penalty parameters (see argumentwbounds inKMeansSparseCluster).

...

additional parameters passed tokmeans (ifLambda isNULL) orKMeansSparseCluster.

Value

A list with:

comembership

an array of binary and symmetricco-membership matrices.

weights

a matrix of median weights byfeature.

References

Witten DM, Tibshirani R (2010).“A Framework for Feature Selection in Clustering.”Journal of the American Statistical Association,105(490), 713-726.doi:10.1198/jasa.2010.tm09415, PMID: 20811510.

See Also

Other clustering algorithms:DBSCANClustering(),GMMClustering(),HierarchicalClustering(),PAMClustering()

Examples

# Data simulationset.seed(1)simul <- SimulateClustering(n = c(10, 10), pk = 50)# K means clusteringmykmeans <- KMeansClustering(xdata = simul$data, nc = seq_len(20))# Sparse K means clusteringif (requireNamespace("sparcl", quietly = TRUE)) {  mykmeans <- KMeansClustering(    xdata = simul$data, nc = seq_len(20),    Lambda = c(2, 5)  )}

Grid of penalty parameters (graphical model)

Description

Generates a relevant grid of penalty parameter values for penalised graphicalmodels.

Usage

LambdaGridGraphical(  xdata,  pk = NULL,  lambda_other_blocks = 0.1,  K = 100,  tau = 0.5,  n_cat = 3,  implementation = PenalisedGraphical,  start = "cold",  scale = TRUE,  resampling = "subsampling",  PFER_method = "MB",  PFER_thr = Inf,  FDP_thr = Inf,  Lambda_cardinal = 50,  lambda_max = NULL,  lambda_path_factor = 0.001,  max_density = 0.5,  ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.For multi-block stability selection, the variables in data have to beordered by group.

pk

optional vector encoding the grouping structure. Only used formulti-block stability selection wherepk indicates the number ofvariables in each group. Ifpk=NULL, single-block stabilityselection is performed.

lambda_other_blocks

optional vector of parameters controlling thelevel of sparsity in neighbour blocks for the multi-block procedure. To usejointly a specific set of parameters for each block,lambda_other_blocks must be set toNULL (not recommended).Only used for multi-block stability selection, i.e. iflength(pk)>1.

K

number of resampling iterations.

tau

subsample size. Only used ifresampling="subsampling" andcpss=FALSE.

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

implementation

function to use for graphical modelling. Ifimplementation=PenalisedGraphical, the algorithm implemented inglassoFast is used for regularised estimation ofa conditional independence graph. Alternatively, a user-defined functioncan be provided.

start

character string indicating if the algorithm should beinitialised at the estimated (inverse) covariance with previous penaltyparameters (start="warm") or not (start="cold"). Usingstart="warm" can speed-up the computations, but could lead toconvergence issues (in particular with smallLambda_cardinal). Onlyused forimplementation=PenalisedGraphical (see argument"start" inglassoFast).

scale

logical indicating if the correlation (scale=TRUE) orcovariance (scale=FALSE) matrix should be used as input ofglassoFast ifimplementation=PenalisedGraphical. Otherwise, this argument must beused in the function provided inimplementation.

resampling

resampling approach. Possible values are:"subsampling" for sampling without replacement of a proportiontau of the observations, or"bootstrap" for sampling withreplacement generating a resampled dataset with as many observations as inthe full sample. Alternatively, this argument can be a function to use forresampling. This function must use arguments nameddata andtau and return the IDs of observations to be included in theresampled dataset.

PFER_method

method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). IfPFER_method="MB", the method proposed by Meinshausen and Bühlmann(2010) is used. IfPFER_method="SS", the method proposed by Shah andSamworth (2013) under the assumption of unimodality is used.

PFER_thr

threshold in PFER for constrained calibration by errorcontrol. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

FDP_thr

threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

Lambda_cardinal

number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm.

lambda_max

optional maximum value for the grid in penalty parameters.Iflambda_max=NULL, the maximum value is set to the maximumcovariance in absolute value. Only used ifimplementation=PenalisedGraphical.

lambda_path_factor

multiplicative factor used to define the minimumvalue in the grid.

max_density

threshold on the density. The grid is defined such thatthe density of the estimated graph does not exceed max_density.

...

additional parameters passed to the functions provided inimplementation orresampling.

Value

A matrix of lambda values withlength(pk) columns andLambda_cardinal rows.

See Also

Other lambda grid functions:LambdaGridRegression(),LambdaSequence()

Examples

# Single-block simulationset.seed(1)simul <- SimulateGraphical()# Generating grid of 10 valuesLambda <- LambdaGridGraphical(xdata = simul$data, Lambda_cardinal = 10)# Ensuring PFER < 5Lambda <- LambdaGridGraphical(xdata = simul$data, Lambda_cardinal = 10, PFER_thr = 5)# Multi-block simulationset.seed(1)simul <- SimulateGraphical(pk = c(10, 10))# Multi-block gridLambda <- LambdaGridGraphical(xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10)# Denser neighbouring blocksLambda <- LambdaGridGraphical(  xdata = simul$data, pk = c(10, 10),  Lambda_cardinal = 10, lambda_other_blocks = 0)# Using different neighbour penaltiesLambda <- LambdaGridGraphical(  xdata = simul$data, pk = c(10, 10),  Lambda_cardinal = 10, lambda_other_blocks = c(0.1, 0, 0.1))stab <- GraphicalModel(  xdata = simul$data, pk = c(10, 10),  Lambda = Lambda, lambda_other_blocks = c(0.1, 0, 0.1))stab$Lambda# Visiting from empty to full graphs with max_density=1Lambda <- LambdaGridGraphical(  xdata = simul$data, pk = c(10, 10),  Lambda_cardinal = 10, max_density = 1)bigblocks <- BlockMatrix(pk = c(10, 10))bigblocks_vect <- bigblocks[upper.tri(bigblocks)]N_blocks <- unname(table(bigblocks_vect))N_blocks # max number of edges per blockstab <- GraphicalModel(xdata = simul$data, pk = c(10, 10), Lambda = Lambda)apply(stab$Q, 2, max, na.rm = TRUE) # max average number of edges from underlying algo

Grid of penalty parameters (regression model)

Description

Generates a relevant grid of penalty parameter values for penalisedregression using the implementation inglmnet.

Usage

LambdaGridRegression(  xdata,  ydata,  tau = 0.5,  seed = 1,  family = "gaussian",  resampling = "subsampling",  Lambda_cardinal = 100,  check_input = TRUE,  ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

tau

subsample size. Only used ifresampling="subsampling" andcpss=FALSE.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

family

type of regression model. This argument is defined as inglmnet. Possible values include"gaussian"(linear regression),"binomial" (logistic regression),"multinomial" (multinomial regression), and"cox" (survivalanalysis).

resampling

resampling approach. Possible values are:"subsampling" for sampling without replacement of a proportiontau of the observations, or"bootstrap" for sampling withreplacement generating a resampled dataset with as many observations as inthe full sample. Alternatively, this argument can be a function to use forresampling. This function must use arguments nameddata andtau and return the IDs of observations to be included in theresampled dataset.

Lambda_cardinal

number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm.

check_input

logical indicating if input values should be checked(recommended).

...

additional parameters passed to the functions provided inimplementation orresampling.

Value

A matrix of lambda values with one column and as many rows asindicated inLambda_cardinal.

See Also

Other lambda grid functions:LambdaGridGraphical(),LambdaSequence()

Examples

# Data simulationset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") # simulated data# Lambda grid for linear regressionLambda <- LambdaGridRegression(  xdata = simul$xdata, ydata = simul$ydata,  family = "gaussian", Lambda_cardinal = 20)# Grid can be used in VariableSelection()stab <- VariableSelection(  xdata = simul$xdata, ydata = simul$ydata,  family = "gaussian", Lambda = Lambda)print(SelectedVariables(stab))

Sequence of penalty parameters

Description

Generates a sequence of penalty parameters from extreme values and therequired number of elements. The sequence is defined on the log-scale.

Usage

LambdaSequence(lmax, lmin, cardinal = 100)

Arguments

lmax

maximum value in the grid.

lmin

minimum value in the grid.

cardinal

number of values in the grid.

Value

A vector with values between "lmin" and "lmax" and as many values asindicated by "cardinal".

See Also

Other lambda grid functions:LambdaGridGraphical(),LambdaGridRegression()

Examples

# Grid from extreme valuesmygrid <- LambdaSequence(lmax = 0.7, lmin = 0.001, cardinal = 10)

Matrix from linear system outputs

Description

Returns a matrix from output ofPenalisedLinearSystem.

Usage

LinearSystemMatrix(vect, adjacency)

Arguments

vect

vector of coefficients to assign to entries of the matrix.

adjacency

binary adjacency matrix of the Directed Acyclic Graph(transpose of the asymmetric matrix A in Reticular Action Model notation).The row and column names of this matrix must be defined.

Value

An asymmetric matrix.

See Also

PenalisedLinearSystem


Transforms NA into NULL

Description

Returns a vector with no missing values or NULL if there are no non-missingvalues.

Usage

NAToNULL(x)

Arguments

x

input vector.

Value

A vector without missing values or NULL.


Matrix from OpenMx outputs

Description

Returns a matrix from output ofmxPenaltySearch.

Usage

OpenMxMatrix(vect, adjacency, residual_covariance = NULL)

Arguments

vect

vector of coefficients to assign to entries of the matrix.

adjacency

binary adjacency matrix of the Directed Acyclic Graph(transpose of the asymmetric matrix A in Reticular Action Model notation).The row and column names of this matrix must be defined.

residual_covariance

binary and symmetric matrix encoding the nonzeroentries in the residual covariance matrix (symmetric matrix S in ReticularAction Model notation). By default, this is the identity matrix (noresidual covariance).

Value

An asymmetric matrix.

See Also

PenalisedOpenMx,OpenMxModel


Writing OpenMx model (matrix specification)

Description

Returns matrix specification for use inmxModel from(i) the adjacency matrix of a Directed Acyclic Graph (asymmetric matrix A inReticular Action Model notation), and (ii) a binary matrix encoding nonzeroentries in the residual covariance matrix (symmetric matrix S in ReticularAction Model notation).

Usage

OpenMxModel(adjacency, residual_covariance = NULL, manifest = NULL)

Arguments

adjacency

binary adjacency matrix of the Directed Acyclic Graph(transpose of the asymmetric matrix A in Reticular Action Model notation).The row and column names of this matrix must be defined.

residual_covariance

binary and symmetric matrix encoding the nonzeroentries in the residual covariance matrix (symmetric matrix S in ReticularAction Model notation). By default, this is the identity matrix (noresidual covariance).

manifest

optional vector of manifest variable names.

Value

A list of RAM matrices that can be used inmxRun.

See Also

PenalisedOpenMx,OpenMxMatrix

Examples

if (requireNamespace("OpenMx", quietly = TRUE)) {  # Definition of simulated effects  pk <- c(3, 2, 3)  dag <- LayeredDAG(layers = pk)  theta <- dag  theta[2, 4] <- 0  theta[3, 7] <- 0  theta[4, 7] <- 0  # Data simulation  set.seed(1)  simul <- SimulateStructural(n = 500, v_between = 1, theta = theta, pk = pk)  # Writing RAM matrices for mxModel  ram_matrices <- OpenMxModel(adjacency = dag)  # Running unpenalised model  unpenalised <- OpenMx::mxRun(OpenMx::mxModel(    "Model",    OpenMx::mxData(simul$data, type = "raw"),    ram_matrices$Amat,    ram_matrices$Smat,    ram_matrices$Fmat,    ram_matrices$Mmat,    OpenMx::mxExpectationRAM("A", "S", "F", "M", dimnames = colnames(dag)),    OpenMx::mxFitFunctionML()  ), silent = TRUE, suppressWarnings = TRUE)  unpenalised$A$values  # Incorporating latent variables  ram_matrices <- OpenMxModel(    adjacency = dag,    manifest = paste0("x", seq_len(7))  )  ram_matrices$Fmat$values  # Running unpenalised model  unpenalised <- OpenMx::mxRun(OpenMx::mxModel(    "Model",    OpenMx::mxData(simul$data[, seq_len(7)], type = "raw"),    ram_matrices$Amat,    ram_matrices$Smat,    ram_matrices$Fmat,    ram_matrices$Mmat,    OpenMx::mxExpectationRAM("A", "S", "F", "M", dimnames = colnames(dag)),    OpenMx::mxFitFunctionML()  ), silent = TRUE, suppressWarnings = TRUE)  unpenalised$A$values}

(Weighted) Partitioning Around Medoids

Description

Runs Partitioning Around Medoids (PAM) clustering using implementation frompam. This is also known as the k-medoids algorithm. IfLambda is provided, clustering is applied on the weighted distancematrix calculated using the COSA algorithm as implemented incosa2. Otherwise, distances are calculated usingdist. This function is not using stability.

Usage

PAMClustering(xdata, nc = NULL, Lambda = NULL, distance = "euclidean", ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.

nc

matrix of parameters controlling the number of clusters in theunderlying algorithm specified inimplementation. Ifnc isnot provided, it is set toseq(1, tau*nrow(xdata)).

Lambda

vector of penalty parameters (see argumentlambda incosa2). Unweighted distance matrices are used ifLambda=NULL.

distance

character string indicating the type of distance to use. IfLambda=NULL, possible values include"euclidean","maximum","canberra","binary", and"minkowski" (see argumentmethod indist). Otherwise, possible values include"euclidean" (pwr=2) or"absolute" (pwr=1) (seeargumentpwr incosa2).

...

additional parameters passed topam,dist, orcosa2. Ifweighted=TRUE, parametersniter (default to 1) andnoit (default to 100) correspond to the number of iterations incosa2 to calculate weights and may need to bemodified.

Value

A list with:

comembership

an array of binary and symmetricco-membership matrices.

weights

a matrix of median weights byfeature.

References

Kampert MM, Meulman JJ, Friedman JH (2017).“rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.”Journal of Classification,34(3), 514–547.doi:10.1007/s00357-017-9240-z.

Friedman JH, Meulman JJ (2004).“Clustering objects on subsets of attributes (with discussion).”Journal of the Royal Statistical Society: Series B (Statistical Methodology),66(4), 815-849.doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x,https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.

See Also

Other clustering algorithms:DBSCANClustering(),GMMClustering(),HierarchicalClustering(),KMeansClustering()

Examples

if (requireNamespace("cluster", quietly = TRUE)) {  # Data simulation  set.seed(1)  simul <- SimulateClustering(n = c(10, 10), pk = 50)  # PAM clustering  myclust <- PAMClustering(    xdata = simul$data,    nc = seq_len(20)  )  # Weighted PAM clustering (using COSA)  if (requireNamespace("rCOSA", quietly = TRUE)) {    myclust <- PAMClustering(      xdata = simul$data,      nc = seq_len(20),      Lambda = c(0.2, 0.5)    )  }}

Per Family Error Rate

Description

Computes the Per Family Error Rate upper-bound of a stability selection modelusing the methods proposed by Meinshausen and Bühlmann (2010) or Shah andSamworth (2013). In stability selection, the PFER corresponds to the expectednumber of stably selected features that are not relevant to the outcome (i.e.False Positives).

Usage

PFER(q, pi, N, K, PFER_method = "MB")

Arguments

q

average number of features selected by the underlying algorithm.

pi

threshold in selection proportions.

N

total number of features.

K

number of resampling iterations.

PFER_method

method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). IfPFER_method="MB", the method proposed by Meinshausen and Bühlmann(2010) is used. IfPFER_method="SS", the method proposed by Shah andSamworth (2013) under the assumption of unimodality is used.

Value

The estimated upper-bound in PFER.

References

Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.

Shah RD, Samworth RJ (2013).“Variable selection with error control: another look at stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),75(1), 55-80.doi:10.1111/j.1467-9868.2011.01034.x.

See Also

Other stability metric functions:ConsensusScore(),FDP(),StabilityMetrics(),StabilityScore()

Examples

# Computing PFER for 10/50 selected features and threshold of 0.8pfer_mb <- PFER(q = 10, pi = 0.8, N = 50, K = 100, PFER_method = "MB")pfer_ss <- PFER(q = 10, pi = 0.8, N = 50, K = 100, PFER_method = "SS")

Partial Least Squares 'a la carte'

Description

Runs a Partial Least Squares (PLS) model in regression mode using algorithmimplemented inpls. This function allows for theconstruction of components based on different sets of predictor and/oroutcome variables. This function is not using stability.

Usage

PLS(  xdata,  ydata,  selectedX = NULL,  selectedY = NULL,  family = "gaussian",  ncomp = NULL,  scale = TRUE)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

selectedX

binary matrix of size(ncol(xdata) * ncomp). Thebinary entries indicate which predictors (in rows) contribute to thedefinition of each component (in columns). IfselectedX=NULL, allpredictors are selected for all components.

selectedY

binary matrix of size(ncol(ydata) * ncomp). Thebinary entries indicate which outcomes (in rows) contribute to thedefinition of each component (in columns). IfselectedY=NULL, alloutcomes are selected for all components.

family

type of PLS model. Onlyfamily="gaussian" is supported.This corresponds to a PLS model as defined inpls(for continuous outcomes).

ncomp

number of components.

scale

logical indicating if the data should be scaled (i.e.transformed so that all variables have a standard deviation of one).

Details

All matrices are defined as in (Wold et al. 2001). The weight matrixWmat is the equivalent ofloadings$X inpls. The loadings matrixPmat is theequivalent ofmat.c inpls. The scorematricesTmat andQmat are the equivalent ofvariates$X andvariates$Y inpls.

Value

A list with:

Wmat

matrix of X-weights.

Wstar

matrix oftransformed X-weights.

Pmat

matrix of X-loadings.

Cmat

matrix of Y-weights.

Tmat

matrix of X-scores.

Umat

matrix of Y-scores.

Qmat

matrix needed forpredictions.

Rmat

matrix needed for predictions.

meansX

vector used for centering of predictors, needed forpredictions.

sigmaX

vector used for scaling of predictors, neededfor predictions.

meansY

vector used for centering of outcomes,needed for predictions.

sigmaY

vector used for scaling of outcomes,needed for predictions.

methods

a list withfamily andscale values used for the run.

params

a list withselectedX andselectedY values used for the run.

References

Wold S, Sjöström M, Eriksson L (2001).“PLS-regression: a basic tool of chemometrics.”Chemometrics and Intelligent Laboratory Systems,58(2), 109-130.ISSN 0169-7439,doi:10.1016/S0169-7439(01)00155-1, PLS Methods.

See Also

VariableSelection,BiSelection

Examples

if (requireNamespace("mixOmics", quietly = TRUE)) {  oldpar <- par(no.readonly = TRUE)  # Data simulation  set.seed(1)  simul <- SimulateRegression(n = 200, pk = 15, q = 3, family = "gaussian")  x <- simul$xdata  y <- simul$ydata  # PLS  mypls <- PLS(xdata = x, ydata = y, ncomp = 3)  if (requireNamespace("sgPLS", quietly = TRUE)) {    # Sparse PLS to identify relevant variables    stab <- BiSelection(      xdata = x, ydata = y,      family = "gaussian", ncomp = 3,      LambdaX = seq_len(ncol(x) - 1),      LambdaY = seq_len(ncol(y) - 1),      implementation = SparsePLS,      n_cat = 2    )    plot(stab)    # Refitting of PLS model    mypls <- PLS(      xdata = x, ydata = y,      selectedX = stab$selectedX,      selectedY = stab$selectedY    )    # Nonzero entries in weights are the same as in selectedX    par(mfrow = c(2, 2))    Heatmap(stab$selectedX,      legend = FALSE    )    title("Selected in X")    Heatmap(ifelse(mypls$Wmat != 0, yes = 1, no = 0),      legend = FALSE    )    title("Nonzero entries in Wmat")    Heatmap(stab$selectedY,      legend = FALSE    )    title("Selected in Y")    Heatmap(ifelse(mypls$Cmat != 0, yes = 1, no = 0),      legend = FALSE    )    title("Nonzero entries in Cmat")  }  # Multilevel PLS  # Generating random design  z <- rep(seq_len(50), each = 4)  # Extracting the within-variability  x_within <- mixOmics::withinVariation(X = x, design = cbind(z))  # Running PLS on within-variability  mypls <- PLS(xdata = x_within, ydata = y, ncomp = 3)  par(oldpar)}

Graphical LASSO

Description

Runs the graphical LASSO algorithm for estimation of a Gaussian GraphicalModel (GGM). This function is not using stability.

Usage

PenalisedGraphical(  xdata,  pk = NULL,  Lambda,  Sequential_template = NULL,  scale = TRUE,  start = "cold",  output_omega = FALSE,  ...)

Arguments

xdata

matrix with observations as rows and variables as columns.

pk

optional vector encoding the grouping structure. Only used formulti-block stability selection wherepk indicates the number ofvariables in each group. Ifpk=NULL, single-block stabilityselection is performed.

Lambda

matrix of parameters controlling the level of sparsity.

Sequential_template

logical matrix encoding the type of procedure touse for data with multiple blocks in stability selection graphicalmodelling. For multi-block estimation, the stability selection model isconstructed as the union of block-specific stable edges estimated while theothers are weakly penalised (TRUE only for the block currently beingcalibrated andFALSE for other blocks). Other approaches with jointcalibration of the blocks are allowed (all entries are set toTRUE).

scale

logical indicating if the correlation (scale=TRUE) orcovariance (scale=FALSE) matrix should be used as input ofglassoFast ifimplementation=PenalisedGraphical. Otherwise, this argument must beused in the function provided inimplementation.

start

character string indicating if the algorithm should beinitialised at the estimated (inverse) covariance with previous penaltyparameters (start="warm") or not (start="cold"). Usingstart="warm" can speed-up the computations, but could lead toconvergence issues (in particular with smallLambda_cardinal). Onlyused forimplementation=PenalisedGraphical (see argument"start" inglassoFast).

output_omega

logical indicating if the estimated precision matricesshould be stored and returned.

...

additional parameters passed to the function provided inimplementation.

Details

The use of the procedure from Equation (4) or (5) is controlled bythe argument "Sequential_template".

Value

An array with binary and symmetric adjacency matrices along the thirddimension.

References

Friedman J, Hastie T, Tibshirani R (2008).“Sparse inverse covariance estimation with the graphical lasso.”Biostatistics,9(3), 432–441.

See Also

GraphicalModel

Other underlying algorithm functions:CART(),ClusteringAlgo(),PenalisedOpenMx(),PenalisedRegression()

Examples

# Data simulationset.seed(1)simul <- SimulateGraphical()# Running graphical LASSOmyglasso <- PenalisedGraphical(  xdata = simul$data,  Lambda = matrix(c(0.1, 0.2), ncol = 1))# Returning estimated precision matrixmyglasso <- PenalisedGraphical(  xdata = simul$data,  Lambda = matrix(c(0.1, 0.2), ncol = 1),  output_omega = TRUE)

Penalised Structural Equation Model

Description

Runs penalised Structural Equation Modelling using implementations fromOpenMx functions (forPenalisedOpenMx),or using series of penalised regressions withglmnet(forPenalisedLinearSystem). The functionPenalisedLinearSystem does not accommodate latent variables.These functions are not using stability.

Usage

PenalisedOpenMx(  xdata,  adjacency,  penalised = NULL,  residual_covariance = NULL,  Lambda,  ...)PenalisedLinearSystem(xdata, adjacency, penalised = NULL, Lambda = NULL, ...)

Arguments

xdata

matrix with observations as rows and variables as columns.Column names must be defined and in line with the row and column names ofadjacency.

adjacency

binary adjacency matrix of the Directed Acyclic Graph(transpose of the asymmetric matrix A in Reticular Action Model notation).The row and column names of this matrix must be defined.

penalised

optional binary matrix indicating which coefficients areregularised.

residual_covariance

binary and symmetric matrix encoding the nonzeroentries in the residual covariance matrix (symmetric matrix S in ReticularAction Model notation). By default, this is the identity matrix (noresidual covariance).

Lambda

matrix of parameters controlling the level of sparsity. Onlythe minimum, maximum and length are used inPenalisedOpenMx.

...

additional parameters passed toOpenMx functions (forPenalisedOpenMx), orglmnet (forPenalisedLinearSystem).

Value

A list with:

selected

matrix of binary selection status. Rowscorrespond to different regularisation parameters. Columns correspond todifferent parameters to estimated.

beta_full

matrix of modelcoefficients. Rows correspond to different regularisation parameters.Columns correspond to different parameters to estimated.

References

Jacobucci R, Grimm KJ, McArdle JJ (2016).“Regularized structural equation modeling.”Structural equation modeling: a multidisciplinary journal,23(4), 555–566.doi:10.1080/10705511.2016.1154793.

See Also

SelectionAlgo,VariableSelection,OpenMxMatrix,LinearSystemMatrix

Other underlying algorithm functions:CART(),ClusteringAlgo(),PenalisedGraphical(),PenalisedRegression()

Examples

# Data simulationpk <- c(3, 2, 3)dag <- LayeredDAG(layers = pk)theta <- dagtheta[2, 4] <- 0set.seed(1)simul <- SimulateStructural(theta = theta, pk = pk, output_matrices = TRUE)# Running regularised SEM (OpenMx)if (requireNamespace("OpenMx", quietly = TRUE)) {  mysem <- PenalisedOpenMx(    xdata = simul$data, adjacency = dag,    Lambda = seq(1, 10, 1)  )  OpenMxMatrix(vect = mysem$selected[3, ], adjacency = dag)}# Running regularised SEM (glmnet)mysem <- PenalisedLinearSystem(  xdata = simul$data, adjacency = dag)LinearSystemMatrix(vect = mysem$selected[20, ], adjacency = dag)

Penalised regression

Description

Runs penalised regression using implementation fromglmnet. This function is not using stability.

Usage

PenalisedRegression(  xdata,  ydata,  Lambda = NULL,  family,  penalisation = c("classic", "randomised", "adaptive"),  gamma = NULL,  ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

Lambda

matrix of parameters controlling the level of sparsity.

family

type of regression model. This argument is defined as inglmnet. Possible values include"gaussian"(linear regression),"binomial" (logistic regression),"multinomial" (multinomial regression), and"cox" (survivalanalysis).

penalisation

type of penalisation to use. Ifpenalisation="classic" (the default), penalised regression is donewith the same regularisation parameter, or usingpenalty.factor, ifspecified. Ifpenalisation="randomised", the regularisation for eachof the variables is uniformly chosen betweenlambda andlambda/gamma. Ifpenalisation="adaptive", the regularisationfor each of the variables is weighted by1/abs(beta)^gamma wherebeta is the regression coefficient obtained from unpenalisedregression.

gamma

parameter for randomised or adaptive regularisation. Default isgamma=0.5 for randomised regularisation andgamma=2 foradaptive regularisation. The parametergamma should be between0 and1 for randomised regularisation.

...

additional parameters passed toglmnet.

Value

A list with:

selected

matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors.

beta_full

array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors. Indicesalong the third dimension correspond to outcome variable(s).

References

Zou H (2006).“The adaptive lasso and its oracle properties.”Journal of the American statistical association,101(476), 1418–1429.

Tibshirani R (1996).“Regression Shrinkage and Selection via the Lasso.”Journal of the Royal Statistical Society. Series B (Methodological),58(1), 267–288.ISSN 00359246,http://www.jstor.org/stable/2346178.

See Also

SelectionAlgo,VariableSelection

Other underlying algorithm functions:CART(),ClusteringAlgo(),PenalisedGraphical(),PenalisedOpenMx()

Examples

# Data simulationset.seed(1)simul <- SimulateRegression(pk = 50)# Running the LASSOmylasso <- PenalisedRegression(  xdata = simul$xdata, ydata = simul$ydata,  Lambda = c(0.1, 0.2), family = "gaussian")# Using glmnet argumentsmylasso <- PenalisedRegression(  xdata = simul$xdata, ydata = simul$ydata,  Lambda = c(0.1), family = "gaussian",  penalty.factor = c(rep(0, 10), rep(1, 40)))mylasso$beta_full

Partial Least Squares predictions

Description

Computes predicted values from a Partial Least Squares (PLS) model inregression mode applied onxdata. This function is using the algorithmimplemented inpredict.pls.

Usage

PredictPLS(xdata, model)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

model

output ofPLS.

Value

An array of predicted values.

See Also

PLS

Examples

if (requireNamespace("mixOmics", quietly = TRUE)) {  # Data simulation  set.seed(1)  simul <- SimulateRegression(n = 100, pk = c(5, 5, 5), family = "gaussian")  x <- simul$xdata  y <- simul$ydata  # PLS  mypls <- PLS(xdata = x, ydata = y, ncomp = 3)  # Predicted values  predicted <- PredictPLS(xdata = x, model = mypls)}

Regression model refitting

Description

Refits the regression model with stably selected variables as predictors(without penalisation). Variables inxdata not evaluated in thestability selection model will automatically be included as predictors.

Usage

Refit(  xdata,  ydata,  stability = NULL,  family = NULL,  implementation = NULL,  Lambda = NULL,  seed = 1,  verbose = TRUE,  ...)Recalibrate(  xdata,  ydata,  stability = NULL,  family = NULL,  implementation = NULL,  Lambda = NULL,  seed = 1,  verbose = TRUE,  ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

stability

output ofVariableSelection orBiSelection. Ifstability=NULL (the default), a modelincluding all variables inxdata as predictors is fitted. Argumentfamily must be provided in this case.

family

type of regression model. Possible values include"gaussian" (linear regression),"binomial" (logisticregression),"multinomial" (multinomial regression), and"cox" (survival analysis). If provided, this argument must beconsistent with inputstability.

implementation

optional function to refit the model. Ifstability is the output ofVariableSelection, a regression model is refitted.Ifimplementation=NULL andLambda=0, this is done usinglm (for linearregression),coxph (Cox regression),glm (logistic regression), ormultinom (multinomial regression).IfLambda=NULL, a Ridge regression is fitted and calibrated by cross validation usingcv.glmnet.The functionPLS is used ifstability is the output ofBiSelection.

Lambda

optional vector of penalty parameters.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

verbose

logical indicating if a loading bar and messages should beprinted.

...

additional arguments to be passed to the function provided inimplementation.

Value

The output as obtained from:

stats::lm

forlinear regression ("gaussian" family).

survival::coxph

for Cox regression ("cox"family).

stats::glm

for logistic regression("binomial" family).

nnet::multinom

formultinomial regression ("multinomial" family).

See Also

VariableSelection

Examples

## Linear regression# Data simulationset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")# Data splitids_train <- Resample(  data = simul$ydata,  tau = 0.5, family = "gaussian")xtrain <- simul$xdata[ids_train, , drop = FALSE]ytrain <- simul$ydata[ids_train, , drop = FALSE]xrefit <- simul$xdata[-ids_train, , drop = FALSE]yrefit <- simul$ydata[-ids_train, , drop = FALSE]# Stability selectionstab <- VariableSelection(xdata = xtrain, ydata = ytrain, family = "gaussian")print(SelectedVariables(stab))# Refitting the modelrefitted <- Refit(  xdata = xrefit, ydata = yrefit,  stability = stab)refitted$coefficients # refitted coefficientshead(refitted$fitted.values) # refitted predicted values# Fitting the full model (including all possible predictors)refitted <- Refit(  xdata = simul$xdata, ydata = simul$ydata,  family = "gaussian")refitted$coefficients # refitted coefficients## Logistic regression# Data simulationset.seed(1)simul <- SimulateRegression(n = 200, pk = 20, family = "binomial")# Data splitids_train <- Resample(  data = simul$ydata,  tau = 0.5, family = "binomial")xtrain <- simul$xdata[ids_train, , drop = FALSE]ytrain <- simul$ydata[ids_train, , drop = FALSE]xrefit <- simul$xdata[-ids_train, , drop = FALSE]yrefit <- simul$ydata[-ids_train, , drop = FALSE]# Stability selectionstab <- VariableSelection(xdata = xtrain, ydata = ytrain, family = "binomial")# Refitting the modelrefitted <- Refit(  xdata = xrefit, ydata = yrefit,  stability = stab)refitted$coefficients # refitted coefficientshead(refitted$fitted.values) # refitted predicted probabilities## Partial Least Squares (multiple components)if (requireNamespace("sgPLS", quietly = TRUE)) {  # Data simulation  set.seed(1)  simul <- SimulateRegression(n = 500, pk = 15, q = 3, family = "gaussian")  # Data split  ids_train <- Resample(    data = simul$ydata,    tau = 0.5, family = "gaussian"  )  xtrain <- simul$xdata[ids_train, , drop = FALSE]  ytrain <- simul$ydata[ids_train, , drop = FALSE]  xrefit <- simul$xdata[-ids_train, , drop = FALSE]  yrefit <- simul$ydata[-ids_train, , drop = FALSE]  # Stability selection  stab <- BiSelection(    xdata = xtrain, ydata = ytrain,    family = "gaussian", ncomp = 3,    LambdaX = seq_len(ncol(xtrain) - 1),    LambdaY = seq_len(ncol(ytrain) - 1),    implementation = SparsePLS  )  plot(stab)  # Refitting the model  refitted <- Refit(    xdata = xrefit, ydata = yrefit,    stability = stab  )  refitted$Wmat # refitted X-weights  refitted$Cmat # refitted Y-weights}

Resampling observations

Description

Generates a vector of resampled observation IDs.

Usage

Resample(data, family = NULL, tau = 0.5, resampling = "subsampling", ...)

Arguments

data

vector or matrix of data. In regression, this should be theoutcome data.

family

type of regression model. This argument is defined as inglmnet. Possible values include"gaussian"(linear regression),"binomial" (logistic regression),"multinomial" (multinomial regression), and"cox" (survivalanalysis).

tau

subsample size. Only used ifresampling="subsampling" andcpss=FALSE.

resampling

resampling approach. Possible values are:"subsampling" for sampling without replacement of a proportiontau of the observations, or"bootstrap" for sampling withreplacement generating a resampled dataset with as many observations as inthe full sample. Alternatively, this argument can be a function to use forresampling. This function must use arguments nameddata andtau and return the IDs of observations to be included in theresampled dataset.

...

additional parameters passed to the function provided inresampling.

Details

With categorical outcomes (i.e. "family" argument is set to"binomial", "multinomial" or "cox"), the resampling is done such that theproportion of observations from each of the categories is representative ofthat of the full sample.

Value

A vector of resampled IDs.

Examples

## Linear regression framework# Data simulationsimul <- SimulateRegression()# Subsamplingids <- Resample(data = simul$ydata, family = "gaussian")sum(duplicated(ids))# Bootstrappingids <- Resample(data = simul$ydata, family = "gaussian", resampling = "bootstrap")sum(duplicated(ids))## Logistic regression framework# Data simulationsimul <- SimulateRegression(family = "binomial")# Subsamplingids <- Resample(data = simul$ydata, family = "binomial")sum(duplicated(ids))prop.table(table(simul$ydata))prop.table(table(simul$ydata[ids]))# Data simulation for a binary confounderconf <- ifelse(runif(n = 100) > 0.5, yes = 1, no = 0)# User-defined resampling functionBalancedResampling <- function(data, tau, Z, ...) {  s <- NULL  for (z in unique(Z)) {    s <- c(s, sample(which((data == "0") & (Z == z)), size = tau * sum((data == "0") & (Z == z))))    s <- c(s, sample(which((data == "1") & (Z == z)), size = tau * sum((data == "1") & (Z == z))))  }  return(s)}# Resampling keeping proportions by Y and Zids <- Resample(data = simul$ydata, family = "binomial", resampling = BalancedResampling, Z = conf)prop.table(table(simul$ydata, conf))prop.table(table(simul$ydata[ids], conf[ids]))# User-defined resampling for stability selectionstab <- VariableSelection(  xdata = simul$xdata, ydata = simul$ydata, family = "binomial",  resampling = BalancedResampling, Z = conf)

Variable selection algorithm

Description

Runs the variable selection algorithm specified in the argumentimplementation. This function is not using stability.

Usage

SelectionAlgo(  xdata,  ydata = NULL,  Lambda,  group_x = NULL,  scale = TRUE,  family = NULL,  implementation = PenalisedRegression,  ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

Lambda

matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified inimplementation.IfLambda=NULL andimplementation=PenalisedRegression,LambdaGridRegression is used to define a relevant grid.

group_x

vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group. Only used formodels with group penalisation (e.g.implementation=GroupPLS orimplementation=SparseGroupPLS).

scale

logical indicating if the predictor data should be scaled.

family

type of regression model. This argument is defined as inglmnet. Possible values include"gaussian"(linear regression),"binomial" (logistic regression),"multinomial" (multinomial regression), and"cox" (survivalanalysis).

implementation

function to use for variable selection. Possiblefunctions are:PenalisedRegression,SparsePLS,GroupPLS andSparseGroupPLS. Alternatively, a user-definedfunction can be provided.

...

additional parameters passed to the function provided inimplementation.

Value

A list with:

selected

matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors.

beta_full

array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors. Indicesalong the third dimension correspond to outcome variable(s).

See Also

VariableSelection,PenalisedRegression,SparsePCA,SparsePLS,GroupPLS,SparseGroupPLS

Other wrapping functions:GraphicalAlgo()

Examples

# Data simulation (univariate outcome)set.seed(1)simul <- SimulateRegression(pk = 50)# Running the LASSOmylasso <- SelectionAlgo(  xdata = simul$xdata, ydata = simul$ydata,  Lambda = c(0.1, 0.2), family = "gaussian",)# Data simulation (multivariate outcome)set.seed(1)simul <- SimulateRegression(pk = 50, q = 3)# Running multivariate Gaussian LASSOmylasso <- SelectionAlgo(  xdata = simul$xdata, ydata = simul$ydata,  Lambda = c(0.1, 0.2), family = "mgaussian")str(mylasso)

Selection performance

Description

Computes different metrics of selection performance by comparing the set ofselected features to the set of true predictors/edges. This function can onlybe used in simulation studies (i.e. when the true model is known).

Usage

SelectionPerformance(theta, theta_star, pk = NULL, cor = NULL, thr = 0.5)

Arguments

theta

output fromVariableSelection,BiSelection, orGraphicalModel. Alternatively,it can be a binary matrix of selected variables (in variable selection) ora binary adjacency matrix (in graphical modelling)

theta_star

output fromSimulateRegression,SimulateComponents, orSimulateGraphical.Alternatively, it can be a binary matrix of true predictors (in variableselection) or the true binary adjacency matrix (in graphical modelling).

pk

optional vector encoding the grouping structure. Only used formulti-block stability selection wherepk indicates the number ofvariables in each group. Ifpk=NULL, single-block stabilityselection is performed.

cor

optional correlation matrix. Only used in graphical modelling.

thr

optional threshold in correlation. Only used in graphicalmodelling and when argument "cor" is not NULL.

Value

A matrix of selection metrics including:

TP

number of True Positives (TP)

FN

number of FalseNegatives (TN)

FP

number of False Positives (FP)

TN

numberof True Negatives (TN)

sensitivity

sensitivity, i.e. TP/(TP+FN)

specificity

specificity, i.e. TN/(TN+FP)

accuracy

accuracy,i.e. (TP+TN)/(TP+TN+FP+FN)

precision

precision (p), i.e.TP/(TP+FP)

recall

recall (r), i.e. TP/(TP+FN)

F1_score

F1-score, i.e. 2*p*r/(p+r)

If argument "cor" is provided, the number of False Positives amongcorrelated (FP_c) and uncorrelated (FP_i) pairs, defined as havingcorrelations (provided in "cor") above or below the threshold "thr", arealso reported.

Block-specific performances are reported if "pk" is not NULL. In this case,the first row of the matrix corresponds to the overall performances, andsubsequent rows correspond to each of the blocks. The order of the blocksis defined as inBlockStructure.

See Also

Other functions for model performance:ClusteringPerformance(),SelectionPerformanceGraph()

Examples

# Variable selection modelset.seed(1)simul <- SimulateRegression(pk = 30, nu_xy = 0.5)stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata)# Selection performanceSelectionPerformance(theta = stab, theta_star = simul)# Alternative formulationSelectionPerformance(  theta = SelectedVariables(stab),  theta_star = simul$theta)

Graph representation of selection performance

Description

Generates an igraph object representing the True Positive, False Positive andFalse Negative edges by comparing the set of selected edges to the set oftrue edges. This function can only be used in simulation studies (i.e. whenthe true model is known).

Usage

SelectionPerformanceGraph(  theta,  theta_star,  col = c("tomato", "forestgreen", "navy"),  lty = c(2, 3, 1),  node_colour = NULL,  show_labels = TRUE,  ...)

Arguments

theta

binary adjacency matrix or output ofGraphicalModel,VariableSelection, orBiSelection.

theta_star

true binary adjacency matrix or output ofSimulateGraphical orSimulateRegression.

col

vector of edge colours. The first entry of the vector definesthe colour of False Positive edges, second entry is for True Negatives andthird entry is for True Positives.

lty

vector of line types for edges. The order is defined as forargumentcol.

node_colour

optional vector of node colours. This vector must containas many entries as there are rows/columns in the adjacency matrix and mustbe in the same order (the order is used to assign colours to nodes).Integers, named colours or RGB values can be used.

show_labels

logical indicating if the node labels should be displayed.

...

additional arguments to be passed toGraph.

Value

An igraph object.

See Also

SimulateGraphical,SimulateRegression,GraphicalModel,VariableSelection,BiSelection

Other functions for model performance:ClusteringPerformance(),SelectionPerformance()

Examples

# Data simulationset.seed(1)simul <- SimulateGraphical(pk = 30)# Stability selectionstab <- GraphicalModel(xdata = simul$data, K = 10)# Performance graphperfgraph <- SelectionPerformanceGraph(  theta = stab,  theta_star = simul)plot(perfgraph)

Selection performance (internal)

Description

Computes different metrics of selection performance from a categoricalvector/matrix with 3 for True Positive, 2 for False Negative, 1 for FalsePositive and 0 for True Negative.

Usage

SelectionPerformanceSingle(Asum, cor = NULL, thr = 0.5)

Arguments

Asum

vector (in variable selection) or matrix (in graphical modelling)containing values of0,1,2 or3.

cor

optional correlation matrix. Only used in graphical modelling.

thr

optional threshold in correlation. Only used in graphicalmodelling and when argument "cor" is not NULL.

Value

A matrix of selection metrics including:

TP

number of True Positives (TP)

FN

number of FalseNegatives (TN)

FP

number of False Positives (FP)

TN

numberof True Negatives (TN)

sensitivity

sensitivity, i.e. TP/(TP+FN)

specificity

specificity, i.e. TN/(TN+FP)

accuracy

accuracy,i.e. (TP+TN)/(TP+TN+FP+FN)

precision

precision (p), i.e.TP/(TP+FP)

recall

recall (r), i.e. TP/(TP+FN)

F1_score

F1-score, i.e. 2*p*r/(p+r)

If argument "cor" is provided, the number of False Positives amongcorrelated (FP_c) and uncorrelated (FP_i) pairs, defined as havingcorrelations (provided in "cor") above or below the threshold "thr", arealso reported.


Selection/co-membership proportions

Description

Extracts selection proportions (for stability selection) or co-membershipproportions (for consensus clustering).

Usage

SelectionProportions(stability, argmax_id = NULL)ConsensusMatrix(stability, argmax_id = NULL)

Arguments

stability

output ofVariableSelection,GraphicalModel,BiSelection, orClustering.

argmax_id

optional indices of hyper-parameters. Ifargmax_id=NULL, the calibrated hyper-parameters are used.

Value

A vector or matrix of proportions.

See Also

VariableSelection,GraphicalModel,BiSelection,Clustering

Examples

# Stability selectionset.seed(1)simul <- SimulateRegression(pk = 50)stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata)SelectionProportions(stab)# Consensus clusteringset.seed(1)simul <- SimulateClustering(  n = c(30, 30, 30), nu_xc = 1, ev_xc = 0.5)stab <- Clustering(xdata = simul$data)ConsensusMatrix(stab)

Selection proportions (graphical model)

Description

Extracts the selection proportions of the (calibrated) stability selectionmodel.

Usage

SelectionProportionsGraphical(stability, argmax_id = NULL)

Arguments

stability

output ofGraphicalModel.

argmax_id

optional indices of hyper-parameters. Ifargmax_id=NULL, the calibrated hyper-parameters are used.

Value

A symmetric matrix.


Selection proportions (variable selection)

Description

Extracts the selection proportions of the (calibrated) stability selectionmodel.

Usage

SelectionProportionsRegression(stability, argmax_id = NULL)

Arguments

stability

output ofVariableSelection.

argmax_id

optional indices of hyper-parameters. Ifargmax_id=NULL, the calibrated hyper-parameters are used.

Value

A vector of selection proportions.


Consensus clustering (internal)

Description

Performs consensus (weighted) clustering. The underlying algorithm (e.g.hierarchical clustering) is run with different number of clustersnc.In consensus weighed clustering, weighted distances are calculated using thecosa2 algorithm with different penalty parametersLambda. The hyper-parameters are calibrated by maximisation of theconsensus score. This function uses a serial implementation and requires the grids ofhyper-parameters as input (for internal use only).

Usage

SerialClustering(  xdata,  nc,  eps,  Lambda,  K = 100,  tau = 0.5,  seed = 1,  n_cat = 3,  implementation = HierarchicalClustering,  scale = TRUE,  linkage = "complete",  row = TRUE,  output_data = FALSE,  verbose = TRUE,  ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.

nc

matrix of parameters controlling the number of clusters in theunderlying algorithm specified inimplementation. Ifnc isnot provided, it is set toseq(1, tau*nrow(xdata)).

eps

radius in density-based clustering, seedbscan. Only used ifimplementation=DBSCANClustering.

Lambda

vector of penalty parameters for weighted distance calculation.Only used for distance-based clustering, including for exampleimplementation=HierarchicalClustering,implementation=PAMClustering, orimplementation=DBSCANClustering.

K

number of resampling iterations.

tau

subsample size.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

implementation

function to use for clustering. Possible functionsincludeHierarchicalClustering (hierarchical clustering),PAMClustering (Partitioning Around Medoids),KMeansClustering (k-means) andGMMClustering(Gaussian Mixture Models). Alternatively, a user-defined function takingxdata andLambda as arguments and returning a binary andsymmetric matrix for which diagonal elements are equal to zero can be used.

scale

logical indicating if the data should be scaled to ensure thatall variables contribute equally to the clustering of the observations.

linkage

character string indicating the type of linkage used inhierarchical clustering to define the stable clusters. Possible valuesinclude"complete","single" and"average" (seeargument"method" inhclust for a full list).Only used ifimplementation=HierarchicalClustering.

row

logical indicating if rows (ifrow=TRUE) or columns (ifrow=FALSE) contain the items to cluster.

output_data

logical indicating if the input datasetsxdata andydata should be included in the output.

verbose

logical indicating if a loading bar and messages should beprinted.

...

additional parameters passed to the functions provided inimplementation orresampling.

Value

A list with:

Sc

a matrixof the best stability scores for different (sets of) parameters controllingthe number of clusters and penalisation of attribute weights.

nc

amatrix of numbers of clusters.

Lambda

a matrix of regularisationparameters for attribute weights.

Q

a matrix of the average numberof selected attributes by the underlying algorithm with differentregularisation parameters.

coprop

an array of consensus matrices.Rows and columns correspond to items. Indices along the third dimensioncorrespond to different parameters controlling the number of clusters andpenalisation of attribute weights.

selprop

an array of selectionproportions. Columns correspond to attributes. Rows correspond to differentparameters controlling the number of clusters and penalisation of attributeweights.

method

a list withtype="clustering" and valuesused for argumentsimplementation,linkage, andresampling.

params

a list with values used for argumentsK,tau,pk,n (number of observations inxdata), andseed.

The rows ofSc,nc,Lambda,Q,selprop and indices along the thirddimension ofcoprop are ordered in the same way and correspond toparameter values stored innc andLambda.


Stability selection graphical model (internal)

Description

Runs stability selection graphical models with different combinations ofparameters controlling the sparsity of the underlying selection algorithm(e.g. penalty parameter for regularised models) and thresholds in selectionproportions. These two parameters are jointly calibrated by maximising thestability score of the model (possibly under a constraint on the expectednumber of falsely stably selected features). This function uses a serialimplementation and requires the grid of parameters controlling the underlyingalgorithm as input (for internal use only).

Usage

SerialGraphical(  xdata,  pk = NULL,  Lambda,  lambda_other_blocks = 0.1,  pi_list = seq(0.6, 0.9, by = 0.01),  K = 100,  tau = 0.5,  seed = 1,  n_cat = n_cat,  implementation = PenalisedGraphical,  start = "cold",  scale = TRUE,  resampling = "subsampling",  cpss = FALSE,  PFER_method = "MB",  PFER_thr = Inf,  FDP_thr = Inf,  output_data = FALSE,  verbose = TRUE,  ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.For multi-block stability selection, the variables in data have to beordered by group.

pk

optional vector encoding the grouping structure. Only used formulti-block stability selection wherepk indicates the number ofvariables in each group. Ifpk=NULL, single-block stabilityselection is performed.

Lambda

matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified inimplementation.Ifimplementation="glassoFast",Lambda contains penaltyparameters.

lambda_other_blocks

optional vector of parameters controlling thelevel of sparsity in neighbour blocks for the multi-block procedure. To usejointly a specific set of parameters for each block,lambda_other_blocks must be set toNULL (not recommended).Only used for multi-block stability selection, i.e. iflength(pk)>1.

pi_list

vector of thresholds in selection proportions. Ifn_cat=NULL orn_cat=2, these values must be>0 and<1. Ifn_cat=3, these values must be>0.5 and<1.

K

number of resampling iterations.

tau

subsample size. Only used ifresampling="subsampling" andcpss=FALSE.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

implementation

function to use for graphical modelling. Ifimplementation=PenalisedGraphical, the algorithm implemented inglassoFast is used for regularised estimation ofa conditional independence graph. Alternatively, a user-defined functioncan be provided.

start

character string indicating if the algorithm should beinitialised at the estimated (inverse) covariance with previous penaltyparameters (start="warm") or not (start="cold"). Usingstart="warm" can speed-up the computations, but could lead toconvergence issues (in particular with smallLambda_cardinal). Onlyused forimplementation=PenalisedGraphical (see argument"start" inglassoFast).

scale

logical indicating if the correlation (scale=TRUE) orcovariance (scale=FALSE) matrix should be used as input ofglassoFast ifimplementation=PenalisedGraphical. Otherwise, this argument must beused in the function provided inimplementation.

resampling

resampling approach. Possible values are:"subsampling" for sampling without replacement of a proportiontau of the observations, or"bootstrap" for sampling withreplacement generating a resampled dataset with as many observations as inthe full sample. Alternatively, this argument can be a function to use forresampling. This function must use arguments nameddata andtau and return the IDs of observations to be included in theresampled dataset.

cpss

logical indicating if complementary pair stability selectionshould be done. For this, the algorithm is applied on two non-overlappingsubsets of half of the observations. A feature is considered as selected ifit is selected for both subsamples. With this method, the data is splitK/2 times (K models are fitted). Only used ifPFER_method="MB".

PFER_method

method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). IfPFER_method="MB", the method proposed by Meinshausen and Bühlmann(2010) is used. IfPFER_method="SS", the method proposed by Shah andSamworth (2013) under the assumption of unimodality is used.

PFER_thr

threshold in PFER for constrained calibration by errorcontrol. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

FDP_thr

threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

output_data

logical indicating if the input datasetsxdata andydata should be included in the output.

verbose

logical indicating if a loading bar and messages should beprinted.

...

additional parameters passed to the functions provided inimplementation orresampling.

Value

A list with:

S

a matrix of the best stability scores fordifferent (sets of) parameters controlling the level of sparsity in theunderlying algorithm.

Lambda

a matrix of parameters controlling thelevel of sparsity in the underlying algorithm.

Q

a matrix of theaverage number of selected features by the underlying algorithm withdifferent parameters controlling the level of sparsity.

Q_s

amatrix of the calibrated number of stably selected features with differentparameters controlling the level of sparsity.

P

a matrix ofcalibrated thresholds in selection proportions for different parameterscontrolling the level of sparsity in the underlying algorithm.

PFER

a matrix of upper-bounds in PFER of calibrated stabilityselection models with different parameters controlling the level ofsparsity.

FDP

a matrix of upper-bounds in FDP of calibratedstability selection models with different parameters controlling the levelof sparsity.

S_2d

a matrix of stability scores obtained withdifferent combinations of parameters. Columns correspond to differentthresholds in selection proportions.

PFER_2d

a matrix ofupper-bounds in FDP obtained with different combinations of parameters.Columns correspond to different thresholds in selection proportions. Onlyreturned iflength(pk)=1.

FDP_2d

a matrix of upper-bounds inPFER obtained with different combinations of parameters. Columns correspondto different thresholds in selection proportions. Only returned iflength(pk)=1.

selprop

an array of selection proportions.Rows and columns correspond to nodes in the graph. Indices along the thirddimension correspond to different parameters controlling the level ofsparsity in the underlying algorithm.

sign

a matrix of signs ofPearson's correlations estimated fromxdata.

method

a listwithtype="graphical_model" and values used for argumentsimplementation,start,resampling,cpss andPFER_method.

params

a list with values used for argumentsK,pi_list,tau,n_cat,pk,n(number of observations inxdata),PFER_thr,FDP_thr,seed,lambda_other_blocks, andSequential_template.

The rows ofS,Lambda,Q,Q_s,P,PFER,FDP,S_2d,PFER_2d andFDP_2d, andindices along the third dimension ofselprop are ordered in the sameway and correspond to parameter values stored inLambda. Formulti-block inference, the columns ofS,Lambda,Q,Q_s,P,PFER andFDP, and indices along thethird dimension ofS_2d correspond to the different blocks.


Stability selection in regression (internal)

Description

Runs stability selection regression models with different combinations ofparameters controlling the sparsity of the underlying selection algorithm(e.g. penalty parameter for regularised models) and thresholds in selectionproportions. These two parameters are jointly calibrated by maximising thestability score of the model (possibly under a constraint on the expectednumber of falsely stably selected features). This function uses a serialimplementation and requires the grid of parameters controlling the underlyingalgorithm as input (for internal use only).

Usage

SerialRegression(  xdata,  ydata = NULL,  Lambda,  pi_list = seq(0.6, 0.9, by = 0.01),  K = 100,  tau = 0.5,  seed = 1,  n_cat = 3,  family = "gaussian",  implementation = PenalisedRegression,  resampling = "subsampling",  cpss = FALSE,  PFER_method = "MB",  PFER_thr = Inf,  FDP_thr = Inf,  group_x = NULL,  group_penalisation = FALSE,  output_data = FALSE,  verbose = TRUE,  ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

Lambda

matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified inimplementation.Withimplementation="glmnet",Lambda contains penaltyparameters.

pi_list

vector of thresholds in selection proportions. Ifn_cat=NULL orn_cat=2, these values must be>0 and<1. Ifn_cat=3, these values must be>0.5 and<1.

K

number of resampling iterations.

tau

subsample size. Only used ifresampling="subsampling" andcpss=FALSE.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

family

type of regression model. This argument is defined as inglmnet. Possible values include"gaussian"(linear regression),"binomial" (logistic regression),"multinomial" (multinomial regression), and"cox" (survivalanalysis).

implementation

function to use for variable selection. Possiblefunctions are:PenalisedRegression,SparsePLS,GroupPLS andSparseGroupPLS. Alternatively, a user-definedfunction can be provided.

resampling

resampling approach. Possible values are:"subsampling" for sampling without replacement of a proportiontau of the observations, or"bootstrap" for sampling withreplacement generating a resampled dataset with as many observations as inthe full sample. Alternatively, this argument can be a function to use forresampling. This function must use arguments nameddata andtau and return the IDs of observations to be included in theresampled dataset.

cpss

logical indicating if complementary pair stability selectionshould be done. For this, the algorithm is applied on two non-overlappingsubsets of half of the observations. A feature is considered as selected ifit is selected for both subsamples. With this method, the data is splitK/2 times (K models are fitted). Only used ifPFER_method="MB".

PFER_method

method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). IfPFER_method="MB", the method proposed by Meinshausen and Bühlmann(2010) is used. IfPFER_method="SS", the method proposed by Shah andSamworth (2013) under the assumption of unimodality is used.

PFER_thr

threshold in PFER for constrained calibration by errorcontrol. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

FDP_thr

threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

group_x

vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group. Only used formodels with group penalisation (e.g.implementation=GroupPLS orimplementation=SparseGroupPLS).

group_penalisation

logical indicating if a group penalisation shouldbe considered in the stability score. The use ofgroup_penalisation=TRUE strictly applies to group (not sparse-group)penalisation.

output_data

logical indicating if the input datasetsxdata andydata should be included in the output.

verbose

logical indicating if a loading bar and messages should beprinted.

...

additional parameters passed to the functions provided inimplementation orresampling.

Value

A list with:

S

a matrix of the best stability scores fordifferent parameters controlling the level of sparsity in the underlyingalgorithm.

Lambda

a matrix of parameters controlling the level ofsparsity in the underlying algorithm.

Q

a matrix of the averagenumber of selected features by the underlying algorithm with differentparameters controlling the level of sparsity.

Q_s

a matrix of thecalibrated number of stably selected features with different parameterscontrolling the level of sparsity.

P

a matrix of calibratedthresholds in selection proportions for different parameters controllingthe level of sparsity in the underlying algorithm.

PFER

a matrix ofupper-bounds in PFER of calibrated stability selection models withdifferent parameters controlling the level of sparsity.

FDP

amatrix of upper-bounds in FDP of calibrated stability selection models withdifferent parameters controlling the level of sparsity.

S_2d

amatrix of stability scores obtained with different combinations ofparameters. Columns correspond to different thresholds in selectionproportions.

PFER_2d

a matrix of upper-bounds in FDP obtained withdifferent combinations of parameters. Columns correspond to differentthresholds in selection proportions.

FDP_2d

a matrix ofupper-bounds in PFER obtained with different combinations of parameters.Columns correspond to different thresholds in selection proportions.

selprop

a matrix of selection proportions. Columns correspond topredictors fromxdata.

Beta

an array of model coefficients.Columns correspond to predictors fromxdata. Indices along the thirddimension correspond to different resampling iterations. With multivariateoutcomes, indices along the fourth dimension correspond to outcome-specificcoefficients.

method

a list withtype="variable_selection"and values used for argumentsimplementation,family,resampling,cpss andPFER_method.

params

alist with values used for argumentsK,pi_list,tau,n_cat,pk,n (number of observations),PFER_thr,FDP_thr andseed. The datasetsxdataandydata are also included ifoutput_data=TRUE.

For allmatrices and arrays returned, the rows are ordered in the same way andcorrespond to parameter values stored inLambda.


Sparse group Partial Least Squares

Description

Runs a sparse group Partial Least Squares model using implementation fromsgPLS-package. This function is not using stability.

Usage

SparseGroupPLS(  xdata,  ydata,  family = "gaussian",  group_x,  group_y = NULL,  Lambda,  alpha.x,  alpha.y = NULL,  keepX_previous = NULL,  keepY = NULL,  ncomp = 1,  scale = TRUE,  ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

family

type of PLS model. Iffamily="gaussian", a sparse groupPLS model as defined insgPLS is run (for continuousoutcomes). Iffamily="binomial", a PLS-DA model as defined insgPLSda is run (for categorical outcomes).

group_x

vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group.

group_y

optional vector encoding the grouping structure amongoutcomes. This argument indicates the number of variables in each group.

Lambda

matrix of parameters controlling the number of selected groupsat current component, as defined byncomp.

alpha.x

vector of parameters controlling the level of sparsity withingroups of predictors.

alpha.y

optional vector of parameters controlling the level ofsparsity within groups of outcomes. Only used iffamily="gaussian".

keepX_previous

number of selected groups in previous components. Onlyused ifncomp > 1. The argumentkeepX insgPLS is obtained by concatenatingkeepX_previous andLambda.

keepY

number of selected groups of outcome variables. This argument isdefined as insgPLS. Only used iffamily="gaussian".

ncomp

number of components.

scale

logical indicating if the data should be scaled (i.e.transformed so that all variables have a standard deviation of one). Onlyused iffamily="gaussian".

...

additional arguments to be passed tosgPLS orsgPLSda.

Value

A list with:

selected

matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors.

beta_full

array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors (startingwith "X") or outcomes (starting with "Y") variables for differentcomponents (denoted by "PC").

References

Liquet B, de Micheaux PL, Hejblum BP, Thiébaut R (2016).“Group and sparse group partial least square approaches applied in genomics context.”Bioinformatics,32(1), 35-42.ISSN 1367-4803,doi:10.1093/bioinformatics/btv535.

See Also

VariableSelection,BiSelection

Other penalised dimensionality reduction functions:GroupPLS(),SparsePCA(),SparsePLS()

Examples

if (requireNamespace("sgPLS", quietly = TRUE)) {  ## Sparse group PLS  # Data simulation  set.seed(1)  simul <- SimulateRegression(n = 100, pk = 30, q = 3, family = "gaussian")  x <- simul$xdata  y <- simul$ydata  # Running sgPLS with 1 group and sparsity of 0.5  mypls <- SparseGroupPLS(    xdata = x, ydata = y, Lambda = 1, family = "gaussian",    group_x = c(10, 15, 5), alpha.x = 0.5  )  # Running sgPLS with groups on outcomes  mypls <- SparseGroupPLS(    xdata = x, ydata = y, Lambda = 1, family = "gaussian",    group_x = c(10, 15, 5), alpha.x = 0.5,    group_y = c(2, 1), keepY = 1, alpha.y = 0.9  )  ## Sparse group PLS-DA  # Data simulation  set.seed(1)  simul <- SimulateRegression(n = 100, pk = 50, family = "binomial")  # Running sgPLS-DA with 1 group and sparsity of 0.9  mypls <- SparseGroupPLS(    xdata = simul$xdata, ydata = simul$ydata, Lambda = 1, family = "binomial",    group_x = c(10, 15, 25), alpha.x = 0.9  )}

Sparse K means clustering

Description

Runs sparse K means clustering using implementation fromKMeansSparseCluster. This function is not usingstability.

Usage

SparseKMeansClustering(xdata, nc = NULL, Lambda, ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.

nc

matrix of parameters controlling the number of clusters in theunderlying algorithm specified inimplementation. Ifnc isnot provided, it is set toseq(1, tau*nrow(xdata)).

Lambda

vector of penalty parameters (see argumentwbounds inKMeansSparseCluster).

...

additional parameters passed toKMeansSparseCluster.

Value

A list with:

comembership

an array of binary and symmetricco-membership matrices.

weights

a matrix of median weights byfeature.

References

Witten DM, Tibshirani R (2010).“A Framework for Feature Selection in Clustering.”Journal of the American Statistical Association,105(490), 713-726.doi:10.1198/jasa.2010.tm09415, PMID: 20811510.


Sparse Principal Component Analysis

Description

Runs a sparse Principal Component Analysis model using implementation fromspca (ifalgo="sPCA") orspca (ifalgo="rSVD"). This function is notusing stability.

Usage

SparsePCA(  xdata,  Lambda,  ncomp = 1,  scale = TRUE,  keepX_previous = NULL,  algorithm = "sPCA",  ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.

Lambda

matrix of parameters controlling the number of selectedvariables at current component, as defined byncomp.

ncomp

number of components.

scale

logical indicating if the data should be scaled (i.e.transformed so that all variables have a standard deviation of one).

keepX_previous

number of selected predictors in previous components.Only used ifncomp > 1.

algorithm

character string indicating the name of the algorithm to use forsparse PCA. Possible values are: "sPCA" (for the algorithm proposed by Zou,Hastie and Tibshirani and implemented inspca) or"rSVD" (for the regularised SVD approach proposed by Shen and Huang andimplemented inspca).

...

additional arguments to be passed tospca (ifalgorithm="sPCA") orspca (ifalgorithm="rSVD").

Value

A list with:

selected

matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors.

beta_full

array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors (startingwith "X") or outcomes (starting with "Y") variables for differentcomponents (denoted by "PC").

References

Zou H, Hastie T, Tibshirani R (2006).“Sparse Principal Component Analysis.”Journal of Computational and Graphical Statistics,15(2), 265-286.doi:10.1198/106186006X113430.

Shen H, Huang JZ (2008).“Sparse principal component analysis via regularized low rank matrix approximation.”Journal of Multivariate Analysis,99(6), 1015-1034.ISSN 0047-259X,doi:10.1016/j.jmva.2007.06.007.

See Also

VariableSelection,BiSelection

Other penalised dimensionality reduction functions:GroupPLS(),SparseGroupPLS(),SparsePLS()

Examples

# Data simulationset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")x <- simul$xdata# Sparse PCA (by Zou, Hastie, Tibshirani)if (requireNamespace("elasticnet", quietly = TRUE)) {  mypca <- SparsePCA(    xdata = x, ncomp = 2,    Lambda = c(1, 2), keepX_previous = 10, algorithm = "sPCA"  )}# Sparse PCA (by Shen and Huang)if (requireNamespace("mixOmics", quietly = TRUE)) {  mypca <- SparsePCA(    xdata = x, ncomp = 2,    Lambda = c(1, 2), keepX_previous = 10, algorithm = "rSVD"  )}

Sparse Partial Least Squares

Description

Runs a sparse Partial Least Squares model using implementation fromsgPLS-package. This function is not using stability.

Usage

SparsePLS(  xdata,  ydata,  Lambda,  family = "gaussian",  ncomp = 1,  scale = TRUE,  keepX_previous = NULL,  keepY = NULL,  ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

Lambda

matrix of parameters controlling the number of selectedpredictors at current component, as defined byncomp.

family

type of PLS model. Iffamily="gaussian", a sparse PLSmodel as defined insPLS is run (for continuousoutcomes). Iffamily="binomial", a PLS-DA model as defined insPLSda is run (for categorical outcomes).

ncomp

number of components.

scale

logical indicating if the data should be scaled (i.e.transformed so that all variables have a standard deviation of one). Onlyused iffamily="gaussian".

keepX_previous

number of selected predictors in previous components.Only used ifncomp > 1. The argumentkeepX insPLS is obtained by concatenatingkeepX_previous andLambda.

keepY

number of selected outcome variables. This argument is definedas insPLS. Only used iffamily="gaussian".

...

additional arguments to be passed tosPLS orsPLSda.

Value

A list with:

selected

matrix of binary selection status. Rowscorrespond to different model parameters. Columns correspond topredictors.

beta_full

array of model coefficients. Rows correspondto different model parameters. Columns correspond to predictors (startingwith "X") or outcomes (starting with "Y") variables for differentcomponents (denoted by "PC").

References

KA LC, Rossouw D, Robert-Granié C, Besse P (2008).“A sparse PLS for variable selection when integrating omics data.”Stat Appl Genet Mol Biol,7(1), Article 35.ISSN 1544-6115,doi:10.2202/1544-6115.1390.

See Also

VariableSelection,BiSelection

Other penalised dimensionality reduction functions:GroupPLS(),SparseGroupPLS(),SparsePCA()

Examples

if (requireNamespace("sgPLS", quietly = TRUE)) {  ## Sparse PLS  # Data simulation  set.seed(1)  simul <- SimulateRegression(n = 100, pk = 20, q = 3, family = "gaussian")  x <- simul$xdata  y <- simul$ydata  # Running sPLS with 2 X-variables and 1 Y-variable  mypls <- SparsePLS(xdata = x, ydata = y, Lambda = 2, family = "gaussian", keepY = 1)  ## Sparse PLS-DA  # Data simulation  set.seed(1)  simul <- SimulateRegression(n = 100, pk = 20, family = "binomial")  # Running sPLS-DA with 2 X-variables and 1 Y-variable  mypls <- SparsePLS(xdata = simul$xdata, ydata = simul$ydata, Lambda = 2, family = "binomial")}

Splitting observations into non-overlapping sets

Description

Generates a list oflength(tau) non-overlapping sets of observationIDs.

Usage

Split(data, family = NULL, tau = c(0.5, 0.25, 0.25))

Arguments

data

vector or matrix of data. In regression, this should be theoutcome data.

family

type of regression model. This argument is defined as inglmnet. Possible values include"gaussian"(linear regression),"binomial" (logistic regression),"multinomial" (multinomial regression), and"cox" (survivalanalysis).

tau

vector of the proportion of observations in each of the sets.

Details

With categorical outcomes (i.e.family argument is set to"binomial","multinomial" or"cox"), the split is donesuch that the proportion of observations from each of the categories ineach of the sets is representative of that of the full sample.

Value

A list of lengthlength(tau) with sets of non-overlappingobservation IDs.

Examples

# Splitting into 3 setssimul <- SimulateRegression()ids <- Split(data = simul$ydata)lapply(ids, length)# Balanced splits with respect to a binary variablesimul <- SimulateRegression(family = "binomial")ids <- Split(data = simul$ydata, family = "binomial")lapply(ids, FUN = function(x) {  table(simul$ydata[x, ])})

Adjacency from bipartite

Description

Generates a symmetric adjacency matrix encoding a bipartite graph.

Usage

Square(x)

Arguments

x

matrix encoding the edges between two types of nodes (rows andcolumns).

Value

A symmetric adjacency matrix encoding a bipartite graph.

Examples

# Simulated links between two setsset.seed(1)mat <- matrix(sample(c(0, 1), size = 15, replace = TRUE),  nrow = 5, ncol = 3)# Adjacency matrix of a bipartite graphSquare(mat)

Stability selection metrics

Description

Computes the stability score (seeStabilityScore) andupper-bounds of thePFER andFDP from selectionproportions of models with a given parameter controlling the sparsity of theunderlying algorithm and for different thresholds in selection proportions.

Usage

StabilityMetrics(  selprop,  pk = NULL,  pi_list = seq(0.6, 0.9, by = 0.01),  K = 100,  n_cat = NULL,  PFER_method = "MB",  PFER_thr_blocks = Inf,  FDP_thr_blocks = Inf,  Sequential_template = NULL,  graph = TRUE,  group = NULL)

Arguments

selprop

array of selection proportions.

pk

optional vector encoding the grouping structure. Only used formulti-block stability selection wherepk indicates the number ofvariables in each group. Ifpk=NULL, single-block stabilityselection is performed.

pi_list

vector of thresholds in selection proportions. Ifn_cat=NULL orn_cat=2, these values must be>0 and<1. Ifn_cat=3, these values must be>0.5 and<1.

K

number of resampling iterations.

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

PFER_method

method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). IfPFER_method="MB", the method proposed by Meinshausen and Bühlmann(2010) is used. IfPFER_method="SS", the method proposed by Shah andSamworth (2013) under the assumption of unimodality is used.

PFER_thr_blocks

vector of block-specific thresholds in PFER forconstrained calibration by error control. IfPFER_thr=Inf andFDP_thr=Inf, unconstrained calibration is used.

FDP_thr_blocks

vector of block-specific thresholds in the expectedproportion of falsely selected features (or False Discovery Proportion,FDP) for constrained calibration by error control. IfPFER_thr=InfandFDP_thr=Inf, unconstrained calibration is used.

Sequential_template

logical matrix encoding the type of procedure touse for data with multiple blocks in stability selection graphicalmodelling. For multi-block estimation, the stability selection model isconstructed as the union of block-specific stable edges estimated while theothers are weakly penalised (TRUE only for the block currently beingcalibrated andFALSE for other blocks). Other approaches with jointcalibration of the blocks are allowed (all entries are set toTRUE).

graph

logical indicating if stability selection is performed in aregression (ifFALSE) or graphical (ifTRUE)framework.

group

vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group and only needs tobe provided for group (but not sparse group) penalisation.

Value

A list with:

S

a matrix of the best (block-specific) stabilityscores for different (sets of) penalty parameters. In multi-block stabilityselection, rows correspond to different sets of penalty parameters, (valuesare stored in the output "Lambda") and columns correspond to differentblocks.

Lambda

a matrix of (block-specific) penalty parameters. Inmulti-block stability selection, rows correspond to sets of penaltyparameters and columns correspond to different blocks.

Q

a matrixof average numbers of (block-specific) edges selected by the underlyingalgorithm for different (sets of) penalty parameters. In multi-blockstability selection, rows correspond to different sets of penaltyparameters, (values are stored in the output "Lambda") and columnscorrespond to different blocks.

Q_s

a matrix of calibrated numbersof (block-specific) stable edges for different (sets of) penaltyparameters. In multi-block stability selection, rows correspond todifferent sets of penalty parameters, (values are stored in the output"Lambda") and columns correspond to different blocks.

P

a matrix ofcalibrated (block-specific) thresholds in selection proportions fordifferent (sets of) penalty parameters. In multi-block stability selection,rows correspond to different sets of penalty parameters, (values are storedin the output "Lambda") and columns correspond to different blocks.

PFER

a matrix of computed (block-specific) upper-bounds in PFER ofcalibrated graphs for different (sets of) penalty parameters. Inmulti-block stability selection, rows correspond to different sets ofpenalty parameters, (values are stored in the output "Lambda") and columnscorrespond to different blocks.

FDP

a matrix of computed(block-specific) upper-bounds in FDP of calibrated stability selectionmodels for different (sets of) penalty parameters. In multi-block stabilityselection, rows correspond to different sets of penalty parameters, (valuesare stored in the output "Lambda") and columns correspond to differentblocks.

S_2d

an array of (block-specific) stability scores obtainedwith different combinations of parameters. Rows correspond to different(sets of) penalty parameters and columns correspond to different thresholdsin selection proportions. In multi-block stability selection, indices alongthe third dimension correspond to different blocks.

PFER_2d

anarray of computed upper-bounds of PFER obtained with different combinationsof parameters. Rows correspond to different penalty parameters and columnscorrespond to different thresholds in selection proportions. Not availablein multi-block stability selection graphical modelling.

FDP_2d

anarray of computed upper-bounds of FDP obtained with different combinationsof parameters. Rows correspond to different penalty parameters and columnscorrespond to different thresholds in selection proportions. Not availablein multi-block stability selection graphical modelling.

References

Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.

Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.

Shah RD, Samworth RJ (2013).“Variable selection with error control: another look at stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),75(1), 55-80.doi:10.1111/j.1467-9868.2011.01034.x.

See Also

Other stability metric functions:ConsensusScore(),FDP(),PFER(),StabilityScore()

Examples

## Sparse or sparse group penalisation# Simulating set of selection proportionsset.seed(1)selprop <- matrix(round(runif(n = 20), digits = 2), nrow = 2)# Computing stability scores for different thresholdsmetrics <- StabilityMetrics(  selprop = selprop, pi = c(0.6, 0.7, 0.8),  K = 100, graph = FALSE)## Group penalisation# Simulating set of selection proportionsset.seed(1)selprop <- matrix(round(runif(n = 6), digits = 2), nrow = 2)selprop <- cbind(  selprop[, 1], selprop[, 1],  selprop[, 2], selprop[, 2],  matrix(rep(selprop[, 3], each = 6), nrow = 2, byrow = TRUE))# Computing stability scores for different thresholdsmetrics <- StabilityMetrics(  selprop = selprop, pi = c(0.6, 0.7, 0.8),  K = 100, graph = FALSE, group = c(2, 2, 6))

Stability score

Description

Computes the stability score from selection proportions of models with agiven parameter controlling the sparsity and for different thresholds inselection proportions. The score measures how unlikely it is that theselection procedure is uniform (i.e. uninformative) for a given combinationof parameters.

Usage

StabilityScore(  selprop,  pi_list = seq(0.6, 0.9, by = 0.01),  K,  n_cat = 3,  group = NULL)

Arguments

selprop

array of selection proportions.

pi_list

vector of thresholds in selection proportions. Ifn_cat=NULL orn_cat=2, these values must be>0 and<1. Ifn_cat=3, these values must be>0.5 and<1.

K

number of resampling iterations.

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

group

vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group and only needs tobe provided for group (but not sparse group) penalisation.

Details

The stability score is derived from the likelihood under theassumption of uniform (uninformative) selection.

We classify the features into three categories: the stably selected ones(that have selection proportions\ge \pi), the stably excluded ones(selection proportion\le 1-\pi), and the unstable ones (selectionproportions between1-\pi and\pi).

Under the hypothesis of equiprobability of selection (instability), thelikelihood of observing stably selected, stably excluded and unstablefeatures can be expressed as:

L_{\lambda, \pi} = \prod_{j=1}^N [ ( 1 - F( K \pi - 1 ) )^{1_{H_{\lambda} (j) \ge K \pi}} \times ( F( K \pi - 1 ) - F( K ( 1 - \pi ) )^{1_{ (1-\pi) K < H_{\lambda} (j) < K \pi }} \times F( K ( 1 - \pi ) )^{1_{ H_{\lambda} (j) \le K (1-\pi) }} ]

whereH_{\lambda} (j) is the selection count of featurej andF(x) is the cumulative probability function of the binomialdistribution with parametersK and the average proportion of selectedfeatures over resampling iterations.

The stability score is computed as the minus log-transformed likelihoodunder the assumption of equiprobability of selection:

S_{\lambda, \pi} = -log(L_{\lambda, \pi})

The stability score increases with stability.

Alternatively, the stability score can be computed by considering only twosets of features: stably selected (selection proportions\ge \pi) ornot (selection proportions< \pi). This can be done usingn_cat=2.

Value

A vector of stability scores obtained with the different thresholdsin selection proportions.

References

Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.

See Also

Other stability metric functions:ConsensusScore(),FDP(),PFER(),StabilityMetrics()

Examples

# Simulating set of selection proportionsset.seed(1)selprop <- round(runif(n = 20), digits = 2)# Computing stability scores for different thresholdsscore <- StabilityScore(selprop, pi_list = c(0.6, 0.7, 0.8), K = 100)

Stable results

Description

Extracts stable results for stability selection or consensus clustering.

Usage

Stable(stability, argmax_id = NULL, linkage = "complete")SelectedVariables(stability, argmax_id = NULL)Adjacency(stability, argmax_id = NULL)Clusters(stability, linkage = "complete", argmax_id = NULL)

Arguments

stability

output ofVariableSelection,BiSelection,GraphicalModel orClustering.

argmax_id

optional indices of hyper-parameters. Ifargmax_id=NULL, the calibrated hyper-parameters are used.

linkage

character string indicating the type of linkage used inhierarchical clustering to define the stable clusters. Possible valuesinclude"complete","single" and"average" (seeargument"method" inhclust for a full list).

Value

A binary vector or matrix encoding the selection status (1 ifselected,0 otherwise) in stability selection or stable clustermembership in consensus clustering.

See Also

VariableSelection,BiSelection,GraphicalModel,Clustering

Examples

# Variable selectionset.seed(1)simul <- SimulateRegression(pk = 20)stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata)SelectedVariables(stab)Stable(stab)# Graphical modelset.seed(1)simul <- SimulateGraphical(pk = 10)stab <- GraphicalModel(xdata = simul$data)Adjacency(stab)Stable(stab)# Clusteringset.seed(1)simul <- SimulateClustering(  n = c(30, 30, 30),  nu_xc = 1)stab <- Clustering(xdata = simul$data)Clusters(stab)Stable(stab)

Stability selection in Structural Equation Modelling

Description

Performs stability selection for Structural Equation Models. The underlyingarrow selection algorithm (e.g. regularised Structural Equation Modelling) isrun with different combinations of parameters controlling the sparsity (e.g.penalty parameter) and thresholds in selection proportions. These twohyper-parameters are jointly calibrated by maximisation of the stabilityscore.

Usage

StructuralModel(  xdata,  adjacency,  residual_covariance = NULL,  Lambda = NULL,  pi_list = seq(0.01, 0.99, by = 0.01),  K = 100,  tau = 0.5,  seed = 1,  n_cat = NULL,  implementation = PenalisedLinearSystem,  resampling = "subsampling",  cpss = FALSE,  PFER_method = "MB",  PFER_thr = Inf,  FDP_thr = Inf,  Lambda_cardinal = 100,  optimisation = c("grid_search", "nloptr"),  n_cores = 1,  output_data = FALSE,  verbose = TRUE,  ...)

Arguments

xdata

matrix with observations as rows and variables as columns.Column names must be defined and in line with the row and column names ofadjacency.

adjacency

binary adjacency matrix of the Directed Acyclic Graph(transpose of the asymmetric matrix A in Reticular Action Model notation).The row and column names of this matrix must be defined.

residual_covariance

binary and symmetric matrix encoding the nonzeroentries in the residual covariance matrix (symmetric matrix S in ReticularAction Model notation). By default, this is the identity matrix (noresidual covariance).

Lambda

matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified inimplementation.

pi_list

vector of thresholds in selection proportions. Ifn_cat=NULL orn_cat=2, these values must be>0 and<1. Ifn_cat=3, these values must be>0.5 and<1.

K

number of resampling iterations.

tau

subsample size. Only used ifresampling="subsampling" andcpss=FALSE.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

implementation

function to use for variable selection.

resampling

resampling approach. Possible values are:"subsampling" for sampling without replacement of a proportiontau of the observations, or"bootstrap" for sampling withreplacement generating a resampled dataset with as many observations as inthe full sample. Alternatively, this argument can be a function to use forresampling. This function must use arguments nameddata andtau and return the IDs of observations to be included in theresampled dataset.

cpss

logical indicating if complementary pair stability selectionshould be done. For this, the algorithm is applied on two non-overlappingsubsets of half of the observations. A feature is considered as selected ifit is selected for both subsamples. With this method, the data is splitK/2 times (K models are fitted). Only used ifPFER_method="MB".

PFER_method

method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). IfPFER_method="MB", the method proposed by Meinshausen and Bühlmann(2010) is used. IfPFER_method="SS", the method proposed by Shah andSamworth (2013) under the assumption of unimodality is used.

PFER_thr

threshold in PFER for constrained calibration by errorcontrol. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

FDP_thr

threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

Lambda_cardinal

number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm. Only used ifLambda=NULL.

optimisation

character string indicating the type of optimisationmethod. Withoptimisation="grid_search" (the default), all values inLambda are visited. Alternatively, optimisation algorithmsimplemented innloptr can be used withoptimisation="nloptr". By default, we use"algorithm"="NLOPT_GN_DIRECT_L","xtol_abs"=0.1,"ftol_abs"=0.1 and"maxeval"=Lambda_cardinal. These valuescan be changed by providing the argumentopts (seenloptr). For stability selection using penalisedregression,optimisation="grid_search" may be faster as it allowsfor warm start.

n_cores

number of cores to use for parallel computing (see argumentworkers inmultisession). Usingn_cores>1 is only supported withoptimisation="grid_search".

output_data

logical indicating if the input datasetsxdata andydata should be included in the output.

verbose

logical indicating if a loading bar and messages should beprinted.

...

additional parameters passed to the functions provided inimplementation orresampling.

Details

In stability selection, a feature selection algorithm is fitted onK subsamples (or bootstrap samples) of the data with differentparameters controlling the sparsity (Lambda). For a given (set of)sparsity parameter(s), the proportion out of theK models in whicheach feature is selected is calculated. Features with selection proportionsabove a threshold pi are considered stably selected. The stabilityselection model is controlled by the sparsity parameter(s) for theunderlying algorithm, and the threshold in selection proportion:

V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}

In Structural Equation Modelling, "feature" refers to an arrow in thecorresponding Directed Acyclic Graph.

These parameters can be calibrated by maximisation of a stability score(seeConsensusScore ifn_cat=NULL orStabilityScore otherwise) calculated under the nullhypothesis of equiprobability of selection.

It is strongly recommended to examine the calibration plot carefully tocheck that the grids of parametersLambda andpi_list do notrestrict the calibration to a region that would not include the globalmaximum (seeCalibrationPlot). In particular, the gridLambda may need to be extended when the maximum stability isobserved on the left or right edges of the calibration heatmap. In someinstances, multiple peaks of stability score can be observed. Simulationstudies suggest that the peak corresponding to the largest number ofselected features tend to give better selection performances. This is notnecessarily the highest peak (which is automatically retained by thefunctions in this package). The user can decide to manually choose anotherpeak.

To control the expected number of False Positives (Per Family Error Rate)in the results, a thresholdPFER_thr can be specified. Theoptimisation problem is then constrained to sets of parameters thatgenerate models with an upper-bound in PFER belowPFER_thr (seeMeinshausen and Bühlmann (2010) and Shah and Samworth (2013)).

Possible resampling procedures include defining (i)K subsamples ofa proportiontau of the observations, (ii)K bootstrapsamples with the full sample size (obtained with replacement), and (iii)K/2 splits of the data in half for complementary pair stabilityselection (see argumentsresampling andcpss). Incomplementary pair stability selection, a feature is considered selected ata given resampling iteration if it is selected in the two complementarysubsamples.

To ensure reproducibility of the results, the starting number of the randomnumber generator is set toseed.

For parallelisation, stability selection with different sets of parameterscan be run onn_cores cores. Usingn_cores > 1 creates amultisession. Alternatively,the function can be run manually with differentseeds and all otherparameters equal. The results can then be combined usingCombine.

Value

An object of classvariable_selection. A list with:

S

amatrix of the best stability scores for different parameters controllingthe level of sparsity in the underlying algorithm.

Lambda

a matrixof parameters controlling the level of sparsity in the underlyingalgorithm.

Q

a matrix of the average number of selected features bythe underlying algorithm with different parameters controlling the level ofsparsity.

Q_s

a matrix of the calibrated number of stably selectedfeatures with different parameters controlling the level of sparsity.

P

a matrix of calibrated thresholds in selection proportions fordifferent parameters controlling the level of sparsity in the underlyingalgorithm.

PFER

a matrix of upper-bounds in PFER of calibratedstability selection models with different parameters controlling the levelof sparsity.

FDP

a matrix of upper-bounds in FDP of calibratedstability selection models with different parameters controlling the levelof sparsity.

S_2d

a matrix of stability scores obtained withdifferent combinations of parameters. Columns correspond to differentthresholds in selection proportions.

PFER_2d

a matrix ofupper-bounds in FDP obtained with different combinations of parameters.Columns correspond to different thresholds in selection proportions.

FDP_2d

a matrix of upper-bounds in PFER obtained with differentcombinations of parameters. Columns correspond to different thresholds inselection proportions.

selprop

a matrix of selection proportions.Columns correspond to predictors fromxdata.

Beta

an arrayof model coefficients. Columns correspond to predictors fromxdata.Indices along the third dimension correspond to different resamplingiterations. With multivariate outcomes, indices along the fourth dimensioncorrespond to outcome-specific coefficients.

method

a list withtype="variable_selection" and values used for argumentsimplementation,family,resampling,cpss andPFER_method.

params

a list with values used for argumentsK,pi_list,tau,n_cat,pk,n(number of observations),PFER_thr,FDP_thr andseed.The datasetsxdata andydata are also included ifoutput_data=TRUE.

For all matrices and arrays returned, the rowsare ordered in the same way and correspond to parameter values stored inLambda.

References

Bodinier B, Rodrigues S, Karimi M, Filippi S, Chiquet J, Chadeau-Hyam M (2025).“Stability Selection and Consensus Clustering in R: The R Package sharp.”Journal of Statistical Software,112(5), btad635.doi:10.18637/jss.v112.i05.

Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.

Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.

Shah RD, Samworth RJ (2013).“Variable selection with error control: another look at stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),75(1), 55-80.doi:10.1111/j.1467-9868.2011.01034.x.

Jacobucci R, Grimm KJ, McArdle JJ (2016).“Regularized structural equation modeling.”Structural equation modeling: a multidisciplinary journal,23(4), 555–566.doi:10.1080/10705511.2016.1154793.

See Also

SelectionAlgo,Resample,StabilityScore

Other stability functions:BiSelection(),Clustering(),GraphicalModel(),VariableSelection()

Examples

oldpar <- par(no.readonly = TRUE)par(mar = rep(7, 4))# Data simulationset.seed(1)pk <- c(3, 2, 3)simul <- SimulateStructural(  n = 500,  pk = pk,  nu_between = 0.5,  v_between = 1,  v_sign = 1)# Stability selection (using glmnet)dag <- LayeredDAG(layers = pk)stab <- StructuralModel(  xdata = simul$data,  adjacency = dag)CalibrationPlot(stab)LinearSystemMatrix(vect = Stable(stab), adjacency = dag)# Stability selection (using OpenMx)if (requireNamespace("OpenMx", quietly = TRUE)) {  stab <- StructuralModel(    xdata = simul$data,    implementation = PenalisedOpenMx,    Lambda = seq(50, 500, by = 50),    adjacency = dag  )  CalibrationPlot(stab)  OpenMxMatrix(SelectedVariables(stab), adjacency = dag)}## Not run: # Data simulation with latent variablesset.seed(1)pk <- c(3, 2, 3)simul <- SimulateStructural(  n = 500,  pk = pk,  nu_between = 0.5,  v_sign = 1,  v_between = 1,  n_manifest = 3,  ev_manifest = 0.95)# Stability selection (using OpenMx)if (requireNamespace("OpenMx", quietly = TRUE)) {  dag <- LayeredDAG(layers = pk, n_manifest = 3)  penalised <- dag  penalised[, seq_len(ncol(simul$data))] <- 0  stab <- StructuralModel(    xdata = simul$data,    implementation = PenalisedOpenMx,    adjacency = dag,    penalised = penalised,    Lambda = seq(10, 100, by = 20),    K = 10 # to increase for real use  )  CalibrationPlot(stab)  ids_latent <- grep("f", colnames(dag))  OpenMxMatrix(SelectedVariables(stab),    adjacency = dag  )[ids_latent, ids_latent]}## End(Not run)par(oldpar)

Unweighted K-means clustering

Description

Runs k-means clustering using implementation fromkmeans. This function is not using stability.

Usage

UnweightedKMeansClustering(xdata, nc = NULL, ...)

Arguments

xdata

data matrix with observations as rows and variables as columns.

nc

matrix of parameters controlling the number of clusters in theunderlying algorithm specified inimplementation. Ifnc isnot provided, it is set toseq(1, tau*nrow(xdata)).

...

additional parameters passed tokmeans.

Value

A list with:

comembership

an array of binary and symmetricco-membership matrices.

weights

a matrix of median weights byfeature.


Stability selection in regression

Description

Performs stability selection for regression models. The underlying variableselection algorithm (e.g. LASSO regression) is run with differentcombinations of parameters controlling the sparsity (e.g. penalty parameter)and thresholds in selection proportions. These two hyper-parameters arejointly calibrated by maximisation of the stability score.

Usage

VariableSelection(  xdata,  ydata = NULL,  Lambda = NULL,  pi_list = seq(0.01, 0.99, by = 0.01),  K = 100,  tau = 0.5,  seed = 1,  n_cat = NULL,  family = "gaussian",  implementation = PenalisedRegression,  resampling = "subsampling",  cpss = FALSE,  PFER_method = "MB",  PFER_thr = Inf,  FDP_thr = Inf,  Lambda_cardinal = 100,  group_x = NULL,  group_penalisation = FALSE,  optimisation = c("grid_search", "nloptr"),  n_cores = 1,  output_data = FALSE,  verbose = TRUE,  beep = NULL,  ...)

Arguments

xdata

matrix of predictors with observations as rows and variables ascolumns.

ydata

optional vector or matrix of outcome(s). Iffamily is setto"binomial" or"multinomial",ydata can be a vectorwith character/numeric values or a factor.

Lambda

matrix of parameters controlling the level of sparsity in theunderlying feature selection algorithm specified inimplementation.IfLambda=NULL andimplementation=PenalisedRegression,LambdaGridRegression is used to define a relevant grid.

pi_list

vector of thresholds in selection proportions. Ifn_cat=NULL orn_cat=2, these values must be>0 and<1. Ifn_cat=3, these values must be>0.5 and<1.

K

number of resampling iterations.

tau

subsample size. Only used ifresampling="subsampling" andcpss=FALSE.

seed

value of the seed to initialise the random number generator andensure reproducibility of the results (seeset.seed).

n_cat

computation options for the stability score. Default isNULL to use the score based on a z test. Other possible values are 2or 3 to use the score based on the negative log-likelihood.

family

type of regression model. This argument is defined as inglmnet. Possible values include"gaussian"(linear regression),"binomial" (logistic regression),"multinomial" (multinomial regression), and"cox" (survivalanalysis).

implementation

function to use for variable selection. Possiblefunctions are:PenalisedRegression,SparsePLS,GroupPLS andSparseGroupPLS. Alternatively, a user-definedfunction can be provided.

resampling

resampling approach. Possible values are:"subsampling" for sampling without replacement of a proportiontau of the observations, or"bootstrap" for sampling withreplacement generating a resampled dataset with as many observations as inthe full sample. Alternatively, this argument can be a function to use forresampling. This function must use arguments nameddata andtau and return the IDs of observations to be included in theresampled dataset.

cpss

logical indicating if complementary pair stability selectionshould be done. For this, the algorithm is applied on two non-overlappingsubsets of half of the observations. A feature is considered as selected ifit is selected for both subsamples. With this method, the data is splitK/2 times (K models are fitted). Only used ifPFER_method="MB".

PFER_method

method used to compute the upper-bound of the expectednumber of False Positives (or Per Family Error Rate, PFER). IfPFER_method="MB", the method proposed by Meinshausen and Bühlmann(2010) is used. IfPFER_method="SS", the method proposed by Shah andSamworth (2013) under the assumption of unimodality is used.

PFER_thr

threshold in PFER for constrained calibration by errorcontrol. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

FDP_thr

threshold in the expected proportion of falsely selectedfeatures (or False Discovery Proportion) for constrained calibration byerror control. IfPFER_thr=Inf andFDP_thr=Inf, unconstrainedcalibration is used (the default).

Lambda_cardinal

number of values in the grid of parameters controllingthe level of sparsity in the underlying algorithm. Only used ifLambda=NULL.

group_x

vector encoding the grouping structure among predictors. Thisargument indicates the number of variables in each group. Only used formodels with group penalisation (e.g.implementation=GroupPLS orimplementation=SparseGroupPLS).

group_penalisation

logical indicating if a group penalisation shouldbe considered in the stability score. The use ofgroup_penalisation=TRUE strictly applies to group (not sparse-group)penalisation.

optimisation

character string indicating the type of optimisationmethod. Withoptimisation="grid_search" (the default), all values inLambda are visited. Alternatively, optimisation algorithmsimplemented innloptr can be used withoptimisation="nloptr". By default, we use"algorithm"="NLOPT_GN_DIRECT_L","xtol_abs"=0.1,"ftol_abs"=0.1 and"maxeval"=Lambda_cardinal. These valuescan be changed by providing the argumentopts (seenloptr). For stability selection using penalisedregression,optimisation="grid_search" may be faster as it allowsfor warm start.

n_cores

number of cores to use for parallel computing (see argumentworkers inmultisession). Usingn_cores>1 is only supported withoptimisation="grid_search".

output_data

logical indicating if the input datasetsxdata andydata should be included in the output.

verbose

logical indicating if a loading bar and messages should beprinted.

beep

sound indicating the end of the run. Possible values are:NULL (no sound) or an integer between 1 and 11 (see argumentsound inbeep).

...

additional parameters passed to the functions provided inimplementation orresampling.

Details

In stability selection, a feature selection algorithm is fitted onK subsamples (or bootstrap samples) of the data with differentparameters controlling the sparsity (Lambda). For a given (set of)sparsity parameter(s), the proportion out of theK models in whicheach feature is selected is calculated. Features with selection proportionsabove a threshold pi are considered stably selected. The stabilityselection model is controlled by the sparsity parameter(s) for theunderlying algorithm, and the threshold in selection proportion:

V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}

If argumentgroup_penalisation=FALSE, "feature" refers to variable(variable selection model). If argumentgroup_penalisation=TRUE,"feature" refers to group (group selection model). In this case, groupsneed to be defineda priori and specified in argumentgroup_x.

These parameters can be calibrated by maximisation of a stability score(seeConsensusScore ifn_cat=NULL orStabilityScore otherwise) calculated under the nullhypothesis of equiprobability of selection.

It is strongly recommended to examine the calibration plot carefully tocheck that the grids of parametersLambda andpi_list do notrestrict the calibration to a region that would not include the globalmaximum (seeCalibrationPlot). In particular, the gridLambda may need to be extended when the maximum stability isobserved on the left or right edges of the calibration heatmap. In someinstances, multiple peaks of stability score can be observed. Simulationstudies suggest that the peak corresponding to the largest number ofselected features tend to give better selection performances. This is notnecessarily the highest peak (which is automatically retained by thefunctions in this package). The user can decide to manually choose anotherpeak.

To control the expected number of False Positives (Per Family Error Rate)in the results, a thresholdPFER_thr can be specified. Theoptimisation problem is then constrained to sets of parameters thatgenerate models with an upper-bound in PFER belowPFER_thr (seeMeinshausen and Bühlmann (2010) and Shah and Samworth (2013)).

Possible resampling procedures include defining (i)K subsamples ofa proportiontau of the observations, (ii)K bootstrapsamples with the full sample size (obtained with replacement), and (iii)K/2 splits of the data in half for complementary pair stabilityselection (see argumentsresampling andcpss). Incomplementary pair stability selection, a feature is considered selected ata given resampling iteration if it is selected in the two complementarysubsamples.

For categorical or time to event outcomes (argumentfamily is"binomial","multinomial" or"cox"), the proportionsof observations from each category in all subsamples or bootstrap samplesare the same as in the full sample.

To ensure reproducibility of the results, the starting number of the randomnumber generator is set toseed.

For parallelisation, stability selection with different sets of parameterscan be run onn_cores cores. Usingn_cores > 1 creates amultisession. Alternatively,the function can be run manually with differentseeds and all otherparameters equal. The results can then be combined usingCombine.

Value

An object of classvariable_selection. A list with:

S

amatrix of the best stability scores for different parameters controllingthe level of sparsity in the underlying algorithm.

Lambda

a matrixof parameters controlling the level of sparsity in the underlyingalgorithm.

Q

a matrix of the average number of selected features bythe underlying algorithm with different parameters controlling the level ofsparsity.

Q_s

a matrix of the calibrated number of stably selectedfeatures with different parameters controlling the level of sparsity.

P

a matrix of calibrated thresholds in selection proportions fordifferent parameters controlling the level of sparsity in the underlyingalgorithm.

PFER

a matrix of upper-bounds in PFER of calibratedstability selection models with different parameters controlling the levelof sparsity.

FDP

a matrix of upper-bounds in FDP of calibratedstability selection models with different parameters controlling the levelof sparsity.

S_2d

a matrix of stability scores obtained withdifferent combinations of parameters. Columns correspond to differentthresholds in selection proportions.

PFER_2d

a matrix ofupper-bounds in FDP obtained with different combinations of parameters.Columns correspond to different thresholds in selection proportions.

FDP_2d

a matrix of upper-bounds in PFER obtained with differentcombinations of parameters. Columns correspond to different thresholds inselection proportions.

selprop

a matrix of selection proportions.Columns correspond to predictors fromxdata.

Beta

an arrayof model coefficients. Columns correspond to predictors fromxdata.Indices along the third dimension correspond to different resamplingiterations. With multivariate outcomes, indices along the fourth dimensioncorrespond to outcome-specific coefficients.

method

a list withtype="variable_selection" and values used for argumentsimplementation,family,resampling,cpss andPFER_method.

params

a list with values used for argumentsK,pi_list,tau,n_cat,pk,n(number of observations),PFER_thr,FDP_thr andseed.The datasetsxdata andydata are also included ifoutput_data=TRUE.

For all matrices and arrays returned, the rowsare ordered in the same way and correspond to parameter values stored inLambda.

References

Bodinier B, Rodrigues S, Karimi M, Filippi S, Chiquet J, Chadeau-Hyam M (2025).“Stability Selection and Consensus Clustering in R: The R Package sharp.”Journal of Statistical Software,112(5), btad635.doi:10.18637/jss.v112.i05.

Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023).“Automated calibration for stability selection in penalised regression and graphical models.”Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058.ISSN 0035-9254,doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.

Shah RD, Samworth RJ (2013).“Variable selection with error control: another look at stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),75(1), 55-80.doi:10.1111/j.1467-9868.2011.01034.x.

Meinshausen N, Bühlmann P (2010).“Stability selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(4), 417-473.doi:10.1111/j.1467-9868.2010.00740.x.

Tibshirani R (1996).“Regression Shrinkage and Selection via the Lasso.”Journal of the Royal Statistical Society. Series B (Methodological),58(1), 267–288.ISSN 00359246,http://www.jstor.org/stable/2346178.

See Also

PenalisedRegression,SelectionAlgo,LambdaGridRegression,Resample,StabilityScoreRefit,ExplanatoryPerformance,Incremental,

Other stability functions:BiSelection(),Clustering(),GraphicalModel(),StructuralModel()

Examples

oldpar <- par(no.readonly = TRUE)par(mar = rep(7, 4))# Linear regressionset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")stab <- VariableSelection(  xdata = simul$xdata, ydata = simul$ydata,  family = "gaussian")# Calibration plotCalibrationPlot(stab)# Extracting the resultssummary(stab)Stable(stab)SelectionProportions(stab)plot(stab)# Using randomised LASSOstab <- VariableSelection(  xdata = simul$xdata, ydata = simul$ydata,  family = "gaussian", penalisation = "randomised")plot(stab)# Using adaptive LASSOstab <- VariableSelection(  xdata = simul$xdata, ydata = simul$ydata,  family = "gaussian", penalisation = "adaptive")plot(stab)# Using additional arguments from glmnet (e.g. penalty.factor)stab <- VariableSelection(  xdata = simul$xdata, ydata = simul$ydata, family = "gaussian",  penalty.factor = c(rep(1, 45), rep(0, 5)))head(coef(stab))# Using CARTif (requireNamespace("rpart", quietly = TRUE)) {  stab <- VariableSelection(    xdata = simul$xdata, ydata = simul$ydata,    implementation = CART,    family = "gaussian",  )  plot(stab)}# Regression with multivariate outcomesset.seed(1)simul <- SimulateRegression(n = 100, pk = 20, q = 3, family = "gaussian")stab <- VariableSelection(  xdata = simul$xdata, ydata = simul$ydata,  family = "mgaussian")summary(stab)# Logistic regressionset.seed(1)simul <- SimulateRegression(n = 200, pk = 10, family = "binomial", ev_xy = 0.8)stab <- VariableSelection(  xdata = simul$xdata, ydata = simul$ydata,  family = "binomial")summary(stab)# Sparse PCA (1 component, see BiSelection for more components)if (requireNamespace("elasticnet", quietly = TRUE)) {  set.seed(1)  simul <- SimulateComponents(pk = c(5, 3, 4))  stab <- VariableSelection(    xdata = simul$data,    Lambda = seq_len(ncol(simul$data) - 1),    implementation = SparsePCA  )  CalibrationPlot(stab, xlab = "")  summary(stab)}# Sparse PLS (1 outcome, 1 component, see BiSelection for more options)if (requireNamespace("sgPLS", quietly = TRUE)) {  set.seed(1)  simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")  stab <- VariableSelection(    xdata = simul$xdata, ydata = simul$ydata,    Lambda = seq_len(ncol(simul$xdata) - 1),    implementation = SparsePLS, family = "gaussian"  )  CalibrationPlot(stab, xlab = "")  SelectedVariables(stab)}# Group PLS (1 outcome, 1 component, see BiSelection for more options)if (requireNamespace("sgPLS", quietly = TRUE)) {  stab <- VariableSelection(    xdata = simul$xdata, ydata = simul$ydata,    Lambda = seq_len(5),    group_x = c(5, 5, 10, 20, 10),    group_penalisation = TRUE,    implementation = GroupPLS, family = "gaussian"  )  CalibrationPlot(stab, xlab = "")  SelectedVariables(stab)}# Example with more hyper-parameters: elastic netset.seed(1)simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")TuneElasticNet <- function(xdata, ydata, family, alpha) {  stab <- VariableSelection(    xdata = xdata, ydata = ydata,    family = family, alpha = alpha, verbose = FALSE  )  return(max(stab$S, na.rm = TRUE))}myopt <- optimise(TuneElasticNet,  lower = 0.1, upper = 1, maximum = TRUE,  xdata = simul$xdata, ydata = simul$ydata,  family = "gaussian")stab <- VariableSelection(  xdata = simul$xdata, ydata = simul$ydata,  family = "gaussian", alpha = myopt$maximum)summary(stab)enet <- SelectedVariables(stab)# Comparison with LASSOstab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian")summary(stab)lasso <- SelectedVariables(stab)table(lasso, enet)# Example using an external function: group-LASSO with gglassoif (requireNamespace("gglasso", quietly = TRUE)) {  set.seed(1)  simul <- SimulateRegression(n = 200, pk = 20, family = "binomial")  ManualGridGroupLasso <- function(xdata, ydata, family, group_x, ...) {    # Defining the grouping    group <- do.call(c, lapply(seq_len(length(group_x)), FUN = function(i) {      rep(i, group_x[i])    }))    if (family == "binomial") {      ytmp <- ydata      ytmp[ytmp == min(ytmp)] <- -1      ytmp[ytmp == max(ytmp)] <- 1      return(gglasso::gglasso(xdata, ytmp, loss = "logit", group = group, ...))    } else {      return(gglasso::gglasso(xdata, ydata, lambda = lambda, loss = "ls", group = group, ...))    }  }  Lambda <- LambdaGridRegression(    xdata = simul$xdata, ydata = simul$ydata,    family = "binomial", Lambda_cardinal = 20,    implementation = ManualGridGroupLasso,    group_x = rep(5, 4)  )  GroupLasso <- function(xdata, ydata, Lambda, family, group_x, ...) {    # Defining the grouping    group <- do.call(c, lapply(seq_len(length(group_x)), FUN = function(i) {      rep(i, group_x[i])    }))    # Running the regression    if (family == "binomial") {      ytmp <- ydata      ytmp[ytmp == min(ytmp)] <- -1      ytmp[ytmp == max(ytmp)] <- 1      mymodel <- gglasso::gglasso(xdata, ytmp, lambda = Lambda, loss = "logit", group = group, ...)    }    if (family == "gaussian") {      mymodel <- gglasso::gglasso(xdata, ydata, lambda = Lambda, loss = "ls", group = group, ...)    }    # Extracting and formatting the beta coefficients    beta_full <- t(as.matrix(mymodel$beta))    beta_full <- beta_full[, colnames(xdata)]    selected <- ifelse(beta_full != 0, yes = 1, no = 0)    return(list(selected = selected, beta_full = beta_full))  }  stab <- VariableSelection(    xdata = simul$xdata, ydata = simul$ydata,    implementation = GroupLasso, family = "binomial", Lambda = Lambda,    group_x = rep(5, 4),    group_penalisation = TRUE  )  summary(stab)}par(oldpar)

Stable attribute weights

Description

Creates a boxplots of the distribution of (calibrated) median attributeweights obtained from the COSA algorithm across the subsampling iterations.See examples inClustering.

Usage

WeightBoxplot(  stability,  at = NULL,  argmax_id = NULL,  col = NULL,  boxwex = 0.3,  xlab = "",  ylab = "Weight",  cex.lab = 1.5,  las = 3,  frame = "F",  add = FALSE,  ...)

Arguments

stability

output ofClustering.

at

coordinates along the x-axis (more details inboxplot).

argmax_id

optional indices of hyper-parameters. Ifargmax_id=NULL, the calibrated hyper-parameters are used.

col

optional vector of colours.

boxwex

box width (more details inboxplot).

xlab

label of the x-axis.

ylab

label of the y-axis.

cex.lab

font size for labels.

las

orientation of labels on the x-axis (seepar).

frame

logical indicating if the box around the plot should be drawn(more details inboxplot).

add

logical indicating if the boxplot should be added to the currentplot.

...

additional parameters passed toboxplot).

Value

A boxplot.

See Also

Clustering


Star-shaped nodes

Description

Produces star-shaped nodes in an igraph object.

Usage

mystar(coords, v = NULL, params)

Arguments

coords

a matrix of coordinates(seeadd_shape).

v

a vector of node IDs(seeadd_shape).

params

node graphical parameters(seeadd_shape).

See Also

add_shape


Triangular nodes

Description

Produces triangular nodes in an igraph object.

Usage

mytriangle(coords, v = NULL, params)

Arguments

coords

a matrix of coordinates(seeadd_shape).

v

a vector of node IDs(seeadd_shape).

params

node graphical parameters(seeadd_shape).

See Also

add_shape


Consensus matrix heatmap

Description

Creates a heatmap of the (calibrated) consensus matrix. See examples inClustering.

Usage

## S3 method for class 'clustering'plot(  x,  linkage = "complete",  argmax_id = NULL,  theta = NULL,  theta_star = NULL,  col = c("ivory", "navajowhite", "tomato", "darkred"),  lines = TRUE,  col.lines = c("blue"),  lwd.lines = 2,  tick = TRUE,  axes = TRUE,  col.axis = NULL,  cex.axis = 1,  xlas = 2,  ylas = 2,  bty = "n",  ...)

Arguments

x

output ofClustering.

linkage

character string indicating the type of linkage used inhierarchical clustering to define the stable clusters. Possible valuesinclude"complete","single" and"average" (seeargument"method" inhclust for a full list).

argmax_id

optional indices of hyper-parameters. Ifargmax_id=NULL, the calibrated hyper-parameters are used.

theta

optional vector of cluster membership. If provided, the orderingof the items should be the same as inClusters. This argumentis used to re-order the consensus matrix.

theta_star

optional vector of true cluster membership. If provided,the ordering of the items should be the same as inClusters.This argument is used to define item colours.

col

vector of colours.

lines

logical indicating if lines separating the clusters provided intheta should be displayed.

col.lines

colour of the lines separating the clusters.

lwd.lines

width of the lines separating the clusters.

tick

logical indicating if axis tickmarks should be displayed.

axes

logical indicating if item labels should be displayed.

col.axis

optional vector of cluster colours.

cex.axis

font size for axes.

xlas

orientation of labels on the x-axis, aslas inpar.

ylas

orientation of labels on the y-axis, aslas inpar.

bty

character string indicating if the box around the plot should bedrawn. Possible values include:"o" (default, the box is drawn), or"n" (no box).

...

additional arguments passed toHeatmap.

Value

A heatmap.


Plot of incremental performance

Description

Represents prediction performances upon sequential inclusion of thepredictors in a logistic or Cox regression model as produced byIncremental. The median andquantiles of the performancemetric are reported. See examples inIncremental.

Usage

## S3 method for class 'incremental'plot(  x,  quantiles = c(0.05, 0.95),  col = c("red", "grey"),  col.axis = NULL,  xgrid = FALSE,  ygrid = FALSE,  output_data = FALSE,  ...)IncrementalPlot(  x,  quantiles = c(0.05, 0.95),  col = c("red", "grey"),  col.axis = NULL,  xgrid = FALSE,  ygrid = FALSE,  output_data = FALSE,  ...)PlotIncremental(  x,  quantiles = c(0.05, 0.95),  col = c("red", "grey"),  col.axis = NULL,  xgrid = FALSE,  ygrid = FALSE,  output_data = FALSE,  ...)

Arguments

x

output ofIncremental.

quantiles

quantiles defining the lower and upper bounds.

col

vector of colours by stable selection status.

col.axis

optional vector of label colours by stable selection status.

xgrid

logical indicating if a vertical grid should be drawn.

ygrid

logical indicating if a horizontal grid should be drawn.

output_data

logical indicating if the median and quantiles should bereturned in a matrix.

...

additional plotting arguments (seepar).

Value

A plot.

See Also

Incremental


Receiver Operating Characteristic (ROC) band

Description

Plots the True Positive Rate (TPR) as a function of the False Positive Rate(FPR) for different thresholds in predicted probabilities. If the resultsfrom multiple ROC analyses are provided (e.g. output ofExplanatoryPerformance with largeK), the point-wisemedian is represented and flanked by a transparent band defined by point-wisequantiles. See examples inROC andExplanatoryPerformance.

Usage

## S3 method for class 'roc_band'plot(  x,  col_band = NULL,  alpha = 0.5,  quantiles = c(0.05, 0.95),  add = FALSE,  ...)

Arguments

x

output ofROC orExplanatoryPerformance.

col_band

colour of the band defined by point-wisequantiles.

alpha

level of opacity for the band.

quantiles

point-wise quantiles of the performances defining the band.

add

logical indicating if the curve should be added to the currentplot.

...

additional plotting arguments (seepar).

Value

A base plot.

See Also

ROC,ExplanatoryPerformance


Plot of selection proportions

Description

Makes a barplot of selection proportions in decreasing order. See examples inVariableSelection.

Usage

## S3 method for class 'variable_selection'plot(  x,  col = c("red", "grey"),  col.axis = NULL,  col.thr = "darkred",  lty.thr = 2,  n_predictors = NULL,  ...)

Arguments

x

output ofVariableSelection.

col

vector of colours by stable selection status.

col.axis

optional vector of label colours by stable selection status.

col.thr

threshold colour.

lty.thr

threshold line type aslty inpar.

n_predictors

number of predictors to display.

...

additional plotting arguments (seepar).

Value

A plot.

See Also

VariableSelection


Predict method for stability selection

Description

Computes predicted values from the output ofVariableSelection.

Usage

## S3 method for class 'variable_selection'predict(  object,  xdata,  ydata,  newdata = NULL,  method = c("ensemble", "refit"),  ...)

Arguments

object

output ofVariableSelection.

xdata

predictor data (training set).

ydata

outcome data (training set).

newdata

optional predictor data (test set).

method

character string indicating if predictions should be obtainedfrom anEnsemble model (ifmethod="ensemble") or aRefitted model (ifmethod="refit").

...

additional arguments passed topredict.

Value

Predicted values.

See Also

Refit,Ensemble,EnsemblePredictions

Examples

## Linear regression# Data simulationset.seed(1)simul <- SimulateRegression(n = 500, pk = 50, family = "gaussian")# Training/test splitids <- Split(data = simul$ydata, tau = c(0.8, 0.2))# Stability selectionstab <- VariableSelection(  xdata = simul$xdata[ids[[1]], ],  ydata = simul$ydata[ids[[1]], ])# Predictions from post stability selection estimationyhat <- predict(stab,  xdata = simul$xdata[ids[[1]], ],  ydata = simul$ydata[ids[[1]], ],  newdata = simul$xdata[ids[[2]], ],  method = "refit")cor(simul$ydata[ids[[2]], ], yhat)^2 # Q-squared# Predictions from ensemble modelyhat <- predict(stab,  xdata = simul$xdata[ids[[1]], ],  ydata = simul$ydata[ids[[1]], ],  newdata = simul$xdata[ids[[2]], ],  method = "ensemble")cor(simul$ydata[ids[[2]], ], yhat)^2 # Q-squared## Logistic regression# Data simulationset.seed(1)simul <- SimulateRegression(n = 500, pk = 20, family = "binomial", ev_xy = 0.9)# Training/test splitids <- Split(data = simul$ydata, family = "binomial", tau = c(0.8, 0.2))# Stability selectionstab <- VariableSelection(  xdata = simul$xdata[ids[[1]], ],  ydata = simul$ydata[ids[[1]], ],  family = "binomial")# Predictions from post stability selection estimationyhat <- predict(stab,  xdata = simul$xdata[ids[[1]], ],  ydata = simul$ydata[ids[[1]], ],  newdata = simul$xdata[ids[[2]], ],  method = "refit", type = "response")plot(ROC(predicted = yhat, observed = simul$ydata[ids[[2]], ]))# Predictions from ensemble modelyhat <- predict(stab,  xdata = simul$xdata[ids[[1]], ],  ydata = simul$ydata[ids[[1]], ],  newdata = simul$xdata[ids[[2]], ],  method = "ensemble", type = "response")plot(ROC(predicted = yhat, observed = simul$ydata[ids[[2]], ]),  add = TRUE,  col = "blue")

[8]ページ先頭

©2009-2025 Movatter.jp