Movatterモバイル変換

Type:

Package

Title:

Hierarchical Shrinkage Stan Models for Biomarker Selection

Version:

0.8.2

Date:

2024-01-13

Description:

Linear and logistic regression models penalized with hierarchical shrinkage priors for selection of biomarkers (or more general variable selection), which can be fitted using Stan (Carpenter et al. (2017) <doi:10.18637/jss.v076.i01>). It implements the horseshoe and regularized horseshoe priors (Piironen and Vehtari (2017) <doi:10.1214/17-EJS1337SI>), as well as the projection predictive selection approach to recover a sparse set of predictive biomarkers (Piironen, Paasiniemi and Vehtari (2020) <doi:10.1214/20-EJS1711>).

License:

GPL-3

URL:

https://github.com/mcol/hsstan

BugReports:

https://github.com/mcol/hsstan/issues

LazyData:

yes

Biarch:

true

Depends:

R (≥ 3.6)

Imports:

ggplot2, loo (≥ 2.1.0), parallel, pROC, Rcpp, methods, rstan(≥ 2.26.0), rstantools (≥ 2.0.0), stats, utils

LinkingTo:

BH (≥ 1.66.0.1), Rcpp (≥ 0.12.15), RcppEigen (≥0.3.3.4.0), RcppParallel (≥ 5.0.1), StanHeaders (≥ 2.26.0),rstan (≥ 2.26.0)

Suggests:

testthat (≥ 2.1.0)

ByteCompile:

yes

NeedsCompilation:

yes

SystemRequirements:

GNU make

Encoding:

UTF-8

RoxygenNote:

7.2.1

Packaged:

2024-01-13 20:59:47 UTC; mcolombo

Author:

Marco Colombo

[aut, cre], Paul McKeigue

[aut], Athina Spiliopoulou

[ctb]

Maintainer:

Marco Colombo <mar.colombo13@gmail.com>

Repository:

CRAN

Date/Publication:

2024-01-13 21:20:02 UTC

Hierarchical shrinkage Stan models for biomarker selection

Description

Thehsstan package provides linear and logistic regression modelspenalized with hierarchical shrinkage priors for selection of biomarkers.Models are fitted with Stan (Carpenter et al. (2017)), which allows toperform full Bayesian inference.

Details

The package implements the horseshoe and regularized horseshoe priors(Piironen and Vehtari (2017)), and the projection predictive selectionapproach to recover a sparse set of predictive biomarkers (Piironen,Paasiniemi and Vehtari (2020)).

The approach is particularly suited to selection from high-dimensionalpanels of biomarkers, such as those that can be measured by MSMS or similartechnologies (Colombo, Valo, McGurnaghan et al. (2019), Colombo, McGurnaghan,Blackbourn et al. (2020)).

References

B. Carpenter et al. (2017),Stan: a probabilistic programming language,Journal of Statistical Software, 76 (1).doi:10.18637/jss.v076.i01

J. Piironen and A. Vehtari (2017),Sparsity information and regularization in the horseshoe and other shrinkagepriors,Electronic Journal of Statistics, 11 (2), 5018-5051.doi:10.1214/17-EJS1337SI

J. Piironen, M. Paasiniemi and A. Vehtari (2020),Projective inference in high-dimensional problems: prediction and featureselection,Electronic Journal of Statistics, 14 (1), 2155-2197.doi:10.1214/20-EJS1711

M. Colombo, E. Valo, S.J. McGurnaghan et al. (2019),Biomarkers associated with progression of renal disease in type 1 diabetes,Diabetologia, 62 (9), 1616-1627.doi:10.1007/s00125-019-4915-0

M. Colombo, S.J. McGurnaghan, L.A.K. Blackbourn et al. (2020),Comparison of serum and urinary biomarker panels with albumin creatininratio in the prediction of renal function decline in type 1 diabetes,Diabetologia, 63 (4), 788-798.doi:10.1007/s00125-019-05081-8

M. Colombo, A. Asadi Shehni, I. Thoma et al. (2021),Quantitative levels of serum N-glycans in type 1 diabetes and theirassociation with kidney disease,Glycobiology, 31 (5), 613-623.doi:10.1093/glycob/cwaa106

Bayesian and LOO-adjusted R-squared

Description

Compute the Bayesian and the LOO-adjusted R-squared from the posteriorsamples. For Bayesian R-squared it uses the modelled residual variance(rather than the variance of the posterior distribution of the residuals).The LOO-adjusted R-squared uses Pareto smoothed importance sampling LOOresiduals and Bayesian bootstrap.

Usage

## S3 method for class 'hsstan'bayes_R2(object, prob = 0.95, summary = TRUE, ...)## S3 method for class 'hsstan'loo_R2(object, prob = 0.95, summary = TRUE, ...)

Arguments

object

An object of classhsstan.

prob

Width of the posterior interval (0.95, by default). It isignored ifsummary=FALSE.

summary

Whether a summary of the distribution of the R-squaredshould be returned rather than the pointwise values (TRUE bydefault).

...

Currently ignored.

Value

The mean, standard deviation and posterior interval of R-squared ifsummary=TRUE, or a vector of R-squared values with length equal tothe size of the posterior sample ifsummary=FALSE.

References

Andrew Gelman, Ben Goodrich, Jonah Gabry and Aki Vehtari (2019),R-squared for Bayesian regression models,The American Statistician, 73 (3), 307-309.doi:10.1080/00031305.2018.1549100

Aki Vehtari, Andrew Gelman, Ben Goodrich and Jonah Gabry (2019),Bayesian R2 and LOO-R2.https://avehtari.github.io/bayes_R2/bayes_R2.html

Examples

# continued from ?hsstanbayes_R2(hs.biom)loo_R2(hs.biom)

Next variable to enter the current submodel

Description

Return the index of the variable that should be added to the current modelaccording to the smallest KL-divergence (linear regression) or the largestscore test (logistic regression).

Usage

choose.next(x, sigma2, fit, fitp, chosen, is.logistic)

Arguments

x

Design matrix.

sigma2

Residual variance (1 for logistic regression).

fit

Matrix of fitted values for the full model.

fitp

Matrix of fitted values for the projected model.

chosen

Vector of indices of the columns ofx in the current submodel.

is.logistic

Set toTRUE for a logistic regression model.

Diabetes data with interaction terms

Description

The dataset consists of observations on 442 individuals for which aquantitative measure of diabetes progression is recorded in variableY.Predictors include 10 baseline measurements, 45 interactions and 9 quadraticterms, for a total of 64 variables for each individual. Each variable hasbeen standardized by subtracting the mean and then dividing it by itsstandard deviation.

Format

A data frame with 442 rows and 65 columns (centred and scaled).

Source

B. Efron, T. Hastie, I. Johnstone and R. Tibshirani (2004),Least angle regression,The Annals of Statistics, 32 (2), 407-499.doi:10.1214/009053604000000067

The original dataset is available fromhttps://web.stanford.edu/~hastie/Papers/LARS/data64.txt

Examples

data(diabetes, package="hsstan")

Compute the fit of a submodel

Description

Compute the fit of a submodel

Usage

fit.submodel(x, sigma2, mu, chosen, xt, yt, logistic)

Arguments

x

Design matrix.

sigma2

Residual variance (1 for logistic regression).

mu

Matrix of fitted values for the full model.

chosen

Vector of indices of the columns ofx in the current submodel.

xt

Design matrix for test data.

yt

Outcome variable for test data.

logistic

Set toTRUE for a logistic regression model.

Hierarchical shrinkage models

Description

Run the No-U-Turn Sampler (NUTS) as implemented in Stan to fit a hierarchicalshrinkage model.

Usage

hsstan(  x,  covs.model,  penalized = NULL,  family = gaussian,  iter = 2000,  warmup = floor(iter/2),  scale.u = 2,  regularized = TRUE,  nu = ifelse(regularized, 1, 3),  par.ratio = 0.05,  global.df = 1,  slab.scale = 2,  slab.df = 4,  qr = TRUE,  seed = 123,  adapt.delta = NULL,  keep.hs.pars = FALSE,  ...)

Arguments

x

Data frame containing outcome, covariates and penalized predictors.Continuous predictors and outcome variable should be standardizedbefore fitting the models as priors assume them to have mean zero andunit variance.

covs.model

Formula containing the unpenalized covariates.

penalized

Names of the variables to be used as penalized predictors.Any variable that is already part of thecovs.model formula will bepenalized. IfNULL or an empty vector, a model with only unpenalizedcovariates is fitted.

family

Type of model fitted: eithergaussian() for linear regression(default) orbinomial() for logistic regression.

iter

Total number of iterations in each chain, including warmup(2000 by default).

warmup

Number of warmup iterations per chain (by default, half thetotal number of iterations).

scale.u

Prior scale (standard deviation) for the unpenalizedcovariates.

regularized

IfTRUE (default), the regularized horseshoe prioris used as opposed to the original horseshoe prior.

nu

Number of degrees of freedom of the half-Student-t prior on thelocal shrinkage parameters (by default, 1 ifregularized=TRUEand 3 otherwise).

par.ratio

Expected ratio of non-zero to zero coefficients (ignoredifregularized=FALSE). The scale of the global shrinkage parametercorresponds topar.ratio divided by the square root of the number ofobservations; for linear regression only, it's further multiplied bythe residual standard deviationsigma.

global.df

Number of degrees of freedom for the global shrinkageparameter (ignored ifregularized=FALSE). Larger values induce moreshrinkage.

slab.scale

Scale of the regularization parameter (ignored ifregularized=FALSE).

slab.df

Number of degrees of freedom of the regularization parameter(ignored ifregularized=FALSE).

qr

Whether the thin QR decomposition should be used to decorrelate thepredictors (TRUE by default). This is silently set toFALSE ifthere are more predictors than observations.

seed

Optional integer defining the seed for the pseudo-random numbergenerator.

adapt.delta

Target average proposal acceptance probability foradaptation, a value between 0.8 and 1 (excluded). If unspecified,it's set to 0.99 for hierarchical shrinkage models and to 0.95 forbase models.

keep.hs.pars

Whether the parameters for the horseshoe prior should bekept in thestanfit object returned (FALSE by default).

...

Further arguments passed torstan::sampling(),such aschains (4 by default),cores (the value ofoptions("mc.cores") by default),refresh (iter / 10 by default).

Value

An object of classhsstan containing the following fields:

stanfit

an object of classstanfit containing the outputproduced by Stan, including posterior samples and diagnostic summaries.It can be manipulated using methods from therstan package.

betas

posterior means of the unpenalized and penalized regressionparameters.

call

the matched call.

data

the dataset used in fitting the model.

model.terms

a list of names for the outcome variable, the unpenalizedcovariates and the penalized predictors.

family

thefamily object used.

hsstan.settings

the optional settings used in the model.

Examples

data(diabetes)# non-default settings for speed of the exampledf <- diabetes[1:50, ]hs.biom <- hsstan(df, Y ~ age + sex, penalized=colnames(df)[5:10],                  chains=2, iter=250)

K-fold cross-validation

Description

Perform K-fold cross-validation using the same settings used when fittingthe model on the whole data.

Usage

## S3 method for class 'hsstan'kfold(  x,  folds,  chains = 1,  store.fits = TRUE,  cores = getOption("mc.cores", 1),  ...)

Arguments

x

An object of classhsstan.

folds

Integer vector with one element per observation indicating thecross-validation fold in which the observation should be withdrawn.

chains

Number of Markov chains to run. By default this is set to 1,independently of the number of chains used forx.

store.fits

Whether the fitted models for each fold should be storedin the returned object (TRUE by default).

cores

Number of cores to use for parallelization (the value ofoptions("mc.cores") by default). The cross-validation folds willbe distributed to the available cores, and the Markov chains for eachmodel will be run sequentially.

...

Further arguments passed torstan::sampling().

Value

An object with classeskfold andloo that has a similar structure as theobjects returned byloo() andwaic() and is compatible with theloo_compare function forcomparing models. The object contains the following fields:

estimates

a matrix containing point estimates and standard errors ofthe expected log pointwise predictive density ("elpd_kfold"),the effective number of parameters ("p_kfold", alwaysNA) and theK-fold information criterion "kfoldic" (which is-2 * elpd_kfold,i.e., converted to the deviance scale).

pointwise

a matrix containing the pointwise contributions of"elpd_kfold", "p_kfold" and "kfoldic".

fits

a matrix with two columns and number of rows equal to the numberof cross-validation folds. Columnfit contains the fittedhsstan objects for each fold, and columntest.idx containsthe indices of the withdrawn observations for each fold. This is notpresent ifstore.fits=FALSE.

data

the dataset used in fitting the model (before withdrawingobservations). This is not present ifstore.fits=FALSE.

Examples

# continued from ?hsstan# only 2 folds for speed of examplefolds <- rep(1:2, length.out=length(df$Y))cv.biom <- kfold(hs.biom, folds=folds, cores=2)

Compute projections of full predictors on to subspace of predictors

Description

Compute projections of full predictors on to subspace of predictors

Usage

lm_proj(x, fit, sigma2, indproj, is.logistic)

Arguments

x

Design matrix.

fit

Matrix of fitted values for the full model.

sigma2

Residual variance (1 for logistic regression).

indproj

Vector of indices of the columns ofx that form theprojection subspace.

is.logistic

Set toTRUE for a logistic regression model.

Pointwise log-likelihood

Description

Compute the pointwise log-likelihood.

Usage

## S3 method for class 'hsstan'log_lik(object, newdata = NULL, ...)

Arguments

object

An object of classhsstan.

newdata

Optional data frame to use to evaluate the log-likelihood.IfNULL (default), the model matrix is used.

...

Currently ignored.

Value

A matrix of sizeS byN, whereS is number of draws from the posteriordistribution, andN is the number of data points.

Examples

# continued from ?hsstanlog_lik(hs.biom)

Predictive information criteria for Bayesian models

Description

Compute an efficient approximate leave-one-out cross-validationusing Pareto smoothed importance sampling (PSIS-LOO), or the widelyapplicable information criterion (WAIC), also known as the Watanabe-Akaikeinformation criterion.

Usage

## S3 method for class 'hsstan'loo(x, cores = getOption("mc.cores"), ...)## S3 method for class 'hsstan'waic(x, cores = getOption("mc.cores"), ...)

Arguments

x

An object of classhsstan.

cores

Number of cores used for parallelisation (the value ofoptions("mc.cores") by default).

...

Currently ignored.

Value

Aloo object.

Examples

# continued from ?hsstanloo(hs.biom)waic(hs.biom)

Number of posterior samples

Description

Extracts the number of posterior samples stored in a fitted model.

Usage

## S3 method for class 'hsstan'nsamples(object, ...)

Arguments

object

An object of classhsstan.

...

Currently ignored.

Value

The total number of posterior samples across the chains after discardingthe warmup iterations.

Examples

# continued from ?hsstannsamples(hs.biom)

Plot of relative explanatory power of predictors

Description

The function plots the relative explanatory power of each predictor in orderof selection. The relative explanatory power of predictors is computedaccording to the KL divergence from the full model to each submodel, scaledin such a way that the baseline model (either the null model or the modelcontaining only unpenalized covariates) is at 0, while the full model is at 1.

Usage

## S3 method for class 'projsel'plot(  x,  title = NULL,  max.points = NULL,  max.labels = NULL,  from.covariates = TRUE,  font.size = 12,  hadj = 0.05,  vadj = 0,  ...)

Arguments

x

A data frame created byprojsel().

title

Title of the plot. IfNULL, no title is displayed.

max.points

Maximum number of predictors to be plotted. IfNULL(default) or 0, all points are plotted.

max.labels

Maximum number of predictors to be labelled. IfNULL(default), all predictor labels present inx are displayed, whichmay result in overprinting.

from.covariates

Whether the plotting should start from the unpenalizedcovariates (TRUE by default). If set toFALSE, the plot includes apoint for the null (intercept-only) model.

font.size

Size of the textual elements (labels and axes).

hadj,vadj

Horizontal and vertical adjustment for the labels.

...

Currently ignored.

Value

Aggplot2 object showing the relative incremental contribution of eachpredictor starting from the initial set of unpenalized covariates.

Posterior uncertainty intervals

Description

Compute posterior uncertainty intervals forhsstan objects.

Usage

## S3 method for class 'hsstan'posterior_interval(object, pars = NULL, prob = 0.95, ...)

Arguments

object

An object of classhsstan.

pars

Names of parameters for which posterior intervals should bereturned, which can be specified as regular expressions. IfNULL(default) then this refers to the set of predictors used in the model.

prob

A value between 0 and 1 indicating the desired probabilityto be covered by the uncertainty intervals (0.95, by default).

...

Currently ignored.

Value

A matrix with lower and upper interval bounds as columns and as many rowsas selected parameters.

Examples

# continued from ?hsstanposterior_interval(hs.biom)

Posterior distribution of the linear predictor

Description

Extract the posterior draws of the linear predictor, possibly transformedby the inverse-link function.

Usage

## S3 method for class 'hsstan'posterior_linpred(object, transform = FALSE, newdata = NULL, ...)

Arguments

object

An object of classhsstan.

transform

Whether the linear predictor should be transformed usingthe inverse-link function (FALSE by default).

newdata

Optional data frame containing the variables to use topredict. IfNULL (default), the model matrix is used. Ifspecified, its continuous variables should be standardized, sincethe model coefficients are learnt on standardized data.

...

Currently ignored.

Value

A matrix of sizeS byN, whereS is the number of draws from theposterior distribution of the (transformed) linear predictor, andN isthe number of data points.

Examples

# continued from ?hsstanposterior_linpred(hs.biom)

Posterior measures of performance

Description

Compute the log-likelihood and a relevant measure of performance (R-squaredor AUC) from the posterior samples.

Usage

posterior_performance(  obj,  prob = 0.95,  sub.idx = NULL,  summary = TRUE,  cores = getOption("mc.cores", 1))

Arguments

obj

An object of classhsstan orkfold.

prob

Width of the posterior interval (0.95, by default). It isignored ifsummary=FALSE.

sub.idx

Vector of indices of observations in the dataset to be usedin computing the performance measures. IfNULL (default), allobservations in the dataset are used.

summary

Whether a summary of the distribution of the performancemeasure should be returned rather than the pointwise values(TRUE by default).

cores

Number of cores to use for parallelization (the value ofoptions("mc.cores") by default).

Value

The mean, standard deviation and posterior interval of the performancemeasure (R-squared or AUC) ifsummary=TRUE, or a vector of valuesof the performance measure with length equal to the size of the posteriorsample ifsummary=FALSE. Attributetype reports whether the performancemeasures are cross-validated or not. Ifsub.idx is notNULL, attributesubset reports the index of observations used in the computations.

Examples

# continued from ?hsstanposterior_performance(hs.biom, cores=1)

Posterior predictive distribution

Description

Draw from the posterior predictive distribution of the outcome.

Usage

## S3 method for class 'hsstan'posterior_predict(object, newdata = NULL, nsamples = NULL, seed = NULL, ...)

Arguments

object

An object of classhsstan.

newdata

nsamples

A positive integer indicating the number of posterior samplesto use. IfNULL (default) all samples are used.

seed

Optional integer defining the seed for the pseudo-random numbergenerator.

...

Currently ignored.

Value

A matrix of sizeS byN, whereS is the number of simulations fromthe posterior predictive distribution, andN is the number of data points.

Examples

# continued from ?hsstanposterior_predict(hs.biom)

Posterior summary

Description

Produce a summary of the posterior samples for the quantities of interest.

Usage

posterior_summary(x, prob, ...)## Default S3 method:posterior_summary(x, prob = 0.95, ...)## S3 method for class 'hsstan'posterior_summary(x, prob = 0.95, pars = NULL, ...)

Arguments

x

An object containing or representing posterior samples. Ifxis a matrix, it should have sizeS byQ, whereSis the number of posterior samples, andQ is the number ofquantities of interest.

prob

Width of the posterior intervals (0.95, by default).

...

Further arguments passed to or from other methods.

pars

Vector of parameter names to be extracted. IfNULL(default) then this refers to the set of predictors used in themodel.

Value

A matrix with columns containing mean, standard deviation and posteriorintervals for the given quantities.

Examples

# continued from ?hsstanposterior_summary(hs.biom)

Print a summary for the fitted model

Description

Print a summary for the fitted model

Usage

## S3 method for class 'hsstan'print(x, ...)

Arguments

x

An object of classhsstan.

...

Further arguments tosummary().

Forward selection minimizing KL-divergence in projection

Description

Forward selection minimizing KL-divergence in projection

Usage

projsel(obj, max.iters = 30, start.from = NULL, out.csv = NULL)

Arguments

obj

Object of classhsstan.

max.iters

Maximum number of iterations (number of predictors selected)after which the selection procedure should stop.

start.from

Vector of variable names to be used in the startingsubmodel. IfNULL (default), selection starts from the set ofunpenalized covariates if the model contains penalized predictors,otherwise selection starts from the intercept-only model.

out.csv

If notNULL, the name of a CSV file to save theoutput to.

Value

A data frame of classprojsel where each row corresponds to aforward-selected submodel that contains all variables listed up to that row.Attributestart.from reports the predictors in the initial model.The data frame contains the following columns:

var

names of the variables selected.

kl

KL-divergence from the full model to the submodel.

rel.kl.null

relative explanatory power of predictors starting from theintercept-only model.

rel.kl

relative explanatory power of predictors starting from theinitial submodel.

elpd

the expected log predictive density of the submodels.

delta.elpd

the difference in elpd from the full model.

Examples

# continued from ?hsstansel <- projsel(hs.biom, max.iters=3)plot(sel)

Sampler statistics

Description

Report statistics on the parameters used in the sampler, the samplerbehaviour and the sampling time.

Usage

sampler.stats(object)

Arguments

object

An object of classhsstan.

Value

A matrix withC + 1 rows, whereC is the number of Markovchains, reporting average acceptance probability, average stepsize, numberof divergent transitions, maximum tree depth, total number of gradientevaluations, warmup and sampling times in seconds.

Examples

# continued from ?hsstansampler.stats(hs.biom)

Summary for the fitted model

Description

Summary for the fitted model

Usage

## S3 method for class 'hsstan'summary(  object,  pars = NULL,  prob = 0.95,  digits = 2,  sort = NULL,  decreasing = TRUE,  max.rows = NULL,  ...)

Arguments

object

An object of classhsstan.

pars

Vector of parameter names to be extracted. IfNULL(default) then this refers to the set of predictors used in themodel.

prob

Width of the posterior intervals (0.95, by default).

digits

Number of decimal places to be reported (2 by default).

sort

Column name used to sort the results according to the absolutevalue of the column. IfNULL (default) or the column name cannotbe found, no sorting occurs.

decreasing

Whether the results should be sorted in decreasing orderwhen a valid column name is specified insort (TRUE by default).

max.rows

Maximum number of rows to be returned. IfNULL(default) or 0, all results are returned.

...

Currently ignored.

Value

A matrix with summaries from the posterior distribution of the parametersof interest.

Examples

# continued from ?hsstansummary(hs.biom)

Movatterモバイル変換

Hierarchical shrinkage Stan models for biomarker selection

Description

Details

References

Bayesian and LOO-adjusted R-squared

Description

Usage

Arguments

Value

References

Examples

Next variable to enter the current submodel

Description

Usage

Arguments

Diabetes data with interaction terms

Description

Format

Source

Examples

Compute the fit of a submodel

Description

Usage

Arguments

Hierarchical shrinkage models

Description

Usage

Arguments

Value

See Also

Examples

K-fold cross-validation

Description

Usage

Arguments

Value

Examples

Compute projections of full predictors on to subspace of predictors

Description

Usage

Arguments

Pointwise log-likelihood

Description

Usage

Arguments

Value

Examples

Predictive information criteria for Bayesian models

Description

Usage

Arguments

Value

Examples

Number of posterior samples

Description

Usage

Arguments

Value

Examples

Plot of relative explanatory power of predictors

Description

Usage

Arguments

Value

Posterior uncertainty intervals

Description

Usage

Arguments

Value

Examples

Posterior distribution of the linear predictor

Description

Usage

Arguments

Value

Examples

Posterior measures of performance

Description

Usage