| Type: | Package |
| Title: | Hierarchical Shrinkage Stan Models for Biomarker Selection |
| Version: | 0.8.2 |
| Date: | 2024-01-13 |
| Description: | Linear and logistic regression models penalized with hierarchical shrinkage priors for selection of biomarkers (or more general variable selection), which can be fitted using Stan (Carpenter et al. (2017) <doi:10.18637/jss.v076.i01>). It implements the horseshoe and regularized horseshoe priors (Piironen and Vehtari (2017) <doi:10.1214/17-EJS1337SI>), as well as the projection predictive selection approach to recover a sparse set of predictive biomarkers (Piironen, Paasiniemi and Vehtari (2020) <doi:10.1214/20-EJS1711>). |
| License: | GPL-3 |
| URL: | https://github.com/mcol/hsstan |
| BugReports: | https://github.com/mcol/hsstan/issues |
| LazyData: | yes |
| Biarch: | true |
| Depends: | R (≥ 3.6) |
| Imports: | ggplot2, loo (≥ 2.1.0), parallel, pROC, Rcpp, methods, rstan(≥ 2.26.0), rstantools (≥ 2.0.0), stats, utils |
| LinkingTo: | BH (≥ 1.66.0.1), Rcpp (≥ 0.12.15), RcppEigen (≥0.3.3.4.0), RcppParallel (≥ 5.0.1), StanHeaders (≥ 2.26.0),rstan (≥ 2.26.0) |
| Suggests: | testthat (≥ 2.1.0) |
| ByteCompile: | yes |
| NeedsCompilation: | yes |
| SystemRequirements: | GNU make |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.2.1 |
| Packaged: | 2024-01-13 20:59:47 UTC; mcolombo |
| Author: | Marco Colombo |
| Maintainer: | Marco Colombo <mar.colombo13@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2024-01-13 21:20:02 UTC |
Hierarchical shrinkage Stan models for biomarker selection
Description
Thehsstan package provides linear and logistic regression modelspenalized with hierarchical shrinkage priors for selection of biomarkers.Models are fitted with Stan (Carpenter et al. (2017)), which allows toperform full Bayesian inference.
Details
The package implements the horseshoe and regularized horseshoe priors(Piironen and Vehtari (2017)), and the projection predictive selectionapproach to recover a sparse set of predictive biomarkers (Piironen,Paasiniemi and Vehtari (2020)).
The approach is particularly suited to selection from high-dimensionalpanels of biomarkers, such as those that can be measured by MSMS or similartechnologies (Colombo, Valo, McGurnaghan et al. (2019), Colombo, McGurnaghan,Blackbourn et al. (2020)).
References
B. Carpenter et al. (2017),Stan: a probabilistic programming language,Journal of Statistical Software, 76 (1).doi:10.18637/jss.v076.i01
J. Piironen and A. Vehtari (2017),Sparsity information and regularization in the horseshoe and other shrinkagepriors,Electronic Journal of Statistics, 11 (2), 5018-5051.doi:10.1214/17-EJS1337SI
J. Piironen, M. Paasiniemi and A. Vehtari (2020),Projective inference in high-dimensional problems: prediction and featureselection,Electronic Journal of Statistics, 14 (1), 2155-2197.doi:10.1214/20-EJS1711
M. Colombo, E. Valo, S.J. McGurnaghan et al. (2019),Biomarkers associated with progression of renal disease in type 1 diabetes,Diabetologia, 62 (9), 1616-1627.doi:10.1007/s00125-019-4915-0
M. Colombo, S.J. McGurnaghan, L.A.K. Blackbourn et al. (2020),Comparison of serum and urinary biomarker panels with albumin creatininratio in the prediction of renal function decline in type 1 diabetes,Diabetologia, 63 (4), 788-798.doi:10.1007/s00125-019-05081-8
M. Colombo, A. Asadi Shehni, I. Thoma et al. (2021),Quantitative levels of serum N-glycans in type 1 diabetes and theirassociation with kidney disease,Glycobiology, 31 (5), 613-623.doi:10.1093/glycob/cwaa106
Bayesian and LOO-adjusted R-squared
Description
Compute the Bayesian and the LOO-adjusted R-squared from the posteriorsamples. For Bayesian R-squared it uses the modelled residual variance(rather than the variance of the posterior distribution of the residuals).The LOO-adjusted R-squared uses Pareto smoothed importance sampling LOOresiduals and Bayesian bootstrap.
Usage
## S3 method for class 'hsstan'bayes_R2(object, prob = 0.95, summary = TRUE, ...)## S3 method for class 'hsstan'loo_R2(object, prob = 0.95, summary = TRUE, ...)Arguments
object | An object of class |
prob | Width of the posterior interval (0.95, by default). It isignored if |
summary | Whether a summary of the distribution of the R-squaredshould be returned rather than the pointwise values ( |
... | Currently ignored. |
Value
The mean, standard deviation and posterior interval of R-squared ifsummary=TRUE, or a vector of R-squared values with length equal tothe size of the posterior sample ifsummary=FALSE.
References
Andrew Gelman, Ben Goodrich, Jonah Gabry and Aki Vehtari (2019),R-squared for Bayesian regression models,The American Statistician, 73 (3), 307-309.doi:10.1080/00031305.2018.1549100
Aki Vehtari, Andrew Gelman, Ben Goodrich and Jonah Gabry (2019),Bayesian R2 and LOO-R2.https://avehtari.github.io/bayes_R2/bayes_R2.html
Examples
# continued from ?hsstanbayes_R2(hs.biom)loo_R2(hs.biom)Next variable to enter the current submodel
Description
Return the index of the variable that should be added to the current modelaccording to the smallest KL-divergence (linear regression) or the largestscore test (logistic regression).
Usage
choose.next(x, sigma2, fit, fitp, chosen, is.logistic)Arguments
x | Design matrix. |
sigma2 | Residual variance (1 for logistic regression). |
fit | Matrix of fitted values for the full model. |
fitp | Matrix of fitted values for the projected model. |
chosen | Vector of indices of the columns of |
is.logistic | Set to |
Diabetes data with interaction terms
Description
The dataset consists of observations on 442 individuals for which aquantitative measure of diabetes progression is recorded in variableY.Predictors include 10 baseline measurements, 45 interactions and 9 quadraticterms, for a total of 64 variables for each individual. Each variable hasbeen standardized by subtracting the mean and then dividing it by itsstandard deviation.
Format
A data frame with 442 rows and 65 columns (centred and scaled).
Source
B. Efron, T. Hastie, I. Johnstone and R. Tibshirani (2004),Least angle regression,The Annals of Statistics, 32 (2), 407-499.doi:10.1214/009053604000000067
The original dataset is available fromhttps://web.stanford.edu/~hastie/Papers/LARS/data64.txt
Examples
data(diabetes, package="hsstan")Compute the fit of a submodel
Description
Compute the fit of a submodel
Usage
fit.submodel(x, sigma2, mu, chosen, xt, yt, logistic)Arguments
x | Design matrix. |
sigma2 | Residual variance (1 for logistic regression). |
mu | Matrix of fitted values for the full model. |
chosen | Vector of indices of the columns of |
xt | Design matrix for test data. |
yt | Outcome variable for test data. |
logistic | Set to |
Hierarchical shrinkage models
Description
Run the No-U-Turn Sampler (NUTS) as implemented in Stan to fit a hierarchicalshrinkage model.
Usage
hsstan( x, covs.model, penalized = NULL, family = gaussian, iter = 2000, warmup = floor(iter/2), scale.u = 2, regularized = TRUE, nu = ifelse(regularized, 1, 3), par.ratio = 0.05, global.df = 1, slab.scale = 2, slab.df = 4, qr = TRUE, seed = 123, adapt.delta = NULL, keep.hs.pars = FALSE, ...)Arguments
x | Data frame containing outcome, covariates and penalized predictors.Continuous predictors and outcome variable should be standardizedbefore fitting the models as priors assume them to have mean zero andunit variance. |
covs.model | Formula containing the unpenalized covariates. |
penalized | Names of the variables to be used as penalized predictors.Any variable that is already part of the |
family | Type of model fitted: either |
iter | Total number of iterations in each chain, including warmup(2000 by default). |
warmup | Number of warmup iterations per chain (by default, half thetotal number of iterations). |
scale.u | Prior scale (standard deviation) for the unpenalizedcovariates. |
regularized | If |
nu | Number of degrees of freedom of the half-Student-t prior on thelocal shrinkage parameters (by default, 1 if |
par.ratio | Expected ratio of non-zero to zero coefficients (ignoredif |
global.df | Number of degrees of freedom for the global shrinkageparameter (ignored if |
slab.scale | Scale of the regularization parameter (ignored if |
slab.df | Number of degrees of freedom of the regularization parameter(ignored if |
qr | Whether the thin QR decomposition should be used to decorrelate thepredictors ( |
seed | Optional integer defining the seed for the pseudo-random numbergenerator. |
adapt.delta | Target average proposal acceptance probability foradaptation, a value between 0.8 and 1 (excluded). If unspecified,it's set to 0.99 for hierarchical shrinkage models and to 0.95 forbase models. |
keep.hs.pars | Whether the parameters for the horseshoe prior should bekept in the |
... | Further arguments passed to |
Value
An object of classhsstan containing the following fields:
stanfit | an object of class |
betas | posterior means of the unpenalized and penalized regressionparameters. |
call | the matched call. |
data | the dataset used in fitting the model. |
model.terms | a list of names for the outcome variable, the unpenalizedcovariates and the penalized predictors. |
family | the |
hsstan.settings | the optional settings used in the model. |
See Also
kfold() for cross-validating a fitted object.
Examples
data(diabetes)# non-default settings for speed of the exampledf <- diabetes[1:50, ]hs.biom <- hsstan(df, Y ~ age + sex, penalized=colnames(df)[5:10], chains=2, iter=250)K-fold cross-validation
Description
Perform K-fold cross-validation using the same settings used when fittingthe model on the whole data.
Usage
## S3 method for class 'hsstan'kfold( x, folds, chains = 1, store.fits = TRUE, cores = getOption("mc.cores", 1), ...)Arguments
x | An object of class |
folds | Integer vector with one element per observation indicating thecross-validation fold in which the observation should be withdrawn. |
chains | Number of Markov chains to run. By default this is set to 1,independently of the number of chains used for |
store.fits | Whether the fitted models for each fold should be storedin the returned object ( |
cores | Number of cores to use for parallelization (the value of |
... | Further arguments passed to |
Value
An object with classeskfold andloo that has a similar structure as theobjects returned byloo() andwaic() and is compatible with theloo_compare function forcomparing models. The object contains the following fields:
estimates | a matrix containing point estimates and standard errors ofthe expected log pointwise predictive density ("elpd_kfold"),the effective number of parameters ("p_kfold", always |
pointwise | a matrix containing the pointwise contributions of"elpd_kfold", "p_kfold" and "kfoldic". |
fits | a matrix with two columns and number of rows equal to the numberof cross-validation folds. Column |
data | the dataset used in fitting the model (before withdrawingobservations). This is not present if |
Examples
# continued from ?hsstan# only 2 folds for speed of examplefolds <- rep(1:2, length.out=length(df$Y))cv.biom <- kfold(hs.biom, folds=folds, cores=2)Compute projections of full predictors on to subspace of predictors
Description
Compute projections of full predictors on to subspace of predictors
Usage
lm_proj(x, fit, sigma2, indproj, is.logistic)Arguments
x | Design matrix. |
fit | Matrix of fitted values for the full model. |
sigma2 | Residual variance (1 for logistic regression). |
indproj | Vector of indices of the columns of |
is.logistic | Set to |
Pointwise log-likelihood
Description
Compute the pointwise log-likelihood.
Usage
## S3 method for class 'hsstan'log_lik(object, newdata = NULL, ...)Arguments
object | An object of class |
newdata | Optional data frame to use to evaluate the log-likelihood.If |
... | Currently ignored. |
Value
A matrix of sizeS byN, whereS is number of draws from the posteriordistribution, andN is the number of data points.
Examples
# continued from ?hsstanlog_lik(hs.biom)Predictive information criteria for Bayesian models
Description
Compute an efficient approximate leave-one-out cross-validationusing Pareto smoothed importance sampling (PSIS-LOO), or the widelyapplicable information criterion (WAIC), also known as the Watanabe-Akaikeinformation criterion.
Usage
## S3 method for class 'hsstan'loo(x, cores = getOption("mc.cores"), ...)## S3 method for class 'hsstan'waic(x, cores = getOption("mc.cores"), ...)Arguments
x | An object of class |
cores | Number of cores used for parallelisation (the value of |
... | Currently ignored. |
Value
Aloo object.
Examples
# continued from ?hsstanloo(hs.biom)waic(hs.biom)Number of posterior samples
Description
Extracts the number of posterior samples stored in a fitted model.
Usage
## S3 method for class 'hsstan'nsamples(object, ...)Arguments
object | An object of class |
... | Currently ignored. |
Value
The total number of posterior samples across the chains after discardingthe warmup iterations.
Examples
# continued from ?hsstannsamples(hs.biom)Plot of relative explanatory power of predictors
Description
The function plots the relative explanatory power of each predictor in orderof selection. The relative explanatory power of predictors is computedaccording to the KL divergence from the full model to each submodel, scaledin such a way that the baseline model (either the null model or the modelcontaining only unpenalized covariates) is at 0, while the full model is at 1.
Usage
## S3 method for class 'projsel'plot( x, title = NULL, max.points = NULL, max.labels = NULL, from.covariates = TRUE, font.size = 12, hadj = 0.05, vadj = 0, ...)Arguments
x | A data frame created by |
title | Title of the plot. If |
max.points | Maximum number of predictors to be plotted. If |
max.labels | Maximum number of predictors to be labelled. If |
from.covariates | Whether the plotting should start from the unpenalizedcovariates ( |
font.size | Size of the textual elements (labels and axes). |
hadj,vadj | Horizontal and vertical adjustment for the labels. |
... | Currently ignored. |
Value
Aggplot2 object showing the relative incremental contribution of eachpredictor starting from the initial set of unpenalized covariates.
Posterior uncertainty intervals
Description
Compute posterior uncertainty intervals forhsstan objects.
Usage
## S3 method for class 'hsstan'posterior_interval(object, pars = NULL, prob = 0.95, ...)Arguments
object | An object of class |
pars | Names of parameters for which posterior intervals should bereturned, which can be specified as regular expressions. If |
prob | A value between 0 and 1 indicating the desired probabilityto be covered by the uncertainty intervals (0.95, by default). |
... | Currently ignored. |
Value
A matrix with lower and upper interval bounds as columns and as many rowsas selected parameters.
Examples
# continued from ?hsstanposterior_interval(hs.biom)Posterior distribution of the linear predictor
Description
Extract the posterior draws of the linear predictor, possibly transformedby the inverse-link function.
Usage
## S3 method for class 'hsstan'posterior_linpred(object, transform = FALSE, newdata = NULL, ...)Arguments
object | An object of class |
transform | Whether the linear predictor should be transformed usingthe inverse-link function ( |
newdata | Optional data frame containing the variables to use topredict. If |
... | Currently ignored. |
Value
A matrix of sizeS byN, whereS is the number of draws from theposterior distribution of the (transformed) linear predictor, andN isthe number of data points.
Examples
# continued from ?hsstanposterior_linpred(hs.biom)Posterior measures of performance
Description
Compute the log-likelihood and a relevant measure of performance (R-squaredor AUC) from the posterior samples.
Usage
posterior_performance( obj, prob = 0.95, sub.idx = NULL, summary = TRUE, cores = getOption("mc.cores", 1))Arguments
obj | An object of class |
prob | Width of the posterior interval (0.95, by default). It isignored if |
sub.idx | Vector of indices of observations in the dataset to be usedin computing the performance measures. If |
summary | Whether a summary of the distribution of the performancemeasure should be returned rather than the pointwise values( |
cores | Number of cores to use for parallelization (the value of |
Value
The mean, standard deviation and posterior interval of the performancemeasure (R-squared or AUC) ifsummary=TRUE, or a vector of valuesof the performance measure with length equal to the size of the posteriorsample ifsummary=FALSE. Attributetype reports whether the performancemeasures are cross-validated or not. Ifsub.idx is notNULL, attributesubset reports the index of observations used in the computations.
Examples
# continued from ?hsstanposterior_performance(hs.biom, cores=1)Posterior predictive distribution
Description
Draw from the posterior predictive distribution of the outcome.
Usage
## S3 method for class 'hsstan'posterior_predict(object, newdata = NULL, nsamples = NULL, seed = NULL, ...)Arguments
object | An object of class |
newdata | Optional data frame containing the variables to use topredict. If |
nsamples | A positive integer indicating the number of posterior samplesto use. If |
seed | Optional integer defining the seed for the pseudo-random numbergenerator. |
... | Currently ignored. |
Value
A matrix of sizeS byN, whereS is the number of simulations fromthe posterior predictive distribution, andN is the number of data points.
Examples
# continued from ?hsstanposterior_predict(hs.biom)Posterior summary
Description
Produce a summary of the posterior samples for the quantities of interest.
Usage
posterior_summary(x, prob, ...)## Default S3 method:posterior_summary(x, prob = 0.95, ...)## S3 method for class 'hsstan'posterior_summary(x, prob = 0.95, pars = NULL, ...)Arguments
x | An object containing or representing posterior samples. If |
prob | Width of the posterior intervals (0.95, by default). |
... | Further arguments passed to or from other methods. |
pars | Vector of parameter names to be extracted. If |
Value
A matrix with columns containing mean, standard deviation and posteriorintervals for the given quantities.
See Also
summary() to produce summaries ofhsstan objects that include the numberof effective samples and the split-Rhat diagnostic.
Examples
# continued from ?hsstanposterior_summary(hs.biom)Print a summary for the fitted model
Description
Print a summary for the fitted model
Usage
## S3 method for class 'hsstan'print(x, ...)Arguments
x | An object of class |
... | Further arguments to |
Forward selection minimizing KL-divergence in projection
Description
Forward selection minimizing KL-divergence in projection
Usage
projsel(obj, max.iters = 30, start.from = NULL, out.csv = NULL)Arguments
obj | Object of class |
max.iters | Maximum number of iterations (number of predictors selected)after which the selection procedure should stop. |
start.from | Vector of variable names to be used in the startingsubmodel. If |
out.csv | If not |
Value
A data frame of classprojsel where each row corresponds to aforward-selected submodel that contains all variables listed up to that row.Attributestart.from reports the predictors in the initial model.The data frame contains the following columns:
var | names of the variables selected. |
kl | KL-divergence from the full model to the submodel. |
rel.kl.null | relative explanatory power of predictors starting from theintercept-only model. |
rel.kl | relative explanatory power of predictors starting from theinitial submodel. |
elpd | the expected log predictive density of the submodels. |
delta.elpd | the difference in elpd from the full model. |
Examples
# continued from ?hsstansel <- projsel(hs.biom, max.iters=3)plot(sel)Sampler statistics
Description
Report statistics on the parameters used in the sampler, the samplerbehaviour and the sampling time.
Usage
sampler.stats(object)Arguments
object | An object of class |
Value
A matrix withC + 1 rows, whereC is the number of Markovchains, reporting average acceptance probability, average stepsize, numberof divergent transitions, maximum tree depth, total number of gradientevaluations, warmup and sampling times in seconds.
Examples
# continued from ?hsstansampler.stats(hs.biom)Summary for the fitted model
Description
Summary for the fitted model
Usage
## S3 method for class 'hsstan'summary( object, pars = NULL, prob = 0.95, digits = 2, sort = NULL, decreasing = TRUE, max.rows = NULL, ...)Arguments
object | An object of class |
pars | Vector of parameter names to be extracted. If |
prob | Width of the posterior intervals (0.95, by default). |
digits | Number of decimal places to be reported (2 by default). |
sort | Column name used to sort the results according to the absolutevalue of the column. If |
decreasing | Whether the results should be sorted in decreasing orderwhen a valid column name is specified in |
max.rows | Maximum number of rows to be returned. If |
... | Currently ignored. |
Value
A matrix with summaries from the posterior distribution of the parametersof interest.
Examples
# continued from ?hsstansummary(hs.biom)