Introduction

This vignette provides anoverview of how to use thefunctions in therstanarm package that focuses oncommonalities. The otherrstanarm vignettes go into theparticularities of each of the individual model-estimatingfunctions.

The goal of therstanarm package is to make Bayesianestimationroutine for the most common regression models thatapplied researchers use. This will enable researchers to avoid thecounter-intuitiveness of the frequentist approach to probability andstatistics with only minimal changes to their existing R scripts.

The four steps of a Bayesian analysis are

Specify a joint distribution for the outcome(s) and all theunknowns, which typically takes the form of a marginal priordistribution for the unknowns multiplied by a likelihood for theoutcome(s) conditional on the unknowns. This joint distribution isproportional to a posterior distribution of the unknowns conditional onthe observed data
Draw from posterior distribution using Markov Chain Monte Carlo(MCMC).
Evaluate how well the model fits the data and possibly revise themodel.
Draw from the posterior predictive distribution of the outcome(s)given interesting values of the predictors in order to visualize how amanipulation of a predictor affects (a function of) the outcome(s).

Step 1 is necessarily model-specific and is covered in more detail inthe other vignettes that cover specific forms of the marginal priordistribution and likelihood of the outcome. It is somewhat more involvedthan the corresponding first step of a frequentist analysis, which onlyrequires that the likelihood of the outcome be specified. However, thedefault priors in therstanarm package should work wellin the majority of cases. Steps 2, 3, and 4 are the focus of thisvignette because they are largely not specific to how the jointdistribution in Step 1 is specified.

The key concept in Step 3 and Step 4 is the posterior predictivedistribution, which is the distribution of the outcome implied by themodel after having used the observed data to update our beliefs aboutthe unknown parameters. Frequentists, by definition, have no posteriorpredictive distribution and frequentist predictions are subtly differentfrom what applied researchers seek. Maximum likelihood estimates donot condition on the observed outcome data and so theuncertainty in the estimates pertains to the variation in the samplingdistribution of the estimator, i.e. the distribution of estimates thatwould occur if we could repeat the process of drawing a random samplefrom a well-defined population and apply the estimator to each sample.It is possible to construct a distribution of predictions under thefrequentist paradigm but it evokes the hypothetical of repeating theprocess of drawing a random sample, applying the estimator each time,and generating point predictions of the outcome. In contrast, theposterior predictive distribution conditions on the observed outcomedata in hand to update beliefs about the unknowns and the variation inthe resulting distribution of predictions reflects the remaininguncertainty in our beliefs about the unknowns.

Step 1: Specify a posterior distribution

For the sake of discussion, we need some posterior distribution todraw from. We will utilize an example from theHSAUR3package by Brian S. Everitt and Torsten Hothorn, which is used in their2014 bookA Handbook of Statistical Analyses Using R (3rdEdition) (Chapman & Hall / CRC). This book is frequentist innature and we will show how to obtain the corresponding Bayesianresults.

The model in section 6.3.2 pertains to whether a survey respondentagrees or disagrees with a conservative statement about the role ofwomen in society, which is modeled as a function of the gender andeducation of the respondents. The posterior distribution — withindependent priors — can be written as\[f\left(\alpha,\beta_1,\beta_2|\mathbf{y},\mathbf{X}\right)\propto f\left(\alpha\right) f\left(\beta_1\right) f\left(\beta_2\right)\times \prod_{i=1}^J { g^{-1}\left(\eta_i\right)^{y_i} \left(1 - g^{-1}\left(\eta_i\right)\right)^{n_i-y_i}},\] where\(\eta_i = \alpha + \beta_1 \mbox{education}_i+ \beta_2 \mbox{Female}_i\) is the linear predictor and afunction of an intercept\(\left(\alpha\right)\), a coefficient on theyears of education\(\left(\beta_1\right)\), and anintercept-shift\(\left(\beta_2\right)\) for the case wherethe respondent is female. These data are organized such that\(y_i\) is the number of respondents whoagree with the statement that have the same level of education and thesame gender, and\(n_i - y_i\) is thenumber of such people who disagree with the statement. The inverse linkfunction,\(p = g^{-1}\left(\eta_i\right)\), for a binomial likelihood can be one of severalCumulative Distribution Functions (CDFs) but in this case is thestandard logistic CDF,\(g^{-1}\left(\eta_i\right)=\frac{1}{1 + e^{-\eta_i}}\).

Suppose we believe — prior to seeing the data — that\(\alpha\),\(\beta_1\), and\(\beta_2\) are probably close to zero, areas likely to be positive as they are to be negative, but have a smallchance of being quite far from zero. These beliefs can be represented byStudent t distributions with a few degrees of freedom in order toproduce moderately heavy tails. In particular, we will specify sevendegrees of freedom. Note that these purported beliefs may well be moreskeptical than your actual beliefs, which are probably that women andpeople with more education have less conservative societal views.

Note on “prior beliefs” and default priors

In this vignette we use the term “prior beliefs” to refer ingenerality to the information content of the prior distribution(conditional on the model). Sometimes previous research on the topic ofinterest motivates beliefs about model parameters, but other times suchwork may not exist or several studies may make contradictory claims.Regardless, we nearly always havesome knowledge that should bereflected in our choice of prior distributions. For example, no onebelieves a logistic regression coefficient will be greater than five inabsolute value if the predictors are scaled reasonably. You may alsohave seen examples of so-called “non-informative” (or “vague”,“diffuse”, etc.) priors like a normal distribution with a variance of1000. When data are reasonably scaled, these priors are almost always abad idea for various reasons (they give non-trivial weight to extremevalues, reduce computational efficiency, etc). The default priors inrstanarm are designed to beweaklyinformative, by which we mean that they avoid placing unwarrantedprior weight on nonsensical parameter values and provide someregularization to avoid overfitting, but also do allow for extremevalues if warranted by the data. If additional information is available,the weakly informative defaults can be replaced with more informativepriors.

Step 2: Draw from the posterior distribution

The likelihood for the sample is just the product over the\(J\) groups of\[g^{-1}\left(\eta_i \right)^{y_i} \left(1 - g^{-1}\left(\eta_i \right)\right)^{n_i-y_i},\] whichcan be maximized over\(\alpha\),\(\beta_1\), and\(\beta_2\) to obtain frequentist estimatesby calling

data("womensrole",package ="HSAUR3")womensrole$total<- womensrole$agree+ womensrole$disagreewomensrole_glm_1<-glm(cbind(agree, disagree)~ education+ gender,data = womensrole,family =binomial(link ="logit"))round(coef(summary(womensrole_glm_1)),3)

             Estimate Std. Error z value Pr(>|z|)(Intercept)     2.509      0.184  13.646    0.000education      -0.271      0.015 -17.560    0.000genderFemale   -0.011      0.084  -0.136    0.892

The p-value for the null hypothesis that\(\beta_1 = 0\) is very small, while thep-value for the null hypothesis that\(\beta_2= 0\) is very large. However, frequentist p-values are awkwardbecause they do not pertain to the probability that a scientifichypothesis is true but rather to the probability of observing a\(z\)-statistic that is so large (inmagnitude) if the null hypothesis were true. The desire to makeprobabilistic statements about a scientific hypothesis is one reason whymany people are drawn to the Bayesian approach.

A model with the same likelihood but Student t priors with sevendegrees of freedom can be specified using therstanarmpackage in a similar way by prependingstan_ to theglm call and specifying priors (and optionally the numberof cores on your computer to utilize):

library(rstanarm)womensrole_bglm_1<-stan_glm(cbind(agree, disagree)~ education+ gender,data = womensrole,family =binomial(link ="logit"),prior =student_t(df =7,0,5),prior_intercept =student_t(df =7,0,5),cores =2,seed =12345)womensrole_bglm_1

stan_glm family:       binomial [logit] formula:      cbind(agree, disagree) ~ education + gender observations: 42 predictors:   3------             Median MAD_SD(Intercept)   2.5    0.2  education    -0.3    0.0  genderFemale  0.0    0.1  ------* For help interpreting the printed output see ?print.stanreg* For info on the priors used see ?prior_summary.stanreg

As can be seen, the “Bayesian point estimates” — which arerepresented by the posterior medians — are very similar to the maximumlikelihood estimates. Frequentists would ask whether the point estimateis greater in magnitude than double the estimated standard deviation ofthe sampling distribution. But here we simply have estimates of thestandard deviation of the marginal posterior distributions, which arebased on a scaling of the Median Absolute Deviation (MAD) from theposterior medians to obtain a robust estimator of the posterior standarddeviation. In addition, we can use theposterior_intervalfunction to obtain a Bayesian uncertainty interval for\(\beta_1\):

ci95<-posterior_interval(womensrole_bglm_1,prob =0.95,pars ="education")round(ci95,2)

          2.5% 97.5%education -0.3 -0.24

Unlike frequentist confidence intervals — which arenotinterpretable in terms of post-data probabilities — the Bayesianuncertainty interval indicates we believe after seeing the data thatthere is a\(0.95\) probability that\(\beta_2\) is betweenci95[1,1] andci95[1,2]. Alternatively, wecould say that there is essentially zero probability that\(\beta_2 > 0\), although frequentistscannot make such a claim coherently.

Many of the post-estimation methods that are available for a modelthat is estimated byglm are also available for a modelthat is estimated bystan_glm. For example,

cbind(Median =coef(womensrole_bglm_1),MAD_SD =se(womensrole_bglm_1))

                  Median     MAD_SD(Intercept)   2.52098276 0.18285768education    -0.27153061 0.01556542genderFemale -0.01262136 0.08463091

summary(residuals(womensrole_bglm_1))# not deviance residuals

      Min.    1st Qu.     Median       Mean    3rd Qu.       Max.       NA's -0.3076575 -0.0359870 -0.0041319 -0.0003265  0.0660755  0.2822688          1

cov2cor(vcov(womensrole_bglm_1))

             (Intercept)   education genderFemale(Intercept)    1.0000000 -0.93963167  -0.23059559education     -0.9396317  1.00000000  -0.02463045genderFemale  -0.2305956 -0.02463045   1.00000000

rstanarm does provide aconfint method,although it is reserved for computing confidence intervals in the casethat the user elects to estimate a model by (penalized) maximumlikelihood. When using full Bayesian inference (therstanarm default) or approximate Bayesian inference theposterior_interval function should be used to obtainBayesian uncertainty intervals.

Step 3: Criticize the model

Thelaunch_shinystan function in theshinystan package provides almost all the tools youneed to visualize the posterior distribution and diagnose any problemswith the Markov chains. In this case, the results are fine and to verifythat, you can call

launch_shinystan(womensrole_bglm_1,ppd =FALSE)

which will open a web browser that drives the visualizations.

For the rest of this subsection, we focus on what users can doprogrammatically to evaluate whether a model is adequate. A minimalrequirement for Bayesian estimates is that the model should fit the datathat the estimates conditioned on. The key function here isposterior_predict, which can be passed a newdata.frame to predict out-of-sample, but in this case isomitted to obtain in-sample posterior predictions:

y_rep<-posterior_predict(womensrole_bglm_1)dim(y_rep)

[1] 4000   42

The resulting matrix has rows equal to the number of posteriorsimulations, which in this case is\(2000\) and columns equal to the number ofobservations in the original dataset, which is\(42\) combinations of education and gender.Each element of this matrix is a predicted number of respondents withthat value of education and gender who agreed with the survey questionand thus should be reasonably close to the observed proportion ofagreements in the data. We can create a plot to check this:

par(mfrow =1:2,mar =c(5,3.7,1,0)+0.1,las =3)boxplot(sweep(y_rep[,womensrole$gender=="Male"],2,STATS =               womensrole$total[womensrole$gender=="Male"],FUN ="/"),axes =FALSE,main ="Male",pch =NA,xlab ="Years of Education",ylab ="Proportion of Agrees")with(womensrole,axis(1,at = education[gender=="Male"]+1,labels =0:20))axis(2,las =1)with(womensrole[womensrole$gender=="Male",],points(education+1,  agree/ (agree+ disagree),pch =16,col ="red"))boxplot(sweep(y_rep[,womensrole$gender=="Female"],2,STATS =          womensrole$total[womensrole$gender=="Female"],FUN ="/"),axes =FALSE,main ="Female",pch =NA,xlab ="Years of Education",ylab ="")with(womensrole,axis(1,at = education[gender=="Female"]+1,labels =0:20))with(womensrole[womensrole$gender=="Female",],points(education+1,  agree/ (agree+ disagree),pch =16,col ="red"))

Posterior predictive boxplots vs. observed datapoints

Here the boxplots provide the median, interquartile range, and hingesof the posterior predictive distribution for a given gender and level ofeducation, while the red points represent the corresponding observeddata. As can be seen, the model predicts the observed data fairly wellfor six to sixteen years of education but predicts less well for verylow or very high levels of education where there are less data.

Consequently, we might consider a model where education has aquadratic effect on agreement, which is easy to specify using R’sformula-based syntax.

(womensrole_bglm_2<-update(womensrole_bglm_1,formula. = .~ .+I(education^2)))

stan_glm family:       binomial [logit] formula:      cbind(agree, disagree) ~ education + gender + I(education^2) observations: 42 predictors:   4------               Median MAD_SD(Intercept)     2.1    0.4  education      -0.2    0.1  genderFemale    0.0    0.1  I(education^2)  0.0    0.0  ------* For help interpreting the printed output see ?print.stanreg* For info on the priors used see ?prior_summary.stanreg

Frequentists would test the null hypothesis that the coefficient onthe squared level of education is zero. Bayesians might ask whether sucha model is expected to produce better out-of-sample predictions than amodel with only the level of education. The latter question can beanswered using leave-one-out cross-validation or the approximationthereof provided by theloo function in theloo package, for which a method is provided by therstanarm package.

loo_bglm_1<-loo(womensrole_bglm_1)loo_bglm_2<-loo(womensrole_bglm_2)

First, we verify that the posterior is not too sensitive to anyparticular observation in the dataset.

par(mfrow =1:2,mar =c(5,3.8,1,0)+0.1,las =3)plot(loo_bglm_1,label_points =TRUE)plot(loo_bglm_2,label_points =TRUE)

There are only one or two moderate outliers (whose statistics aregreater than\(0.5\)), which should nothave too much of an effect on the resulting model comparison:

loo_compare(loo_bglm_1, loo_bglm_2)

                  elpd_diff se_diffwomensrole_bglm_1  0.0       0.0   womensrole_bglm_2 -0.7       1.7

In this case, there is little difference in the expected logpointwise deviance between the two models, so we are essentiallyindifferent between them after taking into account that the second modelestimates an additional parameter. The “LOO Information Criterion(LOOIC)”

loo_bglm_1

Computed from 4000 by 42 log-likelihood matrix.         Estimate   SEelpd_loo   -104.8  9.5p_loo         4.2  1.7looic       209.7 18.9------MCSE of elpd_loo is NA.MCSE and ESS estimates assume independent draws (r_eff=1).Pareto k diagnostic values:                         Count Pct.    Min. ESS(-Inf, 0.7]   (good)     41    97.6%   446        (0.7, 1]   (bad)       0     0.0%   <NA>       (1, Inf)   (very bad)  1     2.4%   <NA>    See help('pareto-k-diagnostic') for details.

has the same purpose as the Akaike Information Criterion (AIC) thatis used by frequentists. Both are intended to estimate the expected logpredicted density (ELPD) for a new dataset. However, the AIC ignorespriors and assumes that the posterior distribution is multivariatenormal, whereas the functions from the loo package used here do notassume that the posterior distribution is multivariate normal andintegrate over uncertainty in the parameters. This only assumes that anyone observation can be omitted without having a major effect on theposterior distribution, which can be judged using the plots above.

Troubleshooting

This section provides suggestions for how to proceed when youencounter warning messages generated by the modeling functions in therstanarm package. The example models below are usedjust for the purposes of concisely demonstrating certain difficultiesand possible remedies (we won’t worry about the merit of the modelsthemselves). The references at the end provide more information on therelevant issues.

Markov chains did not converge

Recommendation: run the chains for more iterations(in certain cases, see qualification below).

By default, allrstanarm modeling functions will runfour randomly initialized Markov chains, each for 2000 iterations(including a warmup period of 1000 iterations that is discarded). Allchains must converge to the target distribution for inferences to bevalid. For most models, the default settings are sufficient, but if yousee a warning message about Markov chains not converging, the firstthing to try is increasing the number of iterations. This can be done byspecifying theiter argument. However, if all parametershave proper priors (no priors were set toNULL), and youused the default values for iterations (2000) and chains (4), and Rhats(explained below) are greater than 2, then increasing the number ofiterations alone is unlikely to solve to the problem.

One way to monitor whether a chain has converged to the equilibriumdistribution is to compare its behavior to other randomly initializedchains. This is the motivation for the Gelman and Rubin potential scalereduction statistic Rhat. The Rhat statistic measures the ratio of theaverage variance of the draws within each chain to the variance of thepooled draws across chains; if all chains are at equilibrium, these willbe the same and Rhat will be one. If the chains have not converged to acommon distribution, the Rhat statistic will tend to be greater thanone.

Gelman and Rubin’s recommendation is that the independent Markovchains be initialized with diffuse starting values for the parametersand sampled until all values for Rhat are below 1.1. When any Rhatvalues are above 1.1rstanarm will print a warningmessage like this:

Markov chains did not converge! Do not analyze results!

To illustrate how to check the Rhat values after fitting a modelusingrstanarm we’ll fit two models and run them fordifferent numbers of iterations.

bad_rhat<-stan_glm(mpg~ .,data = mtcars,iter =20,chains =2,seed =12345)

Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. Seehttps://mc-stan.org/misc/warnings.html#bfmi-low

Warning: Examine the pairs() plot to diagnose sampling problems

Warning: The largest R-hat is 2.83, indicating chains have not mixed.Running the chains for more iterations may help. Seehttps://mc-stan.org/misc/warnings.html#r-hat

Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.Running the chains for more iterations may help. Seehttps://mc-stan.org/misc/warnings.html#bulk-ess

Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.Running the chains for more iterations may help. Seehttps://mc-stan.org/misc/warnings.html#tail-ess

Warning: Markov chains did not converge! Do not analyze results!

good_rhat<-update(bad_rhat,iter =1000,chains =2,seed =12345)

Here the first model leads to the warning message about convergencebut the second model does not. Indeed, we can see that many Rhat valuesare much bigger than 1 for the first model:

rhat<-summary(bad_rhat)[,"Rhat"]rhat[rhat>1.1]

  (Intercept)           cyl          drat            wt          qsec      2.341370      3.218821      2.270997      1.334041      1.350006            vs          gear          carb      mean_PPD log-posterior      2.335472      1.542550      1.948308      1.715434      1.564986

Since we didn’t get a warning for the second model we shouldn’t findany parameters with an Rhat far from 1:

any(summary(good_rhat)[,"Rhat"]>1.1)

[1] FALSE

Details on the computation of Rhat and some of its limitations can befound inStan Modeling LanguageUser’s Guide and Reference Manual.

Divergent transitions

Recommendation: increase the target acceptance rateadapt_delta.

Hamiltonian Monte Carlo (HMC), the MCMC algorithm used byStan, works by simulating the evolutionof a Hamiltonian system. Stan uses asymplecticintegrator to approximate the exact solution of the Hamiltoniandynamics. When the step size parameter is too large relative to thecurvature of the log posterior this approximation can diverge andthreaten the validity of the sampler.rstanarm willprint a warning if there are any divergent transitions after the warmupperiod, in which case the posterior sample may be biased. Therecommended method is to increase theadapt_delta parameter– target average proposal acceptance probability in the adaptation –which will in turn reduce the step size. Each of the modeling functionsaccepts anadapt_delta argument, so to increaseadapt_delta you can simply change the value from thedefault value to a value closer to\(1\). To reduce the frequency with whichusers need to manually setadapt_delta, the default valuedepends on the prior distribution used (seehelp("adapt_delta", package = "rstanarm") fordetails).

The downside to increasing the target acceptance rate – and, as aconsequence, decreasing the step size – is that sampling will tend to beslower. Intuitively, this is because a smaller step size means that moresteps are required to explore the posterior distribution. Since thevalidity of the estimates is not guaranteed if there are any post-warmupdivergent transitions, the slower sampling is a minor cost.

Maximum treedepth exceeded

Recommendation: increase the maximum allowedtreedepthmax_treedepth unless all other convergencediagnostics are ok.

Configuring the No-U-Turn-Sampler (the variant of HMC used by Stan)involves putting a cap on the depth of the trees that it evaluatesduring each iteration. This is controlled through a maximum depthparametermax_treedepth. When the maximum allowed treedepth is reached it indicates that NUTS is terminating prematurely toavoid excessively long execution time. Ifrstanarmprints a warning about transitions exceeding the maximum treedepth youshould try increasing themax_treedepth parameter using theoptionalcontrol argument. For example, to increasemax_treedepth to 16 (the default usedrstanarm is 15) you can provide the argumentcontrol = list(max_treedepth = 16) to any of therstanarm modeling functions. If you do not see awarning about hitting the maximum treedepth (which is rare), then you donot need to worry.

With the modelsrstanarm is capable of fitting, whenyou get a warning about max treedepth you will typically also getwarnings about other diagnostics. However, if you see a max treedepthwarning but all other convergence diagnostics are fine, you cantypically ignore the warning. In that case the warning likely indicatesefficiency issues but not that the results are invalid to analyze.

Movatterモバイル変換

How to Use the rstanarm Package

Jonah Gabry and Ben Goodrich

2025-09-29