library(ggplot2)library(bayesplot)theme_set(bayesplot::theme_default())This vignette explains how to use thestan_lmer,stan_glmer,stan_nlmer, andstan_gamm4 functions in therstanarmpackage to estimate linear and generalized (non-)linear models withparameters that may vary across groups. Before continuing, we recommendreading the vignettes (navigate up one level) for the various ways touse thestan_glm function. TheHierarchical PartialPooling vignette also has examples of bothstan_glmandstan_glmer.
Models with this structure are refered to by many names: multilevelmodels, (generalized) linear mixed (effects) models (GLMM), hierarchical(generalized) linear models, etc. In the simplest case, the model for anoutcome can be written as\[\mathbf{y} =\alpha + \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \mathbf{b} +\boldsymbol{\epsilon},\] where\(\mathbf{X}\) is a matrix predictors that isanalogous to that in Generalized Linear Models and\(\mathbf{Z}\) is a matrix that encodesdeviations in the predictors across specified groups.
The terminology for the unknowns in the model is diverse. Tofrequentists, the error term consists of\(\mathbf{Z}\mathbf{b} +\boldsymbol{\epsilon}\) and the observations within each grouparenot independent conditional on\(\mathbf{X}\) alone. Since,\(\mathbf{b}\) is considered part of therandom error term, frequentists allow themselves to make distributionalassumptions about\(\mathbf{b}\),invariably that it is distributed multivariate normal with mean vectorzero and structured covariance matrix\(\boldsymbol{\Sigma}\). If\(\epsilon_i\) is also distributed(univariate) normal with mean zero and standard deviation\(\sigma\), then\(\mathbf{b}\) can be integrated out, whichimplies\[\mathbf{y} \thicksim\mathcal{N}\left(\alpha + \mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}+\mathbf{Z}^\top \boldsymbol{\Sigma} \mathbf{Z}\right),\] and it is possible to maximize this likelihoodfunction by choosing proposals for the parameters\(\alpha\),\(\boldsymbol{\beta}\), and (the freeelements of)\(\boldsymbol{\Sigma}\).
Consequently, frequentists refer to\(\mathbf{b}\) as therandom effectsbecause they capture the random deviation in the effects of predictorsfrom one group to the next. In contradistinction,\(\alpha\) and\(\boldsymbol{\beta}\) are referred to asfixed effects because they are the same for all groups.Moreover,\(\alpha\) and\(\boldsymbol{\beta}\) persist in the modelin hypothetical replications of the analysis that draw the members ofthe groups afresh every time, whereas\(\mathbf{b}\) would differ from onereplication to the next. Consequently,\(\mathbf{b}\) is not a “parameter” to beestimated because parameters are unknown constants that are fixed inrepeated sampling.
Bayesians condition on the data in-hand without reference to repeatedsampling and describe theirbeliefs about the unknowns withprior distributions before observing the data. Thus, the likelihood in asimple hierarchical model inrstarnarm is\[\mathbf{y} \thicksim \mathcal{N}\left(\alpha +\mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{b}, \sigma^2\mathbf{I}\right)\] and the observations are independentconditional on\(\mathbf{X}\) and\(\mathbf{Z}\). In this formulation, thereare
Bayesians are compelled to state their prior beliefs about allunknowns and the usual assumption (which is maintained inrstanarm) is that\(\mathbf{b} \thicksim\mathcal{N}\left(\mathbf{0},\boldsymbol{\Sigma}\right),\) but itis then necessary to state prior beliefs about\(\boldsymbol{\Sigma}\), in addition to\(\alpha\),\(\boldsymbol{\beta}\), and\(\sigma\).
One of the many challenges of fitting models to data comprisingmultiple groupings is confronting the tradeoff between validity andprecision. An analysis that disregards between-group heterogeneity canyield parameter estimates that are wrong if there is between-groupheterogeneity but would be relatively precise if there actually were nobetween-group heterogeneity. Group-by-group analyses, on the other hand,are valid but produces estimates that are relatively imprecise. Whilecomplete pooling or no pooling of data across groups is sometimes calledfor, models that ignore the grouping structures in the data tend tounderfit or overfit (Gelman et al.,2013). Hierarchical modeling providesa compromise by allowing parameters to vary by group at lower levels ofthe hierarchy while estimating common parameters at higher levels.Inference for each group-level parameter is informed not only by thegroup-specific information contained in the data but also by the datafor other groups as well. This is commonly referred to asborrowingstrength orshrinkage.
Inrstanarm, these models can be estimated using thestan_lmer andstan_glmer functions, which aresimilar in syntax to thelmer andglmerfunctions in thelme4 package. However, rather thanperforming (restricted) maximum likelihood (RE)ML estimation, Bayesianestimation is performed via MCMC. The Bayesian model adds independentprior distributions on the regression coefficients (in the same way asstan_glm) as well as priors on the terms of a decompositionof the covariance matrices of the group-specific parameters. Thesepriors are discussed in greater detail below.
In this section we discuss a flexible family of prior distributionsfor the unknown covariance matrices of the group-specificcoefficients.
For each group, we assume the vector of varying slopes and interceptsis a zero-mean random vector following a multivariate Gaussiandistribution with an unknown covariance matrix to be estimated.Unfortunately, expressing prior information about a covariance matrix isnot intuitive and can also be computationally challenging. When thecovariance matrix is not\(1\times 1\),it is often both much more intuitive and efficient to work instead withthecorrelation matrix and variances. When thecovariance matrix is\(1\times 1\), westill denote it as\(\boldsymbol{\Sigma}\) but most of thedetails in this section do not apply.
The variances are in turn decomposed into the product of a simplexvector (probability vector) and the trace of the implied covariancematrix, which is defined as the sum of its diagonal elements. Finally,this trace is set equal to the product of the order of the matrix andthe square of a scale parameter. This implied prior on a covariancematrix is represented by thedecov (short for decompositionof covariance) function inrstanarm.
Using the decomposition described above, the prior used for acorrelation matrix\(\Omega\) is calledthe LKJ distribution and has a probability density function proportionalto the determinant of the correlation matrix raised to a power of\(\zeta\) minus one:
\[ f(\Omega | \zeta) \propto\text{det}(\Omega)^{\zeta - 1}, \quad \zeta > 0. \]
The shape of this prior depends on the value of the regularizationparameter,\(\zeta\) in the followingways:
The\(J \times J\) covariance matrix\(\Sigma\) of a random vector\(\boldsymbol{\theta} = (\theta_1, \dots,\theta_J)\) has diagonal entries\({\Sigma}_{jj} = \sigma^2_j =\text{var}(\theta_j)\). Therefore, the trace of the covariancematrix is equal to the sum of the variances. We set the trace equal tothe product of the order of the covariance matrix and the square of apositive scale parameter\(\tau\):
\[\text{tr}(\Sigma) = \sum_{j=1}^{J}\Sigma_{jj} = J\tau^2.\]
The vector of variances is set equal to the product of a simplexvector\(\boldsymbol{\pi}\) — which isnon-negative and sums to 1 — and the scalar trace:\(J \tau^2 \boldsymbol{\pi}\). Each element\(\pi_j\) of\(\boldsymbol{\pi}\) then represents theproportion of the trace (total variance) attributable to thecorresponding variable\(\theta_j\).
For the simplex vector\(\boldsymbol{\pi}\) we use a symmetricDirichlet prior, which has a singleconcentration parameter\(\gamma > 0\):
If all the elements of\(\boldsymbol{\theta}\) were multiplied bythe same number\(k\), the trace oftheir covariance matrix would increase by a factor of\(k^2\). For this reason, it is sensible touse a scale-invariant prior for\(\tau\). We choose a Gamma distribution,with shape and scale parameters both set to\(1\) by default, implying a unit-exponentialdistribution. Users can set the shape hyperparameter to some valuegreater than one to ensure that the posterior trace is not zero. In thecase where\(\boldsymbol{\Sigma}\) is\(1\times 1\),\(\tau\) is the cross-group standarddeviation in the parameters and its square is the variance (so the Gammaprior with its shape and scale directly applies to the cross-groupstandard deviation in the parameters).
There are several advantages to estimating these models usingrstanarm rather than thelme4 package.There are also a few drawbacks. In this section we briefly discuss whatwe find to be the two most important advantages as well as an importantdisadvantage.
Whilelme4 uses (restricted) maximum likelihood(RE)ML estimation,rstanarm enables full Bayesianinference via MCMC to be performed. It is well known that (RE)ML tendsto underestimate uncertainties because it relies on point estimates ofhyperparameters. Full Bayes, on the other hand, propagates theuncertainty in the hyperparameters throughout all levels of the modeland provides more appropriate estimates of uncertainty for models thatconsist of a mix of common and group-specific parameters.
Thestan_glmer andstan_lmer functionsallow the user to specify prior distributions over the regressioncoefficients as well as any unknown covariance matrices. There arevarious reasons to specify priors, from helping to stabilize computationto incorporating important information into an analysis that does notenter through the data.
The benefits of full Bayesian inference (via MCMC) come with a cost.Fitting models with (RE)ML will tend to be much faster than fitting asimilar model using MCMC. Speed comparable tolme4 canbe obtained withrstanarm using approximate Bayesianinference via the mean-field and full-rank variational algorithms (seehelp("rstanarm-package", "rstanarm") for details). Thesealgorithms can be useful to narrow the set of candidate models in largeproblems, but MCMC should always be used for final statisticalinference.
glmerIn thelme4 package, there is a fundamentaldistinction between the way that Linear Mixed Models and GeneralizedLinear Mixed Models are estimated. In Linear Mixed Models,\(\mathbf{b}\) can be integrated outanalytically, leaving a likelihood function that can be maximized overproposals for the parameters. To estimate a Linear Mixed Model, one cancall thelmer function.
Generalized Linear Mixed Models are appropriate when the conditionalmean of the outcome is determined by an inverse link function,\(\boldsymbol{\mu} = g\left(\alpha + \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{b}\right)\). If\(g\left(\cdot\right)\) is not the identityfunction, then it is not possible to integrate out\(\mathbf{b}\) analytically and numericalintegration must be used. To estimate a Generalized Linear Mixed Model,one can call theglmer function and specify thefamily argument.
In therstanarm package, there is no suchfundamental distinction; in factstan_lmer simply callsstan_glmer withfamily = gaussian(link = "identity"). Bayesians do not(have to) integrate\(\mathbf{b}\) outof the likelihood and if\(\mathbf{b}\)is not of interest, then the margins of its posterior distribution cansimply be ignored.
gamm4Therstanarm package includes astan_gamm4 function that is similar to thegamm4 function in thegamm4 package, whichis in turn similar to thegamm function in themgcv package. The substringgamm standsfor Generalized Additive Mixed Models, which differ from GeneralizedAdditive Models (GAMs) due to the presence of group-specific terms thatcan be specified with the syntax oflme4. Both GAMs andGAMMs include nonlinear functions of (non-categorical) predictors called“smooths”. In the example below, so-called “thin-plate splines” are usedto model counts of roaches where we might fear that the number ofroaches in the current period is an exponentially increasing function ofthe number of roaches in the previous period. Unlikestan_glmer, instan_gamm4 it is necessary tospecify group-specific terms as a one-sided formula that is passed totherandom argument as in thelme function inthenlme package.
library(rstanarm)data(roaches)roaches$roach1<- roaches$roach1/100roaches$log_exposure2<-log(roaches$exposure2)post<-stan_gamm4( y~s(roach1)+ treatment+ log_exposure2,random =~(1| senior),data = roaches,family = neg_binomial_2,QR =TRUE,cores =2,chains =2,adapt_delta =0.99,seed =12345)plot_nonlinear(post)Here we see that the relationship between past and present roaches isestimated to be nonlinear. For a small number of past roaches, thefunction is steep and then it appears to flatten out, although we becomehighly uncertain about the function in the rare cases where the numberof past roaches is large.
nlmerThestan_gamm4 function allows designated predictors tohave a nonlinear effect on what would otherwise be called the “linear”predictor in Generalized Linear Models. Thestan_nlmerfunction is similar to thenlmer function in thelme4 package, and essentially allows a wider range ofnonlinear functions that relate the linear predictor to the conditionalexpectation of a Gaussian outcome.
To estimate an example model with thenlmer function inthelme4 package, we start by rescaling the outcome andmain predictor(s) by a constant
data("Orange",package ="datasets")Orange$age<- Orange$age/100Orange$circumference<- Orange$circumference/100Although doing so has no substantive effect on the inferencesobtained, it is numerically much easier for Stan and forlme4 to work with variables whose units are such thatthe estimated parameters tend to be single-digit numbers that are nottoo close to zero. Thenlmer function requires that theuser pass starting values to the ironically-named self-startingnon-linear function:
startvec<-c(Asym =2,xmid =7.25,scal =3.5)library(lme4)nm1<-nlmer(circumference~SSlogis(age, Asym, xmid, scal)~ Asym|Tree,data = Orange,start = startvec)summary(nm1)Warning in vcov.merMod(object, use.hessian = use.hessian): variance-covariance matrix computed from finite-difference Hessian isnot positive definite or contains NA values: falling back to var-cov estimated from RXWarning in vcov.merMod(object, correlation = correlation, sigm = sig): variance-covariance matrix computed from finite-difference Hessian isnot positive definite or contains NA values: falling back to var-cov estimated from RXNonlinear mixed model fit by maximum likelihood ['nlmerMod']Formula: circumference ~ SSlogis(age, Asym, xmid, scal) ~ Asym | Tree Data: Orange AIC BIC logLik -2*log(L) df.resid -49.2 -41.4 29.6 -59.2 30 Scaled residuals: Min 1Q Median 3Q Max -1.9170 -0.5421 0.1754 0.7116 1.6820 Random effects: Groups Name Variance Std.Dev. Tree Asym 0.100149 0.31646 Residual 0.006151 0.07843 Number of obs: 35, groups: Tree, 5Fixed effects: Estimate Std. Error t valueAsym 1.9205 0.1558 12.32xmid 7.2791 0.3444 21.14scal 3.4807 0.2631 13.23Correlation of Fixed Effects: Asym xmid xmid 0.384 scal 0.362 0.762Note the warning messages indicating difficulty estimating thevariance-covariance matrix. Althoughlme4 has afallback mechanism, the need to utilize it suggests that the sample istoo small to sustain the asymptotic assumptions underlying the maximumlikelihood estimator.
In the above example, we use theSSlogis function, whichis a lot like the logistic CDF, but with an additionalAsymargument that need not be one and indicates what value the functionapproaches for large values of the first argument. In this case, we caninterpret the asymptote as the maximum possible circumference for anorange. However, this asymptote is allowed to vary from tree to treeusing theAsym | Tree syntax, which reflects an assumptionthat the asymptote for a randomly-selected tree deviates from theasymptote for the population of orange trees in a Gaussian fashion withmean zero and an unknown standard deviation.
Thenlmer function supports user-defined non-linearfunctions, whereas thestan_nlmer function only supportsthe pre-defined non-linear functions starting withSS inthestats package, which are
[1] "SSasymp" "SSasympOff" "SSasympOrig" "SSbiexp" "SSfol" [6] "SSfpl" "SSgompertz" "SSlogis" "SSmicmen" "SSweibull"To fit essentially the same model using Stan’s implementation ofMCMC, we add astan_ prefix
post1<-stan_nlmer(circumference~SSlogis(age, Asym, xmid, scal)~ Asym|Tree,data = Orange,cores =2,seed =12345,init_r =0.5)post1stan_nlmer family: gaussian [inv_SSlogis] formula: circumference ~ SSlogis(age, Asym, xmid, scal) ~ Asym | Tree observations: 35------ Median MAD_SDAsym 1.9 0.1 xmid 7.2 0.4 scal 3.4 0.3 Auxiliary parameter(s): Median MAD_SDsigma 0.1 0.0 Error terms: Groups Name Std.Dev. Tree Asym 0.31 Residual 0.09 Num. levels: Tree 5 ------* For help interpreting the printed output see ?print.stanreg* For info on the priors used see ?prior_summary.stanregInstan_nlmer, it is not necessary to supply startingvalues; however, in this case it was necessary to specify theinit_r argument so that the randomly-chosen starting valueswere not more than\(0.5\) away fromzero (in the unconstrained parameter space). The default value of\(2.0\) produced suboptimal results.
As can be seen, the posterior medians and estimated standarddeviations in the MCMC case are quite similar to the maximum likelihoodestimates and estimated standard errors. However,stan_nlmer produces uncertainty estimates for thetree-specific deviations in the asymptote, which are considerable.
plot(post1,regex_pars ="^[b]")As can be seen, the age of the tree has a non-linear effect on thepredicted circumference of the tree (here for a out-of-sample tree):
nd<-data.frame(age =1:20,Tree =factor("6",levels =1:6))PPD<-posterior_predict(post1,newdata = nd)PPD_df<-data.frame(age =as.factor(rep(1:20,each =nrow(PPD))),circumference =c(PPD))ggplot(PPD_df,aes(age, circumference))+geom_boxplot()If we were pharmacological, we could evaluate drug concentrationusing a first-order compartment model, such as
post3<-stan_nlmer(conc~SSfol(Dose, Time, lKe, lKa, lCl)~ (0+ lKe+ lKa+ lCl| Subject),data = Theoph,cores =2,seed =12345,QR =TRUE,init_r =0.25,adapt_delta =0.999)pairs(post3,regex_pars ="^l")pairs(post3,regex_pars ="igma")However, in this case the posterior distribution is bimodal Thus, youshould always be running many chains when using Stan, especiallystan_nlmer.
There are model fitting functions in therstanarmpackage that can do essentially all of what can be done in thelme4 andgamm4 packages — in the sensethat they can fit models with multilevel structure and / or nonlinearrelationships — and propagate the uncertainty in the parameter estimatesto the predictions and other functions of interest. The documentation oflme4 andgamm4 has various warningsthat acknowledge that the estimated standard errors, confidenceintervals, etc. are not entirely correct, even from a frequentistperspective.
A frequentist point estimate would also completely miss the secondmode in the last example withstan_nlmer. Thus, there isconsiderable reason to prefer therstanarm variants ofthese functions for regression modeling. The only disadvantage is theexecution time required to produce an answer that properly captures theuncertainty in the estimates of complicated models such as these.