GLMs with group-specific terms

Models with this structure are refered to by many names: multilevelmodels, (generalized) linear mixed (effects) models (GLMM), hierarchical(generalized) linear models, etc. In the simplest case, the model for anoutcome can be written as\[\mathbf{y} =\alpha + \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \mathbf{b} +\boldsymbol{\epsilon},\] where\(\mathbf{X}\) is a matrix predictors that isanalogous to that in Generalized Linear Models and\(\mathbf{Z}\) is a matrix that encodesdeviations in the predictors across specified groups.

The terminology for the unknowns in the model is diverse. Tofrequentists, the error term consists of\(\mathbf{Z}\mathbf{b} +\boldsymbol{\epsilon}\) and the observations within each grouparenot independent conditional on\(\mathbf{X}\) alone. Since,\(\mathbf{b}\) is considered part of therandom error term, frequentists allow themselves to make distributionalassumptions about\(\mathbf{b}\),invariably that it is distributed multivariate normal with mean vectorzero and structured covariance matrix\(\boldsymbol{\Sigma}\). If\(\epsilon_i\) is also distributed(univariate) normal with mean zero and standard deviation\(\sigma\), then\(\mathbf{b}\) can be integrated out, whichimplies\[\mathbf{y} \thicksim\mathcal{N}\left(\alpha + \mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}+\mathbf{Z}^\top \boldsymbol{\Sigma} \mathbf{Z}\right),\] and it is possible to maximize this likelihoodfunction by choosing proposals for the parameters\(\alpha\),\(\boldsymbol{\beta}\), and (the freeelements of)\(\boldsymbol{\Sigma}\).

Consequently, frequentists refer to\(\mathbf{b}\) as therandom effectsbecause they capture the random deviation in the effects of predictorsfrom one group to the next. In contradistinction,\(\alpha\) and\(\boldsymbol{\beta}\) are referred to asfixed effects because they are the same for all groups.Moreover,\(\alpha\) and\(\boldsymbol{\beta}\) persist in the modelin hypothetical replications of the analysis that draw the members ofthe groups afresh every time, whereas\(\mathbf{b}\) would differ from onereplication to the next. Consequently,\(\mathbf{b}\) is not a “parameter” to beestimated because parameters are unknown constants that are fixed inrepeated sampling.

Bayesians condition on the data in-hand without reference to repeatedsampling and describe theirbeliefs about the unknowns withprior distributions before observing the data. Thus, the likelihood in asimple hierarchical model inrstarnarm is\[\mathbf{y} \thicksim \mathcal{N}\left(\alpha +\mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{b}, \sigma^2\mathbf{I}\right)\] and the observations are independentconditional on\(\mathbf{X}\) and\(\mathbf{Z}\). In this formulation, thereare

intercept(s) and coefficients that arecommon acrossgroups
deviations in the intercept(s) and / or coefficients thatvaryacross groups

Bayesians are compelled to state their prior beliefs about allunknowns and the usual assumption (which is maintained inrstanarm) is that\(\mathbf{b} \thicksim\mathcal{N}\left(\mathbf{0},\boldsymbol{\Sigma}\right),\) but itis then necessary to state prior beliefs about\(\boldsymbol{\Sigma}\), in addition to\(\alpha\),\(\boldsymbol{\beta}\), and\(\sigma\).

One of the many challenges of fitting models to data comprisingmultiple groupings is confronting the tradeoff between validity andprecision. An analysis that disregards between-group heterogeneity canyield parameter estimates that are wrong if there is between-groupheterogeneity but would be relatively precise if there actually were nobetween-group heterogeneity. Group-by-group analyses, on the other hand,are valid but produces estimates that are relatively imprecise. Whilecomplete pooling or no pooling of data across groups is sometimes calledfor, models that ignore the grouping structures in the data tend tounderfit or overfit (Gelman et al.,2013). Hierarchical modeling providesa compromise by allowing parameters to vary by group at lower levels ofthe hierarchy while estimating common parameters at higher levels.Inference for each group-level parameter is informed not only by thegroup-specific information contained in the data but also by the datafor other groups as well. This is commonly referred to asborrowingstrength orshrinkage.

Inrstanarm, these models can be estimated using thestan_lmer andstan_glmer functions, which aresimilar in syntax to thelmer andglmerfunctions in thelme4 package. However, rather thanperforming (restricted) maximum likelihood (RE)ML estimation, Bayesianestimation is performed via MCMC. The Bayesian model adds independentprior distributions on the regression coefficients (in the same way asstan_glm) as well as priors on the terms of a decompositionof the covariance matrices of the group-specific parameters. Thesepriors are discussed in greater detail below.

Priors on covariance matrices

In this section we discuss a flexible family of prior distributionsfor the unknown covariance matrices of the group-specificcoefficients.

Overview

For each group, we assume the vector of varying slopes and interceptsis a zero-mean random vector following a multivariate Gaussiandistribution with an unknown covariance matrix to be estimated.Unfortunately, expressing prior information about a covariance matrix isnot intuitive and can also be computationally challenging. When thecovariance matrix is not\(1\times 1\),it is often both much more intuitive and efficient to work instead withthecorrelation matrix and variances. When thecovariance matrix is\(1\times 1\), westill denote it as\(\boldsymbol{\Sigma}\) but most of thedetails in this section do not apply.

The variances are in turn decomposed into the product of a simplexvector (probability vector) and the trace of the implied covariancematrix, which is defined as the sum of its diagonal elements. Finally,this trace is set equal to the product of the order of the matrix andthe square of a scale parameter. This implied prior on a covariancematrix is represented by thedecov (short for decompositionof covariance) function inrstanarm.

Details

Using the decomposition described above, the prior used for acorrelation matrix\(\Omega\) is calledthe LKJ distribution and has a probability density function proportionalto the determinant of the correlation matrix raised to a power of\(\zeta\) minus one:

\[ f(\Omega | \zeta) \propto\text{det}(\Omega)^{\zeta - 1}, \quad \zeta > 0. \]

The shape of this prior depends on the value of the regularizationparameter,\(\zeta\) in the followingways:

If\(\zeta = 1\) (the default),then the LKJ prior is jointly uniform over all correlation matrices ofthe same dimension as\(\Omega\).
If\(\zeta > 1\), then the modeof the distribution is the identity matrix. The larger the value of\(\zeta\) the more sharply peaked thedensity is at the identity matrix.
If\(0 < \zeta < 1\), thenthe density has a trough at the identity matrix.

The\(J \times J\) covariance matrix\(\Sigma\) of a random vector\(\boldsymbol{\theta} = (\theta_1, \dots,\theta_J)\) has diagonal entries\({\Sigma}_{jj} = \sigma^2_j =\text{var}(\theta_j)\). Therefore, the trace of the covariancematrix is equal to the sum of the variances. We set the trace equal tothe product of the order of the covariance matrix and the square of apositive scale parameter\(\tau\):

\[\text{tr}(\Sigma) = \sum_{j=1}^{J}\Sigma_{jj} = J\tau^2.\]

The vector of variances is set equal to the product of a simplexvector\(\boldsymbol{\pi}\) — which isnon-negative and sums to 1 — and the scalar trace:\(J \tau^2 \boldsymbol{\pi}\). Each element\(\pi_j\) of\(\boldsymbol{\pi}\) then represents theproportion of the trace (total variance) attributable to thecorresponding variable\(\theta_j\).

For the simplex vector\(\boldsymbol{\pi}\) we use a symmetricDirichlet prior, which has a singleconcentration parameter\(\gamma > 0\):

If\(\gamma = 1\) (the default),then the prior is jointly uniform over the space of simplex vectors with\(J\) elements.
If\(\gamma > 1\), then theprior mode corresponds to all variables having the same (proportion oftotal) variance, which can be used to ensure that the posteriorvariances are not zero. As the concentration parameter approachesinfinity, this mode becomes more pronounced.
If\(0 < \gamma < 1\), thenthe variances are more polarized.

If all the elements of\(\boldsymbol{\theta}\) were multiplied bythe same number\(k\), the trace oftheir covariance matrix would increase by a factor of\(k^2\). For this reason, it is sensible touse a scale-invariant prior for\(\tau\). We choose a Gamma distribution,with shape and scale parameters both set to\(1\) by default, implying a unit-exponentialdistribution. Users can set the shape hyperparameter to some valuegreater than one to ensure that the posterior trace is not zero. In thecase where\(\boldsymbol{\Sigma}\) is\(1\times 1\),\(\tau\) is the cross-group standarddeviation in the parameters and its square is the variance (so the Gammaprior with its shape and scale directly applies to the cross-groupstandard deviation in the parameters).

Comparison withlme4

There are several advantages to estimating these models usingrstanarm rather than thelme4 package.There are also a few drawbacks. In this section we briefly discuss whatwe find to be the two most important advantages as well as an importantdisadvantage.

Advantage: better uncertainty estimates

Whilelme4 uses (restricted) maximum likelihood(RE)ML estimation,rstanarm enables full Bayesianinference via MCMC to be performed. It is well known that (RE)ML tendsto underestimate uncertainties because it relies on point estimates ofhyperparameters. Full Bayes, on the other hand, propagates theuncertainty in the hyperparameters throughout all levels of the modeland provides more appropriate estimates of uncertainty for models thatconsist of a mix of common and group-specific parameters.

Advantage: incorporate prior information

Thestan_glmer andstan_lmer functionsallow the user to specify prior distributions over the regressioncoefficients as well as any unknown covariance matrices. There arevarious reasons to specify priors, from helping to stabilize computationto incorporating important information into an analysis that does notenter through the data.

Disadvantage: speed

The benefits of full Bayesian inference (via MCMC) come with a cost.Fitting models with (RE)ML will tend to be much faster than fitting asimilar model using MCMC. Speed comparable tolme4 canbe obtained withrstanarm using approximate Bayesianinference via the mean-field and full-rank variational algorithms (seehelp("rstanarm-package", "rstanarm") for details). Thesealgorithms can be useful to narrow the set of candidate models in largeproblems, but MCMC should always be used for final statisticalinference.

Relationship to`gamm4`

Therstanarm package includes astan_gamm4 function that is similar to thegamm4 function in thegamm4 package, whichis in turn similar to thegamm function in themgcv package. The substringgamm standsfor Generalized Additive Mixed Models, which differ from GeneralizedAdditive Models (GAMs) due to the presence of group-specific terms thatcan be specified with the syntax oflme4. Both GAMs andGAMMs include nonlinear functions of (non-categorical) predictors called“smooths”. In the example below, so-called “thin-plate splines” are usedto model counts of roaches where we might fear that the number ofroaches in the current period is an exponentially increasing function ofthe number of roaches in the previous period. Unlikestan_glmer, instan_gamm4 it is necessary tospecify group-specific terms as a one-sided formula that is passed totherandom argument as in thelme function inthenlme package.

library(rstanarm)data(roaches)roaches$roach1<- roaches$roach1/100roaches$log_exposure2<-log(roaches$exposure2)post<-stan_gamm4(  y~s(roach1)+ treatment+ log_exposure2,random =~(1| senior),data = roaches,family = neg_binomial_2,QR =TRUE,cores =2,chains =2,adapt_delta =0.99,seed =12345)

plot_nonlinear(post)

Here we see that the relationship between past and present roaches isestimated to be nonlinear. For a small number of past roaches, thefunction is steep and then it appears to flatten out, although we becomehighly uncertain about the function in the rare cases where the numberof past roaches is large.

Relationship to`nlmer`

Thestan_gamm4 function allows designated predictors tohave a nonlinear effect on what would otherwise be called the “linear”predictor in Generalized Linear Models. Thestan_nlmerfunction is similar to thenlmer function in thelme4 package, and essentially allows a wider range ofnonlinear functions that relate the linear predictor to the conditionalexpectation of a Gaussian outcome.

To estimate an example model with thenlmer function inthelme4 package, we start by rescaling the outcome andmain predictor(s) by a constant

data("Orange",package ="datasets")Orange$age<- Orange$age/100Orange$circumference<- Orange$circumference/100

Although doing so has no substantive effect on the inferencesobtained, it is numerically much easier for Stan and forlme4 to work with variables whose units are such thatthe estimated parameters tend to be single-digit numbers that are nottoo close to zero. Thenlmer function requires that theuser pass starting values to the ironically-named self-startingnon-linear function:

startvec<-c(Asym =2,xmid =7.25,scal =3.5)library(lme4)nm1<-nlmer(circumference~SSlogis(age, Asym, xmid, scal)~ Asym|Tree,data = Orange,start = startvec)summary(nm1)

Warning in vcov.merMod(object, use.hessian = use.hessian): variance-covariance matrix computed from finite-difference Hessian isnot positive definite or contains NA values: falling back to var-cov estimated from RX

Warning in vcov.merMod(object, correlation = correlation, sigm = sig): variance-covariance matrix computed from finite-difference Hessian isnot positive definite or contains NA values: falling back to var-cov estimated from RX

Nonlinear mixed model fit by maximum likelihood  ['nlmerMod']Formula: circumference ~ SSlogis(age, Asym, xmid, scal) ~ Asym | Tree   Data: Orange      AIC       BIC    logLik -2*log(L)  df.resid     -49.2     -41.4      29.6     -59.2        30 Scaled residuals:     Min      1Q  Median      3Q     Max -1.9170 -0.5421  0.1754  0.7116  1.6820 Random effects: Groups   Name Variance Std.Dev. Tree     Asym 0.100149 0.31646  Residual      0.006151 0.07843 Number of obs: 35, groups:  Tree, 5Fixed effects:     Estimate Std. Error t valueAsym   1.9205     0.1558   12.32xmid   7.2791     0.3444   21.14scal   3.4807     0.2631   13.23Correlation of Fixed Effects:     Asym  xmid xmid 0.384      scal 0.362 0.762

Note the warning messages indicating difficulty estimating thevariance-covariance matrix. Althoughlme4 has afallback mechanism, the need to utilize it suggests that the sample istoo small to sustain the asymptotic assumptions underlying the maximumlikelihood estimator.

In the above example, we use theSSlogis function, whichis a lot like the logistic CDF, but with an additionalAsymargument that need not be one and indicates what value the functionapproaches for large values of the first argument. In this case, we caninterpret the asymptote as the maximum possible circumference for anorange. However, this asymptote is allowed to vary from tree to treeusing theAsym | Tree syntax, which reflects an assumptionthat the asymptote for a randomly-selected tree deviates from theasymptote for the population of orange trees in a Gaussian fashion withmean zero and an unknown standard deviation.

Thenlmer function supports user-defined non-linearfunctions, whereas thestan_nlmer function only supportsthe pre-defined non-linear functions starting withSS inthestats package, which are

 [1] "SSasymp"     "SSasympOff"  "SSasympOrig" "SSbiexp"     "SSfol"       [6] "SSfpl"       "SSgompertz"  "SSlogis"     "SSmicmen"    "SSweibull"

To fit essentially the same model using Stan’s implementation ofMCMC, we add astan_ prefix

post1<-stan_nlmer(circumference~SSlogis(age, Asym, xmid, scal)~ Asym|Tree,data = Orange,cores =2,seed =12345,init_r =0.5)

post1

stan_nlmer family:       gaussian [inv_SSlogis] formula:      circumference ~ SSlogis(age, Asym, xmid, scal) ~ Asym | Tree observations: 35------     Median MAD_SDAsym 1.9    0.1   xmid 7.2    0.4   scal 3.4    0.3   Auxiliary parameter(s):      Median MAD_SDsigma 0.1    0.0   Error terms: Groups   Name Std.Dev. Tree     Asym 0.31     Residual      0.09    Num. levels: Tree 5 ------* For help interpreting the printed output see ?print.stanreg* For info on the priors used see ?prior_summary.stanreg

Instan_nlmer, it is not necessary to supply startingvalues; however, in this case it was necessary to specify theinit_r argument so that the randomly-chosen starting valueswere not more than\(0.5\) away fromzero (in the unconstrained parameter space). The default value of\(2.0\) produced suboptimal results.

As can be seen, the posterior medians and estimated standarddeviations in the MCMC case are quite similar to the maximum likelihoodestimates and estimated standard errors. However,stan_nlmer produces uncertainty estimates for thetree-specific deviations in the asymptote, which are considerable.

plot(post1,regex_pars ="^[b]")

As can be seen, the age of the tree has a non-linear effect on thepredicted circumference of the tree (here for a out-of-sample tree):

nd<-data.frame(age =1:20,Tree =factor("6",levels =1:6))PPD<-posterior_predict(post1,newdata = nd)PPD_df<-data.frame(age =as.factor(rep(1:20,each =nrow(PPD))),circumference =c(PPD))ggplot(PPD_df,aes(age, circumference))+geom_boxplot()

If we were pharmacological, we could evaluate drug concentrationusing a first-order compartment model, such as

post3<-stan_nlmer(conc~SSfol(Dose, Time, lKe, lKa, lCl)~                    (0+ lKe+ lKa+ lCl| Subject),data = Theoph,cores =2,seed =12345,QR =TRUE,init_r =0.25,adapt_delta =0.999)pairs(post3,regex_pars ="^l")pairs(post3,regex_pars ="igma")

However, in this case the posterior distribution is bimodal Thus, youshould always be running many chains when using Stan, especiallystan_nlmer.

Movatterモバイル変換

Estimating Generalized (Non-)Linear Modelswith Group-Specific Terms with rstanarm

Jonah Gabry and Ben Goodrich

2025-09-29

Introduction

GLMs with group-specific terms

Priors on covariance matrices

Overview

Details

Comparison withlme4

Advantage: better uncertainty estimates

Advantage: incorporate prior information

Disadvantage: speed

Relationship to`glmer`

Relationship to`gamm4`

Relationship to`nlmer`

Conclusion

Movatterモバイル変換

Estimating Generalized (Non-)Linear Modelswith Group-Specific Terms with rstanarm

Jonah Gabry and Ben Goodrich

2025-09-29

Introduction

GLMs with group-specific terms

Priors on covariance matrices

Overview

Details

Comparison withlme4

Advantage: better uncertainty estimates

Advantage: incorporate prior information

Disadvantage: speed

Relationship toglmer

Relationship togamm4

Relationship tonlmer

Conclusion

Relationship to`glmer`

Relationship to`gamm4`

Relationship to`nlmer`