Movatterモバイル変換


[0]ホーム

URL:


Estimating Regularized Linear Models withrstanarm

Jonah Gabry and Ben Goodrich

2025-09-29

library(ggplot2)library(bayesplot)theme_set(bayesplot::theme_default())

Introduction

This vignette explains how to estimate linear models using thestan_lm function in therstanarmpackage.

The four steps of a Bayesian analysis are

  1. Specify a joint distribution for the outcome(s) and all theunknowns, which typically takes the form of a marginal priordistribution for the unknowns multiplied by a likelihood for theoutcome(s) conditional on the unknowns. This joint distribution isproportional to a posterior distribution of the unknowns conditional onthe observed data
  2. Draw from posterior distribution using Markov Chain Monte Carlo(MCMC).
  3. Evaluate how well the model fits the data and possibly revise themodel.
  4. Draw from the posterior predictive distribution of the outcome(s)given interesting values of the predictors in order to visualize how amanipulation of a predictor affects (a function of) the outcome(s).

Steps 3 and 4 are covered in more depth by the vignette entitled“How to Use therstanarmPackage”. This vignette focuses on Step 1 when the likelihood is theproduct of independent normal distributions.

The goal of therstanarm package is to make Bayesianestimation of common regression models routine. That goal can bepartially accomplished by providing interfaces that are similar to thepopular formula-based interfaces to frequentist estimators of thoseregression models. But fully accomplishing that goal sometimes entailsutilizing priors that applied researchers are unaware that they prefer.These priors are intended to work well for any data that a user mightpass to the interface that was generated according to the assumptions ofthe likelihood function.

It is important to distinguish between priors that are easy forapplied researchers tospecify and priors that are easy forapplied researchers toconceptualize. The prior described belowemphasizes the former but we outline its derivation so that appliedresearchers may feel more comfortable utilizing it.

Likelihood

The likelihood for one observation under a linear model can bewritten as a conditionally normal PDF\[\frac{1}{\sigma_{\epsilon} \sqrt{2 \pi}} e^{-\frac{1}{2} \left(\frac{y -\mu}{\sigma_{\epsilon}}\right)^2},\] where\(\mu = \alpha + \mathbf{x}^\top\boldsymbol{\beta}\) is a linear predictor and\(\sigma_{\epsilon}\) is the standarddeviation of the error in predicting the outcome,\(y\). The likelihood of the entire sample isthe product of\(N\) individuallikelihood contributions.

It is well-known that the likelihood of the sample is maximized whenthe sum-of-squared residuals is minimized, which occurs when\[ \widehat{\boldsymbol{\beta}} =\left(\mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{y}, \]\[ \widehat{\alpha} = \overline{y} -\overline{\mathbf{x}}^\top \widehat{\boldsymbol{\beta}},\]\[ \widehat{\sigma}_{\epsilon}^2 = \frac{\left(\mathbf{y} - \widehat{\alpha} - \mathbf{X} \widehat{ \boldsymbol{\beta}}\right)^\top \left(\mathbf{y} - \widehat{\alpha} - \mathbf{X} \widehat{ \boldsymbol{\beta}}\right)}{N},\]where\(\overline{\mathbf{x}}\) is avector that contains the sample means of the\(K\) predictors,\(\mathbf{X}\) is a\(N \times K\) matrix ofcenteredpredictors,\(\mathbf{y}\) is a\(N\)-vector of outcomes and\(\overline{y}\) is the sample mean of theoutcome.

QR Decomposition

Thelm function in R actually performs a QRdecomposition of the design matrix,\(\mathbf{X} = \mathbf{Q}\mathbf{R}\), where\(\mathbf{Q}^\top \mathbf{Q} =\mathbf{I}\) and\(\mathbf{R}\)is upper triangular. Thus, the OLS solution for the coefficients can bewritten as\(\left(\mathbf{X}^\top\mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{y} = \mathbf{R}^{-1}\mathbf{Q}^\top \mathbf{y}\). Thelm functionutilizes the QR decomposition for numeric stability reasons, but the QRdecomposition is also useful for thinking about priors in a Bayesianversion of the linear model. In addition, writing the likelihood interms of\(\mathbf{Q}\) allows it to beevaluated in a very efficient manner in Stan.

Priors

The key innovation in thestan_lm function in therstanarm package is the prior for the parameters in theQR-reparameterized model. To understand this prior, think about theequations that characterize the maximum likelihood solutions beforeobserving the data on\(\mathbf{X}\)and especially\(\mathbf{y}\).

What would the prior distribution of\(\boldsymbol{\theta} = \mathbf{Q}^\top\mathbf{y}\) be? We can write its\(k\)-th element as\(\theta_k = \rho_k \sigma_Y \sqrt{N - 1}\)where\(\rho_k\) is the correlationbetween the\(k\)th column of\(\mathbf{Q}\) and the outcome,\(\sigma_Y\) is the standard deviation of theoutcome, and\(\frac{1}{\sqrt{N-1}}\)is the standard deviation of the\(k\)column of\(\mathbf{Q}\). Then let\(\boldsymbol{\rho} =\sqrt{R^2}\mathbf{u}\) where\(\mathbf{u}\) is a unit vector that isuniformly distributed on the surface of a hypersphere. Consequently,\(R^2 = \boldsymbol{\rho}^\top\boldsymbol{\rho}\) is the familiar coefficient of determinationfor the linear model.

An uninformative prior on\(R^2\)would be standard uniform, which is a special case of a Betadistribution with both shape parameters equal to\(1\). A non-uniform prior on\(R^2\) is somewhat analogous to ridgeregression, which is popular in data mining and produces betterout-of-sample predictions than least squares because it penalizes\(\boldsymbol{\beta}^\top\boldsymbol{\beta}\), usually after standardizing the predictors.An informative prior on\(R^2\)effectively penalizes\(\boldsymbol{\rho}\top\boldsymbol{\rho}\), which encourages\(\boldsymbol{\beta} = \mathbf{R}^{-1}\boldsymbol{\theta}\) to be closer to the origin.

Lewandowski, Kurowicka, and Joe (2009) derives a distribution for acorrelation matrix that depends on a single shape parameter\(\eta > 0\), which implies the varianceof one variable given the remaining\(K\) variables is\(\mathrm{Beta}\left(\eta,\frac{K}{2}\right)\).Thus, the\(R^2\) is distributed\(\mathrm{Beta}\left(\frac{K}{2},\eta\right)\)and any prior information about the location of\(R^2\) can be used to choose a value of thehyperparameter\(\eta\). TheR2(location, what) function in therstanarm package supports four ways of choosing\(\eta\):

  1. what = "mode" andlocation is some priormode on the\(\left(0,1\right)\)interval. This is the default but since the mode of a\(\mathrm{Beta}\left(\frac{K}{2},\eta\right)\)distribution is\(\frac{\frac{K}{2} -1}{\frac{K}{2} + \eta - 2}\) the mode only exists if\(K > 2\). If\(K \leq 2\), then the user must specifysomething else forwhat.
  2. what = "mean" andlocation is some priormean on the\(\left(0,1\right)\)interval, where the mean of a\(\mathrm{Beta}\left(\frac{K}{2},\eta\right)\)distribution is\(\frac{\frac{K}{2}}{\frac{K}{2} +\eta}\).
  3. what = "median" andlocation is some priormedian on the\(\left(0,1\right)\)interval. The median of a\(\mathrm{Beta}\left(\frac{K}{2},\eta\right)\)distribution is not available in closed form but if\(K > 2\) it is approximately equal to\(\frac{\frac{K}{2} - \frac{1}{3}}{\frac{K}{2}+ \eta - \frac{2}{3}}\). Regardless of whether\(K > 2\), theR2 functioncan numerically solve for the value of\(\eta\) that is consistent with a givenprior median utilizing the quantile function.
  4. what = "log" andlocation is some(negative) prior value for\(\mathbb{E} \lnR^2 =\psi\left(\frac{K}{2}\right)- \psi\left(\frac{K}{2}+\eta\right)\),where\(\psi\left(\cdot\right)\) is thedigamma function. Again, given a prior value for theleft-hand side it is easy to numerically solve for the correspondingvalue of\(\eta\).

There is no default value for thelocation argument oftheR2 function. This is aninformative prior on\(R^2\), which must be chosen by theuser in light of the research project. However, specifyinglocation = 0.5 is often safe, in which case\(\eta = \frac{K}{2}\) regardless of whetherwhat is"mode","mean", or"median". In addition, it is possible to specifyNULL, in which case a standard uniform on\(R^2\) is utilized.

We set\(\sigma_y = \omega s_y\)where\(s_y\) is the sample standarddeviation of the outcome and\(\omega >0\) is an unknown scale parameter to be estimated. The only priorfor\(\omega\) that does not contraveneBayes’ theorem in this situation is Jeffreys prior,\(f\left(\omega\right) \propto\frac{1}{\omega}\), which is proportional to a Jeffreys prior onthe unknown\(\sigma_y\),\(f\left(\sigma_y\right) \propto \frac{1}{\sigma_y}= \frac{1}{\omega \widehat{\sigma}_y} \propto \frac{1}{\omega}\).This parameterization and prior makes it easy for Stan to work with anycontinuous outcome variable, no matter what its units of measurementare.

It would seem that we need a prior for\(\sigma_{\epsilon}\), but our prior beliefsabout\(\sigma_{\epsilon} = \omega s_y \sqrt{1- R^2}\) are already implied by our prior beliefs about\(\omega\) and\(R^2\). That only leaves a prior for\(\alpha = \overline{y} - \overline{\mathbf{x}}^\top\mathbf{R}^{-1} \boldsymbol{\theta}\). The default choice is animproper uniform prior, but a normal prior can also be specified such asone with mean zero and standard deviation\(\frac{\sigma_y}{\sqrt{N}}\).

Posterior

The previous sections imply a posterior distribution for\(\omega\),\(\alpha\),\(\mathbf{u}\), and\(R^2\). The parameters of interest can thenbe recovered as generated quantities:

The implementation actually utilizes an improper uniform prior on\(\ln \omega\). Consequently, if\(\ln \omega = 0\), then the marginalstandard deviation of the outcomeimplied by the model is thesame as the sample standard deviation of the outcome. If\(\ln \omega > 0\), then the marginalstandard deviation of the outcome implied by the model exceeds thesample standard deviation, so the model overfits the data. If\(\ln \omega < 0\), then the marginalstandard deviation of the outcome implied by the model is less than thesample standard deviation, so the modelunderfits the data orthat the data-generating process is nonlinear. Given the regularizingnature of the prior on\(R^2\), a minorunderfit would be considered ideal if the goal is to obtain goodout-of-sample predictions. If the model badly underfits or overfits thedata, then you may want to reconsider the model.

Example

We will utilize an example from theHSAUR3 packageby Brian S. Everitt and Torsten Hothorn, which is used in their 2014bookA Handbook of Statistical Analyses Using R (3rd Edition)(Chapman & Hall / CRC). This book is frequentist in nature and wewill show how to obtain the corresponding Bayesian results.

The model in section 5.3.1 analyzes an experiment where clouds wereseeded with different amounts of silver iodide to see if there wasincreased rainfall. This effect could vary according to covariates,which (except fortime) are interacted with the treatmentvariable. Most people would probably be skeptical that cloud hackingcould explain very much of the variation in rainfall and thus the priormode of the\(R^2\) would probably befairly small.

The frequentist estimator of this model can be replicated byexecuting

data("clouds",package ="HSAUR3")ols<-lm(rainfall~ seeding* (sne+ cloudcover+ prewetness+ echomotion)+            time,data = clouds)round(coef(ols),3)
                    (Intercept)                      seedingyes                          -0.346                          15.683                             sne                      cloudcover                           0.420                           0.388                      prewetness            echomotionstationary                           4.108                           3.153                            time                  seedingyes:sne                          -0.045                          -3.197           seedingyes:cloudcover           seedingyes:prewetness                          -0.486                          -2.557 seedingyes:echomotionstationary                          -0.562

Note that we havenot looked at the estimated\(R^2\) or\(\sigma\) for the ordinary least squaresmodel. We can estimate a Bayesian version of this model by prependingstan_ to thelm call, specifying a prior modefor\(R^2\), and optionally specifyinghow many cores the computer may utilize:

library(rstanarm)post<-stan_lm(    rainfall~ seeding* (sne+ cloudcover+ prewetness+ echomotion)+ time,data = clouds,prior =R2(location =0.2),seed =12345  )post
stan_lm family:       gaussian [identity] formula:      rainfall ~ seeding * (sne + cloudcover + prewetness + echomotion) +        time observations: 24 predictors:   11------                                Median MAD_SD(Intercept)                      2.4    2.3  seedingyes                       6.8    3.8  sne                              0.2    0.7  cloudcover                       0.2    0.2  prewetness                       1.7    2.8  echomotionstationary             1.4    1.5  time                             0.0    0.0  seedingyes:sne                  -1.4    1.0  seedingyes:cloudcover           -0.2    0.2  seedingyes:prewetness           -1.1    3.5  seedingyes:echomotionstationary -0.2    2.0  Auxiliary parameter(s):              Median MAD_SDR2            0.3    0.1   log-fit_ratio 0.0    0.1   sigma         2.6    0.4   ------* For help interpreting the printed output see ?print.stanreg* For info on the priors used see ?prior_summary.stanreg

In this case, the “Bayesian point estimates”, which are representedby the posterior medians, appear quite different from the ordinary leastsquares estimates. However, thelog-fit_ratio (i.e. \(\ln \omega\)) is quite small, indicatingthat the model only slightly overfits the data when the prior derivedabove is utilized. Thus, it would be safe to conclude that the ordinaryleast squares estimator considerably overfits the data since there areonly\(24\) observations to estimate\(12\) parameters with and no priorinformation on the parameters.

Also, it is not obvious what the estimated average treatment effectis since the treatment variable,seeding, is interactedwith four other correlated predictors. However, it is easy to estimateor visualize the average treatment effect (ATE) usingrstanarm’sposterior_predict function.

clouds_cf<- cloudsclouds_cf$seeding[]<-"yes"y1_rep<-posterior_predict(post,newdata = clouds_cf)clouds_cf$seeding[]<-"no"y0_rep<-posterior_predict(post,newdata = clouds_cf)qplot(x =c(y1_rep- y0_rep),geom ="histogram",xlab ="Estimated ATE")

As can be seen, the treatment effect is not estimated precisely andis as almost as likely to be negative as it is to be positive.

Alternative Approach

The prior derived above works well in many situations and is quitesimple touse since it only requires the user to specify theprior location of the\(R^2\).Nevertheless, the implications of the prior are somewhat difficult toconceptualize. Thus, it is perhaps worthwhile to compare toanother estimator of a linear model that simply puts independent Cauchypriors on the regression coefficients. This simpler approach can beexecuted by calling thestan_glm function withfamily = gaussian() and specifying the priors:

simple<-stan_glm(    rainfall~ seeding* (sne+ cloudcover+ prewetness+ echomotion)+ time,data = clouds,family =gaussian(),prior =cauchy(),prior_intercept =cauchy(),seed =12345  )

We can compare the two approaches using an approximation toLeave-One-Out (LOO) cross-validation, which is implemented by theloo function in theloo package.

(loo_post<-loo(post))
Computed from 4000 by 24 log-likelihood matrix.         Estimate   SEelpd_loo    -60.3  5.3p_loo         5.9  2.4looic       120.5 10.6------MCSE of elpd_loo is 0.1.MCSE and ESS estimates assume independent draws (r_eff=1).All Pareto k estimates are good (k < 0.7).See help('pareto-k-diagnostic') for details.
loo_compare(loo_post,loo(simple))
Warning: Found 3 observation(s) with a pareto_k > 0.7. We recommend calling 'loo' again with argument 'k_threshold = 0.7' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 3 times to compute the ELPDs for the problematic observations directly.
       elpd_diff se_diffpost    0.0       0.0   simple -0.9       3.0

The results indicate that the first approach is expected to producebetter out-of-sample predictions but the Warning messages are at leastas important. Many of the estimated shape parameters for the GeneralizedPareto distribution are above\(0.5\)in the model with Cauchy priors, which indicates that these estimatesare only going to converge slowly to the true out-of-sample deviancemeasures. Thus, with only\(24\)observations, they should not be considered reliable. The morecomplicated prior derived above is stronger — as evidenced by the factthat the effective number of parameters is about half of that in thesimpler approach and\(12\) for themaximum likelihood estimator — and only has a few of the\(24\) Pareto shape estimates in the “dangerzone”. We might want to reexamine these observations

plot(loo_post,label_points =TRUE)

because the posterior is sensitive to them but, overall, the resultsseem tolerable.

In general, we would expect the joint prior derived here to workbetter when there are many predictors relative to the number ofobservations. Placing independent, heavy-tailed priors on thecoefficients neither reflects the beliefs of the researcher nor conveysenough information to stabilize all the computations.

Conclusion

This vignette has discussed the prior distribution utilized in thestan_lm function, which has the same likelihood and asimilar syntax as thelm function in R but adds the abilityto expression prior beliefs about the location of the\(R^2\), which is the familiar proportion ofvariance in the outcome variable that is attributable to the predictorsunder a linear model. Since the\(R^2\)is a well-understood bounded scalar, it is easy to specify priorinformation about it, whereas other Bayesian approaches require theresearcher to specify a joint prior distribution for the regressioncoefficients (and the intercept and error variance).

However, most researchers have little inclination to specify allthese prior distributions thoughtfully and take a short-cut byspecifying one prior distribution that is taken to apply to all theregression coefficients as if they were independent of each other (andthe intercept and error variance). This short-cut is available in thestan_glm function and is described in more detail in otherrstanarm vignettes for Generalized Linear Models(GLMs), which can be found by navigating up one level.

We are optimistic that this prior on the\(R^2\) will greatly help in accomplishingour goal forrstanarm of making Bayesian estimation ofregression models routine. The same approach is used to specify a priorin ANOVA models (seestan_aov) and proportional-odds modelsfor ordinal outcomes (seestan_polr).

Finally, thestan_biglm function can be used when thedesign matrix is too large for theqr function to process.Thestan_biglm function inputs the output of thebiglm function in thebiglm package, whichutilizes an incremental QR decomposition that does not require theentire dataset to be loaded into memory simultaneously. However, thebiglm function needs to be called in a particular way inorder to work withstan_biglm. In particular, The means ofthe columns of the design matrix, the sample mean of the outcome, andthe sample standard deviation of the outcome all need to be passed tothestan_biglm function, as well as a flag indicatingwhether the model really does include an intercept. Also, the number ofcolumns of the design matrix currently cannot exceed the number of rows.Althoughstan_biglm should run fairly quickly and withoutmuch memory, the resulting object is a fairly plainstanfitobject rather than an enhancedstanreg object like thatproduced bystan_lm. Many of the enhanced capabilities of astanreg object depend on being able to access the fulldesign matrix, so doing posterior predictions, posterior checks, etc.with the output ofstan_biglm would require some custom Rcode.

References

Lewandowski, D., Kurowicka D., and Joe, H. (2009). Generating randomcorrelation matrices based on vines and extended onion method.Journal of Multivariate Analysis.100(9),1989–2001.


[8]ページ先頭

©2009-2025 Movatter.jp