Movatterモバイル変換


[0]ホーム

URL:


Modeling Rates/Proportions using BetaRegression with rstanarm

Imad Ali, Jonah Gabry and Ben Goodrich

2025-09-29

library(ggplot2)library(bayesplot)theme_set(bayesplot::theme_default())

Introduction

This vignette explains how to model continuous outcomes on the openunit interval using thestan_betareg function in therstanarm package.

The four steps of a Bayesian analysis are

  1. Specify a joint distribution for the outcome(s) and all theunknowns, which typically takes the form of a marginal priordistribution for the unknowns multiplied by a likelihood for theoutcome(s) conditional on the unknowns. This joint distribution isproportional to a posterior distribution of the unknowns conditional onthe observed data
  2. Draw from posterior distribution using Markov Chain Monte Carlo(MCMC).
  3. Evaluate how well the model fits the data and possibly revise themodel.
  4. Draw from the posterior predictive distribution of the outcome(s)given interesting values of the predictors in order to visualize how amanipulation of a predictor affects (a function of) the outcome(s).

Steps 3 and 4 are covered in more depth by the vignette entitled“How to Use therstanarmPackage”. This vignette focuses on Step 1 when the likelihood is theproduct of beta distributions.

Likelihood

Beta regression uses the beta distribution as the likelihood for thedata,\[f(y_i | a, b) = \frac{y_i^{(a-1)}(1-y_i)^{(b-1)}}{B(a,b)}\] where\(B(\cdot)\) is thebeta function. The shape parameters for the distribution are\(a\) and\(b\) and enter into the model according tothe following transformations,\[a = \mu\cdot\phi \\b = (1-\mu)\cdot\phi\]

Let\(g_1(\cdot)\) be some linkfunction. Then, in the specification of the shape parameters above,\(\mu =g_1^{-1}(\mathbf{X}\boldsymbol{\beta})\), where\(\boldsymbol{X}\) is a\(N\times K\) dimensional matrix ofpredictors, and\(\boldsymbol{\beta}\)is a\(K\) dimensional vector ofparameters associated with each predictor.

In the simplest case (with only one set of regressors),\(\phi\) is a scalar parameter.Alternatively, it is possible to model\(\phi\) using a second set of regressors\(\mathbf{Z}\). In this context let\(g_2(\cdot)\) be some link functionthat is not necessarily identical to\(g_1(\cdot)\). Then\(\phi =g_2^{-1}(\mathbf{Z}\boldsymbol{\gamma})\), where\(\boldsymbol{\gamma}\) is a\(J\) dimensional vector of parametersassociated with the\(N\times J\)dimensional matrix of predictors\(\mathbf{Z}\).

After substituting the shape parameter values in, the likelihood usedin beta regression takes the following form,\[f(y_i | \mu, \phi) =\frac{y_i^{(\mu\phi-1)}(1-y_i)^{((1-\mu)\phi-1)}}{B(\mu\phi,(1-\mu)\phi)}\]

Priors

A full Bayesian analysis requires specifying prior distributions\(f(\boldsymbol{\beta})\) and\(f(\phi)\) for the vector of regressioncoefficients and\(\phi\). When usingstan_betareg, these distributions can be set using theprior_intercept,prior, andprior_phi arguments. Thestan_betareg functionsupports a variety of prior distributions, which are explained in therstanarm documentation(help(priors, package = 'rstanarm')).

When modeling\(\phi\) with a linearpredictor a full Bayesian analysis requires specifying the priordistributions\(f(\boldsymbol{\beta})\)and\(f(\boldsymbol{\gamma})\). Instan_betareg the prior distributions on\(\boldsymbol{\gamma}\) can be set using theprior_intercept_z andprior_z arguments.

As an example, suppose we have\(K\)predictors and believe — prior to seeing the data — that\(\beta_1, \dots, \beta_K\) and\(\phi\) are as likely to be positive as theyare to be negative, but are highly unlikely to be far from zero. Thesebeliefs can be represented by normal distributions with mean zero and asmall scale (standard deviation). To give\(\phi\) and each of the\(\beta\)s this prior (with a scale of 1,say), in the call tostan_betareg we would include theargumentsprior_intercept = normal(0,1),prior = normal(0,1), andprior_phi = normal(0,1).

If, on the other hand, we have less a priori confidence that theparameters will be close to zero then we could use a larger scale forthe normal distribution and/or a distribution with heavier tails thanthe normal like the Student t distribution.Step 1 inthe “How to Use therstanarm Package” vignettediscusses one such example.

After fitting the model we can use theprior_summaryfunction to print information about the prior distributions used whenfitting the model.

Posterior

When using only asingle set of regressors, the posteriordistribution of\(\boldsymbol{\beta}\)and\(\phi\) is proportional to theproduct of the likelihood contributions, the\(K\) priors on the\(\beta_k\) parameters, and\(\phi\),\[f(\boldsymbol{\beta},\phi|\mathbf{y},\mathbf{X}) \propto\prod_{i=1}^N f(y_i | a, b) \times\prod_{k=1}^K f(\beta_k) \timesf(\phi)\]

When usingtwo sets of regressors, the posteriordistribution of\(\boldsymbol{\beta}\)and\(\boldsymbol{\gamma}\) isproportional to the product of the likelihood contribution, the\(K\) priors on the\(\beta_k\) parameters, and the\(J\) priors on the\(\gamma_j\) parameters,

\[f(\boldsymbol{\beta},\boldsymbol{\gamma}|\mathbf{y},\mathbf{X}) \propto\prod_{i=1}^N f(y_i | a, b) \times\prod_{k=1}^K f(\beta_k) \times\prod_{j=1}^J f(\gamma_j)\]

An Example Using Simulated Data

In this example the outcome variable\(\mathbf{y}\) is simulated in a way thatwarrants the use of beta regression. It is worth mentioning that thedata generation process is quite convoluted, which is apparent in theidentification of the likelihood above.

The data simulated below uses the logistic link function on the firstset of regressors and the log link function on the second set ofregressors.

SEED<-1234set.seed(SEED)eta<-c(1,-0.2)gamma<-c(1.8,0.4)N<-200x<-rnorm(N,2,2)z<-rnorm(N,0,2)mu<-binomial(link = logit)$linkinv(eta[1]+ eta[2]*x)phi<-binomial(link = log)$linkinv(gamma[1]+ gamma[2]*z)y<-rbeta(N, mu* phi, (1- mu)* phi)dat<-data.frame(cbind(y, x, z))hist(dat$y,col ="darkgrey",border = F,main ="Distribution of Outcome Variable",xlab ="y",breaks =20,freq = F)

The model can be fit by callingstan_betareg, using theappropriate link functions.

library(rstanarm)fit1<-stan_betareg(y~ x| z,data = dat,link ="logit",link.phi ="log",cores =2,seed =12345)fit2<-stan_betareg(y~-1+ x ,data = dat,link ="logit",link.phi ="log",cores =2,seed =12345)round(coef(fit1),2)round(coef(fit2),2)
      (Intercept)                 x (phi)_(Intercept)           (phi)_z              0.93             -0.20              1.84              0.31
                x (phi)_(Intercept)              0.00              1.08

For clarity we can useprior_summary to print theinformation about the prior distributions used to fit the models. Thepriors used infit1 are provided below.

prior_summary(fit1)
Priors for model 'fit1' ------Intercept (after predictors centered) ~ normal(location = 0, scale = 2.5)Coefficients  Specified prior:    ~ normal(location = 0, scale = 2.5)  Adjusted prior:    ~ normal(location = 0, scale = 1.2)Intercept_z (after predictors centered) ~ normal(location = 0, scale = 2.5)Coefficients_z  Specified prior:    ~ normal(location = 0, scale = 2.5)  Adjusted prior:    ~ normal(location = 0, scale = 1.2)------See help('prior_summary.stanreg') for more details

The usual posterior analyses are available inrstanarm. The plots below illustrate simulated valuesof the outcome variable. The incorrect model noticeably fails to capturethe top of the distribution consistently in comparison to the truemodel.

library(ggplot2)library(bayesplot)bayesplot_grid(pp_check(fit1),pp_check(fit2),xlim =c(0,1),ylim =c(0,4),titles =c("True Model: y ~ x | z","False Model: y ~ x - 1"),grid_args =list(ncol =2))

We can also compare models by evaluating the expected log pointwisepredictive density (elpd), which can be calculated usingtheloo method, which provides an interface forrstanarm models to the functionality in theloo package.

loo1<-loo(fit1)loo2<-loo(fit2)loo_compare(loo1, loo2)
     elpd_diff se_difffit1   0.0       0.0  fit2 -79.9      11.8

The difference inelpd is negative indicating that theexpected predictive accuracy for the first model is higher.

An Example Using Gasoline Data

In some applied contexts it may be necessary to work with an outcomevariable that is a proportion. If the proportion is bound on the openunit interval then beta regression can be considered a reasonableestimation method. Thebetareg package provides a dataseton the proportion of crude oil converted to gasoline after distillationand fractionation. This variable is defined as yield. Belowstan_betareg is used to model yield as a function oftemperature, pressure, and the batch of conditions.

library(rstanarm)data("GasolineYield",package ="betareg")gas_fit1<-stan_betareg(yield~ temp+ batch,data = GasolineYield,link ="logit",seed =12345)gas_fit2<-stan_betareg(yield~ temp+ batch| pressure,data = GasolineYield,link ="logit",seed =12345)round(coef(gas_fit1),2)round(coef(gas_fit2),2)
(Intercept)        temp      batch1      batch2      batch3      batch4       -5.16        0.01        1.35        0.95        1.15        0.77      batch5      batch6      batch7      batch8      batch9       (phi)        0.80        0.72        0.33        0.26        0.15       12.04
      (Intercept)              temp            batch1            batch2             -6.06              0.01              1.69              1.29            batch3            batch4            batch5            batch6              1.53              1.03              1.10              1.01            batch7            batch8            batch9 (phi)_(Intercept)              0.52              0.47              0.37              5.34    (phi)_pressure              0.04

The plots below illustrate simulated values of gasoline yield. Whilethe first model accounts for variation in batch conditions itspredictions looks somewhat uniform rather than resembling the peaked andright-skewed behavior of the true data. The second model does a somewhatbetter job at capturing the shape of the distribution, however itslocation is off as it is centered around 0.50 rather than 0.20.

library(ggplot2)bayesplot_grid(pp_check(gas_fit1),pp_check(gas_fit2),xlim =c(0,1),ylim =c(0,5),titles =c("gas_fit1","gas_fit2"),grid_args =list(ncol =2))

gas_loo1<-loo(gas_fit1)gas_loo2<-loo(gas_fit2)loo_compare(gas_loo1, gas_loo2)
         elpd_diff se_diffgas_fit2   0.0       0.0  gas_fit1 -33.9       3.3

Evaluating the expected log predictive distribution usingloo reveals that the second of the two models ispreferred.

References

Ferrari, SLP and Cribari-Neto, F (2004) “Beta Regression for ModelingRates and Proportions”.Journal of Applied Statistics. Vol. 31,No. 07, p799-815.


[8]ページ先頭

©2009-2025 Movatter.jp