Movatterモバイル変換


[0]ホーム

URL:


Estimating Generalized Linear Models forBinary and Binomial Data with rstanarm

Jonah Gabry and Ben Goodrich

2025-09-29

library(ggplot2)library(bayesplot)theme_set(bayesplot::theme_default())

Introduction

This vignette explains how to estimate generalized linear models(GLMs) for binary (Bernoulli) and Binomial response variables using thestan_glm function in therstanarmpackage.

The four steps of a Bayesian analysis are

  1. Specify a joint distribution for the outcome(s) and all theunknowns, which typically takes the form of a marginal priordistribution for the unknowns multiplied by a likelihood for theoutcome(s) conditional on the unknowns. This joint distribution isproportional to a posterior distribution of the unknowns conditional onthe observed data
  2. Draw from posterior distribution using Markov Chain Monte Carlo(MCMC).
  3. Evaluate how well the model fits the data and possibly revise themodel.
  4. Draw from the posterior predictive distribution of the outcome(s)given interesting values of the predictors in order to visualize how amanipulation of a predictor affects (a function of) the outcome(s).

Steps 3 and 4 are covered in more depth by the vignette entitled“How to Use therstanarmPackage”. This vignette focuses on Step 1 when the likelihood is theproduct of conditionally independent binomial distributions (possiblywith only one trial per observation).

Likelihood

For a binomial GLM the likelihood for one observation\(y\) can be written as a conditionallybinomial PMF\[\binom{n}{y} \pi^{y} (1 -\pi)^{n - y},\] where\(n\) isthe known number of trials,\(\pi =g^{-1}(\eta)\) is the probability of success and\(\eta = \alpha + \mathbf{x}^\top\boldsymbol{\beta}\) is a linear predictor. For a sample of size\(N\), the likelihood of the entiresample is the product of\(N\)individual likelihood contributions.

Because\(\pi\) is a probability,for a binomial model thelink function\(g\) maps between the unit interval (thesupport of\(\pi\)) and the set of allreal numbers\(\mathbb{R}\). Whenapplied to a linear predictor\(\eta\)with values in\(\mathbb{R}\), theinverse link function\(g^{-1}(\eta)\)therefore returns a valid probability between 0 and 1.

The two most common link functions used for binomial GLMs are thelogit andprobit functions. Withthe logit (or log-odds) link function\(g(x) =\ln{\left(\frac{x}{1-x}\right)}\), the likelihood for a singleobservation becomes

\[\binom{n}{y}\left(\text{logit}^{-1}(\eta)\right)^y\left(1 - \text{logit}^{-1}(\eta)\right)^{n-y} =\binom{n}{y} \left(\frac{e^{\eta}}{1 + e^{\eta}}\right)^{y}\left(\frac{1}{1 + e^{\eta}}\right)^{n - y}\]

and the probit link function\(g(x) =\Phi^{-1}(x)\) yields the likelihood

\[\binom{n}{y} \left(\Phi(\eta)\right)^{y}\left(1 - \Phi(\eta)\right)^{n - y},\]

where\(\Phi\) is the CDF of thestandard normal distribution. The differences between the logit andprobit functions are minor and – if, asrstanarm doesby default, the probit is scaled so its slope at the origin matches thelogit’s – the two link functions should yield similar results. Withstan_glm, binomial models with a logit link function cantypically be fit slightly faster than the identical model with a probitlink because of how the two models are implemented in Stan. Unless theuser has a specific reason to prefer the probit link, we recommend thelogit simply because it will be slightly faster and more numericallystable.

In theory, there are infinitely many possible link functions,although in practice only a few are typically used. Other common choicesare thecauchit andcloglog functions, whichcan also be used withstan_glm (every link functioncompatible withglm will work withstan_glm).

Priors

A full Bayesian analysis requires specifying prior distributions\(f(\alpha)\) and\(f(\boldsymbol{\beta})\) for the interceptand vector of regression coefficients. When usingstan_glm,these distributions can be set using theprior_interceptandprior arguments. Thestan_glm functionsupports a variety of prior distributions, which are explained in therstanarm documentation(help(priors, package = 'rstanarm')).

As an example, suppose we have\(K\)predictors and believe — prior to seeing the data — that\(\alpha, \beta_1, \dots, \beta_K\) are aslikely to be positive as they are to be negative, but are highlyunlikely to be far from zero. These beliefs can be represented by normaldistributions with mean zero and a small scale (standard deviation). Togive\(\alpha\) and each of the\(\beta\)s this prior (with a scale of 1,say), in the call tostan_glm we would include theargumentsprior_intercept = normal(0,1) andprior = normal(0,1).

If, on the other hand, we have less a priori confidence that theparameters will be close to zero then we could use a larger scale forthe normal distribution and/or a distribution with heavier tails thanthe normal like the Student t distribution.Step 1 inthe “How to Use therstanarm Package” vignettediscusses one such example.

Posterior

With independent prior distributions, the joint posteriordistribution for\(\alpha\) and\(\boldsymbol{\beta}\) is proportional to theproduct of the priors and the\(N\)likelihood contributions:

\[f\left(\alpha,\boldsymbol{\beta} |\mathbf{y},\mathbf{X}\right) \propto f\left(\alpha\right) \times \prod_{k=1}^K f\left(\beta_k\right) \times \prod_{i=1}^N { g^{-1}\left(\eta_i\right)^{y_i} \left(1 - g^{-1}\left(\eta_i\right)\right)^{n_i-y_i}}.\]

This is posterior distribution thatstan_glm will drawfrom when using MCMC.

Logistic Regression Example

When the logit link function is used the model is often referred toas a logistic regression model (the inverse logit function is the CDF ofthe standard logistic distribution). As an example, here we will showhow to carry out a few parts of the analysis from Chapter 5.4 ofGelman and Hill(2007) usingstan_glm.

Gelman and Hill describe a survey of 3200 residents in a small areaof Bangladesh suffering from arsenic contamination of groundwater.Respondents with elevated arsenic levels in their wells had beenencouraged to switch their water source to a safe public or private wellin the nearby area and the survey was conducted several years later tolearn which of the affected residents had switched wells. The goal ofthe analysis presented by Gelman and Hill is to learn about the factorsassociated with switching wells.

To start, we’ll usedist (the distance from therespondent’s house to the nearest well with safe drinking water) as theonly predictor ofswitch (1 if switched, 0 if not). Thenwe’ll expand the model by adding the arsenic level of the water in theresident’s own well as a predictor and compare this larger model to theoriginal.

After loading thewells data, we first rescale thedist variable (measured in meters) so that it is measuredin units of 100 meters. If we leavedist in its originalunits then the corresponding regression coefficient will represent theeffect of the marginal meter, which is too small to have a usefulinterpretation.

library(rstanarm)data(wells)wells$dist100<- wells$dist/100

Before estimating any models we can visualize the distribution ofdist100 in the data:

ggplot(wells,aes(x = dist100,y = ..density..,fill =switch==1))+geom_histogram()+scale_fill_manual(values =c("gray30","skyblue"))

In the plot above the blue bars correspond to the 1737 residents whosaid they switched wells and darker bars show the distribution ofdist100 for the 1283 residents who didn’t switch. As wewould expect, for the residents who switched wells, the distribution ofdist100 is more concentrated at smaller distances.

A Bayesian version of Gelman and Hill’s initial logistic regressionmodel can be estimated using thestan_glm function. Herewe’ll use a Student t prior with 7 degrees of freedom and a scale of2.5, which, as discussed above, is a reasonable default prior whencoefficients should be close to zero but have some chance of beinglarge.

t_prior<-student_t(df =7,location =0,scale =2.5)fit1<-stan_glm(switch~ dist100,data = wells,family =binomial(link ="logit"),prior = t_prior,prior_intercept = t_prior,cores =2,seed =12345)
(Intercept)     dist100       0.605      -0.621

Theformula,data andfamilyarguments tostan_glm are specified in exactly the same wayas forglm. We’ve also added the optional additionalargumentschains (how many chains we want to execute),cores (how many cores we want the computer to utilize) andseed (for reproducibility). You can read about otherpossible arguments in thestan_glm documentation(help(stan_glm, package = 'rstanarm')).

To get a sense for the uncertainty in our estimates we can use theposterior_interval function to get Bayesian uncertaintyintervals. The uncertainty intervals are computed by finding therelevant quantiles of the draws from the posterior distribution. Forexample, to compute 50% intervals we use:

round(posterior_interval(fit1,prob =0.5),2)
              25%   75%(Intercept)  0.57  0.65dist100     -0.69 -0.56

For more onposterior_interval and interpreting theparameter estimates from a Bayesian model see Step 2 in the“How to Use therstanarmPackage” vignette.

Using the coefficient estimates we can plot the predicted probabilityofswitch = 1 (as a function ofdist100)together with the observed outcomes:

# Predicted probability as a function of xpr_switch<-function(x, ests)plogis(ests[1]+ ests[2]* x)# A function to slightly jitter the binary datajitt<-function(...) {geom_point(aes_string(...),position =position_jitter(height =0.05,width =0.1),size =2,shape =21,stroke =0.2)}ggplot(wells,aes(x = dist100,y =switch,color =switch))+scale_y_continuous(breaks =c(0,0.5,1))+jitt(x="dist100")+stat_function(fun = pr_switch,args =list(ests =coef(fit1)),size =2,color ="gray35")

The plot shows that under this model the predicted probability ofswitching is a decent bit above 50% for residents living very close towells with safe drinking water. As expected, larger values ofdist100 are associated with lower predicted probabilitiesof switching. At the extreme (\(\approx300\) meters), the probability is about 25%.

Next, we incorporate an additional predictor into the model: thearsenic level of water in the respondent’s well. According to Gelman andHill, “At the levels present in the Bangladesh drinking water, thehealth risks from arsenic are roughly proportional to exposure, and sowe would expect switching to be more likely from wells with high arseniclevels” (pg. 90). We only need to change the formula, so we can use theupdate function:

fit2<-update(fit1,formula =switch~ dist100+ arsenic)
(coef_fit2<-round(coef(fit2),3))
(Intercept)     dist100     arsenic       0.002      -0.896       0.462

As expected the coefficient onarsenic is positive. Theplot below shows distance on the x-axis and arsenic level on the y-axiswith the predicted probability of well-switching mapped to the color ofthe background tiles (the lighter the color the higher the probability).The observed value ofswitch is indicated by the color ofthe points.

pr_switch2<-function(x, y, ests)plogis(ests[1]+ ests[2]* x+ ests[3]* y)grid<-expand.grid(dist100 =seq(0,4,length.out =100),arsenic =seq(0,10,length.out =100))grid$prob<-with(grid,pr_switch2(dist100, arsenic,coef(fit2)))ggplot(grid,aes(x = dist100,y = arsenic))+geom_tile(aes(fill = prob))+geom_point(data = wells,aes(color =factor(switch)),size =2,alpha =0.85)+scale_fill_gradient()+scale_color_manual("switch",values =c("white","black"),labels =c("No","Yes"))

We can see that the black points (switch=1) arepredominantly clustered in the upper-left region of the plot where thepredicted probability of switching is highest.

Another way we can visualize the data and model is to follow Gelmanand Hill and create separate plots for varying the arsenic level anddistance. Here we’ll plot curves representing the predicted probabilityof switching for the minimum, maximum and quartile values of bothvariables.

# Quantilesq_ars<-quantile(wells$dist100,seq(0,1,0.25))q_dist<-quantile(wells$arsenic,seq(0,1,0.25))base<-ggplot(wells)+xlim(c(0,NA))+scale_y_continuous(breaks =c(0,0.5,1))vary_arsenic<- base+jitt(x="arsenic",y="switch",color="switch")vary_dist<- base+jitt(x="dist100",y="switch",color="switch")for (iin1:5) {  vary_dist<-    vary_dist+stat_function(fun = pr_switch2,color ="gray35",args =list(ests =coef(fit2),y = q_dist[i]))  vary_arsenic<-    vary_arsenic+stat_function(fun = pr_switch2,color ="gray35",args =list(ests =coef(fit2),x = q_ars[i]))}bayesplot_grid(vary_dist, vary_arsenic,grid_args =list(ncol =2))

We can compare our two models (with and withoutarsenic)using an approximation to Leave-One-Out (LOO) cross-validation, which isa method for estimating out of sample predictive performance and isimplemented by theloo function in theloopackage:

(loo1<-loo(fit1))
Computed from 4000 by 3020 log-likelihood matrix.         Estimate   SEelpd_loo  -2040.1 10.4p_loo         2.0  0.0looic      4080.2 20.8------MCSE of elpd_loo is 0.0.MCSE and ESS estimates assume independent draws (r_eff=1).All Pareto k estimates are good (k < 0.7).See help('pareto-k-diagnostic') for details.
(loo2<-loo(fit2))
Computed from 4000 by 3020 log-likelihood matrix.         Estimate   SEelpd_loo  -1968.4 15.7p_loo         3.2  0.1looic      3936.9 31.3------MCSE of elpd_loo is 0.0.MCSE and ESS estimates assume independent draws (r_eff=1).All Pareto k estimates are good (k < 0.7).See help('pareto-k-diagnostic') for details.
loo_compare(loo1, loo2)
     elpd_diff se_difffit2   0.0       0.0  fit1 -71.7      12.2

These results favorfit2 overfit1, as theestimated difference inelpd (the expected log pointwisepredictive density for a new dataset) is so much larger than itsstandard error. LOO penalizes models for adding additional predictors(this helps counter overfitting), but in this casefit2represents enough of an improvement overfit1 that thepenalty for includingarsenic is negligible (as it shouldbe ifarsenic is an important predictor).

Thevignette for thestan_lmfunction also has an example of using theloo functionwhere the results are quite a bit different from what we see here andsome important additional considerations are discussed.

Conditional Logit Models

The previous example relies on the fact that observations areplausibly conditionally independent. In contrast, so-called“case-control studies” require that there are a fixed number ofsuccesses and failures within each stratum, and the question iswhich members of each stratum succeed and fail? Thestan_clogit function estimates such a model and is verysimilar to theclogit function in thesurvival package. The main syntactical difference isthat theclogit function requires that the user call thestrata function in the model formula, whereas thestan_clogit function has a requiredstrataargument. In addition, in thestan_clogit case the datamust be sorted by the variable passed tostrata. Theadvantage to these changes is thatstan_clogit canoptionally utilize the multilevel syntax from thelme4package to specify group-specific terms, rather than the more limitedmultilevel structure supported by thefrailty function inthesurvival package. Thevignette for thestan_glmer functiondiscusses the lme4-style syntax in more detail. For example,

post<-stan_clogit(case~ spontaneous+ induced+ (1| parity),data = infert[order(infert$stratum), ],# order necessarystrata = stratum,QR =TRUE,cores =2,seed =12345)
post
stan_clogit family:       binomial [clogit] formula:      case ~ spontaneous + induced + (1 | parity) observations: 248------            Median MAD_SDspontaneous 2.0    0.4   induced     1.4    0.4   Error terms: Groups Name        Std.Dev. parity (Intercept) 1.4     Num. levels: parity 6 ------* For help interpreting the printed output see ?print.stanreg* For info on the priors used see ?prior_summary.stanreg

The posterior predictions are also constrained such that there isexactly one success (in this case) for each of the strata and thus theposterior distribution of the probabilities are also so constrained:

PPD<-posterior_predict(post)stopifnot(rowSums(PPD)==max(infert$stratum))PLP<-posterior_linpred(post,transform =TRUE)stopifnot(round(rowSums(PLP))==max(infert$stratum))

Binomial Models

Although the example in this vignette focused on a binary responsevariable, we can use nearly identical code if we have the sum ofmultiple binary variables. For example, image a hypothetical datasetsimilar to the well-switching data but spanning multiple villages. Eachobservation (each row) of thisdata.frame corresponds to anentire village:switch[i] is the number of ‘yes’ responsesto the well-switching question for villagei,dist100[i] is the average distance to the closest well withclean water for villagei, etc. We also now have a variablen wheren[i] is the number of respondents fromvillagei.

For this data we can estimate a similar model to the one we used inthe binary case by changing the formula to

cbind(switch, n - switch) ~ dist100 + arsenic

The left-hand side is now a 2-column matrix where the first column isthe number of ‘yes’ responses and the second column is the number of‘no’ responses (or more generally, the number of successes and number offailures). The same model can also be specified using the proportion of‘yes’ responses and the total number of responses in each village. Thiscorresponds to the formula

prop_switch ~ dist100 + arsenic

whereprop_switch = switch / n is the proportion of‘yes’ responses. The total number of responses is provided using theweights argument. In this case we would addweights = n to the call tostan_glm.

An example of a similar model can also be found inStep1 of the“How to Use therstanarm Package” vignette.

Going Further

In the hypothetical scenario above, if we also have access to theobservations for each individual in all of the villages (not just theaggregate data), then a natural extension would be to consider amultilevel model that takes advantage of the inherent multilevelstructure of the data (individuals nested within villages). Thevignette for thestan_glmer functiondiscusses these models.

References

Gelman, A. and Hill, J. (2007).Data Analysis Using Regressionand Multilevel/Hierarchical Models. Cambridge University Press,Cambridge, UK.


[8]ページ先頭

©2009-2025 Movatter.jp