Movatterモバイル変換


[0]ホーム

URL:


Estimating Ordinal Regression Models withrstanarm

Jonah Gabry and Ben Goodrich

2025-09-29

library(ggplot2)library(bayesplot)theme_set(bayesplot::theme_default())

Introduction

This vignette explains how to estimate models for ordinal outcomesusing thestan_polr function in therstanarm package.

The four steps of a Bayesian analysis are

  1. Specify a joint distribution for the outcome(s) and all theunknowns, which typically takes the form of a marginal priordistribution for the unknowns multiplied by a likelihood for theoutcome(s) conditional on the unknowns. This joint distribution isproportional to a posterior distribution of the unknowns conditional onthe observed data
  2. Draw from posterior distribution using Markov Chain Monte Carlo(MCMC).
  3. Evaluate how well the model fits the data and possibly revise themodel.
  4. Draw from the posterior predictive distribution of the outcome(s)given interesting values of the predictors in order to visualize how amanipulation of a predictor affects (a function of) the outcome(s).

Steps 3 and 4 are covered in more depth by the vignette entitled“How to Use therstanarmPackage”. This vignette focuses on Step 1.

One of the strengths of doing MCMC with Stan — as opposed to a Gibbssampler — is that reparameterizations are essentially costless, whichallows the user to specify priors on parameters that are either moreintuitive, numerically stable, or computationally efficient withoutchanging the posterior distribution of the parameters that enter thelikelihood. Advantageous parameterizations are already built into theStan programs used in therstanarm package, so it isjust a matter of using these vignettes to explain how the priors work inthe context of these reparameterizations.

Likelihood

Ordinal outcomes fall in one of\(J\) categories. One way to motivate anordinal model is to introduce a latent variable,\(y^\ast\), that is related to the observedoutcomes via an observation mechanism:\[y=\begin{cases}1 & \mbox{if }y^{\ast}<\zeta_{1}\\2 & \mbox{if }\zeta_{1}\leq y^{\ast}<\zeta_{2}\\\vdots\\J & \mbox{if }\zeta_{J-1}\leq y^{\ast}\end{cases},\] where\(\boldsymbol{\zeta}\) is a vector ofcutpoints of length\(J-1\).

Then\(y^\ast\) is modeled as alinear function of\(K\) predictors\[y^\ast = \mu + \epsilon = \mathbf{x}^\top\boldsymbol{\beta} + \epsilon,\] where\(\epsilon\) has mean zero and unit scale butcan be specified as being drawn from one of several distributions. Notethat there is no “intercept” in this model since the data cannotdistinguish an intercept from the cutpoints. However, if\(J = 2\), then\(\zeta_1\) can be referred to as either thecutpoint or the intercept.

A Bayesian can treat\(y^\ast\) asanother unknown parameter, although for computational efficiency theStan code essentially integrates each\(y^\ast\) out of the posterior distribution,leaving the posterior distribution of\(\boldsymbol{\beta}\) and\(\boldsymbol{\zeta}\). Nevertheless, it isuseful to motivate the model theoretically as if\(y^\ast\) were just an unknown parameterwith a distribution truncated by the relevant element(s) of\(\boldsymbol{\zeta}\).

Priors

If\(y^\ast\) were observed we wouldsimply have a linear regression model for it, and the description of thepriors in the vignette entitled“Estimating LinearModels with therstanarm Package” would applydirectly. Another way to say the same thing isconditional on arealization of\(y^\ast\), we have alinear regression model and the description of the priors in the othervignette does apply (and should be read beforecontinuing with this subsection).

Thestan_lm function essentially specifies a prior on\(\boldsymbol{\theta} = \mathbf{R}^{-1}\boldsymbol{\beta}\), where\(\mathbf{R}\) is the upper triangular matrixin the QR decomposition of the design matrix,\(\mathbf{X} = \mathbf{Q} \mathbf{R}\).Furthermore, instan_lm,\(\sigma_{\epsilon} = \sigma_y \sqrt{1 -R^2}\) where\(R^2\) is theproportion of variance in the outcome that is attributable to thecoefficients in a linear model.

The main difference in the context of a model for an ordinal outcomeis that the scale of\(y^\ast\) is notidentified by the data. Thus, the ordinal model specifies that\(\sigma_{\epsilon} = 1\), which implies that\(\sigma_{y^\ast} = 1 / \sqrt{1 -R^2}\) is an intermediate parameter rather than a primitiveparameter.

It is somewhat more difficult to specify a prior value for the\(R^2\) in an ordinal model because\(R^2\) refers to the proportion of variancein the\(y^\ast\) that is attributableto the predictors under a linear model. In general, the\(R^2\) tends to be lower in an ordinal modelthan in a linear model where the continuous outcome is observed.

The other difference is that an ordinal model does not have a globalintercept but rather a vector of\(J-1\) cutpoints. The implied prior on thesecutpoints used by therstanarm package is somewhatnovel. The user instead specifies a Dirichlet prior on\(\Pr\left(y=j \, \left.\right| \,\overline{\mathbf{x}} \right)\), which is to say the priorprobability of the outcome falling in each of the\(J\) categories given that the predictorsare at their sample means. The Dirichlet prior is for a simplex randomvariable, whose elements are non-negative and sum to\(1\). The Dirichlet PDF can be written as\[f\left(\boldsymbol{\pi}|\boldsymbol{\alpha}\right)\propto\prod_{j=1}^J{\pi_j^{\alpha_j - 1}}, \] where\(\boldsymbol{\pi}\) is a simplex vector suchthat\(\pi_j = \Pr\left(y=j \, \left.\right|\, \overline{\mathbf{x}} \right)\).

The Dirichlet prior is one of the easiest to specify because theso-called “concentration” hyperparameters\(\boldsymbol{\alpha}\) can be interpreted asprior counts, i.e., prior observations for each of the J categories(although they need not be integers). If\(\alpha_j = 1\) for every\(j\) (the default used byrstanarm) then the Dirichlet prior is jointly uniformover the space of these simplexes. This corresponds to a prior count ofone observation falling in each of the\(J\) ordinal categories when the predictorsare at their sample means and conveys the reasonable but weak priorinformation that no category has probability zero. If, for each\(j\),\(\alpha_j =\alpha > 1\) then the prior mode is that the\(J\) categories are equiprobable, with priorprobability\(1/J\) of the outcomefalling in each of the\(J\)categories. The larger the value of\(\alpha\) the more sharply peaked thedistribution is at the mode.

The\(j\)-th cutpoint\(\zeta_j\) is then given by\[\zeta_j =F_{y^\ast}^{-1}\left(\sum_{i=1}^j{\pi_i}\right),\] where\(F_{y^\ast}^{-1}\) is an inverse CDFfunction, which depends on the assumed distribution of\(y^\ast\). Common choices include the normaland logistic distributions. The scale parameter of this distribution isagain\(\sigma_{y^\ast} = 1/\sqrt{1 -R^2}\). In short, by making each\(\zeta_j\) a function of\(\boldsymbol{\pi}\), it allows us to specifya Dirichlet prior on\(\boldsymbol{\pi}\), which is simpler thanspecifying a prior on\(\boldsymbol{\zeta}\) directly.

Example

In this section, we start with an ordinal model of tobaccoconsumption as a function of age and alcohol consumption. Frequentistestimates can be obtained using thepolr function in theMASS package:

library(MASS)print(polr(tobgp~ agegp+ alcgp,data = esoph),digits =1)
Call:polr(formula = tobgp ~ agegp + alcgp, data = esoph)Coefficients:agegp.L agegp.Q agegp.C agegp^4 agegp^5 alcgp.L alcgp.Q alcgp.C   -0.37   -0.38   -0.24    0.04   -0.04   -0.19   -0.02    0.03 Intercepts:0-9g/day|10-19    10-19|20-29      20-29|30+           -1.0            0.2            1.3 Residual Deviance: 241.8195 AIC: 263.8195

To obtain Bayesian estimates, we prependstan_ andspecify the priors:

library(rstanarm)post0<-stan_polr(tobgp~ agegp+ alcgp,data = esoph,prior =R2(0.25),prior_counts =dirichlet(1),seed =12345)
print(post0,digits =1)
stan_polr family:       ordered [logistic] formula:      tobgp ~ agegp + alcgp observations: 88------        Median MAD_SDagegp.L -0.2    0.4  agegp.Q -0.2    0.4  agegp.C -0.1    0.3  agegp^4  0.0    0.3  agegp^5  0.0    0.3  alcgp.L -0.1    0.3  alcgp.Q  0.0    0.3  alcgp.C  0.0    0.3  Cutpoints:               Median MAD_SD0-9g/day|10-19 -1.0    0.2  10-19|20-29     0.2    0.2  20-29|30+       1.3    0.2  ------* For help interpreting the printed output see ?print.stanreg* For info on the priors used see ?prior_summary.stanreg

The point estimates, represented by the posterior medians, arequalitatively similar to the maximum-likelihood estimates but aresomewhat shrunk toward zero due to the regularizing prior on thecoefficients. Since these cutpoints are actuallyknown, itwould be more appropriate for the model to take that into account, butstan_polr does not currently support that.

Next, we utilize an example from theMASS packagewhere low birthweight is the binary outcome of interest. First, werecode some of the variables:

data("birthwt",package ="MASS")birthwt$race<-factor(birthwt$race,levels =1:3,labels =c("white","black","other"))birthwt$bwt<- birthwt$bwt/1000# convert from grams to kilogramsbirthwt$low<-factor(birthwt$low,levels =0:1,labels =c("no","yes"))

It is usually a good idea to rescale variables by constants so thatall the numbers are in single or double digits. We start by estimating alinear model for birthweight in kilograms, flipping the sign so thatpositive coefficients are associated withlowerbirthweights.

post1<-stan_lm(-bwt~ smoke+ age+ race+ ptl+ ht+ ftv,data = birthwt,prior =R2(0.5),seed =12345)
print(post1)
stan_lm family:       gaussian [identity] formula:      -bwt ~ smoke + age + race + ptl + ht + ftv observations: 189 predictors:   8------            Median MAD_SD(Intercept) -3.3    0.2  smoke        0.4    0.1  age          0.0    0.0  raceblack    0.4    0.2  raceother    0.4    0.1  ptl          0.2    0.1  ht           0.4    0.2  ftv          0.0    0.0  Auxiliary parameter(s):              Median MAD_SDR2            0.2    0.0   log-fit_ratio 0.0    0.1   sigma         0.7    0.0   ------* For help interpreting the printed output see ?print.stanreg* For info on the priors used see ?prior_summary.stanreg

Next, we estimate an “ordinal” model for the incidence of lowbirthweight, which is defined as a birth weight of less than\(2.5\) kilograms. Even though this outcomeis binary, a binary variable is a special case of an ordinal variablewith\(J=2\) categories and isacceptable tostan_polr. We can think ofbwtas something proportional to\(y^\ast\)and pretend that it is not observed, forcing us to estimate an ordinalmodel.

post2<-stan_polr(low~ smoke+ age+ race+ ptl+ ht+ ftv,data = birthwt,prior =R2(0.5),prior_counts =dirichlet(c(1,1)),method ="probit",seed =12345)
plot(loo(post2))

This prior seems to have worked well in this case because none of thepoints in the plot are above\(0.5\),which would have indicated the the posterior is very sensitive to thoseobservations. If we compare the estimated coefficients,

round(cbind(Linear =coef(post1),Ordinal =coef(post2),Rescaled =coef(post1)/sigma(post1)),3)
            Linear Ordinal Rescaled(Intercept) -3.254  -0.535   -4.812smoke        0.361   0.514    0.534age         -0.003  -0.025   -0.005raceblack    0.394   0.514    0.582raceother    0.400   0.530    0.592ptl          0.154   0.400    0.228ht           0.368   0.696    0.544ftv         -0.004  -0.005   -0.006

they have the same signs and similar magnitudes, with the exceptionof the “Intercept”. In an ordinal model where the outcome only has\(J=2\) categories, this “Intercept” isactually\(\zeta_1\), but it is moreconventional to call it the “Intercept” so that it agrees withstan_glm whenfamily = binomial(link = 'probit'). Recall that\(\sigma_{\epsilon} = 1\) in an ordinalmodel, so if we rescale the coefficients from a linear model by dividingby the posterior median of\(\sigma\),the resulting coefficients are even closer to those of the ordinalmodel.

This illustrates the fundamental similarity between a linear modelfor a continuous observed outcome and a linear model for a latent\(y^\ast\) that generates an ordinal observedoutcome. The main difference is when the outcome is continuous andobserved, we can estimate the scale of the errors meaningfully. When theoutcome is ordinal, we can only fix the scale of the latent errors to\(1\) arbitrarily.

Finally, when\(J = 2\), thestan_polr function allows you to specifynon-NULL values of theshape andrate arguments, which implies a “scobit” likelihood wherethe probability of success is given by\(F\left(y^\ast \right)^\alpha\), where\(F\left(\right)\) is the logistic CDF and\(\alpha > 0\) is a skewingparameter that has a gamma prior with a givenshape andrate. If\(\alpha \neq1\), then the relationship between\(y^\ast\) and the probability of success isasymmetric. In principle, it seems appropriate to estimate\(\alpha\) but in practice, a lot of data isneeded to estimate\(\alpha\) withadequate precision. In the previous example, if we specifyshape = 2 andrate = 2 to reflect the priorbeliefs that\(\alpha\) is expected tobe\(1\) but has a variance of\(\frac{1}{2}\), then theloocalculation yields many Pareto shape parameters that are excessivelylarge. However, with more than\(189\)observations, such a model may be more fruitful.

Conclusion

The posterior distribution for an ordinal model requires priors onthe coefficients and the cutpoints. The priors used by thestan_polr function are unconventional but should work wellfor a variety of problems. The prior on the coefficients is essentiallythe same as that used by thestan_lm function but omits ascale parameter because the standard deviation of the latent\(y^\ast\) is not identified by the data. Thecutpoints are conditionally deterministic given a simplex vector for theprobability of falling in each of the\(J\) ordinal categories given that thepredictors are at their sample means. Thus, a Dirichlet prior — which isrelatively easy to specify and has a good default of jointly uniform —on this simplex completes the posterior distribution.

This approach provides an alternative tostan_glm withfamily = binomial() even if the outcome variable has onlytwo categories. Thestan_glm function has more options forthe prior on the coefficients and the prior on the intercept (which canbe interpreted as the first cutpoint when\(J= 2\)). However, it may be more difficult to obtain efficientsampling with those priors.


[8]ページ先頭

©2009-2025 Movatter.jp