Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Empirical Bayes method

From Wikipedia, the free encyclopedia
Bayesian statistical inference method
Part of a series on
Bayesian statistics
Posterior =Likelihood ×Prior ÷Evidence
Background
Model building
Posterior approximation
Estimators
Evidence approximation
Model evaluation

Empirical Bayes methods are procedures forstatistical inference in which theprior probability distribution is estimated from the data. This approach stands in contrast to standardBayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of ahierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out.[1]

Introduction

[edit]

Empirical Bayes methods can be seen as an approximation to a fully Bayesian treatment of ahierarchical Bayes model.

In, for example, a two-stage hierarchical Bayes model, observed datay={y1,y2,,yn}{\displaystyle y=\{y_{1},y_{2},\dots ,y_{n}\}} are assumed to be generated from an unobserved set of parametersθ={θ1,θ2,,θn}{\displaystyle \theta =\{\theta _{1},\theta _{2},\dots ,\theta _{n}\}} according to a probability distributionp(yθ){\displaystyle p(y\mid \theta )\,}. In turn, the parametersθ{\displaystyle \theta } can be considered samples drawn from a population characterised byhyperparametersη{\displaystyle \eta \,} according to a probability distributionp(θη){\displaystyle p(\theta \mid \eta )\,}. In the hierarchical Bayes model, though not in the empirical Bayes approximation, the hyperparametersη{\displaystyle \eta \,} are considered to be drawn from an unparameterized distributionp(η){\displaystyle p(\eta )\,}.

Information about a particular quantity of interestθi{\displaystyle \theta _{i}\;} therefore comes not only from the properties of those datay{\displaystyle y} that directly depend on it, but also from the properties of the population of parametersθ{\displaystyle \theta \;} as a whole, inferred from the data as a whole, summarised by the hyperparametersη{\displaystyle \eta \;}.

UsingBayes' theorem,

p(θy)=p(yθ)p(θ)p(y)=p(yθ)p(y)p(θη)p(η)dη.{\displaystyle p(\theta \mid y)={\frac {p(y\mid \theta )p(\theta )}{p(y)}}={\frac {p(y\mid \theta )}{p(y)}}\int p(\theta \mid \eta )p(\eta )\,d\eta \,.}

In general, this integral will not be tractableanalytically orsymbolically and must be evaluated bynumerical methods. Stochastic (random) or deterministic approximations may be used. Example stochastic methods areMarkov Chain Monte Carlo andMonte Carlo sampling. Deterministic approximations are discussed inquadrature.

Alternatively, the expression can be written as

p(θy)=p(θη,y)p(ηy)dη=p(yθ)p(θη)p(yη)p(ηy)dη,{\displaystyle p(\theta \mid y)=\int p(\theta \mid \eta ,y)p(\eta \mid y)\;d\eta =\int {\frac {p(y\mid \theta )p(\theta \mid \eta )}{p(y\mid \eta )}}p(\eta \mid y)\;d\eta \,,}

and the final factor in the integral can in turn be expressed as

p(ηy)=p(ηθ)p(θy)dθ.{\displaystyle p(\eta \mid y)=\int p(\eta \mid \theta )p(\theta \mid y)\;d\theta .}

These suggest an iterative scheme, qualitatively similar in structure to aGibbs sampler, to evolve successively improved approximations top(θy){\displaystyle p(\theta \mid y)\;} andp(ηy){\displaystyle p(\eta \mid y)\;}. First, calculate an initial approximation top(θy){\displaystyle p(\theta \mid y)\;} ignoring theη{\displaystyle \eta } dependence completely; then calculate an approximation top(ηy){\displaystyle p(\eta \mid y)\;} based upon the initial approximate distribution ofp(θy){\displaystyle p(\theta \mid y)\;}; then use thisp(ηy){\displaystyle p(\eta \mid y)\;} to update the approximation forp(θy){\displaystyle p(\theta \mid y)\;}; then updatep(ηy){\displaystyle p(\eta \mid y)\;}; and so on.

When the true distributionp(ηy){\displaystyle p(\eta \mid y)\;} is sharply peaked, the integral determiningp(θy){\displaystyle p(\theta \mid y)\;} may be not much changed by replacing the probability distribution overη{\displaystyle \eta \;} with a point estimateη{\displaystyle \eta ^{*}\;} representing the distribution's peak (or, alternatively, its mean),

p(θy)p(yθ)p(θη)p(yη).{\displaystyle p(\theta \mid y)\simeq {\frac {p(y\mid \theta )\;p(\theta \mid \eta ^{*})}{p(y\mid \eta ^{*})}}\,.}

With this approximation, the above iterative scheme becomes theEM algorithm.

The term "Empirical Bayes" can cover a wide variety of methods, but most can be regarded as an early truncation of either the above scheme or something quite like it. Point estimates, rather than the whole distribution, are typically used for the parameter(s)η{\displaystyle \eta \;}. The estimates forη{\displaystyle \eta ^{*}\;} are typically made from the first approximation top(θy){\displaystyle p(\theta \mid y)\;} without subsequent refinement. These estimates forη{\displaystyle \eta ^{*}\;} are usually made without considering an appropriate prior distribution forη{\displaystyle \eta }.

Point estimation

[edit]

Robbins' method: non-parametric empirical Bayes (NPEB)

[edit]

Robbins[2] considered a case of sampling from amixed distribution, where probability for eachyi{\displaystyle y_{i}} (conditional onθi{\displaystyle \theta _{i}}) is specified by aPoisson distribution,

p(yiθi)=θiyieθiyi!{\displaystyle p(y_{i}\mid \theta _{i})={{\theta _{i}}^{y_{i}}e^{-\theta _{i}} \over {y_{i}}!}}

while the prior onθ is unspecified except that it is alsoi.i.d. from an unknown distribution, withcumulative distribution functionG(θ){\displaystyle G(\theta )}. Compound sampling arises in a variety of statistical estimation problems, such as accident rates and clinical trials.[citation needed] We simply seek a point prediction ofθi{\displaystyle \theta _{i}} given all the observed data. Because the prior is unspecified, we seek to do this without knowledge ofG.[3]

Undersquared error loss (SEL), theconditional expectation E(θi | Yi = yi) is a reasonable quantity to use for prediction. For the Poisson compound sampling model, this quantity is

E(θiyi)=(θyi+1eθ/yi!)dG(θ)(θyieθ/yi!)dG(θ).{\displaystyle \operatorname {E} (\theta _{i}\mid y_{i})={\int (\theta ^{y_{i}+1}e^{-\theta }/{y_{i}}!)\,dG(\theta ) \over {\int (\theta ^{y_{i}}e^{-\theta }/{y_{i}}!)\,dG(\theta })}.}

This can be simplified by multiplying both the numerator and denominator by(yi+1){\displaystyle ({y_{i}}+1)}, yielding

E(θiyi)=(yi+1)pG(yi+1)pG(yi),{\displaystyle \operatorname {E} (\theta _{i}\mid y_{i})={{(y_{i}+1)p_{G}(y_{i}+1)} \over {p_{G}(y_{i})}},}

wherepG is the marginal probability mass function obtained by integrating outθ overG.

To take advantage of this, Robbins[2] suggested estimating the marginals with their empirical frequencies (#{Yj}{\displaystyle \#\{Y_{j}\}}), yielding the fully non-parametric estimate as:

E(θiyi)(yi+1)#{Yj=yi+1}#{Yj=yi},{\displaystyle \operatorname {E} (\theta _{i}\mid y_{i})\approx (y_{i}+1){{\#\{Y_{j}=y_{i}+1\}} \over {\#\{Y_{j}=y_{i}\}}},}

where#{\displaystyle \#} denotes "number of". (See alsoGood–Turing frequency estimation.)

Example – Accident rates

Suppose each customer of an insurance company has an "accident rate" Θ and is insured against accidents; the probability distribution of Θ is the underlying distribution, and is unknown. The number of accidents suffered by each customer in a specified time period has aPoisson distribution with expected value equal to the particular customer's accident rate. The actual number of accidents experienced by a customer is the observable quantity. A crude way to estimate the underlying probability distribution of the accident rate Θ is to estimate the proportion of members of the whole population suffering 0, 1, 2, 3, ... accidents during the specified time period as the corresponding proportion in the observed random sample. Having done so, it is then desired to predict the accident rate of each customer in the sample. As above, one may use theconditionalexpected value of the accident rate Θ given the observed number of accidents during the baseline period. Thus, if a customer suffers six accidents during the baseline period, that customer's estimated accident rate is 7 × [the proportion of the sample who suffered 7 accidents] / [the proportion of the sample who suffered 6 accidents]. Note that if the proportion of people sufferingk accidents is a decreasing function ofk, the customer's predicted accident rate will often be lower than their observed number of accidents.

Thisshrinkage effect is typical of empirical Bayes analyses.

Gaussian

[edit]

SupposeX,Y{\displaystyle X,Y} are random variables, such thatY{\displaystyle Y} is observed, butX{\displaystyle X} is hidden. The problem is to find the expectation ofX{\displaystyle X}, conditional onY{\displaystyle Y}. Suppose further thatY|XN(X,Σ){\displaystyle Y|X\sim {\mathcal {N}}(X,\Sigma )}, that is,Y=X+Z{\displaystyle Y=X+Z}, whereZ{\displaystyle Z} is amultivariate gaussian with varianceΣ{\displaystyle \Sigma }.

Then, we have the formulaΣyρ(y|x)=ρ(y|x)(xy){\displaystyle \Sigma \nabla _{y}\rho (y|x)=\rho (y|x)(x-y)}by direct calculation with the probability density function of multivariate gaussians. Integrating overρ(x)dx{\displaystyle \rho (x)dx}, we obtainΣyρ(y)=(E[x|y]y)ρ(y)E[x|y]=y+Σylnρ(y){\displaystyle \Sigma \nabla _{y}\rho (y)=(\mathbb {E} [x|y]-y)\rho (y)\implies \mathbb {E} [x|y]=y+\Sigma \nabla _{y}\ln \rho (y)} In particular, this means that one can perform Bayesian estimation ofX{\displaystyle X} without access to either the prior density ofX{\displaystyle X} or the posterior density ofY{\displaystyle Y}. The only requirement is to have access to thescore function ofY{\displaystyle Y}. This has applications inscore-based generative modeling.[4]

Parametric empirical Bayes

[edit]

If the likelihood and its prior take on simple parametric forms (such as 1- or 2-dimensional likelihood functions with simpleconjugate priors), then the empirical Bayes problem is only to estimate the marginalm(yη){\displaystyle m(y\mid \eta )} and the hyperparametersη{\displaystyle \eta } using the complete set of empirical measurements. For example, one common approach, called parametric empirical Bayes point estimation, is to approximate the marginal using themaximum likelihood estimate (MLE), or amoments expansion, which allows one to express the hyperparametersη{\displaystyle \eta } in terms of the empirical mean and variance. This simplified marginal allows one to plug in the empirical averages into a point estimate for the priorθ{\displaystyle \theta }. The resulting equation for the priorθ{\displaystyle \theta } is greatly simplified, as shown below.

There are several common parametric empirical Bayes models, including thePoisson–gamma model (below), theBeta-binomial model, theGaussian–Gaussian model, theDirichlet-multinomial model, as well specific models forBayesian linear regression (see below) andBayesian multivariate linear regression. More advanced approaches includehierarchical Bayes models andBayesian mixture models.

Gaussian–Gaussian model

[edit]

For an example of empirical Bayes estimation using a Gaussian-Gaussian model, seeEmpirical Bayes estimators.

Poisson–gamma model

[edit]

For example, in the example above, let the likelihood be aPoisson distribution, and let the prior now be specified by theconjugate prior, which is agamma distribution (G(α,β){\displaystyle G(\alpha ,\beta )}) (whereη=(α,β){\displaystyle \eta =(\alpha ,\beta )}):

ρ(θα,β)dθ=(θ/β)α1eθ/βΓ(α)(dθ/β) for θ>0,α>0,β>0.{\displaystyle \rho (\theta \mid \alpha ,\beta )\,d\theta ={\frac {(\theta /\beta )^{\alpha -1}\,e^{-\theta /\beta }}{\Gamma (\alpha )}}\,(d\theta /\beta ){\text{ for }}\theta >0,\alpha >0,\beta >0\,\!.}

It is straightforward to show theposterior is also a gamma distribution. Write

ρ(θy)ρ(yθ)ρ(θα,β),{\displaystyle \rho (\theta \mid y)\propto \rho (y\mid \theta )\rho (\theta \mid \alpha ,\beta ),}

where the marginal distribution has been omitted since it does not depend explicitly onθ{\displaystyle \theta }.Expanding terms which do depend onθ{\displaystyle \theta } gives the posterior as:

ρ(θy)(θyeθ)(θα1eθ/β)=θy+α1eθ(1+1/β).{\displaystyle \rho (\theta \mid y)\propto (\theta ^{y}\,e^{-\theta })(\theta ^{\alpha -1}\,e^{-\theta /\beta })=\theta ^{y+\alpha -1}\,e^{-\theta (1+1/\beta )}.}

So the posterior density is also agamma distributionG(α,β){\displaystyle G(\alpha ',\beta ')}, whereα=y+α{\displaystyle \alpha '=y+\alpha }, andβ=(1+1/β)1{\displaystyle \beta '=(1+1/\beta )^{-1}}. Also notice that the marginal is simply the integral of the posterior over allΘ{\displaystyle \Theta }, which turns out to be anegative binomial distribution.

To apply empirical Bayes, we will approximate the marginal using themaximum likelihood estimate (MLE). But since the posterior is a gamma distribution, the MLE of the marginal turns out to be just the mean of the posterior, which is the point estimateE(θy){\displaystyle \operatorname {E} (\theta \mid y)} we need. Recalling that the meanμ{\displaystyle \mu } of a gamma distributionG(α,β){\displaystyle G(\alpha ',\beta ')} is simplyαβ{\displaystyle \alpha '\beta '}, we have

E(θy)=αβ=y¯+α1+1/β=β1+βy¯+11+β(αβ).{\displaystyle \operatorname {E} (\theta \mid y)=\alpha '\beta '={\frac {{\bar {y}}+\alpha }{1+1/\beta }}={\frac {\beta }{1+\beta }}{\bar {y}}+{\frac {1}{1+\beta }}(\alpha \beta ).}

To obtain the values ofα{\displaystyle \alpha } andβ{\displaystyle \beta }, empirical Bayes prescribes estimating meanαβ{\displaystyle \alpha \beta } and varianceαβ2{\displaystyle \alpha \beta ^{2}} using the complete set of empirical data.

The resulting point estimateE(θy){\displaystyle \operatorname {E} (\theta \mid y)} is therefore like a weighted average of the sample meany¯{\displaystyle {\bar {y}}} and the prior meanμ=αβ{\displaystyle \mu =\alpha \beta }. This turns out to be a general feature of empirical Bayes; the point estimates for the prior (i.e. mean) will look like a weighted averages of the sample estimate and the prior estimate (likewise for estimates of the variance).

See also

[edit]

References

[edit]
This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(February 2012) (Learn how and when to remove this message)
  1. ^Carlin, Bradley P.; Louis, Thomas A. (2002). "Empirical Bayes: Past, Present, and Future". In Raftery, Adrian E.; Tanner, Martin A.; Wells, Martin T. (eds.).Statistics in the 21st Century. Chapman & Hall. pp. 312–318.ISBN 1-58488-272-7.
  2. ^abRobbins, Herbert (1956)."An Empirical Bayes Approach to Statistics".Breakthroughs in Statistics. Springer Series in Statistics. pp. 157–163.doi:10.1007/978-1-4612-0919-5_26.ISBN 978-0-387-94037-3.MR 0084919.{{cite book}}:ISBN / Date incompatibility (help)
  3. ^Carlin, Bradley P.; Louis, Thomas A. (2000).Bayes and Empirical Bayes Methods for Data Analysis (2nd ed.). Chapman & Hall/CRC. pp. Sec. 3.2 and Appendix B.ISBN 978-1-58488-170-4.
  4. ^Saremi, Saeed; Hyvärinen, Aapo (2019)."Neural Empirical Bayes".Journal of Machine Learning Research.20 (181):1–23.ISSN 1533-7928.

Further reading

[edit]

External links

[edit]
Retrieved from "https://en.wikipedia.org/w/index.php?title=Empirical_Bayes_method&oldid=1328186485"
Category:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp