Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Bayes estimator

From Wikipedia, the free encyclopedia
(Redirected fromBayesian decision theory)
Mathematical decision rule
This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(November 2009) (Learn how and when to remove this message)
Part of a series on
Bayesian statistics
Posterior =Likelihood ×Prior ÷Evidence
Background
Model building
Posterior approximation
Estimators
Evidence approximation
Model evaluation

Inestimation theory anddecision theory, aBayes estimator or aBayes action is anestimator ordecision rule that minimizes theposteriorexpected value of aloss function (i.e., theposterior expected loss). Equivalently, it maximizes the posterior expectation of autility function. An alternative way of formulating an estimator withinBayesian statistics ismaximum a posteriori estimation.

Definition

[edit]

Suppose an unknown parameterθ{\displaystyle \theta } is known to have aprior distributionπ{\displaystyle \pi }. Letθ^=θ^(x){\displaystyle {\widehat {\theta }}={\widehat {\theta }}(x)} be an estimator ofθ{\displaystyle \theta } (based on some measurementsx), and letL(θ,θ^){\displaystyle L(\theta ,{\widehat {\theta }})} be aloss function, such as squared error. TheBayes risk ofθ^{\displaystyle {\widehat {\theta }}} is defined asEπ(L(θ,θ^)){\displaystyle E_{\pi }(L(\theta ,{\widehat {\theta }}))}, where theexpectation is taken over the probability distribution ofθ{\displaystyle \theta }: this defines the risk function as a function ofθ^{\displaystyle {\widehat {\theta }}}. An estimatorθ^{\displaystyle {\widehat {\theta }}} is said to be aBayes estimator if it minimizes the Bayes risk among all estimators. Equivalently, the estimator which minimizes the posterior expected lossE(L(θ,θ^)|x){\displaystyle E(L(\theta ,{\widehat {\theta }})|x)}for eachx{\displaystyle x} also minimizes the Bayes risk and therefore is a Bayes estimator.[1]

If the prior isimproper then an estimator which minimizes the posterior expected lossfor eachx{\displaystyle x} is called ageneralized Bayes estimator.[2]

Examples

[edit]

Minimum mean square error estimation

[edit]
Main article:Minimum mean square error

The most common risk function used for Bayesian estimation is themean square error (MSE), also calledsquared error risk. The MSE is defined by

MSE=E[(θ^(x)θ)2],{\displaystyle \mathrm {MSE} =E\left[({\widehat {\theta }}(x)-\theta )^{2}\right],}

where the expectation is taken over the joint distribution ofθ{\displaystyle \theta } andx{\displaystyle x}.

Posterior mean

[edit]

Using the MSE as risk, the Bayes estimate of the unknown parameter is simply the mean of theposterior distribution,[3]

θ^(x)=E[θ|x]=θp(θ|x)dθ.{\displaystyle {\widehat {\theta }}(x)=E[\theta |x]=\int \theta \,p(\theta |x)\,d\theta .}

This is known as theminimum mean square error (MMSE) estimator.

Bayes estimators for conjugate priors

[edit]
Main article:Conjugate prior

If there is no inherent reason to prefer one prior probability distribution over another, aconjugate prior is sometimes chosen for simplicity. A conjugate prior is defined as a prior distribution belonging to someparametric family, for which the resulting posterior distribution also belongs to the same family. This is an important property, since the Bayes estimator, as well as its statistical properties (variance, confidence interval, etc.), can all be derived from the posterior distribution.

Conjugate priors are especially useful for sequential estimation, where the posterior of the current measurement is used as the prior in the next measurement. In sequential estimation, unless a conjugate prior is used, the posterior distribution typically becomes more complex with each added measurement, and the Bayes estimator cannot usually be calculated without resorting to numerical methods.

Following are some examples of conjugate priors.

θ^(x)=σ2σ2+τ2μ+τ2σ2+τ2x.{\displaystyle {\widehat {\theta }}(x)={\frac {\sigma ^{2}}{\sigma ^{2}+\tau ^{2}}}\mu +{\frac {\tau ^{2}}{\sigma ^{2}+\tau ^{2}}}x.}
θ^(X)=nX¯+an+b.{\displaystyle {\widehat {\theta }}(X)={\frac {n{\overline {X}}+a}{n+b}}.}
θ^(X)=(a+n)max(θ0,x1,...,xn)a+n1.{\displaystyle {\widehat {\theta }}(X)={\frac {(a+n)\max {(\theta _{0},x_{1},...,x_{n})}}{a+n-1}}.}

Alternative risk functions

[edit]

Risk functions are chosen depending on how one measures the distance between the estimate and the unknown parameter. The MSE is the most common risk function in use, primarily due to its simplicity. However, alternative risk functions are also occasionally used. The following are several examples of such alternatives. We denote the posterior generalized distribution function byF{\displaystyle F}.

Posterior median and other quantiles

[edit]
Main article:Bias of an estimator § Median-unbiased estimators

A "linear" loss function, witha>0{\displaystyle a>0}, which yields the posterior median as the Bayes' estimate:

L(θ,θ^)=a|θθ^|{\displaystyle L(\theta ,{\widehat {\theta }})=a|\theta -{\widehat {\theta }}|}
F(θ^(x)|X)=12.{\displaystyle F({\widehat {\theta }}(x)|X)={\tfrac {1}{2}}.}

Another "linear" loss function, which assigns different "weights"a,b>0{\displaystyle a,b>0} to over or sub estimation. It yields aquantile from the posterior distribution, and is a generalization of the previous loss function:

L(θ,θ^)={a|θθ^|,for θθ^0b|θθ^|,for θθ^<0{\displaystyle L(\theta ,{\widehat {\theta }})={\begin{cases}a|\theta -{\widehat {\theta }}|,&{\mbox{for }}\theta -{\widehat {\theta }}\geq 0\\b|\theta -{\widehat {\theta }}|,&{\mbox{for }}\theta -{\widehat {\theta }}<0\end{cases}}}
F(θ^(x)|X)=aa+b.{\displaystyle F({\widehat {\theta }}(x)|X)={\frac {a}{a+b}}.}

Posterior mode

[edit]

The following loss function is trickier: it yields either theposterior mode, or a point close to it depending on the curvature and properties of the posterior distribution. Small values of the parameterK>0{\displaystyle K>0} are recommended, in order to use the mode as an approximation (L>0{\displaystyle L>0}):

L(θ,θ^)={0,for |θθ^|<KL,for |θθ^|K.{\displaystyle L(\theta ,{\widehat {\theta }})={\begin{cases}0,&{\mbox{for }}|\theta -{\widehat {\theta }}|<K\\L,&{\mbox{for }}|\theta -{\widehat {\theta }}|\geq K.\end{cases}}}

Lp Estimators

[edit]

One can also considerLP{\displaystyle L^{P}} risk for which the loss is given by

L(θ,θ^)=|θθ^|p,p>0.{\displaystyle L(\theta ,{\hat {\theta }})=|\theta -{\hat {\theta }}|^{p},\,p>0.}

While optimalLp{\displaystyle L^{p}} estimators can be difficult to characterize in closed-form, they do share many similar properties to those inL2{\displaystyle L^{2}} case.[4]


Other loss functions can be conceived, although themean squared error is the most widely used and validated. Other loss functions are used in statistics, particularly inrobust statistics.

Generalized Bayes estimators

[edit]
See also:Admissible decision rule § Bayes rules and generalized Bayes rules

The prior distributionp{\displaystyle p} has thus far been assumed to be a true probability distribution, in that

p(θ)dθ=1.{\displaystyle \int p(\theta )d\theta =1.}

However, occasionally this can be a restrictive requirement. For example, there is no distribution (covering the set,R, of all real numbers) for which every real number is equally likely. Yet, in some sense, such a "distribution" seems like a natural choice for anon-informative prior, i.e., a prior distribution which does not imply a preference for any particular value of the unknown parameter. One can still define a functionp(θ)=1{\displaystyle p(\theta )=1}, but this would not be a proper probability distribution since it has infinite mass,

p(θ)dθ=.{\displaystyle \int {p(\theta )d\theta }=\infty .}

Suchmeasuresp(θ){\displaystyle p(\theta )}, which are not probability distributions, are referred to asimproper priors.

The use of an improper prior means that the Bayes risk is undefined (since the prior is not a probability distribution and we cannot take an expectation under it). As a consequence, it is no longer meaningful to speak of a Bayes estimator that minimizes the Bayes risk. Nevertheless, in many cases, one can define the posterior distribution

p(θ|x)=p(x|θ)p(θ)p(x|θ)p(θ)dθ.{\displaystyle p(\theta |x)={\frac {p(x|\theta )p(\theta )}{\int p(x|\theta )p(\theta )d\theta }}.}

This is a definition, and not an application ofBayes' theorem, since Bayes' theorem can only be applied when all distributions are proper. However, it is not uncommon for the resulting "posterior" to be a valid probability distribution. In this case, the posterior expected loss

L(θ,a)p(θ|x)dθ{\displaystyle \int {L(\theta ,a)p(\theta |x)d\theta }}

is typically well-defined and finite. Recall that, for a proper prior, the Bayes estimator minimizes the posterior expected loss. When the prior is improper, an estimator which minimizes the posterior expected loss is referred to as ageneralized Bayes estimator.[2]

Example

[edit]

A typical example is estimation of alocation parameter with a loss function of the typeL(aθ){\displaystyle L(a-\theta )}. Hereθ{\displaystyle \theta } is a location parameter, i.e.,p(x|θ)=f(xθ){\displaystyle p(x|\theta )=f(x-\theta )}.

It is common to use the improper priorp(θ)=1{\displaystyle p(\theta )=1} in this case, especially when no other more subjective information is available. This yields

p(θ|x)=p(x|θ)p(θ)p(x)=f(xθ)p(x){\displaystyle p(\theta |x)={\frac {p(x|\theta )p(\theta )}{p(x)}}={\frac {f(x-\theta )}{p(x)}}}

so the posterior expected loss

E[L(aθ)|x]=L(aθ)p(θ|x)dθ=1p(x)L(aθ)f(xθ)dθ.{\displaystyle E[L(a-\theta )|x]=\int {L(a-\theta )p(\theta |x)d\theta }={\frac {1}{p(x)}}\int L(a-\theta )f(x-\theta )d\theta .}

The generalized Bayes estimator is the valuea(x){\displaystyle a(x)} that minimizes this expression for a givenx{\displaystyle x}. This is equivalent to minimizing

L(aθ)f(xθ)dθ{\displaystyle \int L(a-\theta )f(x-\theta )d\theta } for a givenx.{\displaystyle x.}        (1)

In this case it can be shown that the generalized Bayes estimator has the formx+a0{\displaystyle x+a_{0}}, for some constanta0{\displaystyle a_{0}}. To see this, leta0{\displaystyle a_{0}} be the value minimizing (1) whenx=0{\displaystyle x=0}. Then, given a different valuex1{\displaystyle x_{1}}, we must minimize

L(aθ)f(x1θ)dθ=L(ax1θ)f(θ)dθ.{\displaystyle \int L(a-\theta )f(x_{1}-\theta )d\theta =\int L(a-x_{1}-\theta ')f(-\theta ')d\theta '.}        (2)

This is identical to (1), except thata{\displaystyle a} has been replaced byax1{\displaystyle a-x_{1}}. Thus, the expression minimizing is given byax1=a0{\displaystyle a-x_{1}=a_{0}}, so that the optimal estimator has the form

a(x)=a0+x.{\displaystyle a(x)=a_{0}+x.\,\!}

Empirical Bayes estimators

[edit]
Main article:Empirical Bayes method

A Bayes estimator derived through theempirical Bayes method is called anempirical Bayes estimator. Empirical Bayes methods enable the use of auxiliary empirical data, from observations of related parameters, in the development of a Bayes estimator. This is done under the assumption that the estimated parameters are obtained from a common prior. For example, if independent observations of different parameters are performed, then the estimation performance of a particular parameter can sometimes be improved by using data from other observations.

There are bothparametric andnon-parametric approaches to empirical Bayes estimation.[5]

Example

[edit]

The following is a simple example of parametric empirical Bayes estimation. Given past observationsx1,,xn{\displaystyle x_{1},\ldots ,x_{n}} having conditional distributionf(xi|θi){\displaystyle f(x_{i}|\theta _{i})}, one is interested in estimatingθn+1{\displaystyle \theta _{n+1}} based onxn+1{\displaystyle x_{n+1}}. Assume that theθi{\displaystyle \theta _{i}}'s have a common priorπ{\displaystyle \pi } which depends on unknown parameters. For example, suppose thatπ{\displaystyle \pi } is normal with unknown meanμπ{\displaystyle \mu _{\pi }\,\!} and varianceσπ.{\displaystyle \sigma _{\pi }\,\!.} We can then use the past observations to determine the mean and variance ofπ{\displaystyle \pi } in the following way.

First, we estimate the meanμm{\displaystyle \mu _{m}\,\!} and varianceσm{\displaystyle \sigma _{m}\,\!} of the marginal distribution ofx1,,xn{\displaystyle x_{1},\ldots ,x_{n}} using themaximum likelihood approach:

μ^m=1nxi,{\displaystyle {\widehat {\mu }}_{m}={\frac {1}{n}}\sum {x_{i}},}
σ^m2=1n(xiμ^m)2.{\displaystyle {\widehat {\sigma }}_{m}^{2}={\frac {1}{n}}\sum {(x_{i}-{\widehat {\mu }}_{m})^{2}}.}

Next, we use thelaw of total expectation to computeμm{\displaystyle \mu _{m}} and thelaw of total variance to computeσm2{\displaystyle \sigma _{m}^{2}} such that

μm=Eπ[μf(θ)],{\displaystyle \mu _{m}=E_{\pi }[\mu _{f}(\theta )]\,\!,}
σm2=Eπ[σf2(θ)]+Eπ[(μf(θ)μm)2],{\displaystyle \sigma _{m}^{2}=E_{\pi }[\sigma _{f}^{2}(\theta )]+E_{\pi }[(\mu _{f}(\theta )-\mu _{m})^{2}],}

whereμf(θ){\displaystyle \mu _{f}(\theta )} andσf(θ){\displaystyle \sigma _{f}(\theta )} are the moments of the conditional distributionf(xi|θi){\displaystyle f(x_{i}|\theta _{i})}, which are assumed to be known. In particular, suppose thatμf(θ)=θ{\displaystyle \mu _{f}(\theta )=\theta } and thatσf2(θ)=K{\displaystyle \sigma _{f}^{2}(\theta )=K}; we then have

μπ=μm,{\displaystyle \mu _{\pi }=\mu _{m}\,\!,}
σπ2=σm2σf2=σm2K.{\displaystyle \sigma _{\pi }^{2}=\sigma _{m}^{2}-\sigma _{f}^{2}=\sigma _{m}^{2}-K.}

Finally, we obtain the estimated moments of the prior,

μ^π=μ^m,{\displaystyle {\widehat {\mu }}_{\pi }={\widehat {\mu }}_{m},}
σ^π2=σ^m2K.{\displaystyle {\widehat {\sigma }}_{\pi }^{2}={\widehat {\sigma }}_{m}^{2}-K.}

For example, ifxi|θiN(θi,1){\displaystyle x_{i}|\theta _{i}\sim N(\theta _{i},1)}, and if we assume a normal prior (which is a conjugate prior in this case), we conclude thatθn+1N(μ^π,σ^π2){\displaystyle \theta _{n+1}\sim N({\widehat {\mu }}_{\pi },{\widehat {\sigma }}_{\pi }^{2})}, from which the Bayes estimator ofθn+1{\displaystyle \theta _{n+1}} based onxn+1{\displaystyle x_{n+1}} can be calculated.

Properties

[edit]

Admissibility

[edit]
See also:Admissible decision rule

Bayes rules having finite Bayes risk are typicallyadmissible. The following are some specific examples of admissibility theorems.

  • If a Bayes rule is unique then it is admissible.[6] For example, as stated above, under mean squared error (MSE) the Bayes rule is unique and therefore admissible.
  • If θ belongs to adiscrete set, then all Bayes rules are admissible.
  • If θ belongs to a continuous (non-discrete) set, and if the risk function R(θ,δ) is continuous in θ for every δ, then all Bayes rules are admissible.

By contrast, generalized Bayes rules often have undefined Bayes risk in the case of improper priors. These rules are often inadmissible and the verification of their admissibility can be difficult. For example, the generalized Bayes estimator of a location parameter θ based on Gaussian samples (described in the "Generalized Bayes estimator" section above) is inadmissible forp>2{\displaystyle p>2}; this is known asStein's phenomenon.

Asymptotic efficiency

[edit]

Let θ be an unknown random variable, and suppose thatx1,x2,{\displaystyle x_{1},x_{2},\ldots } areiid samples with densityf(xi|θ){\displaystyle f(x_{i}|\theta )}. Letδn=δn(x1,,xn){\displaystyle \delta _{n}=\delta _{n}(x_{1},\ldots ,x_{n})} be a sequence of Bayes estimators of θ based on an increasing number of measurements. We are interested in analyzing the asymptotic performance of this sequence of estimators, i.e., the performance ofδn{\displaystyle \delta _{n}} for largen.

To this end, it is customary to regard θ as a deterministic parameter whose true value isθ0{\displaystyle \theta _{0}}. Under specific conditions,[7] for large samples (large values ofn), the posterior density of θ is approximately normal. In other words, for largen, the effect of the prior probability on the posterior is negligible. Moreover, if δ is the Bayes estimator under MSE risk, then it isasymptotically unbiased and itconverges in distribution to thenormal distribution:

n(δnθ0)N(0,1I(θ0)),{\displaystyle {\sqrt {n}}(\delta _{n}-\theta _{0})\to N\left(0,{\frac {1}{I(\theta _{0})}}\right),}

whereI0) is theFisher information of θ0.It follows that the Bayes estimator δn under MSE isasymptotically efficient.

Another estimator which is asymptotically normal and efficient is themaximum likelihood estimator (MLE). The relations between the maximum likelihood and Bayes estimators can be shown in the following simple example.

Example: estimatingp in a binomial distribution

[edit]

Consider the estimator of θ based on binomial samplex~b(θ,n) where θ denotes the probability for success. Assuming θ is distributed according to the conjugate prior, which in this case is theBeta distribution B(a,b), the posterior distribution is known to be B(a+x,b+n-x). Thus, the Bayes estimator under MSE is

δn(x)=E[θ|x]=a+xa+b+n.{\displaystyle \delta _{n}(x)=E[\theta |x]={\frac {a+x}{a+b+n}}.}

The MLE in this case is x/n and so we get,

δn(x)=a+ba+b+nE[θ]+na+b+nδMLE.{\displaystyle \delta _{n}(x)={\frac {a+b}{a+b+n}}E[\theta ]+{\frac {n}{a+b+n}}\delta _{MLE}.}

The last equation implies that, forn → ∞, the Bayes estimator (in the described problem) is close to the MLE.

On the other hand, whenn is small, the prior information is still relevant to the decision problem and affects the estimate. To see the relative weight of the prior information, assume thata=b; in this case each measurement brings in 1 new bit of information; the formula above shows that the prior information has the same weight asa+b bits of the new information. In applications, one often knows very little about fine details of the prior distribution; in particular, there is no reason to assume that it coincides with B(a,b) exactly. In such a case, one possible interpretation of this calculation is: "there is a non-pathological prior distribution with the mean value 0.5 and the standard deviationd which gives the weight of prior information equal to 1/(4d2)-1 bits of new information."

Another example of the same phenomena is the case when the prior estimate and a measurement are normally distributed. If the prior is centered atB with deviation Σ, and the measurement is centered atb with deviation σ,then the posterior is centered atαα+βB+βα+βb{\displaystyle {\frac {\alpha }{\alpha +\beta }}B+{\frac {\beta }{\alpha +\beta }}b}, with weights in this weighted average being α=σ², β=Σ². Moreover, the squared posterior deviation is Σ²+σ². In other words, the prior is combined with the measurement inexactly the same way as if it were an extra measurement to take into account.

For example, if Σ=σ/2, then the deviation of 4 measurements combined matches the deviation of the prior (assuming that errors of measurements are independent). And the weights α,β in the formula for posterior match this: the weight of the prior is 4 times the weight of the measurement. Combining this prior withn measurements with averagev results in the posterior centered at44+nV+n4+nv{\displaystyle {\frac {4}{4+n}}V+{\frac {n}{4+n}}v}; in particular, the prior plays the same role as 4 measurements made in advance. In general, the prior has the weight of (σ/Σ)² measurements.

Compare to the example of binomial distribution: there the prior has the weight of (σ/Σ)²−1 measurements. One can see that the exact weight does depend on the details of the distribution, but when σ≫Σ, the difference becomes small.

Practical example of Bayes estimators

[edit]

TheInternet Movie Database uses a formula for calculating and comparing the ratings of films by its users, including theirTop Rated 250 Titles which is claimed to give "a true Bayesian estimate".[8] The following Bayesian formula was initially used to calculate a weighted average score for the Top 250, though the formula has since changed:

W=Rv+Cmv+m {\displaystyle W={Rv+Cm \over v+m}\ }

where:

W {\displaystyle W\ } = weighted rating
R {\displaystyle R\ } = average rating for the movie as a number from 1 to 10 (mean) = (Rating)
v {\displaystyle v\ } = number of votes/ratings for the movie = (votes)
m {\displaystyle m\ } = weight given to the prior estimate (in this case, the number of votes IMDB deemed necessary for average rating to approach statistical validity)
C {\displaystyle C\ } = the mean vote across the whole pool (currently 7.0)

Note thatW is just theweighted arithmetic mean ofR andC with weight vector(v, m). As the number of ratings surpassesm, the confidence of the average rating surpasses the confidence of the mean vote for all films (C), and the weighted bayesian rating (W) approaches a straight average (R). The closerv (the number of ratings for the film) is to zero, the closerW is toC, where W is the weighted rating and C is the average rating of all films. So, in simpler terms, the fewer ratings/votes cast for a film, the more that film's Weighted Rating will skew towards the average across all films, while films with many ratings/votes will have a rating approaching its pure arithmetic average rating.

IMDb's approach ensures that a film with only a few ratings, all at 10, would not rank above "the Godfather", for example, with a 9.2 average from over 500,000 ratings.

See also

[edit]

Notes

[edit]
  1. ^Lehmann and Casella, Theorem 4.1.1
  2. ^abLehmann and Casella, Definition 4.2.9
  3. ^Jaynes, E.T. (2007).Probability Theory: The Logic of Science (5. print. ed.). Cambridge [u.a.]: Cambridge Univ. Press. p. 172.ISBN 978-0-521-59271-0.
  4. ^Dytso, A.; Bustin, R.; Tuninetti, D.; Devroye, N.; Poor, H. V.; Shamai Shitz, S. (2018). "On the Minimum Mean pth Error in Gaussian Noise Channels and Its Applications".IEEE Transactions on Information Theory.64 (3).IEEE:2012–2037.arXiv:1607.01461.doi:10.1109/TIT.2017.2782786.
  5. ^Berger (1980), section 4.5.
  6. ^Lehmann and Casella (1998), Theorem 5.2.4.
  7. ^Lehmann and Casella (1998), section 6.8
  8. ^IMDb Top 250

References

[edit]
  • Berger, James O. (1985).Statistical decision theory and Bayesian Analysis (2nd ed.). New York: Springer-Verlag.ISBN 0-387-96098-8.MR 0804611.
  • Lehmann, E. L.; Casella, G. (1998).Theory of Point Estimation (2nd ed.). Springer.ISBN 0-387-98502-6.
  • Pilz, Jürgen (1991). "Bayesian estimation".Bayesian Estimation and Experimental Design in Linear Regression Models. Chichester: John Wiley & Sons. pp. 38–117.ISBN 0-471-91732-X.

External links

[edit]
Continuous data
Center
Dispersion
Shape
Count data
Summary tables
Dependence
Graphics
Study design
Survey methodology
Controlled experiments
Adaptive designs
Observational studies
Statistical theory
Frequentist inference
Point estimation
Interval estimation
Testing hypotheses
Parametric tests
Specific tests
Goodness of fit
Rank statistics
Bayesian inference
Correlation
Regression analysis (see alsoTemplate:Least squares and regression analysis
Linear regression
Non-standard predictors
Generalized linear model
Partition of variance
Categorical
Multivariate
Time-series
General
Specific tests
Time domain
Frequency domain
Survival
Survival function
Hazard function
Test
Biostatistics
Engineering statistics
Social statistics
Spatial statistics
Retrieved from "https://en.wikipedia.org/w/index.php?title=Bayes_estimator&oldid=1315344769"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp