Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Maximum entropy probability distribution

From Wikipedia, the free encyclopedia
Probability distribution that has the most entropy of a class
This article has multiple issues. Please helpimprove it or discuss these issues on thetalk page.(Learn how and when to remove these messages)
icon
This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Maximum entropy probability distribution" – news ·newspapers ·books ·scholar ·JSTOR
(August 2009) (Learn how and when to remove this message)
This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(December 2013) (Learn how and when to remove this message)
(Learn how and when to remove this message)

Instatistics andinformation theory, amaximum entropy probability distribution hasentropy that is at least as great as that of all other members of a specified class ofprobability distributions. According to theprinciple of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class (usually defined in terms of specified properties or measures), then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount ofprior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Definition of entropy and differential entropy

[edit]
Further information:Entropy (information theory)

IfX{\displaystyle X} is acontinuous random variable withprobability densityp(x){\displaystyle p(x)}, then thedifferential entropy ofX{\displaystyle X} is defined as[1][2][3]

H(X)=p(x)logp(x)dx .{\displaystyle H(X)=-\int _{-\infty }^{\infty }p(x)\log p(x)\,dx~.}

IfX{\displaystyle X} is adiscrete random variable with distribution given byPr(X=xk)=pk for k=1,2,{\displaystyle \Pr(X{=}x_{k})=p_{k}\qquad {\text{ for }}\quad k=1,2,\ldots }then the entropy ofX{\displaystyle X} is defined asH(X)=k1pklogpk.{\displaystyle H(X)=-\sum _{k\geq 1}p_{k}\log p_{k}\,.}

The seemingly divergent termp(x)logp(x){\displaystyle p(x)\log p(x)} is replaced by zero, wheneverp(x)=0.{\displaystyle p(x)=0\,.}

This is a special case of more general forms described in the articlesEntropy (information theory),Principle of maximum entropy, and differential entropy. In connection with maximum entropy distributions, this is the only one needed, because maximizingH(X){\displaystyle H(X)} will also maximize the more general forms.

The base of thelogarithm is not important, as long as the same one is used consistently: Change of base merely results in a rescaling of the entropy. Information theorists may prefer to use base 2 in order to express the entropy inbits; mathematicians and physicists often prefer thenatural logarithm, resulting in a unit of"nat"s for the entropy.

However, the chosenmeasuredx{\displaystyle dx} is crucial, even though the typical use of theLebesgue measure is often defended as a "natural" choice: Which measure is chosen determines the entropy and the consequent maximum entropy distribution.

Distributions with measured constants

[edit]

Many statistical distributions of applicable interest are those for which themoments or other measurable quantities are constrained to be constants. The following theorem byLudwig Boltzmann gives the form of the probability density under these constraints.

Continuous case

[edit]

SupposeS{\displaystyle S} is a continuous,closed subset of thereal numbersR{\displaystyle \mathbb {R} } and we choose to specifyn{\displaystyle n}measurable functionsf1,,fn{\displaystyle f_{1},\ldots ,f_{n}} andn{\displaystyle n} numbersa1,,an.{\displaystyle a_{1},\ldots ,a_{n}.} We consider the classC{\displaystyle C} of all real-valued random variables which are supported onS{\displaystyle S} (i.e. whose density function is zero outside ofS{\displaystyle S}) and which satisfy then{\displaystyle n} moment conditions:

E[fj(X)]ajfor j=1,,n{\displaystyle \operatorname {E} [f_{j}(X)]\geq a_{j}\qquad {\text{for }}\quad j=1,\ldots ,n}

If there is a member inC{\displaystyle C} whosedensity function is positive everywhere inS,{\displaystyle S,} and if there exists a maximal entropy distribution forC,{\displaystyle C,} then its probability densityp(x){\displaystyle p(x)} has the following form:

p(x)=exp(j=0nλjfj(x)) for all  xS{\displaystyle p(x)=\exp \left(\sum _{j=0}^{n}\lambda _{j}f_{j}(x)\right)\qquad {\text{ for all }}~x\in S}

where we assume thatf0(x)=1.{\displaystyle f_{0}(x)=1\,.} The constantλ0{\displaystyle \lambda _{0}} and then{\displaystyle n}Lagrange multipliersλ=(λ1,,λn){\displaystyle {\boldsymbol {\lambda }}=(\lambda _{1},\ldots ,\lambda _{n})} solve the constrained optimization problem witha0=1{\displaystyle a_{0}=1} (which ensures thatp{\displaystyle p} integrates to unity):[4]

maxλ0;λ{j=0nλjajexp(j=0nλjfj(x))dx}  subject to  λ0{\displaystyle \max _{\lambda _{0};\,{\boldsymbol {\lambda }}}\left\{\sum _{j=0}^{n}\lambda _{j}a_{j}-\int \exp \left(\sum _{j=0}^{n}\lambda _{j}f_{j}(x)\right)dx\right\}\qquad ~{\text{ subject to }}~{\boldsymbol {\lambda }}\geq \mathbf {0} }

Using theKarush–Kuhn–Tucker conditions, it can be shown that the optimization problem has a unique solution because the objective function in the optimization is concave inλ.{\displaystyle {\boldsymbol {\lambda }}\,.}

Note that when the moment constraints are equalities (instead of inequalities), that is,

E[fj(X)]=aj for  j=1,,n,{\displaystyle \operatorname {E} [f_{j}(X)]=a_{j}\qquad {\text{ for }}~j=1,\ldots ,n\,,}

then the constraint conditionλ0{\displaystyle {\boldsymbol {\lambda }}\geq \mathbf {0} } can be dropped, which makes optimization over the Lagrange multipliers unconstrained.

Discrete case

[edit]

SupposeS={x1,x2,}{\displaystyle S=\{x_{1},x_{2},\ldots \}} is a (finite or infinite) discrete subset of the reals, and that we choose to specifyn{\displaystyle n} functionsf1,,fn{\displaystyle f_{1},\ldots ,f_{n}} andn{\displaystyle n} numbersa1,,an.{\displaystyle a_{1},\ldots ,a_{n}\,.} We consider the classC{\displaystyle C} of all discrete random variablesX{\displaystyle X} which are supported onS{\displaystyle S} and which satisfy then{\displaystyle n} moment conditions

E[fj(X)]aj  for  j=1,,n{\displaystyle \operatorname {E} [f_{j}(X)]\geq a_{j}\qquad ~{\text{ for }}~j=1,\ldots ,n}

If there exists a member of classC{\displaystyle C} which assigns positive probability to all members ofS{\displaystyle S} and if there exists a maximum entropy distribution forC,{\displaystyle C,} then this distribution has the following shape:

Pr(X=xk)=exp(j=0nλjfj(xk)) for  k=1,2,{\displaystyle \Pr(X{=}x_{k})=\exp \left(\sum _{j=0}^{n}\lambda _{j}f_{j}(x_{k})\right)\qquad {\text{ for }}~k=1,2,\ldots }

where we assume thatf0=1{\displaystyle f_{0}=1} and the constantsλ0,λ(λ1,,λn){\displaystyle \lambda _{0},\,{\boldsymbol {\lambda }}\equiv (\lambda _{1},\ldots ,\lambda _{n})} solve the constrained optimization problem witha0=1{\displaystyle a_{0}=1}:[5]

maxλ0;λ{j=0nλjajk1exp(j=0nλjfj(xk))} for which  λ0{\displaystyle \max _{\lambda _{0};\,{\boldsymbol {\lambda }}}\left\{\sum _{j=0}^{n}\lambda _{j}a_{j}-\sum _{k\geq 1}\exp \left(\sum _{j=0}^{n}\lambda _{j}f_{j}(x_{k})\right)\right\}\qquad {\text{ for which }}~{\boldsymbol {\lambda }}\geq \mathbf {0} }

Again as above, if the moment conditions are equalities (instead of inequalities), then the constraint conditionλ0{\displaystyle {\boldsymbol {\lambda }}\geq \mathbf {0} } is not present in the optimization.

Proof in the case of equality constraints

[edit]

In the case of equality constraints, this theorem is proved with thecalculus of variations andLagrange multipliers. The constraints can be written as

fj(x)p(x)dx=aj{\displaystyle \int _{-\infty }^{\infty }f_{j}(x)p(x)\,dx=a_{j}}

We consider thefunctional

J(p)=p(x)lnp(x)dxη0(p(x)dx1)j=1nλj(fj(x)p(x)dxaj){\displaystyle J(p)=\int _{-\infty }^{\infty }p(x)\ln {p(x)}\,dx-\eta _{0}\left(\int _{-\infty }^{\infty }p(x)\,dx-1\right)-\sum _{j=1}^{n}\lambda _{j}\left(\int _{-\infty }^{\infty }f_{j}(x)p(x)\,dx-a_{j}\right)}

whereη0{\displaystyle \eta _{0}} andλj,j1{\displaystyle \lambda _{j},j\geq 1} are the Lagrange multipliers. The zeroth constraint ensures thesecond axiom of probability. The other constraints are that the measurements of the function are given constants up to ordern{\displaystyle n}. The entropy attains an extremum when thefunctional derivative is equal to zero:

δJ(p)δp=lnp(x)+1η0j=1nλjfj(x)=0{\displaystyle {\frac {\delta J(p)}{\delta p}}=\ln {p(x)}+1-\eta _{0}-\sum _{j=1}^{n}\lambda _{j}f_{j}(x)=0}

Therefore, the extremal entropy probability distribution in this case must be of the form (λ0:=η01{\displaystyle \lambda _{0}:=\eta _{0}-1}),

p(x)=e1+η0ej=1nλjfj(x)=exp(j=0nλjfj(x)),{\displaystyle p(x)=e^{-1+\eta _{0}}\,e^{\sum _{j=1}^{n}\lambda _{j}f_{j}(x)}=\exp \left(\sum _{j=0}^{n}\lambda _{j}f_{j}(x)\right),}

remembering thatf0(x)=1{\displaystyle f_{0}(x)=1}. It can be verified that this is the maximal solution by checking that the variation around this solution is always negative.

Uniqueness of the maximum

[edit]

Supposep{\displaystyle p} andp{\displaystyle p'} are distributions satisfying the expectation-constraints. Lettingα(0,1){\displaystyle \alpha \in (0,1)} and considering the distributionq=αp+(1α)p{\displaystyle q=\alpha \,p+(1-\alpha )\,p'} it is clear that this distribution satisfies the expectation-constraints and furthermore has as supportsupp(q)=supp(p)supp(p).{\displaystyle \operatorname {supp} (q)=\operatorname {supp} (p)\cup \operatorname {supp} (p')\,.} From basic facts about entropy, it holds thatH(q)αH(p)+(1α)H(p).{\displaystyle {\mathcal {H}}(q)\geq \alpha \,{\mathcal {H}}(p)+(1-\alpha )\,{\mathcal {H}}(p').} Taking limitsα1{\displaystyle \alpha \to 1} andα0,{\displaystyle \alpha \to 0\,,} respectively, yieldsH(q)H(p),H(p).{\displaystyle {\mathcal {H}}(q)\geq {\mathcal {H}}(p),{\mathcal {H}}(p')\,.}

It follows that a distribution satisfying the expectation-constraints and maximising entropy must necessarily have full support —i. e. the distribution is almost everywhere strictly positive. It follows that the maximising distribution must be an internal point in the space of distributions satisfying the expectation-constraints, that is, it must be a local extreme. Thus it suffices to show that the local extreme is unique, in order to show both that the entropy-maximising distribution is unique (and this also shows that the local extreme is the global maximum).

Supposep{\displaystyle p} andp{\displaystyle p'} are local extremes. Reformulating the above computations these are characterised by parametersλ,λRn{\displaystyle {\boldsymbol {\lambda }},\,{\boldsymbol {\lambda }}'\in \mathbb {R} ^{n}} viap(x)=expλ,f(x)/C(λ){\displaystyle p(x)={\exp \left\langle {\boldsymbol {\lambda }},\mathbf {f} (x)\right\rangle }/{C({\boldsymbol {\lambda }})}} and similarly forp,{\displaystyle p',} whereC(λ)=Rexpλ,f(x)dx.{\textstyle C({\boldsymbol {\lambda }})=\int _{\mathbb {R} }\exp \left\langle {\boldsymbol {\lambda }},\mathbf {f} (x)\right\rangle \,dx\,.} We now note a series of identities: Via the satisfaction of the expectation-constraints and utilising gradients / directional derivatives, one has

DlogC()|λ=DC()C()|λ=Ep[f(X)]=a{\displaystyle {\left.D\log C(\cdot )\right|}_{\boldsymbol {\lambda }}={\left.{\tfrac {DC(\cdot )}{C(\cdot )}}\right|}_{\boldsymbol {\lambda }}=\operatorname {E} _{p}\left[\mathbf {f} (X)\right]=\mathbf {a} } and similarly forλ .{\displaystyle {\boldsymbol {\lambda }}'~.} Lettingu=λλRn{\displaystyle u={\boldsymbol {\lambda }}'-{\boldsymbol {\lambda }}\in \mathbb {R} ^{n}} one obtains:

0=u,aa=DulogC()|λDulogC()|λ=Du2logC()|γ{\displaystyle 0=\left\langle u,\mathbf {a} -\mathbf {a} \right\rangle ={\left.D_{u}\log C(\cdot )\right|}_{{\boldsymbol {\lambda }}'}-{\left.D_{u}\log C(\cdot )\right|}_{\boldsymbol {\lambda }}={\left.D_{u}^{2}\log C(\cdot )\right|}_{\boldsymbol {\gamma }}}

whereγ=θλ+(1θ)λ{\displaystyle {\boldsymbol {\gamma }}=\theta {\boldsymbol {\lambda }}+(1-\theta ){\boldsymbol {\lambda }}'} for someθ(0,1).{\displaystyle \theta \in (0,1).} Computing further, one has

0=Du2logC()|γ=Du(DuC()C())|γ=Du2C()C()|γ(DuC())2C()2|γ=Eq[u,f(X)2](Eq[u,f(X)])2=Varq[u,f(X)]{\displaystyle {\begin{aligned}0&={\left.D_{u}^{2}\log C(\cdot )\right|}_{\boldsymbol {\gamma }}\\[1ex]&={\left.D_{u}\left({\frac {D_{u}C(\cdot )}{C(\cdot )}}\right)\right|}_{\boldsymbol {\gamma }}={\left.{\frac {D_{u}^{2}C(\cdot )}{C(\cdot )}}\right|}_{\boldsymbol {\gamma }}-{\left.{\frac {{\left(D_{u}C(\cdot )\right)}^{2}}{C(\cdot )^{2}}}\right|}_{\boldsymbol {\gamma }}\\[1ex]&=\operatorname {E} _{q}\left[{\left\langle u,\mathbf {f} (X)\right\rangle }^{2}\right]-{\left(\operatorname {E} _{q}\left[\left\langle u,\mathbf {f} (X)\right\rangle \right]\right)}^{2}\\[2ex]&=\operatorname {Var} _{q}\left[\left\langle u,\mathbf {f} (X)\right\rangle \right]\end{aligned}}}

whereq{\displaystyle q} is similar to the distribution above, only parameterised byγ ,{\displaystyle {\boldsymbol {\gamma }}~,}Assuming that no non-trivial linear combination of the observables isalmost everywhere (a.e.) constant, (whiche.g. holds if the observables are independent and not a.e. constant), it holds thatu,f(X){\displaystyle \langle u,\mathbf {f} (X)\rangle } has non-zero variance, unlessu=0 .{\displaystyle u=0~.} By the above equation it is thus clear, that the latter must be the case. Henceλλ=u=0,{\displaystyle {\boldsymbol {\lambda }}'-{\boldsymbol {\lambda }}=u=0\,,} so the parameters characterising the local extremap,p{\displaystyle p,\,p'} are identical, which means that the distributions themselves are identical. Thus, the local extreme is unique and by the above discussion, the maximum is unique – provided a local extreme actually exists.

Caveats

[edit]

Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contains distributions of arbitrarily large entropy (e.g. the class of all continuous distributions onR with mean 0 but arbitrary standard deviation), or that the entropies are bounded above but there is no distribution which attains the maximal entropy.[a] It is also possible that the expected value restrictions for the classC force the probability distribution to be zero in certain subsets ofS. In that case our theorem doesn't apply, but one can work around this by shrinking the setS.

Examples

[edit]

Every probability distribution is trivially a maximum entropy probability distribution under the constraint that the distribution has its own entropy. To see this, rewrite the density asp(x)=exp(lnp(x)){\displaystyle p(x)=\exp {(\ln {p(x)})}} and compare to the expression of the theorem above. By choosinglnp(x)f(x){\displaystyle \ln {p(x)}\rightarrow f(x)} to be the measurable function and

exp(f(x))f(x)dx=H{\displaystyle \int \exp {(f(x))}f(x)dx=-H}

to be the constant,p(x){\displaystyle p(x)} is the maximum entropy probability distribution under the constraint

p(x)f(x)dx=H.{\displaystyle \int p(x)f(x)\,dx=-H.}

Nontrivial examples are distributions that are subject to multiple constraints that are different from the assignment of the entropy. These are often found by starting with the same procedurelnp(x)f(x){\displaystyle \ln {p(x)}\to f(x)} and finding thatf(x){\displaystyle f(x)} can be separated into parts.

A table of examples of maximum entropy distributions is given in Lisman (1972)[6] and Park & Bera (2009).[7]

Uniform and piecewise uniform distributions

[edit]

Theuniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a,b], and thus the probability density is 0 outside of the interval. This uniform density can be related to Laplace'sprinciple of indifference, sometimes called the principle of insufficient reason. More generally, if we are given a subdivisiona=a0 <a1 < ... <ak =b of the interval [a,b] and probabilitiesp1,...,pk that add up to one, then we can consider the class of all continuous distributions such thatPr(aj1X<aj)=pj for j=1,,k{\displaystyle \Pr(a_{j-1}\leq X<a_{j})=p_{j}\quad {\text{ for }}j=1,\ldots ,k}The density of the maximum entropy distribution for this class is constant on each of the intervals [aj−1,aj). The uniform distribution on the finite set {x1,...,xn} (which assigns a probability of 1/n to each of these values) is the maximum entropy distribution among all discrete distributions supported on this set.

Positive and specified mean: the exponential distribution

[edit]

Theexponential distribution, for which the density function is

p(x|λ)={λeλxx0,0x<0,{\displaystyle p(x|\lambda )={\begin{cases}\lambda e^{-\lambda x}&x\geq 0,\\0&x<0,\end{cases}}}

is the maximum entropy distribution among all continuous distributions supported in [0,∞) that have a specified mean of 1/λ.

In the case of distributions supported on [0,∞), the maximum entropy distribution depends on relationships between the first and second moments. In specific cases, it may be the exponential distribution, or may be another distribution, or may be undefinable.[8]

Specified mean and variance: the normal distribution

[edit]

Thenormal distributionN(μ,σ2), for which the density function is

p(x|μ,σ)=1σ2πe(xμ)22σ2,{\displaystyle p(x|\mu ,\sigma )={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}},}

has maximum entropy among allreal-valued distributions supported on (−∞,∞) with a specifiedvarianceσ2 (a particularmoment). The same is true when themeanμ and thevarianceσ2 is specified (the first two moments), since entropy is translation invariant on (−∞,∞). Therefore, the assumption of normality imposes the minimal prior structural constraint beyond these moments. (See thedifferential entropy article for a derivation.)

Discrete distributions with specified mean

[edit]

Among all the discrete distributions supported on the set {x1,...,xn} with a specified mean μ, the maximum entropy distribution has the following shape:Pr(X=xk)=Crxk for k=1,,n{\displaystyle \Pr(X{=}x_{k})=Cr^{x_{k}}\quad {\text{ for }}k=1,\ldots ,n}where the positive constantsC andr can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

For example, if a large numberN of dice are thrown, and you are told that the sum of all the shown numbers isS. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with {x1,...,x6} = {1,...,6} andμ =S/N.

Finally, among all the discrete distributions supported on the infinite set{x1,x2,...}{\displaystyle \{x_{1},x_{2},...\}} with meanμ, the maximum entropy distribution has the shape:Pr(X=xk)=Crxk for k=1,2,,{\displaystyle \Pr(X{=}x_{k})=Cr^{x_{k}}\quad {\text{ for }}k=1,2,\ldots ,}where again the constantsC andr were determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ. For example, in the case thatxk = k, this givesC=1μ1,r=μ1μ,{\displaystyle C={\frac {1}{\mu -1}},\quad \quad r={\frac {\mu -1}{\mu }},}

such that respective maximum entropy distribution is thegeometric distribution.

Circular random variables

[edit]

For a continuous random variableθi{\displaystyle \theta _{i}} distributed about the unit circle, theVon Mises distribution maximizes the entropy when the real and imaginary parts of the firstcircular moment are specified[9] or, equivalently, thecircular mean andcircular variance are specified.

When the mean and variance of the anglesθi{\displaystyle \theta _{i}} modulo2π{\displaystyle 2\pi } are specified, thewrapped normal distribution maximizes the entropy.[9]

Maximizer for specified mean, variance and skew

[edit]

There exists an upper bound on the entropy of continuous random variables onR{\displaystyle \mathbb {R} } with a specified mean, variance, and skew. However, there isno distribution which achieves this upper bound, becausep(x)=cexp(λ1x+λ2x2+λ3x3){\displaystyle p(x)=c\exp {(\lambda _{1}x+\lambda _{2}x^{2}+\lambda _{3}x^{3})}} is unbounded whenλ30{\displaystyle \lambda _{3}\neq 0} (see Cover & Thomas (2006: chapter 12)).

However, the maximum entropy isε-achievable: a distribution's entropy can be arbitrarily close to the upper bound. Start with a normal distribution of the specified mean and variance. To introduce a positive skew, perturb the normal distribution upward by a small amount at a value manyσ larger than the mean. The skewness, being proportional to the third moment, will be affected more than the lower order moments.

This is a special case of the general case in which the exponential of any odd-order polynomial inx will be unbounded onR{\displaystyle \mathbb {R} }. For example,ceλx{\displaystyle ce^{\lambda x}} will likewise be unbounded onR{\displaystyle \mathbb {R} }, but when the support is limited to a bounded or semi-bounded interval the upper entropy bound may be achieved (e.g. ifx lies in the interval [0,∞] andλ< 0, theexponential distribution will result).

Maximizer for specified mean and deviation risk measure

[edit]

Every distribution withlog-concave density is a maximal entropy distribution with specified meanμ anddeviation risk measureD .[10]

In particular, the maximal entropy distribution with specified meanE(X)μ{\displaystyle E(X)\equiv \mu } and deviationD(X)d{\displaystyle D(X)\equiv d} is:

Other examples

[edit]

In the table below, each listed distribution maximizes the entropy for a particular set of functional constraints listed in the third column, and the constraint thatx{\displaystyle x} be included in the support of the probability density, which is listed in the fourth column.[6][7]

Several listed examples (Bernoulli,geometric,exponential,Laplace,Pareto) are trivially true, because their associated constraints are equivalent to the assignment of their entropy. They are included anyway because their constraint is related to a common or easily measured quantity.

For reference,Γ(x)=0ettx1dt{\displaystyle \Gamma (x)=\int _{0}^{\infty }e^{-t}t^{x-1}\,dt} is thegamma function,ψ(x)=ddxlnΓ(x)=Γ(x)Γ(x){\displaystyle \psi (x)={\frac {d}{dx}}\ln \Gamma (x)={\frac {\Gamma '(x)}{\Gamma (x)}}} is thedigamma function,B(p,q)=Γ(p)Γ(q)Γ(p+q){\displaystyle B(p,q)={\frac {\Gamma (p)\,\Gamma (q)}{\Gamma (p+q)}}} is thebeta function, andγE{\displaystyle \gamma _{\mathsf {E}}} is theEuler-Mascheroni constant.

Table of probability distributions and corresponding maximum entropy constraints
Distribution nameProbability density / mass functionMaximum entropy constraintSupport
Uniform (discrete)f(k)=1ba+1{\displaystyle f(k)={\frac {1}{b-a+1}}}None{a,a+1,...,b1,b}{\displaystyle \{a,a+1,...,b-1,b\}}
Uniform (continuous)f(x)=1ba{\displaystyle f(x)={\frac {1}{b-a}}}None[a,b]{\displaystyle [a,b]}
Bernoullif(k)=pk(1p)1k{\displaystyle f(k)=p^{k}{\left(1-p\right)}^{1-k}}E[K]=p{\displaystyle \operatorname {E} [K]=p}{0,1}{\displaystyle \{0,1\}}
Geometricf(k)=(1p)k1p{\displaystyle f(k)={\left(1-p\right)}^{k-1}p}E[K]=1p{\displaystyle \operatorname {E} [K]={\frac {1}{p}}}N{0}={1,2,3,...}{\displaystyle \mathbb {N} \setminus \left\{0\right\}=\{1,2,3,...\}}
Exponentialf(x)=λexp(λx){\displaystyle f(x)=\lambda \exp \left(-\lambda x\right)}E[X]=1λ{\displaystyle \operatorname {E} [X]={\frac {1}{\lambda }}}[0,){\displaystyle [0,\infty )}
Laplacef(x)=12bexp(|xμ|b){\displaystyle f(x)={\frac {1}{2b}}\exp \left(-{\frac {|x-\mu |}{b}}\right)}E[|Xμ|]=b{\displaystyle \operatorname {E} [|X-\mu |]=b}R{\displaystyle \mathbb {R} }
Asymmetric Laplacef(x)=λexp[(xm)λsκs]κ+1κ{\displaystyle f(x)={\frac {\lambda \exp \left[-(x-m)\,\lambda \,s\,\kappa ^{s}\right]}{\kappa +{\frac {1}{\kappa }}}}}
where ssgn(xm){\displaystyle ~s\equiv \operatorname {sgn}(x-m)}
E[(Xm)sκs]=1λ{\displaystyle \operatorname {E} \left[(X-m)s\kappa ^{s}\right]={\frac {1}{\lambda }}}R{\displaystyle \mathbb {R} }
Paretof(x)=αxmαxα+1{\displaystyle f(x)={\frac {\alpha x_{m}^{\alpha }}{x^{\alpha +1}}}}E[lnX]=1α+ln(xm){\displaystyle \operatorname {E} [\ln X]={\frac {1}{\alpha }}+\ln(x_{m})}[xm,){\displaystyle [x_{m},\infty )}
Normalf(x)=12πσ2exp((xμ)22σ2){\displaystyle f(x)={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\exp \left(-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}\right)}E[X]=μ,{\displaystyle \operatorname {E} [X]=\mu \,,}
E[X2]=σ2+μ2{\displaystyle \operatorname {E} \left[X^{2}\right]=\sigma ^{2}+\mu ^{2}}
R{\displaystyle \mathbb {R} }
Truncated normal(see article)E[X]=μT,{\displaystyle \operatorname {E} \left[X\right]=\mu _{\mathsf {T}}\,,}
E[X2]=σT2+μT2{\displaystyle \operatorname {E} \left[X^{2}\right]=\sigma _{\mathsf {T}}^{2}+\mu _{\mathsf {T}}^{2}}
[a,b]{\displaystyle [a,b]}
von Misesf(θ)=12πI0(κ)exp(κcos(θμ)){\displaystyle f(\theta )={\frac {1}{2\pi I_{0}(\kappa )}}\exp \left(\kappa \cos {(\theta -\mu )}\right)}E[cosΘ]=I1(κ)I0(κ)cosμ,{\displaystyle \operatorname {E} [\cos \Theta ]={\frac {I_{1}(\kappa )}{I_{0}(\kappa )}}\cos \mu \,,}
E[sinΘ]=I1(κ)I0(κ)sinμ{\displaystyle \operatorname {E} [\sin \Theta ]={\frac {I_{1}(\kappa )}{I_{0}(\kappa )}}\sin \mu }
[0,2π){\displaystyle [0,2\pi )}
Rayleighf(x)=xσ2exp(x22σ2){\displaystyle f(x)={\frac {x}{\sigma ^{2}}}\exp \left(-{\frac {x^{2}}{2\sigma ^{2}}}\right)}E[X2]=2σ2,{\displaystyle \operatorname {E} \left[X^{2}\right]=2\sigma ^{2}\,,}
E[lnX]=ln(2σ2)γE2{\displaystyle \operatorname {E} [\ln X]={\frac {\ln(2\sigma ^{2})-\gamma _{\mathrm {E} }}{2}}}
[0,){\displaystyle [0,\infty )}
Betaf(x)=xα1(1x)β1B(α,β){\displaystyle f(x)={\frac {x^{\alpha -1}(1-x)^{\beta -1}}{B(\alpha ,\beta )}}} for0x1{\displaystyle 0\leq x\leq 1}E[lnX]=ψ(α)ψ(α+β),{\displaystyle \operatorname {E} [\ln X]=\psi (\alpha )-\psi (\alpha +\beta )\,,}
E[ln(1X)]=ψ(β)ψ(α+β){\displaystyle \operatorname {E} [\ln(1-X)]=\psi (\beta )-\psi (\alpha +\beta )}
[0,1]{\displaystyle [0,1]}
Cauchyf(x)=1π(1+x2){\displaystyle f(x)={\frac {1}{\pi \left(1+x^{2}\right)}}}E[ln(1+X2)]=2ln2{\displaystyle \operatorname {E} \left[\ln \left(1+X^{2}\right)\right]=2\ln 2}R{\displaystyle \mathbb {R} }
Chif(x)=22k/2Γ(k/2)xk1exp(x22){\displaystyle f(x)={\frac {2}{2^{k/2}\Gamma (k/2)}}x^{k-1}\exp \left(-{\frac {x^{2}}{2}}\right)}E[X2]=k,{\displaystyle \operatorname {E} \left[X^{2}\right]=k\,,}
E[lnX]=12[ψ(k2)+ln(2)]{\displaystyle \operatorname {E} [\ln X]={\frac {1}{2}}\left[\psi {\left({\frac {k}{2}}\right)}+\ln(2)\right]}
[0,){\displaystyle [0,\infty )}
Chi-squaredf(x)=12k/2Γ(k/2)xk21exp(x2){\displaystyle f(x)={\frac {1}{2^{k/2}\Gamma (k/2)}}x^{{\frac {k}{2}}\!-\!1}\exp \left(-{\frac {x}{2}}\right)}E[X]=k,{\displaystyle \operatorname {E} [X]=k\,,}
E[lnX]=ψ(k2)+ln(2){\displaystyle \operatorname {E} [\ln X]=\psi {\left({\frac {k}{2}}\right)}+\ln(2)}
[0,){\displaystyle [0,\infty )}
Erlangf(x)=λk(k1)!xk1exp(λx){\displaystyle f(x)={\frac {\lambda ^{k}}{(k-1)!}}x^{k-1}\exp(-\lambda x)}E[X]=k/λ,{\displaystyle \operatorname {E} [X]=k/\lambda \,,}

E[lnX]=ψ(k)ln(λ){\displaystyle \operatorname {E} [\ln X]=\psi (k)-\ln(\lambda )}

[0,){\displaystyle [0,\infty )}
Gammaf(x)=xk1ex/θθkΓ(k){\displaystyle f(x)={\frac {x^{k-1}e^{-x/\theta }}{\theta ^{k}\Gamma (k)}}}E[X]=kθ,{\displaystyle \operatorname {E} [X]=k\theta \,,}
E[lnX]=ψ(k)+lnθ{\displaystyle \operatorname {E} [\ln X]=\psi (k)+\ln \theta }
[0,){\displaystyle [0,\infty )}
Lognormalf(x)=1σx2πexp((lnxμ)22σ2){\displaystyle f(x)={\frac {1}{\sigma x{\sqrt {2\pi }}}}\exp \left(-{\frac {{\left(\ln x-\mu \right)}^{2}}{2\sigma ^{2}}}\right)}E[lnX]=μ,{\displaystyle \operatorname {E} \left[\ln X\right]=\mu \,,}
E[ln(X)2]=σ2+μ2{\displaystyle \operatorname {E} \left[\ln(X)^{2}\right]=\sigma ^{2}+\mu ^{2}}
(0,){\displaystyle (0,\infty )}
Maxwell–Boltzmannf(x)=1a32πx2exp(x22a2){\displaystyle f(x)={\frac {1}{a^{3}}}{\sqrt {\frac {2}{\pi }}}x^{2}\exp \left(-{\frac {x^{2}}{2a^{2}}}\right)}E[X2]=3a2,{\displaystyle \operatorname {E} \left[X^{2}\right]=3a^{2}\,,}
E[lnX]=1+ln(a2)γE2{\displaystyle \operatorname {E} \left[\ln X\right]=1+\ln \left({\frac {a}{\sqrt {2}}}\right)-{\frac {\gamma _{\mathrm {E} }}{2}}}
[0,){\displaystyle [0,\infty )}
Weibullf(x)=kλkxk1exp(xkλk){\displaystyle f(x)={\frac {k}{\lambda ^{k}}}x^{k-1}\exp \left(-{\frac {x^{k}}{\lambda ^{k}}}\right)}E[Xk]=λk,{\displaystyle \operatorname {E} \left[X^{k}\right]=\lambda ^{k},}
E[lnX]=ln(λ)γEk{\displaystyle \operatorname {E} \left[\ln X\right]=\ln(\lambda )-{\frac {\gamma _{\mathrm {E} }}{k}}}
[0,){\displaystyle [0,\infty )}
Multivariate normalfX(x)=exp[12(xμ)TΣ1(xμ)](2π)N|Σ|{\displaystyle f_{X}(\mathbf {x} )={\frac {\exp \left[-{\frac {1}{2}}\left(\mathbf {x} -{\boldsymbol {\mu }}\right)^{\mathsf {T}}\Sigma ^{-1}\left(\mathbf {x} -{\boldsymbol {\mu }}\right)\right]}{\sqrt {{\left(2\pi \right)}^{N}\left|\Sigma \right|}}}}E[x]=μ,{\displaystyle \operatorname {E} \left[\mathbf {x} \right]={\boldsymbol {\mu }},}
E[(xμ)(xμ)T]=Σ{\displaystyle \operatorname {E} \left[(\mathbf {x} -{\boldsymbol {\mu }})(\mathbf {x} -{\boldsymbol {\mu }})^{\mathsf {T}}\right]=\Sigma }
Rn{\displaystyle \mathbb {R} ^{n}}
Binomialf(k)=(nk)pk(1p)nk{\displaystyle f(k)={\binom {n}{k}}p^{k}{\left(1-p\right)}^{n-k}}E[X]=μ,{\displaystyle \operatorname {E} [X]=\mu \,,}
f{\displaystyle f\in } n-generalized binomial distribution[11]
{0,,n}{\displaystyle \left\{0,{\ldots },n\right\}}
Poissonf(k)=λkeλk!{\displaystyle f(k)={\frac {\lambda ^{k}e^{-\lambda }}{k!}}}E[X]=λ,{\displaystyle \operatorname {E} [X]=\lambda ,}
f{\displaystyle f\in \infty }-generalized binomial distribution[11]
N={0,1,}{\displaystyle \mathbb {N} =\left\{0,1,{\ldots }\right\}}
Logisticf(x)=ex(1+ex)2=e+x(e+x+1)2{\displaystyle f(x)={\frac {e^{-x}}{\left(1+e^{-x}\right)^{2}}}={\frac {e^{+x}}{\left(e^{+x}+1\right)^{2}}}}E[X]=0,{\displaystyle \operatorname {E} [X]=0,}
E[ln(1+eX)]=1{\displaystyle \operatorname {E} \left[\ln \left(1+e^{-X}\right)\right]=1}
{,}{\displaystyle \left\{-\infty ,\infty \right\}}
Random Group Formation distribution[12]f(k)=Aebkkγ{\displaystyle f(k)={\frac {Ae^{-bk}}{k^{\gamma }}}}kf(k)ln(kN(k)){\displaystyle \sum _{k}f(k)\ln(kN(k))}, whereN(k){\displaystyle N(k)} are group sizes.[0,){\displaystyle [0,\infty )}

The maximum entropy principle can be used to upper bound the entropy of statistical mixtures.[13]

See also

[edit]

Notes

[edit]
  1. ^For example, the class of all continuous distributionsX onR withE(X) = 0 andE(X2) = E(X3) = 1 (see Cover, Ch 12).

Citations

[edit]
  1. ^Williams, D. (2001).Weighing the Odds.Cambridge University Press. pp. 197–199.ISBN 0-521-00618-X.
  2. ^Bernardo, J.M.; Smith, A.F.M. (2000).Bayesian Theory. Wiley. pp. 209, 366.ISBN 0-471-49464-X.
  3. ^O'Hagan, A. (1994),Bayesian Inference. Kendall's Advanced Theory of Statistics. Vol. 2B.Edward Arnold. 1994. section 5.40.ISBN 0-340-52922-9.
  4. ^Botev, Z.I.; Kroese, D.P. (2011)."The generalized cross entropy method, with applications to probability density estimation"(PDF).Methodology and Computing in Applied Probability.13 (1):1–27.doi:10.1007/s11009-009-9133-7.S2CID 18155189.
  5. ^Botev, Z.I.; Kroese, D.P. (2008). "Non-asymptotic bandwidth selection for density estimation of discrete data".Methodology and Computing in Applied Probability.10 (3): 435.doi:10.1007/s11009-007-9057-zv (inactive 1 July 2025).S2CID 122047337.{{cite journal}}: CS1 maint: DOI inactive as of July 2025 (link)
  6. ^abcLisman, J. H. C.; van Zuylen, M. C. A. (1972). "Note on the generation of most probable frequency distributions".Statistica Neerlandica.26 (1):19–23.doi:10.1111/j.1467-9574.1972.tb00152.x.
  7. ^abPark, Sung Y.; Bera, Anil K. (2009)."Maximum entropy autoregressive conditional heteroskedasticity model"(PDF).Journal of Econometrics.150 (2):219–230.CiteSeerX 10.1.1.511.9750.doi:10.1016/j.jeconom.2008.12.014. Archived fromthe original(PDF) on 2016-03-07. Retrieved2011-06-02.
  8. ^Dowson, D.; Wragg, A. (September 1973). "Maximum-entropy distributions having prescribed first and second moments".IEEE Transactions on Information Theory (correspondance).19 (5):689–693.doi:10.1109/tit.1973.1055060.ISSN 0018-9448.
  9. ^abJammalamadaka, S. Rao; SenGupta, A. (2001).Topics in circular statistics. New Jersey: World Scientific.ISBN 978-981-02-3778-3. Retrieved2011-05-15.
  10. ^abGrechuk, Bogdan; Molyboha, Anton; Zabarankin, Michael (2009)."Maximum entropy principle with general deviation measures".Mathematics of Operations Research.34 (2):445–467.doi:10.1287/moor.1090.0377 – via researchgate.net.
  11. ^abHarremös, Peter (2001). "Binomial and Poisson distributions as maximum entropy distributions".IEEE Transactions on Information Theory.47 (5):2039–2041.doi:10.1109/18.930936.S2CID 16171405.
  12. ^Baek, Seung Ki; Bernhardsson, Sebastian; Minnhagen, Petter (7 April 2011)."Zipf's law unzipped"(PDF).New Journal of Physics.13.
  13. ^Nielsen, Frank; Nock, Richard (2017). "MaxEnt upper bounds for the differential entropy of univariate continuous distributions".IEEE Signal Processing Letters.24 (4).IEEE:402–406.Bibcode:2017ISPL...24..402N.doi:10.1109/LSP.2017.2666792.S2CID 14092514.

References

[edit]
Discrete
univariate
with finite
support
with infinite
support
Continuous
univariate
supported on a
bounded interval
supported on a
semi-infinite
interval
supported
on the whole
real line
with support
whose type varies
Mixed
univariate
continuous-
discrete
Multivariate
(joint)
Directional
Degenerate
andsingular
Degenerate
Dirac delta function
Singular
Cantor
Families
Retrieved from "https://en.wikipedia.org/w/index.php?title=Maximum_entropy_probability_distribution&oldid=1323934873"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp