Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Kullback–Leibler divergence

From Wikipedia, the free encyclopedia
(Redirected fromRelative entropy)
Mathematical statistics distance measure

Inmathematical statistics, theKullback–Leibler (KL)divergence (also calledrelative entropy andI-divergence[1]), denotedDKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)}, is a type ofstatistical distance: a measure of how much an approximatingprobability distributionQ is different from a true probability distributionP.[2][3] Mathematically, it is defined as

DKL(PQ)=xXP(x)logP(x)Q(x).{\displaystyle D_{\text{KL}}(P\parallel Q)=\sum _{x\in {\mathcal {X}}}P(x)\,\log {\frac {P(x)}{Q(x)}}{\text{.}}}

A simpleinterpretation of the KL divergence ofP fromQ is theexpected excesssurprisal from using the approximationQ instead ofP when the actual isP. While it is a measure of how different two distributions are and is thus a distance in some sense, it is not actually ametric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions (in contrast tovariation of information), and does not satisfy thetriangle inequality. Instead, in terms ofinformation geometry, it is a type ofdivergence,[4] a generalization ofsquared distance, and for certain classes of distributions (notably anexponential family), it satisfies a generalizedPythagorean theorem (which applies to squared distances).[5]

Relative entropy is always a non-negativereal number, with value 0 if and only if the two distributions in question are identical. It has diverse applications, both theoretical, such as characterizing the relative(Shannon) entropy in information systems, randomness in continuoustime-series, and information gain when comparing statistical models ofinference; and practical, such as applied statistics,fluid mechanics,neuroscience,bioinformatics, andmachine learning.

Introduction and context

[edit]

Consider two probability distributions, a trueP and an approximatingQ. Often,P represents the data, the observations, or a measured probability distribution and distributionQ represents instead a theory, a model, a description, or another approximation ofP. However, sometimes the true distributionP represents a model and the approximating distributionQ represents (simulated) data that are intended to match the true distribution. The Kullback–Leibler divergenceDKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} is then interpreted as the average difference of the number of bits required for encoding samples ofP using acode optimized forQ rather than one optimized forP.

Note that the roles ofP andQ can be reversed in some situations where that is easier to compute and the goal is to minimizeDKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)}, such as with theexpectation–maximization algorithm (EM) andevidence lower bound (ELBO) computations. This role-reversal approach exploits thatDKL(PQ)=0{\displaystyle D_{\text{KL}}(P\parallel Q)=0} if and only ifDKL(QP)=0{\displaystyle D_{\text{KL}}(Q\parallel P)=0} and that, in many cases, reducing one has the effect of reducing the other.

Etymology

[edit]

The relative entropy was introduced bySolomon Kullback andRichard Leibler inKullback & Leibler (1951) as "the mean information for discrimination betweenH1{\displaystyle H_{1}} andH2{\displaystyle H_{2}} per observation fromμ1{\displaystyle \mu _{1}}",[6] where one is comparing two probability measuresμ1,μ2{\displaystyle \mu _{1},\mu _{2}}, andH1,H2{\displaystyle H_{1},H_{2}} are the hypotheses that one is selecting from measureμ1,μ2{\displaystyle \mu _{1},\mu _{2}} (respectively). They denoted this byI(1:2){\displaystyle I(1:2)}, and defined the "'divergence' betweenμ1{\displaystyle \mu _{1}} andμ2{\displaystyle \mu _{2}}" as the symmetrized quantityJ(1,2)=I(1:2)+I(2:1){\displaystyle J(1,2)=I(1:2)+I(2:1)}, which had already been defined and used byHarold Jeffreys in 1948.[7] InKullback (1959), the symmetrized form is again referred to as the "divergence", and the relative entropies in each direction are referred to as a "directed divergences" between two distributions;[8] Kullback preferred the termdiscrimination information.[9] The term "divergence" is in contrast to a distance (metric), since the symmetrized divergence does not satisfy the triangle inequality.[10] Numerous references to earlier uses of the symmetrized divergence and to otherstatistical distances are given inKullback (1959, pp. 6–7, §1.3 Divergence). The asymmetric "directed divergence" has come to be known as the Kullback–Leibler divergence, while the symmetrized "divergence" is now referred to as theJeffreys divergence.

Definition

[edit]

Fordiscrete probability distributionsP andQ defined on the samesample space,X{\displaystyle {\mathcal {X}}}, the relative entropy fromQ toP is defined[11] to be

DKL(PQ)=xXP(x)logP(x)Q(x),{\displaystyle D_{\text{KL}}(P\parallel Q)=\sum _{x\in {\mathcal {X}}}P(x)\,\log {\frac {P(x)}{Q(x)}}{\text{,}}}

which is equivalent to

DKL(PQ)=(xXP(x)logQ(x))(xXP(x)logP(x)).{\displaystyle D_{\text{KL}}(P\parallel Q)=\left(-\sum _{x\in {\mathcal {X}}}P(x)\,\log Q(x)\right)-\left(-\sum _{x\in {\mathcal {X}}}P(x)\,\log P(x)\right){\text{.}}}

In other words, it is theexpectation of the logarithmic difference between the probabilitiesP andQ, where the expectation is taken using the probabilitiesP.

Relative entropy is only defined in this way if, for allx,Q(x)=0{\displaystyle Q(x)=0} impliesP(x)=0{\displaystyle P(x)=0} (absolute continuity). Otherwise, it is often defined as+{\displaystyle +\infty },[1] but the value + {\displaystyle \ +\infty \ } is possible even ifQ(x)0{\displaystyle Q(x)\neq 0} everywhere,[12][13] provided thatX{\displaystyle {\mathcal {X}}} is infinite in extent. Analogous comments apply to the continuous and general measure cases defined below.

WheneverP(x){\displaystyle P(x)} is zero the contribution of the corresponding term is interpreted as zero because

limx0+xlog(x)=0.{\displaystyle \lim _{x\to 0^{+}}x\,\log(x)=0{\text{.}}}

For distributionsP andQ of acontinuous random variable, relative entropy is defined to be the integral[14]

DKL(PQ)=p(x)logp(x)q(x)dx.{\displaystyle D_{\text{KL}}(P\parallel Q)=\int _{-\infty }^{\infty }p(x)\,\log {\frac {p(x)}{q(x)}}\,dx{\text{.}}}

wherep andq denote theprobability densities ofP andQ.

More generally, ifP andQ are probabilitymeasures on ameasurable spaceX,{\displaystyle {\mathcal {X}}\,,} andP isabsolutely continuous with respect toQ, then the relative entropy fromQ toP is defined as

DKL(PQ)=xXlogP(dx)Q(dx)P(dx),{\displaystyle D_{\text{KL}}(P\parallel Q)=\int _{x\in {\mathcal {X}}}\log {\frac {P(dx)}{Q(dx)}}\,P(dx){\text{,}}}

whereP(dx)Q(dx){\displaystyle {\frac {P(dx)}{Q(dx)}}} is theRadon–Nikodym derivative ofP with respect toQ, i.e. the uniqueQ almost everywhere defined functionr onX{\displaystyle {\mathcal {X}}} such thatP(dx)=r(x)Q(dx){\displaystyle P(dx)=r(x)Q(dx)} which exists becauseP is absolutely continuous with respect toQ. Also we assume the expression on the right-hand side exists. Equivalently (by thechain rule), this can be written as

DKL(PQ)=xXP(dx)Q(dx) logP(dx)Q(dx) Q(dx),{\displaystyle D_{\text{KL}}(P\parallel Q)=\int _{x\in {\mathcal {X}}}{\frac {P(dx)}{Q(dx)}}\ \log {\frac {P(dx)}{Q(dx)}}\ Q(dx){\text{,}}}

which is theentropy ofP relative toQ. Continuing in this case, ifμ{\displaystyle \mu } is any measure onX{\displaystyle {\mathcal {X}}} for which densitiesp andq withP(dx)=p(x)μ(dx){\displaystyle P(dx)=p(x)\mu (dx)} andQ(dx)=q(x)μ(dx){\displaystyle Q(dx)=q(x)\mu (dx)} exist (meaning thatP andQ are both absolutely continuous with respect toμ{\displaystyle \mu }), then the relative entropy fromQ toP is given as

DKL(PQ)=xXp(x)logp(x)q(x) μ(dx).{\displaystyle D_{\text{KL}}(P\parallel Q)=\int _{x\in {\mathcal {X}}}p(x)\,\log {\frac {p(x)}{q(x)}}\ \mu (dx){\text{.}}}

Note that such a measureμ{\displaystyle \mu } for which densities can be defined always exists, since one can takeμ=12(P+Q){\textstyle \mu ={\frac {1}{2}}\left(P+Q\right)} although in practice it will usually be one that applies in the context such ascounting measure for discrete distributions, orLebesgue measure or a convenient variant thereof such asGaussian measure or the uniform measure on thesphere,Haar measure on aLie group etc. for continuous distributions.The logarithms in these formulae are usually taken tobase 2 if information is measured in units ofbits, or to basee if information is measured innats. Most formulas involving relative entropy hold regardless of the base of the logarithm.

Various conventions exist for referring toDKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} in words. Often it is referred to as the divergencebetweenP andQ, but this fails to convey the fundamental asymmetry in the relation. Sometimes, as in this article, it may be described as the divergence ofPfromQ or as the divergencefromQtoP. This reflects theasymmetry inBayesian inference, which startsfrom apriorQ and updatesto theposteriorP. Another common way to refer toDKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} is as the relative entropy ofPwith respect toQ or the information gain fromP overQ.

Basic example

[edit]

Kullback[3] gives the following example (Table 2.1, Example 2.1). LetP andQ be the distributions shown in the table and figure.P is the distribution on the left side of the figure, abinomial distribution withN=2{\displaystyle N=2} andp=0.4{\displaystyle p=0.4}.Q is the distribution on the right side of the figure, adiscrete uniform distribution with the three possible outcomesx =0,1,2 (i.e.X={0,1,2}{\displaystyle {\mathcal {X}}=\{0,1,2\}}), each with probabilityp=1/3{\displaystyle p=1/3}.

Two distributions to illustrate relative entropy
x
Distribution
012
P(x){\displaystyle P(x)}9/2512/254/25
Q(x){\displaystyle Q(x)}1/31/31/3

Relative entropiesDKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} andDKL(QP){\displaystyle D_{\text{KL}}(Q\parallel P)} are calculated as follows. This example uses thenatural log with basee, designatedln to get results innats (seeunits of information):

DKL(PQ)=xXP(x)lnP(x)Q(x)=925ln9/251/3+1225ln12/251/3+425ln4/251/3=125(32ln2+55ln350ln5)0.0852996,{\displaystyle {\begin{aligned}D_{\text{KL}}(P\parallel Q)&=\sum _{x\in {\mathcal {X}}}P(x)\,\ln {\frac {P(x)}{Q(x)}}\\&={\frac {9}{25}}\ln {\frac {9/25}{1/3}}+{\frac {12}{25}}\ln {\frac {12/25}{1/3}}+{\frac {4}{25}}\ln {\frac {4/25}{1/3}}\\&={\frac {1}{25}}\left(32\ln 2+55\ln 3-50\ln 5\right)\\&\approx 0.0852996{\text{,}}\end{aligned}}}

DKL(QP)=xXQ(x)lnQ(x)P(x)=13ln1/39/25+13ln1/312/25+13ln1/34/25=13(4ln26ln3+6ln5)0.097455.{\displaystyle {\begin{aligned}D_{\text{KL}}(Q\parallel P)&=\sum _{x\in {\mathcal {X}}}Q(x)\,\ln {\frac {Q(x)}{P(x)}}\\&={\frac {1}{3}}\,\ln {\frac {1/3}{9/25}}+{\frac {1}{3}}\,\ln {\frac {1/3}{12/25}}+{\frac {1}{3}}\,\ln {\frac {1/3}{4/25}}\\&={\frac {1}{3}}\left(-4\ln 2-6\ln 3+6\ln 5\right)\\&\approx 0.097455{\text{.}}\end{aligned}}}

Interpretations

[edit]

Statistics

[edit]

In the field of statistics, theNeyman–Pearson lemma states that the most powerful way to distinguish between the two distributionsP andQ based on an observationY (drawn from one of them) is through the log of the ratio of their likelihoods:logP(Y)logQ(Y){\displaystyle \log P(Y)-\log Q(Y)}. The KL divergence is the expected value of this statistic ifY is actually drawn fromP. Kullback motivated the statistic as an expected log likelihood ratio.[15]

Coding

[edit]

In the context ofcoding theory,DKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} can be constructed by measuring the expected number of extrabits required tocode samples fromP using a code optimized forQ rather than the code optimized forP.

Inference

[edit]

In the context ofmachine learning,DKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} is often called theinformation gain achieved ifP would be used instead ofQ which is currently used. By analogy with information theory, it is called therelative entropy ofP with respect toQ.

Expressed in the language ofBayesian inference,DKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} is a measure of the information gained by revising one's beliefs from theprior probability distributionQ to theposterior probability distributionP. In other words, it is the amount of information lost whenQ is used to approximateP.[16]

Information geometry

[edit]

In applications,P typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, whileQ typically represents a theory, model, description, orapproximation ofP. In order to find a distributionQ that is closest toP, we can minimize the KL divergence and compute aninformation projection.

While it is astatistical distance, it is not ametric, the most familiar type of distance, but instead it is adivergence.[4] While metrics are symmetric and generalizelinear distance, satisfying thetriangle inequality, divergences are asymmetric and generalizesquared distance, in some cases satisfying a generalizedPythagorean theorem. In generalDKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} does not equalDKL(QP){\displaystyle D_{\text{KL}}(Q\parallel P)}, and the asymmetry is an important part of the geometry.[4] Theinfinitesimal form of relative entropy, specifically itsHessian, gives ametric tensor that equals theFisher information metric; see§ Fisher information metric. Fisher information metric on the certain probability distribution let determine the natural gradient for information-geometric optimization algorithms.[17] Its quantum version is Fubini-study metric.[18] Relative entropy satisfies a generalized Pythagorean theorem forexponential families (geometrically interpreted asdually flat manifolds), and this allows one to minimize relative entropy by geometric means, for example byinformation projection and inmaximum likelihood estimation.[5]

The relative entropy is theBregman divergence generated by the negative entropy, but it is also of the form of anf-divergence. For probabilities over a finitealphabet, it is unique in being a member of both of these classes ofstatistical divergences. The application of Bregman divergence can be found in mirror descent.[19]

Finance (game theory)

[edit]

Consider a growth-optimizing investor in a fair game with mutually exclusive outcomes(e.g. a "horse race" in which the official odds add up to one).The rate of return expected by such an investor is equal to the relative entropybetween the investor's believed probabilities and the official odds.[20]This is a special case of a much more general connection between financial returns and divergence measures.[21]

Financial risks are connected toDKL{\displaystyle D_{\text{KL}}} via information geometry.[22] Investors' views, the prevailing market view, and risky scenarios form triangles on the relevant manifold of probability distributions. The shape of the triangles determines key financial risks (both qualitatively and quantitatively). For instance, obtuse triangles in which investors' views and risk scenarios appear on "opposite sides" relative to the market describe negative risks, acute triangles describe positive exposure, and the right-angled situation in the middle corresponds to zero risk. Extending this concept, relative entropy can be hypothetically utilised to identify the behaviour of informed investors, if one takes this to be represented by the magnitude and deviations away from the prior expectations of fund flows, for example.[23]

Motivation

[edit]
Illustration of the relative entropy for twonormal distributions. The typical asymmetry is clearly visible.

In information theory, theKraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one valuexi{\displaystyle x_{i}} out of a set of possibilitiesX can be seen as representing an implicit probability distributionq(xi)=2i{\displaystyle q(x_{i})=2^{-\ell _{i}}} overX, wherei{\displaystyle \ell _{i}} is the length of the code forxi{\displaystyle x_{i}} in bits. Therefore, relative entropy can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distributionQ is used, compared to using a code based on the true distributionP: it is theexcess entropy.

DKL(PQ)=xXp(x)log1q(x)xXp(x)log1p(x)=H(P,Q)H(P){\displaystyle {\begin{aligned}D_{\text{KL}}(P\parallel Q)&=\sum _{x\in {\mathcal {X}}}p(x)\log {\frac {1}{q(x)}}-\sum _{x\in {\mathcal {X}}}p(x)\log {\frac {1}{p(x)}}\\[5pt]&=\mathrm {H} (P,Q)-\mathrm {H} (P)\end{aligned}}}

whereH(P,Q){\displaystyle \mathrm {H} (P,Q)} is thecross entropy ofQ relative toP andH(P){\displaystyle \mathrm {H} (P)} is theentropy ofP (which is the same as the cross-entropy of P with itself).

The relative entropyDKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} can be thought of geometrically as astatistical distance, a measure of how far the distributionQ is from the distributionP. Geometrically it is adivergence: an asymmetric, generalized form of squared distance. The cross-entropyH(P,Q){\displaystyle H(P,Q)} is itself such a measurement (formally aloss function), but it cannot be thought of as a distance, sinceH(P,P)=:H(P){\displaystyle H(P,P)=:H(P)} is not zero. This can be fixed by subtractingH(P){\displaystyle H(P)} to makeDKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} agree more closely with our notion of distance, as theexcess loss. The resulting function is asymmetric, and while this can be symmetrized (see§ Symmetrised divergence), the asymmetric form is more useful. See§ Interpretations for more on the geometric interpretation.

Relative entropy relates to "rate function" in the theory oflarge deviations.[24][25]

Arthur Hobson proved that relative entropy is the only measure of difference between probability distributions that satisfies some desired properties, which are the canonical extension to those appearing in a commonly usedcharacterization of entropy.[26] Consequently,mutual information is the only measure of mutual dependence that obeys certain related conditions, since it can be definedin terms of Kullback–Leibler divergence.

Properties

[edit]

In particular, ifP(dx)=p(x)μ(dx){\displaystyle P(dx)=p(x)\mu (dx)} andQ(dx)=q(x)μ(dx){\displaystyle Q(dx)=q(x)\mu (dx)}, thenp(x)=q(x){\displaystyle p(x)=q(x)}μ{\displaystyle \mu }-almost everywhere. The entropyH(P){\displaystyle \mathrm {H} (P)} thus sets a minimum value for the cross-entropyH(P,Q){\displaystyle \mathrm {H} (P,Q)}, theexpected number ofbits required when using a code based onQ rather thanP; and the Kullback–Leibler divergence therefore represents the expected number of extra bits that must be transmitted to identify a valuex drawn fromX, if a code is used corresponding to the probability distributionQ, rather than the "true" distributionP.

[Proof]

Denotef(α):=DKL((1α)Q+αPQ){\displaystyle f(\alpha ):=D_{\text{KL}}((1-\alpha )Q+\alpha P\parallel Q)} and note thatDKL(PQ)=f(1){\displaystyle D_{\text{KL}}(P\parallel Q)=f(1)}. The first derivative off{\displaystyle f} may be derived and evaluated as followsf(α)=xX(P(x)Q(x))(log((1α)Q(x)+αP(x)Q(x))+1)=xX(P(x)Q(x))log((1α)Q(x)+αP(x)Q(x))f(0)=0{\displaystyle {\begin{aligned}f'(\alpha )&=\sum _{x\in {\mathcal {X}}}(P(x)-Q(x))\left(\log \left({\frac {(1-\alpha )Q(x)+\alpha P(x)}{Q(x)}}\right)+1\right)\\&=\sum _{x\in {\mathcal {X}}}(P(x)-Q(x))\log \left({\frac {(1-\alpha )Q(x)+\alpha P(x)}{Q(x)}}\right)\\f'(0)&=0\end{aligned}}}Further derivatives may be derived and evaluated as followsf(α)=xX(P(x)Q(x))2(1α)Q(x)+αP(x)f(0)=xX(P(x)Q(x))2Q(x)f(n)(α)=(1)n(n2)!xX(P(x)Q(x))n((1α)Q(x)+αP(x))n1f(n)(0)=(1)n(n2)!xX(P(x)Q(x))nQ(x)n1{\displaystyle {\begin{aligned}f''(\alpha )&=\sum _{x\in {\mathcal {X}}}{\frac {(P(x)-Q(x))^{2}}{(1-\alpha )Q(x)+\alpha P(x)}}\\f''(0)&=\sum _{x\in {\mathcal {X}}}{\frac {(P(x)-Q(x))^{2}}{Q(x)}}\\f^{(n)}(\alpha )&=(-1)^{n}(n-2)!\sum _{x\in {\mathcal {X}}}{\frac {(P(x)-Q(x))^{n}}{\left((1-\alpha )Q(x)+\alpha P(x)\right)^{n-1}}}\\f^{(n)}(0)&=(-1)^{n}(n-2)!\sum _{x\in {\mathcal {X}}}{\frac {(P(x)-Q(x))^{n}}{Q(x)^{n-1}}}\end{aligned}}}Hence solving forDKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} via the Taylor expansion off{\displaystyle f} about0{\displaystyle 0} evaluated atα=1{\displaystyle \alpha =1} yieldsDKL(PQ)=n=0f(n)(0)n!=n=21n(n1)xX(Q(x)P(x))nQ(x)n1{\displaystyle {\begin{aligned}D_{\text{KL}}(P\parallel Q)&=\sum _{n=0}^{\infty }{\frac {f^{(n)}(0)}{n!}}\\&=\sum _{n=2}^{\infty }{\frac {1}{n(n-1)}}\sum _{x\in {\mathcal {X}}}{\frac {(Q(x)-P(x))^{n}}{Q(x)^{n-1}}}\end{aligned}}}P2Q{\displaystyle P\leq 2Q} a.s. is a sufficient condition for convergence of the series by the following absolute convergence argumentn=2|1n(n1)xX(Q(x)P(x))nQ(x)n1|=n=21n(n1)xX|Q(x)P(x)||1P(x)Q(x)|n1n=21n(n1)xX|Q(x)P(x)|n=21n(n1)=1{\displaystyle {\begin{aligned}\sum _{n=2}^{\infty }\left\vert {\frac {1}{n(n-1)}}\sum _{x\in {\mathcal {X}}}{\frac {(Q(x)-P(x))^{n}}{Q(x)^{n-1}}}\right\vert &=\sum _{n=2}^{\infty }{\frac {1}{n(n-1)}}\sum _{x\in {\mathcal {X}}}\left\vert Q(x)-P(x)\right\vert \left\vert 1-{\frac {P(x)}{Q(x)}}\right\vert ^{n-1}\\&\leq \sum _{n=2}^{\infty }{\frac {1}{n(n-1)}}\sum _{x\in {\mathcal {X}}}\left\vert Q(x)-P(x)\right\vert \\&\leq \sum _{n=2}^{\infty }{\frac {1}{n(n-1)}}\\&=1\end{aligned}}}P2Q{\displaystyle P\leq 2Q} a.s. is also a necessary condition for convergence of the series by the following proof by contradiction. Assume thatP>2Q{\displaystyle P>2Q} with measure strictly greater than0{\displaystyle 0}. It then follows that there must exist some valuesε>0{\displaystyle \varepsilon >0},ρ>0{\displaystyle \rho >0}, andU<{\displaystyle U<\infty } such thatP2Q+ε{\displaystyle P\geq 2Q+\varepsilon } andQU{\displaystyle Q\leq U} with measureρ{\displaystyle \rho }. The previous proof of sufficiency demonstrated that the measure1ρ{\displaystyle 1-\rho } component of the series whereP2Q{\displaystyle P\leq 2Q} is bounded, so we need only concern ourselves with the behavior of the measureρ{\displaystyle \rho } component of the series whereP2Q+ε{\displaystyle P\geq 2Q+\varepsilon }. The absolute value of then{\displaystyle n}th term of this component of the series is then lower bounded by1n(n1)ρ(1+εU)n{\displaystyle {\frac {1}{n(n-1)}}\rho \left(1+{\frac {\varepsilon }{U}}\right)^{n}}, which is unbounded asn{\displaystyle n\to \infty }, so the series diverges.

Duality formula for variational inference

[edit]

The following result, due to Donsker and Varadhan,[29] is known asDonsker and Varadhan's variational formula.

Theorem [Duality Formula for Variational Inference]LetΘ{\displaystyle \Theta } be a set endowed with an appropriateσ{\displaystyle \sigma }-fieldF{\displaystyle {\mathcal {F}}}, and two probability measuresP andQ, which formulate twoprobability spaces(Θ,F,P){\displaystyle (\Theta ,{\mathcal {F}},P)} and(Θ,F,Q){\displaystyle (\Theta ,{\mathcal {F}},Q)}, withQP{\displaystyle Q\ll P}. (QP{\displaystyle Q\ll P} indicates thatQ is absolutely continuous with respect toP.) Leth be a real-valued integrablerandom variable on(Θ,F,P){\displaystyle (\Theta ,{\mathcal {F}},P)}. Then the following equality holds

logEP[exph]=supQP{EQ[h]DKL(QP)}.{\displaystyle \log E_{P}[\exp h]=\operatorname {sup} _{Q\ll P}\{E_{Q}[h]-D_{\text{KL}}(Q\parallel P)\}{\text{.}}}

Further, the supremum on the right-hand side is attained if and only if it holds

Q(dθ)P(dθ)=exph(θ)EP[exph],{\displaystyle {\frac {Q(d\theta )}{P(d\theta )}}={\frac {\exp h(\theta )}{E_{P}[\exp h]}}{\text{,}}}

almost surely with respect to probability measureP, whereQ(dθ)P(dθ){\displaystyle {\frac {Q(d\theta )}{P(d\theta )}}} denotes the Radon-Nikodym derivative ofQ with respect toP.

Proof

For a short proof assuming integrability ofexp(h){\displaystyle \exp(h)} with respect toP, letQ{\displaystyle Q^{*}} haveP-densityexph(θ)EP[exph]{\displaystyle {\frac {\exp h(\theta )}{E_{P}[\exp h]}}}, i.e.Q(dθ)=exph(θ)EP[exph]P(dθ){\displaystyle Q^{*}(d\theta )={\frac {\exp h(\theta )}{E_{P}[\exp h]}}P(d\theta )} Then

DKL(QQ)DKL(QP)=EQ[h]+logEP[exph].{\displaystyle D_{\text{KL}}(Q\parallel Q^{*})-D_{\text{KL}}(Q\parallel P)=-E_{Q}[h]+\log E_{P}[\exp h]{\text{.}}}

Therefore,

EQ[h]DKL(QP)=logEP[exph]DKL(QQ)logEP[exph],{\displaystyle E_{Q}[h]-D_{\text{KL}}(Q\parallel P)=\log E_{P}[\exp h]-D_{\text{KL}}(Q\parallel Q^{*})\leq \log E_{P}[\exp h]{\text{,}}}

where the last inequality follows fromDKL(QQ)0{\displaystyle D_{\text{KL}}(Q\parallel Q^{*})\geq 0}, for which equality occurs if and only ifQ=Q{\displaystyle Q=Q^{*}}. The conclusion follows.

Examples

[edit]

Multivariate normal distributions

[edit]

Suppose that we have twomultivariate normal distributions, with meansμ0,μ1{\displaystyle \mu _{0},\mu _{1}} and with (non-singular)covariance matricesΣ0,Σ1.{\displaystyle \Sigma _{0},\Sigma _{1}.} If the two distributions have the same dimension,k, then the relative entropy between the distributions is as follows:[30]

DKL(N0N1)=12[tr(Σ11Σ0)k+(μ1μ0)TΣ11(μ1μ0)+lndetΣ1detΣ0].{\displaystyle D_{\text{KL}}\left({\mathcal {N}}_{0}\parallel {\mathcal {N}}_{1}\right)={\frac {1}{2}}\left[\operatorname {tr} \left(\Sigma _{1}^{-1}\Sigma _{0}\right)-k+\left(\mu _{1}-\mu _{0}\right)^{\mathsf {T}}\Sigma _{1}^{-1}\left(\mu _{1}-\mu _{0}\right)+\ln {\frac {\det \Sigma _{1}}{\det \Sigma _{0}}}\right]{\text{.}}}

Thelogarithm in the last term must be taken to basee since all terms apart from the last are base-e logarithms of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives a result measured innats. Dividing the entire expression above byln(2){\displaystyle \ln(2)} yields the divergence inbits.

In a numerical implementation, it is helpful to express the result in terms of the Cholesky decompositionsL0,L1{\displaystyle L_{0},L_{1}} such thatΣ0=L0L0T{\displaystyle \Sigma _{0}=L_{0}L_{0}^{T}} andΣ1=L1L1T{\displaystyle \Sigma _{1}=L_{1}L_{1}^{T}}. Then withM andy solutions to the triangular linear systemsL1M=L0{\displaystyle L_{1}M=L_{0}}, andL1y=μ1μ0{\displaystyle L_{1}y=\mu _{1}-\mu _{0}},

DKL(N0N1)=12(i,j=1k(Mij)2k+|y|2+2i=1kln(L1)ii(L0)ii).{\displaystyle D_{\text{KL}}\left({\mathcal {N}}_{0}\parallel {\mathcal {N}}_{1}\right)={\frac {1}{2}}\left(\sum _{i,j=1}^{k}{\left(M_{ij}\right)}^{2}-k+|y|^{2}+2\sum _{i=1}^{k}\ln {\frac {(L_{1})_{ii}}{(L_{0})_{ii}}}\right){\text{.}}}

A special case, and a common quantity invariational inference, is the relative entropy between a diagonal multivariate normal, and a standard normal distribution (with zero mean and unit variance):

DKL(N((μ1,,μk)T,diag(σ12,,σk2))N(0,I))=12i=1k[σi2+μi21ln(σi2)].{\displaystyle D_{\text{KL}}\left({\mathcal {N}}\left(\left(\mu _{1},\ldots ,\mu _{k}\right)^{\mathsf {T}},\operatorname {diag} \left(\sigma _{1}^{2},\ldots ,\sigma _{k}^{2}\right)\right)\parallel {\mathcal {N}}\left(\mathbf {0} ,\mathbf {I} \right)\right)={\frac {1}{2}}\sum _{i=1}^{k}\left[\sigma _{i}^{2}+\mu _{i}^{2}-1-\ln \left(\sigma _{i}^{2}\right)\right]{\text{.}}}

For two univariate normal distributionsp andq the above simplifies to[31]DKL(pq)=logσ1σ0+σ02+(μ0μ1)22σ1212{\displaystyle D_{\text{KL}}\left({\mathcal {p}}\parallel {\mathcal {q}}\right)=\log {\frac {\sigma _{1}}{\sigma _{0}}}+{\frac {\sigma _{0}^{2}+{\left(\mu _{0}-\mu _{1}\right)}^{2}}{2\sigma _{1}^{2}}}-{\frac {1}{2}}}

In the case of co-centered normal distributions withk=σ1/σ0{\displaystyle k=\sigma _{1}/\sigma _{0}}, this simplifies[32] to:

DKL(pq)=log2k+(k21)/2/ln(2)bits{\displaystyle D_{\text{KL}}\left({\mathcal {p}}\parallel {\mathcal {q}}\right)=\log _{2}k+(k^{-2}-1)/2/\ln(2)\mathrm {bits} }

Uniform distributions

[edit]

Consider two uniform distributions, with the support ofp=[A,B]{\displaystyle p=[A,B]} enclosed withinq=[C,D]{\displaystyle q=[C,D]} (CA<BD{\displaystyle C\leq A<B\leq D}). Then the information gain is:

DKL(pq)=logDCBA{\displaystyle D_{\text{KL}}\left({\mathcal {p}}\parallel {\mathcal {q}}\right)=\log {\frac {D-C}{B-A}}}

Intuitively,[32] the information gain to ak times narrower uniform distribution containslog2k{\displaystyle \log _{2}k} bits. This connects with the use of bits in computing, wherelog2k{\displaystyle \log _{2}k} bits would be needed to identify one element of ak long stream.

Exponential family

[edit]

Theexponential family of distribution is given by

pX(x|θ)=h(x)exp(θTT(x)A(θ)){\displaystyle p_{X}(x|\theta )=h(x)\exp \left(\theta ^{\mathsf {T}}T(x)-A(\theta )\right)}

whereh(x){\displaystyle h(x)} is reference measure,T(x){\displaystyle T(x)} issufficient statistics,θ{\displaystyle \theta } is canonical natural parameters, andA(θ){\displaystyle A(\theta )} is the log-partition function.

The KL divergence between two distributionsp(x|θ1){\displaystyle p(x|\theta _{1})} andp(x|θ2){\displaystyle p(x|\theta _{2})} is given by[33]

DKL(θ1θ2)=(θ1θ2)Tμ1A(θ1)+A(θ2){\displaystyle D_{\text{KL}}(\theta _{1}\parallel \theta _{2})={\left(\theta _{1}-\theta _{2}\right)}^{\mathsf {T}}\mu _{1}-A(\theta _{1})+A(\theta _{2})}

whereμ1=Eθ1[T(X)]=A(θ1){\displaystyle \mu _{1}=E_{\theta _{1}}[T(X)]=\nabla A(\theta _{1})} is the mean parameter ofp(x|θ1){\displaystyle p(x|\theta _{1})}.

For example, for thePoisson distribution with meanλ{\displaystyle \lambda }, the sufficient statisticsT(x)=x{\displaystyle T(x)=x}, the natural parameterθ=logλ{\displaystyle \theta =\log \lambda }, and log partition functionA(θ)=eθ{\displaystyle A(\theta )=e^{\theta }}. As such, the divergence between two Poisson distributions with meansλ1{\displaystyle \lambda _{1}} andλ2{\displaystyle \lambda _{2}} is

DKL(λ1λ2)=λ1logλ1λ2λ1+λ2.{\displaystyle D_{\text{KL}}(\lambda _{1}\parallel \lambda _{2})=\lambda _{1}\log {\frac {\lambda _{1}}{\lambda _{2}}}-\lambda _{1}+\lambda _{2}{\text{.}}}

As another example, for a normal distribution with unit varianceN(μ,1){\displaystyle N(\mu ,1)}, the sufficient statisticsT(x)=x{\displaystyle T(x)=x}, the natural parameterθ=μ{\displaystyle \theta =\mu }, and log partition functionA(θ)=μ2/2{\displaystyle A(\theta )=\mu ^{2}/2}. Thus, the divergence between two normal distributionsN(μ1,1){\displaystyle N(\mu _{1},1)} andN(μ2,1){\displaystyle N(\mu _{2},1)} is

DKL(μ1μ2)=(μ1μ2)μ1μ122+μ222=(μ2μ1)22.{\displaystyle D_{\text{KL}}(\mu _{1}\parallel \mu _{2})=\left(\mu _{1}-\mu _{2}\right)\mu _{1}-{\frac {\mu _{1}^{2}}{2}}+{\frac {\mu _{2}^{2}}{2}}={\frac {{\left(\mu _{2}-\mu _{1}\right)}^{2}}{2}}{\text{.}}}

As final example, the divergence between a normal distribution with unit varianceN(μ,1){\displaystyle N(\mu ,1)} and a Poisson distribution with meanλ{\displaystyle \lambda } is

DKL(μλ)=(μlogλ)μμ22+λ.{\displaystyle D_{\text{KL}}(\mu \parallel \lambda )=(\mu -\log \lambda )\mu -{\frac {\mu ^{2}}{2}}+\lambda {\text{.}}}

Relation to metrics

[edit]

While relative entropy is astatistical distance, it is not ametric on the space of probability distributions, but instead it is adivergence.[4] While metrics are symmetric and generalizelinear distance, satisfying thetriangle inequality, divergences are asymmetric in general and generalizesquared distance, in some cases satisfying a generalizedPythagorean theorem. In generalDKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} does not equalDKL(QP){\displaystyle D_{\text{KL}}(Q\parallel P)}, and while this can be symmetrized (see§ Symmetrised divergence), the asymmetry is an important part of the geometry.[4]

It generates atopology on the space ofprobability distributions. More concretely, if{P1,P2,}{\displaystyle \{P_{1},P_{2},\ldots \}} is a sequence of distributions such that

limnDKL(PnQ)=0,{\displaystyle \lim _{n\to \infty }D_{\text{KL}}(P_{n}\parallel Q)=0{\text{,}}}

then it is said that

PnDQ.{\displaystyle P_{n}\xrightarrow {D} \,Q{\text{.}}}

Pinsker's inequality entails that

PnDPPnTVP,{\displaystyle P_{n}\xrightarrow {D} P\Rightarrow P_{n}\xrightarrow {TV} P{\text{,}}}

where the latter stands for the usual convergence intotal variation.

Fisher information metric

[edit]

Relative entropy is directly related to theFisher information metric. This can be made explicit as follows. Assume that the probability distributionsP andQ are both parameterized by some (possibly multi-dimensional) parameterθ{\displaystyle \theta }. Consider then two close by values ofP=P(θ){\displaystyle P=P(\theta )} andQ=P(θ0){\displaystyle Q=P(\theta _{0})} so that the parameterθ{\displaystyle \theta } differs by only a small amount from the parameter valueθ0{\displaystyle \theta _{0}}. Specifically, up to first order one has (using theEinstein summation convention)P(θ)=P(θ0)+ΔθjPj(θ0)+{\displaystyle P(\theta )=P(\theta _{0})+\Delta \theta _{j}\,P_{j}(\theta _{0})+\cdots }

withΔθj=(θθ0)j{\displaystyle \Delta \theta _{j}=(\theta -\theta _{0})_{j}} a small change ofθ{\displaystyle \theta } in thej direction, andPj(θ0)=Pθj(θ0){\displaystyle P_{j}\left(\theta _{0}\right)={\frac {\partial P}{\partial \theta _{j}}}(\theta _{0})} the corresponding rate of change in the probability distribution. Since relative entropy has an absolute minimum 0 forP=Q{\displaystyle P=Q}, i.e.θ=θ0{\displaystyle \theta =\theta _{0}}, it changes only tosecond order in the small parametersΔθj{\displaystyle \Delta \theta _{j}}. More formally, as for any minimum, the first derivatives of the divergence vanish

θj|θ=θ0DKL(P(θ)P(θ0))=0,{\displaystyle \left.{\frac {\partial }{\partial \theta _{j}}}\right|_{\theta =\theta _{0}}D_{\text{KL}}(P(\theta )\parallel P(\theta _{0}))=0,}

and by theTaylor expansion one has up to second order

DKL(P(θ)P(θ0))=12ΔθjΔθkgjk(θ0)+{\displaystyle D_{\text{KL}}(P(\theta )\parallel P(\theta _{0}))={\frac {1}{2}}\,\Delta \theta _{j}\,\Delta \theta _{k}\,g_{jk}(\theta _{0})+\cdots }

where theHessian matrix of the divergence

gjk(θ0)=2θjθk|θ=θ0DKL(P(θ)P(θ0)){\displaystyle g_{jk}(\theta _{0})=\left.{\frac {\partial ^{2}}{\partial \theta _{j}\,\partial \theta _{k}}}\right|_{\theta =\theta _{0}}D_{\text{KL}}(P(\theta )\parallel P(\theta _{0}))}

must bepositive semidefinite. Lettingθ0{\displaystyle \theta _{0}} vary (and dropping the subindex 0) the Hessiangjk(θ){\displaystyle g_{jk}(\theta )} defines a (possibly degenerate)Riemannian metric on theθparameter space, called the Fisher information metric.

Fisher information metric theorem

[edit]

There is an associated theorem.[3]Whenp(x,ρ){\displaystyle p_{(x,\rho )}} satisfies the following regularity conditions:

log(p)ρ,2log(p)ρ2,3log(p)ρ3{\displaystyle {\frac {\partial \log(p)}{\partial \rho }},{\frac {\partial ^{2}\log(p)}{\partial \rho ^{2}}},{\frac {\partial ^{3}\log(p)}{\partial \rho ^{3}}}} exist,|pρ|<F(x):x=0F(x)dx<,|2pρ2|<G(x):x=0G(x)dx<|3log(p)ρ3|<H(x):x=0p(x,0)H(x)dx<ξ<{\displaystyle {\begin{aligned}\left|{\frac {\partial p}{\partial \rho }}\right|&<F(x):\int _{x=0}^{\infty }F(x)\,dx<\infty ,\\\left|{\frac {\partial ^{2}p}{\partial \rho ^{2}}}\right|&<G(x):\int _{x=0}^{\infty }G(x)\,dx<\infty \\\left|{\frac {\partial ^{3}\log(p)}{\partial \rho ^{3}}}\right|&<H(x):\int _{x=0}^{\infty }p(x,0)H(x)\,dx<\xi <\infty \end{aligned}}}

whereξ is independent ofρx=0p(x,ρ)ρ|ρ=0dx=x=02p(x,ρ)ρ2|ρ=0dx=0{\displaystyle \left.\int _{x=0}^{\infty }{\frac {\partial p(x,\rho )}{\partial \rho }}\right|_{\rho =0}\,dx=\left.\int _{x=0}^{\infty }{\frac {\partial ^{2}p(x,\rho )}{\partial \rho ^{2}}}\right|_{\rho =0}\,dx=0}

then:D(p(x,0)p(x,ρ))=cρ22+O(ρ3) as ρ0.{\displaystyle {\mathcal {D}}(p(x,0)\parallel p(x,\rho ))={\frac {c\rho ^{2}}{2}}+{\mathcal {O}}\left(\rho ^{3}\right){\text{ as }}\rho \to 0{\text{.}}}

Variation of information

[edit]

Another information-theoretic metric isvariation of information, which is roughly a symmetrization ofconditional entropy. It is a metric on the set ofpartitions of a discreteprobability space.

MAUVE Metric

[edit]

MAUVE is a measure of the statistical gap between two text distributions, such as the difference between text generated by a model and human-written text. This measure is computed using Kullback–Leibler divergences between the two distributions in a quantized embedding space of a foundation model.

Relation to other quantities of information theory

[edit]

Many of the other quantities of information theory can be interpreted as applications of relative entropy to specific cases.

Self-information

[edit]
Main article:Information content

Theself-information, also known as theinformation content of a signal, random variable, orevent is defined as the negative logarithm of theprobability of the given outcome occurring.

When applied to adiscrete random variable, the self-information can be represented as[citation needed]

I(m)=DKL(δim{pi}),{\displaystyle \operatorname {\operatorname {I} } (m)=D_{\text{KL}}\left(\delta _{\text{im}}\parallel \{p_{i}\}\right),}

is the relative entropy of the probability distributionP(i){\displaystyle P(i)} from aKronecker delta representing certainty thati=m{\displaystyle i=m} — i.e. the number of extra bits that must be transmitted to identifyi if only the probability distributionP(i){\displaystyle P(i)} is available to the receiver, not the fact thati=m{\displaystyle i=m}.

Mutual information

[edit]
Main article:Mutual information § Relation to Kullback–Leibler divergence

Themutual information,

I(X;Y)=DKL(PX,YPXPY)=EX[DKLY(PYXPY)]=EY[DKLX(PXYPX)]{\displaystyle {\begin{aligned}\operatorname {I} (X;Y)&=D_{\text{KL}}(P_{X,Y}\parallel P_{X}\cdot P_{Y})\\&=\operatorname {E} _{X}[D_{\text{KL}}^{Y}(P_{Y\mid X}\parallel P_{Y})]\\&=\operatorname {E} _{Y}[D_{\text{KL}}^{X}(P_{X\mid Y}\parallel P_{X})]\end{aligned}}}

is the relative entropy of thejoint probability distributionPX,Y(x,y){\displaystyle P_{X,Y}(x,y)} from the product(PXPY)(x,y)=PX(x)PY(y){\displaystyle (P_{X}\cdot P_{Y})(x,y)=P_{X}(x)P_{Y}(y)} of the twomarginal probability distributions — i.e. the expected number of extra bits that must be transmitted to identifyX andY if they are coded using only their marginal distributions instead of the joint distribution.

Shannon entropy

[edit]

TheShannon entropy,

H(X)=E[IX(x)]=logNDKL(pX(x)PU(X)){\displaystyle {\begin{aligned}\mathrm {H} (X)&=\operatorname {E} \left[\operatorname {I} _{X}(x)\right]\\&=\log N-D_{\text{KL}}{\left(p_{X}(x)\parallel P_{U}(X)\right)}\end{aligned}}}

is the number of bits which would have to be transmitted to identifyX fromN equally likely possibilities,less the relative entropy of the uniform distribution on therandom variates ofX,PU(X){\displaystyle P_{U}(X)}, from the true distributionP(X){\displaystyle P(X)} — i.e.less the expected number of bits saved, which would have had to be sent if the value ofX were coded according to the uniform distributionPU(X){\displaystyle P_{U}(X)} rather than the true distributionP(X){\displaystyle P(X)}. This definition of Shannon entropy forms the basis ofE.T. Jaynes's alternative generalization to continuous distributions, thelimiting density of discrete points (as opposed to the usualdifferential entropy), which defines the continuous entropy aslimNHN(X)=logNp(x)logp(x)m(x)dx,{\displaystyle \lim _{N\to \infty }H_{N}(X)=\log N-\int p(x)\log {\frac {p(x)}{m(x)}}\,dx{\text{,}}}which is equivalent to:log(N)DKL(p(x)||m(x)){\displaystyle \log(N)-D_{\text{KL}}(p(x)||m(x))}

Conditional entropy

[edit]

Theconditional entropy[34],

H(XY)=logNDKL(P(X,Y)PU(X)P(Y))=logNDKL(P(X,Y)P(X)P(Y))DKL(P(X)PU(X))=H(X)I(X;Y)=logNEY[DKL(P(XY)PU(X))]{\displaystyle {\begin{aligned}\mathrm {H} (X\mid Y)&=\log N-D_{\text{KL}}(P(X,Y)\parallel P_{U}(X)P(Y))\\[5pt]&=\log N-D_{\text{KL}}(P(X,Y)\parallel P(X)P(Y))-D_{\text{KL}}(P(X)\parallel P_{U}(X))\\[5pt]&=\mathrm {H} (X)-\operatorname {I} (X;Y)\\[5pt]&=\log N-\operatorname {E} _{Y}\left[D_{\text{KL}}\left(P\left(X\mid Y\right)\parallel P_{U}(X)\right)\right]\end{aligned}}}

is the number of bits which would have to be transmitted to identifyX fromN equally likely possibilities,less the relative entropy of the true joint distributionP(X,Y){\displaystyle P(X,Y)} from the product distributionPU(X)P(Y){\displaystyle P_{U}(X)P(Y)} from — i.e.less the expected number of bits saved which would have had to be sent if the value ofX were coded according to the uniform distributionPU(X){\displaystyle P_{U}(X)} rather than the conditional distributionP(X|Y){\displaystyle P(X|Y)} ofX givenY.

Cross entropy

[edit]

When we have a set of possible events, coming from the distributionp, we can encode them (with alossless data compression) usingentropy encoding. This compresses the data by replacing each fixed-length input symbol with a corresponding unique, variable-length,prefix-free code (e.g.: the events (A, B, C) with probabilities p = (1/2, 1/4, 1/4) can be encoded as the bits (0, 10, 11)). If we know the distributionp in advance, we can devise an encoding that would be optimal (e.g.: usingHuffman coding). Meaning the messages we encode will have the shortest length on average (assuming the encoded events are sampled fromp), which will be equal toShannon's Entropy ofp (denoted asH(p){\displaystyle \mathrm {H} (p)}). However, if we use a different probability distribution (q) when creating the entropy encoding scheme, then a larger number ofbits will be used (on average) to identify an event from a set of possibilities. This new (larger) number is measured by thecross entropy betweenp andq.

Thecross entropy between twoprobability distributions (p andq) measures the average number ofbits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distributionq, rather than the "true" distributionp. The cross entropy for two distributionsp andq over the sameprobability space is thus defined as follows.

H(p,q)=Ep[logq]=H(p)+DKL(pq){\displaystyle \mathrm {H} (p,q)=\operatorname {E} _{p}[-\log q]=\mathrm {H} (p)+D_{\text{KL}}(p\parallel q)}

For explicit derivation of this, see theMotivation section above.

Under this scenario, relative entropies (kl-divergence) can be interpreted as the extra number of bits, on average, that are needed (beyondH(p){\displaystyle \mathrm {H} (p)}) for encoding the events because of usingq for constructing the encoding scheme instead ofp.

Bayesian updating

[edit]

InBayesian statistics, relative entropy can be used as a measure of the information gain in moving from aprior distribution to aposterior distribution:p(x)p(xI){\displaystyle p(x)\to p(x\mid I)}. If some new factY=y{\displaystyle Y=y} is discovered, it can be used to update the posterior distribution forX fromp(xI){\displaystyle p(x\mid I)} to a new posterior distributionp(xy,I){\displaystyle p(x\mid y,I)} usingBayes' theorem:

p(xy,I)=p(yx,I)p(xI)p(yI){\displaystyle p(x\mid y,I)={\frac {p(y\mid x,I)p(x\mid I)}{p(y\mid I)}}}

This distribution has a newentropy:

H(p(xy,I))=xp(xy,I)logp(xy,I),{\displaystyle \mathrm {H} {\big (}p(x\mid y,I){\big )}=-\sum _{x}p(x\mid y,I)\log p(x\mid y,I){\text{,}}}

which may be less than or greater than the original entropyH(p(xI)){\displaystyle \mathrm {H} (p(x\mid I))}. However, from the standpoint of the new probability distribution one can estimate that to have used the original code based onp(xI){\displaystyle p(x\mid I)} instead of a new code based onp(xy,I){\displaystyle p(x\mid y,I)} would have added an expected number of bits:

DKL(p(xy,I)p(xI))=xp(xy,I)logp(xy,I)p(xI){\displaystyle D_{\text{KL}}{\big (}p(x\mid y,I)\parallel p(x\mid I){\big )}=\sum _{x}p(x\mid y,I)\log {\frac {p(x\mid y,I)}{p(x\mid I)}}}

to the message length. This therefore represents the amount of useful information, or information gain, aboutX, that has been learned by discoveringY=y{\displaystyle Y=y}.

If a further piece of data,Y2=y2{\displaystyle Y_{2}=y_{2}}, subsequently comes in, the probability distribution forx can be updated further, to give a new best guessp(xy1,y2,I){\displaystyle p(x\mid y_{1},y_{2},I)}. If one reinvestigates the information gain for usingp(xy1,I){\displaystyle p(x\mid y_{1},I)} rather thanp(xI){\displaystyle p(x\mid I)}, it turns out that it may be either greater or less than previously estimated:

xp(xy1,y2,I)logp(xy1,y2,I)p(xI){\displaystyle \sum _{x}p(x\mid y_{1},y_{2},I)\log {\frac {p(x\mid y_{1},y_{2},I)}{p(x\mid I)}}} may be ≤ or > thanxp(xy1,I)logp(xy1,I)p(xI){\textstyle \sum _{x}p(x\mid y_{1},I)\log {\frac {p(x\mid y_{1},I)}{p(x\mid I)}}}

and so the combined information gain doesnot obey the triangle inequality:

DKL(p(xy1,y2,I)p(xI)){\displaystyle D_{\text{KL}}{\big (}p(x\mid y_{1},y_{2},I)\parallel p(x\mid I){\big )}} may be <, = or > thanDKL(p(xy1,y2,I)p(xy1,I))+DKL(p(xy1,I)p(xI)){\displaystyle D_{\text{KL}}{\big (}p(x\mid y_{1},y_{2},I)\parallel p(x\mid y_{1},I){\big )}+D_{\text{KL}}{\big (}p(x\mid y_{1},I)\parallel p(x\mid I){\big )}}

All one can say is that onaverage, averaging usingp(y2y1,x,I){\displaystyle p(y_{2}\mid y_{1},x,I)}, the two sides will average out.

Bayesian experimental design

[edit]

A common goal inBayesian experimental design is to maximise the expected relative entropy between the prior and the posterior.[35] When posteriors are approximated to be Gaussian distributions, a design maximising the expected relative entropy is calledBayes d-optimal.

Discrimination information

[edit]

Relative entropyDKL(p(xH1)p(xH0)){\textstyle D_{\text{KL}}{\bigl (}p(x\mid H_{1})\parallel p(x\mid H_{0}){\bigr )}} can also be interpreted as the expecteddiscrimination information forH1{\displaystyle H_{1}} overH0{\displaystyle H_{0}}: the mean information per sample for discriminating in favor of a hypothesisH1{\displaystyle H_{1}} against a hypothesisH0{\displaystyle H_{0}}, when hypothesisH1{\displaystyle H_{1}} is true.[36] Another name for this quantity, given to it byI. J. Good, is the expected weight of evidence forH1{\displaystyle H_{1}} overH0{\displaystyle H_{0}} to be expected from each sample.

The expected weight of evidence forH1{\displaystyle H_{1}} overH0{\displaystyle H_{0}} isnot the same as the information gain expected per sample about the probability distributionp(H){\displaystyle p(H)} of the hypotheses,

DKL(p(xH1)p(xH0))IG=DKL(p(Hx)p(HI)).{\displaystyle D_{\text{KL}}(p(x\mid H_{1})\parallel p(x\mid H_{0}))\neq IG=D_{\text{KL}}(p(H\mid x)\parallel p(H\mid I)){\text{.}}}

Either of the two quantities can be used as autility function in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.

On the entropy scale ofinformation gain there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on thelogit scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, theRiemann hypothesis is correct, compared to being certain that it is correct because one has amathematical proof. These two different scales ofloss function for uncertainty areboth useful, according to how well each reflects the particular circumstances of the problem in question.

Principle of minimum discrimination information

[edit]

The idea of relative entropy as discrimination information led Kullback to propose the Principle ofMinimum Discrimination Information (MDI): given new facts, a new distributionf should be chosen which is as hard to discriminate from the original distributionf0{\displaystyle f_{0}} as possible; so that the new data produces as small an information gainDKL(ff0){\displaystyle D_{\text{KL}}(f\parallel f_{0})} as possible.

For example, if one had a prior distributionp(x,a){\displaystyle p(x,a)} overx anda, and subsequently learnt the true distribution ofa wasu(a){\displaystyle u(a)}, then the relative entropy between the new joint distribution forx anda,q(xa)u(a){\displaystyle q(x\mid a)u(a)}, and the earlier prior distribution would be:

DKL(q(xa)u(a)p(x,a))=Eu(a){DKL(q(xa)p(xa))}+DKL(u(a)p(a)),{\displaystyle D_{\text{KL}}(q(x\mid a)u(a)\parallel p(x,a))=\operatorname {E} _{u(a)}\left\{D_{\text{KL}}(q(x\mid a)\parallel p(x\mid a))\right\}+D_{\text{KL}}(u(a)\parallel p(a)),}

i.e. the sum of the relative entropy ofp(a){\displaystyle p(a)} the prior distribution fora from the updated distributionu(a){\displaystyle u(a)}, plus the expected value (using the probability distributionu(a){\displaystyle u(a)}) of the relative entropy of the prior conditional distributionp(xa){\displaystyle p(x\mid a)} from the new conditional distributionq(xa){\displaystyle q(x\mid a)}. (Note that often the later expected value is called theconditional relative entropy (orconditional Kullback–Leibler divergence) and denoted byDKL(q(xa)p(xa)){\displaystyle D_{\text{KL}}(q(x\mid a)\parallel p(x\mid a))}[3][34]) This is minimized ifq(xa)=p(xa){\displaystyle q(x\mid a)=p(x\mid a)} over the whole support ofu(a){\displaystyle u(a)}; and we note that this result incorporates Bayes' theorem, if the new distributionu(a){\displaystyle u(a)} is in fact a δ function representing certainty thata has one particular value.

MDI can be seen as an extension ofLaplace'sPrinciple of Insufficient Reason, and thePrinciple of Maximum Entropy ofE.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (seedifferential entropy), but the relative entropy continues to be just as relevant.

In the engineering literature, MDI is sometimes called thePrinciple of Minimum Cross-Entropy (MCE) orMinxent for short. Minimising relative entropy fromm top with respect tom is equivalent to minimizing the cross-entropy ofp andm, since

H(p,m)=H(p)+DKL(pm),{\displaystyle \mathrm {H} (p,m)=\mathrm {H} (p)+D_{\text{KL}}(p\parallel m),}

which is appropriate if one is trying to choose an adequate approximation top. However, this is just as oftennot the task one is trying to achieve. Instead, just as often it ism that is some fixed prior reference measure, andp that one is attempting to optimise by minimisingDKL(pm){\displaystyle D_{\text{KL}}(p\parallel m)} subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to beDKL(pm){\displaystyle D_{\text{KL}}(p\parallel m)}, rather thanH(p,m){\displaystyle \mathrm {H} (p,m)}[citation needed].

Relationship to available work

[edit]
Pressure versus volume plot of available work from a mole of argon gas relative to ambient, calculated asTo{\displaystyle T_{o}} times the Kullback–Leibler divergence

Surprisals[37] add where probabilities multiply. The surprisal for an event of probabilityp is defined ass=klnp{\displaystyle s=-k\ln p}. Ifk is{1,1/ln2,1.38×1023}{\displaystyle \left\{1,1/\ln 2,1.38\times 10^{-23}\right\}} then surprisal is in{{\displaystyle \{}nats, bits, orJ/K}{\displaystyle J/K\}} so that, for instance, there areN bits of surprisal for landing all "heads" on a toss ofN coins.

Best-guess states (e.g. for atoms in a gas) are inferred by maximizing theaverage surprisalS (entropy) for a given set of control parameters (like pressureP or volumeV). This constrainedentropy maximization, both classically[38] and quantum mechanically,[39] minimizesGibbs availability in entropy units[40]AklnZ{\displaystyle A\equiv -k\ln Z} whereZ is a constrained multiplicity orpartition function.

When temperatureT is fixed, free energy (T×A{\displaystyle T\times A}) is also minimized. Thus ifT,V{\displaystyle T,V} and number of moleculesN are constant, theHelmholtz free energyFUTS{\displaystyle F\equiv U-TS} (whereU is energy andS is entropy) is minimized as a system "equilibrates." IfT andP are held constant (say during processes in your body), theGibbs free energyG=U+PVTS{\displaystyle G=U+PV-TS} is minimized instead. The change in free energy under these conditions is a measure of availablework that might be done in the process. Thus available work for an ideal gas at constant temperatureTo{\displaystyle T_{o}} and pressurePo{\displaystyle P_{o}} isW=ΔG=NkToΘ(V/Vo){\displaystyle W=\Delta G=NkT_{o}\Theta (V/V_{o})} whereVo=NkTo/Po{\displaystyle V_{o}=NkT_{o}/P_{o}} andΘ(x)=x1lnx0{\displaystyle \Theta (x)=x-1-\ln x\geq 0} (see alsoGibbs inequality).

More generally[41] thework available relative to some ambient is obtained by multiplying ambient temperatureTo{\displaystyle T_{o}} by relative entropy ornet surprisalΔI0,{\displaystyle \Delta I\geq 0,} defined as the average value ofkln(p/po){\displaystyle k\ln(p/p_{o})} wherepo{\displaystyle p_{o}} is the probability of a given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to ambient values ofVo{\displaystyle V_{o}} andTo{\displaystyle T_{o}} is thusW=ToΔI{\displaystyle W=T_{o}\Delta I}, where relative entropy

ΔI=Nk[Θ(VVo)+32Θ(TTo)].{\displaystyle \Delta I=Nk\left[\Theta {\left({\frac {V}{V_{o}}}\right)}+{\frac {3}{2}}\Theta {\left({\frac {T}{T_{o}}}\right)}\right].}

The resulting contours of constant relative entropy, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here.[42] Thus relative entropy measures thermodynamic availability in bits.

Quantum information theory

[edit]

Fordensity matricesP andQ on aHilbert space, thequantum relative entropy fromQ toP is defined to be

DKL(PQ)=Tr(P(logPlogQ)).{\displaystyle D_{\text{KL}}(P\parallel Q)=\operatorname {Tr} (P(\log P-\log Q)).}

Inquantum information science the minimum ofDKL(PQ){\displaystyle D_{\text{KL}}(P\parallel Q)} over all separable statesQ can also be used as a measure ofentanglement in the stateP.

Relationship between models and reality

[edit]
See also:Maximum likelihood estimation § Relation to minimizing Kullback–Leibler divergence and cross entropy
Further information:Model selection

Just as relative entropy of "actual from ambient" measures thermodynamic availability, relative entropy of "reality from a model" is also useful even if the only clues we have about reality are some experimental measurements. In the former case relative entropy describesdistance to equilibrium or (when multiplied by ambient temperature) the amount ofavailable work, while in the latter case it tells you about surprises that reality has up its sleeve or, in other words,how much the model has yet to learn.

Although this tool for evaluating models against systems that are accessible experimentally may be applied in any field, its application to selecting astatistical model viaAkaike information criterion are particularly well described in papers[43] and a book[44] by Burnham and Anderson. In a nutshell the relative entropy of reality from a model may be estimated, to within a constant additive term, by a function of the deviations observed between data and the model's predictions (like themean squared deviation) . Estimates of such divergence for models that share the same additive term can in turn be used to select among models.

When trying to fit parametrized models to data there are various estimators which attempt to minimize relative entropy, such asmaximum likelihood andmaximum spacing estimators.[citation needed]

Symmetrised divergence

[edit]

Kullback & Leibler (1951)also considered the symmetrized function:[6]

DKL(PQ)+DKL(QP){\displaystyle D_{\text{KL}}(P\parallel Q)+D_{\text{KL}}(Q\parallel P)}

which they referred to as the "divergence", though today the "KL divergence" refers to the asymmetric function (see§ Etymology for the evolution of the term). This function is symmetric and nonnegative, and had already been defined and used byHarold Jeffreys in 1948;[7] it is accordingly called theJeffreys divergence.

This quantity has sometimes been used forfeature selection inclassification problems, whereP andQ are the conditionalpdfs of a feature under two different classes. In the Banking and Finance industries, this quantity is referred to asPopulation Stability Index (PSI), and is used to assess distributional shifts in model features through time.

An alternative is given via theλ{\displaystyle \lambda }-divergence,

Dλ(PQ)=λDKL(PλP+(1λ)Q)+(1λ)DKL(QλP+(1λ)Q),{\displaystyle D_{\lambda }(P\parallel Q)=\lambda D_{\text{KL}}(P\parallel \lambda P+(1-\lambda )Q)+(1-\lambda )D_{\text{KL}}(Q\parallel \lambda P+(1-\lambda )Q){\text{,}}}

which can be interpreted as the expected information gain aboutX from discovering which probability distributionX is drawn from,P orQ, if they currently have probabilitiesλ{\displaystyle \lambda } and1λ{\displaystyle 1-\lambda } respectively.[clarification needed][citation needed]

The valueλ=0.5{\displaystyle \lambda =0.5} gives theJensen–Shannon divergence, defined by

DJS=12DKL(PM)+12DKL(QM){\displaystyle D_{\text{JS}}={\tfrac {1}{2}}D_{\text{KL}}(P\parallel M)+{\tfrac {1}{2}}D_{\text{KL}}(Q\parallel M)}

whereM is the average of the two distributions,

M=12(P+Q).{\displaystyle M={\tfrac {1}{2}}\left(P+Q\right){\text{.}}}

We can also interpretDJS{\displaystyle D_{\text{JS}}} as the capacity of a noisy information channel with two inputs giving the output distributionsP andQ. The Jensen–Shannon divergence, like allf-divergences, islocally proportional to theFisher information metric. It is similar to theHellinger metric (in the sense that it induces the same affine connection on astatistical manifold).

Furthermore, the Jensen–Shannon divergence can be generalized using abstract statistical M-mixtures relying on an abstract mean M.[45][46]

Relationship to other probability-distance measures

[edit]

There are many other important measures ofprobability distance. Some of these are particularly connected with relative entropy. For example:

Other notable measures of distance include theHellinger distance,histogram intersection,Chi-squared statistic,quadratic form distance,match distance,Kolmogorov–Smirnov distance, andearth mover's distance.[49]

Data differencing

[edit]
Main article:Data differencing

Just asabsolute entropy serves as theoretical background fordatacompression,relative entropy serves as theoretical background fordatadifferencing – the absolute entropy of a set of data in this sense being the data required to reconstruct it (minimum compressed size), while the relative entropy of a target set of data, given a source set of data, is the data required to reconstruct the targetgiven the source (minimum size of apatch).

See also

[edit]

References

[edit]
  1. ^abCsiszar, I (February 1975)."I-Divergence Geometry of Probability Distributions and Minimization Problems".Ann. Probab.3 (1):146–158.doi:10.1214/aop/1176996454.
  2. ^Kullback, S.;Leibler, R.A. (1951)."On information and sufficiency".Annals of Mathematical Statistics.22 (1):79–86.doi:10.1214/aoms/1177729694.JSTOR 2236703.MR 0039968.
  3. ^abcdKullback 1959.
  4. ^abcdeAmari 2016, p. 11.
  5. ^abAmari 2016, p. 28.
  6. ^abKullback & Leibler 1951, p. 80.
  7. ^abJeffreys 1948, p. 158.
  8. ^Kullback 1959, p. 7.
  9. ^Kullback, S. (1987). "Letter to the Editor: The Kullback–Leibler distance".The American Statistician.41 (4):340–341.doi:10.1080/00031305.1987.10475510.JSTOR 2684769.
  10. ^Kullback 1959, p. 6.
  11. ^MacKay, David J.C. (2003).Information Theory, Inference, and Learning Algorithms (1st ed.). Cambridge University Press. p. 34.ISBN 978-0-521-64298-9 – via Google Books.
  12. ^"What's the maximum value of Kullback-Leibler (KL) divergence?". Machine learning.Statistics Stack Exchange (stats.stackexchange.com). Cross validated.
  13. ^"In what situations is the integral equal to infinity?". Integration.Mathematics Stack Exchange (math.stackexchange.com).
  14. ^Bishop, Christopher M.Pattern recognition and machine learning. p. 55.OCLC 1334664824.
  15. ^Kullback 1959, p. 5.
  16. ^Burnham, K. P.; Anderson, D. R. (2002).Model Selection and Multi-Model Inference (2nd ed.). Springer. p. 51.ISBN 978-0-387-95364-9.
  17. ^Abdulkadirov, Ruslan; Lyakhov, Pavel; Nagornov, Nikolay (January 2023)."Survey of Optimization Algorithms in Modern Neural Networks".Mathematics.11 (11): 2466.doi:10.3390/math11112466.ISSN 2227-7390.
  18. ^Matassa, Marco (December 2021)."Fubini-Study metrics and Levi-Civita connections on quantum projective spaces".Advances in Mathematics.393 108101.arXiv:2010.03291.doi:10.1016/j.aim.2021.108101.ISSN 0001-8708.
  19. ^Lan, Guanghui (March 2023)."Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes".Mathematical Programming.198 (1):1059–1106.arXiv:2102.00135.doi:10.1007/s10107-022-01816-5.ISSN 1436-4646.
  20. ^Kelly, J. L. Jr. (1956). "A New Interpretation of Information Rate".Bell Syst. Tech. J.2 (4):917–926.doi:10.1002/j.1538-7305.1956.tb03809.x.
  21. ^Soklakov, A. N. (2020)."Economics of Disagreement—Financial Intuition for the Rényi Divergence".Entropy.22 (8): 860.arXiv:1811.08308.Bibcode:2020Entrp..22..860S.doi:10.3390/e22080860.PMC 7517462.PMID 33286632.
  22. ^Soklakov, A. N. (2023). "Information Geometry of Risks and Returns".Risk.June.SSRN 4134885.
  23. ^Henide, Karim (30 September 2024). "Flow Rider: Tradable Ecosystems' Relative Entropy of Flows As a Determinant of Relative Value".The Journal of Investing.33 (6):34–58.doi:10.3905/joi.2024.1.321.
  24. ^Sanov, I.N. (1957). "On the probability of large deviations of random magnitudes".Mat. Sbornik.42 (84):11–44.
  25. ^Novak S.Y. (2011),Extreme Value Methods with Applications to Finance ch. 14.5 (Chapman & Hall).ISBN 978-1-4398-3574-6.
  26. ^Hobson, Arthur (1971).Concepts in statistical mechanics. New York: Gordon and Breach.ISBN 978-0-677-03240-5.
  27. ^Bonnici, V. (2020). "Kullback-Leibler divergence between quantum distributions, and its upper-bound".arXiv:2008.05932 [cs.LG].
  28. ^See the section "differential entropy – 4" inRelative Entropy video lecture bySergio VerdúNIPS 2009
  29. ^Donsker, Monroe D.; Varadhan, SR Srinivasa (1983). "Asymptotic evaluation of certain Markov process expectations for large time. IV".Communications on Pure and Applied Mathematics.36 (2):183–212.doi:10.1002/cpa.3160360204.
  30. ^Duchi J."Derivations for Linear Algebra and Optimization"(PDF). p. 13.
  31. ^Belov, Dmitry I.; Armstrong, Ronald D. (2011-04-15). "Distributions of the Kullback-Leibler divergence with applications".British Journal of Mathematical and Statistical Psychology.64 (2):291–309.doi:10.1348/000711010x522227.ISSN 0007-1102.PMID 21492134.
  32. ^abBuchner, Johannes (2022-04-29).An intuition for physicists: information gain from experiments.OCLC 1363563215.
  33. ^Nielsen, Frank; Garcia, Vincent (2011). "Statistical exponential families: A digest with flash cards".arXiv:0911.4863 [cs.LG].
  34. ^abCover, Thomas M.; Thomas, Joy A. (1991),Elements of Information Theory,John Wiley & Sons, p. 22
  35. ^Chaloner, K.; Verdinelli, I. (1995)."Bayesian experimental design: a review".Statistical Science.10 (3):273–304.doi:10.1214/ss/1177009939.hdl:11299/199630.
  36. ^Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. (2007)."Section 14.7.2. Kullback–Leibler Distance".Numerical Recipes: The Art of Scientific Computing (3rd ed.). Cambridge University Press.ISBN 978-0-521-88068-8.
  37. ^Tribus, Myron (1959).Thermostatics and Thermodynamics: An Introduction to Energy, Information and States of Matter, with Engineering Applications. Van Nostrand.
  38. ^Jaynes, E. T. (1957)."Information theory and statistical mechanics"(PDF).Physical Review.106 (4):620–630.Bibcode:1957PhRv..106..620J.doi:10.1103/physrev.106.620.S2CID 17870175.
  39. ^Jaynes, E. T. (1957)."Information theory and statistical mechanics II"(PDF).Physical Review.108 (2):171–190.Bibcode:1957PhRv..108..171J.doi:10.1103/physrev.108.171.
  40. ^Gibbs, Josiah Willard (1871).A Method of Geometrical Representation of the Thermodynamic Properties of Substances by Means of Surfaces. The Academy. footnote page 52.
  41. ^Tribus, M.; McIrvine, E. C. (1971). "Energy and information".Scientific American.224 (3):179–186.Bibcode:1971SciAm.225c.179T.doi:10.1038/scientificamerican0971-179.
  42. ^Fraundorf, P. (2007)."Thermal roots of correlation-based complexity".Complexity.13 (3):18–26.arXiv:1103.2481.Bibcode:2008Cmplx..13c..18F.doi:10.1002/cplx.20195.S2CID 20794688. Archived fromthe original on 2011-08-13.
  43. ^Burnham, K.P.; Anderson, D.R. (2001)."Kullback–Leibler information as a basis for strong inference in ecological studies".Wildlife Research.28 (2):111–119.doi:10.1071/WR99107.
  44. ^Burnham, Kenneth P. (December 2010).Model selection and multimodel inference: a practical information-theoretic approach. Springer.ISBN 978-1-4419-2973-0.OCLC 878132909.
  45. ^Nielsen, Frank (2019)."On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means".Entropy.21 (5): 485.arXiv:1904.04017.Bibcode:2019Entrp..21..485N.doi:10.3390/e21050485.PMC 7514974.PMID 33267199.
  46. ^Nielsen, Frank (2020)."On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid".Entropy.22 (2): 221.arXiv:1912.00610.Bibcode:2020Entrp..22..221N.doi:10.3390/e22020221.PMC 7516653.PMID 33285995.
  47. ^Bretagnolle, J.; Huber, C. (1978), "Estimation des densités : Risque minimax",Séminaire de Probabilités XII, Lecture Notes in Mathematics (in French), vol. 649, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 342–363,doi:10.1007/bfb0064610,ISBN 978-3-540-08761-8,S2CID 122597694 Lemma 2.1
  48. ^B.), Tsybakov, A. B. (Alexandre (2010).Introduction to nonparametric estimation. Springer.ISBN 978-1-4419-2709-5.OCLC 757859245.{{cite book}}: CS1 maint: multiple names: authors list (link) Equation 2.25.
  49. ^Rubner, Y.; Tomasi, C.;Guibas, L. J. (2000). "The earth mover's distance as a metric for image retrieval".International Journal of Computer Vision.40 (2):99–121.doi:10.1023/A:1026543900054.S2CID 14106275.

External links

[edit]
Retrieved from "https://en.wikipedia.org/w/index.php?title=Kullback–Leibler_divergence&oldid=1320923009"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp