Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Conditional entropy

From Wikipedia, the free encyclopedia
Measure of relative information in probability theory
Information theory
Venn diagram showing additive and subtractive relationships variousinformation measures associated with correlated variablesX{\displaystyle X} andY{\displaystyle Y}. The area contained by both circles is thejoint entropyH(X,Y){\displaystyle \mathrm {H} (X,Y)}. The circle on the left (red and violet) is theindividual entropyH(X){\displaystyle \mathrm {H} (X)}, with the red being the conditional entropyH(X|Y){\displaystyle \mathrm {H} (X|Y)}. The circle on the right (blue and violet) isH(Y){\displaystyle \mathrm {H} (Y)}, with the blue beingH(Y|X){\displaystyle \mathrm {H} (Y|X)}. The violet is themutual informationI(X;Y){\displaystyle \operatorname {I} (X;Y)}.

Ininformation theory, theconditional entropy quantifies the amount of information needed to describe the outcome of arandom variableY{\displaystyle Y} given that the value of another random variableX{\displaystyle X} is known. Here, information is measured inshannons,nats, orhartleys. Theentropy ofY{\displaystyle Y} conditioned onX{\displaystyle X} is written asH(Y|X){\displaystyle \mathrm {H} (Y|X)}.

Definition

[edit]

The conditional entropy ofY{\displaystyle Y} givenX{\displaystyle X} is defined as

H(Y|X) =xX,yYp(x,y)logp(x,y)p(x){\displaystyle \mathrm {H} (Y|X)\ =-\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p(x,y)\log {\frac {p(x,y)}{p(x)}}}

whereX{\displaystyle {\mathcal {X}}} andY{\displaystyle {\mathcal {Y}}} denote thesupport sets ofX{\displaystyle X} andY{\displaystyle Y}.

Note: Here, the convention is that the expression0log0{\displaystyle 0\log 0} should be treated as being equal to zero. This is becauselimθ0+θlogθ=0{\displaystyle \lim _{\theta \to 0^{+}}\theta \,\log \theta =0}.[1]

Intuitively, notice that by definition ofexpected value and ofconditional probability,H(Y|X){\displaystyle \displaystyle H(Y|X)} can be written asH(Y|X)=E[f(X,Y)]{\displaystyle H(Y|X)=\mathbb {E} [f(X,Y)]}, wheref{\displaystyle f} is defined asf(x,y):=log(p(x,y)p(x))=log(p(y|x)){\displaystyle \displaystyle f(x,y):=-\log \left({\frac {p(x,y)}{p(x)}}\right)=-\log(p(y|x))}. One can think off{\displaystyle \displaystyle f} as associating each pair(x,y){\displaystyle \displaystyle (x,y)} with a quantity measuring the information content of(Y=y){\displaystyle \displaystyle (Y=y)} given(X=x){\displaystyle \displaystyle (X=x)}. This quantity is directly related to the amount of information needed to describe the event(Y=y){\displaystyle \displaystyle (Y=y)} given(X=x){\displaystyle (X=x)}. Hence by computing the expected value off{\displaystyle \displaystyle f} over all pairs of values(x,y)X×Y{\displaystyle (x,y)\in {\mathcal {X}}\times {\mathcal {Y}}}, the conditional entropyH(Y|X){\displaystyle \displaystyle H(Y|X)} measures how much information, on average, the variableX{\displaystyle X} encodes aboutY{\displaystyle Y}.

Motivation

[edit]

LetH(Y|X=x){\displaystyle \mathrm {H} (Y|X=x)} be theentropy of the discrete random variableY{\displaystyle Y} conditioned on the discrete random variableX{\displaystyle X} taking a certain valuex{\displaystyle x}. Denote the support sets ofX{\displaystyle X} andY{\displaystyle Y} byX{\displaystyle {\mathcal {X}}} andY{\displaystyle {\mathcal {Y}}}. LetY{\displaystyle Y} haveprobability mass functionpY(y){\displaystyle p_{Y}{(y)}}. The unconditional entropy ofY{\displaystyle Y} is calculated asH(Y):=E[I(Y)]{\displaystyle \mathrm {H} (Y):=\mathbb {E} [\operatorname {I} (Y)]}, i.e.

H(Y)=yYPr(Y=y)I(y)=yYpY(y)log2pY(y),{\displaystyle \mathrm {H} (Y)=\sum _{y\in {\mathcal {Y}}}{\mathrm {Pr} (Y=y)\,\mathrm {I} (y)}=-\sum _{y\in {\mathcal {Y}}}{p_{Y}(y)\log _{2}{p_{Y}(y)}},}

whereI(yi){\displaystyle \operatorname {I} (y_{i})} is theinformation content of theoutcome ofY{\displaystyle Y} taking the valueyi{\displaystyle y_{i}}. The entropy ofY{\displaystyle Y} conditioned onX{\displaystyle X} taking the valuex{\displaystyle x} is defined by:

H(Y|X=x)=yYPr(Y=y|X=x)log2Pr(Y=y|X=x).{\displaystyle \mathrm {H} (Y|X=x)=-\sum _{y\in {\mathcal {Y}}}{\Pr(Y=y|X=x)\log _{2}{\Pr(Y=y|X=x)}}.}

Note thatH(Y|X){\displaystyle \mathrm {H} (Y|X)} is the result of averagingH(Y|X=x){\displaystyle \mathrm {H} (Y|X=x)} over all possible valuesx{\displaystyle x} thatX{\displaystyle X} may take. Also, if the above sum is taken over a sampley1,,yn{\displaystyle y_{1},\dots ,y_{n}}, the expected valueEX[H(y1,,ynX=x)]{\displaystyle E_{X}[\mathrm {H} (y_{1},\dots ,y_{n}\mid X=x)]} is known in some domains asequivocation.[2]

Givendiscrete random variablesX{\displaystyle X} with imageX{\displaystyle {\mathcal {X}}} andY{\displaystyle Y} with imageY{\displaystyle {\mathcal {Y}}}, the conditional entropy ofY{\displaystyle Y} givenX{\displaystyle X} is defined as theweighted sum ofH(Y|X=x){\displaystyle \mathrm {H} (Y|X=x)} for each possible value ofx{\displaystyle x}, usingp(x){\displaystyle p(x)} as the weights:[3]: 15 

H(Y|X) xXp(x)H(Y|X=x)=xXp(x)yYp(y|x)log2p(y|x)=xX,yYp(x)p(y|x)log2p(y|x)=xX,yYp(x,y)log2p(x,y)p(x).{\displaystyle {\begin{aligned}\mathrm {H} (Y|X)\ &\equiv \sum _{x\in {\mathcal {X}}}\,p(x)\,\mathrm {H} (Y|X=x)\\&=-\sum _{x\in {\mathcal {X}}}p(x)\sum _{y\in {\mathcal {Y}}}\,p(y|x)\,\log _{2}\,p(y|x)\\&=-\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}\,p(x)p(y|x)\,\log _{2}\,p(y|x)\\&=-\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p(x,y)\log _{2}{\frac {p(x,y)}{p(x)}}.\end{aligned}}}

Properties

[edit]

Conditional entropy equals zero

[edit]
H(Y|X)=0{\displaystyle \mathrm {H} (Y|X)=0}if and only if the value ofY{\displaystyle Y} is completely determined by the value ofX{\displaystyle X}.

Conditional entropy of independent random variables

[edit]

Conversely,H(Y|X)=H(Y){\displaystyle \mathrm {H} (Y|X)=\mathrm {H} (Y)} if and only ifY{\displaystyle Y} andX{\displaystyle X} areindependent random variables.

Chain rule

[edit]

Assume that the combined system determined by two random variablesX{\displaystyle X} andY{\displaystyle Y} hasjoint entropyH(X,Y){\displaystyle \mathrm {H} (X,Y)}, that is, we needH(X,Y){\displaystyle \mathrm {H} (X,Y)} bits of information on average to describe its exact state. Now if we first learn the value ofX{\displaystyle X}, we have gainedH(X){\displaystyle \mathrm {H} (X)} bits of information. OnceX{\displaystyle X} is known, we only needH(X,Y)H(X){\displaystyle \mathrm {H} (X,Y)-\mathrm {H} (X)} bits to describe the state of the whole system. This quantity is exactlyH(Y|X){\displaystyle \mathrm {H} (Y|X)}, which gives thechain rule of conditional entropy:

H(Y|X)=H(X,Y)H(X).{\displaystyle \mathrm {H} (Y|X)\,=\,\mathrm {H} (X,Y)-\mathrm {H} (X).}[3]: 17 

The chain rule follows from the above definition of conditional entropy:

H(Y|X)=xX,yYp(x,y)log(p(x)p(x,y))=xX,yYp(x,y)(log(p(x))log(p(x,y)))=xX,yYp(x,y)log(p(x,y))+xX,yYp(x,y)log(p(x))=H(X,Y)+xXp(x)log(p(x))=H(X,Y)H(X).{\displaystyle {\begin{aligned}\mathrm {H} (Y|X)&=\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p(x,y)\log \left({\frac {p(x)}{p(x,y)}}\right)\\[4pt]&=\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p(x,y)(\log(p(x))-\log(p(x,y)))\\[4pt]&=-\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p(x,y)\log(p(x,y))+\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}{p(x,y)\log(p(x))}\\[4pt]&=\mathrm {H} (X,Y)+\sum _{x\in {\mathcal {X}}}p(x)\log(p(x))\\[4pt]&=\mathrm {H} (X,Y)-\mathrm {H} (X).\end{aligned}}}

In general, a chain rule for multiple random variables holds:

H(X1,X2,,Xn)=i=1nH(Xi|X1,,Xi1){\displaystyle \mathrm {H} (X_{1},X_{2},\ldots ,X_{n})=\sum _{i=1}^{n}\mathrm {H} (X_{i}|X_{1},\ldots ,X_{i-1})}[3]: 22 

It has a similar form tochain rule inprobability theory, except that addition instead of multiplication is used.

Bayes' rule

[edit]

Bayes' rule for conditional entropy states

H(Y|X)=H(X|Y)H(X)+H(Y).{\displaystyle \mathrm {H} (Y|X)\,=\,\mathrm {H} (X|Y)-\mathrm {H} (X)+\mathrm {H} (Y).}

Proof.H(Y|X)=H(X,Y)H(X){\displaystyle \mathrm {H} (Y|X)=\mathrm {H} (X,Y)-\mathrm {H} (X)} andH(X|Y)=H(Y,X)H(Y){\displaystyle \mathrm {H} (X|Y)=\mathrm {H} (Y,X)-\mathrm {H} (Y)}. Symmetry entailsH(X,Y)=H(Y,X){\displaystyle \mathrm {H} (X,Y)=\mathrm {H} (Y,X)}. Subtracting the two equations implies Bayes' rule.

IfY{\displaystyle Y} isconditionally independent ofZ{\displaystyle Z} givenX{\displaystyle X} we have:

H(Y|X,Z)=H(Y|X).{\displaystyle \mathrm {H} (Y|X,Z)\,=\,\mathrm {H} (Y|X).}

Other properties

[edit]

For anyX{\displaystyle X} andY{\displaystyle Y}:

H(Y|X)H(Y)H(X,Y)=H(X|Y)+H(Y|X)+I(X;Y),H(X,Y)=H(X)+H(Y)I(X;Y),I(X;Y)H(X),{\displaystyle {\begin{aligned}\mathrm {H} (Y|X)&\leq \mathrm {H} (Y)\,\\\mathrm {H} (X,Y)&=\mathrm {H} (X|Y)+\mathrm {H} (Y|X)+\operatorname {I} (X;Y),\qquad \\\mathrm {H} (X,Y)&=\mathrm {H} (X)+\mathrm {H} (Y)-\operatorname {I} (X;Y),\,\\\operatorname {I} (X;Y)&\leq \mathrm {H} (X),\,\end{aligned}}}

whereI(X;Y){\displaystyle \operatorname {I} (X;Y)} is themutual information betweenX{\displaystyle X} andY{\displaystyle Y}.

For independentX{\displaystyle X} andY{\displaystyle Y}:

H(Y|X)=H(Y){\displaystyle \mathrm {H} (Y|X)=\mathrm {H} (Y)} andH(X|Y)=H(X){\displaystyle \mathrm {H} (X|Y)=\mathrm {H} (X)\,}

Although the specific-conditional entropyH(X|Y=y){\displaystyle \mathrm {H} (X|Y=y)} can be either less or greater thanH(X){\displaystyle \mathrm {H} (X)} for a givenrandom variatey{\displaystyle y} ofY{\displaystyle Y},H(X|Y){\displaystyle \mathrm {H} (X|Y)} can never exceedH(X){\displaystyle \mathrm {H} (X)}.

Conditional differential entropy

[edit]

Definition

[edit]

The above definition is for discrete random variables. The continuous version of discrete conditional entropy is calledconditional differential (or continuous) entropy. LetX{\displaystyle X} andY{\displaystyle Y} be a continuous random variables with ajoint probability density functionf(x,y){\displaystyle f(x,y)}. The differential conditional entropyh(X|Y){\displaystyle h(X|Y)} is defined as[3]: 249 

h(X|Y)=X,Yf(x,y)logf(x|y)dxdy{\displaystyle h(X|Y)=-\int _{{\mathcal {X}},{\mathcal {Y}}}f(x,y)\log f(x|y)\,dxdy}.

Properties

[edit]

In contrast to the conditional entropy for discrete random variables, the conditional differential entropy may be negative.

As in the discrete case there is a chain rule for differential entropy:

h(Y|X)=h(X,Y)h(X){\displaystyle h(Y|X)\,=\,h(X,Y)-h(X)}[3]: 253 

Notice however that this rule may not be true if the involved differential entropies do not exist or are infinite.

Joint differential entropy is also used in the definition of themutual information between continuous random variables:

I(X,Y)=h(X)h(X|Y)=h(Y)h(Y|X){\displaystyle \operatorname {I} (X,Y)=h(X)-h(X|Y)=h(Y)-h(Y|X)}
h(X|Y)h(X){\displaystyle h(X|Y)\leq h(X)} with equality if and only ifX{\displaystyle X} andY{\displaystyle Y} are independent.[3]: 253 

Relation to estimator error

[edit]

The conditional differential entropy yields a lower bound on the expected squared error of anestimator. For any Gaussian random variableX{\displaystyle X}, observationY{\displaystyle Y} and estimatorX^{\displaystyle {\widehat {X}}} the following holds:[3]: 255 

E[(XX^(Y))2]12πee2h(X|Y){\displaystyle \mathbb {E} \left[{\bigl (}X-{\widehat {X}}{(Y)}{\bigr )}^{2}\right]\geq {\frac {1}{2\pi e}}e^{2h(X|Y)}}

This is related to theuncertainty principle fromquantum mechanics.

Generalization to quantum theory

[edit]

Inquantum information theory, the conditional entropy is generalized to theconditional quantum entropy. The latter can take negative values, unlike its classical counterpart.

See also

[edit]

References

[edit]
  1. ^"David MacKay: Information Theory, Pattern Recognition and Neural Networks: The Book".www.inference.org.uk. Retrieved2019-10-25.
  2. ^Hellman, M.; Raviv, J. (1970). "Probability of error, equivocation, and the Chernoff bound".IEEE Transactions on Information Theory.16 (4):368–372.CiteSeerX 10.1.1.131.2865.doi:10.1109/TIT.1970.1054466.
  3. ^abcdefgT. Cover; J. Thomas (1991).Elements of Information Theory. Wiley.ISBN 0-471-06259-6.
Retrieved from "https://en.wikipedia.org/w/index.php?title=Conditional_entropy&oldid=1299032108"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp