Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Bregman divergence

From Wikipedia, the free encyclopedia
Measure of difference between two points
Bregman divergence between two points on the real line for the caseF=exp{\displaystyle F=\exp }. It shows that in this case, the divergence is asymmetric.

Inmathematics, specificallystatistics andinformation geometry, aBregman divergence orBregman distance is a measure of difference between two points, defined in terms of a strictlyconvex function; they form an important class ofdivergences. When the points are interpreted asprobability distributions – notably as either values of the parameter of aparametric model or as a data set of observed values – the resulting distance is astatistical distance. The most basic Bregman divergence is thesquared Euclidean distance.

Bregman divergences are similar tometrics, but satisfy neither thetriangle inequality (ever) nor symmetry (in general). However, they satisfy a generalization of thePythagorean theorem, and ininformation geometry the correspondingstatistical manifold is interpreted as a (dually)flat manifold. This allows many techniques ofoptimization theory to be generalized to Bregman divergences, geometrically as generalizations ofleast squares.

Bregman divergences are named after Soviet and Israeli mathematicianLev M. Bregman, who introduced the concept in 1967.

Definition

[edit]

LetF:ΩR{\displaystyle F\colon \Omega \to \mathbb {R} } be a continuously-differentiable, strictlyconvex function defined on aconvex setΩ{\displaystyle \Omega }.

The Bregman distance associated withF for pointsp,qΩ{\displaystyle p,q\in \Omega } is the difference between the value ofF at pointp and the value of the first-orderTaylor expansion ofF around pointq evaluated at pointp:DF(p,q)=F(p)F(q)F(q),pq.{\displaystyle D_{F}(p,q)=F(p)-F(q)-\langle \nabla F(q),p-q\rangle .}

Properties

[edit]

Proofs

[edit]

Classification theorems

[edit]
Proof
Bregman divergence interpreted as areas.

For anyxyX{\displaystyle x\neq y\in X} , definer=yx,v=(yx)/r,g(t)=f(x+tv){\displaystyle r=\|y-x\|,v=(y-x)/r,g(t)=f(x+tv)} fort[0,r]{\displaystyle t\in [0,r]} . Letz(t)=x+tv{\displaystyle z(t)=x+tv}.

Theng(t)=f(z(t)),v{\displaystyle g'(t)=\langle \nabla f(z(t)),v\rangle } fort(0,r){\displaystyle t\in (0,r)} , and sincef{\displaystyle \nabla f} is continuous, also fort=0,r{\displaystyle t=0,r} .

Then, from the diagram, we see that forDf(x;z(t))=Df(z(t);x){\displaystyle D_{f}(x;z(t))=D_{f}(z(t);x)} for allt[0,r]{\displaystyle t\in [0,r]} , we must haveg(t){\displaystyle g'(t)} linear ont[0,r]{\displaystyle t\in [0,r]}.

Thus we find thatf{\displaystyle \nabla f} varies linearly along any direction. By the next lemma,f{\displaystyle f} is quadratic. Sincef{\displaystyle f} is also strictly convex, it is of formf(x)+xTAx+BTx+C{\displaystyle f(x)+x^{T}Ax+B^{T}x+C} , whereA0{\displaystyle A\succ 0}.

Lemma: IfS{\displaystyle S} is an open subset ofRn{\displaystyle \mathbb {R} ^{n}} ,f:SR{\displaystyle f:S\to \mathbb {R} } has continuous derivative, and given any line segment[x,x+v]S{\displaystyle [x,x+v]\subset S} , the functionh(t):=f(x+tv),v{\displaystyle h(t):=\langle \nabla f(x+tv),v\rangle } is linear int{\displaystyle t} , thenf{\displaystyle f} is a quadratic function.

Proof idea: For any quadratic functionq:SR{\displaystyle q:S\to \mathbb {R} } , we havefq{\displaystyle f-q} still has such derivative-linearity, so we will subtract away a few quadratic functions and show thatf{\displaystyle f} becomes zero.

The proof idea can be illustrated fully for the case ofS=R2{\displaystyle S=\mathbb {R} ^{2}} , so we prove it in this case.

By the derivative-linearity,f{\displaystyle f} is a quadratic function on any line segment inR2{\displaystyle \mathbb {R} ^{2}}. We subtract away four quadratic functions, such thatg:=fq0q1q2q3{\displaystyle g:=f-q_{0}-q_{1}-q_{2}-q_{3}} becomes identically zero on the x-axis, y-axis, and the{x=y}{\displaystyle \{x=y\}} line.

Letq0(x,y)=f(0,0)+f(0,0)(x,y),q1(x,y)=A1x2,q2(x,y)=A2y2,q3(x,y)=A3xy{\displaystyle q_{0}(x,y)=f(0,0)+\nabla f(0,0)\cdot (x,y),q_{1}(x,y)=A_{1}x^{2},q_{2}(x,y)=A_{2}y^{2},q_{3}(x,y)=A_{3}xy}, for well-chosenA1,A2,A3{\displaystyle A_{1},A_{2},A_{3}}. Now useq0{\displaystyle q_{0}} to remove the linear term, and useq1,q2,q3{\displaystyle q_{1},q_{2},q_{3}} respectively to remove the quadratic terms along the three lines.

(x,y)R2{\displaystyle \forall (x,y)\in \mathbb {R} ^{2}} not on the origin, there exists a linel{\displaystyle l} across(x,y){\displaystyle (x,y)} that intersects the x-axis, y-axis, and the{x=y}{\displaystyle \{x=y\}} line at three different points. Sinceg{\displaystyle g} is quadratic onl{\displaystyle l} , and is zero on three different points,g{\displaystyle g} is identically zero onl{\displaystyle l} , thusg(x,y)=0{\displaystyle g(x,y)=0}. Thusf=q0+q1+q2+q3{\displaystyle f=q_{0}+q_{1}+q_{2}+q_{3}} is quadratic.

The following two characterizations are for divergences onΓn{\displaystyle \Gamma _{n}}, the set of all probability measures on{1,2,...,n}{\displaystyle \{1,2,...,n\}}, withn2{\displaystyle n\geq 2}.

Define a divergence onΓn{\displaystyle \Gamma _{n}} as any function of typeD:Γn×Γn[0,]{\displaystyle D:\Gamma _{n}\times \Gamma _{n}\to [0,\infty ]}, such thatD(x,x)=0{\displaystyle D(x,x)=0} for allxΓn{\displaystyle x\in \Gamma _{n}}, then:

Given a Bregman divergenceDF{\displaystyle D_{F}}, its "opposite", defined byDF(v,w)=DF(w,v){\displaystyle D_{F}^{*}(v,w)=D_{F}(w,v)}, is generally not a Bregman divergence. For example, the Kullback-Leiber divergence is both a Bregman divergence and an f-divergence. Its reverse is also an f-divergence, but by the above characterization, the reverse KL divergence cannot be a Bregman divergence.

Examples

[edit]

Generalizing projective duality

[edit]

A key tool incomputational geometry is the idea ofprojective duality, which maps points to hyperplanes and vice versa, while preserving incidence and above-below relationships. There are numerous analytical forms of the projective dual: one common form maps the pointp=(p1,pd){\displaystyle p=(p_{1},\ldots p_{d})} to the hyperplanexd+1=i=1d2pixi{\textstyle x_{d+1}=\sum _{i=1}^{d}2p_{i}x_{i}}. This mapping can be interpreted (identifying thehyperplane with its normal) as the convex conjugate mapping that takes the point p to its dual pointp=F(p){\displaystyle p^{*}=\nabla F(p)}, whereF defines thed-dimensional paraboloidxd+1=ixi2{\textstyle x_{d+1}=\sum _{i}x_{i}^{2}}.

If we now replace the paraboloid by an arbitrary convex function, we obtain a different dual mapping that retains the incidence and above-below properties of the standard projective dual. This implies that natural dual concepts in computational geometry likeVoronoi diagrams andDelaunay triangulations retain their meaning in distance spaces defined by an arbitrary Bregman divergence. Thus, algorithms from "normal" geometry extend directly to these spaces (Boissonnat, Nielsen and Nock, 2010)

Generalization of Bregman divergences

[edit]

Bregman divergences can be interpreted as limit cases of skewedJensen divergences (see Nielsen and Boltz, 2011). Jensen divergences can be generalized using comparative convexity, and limit cases of these skewed Jensen divergences generalizations yields generalized Bregman divergence (see Nielsen and Nock, 2017).The Bregman chord divergence[7] is obtained by taking a chord instead of a tangent line.

Bregman divergence on other objects

[edit]

Bregman divergences can also be defined between matrices, between functions, and between measures (distributions). Bregman divergences between matrices include theStein's loss andvon Neumann entropy. Bregman divergences between functions include total squared error, relative entropy, and squared bias; see the references by Frigyik et al. below for definitions and properties. Similarly Bregman divergences have also been defined over sets, through asubmodular set function which is known as the discrete analog of aconvex function. The submodular Bregman divergences subsume a number of discrete distance measures, like theHamming distance,precision and recall,mutual information and some other set based distance measures (see Iyer & Bilmes, 2012 for more details and properties of the submodular Bregman.)

For a list of common matrix Bregman divergences, see Table 15.1 in.[8]

Applications

[edit]

In machine learning, Bregman divergences are used to calculate the bi-tempered logistic loss, performing better than thesoftmax function with noisy datasets.[9]

Bregman divergence is used in the formulation ofmirror descent, which includes optimization algorithms used in machine learning such asgradient descent and thehedge algorithm.

References

[edit]
  1. ^ab"Learning with Bregman Divergences"(PDF).utexas.edu. Retrieved19 August 2023.
  2. ^Adamčík, Martin (2014)."The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning".Entropy.16 (12):6338–6381.Bibcode:2014Entrp..16.6338A.doi:10.3390/e16126338.
  3. ^Dhillon, Inderjit; Tropp, Joel (2008)."Matrix Nearness Problems with Bregman Divergence"(PDF).SIAM Journal on Matrix Analysis and Applications.29 (4):1120–1146.doi:10.1137/060649021.SupposedDφ{\displaystyle D_{\varphi }} is a Bregman divergence, supposed thatCk{\displaystyle C_{k}} is a finite collection of closed, convex sets whose intersection is nonempty. Given an input matrix Y our goal is to produce a matrixX in the intersection that diverges the least fromY, i.e. to solveminXDφ(X;Y){\displaystyle \min _{\mathbf {X} }D_{\varphi }(\mathbf {X} ;\mathbf {Y} )} subject toXkCk{\textstyle \mathbf {X} \in \bigcap _{k}C_{k}}. Under mild conditions, the solution is unique and it has a variational characterization analogous with the characterization of an orthogonal projection onto a convex set" (see s2.4, page 1125 for more)
  4. ^Nielsen, Frank (28 October 2021)."Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions".Entropy.23 (11): 1417.arXiv:2107.05901.Bibcode:2021Entrp..23.1417N.doi:10.3390/e23111417.ISSN 1099-4300.PMC 8619509.PMID 34828115.
  5. ^Nielsen, Frank;Boissonnat, Jean-Daniel; Nock, Richard (September 2010). "Bregman Voronoi Diagrams: Properties, Algorithms and Applications".Discrete & Computational Geometry.44 (2):281–307.arXiv:0709.2196.doi:10.1007/s00454-010-9256-1.ISSN 0179-5376.S2CID 1327029.
  6. ^abJiao, Jiantao; Courtade, Thomas; No, Albert; Venkat, Kartik; Weissman, Tsachy (December 2014). "Information Measures: the Curious Case of the Binary Alphabet".IEEE Transactions on Information Theory.60 (12):7616–7626.arXiv:1404.6810.Bibcode:2014ITIT...60.7616J.doi:10.1109/TIT.2014.2360184.ISSN 0018-9448.S2CID 13108908.
  7. ^Nielsen, Frank; Nock, Richard (2019). "The Bregman Chord Divergence".Geometric Science of Information. Lecture Notes in Computer Science. Vol. 11712. pp. 299–308.arXiv:1810.09113.doi:10.1007/978-3-030-26980-7_31.ISBN 978-3-030-26979-1.S2CID 53046425.
  8. ^"Matrix Information Geometry", R. Nock, B. Magdalou, E. Briys and F. Nielsen,pdf, from thisbook
  9. ^Ehsan Amid, Manfred K. Warmuth, Rohan Anil, Tomer Koren (2019). "Robust Bi-Tempered Logistic Loss Based on Bregman Divergences". Conference on Neural Information Processing Systems. pp. 14987-14996.pdf
Retrieved from "https://en.wikipedia.org/w/index.php?title=Bregman_divergence&oldid=1324351228"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp