The Bregman distance associated withF for points is the difference between the value ofF at pointp and the value of the first-orderTaylor expansion ofF around pointq evaluated at pointp:
Non-negativity: for all,. This is a consequence of the convexity of.
Positivity: When is strictly convex, if and only if.
Uniqueness up to affine difference: if and only if is an affine function.
Convexity: is convex in its first argument, but not necessarily in the second argument. If F is strictly convex, then is strictly convex in its first argument.
For example, Takef(x) = |x|, smooth it at 0, then take, then.
Linearity: If we think of the Bregman distance as an operator on the functionF, then it is linear with respect to non-negative coefficients. In other words, for strictly convex and differentiable, and,
Duality: If F is strictly convex, then the function F has aconvex conjugate which is also strictly convex and continuously differentiable on some convex set. The Bregman distance defined with respect to is dual to as Here, and are the dual points corresponding top andq.Moreover, using the same notations:
Integral form: by the integral remainder form ofTaylor's Theorem, a Bregman divergence can be written as the integral of the Hessian of along the line segment between the Bregman divergence's arguments.
Mean as minimizer: A key result about Bregman divergences is that, given a random vector, the mean vector minimizes the expected Bregman divergence from the random vector. This result generalizes the textbook result that the mean of a set minimizes total squared error to elements in the set. This result was proved for the vector case by (Banerjee et al. 2005), and extended to the case of functions/distributions by (Frigyik et al. 2008). This result is important because it further justifies using a mean as a representative of a random set, particularly in Bayesian estimation.
Bregman balls are bounded, and compact if X is closed: Define Bregman ball centered at x with radius r by. When is finite dimensional,, if is in the relative interior of, or if is locally closed at (that is, there exists a closed ball centered at, such that is closed), then is bounded for all . If is closed, then is compact for all.
Parallelogram law: for any,Generalized Pythagorean theorem for Bregman divergence .[2]
Bregman projection: For any, define the "Bregman projection" of onto: Then
if is convex, then the projection is unique if it exists;
if is nonempty, closed, and convex and is finite dimensional, then the projection exists and is unique.[3]
Generalized Pythagorean Theorem:[1]For any, This is an equality if is in therelative interior of.In particular, this always happens when is an affine set.
Lack of triangle inequality: Since the Bregman divergence is essentially a generalization of squared Euclidean distance, there is no triangle inequality. Indeed,, which may be positive or negative.
Bregman balls are bounded, and compact if X is closed:
Fix . Take affine transform on , so that.
Take some, such that. Then consider the "radial-directional" derivative of on the Euclidean sphere.
for all.
Since is compact, it achieves minimal value at some.
Since is strictly convex,. Then.
Since is in, is continuous in, thus is closed if is.
Projection is well-defined when is closed and convex.Fix. Take some , then let. Then draw the Bregman ball. It is closed and bounded, thus compact. Since is continuous and strictly convex on it, and bounded below by, it achieves a unique minimum on it.
Pythagorean inequality.By cosine law,, which must be, since minimizes in, and is convex.
Pythagorean equality when is in the relative interior of.
If, then since is in the relative interior, we can move from in the direction opposite of, to decrease , contradiction.
Then, from the diagram, we see that for for all , we must have linear on.
Thus we find that varies linearly along any direction. By the next lemma, is quadratic. Since is also strictly convex, it is of form , where.
Lemma: If is an open subset of , has continuous derivative, and given any line segment , the function is linear in , then is a quadratic function.
Proof idea: For any quadratic function , we have still has such derivative-linearity, so we will subtract away a few quadratic functions and show that becomes zero.
The proof idea can be illustrated fully for the case of , so we prove it in this case.
By the derivative-linearity, is a quadratic function on any line segment in. We subtract away four quadratic functions, such that becomes identically zero on the x-axis, y-axis, and the line.
Let, for well-chosen. Now use to remove the linear term, and use respectively to remove the quadratic terms along the three lines.
not on the origin, there exists a line across that intersects the x-axis, y-axis, and the line at three different points. Since is quadratic on , and is zero on three different points, is identically zero on , thus. Thus is quadratic.
The following two characterizations are for divergences on, the set of all probability measures on, with.
Define a divergence on as any function of type, such that for all, then:
If, then any Bregman divergence on that satisfies thedata processing inequality must be the Kullback–Leibler divergence. (In fact, a weaker assumption of "sufficiency" is enough.) Counterexamples exist when.[6]
Given a Bregman divergence, its "opposite", defined by, is generally not a Bregman divergence. For example, the Kullback-Leiber divergence is both a Bregman divergence and an f-divergence. Its reverse is also an f-divergence, but by the above characterization, the reverse KL divergence cannot be a Bregman divergence.
The canonical example of a Bregman distance is the squared Euclidean distance.
The squaredMahalanobis distance is generated by the convexquadratic form. The squared Euclidean distance is the special case where is the identity, i.e. for. As noted, affine differences, i.e. the lower orders added in, are irrelevant to.
The generalized Kullback–Leibler divergence is generated by the negativeentropy function When restricted to thesimplex, the last two terms cancel, giving the usualKullback–Leibler divergence for distributions.
A key tool incomputational geometry is the idea ofprojective duality, which maps points to hyperplanes and vice versa, while preserving incidence and above-below relationships. There are numerous analytical forms of the projective dual: one common form maps the point to the hyperplane. This mapping can be interpreted (identifying thehyperplane with its normal) as the convex conjugate mapping that takes the point p to its dual point, whereF defines thed-dimensional paraboloid.
If we now replace the paraboloid by an arbitrary convex function, we obtain a different dual mapping that retains the incidence and above-below properties of the standard projective dual. This implies that natural dual concepts in computational geometry likeVoronoi diagrams andDelaunay triangulations retain their meaning in distance spaces defined by an arbitrary Bregman divergence. Thus, algorithms from "normal" geometry extend directly to these spaces (Boissonnat, Nielsen and Nock, 2010)
Bregman divergences can be interpreted as limit cases of skewedJensen divergences (see Nielsen and Boltz, 2011). Jensen divergences can be generalized using comparative convexity, and limit cases of these skewed Jensen divergences generalizations yields generalized Bregman divergence (see Nielsen and Nock, 2017).The Bregman chord divergence[7] is obtained by taking a chord instead of a tangent line.
Bregman divergences can also be defined between matrices, between functions, and between measures (distributions). Bregman divergences between matrices include theStein's loss andvon Neumann entropy. Bregman divergences between functions include total squared error, relative entropy, and squared bias; see the references by Frigyik et al. below for definitions and properties. Similarly Bregman divergences have also been defined over sets, through asubmodular set function which is known as the discrete analog of aconvex function. The submodular Bregman divergences subsume a number of discrete distance measures, like theHamming distance,precision and recall,mutual information and some other set based distance measures (see Iyer & Bilmes, 2012 for more details and properties of the submodular Bregman.)
For a list of common matrix Bregman divergences, see Table 15.1 in.[8]
In machine learning, Bregman divergences are used to calculate the bi-tempered logistic loss, performing better than thesoftmax function with noisy datasets.[9]
^Dhillon, Inderjit; Tropp, Joel (2008)."Matrix Nearness Problems with Bregman Divergence"(PDF).SIAM Journal on Matrix Analysis and Applications.29 (4):1120–1146.doi:10.1137/060649021.Supposed is a Bregman divergence, supposed that is a finite collection of closed, convex sets whose intersection is nonempty. Given an input matrix Y our goal is to produce a matrixX in the intersection that diverges the least fromY, i.e. to solve subject to. Under mild conditions, the solution is unique and it has a variational characterization analogous with the characterization of an orthogonal projection onto a convex set" (see s2.4, page 1125 for more)
^"Matrix Information Geometry", R. Nock, B. Magdalou, E. Briys and F. Nielsen,pdf, from thisbook
^Ehsan Amid, Manfred K. Warmuth, Rohan Anil, Tomer Koren (2019). "Robust Bi-Tempered Logistic Loss Based on Bregman Divergences". Conference on Neural Information Processing Systems. pp. 14987-14996.pdf
Bregman, L. M. (1967). "The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming".USSR Computational Mathematics and Mathematical Physics.7 (3):200–217.doi:10.1016/0041-5553(67)90040-7.
Frigyik, Bela A.; Srivastava, Santosh; Gupta, Maya R. (2008).An Introduction to Functional Derivatives(PDF). UWEE Tech Report 2008-0001. University of Washington, Dept. of Electrical Engineering. Archived fromthe original(PDF) on 17 February 2017. Retrieved20 March 2014.
Nielsen, Frank; Nock, Richard (2006). "On approximating the smallest enclosing Bregman Balls".Proc. 22nd ACM Symposium on Computational Geometry. pp. 485–486.doi:10.1145/1137856.1137931.