Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Flow-based generative model

From Wikipedia, the free encyclopedia
(Redirected fromNormalizing flow)
Statistical model used in machine learning
Part of a series on
Machine learning
anddata mining

Aflow-based generative model is agenerative model used inmachine learning that explicitly models aprobability distribution by leveragingnormalizing flow,[1][2][3] which is a statistical method using thechange-of-variable law of probabilities to transform a simple distribution into a complex one.

The direct modeling of likelihood provides many advantages. For example, the negative log-likelihood can be directly computed and minimized as theloss function. Additionally, novel samples can be generated by sampling from the initial distribution, and applying the flow transformation.

In contrast, many alternative generative modeling methods such asvariational autoencoder (VAE) andgenerative adversarial network do not explicitly represent the likelihood function.

Method

[edit]
Scheme for normalizing flows

Letz0{\displaystyle z_{0}} be a (possibly multivariate)random variable with distributionp0(z0){\displaystyle p_{0}(z_{0})}.

Fori=1,...,K{\displaystyle i=1,...,K}, letzi=fi(zi1){\displaystyle z_{i}=f_{i}(z_{i-1})} be a sequence of random variables transformed fromz0{\displaystyle z_{0}}. The functionsf1,...,fK{\displaystyle f_{1},...,f_{K}} should be invertible, i.e. theinverse functionfi1{\displaystyle f_{i}^{-1}} exists. The final outputzK{\displaystyle z_{K}} models the target distribution.

The log likelihood ofzK{\displaystyle z_{K}} is (seederivation):

logpK(zK)=logp0(z0)i=1Klog|detdfi(zi1)dzi1|{\displaystyle \log p_{K}(z_{K})=\log p_{0}(z_{0})-\sum _{i=1}^{K}\log \left|\det {\frac {df_{i}(z_{i-1})}{dz_{i-1}}}\right|}

Learning probability distributions by differentiating such log Jacobians originated in the Infomax (maximum likelihood) approach to ICA,[4] which forms a single-layer (K=1) flow-based model. Relatedly, the single layer precursor of conditional generative flows appeared in.[5]

To efficiently compute the log likelihood, the functionsf1,...,fK{\displaystyle f_{1},...,f_{K}} should be easily invertible, and the determinants of their Jacobians should be simple to compute. In practice, the functionsf1,...,fK{\displaystyle f_{1},...,f_{K}} are modeled usingdeep neural networks, and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE,[6] RealNVP,[7] and Glow.[8]

Derivation of log likelihood

[edit]

Considerz1{\displaystyle z_{1}} andz0{\displaystyle z_{0}}. Note thatz0=f11(z1){\displaystyle z_{0}=f_{1}^{-1}(z_{1})}.

By thechange of variable formula, the distribution ofz1{\displaystyle z_{1}} is:

p1(z1)=p0(z0)|detdf11(z1)dz1|{\displaystyle p_{1}(z_{1})=p_{0}(z_{0})\left|\det {\frac {df_{1}^{-1}(z_{1})}{dz_{1}}}\right|}

Wheredetdf11(z1)dz1{\displaystyle \det {\frac {df_{1}^{-1}(z_{1})}{dz_{1}}}} is thedeterminant of theJacobian matrix off11{\displaystyle f_{1}^{-1}}.

By theinverse function theorem:

p1(z1)=p0(z0)|det(df1(z0)dz0)1|{\displaystyle p_{1}(z_{1})=p_{0}(z_{0})\left|\det \left({\frac {df_{1}(z_{0})}{dz_{0}}}\right)^{-1}\right|}

By the identitydet(A1)=det(A)1{\displaystyle \det(A^{-1})=\det(A)^{-1}} (whereA{\displaystyle A} is aninvertible matrix), we have:

p1(z1)=p0(z0)|detdf1(z0)dz0|1{\displaystyle p_{1}(z_{1})=p_{0}(z_{0})\left|\det {\frac {df_{1}(z_{0})}{dz_{0}}}\right|^{-1}}

The log likelihood is thus:

logp1(z1)=logp0(z0)log|detdf1(z0)dz0|{\displaystyle \log p_{1}(z_{1})=\log p_{0}(z_{0})-\log \left|\det {\frac {df_{1}(z_{0})}{dz_{0}}}\right|}

In general, the above applies to anyzi{\displaystyle z_{i}} andzi1{\displaystyle z_{i-1}}. Sincelogpi(zi){\displaystyle \log p_{i}(z_{i})} is equal tologpi1(zi1){\displaystyle \log p_{i-1}(z_{i-1})} subtracted by a non-recursive term, we can infer byinduction that:

logpK(zK)=logp0(z0)i=1Klog|detdfi(zi1)dzi1|{\displaystyle \log p_{K}(z_{K})=\log p_{0}(z_{0})-\sum _{i=1}^{K}\log \left|\det {\frac {df_{i}(z_{i-1})}{dz_{i-1}}}\right|}

Training method

[edit]

As is generally done when training a deep learning model, the goal with normalizing flows is to minimize theKullback–Leibler divergence between the model's likelihood and the target distribution to be estimated. Denotingpθ{\displaystyle p_{\theta }} the model's likelihood andp{\displaystyle p^{*}} the target distribution to learn, the (forward) KL-divergence is:

DKL[p(x)pθ(x)]=Ep(x)[logpθ(x)]+Ep(x)[logp(x)]{\displaystyle D_{\text{KL}}[p^{*}(x)\|p_{\theta }(x)]=-\mathop {\mathbb {E} } _{p^{*}(x)}[\log p_{\theta }(x)]+\mathop {\mathbb {E} } _{p^{*}(x)}[\log p^{*}(x)]}

The second term on the right-hand side of the equation corresponds to the entropy of the target distribution and is independent of the parameterθ{\displaystyle \theta } we want the model to learn, which only leaves the expectation of the negative log-likelihood to minimize under the target distribution. This intractable term can be approximated with a Monte-Carlo method byimportance sampling. Indeed, if we have a dataset{xi}i=1N{\displaystyle \{x_{i}\}_{i=1}^{N}} of samples each independently drawn from the target distributionp(x){\displaystyle p^{*}(x)}, then this term can be estimated as:

E^p(x)[logpθ(x)]=1Ni=0Nlogpθ(xi){\displaystyle -{\hat {\mathop {\mathbb {E} } }}_{p^{*}(x)}[\log p_{\theta }(x)]=-{\frac {1}{N}}\sum _{i=0}^{N}\log p_{\theta }(x_{i})}

Therefore, the learning objective

argminθ DKL[p(x)pθ(x)]{\displaystyle {\underset {\theta }{\operatorname {arg\,min} }}\ D_{\text{KL}}[p^{*}(x)\|p_{\theta }(x)]}

is replaced by

argmaxθ i=0Nlogpθ(xi){\displaystyle {\underset {\theta }{\operatorname {arg\,max} }}\ \sum _{i=0}^{N}\log p_{\theta }(x_{i})}

In other words, minimizing theKullback–Leibler divergence between the model's likelihood and the target distribution is equivalent tomaximizing the model likelihood under observed samples of the target distribution.[9]

A pseudocode for training normalizing flows is as follows:[10]

Variants

[edit]

Planar Flow

[edit]

The earliest example.[11] Fix some activation functionh{\displaystyle h}, and letθ=(u,w,b){\displaystyle \theta =(u,w,b)} with the appropriate dimensions, thenx=fθ(z)=z+uh(w,z+b){\displaystyle x=f_{\theta }(z)=z+uh(\langle w,z\rangle +b)}The inversefθ1{\displaystyle f_{\theta }^{-1}} has no closed-form solution in general.

The Jacobian is|det(I+h(w,z+b)uwT)|=|1+h(w,z+b)u,w|{\displaystyle |\det(I+h'(\langle w,z\rangle +b)uw^{T})|=|1+h'(\langle w,z\rangle +b)\langle u,w\rangle |}.

For it to be invertible everywhere, it must be nonzero everywhere. For example,h=tanh{\displaystyle h=\tanh } andu,w>1{\displaystyle \langle u,w\rangle >-1} satisfies the requirement.

Nonlinear Independent Components Estimation (NICE)

[edit]

Letx,zR2n{\displaystyle x,z\in \mathbb {R} ^{2n}} be even-dimensional, and split them in the middle.[6] Then the normalizing flow functions arex=[x1x2]=fθ(z)=[z1z2]+[0mθ(z1)]{\displaystyle x={\begin{bmatrix}x_{1}\\x_{2}\end{bmatrix}}=f_{\theta }(z)={\begin{bmatrix}z_{1}\\z_{2}\end{bmatrix}}+{\begin{bmatrix}0\\m_{\theta }(z_{1})\end{bmatrix}}}wheremθ{\displaystyle m_{\theta }} is any neural network with weightsθ{\displaystyle \theta }.

fθ1{\displaystyle f_{\theta }^{-1}} is justz1=x1,z2=x2mθ(x1){\displaystyle z_{1}=x_{1},z_{2}=x_{2}-m_{\theta }(x_{1})}, and the Jacobian is just 1, that is, the flow is volume-preserving.

Whenn=1{\displaystyle n=1}, this is seen as a curvy shearing along thex2{\displaystyle x_{2}} direction.

Real Non-Volume Preserving (Real NVP)

[edit]

The Real Non-Volume Preserving model generalizes NICE model by:[7]x=[x1x2]=fθ(z)=[z1esθ(z1)z2]+[0mθ(z1)]{\displaystyle x={\begin{bmatrix}x_{1}\\x_{2}\end{bmatrix}}=f_{\theta }(z)={\begin{bmatrix}z_{1}\\e^{s_{\theta }(z_{1})}\odot z_{2}\end{bmatrix}}+{\begin{bmatrix}0\\m_{\theta }(z_{1})\end{bmatrix}}}

Its inverse isz1=x1,z2=esθ(x1)(x2mθ(x1)){\displaystyle z_{1}=x_{1},z_{2}=e^{-s_{\theta }(x_{1})}\odot (x_{2}-m_{\theta }(x_{1}))}, and its Jacobian isi=1nesθ(z1,){\displaystyle \prod _{i=1}^{n}e^{s_{\theta }(z_{1,})}}. The NICE model is recovered by settingsθ=0{\displaystyle s_{\theta }=0}.Since the Real NVP map keeps the first and second halves of the vectorx{\displaystyle x} separate, it's usually required to add a permutation(x1,x2)(x2,x1){\displaystyle (x_{1},x_{2})\mapsto (x_{2},x_{1})} after every Real NVP layer.

Generative Flow (Glow)

[edit]

In generative flow model,[8] each layer has 3 parts:

The idea of using the invertible 1x1 convolution is to permute all layers in general, instead of merely permuting the first and second half, as in Real NVP.

Masked Autoregressive Flow (MAF)

[edit]

An autoregressive model of a distribution onRn{\displaystyle \mathbb {R} ^{n}} is defined as the following stochastic process:[12]

x1N(μ1,σ12)x2N(μ2(x1),σ2(x1)2)xnN(μn(x1:n1),σn(x1:n1)2){\displaystyle {\begin{aligned}x_{1}\sim &N(\mu _{1},\sigma _{1}^{2})\\x_{2}\sim &N(\mu _{2}(x_{1}),\sigma _{2}(x_{1})^{2})\\&\cdots \\x_{n}\sim &N(\mu _{n}(x_{1:n-1}),\sigma _{n}(x_{1:n-1})^{2})\\\end{aligned}}}whereμi:Ri1R{\displaystyle \mu _{i}:\mathbb {R} ^{i-1}\to \mathbb {R} } andσi:Ri1(0,){\displaystyle \sigma _{i}:\mathbb {R} ^{i-1}\to (0,\infty )} are fixed functions that define the autoregressive model.

By thereparameterization trick, the autoregressive model is generalized to a normalizing flow:x1=μ1+σ1z1x2=μ2(x1)+σ2(x1)z2xn=μn(x1:n1)+σn(x1:n1)zn{\displaystyle {\begin{aligned}x_{1}=&\mu _{1}+\sigma _{1}z_{1}\\x_{2}=&\mu _{2}(x_{1})+\sigma _{2}(x_{1})z_{2}\\&\cdots \\x_{n}=&\mu _{n}(x_{1:n-1})+\sigma _{n}(x_{1:n-1})z_{n}\\\end{aligned}}}The autoregressive model is recovered by settingzN(0,In){\displaystyle z\sim N(0,I_{n})}.

The forward mapping is slow (because it's sequential), but the backward mapping is fast (because it's parallel).

The Jacobian matrix is lower-diagonal, so the Jacobian isσ1σ2(x1)σn(x1:n1){\displaystyle \sigma _{1}\sigma _{2}(x_{1})\cdots \sigma _{n}(x_{1:n-1})}.

Reversing the two mapsfθ{\displaystyle f_{\theta }} andfθ1{\displaystyle f_{\theta }^{-1}} of MAF results in Inverse Autoregressive Flow (IAF), which has fast forward mapping and slow backward mapping.[13]

Continuous Normalizing Flow (CNF)

[edit]

Instead of constructing flow by function composition, another approach is to formulate the flow as a continuous-time dynamic.[14][15] Letz0{\displaystyle z_{0}} be the latent variable with distributionp(z0){\displaystyle p(z_{0})}. Map this latent variable to data space with the following flow function:

x=F(z0)=zT=z0+0Tf(zt,t)dt{\displaystyle x=F(z_{0})=z_{T}=z_{0}+\int _{0}^{T}f(z_{t},t)dt}

wheref{\displaystyle f} is an arbitrary function and can be modeled with e.g. neural networks.

The inverse function is then naturally:[14]

z0=F1(x)=zT+T0f(zt,t)dt=zT0Tf(zt,t)dt{\displaystyle z_{0}=F^{-1}(x)=z_{T}+\int _{T}^{0}f(z_{t},t)dt=z_{T}-\int _{0}^{T}f(z_{t},t)dt}

And the log-likelihood ofx{\displaystyle x} can be found as:[14]

log(p(x))=log(p(z0))0TTr[fzt]dt{\displaystyle \log(p(x))=\log(p(z_{0}))-\int _{0}^{T}{\text{Tr}}\left[{\frac {\partial f}{\partial z_{t}}}\right]dt}

Since the trace depends only on the diagonal of the Jacobianztf{\displaystyle \partial _{z_{t}}f}, this allows "free-form" Jacobian.[16] Here, "free-form" means that there is no restriction on the Jacobian's form. It is contrasted with previous discrete models of normalizing flow, where the Jacobian is carefully designed to be only upper- or lower-diagonal, so that the Jacobian can be evaluated efficiently.

The trace can be estimated by "Hutchinson's trick":[17][18]

Given any matrixWRn×n{\displaystyle W\in \mathbb {R} ^{n\times n}}, and any randomuRn{\displaystyle u\in \mathbb {R} ^{n}} withE[uuT]=I{\displaystyle E[uu^{T}]=I}, we haveE[uTWu]=tr(W){\displaystyle E[u^{T}Wu]=tr(W)}. (Proof: expand the expectation directly.)

Usually, the random vector is sampled fromN(0,I){\displaystyle N(0,I)} (normal distribution) or{±n1/2}n{\displaystyle \{\pm n^{-1/2}\}^{n}} (Rademacher distribution).

Whenf{\displaystyle f} is implemented as a neural network,neural ODE methods[19] would be needed. Indeed, CNF was first proposed in the same paper that proposed neural ODE.

There are two main deficiencies of CNF, one is that a continuous flow must be ahomeomorphism, thus preserve orientation andambient isotopy (for example, it's impossible to flip a left-hand to a right-hand by continuous deforming of space, and it's impossible toturn a sphere inside out, or undo a knot), and the other is that the learned flowf{\displaystyle f} might be ill-behaved, due to degeneracy (that is, there are an infinite number of possiblef{\displaystyle f} that all solve the same problem).

By adding extra dimensions, the CNF gains enough freedom to reverse orientation and go beyond ambient isotopy (just like how one can pick up a polygon from a desk and flip it around in 3-space, or unknot a knot in 4-space), yielding the "augmented neural ODE".[20]

Any homeomorphism ofRn{\displaystyle \mathbb {R} ^{n}} can be approximated by a neural ODE operating onR2n+1{\displaystyle \mathbb {R} ^{2n+1}}, proved by combiningWhitney embedding theorem for manifolds and theuniversal approximation theorem for neural networks.[21]

To regularize the flowf{\displaystyle f}, one can impose regularization losses. The paper[17] proposed the following regularization loss based onoptimal transport theory:λK0Tf(zt,t)2dt+λJ0Tzf(zt,t)F2dt{\displaystyle \lambda _{K}\int _{0}^{T}\left\|f(z_{t},t)\right\|^{2}dt+\lambda _{J}\int _{0}^{T}\left\|\nabla _{z}f(z_{t},t)\right\|_{F}^{2}dt}whereλK,λJ>0{\displaystyle \lambda _{K},\lambda _{J}>0} are hyperparameters. The first term punishes the model for oscillating the flow field over time, and the second term punishes it for oscillating the flow field over space. Both terms together guide the model into a flow that is smooth (not "bumpy") over space and time.

Flows on manifolds

[edit]

When aprobabilistic flow transforms a distribution on anm{\displaystyle m}-dimensionalsmooth manifold embedded inRn{\displaystyle \mathbb {R} ^{n}}, wherem<n{\displaystyle m<n}, and where the transformation is specified as a function,RnRn{\displaystyle \mathbb {R} ^{n}\to \mathbb {R} ^{n}}, the scaling factor between the source and transformedPDFs isnot given by the naive computation of thedeterminant of then-by-n{\displaystyle n{\text{-by-}}n} Jacobian (which is zero), but instead by the determinant(s) of one or more suitably definedm-by-m{\displaystyle m{\text{-by-}}m} matrices. This section is an interpretation of the tutorial in the appendix of Sorrenson et al.(2023),[22] where the more general case of non-isometrically embeddedRiemann manifolds is also treated. Here we restrict attention toisometrically embedded manifolds.

As running examples of manifolds with smooth, isometric embedding inRn{\displaystyle \mathbb {R} ^{n}} we shall use:

As a first example of a spherical manifold flow transform, consider thenormalized linear transform, which radially projects onto the unitsphere the output of an invertible linear transform, parametrized by then-by-n{\displaystyle n{\text{-by-}}n} invertible matrixM{\displaystyle \mathbf {M} }:

flin(x;M)=MxMx{\displaystyle f_{\text{lin}}(\mathbf {x} ;\mathbf {M} )={\frac {\mathbf {Mx} }{\lVert \mathbf {Mx} \rVert }}}

In full Euclidean space,flin:RnRn{\displaystyle f_{\text{lin}}:\mathbb {R} ^{n}\to \mathbb {R} ^{n}} isnot invertible, but if we restrict the domain and co-domain to the unitsphere, thenflin:Sn1Sn1{\displaystyle f_{\text{lin}}:\mathbb {S} ^{n-1}\to \mathbb {S} ^{n-1}}is invertible (more specifically it is abijection and ahomeomorphism and adiffeomorphism), with inverseflin(;M1){\displaystyle f_{\text{lin}}(\cdot \,;\mathbf {M} ^{-1})}. The Jacobian offlin:RnRn{\displaystyle f_{\text{lin}}:\mathbb {R} ^{n}\to \mathbb {R} ^{n}}, aty=flin(x;M){\displaystyle \mathbf {y} =f_{\text{lin}}(\mathbf {x} ;\mathbf {M} )} isMx1(Inyy)M{\displaystyle \lVert \mathbf {Mx} \rVert ^{-1}(\mathbf {I} _{n}-\mathbf {yy} ')\mathbf {M} }, which has rankn1{\displaystyle n-1} and determinant of zero; whileas explained here, the factor (see subsection below) relating source and transformed densities is:Mxn|detM|{\displaystyle \lVert \mathbf {Mx} \rVert ^{-n}\left|\operatorname {det} \mathbf {M} \right|}.

Differential volume ratio

[edit]

Form<n{\displaystyle m<n}, letMRn{\displaystyle {\mathcal {M}}\subset \mathbb {R} ^{n}} be anm{\displaystyle m}-dimensional manifold with a smooth, isometric embedding intoRn{\displaystyle \mathbb {R} ^{n}}. Letf:RnRn{\displaystyle f:\mathbb {R} ^{n}\to \mathbb {R} ^{n}} be a smooth flow transform with range restricted toM{\displaystyle {\mathcal {M}}}. LetxM{\displaystyle \mathbf {x} \in {\mathcal {M}}} be sampled from a distribution with densityPX{\displaystyle P_{X}}. Lety=f(x){\displaystyle \mathbf {y} =f(\mathbf {x} )}, with resultant (pushforward) densityPY{\displaystyle P_{Y}}. LetUM{\displaystyle U\subset {\mathcal {M}}} be a small, convex region containingx{\displaystyle \mathbf {x} } and letV=f(U){\displaystyle V=f(U)} be its image, which containsy{\displaystyle \mathbf {y} }; then by conservation of probability mass:

PX(x)volume(U)PY(y)volume(V){\displaystyle P_{X}(\mathbf {x} )\operatorname {volume} (U)\approx P_{Y}(\mathbf {y} )\operatorname {volume} (V)}

where volume (for very small regions) is given byLebesgue measure inm{\displaystyle m}-dimensionaltangent space. By making the regions infinitessimally small, the factor relating the two densities is the ratio of volumes, which we term thedifferential volume ratio.

To obtain concrete formulas for volume on them{\displaystyle m}-dimensional manifold, we constructU{\displaystyle U} by mapping anm{\displaystyle m}-dimensional rectangle in (local) coordinate space to the manifold via a smooth embedding function:RmRn{\displaystyle \mathbb {R} ^{m}\to \mathbb {R} ^{n}}. At very small scale, the embedding function becomes essentially linear so thatU{\displaystyle U} is aparallelotope (multidimensional generalization of a parallelogram). Similarly, the flow transform,f{\displaystyle f} becomes linear, so that the image,V=f(U){\displaystyle V=f(U)} is also a parallelotope. InRm{\displaystyle \mathbb {R} ^{m}}, we can represent anm{\displaystyle m}-dimensional parallelotope with anm-by-m{\displaystyle m{\text{-by-}}m} matrix whose column-vectors are a set of edges (meeting at a common vertex) that span the paralellotope. Thevolume is given by the absolute value of the determinant of this matrix. If more generally (as is the case here), anm{\displaystyle m}-dimensional paralellotope is embedded inRn{\displaystyle \mathbb {R} ^{n}}, it can be represented with a (tall)n-by-m{\displaystyle n{\text{-by-}}m} matrix, sayV{\displaystyle \mathbf {V} }. Denoting the parallelotope as/V/{\displaystyle /\mathbf {V} \!/}, its volume is then given by the square root of theGram determinant:

volume/V/=|det(VV)|{\displaystyle \operatorname {volume} /\mathbf {V} \!/={\sqrt {\left|\operatorname {det} (\mathbf {V} '\mathbf {V} )\right|}}}

In the sections below, we show various ways to use this volume formula to derive the differential volume ratio.

Simplex flow

[edit]

As a first example, we develop expressions for the differential volume ratio of a simplex flow,q=f(p){\displaystyle \mathbf {q} =f(\mathbf {p} )}, wherep,qM=Δn1{\displaystyle \mathbf {p} ,\mathbf {q} \in {\mathcal {M}}=\Delta ^{n-1}}. Define theembedding function:

e:p~=(p1,pn1)p=(p1,pn1,1i=1n1pi){\displaystyle e:{\tilde {\mathbf {p} }}=(p_{1}\dots ,p_{n-1})\mapsto \mathbf {p} =(p_{1}\dots ,p_{n-1},1-\sum _{i=1}^{n-1}p_{i})}

which maps a conveniently chosen,(n1){\displaystyle (n-1)}-dimensional representation,p~{\displaystyle {\tilde {\mathbf {p} }}}, to the embedded manifold. Then-by-(n1){\displaystyle n{\text{-by-}}(n-1)} Jacobian isE=[In11]{\displaystyle \mathbf {E} ={\begin{bmatrix}\mathbf {I} _{n-1}\\-{\boldsymbol {1}}'\end{bmatrix}}}.To defineU{\displaystyle U}, the differential volume element at the transformation input (pΔn1{\displaystyle \mathbf {p} \in \Delta ^{n-1}}), we start with a rectangle inp~{\displaystyle {\tilde {\mathbf {p} }}}-space, having (signed) differential side-lengths,dp1,,dpn1{\displaystyle dp_{1},\dots ,dp_{n-1}} from which we form the square diagonal matrixD{\displaystyle \mathbf {D} }, the columns of which span the rectangle. At very small scale, we getU=e(D)=/ED/{\displaystyle U=e(\mathbf {D} )=/\mathbf {ED} \!/}, with:

For the 1-simplex (blue) embedded inR2{\displaystyle \mathbb {R} ^{2}}, when we pull backLebesgue measure fromtangent space (parallel to the simplex), via the embeddingp1(p1,1p1){\displaystyle p_{1}\mapsto (p_{1},1-p_{1})}, with JacobianE=[11]{\displaystyle \mathbf {E} ={\begin{bmatrix}1&-1\end{bmatrix}}'}, a scaling factor ofEE=2{\displaystyle {\sqrt {\mathbf {E} '\mathbf {E} }}={\sqrt {2}}} results.
volume(U)=|det(DEED)|=|det(EE)||detD)|=ni=1n1|dpi|{\displaystyle \operatorname {volume} (U)={\sqrt {\left|\operatorname {det} (\mathbf {DE} '\mathbf {ED} )\right|}}={\sqrt {\left|\operatorname {det} (\mathbf {E} '\mathbf {E} )\right|}}\,\left|\operatorname {det} \mathbf {D} )\right|={\sqrt {n}}\prod _{i=1}^{n-1}\left|dp_{i}\right|}

To understand the geometric interpretation of the factorn{\displaystyle {\sqrt {n}}}, see the example for the 1-simplex in the diagram at right.

The differential volume element at the transformation output (qΔn1{\displaystyle \mathbf {q} \in \Delta ^{n-1}}), is the parallelotope,V=f(U)=/FpED/{\displaystyle V=f(U)=/\mathbf {F_{p}ED} \!/}, whereFp{\displaystyle \mathbf {F_{p}} } is then-by-n{\displaystyle n{\text{-by-}}n} Jacobian off{\displaystyle f} atp=e(p~){\displaystyle \mathbf {p} =e({\tilde {\mathbf {p} }})}. Its volume is:

volume(V)=|det(DEFpFpED)|=|det(EFpFpE)||detD)|{\displaystyle \operatorname {volume} (V)={\sqrt {\left|\operatorname {det} (\mathbf {DE} '\mathbf {F_{p}} '\mathbf {F_{p}ED} )\right|}}={\sqrt {\left|\operatorname {det} (\mathbf {E} '\mathbf {F_{p}} '\mathbf {F_{p}E} )\right|}}\,\left|\operatorname {det} \mathbf {D} )\right|}

so that the factor|detD)|{\displaystyle \left|\operatorname {det} \mathbf {D} )\right|} cancels in the volume ratio, which can now already be numerically evaluated. It can however be rewritten in a sometimes more convenient form by also introducing therepresentation function,r:pp~{\displaystyle r:\mathbf {p} \mapsto {\tilde {\mathbf {p} }}}, which simply extracts the first(n1){\displaystyle (n-1)} components. The Jacobian isR=[In0]{\displaystyle \mathbf {R} ={\begin{bmatrix}\mathbf {I} _{n}&{\boldsymbol {0}}\end{bmatrix}}}. Observe that, sinceerf=f{\displaystyle e\circ r\circ f=f}, thechain rule for function composition gives:ERFp=Fp{\displaystyle \mathbf {ERF_{p}} =\mathbf {F_{p}} }. By plugging this expansion into the above Gram determinant and then refactoring it as a product of determinants of square matrices, we can extract the factor|det(EE)|=n{\displaystyle {\sqrt {\left|\operatorname {det} (\mathbf {E} '\mathbf {E} )\right|}}={\sqrt {n}}}, which now also cancels in the ratio, which finally simpifies to the determinant of the Jacobian of the "sandwiched" flow transformation,rfe{\displaystyle r\circ f\circ e}:

RfΔ(p)=volume(V)volume(U)=|det(RFpE)|{\displaystyle R_{f}^{\Delta }(\mathbf {p} )={\frac {\operatorname {volume} (V)}{\operatorname {volume} (U)}}=\left|\operatorname {det} (\mathbf {RF_{p}E} )\right|}

which, ifpPP{\displaystyle \mathbf {p} \sim P_{\mathbf {P} }}, can be used to derive the pushforward density after a change of variables,q=f(p){\displaystyle \mathbf {q} =f(\mathbf {p} )}:

PQ(q)=PP(p)RfΔ(p),wherep=f1(q){\displaystyle P_{\mathbf {Q} }(\mathbf {q} )={\frac {P_{\mathbf {P} }(\mathbf {p} )}{R_{f}^{\Delta }(\mathbf {p} )}}\,,\;{\text{where}}\;\;\mathbf {p} =f^{-1}(\mathbf {q} )}

This formula is valid only because the simplex is flat and the Jacobian,E{\displaystyle \mathbf {E} } is constant. The more general case for curved manifolds is discussed below, after we present two concrete examples of simplex flow transforms.

Simplex calibration transform

[edit]

Acalibration transform,fcal:Δn1Δn1{\displaystyle f_{\text{cal}}:\Delta ^{n-1}\to \Delta ^{n-1}}, which is sometimes used in machine learning for post-processing of the (class posterior) outputs of a probabilisticn{\displaystyle n}-class classifier,[23][24] uses thesoftmax function to renormalize categorical distributions after scaling and translation of the input distributions in log-probability space. Forp,qΔn1{\displaystyle \mathbf {p} ,\mathbf {q} \in \Delta ^{n-1}} and with parameters,a0{\displaystyle a\neq 0} andcRn{\displaystyle \mathbf {c} \in \mathbb {R} ^{n}} the transform can be specified as:

q=fcal(p;a,c)=softmax(a1logp+c)p=fcal1(q;a,c)=softmax(alogqac){\displaystyle \mathbf {q} =f_{\text{cal}}(\mathbf {p} ;a,\mathbf {c} )=\operatorname {softmax} (a^{-1}\log \mathbf {p} +\mathbf {c} )\;\iff \;\mathbf {p} =f_{\text{cal}}^{-1}(\mathbf {q} ;a,\mathbf {c} )=\operatorname {softmax} (a\log \mathbf {q} -a\mathbf {c} )}

where the log is applied elementwise. After some algebra thedifferential volume ratio can be expressed as:

RcalΔ(p;a,c)=|det(RFpE)|=|a|1ni=1nqipi{\displaystyle R_{\text{cal}}^{\Delta }(\mathbf {p} ;a,\mathbf {c} )=\left|\operatorname {det} (\mathbf {RF_{p}E} )\right|=\left|a\right|^{1-n}\prod _{i=1}^{n}{\frac {q_{i}}{p_{i}}}}

While calibration transforms are most often trained asdiscriminative models, the reinterpretation here as a probabilistic flow allows also the design ofgenerative calibration models based on this transform. When used for calibration, the restrictiona>0{\displaystyle a>0} can be imposed to prevent direction reversal in log-probability space. With the additional restrictionc=0{\displaystyle \mathbf {c} ={\boldsymbol {0}}}, this transform (with discriminative training) is known in machine learning astemperature scaling.

Generalized calibration transform

[edit]

The above calibration transform can be generalized tofgcal:Δn1Δn1{\displaystyle f_{\text{gcal}}:\Delta ^{n-1}\to \Delta ^{n-1}}, with parameterscRn{\displaystyle \mathbf {c} \in \mathbb {R} ^{n}} andA{\displaystyle \mathbf {A} }n-by-n{\displaystyle n{\text{-by-}}n} invertible:[26]

q=fgcal(p;A,c)=softmax(Alogp+c),subject toA1=λ1{\displaystyle \mathbf {q} =f_{\text{gcal}}(\mathbf {p} ;\mathbf {A} ,\mathbf {c} )=\operatorname {softmax} (\mathbf {A} \log \mathbf {p} +\mathbf {c} )\,,\;{\text{subject to}}\;\mathbf {A1} =\lambda \mathbf {1} }

where the condition thatA{\displaystyle \mathbf {A} } has1{\displaystyle \mathbf {1} } as aneigenvector ensures invertibility by sidestepping the information loss due to the invariance:softmax(x+α1)=softmax(x){\displaystyle \operatorname {softmax} (\mathbf {x} +\alpha \mathbf {1} )=\operatorname {softmax} (\mathbf {x} )}. Note in particular thatA=λIn{\displaystyle \mathbf {A} =\lambda \mathbf {I} _{n}} is theonly allowed diagonal parametrization, in which case we recoverfcal(p;λ1,c){\displaystyle f_{\text{cal}}(\mathbf {p} ;\lambda ^{-1},\mathbf {c} )}, while (forn>2{\displaystyle n>2}) generalizationis possible with non-diagonal matrices. Theinverse is:

p=fgcal1(q;A,c)=fgcal(q;A1,A1c),whereA1=λ1A11=λ11{\displaystyle \mathbf {p} =f_{\text{gcal}}^{-1}(\mathbf {q} ;\mathbf {A} ,\mathbf {c} )=f_{\text{gcal}}(\mathbf {q} ;\mathbf {A} ^{-1},-\mathbf {A} ^{-1}\mathbf {c} )\,,\;{\text{where}}\;\mathbf {A1} =\lambda \mathbf {1} \Longrightarrow \mathbf {A} ^{-1}\mathbf {1} =\lambda ^{-1}\mathbf {1} }

Thedifferential volume ratio is:

RgcalΔ(p;A,c)=|det(A)||λ|i=1nqipi{\displaystyle R_{\text{gcal}}^{\Delta }(\mathbf {p} ;\mathbf {A} ,\mathbf {c} )={\frac {\left|\operatorname {det} (\mathbf {A} )\right|}{|\lambda |}}\prod _{i=1}^{n}{\frac {q_{i}}{p_{i}}}}

Iffgcal{\displaystyle f_{\text{gcal}}} is to be used as a calibration transform, further constraint could be imposed, for example thatA{\displaystyle \mathbf {A} } bepositive definite, so that(Ax)x>0{\displaystyle (\mathbf {Ax} )'\mathbf {x} >0}, which avoids direction reversals. (This is one possible generalization ofa>0{\displaystyle a>0} in thefcal{\displaystyle f_{\text{cal}}} parameter.)

Forn=2{\displaystyle n=2},a>0{\displaystyle a>0} andA{\displaystyle \mathbf {A} } positive definite, thenfcal{\displaystyle f_{\text{cal}}} andfgcal{\displaystyle f_{\text{gcal}}} are equivalent in the sense that in both cases,logp1p2logq1q2{\displaystyle \log {\frac {p_{1}}{p_{2}}}\mapsto \log {\frac {q_{1}}{q_{2}}}} is a straight line, the (positive) slope and offset of which are functions of the transform parameters. Forn>2,{\displaystyle n>2,}fgcal{\displaystyle f_{\text{gcal}}}does generalizefcal{\displaystyle f_{\text{cal}}}.

It must however be noted that chaining multiplefgcal{\displaystyle f_{\text{gcal}}} flow transformations doesnot give a further generalization, because:

fgcal(;A1,c1)fgcal(;A2,c2)=fgcal(;A1A2,c1+A1c2){\displaystyle f_{\text{gcal}}(\cdot \,;\mathbf {A} _{1},\mathbf {c} _{1})\circ f_{\text{gcal}}(\cdot \,;\mathbf {A} _{2},\mathbf {c} _{2})=f_{\text{gcal}}(\cdot \,;\mathbf {A} _{1}\mathbf {A} _{2},\mathbf {c} _{1}+\mathbf {A} _{1}\mathbf {c} _{2})}

In fact, the set offgcal{\displaystyle f_{\text{gcal}}} transformations form agroup under function composition. The set offcal{\displaystyle f_{\text{cal}}} transformations form a subgroup.

Also see:Dirichlet calibration,[27] which generalizesfgcal{\displaystyle f_{\text{gcal}}}, by not placing any restriction on the matrix,A{\displaystyle \mathbf {A} }, so that invertibility is not guaranteed. While Dirichlet calibration is trained as a discriminative model,fgcal{\displaystyle f_{\text{gcal}}} can also be trained as part of a generative calibration model.

Differential volume ratio for curved manifolds

[edit]

Consider a flow,y=f(x){\displaystyle \mathbf {y} =f(\mathbf {x} )} on a curved manifold, for exampleSn1{\displaystyle \mathbb {S} ^{n-1}} which we equip with the embedding function,e{\displaystyle e} that maps a set of(n1){\displaystyle (n-1)}angular spherical coordinates toSn1{\displaystyle \mathbb {S} ^{n-1}}. The Jacobian ofe{\displaystyle e} is non-constant and we have to evaluate it at both input (Ex{\displaystyle \mathbf {E_{x}} }) and output (Ey{\displaystyle \mathbf {E_{y}} }). The same applies tor{\displaystyle r}, the representation function that recovers spherical coordinates from points onSn1{\displaystyle \mathbb {S} ^{n-1}}, for which we need the Jacobian at the output (Ry{\displaystyle \mathbf {R_{y}} }). The differential volume ratio now generalizes to:

Rf(x)=|det(RyFxEx)||det(EyEy)||det(ExEx)|{\displaystyle R_{f}(\mathbf {x} )=\left|\operatorname {det} (\mathbf {R_{y}F_{x}E_{x}} )\right|\,{\frac {\sqrt {\left|\operatorname {det} (\mathbf {E} _{\mathbf {y} }'\mathbf {E_{y}} )\right|}}{\sqrt {\left|\operatorname {det} (\mathbf {E} _{\mathbf {x} }'\mathbf {E_{x}} )\right|}}}}

For geometric insight, considerS2{\displaystyle \mathbf {S} ^{2}}, where the spherical coordinates are co-latitude,θ[0,π]{\displaystyle \theta \in [0,\pi ]} and longitude,ϕ[0,2π){\displaystyle \phi \in [0,2\pi )}. Atx=e(θ,ϕ){\displaystyle \mathbf {x} =e(\theta ,\phi )}, we get|det(ExEx)|=sinθ{\displaystyle {\sqrt {\left|\operatorname {det} (\mathbf {E} _{\mathbf {x} }'\mathbf {E_{x}} )\right|}}=\sin \theta }, which gives the radius of the circle at that latitude (compare e.g. polar circle to equator). The differential volume (surface area on the sphere) is:sinθdθdϕ{\displaystyle \sin \theta \,d\theta \,d\phi }.

The above derivation forRf{\displaystyle R_{f}} is fragile in the sense that when usingfixed functionse,r{\displaystyle e,r}, there may be places where they are not well-defined, for example at the poles of the 2-sphere where longitude is arbitrary. This problem is sidestepped (using standard manifold machinery) by generalizing tolocal coordinates (charts), where in the vicinities ofx,yM{\displaystyle \mathbf {x} ,\mathbf {y} \in {\mathcal {M}}}, we map from localm{\displaystyle m}-dimensional coordinates toRn{\displaystyle \mathbb {R} ^{n}} and back using the respective function pairsex,rx{\displaystyle e_{\mathbf {x} },r_{\mathbf {x} }} andey,ry{\displaystyle e_{\mathbf {y} },r_{\mathbf {y} }}. We continue to use the same notation for the Jacobians of these functions (Ex,Ey,Ry{\displaystyle \mathbf {E_{x}} ,\mathbf {E_{y}} ,\mathbf {R_{y}} }), so that the above formula forRf{\displaystyle R_{f}} remains valid.

Wecan however, choose our local coordinate system in a way that simplifies the expression forRf{\displaystyle R_{f}} and indeed also its practical implementation.[22] Letπ:PRn{\displaystyle \pi :{\mathcal {P}}\to \mathbb {R} ^{n}} be a smooth idempotent projection (ππ=π{\displaystyle \pi \circ \pi =\pi }) from theprojectible set,PRn{\displaystyle {\mathcal {P}}\subseteq \mathbb {R} ^{n}}, onto the embedded manifold. For example:

For everyxM{\displaystyle \mathbf {x} \in {\mathcal {M}}}, we require ofπ{\displaystyle \pi } that itsn-by-n{\displaystyle n{\text{-by-}}n} Jacobian,Πx{\displaystyle {\boldsymbol {\Pi _{x}}}} has rankm{\displaystyle m} (the manifold dimension), in which caseΠx{\displaystyle {\boldsymbol {\Pi _{x}}}} is anidempotent linear projection onto the local tangent space (orthogonal for the unitsphere:Inxx{\displaystyle \mathbf {I} _{n}-\mathbf {xx} '};oblique for the simplex:Inx1{\displaystyle \mathbf {I} _{n}-{\boldsymbol {x1}}'}). The columns ofΠx{\displaystyle {\boldsymbol {\Pi _{x}}}} span them{\displaystyle m}-dimensional tangent space atx{\displaystyle \mathbf {x} }. We use the notation,Tx{\displaystyle \mathbf {T_{x}} } for anyn-by-m{\displaystyle n{\text{-by-}}m} matrix with orthonormal columns (TxTx=Im{\displaystyle \mathbf {T} _{\mathbf {x} }'\mathbf {T_{x}} =\mathbf {I} _{m}}) that span the local tangent space. Also note:ΠxTx=Tx{\displaystyle {\boldsymbol {\Pi _{x}}}\mathbf {T_{x}} =\mathbf {T_{x}} }. We can now choose our local coordinate embedding function,ex:RmRn{\displaystyle e_{\mathbf {x} }:\mathbb {R} ^{m}\to \mathbb {R} ^{n}}:

ex(x~)=π(x+Txx~),with Jacobian:Ex=Txatx~=0.{\displaystyle e_{\mathbf {x} }({\tilde {x}})=\pi (\mathbf {x} +\mathbf {T_{x}{\tilde {x}}} )\,,{\text{with Jacobian:}}\,\mathbf {E_{x}} =\mathbf {T_{x}} \,{\text{at}}\,{\tilde {\mathbf {x} }}=\mathbf {0} .}

Since the Jacobian is injective (full rank:m{\displaystyle m}), a local (not necessarily unique)left inverse, sayrx{\displaystyle r_{\mathbf {x} }^{*}} with JacobianRx{\displaystyle \mathbf {R} _{\mathbf {x} }^{*}}, exists such thatrx(ex(x~))=x~{\displaystyle r_{\mathbf {x} }^{*}(e_{\mathbf {x} }({\tilde {x}}))={\tilde {x}}} andRxTx=Im{\displaystyle \mathbf {R} _{\mathbf {x} }^{*}\mathbf {T_{x}} =\mathbf {I} _{m}}. In practice we do not need the left inverse function itself, but wedo need its Jacobian, for which the above equation does not give a unique solution. We can however enforce a unique solution for the Jacobian by choosing the left inverse as,rx:RnRm{\displaystyle r_{\mathbf {x} }:\mathbb {R} ^{n}\to \mathbb {R} ^{m}}:

rx(z)=rx(π(z)),with Jacobian:Rx=Tx{\displaystyle r_{\mathbf {x} }(\mathbf {z} )=r_{\mathbf {x} }^{*}(\pi (\mathbf {z} ))\,,{\text{with Jacobian:}}\,\mathbf {R_{x}} =\mathbf {T} _{\mathbf {x} }'}

We can now finally plugEx=Tx{\displaystyle \mathbf {E_{x}} =\mathbf {T_{x}} } andRy=Ty{\displaystyle \mathbf {R_{y}} =\mathbf {T} _{\mathbf {y} }'} into our previous expression forRf{\displaystyle R_{f}}, thedifferential volume ratio, which because of the orthonormal Jacobians, simplifies to:[28]

Rf(x)=|det(TyFxTx)|{\displaystyle R_{f}(\mathbf {x} )=\left|\operatorname {det} (\mathbf {T_{y}} '\mathbf {F_{x}T_{x}} )\right|}

Practical implementation

[edit]

For learning the parameters of a manifold flow transformation, we need access to the differential volume ratio,Rf{\displaystyle R_{f}}, or at least to its gradient w.r.t. the parameters. Moreover, for some inference tasks, we need access toRf{\displaystyle R_{f}} itself. Practical solutions include:

Simple spherical flows

[edit]

In machine learning literature, various complex spherical flows formed by deep neural network architectures may be found.[22] In contrast, this section compiles fromstatistics literature the details of three very simple spherical flow transforms, with simple closed-form expressions for inverses and differential volume ratios. These flows can be used individually, or chained, to generalize distributions on the unitsphere,Sn1{\displaystyle \mathbb {S} ^{n-1}}. All three flows are compositions of an invertible affine transform inRn{\displaystyle \mathbb {R} ^{n}}, followed by radial projection back onto the sphere. The flavours we consider for the affine transform are: pure translation, pure linear and general affine. To make these flows fully functional for learning, inference and sampling, the tasks are:

  • To derive the inverse transform, with suitable restrictions on the parameters to ensure invertibility.
  • To derive in simple closed form thedifferential volume ratio,Rf{\displaystyle R_{f}}.

An interesting property of these simple spherical flows is that they don't make use of any non-linearities apart from the radial projection. Even the simplest of them, the normalized translation flow, can be chained to form perhaps surprisingly flexible distributions.

Normalized translation flow

[edit]

The normalized translation flow,ftrans:Sn1Sn1{\displaystyle f_{\text{trans}}:\mathbb {S} ^{n-1}\to \mathbb {S} ^{n-1}}, with parametercRn{\displaystyle \mathbf {c} \in \mathbb {R} ^{n}}, is given by:

y=ftrans(x;c)=x+cx+c,wherec<1{\displaystyle \mathbf {y} =f_{\text{trans}}(\mathbf {x} ;\mathbf {c} )={\frac {\mathbf {x} +\mathbf {c} }{\lVert \mathbf {x} +\mathbf {c} \rVert }}\,,\;{\text{where}}\;\lVert \mathbf {c} \rVert <1}

The inverse function may be derived by considering, for>0{\displaystyle \ell >0}:y=1(x+c){\displaystyle \mathbf {y} =\ell ^{-1}(\mathbf {x} +\mathbf {c} )} and then usingxx=1{\displaystyle \mathbf {x} '\mathbf {x} =1} to get aquadratic equation to recover{\displaystyle \ell }, which gives:

x=ftrans1(y;c)=yc,where=yc+(yc)2+1cc{\displaystyle \mathbf {x} =f_{\text{trans}}^{-1}(\mathbf {y} ;\mathbf {c} )=\ell \mathbf {y} -\mathbf {c} \,,{\text{where}}\;\ell =\mathbf {y} '\mathbf {c} +{\sqrt {(\mathbf {y} '\mathbf {c} )^{2}+1-\mathbf {c} '\mathbf {c} }}}

from which we see that we needc<1{\displaystyle \lVert \mathbf {c} \rVert <1} to keep{\displaystyle \ell } real and positive for allySn1{\displaystyle \mathbf {y} \in \mathbb {S} ^{n-1}}. Thedifferential volume ratio is given (without derivation) by Boulerice & Ducharme(1994) as:[30]

Rtrans(x;c)=1+xcx+cn{\displaystyle R_{\text{trans}}(\mathbf {x} ;\mathbf {c} )={\frac {1+\mathbf {x} '\mathbf {c} }{\lVert \mathbf {x} +\mathbf {c} \rVert ^{n}}}}

This can indeed be verified analytically:

Finally, it is worth noting thatftrans{\displaystyle f_{\text{trans}}} andftrans1{\displaystyle f_{\text{trans}}^{-1}} do not have the same functional form.

Normalized linear flow

[edit]

The normalized linear flow,flin:Sn1Sn1{\displaystyle f_{\text{lin}}:\mathbb {S} ^{n-1}\to \mathbb {S} ^{n-1}}, where parameterM{\displaystyle \mathbf {M} } is an invertiblen-by-n{\displaystyle n{\text{-by-}}n} matrix, is given by:

y=flin(x;M)=MxMxx=flin1(y;M)=flin(y;M1)=M1yM1y{\displaystyle \mathbf {y} =f_{\text{lin}}(\mathbf {x} ;\mathbf {M} )={\frac {\mathbf {Mx} }{\lVert \mathbf {Mx} \rVert }}\;\iff \;\mathbf {x} =f_{\text{lin}}^{-1}(\mathbf {y} ;\mathbf {M} )=f_{\text{lin}}(\mathbf {y} ;\mathbf {M} ^{-1})={\frac {\mathbf {M^{-1}y} }{\lVert \mathbf {M^{-1}y} \rVert }}}

Thedifferential volume ratio is:

Rlin(x;M)=|detM|Mxn{\displaystyle R_{\text{lin}}(\mathbf {x} ;\mathbf {M} )={\frac {\left|\operatorname {det} \mathbf {M} \right|}{\lVert \mathbf {Mx} \rVert ^{n}}}}

This result can be derived indirectly via theAngular central Gaussian distribution (ACG),[31] which can be obtained via normalized linear transform of either Gaussian, or uniform spherical variates. The first relationship can be used to derive the ACG density by a marginalization integral over the radius; after which the second relationship can be used to factor out the differential volume ratio. For details, seeACG distribution.

Normalized affine flow

[edit]

The normalized affine flow,faff:Sn1Sn1{\displaystyle f_{\text{aff}}:\mathbb {S} ^{n-1}\to \mathbb {S} ^{n-1}}, with parameterscRn{\displaystyle \mathbf {c} \in \mathbb {R} ^{n}} andM{\displaystyle \mathbf {M} },n-by-n{\displaystyle n{\text{-by-}}n} invertible, is given by:

faff(x;M,c)=Mx+cMx+c,whereM1c<1{\displaystyle f_{\text{aff}}(\mathbf {x} ;\mathbf {M} ,\mathbf {c} )={\frac {\mathbf {Mx} +\mathbf {c} }{\lVert \mathbf {Mx} +\mathbf {c} \rVert }}\,,\;{\text{where}}\;\lVert \mathbf {M^{-1}c} \rVert <1}

The inverse function, derived in a similar way to the normalized translation inverse is:

x=faff1(y;M,c)=M1(yc),where=yWc+(yWc)2+yWy(1cWc)yWy{\displaystyle \mathbf {x} =f_{\text{aff}}^{-1}(\mathbf {y} ;\mathbf {M} ,\mathbf {c} )=\mathbf {M} ^{-1}(\ell \mathbf {y} -\mathbf {c} )\,,{\text{where}}\;\ell ={\frac {\mathbf {y} '\mathbf {Wc} +{\sqrt {(\mathbf {y} '\mathbf {Wc} )^{2}+\mathbf {y} '\mathbf {Wy} (1-\mathbf {c} '\mathbf {Wc} )}}}{\mathbf {y} '\mathbf {Wy} }}}

whereW=(MM)1{\displaystyle \mathbf {W} =(\mathbf {MM} ')^{-1}}. Thedifferential volume ratio is:

Raff(x;M,c)=Rlin(x;M+cx)=|detM|(1+xM1c)Mx+cn{\displaystyle R_{\text{aff}}(\mathbf {x} ;\mathbf {M} ,\mathbf {c} )=R_{\text{lin}}(\mathbf {x} ;\mathbf {M} +\mathbf {c} \mathbf {x} ')={\frac {\left|\operatorname {det} \mathbf {M} \right|(1+\mathbf {x} '\mathbf {M^{-1}c} )}{\lVert \mathbf {Mx+c} \rVert ^{n}}}}

The final RHS numerator was expanded fromdet(M+cx){\displaystyle \operatorname {det} (\mathbf {M} +\mathbf {cx} ')} by thematrix determinant lemma. RecallingRf(x)=|det(TyFxTx)|{\displaystyle R_{f}(\mathbf {x} )=\left|\operatorname {det} (\mathbf {T} _{\mathbf {y} }'\mathbf {F_{x}T_{x}} )\right|}, the equality betweenRaff{\displaystyle R_{\text{aff}}} andRlin{\displaystyle R_{\text{lin}}} holds because not only:

xx=1y=faff(x;M,c)=flin(x;M+cx){\displaystyle \mathbf {x} '\mathbf {x} =1\;\Longrightarrow \;\mathbf {y} =f_{\text{aff}}(\mathbf {x} ;\mathbf {M,c} )=f_{\text{lin}}(\mathbf {x} ;\mathbf {M+cx} ')}

but also, by orthogonality ofx{\displaystyle \mathbf {x} } to the local tangent space:

xTx=0FxaffTx=FxlinTx{\displaystyle \mathbf {x} '\mathbf {T_{x}} ={\boldsymbol {0}}\;\Longrightarrow \;\mathbf {F} _{\mathbf {x} }^{\text{aff}}\mathbf {T_{x}} =\mathbf {F} _{\mathbf {x} }^{\text{lin}}\mathbf {T_{x}} }

whereFxlin=Mx+c1(Inyy)(M+cx){\displaystyle \mathbf {F} _{\mathbf {x} }^{\text{lin}}=\lVert \mathbf {Mx} +\mathbf {c} \rVert ^{-1}(\mathbf {I} _{n}-\mathbf {yy} ')(\mathbf {M+cx} ')} is the Jacobian offlin{\displaystyle f_{\text{lin}}} differentiated w.r.t. its input, butnot also w.r.t. to its parameter.

Downsides

[edit]

Despite normalizing flows success in estimating high-dimensional densities, some downsides still exist in their designs. First of all, their latent space where input data is projected onto is not a lower-dimensional space and therefore, flow-based models do not allow for compression of data by default and require a lot of computation. However, it is still possible to perform image compression with them.[32]

Flow-based models are also notorious for failing in estimating the likelihood of out-of-distribution samples (i.e.: samples that were not drawn from the same distribution as the training set).[33] Some hypotheses were formulated to explain this phenomenon, among which thetypical set hypothesis,[34] estimation issues when training models,[35] or fundamental issues due to the entropy of the data distributions.[36]

One of the most interesting properties of normalizing flows is theinvertibility of their learnedbijective map. This property is given by constraints in the design of the models (cf.: RealNVP, Glow) which guarantee theoretical invertibility. The integrity of the inverse is important in order to ensure the applicability of thechange-of-variable theorem, the computation of theJacobian of the map as well as sampling with the model. However, in practice this invertibility is violated and the inverse map explodes because of numerical imprecision.[37]

Applications

[edit]

Flow-based generative models have been applied on a variety of modeling tasks, including:

  • Audio generation[38]
  • Image generation[8]
  • Molecular graph generation[39]
  • Point-cloud modeling[40]
  • Video generation[41]
  • Lossy image compression[32]
  • Anomaly detection[42]

References

[edit]
  1. ^Tabak, Esteban G.; Vanden-Eijnden, Eric (2010)."Density estimation by dual ascent of the log-likelihood".Communications in Mathematical Sciences.8 (1):217–233.doi:10.4310/CMS.2010.v8.n1.a11.
  2. ^Tabak, Esteban G.; Turner, Cristina V. (2012)."A family of nonparametric density estimation algorithms".Communications on Pure and Applied Mathematics.66 (2):145–164.doi:10.1002/cpa.21423.hdl:11336/8930.S2CID 17820269.
  3. ^Papamakarios, George; Nalisnick, Eric; Jimenez Rezende, Danilo; Mohamed, Shakir; Bakshminarayanan, Balaji (2021)."Normalizing flows for probabilistic modeling and inference".Journal of Machine Learning Research.22 (1):2617–2680.arXiv:1912.02762.
  4. ^Bell, A. J.; Sejnowski, T. J. (1995). "An information-maximization approach to blind separation and blind deconvolution".Neural Computation. **7** (6): 1129–1159. doi:10.1162/neco.1995.7.6.1129.
  5. ^Roth, Z.; Baram, Y. (1996). "Multidimensional density shaping by sigmoids".IEEE Transactions on Neural Networks. **7** (5): 1291–1298. doi:10.1109/72.536322.
  6. ^abDinh, Laurent; Krueger, David; Bengio, Yoshua (2014). "NICE: Non-linear Independent Components Estimation".arXiv:1410.8516 [cs.LG].
  7. ^abDinh, Laurent; Sohl-Dickstein, Jascha; Bengio, Samy (2016). "Density estimation using Real NVP".arXiv:1605.08803 [cs.LG].
  8. ^abcKingma, Diederik P.; Dhariwal, Prafulla (2018). "Glow: Generative Flow with Invertible 1x1 Convolutions".arXiv:1807.03039 [stat.ML].
  9. ^Papamakarios, George; Nalisnick, Eric; Rezende, Danilo Jimenez; Shakir, Mohamed; Balaji, Lakshminarayanan (March 2021)."Normalizing Flows for Probabilistic Modeling and Inference".Journal of Machine Learning Research.22 (57):1–64.arXiv:1912.02762.
  10. ^Kobyzev, Ivan; Prince, Simon J.D.; Brubaker, Marcus A. (November 2021). "Normalizing Flows: An Introduction and Review of Current Methods".IEEE Transactions on Pattern Analysis and Machine Intelligence.43 (11):3964–3979.arXiv:1908.09257.Bibcode:2021ITPAM..43.3964K.doi:10.1109/TPAMI.2020.2992934.ISSN 1939-3539.PMID 32396070.S2CID 208910764.
  11. ^Danilo Jimenez Rezende; Mohamed, Shakir (2015). "Variational Inference with Normalizing Flows".arXiv:1505.05770 [stat.ML].
  12. ^Papamakarios, George; Pavlakou, Theo; Murray, Iain (2017)."Masked Autoregressive Flow for Density Estimation".Advances in Neural Information Processing Systems.30. Curran Associates, Inc.arXiv:1705.07057.
  13. ^Kingma, Durk P; Salimans, Tim; Jozefowicz, Rafal; Chen, Xi; Sutskever, Ilya; Welling, Max (2016)."Improved Variational Inference with Inverse Autoregressive Flow".Advances in Neural Information Processing Systems.29. Curran Associates, Inc.arXiv:1606.04934.
  14. ^abcGrathwohl, Will; Chen, Ricky T. Q.; Bettencourt, Jesse; Sutskever, Ilya; Duvenaud, David (2018). "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models".arXiv:1810.01367 [cs.LG].
  15. ^Lipman, Yaron; Chen, Ricky T. Q.; Ben-Hamu, Heli; Nickel, Maximilian; Le, Matt (2022-10-01). "Flow Matching for Generative Modeling".arXiv:2210.02747 [cs.LG].
  16. ^Grathwohl, Will; Chen, Ricky T. Q.; Bettencourt, Jesse; Sutskever, Ilya; Duvenaud, David (2018-10-22). "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models".arXiv:1810.01367 [cs.LG].
  17. ^abFinlay, Chris; Jacobsen, Joern-Henrik; Nurbekyan, Levon; Oberman, Adam (2020-11-21)."How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization".International Conference on Machine Learning. PMLR:3154–3164.arXiv:2002.02798.
  18. ^Hutchinson, M.F. (January 1989)."A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines".Communications in Statistics - Simulation and Computation.18 (3):1059–1076.doi:10.1080/03610918908812806.ISSN 0361-0918.
  19. ^Chen, Ricky T. Q.; Rubanova, Yulia; Bettencourt, Jesse; Duvenaud, David K. (2018)."Neural Ordinary Differential Equations"(PDF). In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R. (eds.).Advances in Neural Information Processing Systems. Vol. 31. Curran Associates, Inc.arXiv:1806.07366.
  20. ^Dupont, Emilien; Doucet, Arnaud; Teh, Yee Whye (2019)."Augmented Neural ODEs".Advances in Neural Information Processing Systems.32. Curran Associates, Inc.
  21. ^Zhang, Han; Gao, Xi; Unterman, Jacob; Arodz, Tom (2019-07-30). "Approximation Capabilities of Neural ODEs and Invertible Residual Networks".arXiv:1907.12998 [cs.LG].
  22. ^abcdSorrenson, Peter; Draxler, Felix; Rousselot, Armand; Hummerich, Sander; Köthe, Ullrich (2023). "Learning Distributions on Manifolds with Free-Form Flows".arXiv:2312.09852 [cs.LG].
  23. ^Brümmer, Niko; van Leeuwen, D. A. (2006). "On calibration of language recognition scores".Proceedings of IEEE Odyssey: The Speaker and Language Recognition Workshop. San Juan, Puerto Rico. pp. 1–8.doi:10.1109/ODYSSEY.2006.248106.
  24. ^Ferrer, Luciana; Ramos, Daniel (2024). "Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration".arXiv:2408.02841 [stat.ML].
  25. ^Graf, Monique (2019)."The Simplicial Generalized Beta distribution - R-package SGB and applications".Libra. Retrieved26 May 2025.{{cite web}}: CS1 maint: numeric names: authors list (link)
  26. ^Brümmer, Niko (18 October 2010).Measuring, refining and calibrating speaker and language information extracted from speech (PhD thesis). Stellenbosch, South Africa: Department of Electrical & Electronic Engineering, University of Stellenbosch.
  27. ^Meelis Kull, Miquel Perelló‑Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, Peter A. Flach (28 October 2019). "Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration".arXiv:1910.12656 [cs.LG].{{cite arXiv}}: CS1 maint: multiple names: authors list (link)
  28. ^The tangent matrices are not unique: ifT{\displaystyle \mathbf {T} } has orthonormal columns andQ{\displaystyle \mathbf {Q} } is anorthogonal matrix, thenTQ{\displaystyle \mathbf {TQ} } also has orthonormal columns that span the same subspace; it is easy to verify that|det(TyFxTx)|{\displaystyle \left|\operatorname {det} (\mathbf {T_{y}} '\mathbf {F_{x}T_{x}} )\right|} is invariant to such transformations of the tangent representatives.
  29. ^WithPyTorch:
    from torch.linalg import qrfrom torch.func import jacrevdef logRf(pi, m, f, x):    y = f(x)     Fx, PI = jacrev(f)(x), jacrev(pi)    Tx, Ty = [qr(PI(z)).Q[:,:m] for z in (x,y)]    return (Ty.T @ Fx @ Tx).slogdet().logabsdet
  30. ^Boulerice, Bernard; Ducharme, Gilles R. (1994). "Decentered Directional Data".Annals of the Institute of Statistical Mathematics.46 (3):573–586.doi:10.1007/BF00773518.
  31. ^Tyler, David E (1987). "Statistical analysis for the angular central Gaussian distribution on the sphere".Biometrika.74 (3):579–589.doi:10.2307/2336697.JSTOR 2336697.
  32. ^abHelminger, Leonhard; Djelouah, Abdelaziz; Gross, Markus; Schroers, Christopher (2020). "Lossy Image Compression with Normalizing Flows".arXiv:2008.10486 [cs.CV].
  33. ^Nalisnick, Eric; Matsukawa, Teh; Zhao, Yee Whye; Song, Zhao (2018). "Do Deep Generative Models Know What They Don't Know?".arXiv:1810.09136v3 [stat.ML].
  34. ^Nalisnick, Eric; Matsukawa, Teh; Zhao, Yee Whye; Song, Zhao (2019). "Detecting Out-of-Distribution Inputs to Deep Generative Models Using Typicality".arXiv:1906.02994 [stat.ML].
  35. ^Zhang, Lily; Goldstein, Mark; Ranganath, Rajesh (2021)."Understanding Failures in Out-of-Distribution Detection with Deep Generative Models".Proceedings of Machine Learning Research.139:12427–12436.PMC 9295254.PMID 35860036.
  36. ^Caterini, Anthony L.; Loaiza-Ganem, Gabriel (2022). "Entropic Issues in Likelihood-Based OOD Detection". pp. 21–26.arXiv:2109.10794 [stat.ML].
  37. ^Behrmann, Jens; Vicol, Paul; Wang, Kuan-Chieh; Grosse, Roger; Jacobsen, Jörn-Henrik (2020). "Understanding and Mitigating Exploding Inverses in Invertible Neural Networks".arXiv:2006.09347 [cs.LG].
  38. ^Ping, Wei; Peng, Kainan; Gorur, Dilan; Lakshminarayanan, Balaji (2019). "WaveFlow: A Compact Flow-based Model for Raw Audio".arXiv:1912.01219 [cs.SD].
  39. ^Shi, Chence; Xu, Minkai; Zhu, Zhaocheng; Zhang, Weinan; Zhang, Ming; Tang, Jian (2020). "GraphAF: A Flow-based Autoregressive Model for Molecular Graph Generation".arXiv:2001.09382 [cs.LG].
  40. ^Yang, Guandao; Huang, Xun; Hao, Zekun; Liu, Ming-Yu; Belongie, Serge; Hariharan, Bharath (2019). "PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows".arXiv:1906.12320 [cs.CV].
  41. ^Kumar, Manoj; Babaeizadeh, Mohammad; Erhan, Dumitru; Finn, Chelsea; Levine, Sergey; Dinh, Laurent; Kingma, Durk (2019). "VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation".arXiv:1903.01434 [cs.CV].
  42. ^Rudolph, Marco; Wandt, Bastian; Rosenhahn, Bodo (2021). "Same Same But DifferNet: Semi-Supervised Defect Detection with Normalizing Flows".arXiv:2008.12577 [cs.CV].

External links

[edit]
Retrieved from "https://en.wikipedia.org/w/index.php?title=Flow-based_generative_model&oldid=1323031387"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp