Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Diffusion model

From Wikipedia, the free encyclopedia
Technique for the generative modeling of a continuous probability distribution
This article is about the technique in generative statistical modeling. For other uses, seeDiffusion (disambiguation).
This article discusses diffusion modeling of a continuous distribution. For the modeling of a discrete distribution, seeDiscrete diffusion model.
Part of a series on
Machine learning
anddata mining

Inmachine learning,diffusion models, also known asdiffusion-based generative models orscore-based generative models, are a class oflatent variablegenerative models. A diffusion model consists of two major components: the forward diffusion process, and the reverse sampling process. The goal of diffusion models is to learn adiffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs arandom walk with drift through the space of all possible data.[1] A trained diffusion model can be sampled in many ways, with different efficiency and quality.

There are various equivalent formalisms, includingMarkov chains, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.[2] They are typically trained usingvariational inference.[3] The model responsible for denoising is typically called its "backbone". The backbone may be of any kind, but they are typicallyU-nets ortransformers.

As of 2024[update], diffusion models are mainly used forcomputer vision tasks, includingimage denoising,inpainting,super-resolution,image generation, and video generation. These typically involve training a neural network to sequentiallydenoise images blurred withGaussian noise.[1][4] The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image.

Diffusion-based image generators have seen widespread commercial interest, such asStable Diffusion andDALL-E. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation.[5]

Other than computer vision, diffusion models have also found applications innatural language processing[6] such astext generation[7] andsummarization,[8] sound generation,[9] andreinforcement learning.[10][11]

Denoising diffusion model

[edit]

Non-equilibrium thermodynamics

[edit]

Diffusion models were introduced in 2015 as a method to train a model that can sample from a highly complex probability distribution. They used techniques fromnon-equilibrium thermodynamics, especiallydiffusion.[12]

Consider, for example, how one might model the distribution of all naturally occurring photos. Each image is a point in the space of all images, and the distribution of naturally occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from aGaussian distributionN(0,I){\displaystyle {\mathcal {N}}(0,I)}. A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution.

The equilibrium distribution is the Gaussian distributionN(0,I){\displaystyle {\mathcal {N}}(0,I)}, with pdfρ(x)e12x2{\displaystyle \rho (x)\propto e^{-{\frac {1}{2}}\|x\|^{2}}}. This is just theMaxwell–Boltzmann distribution of particles in a potential wellV(x)=12x2{\displaystyle V(x)={\frac {1}{2}}\|x\|^{2}} at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like aBrownian walker) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution.

Denoising Diffusion Probabilistic Model (DDPM)

[edit]

The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method byvariational inference.[3][13]

Forward diffusion

[edit]

To present the model, some notation is required.

Aforward diffusion process starts at some starting pointx0q{\displaystyle x_{0}\sim q}, whereq{\displaystyle q} is the probability distribution to be learned, then repeatedly adds noise to it byxt=1βtxt1+βtzt{\displaystyle x_{t}={\sqrt {1-\beta _{t}}}x_{t-1}+{\sqrt {\beta _{t}}}z_{t}}wherez1,...,zT{\displaystyle z_{1},...,z_{T}} are IID (Independent and identically distributed random variables) samples fromN(0,I){\displaystyle {\mathcal {N}}(0,I)}. The coefficients1βt{\displaystyle {\sqrt {1-\beta _{t}}}} andβt{\displaystyle {\sqrt {\beta _{t}}}} ensure thatVar(Xt)=I{\displaystyle {\mbox{Var}}(X_{t})=I} assuming thatVar(X0)=I{\displaystyle {\mbox{Var}}(X_{0})=I}. The values ofβt{\displaystyle \beta _{t}} are chosen such that for any starting distribution ofx0{\displaystyle x_{0}}, if it has finite second moment, thenlimtxt|x0{\displaystyle \lim _{t\to \infty }x_{t}|x_{0}} converges[clarification needed] toN(0,I){\displaystyle {\mathcal {N}}(0,I)}.

The entire diffusion process then satisfiesq(x0:T)=q(x0)q(x1|x0)q(xT|xT1)=q(x0)N(x1|α1x0,β1I)N(xT|αTxT1,βTI){\displaystyle q(x_{0:T})=q(x_{0})q(x_{1}|x_{0})\cdots q(x_{T}|x_{T-1})=q(x_{0}){\mathcal {N}}(x_{1}|{\sqrt {\alpha _{1}}}x_{0},\beta _{1}I)\cdots {\mathcal {N}}(x_{T}|{\sqrt {\alpha _{T}}}x_{T-1},\beta _{T}I)}orlnq(x0:T)=lnq(x0)t=1T12βtxt1βtxt12+C{\displaystyle \ln q(x_{0:T})=\ln q(x_{0})-\sum _{t=1}^{T}{\frac {1}{2\beta _{t}}}\|x_{t}-{\sqrt {1-\beta _{t}}}x_{t-1}\|^{2}+C}whereC{\displaystyle C} is a normalization constant and often omitted. In particular, we note thatx1:T|x0{\displaystyle x_{1:T}|x_{0}} is aGaussian process, which affords us considerable freedom inreparameterization. For example, by standard manipulation with Gaussian process,xt|x0N(α¯tx0,σt2I){\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)}xt1|xt,x0N(μ~t(xt,x0),σ~t2I){\displaystyle x_{t-1}|x_{t},x_{0}\sim {\mathcal {N}}({\tilde {\mu }}_{t}(x_{t},x_{0}),{\tilde {\sigma }}_{t}^{2}I)}In particular, notice that for larget{\displaystyle t}, the variablext|x0N(α¯tx0,σt2I){\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} converges toN(0,I){\displaystyle {\mathcal {N}}(0,I)}. That is, after a long enough diffusion process, we end up with somexT{\displaystyle x_{T}} that is very close toN(0,I){\displaystyle {\mathcal {N}}(0,I)}, with all traces of the originalx0q{\displaystyle x_{0}\sim q} gone.

For example, sincext|x0N(α¯tx0,σt2I){\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)}we can samplext|x0{\displaystyle x_{t}|x_{0}} directly "in one step", instead of going through all the intermediate stepsx1,x2,...,xt1{\displaystyle x_{1},x_{2},...,x_{t-1}}.

Derivation by reparameterization

We knowxt1|x0{\textstyle x_{t-1}|x_{0}} is a Gaussian, andxt|xt1{\textstyle x_{t}|x_{t-1}} is another Gaussian. We also know that these are independent. Thus we can perform a reparameterization:xt1=α¯t1x0+1α¯t1z{\displaystyle x_{t-1}={\sqrt {{\bar {\alpha }}_{t-1}}}x_{0}+{\sqrt {1-{\bar {\alpha }}_{t-1}}}z}xt=αtxt1+1αtz{\displaystyle x_{t}={\sqrt {\alpha _{t}}}x_{t-1}+{\sqrt {1-\alpha _{t}}}z'} wherez,z{\textstyle z,z'} are IID Gaussians.

There are 5 variablesx0,xt1,xt,z,z{\textstyle x_{0},x_{t-1},x_{t},z,z'} and two linear equations. The two sources of randomness arez,z{\textstyle z,z'}, which can be reparameterized by rotation, since the IID Gaussian distribution is rotationally symmetric.

By plugging in the equations, we can solve for the first reparameterization:xt=α¯tx0+αtα¯tz+1αtz=σtz{\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+\underbrace {{\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}z+{\sqrt {1-\alpha _{t}}}z'} _{=\sigma _{t}z''}} wherez{\textstyle z''} is a Gaussian with mean zero and variance one.

To find the second one, we complete the rotational matrix:[zz]=[αtα¯tσtβtσt??][zz]{\displaystyle {\begin{bmatrix}z''\\z'''\end{bmatrix}}={\begin{bmatrix}{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sigma _{t}}}&{\frac {\sqrt {\beta _{t}}}{\sigma _{t}}}\\?&?\end{bmatrix}}{\begin{bmatrix}z\\z'\end{bmatrix}}}

Since rotational matrices are all of the form[cosθsinθsinθcosθ]{\textstyle {\begin{bmatrix}\cos \theta &\sin \theta \\-\sin \theta &\cos \theta \end{bmatrix}}}, we know the matrix must be[zz]=[αtα¯tσtβtσtβtσtαtα¯tσt][zz]{\displaystyle {\begin{bmatrix}z''\\z'''\end{bmatrix}}={\begin{bmatrix}{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sigma _{t}}}&{\frac {\sqrt {\beta _{t}}}{\sigma _{t}}}\\-{\frac {\sqrt {\beta _{t}}}{\sigma _{t}}}&{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sigma _{t}}}\end{bmatrix}}{\begin{bmatrix}z\\z'\end{bmatrix}}} and since the inverse of rotational matrix is its transpose,
[zz]=[αtα¯tσtβtσtβtσtαtα¯tσt][zz]{\displaystyle {\begin{bmatrix}z\\z'\end{bmatrix}}={\begin{bmatrix}{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sigma _{t}}}&-{\frac {\sqrt {\beta _{t}}}{\sigma _{t}}}\\{\frac {\sqrt {\beta _{t}}}{\sigma _{t}}}&{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sigma _{t}}}\end{bmatrix}}{\begin{bmatrix}z''\\z'''\end{bmatrix}}}

Plugging back, and simplifying, we havext=α¯tx0+σtz{\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+\sigma _{t}z''}xt1=μ~t(xt,x0)σ~tz{\displaystyle x_{t-1}={\tilde {\mu }}_{t}(x_{t},x_{0})-{\tilde {\sigma }}_{t}z'''}

Backward diffusion

[edit]

The key idea of DDPM is to use a neural network parametrized byθ{\displaystyle \theta }. The network takes in two argumentsxt,t{\displaystyle x_{t},t}, and outputs a vectorμθ(xt,t){\displaystyle \mu _{\theta }(x_{t},t)} and a matrixΣθ(xt,t){\displaystyle \Sigma _{\theta }(x_{t},t)}, such that each step in the forward diffusion process can be approximately undone byxt1N(μθ(xt,t),Σθ(xt,t)){\displaystyle x_{t-1}\sim {\mathcal {N}}(\mu _{\theta }(x_{t},t),\Sigma _{\theta }(x_{t},t))}. This then gives us a backward diffusion processpθ{\displaystyle p_{\theta }} defined bypθ(xT)=N(xT|0,I){\displaystyle p_{\theta }(x_{T})={\mathcal {N}}(x_{T}|0,I)}pθ(xt1|xt)=N(xt1|μθ(xt,t),Σθ(xt,t)){\displaystyle p_{\theta }(x_{t-1}|x_{t})={\mathcal {N}}(x_{t-1}|\mu _{\theta }(x_{t},t),\Sigma _{\theta }(x_{t},t))}The goal now is to learn the parametersθ{\displaystyle \theta } such thatpθ(x0){\displaystyle p_{\theta }(x_{0})} is as close toq(x0){\displaystyle q(x_{0})} as possible. To do that, we usemaximum likelihood estimation with variational inference.

Variational inference

[edit]

TheELBO inequality states thatlnpθ(x0)Ex1:Tq(|x0)[lnpθ(x0:T)lnq(x1:T|x0)]{\displaystyle \ln p_{\theta }(x_{0})\geq E_{x_{1:T}\sim q(\cdot |x_{0})}[\ln p_{\theta }(x_{0:T})-\ln q(x_{1:T}|x_{0})]}, and taking one more expectation, we getEx0q[lnpθ(x0)]Ex0:Tq[lnpθ(x0:T)lnq(x1:T|x0)]{\displaystyle E_{x_{0}\sim q}[\ln p_{\theta }(x_{0})]\geq E_{x_{0:T}\sim q}[\ln p_{\theta }(x_{0:T})-\ln q(x_{1:T}|x_{0})]}We see that maximizing the quantity on the right would give us a lower bound on the likelihood of observed data. This allows us to perform variational inference.

Define the loss functionL(θ):=Ex0:Tq[lnpθ(x0:T)lnq(x1:T|x0)]{\displaystyle L(\theta ):=-E_{x_{0:T}\sim q}[\ln p_{\theta }(x_{0:T})-\ln q(x_{1:T}|x_{0})]}and now the goal is to minimize the loss bystochastic gradient descent. The expression may be simplified to[14]L(θ)=t=1TExt1,xtq[lnpθ(xt1|xt)]+Ex0q[DKL(q(xT|x0)pθ(xT))]+C{\displaystyle L(\theta )=\sum _{t=1}^{T}E_{x_{t-1},x_{t}\sim q}[-\ln p_{\theta }(x_{t-1}|x_{t})]+E_{x_{0}\sim q}[D_{KL}(q(x_{T}|x_{0})\|p_{\theta }(x_{T}))]+C}whereC{\displaystyle C} does not depend on the parameter, and thus can be ignored. Sincepθ(xT)=N(xT|0,I){\displaystyle p_{\theta }(x_{T})={\mathcal {N}}(x_{T}|0,I)} also does not depend on the parameter, the termEx0q[DKL(q(xT|x0)pθ(xT))]{\displaystyle E_{x_{0}\sim q}[D_{KL}(q(x_{T}|x_{0})\|p_{\theta }(x_{T}))]} can also be ignored. This leaves justL(θ)=t=1TLt{\displaystyle L(\theta )=\sum _{t=1}^{T}L_{t}} withLt=Ext1,xtq[lnpθ(xt1|xt)]{\displaystyle L_{t}=E_{x_{t-1},x_{t}\sim q}[-\ln p_{\theta }(x_{t-1}|x_{t})]} to be minimized.

Noise prediction network

[edit]

Sincext1|xt,x0N(μ~t(xt,x0),σ~t2I){\displaystyle x_{t-1}|x_{t},x_{0}\sim {\mathcal {N}}({\tilde {\mu }}_{t}(x_{t},x_{0}),{\tilde {\sigma }}_{t}^{2}I)}, this suggests that we should useμθ(xt,t)=μ~t(xt,x0){\displaystyle \mu _{\theta }(x_{t},t)={\tilde {\mu }}_{t}(x_{t},x_{0})}; however, the network does not have access tox0{\displaystyle x_{0}}, and so it has to estimate it instead. Now, sincext|x0N(α¯tx0,σt2I){\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)}, we may writext=α¯tx0+σtz{\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+\sigma _{t}z}, wherez{\displaystyle z} is some unknown Gaussian noise. Now we see that estimatingx0{\displaystyle x_{0}} is equivalent to estimatingz{\displaystyle z}.

Therefore, let the network output a noise vectorϵθ(xt,t){\displaystyle \epsilon _{\theta }(x_{t},t)}, and let it predictμθ(xt,t)=μ~t(xt,xtσtϵθ(xt,t)α¯t)=xtϵθ(xt,t)βt/σtαt{\displaystyle \mu _{\theta }(x_{t},t)={\tilde {\mu }}_{t}\left(x_{t},{\frac {x_{t}-\sigma _{t}\epsilon _{\theta }(x_{t},t)}{\sqrt {{\bar {\alpha }}_{t}}}}\right)={\frac {x_{t}-\epsilon _{\theta }(x_{t},t)\beta _{t}/\sigma _{t}}{\sqrt {\alpha _{t}}}}}It remains to designΣθ(xt,t){\displaystyle \Sigma _{\theta }(x_{t},t)}. The DDPM paper suggested not learning it (since it resulted in "unstable training and poorer sample quality"), but fixing it at some valueΣθ(xt,t)=ζt2I{\displaystyle \Sigma _{\theta }(x_{t},t)=\zeta _{t}^{2}I}, where eitherζt2=βt or σ~t2{\displaystyle \zeta _{t}^{2}=\beta _{t}{\text{ or }}{\tilde {\sigma }}_{t}^{2}} yielded similar performance.

With this, the loss simplifies toLt=βt22αtσt2ζt2Ex0q;zN(0,I)[ϵθ(xt,t)z2]+C{\displaystyle L_{t}={\frac {\beta _{t}^{2}}{2\alpha _{t}\sigma _{t}^{2}\zeta _{t}^{2}}}E_{x_{0}\sim q;z\sim {\mathcal {N}}(0,I)}\left[\left\|\epsilon _{\theta }(x_{t},t)-z\right\|^{2}\right]+C}which may be minimized by stochastic gradient descent. The paper noted empirically that an even simpler loss functionLsimple,t=Ex0q;zN(0,I)[ϵθ(xt,t)z2]{\displaystyle L_{simple,t}=E_{x_{0}\sim q;z\sim {\mathcal {N}}(0,I)}\left[\left\|\epsilon _{\theta }(x_{t},t)-z\right\|^{2}\right]}resulted in better models.

Backward diffusion process

[edit]

After a noise prediction network is trained, it can be used for generating data points in the original distribution in a loop as follows:

  1. Compute the noise estimateϵϵθ(xt,t){\displaystyle \epsilon \leftarrow \epsilon _{\theta }(x_{t},t)}
  2. Compute the original data estimatex~0(xtσtϵ)/α¯t{\displaystyle {\tilde {x}}_{0}\leftarrow (x_{t}-\sigma _{t}\epsilon )/{\sqrt {{\bar {\alpha }}_{t}}}}
  3. Sample the previous dataxt1N(μ~t(xt,x~0),σ~t2I){\displaystyle x_{t-1}\sim {\mathcal {N}}({\tilde {\mu }}_{t}(x_{t},{\tilde {x}}_{0}),{\tilde {\sigma }}_{t}^{2}I)}
  4. Change timett1{\displaystyle t\leftarrow t-1}

Score-based generative model

[edit]

Score-based generative model is another formulation of diffusion modelling. They are also called noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD).[15][16][17][18]

Score matching

[edit]

The idea of score functions

[edit]

Consider the problem of image generation. Letx{\displaystyle x} represent an image, and letq(x){\displaystyle q(x)} be the probability distribution over all possible images. If we haveq(x){\displaystyle q(x)} itself, then we can say for certain how likely a certain image is. However, this is intractable in general.

Most often, we are uninterested in knowing the absolute probability of a certain image. Instead, we are usually only interested in knowing how likely a certain image is compared to its immediate neighbors — e.g. how much more likely is an image of cat compared to some small variants of it? Is it more likely if the image contains two whiskers, or three, or with some Gaussian noise added?

Consequently, we are actually quite uninterested inq(x){\displaystyle q(x)} itself, but rather,xlnq(x){\displaystyle \nabla _{x}\ln q(x)}. This has two major effects:

Let thescore function bes(x):=xlnq(x){\displaystyle s(x):=\nabla _{x}\ln q(x)}; then consider what we can do withs(x){\displaystyle s(x)}.

As it turns out,s(x){\displaystyle s(x)} allows us to sample fromq(x){\displaystyle q(x)} using thermodynamics. Specifically, if we have a potential energy functionU(x)=lnq(x){\displaystyle U(x)=-\ln q(x)}, and a lot of particles in the potential well, then the distribution at thermodynamic equilibrium is theBoltzmann distributionqU(x)eU(x)/kBT=q(x)1/kBT{\displaystyle q_{U}(x)\propto e^{-U(x)/k_{B}T}=q(x)^{1/k_{B}T}}. At temperaturekBT=1{\displaystyle k_{B}T=1}, the Boltzmann distribution is exactlyq(x){\displaystyle q(x)}.

Therefore, to modelq(x){\displaystyle q(x)}, we may start with a particle sampled at any convenient distribution (such as the standard Gaussian distribution), then simulate the motion of the particle forwards according to theLangevin equationdxt=xtU(xt)dt+dWt{\displaystyle dx_{t}=-\nabla _{x_{t}}U(x_{t})dt+dW_{t}}and the Boltzmann distribution is,by Fokker-Planck equation, the unique thermodynamic equilibrium. So no matter what distributionx0{\displaystyle x_{0}} has, the distribution ofxt{\displaystyle x_{t}} converges in distribution toq{\displaystyle q} ast{\displaystyle t\to \infty }.

Learning the score function

[edit]

Given a densityq{\displaystyle q}, we wish to learn a score function approximationfθlnq{\displaystyle f_{\theta }\approx \nabla \ln q}. This isscore matching.[19] Typically, score matching is formalized as minimizingFisher divergence functionEq[fθ(x)lnq(x)2]{\displaystyle E_{q}[\|f_{\theta }(x)-\nabla \ln q(x)\|^{2}]}. By expanding the integral, and performing an integration by parts,Eq[fθ(x)lnq(x)2]=Eq[fθ2+2fθ]+C{\displaystyle E_{q}[\|f_{\theta }(x)-\nabla \ln q(x)\|^{2}]=E_{q}[\|f_{\theta }\|^{2}+2\nabla \cdot f_{\theta }]+C}giving us a loss function, also known as theHyvärinen scoring rule, that can be minimized by stochastic gradient descent.

Annealing the score function

[edit]

Suppose we need to model the distribution of images, and we wantx0N(0,I){\displaystyle x_{0}\sim {\mathcal {N}}(0,I)}, a white-noise image. Now, most white-noise images do not look like real images, soq(x0)0{\displaystyle q(x_{0})\approx 0} for large swaths ofx0N(0,I){\displaystyle x_{0}\sim {\mathcal {N}}(0,I)}. This presents a problem for learning the score function, because if there are no samples around a certain point, then we can't learn the score function at that point. If we do not know the score functionxtlnq(xt){\displaystyle \nabla _{x_{t}}\ln q(x_{t})} at that point, then we cannot impose the time-evolution equation on a particle:dxt=xtlnq(xt)dt+dWt{\displaystyle dx_{t}=\nabla _{x_{t}}\ln q(x_{t})dt+dW_{t}}To deal with this problem, we performannealing. Ifq{\displaystyle q} is too different from a white-noise distribution, then progressively add noise until it is indistinguishable from one. That is, we perform a forward diffusion, then learn the score function, then use the score function to perform a backward diffusion.

Continuous diffusion processes

[edit]

Forward diffusion process

[edit]

Consider again the forward diffusion process, but this time in continuous time:xt=1βtxt1+βtzt{\displaystyle x_{t}={\sqrt {1-\beta _{t}}}x_{t-1}+{\sqrt {\beta _{t}}}z_{t}}By taking theβtβ(t)dt,dtztdWt{\displaystyle \beta _{t}\to \beta (t)dt,{\sqrt {dt}}z_{t}\to dW_{t}} limit, we obtain a continuous diffusion process, in the form of astochastic differential equation:dxt=12β(t)xtdt+β(t)dWt{\displaystyle dx_{t}=-{\frac {1}{2}}\beta (t)x_{t}dt+{\sqrt {\beta (t)}}dW_{t}}whereWt{\displaystyle W_{t}} is aWiener process (multidimensional Brownian motion).

Now, the equation is exactly a special case of theoverdamped Langevin equationdxt=DkBT(xU)dt+2DdWt{\displaystyle dx_{t}=-{\frac {D}{k_{B}T}}(\nabla _{x}U)dt+{\sqrt {2D}}dW_{t}}whereD{\displaystyle D} is diffusion tensor,T{\displaystyle T} is temperature, andU{\displaystyle U} is potential energy field. If we substitute inD=12β(t)I,kBT=1,U=12x2{\displaystyle D={\frac {1}{2}}\beta (t)I,k_{B}T=1,U={\frac {1}{2}}\|x\|^{2}}, we recover the above equation. This explains why the phrase "Langevin dynamics" is sometimes used in diffusion models.

Now the above equation is for the stochastic motion of a single particle. Suppose we have a cloud of particles distributed according toq{\displaystyle q} at timet=0{\displaystyle t=0}, then after a long time, the cloud of particles would settle into the stable distribution ofN(0,I){\displaystyle {\mathcal {N}}(0,I)}. Letρt{\displaystyle \rho _{t}} be the density of the cloud of particles at timet{\displaystyle t}, then we haveρ0=q;ρTN(0,I){\displaystyle \rho _{0}=q;\quad \rho _{T}\approx {\mathcal {N}}(0,I)}and the goal is to somehow reverse the process, so that we can start at the end and diffuse back to the beginning.

ByFokker-Planck equation, the density of the cloud evolves according totlnρt=12β(t)(n+(x+lnρt)lnρt+Δlnρt){\displaystyle \partial _{t}\ln \rho _{t}={\frac {1}{2}}\beta (t)\left(n+(x+\nabla \ln \rho _{t})\cdot \nabla \ln \rho _{t}+\Delta \ln \rho _{t}\right)}wheren{\displaystyle n} is the dimension of space, andΔ{\displaystyle \Delta } is theLaplace operator. Equivalently,tρt=12β(t)((xρt)+Δρt){\displaystyle \partial _{t}\rho _{t}={\frac {1}{2}}\beta (t)(\nabla \cdot (x\rho _{t})+\Delta \rho _{t})}

Backward diffusion process

[edit]

If we have solvedρt{\displaystyle \rho _{t}} for timet[0,T]{\displaystyle t\in [0,T]}, then we can exactly reverse the evolution of the cloud. Suppose we start with another cloud of particles with densityν0=ρT{\displaystyle \nu _{0}=\rho _{T}}, and let the particles in the cloud evolve according to

dyt=12β(Tt)ytdt+β(Tt)ytlnρTt(yt)score function dt+β(Tt)dWt{\displaystyle dy_{t}={\frac {1}{2}}\beta (T-t)y_{t}dt+\beta (T-t)\underbrace {\nabla _{y_{t}}\ln \rho _{T-t}\left(y_{t}\right)} _{\text{score function }}dt+{\sqrt {\beta (T-t)}}dW_{t}}

then by plugging into the Fokker-Planck equation, we find thattρTt=tνt{\displaystyle \partial _{t}\rho _{T-t}=\partial _{t}\nu _{t}}. Thus this cloud of points is the original cloud, evolving backwards.[20]

Noise conditional score network (NCSN)

[edit]

At the continuous limit,α¯t=(1β1)(1βt)=eiln(1βi)e0tβ(t)dt{\displaystyle {\bar {\alpha }}_{t}=(1-\beta _{1})\cdots (1-\beta _{t})=e^{\sum _{i}\ln(1-\beta _{i})}\to e^{-\int _{0}^{t}\beta (t)dt}}and soxt|x0N(e120tβ(t)dtx0,(1e0tβ(t)dt)I){\displaystyle x_{t}|x_{0}\sim N\left(e^{-{\frac {1}{2}}\int _{0}^{t}\beta (t)dt}x_{0},\left(1-e^{-\int _{0}^{t}\beta (t)dt}\right)I\right)}In particular, we see that we can directly sample from any point in the continuous diffusion process without going through the intermediate steps, by first samplingx0q,zN(0,I){\displaystyle x_{0}\sim q,z\sim {\mathcal {N}}(0,I)}, then getxt=e120tβ(t)dtx0+(1e0tβ(t)dt)z{\displaystyle x_{t}=e^{-{\frac {1}{2}}\int _{0}^{t}\beta (t)dt}x_{0}+\left(1-e^{-\int _{0}^{t}\beta (t)dt}\right)z}. That is, we can quickly samplextρt{\displaystyle x_{t}\sim \rho _{t}} for anyt0{\displaystyle t\geq 0}.

Now, define a certain probability distributionγ{\displaystyle \gamma } over[0,){\displaystyle [0,\infty )}, then the score-matching loss function is defined as the expected Fisher divergence:L(θ)=Etγ,xtρt[fθ(xt,t)2+2fθ(xt,t)]{\displaystyle L(\theta )=E_{t\sim \gamma ,x_{t}\sim \rho _{t}}[\|f_{\theta }(x_{t},t)\|^{2}+2\nabla \cdot f_{\theta }(x_{t},t)]}After training,fθ(xt,t)lnρt{\displaystyle f_{\theta }(x_{t},t)\approx \nabla \ln \rho _{t}}, so we can perform the backwards diffusion process by first samplingxTN(0,I){\displaystyle x_{T}\sim {\mathcal {N}}(0,I)}, then integrating the SDE fromt=T{\displaystyle t=T} tot=0{\displaystyle t=0}:xtdt=xt+12β(t)xtdt+β(t)fθ(xt,t)dt+β(t)dWt{\displaystyle x_{t-dt}=x_{t}+{\frac {1}{2}}\beta (t)x_{t}dt+\beta (t)f_{\theta }(x_{t},t)dt+{\sqrt {\beta (t)}}dW_{t}}This may be done by any SDE integration method, such asEuler–Maruyama method.

The name "noise conditional score network" is explained thus:

Their equivalence

[edit]

DDPM and score-based generative models are equivalent.[16][1][21] This means that a network trained using DDPM can be used as a NCSN, and vice versa.

We know thatxt|x0N(α¯tx0,σt2I){\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)}, so byTweedie's formula, we havextlnq(xt)=1σt2(xt+α¯tEq[x0|xt]){\displaystyle \nabla _{x_{t}}\ln q(x_{t})={\frac {1}{\sigma _{t}^{2}}}(-x_{t}+{\sqrt {{\bar {\alpha }}_{t}}}E_{q}[x_{0}|x_{t}])}As described previously, the DDPM loss function istLsimple,t{\displaystyle \sum _{t}L_{simple,t}} withLsimple,t=Ex0q;zN(0,I)[ϵθ(xt,t)z2]{\displaystyle L_{simple,t}=E_{x_{0}\sim q;z\sim {\mathcal {N}}(0,I)}\left[\left\|\epsilon _{\theta }(x_{t},t)-z\right\|^{2}\right]}wherext=α¯tx0+σtz{\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+\sigma _{t}z}. By a change of variables,Lsimple,t=Ex0,xtq[ϵθ(xt,t)xtα¯tx0σt2]=Extq,x0q(|xt)[ϵθ(xt,t)xtα¯tx0σt2]{\displaystyle L_{simple,t}=E_{x_{0},x_{t}\sim q}\left[\left\|\epsilon _{\theta }(x_{t},t)-{\frac {x_{t}-{\sqrt {{\bar {\alpha }}_{t}}}x_{0}}{\sigma _{t}}}\right\|^{2}\right]=E_{x_{t}\sim q,x_{0}\sim q(\cdot |x_{t})}\left[\left\|\epsilon _{\theta }(x_{t},t)-{\frac {x_{t}-{\sqrt {{\bar {\alpha }}_{t}}}x_{0}}{\sigma _{t}}}\right\|^{2}\right]}and the term inside becomes a least squares regression, so if the network actually reaches the global minimum of loss, then we haveϵθ(xt,t)=xtα¯tEq[x0|xt]σt=σtxtlnq(xt){\displaystyle \epsilon _{\theta }(x_{t},t)={\frac {x_{t}-{\sqrt {{\bar {\alpha }}_{t}}}E_{q}[x_{0}|x_{t}]}{\sigma _{t}}}=-\sigma _{t}\nabla _{x_{t}}\ln q(x_{t})}

Thus, given a good score-based network, its predicted score is a good prediction of the noise (after scaling byσt{\displaystyle \sigma _{t}}), and thus can be used for denoising.

Conversely, the continuous limitxt1=xtdt,βt=β(t)dt,ztdt=dWt{\displaystyle x_{t-1}=x_{t-dt},\beta _{t}=\beta (t)dt,z_{t}{\sqrt {dt}}=dW_{t}} of the backward equationxt1=xtαtβtσtαtϵθ(xt,t)+βtzt;ztN(0,I){\displaystyle x_{t-1}={\frac {x_{t}}{\sqrt {\alpha _{t}}}}-{\frac {\beta _{t}}{\sigma _{t}{\sqrt {\alpha _{t}}}}}\epsilon _{\theta }(x_{t},t)+{\sqrt {\beta _{t}}}z_{t};\quad z_{t}\sim {\mathcal {N}}(0,I)}gives us precisely the same equation as score-based diffusion:xtdt=xt(1+β(t)dt/2)+β(t)xtlnq(xt)dt+β(t)dWt{\displaystyle x_{t-dt}=x_{t}(1+\beta (t)dt/2)+\beta (t)\nabla _{x_{t}}\ln q(x_{t})dt+{\sqrt {\beta (t)}}dW_{t}}Thus, at infinitesimal steps of DDPM, a denoising network performs score-based diffusion.

Main variants

[edit]

Noise schedule

[edit]
Illustration for a linear diffusion noise schedule. With settingsβ1=104,β1000=0.02{\displaystyle \beta _{1}=10^{-4},\beta _{1000}=0.02}.

In DDPM, the sequence of numbers0=σ0<σ1<<σT<1{\displaystyle 0=\sigma _{0}<\sigma _{1}<\cdots <\sigma _{T}<1} is called a (discrete time)noise schedule. In general, consider a strictly increasing monotonic functionσ{\displaystyle \sigma } of typeR(0,1){\displaystyle \mathbb {R} \to (0,1)}, such as thesigmoid function. In that case, a noise schedule is a sequence of real numbersλ1<λ2<<λT{\displaystyle \lambda _{1}<\lambda _{2}<\cdots <\lambda _{T}}. It then defines a sequence of noisesσt:=σ(λt){\displaystyle \sigma _{t}:=\sigma (\lambda _{t})}, which then derives the other quantitiesβt=11σt21σt12{\displaystyle \beta _{t}=1-{\frac {1-\sigma _{t}^{2}}{1-\sigma _{t-1}^{2}}}}.

In order to use arbitrary noise schedules, instead of training a noise prediction modelϵθ(xt,t){\displaystyle \epsilon _{\theta }(x_{t},t)}, one trainsϵθ(xt,σt){\displaystyle \epsilon _{\theta }(x_{t},\sigma _{t})}.

Similarly, for the noise conditional score network, instead of trainingfθ(xt,t){\displaystyle f_{\theta }(x_{t},t)}, one trainsfθ(xt,σt){\displaystyle f_{\theta }(x_{t},\sigma _{t})}.

Denoising Diffusion Implicit Model (DDIM)

[edit]

The original DDPM method for generating images is slow, since the forward diffusion process usually takesT1000{\displaystyle T\sim 1000} to make the distribution ofxT{\displaystyle x_{T}} to appear close to Gaussian. However this means the backward diffusion process also take 1000 steps. Unlike the forward diffusion process, which can skip steps asxt|x0{\displaystyle x_{t}|x_{0}} is Gaussian for allt1{\displaystyle t\geq 1}, the backward diffusion process does not allow skipping steps. For example, to samplext2|xt1N(μθ(xt1,t1),Σθ(xt1,t1)){\displaystyle x_{t-2}|x_{t-1}\sim {\mathcal {N}}(\mu _{\theta }(x_{t-1},t-1),\Sigma _{\theta }(x_{t-1},t-1))} requires the model to first samplext1{\displaystyle x_{t-1}}. Attempting to directly samplext2|xt{\displaystyle x_{t-2}|x_{t}} would require us to marginalize outxt1{\displaystyle x_{t-1}}, which is generally intractable.

DDIM[22] is a method to take any model trained on DDPM loss, and use it to sample with some steps skipped, sacrificing an adjustable amount of quality. If we generate the Markovian chain case in DDPM to non-Markovian case, DDIM corresponds to the case that the reverse process has variance equals to 0. In other words, the reverse process (and also the forward process) is deterministic. When using fewer sampling steps, DDIM outperforms DDPM.

In detail, the DDIM sampling method is as follows. Start with the forward diffusion processxt=α¯tx0+σtϵ{\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+\sigma _{t}\epsilon }. Then, during the backward denoising process, givenxt,ϵθ(xt,t){\displaystyle x_{t},\epsilon _{\theta }(x_{t},t)}, the original data is estimated asx0=xtσtϵθ(xt,t)α¯t{\displaystyle x_{0}'={\frac {x_{t}-\sigma _{t}\epsilon _{\theta }(x_{t},t)}{\sqrt {{\bar {\alpha }}_{t}}}}}then the backward diffusion process can jump to any step0s<t{\displaystyle 0\leq s<t}, and the next denoised sample isxs=α¯sx0+σs2(σs)2ϵθ(xt,t)+σsϵ{\displaystyle x_{s}={\sqrt {{\bar {\alpha }}_{s}}}x_{0}'+{\sqrt {\sigma _{s}^{2}-(\sigma '_{s})^{2}}}\epsilon _{\theta }(x_{t},t)+\sigma _{s}'\epsilon }whereσs{\displaystyle \sigma _{s}'} is an arbitrary real number within the range[0,σs]{\displaystyle [0,\sigma _{s}]}, andϵN(0,I){\displaystyle \epsilon \sim {\mathcal {N}}(0,I)} is a newly sampled Gaussian noise.[14] If allσs=0{\displaystyle \sigma _{s}'=0}, then the backward process becomes deterministic, and this special case of DDIM is also called "DDIM". The original paper noted that when the process is deterministic, samples generated with only 20 steps are already very similar to ones generated with 1000 steps on the high-level.

The original paper recommended defining a single "eta value"η[0,1]{\displaystyle \eta \in [0,1]}, such thatσs=ησ~s{\displaystyle \sigma _{s}'=\eta {\tilde {\sigma }}_{s}}. Whenη=1{\displaystyle \eta =1}, this is the original DDPM. Whenη=0{\displaystyle \eta =0}, this is the fully deterministic DDIM. For intermediate values, the process interpolates between them.

By the equivalence, the DDIM algorithm also applies for score-based diffusion models.

Latent diffusion model (LDM)

[edit]
Main article:Latent diffusion model

Since the diffusion model is a general method for modelling probability distributions, if one wants to model a distribution over images, one can first encode the images into a lower-dimensional space by an encoder, then use a diffusion model to model the distribution over encoded images. Then to generate an image, one can sample from the diffusion model, then use a decoder to decode it into an image.[23]

The encoder-decoder pair is most often avariational autoencoder (VAE).

Architectural improvements

[edit]

[24][who?] proposed various architectural improvements. For example, they proposed log-space interpolation during backward sampling. Instead of sampling fromxt1N(μ~t(xt,x~0),σ~t2I){\displaystyle x_{t-1}\sim {\mathcal {N}}({\tilde {\mu }}_{t}(x_{t},{\tilde {x}}_{0}),{\tilde {\sigma }}_{t}^{2}I)}, they recommended sampling fromN(μ~t(xt,x~0),(σtvσ~t1v)2I){\displaystyle {\mathcal {N}}({\tilde {\mu }}_{t}(x_{t},{\tilde {x}}_{0}),(\sigma _{t}^{v}{\tilde {\sigma }}_{t}^{1-v})^{2}I)} for a learned parameterv{\displaystyle v}.

In thev-prediction formalism, the noising formulaxt=α¯tx0+1α¯tϵt{\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+{\sqrt {1-{\bar {\alpha }}_{t}}}\epsilon _{t}} is reparameterised by an angleϕt{\displaystyle \phi _{t}} such thatcosϕt=α¯t{\displaystyle \cos \phi _{t}={\sqrt {{\bar {\alpha }}_{t}}}} and a "velocity" defined bycosϕtϵtsinϕtx0{\displaystyle \cos \phi _{t}\epsilon _{t}-\sin \phi _{t}x_{0}}. The network is trained to predict the velocityv^θ{\displaystyle {\hat {v}}_{\theta }}, and denoising is byxϕtδ=cos(δ)xϕtsin(δ)v^θ(xϕt){\displaystyle x_{\phi _{t}-\delta }=\cos(\delta )\;x_{\phi _{t}}-\sin(\delta ){\hat {v}}_{\theta }\;(x_{\phi _{t}})}.[25] This parameterization was found to improve performance, as the model can be trained to reach total noise (i.e.ϕt=90{\displaystyle \phi _{t}=90^{\circ }}) and then reverse it, whereas the standard parameterization never reaches total noise sinceα¯t>0{\displaystyle {\sqrt {{\bar {\alpha }}_{t}}}>0} is always true.[26]

Classifier guidance

[edit]

Classifier guidance was proposed in 2021 to improve class-conditional generation by using a classifier. The original publication usedCLIP text encoders to improve text-conditional image generation.[27]

Suppose we wish to sample not from the entire distribution of images, but conditional on the image description. We don't want to sample a generic image, but an image that fits the description "black cat with red eyes". Generally, we want to sample from the distributionp(x|y){\displaystyle p(x|y)}, wherex{\displaystyle x} ranges over images, andy{\displaystyle y} ranges over classes of images (a description "black cat with red eyes" is just a very detailed class, and a class "cat" is just a very vague description).

Taking the perspective of thenoisy channel model, we can understand the process as follows: To generate an imagex{\displaystyle x} conditional on descriptiony{\displaystyle y}, we imagine that the requester really had in mind an imagex{\displaystyle x}, but the image is passed through a noisy channel and came out garbled, asy{\displaystyle y}. Image generation is then nothing but inferring whichx{\displaystyle x} the requester had in mind.

In other words, conditional image generation is simply "translating from a textual language into a pictorial language". Then, as in noisy-channel model, we use Bayes theorem to getp(x|y)p(y|x)p(x){\displaystyle p(x|y)\propto p(y|x)p(x)}in other words, if we have a good model of the space of all images, and a good image-to-class translator, we get a class-to-image translator "for free". In the equation for backward diffusion, the scorelnp(x){\displaystyle \nabla \ln p(x)} can be replaced byxlnp(x|y)=xlnp(x)score+xlnp(y|x)classifier guidance{\displaystyle \nabla _{x}\ln p(x|y)=\underbrace {\nabla _{x}\ln p(x)} _{\text{score}}+\underbrace {\nabla _{x}\ln p(y|x)} _{\text{classifier guidance}}}wherexlnp(x){\displaystyle \nabla _{x}\ln p(x)} is the score function, trained as previously described, andxlnp(y|x){\displaystyle \nabla _{x}\ln p(y|x)} is found by using a differentiable image classifier.

During the diffusion process, we need to condition on the time, givingxtlnp(xt|y,t)=xtlnp(y|xt,t)+xtlnp(xt|t){\displaystyle \nabla _{x_{t}}\ln p(x_{t}|y,t)=\nabla _{x_{t}}\ln p(y|x_{t},t)+\nabla _{x_{t}}\ln p(x_{t}|t)}Although, usually the classifier model does not depend on time, in which casep(y|xt,t)=p(y|xt){\displaystyle p(y|x_{t},t)=p(y|x_{t})}.

Classifier guidance is defined for the gradient of score function, thus for score-based diffusion network, but as previously noted, score-based diffusion models are equivalent to denoising models byϵθ(xt,t)=σtxtlnp(xt|t){\displaystyle \epsilon _{\theta }(x_{t},t)=-\sigma _{t}\nabla _{x_{t}}\ln p(x_{t}|t)}, and similarly,ϵθ(xt,y,t)=σtxtlnp(xt|y,t){\displaystyle \epsilon _{\theta }(x_{t},y,t)=-\sigma _{t}\nabla _{x_{t}}\ln p(x_{t}|y,t)}. Therefore, classifier guidance works for denoising diffusion as well, using the modified noise prediction:[27]ϵθ(xt,y,t)=ϵθ(xt,t)σtxtlnp(y|xt,t)classifier guidance{\displaystyle \epsilon _{\theta }(x_{t},y,t)=\epsilon _{\theta }(x_{t},t)-\underbrace {\sigma _{t}\nabla _{x_{t}}\ln p(y|x_{t},t)} _{\text{classifier guidance}}}

With temperature

[edit]

The classifier-guided diffusion model samples fromp(x|y){\displaystyle p(x|y)}, which is concentrated around themaximum a posteriori estimateargmaxxp(x|y){\displaystyle \arg \max _{x}p(x|y)}. If we want to force the model to move towards themaximum likelihood estimateargmaxxp(y|x){\displaystyle \arg \max _{x}p(y|x)}, we can usepγ(x|y)p(y|x)γp(x){\displaystyle p_{\gamma }(x|y)\propto p(y|x)^{\gamma }p(x)}whereγ>0{\displaystyle \gamma >0} is interpretable asinverse temperature. In the context of diffusion models, it is usually called theguidance scale. A highγ{\displaystyle \gamma } would force the model to sample from a distribution concentrated aroundargmaxxp(y|x){\displaystyle \arg \max _{x}p(y|x)}. This sometimes improves quality of generated images.[27]

This gives a modification to the previous equation:xlnpβ(x|y)=xlnp(x)+γxlnp(y|x){\displaystyle \nabla _{x}\ln p_{\beta }(x|y)=\nabla _{x}\ln p(x)+\gamma \nabla _{x}\ln p(y|x)}For denoising models, it corresponds to[28]ϵθ(xt,y,t)=ϵθ(xt,t)γσtxtlnp(y|xt,t){\displaystyle \epsilon _{\theta }(x_{t},y,t)=\epsilon _{\theta }(x_{t},t)-\gamma \sigma _{t}\nabla _{x_{t}}\ln p(y|x_{t},t)}

Classifier-free guidance (CFG)

[edit]

If we do not have a classifierp(y|x){\displaystyle p(y|x)}, we could still extract one out of the image model itself:[28]xlnpγ(x|y)=(1γ)xlnp(x)+γxlnp(x|y){\displaystyle \nabla _{x}\ln p_{\gamma }(x|y)=(1-\gamma )\nabla _{x}\ln p(x)+\gamma \nabla _{x}\ln p(x|y)}Such a model is usually trained by presenting it with both(x,y){\displaystyle (x,y)} and(x,None){\displaystyle (x,{\rm {None}})}, allowing it to model bothxlnp(x|y){\displaystyle \nabla _{x}\ln p(x|y)} andxlnp(x){\displaystyle \nabla _{x}\ln p(x)}.

Note that for CFG, the diffusion model cannot be merely a generative model of the entire data distributionxlnp(x){\displaystyle \nabla _{x}\ln p(x)}. It must be a conditional generative modelxlnp(x|y){\displaystyle \nabla _{x}\ln p(x|y)}. For example, in stable diffusion, the diffusion backbone takes as input both a noisy modelxt{\displaystyle x_{t}}, a timet{\displaystyle t}, and a conditioning vectory{\displaystyle y} (such as a vector encoding a text prompt), and produces a noise predictionϵθ(xt,y,t){\displaystyle \epsilon _{\theta }(x_{t},y,t)}.

For denoising models, it corresponds toϵθ(xt,y,t,γ)=ϵθ(xt,t)+γ(ϵθ(xt,y,t)ϵθ(xt,t)){\displaystyle \epsilon _{\theta }(x_{t},y,t,\gamma )=\epsilon _{\theta }(x_{t},t)+\gamma (\epsilon _{\theta }(x_{t},y,t)-\epsilon _{\theta }(x_{t},t))}As sampled by DDIM, the algorithm can be written as[29]ϵuncondϵθ(xt,t)ϵcondϵθ(xt,t,c)ϵCFGϵuncond+γ(ϵcondϵuncond)x0(xtσtϵCFG)/1σt2xs1σs2x0+σs2(σs)2ϵuncond+σsϵ{\displaystyle {\begin{aligned}\epsilon _{\text{uncond}}&\leftarrow \epsilon _{\theta }(x_{t},t)\\\epsilon _{\text{cond}}&\leftarrow \epsilon _{\theta }(x_{t},t,c)\\\epsilon _{\text{CFG}}&\leftarrow \epsilon _{\text{uncond}}+\gamma (\epsilon _{\text{cond}}-\epsilon _{\text{uncond}})\\x_{0}&\leftarrow (x_{t}-\sigma _{t}\epsilon _{\text{CFG}})/{\sqrt {1-\sigma _{t}^{2}}}\\x_{s}&\leftarrow {\sqrt {1-\sigma _{s}^{2}}}x_{0}+{\sqrt {\sigma _{s}^{2}-(\sigma _{s}')^{2}}}\epsilon _{\text{uncond}}+\sigma _{s}'\epsilon \\\end{aligned}}}A similar technique applies to language model sampling. Also, if the unconditional generationϵuncondϵθ(xt,t){\displaystyle \epsilon _{\text{uncond}}\leftarrow \epsilon _{\theta }(x_{t},t)} is replaced byϵneg condϵθ(xt,t,c){\displaystyle \epsilon _{\text{neg cond}}\leftarrow \epsilon _{\theta }(x_{t},t,c')}, then it results in negative prompting, which pushes the generation away fromc{\displaystyle c'} condition.[30][31]

Samplers

[edit]

Given a diffusion model, one may regard it either as a continuous process, and sample from it by integrating a SDE, or one can regard it as a discrete process, and sample from it by iterating the discrete steps. The choice of the "noise schedule"βt{\displaystyle \beta _{t}} can also affect the quality of samples. A noise schedule is a function that sends a natural number to a noise level:tβt,t{1,2,},β(0,1){\displaystyle t\mapsto \beta _{t},\quad t\in \{1,2,\dots \},\beta \in (0,1)}A noise schedule is more often specified by a maptσt{\displaystyle t\mapsto \sigma _{t}}. The two definitions are equivalent, sinceβt=11σt21σt12{\displaystyle \beta _{t}=1-{\frac {1-\sigma _{t}^{2}}{1-\sigma _{t-1}^{2}}}}.

In the DDPM perspective, one can use the DDPM itself (with noise), or DDIM (with adjustable amount of noise). The case where one adds noise is sometimes called ancestral sampling.[32] One can interpolate between noise and no noise. The amount of noise is denotedη{\displaystyle \eta } ("eta value") in the DDIM paper, withη=0{\displaystyle \eta =0} denoting no noise (as indeterministic DDIM), andη=1{\displaystyle \eta =1} denoting full noise (as in DDPM).

In the perspective of SDE, one can use any of thenumerical integration methods, such asEuler–Maruyama method,Heun's method,linear multistep methods, etc. Just as in the discrete case, one can add an adjustable amount of noise during the integration.[33]

A survey and comparison of samplers in the context of image generation is in.[34]

Other examples

[edit]

Notable variants include[35] Poisson flow generative model,[36] consistency model,[37] critically damped Langevin diffusion,[38] GenPhys,[39] cold diffusion,[40] etc.

Flow-based diffusion model

[edit]

Abstractly speaking, the idea of diffusion model is to take an unknown probability distribution (the distribution of natural-looking images), then progressively convert it to a known probability distribution (standard Gaussian distribution), by building an absolutely continuous probability path connecting them. The probability path is in fact defined implicitly by the score functionlnpt{\displaystyle \nabla \ln p_{t}}.

In denoising diffusion models, the forward process adds noise, and the backward process removes noise. Both the forward and backward processes areSDEs, though the forward process is integrable in closed-form, so it can be done at no computational cost. The backward process is not integrable in closed-form, so it must be integrated step-by-step by standard SDE solvers, which can be very expensive. The probability path in diffusions model is defined through anItô process and one can retrieve the deterministic process by using the Probability ODE flow formulation.[1]

In flow-based diffusion models, the forward process is a deterministic flow along a time-dependent vector field, and the backward process is also a deterministic flow along the same vector field, but going backwards. Both processes are solutions toODEs. If the vector field is well-behaved, the ODE will also be well-behaved.

Given two distributionsπ0{\displaystyle \pi _{0}} andπ1{\displaystyle \pi _{1}}, a flow-based model is a time-dependent velocity fieldvt(x){\displaystyle v_{t}(x)} in[0,1]×Rd{\displaystyle [0,1]\times \mathbb {R} ^{d}}, such that if we start by sampling a pointxπ0{\displaystyle x\sim \pi _{0}}, and let it move according to the velocity field:ddtϕt(x)=vt(ϕt(x))t[0,1],starting from ϕ0(x)=x{\displaystyle {\frac {d}{dt}}\phi _{t}(x)=v_{t}(\phi _{t}(x))\quad t\in [0,1],\quad {\text{starting from }}\phi _{0}(x)=x}we end up with a pointx1π1{\displaystyle x_{1}\sim \pi _{1}}. The solutionϕt{\displaystyle \phi _{t}} of the above ODE define a probability pathpt=[ϕt]#π0{\displaystyle p_{t}=[\phi _{t}]_{\#}\pi _{0}} by thepushforward measure operator. In particular,[ϕ1]#π0=π1{\displaystyle [\phi _{1}]_{\#}\pi _{0}=\pi _{1}}.

The probability path and the velocity field also satisfy thecontinuity equation, in the sense of probability distribution:tpt+(vtpt)=0{\displaystyle \partial _{t}p_{t}+\nabla \cdot (v_{t}p_{t})=0}To construct a probability path, we start by construct a conditional probability pathpt(x|z){\displaystyle p_{t}(x\vert z)} and the corresponding conditional velocity fieldvt(x|z){\displaystyle v_{t}(x\vert z)} on some conditional distributionq(z){\displaystyle q(z)}. A natural choice is the Gaussian conditional probability path:pt(x|z)=N(mt(z),ζt2I){\displaystyle p_{t}(x\vert z)={\mathcal {N}}\left(m_{t}(z),\zeta _{t}^{2}I\right)}The conditional velocity field which corresponds to the geodesic path between conditional Gaussian path isvt(x|z)=ζtζt(xmt(z))+mt(z){\displaystyle v_{t}(x\vert z)={\frac {\zeta _{t}'}{\zeta _{t}}}(x-m_{t}(z))+m_{t}'(z)}The probability path and velocity field are then computed by marginalizing

pt(x)=pt(x|z)q(z)dz and vt(x)=Eq(z)[vt(x|z)pt(x|z)pt(x)]{\displaystyle p_{t}(x)=\int p_{t}(x\vert z)q(z)dz\qquad {\text{ and }}\qquad v_{t}(x)=\mathbb {E} _{q(z)}\left[{\frac {v_{t}(x\vert z)p_{t}(x\vert z)}{p_{t}(x)}}\right]}

Optimal transport flow

[edit]

The idea ofoptimal transport flow[41] is to construct a probability path minimizing theWasserstein metric. The distribution on which we condition is an approximation of the optimal transport plan betweenπ0{\displaystyle \pi _{0}} andπ1{\displaystyle \pi _{1}}:z=(x0,x1){\displaystyle z=(x_{0},x_{1})} andq(z)=Γ(π0,π1){\displaystyle q(z)=\Gamma (\pi _{0},\pi _{1})}, whereΓ{\displaystyle \Gamma } is the optimal transport plan, which can be approximated bymini-batch optimal transport. If the batch size is not large, then the transport it computes can be very far from the true optimal transport.

Rectified flow

[edit]

The idea ofrectified flow[42][43] is to learn a flow model such that the velocity is nearly constant along each flow path. This is beneficial, because we can integrate along such a vector field with very few steps. For example, if an ODEϕt˙(x)=vt(ϕt(x)){\displaystyle {\dot {\phi _{t}}}(x)=v_{t}(\phi _{t}(x))} follows perfectly straight paths, it simplifies toϕt(x)=x0+tv0(x0){\displaystyle \phi _{t}(x)=x_{0}+t\cdot v_{0}(x_{0})}, allowing for exact solutions in one step. In practice, we cannot reach such perfection, but when the flow field is nearly so, we can take a few large steps instead of many little steps.

Linear interpolationRectified FlowStraightened Rectified Flow[1]

The general idea is to start with two distributionsπ0{\displaystyle \pi _{0}} andπ1{\displaystyle \pi _{1}}, then construct a flow fieldϕ0={ϕt:t[0,1]}{\displaystyle \phi ^{0}=\{\phi _{t}:t\in [0,1]\}} from it, then repeatedly apply a "reflow" operation to obtain successive flow fieldsϕ1,ϕ2,{\displaystyle \phi ^{1},\phi ^{2},\dots }, each straighter than the previous one. When the flow field is straight enough for the application, we stop.

Generally, for any time-differentiable processϕt{\displaystyle \phi _{t}},vt{\displaystyle v_{t}} can be estimated by solving:minθ01Expt[vt(x,θ)vt(x)2]dt.{\displaystyle \min _{\theta }\int _{0}^{1}\mathbb {E} _{x\sim p_{t}}\left[\lVert {v_{t}(x,\theta )-v_{t}(x)}\rVert ^{2}\right]\,\mathrm {d} t.}

In rectified flow, by injecting strong priors that intermediate trajectories are straight, it can achieve both theoretical relevance for optimal transport and computational efficiency, as ODEs with straight paths can be simulated precisely without time discretization.

Transport by rectified flow[42]

Specifically, rectified flow seeks to match an ODE with the marginal distributions of thelinear interpolation between points from distributionsπ0{\displaystyle \pi _{0}} andπ1{\displaystyle \pi _{1}}. Given observationsx0π0{\displaystyle x_{0}\sim \pi _{0}} andx1π1{\displaystyle x_{1}\sim \pi _{1}}, the canonical linear interpolationxt=tx1+(1t)x0,t[0,1]{\displaystyle x_{t}=tx_{1}+(1-t)x_{0},t\in [0,1]} yields a trivial casex˙t=x1x0{\displaystyle {\dot {x}}_{t}=x_{1}-x_{0}}, which cannot be causally simulated withoutx1{\displaystyle x_{1}}. To address this,xt{\displaystyle x_{t}} is "projected" into a space of causally simulatable ODEs, by minimizing the least squares loss with respect to the directionx1x0{\displaystyle x_{1}-x_{0}}:minθ01Eπ0,π1,pt[(x1x0)vt(xt)2]dt.{\displaystyle \min _{\theta }\int _{0}^{1}\mathbb {E} _{\pi _{0},\pi _{1},p_{t}}\left[\lVert {(x_{1}-x_{0})-v_{t}(x_{t})}\rVert ^{2}\right]\,\mathrm {d} t.}

The data pair(x0,x1){\displaystyle (x_{0},x_{1})} can be any coupling ofπ0{\displaystyle \pi _{0}} andπ1{\displaystyle \pi _{1}}, typically independent (i.e.,(x0,x1)π0×π1{\displaystyle (x_{0},x_{1})\sim \pi _{0}\times \pi _{1}}) obtained by randomly combining observations fromπ0{\displaystyle \pi _{0}} andπ1{\displaystyle \pi _{1}}. This process ensures that the trajectories closely mirror the density map ofxt{\displaystyle x_{t}} trajectories butreroute at intersections to ensure causality.

The reflow process[42]

A distinctive aspect of rectified flow is its capability for "reflow", which straightens the trajectory of ODE paths. Denote the rectified flowϕ0={ϕt:t[0,1]}{\displaystyle \phi ^{0}=\{\phi _{t}:t\in [0,1]\}} induced from(x0,x1){\displaystyle (x_{0},x_{1})} asϕ0=Rectflow((x0,x1)){\displaystyle \phi ^{0}={\mathsf {Rectflow}}((x_{0},x_{1}))}. Recursively applying thisRectflow(){\displaystyle {\mathsf {Rectflow}}(\cdot )} operator generates a series of rectified flowsϕk+1=Rectflow((ϕ0k(x0),ϕ1k(x1))){\displaystyle \phi ^{k+1}={\mathsf {Rectflow}}((\phi _{0}^{k}(x_{0}),\phi _{1}^{k}(x_{1})))}. This "reflow" process not only reduces transport costs but also straightens the paths of rectified flows, makingϕk{\displaystyle \phi ^{k}} paths straighter with increasingk{\displaystyle k}.

Rectified flow includes a nonlinear extension where linear interpolationxt{\displaystyle x_{t}} is replaced with any time-differentiable curve that connectsx0{\displaystyle x_{0}} andx1{\displaystyle x_{1}}, given byxt=αtx1+βtx0{\displaystyle x_{t}=\alpha _{t}x_{1}+\beta _{t}x_{0}}. This framework encompasses DDIM and probability flow ODEs as special cases, with particular choices ofαt{\displaystyle \alpha _{t}} andβt{\displaystyle \beta _{t}}. However, in the case where the path ofxt{\displaystyle x_{t}} is not straight, the reflow process no longer ensures a reduction in convex transport costs, and also no longer straighten the paths ofϕt{\displaystyle \phi _{t}}.[42]

Choice of architecture

[edit]
Architecture of Stable Diffusion
The denoising process used by Stable Diffusion

Diffusion model

[edit]

For generating images by DDPM, we need a neural network that takes a timet{\displaystyle t} and a noisy imagext{\displaystyle x_{t}}, and predicts a noiseϵθ(xt,t){\displaystyle \epsilon _{\theta }(x_{t},t)} from it. Since predicting the noise is the same as predicting the denoised image, then subtracting it fromxt{\displaystyle x_{t}}, denoising architectures tend to work well. For example, theU-Net, which was found to be good for denoising images, is often used for denoising diffusion models that generate images.[44]

For DDPM, the underlying architecture ("backbone") does not have to be a U-Net. It just has to predict the noise somehow. For example, the diffusion transformer (DiT) uses aTransformer to predict the mean and diagonal covariance of the noise, given the textual conditioning and the partially denoised image. It is the same as standard U-Net-based denoising diffusion model, with a Transformer replacing the U-Net.[45]Mixture of experts-Transformer can also be applied.[46]

DDPM can be used to model general data distributions, not just natural-looking images. For example, Human Motion Diffusion[47] models human motion trajectory by DDPM. Each human motion trajectory is a sequence of poses, represented by either joint rotations or positions. It uses aTransformer network to generate a less noisy trajectory out of a noisy one.

Conditioning

[edit]

The base diffusion model can only generate unconditionally from the whole distribution. For example, a diffusion model learned onImageNet would generate images that look like a random image from ImageNet. To generate images from just one category, one would need to impose the condition, and then sample from the conditional distribution. Whatever condition one wants to impose, one needs to first convert the conditioning into a vector of floating point numbers, then feed it into the underlying diffusion model neural network. However, one has freedom in choosing how to convert the conditioning into a vector.

Stable Diffusion, for example, imposes conditioning in the form ofcross-attention mechanism, where the query is an intermediate representation of the image in the U-Net, and both key and value are the conditioning vectors. The conditioning can be selectively applied to only parts of an image, and new kinds of conditionings can be finetuned upon the base model, as used in ControlNet.[48]

As a particularly simple example, considerimage inpainting. The conditions arex~{\displaystyle {\tilde {x}}}, the reference image, andm{\displaystyle m}, the inpaintingmask. The conditioning is imposed at each step of the backward diffusion process, by first samplingx~tN(α¯tx~,σt2I){\displaystyle {\tilde {x}}_{t}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}{\tilde {x}},\sigma _{t}^{2}I\right)}, a noisy version ofx~{\displaystyle {\tilde {x}}}, then replacingxt{\displaystyle x_{t}} with(1m)xt+mx~t{\displaystyle (1-m)\odot x_{t}+m\odot {\tilde {x}}_{t}}, where{\displaystyle \odot } meanselementwise multiplication.[49] Another application of cross-attention mechanism is prompt-to-prompt image editing.[50]

Conditioning is not limited to just generating images from a specific category, or according to a specific caption (as in text-to-image). For example,[47] demonstrated generating human motion, conditioned on an audio clip of human walking (allowing syncing motion to a soundtrack), or video of human running, or a text description of human motion, etc. For how conditional diffusion models are mathematically formulated, see a methodological summary in.[51]

Upscaling

[edit]

As generating an image takes a long time, one can try to generate a small image by a base diffusion model, then upscale it by other models. Upscaling can be done byGAN,[52]Transformer,[53] or signal processing methods likeLanczos resampling.

Diffusion models themselves can be used to perform upscaling. Cascading diffusion model stacks multiple diffusion models one after another, in the style ofProgressive GAN. The lowest level is a standard diffusion model that generate 32x32 image, then the image would be upscaled by a diffusion model specifically trained for upscaling, and the process repeats.[44]

In more detail, the diffusion upscaler is trained as follows:[44]

Examples

[edit]

This section collects some notable diffusion models, and briefly describes their architecture.

OpenAI

[edit]
Main articles:DALL-E andSora (text-to-video model)

The DALL-E series by OpenAI are text-conditional diffusion models of images.

The first version of DALL-E (2021) is not actually a diffusion model. Instead, it uses a Transformer architecture that autoregressively generates a sequence of tokens, which is then converted to an image by the decoder of a discrete VAE. Released with DALL-E was the CLIP classifier, which was used by DALL-E to rank generated images according to how close the image fits the text.

GLIDE (2022-03)[54] is a 3.5-billion diffusion model, and a small version was released publicly.[5] Soon after, DALL-E 2 was released (2022-04).[55] DALL-E 2 is a 3.5-billion cascaded diffusion model that generates images from text by "inverting the CLIP image encoder", the technique which they termed "unCLIP".

The unCLIP method contains 4 models: a CLIP image encoder, a CLIP text encoder, an image decoder, and a "prior" model (which can be a diffusion model, or an autoregressive model). During training, the prior model is trained to convert CLIP image encodings to CLIP text encodings. The image decoder is trained to convert CLIP image encodings back to images. During inference, a text is converted by the CLIP text encoder to a vector, then it is converted by the prior model to an image encoding, then it is converted by the image decoder to an image.

Sora (2024-02) is a diffusion Transformer model (DiT).

Stability AI

[edit]
Main article:Stable Diffusion

Stable Diffusion (2022-08), released by Stability AI, consists of a denoising latent diffusion model (860 million parameters), a VAE, and a text encoder. The denoising network is a U-Net, with cross-attention blocks to allow for conditional image generation.[56][23]

Stable Diffusion 3 (2024-03)[57] changed the latent diffusion model from the UNet to a Transformer model, and so it is a DiT. It uses rectified flow.

Stable Video 4D (2024-07)[58] is a latent diffusion model for videos of 3D objects.

Google

[edit]

Imagen (2022)[59][60] uses aT5-XXL language model to encode the input text into an embedding vector. It is a cascaded diffusion model with three sub-models. The first step denoises a white noise to a 64×64 image, conditional on the embedding vector of the text. This model has 2B parameters. The second step upscales the image by 64×64→256×256, conditional on embedding. This model has 650M parameters. The third step is similar, upscaling by 256×256→1024×1024. This model has 400M parameters. The three denoising networks are all U-Nets.

Muse (2023-01)[61] is not a diffusion model, but an encoder-only Transformer that is trained to predict masked image tokens from unmasked image tokens.

Imagen 2 (2023-12) is also diffusion-based. It can generate images based on a prompt that mixes images and text. No further information available.[62] Imagen 3 (2024-05) is too. No further information available.[63]

Veo (2024) generates videos by latent diffusion. The diffusion is conditioned on a vector that encodes both a text prompt and an image prompt.[64]

Meta

[edit]

Make-A-Video (2022) is a text-to-video diffusion model.[65][66]

CM3leon (2023) is not a diffusion model, but an autoregressive causally masked Transformer, with mostly the same architecture asLLaMa-2.[67][68]

Transfusion architectural diagram

Transfusion (2024) is a Transformer that combines autoregressive text generation and denoising diffusion. Specifically, it generates text autoregressively (with causal masking), and generates images by denoising multiple times over image tokens (with all-to-all attention).[69]

Movie Gen (2024) is a series of Diffusion Transformers operating on latent space and by flow matching.[70]

See also

[edit]

Further reading

[edit]

References

[edit]
  1. ^abcdSong, Yang; Sohl-Dickstein, Jascha; Kingma, Diederik P.; Kumar, Abhishek; Ermon, Stefano; Poole, Ben (2021-02-10). "Score-Based Generative Modeling through Stochastic Differential Equations".arXiv:2011.13456 [cs.LG].
  2. ^Croitoru, Florinel-Alin; Hondru, Vlad; Ionescu, Radu Tudor; Shah, Mubarak (2023). "Diffusion Models in Vision: A Survey".IEEE Transactions on Pattern Analysis and Machine Intelligence.45 (9):10850–10869.arXiv:2209.04747.Bibcode:2023ITPAM..4510850C.doi:10.1109/TPAMI.2023.3261988.PMID 37030794.S2CID 252199918.
  3. ^abHo, Jonathan; Jain, Ajay; Abbeel, Pieter (2020)."Denoising Diffusion Probabilistic Models".Advances in Neural Information Processing Systems.33. Curran Associates, Inc.:6840–6851.
  4. ^Gu, Shuyang; Chen, Dong; Bao, Jianmin; Wen, Fang; Zhang, Bo; Chen, Dongdong; Yuan, Lu; Guo, Baining (2021). "Vector Quantized Diffusion Model for Text-to-Image Synthesis".arXiv:2111.14822 [cs.CV].
  5. ^abGLIDE, OpenAI, 2023-09-22, retrieved2023-09-24
  6. ^Li, Yifan; Zhou, Kun; Zhao, Wayne Xin; Wen, Ji-Rong (August 2023)."Diffusion Models for Non-autoregressive Text Generation: A Survey".Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization. pp. 6692–6701.arXiv:2303.06574.doi:10.24963/ijcai.2023/750.ISBN 978-1-956792-03-4.
  7. ^Xu, Weijie; Hu, Wenxiang; Wu, Fanyou; Sengamedu, Srinivasan (2023)."DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM".Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg, PA, USA: Association for Computational Linguistics:9040–9057.arXiv:2310.15296.doi:10.18653/v1/2023.findings-emnlp.606.
  8. ^Zhang, Haopeng; Liu, Xiao; Zhang, Jiawei (2023)."DiffuSum: Generation Enhanced Extractive Summarization with Diffusion".Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg, PA, USA: Association for Computational Linguistics:13089–13100.arXiv:2305.01735.doi:10.18653/v1/2023.findings-acl.828.
  9. ^Yang, Dongchao; Yu, Jianwei; Wang, Helin; Wang, Wen; Weng, Chao; Zou, Yuexian; Yu, Dong (2023)."Diffsound: Discrete Diffusion Model for Text-to-Sound Generation".IEEE/ACM Transactions on Audio, Speech, and Language Processing.31:1720–1733.arXiv:2207.09983.Bibcode:2023ITASL..31.1720Y.doi:10.1109/taslp.2023.3268730.ISSN 2329-9290.
  10. ^Janner, Michael; Du, Yilun; Tenenbaum, Joshua B.; Levine, Sergey (2022-12-20). "Planning with Diffusion for Flexible Behavior Synthesis".arXiv:2205.09991 [cs.LG].
  11. ^Chi, Cheng; Xu, Zhenjia; Feng, Siyuan; Cousineau, Eric; Du, Yilun; Burchfiel, Benjamin; Tedrake, Russ; Song, Shuran (2024-03-14). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion".arXiv:2303.04137 [cs.RO].
  12. ^Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01)."Deep Unsupervised Learning using Nonequilibrium Thermodynamics"(PDF).Proceedings of the 32nd International Conference on Machine Learning.37. PMLR:2256–2265.arXiv:1503.03585.
  13. ^Ho, Jonathan (Jun 20, 2020),hojonathanho/diffusion, retrieved2024-09-07
  14. ^abWeng, Lilian (2021-07-11)."What are Diffusion Models?".lilianweng.github.io. Retrieved2023-09-24.
  15. ^"Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song".yang-song.net. Retrieved2023-09-24.
  16. ^abSong, Yang; Ermon, Stefano (2019)."Generative Modeling by Estimating Gradients of the Data Distribution".Advances in Neural Information Processing Systems.32. Curran Associates, Inc.arXiv:1907.05600.
  17. ^Song, Yang; Sohl-Dickstein, Jascha; Kingma, Diederik P.; Kumar, Abhishek; Ermon, Stefano; Poole, Ben (2021-02-10). "Score-Based Generative Modeling through Stochastic Differential Equations".arXiv:2011.13456 [cs.LG].
  18. ^ermongroup/ncsn, ermongroup, 2019, retrieved2024-09-07
  19. ^"Sliced Score Matching: A Scalable Approach to Density and Score Estimation | Yang Song".yang-song.net. Retrieved2023-09-24.
  20. ^Anderson, Brian D.O. (May 1982)."Reverse-time diffusion equation models".Stochastic Processes and Their Applications.12 (3):313–326.doi:10.1016/0304-4149(82)90051-5.ISSN 0304-4149.
  21. ^Luo, Calvin (2022). "Understanding Diffusion Models: A Unified Perspective".arXiv:2208.11970v1 [cs.LG].
  22. ^Song, Jiaming; Meng, Chenlin; Ermon, Stefano (3 Oct 2023). "Denoising Diffusion Implicit Models".arXiv:2010.02502 [cs.LG].
  23. ^abRombach, Robin; Blattmann, Andreas; Lorenz, Dominik; Esser, Patrick; Ommer, Björn (13 April 2022). "High-Resolution Image Synthesis With Latent Diffusion Models".arXiv:2112.10752 [cs.CV].
  24. ^Nichol, Alexander Quinn; Dhariwal, Prafulla (2021-07-01)."Improved Denoising Diffusion Probabilistic Models".Proceedings of the 38th International Conference on Machine Learning. PMLR:8162–8171.
  25. ^Salimans, Tim; Ho, Jonathan (2021-10-06).Progressive Distillation for Fast Sampling of Diffusion Models. The Tenth International Conference on Learning Representations (ICLR 2022).
  26. ^Lin, Shanchuan; Liu, Bingchen; Li, Jiashi; Yang, Xiao (2024).Common Diffusion Noise Schedules and Sample Steps Are Flawed. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 5404–5411.
  27. ^abcDhariwal, Prafulla; Nichol, Alex (2021-06-01). "Diffusion Models Beat GANs on Image Synthesis".arXiv:2105.05233 [cs.LG].
  28. ^abHo, Jonathan; Salimans, Tim (2022-07-25). "Classifier-Free Diffusion Guidance".arXiv:2207.12598 [cs.LG].
  29. ^Chung, Hyungjin; Kim, Jeongsol; Park, Geon Yeong; Nam, Hyelin; Ye, Jong Chul (2024-06-12). "CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models".arXiv:2406.08070 [cs.CV].
  30. ^Sanchez, Guillaume; Fan, Honglu; Spangher, Alexander; Levi, Elad; Ammanamanchi, Pawan Sasanka; Biderman, Stella (2023-06-30). "Stay on topic with Classifier-Free Guidance".arXiv:2306.17806 [cs.CL].
  31. ^Armandpour, Mohammadreza; Sadeghian, Ali; Zheng, Huangjie; Sadeghian, Amir; Zhou, Mingyuan (2023-04-26). "Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond".arXiv:2304.04968 [cs.CV].
  32. ^Yang, Ling; Zhang, Zhilong; Song, Yang; Hong, Shenda; Xu, Runsheng; Zhao, Yue; Zhang, Wentao; Cui, Bin;Yang, Ming-Hsuan (2022). "Diffusion Models: A Comprehensive Survey of Methods and Applications".arXiv:2206.00364 [cs.CV].
  33. ^Shi, Jiaxin; Han, Kehang; Wang, Zhe; Doucet, Arnaud; Titsias, Michalis K. (2024). "Simplified and Generalized Masked Diffusion for Discrete Data".arXiv:2406.04329 [cs.LG].
  34. ^Karras, Tero; Aittala, Miika; Aila, Timo; Laine, Samuli (2022). "Elucidating the Design Space of Diffusion-Based Generative Models".arXiv:2206.00364v2 [cs.CV].
  35. ^Cao, Hanqun; Tan, Cheng; Gao, Zhangyang; Xu, Yilun; Chen, Guangyong; Heng, Pheng-Ann; Li, Stan Z. (July 2024). "A Survey on Generative Diffusion Models".IEEE Transactions on Knowledge and Data Engineering.36 (7):2814–2830.Bibcode:2024ITKDE..36.2814C.doi:10.1109/TKDE.2024.3361474.ISSN 1041-4347.
  36. ^Xu, Yilun; Liu, Ziming; Tian, Yonglong; Tong, Shangyuan; Tegmark, Max; Jaakkola, Tommi (2023-07-03)."PFGM++: Unlocking the Potential of Physics-Inspired Generative Models".Proceedings of the 40th International Conference on Machine Learning. PMLR:38566–38591.arXiv:2302.04265.
  37. ^Song, Yang; Dhariwal, Prafulla; Chen, Mark; Sutskever, Ilya (2023-07-03)."Consistency Models".Proceedings of the 40th International Conference on Machine Learning. PMLR:32211–32252.
  38. ^Dockhorn, Tim; Vahdat, Arash; Kreis, Karsten (2021-10-06). "Score-Based Generative Modeling with Critically-Damped Langevin Diffusion".arXiv:2112.07068 [stat.ML].
  39. ^Liu, Ziming; Luo, Di; Xu, Yilun; Jaakkola, Tommi; Tegmark, Max (2023-04-05). "GenPhys: From Physical Processes to Generative Models".arXiv:2304.02637 [cs.LG].
  40. ^Bansal, Arpit; Borgnia, Eitan; Chu, Hong-Min; Li, Jie; Kazemi, Hamid; Huang, Furong; Goldblum, Micah; Geiping, Jonas; Goldstein, Tom (2023-12-15)."Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise".Advances in Neural Information Processing Systems.36:41259–41282.arXiv:2208.09392.
  41. ^Tong, Alexander; Fatras, Kilian; Malkin, Nikolay; Huguet, Guillaume; Zhang, Yanlei; Rector-Brooks, Jarrid; Wolf, Guy; Bengio, Yoshua (2023-11-08)."Improving and generalizing flow-based generative models with minibatch optimal transport".Transactions on Machine Learning Research.arXiv:2302.00482.ISSN 2835-8856.
  42. ^abcdLiu, Xingchao; Gong, Chengyue; Liu, Qiang (2022-09-07). "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow".arXiv:2209.03003 [cs.LG].
  43. ^Liu, Qiang (2022-09-29). "Rectified Flow: A Marginal Preserving Approach to Optimal Transport".arXiv:2209.14577 [stat.ML].
  44. ^abcHo, Jonathan; Saharia, Chitwan; Chan, William; Fleet, David J.; Norouzi, Mohammad; Salimans, Tim (2022-01-01)."Cascaded diffusion models for high fidelity image generation".The Journal of Machine Learning Research.23 (1): 47:2249–47:2281.arXiv:2106.15282.ISSN 1532-4435.
  45. ^Peebles, William; Xie, Saining (March 2023). "Scalable Diffusion Models with Transformers".arXiv:2212.09748v2 [cs.CV].
  46. ^Fei, Zhengcong; Fan, Mingyuan; Yu, Changqian; Li, Debang; Huang, Junshi (2024-07-16). "Scaling Diffusion Transformers to 16 Billion Parameters".arXiv:2407.11633 [cs.CV].
  47. ^abTevet, Guy; Raab, Sigal; Gordon, Brian; Shafir, Yonatan; Cohen-Or, Daniel; Bermano, Amit H. (2022). "Human Motion Diffusion Model".arXiv:2209.14916 [cs.CV].
  48. ^Zhang, Lvmin; Rao, Anyi; Agrawala, Maneesh (2023). "Adding Conditional Control to Text-to-Image Diffusion Models".arXiv:2302.05543 [cs.CV].
  49. ^Lugmayr, Andreas; Danelljan, Martin; Romero, Andres; Yu, Fisher; Timofte, Radu; Van Gool, Luc (2022). "RePaint: Inpainting Using Denoising Diffusion Probabilistic Models".arXiv:2201.09865v4 [cs.CV].
  50. ^Hertz, Amir; Mokady, Ron; Tenenbaum, Jay; Aberman, Kfir; Pritch, Yael; Cohen-Or, Daniel (2022-08-02). "Prompt-to-Prompt Image Editing with Cross Attention Control".arXiv:2208.01626 [cs.CV].
  51. ^Zhao, Zheng; Luo, Ziwei; Sjölund, Jens; Schön, Thomas B. (2024). "Conditional sampling within generative diffusion models".arXiv:2409.09650 [stat.ML].
  52. ^Wang, Xintao; Xie, Liangbin; Dong, Chao; Shan, Ying (2021)."Real-ESRGAN: Training Real-World Blind Super-Resolution With Pure Synthetic Data"(PDF).Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021. International Conference on Computer Vision. pp. 1905–1914.arXiv:2107.10833.
  53. ^Liang, Jingyun; Cao, Jiezhang; Sun, Guolei; Zhang, Kai; Van Gool, Luc; Timofte, Radu (2021)."SwinIR: Image Restoration Using Swin Transformer"(PDF).Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. International Conference on Computer Vision, 2021. pp. 1833–1844.arXiv:2108.10257v1.
  54. ^Nichol, Alex; Dhariwal, Prafulla; Ramesh, Aditya; Shyam, Pranav; Mishkin, Pamela; McGrew, Bob; Sutskever, Ilya; Chen, Mark (2022-03-08). "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models".arXiv:2112.10741 [cs.CV].
  55. ^Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12). "Hierarchical Text-Conditional Image Generation with CLIP Latents".arXiv:2204.06125 [cs.CV].
  56. ^Alammar, Jay."The Illustrated Stable Diffusion".jalammar.github.io. Retrieved2022-10-31.
  57. ^Esser, Patrick; Kulal, Sumith; Blattmann, Andreas; Entezari, Rahim; Müller, Jonas; Saini, Harry; Levi, Yam; Lorenz, Dominik; Sauer, Axel (2024-03-05). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis".arXiv:2403.03206 [cs.CV].
  58. ^Xie, Yiming; Yao, Chun-Han; Voleti, Vikram; Jiang, Huaizu; Jampani, Varun (2024-07-24). "SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency".arXiv:2407.17470 [cs.CV].
  59. ^"Imagen: Text-to-Image Diffusion Models".imagen.research.google. Retrieved2024-04-04.
  60. ^Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily L.; Ghasemipour, Kamyar; Gontijo Lopes, Raphael; Karagol Ayan, Burcu; Salimans, Tim; Ho, Jonathan; Fleet, David J.; Norouzi, Mohammad (2022-12-06)."Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding".Advances in Neural Information Processing Systems.35:36479–36494.arXiv:2205.11487.
  61. ^Chang, Huiwen; Zhang, Han; Barber, Jarred; Maschinot, A. J.; Lezama, Jose; Jiang, Lu; Yang, Ming-Hsuan; Murphy, Kevin; Freeman, William T. (2023-01-02). "Muse: Text-To-Image Generation via Masked Generative Transformers".arXiv:2301.00704 [cs.CV].
  62. ^"Imagen 2 - our most advanced text-to-image technology".Google DeepMind. Retrieved2024-04-04.
  63. ^Imagen-Team-Google; Baldridge, Jason; Bauer, Jakob; Bhutani, Mukul; Brichtova, Nicole; Bunner, Andrew; Castrejon, Lluis; Chan, Kelvin; Chen, Yichang (2024-12-13),Imagen 3,arXiv:2408.07009{{citation}}:|last1= has generic name (help)
  64. ^"Veo".Google DeepMind. 2024-05-14. Retrieved2024-05-17.
  65. ^"Introducing Make-A-Video: An AI system that generates videos from text".ai.meta.com. Retrieved2024-09-20.
  66. ^Singer, Uriel; Polyak, Adam; Hayes, Thomas; Yin, Xi; An, Jie; Zhang, Songyang; Hu, Qiyuan; Yang, Harry; Ashual, Oron (2022-09-29). "Make-A-Video: Text-to-Video Generation without Text-Video Data".arXiv:2209.14792 [cs.CV].
  67. ^"Introducing CM3leon, a more efficient, state-of-the-art generative model for text and images".ai.meta.com. Retrieved2024-09-20.
  68. ^Chameleon Team (2024-05-16). "Chameleon: Mixed-Modal Early-Fusion Foundation Models".arXiv:2405.09818 [cs.CL].
  69. ^Zhou, Chunting; Yu, Lili; Babu, Arun; Tirumala, Kushal; Yasunaga, Michihiro; Shamis, Leonid; Kahn, Jacob; Ma, Xuezhe; Zettlemoyer, Luke (2024-08-20). "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model".arXiv:2408.11039 [cs.AI].
  70. ^Movie Gen: A Cast of Media Foundation Models, The Movie Gen team @ Meta, October 4, 2024.
Retrieved from "https://en.wikipedia.org/w/index.php?title=Diffusion_model&oldid=1337976238"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp