Movatterモバイル変換

[0]ホーム

Jump to content

Latent diffusion model

Català

Edit links

From Wikipedia, the free encyclopedia

Diffusion model over latent embedding space

Latent Diffusion Model
Original author	CompVis
Initial release	December 20, 2021
Repository	github.com/CompVis/latent-diffusion
Written in	Python
Type	Generative model Diffusion model
License	MIT

TheLatent Diffusion Model (LDM)^[1] is adiffusion model architecture developed by the CompVis (Computer Vision & Learning)^[2] group atLMU Munich.^[3]

Introduced in 2015, diffusion models (DMs) are trained with the objective of removing successive applications of noise (commonlyGaussian) on training images. The LDM is an improvement on standard DM by performing diffusion modeling in alatent space, and by allowing self-attention and cross-attention conditioning.

LDMs are widely used in practical diffusion models. For instance,Stable Diffusion versions 1.1 to 2.1 were based on the LDM architecture.^[4]

Version history

[edit]

Diffusion models were introduced in 2015 as a method to learn a model that can sample from a highly complex probability distribution. They used techniques fromnon-equilibrium thermodynamics, especiallydiffusion.^[5] It was accompanied by a software implementation inTheano.^[6]

A 2019 paper proposed thenoise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD).^[7] The paper was accompanied by a software package written inPyTorch release on GitHub.^[8]

A 2020 paper^[9] proposed theDenoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method byvariational inference. The paper was accompanied by a software package written inTensorFlow release on GitHub.^[10] It was reimplemented inPyTorch by lucidrains.^[11]^[12]

On December 20, 2021, the LDM paper was published on arXiv,^[13] and bothStable Diffusion^[14] and LDM^[15] repositories were published on GitHub. However, they remained roughly the same. Substantial information concerning Stable Diffusion v1 was only added to GitHub on August 10, 2022.^[16]

All of Stable Diffusion (SD) versions 1.1 to XL were particular instantiations of the LDM architecture.

SD 1.1 to 1.4 were released by CompVis in August 2022. There is no "version 1.0". SD 1.1 was a LDM trained on the laion2B-en dataset. SD 1.1 was finetuned to 1.2 on more aesthetic images. SD 1.2 was finetuned to 1.3, 1.4 and 1.5, with 10% of text-conditioning dropped, to improve classifier-free guidance.^[17]^[18] SD 1.5 was released byRunwayML in October 2022.^[18]

Architecture

[edit]

While the LDM can work for generating arbitrary data conditional on arbitrary data, for concreteness, we describe its operation in conditional text-to-image generation.

LDM consists of avariational autoencoder (VAE), a modifiedU-Net, and a text encoder.

The VAE encoder compresses the image from pixel space to a smaller dimensionallatent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of aResNet backbone,denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space.^[4]

The denoising step can be conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via across-attention mechanism.^[4] For conditioning on text, the fixed, a pretrainedCLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space.^[3]

Variational Autoencoder

[edit]

To compress the image data, a variational autoencoder (VAE) is first trained on a dataset of images. The encoder part of the VAE takes an image as input and outputs a lower-dimensional latent representation of the image. This latent representation is then used as input to the U-Net. Once the model is trained, the encoder is used to encode images into latent representations, and the decoder is used to decode latent representations back into images.

Let the encoder and the decoder of the VAE be $E, D {\displaystyle E,D}$ .

To encode an RGB image, its three channels are divided by the maximum value, resulting in atensor $x {\displaystyle x}$ of shape $(3,512,512)$ with all entries within range $[0,1]$ . The encoded vector is $0.18215\times E(2x-1)$ , with shape $(4,64,64)$ , where 0.18215 is a hyperparameter, which the original authors picked to roughlywhiten the encoded vector to roughly unit variance. Conversely, given a latent tensor $y {\displaystyle y}$ , the decoded image is $(D(y/0.18125)+1)/2$ , thenclipped to the range $[0,1]$ .^[19]^[20]

In the implemented version,^[3]^{: ldm/models/autoencoder.py} the encoder is aconvolutional neural network (CNN) with a single self-attention mechanism near the end. It takes a tensor of shape $(3,H,W)$ and outputs a tensor of shape $(8,H/8,W/8)$ , being the concatenation of the predicted mean and variance of the latent vector, each of shape $(4,H/8,W/8)$ . The variance is used in training, but after training, usually only the mean is taken, with the variance discarded.

The decoder is also a CNN with a single self-attention mechanism near the end. It takes a tensor of shape $(4,H/8,W/8)$ and outputs a tensor of shape $(3,H,W)$ .

U-Net

[edit]

The U-Net backbone takes the following kinds of inputs:

Alatent image array, produced by the VAE encoder. It has dimensions $({\text{channel}},{\text{width}},{\text{height}})$ . Typically, $({\text{channel}},{\text{width}},{\text{height}})=(4,64,64)$ .
Atimestep-embedding vector, which tells the backbone how much noise there is in the image. For example, an embedding of timestep $t=0$ would indicate that the input image is already noiseless, while $t=100$ would mean there is much noise.
Amodality-embedding vector sequence, which indicates to the backbone about additional conditions for denoising. For example, in text-to-image generation, the text is divided into a sequence of tokens, then encoded by a text encoder, such as aCLIP encoder, before feeding into the backbone. As another example, an input image can be processed by aVision Transformer into a sequence of vectors, which can then be used to condition the backbone for tasks such as generating an image in the same style.

Each run through the U-Net backbone produces a predicted noise vector. This noise vector is scaled down and subtracted away from the latent image array, resulting in a slightly less noisy latent image. The denoising is repeated according to a denoising schedule ("noise schedule"), and the output of the last step is processed by the VAE decoder into a finished image.

A single cross-attention mechanism as it appears in a standard Transformer language model

Block diagram for the full Transformer architecture. The stack on the right is a standard pre-LN Transformer decoder, which is essentially the same as the`SpatialTransformer`.

Similar to the standardU-Net, the U-Net backbone used in the SD 1.5 is essentially composed of down-scaling layers followed by up-scaling layers. However, the U-Net backbone has additional modules to allow for it to handle the embedding. As an illustration, we describe a single down-scaling layer in the backbone:

The latent array and the time-embedding are processed by aResBlock:
- The latent array is processed by aconvolutional layer.
- The time-embedding vector is processed by a one-layeredfeedforward network, then added to the previous array (broadcast over all pixels).
- This is processed by another convolutional layer, then another time-embedding.
The latent array and the embedding vector sequence are processed by aSpatialTransformer, which is essentially a standard pre-LNTransformer decoder without causal masking.
- In the cross-attentional blocks, the latent array itself serves as the query sequence, one query-vector per pixel. For example, if, at this layer in the U-Net, the latent array has dimensions $(128,32,32)$ , then the query sequence has $1024 {\displaystyle 1024}$ vectors, each of which has $128 {\displaystyle 128}$ dimensions. The embedding vector sequence serves as both the key sequence and as the value sequence.
- When no embedding vector sequence is input, a cross-attentional block defaults to self-attention, with the latent array serving as the query, key, and value.^[21]^{: line 251}

In pseudocode,

defResBlock(x,time,residual_channels):x_in=xtime_embedding=feedforward_network(time)x=concatenate(x,residual_channels)x=conv_layer_1(activate(normalize_1(x)))+time_embeddingx=conv_layer_2(dropout(activate(normalize_2(x))))returnx_in+xdefSpatialTransformer(x,cond):x_in=xx=normalize(x)x=proj_in(x)x=cross_attention(x,cond)x=proj_out(x)returnx_in+xdefunet(x,time,cond):residual_channels=[]forresblock,spatialtransformerindownscaling_layers:x=resblock(x,time)residual_channels.append(x)x=spatialtransformer(x,cond)x=middle_layer.resblock_1(x,time)x=middle_layer.spatialtransformer(x,time)x=middle_layer.resblock_2(x,time)forresblock,spatialtransformerinupscaling_layers:residual=residual_channels.pop()x=resblock(concatenate(x,residual),time)x=spatialtransformer(x,cond)returnx

The detailed architecture may be found in.^[22]^[23]

Training and inference

[edit]

The LDM is trained by using aMarkov chain to gradually add noise to the training images. The model is then trained to reverse this process, starting with a noisy image and gradually removing the noise until it recovers the original image.More specifically, the training process can be described as follows:

Forward diffusion process: Given a real image $x_{0}$ , a sequence of latent variables $x_{1:T}$ are generated by gradually adding Gaussian noise to the image, according to a pre-determined "noise schedule".
Reverse diffusion process: Starting from a Gaussian noise sample $x_{T}$ , the model learns to predict the noise added at each step, in order to reverse the diffusion process and obtain a reconstruction of the original image $x_{0}$ .

The model is trained to minimize the difference between the predicted noise and the actual noise added at each step. This is typically done using amean squared error (MSE) loss function.

Once the model is trained, it can be used to generate new images by simply running the reverse diffusion process starting from a random noise sample. The model gradually removes the noise from the sample, guided by the learned noise distribution, until it generates a final image.

See thediffusion model page for details.

References

[edit]

^Rombach, Robin; Blattmann, Andreas; Lorenz, Dominik; Esser, Patrick; Ommer, Björn (2022).High-Resolution Image Synthesis With Latent Diffusion Models. The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022. pp. 10684–10695.
^"Home".Computer Vision & Learning Group. Retrieved2024-09-05.
^^a ^b ^c"Stable Diffusion Repository on GitHub". CompVis - Machine Vision and Learning Research Group, LMU Munich. 17 September 2022.Archived from the original on January 18, 2023. Retrieved17 September 2022.
^^a ^b ^cAlammar, Jay."The Illustrated Stable Diffusion".jalammar.github.io.Archived from the original on November 1, 2022. Retrieved2022-10-31.
^Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01)."Deep Unsupervised Learning using Nonequilibrium Thermodynamics"(PDF).Proceedings of the 32nd International Conference on Machine Learning.37. PMLR:2256–2265.arXiv:1503.03585.
^Sohl-Dickstein, Jascha (2024-09-01)."Sohl-Dickstein/Diffusion-Probabilistic-Models".GitHub. Retrieved2024-09-07.
^"ermongroup/ncsn". ermongroup. 2019. Retrieved2024-09-07.
^Song, Yang; Ermon, Stefano (2019)."Generative Modeling by Estimating Gradients of the Data Distribution".Advances in Neural Information Processing Systems.32. Curran Associates, Inc.arXiv:1907.05600.
^Ho, Jonathan; Jain, Ajay; Abbeel, Pieter (2020)."Denoising Diffusion Probabilistic Models".Advances in Neural Information Processing Systems.33. Curran Associates, Inc.:6840–6851.
^Ho, Jonathan (Jun 20, 2020)."hojonathanho/diffusion".GitHub. Retrieved2024-09-07.
^Wang, Phil (2024-09-07)."lucidrains/denoising-diffusion-pytorch".GitHub. Retrieved2024-09-07.
^"The Annotated Diffusion Model".huggingface.co. Retrieved2024-09-07.
^Rombach, Robin; Blattmann, Andreas; Lorenz, Dominik; Esser, Patrick; Ommer, Björn (2021-12-20). "High-Resolution Image Synthesis with Latent Diffusion Models".arXiv:2112.10752 [cs.CV].
^"Update README.md · CompVis/stable-diffusion@17e64e3".GitHub. Retrieved2024-09-07.
^"Update README.md · CompVis/latent-diffusion@17e64e3".GitHub. Retrieved2024-09-07.
^"stable diffusion · CompVis/stable-diffusion@2ff270f".GitHub. Retrieved2024-09-07.
^"CompVis (CompVis)".huggingface.co. 2023-08-23. Retrieved2024-03-06.
^^a ^b"runwayml/stable-diffusion-v1-5 · Hugging Face".huggingface.co.Archived from the original on September 21, 2023. Retrieved2023-08-17.
^"Explanation of the 0.18215 factor in textual_inversion? · Issue #437 · huggingface/diffusers".GitHub. Retrieved2024-09-19.
^"diffusion-nbs/Stable Diffusion Deep Dive.ipynb at master · fastai/diffusion-nbs".GitHub. Retrieved2024-09-19.
^"latent-diffusion/ldm/modules/attention.py at main · CompVis/latent-diffusion".GitHub. Retrieved2024-09-09.
^"U-Net for Stable Diffusion".U-Net for Stable Diffusion. Retrieved2024-08-31.
^"Transformer for Stable Diffusion U-Net".Transformer for Stable Diffusion U-Net. Retrieved2024-09-07.

Movatterモバイル変換

Latent diffusion model

Version history

Architecture

Variational Autoencoder

U-Net

Training and inference

See also

References

Further reading