This pipeline is deprecated but it can still be used. However, we won’t test the pipeline anymore and won’t accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model.
Text2Video-Zero
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators is by Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel,Zhangyang Wang, Shant Navasardyan,Humphrey Shi.
Text2Video-Zero enables zero-shot video generation using either:
- A textual prompt
- A prompt combined with guidance from poses or edges
- Video Instruct-Pix2Pix (instruction-guided video editing)
Results are temporally consistent and closely follow the guidance and textual prompts.

The abstract from the paper is:
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain.Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object.Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing.As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
You can find additional information about Text2Video-Zero on theproject page,paper, andoriginal codebase.
Usage example
Text-To-Video
To generate a video from prompt, run the following Python code:
import torchfrom diffusersimport TextToVideoZeroPipelineimport imageiomodel_id ="stable-diffusion-v1-5/stable-diffusion-v1-5"pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")prompt ="A panda is playing guitar on times square"result = pipe(prompt=prompt).imagesresult = [(r *255).astype("uint8")for rin result]imageio.mimsave("video.mp4", result, fps=4)
You can change these parameters in the pipeline call:
- Motion field strength (see thepaper, Sect. 3.3.1):
motion_field_strength_xandmotion_field_strength_y. Default:motion_field_strength_x=12,motion_field_strength_y=12
TandT'(see thepaper, Sect. 3.3.1)t0andt1in the range{0, ..., num_inference_steps}. Default:t0=45,t1=48
- Video length:
video_length, the number of frames video_length to be generated. Default:video_length=8
We can also generate longer videos by doing the processing in a chunk-by-chunk manner:
import torchfrom diffusersimport TextToVideoZeroPipelineimport numpyas npmodel_id ="stable-diffusion-v1-5/stable-diffusion-v1-5"pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")seed =0video_length =24#24 ÷ 4fps = 6 secondschunk_size =8prompt ="A panda is playing guitar on times square"# Generate the video chunk-by-chunkresult = []chunk_ids = np.arange(0, video_length, chunk_size -1)generator = torch.Generator(device="cuda")for iinrange(len(chunk_ids)):print(f"Processing chunk{i +1} /{len(chunk_ids)}") ch_start = chunk_ids[i] ch_end = video_lengthif i ==len(chunk_ids) -1else chunk_ids[i +1]# Attach the first frame for Cross Frame Attention frame_ids = [0] +list(range(ch_start, ch_end))# Fix the seed for the temporal consistency generator.manual_seed(seed) output = pipe(prompt=prompt, video_length=len(frame_ids), generator=generator, frame_ids=frame_ids) result.append(output.images[1:])# Concatenate chunks and saveresult = np.concatenate(result)result = [(r *255).astype("uint8")for rin result]imageio.mimsave("video.mp4", result, fps=4)
SDXL Support
In order to use the SDXL model when generating a video from prompt, use theTextToVideoZeroSDXLPipelinepipeline:
import torchfrom diffusersimport TextToVideoZeroSDXLPipelinemodel_id ="stabilityai/stable-diffusion-xl-base-1.0"pipe = TextToVideoZeroSDXLPipeline.from_pretrained( model_id, torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("cuda")
Text-To-Video with Pose Control
To generate a video from prompt with additional pose control
Download a demo video
from huggingface_hubimport hf_hub_downloadfilename ="__assets__/poses_skeleton_gifs/dance1_corr.mp4"repo_id ="PAIR/Text2Video-Zero"video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
Read video containing extracted pose images
from PILimport Imageimport imageioreader = imageio.get_reader(video_path,"ffmpeg")frame_count =8pose_images = [Image.fromarray(reader.get_data(i))for iinrange(frame_count)]
To extract pose from actual video, readControlNet documentation.
Run
StableDiffusionControlNetPipelinewith our custom attention processorimport torchfrom diffusersimport StableDiffusionControlNetPipeline, ControlNetModelfrom diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zeroimport CrossFrameAttnProcessormodel_id ="stable-diffusion-v1-5/stable-diffusion-v1-5"controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16)pipe = StableDiffusionControlNetPipeline.from_pretrained( model_id, controlnet=controlnet, torch_dtype=torch.float16).to("cuda")# Set the attention processorpipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))# fix latents for all frameslatents = torch.randn((1,4,64,64), device="cuda", dtype=torch.float16).repeat(len(pose_images),1,1,1)prompt ="Darth Vader dancing in a desert"result = pipe(prompt=[prompt] *len(pose_images), image=pose_images, latents=latents).imagesimageio.mimsave("video.mp4", result, fps=4)
SDXL Support
Since our attention processor also works with SDXL, it can be utilized to generate a video from prompt using ControlNet models powered by SDXL:
import torchfrom diffusersimport StableDiffusionXLControlNetPipeline, ControlNetModelfrom diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zeroimport CrossFrameAttnProcessorcontrolnet_model_id ='thibaud/controlnet-openpose-sdxl-1.0'model_id ='stabilityai/stable-diffusion-xl-base-1.0'controlnet = ControlNetModel.from_pretrained(controlnet_model_id, torch_dtype=torch.float16)pipe = StableDiffusionControlNetPipeline.from_pretrained(model_id, controlnet=controlnet, torch_dtype=torch.float16).to('cuda')# Set the attention processorpipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))# fix latents for all frameslatents = torch.randn((1,4,128,128), device="cuda", dtype=torch.float16).repeat(len(pose_images),1,1,1)prompt ="Darth Vader dancing in a desert"result = pipe(prompt=[prompt] *len(pose_images), image=pose_images, latents=latents).imagesimageio.mimsave("video.mp4", result, fps=4)
Text-To-Video with Edge Control
To generate a video from prompt with additional Canny edge control, follow the same steps described above for pose-guided generation usingCanny edge ControlNet model.
Video Instruct-Pix2Pix
To perform text-guided video editing (withInstructPix2Pix):
Download a demo video
from huggingface_hubimport hf_hub_downloadfilename ="__assets__/pix2pix video/camel.mp4"repo_id ="PAIR/Text2Video-Zero"video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
Read video from path
from PILimport Imageimport imageioreader = imageio.get_reader(video_path,"ffmpeg")frame_count =8video = [Image.fromarray(reader.get_data(i))for iinrange(frame_count)]
Run
StableDiffusionInstructPix2PixPipelinewith our custom attention processorimport torchfrom diffusersimport StableDiffusionInstructPix2PixPipelinefrom diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zeroimport CrossFrameAttnProcessormodel_id ="timbrooks/instruct-pix2pix"pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=3))prompt ="make it Van Gogh Starry Night style"result = pipe(prompt=[prompt] *len(video), image=video).imagesimageio.mimsave("edited_video.mp4", result, fps=4)
DreamBooth specialization
MethodsText-To-Video,Text-To-Video with Pose Control andText-To-Video with Edge Controlcan run with customDreamBooth models, as shown below forCanny edge ControlNet model andAvatar style DreamBooth model:
Download a demo video
from huggingface_hubimport hf_hub_downloadfilename ="__assets__/canny_videos_mp4/girl_turning.mp4"repo_id ="PAIR/Text2Video-Zero"video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
Read video from path
from PILimport Imageimport imageioreader = imageio.get_reader(video_path,"ffmpeg")frame_count =8canny_edges = [Image.fromarray(reader.get_data(i))for iinrange(frame_count)]
Run
StableDiffusionControlNetPipelinewith custom trained DreamBooth modelimport torchfrom diffusersimport StableDiffusionControlNetPipeline, ControlNetModelfrom diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zeroimport CrossFrameAttnProcessor# set model id to custom modelmodel_id ="PAIR/text2video-zero-controlnet-canny-avatar"controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)pipe = StableDiffusionControlNetPipeline.from_pretrained( model_id, controlnet=controlnet, torch_dtype=torch.float16).to("cuda")# Set the attention processorpipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))# fix latents for all frameslatents = torch.randn((1,4,64,64), device="cuda", dtype=torch.float16).repeat(len(canny_edges),1,1,1)prompt ="oil painting of a beautiful girl avatar style"result = pipe(prompt=[prompt] *len(canny_edges), image=canny_edges, latents=latents).imagesimageio.mimsave("video.mp4", result, fps=4)
You can filter out some available DreamBooth-trained models withthis link.
Make sure to check out the Schedulersguide to learn how to explore the tradeoff between scheduler speed and quality, and see thereuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
TextToVideoZeroPipeline
classdiffusers.TextToVideoZeroPipeline
<source>(vae: AutoencoderKLtext_encoder: CLIPTextModeltokenizer: CLIPTokenizerunet: UNet2DConditionModelscheduler: KarrasDiffusionSchedulerssafety_checker: StableDiffusionSafetyCheckerfeature_extractor: CLIPImageProcessorrequires_safety_checker: bool = True)
__call__
<source>(prompt: typing.Union[str, typing.List[str]]video_length: typing.Optional[int] = 8height: typing.Optional[int] = Nonewidth: typing.Optional[int] = Nonenum_inference_steps: int = 50guidance_scale: float = 7.5negative_prompt: typing.Union[str, typing.List[str], NoneType] = Nonenum_videos_per_prompt: typing.Optional[int] = 1eta: float = 0.0generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Nonelatents: typing.Optional[torch.Tensor] = Nonemotion_field_strength_x: float = 12motion_field_strength_y: float = 12output_type: typing.Optional[str] = 'tensor'return_dict: bool = Truecallback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = Nonecallback_steps: typing.Optional[int] = 1t0: int = 44t1: int = 47frame_ids: typing.Optional[typing.List[int]] = None)→TextToVideoPipelineOutput
Parameters
- prompt (
strorList[str],optional) —The prompt or prompts to guide image generation. If not defined, you need to passprompt_embeds. - video_length (
int,optional, defaults to 8) —The number of generated video frames. - height (
int,optional, defaults toself.unet.config.sample_size * self.vae_scale_factor) —The height in pixels of the generated image. - width (
int,optional, defaults toself.unet.config.sample_size * self.vae_scale_factor) —The width in pixels of the generated image. - num_inference_steps (
int,optional, defaults to 50) —The number of denoising steps. More denoising steps usually lead to a higher quality image at theexpense of slower inference. - guidance_scale (
float,optional, defaults to 7.5) —A higher guidance scale value encourages the model to generate images closely linked to the textpromptat the expense of lower image quality. Guidance scale is enabled whenguidance_scale > 1. - negative_prompt (
strorList[str],optional) —The prompt or prompts to guide what to not include in video generation. If not defined, you need topassnegative_prompt_embedsinstead. Ignored when not using guidance (guidance_scale < 1). - num_videos_per_prompt (
int,optional, defaults to 1) —The number of videos to generate per prompt. - eta (
float,optional, defaults to 0.0) —Corresponds to parameter eta (η) from theDDIM paper. Onlyapplies to theDDIMScheduler, and is ignored in other schedulers. - generator (
torch.GeneratororList[torch.Generator],optional) —Atorch.Generatorto makegeneration deterministic. - latents (
torch.Tensor,optional) —Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for videogeneration. Can be used to tweak the same generation with different prompts. If not provided, a latentstensor is generated by sampling using the supplied randomgenerator. - output_type (
str,optional, defaults to"np") —The output format of the generated video. Choose between"latent"and"np". - return_dict (
bool,optional, defaults toTrue) —Whether or not to return aTextToVideoPipelineOutput instead ofa plain tuple. - callback (
Callable,optional) —A function that calls everycallback_stepssteps during inference. The function is called with thefollowing arguments:callback(step: int, timestep: int, latents: torch.Tensor). - callback_steps (
int,optional, defaults to 1) —The frequency at which thecallbackfunction is called. If not specified, the callback is called atevery step. - motion_field_strength_x (
float,optional, defaults to 12) —Strength of motion in generated video along x-axis. See thepaper, Sect. 3.3.1. - motion_field_strength_y (
float,optional, defaults to 12) —Strength of motion in generated video along y-axis. See thepaper, Sect. 3.3.1. - t0 (
int,optional, defaults to 44) —Timestep t0. Should be in the range [0, num_inference_steps - 1]. See thepaper, Sect. 3.3.1. - t1 (
int,optional, defaults to 47) —Timestep t0. Should be in the range [t0 + 1, num_inference_steps - 1]. See thepaper, Sect. 3.3.1. - frame_ids (
List[int],optional) —Indexes of the frames that are being generated. This is used when generating longer videoschunk-by-chunk.
Returns
The output contains andarray of the generated video, whenoutput_type !="latent", otherwise alatent code of generated videos and a list ofbools indicating whether the corresponding generatedvideo contains “not-safe-for-work” (nsfw) content..
The call function to the pipeline for generation.
backward_loop
<source>(latentstimestepsprompt_embedsguidance_scalecallbackcallback_stepsnum_warmup_stepsextra_step_kwargscross_attention_kwargs = None)→latents
Parameters
- latents —Latents at time timesteps[0].
- timesteps —Time steps along which to perform backward process.
- prompt_embeds —Pre-generated text embeddings.
- guidance_scale —A higher guidance scale value encourages the model to generate images closely linked to the text
promptat the expense of lower image quality. Guidance scale is enabled whenguidance_scale > 1. - callback (
Callable,optional) —A function that calls everycallback_stepssteps during inference. The function is called with thefollowing arguments:callback(step: int, timestep: int, latents: torch.Tensor). - callback_steps (
int,optional, defaults to 1) —The frequency at which thecallbackfunction is called. If not specified, the callback is called atevery step. - extra_step_kwargs —Extra_step_kwargs.
- cross_attention_kwargs —A kwargs dictionary that if specified is passed along to the
AttentionProcessoras defined inself.processor. - num_warmup_steps —number of warmup steps.
Returns
latents
Latents of backward process output at time timesteps[-1].
Perform backward process given list of time steps.
encode_prompt
<source>(promptdevicenum_images_per_promptdo_classifier_free_guidancenegative_prompt = Noneprompt_embeds: typing.Optional[torch.Tensor] = Nonenegative_prompt_embeds: typing.Optional[torch.Tensor] = Nonelora_scale: typing.Optional[float] = Noneclip_skip: typing.Optional[int] = None)
Parameters
- prompt (
strorList[str],optional) —prompt to be encoded - device — (
torch.device):torch device - num_images_per_prompt (
int) —number of images that should be generated per prompt - do_classifier_free_guidance (
bool) —whether to use classifier free guidance or not - negative_prompt (
strorList[str],optional) —The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleisless than1). - prompt_embeds (
torch.Tensor,optional) —Pre-generated text embeddings. Can be used to easily tweak text inputs,e.g. prompt weighting. If notprovided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.Tensor,optional) —Pre-generated negative text embeddings. Can be used to easily tweak text inputs,e.g. promptweighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinputargument. - lora_scale (
float,optional) —A LoRA scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded. - clip_skip (
int,optional) —Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means thatthe output of the pre-final layer will be used for computing the prompt embeddings.
Encodes the prompt into text encoder hidden states.
forward_loop
<source>(x_t0t0t1generator)→x_t1
Parameters
- x_t0 —Latent code at time t0.
- t0 —Timestep at t0.
- t1 —Timestamp at t1.
- generator (
torch.GeneratororList[torch.Generator],optional) —Atorch.Generatorto makegeneration deterministic.
Returns
x_t1
Forward process applied to x_t0 from time t0 to t1.
Perform DDPM forward process from time t0 to t1. This is the same as adding noise with corresponding variance.
TextToVideoZeroSDXLPipeline
classdiffusers.TextToVideoZeroSDXLPipeline
<source>(vae: AutoencoderKLtext_encoder: CLIPTextModeltext_encoder_2: CLIPTextModelWithProjectiontokenizer: CLIPTokenizertokenizer_2: CLIPTokenizerunet: UNet2DConditionModelscheduler: KarrasDiffusionSchedulersimage_encoder: CLIPVisionModelWithProjection = Nonefeature_extractor: CLIPImageProcessor = Noneforce_zeros_for_empty_prompt: bool = Trueadd_watermarker: typing.Optional[bool] = None)
__call__
<source>(prompt: typing.Union[str, typing.List[str]]prompt_2: typing.Union[str, typing.List[str], NoneType] = Nonevideo_length: typing.Optional[int] = 8height: typing.Optional[int] = Nonewidth: typing.Optional[int] = Nonenum_inference_steps: int = 50denoising_end: typing.Optional[float] = Noneguidance_scale: float = 7.5negative_prompt: typing.Union[str, typing.List[str], NoneType] = Nonenegative_prompt_2: typing.Union[str, typing.List[str], NoneType] = Nonenum_videos_per_prompt: typing.Optional[int] = 1eta: float = 0.0generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Noneframe_ids: typing.Optional[typing.List[int]] = Noneprompt_embeds: typing.Optional[torch.Tensor] = Nonenegative_prompt_embeds: typing.Optional[torch.Tensor] = Nonepooled_prompt_embeds: typing.Optional[torch.Tensor] = Nonenegative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = Nonelatents: typing.Optional[torch.Tensor] = Nonemotion_field_strength_x: float = 12motion_field_strength_y: float = 12output_type: typing.Optional[str] = 'tensor'return_dict: bool = Truecallback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = Nonecallback_steps: int = 1cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = Noneguidance_rescale: float = 0.0original_size: typing.Optional[typing.Tuple[int, int]] = Nonecrops_coords_top_left: typing.Tuple[int, int] = (0, 0)target_size: typing.Optional[typing.Tuple[int, int]] = Nonet0: int = 44t1: int = 47)
Parameters
- prompt (
strorList[str],optional) —The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds.instead. - prompt_2 (
strorList[str],optional) —The prompt or prompts to be sent to thetokenizer_2andtext_encoder_2. If not defined,promptisused in both text-encoders - video_length (
int,optional, defaults to 8) —The number of generated video frames. - height (
int,optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) —The height in pixels of the generated image. - width (
int,optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) —The width in pixels of the generated image. - num_inference_steps (
int,optional, defaults to 50) —The number of denoising steps. More denoising steps usually lead to a higher quality image at theexpense of slower inference. - denoising_end (
float,optional) —When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to becompleted before it is intentionally prematurely terminated. As a result, the returned sample willstill retain a substantial amount of noise as determined by the discrete timesteps selected by thescheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a“Mixture of Denoisers” multi-pipeline setup, as elaborated inRefining the ImageOutput - guidance_scale (
float,optional, defaults to 7.5) —Guidance scale as defined inClassifier-Free DiffusionGuidance.guidance_scaleis defined aswof equation 2.ofImagen Paper. Guidance scale is enabled by settingguidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked tothe textprompt, usually at the expense of lower image quality. - negative_prompt (
strorList[str],optional) —The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleisless than1). - negative_prompt_2 (
strorList[str],optional) —The prompt or prompts not to guide the image generation to be sent totokenizer_2andtext_encoder_2. If not defined,negative_promptis used in both text-encoders - num_videos_per_prompt (
int,optional, defaults to 1) —The number of videos to generate per prompt. - eta (
float,optional, defaults to 0.0) —Corresponds to parameter eta (η) in the DDIM paper:https://huggingface.co/papers/2010.02502. Onlyapplies toschedulers.DDIMScheduler, will be ignored for others. - generator (
torch.GeneratororList[torch.Generator],optional) —One or a list oftorch generator(s)to make generation deterministic. - frame_ids (
List[int],optional) —Indexes of the frames that are being generated. This is used when generating longer videoschunk-by-chunk. - prompt_embeds (
torch.Tensor,optional) —Pre-generated text embeddings. Can be used to easily tweak text inputs,e.g. prompt weighting. If notprovided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.Tensor,optional) —Pre-generated negative text embeddings. Can be used to easily tweak text inputs,e.g. promptweighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinputargument. - pooled_prompt_embeds (
torch.Tensor,optional) —Pre-generated pooled text embeddings. Can be used to easily tweak text inputs,e.g. prompt weighting.If not provided, pooled text embeddings will be generated frompromptinput argument. - negative_pooled_prompt_embeds (
torch.Tensor,optional) —Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs,e.g. promptweighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_promptinput argument. - latents (
torch.Tensor,optional) —Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for imagegeneration. Can be used to tweak the same generation with different prompts. If not provided, a latentstensor will ge generated by sampling using the supplied randomgenerator. - motion_field_strength_x (
float,optional, defaults to 12) —Strength of motion in generated video along x-axis. See thepaper, Sect. 3.3.1. - motion_field_strength_y (
float,optional, defaults to 12) —Strength of motion in generated video along y-axis. See thepaper, Sect. 3.3.1. - output_type (
str,optional, defaults to"pil") —The output format of the generate image. Choose betweenPIL:PIL.Image.Imageornp.array. - return_dict (
bool,optional, defaults toTrue) —Whether or not to return a~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutputinsteadof a plain tuple. - callback (
Callable,optional) —A function that will be called everycallback_stepssteps during inference. The function will becalled with the following arguments:callback(step: int, timestep: int, latents: torch.Tensor). - callback_steps (
int,optional, defaults to 1) —The frequency at which thecallbackfunction will be called. If not specified, the callback will becalled at every step. - cross_attention_kwargs (
dict,optional) —A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined underself.processorindiffusers.cross_attention. - guidance_rescale (
float,optional, defaults to 0.7) —Guidance rescale factor proposed byCommon Diffusion Noise Schedules and Sample Steps areFlawedguidance_scaleis defined asφin equation 16. ofCommon Diffusion Noise Schedules and Sample Steps areFlawed. Guidance rescale factor should fix overexposure whenusing zero terminal SNR. - original_size (
Tuple[int],optional, defaults to (1024, 1024)) —Iforiginal_sizeis not the same astarget_sizethe image will appear to be down- or upsampled.original_sizedefaults to(width, height)if not specified. Part of SDXL’s micro-conditioning asexplained in section 2.2 ofhttps://huggingface.co/papers/2307.01952. - crops_coords_top_left (
Tuple[int],optional, defaults to (0, 0)) —crops_coords_top_leftcan be used to generate an image that appears to be “cropped” from the positioncrops_coords_top_leftdownwards. Favorable, well-centered images are usually achieved by settingcrops_coords_top_leftto (0, 0). Part of SDXL’s micro-conditioning as explained in section 2.2 ofhttps://huggingface.co/papers/2307.01952. - target_size (
Tuple[int],optional, defaults to (1024, 1024)) —For most cases,target_sizeshould be set to the desired height and width of the generated image. Ifnot specified it will default to(width, height). Part of SDXL’s micro-conditioning as explained insection 2.2 ofhttps://huggingface.co/papers/2307.01952. - t0 (
int,optional, defaults to 44) —Timestep t0. Should be in the range [0, num_inference_steps - 1]. See thepaper, Sect. 3.3.1. - t1 (
int,optional, defaults to 47) —Timestep t0. Should be in the range [t0 + 1, num_inference_steps - 1]. See thepaper, Sect. 3.3.1.
Function invoked when calling the pipeline for generation.
backward_loop
<source>(latentstimestepsprompt_embedsguidance_scalecallbackcallback_stepsnum_warmup_stepsextra_step_kwargsadd_text_embedsadd_time_idscross_attention_kwargs = Noneguidance_rescale: float = 0.0)→latents
Parameters
- latents —Latents at time timesteps[0].
- timesteps —Time steps along which to perform backward process.
- prompt_embeds —Pre-generated text embeddings.
- guidance_scale —A higher guidance scale value encourages the model to generate images closely linked to the text
promptat the expense of lower image quality. Guidance scale is enabled whenguidance_scale > 1. - callback (
Callable,optional) —A function that calls everycallback_stepssteps during inference. The function is called with thefollowing arguments:callback(step: int, timestep: int, latents: torch.Tensor). - callback_steps (
int,optional, defaults to 1) —The frequency at which thecallbackfunction is called. If not specified, the callback is called atevery step. - extra_step_kwargs —Extra_step_kwargs.
- cross_attention_kwargs —A kwargs dictionary that if specified is passed along to the
AttentionProcessoras defined inself.processor. - num_warmup_steps —number of warmup steps.
Returns
latents
latents of backward process output at time timesteps[-1]
Perform backward process given list of time steps
encode_prompt
<source>(prompt: strprompt_2: typing.Optional[str] = Nonedevice: typing.Optional[torch.device] = Nonenum_images_per_prompt: int = 1do_classifier_free_guidance: bool = Truenegative_prompt: typing.Optional[str] = Nonenegative_prompt_2: typing.Optional[str] = Noneprompt_embeds: typing.Optional[torch.Tensor] = Nonenegative_prompt_embeds: typing.Optional[torch.Tensor] = Nonepooled_prompt_embeds: typing.Optional[torch.Tensor] = Nonenegative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = Nonelora_scale: typing.Optional[float] = Noneclip_skip: typing.Optional[int] = None)
Parameters
- prompt (
strorList[str],optional) —prompt to be encoded - prompt_2 (
strorList[str],optional) —The prompt or prompts to be sent to thetokenizer_2andtext_encoder_2. If not defined,promptisused in both text-encoders - device — (
torch.device):torch device - num_images_per_prompt (
int) —number of images that should be generated per prompt - do_classifier_free_guidance (
bool) —whether to use classifier free guidance or not - negative_prompt (
strorList[str],optional) —The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleisless than1). - negative_prompt_2 (
strorList[str],optional) —The prompt or prompts not to guide the image generation to be sent totokenizer_2andtext_encoder_2. If not defined,negative_promptis used in both text-encoders - prompt_embeds (
torch.Tensor,optional) —Pre-generated text embeddings. Can be used to easily tweak text inputs,e.g. prompt weighting. If notprovided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.Tensor,optional) —Pre-generated negative text embeddings. Can be used to easily tweak text inputs,e.g. promptweighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinputargument. - pooled_prompt_embeds (
torch.Tensor,optional) —Pre-generated pooled text embeddings. Can be used to easily tweak text inputs,e.g. prompt weighting.If not provided, pooled text embeddings will be generated frompromptinput argument. - negative_pooled_prompt_embeds (
torch.Tensor,optional) —Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs,e.g. promptweighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_promptinput argument. - lora_scale (
float,optional) —A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded. - clip_skip (
int,optional) —Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means thatthe output of the pre-final layer will be used for computing the prompt embeddings.
Encodes the prompt into text encoder hidden states.
forward_loop
<source>(x_t0t0t1generator)→x_t1
Parameters
- x_t0 —Latent code at time t0.
- t0 —Timestep at t0.
- t1 —Timestamp at t1.
- generator (
torch.GeneratororList[torch.Generator],optional) —Atorch.Generatorto makegeneration deterministic.
Returns
x_t1
Forward process applied to x_t0 from time t0 to t1.
Perform DDPM forward process from time t0 to t1. This is the same as adding noise with corresponding variance.
TextToVideoPipelineOutput
classdiffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput
<source>(images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray]nsfw_content_detected: typing.Optional[typing.List[bool]])
Parameters
- images (
[List[PIL.Image.Image],np.ndarray]) —List of denoised PIL images of lengthbatch_sizeor NumPy array of shape(batch_size, height, width, num_channels). - nsfw_content_detected (
[List[bool]]) —List indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content orNoneif safety checking could not be performed.
Output class for zero-shot text-to-video pipeline.