Serve Diffusion Transformer models using xDiT container on Cloud GPUs

xDiT is an open-source library that accelerates inference for DiffusionTransformer (DiT) models by using parallelism and optimization techniques. Thesetechniques enable a scalable multi-GPU setup for demanding workloads. This pagedemonstrates how to deploy DiT models by using xDiT and Cloud GPUs onVertex AI.

For more information about xDiT, see thexDiT GitHub project.

Benefits

The following list describes the key benefits for using xDiT to serve DiT modelson Vertex AI:

  • Up to three times faster generation: Generate high-resolution images andvideos in a fraction of the time compared to other serving solutions.
  • Scalable multi-GPU support: Efficiently distribute workloads acrossmultiple GPUs for optimal performance.
    • Hybrid parallelism: xDiT supports various parallel processing approaches,such as unified sequence parallelism, PipeFusion, CFG parallelism, and dataparallelism. These methods can be combined in a unique recipe to optimizeperformance.
  • Optimized single-GPU performance: xDiT provides faster inference even on asingle GPU.
    • GPU acceleration: xDiT incorporates several kernel acceleration methodsand uses techniques from DiTFastAttn to speed up inference on a singleGPU.
  • Easy deployment: Get started quickly with one-click deployment orColab Enterprise notebooks in Vertex AIModel Garden.

Supported models

xDiT is available for certain DiT model architectures inVertex AI Model Garden such as Flux.1 Schnell,CogVideoX-2b, and Wan2.1 text-to-video model variants. To see if a DiT model supports xDiT in Model Garden, view its model card inModel Garden.

Hybrid parallelism for multi-GPU performance:

xDiT uses a combination of parallelism techniques to maximize performance onmulti-GPU setups. These techniques work together to distribute the workload andoptimize resource utilization:

  • Unified sequence parallelism: This technique splits the input data (suchas splitting an image into patches) across multiple GPUs, reducing the memoryusage and improving scalability.
  • PipeFusion: PipeFusion divides the DiT model into stages and assigns eachstage to a different GPU, enabling parallel processing of different parts ofthe model.
  • CFG parallelism: This technique specifically optimizes models by usingclassifier-free guidance, a common method for controlling the style andcontent of generated images. It parallelizes the computation of conditionaland unconditional branches, leading to faster inference.
  • Data Parallelism: This method replicates the entire model on each GPU,with each GPU processing a different batch of input data, increasing theoverall throughput of the system.

For more information about performance improvements, see xDiT's report onFlux.1 Schnell orCogVideoX-2b. Google was able to reproducethese results on Vertex AI Model Garden.

Single GPU acceleration

The xDiT library provides benefits for single-GPU serving by usingtorch.compile andonediff to enhance runtime speed on GPUs.These techniques can also be used in conjunction with hybrid parallelism.

xDiT also has an efficient attention computation technique, called DiTFastAttn,to address DiT's computational bottleneck. For now, this technique is onlyavailable to use for single GPU setups or in conjunction with data parallelism.

Get started in Model Garden

The xDiT optimized Cloud GPU serving container is provided withinVertex AI Model Garden. For supported models, deploymentsuse this container when you use one-click deployments or theColab Enterprise notebook examples.

The following examples use the Flux.1-schnell model to demonstrate how to deploya DiT model on a xDiT container.

Use one-click deployment

You can deploy a custom Vertex AI endpoint with the xDiTcontainer by using a model card.

  1. Navigate to themodel card page and clickDeploy.

  2. For the model variation to use, select a machine type to use for yourdeployment.

  3. ClickDeploy to begin the deployment process. You receive two emailnotifications; one when the model is uploaded and another when the endpointis ready.

Use the Colab Enterprise notebook

For flexibility and customization, use the Colab Enterprisenotebook examples to deploy a Vertex AI endpoint with the xDiTcontainer by using the Vertex AI SDK for Python.

  1. Navigate to themodel card page and clickOpen notebook.

  2. Select the Vertex Serving notebook. The notebook is opened inColab Enterprise.

  3. Run through the notebook to deploy a model by using the xDiT container andsend prediction requests to the endpoint. The code snippet for thedeployment is as follows:

importvertexaifromvertexaiimportmodel_gardenvertexai.init(project=<YOUR_PROJECT_ID>,location=<REGION>)model=model_garden.OpenModel("black-forest-labs/FLUX.1-schnell")endpoint=model.deploy()

xDiT arguments

xDiT offers a range of server arguments that can be configured to optimizeperformance for specific use cases. These arguments are set as environmentvariables during deployment. The following list are the key arguments you mightneed to configure:

Model Configuration
  • MODEL_ID (string): Specifies the model identifier to load. This should matchthe model name in your registry or path.
Runtime Optimization Arguments
  • N_GPUS (integer): Specifies the number of GPUs to use for inference. Thedefault value is 1.
  • WARMUP_STEPS (integer): Number of warmup steps required before inferencebegins. This is particularly important when PipeFusion is enabled to ensurestable performance. The default value is 1.
  • USE_PARALLEL_VAE (boolean): Enables efficient processing of high-resolutionimages (greater than 2048 pixels) by parallelizing the VAE component acrossdevices. This prevents OOM issues for large images. The default value is false.
  • USE_TORCH_COMPILE (boolean): Enables single-GPU acceleration throughtorch.compile, providing kernel-level optimizations for improved performance.The default value is false.
  • USE_ONEDIFF (boolean): Enables OneDiff compilation acceleration technology tooptimize GPU kernel execution speed. The default value is false.
Data Parallel Arguments
  • DATA_PARALLEL_DEGREE (integer): Sets the degree of data parallelism.Leave empty to disable or set to selected parallel degree.
  • USE_CFG_PARALLEL (boolean): Enables parallel computation forclassifier-free guidance (CFG), also known as Split Batch. When enabled, theconstant parallelism degree is 2. Set to true when using CFG for controllingoutput style and content. The default value is false.
Sequence Parallel Arguments (USP - Unified Sequence Parallelism)
  • ULYSSES_DEGREE (integer): Sets the Ulysses degree for the unified sequenceparallel approach, which combines DeepSpeed-Ulysses and Ring-Attention. Thiscontrols the all-to-all communication pattern. Leave empty to use default.
  • RING_DEGREE (integer): Sets the Ring degree for peer-to-peer communicationin sequence parallelism. Works in conjunction with ULYSSES_DEGREE to form the2D process mesh. Leave empty to use default.
Tensor Parallel Arguments
  • TENSOR_PARALLEL_DEGREE (integer): Sets the degree of tensor parallelism, whichsplits model parameters across devices along feature dimensions to reducememory costs per device. Leave empty to disable.
  • SPLIT_SCHEME (string): Defines how to split the model tensors across devices(e.g., by attention heads, hidden dimensions). Leave empty for the defaultsplitting scheme.
Ray Distributed Arguments
  • USE_RAY (boolean): Enables Ray distributed execution framework for scalingcomputations across multiple nodes. The default value is false.
  • RAY_WORLD_SIZE (integer): Total number of processes in the Ray cluster. Thedefault value is 1.
  • VAE_PARALLEL_SIZE (integer): Number of processes dedicated to VAE parallelprocessing when using Ray. The default value is 0.
  • DIT_PARALLEL_SIZE (integer): Number of processes dedicated to DiT backboneparallel processing when using Ray. The default value is 0.
PipeFusion Parallel Arguments
  • PIPEFUSION_PARALLEL_DEGREE (integer): Sets the degree of parallelism forPipeFusion, a sequence-level pipeline parallelism that takes advantage of theinput temporal redundancy characteristics of diffusion models. Higher valuesincrease parallelism but require more memory. The default value is 1.
  • NUM_PIPELINE_PATCH (integer): Number of patches to split the sequence into forpipeline processing. Leave empty for automatic determination.
  • ATTN_LAYER_NUM_FOR_PP (string): Specifies which attention layers to use forpipeline parallelism. Can be comma-separated (e.g., "10,9") orspace-separated (e.g., "10 9"). Leave empty to use all layers.
Memory Optimization Arguments
  • ENABLE_MODEL_CPU_OFFLOAD (boolean): Offloads model weights to CPU memory whennot in use, reducing GPU memory usage at the cost of increased latency. Thedefault value is false.
  • ENABLE_SEQUENTIAL_CPU_OFFLOAD (boolean): Sequentially offloads model layersto CPU during forward pass, enabling inference of models larger than GPUmemory. The default value is false.
  • ENABLE_TILING (boolean): Reduces GPU memory usage by decoding the VAEcomponent one tile at a time. This argument is useful for larger images orvideos and to prevent out-of-memory errors. The default value is false.
  • ENABLE_SLICING (boolean): Reduces GPU memory usage by splitting the inputtensor into slices for VAE decoding. The default value is false.
DiTFastAttn Arguments (Attention Optimization)
  • USE_FAST_ATTN (boolean): Enables DiTFastAttn acceleration for single-GPUinference, utilizing Input Temporal Reduction to reduce computationalcomplexity. The default value is false.
  • N_CALIB (integer): Number of calibration samples for DiTFastAttnoptimization. The default value is 8.
  • THRESHOLD (float): Similarity threshold for Temporal Similarity Reduction inDiTFastAttn. The default value is 0.5.
  • WINDOW_SIZE (integer): Window size for Window Attention with ResidualCaching to reduce spatial redundancy. The default value is 64.
  • COCO_PATH (string): Path to COCO dataset for DiTFastAttn calibration.Required when USE_FAST_ATTN is true. Leave empty if not using.
Cache Optimization Arguments
  • USE_CACHE (boolean): Enables general caching mechanisms to reduce redundantcomputations. The default value is false.
  • USE_TEACACHE (boolean): Enables TeaCache optimization method for cachingintermediate results. The default value is false.
  • USE_FBCACHE (boolean): Enables First-Block-Cache optimization method. Thedefault value is false.
Precision Optimization Arguments
  • USE_FP8_T5_ENCODER (boolean): Enables FP8 (8-bit floating point) precisionfor the T5 text encoder, reducing memory usage and potentially improvingthroughput with minimal quality impact. The default value is false.
Note: When you configure parallelism degrees (PIPEFUSION_PARALLEL_DEGREE,ULYSSES_DEGREE,RING_DEGREE, andUSE_CFG_PARALLEL), check that the productof these values equals the total number of GPUs (N_GPUS). For example, with 2GPUs, a valid configuration can haveRING_DEGREE set to1 andULYSSES_DEGREE set to2. Because 1 x 2 = 2, this configuration is valid.For a full list of arguments, see thexFuserArgs class in thexDiT GitHub project.

Serving Customizations

Model Garden provides default xDiT parallelization configurations for supported models. You can inspect these default settings using theVertex AI SDK for Python.

To view the default deployment configuration for a model, such as "black-forest-labs/FLUX.1-schnell", you can run the following code snippet:

importvertexaifromvertexaiimportmodel_gardenvertexai.init(project=<YOUR_PROJECT_ID>,location=<REGION>)model=model_garden.OpenModel("black-forest-labs/FLUX.1-schnell")deploy_options=model.list_deploy_options()# Example Response# ['black-forest-labs/flux1-schnell@flux.1-schnell']# [model_display_name: "Flux1-schnell"# container_spec {#   image_uri: "us-docker.pkg.dev/deeplearning-platform-release/vertex-model-garden/xdit-serve.cu125.0-2.ubuntu2204.py310"#  env {#    name: "DEPLOY_SOURCE"#    value: "UI_NATIVE_MODEL"#  }#  env {#    name: "MODEL_ID"#    value: "gs://vertex-model-garden-restricted-us/black-forest-labs/FLUX.1-schnell"#  }#  env {#    name: "TASK"#    value: "text-to-image"#  }#  env {#    name: "N_GPUS"#    value: "2"#  }#  env {#    name: "USE_TORCH_COMPILE"#    value: "true"#  }#  env {#    name: "RING_DEGREE"#    value: "2"#  }# ..........]

Thelist_deploy_options() method returns the container specifications, including the environment variables (env) that define the xDiT configuration.

To customize the parallelism strategy, you can override these environment variables when deploying the model. The following example demonstrates how to modify the RING_DEGREE and ULYSSES_DEGREE for a 2-GPU setup, changing the parallelism approach:

importvertexaifromvertexaiimportmodel_garden# Replace with your project ID and regionvertexai.init(project="<YOUR_PROJECT_ID>",location="<REGION>")model=model_garden.OpenModel("black-forest-labs/FLUX.1-schnell")# Custom environment variables to override default settings# This example sets N_GPUS as 2, so RING_DEGREE * ULYSSES_DEGREE must equal 2container_env_vars={"N_GPUS":"2","RING_DEGREE":"1","ULYSSES_DEGREE":"2"# Add other environment variables to customize here}machine_type="a3-highgpu-2g"accelerator_type="NVIDIA_H100_80GB"accelerator_count=2# Deploy the model with the custom environment variablesendpoint=model.deploy(machine_type=machine_type,accelerator_type=accelerator_type,accelerator_count=accelerator_count,container_env_vars=container_env_vars)

Remember to consult the "Understanding xDiT specific arguments" section for details on each environment variable. Ensure that the product of parallelism degrees (e.g., PIPEFUSION_PARALLEL_DEGREE,ULYSSES_DEGREE, RING_DEGREE, and USE_CFG_PARALLEL) equals the total number of GPUs (N_GPUS).

For more examples of serving recipes and configurations for different models, refer to thexDiT official documentation. For additional information on the Model Garden SDK, see thedocumentation.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-17 UTC.