Schedule training jobs based on resource availability

For Vertex AI serverless training jobs that request GPU resources, Dynamic Workload Scheduler lets youschedule the jobs based on when the requested GPU resources become available.This page shows you how to schedule serverless training jobs by using Dynamic Workload Scheduler,and how to customize the scheduling behavior on Vertex AI.

Recommended use cases

We recommend using Dynamic Workload Scheduler to schedule serverless training jobs in thefollowing situations:

The serverless training job requests L4, A100, H100, H200, or B200 GPUs and you want to run thejob as soon as the requested resources become available. For example, whenVertex AI allocates the GPU resources outside of peak hours.
Your workload requires multiple nodes and can't start running until all GPUnodes are provisioned and ready at the same time. For example, you'recreating a distributed training job.

Requirements

To use Dynamic Workload Scheduler, your serverless training job must meet the followingrequirements:

Your serverless training job requests L4, A100, H100, H200, or B200 GPUs.
Your serverless training job has a maximumtimeout of 7 days or less.
Your serverless training job uses the same machine configuration for all workerpools.

Supported job types

All serverless training job types are supported, includingCustomJob,HyperparameterTuningjob, andTrainingPipeline.

Enable Dynamic Workload Scheduler in your serverless training job

To enable Dynamic Workload Scheduler in your serverless training job, set thescheduling.strategy API field toFLEX_START when you create the job.

For details on how to create a serverless training job, see the following links.

Configure the duration to wait for resource availability

You can configure how long your job can wait for resources in thescheduling.maxWaitDuration field. A value of0 means that the job waitsindefinitely until the requested resources become available. The default valueis1 day.

Examples

The following examples show you how to enable Dynamic Workload Scheduler for acustomJob.Select the tab for the interface that you want to use.

gcloud

When submitting a job using the Google Cloud CLI, add thescheduling.strategyfield in theconfig.yaml file.

Example YAML configuration file:

workerPoolSpecs:machineSpec:machineType:a2-highgpu-1gacceleratorType:NVIDIA_TESLA_A100acceleratorCount:1replicaCount:1containerSpec:imageUri:gcr.io/ucaip-test/ucaip-training-testargs:-port=8500command:-startscheduling:strategy:FLEX_STARTmaxWaitDuration:7200s

Python

When submitting a job using the Vertex AI SDK for Python, set thescheduling_strategy field in the relevantCustomJob creation method.

fromgoogle.cloud.aiplatform_v1.typesimportcustom_jobasgca_custom_job_compatdefcreate_custom_job_with_dws_sample(project:str,location:str,staging_bucket:str,display_name:str,script_path:str,container_uri:str,service_account:str,experiment:str,experiment_run:Optional[str]=None,)->None:aiplatform.init(project=project,location=location,staging_bucket=staging_bucket,experiment=experiment)job=aiplatform.CustomJob.from_local_script(display_name=display_name,script_path=script_path,container_uri=container_uri,enable_autolog=True,machine_type="a2-highgpu-1g",accelerator_type="NVIDIA_TESLA_A100",accelerator_count=1,)job.run(service_account=service_account,experiment=experiment,experiment_run=experiment_run,max_wait_duration=1800,scheduling_strategy=gca_custom_job_compat.Scheduling.Strategy.FLEX_START)

REST

When submitting a job using the Vertex AI REST API, set the fieldsscheduling.strategy andscheduling.maxWaitDuration when creating yourserverless training job.

Example request JSON body:

{"displayName":"MyDwsJob","jobSpec":{"workerPoolSpecs":[{"machineSpec":{"machineType":"a2-highgpu-1g","acceleratorType":"NVIDIA_TESLA_A100","acceleratorCount":1},"replicaCount":1,"diskSpec":{"bootDiskType":"pd-ssd","bootDiskSizeGb":100},"containerSpec":{"imageUri":"python:3.10","command":["sleep"],"args":["100"]}}],"scheduling":{"maxWaitDuration":"1800s","strategy":"FLEX_START"}}}

Quota

When you submit a job using Dynamic Workload Scheduler, instead of consuming on-demandVertex AI quota, Vertex AI consumespreemptible quota. Forexample, for Nvidia H100 GPUs, instead of consuming:

aiplatform.googleapis.com/custom_model_training_nvidia_h100_gpus,

Vertex AI consumes:

aiplatform.googleapis.com/custom_model_training_preemptible_nvidia_h100_gpus.

However,preemptible quota is used only in name. Your resources aren'tpreemptible and behave like standard resources.

Before submitting a job using Dynamic Workload Scheduler, ensure that your preemptible quotashave been increased to a sufficient amount. For details onVertex AI quotas and instructions for making quota increase requests, seeVertex AI quotas and limits.

Billing

When using DWS flex start, you're billed according toDynamic Workload Scheduler pricing. There areserverless training management fees inaddition to your infrastructure usage.

What's Next

Learn more aboutconfiguring compute resources forserverless training jobs.
Learn more aboutusing distributed training forserverless training jobs.
Learn more aboutother scheduling options.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.

Movatterモバイル変換

Schedule training jobs based on resource availability Stay organized with collections Save and categorize content based on your preferences.