Run Vertex AI serverless training jobs on a persistent resource

This page shows you how to run a serverless trainingjob on a persistent resource byusing the Google Cloud CLI, Vertex AI SDK for Python, and the REST API.

Normally, when youcreate a serverless training job, you need tospecify compute resources that the job creates and runs on. After you create apersistent resource, you can instead configure theserverless trainingjob to run onone or more resource pools of that persistent resource. Running a customtraining job on a persistent resource significantly reduces the job startup timethat's otherwise needed for compute resource creation.

Required roles

To get the permission that you need to run serverless trainingjobs on a persistent resource, ask your administrator to grant you theVertex AI User (roles/aiplatform.user) IAM role on your project. For more information about granting roles, seeManage access to projects, folders, and organizations.

This predefined role contains the aiplatform.customJobs.create permission, which is required to run serverless trainingjobs on a persistent resource.

You might also be able to get this permission withcustom roles or otherpredefined roles.

Create a training job that runs on a persistent resource

To create a serverless training job that runs on a persistent resource, make thefollowing modifications to the standard instructions forcreating a serverless training job:

gcloud

  • Specify the--persistent-resource-id flag and set the value to the ID of the persistent resource (PERSISTENT_RESOURCE_ID) that you want to use.
  • Specify the--worker-pool-spec flag such that the values formachine-type anddisk-type matches exactly with a corresponding resource pool from the persistent resource. Specify one--worker-pool-spec for single node training and multiple for distributed training.
  • Specify areplica-count less than or equal to thereplica-count ormax-replica-count of the corresponding resource pool.

Python

To learn how to install or update the Vertex AI SDK for Python, seeInstall the Vertex AI SDK for Python. For more information, see thePython API reference documentation.

defcreate_custom_job_on_persistent_resource_sample(project:str,location:str,staging_bucket:str,display_name:str,container_uri:str,persistent_resource_id:str,service_account:Optional[str]=None,)->None:aiplatform.init(project=project,location=location,staging_bucket=staging_bucket)worker_pool_specs=[{        "machine_spec": {            "machine_type": "n1-standard-4",            "accelerator_type": "NVIDIA_TESLA_K80",            "accelerator_count": 1,        },        "replica_count": 1,        "container_spec": {            "image_uri": container_uri,            "command": [],"args":[],},}]custom_job=aiplatform.CustomJob(display_name=display_name,worker_pool_specs=worker_pool_specs,persistent_resource_id=persistent_resource_id,)custom_job.run(service_account=service_account)

REST

  • Specify thepersistent_resource_id parameter and set the value to the ID of the persistent resource (PERSISTENT_RESOURCE_ID) that you want to use.
  • Specify theworker_pool_specs parameter such that the values ofmachine_spec anddisk_spec for each resource pool matches exactly with a corresponding resource pool from the persistent resource. Specify onemachine_spec for single node training and multiple for distributed training.
  • Specify areplica_count less than or equal to thereplica_count ormax_replica_count of the corresponding resource pool, excluding the replica count of any other jobs running on that resource pool.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.