Vertex AI serverless training overview

Vertex AI provides a managed training service that helps youoperationalize large scale model training. You can use Vertex AI to runtraining applications based on any machine learning (ML) framework onGoogle Cloud infrastructure. For the following popular ML frameworks,Vertex AI also has integrated support that simplifies the preparationprocess for model training and serving:

This page explains the benefits of serverless trainingon Vertex AI, the workflow involved, and the various training optionsthat are available.

Vertex AI operationalizes training at scale

There are several challenges to operationalizing model training. Thesechallenges include the time and cost needed to train models, the depth of skillsrequired to manage the compute infrastructure, and the need to provideenterprise-level security. Vertex AI addresses these challenges whileproviding a host of other benefits.

Fully managed compute infrastructure

Model training on Vertex AI is a fully managed service that requires no administration of physical infrastructure. You can train ML models without the need to provision or manage servers. You only pay for the compute resources that you consume. Vertex AI also handles job logging, queuing, and monitoring.

High-performance

Vertex AI training jobs are optimized for ML model training, which can provide faster performance than directly running your training application on a Google Kubernetes Engine (GKE)) cluster. You can also identify and debug performance bottlenecks in your training job by using Cloud Profiler.

Distributed training

Reduction Server is an all-reduce algorithm in Vertex AI that can increase throughput and reduce latency of multi-node distributed training on NVIDIA graphics processing units (GPUs). This optimization helps reduce the time and cost of completing large training jobs.

Hyperparameter optimization

Hyperparameter tuning jobs run multiple trials of your training application using different hyperparameter values. You specify a range of values to test, and Vertex AI discovers the optimal values for your model within that range.

Enterprise security

Vertex AI provides the following enterprise security features:

VPC peering to limit network access.
VPC Service Controls to mitigate the risksof data exfiltration.
Customer-managed encryption keys to help you meetspecific compliance or regulatory requirements related to data protection.
Identity and Access Management for fine-grained control overservice account access.
Data isolation with single-tenant project boundaries.

ML operations (MLOps) integrations

Vertex AI provides a suite ofintegrated MLOps tools and featuresthat you can use for the following purposes:

Orchestrate end-to-end ML workflows.
Perform feature engineering.
Run experiments.
Manage and iterate your models.
Track ML metadata.
Monitor and evaluate model quality.

Workflow for serverless training

The following diagram shows a high-level overview of the serverless trainingworkflow on Vertex AI. The sections that follow describe each step indetail.

Workflow for custom training

Load and prepare training data

For the best performance and support, use one of thefollowing Google Cloud services as your data source:

For a comparison of these services, seeData preparation overview.

You can also specify aVertex AI managed datasetas the data source when using a training pipeline to train your model. Traininga custom model and an AutoML model using the same dataset lets youcompare the performance of the two models.

Prepare your training application

To prepare your training application for use on Vertex AI, dothe following:

Implement training code best practices for Vertex AI.
Determine a type of container image to use.
Package your training application into a supported format based on theselected container image type.

Implement training code best practices

Your training application should implement thetraining code best practices for Vertex AI.These best practices relate to the ability of your training application to dothe following:

Access Google Cloud services.
Load input data.
Enable autologging for experiment tracking.
Export model artifacts.
Use the environment variables of Vertex AI.
Ensure resilience to VM restarts.

Select a container type

Vertex AI runs your training application in aDocker container image.A Docker container image is a self-contained software package that includes codeand all dependencies, which can run in almost any computing environment. You caneither specify the URI of aprebuilt container imageto use, or create and upload acustom container imagethat has your training application and dependencies pre-installed.

The following table shows the differences between prebuilt and custom containerimages:

Specifications	Prebuilt container images	Custom container images
ML framework	Each container image is specific to an ML framework.	Use any ML framework or use none.
ML framework version	Each container image is specific to an ML framework version.	Use any ML framework version, including minor versions and nightly builds.
Application dependencies	Common dependencies for the ML framework are pre-installed. You can specify additional dependencies to install in your training application.	Pre-install the dependencies that your training application needs.
Application delivery format	Python source distribution. Single Python file.	Pre-install the training application in the custom container image.
Effort to set up	Low	High
Recommended for	Python training applications based on an ML framework and framework version that has a prebuilt container image available.	Greater customization and control. Non-Python training applications. Private or custom dependencies. Training applications that use an ML framework or framework version that has no prebuilt container image available.

Package your training application

After you've determined the type of container image to use, package yourtraining application into one of the following formats based on the containerimage type:

Single Python file for use in a prebuilt container
Write your training application as a single Python file and use theVertex AI SDK for Python to create aCustomJob orCustomTrainingJob class. The Python file is packaged into aPython source distribution and installed to a prebuilt container image.Delivering your training application as a single Python file is suitable forprototyping. For production training applications, you'll likely have yourtraining application arranged into more than one file.
Python source distribution for use in a prebuilt container
Package your training applicationinto one or more Python source distributions and upload them to aCloud Storage bucket. Vertex AI installs the source distributionsto a prebuilt container image when you create a training job.
Custom container image
Create your own Docker container imagethat has your training application and dependencies pre-installed, andupload it to Artifact Registry. If your training application is written in Python,you canperform these steps by using one Google Cloud CLI command.

Configure training job

A Vertex AI training job performs the following tasks:

Provisions one (single node training) or more (distributed training) virtualmachines (VMs).
Runs your containerized training application on the provisioned VMs.
Deletes the VMs after the training job completes.

Vertex AI offersthree types of training jobsfor running your training application:

Custom job
A custom job(CustomJob)runs your training application. If you're using a prebuilt container image,model artifacts are output to the specified Cloud Storage bucket. Forcustom container images, your training application can also output modelartifacts to other locations.
Hyperparameter tuning job
A hyperparameter tuning job(HyperparameterTuningJob)runs multiple trials of your trainingapplication using different hyperparameter values until it produces modelartifacts with the optimal performing hyperparameter values. You specify therange of hyperparameter values to test and the metrics to optimize for.
Training pipeline
A training pipeline(CustomTrainingJob)runs a custom job or hyperparameter tuning job and optionally exports themodel artifacts to Vertex AI to create a model resource. You canspecify a Vertex AI managed dataset as your data source.

When creating a training job, specify the compute resources to usefor running your training application and configure your container settings.

Compute configurations

Specify the compute resources touse for a training job. Vertex AI supports single-node training, wherethe training job runs on one VM, anddistributed training, where the trainingjob runs on multiple VMs.

The compute resources that you can specify for your training job are as follows:

VM machine type
Different machine types offer different CPUs, memory size, and bandwidth.
Graphics processing units (GPUs)
You can add one or more GPUs to A2 or N1 type VMs. If your trainingapplication is designed to use GPUs, adding GPUs can significantly improveperformance.
Tensor Processing Units (TPUs)
TPUs are designed specifically for accelerating machine learning workloads.When using a TPU VM for training, you can specify only one worker pool.That worker pool can have only one replica.
Boot disks
You can use SSDs (default) or HDDs for your boot disk. If your trainingapplication reads and writes to disk, using SSDs can improve performance.You can also specify the size of your boot disk based on the amount oftemporary data that your training application writes to disk. Boot disks canhave between 100 GiB (default) and 64,000 GiB. All VMs in a worker pool mustuse the same type and size of boot disk.

Container configurations

Thecontainer configurationsthat you need to make depend on whether you're using a prebuilt or customcontainer image.

Prebuilt container configurations:
- Specify the URI of the prebuilt container image that you want to use.
- If your training application is packaged as a Python sourcedistribution, specify the Cloud Storage URI where the package islocated.
- Specify the entry point module of your training application.
- Optional: Specify a list of command-line arguments to pass to the entrypoint module of your training application.
Custom container configurations:
- Specify the URI of your custom container image, which can be a URI fromArtifact Registry or Docker Hub.
- Optional: Override theENTRYPOINT orCMD instructions in yourcontainer image.

Create a training job

After your data and training application are prepared, run your trainingapplication by creating one of the following training jobs:

To create the training job, you can use the Google Cloud console, Google Cloud CLI,Vertex AI SDK for Python, or the Vertex AI API.

(Optional) Import model artifacts into Vertex AI

Your training application likely outputs one or more model artifacts to aspecified location, usually a Cloud Storage bucket. Before you can getinferences in Vertex AI from your model artifacts, firstimport the model artifacts into Vertex AI Model Registry.

Like container images for training, Vertex AI gives you the choice ofusingprebuilt orcustom container images forinferences. If a prebuilt container image for inferences is available for yourML framework and framework version, we recommend using a prebuilt containerimage.

What's next

Get inferences from your model.
Evaluate your model.
Try theHello serverless trainingtutorial for step-by-step instructions on training a TensorFlow Keras imageclassification model on Vertex AI.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.

Movatterモバイル変換

Vertex AI serverless training overview Stay organized with collections Save and categorize content based on your preferences.