Prepare training code

Perform serverless training on Vertex AI to runyour own machine learning(ML) training code in the cloud, instead of using AutoML. This documentdescribes best practices to consider as you write training code.

Note: This document describes training code best practices specific toVertex AI, but it doesn't comprehensively explain how to design anML model or write ML training code. These details vary depending on the purposeof your model and the ML framework that you use for training. If you are new tocreating custom ML models, we recommend working through Google'sMachineLearning Crash Course with TensorFlowAPIs.

Choose a training code structure

First, determine what structure you want your ML training code to take. You canprovide training code to Vertex AI in one of the following forms:

  • A Python script to use with a prebuilt container. Use theVertex AI SDK tocreate a custom job.This method lets you provide your training application as a single Pythonscript.

  • A Python training application to use with a prebuilt container. Create aPython sourcedistributionwith code that trains an ML model and exports it to Cloud Storage. Thistraining application can use any of the dependencies included in the prebuiltcontainer that you plan to use it with.

    Note: If you use theVertex AI SDK for Python tocreate aTrainingPipelineresource,then you can provide your training application as a single Python script,rather than as a Python source distribution.

    Use this option if one of the Vertex AIprebuilt containers fortraining includes all the dependenciesthat you need for training. For example, if you want to train with PyTorch,scikit-learn, TensorFlow, or XGBoost, then this is likely the betteroption.

    To learn about best practices specific to this option, read the guide tocreating a Python trainingapplication.

  • A custom container image. Create aDocker containerimage with code that trains an ML model and exports it toCloud Storage. Include any dependencies required by your code in thecontainer image.

    Use this option if you want to use dependencies that are not included in oneof the Vertex AIprebuilt containers fortraining. For example, if you want totrain using a Python ML framework that is not available in a prebuiltcontainer, or if you want to train using a programming language other thanPython, then this is the better option.

    To learn about best practices specific to this option, read the guide tocreating a custom container image.

The rest of this document describes best practices relevant to both trainingcode structures.

Best practices for all serverless training code

When you write serverless training code forVertex AI, keep in mindthat the code will run on one or more virtual machine (VM) instances managed byGoogle Cloud. This section describes best practices applicable to all customtraining code.

Access Google Cloud services in your code

Several of the following sections describe accessing other Google Cloudservices from your code. To access Google Cloud services, write yourtraining code to useApplication Default Credentials (ADC).Many Google Cloud client libraries authenticate with ADC by default.You don't need to configure any environment variables; Vertex AIautomatically configures ADC to authenticate as either theVertex AI Custom Code Service Agent for your project(by default) or acustom service account(if you have configured one).

Note: If you want to run your training code in your local environment before yourun it on Vertex AI, you might want to configure your localenvironment for ADC. You can do this bydownloading a service accountkey or by using thegcloud auth application-default logincommand.

However, when you use a Google Cloud client library in your code,Vertex AI might not always connect to the correctGoogle Cloud project by default. If you encounter permission errors,connecting to the wrong project might be the problem.

This problem occurs because Vertex AI does not run your codedirectly in your Google Cloud project. Instead, Vertex AIruns your code in one of several separate projects managed by Google.Vertex AI uses these projects exclusively for operations related toyour project. Therefore, don't try to infer a project ID from the environment inyour training or inference code; specify project IDs explicitly.

If you don't want to hardcode a project ID in your training code, you canreference theCLOUD_ML_PROJECT_ID environment variable: Vertex AIsets this environment variable in every serverless trainingcontainer to contain theproject number of the project where you initiatedserverless training. Many Google Cloud tools canaccept a projectnumber wherever they take a project ID.

For example, if you want to use thePython Client for GoogleBigQuery to access aBigQuery table in the same project, then don't try to infer theproject in your training code:

Implicit project selection

fromgoogle.cloudimportbigqueryclient=bigquery.Client()

Instead use code that explicitly selects a project:

Explicit project selection

importosfromgoogle.cloudimportbigqueryproject_number=os.environ["CLOUD_ML_PROJECT_ID"]client=bigquery.Client(project=project_number)

If you encounter permission errors after configuring your code in this way, thenread the following section aboutwhich resources your code can access to adjust thepermissions available to your training code.

Which resources your code can access

By default, your training application can access any Google Cloud resources thatare available to theVertex AI Custom Code Service Agent (CCSA)of your project. You can grant the CCSA, and thereby your training application,access to a limited number of other resources by following the instructions inGrant Vertex AI service agents access to other resources.If your training application needs more than read-level access to Google Cloudresources that are not listed in that page, it needs to acquire an OAuth 2.0access token with thehttps://www.googleapis.com/auth/cloud-platformscope, which can only be done by using acustom service account.

For example, consider your training code's access to Cloud Storageresources:

By default, Vertex AI can access any Cloud Storage bucketin the Google Cloud project where you're performingserverless training. Youcan alsogrant Vertex AI access to Cloud Storage bucketsin other projects, or you canprecisely customize what buckets a specific job can access byusing a custom service account.

Read and write Cloud Storage files with Cloud Storage FUSE

In all serverless training jobs, Vertex AImounts Cloud Storagebuckets that you have access to in the/gcs/ directory of each training node'sfile system. As a convenient alternative to using the Python Client forCloud Storage or another library to access Cloud Storage, youcan read and write directly to the local file system in order to read data fromCloud Storage or write data to Cloud Storage. For example, toload data fromgs://BUCKET/data.csv, you can use thefollowing Python code:

file=open('/gcs/BUCKET/data.csv','r')

Vertex AI usesCloud Storage FUSE tomount the storage buckets. Note thatdirectories mounted by Cloud Storage FUSE are not POSIX compliant.

The credentials that you're using for serverless trainingdetermine which bucketsyou can access in this way. The preceding section aboutwhich resources your code can access describes exactly whichbuckets you can access by default and how to customize this access.

Load input data

ML code usually operates on training data in order to train a model. Don't storetraining data together with your code, whether you create a Python trainingapplication or a custom container image. Storing data with code can lead to apoorly organized project, make it difficult to reuse code on different datasets,and cause errors for large datasets.

You can load data from aVertex AI managed datasetor write your own code to load data from a source outside ofVertex AI, such as BigQuery orCloud Storage.

For best performance when you load data from Cloud Storage, use abucket in theregion where you'reperforming serverless training.To learn how to store datain Cloud Storage, readCreating storagebuckets andUploadingobjects.

To learn about which Cloud Storage buckets you can load data from, readthe previous section aboutwhich resources your code can access.

To load data from Cloud Storage in your training code, use theCloud Storage FUSE feature described in the preceding section, or useany library that supports ADC. You don't need to explicitly provide anyauthentication credentials in your code.

For example, you can use one of the client libraries demonstrated in theCloud Storage guide toDownloadingobjects. ThePython Client forCloud Storage,in particular, is included in prebuilt containers.TensorFlow'stf.io.gfile.GFileclass also supports ADC.

Load a large dataset

Depending on whichmachinetypes you planto use during serverless training, your VMs might not beable to load theentirety of a large dataset into memory.

If you need to read data that is too large to fit in memory, stream the data orread it incrementally. Different ML frameworks have different best practices fordoing this. For example, TensorFlow'stf.data.Datasetclasscan stream TFRecord or text data from Cloud Storage.

Performing serverless training on multiple VMs with dataparallelism is another wayto reduce the amount of data each VM loads into memory. See theWriting codefor distributed training section of this document.

Export a trained ML model

ML code usually exports a trained model at the end of training in the form ofone or more model artifacts. You can then use the model artifacts to getinferences.

After serverless training completes, you can no longeraccess the VMs that ran yourtraining code. Therefore, your training code must export model artifacts to alocation outside of Vertex AI.

We recommend that you export model artifacts to a Cloud Storage bucket.As described in the previous section aboutwhich resources your code can access, Vertex AIcan access any Cloud Storage bucket in the Google Cloud projectwhere you are performing serverless training. Use alibrary that supports ADC toexport your model artifacts. For example, theTensorFlow APIs for saving Keras modelscan export artifacts directly to a Cloud Storage path.

Note: You can't export model artifacts from your training code in a way thatdirectly creates aModel resource.However, if you perform serverless training by creatingaTrainingPipelineresource, theTrainingPipeline can export model artifacts to Cloud Storage. TheTrainingPipeline can then immediately import those same model artifacts backinto Vertex AI as aModel. Learn more in theguide to creatingtraining pipelines.

If you want to use your trained model to serve inferences onVertex AI, then your code must export model artifacts in a formatcompatible with one of theprebuilt containers forinference. Learn more in theguide toexporting model artifacts forinference and explanation.

Environment variables for special Cloud Storage directories

If you specify thebaseOutputDirectory APIfield,Vertex AI sets the following environment variables when it runsyour training code:

The values of these environment variables differ slightly depending on whetheryou are using hyperparameter tuning. To learn more, see theAPI reference forbaseOutputDirectory.

Using these environment variables makes it easier to reuse the same trainingcode multiple times—for example with different data or configuration options—andsave model artifacts and checkpoints to different locations, just by changingthebaseOutputDirectory API field. However, you are not required to use theenvironment variables in your code if you don't want to. For example, you canalternatively hardcode locations for saving checkpoints and exporting modelartifacts.

Additionally, if youuse aTrainingPipeline for customtraining and don't specify themodelToUpload.artifactUrifield, thenVertex AI uses the value of theAIP_MODEL_DIR environmentvariable formodelToUpload.artifactUri. (For hyperparameter tuning,Vertex AI uses the value of theAIP_MODEL_DIR environmentvariable from the best trial.)

Ensure resilience to restarts

The VMs that run your training code restart occasionally. For example,Google Cloud might need to restart a VM for maintenance reasons. When a VMrestarts, Vertex AI starts running your code again from its start.

If you expect your training code to run for more than four hours, addseveral behaviors to your code to make it resilient to restarts:

  • Frequently export your training progress to Cloud Storage, at leastonce every four hours, so that you don't lose progress if your VMs restart.

  • At the start of your training code, check whether any training progressalready exists in your export location. If so, load the saved training stateinstead of starting training from scratch.

Four hours is a guideline, not a hard limit. If ensuring resilience is apriority, consider adding these behaviors to your code even if you don't expectit to run for that long.

How to accomplish these behaviors depends on which ML framework you use. Forexample, if you use TensorFlow Keras,learn how to use theModelCheckpointcallback for thispurpose.

To learn more about how Vertex AI manages VMs, seeUnderstand the custom training service.

Best practices for optional serverless training features

If you want to use certain optional serverless trainingfeatures, you might need tomake additional changes to your training code. This section describes codebest practices for hyperparameter tuning, GPUs, distributed training, andVertex AI TensorBoard.

Write code to enable autologging

You can enable autologging using the Vertex AI SDK for Python to automaticallycapture parameters and performance metrics when submitting the custom job. Fordetails, seeRun training job with experiment tracking.

Write code to return container logs

When you write logs from your service or job, they will be picked upautomatically by Cloud Logging so long as the logs are written to any ofthese locations:

Most developers are expected to write logs using standard output and standarderror.

The container logs written to these supported locations are automaticallyassociated with the Vertex AI serverless trainingservice, revision, andlocation, or with the serverless training job. Exceptionscontained in these logsare captured by and reported inError Reporting.

Use plain text versus structured JSON in logs

When you write logs, you can send a plain text string or send a single lineof serialized JSON, also called "structured" data. This is picked up and parsedby Cloud Logging and is placed intojsonPayload. In contrast, the plain textmessage is placed intextPayload.

Write structured logs

You can pass structured JSON logs in multiple ways. The most common ways are byusing thePython Logging libraryor by passing raw JSON usingprint.

Python logging library

importjsonimportloggingfrompythonjsonloggerimportjsonloggerclassCustomJsonFormatter(jsonlogger.JsonFormatter):"""Formats log lines in JSON."""defprocess_log_record(self,log_record):"""Modifies fields in the log_record to match Cloud Logging's expectations."""log_record['severity']=log_record['levelname']log_record['timestampSeconds']=int(log_record['created'])log_record['timestampNanos']=int((log_record['created']%1)*1000*1000*1000)returnlog_recorddefconfigure_logger():"""Configures python logger to format logs as JSON."""formatter=CustomJsonFormatter('%(name)s|%(levelname)s|%(message)s|%(created)f''|%(lineno)d|%(pathname)s','%Y-%m-%dT%H:%M:%S')root_logger=logging.getLogger()handler=logging.StreamHandler()handler.setFormatter(formatter)root_logger.addHandler(handler)root_logger.setLevel(logging.WARNING)logging.warning("This is a warning log")

Raw JSON

importjsondeflog(severity,message):global_extras={"debug_key":"debug_value"}structured_log={"severity":severity,"message":message,**global_extras}print(json.dumps(structured_log))defmain(args):log("DEBUG","Debugging the application.")log("INFO","Info.")log("WARNING","Warning.")log("ERROR","Error.")log("CRITICAL","Critical.")

Special JSON fields in messages

When you provide a structured log as a JSON dictionary, some special fields arestripped from thejsonPayload and are written to the corresponding field inthe generatedLogEntry as described inthe documentation forspecial fields.

For example, if your JSON includes aseverity property, it is removed from thejsonPayload and appears instead as the log entry'sseverity. Themessageproperty is used as the main display text of the log entry if present.

Correlate your container logs with a request log (services only)

In the Logs Explorer, logs correlated by the sametrace are viewable in"parent-child" format: when you click the triangle icon at the left of therequest log entry, the container logs related to that request show up nestedunder the request log.

Container logs are not automatically correlated to request logs unless you use aCloud Logging client library.To correlate container logs with request logs without using a client library,you can use a structured JSON log line that contains alogging.googleapis.com/trace field with the trace identifier extracted fromtheX-Cloud-Trace-Context header.

View logs

To view your container logs in the Google Cloud console, do the following:

  1. In the Google Cloud console, go to theVertex AI custom jobs page.

    Go to Custom jobs

  2. Click the name of the custom job that you want to see logs for.

  3. ClickView logs.

Write code for hyperparameter tuning

Vertex AI can perform hyperparameter tuning on your ML trainingcode. Learn more abouthow hyperparameter tuning on Vertex AIworks andhow to configure aHyperparameterTuningJobresource.

If you want to use hyperparameter tuning, your training code must do thefollowing:

  • Parse command-line arguments representing the hyperparameters that you want totune, and use the parsed values to set the hyperparameters for training.

  • Intermittently report the hyperparameter tuning metric toVertex AI.

Parse command-line arguments

For hyperparameter tuning, Vertex AI runs your training codemultiple times, with different command-line arguments each time. Your trainingcode must parse these command-line arguments and use them as hyperparameters fortraining. For example, to tune your optimizer'slearningrate,you might want to parse a command-line argument named--learning_rate. Learnhow to configure which command-line arguments Vertex AIprovides.

We recommend that you use Python'sargparselibrary to parsecommand-line arguments.

Report the hyperparameter tuning metric

Your training code must intermittently report the hyperparameter metric that youare trying to optimize to Vertex AI. For example, if you want tomaximize your model's accuracy, you might want to report this metric at the endof every training epoch. Vertex AI uses this information to decidewhat hyperparameters to use for the next training trial. Learn more aboutselecting and specifying a hyperparameter tuningmetric.

Use thecloudml-hypertune Python library to report the hyperparameter tuningmetric. This library is included in allprebuilt containers fortraining, and you can usepip to installit in a custom container.

To learn how to install and use this library, seethecloudml-hypertuneGitHubrepository, or refer to theVertex AI: Hyperparameter Tuning codelab.

Write code for GPUs

You can select VMs with graphics processing units (GPUs) to run your customtraining code. Learn more aboutconfiguring serverless training to use GPU-enabledVMs.

If you want to train with GPUs, make sure your training code can take advantageof them. Depending on which ML framework you use, this might require changes toyour code. For example, if you use TensorFlow Keras,you only need to adjustyour code if you want to use more than oneGPU. Some ML frameworks can'tuse GPUs at all.

In addition, make sure that your container supports GPUs: Select aprebuiltcontainer for training that supports GPUs,or install theNVIDIA CUDAToolkit andNVIDIAcuDNN on your custom container.One way to do this is to use base image from thenvidia/cuda Dockerrepository; another way istouse a Deep Learning Containers instance as your baseimage.

Write code for distributed training

To train on large datasets, you can run your code on multiple VMs in adistributed cluster managed by Vertex AI. Learn how toconfiguremultiple VMs for training.

Some ML frameworks, likeTensorFlow andPyTorch, let you run identical training code on multiple machineswhich automatically coordinate how to divide the work based on environmentvariables set on each machine. Find out if Vertex AIsetsenvironment variables to make this possible for your MLframework.

Alternatively, you can run a different container on each of severalworkerpools. A worker pool is a group of VMs that you configure to use the samecompute options and container. In this case, you still probably want to relyon the environment variables set by Vertex AI to coordinatecommunication between the VMs. You can customize the training code of eachworker pool to perform whatever arbitrary tasks you want; how you do thisdepends on your goal and which ML framework you use.

Track and visualize serverless training experiments using Vertex AI TensorBoard

Vertex AI TensorBoard is amanaged version ofTensorBoard, a Google open source project for visualizing machinelearning experiments. With Vertex AI TensorBoard you can track, visualize,and compare ML experiments and then share them with your team. You can also useCloud Profiler to pinpoint andfix performance bottlenecks to train models faster and cheaper.

To use Vertex AI TensorBoard with serverless training, you must do thefollowing:

  • Create a Vertex AI TensorBoard instance in your project to store yourexperiments (seeCreate a TensorBoard instance).

  • Configure a service account to run the serverless training job with appropriatepermissions.

  • Adjust your serverless training code to write out TensorBoard compatible logs toCloud Storage (seeChanges to your training script)

For a step-by-step guide, seeUsing Vertex AI TensorBoard with serverless training.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.