Use Vertex AI TensorBoard with custom training

When using custom training to train models, you can set upyour training job to automatically upload your Vertex AI TensorBoardlogs to Vertex AI TensorBoard.

You can use this integration to monitor your training in near real time asVertex AI TensorBoard streams in Vertex AI TensorBoardlogs as they're written to Cloud Storage.

For initial setup seeSet up for Vertex AI TensorBoard.

Changes to your training script

Your training script must be configured to write TensorBoard logs to theCloud Storage bucket, the location of which the Vertex AI TrainingService will automatically make available through a predefined environmentvariableAIP_TENSORBOARD_LOG_DIR.

This can usually be done by providingos.environ['AIP_TENSORBOARD_LOG_DIR']as the log directory to the open source TensorBoard log writing APIs. The locationof theAIP_TENSORBOARD_LOG_DIR is typically set with thestaging_bucketvariable.

To configure your training script in TensorFlow 2.x, create a TensorBoardcallback and set thelog_dir variable toos.environ['AIP_TENSORBOARD_LOG_DIR']The TensorBoard callback is then included in the TensorFlowmodel.fit callbackslist.

tensorboard_callback=tf.keras.callbacks.TensorBoard(log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'],histogram_freq=1)model.fit(x=x_train,y=y_train,epochs=epochs,validation_data=(x_test,y_test),callbacks=[tensorboard_callback],)

Learn more abouthow Vertex AIsets environment variables in your custom training environment.

Create a custom training job

The following example shows how to create your own custom training job.

For a detailed example of how to create a custom training job, seeHello custom training. For steps tobuild custom training containers, seeCreate a custom container image for training.

To create a custom training job use either Vertex AI SDK for Python or REST.

Python

Python

defcreate_training_pipeline_custom_job_sample(project:str,location:str,staging_bucket:str,display_name:str,script_path:str,container_uri:str,model_serving_container_image_uri:str,dataset_id:Optional[str]=None,model_display_name:Optional[str]=None,args:Optional[List[Union[str, float, int]]]=None,replica_count:int=0,machine_type:str="n1-standard-4",accelerator_type:str="ACCELERATOR_TYPE_UNSPECIFIED",accelerator_count:int=0,training_fraction_split:float=0.8,validation_fraction_split:float=0.1,test_fraction_split:float=0.1,sync:bool=True,tensorboard_resource_name:Optional[str]=None,service_account:Optional[str]=None,):aiplatform.init(project=project,location=location,staging_bucket=staging_bucket)job=aiplatform.CustomTrainingJob(display_name=display_name,script_path=script_path,container_uri=container_uri,model_serving_container_image_uri=model_serving_container_image_uri,)#ThisexampleusesanImageDataset,butyoucanuseanothertypedataset=aiplatform.ImageDataset(dataset_id)ifdataset_idelseNonemodel=job.run(dataset=dataset,model_display_name=model_display_name,args=args,replica_count=replica_count,machine_type=machine_type,accelerator_type=accelerator_type,accelerator_count=accelerator_count,training_fraction_split=training_fraction_split,validation_fraction_split=validation_fraction_split,test_fraction_split=test_fraction_split,sync=sync,tensorboard=tensorboard_resource_name,service_account=service_account,)model.wait()print(model.display_name)print(model.resource_name)print(model.uri)returnmodel
  • project: . You can find these IDs in the Google Cloud consolewelcome page.
  • location: The location to run the CustomJob in. This should be the same location as the provided TensorBoard instance.
  • staging_bucket: The Cloud Storage bucket to stage artifacts during API calls, including TensorBoard logs.
  • display_name: Display name of the custom training job.
  • script_path: The path, relative to the working directory on your local file system, to the script that is the entry point for your training code.
  • container_uri: The URI of the training container image can be Vertex AI.prebuilt training container or acustom container.
  • model_serving_container_image_uri: The URI of the model serving container suitable for serving the model produced by the training script.
  • dataset_id: The ID number for the dataset to use for training.
  • model_display_name: Display name of the trained model.
  • args: Command line arguments to be passed to the Python script.
  • replica_count: The number of worker replicas to use. In most cases, set this to 1 for yourfirst worker pool.
  • machine_type: The type of VM to use. For a list of supported VMs, seeMachine types
  • accelerator_type: The type of GPU to attach to each VM in the resource pool. For a list of supported GPUs, seeGPUs.
  • accelerator_count The number of GPUs to attach to each VM in the resource pool. The default the value is1.
  • training_fraction_split: The fraction of the dataset to use to train your model.
  • validation_fraction_split: The fraction of the dataset to use to validate your model.
  • test_fraction_split: The fraction of the dataset to use to evaluate your model.
  • sync: Whether to execute this method synchronously.
  • tensorboard_resource_name: The resource name of the Vertex TensorBoard instance to which CustomJob will upload TensorBoard logs.
  • service_account: Required when running with TensorBoard. SeeCreate a service account with required permissions.

REST

Before using any of the request data, make the following replacements:

  • LOCATION_ID: The location to run theCustomJob in, for example, us-central1. This should be the same location as the provided TensorBoard instance.
  • PROJECT_ID: Yourproject ID.
  • TENSORBOARD_INSTANCE_NAME: (Obligatory) The full name of the existing Vertex AI TensorBoard instance storing your Vertex AI TensorBoard logs:
    projects/PROJECT_ID/locations/LOCATION_ID/tensorboards/TENSORBOARD_INSTANCE_ID
    Note: If the tensorboard instance is not an existing one, the customJobs creation throws a 404.
  • GCS_BUCKET_NAME: "${PROJECT_ID}-tensorboard-logs-${LOCATION}"
  • USER_SA_EMAIL: (Obligatory) The service account created in previous steps, or your own service account. "USER_SA_NAME@${PROJECT_ID}.iam.gserviceaccount.com"
  • TRAINING_CONTAINER: TRAINING_CONTAINER.
  • INVOCATION_TIMESTAMP: "$(date +'%Y%m%d-%H%M%S')"
  • JOB_NAME: "tensorboard-example-job-${INVOCATION_TIMESTAMP}"
  • BASE_OUTPUT_DIR: (Obligatory) the Google Cloud path where all the output of the training is written to. "gs://$GCS_BUCKET_NAME/$JOB_NAME"

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/customJobs

Request JSON body:

{"displayName":JOB_NAME,"jobSpec":{"workerPoolSpecs":[  {    "replicaCount": "1",     "machineSpec": {        "machineType": "n1-standard-8",      },      "containerSpec": {        "imageUri":TRAINING_CONTAINER,      }    }  ],    "base_output_directory": {  "output_uri_prefix":BASE_OUTPUT_DIR,   },  "serviceAccount":USER_SA_EMAIL,  "tensorboard":TENSORBOARD_INSTANCE_NAME,  }}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json. Run the following command in the terminal to create or overwrite this file in the current directory:

cat > request.json<< 'EOF'{"displayName":JOB_NAME,"jobSpec":{"workerPoolSpecs":[  {    "replicaCount": "1",     "machineSpec": {        "machineType": "n1-standard-8",      },      "containerSpec": {        "imageUri":TRAINING_CONTAINER,      }    }  ],    "base_output_directory": {  "output_uri_prefix":BASE_OUTPUT_DIR,   },  "serviceAccount":USER_SA_EMAIL,  "tensorboard":TENSORBOARD_INSTANCE_NAME,  }}EOF

Then execute the following command to send your REST request:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/customJobs"

PowerShell (Windows)

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json. Run the following command in the terminal to create or overwrite this file in the current directory:

@'{"displayName":JOB_NAME,"jobSpec":{"workerPoolSpecs":[  {    "replicaCount": "1",     "machineSpec": {        "machineType": "n1-standard-8",      },      "containerSpec": {        "imageUri":TRAINING_CONTAINER,      }    }  ],    "base_output_directory": {  "output_uri_prefix":BASE_OUTPUT_DIR,   },  "serviceAccount":USER_SA_EMAIL,  "tensorboard":TENSORBOARD_INSTANCE_NAME,  }}'@  | Out-File -FilePath request.json -Encoding utf8

Then execute the following command to send your REST request:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/customJobs" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{  "name": "projects/PROJECT_ID/locations/LOCATION_ID/customJobs/CUSTOM_JOB_ID",  "displayName": "DISPLAY_NAME",  "jobSpec": {    "workerPoolSpecs": [      {        "machineSpec": {          "machineType": "n1-standard-8"        },        "replicaCount": "1",        "diskSpec": {          "bootDiskType": "pd-ssd",          "bootDiskSizeGb": 100        },        "containerSpec": {          "imageUri": "IMAGE_URI"        }      }    ],    "serviceAccount": "SERVICE_ACCOUNT",    "baseOutputDirectory": {      "outputUriPrefix": "OUTPUT_URI_PREFIX"    },    "tensorboard": "projects//locations/LOCATION_ID/tensorboards/tensorboard-id"  },  "state": "JOB_STATE_PENDING",  "createTime": "CREATE-TIME",  "updateTime": "UPDATE-TIME"}

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.