Use a private IP for Vertex AI serverless training

Note: Vertex AI Training doesn't support VPC Peering with H100-mega, H200 andB200 accelerators. Anetwork attachment with PSC-Ican be used as an alternative for VPC Peering.

Using private IP to connect to your training jobs provides morenetwork security and lower network latency than using public IP. To use privateIP, you useVirtual Private Cloud (VPC) to peer yournetwork with any type ofVertex AI serverless training job.This allows your training code to access private IP addresses inside yourGoogle Cloud or on-premises networks.

This guide shows how to run serverless training jobs in your network after you havealreadyset up VPC Network Peering to peer your networkwith a Vertex AICustomJob,HyperparameterTuningJob, or customTrainingPipeline resource.

Overview

Before you submit a serverless training job using private IP, you mustconfigureprivate services access to create peering connections between your networkand Vertex AI. If you have already set this up,you can use your existing peering connections.

This guide covers the following tasks:

  • Understanding which IP ranges to reserve for serverless training.
  • Verify the status of your existing peering connections.
  • Perform Vertex AI serverless training on your network.
  • Check for active training occurring on one network before training onanother network.
  • Test that your training code can access private IPs in your network.

Reserve IP ranges for serverless training

When you reserve an IP range for service producers, the range can be used byVertex AI and other services. This table shows the maximum numberof parallel training jobs that you can run with reserved ranges from /16 to /18,assuming the range is used almost exclusively by Vertex AI. If youconnect with other service producers using the same range, allocate a largerrange to accommodate them, in order to avoid IP exhaustion.

Machine configuration for training jobReserved rangeMaximum number of parallel jobs
Up to 8 nodes.
For example: 1 primary replica in the first worker pool, 6 replicas in thesecond worker pool, and 1 worker in the third worker pool (to act as a parameterserver)
/1663
/1731
/1815
Up to 16 nodes.
For example: 1 primary replica in the first worker pool, 14 replicas in thesecond worker pool, and 1 worker in the third worker pool (to act as a parameterserver)
/1631
/1715
/187
Up to 32 nodes.
For example: 1 primary replica in the first worker pool, 30 replicas in thesecond worker pool, and 1 worker in the third worker pool (to act as a parameterserver)
/1615
/177
/183

Learn more aboutconfiguring worker pools for distributedtraining.

Check the status of existing peering connections

If you have existing peering connections you use with Vertex AI,you can list them to check status:

gcloudcomputenetworkspeeringslist--networkNETWORK_NAME

You should see that the state of your peering connections areACTIVE.Learn more aboutactive peering connections.

Perform serverless training

When you perform serverless training, you must specify the name of thenetwork that you want Vertex AI to have access to.

Depending on how you perform serverless training, specify the network in one of thefollowing API fields:

If you don't specify a network name, then Vertex AI runs yourserverless training without a peering connection, and without access to private IPsin your project.

Example: Creating aCustomJob with the gcloud CLI

The following example shows how to specify a network when you use thegcloud CLI to run aCustomJob that uses a prebuilt container. Ifyou are perform serverless training in a different way, add thenetwork fieldas described for the type of serverless training job you're using.

  1. Create aconfig.yaml file to specify the network. If you're usingShared VPC, use your VPC host project number.

    Make sure the network name is formatted correctly:

    PROJECT_NUMBER=$(gcloudprojectsdescribe$PROJECT_ID--format="value(projectNumber)")cat<<EOF >config.yamlnetwork:projects/PROJECT_NUMBER/global/networks/NETWORK_NAMEEOF
  2. Create a training applicationto run on Vertex AI.

  3. Create theCustomJob, passing in yourconfig.yaml file:

    gcloudaicustom-jobscreate\--region=LOCATION\--display-name=JOB_NAME\--python-package-uris=PYTHON_PACKAGE_URIS\--worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,executor-image-uri=PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,python-module=PYTHON_MODULE\--config=config.yaml

To learn how to replace the placeholders in this command, readCreating customtraining jobs.

Run jobs on different networks

You can't perform serverless training on a new network while you are stillperforming serverless training on another network. Before you switch to a differentnetwork, you must wait for all submittedCustomJob,HyperparameterTuningJob,and customTrainingPipeline resources to finish, or you must cancel them.

Test training job access

This section explains how to test that a serverless training resource can accessprivate IPs in your network.

  1. Create a Compute Engine instance in your VPC network.
  2. Check your firewall rules to make sure that they don'trestrict ingress from the Vertex AI network. If so, add arule to ensure the Vertex AI network can access the IP range youreserved for Vertex AI (and other service producers).
  3. Set up a local server on the VM instance in order to create an endpoint for aVertex AICustomJob to access.
  4. Create a Python training application to run on Vertex AI.Instead of model training code, create code that accesses the endpoint youset up in the previous step.
  5. Follow the previous example to create aCustomJob.

Common problems

This section lists some common issues for configuring VPC Network Peering withVertex AI.

  • When you configure Vertex AI to use your network, specify thefull network name:

    "projects/YOUR_PROJECT_NUMBER/global/networks/YOUR_NETWORK_NAME"

  • Make sure you are not performing serverless training on a network beforeperforming serverless training on a different network.

  • Make sure that you've allocated a sufficient IP range for all serviceproducers your network connects to, including Vertex AI.

For additional troubleshooting information, refer to theVPC Network Peering troubleshooting guide.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.