Use private services access endpoints for online inference

Using private services access endpoints to serve online inferences with Vertex AIprovides a low-latency, secure connection to the Vertex AI onlineinference service. This guide shows how to configure private services access onVertex AI by usingVPC Network Peeringto peer your networkwith the Vertex AI online inference service.

Overview

Before you serve online inference with private endpoints, you mustconfigureprivate services access to create peering connections between your networkand Vertex AI. If you have already set this up,you can use your existing peering connections.

This guide covers the following tasks:

  • Verifying the status of your existing peering connections.
  • Verifying the necessary APIs are enabled.
  • Creating a private endpoint.
  • Deploying a model to a private endpoint.
    • Only support one model per private endpoint. This is different from apublic Vertex AI endpoint where you can split trafficacross multiple models deployed to one endpoint.
    • Private endpoint supports AutoML tabular and custom trained models.
  • Sending an inference to a private endpoint.
  • Cleaning up resources

Check the status of existing peering connections

If you have existing peering connections you use with Vertex AI,you can list them to check status:

gcloudcomputenetworkspeeringslist--networkNETWORK_NAME

You should see that the state of your peering connections isACTIVE.Learn more aboutactive peering connections.

Enable the necessary APIs

gcloudservicesenableaiplatform.googleapis.comgcloudservicesenabledns.googleapis.com

Create a private endpoint

To create a private endpoint, add the--network flag when youcreate anendpoint using the Google Cloud CLI:

gcloudbetaaiendpointscreate\--display-name=ENDPOINT_DISPLAY_NAME\--network=FULLY_QUALIFIED_NETWORK_NAME\--region=REGION

ReplaceNETWORK_NAME with the fully qualified network name:

projects/PROJECT_NUMBER/global/networks/NETWORK_NAME

If you create the endpoint without specifying a network, then you create apublic endpoint.

Limitations of private endpoints

Note the following limitations for private endpoints:

  • Private endpoints don't support traffic splitting. As a workaround,you can create traffic splitting manually by deploying your model to multipleprivate endpoints, and splitting traffic among the resulting inference URLsfor each private endpoint.
  • Private endpoints don't support SSL/TLS.
  • To enable access logging on a private endpoint, contactvertex-ai-feedback@google.com.
  • You can use only one network for all private endpoints in a Google Cloudproject. If you want to change to another network,contactvertex-ai-feedback@google.com.
  • Client side retry on recoverable errors are highly recommended. These caninclude the following errors:
    • Empty response (HTTP error code0), possibly due to a transient broken connection.
    • HTTP error codes5xx that indicate the service might be temporarily unavailable.
  • For the HTTP error code429 that indicates the system is overloaded,consider slowing down traffic to mitigate this issue instead of retrying.
  • Inference requests from prediction services (such asPredictionServiceClient andPredictionServiceClient) aren't supported.
  • The Private Service Access endpoint does not support tuned foundation models. For a tuned foundation model, deploy itusing aPrivate Service Connect endpoint.

Monitor private endpoints

You can use the metrics dashboard to inspect the availability and latencyof the traffic sent to a private endpoint.

To customize monitoring, query the following metrics in Cloud Monitoring:

  • aiplatform.googleapis.com/prediction/online/private/response_count

    The number of inference responses. You can filter this metric bydeployed_model_id or HTTP response code.

  • aiplatform.googleapis.com/prediction/online/private/prediction_latencies

    The latency of the inference request in milliseconds. You can filter thismetric bydeployed_model_id, only for successful requests.

Learnhow to select, query, and display these metrics in Metrics Explorer.

Deploy a model

You can import a new model, or deploy an existing model that you have alreadyuploaded. To upload a new model, usegcloud ai models upload.For more information, seeImport models to Vertex AI.

  1. To deploy a model to a private endpoint, see the guide todeploy models.Besides traffic splitting and manually enabling access logging, you can useany of the other options available for deploying custom-trained models.Refer to thelimitations of private endpoints to learn moreabout how they are different from public endpoints.

  2. After you deploy the endpoint, you can get the inference URI from themetadata of your private endpoint.

    1. If you have the display name of your private endpoint, run this commandto get the endpoint ID:

      ENDPOINT_ID=$(gcloudaiendpointslist\--region=REGION\--filter=displayName:ENDPOINT_DISPLAY_NAME\--format="value(ENDPOINT_ID.scope())")

      Otherwise, to view the endpoint ID and display name for all of yourendpoints, run the following command:

      gcloudaiendpointslist--region=REGION
    2. Finally, to get the inference URI, run the following command:

      gcloudbetaaiendpointsdescribeENDPOINT_ID\--region=REGION\--format="value(deployedModels.privateEndpoints.predictHttpUri)"

Private inference URI format

Note: You might seehttp://aiplatformHASH_ID.googleapis.com/v1/models/DEPLOYED_MODEL_HASH_ID:predictformat for older private endpoints where:HASH_ID is a hash ID inyour inference URL that contains six alphanumeric characters andDEPLOYED_MODEL_HASH_ID is a hash ID for your deployed model thatcontains 21 characters (two sets of 10 alphanumeric characters connected by ahyphen).

The inference URI looks different for private endpoints compared to Vertex AI public endpoints:

http://ENDPOINT_ID.aiplatform.googleapis.com/v1/models/DEPLOYED_MODEL_ID:predict

If you choose to undeploy the current model and redeploy with a new one, thedomain name is reused but the path includes a different deployed model ID.

Send an inference to a private endpoint

  1. Create a Compute Engine instancein your VPC network. Make sure tocreate theinstance in the same VPC network that you have peered withVertex AI.

  2. SSH into your Compute Engine instance, and install your inferenceclient, if applicable. Otherwise, you can use curl.

  3. When predicting, use the inference URL obtained from model deployment. Inthis example, you're sending the request from your inference client in yourCompute Engine instance in the same VPC network:

    curl-XPOST-d@PATH_TO_JSON_FILEhttp://ENDPOINT_ID.aiplatform.googleapis.com/v1/models/DEPLOYED_MODEL_ID:predict

    In this sample request,PATH_TO_JSON_FILE is the path to yourinference request, saved as a JSON file. For example,example-request.json.

Clean up resources

You can undeploy models and delete private endpointsthe same way as for public models and endpoints.

Example: Test private endpoints in Shared VPC

This example uses two Google Cloud projects with a Shared VPCnetwork:

  • Thehost project hosts the Shared VPC network.
  • Theclient project hosts a Compute Engine instance where you runan inference client, such as curl, or your own REST client inthe Compute Engine instance, to send inference requests.

When you create the Compute Engine instance in the client project, itmust be within the custom subnet in the host project's Shared VPCnetwork, and in the same region where the model gets deployed.

  1. Create the peering connections for private services access in the hostproject. Rungcloud services vpc-peerings connect:

    gcloudservicesvpc-peeringsconnect\--service=servicenetworking.googleapis.com\--network=HOST_SHARED_VPC_NAME\--ranges=PREDICTION_RESERVED_RANGE_NAME\--project=HOST_PROJECT_ID
  2. Create the endpoint in the client project, using the host project's networkname.Rungcloud beta ai endpoints create:

    gcloudbetaaiendpointscreate\--display-name=ENDPOINT_DISPLAY_NAME\--network=HOST_SHARED_VPC_NAME\--region=REGION\--project=CLIENT_PROJECT_ID
  3. Send inference requests, usingthe inference client within the client project.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.