Use dedicated public endpoints for online inference

Adedicated public endpoint is a public endpoint for online inference. Itoffers the following benefits:

Dedicated networking: When you send an inference request to a dedicatedpublic endpoint, it is isolated from other users' traffic.
Optimized network latency
Larger payload support: Up to 10 MB.
Longer request timeouts: Configurable up to 1 hour.
Generative AI-ready: Streaming and gRPC are supported. Inferencetimeout is configurable up to 1 hour.

For these reasons, dedicated public endpoints are recommended as a bestpractice for serving Vertex AI online inferences.

Note: Tuned Gemini models can only be deployed to shared publicendpoints.

To learn more, see Choose an endpoint type.

Create a dedicated public endpoint and deploy a model to it

You can create a dedicated endpoint and deploy a model to it by using theGoogle Cloud console. For details, seeDeploy a model by using the Google Cloud console.

You can also create a dedicated public endpoint and deploy a model to it byusing the Vertex AI API as follows:

Create a dedicated public endpoint.Configuration of the inference timeout and request-response logging settingsis supported at the time of endpoint creation.
Deploy the model by using the Vertex AI API.

Get online inferences from a dedicated public endpoint

Dedicated endpoints support both HTTP and gRPC communication protocols. For gRPCrequests, the x-vertex-ai-endpoint-id header must be included for properendpoint identification. The following APIs are supported:

Predict
RawPredict
StreamRawPredict
Chat Completion (Model Garden only)

You can send online inference requests to a dedicated public endpoint by usingthe Vertex AI SDK for Python. For details, seeSend an online inference request to a dedicated public endpoint.

Tutorial

To learn more, run the "Vertex AI Model Garden - Gemma (Deployment)" notebook in one of the following environments:

Open in Colab |Open in Colab Enterprise |Openin Vertex AI Workbench |View on GitHub

Limitations

Deployment of tuned Gemini models isn't supported.
VPC Service Controls isn't supported. Use a Private Service Connectendpoint instead.

What's next

Learn about Vertex AI online inferenceendpoint types.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-17 UTC.

Movatterモバイル変換