About multi-cluster GKE Inference Gateway

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

The multi-cluster Google Kubernetes Engine (GKE) Inference Gatewayload balances your AI/ML inference workloads across multiple GKEclusters. It integrates GKE multi-cluster gateways forcross-cluster traffic routing with Inference Gatewayfor AI/ML model serving. This integration improves the scalability and highavailability of your deployments. This document explains the gateway's coreconcepts and benefits.

For more information about how to deploy multi-cluster GKEInference Gateway, seeSet up the GKEmulti-clusterInference Gateway.

To understand this document, you must be familiar with the following:

This document targets the following personas:

  • Machine learning (ML) engineers, Platform admins and operators, andData and AI specialists interested in using Kubernetes containerorchestration capabilities for serving AI/ML workloads.
  • Cloud architects or Networking specialists who interact withKubernetes networking.

To learn more about common roles and example tasks that we reference inGoogle Cloud content, seeCommon GKE Enterprise user roles andtasks.

Benefits of the GKE multi-cluster Inference Gateway

The multi-cluster GKE Inference Gateway providesseveral benefits for managing your AI/ML inference workloads, including thefollowing:

  • Enhances high availability and fault tolerance through intelligent loadbalancing across multiple GKE clusters, even across differentgeographical regions. Your inference workloads remain available, and thesystem automatically reroutes requests if a cluster or region experiencesissues, thereby minimizing downtime.
  • Improves scalability and optimizes resource usage by pooling GPU and TPUresources from various clusters to handle increased demand. This poolingallows your workloads to burst beyond the capacity of a single cluster andefficiently use available resources across your fleet.
  • Maximizes performance with globally optimized routing. The gateway usesadvanced metrics, such as Key-Value (KV) Cache usage from all clusters, tomake efficient routing decisions. This approach helps to ensure thatrequests go to the cluster best equipped to handle them, thereby maximizingoverall performance for your AI/ML inference fleet.

Limitations

The multi-cluster GKE Inference Gateway has thefollowing limitations:

  • Model Armor integration: the multi-cluster GKEInference Gateway doesn't supportModel Armor integration.

  • Envoy Proxy latency reporting: the Envoy Proxy only reports querylatency for successful (2xx) requests. It ignores errors and timeouts. Thisbehavior can cause the Global Server Load Balancer (GSLB) to underestimatethe true load on failing backends, potentially directing more traffic toalready overloaded services. To mitigate this issue, configure a largerrequest timeout. For example, a value of600s is recommended.

Key components

The multi-cluster GKE Inference Gateway usesseveral Kubernetes custom resources to manage inference workloads and trafficrouting:

  • InferencePool: groups identical modelserver backends in your target cluster. This resource simplifies the management and scaling of yourmodel serving instances.
  • InferenceObjective: defines routing priorities for specific modelswithin anInferencePool. This routing helps to ensure that certain models receive trafficpreference based on your requirements.
  • GCPInferencePoolImport: makes your model backends available for routing configurationby usingHTTPRoute in the config cluster. This resource isautomatically created in your config clusterwhen you export anInferencePool from a target cluster. The config clusteracts as the central point of control for your multi-cluster environment.
  • GCPBackendPolicy: customizes how traffic is load balanced to yourbackends. For example, you can enable load balancing based on custom metricsor set limits on in-flight requests per endpoint to protect your modelservers.
  • AutoscalingMetric: defines custom metrics, such asvllm:kv_cache_usage_perc, to export from your model servers. You can thenuse these metrics withinGCPBackendPolicy to make more intelligent loadbalancing decisions, and optimize performance and resource utilization.

How the multi-cluster GKE Inference Gateway works

The multi-cluster GKE Inference Gateway manages androutes traffic to your AI/ML models deployed across multiple GKEclusters. It works as follows:

  • Centralized traffic management: a dedicatedconfig cluster definesyour traffic routing rules. The config cluster acts as thecentral point of control for your multi-cluster environment. You designate aGKE cluster as the config cluster when you enablemulti-cluster Ingress for your fleet. This centralized approach lets youmanage how requests are directed to your models across your entire fleet ofGKE clusters from a single place.
  • Flexible model deployment: your actual AI/ML models run in separatetarget clusters. This separation lets you deploy models where it makes themost sense (for example, closer to data or to clusters with specific hardware).
  • Easy integration of models: when you deploy a model in a target cluster,you group its serving instances using anInferencePool. Exporting thisInferencePool automatically makes it available for routing in your configcluster.
  • Intelligent load balancing: the gateway doesn't only distribute traffic;it makes intelligent routing decisions. By configuring it to use varioussignals, including custom metrics from your model servers, the gatewayhelps to ensure that incoming requests are sent to the best-equipped cluster or modelinstance, which can maximize performance and resource utilization. For example, youcan route requests to the cluster with the most available inference capacitybased on metrics like Key-Value (KV) Cache usage.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.