Serve an LLM with GKE Inference Gateway

Autopilot Standard

This tutorial describes how to deploy a large language model (LLM) onGoogle Kubernetes Engine (GKE) with the GKE Inference Gateway. The tutorialincludes steps for cluster setup, model deployment, GKE Inference Gatewayconfiguration, and handling LLM requests.

This tutorial is for Machine learning (ML) engineers, Platform admins and operators, andData and AI specialists who want to deploy and manage LLM applications onGKE with GKE Inference Gateway.

Before reading this page, ensure that you're familiar with the following:

About GKE Inference Gateway.
AI/ML orchestration onGKE.
Generative AI glossary.
Load balancing inGoogle Cloud,especially how load balancers interact with GKE.
GKE Service Extensions. For more information, see theGKE Gatewaycontrollerdocumentation.
Customize GKE Gateway traffic using Service Extensions.

GKE Inference Gateway enhances Google Kubernetes Engine (GKE)Gateway to optimize the serving of generative AI applications and workloads onGKE. It provides efficient management and scaling of AI workloads,enables workload-specific performance objectives such as latency, and enhancesresource utilization, observability, and AI safety.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task,install and theninitialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running thegcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.Note: For existing gcloud CLI installations, make sure to set thecompute/regionproperty. If you use primarily zonal clusters, set thecompute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following:One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Enable the Compute Engine API, the Network Services API, and the Model Armor API ifneeded.
Go toEnable access toAPIs and follow the instructions.
Make sure that you have the following roles on the project:roles/container.admin,roles/iam.serviceAccountAdmin.
Ensure your project has sufficient quota for H100 GPUs. To learn more, seePlan GPU quota andAllocation quotas.
Create aHugging Face account if you don't already have one. You will needthis to access the model resources for this tutorial.
Request access to the Llama 3.1 model and generate an access token. Accessto this model requires an approved request on Hugging Face, and thedeployment will fail if access has not been granted.
- Sign the license consent agreement: You must sign the consent agreement touse the Llama 3.1 model. Go to the model's page on Hugging Face, verify youraccount, and accept the terms.
- Generate an access token: To access the model, you need a Hugging Facetoken. In your Hugging Face account, go toYour Profile > Settings > AccessTokens, create a new token with at least Read permissions, and copy it toyour clipboard.

GKE Gateway controller requirements

GKE version 1.32.3 or later.
Google Cloud CLI version 407.0.0 or later.
Gateway API is supported on VPC-native clusters only.
You mustenable a proxy-onlysubnet.
Your cluster must have theHttpLoadBalancing add-on enabled.
If you are using Istio, you must upgrade Istio to one of the followingversions:
- 1.15.2 or later
- 1.14.5 or later
- 1.13.9 or later
If you are using Shared VPC, then in the host project, you need toassign theCompute Network User role to the GKE Serviceaccount for the service project.

Restrictions and limitations

The following restrictions and limitations apply:

Multi-cluster Gateways are not supported.
GKE Inference Gateway is only supported on thegke-l7-regional-external-managed andgke-l7-rilb GatewayClass resources.
Cross-region internal Application Load Balancers are not supported.

Configure GKE Inference Gateway

To configure GKE Inference Gateway, consider this example. Ateam runsvLLM andLlama3 models and actively experiments with twodistinct LoRA fine-tuned adapters: "food-review" and "cad-fabricator".

The high-level workflow for configuring GKE Inference Gatewayis as follows:

Prepare your environment: set up the necessaryinfrastructure and components.
Create an inference pool: define a pool of modelservers using theInferencePool Custom Resource.
Specify inference objectives: specifyinference objectives using theInferenceObjective Custom Resource
Create the Gateway: expose the inference service usingGateway API.
Create theHTTPRoute: define how HTTP traffic isrouted to the inference service.
Send inference requests: make requests to thedeployed model.

Prepare your environment

InstallHelm.
Create a GKE cluster:
- Create a GKE Autopilot or Standardcluster with version 1.32.3 or later. For a one-click deploymentreference setup, see thecluster-toolkit gke-a3-highgpusample.
- Configure the nodes with your preferred compute family and accelerator.
- UseGKE Inference Quickstartfor pre-configured and tested deployment manifests, based on your selectedaccelerator, model, and performance needs.
Install needed Custom Resource Definitions (CRDs) in your GKE cluster:
- For GKE versions1.34.0-gke.1626000 or later, install only the alphaInferenceObjective CRD:
```
kubectlapply-fhttps://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml
```
- For GKE versions earlier than1.34.0-gke.1626000, install both the v1InferencePool and alphaInferenceObjective CRDs:
```
kubectlapply-fhttps://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.0/manifests.yaml
```
  For more information, see thecompatibility matrix.

If you are using GKE version earlier thanv1.32.2-gke.1182001and you want to use Model Armor with GKE Inference Gateway, youmust install the traffic and routing extension CRDs:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/gke-gateway-api/refs/heads/main/config/crd/networking.gke.io_gcptrafficextensions.yamlkubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/gke-gateway-api/refs/heads/main/config/crd/networking.gke.io_gcproutingextensions.yaml

Create a model server and model deployment

This section shows how to deploy a model server and model. The example uses avLLM model server with aLlama3 model. The deployment is labeled asapp:vllm-llama3-8b-instruct. This deployment also uses two LoRA adaptersnamedfood-review andcad-fabricator from Hugging Face.

You can adapt this example with your own model server container and model,serving port, and deployment name. You can also configure LoRA adapters in thedeployment, or deploy the base model. The following steps describe how tocreate the necessary Kubernetes resources.

Create a Kubernetes Secret to store your Hugging Face token. This token isused to access the base model and the LoRA adapters:
```
kubectlcreatesecretgenerichf-token--from-literal=token=HF_TOKEN
```
ReplaceHF_TOKEN with your Hugging Face token.
Deploy the model server and model. The following command applies a manifestthat defines a Kubernetes Deployment for avLLM model server with aLlama3model:
```
kubectlapply-fhttps://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/release-1.0/config/manifests/vllm/gpu-deployment.yaml
```
Note: If you are using Autopilot, you must manually add anodeSelector to the defaultgpu-deployment.yaml manifest to schedule GPUworkloads. For more information, seeRequest GPUs in yourcontainers.

Create an inference pool

TheInferencePool Kubernetes custom resource defines a group of Pods with acommon base large language model (LLM) and compute configuration. Theselectorfield specifies which Pods belong to this pool. The labels in this selector mustexactly match the labels applied to your model server Pods. ThetargetPortfield defines the ports that the model server uses within the Pods. TheextensionRef field references an extension service that provides additionalcapability for the inference pool. TheInferencePool enablesGKE Inference Gateway to route traffic to your model server Pods.

Before you create theInferencePool, ensure that the Pods that theInferencePool selects are already running.

To create anInferencePool using Helm, perform the following steps:

helminstallvllm-llama3-8b-instruct\--setinferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct\--setprovider.name=gke\--setinferenceExtension.monitoring.gke.enabled=true\--versionv1.0.1\oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool

Change the following field to match your Deployment:

inferencePool.modelServers.matchLabels.app: the key of the label used toselect your model server Pods.

For monitoring, metrics scraping for Google Cloud Managed Service for Prometheus is enabled bydefault.

To disable this feature, add the--setinferenceExtension.monitoring.gke.enabled=false flag to the command.
If you use the default monitoring on a GKE Autopilotcluster, you must also add the--set provider.gke.autopilot=true flag.

The Helm install automatically installs the necessary timeout policy,endpoint-picker and the Pods needed for observability.

This creates anInferencePool object:vllm-llama3-8b-instruct referencingthe model endpoint services within the Pods. It also creates a deployment of theEndpoint Picker namedapp:vllm-llama3-8b-instruct-epp for this createdInferencePool.

Specify inference objectives

TheInferenceObjective custom resource lets you specify priority of requests.

Themetadata.name field of theInferenceObjective resource specifies thename of the Inference Objective, thePriority field specifies its servingcriticality, and thepoolRef field specifies theInferencePool on which themodel is served.

apiVersion:inference.networking.x-k8s.io/v1alpha2kind:InferenceObjectivemetadata:name:NAMEspec:priority:VALUEpoolRef:name:INFERENCE_POOL_NAMEgroup:"inference.networking.k8s.io"

Replace the following:

NAME: the name of your Inference Objective. For example,food-review.
VALUE: the priority for the Inference Objective. This is an integer where a higher value indicates a more critical request. For example, 10.
INFERENCE_POOL_NAME: the name of theInferencePool you created in the previous step. For example,vllm-llama3-8b-instruct.

To create anInferenceObjective, perform the following steps:

Save the following manifest asinference-objectives.yaml. This manifestcreates twoInferenceObjective resources. The first configures thefood-review Inference Objective on thevllm-llama3-8b-instructInferencePool with a priority of 10. The second configures thellama3-base-model Inference Objective to be served with a higher priorityof 20.

apiVersion:inference.networking.x-k8s.io/v1alpha2kind:InferenceObjectivemetadata:name:food-reviewspec:priority:10poolRef:name:vllm-llama3-8b-instructgroup:"inference.networking.k8s.io"---apiVersion:inference.networking.x-k8s.io/v1alpha2kind:InferenceObjectivemetadata:name:llama3-base-modelspec:priority:20# Higher prioritypoolRef:name:vllm-llama3-8b-instruct

Apply the sample manifest to your cluster:
```
kubectlapply-finference-objectives.yaml
```

Create the Gateway

The Gateway resource is the entry point for external traffic into yourKubernetes cluster. It defines the listeners that accept incomingconnections.

The GKE Inference Gateway works with the following Gateway Classes:

gke-l7-rilb: for regional internal Application Load Balancers.
gke-l7-regional-external-managed: for regional external Application Load Balancers.

For more information, seeGatewayClassesdocumentation.

To create a Gateway, perform the following steps:

Save the following sample manifest asgateway.yaml:
```
apiVersion:gateway.networking.k8s.io/v1kind:Gatewaymetadata:name:GATEWAY_NAMEspec:gatewayClassName:GATEWAY_CLASSlisteners:-protocol:HTTPport:80name:http
```
Replace the following:
- GATEWAY_NAME: a unique name for your Gatewayresource. For example,inference-gateway.
- GATEWAY_CLASS: the Gateway Class you want to use.For example,gke-l7-regional-external-managed.
Apply the manifest to your cluster:
```
kubectlapply-fgateway.yaml
```

Note: For more information about configuring TLS to secure your Gateway withHTTPS, see the GKE documentation onTLSconfiguration.

Create the`HTTPRoute`

TheHTTPRoute resource defines how the GKE Gateway routesincoming HTTP requests to backend services, such as yourInferencePool. TheHTTPRoute resource specifies matching rules (for example, headers or paths) and thebackend to which traffic should be forwarded.

To create anHTTPRoute, save the following sample manifest ashttproute.yaml:
```
apiVersion:gateway.networking.k8s.io/v1kind:HTTPRoutemetadata:name:HTTPROUTE_NAMEspec:parentRefs:-name:GATEWAY_NAMErules:-matches:-path:type:PathPrefixvalue:PATH_PREFIXbackendRefs:-name:INFERENCE_POOL_NAMEgroup:"inference.networking.k8s.io"kind:InferencePool
```
Replace the following:
- HTTPROUTE_NAME: a unique name for yourHTTPRouteresource. For example,my-route.
- GATEWAY_NAME: the name of theGateway resourcethat you created. For example,inference-gateway.
- PATH_PREFIX: the path prefix that you use to matchincoming requests. For example,/ to match all.
- INFERENCE_POOL_NAME: the name of theInferencePool resource that you want to route traffic to. For example,vllm-llama3-8b-instruct.
Apply the manifest to your cluster:
```
kubectlapply-fhttproute.yaml
```

Send inference request

After you have configured GKE Inference Gateway, you can sendinference requests to your deployed model. This lets you generate text based onyour input prompt and specified parameters.

To send inference requests, perform the following steps:

Set the following environment variables:
```
exportGATEWAY_NAME=GATEWAY_NAMEexportPORT_NUMBER=PORT_NUMBER# Use 80 for HTTP
```
Replace the following:
- GATEWAY_NAME: the name of your Gatewayresource.
- PORT_NUMBER: the port number you configured inthe Gateway.

To get the Gateway endpoint, run the following command:

echo"Waiting for the Gateway IP address..."IP=""while[-z"$IP"];doIP=$(kubectlgetgateway/${GATEWAY_NAME}-ojsonpath='{.status.addresses[0].value}'2>/dev/null)if[-z"$IP"];thenecho"Gateway IP not found, waiting 5 seconds..."sleep5fidoneecho"Gateway IP address is:$IP"PORT=${PORT_NUMBER}

To send a request to the/v1/completions endpoint usingcurl, run thefollowing command:
```
curl-i-XPOST${IP}:${PORT}/v1/completions\-H'Content-Type: application/json'\-H'Authorization: Bearer $(gcloud auth application-default print-access-token)'\-d'{    "model": "MODEL_NAME",    "prompt": "PROMPT_TEXT",    "max_tokens":MAX_TOKENS,    "temperature": "TEMPERATURE"}'
```
Replace the following:
- MODEL_NAME: the name of the model or LoRAadapter to use.
- PROMPT_TEXT: the input prompt for the model.
- MAX_TOKENS: the maximum number of tokens togenerate in the response.
- TEMPERATURE: controls the randomness of theoutput. Use the value0 for deterministic output, or a higher numberfor more creative output.

The following example shows you how to send a sample request to GKE Inference Gateway:

curl-i-XPOST${IP}:${PORT}/v1/completions-H'Content-Type: application/json'-H'Authorization: Bearer $(gcloud auth print-access-token)'-d'{    "model": "food-review-1",    "prompt": "What is the best pizza in the world?",    "max_tokens": 2048,    "temperature": "0"}'

Be aware of the following behaviours:

Request body: the request body can include additional parameters likestop andtop_p. Refer to theOpenAI API specification for a completelist of options.
Error handling: implement proper error handling in your client code tohandle potential errors in the response. For example, check the HTTP statuscode in thecurl response. A non-200 status code generally indicates anerror.
Authentication and authorization: for production deployments, secure yourAPI endpoint with authentication and authorization mechanisms. Include theappropriate headers (for example,Authorization) in your requests.

What's next

Read aboutGKE Inference Gateway.
Read aboutDeploying GKE Inference Gateway.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.

Movatterモバイル変換