Serve an LLM using TPU Trillium on GKE with vLLM Stay organized with collections Save and categorize content based on your preferences.
This tutorial shows you how to serve large language models (LLMs) using TensorProcessing Units (TPUs) on Google Kubernetes Engine (GKE) with thevLLM serving framework. In this tutorial, youserveLlama 3.1 70b, useTPU Trillium, and set uphorizontal Pod autoscaling usingvLLM server metrics.
This document is a good starting point if you need the granular control, scalability,resilience, portability, and cost-effectiveness of managed Kubernetes whenyou deploy and serve your AI/ML workloads.
Background
By using TPU Trillium on GKE, youcan implement a robust, production-ready serving solution with all thebenefits of managedKubernetes, including efficient scalability and higheravailability. This section describes the key technologies used in this guide.
TPU Trillium
TPUs are Google's custom-developed application-specific integrated circuits(ASICs). TPUs are used to accelerate machine learning and AI models built using frameworkssuch asTensorFlow,PyTorch, andJAX.This tutorial uses TPU Trillium, which is Google's sixth generation TPU.
Before you use TPUs in GKE, we recommend that you complete thefollowing learning path:
- Learn about TPU Trilliumsystem architecture.
- Learnabout TPUs in GKE.
vLLM
vLLM is a highly optimized, open-source framework for serving LLMs. vLLM canincrease serving throughput on TPUs, with features such as the following:
- Optimized transformer implementation withPagedAttention.
- Continuous batching to improve the overall serving throughput.
- Tensor parallelism and distributed serving on multiple TPUs.
To learn more, refer to thevLLM documentation.
Note: This tutorial focuses on deploying vLLM in a single-host configuration, which is ideal for models that can be served from a single TPU slice, such as Llama 3.1 70b on act6e-standard-8t machine type. It's important to note that multi-host configurations are not supported when using vLLM with TPUs on GKE. The lack of multi-host support limits the use of vLLM for serving extremely large models (for example 400B+ parameters) that require the aggregated memory and compute of multiple hosts. For production systems or models that require a multi-host setup, the recommended and performance-optimized solution is to use JetStream, Google's engine for TPU inference. To get started with a multi-host deployment, seeJetStream MaxText inference on v6e TPU.Cloud Storage FUSE
Cloud Storage FUSE provides access from your GKE cluster to Cloud Storage for model weights that reside inobject storage buckets. In this tutorial, the created Cloud Storage bucket will initially be empty.When vLLM starts up, GKE downloads the model from Hugging Face andcaches the weights to the Cloud Storage bucket. On Pod restart, or deployment scale-up, subsequentmodel loads will download cached data from the Cloud Storage bucket, leveraging paralleldownloads for optimal performance.
To learn more, refer to theCloud Storage FUSE CSI driver documentation.
Objectives
This tutorial is intended for MLOps or DevOps engineers or platform administrators who want to use GKE orchestration capabilities for serving LLMs.
This tutorial covers the following steps:
- Create a GKE cluster with the recommendedTPU Trillium topology based on the model characteristics.
- Deploy the vLLM framework on a node pool in your cluster.
- Use the vLLM framework to serve Llama 3.1 70b using a load balancer.
- Set up horizontal Pod autoscaling using vLLM server metrics.
- Serve the model.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.Make sure that you have the following role or roles on the project:
roles/container.admin,roles/iam.serviceAccountAdmin,roles/iam.securityAdmin,roles/artifactregistry.writer,roles/container.clusterAdminCheck for the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
In thePrincipal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check theRole column to see whether the list of roles includes the required roles.
Grant the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
- ClickGrant access.
In theNew principals field, enter your user identifier. This is typically the email address for a Google Account.
- In theSelect a role list, select a role.
- To grant additional roles, clickAdd another role and add each additional role.
- ClickSave.
- Create aHugging Face account, if you don't already have one.
- Ensure your project has sufficient quota forCloud TPU in GKE.
Prepare the environment
In this section, you provision the resources that you need to deploy vLLM and the model.
Get access to the model
You must sign the consent agreement to use Llama 3.1 70b in theHugging Face repository.
Generate an access token
If you don't already have one, generate a newHugging Face token:
- ClickYour Profile > Settings > Access Tokens.
- SelectNew Token.
- Specify a Name of your choice and a Role of at least
Read. - SelectGenerate a token.
Launch Cloud Shell
In this tutorial, you useCloud Shell to manage resources hosted onGoogle Cloud. Cloud Shell comes preinstalled with the software you needfor this tutorial, includingkubectl and thegcloud CLI.
To set up your environment with Cloud Shell, follow these steps:
In the Google Cloud console, launch a Cloud Shell session by clicking
Activate Cloud Shell in theGoogle Cloud console. This launches a session in thebottom pane of the Google Cloud console.
Set the default environment variables:
gcloudconfigsetprojectPROJECT_ID &&\gcloudconfigsetbilling/quota_projectPROJECT_ID &&\exportPROJECT_ID=$(gcloudconfiggetproject) &&\exportPROJECT_NUMBER=$(gcloudprojectsdescribe${PROJECT_ID}--format="value(projectNumber)") &&\exportCLUSTER_NAME=CLUSTER_NAME &&\exportCONTROL_PLANE_LOCATION=CONTROL_PLANE_LOCATION &&\exportZONE=ZONE &&\exportHF_TOKEN=HUGGING_FACE_TOKEN &&\exportCLUSTER_VERSION=CLUSTER_VERSION &&\exportGSBUCKET=GSBUCKET &&\exportKSA_NAME=KSA_NAME &&\exportNAMESPACE=NAMESPACEReplace the following values:
- PROJECT_ID : your Google Cloudproject ID.
- CLUSTER_NAME : the name of your GKE cluster.
- CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. Provide a region that supports TPU Trillium (v6e).
- ZONE : a zone that supports TPU Trillium (v6e).
- CLUSTER_VERSION : the GKE version, whichmust support the machine type that you want to use. Note that the defaultGKE version might not have availability for your target TPU.TPU Trillium (v6e) is supported in GKE versions1.31.2-gke.1115000 or later.
- GSBUCKET : the name of the Cloud Storage bucketto use forCloud Storage FUSE.
- KSA_NAME : the name of the Kubernetes ServiceAccountthat's used to access Cloud Storage buckets. Bucket access isneeded for Cloud Storage FUSE to work.
- NAMESPACE : the Kubernetes namespace where you want todeploy the vLLM assets.
Create a GKE cluster
You can serve LLMs on TPUs in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, seeChoose a GKE mode of operation.
Autopilot
Create a GKE Autopilot cluster:
Note: If your Cloud Shell instance disconnects throughout the tutorialexecution, repeat the preceding step.gcloudcontainerclusterscreate-auto${CLUSTER_NAME}\--cluster-version=${CLUSTER_VERSION}\--location=${CONTROL_PLANE_LOCATION}
Standard
Create a GKE Standard cluster:
gcloudcontainerclusterscreate${CLUSTER_NAME}\--project=${PROJECT_ID}\--location=${CONTROL_PLANE_LOCATION}\--node-locations=${ZONE}\--cluster-version=${CLUSTER_VERSION}\--workload-pool=${PROJECT_ID}.svc.id.goog\--addonsGcsFuseCsiDriverCreate a TPU slice node pool:
gcloudcontainernode-poolscreatetpunodepool\--location=${CONTROL_PLANE_LOCATION}\--node-locations=${ZONE}\--num-nodes=1\--machine-type=ct6e-standard-8t\--cluster=${CLUSTER_NAME}\--enable-autoscaling--total-min-nodes=1--total-max-nodes=2GKE creates the following resources for the LLM:
- A GKE Standard cluster that usesWorkload Identity Federation for GKEand hasCloud Storage FUSE CSI driverenabled.
- A TPU Trillium node pool with a
ct6e-standard-8tmachinetype. This node pool has one node, eight TPU chips, and autoscalingenabled.
Configure kubectl to communicate with your cluster
To configure kubectl to communicate with your cluster, run the followingcommand:
gcloudcontainerclustersget-credentials${CLUSTER_NAME}--location=${CONTROL_PLANE_LOCATION}Create a Kubernetes Secret for Hugging Face credentials
Create a namespace. You can skip this step if you are using the
defaultnamespace:kubectlcreatenamespace${NAMESPACE}Create a Kubernetes Secret that contains the Hugging Face token, run thefollowing command:
kubectlcreatesecretgenerichf-secret\--from-literal=hf_api_token=${HF_TOKEN}\--namespace${NAMESPACE}
Create a Cloud Storage bucket
In Cloud Shell, run the following command:
gcloudstoragebucketscreategs://${GSBUCKET}\--uniform-bucket-level-accessThis creates a Cloud Storage bucket to store the model files youdownload from Hugging Face.
Set up a Kubernetes ServiceAccount to access the bucket
Create the Kubernetes ServiceAccount:
kubectlcreateserviceaccount${KSA_NAME}--namespace${NAMESPACE}Grant read-write access to the Kubernetes ServiceAccount in order to access the Cloud Storage bucket:
gcloudstoragebucketsadd-iam-policy-bindinggs://${GSBUCKET}\--member"principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}"\--role"roles/storage.objectUser"Alternatively, you can grant read-write access to all Cloud Storagebuckets in the project:
gcloudprojectsadd-iam-policy-binding${PROJECT_ID}\--member"principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}"\--role"roles/storage.objectUser"GKE creates the following resources for the LLM:
- A Cloud Storage bucket to store the downloaded model and thecompilation cache. ACloud Storage FUSE CSI driverreads the content of the bucket.
- Volumes with file caching enabled and theparallel download feature of Cloud Storage FUSE.
Best practice: Use a file cache backed by
tmpfsorHyperdisk / Persistent Diskdepending on the expected size of the model contents, for example, weight files.In this tutorial, you use Cloud Storage FUSE file cache backed by RAM.
Deploy vLLM model server
To deploy the vLLM model server, this tutorial uses a Kubernetes Deployment. ADeployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..
Inspect the following Deployment manifest saved as
vllm-llama3-70b.yaml, that uses a single replica:apiVersion:apps/v1kind:Deploymentmetadata:name:vllm-tpuspec:replicas:1selector:matchLabels:app:vllm-tputemplate:metadata:labels:app:vllm-tpuannotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:KSA_NAMEnodeSelector:cloud.google.com/gke-tpu-topology:2x4cloud.google.com/gke-tpu-accelerator:tpu-v6e-slicecontainers:-name:vllm-tpuimage:vllm/vllm-tpu:latestcommand:["python3","-m","vllm.entrypoints.openai.api_server"]args:---host=0.0.0.0---port=8000---tensor-parallel-size=8---max-model-len=4096---model=meta-llama/Llama-3.1-70B---download-dir=/data---max-num-batched-tokens=512---max-num-seqs=128env:-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"-name:VLLM_USE_V1value:"1"ports:-containerPort:8000resources:limits:google.com/tpu:8readinessProbe:tcpSocket:port:8000initialDelaySeconds:15periodSeconds:10volumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"---apiVersion:v1kind:Servicemetadata:name:vllm-servicespec:selector:app:vllm-tputype:LoadBalancerports:-name:httpprotocol:TCPport:8000targetPort:8000If you scale up the Deployment to multiple replicas, the concurrent writes to the
VLLM_XLA_CACHE_PATHwill cause the error:RuntimeError: filesystem error: cannot create directories. To prevent this error, you have two options:Remove the XLA cache location by removing the following block from the Deployment YAML. This means all replicas will recompile the cache.
- name: VLLM_XLA_CACHE_PATH value: "/data"Scale the Deployment to
1, and wait for the first replica to become ready and write to the XLA cache. Then scale to additional replicas. This allows the remainder of the replicas to read the cache, without attempting to write it.
Apply the manifest by running the following command:
kubectlapply-fvllm-llama3-70b.yaml-n${NAMESPACE}View the logs from the running model server:
kubectllogs-f-lapp=vllm-tpu-n${NAMESPACE}The output should look similar to the following:
INFO: Started server process [1]INFO: Waiting for application startup.INFO: Application startup complete.INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
Serve the model
To get the external IP address of the VLLM service, run the following command:
exportvllm_service=$(kubectlgetservicevllm-service-ojsonpath='{.status.loadBalancer.ingress[0].ip}'-n${NAMESPACE})Interact with the model using
curl:curlhttp://$vllm_service:8000/v1/completions\-H"Content-Type: application/json"\-d'{ "model": "meta-llama/Llama-3.1-70B", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0}'The output should be similar to the following:
{"id":"cmpl-6b4bb29482494ab88408d537da1e608f","object":"text_completion","created":1727822657,"model":"meta-llama/Llama-3-8B","choices":[{"index":0,"text":" top holiday destination featuring scenic beauty and","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}
Set up the custom autoscaler
In this section, you set up horizontal Pod autoscaling using customPrometheus metrics. You use the Google Cloud Managed Service for Prometheus metrics from the vLLM server.
To learn more, seeGoogle Cloud Managed Service for Prometheus.This should be enabled by default on the GKE cluster.
Set up the Custom Metrics Stackdriver Adapter onyour cluster:
kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yamlAdd the Monitoring Viewer role to the service account that the Custom Metrics Stackdriver Adapter uses:
Note: Ensure that the service account that your GKE clusteruses has theMonitoring Metric Writer role. This tutorialuses the default Compute Engine service account.gcloudprojectsadd-iam-policy-bindingprojects/${PROJECT_ID}\--roleroles/monitoring.viewer\--member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/custom-metrics/sa/custom-metrics-stackdriver-adapterSave the following manifest as
vllm_pod_monitor.yaml:apiVersion:monitoring.googleapis.com/v1kind:PodMonitoringmetadata:name:vllm-pod-monitoringspec:selector:matchLabels:app:vllm-tpuendpoints:-path:/metricsport:8000interval:15sApply it to the cluster:
kubectlapply-fvllm_pod_monitor.yaml-n${NAMESPACE}
Create load on the vLLM endpoint
Create load to the vLLM server to test how GKE autoscales with a custom vLLM metric.
Run a bash script (
load.sh) to sendNnumber of parallelrequests to the vLLM endpoint:#!/bin/bashN=PARALLEL_PROCESSESexportvllm_service=$(kubectlgetservicevllm-service-ojsonpath='{.status.loadBalancer.ingress[0].ip}'-n${NAMESPACE})foriin$(seq1$N);dowhiletrue;docurlhttp://$vllm_service:8000/v1/completions-H"Content-Type: application/json"-d'{"model": "meta-llama/Llama-3.1-70B", "prompt": "Write a story about san francisco", "max_tokens": 1000, "temperature": 0}'done &# Run in the backgrounddonewaitReplacePARALLEL_PROCESSES with the number of parallelprocesses that you want to run.
Run the bash script:
chmod+xload.shnohup./load.sh&
Verify that Google Cloud Managed Service for Prometheus ingests the metrics
After Google Cloud Managed Service for Prometheus scrapes the metrics and you're adding load to the vLLM endpoint,you can view metrics on Cloud Monitoring.
In the Google Cloud console, go to theMetrics explorer page.
Click< > PromQL.
Enter the following query to observe traffic metrics:
vllm:num_requests_waiting{cluster='CLUSTER_NAME'}
A line graph shows your vLLM metric (num_requests_waiting) measured over time. The vLLM metric scales up from 0 (pre-load) to a value(post-load). This graph confirms your vLLM metrics are being ingested intoGoogle Cloud Managed Service for Prometheus. The following example graph shows a startingpre-load value of 0, which reaches a maximum post-load value of close to 400within one minute.

Deploy the Horizontal Pod Autoscaler configuration
When deciding which metric to autoscale on, we recommend the following metrics for vLLM TPU:
num_requests_waiting: This metric relates to the number of requests waiting in the model server's queue. This number starts to noticeably grow when the kv cache is full.gpu_cache_usage_perc: This metric relates to the kv cache utilization, which directly correlates to the number of requests being processed for a given inference cycle on the model server. Note that this metric works the same on GPUs and TPUs, though it is tied to the GPU naming schema.
We recommend that you usenum_requests_waiting when optimizing for throughput and cost, and when your latency targets are achievable with your model server's maximum throughput.
We recommend that you usegpu_cache_usage_perc when you have latency-sensitive workloads where queue-based scaling isn't fast enough to meet your requirements.
For further explanation, check outBest practices for autoscaling large language model (LLM) inference workloads with TPUs.
When selecting anaverageValue target for your HPA config, you will have to determine this experimentally. Check out theSave on GPUs: Smarter autoscaling for your GKE inferencing workloads blog post for additional ideas on how to optimize this part. Theprofile-generator used in this blog post works for vLLM TPU as well.
In the following instructions, you deploy your HPA configuration by using thenum_requests_waiting metric. For demonstration purposes, you set the metric to a low value so that the HPA configuration scales your vLLM replicas to two. To deploy the Horizontal Pod Autoscaler configuration usingnum_requests_waiting, follow these steps:
Save the following manifest as
vllm-hpa.yaml:apiVersion:autoscaling/v2kind:HorizontalPodAutoscalermetadata:name:vllm-hpaspec:scaleTargetRef:apiVersion:apps/v1kind:Deploymentname:vllm-tpuminReplicas:1maxReplicas:2metrics:-type:Podspods:metric:name:prometheus.googleapis.com|vllm:num_requests_waiting|gaugetarget:type:AverageValueaverageValue:10The vLLM metrics in Google Cloud Managed Service for Prometheus follow the
vllm:metric_nameformat.Best practice: Use
num_requests_waitingfor scaling throughput. Usegpu_cache_usage_percfor latency-sensitive TPU use cases.Deploy the Horizontal Pod Autoscaler configuration:
kubectlapply-fvllm-hpa.yaml-n${NAMESPACE}GKE schedules another Pod to deploy,which triggers the node pool autoscaler to add a second node before it deploysthe second vLLM replica.
Watch the progress of the Pod autoscaling:
kubectlgethpa--watch-n${NAMESPACE}The output is similar to the following:
Success: GKE scales up the vLLM server using custom Prometheus metrics.NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGEvllm-hpa Deployment/vllm-tpu <unknown>/10 1 2 0 6svllm-hpa Deployment/vllm-tpu 34972m/10 1 2 1 16svllm-hpa Deployment/vllm-tpu 25112m/10 1 2 2 31svllm-hpa Deployment/vllm-tpu 35301m/10 1 2 2 46svllm-hpa Deployment/vllm-tpu 25098m/10 1 2 2 62svllm-hpa Deployment/vllm-tpu 35348m/10 1 2 2 77sWait for 10 minutes and repeat the steps in theVerify thatGoogle Cloud Managed Service for Prometheus ingests the metrics section.Google Cloud Managed Service for Prometheus ingests the metrics from both vLLM endpoints now.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the deployed resources
To avoid incurring charges to your Google Cloud account for the resourcesthat you created in this guide, run the following commands:
ps-ef|grepload.sh|awk'{print $2}'|xargs-n1kill-9gcloudcontainerclustersdelete${CLUSTER_NAME}\--location=${CONTROL_PLANE_LOCATION}What's next
- Learn more aboutTPUs in GKE.
- Learn more aboutthe available metrics to set up your Horizontal Pod Autoscaler.
- Explore the vLLMGitHub repositoryanddocumentation.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.