Serve an LLM using TPUs on GKE with JetStream and PyTorch Stay organized with collections Save and categorize content based on your preferences.
This guide shows you how to serve a large language model(LLM) using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE) withJetStream throughPyTorch. In this guide, you downloadmodel weights to Cloud Storage and deploy them on a GKEAutopilotorStandardcluster using a container that runs JetStream.
If you need the scalability, resilience, andcost-effectiveness offered by Kubernetes features when deploying your model onJetStream, this guide is a good starting point.
This guide is intended for Generative AI customers who use PyTorch, new orexisting users of GKE, ML Engineers, MLOps (DevOps) engineers, orplatform administrators who are interested in using Kubernetes containerorchestration capabilities for serving LLMs.
Background
By serving an LLM using TPUs on GKE with JetStream,you can build a robust, production-ready serving solution with allthe benefits of managedKubernetes, includingcost-efficiency, scalability and higher availability. This section describes thekey technologies used in this tutorial.
About TPUs
TPUs are Google's custom-developed application-specific integrated circuits(ASICs) used to accelerate machine learning and AI models built using frameworkssuch asTensorFlow,PyTorch, andJAX.
Before you use TPUs in GKE, we recommend that you complete thefollowing learning path:
- Learn about current TPU version availability with theCloud TPU system architecture.
- Learnabout TPUs in GKE.
This tutorial covers serving various LLM models. GKEdeploys the model on single-host TPUv5e nodes with TPU topologies configuredbased on the model requirements for serving prompts with low latency.
About JetStream
JetStream is an open source inferenceserving framework developed by Google. JetStream enables high-performance,high-throughput, and memory-optimized inference on TPUs and GPUs. JetStreamprovides advanced performance optimizations, including continuous batching,KV cache optimizations, and quantization techniques, to facilitate LLMdeployment. JetStream enables PyTorch/XLA and JAX TPU serving to achieve optimalperformance.
Continuous Batching
Continuous batching is a technique that dynamically groups incoming inferencerequests into batches, reducing latency and increasing throughput.
KV cache quantization
KV cache quantization involves compressing the key-value cache used in attentionmechanisms, reducing memory requirements.
Int8 weight quantization
Int8 weight quantization reduces the precision of model weights from 32-bitfloating point to 8-bit integers, leading to faster computation and reducedmemory usage.
To learn more about these optimizations, refer to theJetStream PyTorch andJetStream MaxTextproject repositories.
About PyTorch
PyTorch is an open source machine learning framework developed by Meta and nowpart of the Linux Foundation umbrella. PyTorch provides high-level features suchas tensor computation and deep neural networks.
Objectives
- Prepare a GKE Autopilot or Standard clusterwith the recommended TPU topology based on the model characteristics.
- Deploy JetStream components on GKE.
- Get and publish your model.
- Serve and interact with the published model.
Architecture
This section describes the GKE architecture used in this tutorial.The architecture includes a GKE Autopilot orStandard cluster that provisions TPUs and hosts JetStream componentsto deploy and serve the models.
The following diagram shows you the components of this architecture:
This architecture includes the following components:
- A GKE Autopilot or Standard regional cluster.
- Two single-host TPU slice node pools that host the JetStream deployment.
- The Service component spreads inbound traffic to all
JetStream HTTPreplicas. JetStream HTTPis an HTTP server which accepts requests as a wrapper toJetStream's required format and sends it toJetStream's GRPC client.JetStream-PyTorchis a JetStream server that performs inferencing with continuous batching.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/resourcemanager.projectIamAdmin
Check for the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
In thePrincipal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check theRole column to see whether the list of roles includes the required roles.
Grant the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
- ClickGrant access.
In theNew principals field, enter your user identifier. This is typically the email address for a Google Account.
- ClickSelect a role, then search for the role.
- To grant additional roles, clickAdd another role and add each additional role.
- ClickSave.
- Check that you have sufficient quota for eight TPU v5e PodSlice Lite chips.In this tutorial, you useon-demand instances.
- Create aHugging Face token, if you don't already have one.
Get access to the model
Get access to various models on Hugging Face for deployment to GKE
Gemma 7B-it
To get access to the Gemma model for deployment toGKE, you must first sign the license consent agreement.
- Access theGemma model consent page on Hugging Face
- Login to Hugging Face if you haven't done so already.
- Review and accept the modelTerms and Conditions.
Llama 3 8B
To get access to the Llama 3 model for deployment to GKE,you must first sign the license consent agreement.
- Access theLlama 3 model consent page on Hugging Face
- Login to Hugging Face if you haven't done so already.
- Review and accept the modelTerms and Conditions.
Prepare the environment
In this tutorial, you useCloud Shell to manage resources hosted onGoogle Cloud. Cloud Shell comes preinstalled with the software you'll needfor this tutorial, includingkubectl andgcloud CLI.
To set up your environment with Cloud Shell, follow these steps:
In the Google Cloud console, launch a Cloud Shell session by clicking
Activate Cloud Shell in theGoogle Cloud console. This launches a session in thebottom pane of Google Cloud console.
Set the default environment variables:
gcloudconfigsetprojectPROJECT_IDgcloudconfigsetbilling/quota_projectPROJECT_IDexportPROJECT_ID=$(gcloudconfiggetproject)exportCLUSTER_NAME=CLUSTER_NAMEexportCONTROL_PLANE_LOCATION=CONTROL_PLANE_LOCATIONexportNODE_LOCATION=NODE_LOCATIONexportCLUSTER_VERSION=CLUSTER_VERSIONexportBUCKET_NAME=BUCKET_NAMEReplace the following values:
PROJECT_ID: your Google Cloudproject ID.CLUSTER_NAME: the name of your GKE cluster.CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. The region must contain zones where TPU v5e machine types are available(for example,us-west1,us-west4,us-central1,us-east1,us-east5, oreurope-west4).For Autopilot clusters, ensure that you have sufficientTPU v5e zonal resources for your region of choice.- (Standard cluster only)
NODE_LOCATION:the zone where theTPU resources are available(for example,us-west4-a). For Autopilot clusters, you don't needto specify this value. CLUSTER_VERSION: the GKE version, which must support the machine type that you want to use. Note that the default GKE version might not have availability for your target TPU. For a list of minimum GKE versions available by TPU machine type, seeTPU availability in GKE.BUCKET_NAME: the name of your Cloud Storage bucket, used to store JAX compilation cache.
Create and configure Google Cloud resources
Follow these instructions to create the required resources.
Note: You may need to create a capacity reservation for usage of some accelerators. To learn how to reserve and consume reserved resources, seeConsuming reserved zonal resources.Create a GKE cluster
You can serve Gemma on TPUs in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, seeChoose a GKE mode of operation.
Autopilot
Create an Autopilot GKE cluster:
gcloudcontainerclusterscreate-autoCLUSTER_NAME\--project=PROJECT_ID\--location=CONTROL_PLANE_LOCATION\--cluster-version=CLUSTER_VERSIONReplaceCLUSTER_VERSION with the appropriatecluster version. For a Autopilot GKEcluster, use a regular release channel version.
Standard
Create a regional GKE Standard cluster that usesWorkload Identity Federation for GKE:
gcloudcontainerclusterscreateCLUSTER_NAME\--enable-ip-alias\--machine-type=e2-standard-4\--num-nodes=2\--cluster-version=CLUSTER_VERSION\--workload-pool=PROJECT_ID.svc.id.goog\--location=CONTROL_PLANE_LOCATIONThe cluster creation might take several minutes.
Replace
CLUSTER_VERSIONwith the appropriatecluster version.Create a TPU v5enode poolwith a
2x4topology and two nodes:gcloudcontainernode-poolscreatetpu-nodepool\--cluster=CLUSTER_NAME\--machine-type=ct5lp-hightpu-8t\--project=PROJECT_ID\--num-nodes=2\--location=CONTROL_PLANE_LOCATION\--node-locations=NODE_LOCATION
Generate your Hugging Face CLI token in Cloud Shell
Generate a newHugging Face token if you don't already have one:
- ClickYour Profile > Settings > Access Tokens.
- ClickNew Token.
- Specify a Name of your choice and a Role of at least
Read. - ClickGenerate a token.
- Edit permissions to your access token to have read access to your model's Hugging Face repository.
- Copy the generated token to your clipboard.
Create a Kubernetes Secret for Hugging Face credentials
In Cloud Shell, do the following:
Configure
kubectlto communicate with your cluster:gcloudcontainerclustersget-credentialsCLUSTER_NAME--location=CONTROL_PLANE_LOCATIONCreate a Secret to store the Hugging Face credentials:
kubectlcreatesecretgenerichuggingface-secret\--from-literal=HUGGINGFACE_TOKEN=HUGGINGFACE_TOKENReplace
HUGGINGFACE_TOKENwith your Hugging Face token.
Configure your workloads access using Workload Identity Federation for GKE
Assign aKubernetes ServiceAccount to the application and configure that Kubernetes ServiceAccount to act as anIAM service account.
Create an IAM service account for your application:
gcloudiamservice-accountscreatewi-jetstreamAdd anIAM policy binding for your IAM service account tomanage Cloud Storage:
gcloudprojectsadd-iam-policy-bindingPROJECT_ID\--member"serviceAccount:wi-jetstream@PROJECT_ID.iam.gserviceaccount.com"\--roleroles/storage.objectUsergcloudprojectsadd-iam-policy-bindingPROJECT_ID\--member"serviceAccount:wi-jetstream@PROJECT_ID.iam.gserviceaccount.com"\--roleroles/storage.insightsCollectorServiceAllow the Kubernetes ServiceAccount toimpersonate the IAM service account by adding an IAMpolicy binding between the two service accounts. This binding allows the KubernetesServiceAccount to act as the IAM service account:
gcloudiamservice-accountsadd-iam-policy-bindingwi-jetstream@PROJECT_ID.iam.gserviceaccount.com\--roleroles/iam.workloadIdentityUser\--member"serviceAccount:PROJECT_ID.svc.id.goog[default/default]"Annotatethe Kubernetes service account with the email address of the IAMservice account:
kubectlannotateserviceaccountdefault\iam.gke.io/gcp-service-account=wi-jetstream@PROJECT_ID.iam.gserviceaccount.com
Deploy JetStream
Deploy the JetStream container to serve your model. This tutorial uses Kubernetes Deployment manifests. ADeployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..
Save the following manifest asjetstream-pytorch-deployment.yaml:
Gemma 7B-it
apiVersion:apps/v1kind:Deploymentmetadata:name:jetstream-pytorch-serverspec:replicas:2selector:matchLabels:app:jetstream-pytorch-servertemplate:metadata:labels:app:jetstream-pytorch-serverspec:nodeSelector:cloud.google.com/gke-tpu-topology:2x4cloud.google.com/gke-tpu-accelerator:tpu-v5-lite-podslicecontainers:-name:jetstream-pytorch-serverimage:us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-pytorch-server:v0.2.4args:---model_id=google/gemma-7b-it---override_batch_size=30---enable_model_warmup=TruevolumeMounts:-name:huggingface-credentialsmountPath:/huggingfacereadOnly:trueports:-containerPort:9000resources:requests:google.com/tpu:8limits:google.com/tpu:8startupProbe:httpGet:path:/healthcheckport:8000scheme:HTTPperiodSeconds:60initialDelaySeconds:90failureThreshold:50livenessProbe:httpGet:path:/healthcheckport:8000scheme:HTTPperiodSeconds:60failureThreshold:30readinessProbe:httpGet:path:/healthcheckport:8000scheme:HTTPperiodSeconds:60failureThreshold:30-name:jetstream-httpimage:us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.3ports:-containerPort:8000volumes:-name:huggingface-credentialssecret:defaultMode:0400secretName:huggingface-secret---apiVersion:v1kind:Servicemetadata:name:jetstream-svcspec:selector:app:jetstream-pytorch-serverports:-protocol:TCPname:jetstream-httpport:8000targetPort:8000Llama 3 8B
apiVersion:apps/v1kind:Deploymentmetadata:name:jetstream-pytorch-serverspec:replicas:2selector:matchLabels:app:jetstream-pytorch-servertemplate:metadata:labels:app:jetstream-pytorch-serverspec:nodeSelector:cloud.google.com/gke-tpu-topology:2x4cloud.google.com/gke-tpu-accelerator:tpu-v5-lite-podslicecontainers:-name:jetstream-pytorch-serverimage:us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-pytorch-server:v0.2.4args:---model_id=meta-llama/Meta-Llama-3-8B---override_batch_size=30---enable_model_warmup=TruevolumeMounts:-name:huggingface-credentialsmountPath:/huggingfacereadOnly:trueports:-containerPort:9000resources:requests:google.com/tpu:8limits:google.com/tpu:8startupProbe:httpGet:path:/healthcheckport:8000scheme:HTTPperiodSeconds:60initialDelaySeconds:90failureThreshold:50livenessProbe:httpGet:path:/healthcheckport:8000scheme:HTTPperiodSeconds:60failureThreshold:30readinessProbe:httpGet:path:/healthcheckport:8000scheme:HTTPperiodSeconds:60failureThreshold:30-name:jetstream-httpimage:us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.3ports:-containerPort:8000volumes:-name:huggingface-credentialssecret:defaultMode:0400secretName:huggingface-secret---apiVersion:v1kind:Servicemetadata:name:jetstream-svcspec:selector:app:jetstream-pytorch-serverports:-protocol:TCPname:jetstream-httpport:8000targetPort:8000The manifest sets the following key properties:
model_id: the model name from Hugging Face (google/gemma-7b-it,meta-llama/Meta-Llama-3-8B) (see thesupported models).override_batch_size: the decoding batch size per device, where one TPU chip equals one device. This value defaults to30.enable_model_warmup: this setting enables model warmup after the model server has started. This value defaults toFalse.
You can optionally set these properties:
max_input_length: the maximum input sequence length. This value defaults to1024.max_output_length: the maximum output decode length, this value defaults to1024.quantize_weights: whether the checkpoint is quantized. This value defaults to0; set it to 1 to enableint8quantization.internal_jax_compilation_cache: the directory for the JAX compilation cache. This value defaults to~/jax_cache; set it togs://BUCKET_NAME/jax_cachefor remote caching.
In the manifest, astartup probeis configured to ensure that the model server is labeledReady after themodel has been loaded and warmup has completed.Liveness andreadiness probes are configured to ensure the healthiness of the model server.
Apply the manifest:
kubectlapply-fjetstream-pytorch-deployment.yamlVerify the Deployment:
kubectlgetdeploymentThe output is similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGEjetstream-pytorch-server 0/2 2 0 ##sFor Autopilot clusters, it may take a few minutes to provision the required TPU resources.
View the JetStream-PyTorch server logs to check that the model weights havebeen loaded and model warmup has completed.It might take the server a few minutes to complete this operation.
kubectllogsdeploy/jetstream-pytorch-server-f-cjetstream-pytorch-serverThe output is similar to the following:
Started jetstream_server....2024-04-12 04:33:37,128 - root - INFO - ---------Generate params 0 loaded.---------Verify the Deployment is ready:
kubectlgetdeploymentThe output is similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGEjetstream-pytorch-server 2/2 2 2 ##sIt might take several minutes for the
healthcheckendpoint to register.
Serve the model
In this section, you interact with the model.
Set up port forwarding
You can access the JetStream Deployment through theClusterIP Service that youcreated in the preceding step. The ClusterIP Services are only reachable from withinthe cluster. Therefore, to access the Service from outside the cluster, completethe following steps:
To establish a port forwarding session, run the following command:
kubectlport-forwardsvc/jetstream-svc8000:8000Interact with the model using curl
Verify that you can access the JetStream HTTP server by opening a new terminaland running the following command:
curl--requestPOST\--header"Content-type: application/json"\-s\localhost:8000/generate\--data\'{ "prompt": "What are the top 5 programming languages", "max_tokens": 200}'The initial request can take several seconds to complete due to model warmup.The output is similar to the following:
{ "response": " for data science in 2023?\n\n**1. Python:**\n- Widely used for data science due to its readability, extensive libraries (pandas, scikit-learn), and integration with other tools.\n- High demand for Python programmers in data science roles.\n\n**2. R:**\n- Popular choice for data analysis and visualization, particularly in academia and research.\n- Extensive libraries for statistical modeling and data wrangling.\n\n**3. Java:**\n- Enterprise-grade platform for data science, with strong performance and scalability.\n- Widely used in data mining and big data analytics.\n\n**4. SQL:**\n- Essential for data querying and manipulation, especially in relational databases.\n- Used for data analysis and visualization in various industries.\n\n**5. Scala:**\n- Scalable and efficient for big data processing and machine learning models.\n- Popular in data science for its parallelism and integration with Spark and Spark MLlib."}
You've successfully done the following:
- Deployed the JetStream-PyTorch model server on GKE using TPUs.
- Served and interacted with the model.
Observe model performance
To observe the model performance, you can use the JetStream dashboardintegration inCloud Monitoring.With this dashboard, you can view critical performance metrics like tokenthroughput, request latency, and error rates.
To use the JetStream dashboard, you must enableGoogle Cloud Managed Service for Prometheus,which collects the metrics from JetStream,in your GKE cluster.
You can then view the metrics by using the JetStream dashboard.For information about using Google Cloud Managed Service for Prometheus to collectmetrics from your model, see theJetStreamobservability guidance in the Cloud Monitoring documentation.Troubleshoot issues
- If you get the message
Empty reply from server, it's possible the container has not finished downloading the model data.Check the Pod's logs again for theConnectedmessage which indicates that the model is ready to serve. - If you see
Connection refused, verify that yourport forwarding is active.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the deployed resources
To avoid incurring charges to your Google Cloud account for the resourcesthat you created in this guide, run the following commands and follow the prompts:
gcloudcontainerclustersdeleteCLUSTER_NAME--location=CONTROL_PLANE_LOCATIONgcloudiamservice-accountsdeletewi-jetstream@PROJECT_ID.iam.gserviceaccount.comWhat's next
- Discover how you can runGemma models on GKE and how to run optimizedAI/ML workloads withGKE platform orchestration capabilities.
- Learn more aboutTPUs inGKE.
- Explore the JetStreamGitHub repository.
- Explore theVertex AI Model Garden.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-18 UTC.