Serve Gemma using TPUs on GKE with JetStream

Note: vLLM is now the recommended solution for serving LLMs on TPUs inGKE. To get started, seeServe an LLM using TPU Trillium on GKE with vLLM.

This tutorial shows you how to serve aGemma large language model (LLM)using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE). You deploy apre-built container withJetStream andMaxText to GKE. You alsoconfigure GKE to load the Gemma 7B weights fromCloud Storage at runtime.

This tutorial is intended for Machine learning (ML) engineers,Platform admins and operators, and for Data and AI specialists who are interestedin using Kubernetes container orchestration capabilities for serving LLMs. Tolearn more about common roles and example tasks that we reference inGoogle Cloud content, seeCommon GKE user roles and tasks.

Before reading this page, ensure that you're familiar with the following:

Background

This section describes the key technologies used in this tutorial.

Gemma

Gemma isa set of openly available, lightweight, generative artificial intelligence (AI)models released under an open license. These AI models are available to runin your applications, hardware, mobile devices, or hosted services.You can use the Gemma models for text generation, however you can alsotune these models for specialized tasks.

To learn more, see theGemma documentation.

TPUs

TPUs are Google's custom-developed application-specific integrated circuits(ASICs) used to accelerate machine learning and AI models built using frameworkssuch asTensorFlow,PyTorch, andJAX.

This tutorial covers serving the Gemma 7B model. GKEdeploys the model on single-host TPUv5e nodes with TPU topologies configuredbased on the model requirements for serving prompts with low latency.

JetStream

JetStream is an open source inferenceserving framework developed by Google. JetStream enables high-performance,high-throughput, and memory-optimized inference on TPUs and GPUs. Itprovides advanced performance optimizations, including continuous batching andquantization techniques, to facilitate LLM deployment. JetStream enablesPyTorch/XLA and JAX TPU serving to achieve optimal performance.

To learn more about these optimizations, refer to theJetStream PyTorch andJetStream MaxTextproject repositories.

MaxText

MaxText is a performant, scalable, andadaptable JAX LLM implementation, built on open source JAX libraries such asFlax,Orbax,andOptax. MaxText's decoder-onlyLLM implementation is written in Python. It leverages the XLA compiler heavilyto achieve high performance without needing to build custom kernels.

To learn more about the latest models and parameter sizes that MaxText supports,see theMaxtText project repository.

Objectives

  1. Prepare a GKE Autopilot or Standard clusterwith the recommended TPU topology based on the model characteristics.
  2. Deploy JetStream components on GKE.
  3. Get and publish the Gemma 7B instruction tuned model.
  4. Serve and interact with the published model.

Architecture

This section describes the GKE architecture used in this tutorial.The architecture comprises a GKE Autopilot orStandard cluster that provisions TPUs and hosts JetStream componentsto deploy and serve the models.

The following diagram shows you the components of this architecture:

Architecture of GKE cluster with single-host TPU node pools containing the Maxengine and Max HTTP components.

This architecture includes the following components:

  • A GKE Autopilot or Standard regional cluster.
  • Two single-host TPU slice node pools that host the JetStream deployment.
  • The Service component spreads inbound traffic to allJetStream HTTP replicas.
  • JetStream HTTP is an HTTP server which accepts requests as a wrapper toJetStream's required format and sends it toJetStream's GRPC client.
  • Maxengine is a JetStream server that performs inferencing with continuous batching.

Before you begin

  • Ensure that you have sufficient quota for eight TPU v5e PodSlice Lite chips.In this tutorial, you useon-demand instances.
  • Create aKaggle account, if you don't already have one.

Get access to the model

To get access to the Gemma model for deployment toGKE, you must first sign the license consent agreement.

Sign the license consent agreement

You must sign the consent agreement to use Gemma.Follow these instructions:

  1. Access theGemma model consent page on Kaggle.com.
  2. Login to Kaggle if you haven't done so already.
  3. ClickRequest Access.
  4. In theChoose Account for Consent section, selectVerify via KaggleAccount to use your Kaggle account for consent.
  5. Accept the modelTerms and Conditions.

Generate an access token

To access the model through Kaggle, you need a Kaggle API token.

Follow these steps to generate a new token if you don't have one already:

  1. In your browser, go toKaggle settings.
  2. Under theAPI section, clickCreate New Token.

A file namedkaggle.json is downloaded.

Prepare the environment

In this tutorial, you useCloud Shell to manage resources hosted onGoogle Cloud. Cloud Shell comes preinstalled with the software you'll needfor this tutorial, includingkubectl andgcloud CLI.

To set up your environment with Cloud Shell, follow these steps:

  1. In the Google Cloud console, launch a Cloud Shell session by clickingCloud Shell activation iconActivate Cloud Shell in theGoogle Cloud console. This launches a session in thebottom pane of Google Cloud console.

  2. Set the default environment variables:

    gcloudconfigsetprojectPROJECT_IDgcloudconfigsetbilling/quota_projectPROJECT_IDexportPROJECT_ID=$(gcloudconfiggetproject)exportCLUSTER_NAME=CLUSTER_NAMEexportBUCKET_NAME=BUCKET_NAMEexportCONTROL_PLANE_LOCATION=CONTROL_PLANE_LOCATIONexportNODE_LOCATION=NODE_LOCATIONexportCLUSTER_VERSION=CLUSTER_VERSION

    Replace the following values:

    • PROJECT_ID: your Google Cloudproject ID.
    • CLUSTER_NAME: the name of your GKE cluster.
    • BUCKET_NAME: the name of your Cloud Storage bucket.You don't need to specify thegs:// prefix.
    • CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. This region must contain zones where TPU v5e machine types are available(for example,us-west1,us-west4,us-central1,us-east1,us-east5, oreurope-west4).For Autopilot clusters, ensure that you have sufficientTPU v5e zonal resources for your region of choice.
    • (Standard cluster only)NODE_LOCATION: the zonewhere theTPU resources are available(for example,us-west4-a). For Autopilot clusters, you don't need tospecify this value.
    • CLUSTER_VERSION: the GKE version, which must support the machine type that you want to use. Note that the default GKE version might not have availability for your target TPU. For a list of minimum GKE versions available by TPU machine type, seeTPU availability in GKE.

Create and configure Google Cloud resources

Follow these instructions to create the required resources.

Note: You may need to create a capacity reservation for usage of some accelerators. To learn how to reserve and consume reserved resources, seeConsuming reserved zonal resources.

Create a GKE cluster

You can serve Gemma on TPUs in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, seeChoose a GKE mode of operation.

Autopilot

In Cloud Shell, run the following command:

gcloudcontainerclusterscreate-auto${CLUSTER_NAME}\--project=${PROJECT_ID}\--location=${CONTROL_PLANE_LOCATION}\--cluster-version=${CLUSTER_VERSION}

Standard

  1. Create a regional GKE Standard cluster that usesWorkload Identity Federation for GKE.

    gcloudcontainerclusterscreate${CLUSTER_NAME}\--enable-ip-alias\--machine-type=e2-standard-4\--num-nodes=2\--cluster-version=${CLUSTER_VERSION}\--workload-pool=${PROJECT_ID}.svc.id.goog\--location=${CONTROL_PLANE_LOCATION}

    The cluster creation might take several minutes.

  2. Run the following command to create anode pool for your cluster:

    gcloudcontainernode-poolscreategemma-7b-tpu-nodepool\--cluster=${CLUSTER_NAME}\--machine-type=ct5lp-hightpu-8t\--project=${PROJECT_ID}\--num-nodes=2\--location=${CONTROL_PLANE_LOCATION}\--node-locations=${NODE_LOCATION}

    GKE creates a TPU v5e node pool with a2x4 topologyand two nodes.

Create a Cloud Storage bucket

In Cloud Shell, run the following command:

gcloudstoragebucketscreategs://${BUCKET_NAME}--location=${CONTROL_PLANE_LOCATION}

This creates a Cloud Storage bucket to store the model files youdownload from Kaggle.

Upload the access token to Cloud Shell

In Cloud Shell, you can upload the Kaggle API token to your Google Cloudproject:

  1. In Cloud Shell, clickMore>Upload.
  2. Select File and clickChoose Files.
  3. Open thekaggle.json file.
  4. ClickUpload.

Create a Kubernetes Secret for Kaggle credentials

In Cloud Shell, do the following:

  1. Configurekubectl to communicate with your cluster:

    gcloudcontainerclustersget-credentials${CLUSTER_NAME}--location=${CONTROL_PLANE_LOCATION}
  2. Create a Secret to store the Kaggle credentials:

    kubectlcreatesecretgenerickaggle-secret\--from-file=kaggle.json

Configure your workloads access using Workload Identity Federation for GKE

Assign aKubernetes ServiceAccount to the application and configure that Kubernetes ServiceAccount to act as anIAM service account.

  1. Create an IAM service account for your application:

    gcloudiamservice-accountscreatewi-jetstream
  2. Add anIAM policy binding for your IAM service account tomanage Cloud Storage:

    gcloudprojectsadd-iam-policy-binding${PROJECT_ID}\--member"serviceAccount:wi-jetstream@${PROJECT_ID}.iam.gserviceaccount.com"\--roleroles/storage.objectUsergcloudprojectsadd-iam-policy-binding${PROJECT_ID}\--member"serviceAccount:wi-jetstream@${PROJECT_ID}.iam.gserviceaccount.com"\--roleroles/storage.insightsCollectorService
  3. Allow the Kubernetes ServiceAccount toimpersonate the IAM service account by adding an IAMpolicy binding between the two service accounts. This binding allows the KubernetesServiceAccount to act as the IAM service account:

    gcloudiamservice-accountsadd-iam-policy-bindingwi-jetstream@${PROJECT_ID}.iam.gserviceaccount.com\--roleroles/iam.workloadIdentityUser\--member"serviceAccount:${PROJECT_ID}.svc.id.goog[default/default]"
  4. Annotatethe Kubernetes service account with the email address of the IAMservice account:

    kubectlannotateserviceaccountdefault\iam.gke.io/gcp-service-account=wi-jetstream@${PROJECT_ID}.iam.gserviceaccount.com

Convert the model checkpoints

In this section, you create a Job to do the following:

  1. Download the baseOrbax checkpoint from Kaggle.
  2. Upload the checkpoint to a Cloud Storage bucket.
  3. Convert the checkpoint to a MaxText compatible checkpoint.
  4. Unscan the checkpoint to be used for serving.

Deploy the model checkpoint conversion Job

Follow these instructions to download and convert the Gemma 7Bmodel checkpoint files. This tutorial uses a Kubernetes Job. A Job controller in Kubernetes creates one or more Pods and ensures that they successfully execute a specific task.

  1. Create the following manifest asjob-7b.yaml.

    apiVersion:batch/v1kind:Jobmetadata:name:data-loader-7bspec:ttlSecondsAfterFinished:30template:spec:restartPolicy:Nevercontainers:-name:inference-checkpointimage:us-docker.pkg.dev/cloud-tpu-images/inference/inference-checkpoint:v0.2.4args:--b=BUCKET_NAME--m=google/gemma/maxtext/7b-it/2volumeMounts:-mountPath:"/kaggle/"name:kaggle-credentialsreadOnly:trueresources:requests:google.com/tpu:8limits:google.com/tpu:8nodeSelector:cloud.google.com/gke-tpu-topology:2x4cloud.google.com/gke-tpu-accelerator:tpu-v5-lite-podslicevolumes:-name:kaggle-credentialssecret:defaultMode:0400secretName:kaggle-secret
  2. Apply the manifest:

    kubectlapply-fjob-7b.yaml
  3. Wait for the Pod scheduling the Job to begin running:

    kubectlgetpod-w

    The output will be similar to the following, this may take a few minutes:

    NAME                  READY   STATUS              RESTARTS   AGEdata-loader-7b-abcd   0/1     ContainerCreating   0          28sdata-loader-7b-abcd   1/1     Running             0          51s

    For Autopilot clusters, it may take a few minutes to provision the required TPU resources.

  4. View the logs from the Job:

    kubectllogs-fjobs/data-loader-7b

    When the Job is completed, the output is similar to the following:

    Successfully generated decode checkpoint at: gs://BUCKET_NAME/final/unscanned/gemma_7b-it/0/checkpoints/0/items+ echo -e '\nCompleted unscanning checkpoint to gs://BUCKET_NAME/final/unscanned/gemma_7b-it/0/checkpoints/0/items'Completed unscanning checkpoint to gs://BUCKET_NAME/final/unscanned/gemma_7b-it/0/checkpoints/0/items

Deploy JetStream

In this section, you deploy the JetStream container to serve the Gemmamodel.

Follow these instructions to deploy the Gemma 7Binstruction tuned model. This tutorial uses a Kubernetes Deployment. ADeployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..

  1. Save the following Deployment manifest asjetstream-gemma-deployment.yaml:

    apiVersion:apps/v1kind:Deploymentmetadata:name:maxengine-serverspec:replicas:1selector:matchLabels:app:maxengine-servertemplate:metadata:labels:app:maxengine-serverspec:nodeSelector:cloud.google.com/gke-tpu-topology:2x4cloud.google.com/gke-tpu-accelerator:tpu-v5-lite-podslicecontainers:-name:maxengine-serverimage:us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.2args:-model_name=gemma-7b-tokenizer_path=assets/tokenizer.gemma-per_device_batch_size=4-max_prefill_predict_length=1024-max_target_length=2048-async_checkpointing=false-ici_fsdp_parallelism=1-ici_autoregressive_parallelism=-1-ici_tensor_parallelism=1-scan_layers=false-weight_dtype=bfloat16-load_parameters_path=gs://BUCKET_NAME/final/unscanned/gemma_7b-it/0/checkpoints/0/items-prometheus_port=PROMETHEUS_PORTports:-containerPort:9000resources:requests:google.com/tpu:8limits:google.com/tpu:8-name:jetstream-httpimage:us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.2ports:-containerPort:8000---apiVersion:v1kind:Servicemetadata:name:jetstream-svcspec:selector:app:maxengine-serverports:-protocol:TCPname:jetstream-httpport:8000targetPort:8000-protocol:TCPname:jetstream-grpcport:9000targetPort:9000

    The manifest sets the following key properties:

    • tokenizer_path: the path to your model's tokenizer.
    • load_parameters_path: the path in the Cloud Storage bucket where your checkpoints are stored.
    • per_device_batch_size: the decoding batch size per device, where one TPU chip equals one device.
    • max_prefill_predict_length: the maximum length for the prefill when doing autoregression.
    • max_target_length: the maximum sequence length.
    • model_name: the model name (gemma-7b).
    • ici_fsdp_parallelism: the number of shards for fully sharded data parallelism (FSDP).
    • ici_tensor_parallelism: the number of shards for tensor parallelism.
    • ici_autoregressive_parallelism: the number of shards for autoregressive parallelism.
    • prometheus_port: port to expose prometheus metrics. Remove this argument if metrics aren't needed.
    • scan_layers: scan layers boolean flag (boolean).
    • weight_dtype: the weight data type (bfloat16).
  2. Apply the manifest:

    kubectlapply-fjetstream-gemma-deployment.yaml
  3. Verify the Deployment:

    kubectlgetdeployment

    The output is similar to the following:

    NAME                              READY   UP-TO-DATE   AVAILABLE   AGEmaxengine-server                  2/2     2            2           ##s

    For Autopilot clusters, it may take a few minutes to provision the required TPU resources.

  4. View the HTTP server logs to check that the model has been loaded and compiled.It may take the server a few minutes to complete this operation.

    kubectllogsdeploy/maxengine-server-f-cjetstream-http

    The output is similar to the following:

    kubectl logs deploy/maxengine-server -f -c jetstream-httpINFO:     Started server process [1]INFO:     Waiting for application startup.INFO:     Application startup complete.INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
  5. View the MaxEngine logs and verify that the compilation is done.

    kubectllogsdeploy/maxengine-server-f-cmaxengine-server

    The output is similar to the following:

    2024-03-29 17:09:08,047 - jax._src.dispatch - DEBUG - Finished XLA compilation of jit(initialize) in 0.26236414909362793 sec2024-03-29 17:09:08,150 - root - INFO - ---------Generate params 0 loaded.---------

Serve the model

In this section, you interact with the model.

Set up port forwarding

You can access the JetStream Deployment through theClusterIP Service that youcreated in the preceding step. The ClusterIP Services are only reachable from withinthe cluster. Therefore, to access the Service from outside the cluster, completethe following steps:

To establish a port forwarding session, run the following command:

kubectlport-forwardsvc/jetstream-svc8000:8000
Success: You've successfully served Gemma using TPUs onGKE with JetStream. You can now interact with the model.

Interact with the model using curl

  1. Verify that you can access the JetStream HTTP server by opening a new terminaland running the following command:

    curl--requestPOST\--header"Content-type: application/json"\-s\localhost:8000/generate\--data\'{    "prompt": "What are the top 5 programming languages",    "max_tokens": 200}'

    The initial request can take several seconds to complete due to model warmup.The output is similar to the following:

    {    "response": "\nfor data science in 2023?\n\n**1. Python:**\n- Widely used for data science due to its simplicity, readability, and extensive libraries for data wrangling, analysis, visualization, and machine learning.\n- Popular libraries include pandas, scikit-learn, and matplotlib.\n\n**2. R:**\n- Statistical programming language widely used for data analysis, visualization, and modeling.\n- Popular libraries include ggplot2, dplyr, and caret.\n\n**3. Java:**\n- Enterprise-grade language with strong performance and scalability.\n- Popular libraries include Spark, TensorFlow, and Weka.\n\n**4. C++:**\n- High-performance language often used for data analytics and machine learning models.\n- Popular libraries include TensorFlow, PyTorch, and OpenCV.\n\n**5. SQL:**\n- Relational database language essential for data wrangling and querying large datasets.\n- Popular tools"}

(Optional) Interact with the model through a Gradio chat interface

In this section, you build a web chat application that lets you interact withyour instruction tuned model.

Gradio is a Python library that has aChatInterface wrapper that creates user interfaces for chatbots.

Deploy the chat interface

  1. In Cloud Shell, save the following manifest asgradio.yaml:

    apiVersion:apps/v1kind:Deploymentmetadata:name:gradiolabels:app:gradiospec:replicas:1selector:matchLabels:app:gradiotemplate:metadata:labels:app:gradiospec:containers:-name:gradioimage:us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.3resources:requests:cpu:"512m"memory:"512Mi"limits:cpu:"1"memory:"512Mi"env:-name:CONTEXT_PATHvalue:"/generate"-name:HOSTvalue:"http://jetstream-svc:8000"-name:LLM_ENGINEvalue:"max"-name:MODEL_IDvalue:"gemma"-name:USER_PROMPTvalue:"<start_of_turn>user\nprompt<end_of_turn>\n"-name:SYSTEM_PROMPTvalue:"<start_of_turn>model\nprompt<end_of_turn>\n"ports:-containerPort:7860---apiVersion:v1kind:Servicemetadata:name:gradiospec:selector:app:gradioports:-protocol:TCPport:8080targetPort:7860type:ClusterIP
  2. Apply the manifest:

    kubectlapply-fgradio.yaml
  3. Wait for the deployment to be available:

    kubectlwait--for=condition=Available--timeout=300sdeployment/gradio

Use the chat interface

  1. In Cloud Shell, run the following command:

    kubectlport-forwardservice/gradio8080:8080

    This creates a port forward from Cloud Shell to the Gradio service.

  2. Click theWeb Preview iconWeb Preview button which can be found on the top right of the Cloud Shell taskbar. ClickPreview on Port 8080. A new tab opens in your browser.

  3. Interact with Gemma using the Gradio chat interface. Add a prompt and clickSubmit.

Troubleshoot issues

  • If you get the messageEmpty reply from server, it's possible the container has not finished downloading the model data.Check the Pod's logs again for theConnected message which indicates that the model is ready to serve.
  • If you seeConnection refused, verify that yourport forwarding is active.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the deployed resources

To avoid incurring charges to your Google Cloud account for the resourcesthat you created in this guide, run the following commands and follow the prompts:

gcloudcontainerclustersdelete${CLUSTER_NAME}--location=${CONTROL_PLANE_LOCATION}gcloudiamservice-accountsdeletewi-jetstream@PROJECT_ID.iam.gserviceaccount.comgcloudstoragerm--recursivegs://BUCKET_NAME

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.