Serve Gemma open models using TPUs on GKE with Saxml Stay organized with collections Save and categorize content based on your preferences.
This tutorial demonstrates how to deploy and serve a Gemma 2large language model (LLM) using TPUs on GKE with theSaxml serving framework.This tutorial provides a foundation for understanding and exploring practical LLMdeployment for inference in a managed Kubernetes environment. You deploy apre-built container with Saxml to GKE. You also configureGKE to load the Gemma 2B and 7B weights fromCloud Storage at runtime.
This tutorial is intended for Machine learning (ML) engineers,Platform admins and operators and for Data and AI specialists who are interestedin using Kubernetes container orchestration capabilities for serving LLMs.To learn more about common roles and example tasks that we reference inGoogle Cloud content, seeCommon GKE user roles and tasks.
Before reading this page, ensure that you're familiar with the following:
- Current TPU version availability with theCloud TPU system architecture
- TPUs in GKE
If you need a unified managed AIplatform to rapidly build and serve ML models cost effectively, we recommendthat you try ourVertex AI deployment solution.
Background
This section describes the key technologies used in this tutorial.
Gemma
Gemma is a set of openly available, lightweight generativeAI models released under an open license. These AImodels are available to run in your applications, hardware, mobile devices, orhosted services. You can use theGemma models for text generation, plus you can tune thesemodels for specialized tasks.
To learn more, see theGemma documentation.
TPUs
TPUs are Google's custom-developed application-specific integrated circuits(ASICs) used to accelerate data processing frameworks such asTensorFlow,PyTorch, andJAX.
This tutorial serves the Gemma 2B and Gemma 7B models.GKE hosts these models onthe following single-host TPU v5e node pools:
- Gemma 2B: Instruction tuned model hosted in a TPU v5e node pool with
1x1topology that represents one TPU chip. The machine type for the nodes isct5lp-hightpu-1t. - Gemma 7B: Instruction tuned model hosted in a TPU v5e node pool with
2x2topology that represents four TPU chips. The machine type for the nodes isct5lp-hightpu-4t.
Saxml
Saxml is an experimental system that servesPaxml,JAX,andPyTorch models for inference. The Saxmlsystem includes the following components:
- Saxml cell or Sax cluster: An admin server and a group of modelservers. The admin server keeps track of model servers, assigns publishedmodels to model servers to serve, and helps clients locate model serversserving specific published models.
- Saxml client: The user-facing programming interface for the Saxml system. TheSaxml client includes a command line tool(saxutil)and a suite ofclient librariesin Python, C++, and Go.
In this tutorial, you also use the Saxml HTTP server.The Saxml HTTP Server is a custom HTTP server that encapsulates the Saxml Pythonclient library and exposes REST APIs to interact with the Saxml system. TheREST APIs includes endpoints to publish, list, unpublish models, and generatepredictions.
Objectives
- Prepare a GKE Standard cluster with the recommendedTPU topology based on the model characteristics.
- Deploy Saxml components on GKE.
- Get and publish the Gemma 2B or Gemma 7B parameter model.
- Serve and interact with the published models.
Architecture
This section describes the GKE architecture used in this tutorial.The architecture comprises a GKE Standard cluster thatprovisions TPUs and hosts Saxml components to deploy and serve Gemma 2Bor 7B models. The following diagram shows the components of thisarchitecture:
This architecture includes the following components:
- A GKE Standard, zonal cluster.
- A single-host TPU slice node pool that depends on the Gemma model you want to serve:
- Gemma 2B: Configured with a TPU v5e with a
1x1topology. One instance of the Saxml Model server is configured to use thisnode pool. - Gemma 7B: Configured with a TPU v5e with a
2x2topology. One instance of the Saxml Model server is configured to use thisnode pool.
- Gemma 2B: Configured with a TPU v5e with a
- A default CPU node pool where the Saxml Admin server and Saxml HTTP server aredeployed.
- TwoCloud Storage buckets:
- One Cloud Storage bucket stores the state managed by an Admin server.
- One Cloud Storage bucket stores the Gemma model checkpoints.
This architecture has the following characteristics:
- A publicArtifact Registry manages the containers images for the Saxml components.
- The GKE cluster usesWorkload Identity Federation for GKE. All Saxmlcomponents use a Workload Identity Federation that integrates an IAM Serviceaccount to access external Services like Cloud Storage buckets.
- The logs generated by Saxml components are integrated intoCloud Logging.
- You can useCloud Monitoring toanalyze the performance metrics of GKE node pools, such as thistutorial creates.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/iam.policyAdmin, roles/iam.securityAdmin, roles/iam.roleAdmin
Check for the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
In thePrincipal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check theRole column to see whether the list of roles includes the required roles.
Grant the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
- ClickGrant access.
In theNew principals field, enter your user identifier. This is typically the email address for a Google Account.
- ClickSelect a role, then search for the role.
- To grant additional roles, clickAdd another role and add each additional role.
- ClickSave.
- Ensure that you have sufficient quotas for 5 TPU v5e chips. In this tutorial,you useon-demand instances.
- Create aKaggle account, if you don't already have one.
Prepare the environment for Gemma
Launch Cloud Shell
In this tutorial, you useCloud Shell to manage resources hostedon Google Cloud. Cloud Shell is preinstalled with the softwareyou need for this tutorial, includingkubectl andgcloud CLI.
In the Google Cloud console, start a Cloud Shell instance:
Open Cloud ShellSet the default environment variables:
gcloudconfigsetprojectPROJECT_IDexportPROJECT_ID=$(gcloudconfiggetproject)exportLOCATION=LOCATIONexportCLUSTER_NAME=saxml-tpuReplace the following values:
- PROJECT_ID: Your Google Cloudproject ID.
- LOCATION: The name of the Compute Engine zone where theTPU v5e machine types are available.
Create a Standard cluster
In this section, you create the GKE cluster and node pool.
Gemma 2B-it
Use Cloud Shell to do the following:
Create a Standard cluster that usesWorkload Identity Federation for GKE:
gcloudcontainerclusterscreate${CLUSTER_NAME}\--enable-ip-alias\--machine-type=e2-standard-4\--num-nodes=2\--release-channel=rapid\--workload-pool=${PROJECT_ID}.svc.id.goog\--location=${LOCATION}The cluster creation can take several minutes.
Create a TPU v5e node pool with a
1x1topology and one node:gcloudcontainernode-poolscreatetpu-v5e-1x1\--cluster=${CLUSTER_NAME}\--machine-type=ct5lp-hightpu-1t\--num-nodes=1\--location=${LOCATION}You serve the Gemma 2B model in this node pool.
Gemma 7B-it
Use Cloud Shell to do the following:
Create a Standard cluster that usesWorkload Identity Federation for GKE:
gcloudcontainerclusterscreate${CLUSTER_NAME}\--enable-ip-alias\--machine-type=e2-standard-4\--num-nodes=2\--release-channel=rapid\--workload-pool=${PROJECT_ID}.svc.id.goog\--location=${LOCATION}The cluster creation can take several minutes.
Create a TPU v5e node pool with a
2x2topology and one node:gcloudcontainernode-poolscreatetpu-v5e-2x2\--cluster=${CLUSTER_NAME}\--machine-type=ct5lp-hightpu-4t\--num-nodes=1\--location=${LOCATION}You serve the Gemma 7B model in this node pool.
Create the Cloud Storage buckets
Create two Cloud Storage bucket to manages the state of the Saxml Admin server and the model checkpoints.
In Cloud Shell, run the following:
Create a Cloud Storage bucket to store Saxml Admin server configurations.
gcloudstoragebucketscreategs://ADMIN_BUCKET_NAMEReplace theADMIN_BUCKET_NAME with the name of theCloud Storage bucket that stores the Saxml Admin server.
Create a Cloud Storage bucket to store model checkpoints:
gcloudstoragebucketscreategs://CHECKPOINTS_BUCKET_NAMEReplace theCHECKPOINTS_BUCKET_NAME with the name of theCloud Storage bucket that stores the model checkpoints.
Configure your workloads access using Workload Identity Federation for GKE
Assign aKubernetes ServiceAccount to the application and configure thatKubernetes ServiceAccount to act as an IAM service account.
Configure
kubectlto communicate with your cluster:gcloudcontainerclustersget-credentials${CLUSTER_NAME}--location=${LOCATION}Create a Kubernetes ServiceAccount for your application to use:
gcloudiamservice-accountscreatewi-saxAdd anIAM policy binding for your IAM service account toread and write to Cloud Storage:
gcloudprojectsadd-iam-policy-binding${PROJECT_ID}\--member"serviceAccount:wi-sax@${PROJECT_ID}.iam.gserviceaccount.com"\--roleroles/storage.objectUsergcloudprojectsadd-iam-policy-binding${PROJECT_ID}\--member"serviceAccount:wi-sax@${PROJECT_ID}.iam.gserviceaccount.com"\--roleroles/storage.insightsCollectorServiceAllow the Kubernetes ServiceAccount toimpersonate the IAM service account by adding an IAMpolicy binding between the two service accounts. This binding allows the KubernetesServiceAccount to act as the IAM service account:
gcloudiamservice-accountsadd-iam-policy-bindingwi-sax@${PROJECT_ID}.iam.gserviceaccount.com\--roleroles/iam.workloadIdentityUser\--member"serviceAccount:${PROJECT_ID}.svc.id.goog[default/default]"Annotatethe Kubernetes service account with the email address of the IAMservice account:
kubectlannotateserviceaccountdefault\iam.gke.io/gcp-service-account=wi-sax@${PROJECT_ID}.iam.gserviceaccount.com
Get access to the model
To get access to the Gemma models for deployment toGKE, you must sign in to theKaggle platform, sign the license consent agreement, and geta KaggleAPI token. In this tutorial, you use a Kubernetes Secret for the Kagglecredentials.
Sign the license consent agreement
You must sign the consent agreement to use Gemma. Follow these instructions:
- Access themodel consent pageon Kaggle.com.
- Sign in to Kaggle, if you haven't done so already.
- ClickRequest Access.
- In theChoose Account for Consent section, selectVerify via KaggleAccount to use your Kaggle account for granting consent.
- Accept the modelTerms and Conditions.
Generate an access token
To access the model through Kaggle, you need aKaggle API token.
Follow these steps to generate a new token, if you don't have one already:
- In your browser, go toKaggle settings.
- Under theAPI section, clickCreate New Token.
Kaggle downloads a file namedkaggle.json.
Upload the access token to Cloud Shell
In Cloud Shell, you can upload the Kaggle API token to your Google Cloudproject:
- In Cloud Shell, clickMore>Upload.
- Select File and clickChoose Files.
- Open the
kaggle.jsonfile. - ClickUpload.
Create Kubernetes Secret for Kaggle credentials
In Cloud Shell, do the following steps:
Configure
kubectlto communicate with your cluster:gcloudcontainerclustersget-credentials${CLUSTER_NAME}--location=${LOCATION}Create a Secret to store the Kaggle credentials:
kubectlcreatesecretgenerickaggle-secret\--from-file=kaggle.json
Deploy Saxml
In this section, you deploy the Saxml admin server, model servers, and the HTTP server. This tutorial uses Kubernetes Deployment manifests. ADeployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..
Deploy the Saxml admin server
In this section, you deploy the Saxml admin server.
Create the following
saxml-admin-server.yamlmanifest:apiVersion:apps/v1kind:Deploymentmetadata:name:sax-admin-serverspec:replicas:1selector:matchLabels:app:sax-admin-servertemplate:metadata:labels:app:sax-admin-serverspec:hostNetwork:falsecontainers:-name:sax-admin-serverimage:us-docker.pkg.dev/cloud-tpu-images/inference/sax-admin-server:v1.2.0securityContext:privileged:trueports:-containerPort:10000env:-name:GSBUCKETvalue:ADMIN_BUCKET_NAMEReplace theADMIN_BUCKET_NAME with the name of thebucket you created in theCreate Cloud Storage bucketssection. Don't include the
gs://prefix.Apply the manifest:
kubectlapply-fsaxml-admin-server.yamlVerify the admin server deployment:
kubectlgetdeploymentThe output looks similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGEsax-admin-server 1/1 1 1 ##s
Deploy the Saxml model server
Follow these instructions to deploy the model server for the Gemma 2B or Gemma 7B model.
Gemma 2B-it
Create the following
saxml-model-server-1x1.yamlmanifest:apiVersion:apps/v1kind:Deploymentmetadata:name:sax-model-server-v5e-1x1spec:replicas:1selector:matchLabels:app:gemma-serverstrategy:type:Recreatetemplate:metadata:labels:app:gemma-serverai.gke.io/model:gemma-2b-itai.gke.io/inference-server:saxmlexamples.ai.gke.io/source:user-guidespec:nodeSelector:cloud.google.com/gke-tpu-topology:1x1cloud.google.com/gke-tpu-accelerator:tpu-v5-lite-podslicehostNetwork:falserestartPolicy:Alwayscontainers:-name:inference-serverimage:us-docker.pkg.dev/cloud-tpu-images/inference/sax-model-server:v1.2.0args:-"--jax_platforms=tpu"-"--platform_chip=tpuv5e"-"--platform_topology=1x1"-"--port=10001"-"--sax_cell=/sax/test"ports:-containerPort:10001securityContext:privileged:trueenv:-name:SAX_ROOTvalue:"gs://ADMIN_BUCKET_NAME/sax-root"resources:requests:google.com/tpu:1limits:google.com/tpu:1Replace theADMIN_BUCKET_NAME with the name of thebucket you created in theCreate Cloud Storage bucketssection. Don't include the
gs://prefix.Apply the manifest:
kubectlapply-fsaxml-model-server-1x1.yamlVerify the status of the model server Deployment:
kubectlgetdeploymentThe output looks similar to the following:
NAME READY STATUS RESTARTS AGEsax-admin-server 1/1 Running 0 ##msax-model-server-v5e-1x1 1/1 Running 0 ##s
Gemma 7B-it
Create the following
saxml-model-server-2x2.yamlmanifest:apiVersion:apps/v1kind:Deploymentmetadata:name:sax-model-server-v5e-2x2spec:replicas:1selector:matchLabels:app:gemma-serverstrategy:type:Recreatetemplate:metadata:labels:app:gemma-serverai.gke.io/model:gemma-7b-itai.gke.io/inference-server:saxmlexamples.ai.gke.io/source:user-guidespec:nodeSelector:cloud.google.com/gke-tpu-topology:2x2cloud.google.com/gke-tpu-accelerator:tpu-v5-lite-podslicehostNetwork:falserestartPolicy:Alwayscontainers:-name:inference-serverimage:us-docker.pkg.dev/cloud-tpu-images/inference/sax-model-server:v1.2.0args:-"--jax_platforms=tpu"-"--platform_chip=tpuv5e"-"--platform_topology=2x2"-"--port=10001"-"--sax_cell=/sax/test"ports:-containerPort:10001securityContext:privileged:trueenv:-name:SAX_ROOTvalue:"gs://ADMIN_BUCKET_NAME/sax-root"resources:requests:google.com/tpu:4limits:google.com/tpu:4Replace theADMIN_BUCKET_NAME with the name of thebucket you created in theCreate Cloud Storage bucketssection. Don't include the
gs://prefix.Apply the manifest:
kubectlapply-fsaxml-model-server-2x2.yamlVerify the status of the model server Deployment:
kubectlgetdeploymentThe output looks similar to the following:
NAME READY STATUS RESTARTS AGEsax-admin-server 1/1 Running 0 ##msax-model-server-v5e-2x2 1/1 Running 0 ##s
2x2slice to serve the Gemma 7B model.Deploy the Saxml HTTP server
In this section, you deploy the Saxml HTTP server and create a Cluster IPService that you use to access the server.
Create the following
saxml-http.yamlmanifest:apiVersion:apps/v1kind:Deploymentmetadata:name:sax-httpspec:replicas:1selector:matchLabels:app:sax-httptemplate:metadata:labels:app:sax-httpspec:hostNetwork:falsecontainers:-name:sax-httpimage:us-docker.pkg.dev/cloud-tpu-images/inference/sax-http:v1.2.0imagePullPolicy:Alwaysports:-containerPort:8888env:-name:SAX_ROOTvalue:"gs://ADMIN_BUCKET_NAME/sax-root"---apiVersion:v1kind:Servicemetadata:name:sax-http-svcspec:selector:app:sax-httpports:-protocol:TCPport:8888targetPort:8888type:ClusterIPReplace theADMIN_BUCKET_NAME with the name of theCloud Storage bucket that stores the Saxml Admin server.
Apply the manifest:
kubectlapply-fsaxml-http.yamlVerify the status of the Saxml HTTP server deployment:
kubectlgetdeploymentGemma 2B-it
The output looks similar to the following:
NAME READY STATUS RESTARTS AGEsax-admin-server 1/1 Running 0 ##msax-model-server-v5e-1x1 1/1 Running 0 ##msax-http 1/1 Running 0 ##sGemma 7B-it
The output looks similar to the following:
NAME READY STATUS RESTARTS AGEsax-admin-server 1/1 Running 0 ##msax-model-server-v5e-2x2 1/1 Running 0 ##msax-http 1/1 Running 0 ##s
Download the model checkpoint
In this section, you run a Kubernetes Job that fetches, downloads, and storesthe model checkpoint. A Job controller in Kubernetes creates one or more Pods and ensures that they successfully execute a specific task.
Follow the steps for the Gemma model thatyou want to use:
Gemma 2B-it
Create the following
job-2b.yamlmanifest:apiVersion:v1kind:ConfigMapmetadata:name:fetch-model-scriptsdata:fetch_model.sh:|-#!/usr/bin/bash -xpip install kaggle --break-system-packages && \MODEL_NAME=$(echo ${MODEL_PATH} | awk -F'/' '{print $2}') && \VARIATION_NAME=$(echo ${MODEL_PATH} | awk -F'/' '{print $4}') && \mkdir -p /data/${MODEL_NAME}_${VARIATION_NAME} &&\kaggle models instances versions download ${MODEL_PATH} --untar -p /data/${MODEL_NAME}_${VARIATION_NAME} && \echo -e "\nCompleted extraction to /data/${MODEL_NAME}_${VARIATION_NAME}" && \gcloud storage rsync --recursive --no-clobber /data/${MODEL_NAME}_${VARIATION_NAME} gs://${BUCKET_NAME}/${MODEL_NAME}_${VARIATION_NAME} && \echo -e "\nCompleted copy of data to gs://${BUCKET_NAME}/${MODEL_NAME}_${VARIATION_NAME}"---apiVersion:batch/v1kind:Jobmetadata:name:data-loader-2blabels:app:data-loader-2bspec:ttlSecondsAfterFinished:120template:metadata:labels:app:data-loader-2bspec:restartPolicy:OnFailurecontainers:-name:gcloudimage:gcr.io/google.com/cloudsdktool/google-cloud-cli:slimcommand:-/scripts/fetch_model.shenv:-name:BUCKET_NAMEvalue:CHECKPOINTS_BUCKET_NAME-name:KAGGLE_CONFIG_DIRvalue:/kaggle-name:MODEL_PATHvalue:"google/gemma/pax/2b-it/2"volumeMounts:-mountPath:"/kaggle/"name:kaggle-credentialsreadOnly:true-mountPath:"/scripts/"name:scripts-volumereadOnly:truevolumes:-name:kaggle-credentialssecret:defaultMode:0400secretName:kaggle-secret-name:scripts-volumeconfigMap:defaultMode:0700name:fetch-model-scriptsReplace theCHECKPOINTS_BUCKET_NAME with the name of thebucket you created in theCreate Cloud Storage bucketssection. Don't include the
gs://prefix.Apply the manifest:
kubectlapply-fjob-2b.yamlWait for the Job to complete:
kubectlwait--for=condition=complete--timeout=180sjob/data-loader-2bThe output looks similar to the following:
job.batch/data-loader-2b condition metVerify that the Job completed successfully:
kubectlgetjob/data-loader-2bThe output looks similar to the following:
NAME COMPLETIONS DURATION AGEdata-loader-2b 1/1 ##s #m##sView the logs for the Job:
kubectllogs--followjob/data-loader-2b
The Job uploads the checkpoint togs://CHECKPOINTS_BUCKET_NAME/gemma_2b-it/checkpoint_00000000.
Gemma 7B-it
Create the following
job-7b.yamlmanifest:apiVersion:v1kind:ConfigMapmetadata:name:fetch-model-scriptsdata:fetch_model.sh:|-#!/usr/bin/bash -xpip install kaggle --break-system-packages && \MODEL_NAME=$(echo ${MODEL_PATH} | awk -F'/' '{print $2}') && \VARIATION_NAME=$(echo ${MODEL_PATH} | awk -F'/' '{print $4}') && \mkdir -p /data/${MODEL_NAME}_${VARIATION_NAME} &&\kaggle models instances versions download ${MODEL_PATH} --untar -p /data/${MODEL_NAME}_${VARIATION_NAME} && \echo -e "\nCompleted extraction to /data/${MODEL_NAME}_${VARIATION_NAME}" && \gcloud storage rsync --recursive --no-clobber /data/${MODEL_NAME}_${VARIATION_NAME} gs://${BUCKET_NAME}/${MODEL_NAME}_${VARIATION_NAME} && \echo -e "\nCompleted copy of data to gs://${BUCKET_NAME}/${MODEL_NAME}_${VARIATION_NAME}"---apiVersion:batch/v1kind:Jobmetadata:name:data-loader-7blabels:app:data-loader-7bspec:ttlSecondsAfterFinished:120template:metadata:labels:app:data-loader-7bspec:restartPolicy:OnFailurecontainers:-name:gcloudimage:gcr.io/google.com/cloudsdktool/google-cloud-cli:slimcommand:-/scripts/fetch_model.shenv:-name:BUCKET_NAMEvalue:CHECKPOINTS_BUCKET_NAME-name:KAGGLE_CONFIG_DIRvalue:/kaggle-name:MODEL_PATHvalue:"google/gemma/pax/7b-it/2"volumeMounts:-mountPath:"/kaggle/"name:kaggle-credentialsreadOnly:true-mountPath:"/scripts/"name:scripts-volumereadOnly:truevolumes:-name:kaggle-credentialssecret:defaultMode:0400secretName:kaggle-secret-name:scripts-volumeconfigMap:defaultMode:0700name:fetch-model-scriptsReplace theCHECKPOINTS_BUCKET_NAME with the name of thebucket you created in theCreate Cloud Storage bucketssection. Do include the
gs://prefix.Apply the manifest:
kubectlapply-fjob-7b.yamlWait for the Job to complete:
kubectlwait--for=condition=complete--timeout=360sjob/data-loader-7bThe output looks similar to the following:
job.batch/data-loader-7b condition metVerify that the Job completed successfully:
kubectlgetjob/data-loader-7bThe output looks similar to the following:
NAME COMPLETIONS DURATION AGEdata-loader-7b 1/1 ##s #m##sView the logs for the Job:
kubectllogs--followjob/data-loader-7b
The Job uploads the checkpoint togs://CHECKPOINTS_BUCKET_NAME/gemma_7b_it/checkpoint_00000000.
Expose the Saxml HTTP server
You can access the Saxml HTTP server through theClusterIP Service that youcreated whendeploying the Saxml HTTP server. The ClusterIP Services are only reachable from withinthe cluster. Therefore, to access the Service from outside the cluster, completethe following steps:
Establish a port forwarding session:
kubectlport-forwardservice/sax-http-svc8888:8888Verify that you can access the Saxml HTTP server by opening a new terminaland running the following command:
curl-slocalhost:8888The output looks similar to the following:
{ "Message": "HTTP Server for SAX Client"}
The Saxml HTTP server encapsulates the client interface to the Saxml system andexposes it through a set of REST APIs. You use these APIs to publish,manage, and interface with Gemma 2B and Gemma 7B models.
Publish the Gemma model
Next, you can publish the Gemma model to amodel server that runs in a TPU slice node pool. You use the Saxml HTTP server'spublish API to publish a model. Follow these steps to publish the Gemma 2Bor 7B parameter model.
To learn more about the Saxml HTTP server's API, seeSaxml HTTP APIs.
Gemma 2B-it
Make sure that your port forwarding session is still active:
curl-slocalhost:8888Publish the Gemma 2B parameter:
curl--requestPOST\--header"Content-type: application/json"\-s\localhost:8888/publish\--data\'{ "model": "/sax/test/gemma2bfp16", "model_path": "saxml.server.pax.lm.params.gemma.Gemma2BFP16", "checkpoint": "gs://CHECKPOINTS_BUCKET_NAME/gemma_2b-it/checkpoint_00000000", "replicas": "1"}'The output looks similar to the following:
{ "model": "/sax/test/gemma2bfp16", "model_path": "saxml.server.pax.lm.params.gemma.Gemma2BFP16", "checkpoint": "gs://CHECKPOINTS_BUCKET_NAME/gemma_2b-it/checkpoint_00000000", "replicas": 1}See the next step for monitoring the progress of the deployment.
Monitor the progress by observing logs in a model server Pod of the
sax-model-server-v5e-1x1deployment.kubectllogs--followdeployment/sax-model-server-v5e-1x1This deployment can take up to five minutes to complete. Wait until yousee a message similar to the following:
I0125 15:34:31.685555 139063071708736 servable_model.py:699] loading completed.I0125 15:34:31.686286 139063071708736 model_service_base.py:532] Successfully loaded model for key: /sax/test/gemma2bfp16Verify that you can access the model, by displaying the model information:
curl--requestGET\--header"Content-type: application/json"\-s\localhost:8888/listcell\--data\'{ "model": "/sax/test/gemma2bfp16"}'The output looks similar to the following:
{ "model": "/sax/test/gemma2bfp16", "model_path": "saxml.server.pax.lm.params.gemma.Gemma2BFP16", "checkpoint": "gs://CHECKPOINTS_BUCKET_NAME/gemma_2b-it/checkpoint_00000000", "max_replicas": 1, "active_replicas": 1}
Gemma 7B-it
Make sure that your port forwarding session is still active:
curl-slocalhost:8888Publish the Gemma 7B parameter:
curl--requestPOST\--header"Content-type: application/json"\-s\localhost:8888/publish\--data\'{ "model": "/sax/test/gemma7bfp16", "model_path": "saxml.server.pax.lm.params.gemma.Gemma7BFP16", "checkpoint": "gs://CHECKPOINTS_BUCKET_NAME/gemma_7b-it/checkpoint_00000000", "replicas": "1"}'The output looks similar to the following:
{ "model": "/sax/test/gemma7bfp16", "model_path": "saxml.server.pax.lm.params.gemma.Gemma7BFP16", "checkpoint": "gs://CHECKPOINTS_BUCKET_NAME/gemma_7b-it/checkpoint_00000000", "replicas": 1}See the next step for monitoring the progress of the deployment.
Monitor the progress by observing logs in a model server Pod of the
sax-model-server-v5e-2x2deployment.kubectllogs--followdeployment/sax-model-server-v5e-2x2Wait until you see a message similar to the following:
I0125 15:34:31.685555 139063071708736 servable_model.py:699] loading completed.I0125 15:34:31.686286 139063071708736 model_service_base.py:532] Successfully loaded model for key: /sax/test/gemma7bfp16Verify that the model was published by displaying the model information:
curl--requestGET\--header"Content-type: application/json"\-s\localhost:8888/listcell\--data\'{ "model": "/sax/test/gemma7bfp16"}'The output is similar to the following:
{ "model": "/sax/test/gemma7bfp16", "model_path": "saxml.server.pax.lm.params.gemma.Gemma7BFP16", "checkpoint": "gs://CHECKPOINTS_BUCKET_NAME/gemma_7b-it/checkpoint_00000000", "max_replicas": 1, "active_replicas": 1}
Use the model
You can interact with the Gemma 2B or 7B models. Use the Saxml HTTPserver'sgenerate API to send a prompt to the model.
Gemma 2B-it
Serve a prompt request by using thegenerate endpoint of the Saxml HTTP server:
curl--requestPOST\--header"Content-type: application/json"\-s\localhost:8888/generate\--data\'{ "model": "/sax/test/gemma2bfp16", "query": "What are the top 5 most popular programming languages?"}'The following is an example of the model response.The actual output varies, based on the prompt that you serve:
[ [ "\n\n1. **Python**\n2. **JavaScript**\n3. **Java**\n4. **C++**\n5. **Go**", -3.0704939365386963 ]]You can run the command with differentquery parameters. You also can modifyextra parameters suchtemperature,top_k,topc_p by using thegenerate API. To learn more about the Saxml HTTP server's API, seeSaxml HTTP APIs.
Gemma 7B-it
Serve a prompt request by using thegenerate endpoint of the Saxml HTTP server:
curl--requestPOST\--header"Content-type: application/json"\-s\localhost:8888/generate\--data\'{ "model": "/sax/test/gemma7bfp16", "query": "What are the top 5 most popular programming languages?"}'The following is an example of the model response. The output might vary inevery prompt that you serve:
[ [ "\n\n**1. JavaScript**\n\n* Most widely used language on the web.\n* Used for front-end development, such as websites and mobile apps.\n* Extensive libraries and frameworks available.\n\n**2. Python**\n\n* Known for its simplicity and readability.\n* Versatile, used for various tasks, including data science, machine learning, and web development.\n* Large and active community.\n\n**3. Java**\n\n* Object-oriented language widely used in enterprise applications.\n* Used for web applications, mobile apps, and enterprise software.\n* Strong ecosystem and support.\n\n**4. Go**\n\n", -16.806324005126953 ]]You can run the command with differentquery parameters. You can also modifyextra parameters suchtemperature,top_k,topc_p by using thegenerate API. To learn more about the Saxml HTTP server's API, seeSaxml HTTP APIs.
Unpublish the model
Follow these steps to unpublish your model:
Gemma 2B-it
To unpublish the Gemma 2B-it model, run the following command:
curl--requestPOST\--header"Content-type: application/json"\-s\localhost:8888/unpublish\--data\'{ "model": "/sax/test/gemma2bfp16"}'The output looks similar to the following:
{ "model": "/sax/test/gemma2bfp16"}You can run the command with different prompts that are passed in thequery parameter.
Gemma 7B-it
To unpublish the Gemma 7B-it model, run the following command:
curl--requestPOST\--header"Content-type: application/json"\-s\localhost:8888/unpublish\--data\'{ "model": "/sax/test/gemma7bfp16"}'The output looks similar to the following:
{ "model": "/sax/test/gemma7bfp16"}You can run the command with different prompts that are passed in thequery parameter.
Troubleshoot issues
- If you get the message
Empty reply from server, it's possible that thecontainer has not finished downloading the model data.Check the Pod's logsagain for theConnectedmessage which indicates that the model is ready toserve. - If you see
Connection refused,verify that your port forwarding is active.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the deployed resources
To avoid incurring charges to your Google Cloud account for the resourcesthat you created in this guide, run the following command:
gcloudcontainerclustersdelete${CLUSTER_NAME}--location=${LOCATION}gcloudiamservice-accountsdelete--quietwi-sax@${PROJECT_ID}.iam.gserviceaccount.comgcloudstoragerm--recursivegs://ADMIN_BUCKET_NAMEgcloudstoragerm--recursivegs://CHECKPOINTS_BUCKET_NAMEReplace the following:
- ADMIN_BUCKET_NAME: The name of theCloud Storage bucket that stores the Saxml Admin server.
- CHECKPOINTS_BUCKET_NAME: The name of theCloud Storage bucket that stores the model checkpoints.
What's next
- Learn more aboutTPUs in GKE.
- Explore the SaxmlGitHub repository, including theSaxml HTTP APIs.
- Explore theVertex AI Model Garden.
- Discover how to run optimized AI/ML workloads withGKE platform orchestration capabilities.
- Explore reference architectures, diagrams, and best practices about Google Cloud.Take a look at ourCloud Architecture Center.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-18 UTC.