Serve Gemma open models using TPUs on Vertex AI with Saxml

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

This guide shows you how to serve a Gemma open models large language model (LLM) usingTensor Processing Units (TPUs) on Vertex AI with Saxml. In this guide, you download the 2B and 7B parameter instruction tuned Gemma models to Cloud Storage and deploy them on Vertex AI that runs Saxml on TPUs.

Background

By serving Gemma using TPUs on Vertex AI with Saxml. You can take advantage of a managed AI solutionthat takes care of low level infrastructure and offers a cost effective way for serving LLMs.This section describes the key technologies used in this tutorial.

Gemma

Gemma is a set of openly available, lightweight, and generativeartificial intelligence (AI) models released under an open license. These AImodels are available to run in your applications, hardware, mobile devices, orhosted services. You can use theGemma models for text generation, however you can also tune thesemodels for specialized tasks.

To learn more, see theGemma documentation.

Saxml

Saxml is an experimental system that servesPaxml,JAX,andPyTorch models for inference.For the sake of this tutorialwe'll cover how to serve Gemma on TPUs that are more cost efficient for Saxml.Setup for GPUs is similar.Saxml offers scripts to build containers forVertex AI that we are going to use in this tutorial.

TPUs

TPUs are Google's custom-developed application-specific integrated circuits(ASICs) used to accelerate data processing frameworks such as TensorFlow,PyTorch, and JAX.

This tutorial serves the Gemma 2B and Gemma 7B models.Vertex AI hosts these models onthe following single-host TPU v5e node pools:

  • Gemma 2B: Hosted in a TPU v5e node pool with1x1topology that represents one TPU chip. The machine type for the nodes isct5lp-hightpu-1t.
  • Gemma 7B: Hosted in a TPU v5e node pool with2x2topology that represents four TPU chips. The machine type for the nodes isct5lp-hightpu-4t.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the Vertex AI API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the Vertex AI API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the API

  8. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

This tutorial assumes that you are usingCloud Shell to interact with Google Cloud. If you want touse a different shell instead of Cloud Shell, thenperform the following additional configuration:

  1. Install the Google Cloud CLI.

  2. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  3. Toinitialize the gcloud CLI, run the following command:

    gcloudinit
  4. Make sure that you have sufficient quota for TPU v5e chips for Vertex AI. By default, this quota is 0. For a1x1 topology, it must be 1. For2x2, it must be 4. To run both topologies, it must be 5.
  5. Create aKaggle account, if you don't already have one.

Get access to the model

Note that Cloud Shell might not have sufficient resources todownload model weights. If so, you can create aVertex AI Workbench instance for performing that task.

To get access to the Gemma models for deployment toVertex AI, you must sign in to theKaggle platform, sign the license consent agreement, and geta KaggleAPI token. In this tutorial, you use a Kubernetes Secret for the Kagglecredentials.

Sign the license consent agreement

You must sign the consent agreement to use Gemma. Follow these instructions:

  1. Access themodel consent pageon Kaggle.com.
  2. Sign in to Kaggle if you haven't done so already.
  3. ClickRequest Access.
  4. In theChoose Account for Consent section, selectVerify via KaggleAccount to use your Kaggle account for consent.
  5. Accept the modelTerms and Conditions.

Generate an access token

To access the model through Kaggle, you need aKaggle API token.

Follow these steps to generate a new token if you don't have one already:

  1. In your browser, go toKaggle settings.
  2. Under theAPI section, clickCreate New Token.

    A file namedkaggle.json is downloaded.

Upload the access token to Cloud Shell

In Cloud Shell, you can upload the Kaggle API token to your Google Cloudproject:

  1. In Cloud Shell, clickMore>Upload.
  2. Select File and clickChoose Files.
  3. Open thekaggle.json file.
  4. ClickUpload.

Create the Cloud Storage bucket

Create Cloud Storage bucket to store the model checkpoints.

In Cloud Shell, run the following:

gcloudstoragebucketscreategs://CHECKPOINTS_BUCKET_NAME

Replace theCHECKPOINTS_BUCKET_NAME with the name of theCloud Storage bucket that stores the model checkpoints.

Copy model to Cloud Storage bucket

In Cloud Shell, run the following:

pipinstallkaggle--break-system-packages# For Gemma 2Bmkdir-p/data/gemma_2b-itkagglemodelsinstancesversionsdownloadgoogle/gemma/pax/2b-it/1--untar-p/data/gemma_2b-itgcloudstoragecp/data/gemma_2b-it/*gs://CHECKPOINTS_BUCKET_NAME/gemma_2b-it/--recursive# For Gemma 7Bmkdir-p/data/gemma_7b-itkagglemodelsinstancesversionsdownloadgoogle/gemma/pax/7b-it/1--untar-p/data/gemma_7b-itgcloudstoragecp/data/gemma_7b-it/*gs://CHECKPOINTS_BUCKET_NAME/gemma_7b-it/--recursive

Deploying the model

Upload a model

To upload aModel resource that uses your Saxml container, run the followinggcloud ai models uploadcommand:

Gemma 2B-it

gcloudaimodelsupload\--region=LOCATION\--display-name=DEPLOYED_MODEL_NAME\--container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest\--artifact-uri='gs://CHECKPOINTS_BUCKET_NAME/gemma_2b-it/'\--container-args='--model_path=saxml.server.pax.lm.params.gemma.Gemma2BFP16'\--container-args='--platform_chip=tpuv5e'\--container-args='--platform_topology=2x2'\--container-args='--ckpt_path_suffix=checkpoint_00000000'\--container-ports=8502

Gemma 7B-it

gcloudaimodelsupload\--region=LOCATION\--display-name=DEPLOYED_MODEL_NAME\--container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest\--artifact-uri='gs://CHECKPOINTS_BUCKET_NAME/gemma_7b-it/'\--container-args='--model_path=saxml.server.pax.lm.params.gemma.Gemma7BFP16'\--container-args='--platform_chip=tpuv5e'\--container-args='--platform_topology=2x2'\--container-args='--ckpt_path_suffix=checkpoint_00000000'\--container-ports=8502

Replace the following:

  • PROJECT_ID: the ID of yourGoogle Cloudproject
  • LOCATION_ID: The region where you are using Vertex AI. Note that TPUs are only available in us-west1.
  • DEPLOYED_MODEL_NAME: A name for theDeployedModel. You can use the display name of theModel for theDeployedModel as well.

Create an endpoint

You must deploy the model to an endpoint before the model can be used to serveonline inferences. If you are deploying a model to an existing endpoint,you can skip this step. The following example uses thegcloud ai endpoints createcommand:

gcloudaiendpointscreate\--region=LOCATION\--display-name=ENDPOINT_NAME

Replace the following:

  • LOCATION_ID: The region where you are using Vertex AI.
  • ENDPOINT_NAME: The display name for the endpoint.

The Google Cloud CLI tool might take a few seconds to create the endpoint.

Deploy the model to endpoint

After the endpoint is ready, deploy the model to the endpoint.

ENDPOINT_ID=$(gcloudaiendpointslist\--region=LOCATION\--filter=display_name=ENDPOINT_NAME\--format="value(name)")MODEL_ID=$(gcloudaimodelslist\--region=LOCATION\--filter=display_name=DEPLOYED_MODEL_NAME\--format="value(name)")gcloudaiendpointsdeploy-model$ENDPOINT_ID\--region=LOCATION\--model=$MODEL_ID\--display-name=DEPLOYED_MODEL_NAME\--machine-type=ct5lp-hightpu-4t\--traffic-split=0=100

Replace the following:

  • LOCATION_ID: The region where you are using Vertex AI.
  • ENDPOINT_NAME: The display name for the endpoint.
  • DEPLOYED_MODEL_NAME: A name for theDeployedModel. You can use the display name of theModel for theDeployedModel as well.

Gemma 2B can be deployed on a smaller ct5lp-hightpu-1t machine, in such caseyou should specify--platform_topology=1x1 when uploading model.

The Google Cloud CLI tool might take a few minutes to deploy the model to theendpoint. When the model is successfully deployed, this command prints thefollowing output:

  Deployed a model to the endpoint xxxxx. Id of the deployed model: xxxxx.

Getting online inferences from the deployed model

To invoke the model through the Vertex AI endpoint, formatthe inference request by using astandard Inference Request JSON Object.

The following example uses thegcloud ai endpoints predictcommand:

ENDPOINT_ID=$(gcloudaiendpointslist\--region=LOCATION\--filter=display_name=ENDPOINT_NAME\--format="value(name)")gcloudaiendpointspredict$ENDPOINT_ID\--region=LOCATION\--http-headers=Content-Type=application/json\--json-requestinstances.json

Replace the following:

  • LOCATION_ID: The region where you are using Vertex AI.
  • ENDPOINT_NAME: The display name for the endpoint.
  • instances.json has following format:{"instances": [{"text_batch": "<your prompt>"},{...}]}

Cleaning up

To avoid incurring furtherVertex AIcharges andArtifact Registrycharges, delete the Google Cloud resourcesthat you created during this tutorial:

  1. To undeploy model from endpoint and delete the endpoint,run the following command in your shell:

    ENDPOINT_ID=$(gcloudaiendpointslist\--region=LOCATION\--filter=display_name=ENDPOINT_NAME\--format="value(name)")DEPLOYED_MODEL_ID=$(gcloudaiendpointsdescribe$ENDPOINT_ID\--region=LOCATION\--format="value(deployedModels.id)")gcloudaiendpointsundeploy-model$ENDPOINT_ID\--region=LOCATION\--deployed-model-id=$DEPLOYED_MODEL_IDgcloudaiendpointsdelete$ENDPOINT_ID\--region=LOCATION\--quiet

    ReplaceLOCATION with the region where you created your model ina previous section.

  2. To delete your model, run the following command in your shell:

    MODEL_ID=$(gcloudaimodelslist\--region=LOCATION\--filter=display_name=DEPLOYED_MODEL_NAME\--format="value(name)")gcloudaimodelsdelete$MODEL_ID\--region=LOCATION\--quiet

    ReplaceLOCATION with the region where you created your model ina previous section.

Limitations

  • On Vertex AI Cloud TPUs are supported only inus-west1. For moreinformation, seelocations.

What's next

  • Learn how to deploy otherSaxml models such as Llama2 and GPT-J.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-17 UTC.