Serve Llama 3 open models using multi-host Cloud TPUs on Vertex AI with Saxml

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

Llama 3 is an open-source largelanguage model (LLM) from Meta.This guide shows you how to serve a Llama 3 LLM using multi-hostTensor Processing Units (TPUs) onVertex AI withSaxml.

In this guide, you download the Llama 3 70B model weights and tokenizer anddeploy them on Vertex AI that runs Saxml on TPUs.

Note: A GPU-only version of Llama 3 is available inModel Garden.For more information about Model Garden, seeExplore AI models in Model Garden.

Before you begin

We recommend that you use anM2 memory-optimized VMfor downloading the model and converting it to Saxml. This is because the modelconversion process requires significant memory and may fail if you choose amachine type that doesn't have enough memory.

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.

Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI and Artifact Registry APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI and Artifact Registry APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

Enable the APIs

In the Google Cloud console, activate Cloud Shell.
Activate Cloud Shell
At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Follow the Artifact Registry documentation toInstall Docker.
Make sure that you have sufficient quotas for 16 TPU v5e chips for Vertex AI.

This tutorial assumes that you are usingCloud Shell to interact with Google Cloud. If you want touse a different shell instead of Cloud Shell, thenperform the following additional configuration:

Install the Google Cloud CLI.
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
Toinitialize the gcloud CLI, run the following command:
```
gcloudinit
```

If you're using a different shell instead of Cloud Shell for modeldeployment, make sure that theGoogle Cloud CLI version islater than475.0.0. You can update the Google Cloud CLI by running thegcloud components updatecommand.

If you're deploying your model using the Vertex AI SDK, make sure thatyou have version1.50.0 or later.

Get access to the model and download the model weights

The following steps are for a Vertex AI Workbench instance that has anM2 memory-optimized VM.For information on changing the machine type of a Vertex AI Workbenchinstance, seeChange machine type of a Vertex AI Workbench instance.

Go to theLlama model consent page.
Select Llama 3, fill out the consent form, and accept the terms andconditions.
Check your inbox for an email containing a signed URL.

Download thedownload.sh scriptfrom GitHub by executing the following command:

wgethttps://raw.githubusercontent.com/meta-llama/llama3/main/download.shchmod+xdownload.sh

To download the model weights, run thedownload.sh script that youdownloaded from GitHub.
When prompted, enter the signed URL from the email you received in theprevious section.
When prompted for the models to download, enter70B.

Convert the model weights to Saxml format

Run the following command to download Saxml:
```
gitclonehttps://github.com/google/saxml.git
```
Run the following commands to configure a Python virtual environment:
```
python-mvenv.sourcebin/activate
```

Run the following commands to install dependencies:

pipinstall--upgradepippipinstallpaxmlpipinstallpraxispipinstalltorch

To convert the model weights to Saxml format, run the following command:
```
python3saxml/saxml/tools/convert_llama_ckpt.py\--basePATH_TO_META_LLAMA3\--paxPATH_TO_PAX_LLAMA3\--model-sizellama3_70b
```
Replace the following:
- PATH_TO_META_LLAMA3: the path to the directory containingthe downloaded model weights
- PATH_TO_PAX_LLAMA3: the path to the directory in which tostore the converted model weights
Note: You can use this command for any Llama 2 or Llama 3 model.
Converted models will be put into the$PATH_TO_PAX_LLAMA3/checkpoint_00000000 folder.
Copy the tokenizer file from original directory into a subfolder namedvocabsas follows:
```
cp$PATH_TO_META_LLAMA3/tokenizer.model$PATH_TO_PAX_LLAMA3/vocabs/tokenizer.model
```

Add an emptycommit_success.txt file in the$PATH_TO_PAX_LLAMA3 folderand themetadata andstate subfolders in that folder as follows:

touch$PATH_TO_PAX_LLAMA3/checkpoint_00000000/commit_success.txttouch$PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/commit_success.txttouch$PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/commit_success.txt

The$PATH_TO_PAX_LLAMA3 folder now contains the following folders andfiles:

$PATH_TO_PAX_LLAMA3/checkpoint_00000000/commit_success.txt$PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/$PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/commit_success.txt$PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/$PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/commit_success.txt$PATH_TO_PAX_LLAMA3/vocabs/tokenizer.model

Create a Cloud Storage bucket

Create a Cloud Storage bucket to store the converted model weights.

In Cloud Shell, run the followingcommands, replacingPROJECT_ID with your project ID:
```
projectid=PROJECT_IDgcloudconfigsetproject${projectid}
```
To create the bucket, run the following command:
```
gcloudstoragebucketscreategs://WEIGHTS_BUCKET_NAME
```
ReplaceWEIGHTS_BUCKET_NAME with the name you wantto use for the bucket.

Copy the model weights to the Cloud Storage bucket

To copy the model weights to your bucket, run the following command:

gcloudstoragecpPATH_TO_PAX_LLAMA3/*gs://WEIGHTS_BUCKET_NAME/llama3/pax_70b/--recursive

Upload the model

A prebuilt Saxml container is available atus-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest.

To upload aModel resource to Vertex AI using the prebuiltSaxml container, run the gcloud ai models upload commandas follows:

gcloudaimodelsupload\--region=LOCATION\--display-name=MODEL_DISPLAY_NAME\--container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest\--artifact-uri='gs://WEIGHTS_BUCKET_NAME/llama3/pax_70b'\--container-args='--model_path=saxml.server.pax.lm.params.lm_cloud.LLaMA3_70BFP16x16'\--container-args='--platform_chip=tpuv5e'\--container-args='--platform_topology=4x4'\--container-args='--ckpt_path_suffix=checkpoint_00000000'\--container-deployment-timeout-seconds=2700\--container-ports=8502\--project=PROJECT_ID

Make the following replacements:

LOCATION: the region where you are using Vertex AI.Note that TPUs are only available inus-west1.
MODEL_DISPLAY_NAME: the display name you want for your model
PROJECT_ID: the ID of your Google Cloud project

Note: You can use this command for any Llama model that's listed on thefollowing GitHub link:https://github.com/google/saxml/blob/main/saxml/server/pax/lm/params/lm_cloud.py

Create an online inference endpoint

To create the endpoint, run the following command:

gcloudaiendpointscreate\--region=LOCATION\--display-name=ENDPOINT_DISPLAY_NAME\--project=PROJECT_ID

ReplaceENDPOINT_DISPLAY_NAME with the display name you want foryour endpoint.

Deploy the model to the endpoint

After the endpoint is ready, deploy the model to the endpoint.

In this tutorial, you deploy a Llama 3 70B model that's sharded for 16Cloud TPU v5e chips using 4x4 topology. However, you can specify any of thefollowing supported multi-host Cloud TPU topologies:

Machine Type	Topology	Number of TPU chips	Number of Hosts
`ct5lp-hightpu-4t`	4x4	16	2
`ct5lp-hightpu-4t`	4x8	32	4
`ct5lp-hightpu-4t`	8x8	64	8
`ct5lp-hightpu-4t`	8x16	128	16
`ct5lp-hightpu-4t`	16x16	256	32

If you're deploying a different Llama model that's defined in theSaxml GitHub repo,make sure that it's partitioned to match the number of devices you'retargeting and that Cloud TPU has sufficient memory to load the model.

For information about deploying a model on single-host Cloud TPUs, seeDeploy a model.

For a full list of supported Cloud TPU types and regions seeVertex AI Locations.

Get the endpoint ID for the online inference endpoint:

ENDPOINT_ID=$(gcloudaiendpointslist\--region=LOCATION\--filter=display_name=ENDPOINT_NAME\--format="value(name)")

Get the model ID for your model:

MODEL_ID=$(gcloudaimodelslist\--region=LOCATION\--filter=display_name=DEPLOYED_MODEL_NAME\--format="value(name)")

Deploy the model to the endpoint:
```
gcloudaiendpointsdeploy-model$ENDPOINT_ID\--region=LOCATION\--model=$MODEL_ID\--display-name=DEPLOYED_MODEL_NAME\--machine-type=ct5lp-hightpu-4t\--tpu-topology=4x4\--traffic-split=0=100
```
ReplaceDEPLOYED_MODEL_NAME with a name for the deployed.This can be the same as the model display name(MODEL_DISPLAY_NAME).
The deployment operation might time out.
Thedeploy-model command returns an operation ID that can be used to checkwhen the operation is finished. You can poll for the status of the operationuntil the response includes"done": true. Use the following command topoll the status:
```
gcloudaioperationsdescribe\--region=LOCATION\OPERATION_ID
```
ReplaceOPERATION_ID with the operation ID that was returnedby the previous command.

Get online inferences from the deployed model

To get online inferences from the Vertex AI endpoint,run thegcloud ai endpoints predictcommand.

Run the following command to create arequest.json file containing asample inference request:

cat <<EOF >request.json{"instances":[{"text_batch":"the distance between Earth and Moon is "}]}EOF

To send the online inference request to the endpoint, run the followingcommand:

gcloudaiendpointspredict$ENDPOINT_ID\--project=PROJECT_ID\--region=LOCATION\--json-request=request.json

Clean up

To avoid incurring furtherVertex AIcharges, delete the Google Cloud resourcesthat you created during this tutorial:

To undeploy the model from the endpoint and delete the endpoint,run the following commands:

ENDPOINT_ID=$(gcloudaiendpointslist\--region=LOCATION\--filter=display_name=ENDPOINT_NAME\--format="value(name)")DEPLOYED_MODEL_ID=$(gcloudaiendpointsdescribe$ENDPOINT_ID\--region=LOCATION\--format="value(deployedModels.id)")gcloudaiendpointsundeploy-model$ENDPOINT_ID\--region=LOCATION\--deployed-model-id=$DEPLOYED_MODEL_IDgcloudaiendpointsdelete$ENDPOINT_ID\--region=LOCATION\--quiet

To delete your model, run the following commands:

MODEL_ID=$(gcloudaimodelslist\--region=LOCATION\--filter=display_name=DEPLOYED_MODEL_NAME\--format="value(name)")gcloudaimodelsdelete$MODEL_ID\--region=LOCATION\--quiet

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Serve Llama 3 open models using multi-host Cloud TPUs on Vertex AI with Saxml Stay organized with collections Save and categorize content based on your preferences.