Serve Llama 3 open models using multi-host Cloud TPUs on Vertex AI with Saxml Stay organized with collections Save and categorize content based on your preferences.
Preview
This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
Llama 3 is an open-source largelanguage model (LLM) from Meta.This guide shows you how to serve a Llama 3 LLM using multi-hostTensor Processing Units (TPUs) onVertex AI withSaxml.
In this guide, you download the Llama 3 70B model weights and tokenizer anddeploy them on Vertex AI that runs Saxml on TPUs.
Note: A GPU-only version of Llama 3 isavailable inModel Garden.For more information about Model Garden, seeExplore AI models in Model Garden.Before you begin
We recommend that you use anM2 memory-optimized VMfor downloading the model and converting it to Saxml. This is because the modelconversion process requires significant memory and may fail if you choose amachine type that doesn't have enough memory.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the Vertex AI and Artifact Registry APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the Vertex AI and Artifact Registry APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
- Follow the Artifact Registry documentation toInstall Docker.
- Make sure that you have sufficient quotas for 16 TPU v5e chips for Vertex AI.
This tutorial assumes that you are usingCloud Shell to interact with Google Cloud. If you want touse a different shell instead of Cloud Shell, thenperform the following additional configuration:
Install the Google Cloud CLI.
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
Toinitialize the gcloud CLI, run the following command:
gcloudinit
If you're using a different shell instead of Cloud Shell for modeldeployment, make sure that theGoogle Cloud CLI version islater than475.0.0. You can update the Google Cloud CLI by running thegcloud components updatecommand.
If you're deploying your model using the Vertex AI SDK, make sure thatyou have version1.50.0 or later.
Get access to the model and download the model weights
The following steps are for a Vertex AI Workbench instance that has anM2 memory-optimized VM.For information on changing the machine type of a Vertex AI Workbenchinstance, seeChange machine type of a Vertex AI Workbench instance.
Go to theLlama model consent page.
Select Llama 3, fill out the consent form, and accept the terms andconditions.
Check your inbox for an email containing a signed URL.
Download the
download.shscriptfrom GitHub by executing the following command:wgethttps://raw.githubusercontent.com/meta-llama/llama3/main/download.shchmod+xdownload.shTo download the model weights, run the
download.shscript that youdownloaded from GitHub.When prompted, enter the signed URL from the email you received in theprevious section.
When prompted for the models to download, enter
70B.
Convert the model weights to Saxml format
Run the following command to download Saxml:
gitclonehttps://github.com/google/saxml.gitRun the following commands to configure a Python virtual environment:
python-mvenv.sourcebin/activateRun the following commands to install dependencies:
pipinstall--upgradepippipinstallpaxmlpipinstallpraxispipinstalltorchTo convert the model weights to Saxml format, run the following command:
python3saxml/saxml/tools/convert_llama_ckpt.py\--basePATH_TO_META_LLAMA3\--paxPATH_TO_PAX_LLAMA3\--model-sizellama3_70bReplace the following:
PATH_TO_META_LLAMA3: the path to the directory containingthe downloaded model weightsPATH_TO_PAX_LLAMA3: the path to the directory in which tostore the converted model weights
Converted models will be put into the
$PATH_TO_PAX_LLAMA3/checkpoint_00000000folder.Copy the tokenizer file from original directory into a subfolder named
vocabsas follows:cp$PATH_TO_META_LLAMA3/tokenizer.model$PATH_TO_PAX_LLAMA3/vocabs/tokenizer.modelAdd an empty
commit_success.txtfile in the$PATH_TO_PAX_LLAMA3folderand themetadataandstatesubfolders in that folder as follows:touch$PATH_TO_PAX_LLAMA3/checkpoint_00000000/commit_success.txttouch$PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/commit_success.txttouch$PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/commit_success.txtThe
$PATH_TO_PAX_LLAMA3folder now contains the following folders andfiles:$PATH_TO_PAX_LLAMA3/checkpoint_00000000/commit_success.txt$PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/$PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/commit_success.txt$PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/$PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/commit_success.txt$PATH_TO_PAX_LLAMA3/vocabs/tokenizer.model
Create a Cloud Storage bucket
Create a Cloud Storage bucket to store the converted model weights.
In Cloud Shell, run the followingcommands, replacingPROJECT_ID with your project ID:
projectid=PROJECT_IDgcloudconfigsetproject${projectid}To create the bucket, run the following command:
gcloudstoragebucketscreategs://WEIGHTS_BUCKET_NAMEReplaceWEIGHTS_BUCKET_NAME with the name you wantto use for the bucket.
Copy the model weights to the Cloud Storage bucket
To copy the model weights to your bucket, run the following command:
gcloudstoragecpPATH_TO_PAX_LLAMA3/*gs://WEIGHTS_BUCKET_NAME/llama3/pax_70b/--recursiveUpload the model
A prebuilt Saxml container is available atus-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest.
To upload aModel resource to Vertex AI using the prebuiltSaxml container, run thegcloud ai models upload commandas follows:
gcloudaimodelsupload\--region=LOCATION\--display-name=MODEL_DISPLAY_NAME\--container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest\--artifact-uri='gs://WEIGHTS_BUCKET_NAME/llama3/pax_70b'\--container-args='--model_path=saxml.server.pax.lm.params.lm_cloud.LLaMA3_70BFP16x16'\--container-args='--platform_chip=tpuv5e'\--container-args='--platform_topology=4x4'\--container-args='--ckpt_path_suffix=checkpoint_00000000'\--container-deployment-timeout-seconds=2700\--container-ports=8502\--project=PROJECT_IDMake the following replacements:
LOCATION: the region where you are using Vertex AI.Note that TPUs are only available inus-west1.MODEL_DISPLAY_NAME: the display name you want for your modelPROJECT_ID: the ID of your Google Cloud project
Create an online inference endpoint
To create the endpoint, run the following command:
gcloudaiendpointscreate\--region=LOCATION\--display-name=ENDPOINT_DISPLAY_NAME\--project=PROJECT_IDReplaceENDPOINT_DISPLAY_NAME with the display name you want foryour endpoint.
Deploy the model to the endpoint
After the endpoint is ready, deploy the model to the endpoint.
In this tutorial, you deploy a Llama 3 70B model that's sharded for 16Cloud TPU v5e chips using 4x4 topology. However, you can specify any of thefollowing supported multi-host Cloud TPU topologies:
| Machine Type | Topology | Number of TPU chips | Number of Hosts |
|---|---|---|---|
ct5lp-hightpu-4t | 4x4 | 16 | 2 |
ct5lp-hightpu-4t | 4x8 | 32 | 4 |
ct5lp-hightpu-4t | 8x8 | 64 | 8 |
ct5lp-hightpu-4t | 8x16 | 128 | 16 |
ct5lp-hightpu-4t | 16x16 | 256 | 32 |
If you're deploying a different Llama model that's defined in theSaxml GitHub repo,make sure that it's partitioned to match the number of devices you'retargeting and that Cloud TPU has sufficient memory to load the model.
For information about deploying a model on single-host Cloud TPUs, seeDeploy a model.
For a full list of supported Cloud TPU types and regions seeVertex AI Locations.
Get the endpoint ID for the online inference endpoint:
ENDPOINT_ID=$(gcloudaiendpointslist\--region=LOCATION\--filter=display_name=ENDPOINT_NAME\--format="value(name)")Get the model ID for your model:
MODEL_ID=$(gcloudaimodelslist\--region=LOCATION\--filter=display_name=DEPLOYED_MODEL_NAME\--format="value(name)")Deploy the model to the endpoint:
gcloudaiendpointsdeploy-model$ENDPOINT_ID\--region=LOCATION\--model=$MODEL_ID\--display-name=DEPLOYED_MODEL_NAME\--machine-type=ct5lp-hightpu-4t\--tpu-topology=4x4\--traffic-split=0=100ReplaceDEPLOYED_MODEL_NAME with a name for the deployed.This can be the same as the model display name(MODEL_DISPLAY_NAME).
The deployment operation might time out.
The
deploy-modelcommand returns an operation ID that can be used to checkwhen the operation is finished. You can poll for the status of the operationuntil the response includes"done": true. Use the following command topoll the status:gcloudaioperationsdescribe\--region=LOCATION\OPERATION_IDReplaceOPERATION_ID with the operation ID that was returnedby the previous command.
Get online inferences from the deployed model
To get online inferences from the Vertex AI endpoint,run thegcloud ai endpoints predictcommand.
Run the following command to create a
request.jsonfile containing asample inference request:cat <<EOF >request.json{"instances":[{"text_batch":"the distance between Earth and Moon is "}]}EOFTo send the online inference request to the endpoint, run the followingcommand:
gcloudaiendpointspredict$ENDPOINT_ID\--project=PROJECT_ID\--region=LOCATION\--json-request=request.json
Clean up
To avoid incurring furtherVertex AIcharges, delete the Google Cloud resourcesthat you created during this tutorial:
To undeploy the model from the endpoint and delete the endpoint,run the following commands:
ENDPOINT_ID=$(gcloudaiendpointslist\--region=LOCATION\--filter=display_name=ENDPOINT_NAME\--format="value(name)")DEPLOYED_MODEL_ID=$(gcloudaiendpointsdescribe$ENDPOINT_ID\--region=LOCATION\--format="value(deployedModels.id)")gcloudaiendpointsundeploy-model$ENDPOINT_ID\--region=LOCATION\--deployed-model-id=$DEPLOYED_MODEL_IDgcloudaiendpointsdelete$ENDPOINT_ID\--region=LOCATION\--quietTo delete your model, run the following commands:
MODEL_ID=$(gcloudaimodelslist\--region=LOCATION\--filter=display_name=DEPLOYED_MODEL_NAME\--format="value(name)")gcloudaimodelsdelete$MODEL_ID\--region=LOCATION\--quiet
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.