Deploy a model to Cloud TPU VMs Stay organized with collections Save and categorize content based on your preferences.
Google Cloud provides access to custom-designed machine learning acceleratorscalledTensor Processing Units (TPUs). TPUs areoptimized to accelerate the training and inference of machine learning models,making them ideal for a variety of applications, including natural languageprocessing, computer vision, and speech recognition.
This page describes how to deploy your models to asingle hostCloud TPU v5e or v6e for online inference in Vertex AI.
Note: Multi-host deployment is now available in Public preview. For moreinformation, seeServe Llama 3 open models using multi-host Cloud TPUs on Vertex AI with Saxml.OnlyCloud TPU version v5e and v6e are supported.Other Cloud TPU generations are not supported.
To learn which locations Cloud TPU version v5e and v6e are available in, seelocations.
Import your model
For deployment on Cloud TPUs, you mustimport your model to Vertex AIand configure it to use one of the following containers:
- prebuilt optimized TensorFlow runtime container either the
nightlyversion, or version2.15or later - prebuilt PyTorch TPU container version
2.1or later - your own custom container that supports TPUs
Prebuilt optimized TensorFlow runtime container
To import and run aTensorFlowSavedModelon a Cloud TPU, the model must be TPU-optimized. If your TensorFlowSavedModel isn't already TPU optimized, you can optimize your modelautomatically. To do this, import your model and then Vertex AIoptimizes your unoptimized model by using an automatic partitioning algorithm.This optimization doesn't work on all models. If optimization fails, youmust manually optimize your model.
The following sample code demonstrates how to use automatic model optimization with automatic partitioning:
model=aiplatform.Model.upload(display_name='TPU optimized model with automatic partitioning',artifact_uri="gs://model-artifact-uri",serving_container_image_uri="us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest",serving_container_args=[])For more information on importing models, seeimporting models to Vertex AI.
Prebuilt PyTorch container
The instructions to import and run a PyTorch model on Cloud TPU are the sameas the instructions to import and run a PyTorch model.
For example,TorchServe for Cloud TPU v5e InferenceThen, upload the model artifacts to your Cloud Storage folder and upload yourmodel as shown:
model=aiplatform.Model.upload(display_name='DenseNet TPU model from SDK PyTorch 2.1',artifact_uri="gs://model-artifact-uri",serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-tpu.2-1:latest",serving_container_args=[],serving_container_predict_route="/predictions/model",serving_container_health_route="/ping",serving_container_ports=[8080])For more information, seeexport model artifacts for PyTorchand the tutorial notebook forServe a PyTorch model using a prebuilt container.
Custom container
For custom containers, your model does not need to be a TensorFlowmodel, but it must be TPU optimized. For information on producing a TPUoptimized model, see the following guides for common ML frameworks:
For information on serving models trained with JAX, TensorFlow, orPyTorch on Cloud TPU v5e, seeCloud TPU v5e Inference.
Make sure your custom container meets thecustom container requirements.
You mustraise the locked memory limitso the driver can communicate with the TPU chips over direct memory access (DMA).For example:
Command line
ulimit-l68719476736Python
importresourceresource.setrlimit(resource.RLIMIT_MEMLOCK,(68_719_476_736_000,# soft limit68_719_476_736_000,# hard limit),)Then, seeUse a custom container for inferencefor information on importing a model with a custom container. If you have wantto implement pre or post processing logic, consider usingCustom inference routines.
Create an endpoint
The instructions for creating an endpoint for Cloud TPUs are the same as theinstructions for creating any endpoint.
For example, the following command creates anendpointresource:
endpoint=aiplatform.Endpoint.create(display_name='My endpoint')The response contains the new endpoint's ID, which you use in subsequent steps.
For more information on creating an endpoint, seedeploy a model to an endpoint.
Deploy a model
The instructions for deploying a model to Cloud TPUs are the same as theinstructions for deploying any model, except you specify one of thefollowing supported Cloud TPU machine types:
| Machine Type | Number of TPU chips |
|---|---|
ct6e-standard-1t | 1 |
ct6e-standard-4t | 4 |
ct6e-standard-8t | 8 |
ct5lp-hightpu-1t | 1 |
ct5lp-hightpu-4t | 4 |
ct5lp-hightpu-8t | 8 |
TPU accelerators are built-in to the machine type. You don't have to specifyaccelerator type or accelerator count.
For example, the following command deploys a model by callingdeployModel:
machine_type='ct5lp-hightpu-1t'deployed_model=model.deploy(endpoint=endpoint,deployed_model_display_name='My deployed model',machine_type=machine_type,traffic_percentage=100,min_replica_count=1sync=True,)minReplicaCount and ignoresmaxReplicaCount for the deployed model.For more information, seedeploy a model to an endpoint.
Get online inferences
The instruction for getting online inferences from a Cloud TPU is the sameas the instruction forgetting online inferences.
For example, the following command sends an online inference request by callingpredict:
deployed_model.predict(...)For custom containers, see theinference request and response requirementsfor custom containers.
Securing capacity
For most regions, theTPU v5e and v6e cores per regionquota for custom model serving is 0. In some regions, it is limited.
To request a quota increase, seeRequest a quota adjustment.
Pricing
TPU machine types are billed per hour, just like all other machine type in Vertex Prediction. For moreinformation, seePrediction pricing.
What's next
- Learn how toget an online inference
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-24 UTC.