Deploy a model to Cloud TPU VMs

Google Cloud provides access to custom-designed machine learning acceleratorscalledTensor Processing Units (TPUs). TPUs areoptimized to accelerate the training and inference of machine learning models,making them ideal for a variety of applications, including natural languageprocessing, computer vision, and speech recognition.

This page describes how to deploy your models to asingle hostCloud TPU v5e or v6e for online inference in Vertex AI.

Note: Multi-host deployment is now available in Public preview. For moreinformation, seeServe Llama 3 open models using multi-host Cloud TPUs on Vertex AI with Saxml.

OnlyCloud TPU version v5e and v6e are supported.Other Cloud TPU generations are not supported.

To learn which locations Cloud TPU version v5e and v6e are available in, seelocations.

Import your model

For deployment on Cloud TPUs, you mustimport your model to Vertex AIand configure it to use one of the following containers:

Prebuilt optimized TensorFlow runtime container

To import and run aTensorFlowSavedModelon a Cloud TPU, the model must be TPU-optimized. If your TensorFlowSavedModel isn't already TPU optimized, you can optimize your modelautomatically. To do this, import your model and then Vertex AIoptimizes your unoptimized model by using an automatic partitioning algorithm.This optimization doesn't work on all models. If optimization fails, youmust manually optimize your model.

The following sample code demonstrates how to use automatic model optimization with automatic partitioning:

model=aiplatform.Model.upload(display_name='TPU optimized model with automatic partitioning',artifact_uri="gs://model-artifact-uri",serving_container_image_uri="us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest",serving_container_args=[])

For more information on importing models, seeimporting models to Vertex AI.

Prebuilt PyTorch container

The instructions to import and run a PyTorch model on Cloud TPU are the sameas the instructions to import and run a PyTorch model.

For example,TorchServe for Cloud TPU v5e InferenceThen, upload the model artifacts to your Cloud Storage folder and upload yourmodel as shown:

model=aiplatform.Model.upload(display_name='DenseNet TPU model from SDK PyTorch 2.1',artifact_uri="gs://model-artifact-uri",serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-tpu.2-1:latest",serving_container_args=[],serving_container_predict_route="/predictions/model",serving_container_health_route="/ping",serving_container_ports=[8080])

For more information, seeexport model artifacts for PyTorchand the tutorial notebook forServe a PyTorch model using a prebuilt container.

Custom container

For custom containers, your model does not need to be a TensorFlowmodel, but it must be TPU optimized. For information on producing a TPUoptimized model, see the following guides for common ML frameworks:

For information on serving models trained with JAX, TensorFlow, orPyTorch on Cloud TPU v5e, seeCloud TPU v5e Inference.

Make sure your custom container meets thecustom container requirements.

You mustraise the locked memory limitso the driver can communicate with the TPU chips over direct memory access (DMA).For example:

Command line

ulimit-l68719476736

Python

importresourceresource.setrlimit(resource.RLIMIT_MEMLOCK,(68_719_476_736_000,# soft limit68_719_476_736_000,# hard limit),)

Then, seeUse a custom container for inferencefor information on importing a model with a custom container. If you have wantto implement pre or post processing logic, consider usingCustom inference routines.

Create an endpoint

The instructions for creating an endpoint for Cloud TPUs are the same as theinstructions for creating any endpoint.

For example, the following command creates anendpointresource:

endpoint=aiplatform.Endpoint.create(display_name='My endpoint')

The response contains the new endpoint's ID, which you use in subsequent steps.

For more information on creating an endpoint, seedeploy a model to an endpoint.

Deploy a model

The instructions for deploying a model to Cloud TPUs are the same as theinstructions for deploying any model, except you specify one of thefollowing supported Cloud TPU machine types:

Machine TypeNumber of TPU chips
ct6e-standard-1t1
ct6e-standard-4t4
ct6e-standard-8t8
ct5lp-hightpu-1t1
ct5lp-hightpu-4t4
ct5lp-hightpu-8t8

TPU accelerators are built-in to the machine type. You don't have to specifyaccelerator type or accelerator count.

For example, the following command deploys a model by callingdeployModel:

machine_type='ct5lp-hightpu-1t'deployed_model=model.deploy(endpoint=endpoint,deployed_model_display_name='My deployed model',machine_type=machine_type,traffic_percentage=100,min_replica_count=1sync=True,)
Note:Autoscaling for deploymentsusing TPUs is not supported at this time. The system always usesminReplicaCount and ignoresmaxReplicaCount for the deployed model.

For more information, seedeploy a model to an endpoint.

Get online inferences

The instruction for getting online inferences from a Cloud TPU is the sameas the instruction forgetting online inferences.

For example, the following command sends an online inference request by callingpredict:

deployed_model.predict(...)

For custom containers, see theinference request and response requirementsfor custom containers.

Securing capacity

For most regions, theTPU v5e and v6e cores per regionquota for custom model serving is 0. In some regions, it is limited.

To request a quota increase, seeRequest a quota adjustment.

Pricing

TPU machine types are billed per hour, just like all other machine type in Vertex Prediction. For moreinformation, seePrediction pricing.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.