Optimized TensorFlow runtime

The optimized TensorFlow runtime optimizes models for faster and lower cost inference thanopen source basedprebuilt TensorFlow Serving containers. The optimized TensorFlow runtime doesthis by utilizing Google's proprietary and open source technologies.

The larger a machine learning (ML) model is, the more it can cost to serve it.With the optimized TensorFlow runtime, the cost of serving your ML model and the speed ofinference can be lower compared to when you use an open source-based TensorFlowruntime. To take advantage of the optimized TensorFlow runtime when you useVertex AI, you don't need to modify code. Instead, you choosea serving container image that uses it.

The optimized TensorFlow runtime is backward compatible withprebuilt TensorFlow Serving containers. If you're runningTensorFlow models with a prebuilt container, you can switch to an optimized TensorFlow runtimecontainer with minimal effort.

While the performance of your model improves when you use the optimized TensorFlow runtime, youshould expect the performance impact to vary for different types of models.

Optimized TensorFlow runtime overview

The optimized TensorFlow runtime uses model optimizations and new proprietary Google technologies toimprove the speed and lower the cost of inference compared toopen source basedprebuilt TensorFlow Serving containers.

The optimization occurs when Vertex AI uploads a model, before it runs.After you deploy a model to an endpoint, the optimization log is added to theinference log. You can use these logs to troubleshoot problems that might occurduring optimization.

The following topics describe optimization improvements in the optimized TensorFlow runtime.

Model optimizations

The following three model optimizations are included in the optimized TensorFlow runtime.

Model XLA precompilation

When a TensorFlow model runs, all operations run individually. There is a smallamount of overhead with running individual operations. The optimized TensorFlow runtime can removesome of this overhead by leveragingXLA toprecompile all or a portion of the TensorFlow graph into larger kernels.

Model XLA precompilation is optional and disabled by default. To learn how toenable model XLA precompilation during a deployment, seeEnable model XLAprecompilation.

Model compression optimizations

The optimized TensorFlow runtime can run some models faster with a small impact on model precisionenabling model compression optimization. When model compression optimization isenabled, the optimized TensorFlow runtime utilizes techniques such asquantizationand weight pruning to run models faster.

The model compression optimization feature is disabled by default. To learn how to enablemodel compression optimization during a deployment, seeEnable model compression optimization.

Improved tabular model performance on GPUs

TensorFlow tabular models are usually served on CPUs because they can't utilizeaccelerators effectively. The optimized TensorFlow runtime addresses this by running computationallyexpensive parts of the model on GPUs. The rest of the model runs on CPUs byminimizing communication between the host and accelerator. Running the expensiveparts of the model on GPUs and the rest on CPUs makes serving tabular modelsfaster and cost less.

The optimized TensorFlow runtime optimizes serving the following tabular model types.

Automatic model optimization for Cloud TPU

The prebuilt optimized TensorFlow runtime containers that supportCloud TPUs can automatically partition and optimize your models to be run on TPUs.For more information, seeDeploy to Cloud TPU.

Use of the TensorFlow runtime (TFRT)

The Optimized TensorFlow runtime can use theTensorFlow runtime(TFRT). The TFRT efficiently usesmultithreaded host CPUs, supports asynchronous programming models, and isoptimized for low-level efficiency.

The TFRT CPU is enabled in all optimized TensorFlow runtime runtime CPU container images exceptversion 2.8. To disable the TFRT CPU, set theuse_tfrt flag tofalse.

The TFRT GPU is available on nightly optimized TensorFlow runtime GPU container images and stableoptimized TensorFlow runtime GPU container images versions 2.13 and later. To enable the TFRT GPU,set theuse_tfrt andallow_precompilation flags totrue.TFRT on a GPU container image minimizes data transfer overhead between thehost CPU and the GPU. After you enable TFRT, it works together with XLAcompilation. Because XLA precompilation is enabled, you might experience someside effects such as increased latency on the first request.For more information, seeEnable model XLA precompilation.

Use of the Google runtime

Because the optimized TensorFlow runtime is built using Google's internal stack, itcan take advantage of running on Google's proprietary runtime environment.

Optimized TensorFlow runtime container images

Vertex AI provides two types of optimized TensorFlow runtime container images:stable and nightly.

Stable container images

Stable optimized TensorFlow runtime containers are bound to a specific TensorFlow version, just likethe open source basedprebuilt TensorFlow Serving containers. Optimized TensorFlow runtimecontainers bound to a specific version are maintained for the same duration asthe open source build that is bound to the same version. The optimized TensorFlow runtime buildshave the same properties as open source TensorFlow builds, except with fasterinference.

Builds are backward compatible. This means you should be able to run modelstrained on older TensorFlow versions using a more recent container. Recentcontainers should perform better than older ones. In rare exceptions, a modeltrained on an older TensorFlow version might not work with a more recentcontainer.

Nightly containers images

Nightly optimized TensorFlow runtime builds include the most recent improvements and optimizations,but might not be as reliable as stable builds. They are primarily used forexperimental purposes. Nightly build names include the labelnightly. Unlikethe stable container images, nightly containers are not covered by theVertex AI Service Level Agreement(SLA).

Available container images

The following nightly and stable optimized TensorFlow runtime Docker container images are available.

ML framework versionSupported accelerators (and CUDA version, if applicable)End of patch and support dateEnd of availabilitySupported images
nightlyCPU onlyNot applicableNot applicable
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest
nightlyGPU (CUDA 12.x)Not applicableNot applicable
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest
nightlyCloud TPUNot applicableNot applicable
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.nightly:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.nightly:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.nightly:latest
2.17CPU onlyJul 11, 2024Jul 11, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-17:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-17:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-17:latest
2.17GPU (CUDA 12.x)Jul 11, 2024Jul 11, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-17:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-17:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-17:latest
2.17Cloud TPUJul 11, 2024Jul 11, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-17:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-17:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-17:latest
2.16CPU onlyApr 26, 2024Apr 26, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-16:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-16:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-16:latest
2.16GPU (CUDA 12.x)Apr 26, 2024Apr 26, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-16:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-16:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-16:latest
2.16Cloud TPUApr 26, 2024Apr 26, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-16:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-16:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-16:latest
2.15CPU onlyAug 15, 2024Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-15:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-15:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-15:latest
2.15GPU (CUDA 12.x)Aug 15, 2024Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-15:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-15:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-15:latest
2.15Cloud TPUAug 15, 2024Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest
2.14CPU onlyAug 15, 2024Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-14:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-14:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-14:latest
2.14GPU (CUDA 12.x)Aug 15, 2024Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-14:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-14:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-14:latest
2.13CPU onlyAug 15, 2024Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-13:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-13:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-13:latest
2.13GPU (CUDA 11.x)Aug 15, 2024Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-13:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-13:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-13:latest
2.12CPU onlyMay 15, 2024May 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-12:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-12:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-12:latest
2.12GPU (CUDA 11.x)May 15, 2024May 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-12:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-12:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-12:latest
2.11CPU onlyNov 15, 2023Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-11:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-11:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-11:latest
2.11GPU (CUDA 11.x)Nov 15, 2023Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-11:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-11:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-11:latest
2.10CPU onlyNov 15, 2023Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-10:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-10:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-10:latest
2.10GPU (CUDA 11.x)Nov 15, 2023Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-10:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-10:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-10:latest
2.9CPU onlyNov 15, 2023Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-9:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-9:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-9:latest
2.9GPU (CUDA 11.x)Nov 15, 2023Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-9:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-9:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-9:latest
2.8CPU onlyNov 15, 2023Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latest
2.8GPU (CUDA 11.x)Nov 15, 2023Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-8:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-8:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-8:latest

Use the optimized TensorFlow runtime with a private endpoint

Using private endpoints to serve online inferences with Vertex AIprovides a low-latency, secure connection to the Vertex AI onlineinference service that is faster than using public endpoints. The optimized TensorFlow runtimelikely serves latency sensitive models, so you might consider using it withprivate endpoints. For more information, seeUse private endpoints for onlineinference.

Deploy a model using the optimized TensorFlow runtime

The process to deploy a model for inference using the optimized TensorFlow runtime is almost thesame as the process to deploy models using open source basedprebuilt TensorFlow Serving containers. The onlydifferences are that you specify a container image that uses the optimized TensorFlow runtime whenyou create your model and you can enable the optimization flags describedearlier in this document. For example, if you deployed your model with theus-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest container, you canserve the same model with the optimized TensorFlow runtime by using the usingus-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latestcontainer.

The following code sample shows you how to create a model with theus-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latestoptimized TensorFlow runtime container. To deploy this model, you use the same process that you useto deploy a model with otherprebuilt TensorFlow Serving containers.

For more information about theModelServiceClient used in this sample, seeClass ModelServiceClient.For more information about how to deploy models using Vertex AI, seeDeploy a model using the Vertex AI API.For more information about theallow_precompilation andallow_compression settings, seeModel optimizationsdescribed earlier in this document.

    from google.cloud.aiplatform import gapic as aip    PROJECT_ID =PROJECT_ID    REGION =LOCATION    API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"    PARENT = f"projects/{PROJECT_ID}/locations/{REGION}"    client_options = {"api_endpoint": API_ENDPOINT}    model_service_client = aip.ModelServiceClient(client_options=client_options)    tf_opt_model_dict = {        "display_name": "DISPLAY_NAME",        "metadata_schema_uri": "",        "artifact_uri": "MODEL_URI",        "container_spec": {            "image_uri": "us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest",            "args": [                # The optimized TensorFlow runtime includes the following                # options that can be set here.                # "--allow_precompilation=true" - enable XLA precompilation                # "--allow_compression=true" - enable                #    model compression optimization            ],        },    }    tf_opt_model = model_service_client.upload_model(        parent=PARENT,        model=tf_opt_model_dict).result(timeout=180).model

Model optimization flags

When you deploy a model using the optimized TensorFlow runtime, you can enable two features thatmight further optimize serving TensorFlow models.

  1. Model XLA precompilation
  2. Optimization that affects model compression

You can enable model XLA precompilation and model compression optimization atthe same time. The following sections describe how to enable these optionsusing flags during deployment.

Enable model XLA precompilation

To configure the optimized TensorFlow runtime to precompile models, set theallow_precompilationflag totrue. Model XLA precompilation works for different kinds of models, and inmost cases improves performance. XLA precompilation works best for requests withlarge batch sizes.

Model XLA precompilation happens when the first request with the new batch sizearrives. To ensure that the runtime is initialized before the first request, youcan include a warmup requests file. For more information, seeSavedModel warmup in theTensorFlow documentation.

XLA precompilation takes between several seconds and several minutes to complete,depending on the model complexity. If you use model XLA precompilation, you shouldconsider the following.

  • If you use a warmup file, try to include requests with batch sizes thatrepresent the batch sizes you expect your model to receive. Providing alarge number of requests in your warmup file slows down the startup of yourmodel server.

  • If you expect your model to receive requests with different batch sizes, youmight want to enable server-side batching with a set of fixed values forallow_batch_sizes. For more information about how to enable server-side

    batching, seeEnable server-side request batching for TensorFlowin the TensorFlow documentation.

  • Because XLA precompilation adds memory overhead, some large models might failwith anout of memory error on the GPU.

It's recommended that you test XLA precompilation on your model before enablingthis feature in production.

Enable model compression optimization

To configure optimized TensorFlow runtime to use model compression optimization, set itsallow_compression flag totrue. Test how enabling this flag affects theprecision of your model, and then determine if you want to enable it in production.

Disable optimizations

To configure optimized TensorFlow runtime to run models without optimization, set itsdisable_optimizer flag totrue.

Optimized TensorFlow runtime limits

The optimized TensorFlow runtime has the following limitations:

  • The optimized TensorFlow runtime is not compatible with older NVIDIA GPUs such as Tesla P4and Tesla P100.
  • The optimized TensorFlow runtime supports onlysampledShapleyexplainability at this time.

Pricing

Deploying models using the optimized TensorFlow runtime doesn't incur additional charges. The costis the same as other inference deployments where you're charged based on thenumber of VMs and accelerators that are used. For more information, seeVertex AI pricing.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.