Serve open models using Hex-LLM premium container on Cloud TPU

Deprecated: Hex-LLM is deprecated. Use vLLM TPU.

Hex-LLM, a high-efficiency large language model (LLM) serving with XLA, is theVertex AI LLM serving framework that's designed and optimized forCloud TPU hardware. Hex-LLM combines LLM serving technologies such ascontinuous batching andPagedAttention withVertex AI optimizations that are tailored forXLA and Cloud TPU. It's a high-efficiencyand low-cost LLM serving on Cloud TPU for open source models.

Hex-LLM is available inModel Garden through modelplayground, one-click deployment, and notebook.

Features

Hex-LLM is based on open source projects with Google's own optimizations for XLAand Cloud TPU. Hex-LLM achieves high throughput and low latency when servingfrequently used LLMs.

Hex-LLM includes the following optimizations:

Token-based continuous batching algorithm to help ensure models are fullyutilizing the hardware with a large number of concurrent requests.
A complete rewrite of the attention kernels that are optimized for XLA.
Flexible and composable data parallelism and tensor parallelism strategieswith highly optimized weight sharding methods to efficiently run LLMs onmultiple Cloud TPU chips.

Hex-LLM supports a wide range of dense and sparse LLMs:

Gemma 2B and 7B
Gemma-2 9B and 27B
Llama-2 7B, 13B and 70B
Llama-3 8B and 70B
Llama-3.1 8B and 70B
Llama-3.2 1B and 3B
Llama-3.3 70B
Llama-Guard-3 1B and 8B
Llama-4 Scout-17B-16E
Mistral 7B
Mixtral 8x7B and 8x22B
Phi-3 mini and medium
Phi-4, Phi-4 reasoning and reasoning plus
Qwen-2 0.5B, 1.5B and 7B
Qwen-2.5 0.5B, 1.5B, 7B, 14B and 32B

Note: Hex-LLM can serve the 70B models in full precision or quantized(int8, int4) precision.

Hex-LLM also provides a variety of features, such as the following:

Hex-LLM is included in a single container. Hex-LLM packages the API server,inference engine, and supported models into a single Docker image to bedeployed.

Compatible with the Hugging Face modelsformat. Hex-LLM can load a Hugging Face model from local disk, the HuggingFace Hub, and a Cloud Storage bucket.
Quantization usingbitsandbytes andAWQ.
DynamicLoRA loading. Hex-LLM is able toload the LoRA weights through reading the request argument during serving.

Advanced features

Hex-LLM supports the following advanced features:

Multi-host serving
Disaggregated serving [experimental]
Prefix caching
4-bit quantization support

Multi-host serving

Hex-LLM now supports serving models with amulti-host TPU slice.This feature lets you serve large models that can't be loadedinto a single host TPU VM, which contains at most eight v5e cores.

To enable this feature, set--num_hosts in the Hex-LLM container arguments andset--tpu_topology in the Vertex AI SDK model upload request. Thefollowing example shows how to deploy the Hex-LLM container with a TPU 4x4 v5etopology that serves the Llama 3.1 70B bfloat16 model:

hexllm_args=["--host=0.0.0.0","--port=7080","--model=meta-llama/Meta-Llama-3.1-70B","--data_parallel_size=1","--tensor_parallel_size=16","--num_hosts=4","--hbm_utilization_factor=0.9",]model=aiplatform.Model.upload(display_name=model_name,serving_container_image_uri=HEXLLM_DOCKER_URI,serving_container_command=["python","-m","hex_llm.server.api_server"],serving_container_args=hexllm_args,serving_container_ports=[7080],serving_container_predict_route="/generate",serving_container_health_route="/ping",serving_container_environment_variables=env_vars,serving_container_shared_memory_size_mb=(16*1024),# 16 GBserving_container_deployment_timeout=7200,location=TPU_DEPLOYMENT_REGION,)model.deploy(endpoint=endpoint,machine_type=machine_type,tpu_topology="4x4",deploy_request_timeout=1800,service_account=service_account,min_replica_count=min_replica_count,max_replica_count=max_replica_count,)

For an end-to-end tutorial for deploying the Hex-LLM container with a multi-hostTPU topology, see theVertex AI Model Garden - Llama 3.1 (Deployment) notebook.

In general, the only changes needed to enable multi-host serving are:

Set argument--tensor_parallel_size to the total number of cores within theTPU topology.
Set argument--num_hosts to the number of hosts within the TPU topology.
Set--tpu_topology with the Vertex AI SDK model upload API.

Disaggregated serving [experimental]

Hex-LLM now supports disaggregated serving as an experimental feature. It canonly be enabled on the single host setup and the performance is under tuning.

Disaggregated serving is an effective method for balancing Time to First Token(TTFT) and Time Per Output Token (TPOT) for each request, and the overallserving throughput. It separates the prefill phase and the decode phase intodifferent workloads so that they don't interfere with each other. This methodis especially useful for scenarios that set strict latency requirements.

To enable this feature, set--disagg_topo in the Hex-LLM container arguments.The following is an example that shows how to deploy the Hex-LLM container onTPU v5e-8 that serves the Llama 3.1 8B bfloat16 model:

hexllm_args=["--host=0.0.0.0","--port=7080","--model=meta-llama/Llama-3.1-8B","--data_parallel_size=1","--tensor_parallel_size=2","--disagg_topo=3,1","--hbm_utilization_factor=0.9",]model=aiplatform.Model.upload(display_name=model_name,serving_container_image_uri=HEXLLM_DOCKER_URI,serving_container_command=["python","-m","hex_llm.server.api_server"],serving_container_args=hexllm_args,serving_container_ports=[7080],serving_container_predict_route="/generate",serving_container_health_route="/ping",serving_container_environment_variables=env_vars,serving_container_shared_memory_size_mb=(16*1024),# 16 GBserving_container_deployment_timeout=7200,location=TPU_DEPLOYMENT_REGION,)model.deploy(endpoint=endpoint,machine_type=machine_type,deploy_request_timeout=1800,service_account=service_account,min_replica_count=min_replica_count,max_replica_count=max_replica_count,)

The--disagg_topo argument accepts a string in the format"number_of_prefill_workers,number_of_decode_workers".In the earlier example, it is set to"3,1" to configure three prefill workersand 1 decode worker. Each worker uses two TPU v5e cores.

Prefix caching

Prefix caching reduces Time to First Token (TTFT) for prompts that haveidentical content at the beginning of the prompt, such as company-wide preambles,common system instructions, and multi-turn conversation history. Instead ofprocessing the same input tokens repeatedly, Hex-LLM can retain a temporarycache of the processed input token computations to improve TTFT.

To enable this feature, set--enable_prefix_cache_hbm in the Hex-LLM containerarguments. The following is an example that shows how to deploy the Hex-LLMcontainer on TPU v5e-8 that serves the Llama 3.1 8B bfloat16 model:

hexllm_args=["--host=0.0.0.0","--port=7080","--model=meta-llama/Llama-3.1-8B","--data_parallel_size=1","--tensor_parallel_size=4","--hbm_utilization_factor=0.9","--enable_prefix_cache_hbm",]model=aiplatform.Model.upload(display_name=model_name,serving_container_image_uri=HEXLLM_DOCKER_URI,serving_container_command=["python","-m","hex_llm.server.api_server"],serving_container_args=hexllm_args,serving_container_ports=[7080],serving_container_predict_route="/generate",serving_container_health_route="/ping",serving_container_environment_variables=env_vars,serving_container_shared_memory_size_mb=(16*1024),# 16 GBserving_container_deployment_timeout=7200,location=TPU_DEPLOYMENT_REGION,)model.deploy(endpoint=endpoint,machine_type=machine_type,deploy_request_timeout=1800,service_account=service_account,min_replica_count=min_replica_count,max_replica_count=max_replica_count,)

Hex-LLM employs prefix caching to optimize performance for prompts exceeding acertain length (512 tokens by default, configurable usingprefill_len_padding).Cache hits occur in increments of this value, ensuring the cached token count isalways a multiple ofprefill_len_padding. Thecached_tokens field ofusage.prompt_tokens_details in the chat completion API response indicates howmany of the prompt tokens were a cache hit.

"usage":{"prompt_tokens":643,"total_tokens":743,"completion_tokens":100,"prompt_tokens_details":{"cached_tokens":512}}

Chunked prefill

Chunked prefill splits a request prefillinto smaller chunks, and mixes prefill and decode into one batch step. Hex-LLMimplements chunked prefill to balance the Time to First Token (TTFT) andTime per Output Token (TPOT) and improves the throughput.

To enable this feature, set--enable_chunked_prefill in the Hex-LLM containerarguments. The following is an example that shows how to deploy the Hex-LLMcontainer on TPU v5e-8 that serves the Llama 3.1 8B model:

hexllm_args=["--host=0.0.0.0","--port=7080","--model=meta-llama/Llama-3.1-8B","--data_parallel_size=1","--tensor_parallel_size=4","--hbm_utilization_factor=0.9","--enable_chunked_prefill",]model=aiplatform.Model.upload(display_name=model_name,serving_container_image_uri=HEXLLM_DOCKER_URI,serving_container_command=["python","-m","hex_llm.server.api_server"],serving_container_args=hexllm_args,serving_container_ports=[7080],serving_container_predict_route="/generate",serving_container_health_route="/ping",serving_container_environment_variables=env_vars,serving_container_shared_memory_size_mb=(16*1024),# 16 GBserving_container_deployment_timeout=7200,location=TPU_DEPLOYMENT_REGION,)model.deploy(endpoint=endpoint,machine_type=machine_type,deploy_request_timeout=1800,service_account=service_account,min_replica_count=min_replica_count,max_replica_count=max_replica_count,)

4-bit quantization support

Quantization is a technique for reducing the computational and memory costs ofrunning inference by representing the weights or activations with low-precisiondata types like INT8 or INT4 instead of the usual BF16 or FP32.

Hex-LLM supports INT8 weight-only quantization. Extended support includes modelswith INT4 weights quantized using AWQ zero-point quantization. Hex-LLM supportsINT4 variants of Mistral, Mixtral and Llama model families.

There is no additional flag required for serving quantized models.

Get started in Model Garden

The Hex-LLM Cloud TPU serving container is integrated intoModel Garden. You can access this serving technology through theplayground, one-click deployment, and Colab Enterprise notebookexamples for a variety of models.

Use playground

Model Garden playground is a pre-deployed Vertex AIendpoint that is reachable by sending requests in the model card.

Enter a prompt and, optionally, include arguments for your request.
ClickSUBMIT to get the model response quickly.

Try it out with Gemma!

Use one-click deployment

You can deploy a custom Vertex AI endpoint with Hex-LLM by usinga model card.

Navigate to themodel card pageand clickDeploy.
For the model variation that you want to use, select theCloud TPUv5e machine typefor deployment.
ClickDeploy at the bottom to begin the deployment process. You receivetwo email notifications; one when the model is uploaded and another when theendpoint is ready.

Use the Colab Enterprise notebook

For flexibility and customization, you can use Colab Enterprisenotebook examples to deploy a Vertex AI endpoint with Hex-LLM byusing the Vertex AI SDK for Python.

Navigate to the model card page and clickOpen notebook.
Select the Vertex Serving notebook. The notebook is opened inColab Enterprise.
Run through the notebook to deploy a model by using Hex-LLM and sendprediction requests to the endpoint. The code snippet for the deployment isas follows:

hexllm_args=[f"--model=google/gemma-2-9b-it",f"--tensor_parallel_size=4",f"--hbm_utilization_factor=0.8",f"--max_running_seqs=512",]hexllm_envs={"PJRT_DEVICE":"TPU","MODEL_ID":"google/gemma-2-9b-it","DEPLOY_SOURCE":"notebook",}model=aiplatform.Model.upload(display_name="gemma-2-9b-it",serving_container_image_uri=HEXLLM_DOCKER_URI,serving_container_command=["python","-m","hex_llm.server.api_server"],serving_container_args=hexllm_args,serving_container_ports=[7080],serving_container_predict_route="/generate",serving_container_health_route="/ping",serving_container_environment_variables=hexllm_envs,serving_container_shared_memory_size_mb=(16*1024),serving_container_deployment_timeout=7200,)endpoint=aiplatform.Endpoint.create(display_name="gemma-2-9b-it-endpoint")model.deploy(endpoint=endpoint,machine_type="ct5lp-hightpu-4t",deploy_request_timeout=1800,service_account="<your-service-account>",min_replica_count=1,max_replica_count=1,)

Example Colab Enterprise notebooks include:

Configure server arguments and environment variables

You can set the following arguments to launch the Hex-LLM server. You can tailorthe arguments to best fit your intended use case and requirements. Note that thearguments are predefined for one-click deployment for enabling the easiestdeployment experience. To customize the arguments, you can build off of thenotebook examples for reference and set the arguments accordingly.

Model

--model: The model to load. You can specify a Hugging Face model ID, aCloud Storage bucket path (gs://my-bucket/my-model), or a local path.The model artifacts are expected to follow the Hugging Face format and usesafetensors files forthe model weights.BitsAndBytesint8 andAWQquantized model artifacts are supported for Llama, Gemma 2 andMistral/Mixtral.
--tokenizer: Thetokenizerto load. This can be a Hugging Face model ID, aCloud Storagebucket path (gs://my-bucket/my-model), or a local path. If this argumentis not set, it defaults to the value for--model.
--tokenizer_mode: The tokenizer mode. Possible choices are["auto", "slow"]. The default value is"auto". If this is set to"auto", the fast tokenizer is used if available. The slow tokenizers arewritten in Python and provided in the Transformers library, while the fasttokenizers offering performance improvement are written in Rust and providedin the Tokenizers library. For more information, see theHugging Face documentation.
--trust_remote_code: Whether to allow remote code files defined in theHugging Face model repositories. The default value isFalse.
--load_format: Format of model checkpoints to load. Possible choices are["auto", "dummy"]. The default value is"auto". If this is set to"auto", the model weights are loaded in safetensors format. If this is setto"dummy", the model weights are randomly initialized. Setting this to"dummy" is useful for experimentation.
--max_model_len: The maximum context length (input length plus the outputlength) to serve for the model. The default value is read from the modelconfiguration file in Hugging Face format:config.json. A larger maximumcontext length requires more TPU memory.
--sliding_window: If set, this argument overrides the model's window sizeforsliding window attention. Settingthis argument to a larger value makes the attention mechanism include moretokens and approaches the effect of standard self attention. This argumentis meant for experimental usage only. In general use cases, we recommendusing the model's original window size.
--seed: The seed for initializing all random number generators. Changingthis argument might affect the generated output for the same prompt throughchanging the tokens that are sampled as next tokens. The default value is0.

Inference engine

--num_hosts: The number of hosts to run. The default value is1. Formore details, refer to the documentation onTPU v5e configuration.
--disagg_topo: Defines the number of prefill workers and decode workerswith the experimental feature disaggregated serving. The default value isNone. The argument follows the format:"number_of_prefill_workers,number_of_decode_workers".
--data_parallel_size: The number of data parallel replicas. The defaultvalue is1. Setting this toN from1 approximately improves thethroughput byN, while maintaining the same latency.
--tensor_parallel_size: The number of tensor parallel replicas. Thedefault value is1. Increasing the number of tensor parallel replicasgenerally improves latency, because it speeds up matrix multiplication byreducing the matrix size.
--worker_distributed_method: The distributed method to launch the worker.Usemp for themultiprocessingmodule orray for theRay library. The defaultvalue ismp.
--enable_jit: Whether to enableJIT (Just-in-Time Compilation)mode. The default value isTrue. Setting--no-enable_jit disables it.Enabling JIT mode improves inference performance at the cost of requiringadditional time spent on initial compilation. In general, the inferenceperformance benefits overweigh the overhead.
--warmup: Whether to warm up the server with sample requests duringinitialization. The default value isTrue. Setting--no-warmup disablesit. Warmup is recommended, because initial requests trigger heaviercompilation and therefore will be slower.
--max_prefill_seqs: The maximum number of sequences that can be scheduledfor prefilling per iteration. The default value is1. The larger thisvalue is, the higher throughput the server can achieve, but with potentialadverse effects on latency.
--prefill_seqs_padding: The server pads the prefill batch size to amultiple of this value. The default value is8. Increasing this valuereduces model recompilation times, but increases wasted computation andinference overhead. The optimal setting depends on the request traffic.
--prefill_len_padding: The server pads the sequence length to a multipleof this value. The default value is512. Increasing this value reducesmodel recompilation times, but increases wasted computation and inferenceoverhead. The optimal setting depends on the data distribution of therequests.
--max_decode_seqs/--max_running_seqs: The maximum number of sequencesthat can be scheduled for decoding per iteration. The default value is256.The larger this value is, the higher throughput the server can achieve, butwith potential adverse effects on latency.
--decode_seqs_padding: The server pads the decode batch size to a multipleof this value. The default value is8. Increasing this value reduces modelrecompilation times, but increases wasted computation and inference overhead.The optimal setting depends on the request traffic.
--decode_blocks_padding: The server pads the number of memory blocks usedfor a sequence's Key-Value cache (KV cache) to a multiple of this valueduring decoding. The default value is128. Increasing this value reducesmodel recompilation times, but increases wasted computation and inferenceoverhead. The optimal setting depends on the data distribution of therequests.
--enable_prefix_cache_hbm: Whether to enableprefix cachingin HBM. The default value isFalse. Setting this argument can improveperformance by reusing the computations of shared prefixes of prior requests.
--enable_chunked_prefill: Whether to enablechunked prefill.The default value isFalse. Setting this argument can support longercontext length and improve performance.

Memory management

--hbm_utilization_factor: The percentage of freeCloud TPU High Bandwidth Memory (HBM)that can be allocated for KV cache after model weights are loaded. Thedefault value is0.9. Setting this argument to a higher value increasesthe KV cache size and can improve throughput, but it increases the risk ofrunning out of Cloud TPU HBM during initialization and at runtime.
--num_blocks: Number of device blocks to allocate for KV cache. If thisargument is set, the server ignores--hbm_utilization_factor. If thisargument is not set, the server profiles HBM usage and computes the numberof device blocks to allocate based on--hbm_utilization_factor. Settingthis argument to a higher value increases the KV cache size and can improvethroughput, but it increases the risk of running out of Cloud TPU HBM duringinitialization and at runtime.
--block_size: Number of tokens stored in a block. Possible choices are[8, 16, 32, 2048, 8192]. The default value is32. Setting this argumentto a larger value reduces overhead in block management, at the cost of morememory waste. The exact performance impact needs to be determinedempirically.

Dynamic LoRA

--enable_lora: Whether to enable dynamicLoRA adaptersloading from Cloud Storage. The default value isFalse. This issupported for the Llama model family.
--max_lora_rank: The maximum LoRA rank supported for LoRA adapters definedin requests. The default value is16. Setting this argument to a highervalue allows for greater flexibility in the LoRA adapters that can be usedwith the server, but increases the amount of Cloud TPU HBM allocated forLoRA weights and decreases throughput.
--enable_lora_cache: Whether to enable caching of dynamic LoRA adapters.The default value isTrue. Setting--no-enable_lora_cache disables it.Caching improves performance because it removes the need to re-downloadpreviously used LoRA adapter files.
--max_num_mem_cached_lora: The maximum number of LoRA adapters stored inTPU memory cache.The default value is16. Setting this argument to alarger value improves the chance of a cache hit, but it increases the amountof Cloud TPU HBM usage.

You can also configure the server using the following environment variables:

HEX_LLM_LOG_LEVEL: Controls the amount of logging information generated.The default value isINFO. Set this to one of the standard Python logginglevels defined in thelogging module.
HEX_LLM_VERBOSE_LOG: Whether to enable detailed logging output. Allowedvalues aretrue orfalse. Default value isfalse.

Tune server arguments

The server arguments are interrelated and have a collective effect on theserving performance. For example, a larger setting of--max_model_len=4096leads to higher TPU memory usage, and therefore requires larger memoryallocation and less batching. In addition, some arguments are determined by theuse case, while others can be tuned. Here is a workflow for configuring theHex-LLM server.

Determine the model family and model variant of interest. For example, Llama3.1 8B Instruct.
Estimate the lower bound of TPU memory needed based on the model size andprecision:model_size * (num_bits / 8). For an 8B model and bfloat16precision, the lower bound of TPU memory needed would be8 * (16 / 8) = 16 GB.
Estimate the number of TPU v5e chips needed, where each v5e chip offers 16GB:tpu_memory / 16. For an 8B model and bfloat16 precision, you need morethan 1 chip. Among the1-chip, 4-chip and 8-chip configurations,the smallest configuration that offers more than 1 chip is the 4-chipconfiguration:ct5lp-hightpu-4t. You can subsequently set--tensor_parallel_size=4.
Determine the maximum context length (input length + output length) for theintended use case. For example, 4096. You can subsequently set--max_model_len=4096.
Tune the amount of free TPU memory allocated for KV cache to the maximumvalue achievable given the model, hardware and server configurations(--hbm_utilization_factor). Start with0.95. Deploy the Hex-LLM serverand test the server with long prompts and high concurrency. If the serverruns out-of-memory, reduce the utilization factor accordingly.

A sample set of arguments for deploying Llama 3.1 8B Instruct is:

python-mhex_llm.server.api_server\--model=meta-llama/Llama-3.1-8B-Instruct\--tensor_parallel_size=4\--max_model_len=4096--hbm_utilization_factor=0.95

A sample set of arguments for deploying Llama 3.1 70B Instruct AWQ onct5lp-hightpu-4t is:

python-mhex_llm.server.api_server\--model=hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4\--tensor_parallel_size=4\--max_model_len=4096--hbm_utilization_factor=0.45

Request Cloud TPU quota

In Model Garden, your default quota is 32 Cloud TPU v5echips in theus-west1 region. This quotas applies to one-click deployments andColab Enterprise notebook deployments. To request a higher quota value,seeRequest a quota adjustment.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Serve open models using Hex-LLM premium container on Cloud TPU Stay organized with collections Save and categorize content based on your preferences.