trtllm-serve #

About#

Thetrtllm-serve command starts an OpenAI compatible server that supports the following endpoints:

/v1/models
/v1/completions
/v1/chat/completions

For information about the inference endpoints, refer to theOpenAI API Reference.

The server also supports the following endpoints:

/health
/metrics
/version

Themetrics endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details.

Starting a Server#

The following abbreviated command syntax shows the commonly used arguments to start a server:

trtllm-serve<model>[--tp_size<tp>--pp_size<pp>--ep_size<ep>--host<host>--port<port>]

For the full syntax and argument descriptions, refer toSyntax.

Inference Endpoints#

After you start the server, you can send inference requests through completions API and Chat API, which are compatible with corresponding OpenAI APIs. We useTinyLlama-1.1B-Chat-v1.0 for examples in the following sections.

Chat API#

You can query Chat API with any http clients, a typical example is OpenAI Python client:

Another example usescurl:

Completions API#

You can query Completions API with any http clients, a typical example is OpenAI Python client:

Another example usescurl:

Multimodal Serving#

For multimodal models, you need to create a configuration file and start the server with additional options due to the following limitations:

TRT-LLM multimodal is currently not compatible withkv_cache_reuse
Multimodal models requirechat_template, so only the Chat API is supported

To set up multimodal models:

First, create a configuration file:

cat>./extra-llm-api-config.yml<<EOFkv_cache_config:    enable_block_reuse: falseEOF

Then, start the server with the configuration file:

trtllm-serveQwen/Qwen2-VL-7B-Instruct\--extra_llm_api_options./extra-llm-api-config.yml

Multimodal Chat API#

You can query Completions API with any http clients, a typical example is OpenAI Python client:

Another example usescurl:

Multimodal Modality Coverage#

TRT-LLM multimodal supports the following modalities and data types (depending on the model):

Text

No type specified:

{"role":"user","content":"What's the capital of South Korea?"}

Explicit “text” type:

{"role":"user","content":[{"type":"text","text":"What's the capital of South Korea?"}]}

Image

Using “image_url” with URL:

{"role":"user","content":[{"type":"text","text":"What's in this image?"},{"type":"image_url","image_url":{"url":"https://example.com/image.png"}}]}

Using “image_url” with base64-encoded data:

{"role":"user","content":[{"type":"text","text":"What's in this image?"},{"type":"image_url","image_url":{"url":"data:image/jpeg;base64,{image_base64}"}}]}

Note

To convert images to base64-encoded format, use the utility functiontensorrt_llm.utils.load_base64_image(). Refer to theload_base64_image utilityfor implementation details.

Video

Using “video_url”:

{"role":"user","content":[{"type":"text","text":"What's in this video?"},{"type":"video_url","video_url":{"url":"https://example.com/video.mp4"}}]}

Audio

Using “audio_url”:

{"role":"user","content":[{"type":"text","text":"What's in this audio?"},{"type":"audio_url","audio_url":{"url":"https://example.com/audio.mp3"}}]}

Multi-node Serving with Slurm#

You can deployDeepSeek-V3 model across two nodes with Slurm andtrtllm-serve

echo-e"enable_attention_dp: true\npytorch_backend_config:\n  enable_overlap_scheduler: true">extra-llm-api-config.ymlsrun-N2-w[NODES]\--output=benchmark_2node.log\--ntasks16--ntasks-per-node=8\--mpi=pmix--gres=gpu:8\--container-image=<CONTAINER_IMG>\--container-mounts=/workspace:/workspace\--container-workdir/workspace\bash-c"trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 --max_batch_size 161 --max_num_tokens 1160 --tp_size 16 --ep_size 4 --kv_cache_free_gpu_memory_fraction 0.95 --extra_llm_api_options ./extra-llm-api-config.yml"

Seethe source code oftrtllm-llmapi-launch for more details.

Metrics Endpoint#

Note

The metrics endpoint for the default PyTorch backend are in beta and are not as comprehensive as those for the TensorRT backend.

Some fields, such as CPU memory usage, are not yet available for the PyTorch backend.

Enablingenable_iter_perf_stats in the PyTorch backend can slightly impact performance, depending on the serving configuration.

The/metrics endpoint provides runtime iteration statistics such as GPU memory usage and KV cache details.

For the default PyTorch backend, iteration statistics logging is enabled by setting theenable_iter_perf_stats field in a YAML file:

# extra_llm_config.yamlenable_iter_perf_stats:true

Start the server and specify the--extra_llm_api_options argument with the path to the YAML file:

trtllm-serve"TinyLlama/TinyLlama-1.1B-Chat-v1.0"--extra_llm_api_optionsextra_llm_config.yaml

After sending at least one inference request to the server, you can fetch runtime iteration statistics by polling the/metrics endpoint.Since the statistics are stored in an internal queue and removed once retrieved, it’s recommended to poll the endpoint shortly after each request and store the results if needed.

curl-XGEThttp://localhost:8000/metrics

Example output:

[{"gpuMemUsage":76665782272,"iter":154,"iterLatencyMS":7.00688362121582,"kvCacheStats":{"allocNewBlocks":3126,"allocTotalBlocks":3126,"cacheHitRate":0.00128,"freeNumBlocks":101253,"maxNumBlocks":101256,"missedBlocks":3121,"reusedBlocks":4,"tokensPerBlock":32,"usedNumBlocks":3},"numActiveRequests":1...}]

Syntax#

trtllm-serve#

trtllm-serve[OPTIONS]COMMAND[ARGS]...

disaggregated#

Running server in disaggregated mode

trtllm-servedisaggregated[OPTIONS]

Options

-c,--config_file<config_file>#: Specific option for disaggregated mode.

-m,--metadata_server_config_file<metadata_server_config_file>#: Path to metadata server config file

-t,--server_start_timeout<server_start_timeout>#: Server start timeout

-r,--request_timeout<request_timeout>#: Request timeout

-l,--log_level<log_level>#

The logging level.

Options:: internal_error | error | warning | info | verbose | debug | trace

--metrics-log-interval<metrics_log_interval>#: The interval of logging metrics in seconds. Set to 0 to disable metrics logging.

disaggregated_mpi_worker#

Launching disaggregated MPI worker

trtllm-servedisaggregated_mpi_worker[OPTIONS]

Options

-c,--config_file<config_file>#: Specific option for disaggregated mode.

--log_level<log_level>#

The logging level.

Options:: internal_error | error | warning | info | verbose | debug | trace

mm_embedding_serve#

Running an OpenAI API compatible server

MODEL: model name | HF checkpoint path | TensorRT engine path

trtllm-servemm_embedding_serve[OPTIONS]MODEL

Options

--host<host>#: Hostname of the server.

--port<port>#: Port of the server.

--log_level<log_level>#

The logging level.

Options:: internal_error | error | warning | info | verbose | debug | trace

--max_batch_size<max_batch_size>#: Maximum number of requests that the engine can schedule.

--max_num_tokens<max_num_tokens>#: Maximum number of batched input tokens after padding is removed in each batch.

--gpus_per_node<gpus_per_node>#: Number of GPUs per node. Default to None, and it will be detected automatically.

--trust_remote_code#: Flag for HF transformers.

--extra_encoder_options<extra_encoder_options>#: Path to a YAML file that overwrites the parameters specified by trtllm-serve.

--metadata_server_config_file<metadata_server_config_file>#: Path to metadata server config file

Arguments

MODEL#: Required argument

serve#

Running an OpenAI API compatible server

MODEL: model name | HF checkpoint path | TensorRT engine path

trtllm-serveserve[OPTIONS]MODEL

Options

--tokenizer<tokenizer>#: Path | Name of the tokenizer.Specify this value only if using TensorRT engine as model.

--host<host>#: Hostname of the server.

--port<port>#: Port of the server.

--backend<backend>#

The backend to use to serve the model. Default is pytorch backend.

Options:: pytorch | tensorrt | _autodeploy

--custom_module_dirs<custom_module_dirs>#: Paths to custom module directories to import.

--log_level<log_level>#

The logging level.

Options:: internal_error | error | warning | info | verbose | debug | trace

--max_beam_width<max_beam_width>#: Maximum number of beams for beam search decoding.

--max_batch_size<max_batch_size>#: Maximum number of requests that the engine can schedule.

--max_num_tokens<max_num_tokens>#: Maximum number of batched input tokens after padding is removed in each batch.

--max_seq_len<max_seq_len>#: Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config.

--tp_size<tp_size>#: Tensor parallelism size.

--pp_size<pp_size>#: Pipeline parallelism size.

--ep_size<ep_size>#: expert parallelism size

--cluster_size<cluster_size>#: expert cluster parallelism size

--gpus_per_node<gpus_per_node>#: Number of GPUs per node. Default to None, and it will be detected automatically.

--kv_cache_free_gpu_memory_fraction<kv_cache_free_gpu_memory_fraction>#: Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers.

--num_postprocess_workers<num_postprocess_workers>#: [Experimental] Number of workers to postprocess raw responses to comply with OpenAI protocol.

--trust_remote_code#: Flag for HF transformers.

--extra_llm_api_options<extra_llm_api_options>#: Path to a YAML file that overwrites the parameters specified by trtllm-serve.

--reasoning_parser<reasoning_parser>#

[Experimental] Specify the parser for reasoning models.

Options:: deepseek-r1 | qwen3

--tool_parser<tool_parser>#

[Experimental] Specify the parser for tool models.

Options:: qwen3 | qwen3_coder

--metadata_server_config_file<metadata_server_config_file>#: Path to metadata server config file

--server_role<server_role>#: Server role. Specify this value only if running in disaggregated mode.

--fail_fast_on_attention_window_too_large#: Exit with runtime error when attention window is too large to fit even a single sequence in the KV cache.

--otlp_traces_endpoint<otlp_traces_endpoint>#: Target URL to which OpenTelemetry traces will be sent.

--disagg_cluster_uri<disagg_cluster_uri>#: URI of the disaggregated cluster.

--enable_chunked_prefill#: Enable chunked prefill

--media_io_kwargs<media_io_kwargs>#: Keyword arguments for media I/O.

Arguments

MODEL#: Required argument

Besides the above examples,trtllm-serve is also used as an entrypoint for performance benchmarking.Please refer toPerformance Benchmarking with `trtllm-serve <NVIDIA/TensorRT-LLM>` for more details.

On this page

Movatterモバイル変換

trtllm-serve#

About#

Starting a Server#

Inference Endpoints#

Chat API#

Completions API#

Multimodal Serving#

Multimodal Chat API#

Multimodal Modality Coverage#

Multi-node Serving with Slurm#

Metrics Endpoint#

Syntax#

trtllm-serve#

disaggregated#

disaggregated_mpi_worker#

mm_embedding_serve#

serve#

trtllm-serve #