trtllm-serve#
About#
Thetrtllm-serve command starts an OpenAI compatible server that supports the following endpoints:
/v1/models/v1/completions/v1/chat/completions
For information about the inference endpoints, refer to theOpenAI API Reference.
The server also supports the following endpoints:
/health/metrics/version
Themetrics endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details.
Starting a Server#
The following abbreviated command syntax shows the commonly used arguments to start a server:
trtllm-serve<model>[--tp_size<tp>--pp_size<pp>--ep_size<ep>--host<host>--port<port>]
For the full syntax and argument descriptions, refer toSyntax.
Inference Endpoints#
After you start the server, you can send inference requests through completions API and Chat API, which are compatible with corresponding OpenAI APIs. We useTinyLlama-1.1B-Chat-v1.0 for examples in the following sections.
Chat API#
You can query Chat API with any http clients, a typical example is OpenAI Python client:
1### :title OpenAI Chat Client 2 3fromopenaiimportOpenAI 4 5client=OpenAI( 6base_url="http://localhost:8000/v1", 7api_key="tensorrt_llm", 8) 910response=client.chat.completions.create(11model="TinyLlama-1.1B-Chat-v1.0",12messages=[{13"role":"system",14"content":"you are a helpful assistant"15},{16"role":"user",17"content":"Where is New York?"18}],19max_tokens=20,20)21print(response)
Another example usescurl:
1#! /usr/bin/env bash 2 3curlhttp://localhost:8000/v1/chat/completions\ 4-H"Content-Type: application/json"\ 5-d'{ 6 "model": "TinyLlama-1.1B-Chat-v1.0", 7 "messages":[{"role": "system", "content": "You are a helpful assistant."}, 8 {"role": "user", "content": "Where is New York?"}], 9 "max_tokens": 16,10 "temperature": 011 }'
Completions API#
You can query Completions API with any http clients, a typical example is OpenAI Python client:
1### :title OpenAI Completion Client 2 3fromopenaiimportOpenAI 4 5client=OpenAI( 6base_url="http://localhost:8000/v1", 7api_key="tensorrt_llm", 8) 910response=client.completions.create(11model="TinyLlama-1.1B-Chat-v1.0",12prompt="Where is New York?",13max_tokens=20,14)15print(response)
Another example usescurl:
1#! /usr/bin/env bash 2 3curlhttp://localhost:8000/v1/completions\ 4-H"Content-Type: application/json"\ 5-d'{ 6 "model": "TinyLlama-1.1B-Chat-v1.0", 7 "prompt": "Where is New York?", 8 "max_tokens": 16, 9 "temperature": 010 }'
Multimodal Serving#
For multimodal models, you need to create a configuration file and start the server with additional options due to the following limitations:
TRT-LLM multimodal is currently not compatible with
kv_cache_reuseMultimodal models require
chat_template, so only the Chat API is supported
To set up multimodal models:
First, create a configuration file:
cat>./extra-llm-api-config.yml<<EOFkv_cache_config: enable_block_reuse: falseEOF
Then, start the server with the configuration file:
trtllm-serveQwen/Qwen2-VL-7B-Instruct\--extra_llm_api_options./extra-llm-api-config.ymlMultimodal Chat API#
You can query Completions API with any http clients, a typical example is OpenAI Python client:
Another example usescurl:
1#! /usr/bin/env bash 2 3# SINGLE IMAGE INFERENCE 4curlhttp://localhost:8000/v1/chat/completions\ 5-H"Content-Type: application/json"\ 6-d'{ 7 "model": "Qwen2.5-VL-3B-Instruct", 8 "messages":[{ 9 "role": "system",10 "content": "You are a helpful assistant."11 }, {12 "role": "user",13 "content": [14 {15 "type": "text",16 "text": "Describe the natural environment in the image."17 },18 {19 "type":"image_url",20 "image_url": {21 "url": "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"22 }23 }24 ]25 }],26 "max_tokens": 64,27 "temperature": 028 }'2930# MULTI IMAGE INFERENCE31curlhttp://localhost:8000/v1/chat/completions\32-H"Content-Type: application/json"\33-d'{34 "model": "Qwen2.5-VL-3B-Instruct",35 "messages":[{36 "role": "system",37 "content": "You are a helpful assistant."38 }, {39 "role": "user",40 "content": [41 {42 "type": "text",43 "text":"Tell me the difference between two images"44 },45 {46 "type":"image_url",47 "image_url": {48 "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"49 }50 },51 {52 "type":"image_url",53 "image_url": {54 "url": "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"55 }56 }57 ]58 }],59 "max_tokens": 64,60 "temperature": 061 }'6263# SINGLE VIDEO INFERENCE64curlhttp://localhost:8000/v1/chat/completions\65-H"Content-Type: application/json"\66-d'{67 "model": "Qwen2.5-VL-3B-Instruct",68 "messages":[{69 "role": "system",70 "content": "You are a helpful assistant."71 }, {72 "role": "user",73 "content": [74 {75 "type": "text",76 "text":"Tell me what you see in the video briefly."77 },78 {79 "type":"video_url",80 "video_url": {81 "url": "https://huggingface.co/datasets/Efficient-Large-Model/VILA-inference-demos/resolve/main/OAI-sora-tokyo-walk.mp4"82 }83 }84 ]85 }],86 "max_tokens": 64,87 "temperature": 088 }'
Multimodal Modality Coverage#
TRT-LLM multimodal supports the following modalities and data types (depending on the model):
Text
No type specified:
{"role":"user","content":"What's the capital of South Korea?"}
Explicit “text” type:
{"role":"user","content":[{"type":"text","text":"What's the capital of South Korea?"}]}
Image
Using “image_url” with URL:
{"role":"user","content":[{"type":"text","text":"What's in this image?"},{"type":"image_url","image_url":{"url":"https://example.com/image.png"}}]}
Using “image_url” with base64-encoded data:
{"role":"user","content":[{"type":"text","text":"What's in this image?"},{"type":"image_url","image_url":{"url":"data:image/jpeg;base64,{image_base64}"}}]}
Note
To convert images to base64-encoded format, use the utility functiontensorrt_llm.utils.load_base64_image(). Refer to theload_base64_image utilityfor implementation details.
Video
Using “video_url”:
{"role":"user","content":[{"type":"text","text":"What's in this video?"},{"type":"video_url","video_url":{"url":"https://example.com/video.mp4"}}]}
Audio
Using “audio_url”:
{"role":"user","content":[{"type":"text","text":"What's in this audio?"},{"type":"audio_url","audio_url":{"url":"https://example.com/audio.mp3"}}]}
Benchmark#
You can use any benchmark clients compatible with OpenAI API to test serving performance oftrtllm_serve, we recommendgenai-perf and here is a benchmarking recipe.
First, installgenai-perf withpip:
pipinstallgenai-perf
Then,start a server withtrtllm-serve andTinyLlama-1.1B-Chat-v1.0.
Finally, test performance with the following command:
1#! /usr/bin/env bash 2 3genai-perfprofile\ 4-mTinyLlama-1.1B-Chat-v1.0\ 5--tokenizerTinyLlama/TinyLlama-1.1B-Chat-v1.0\ 6--endpoint-typechat\ 7--random-seed123\ 8--synthetic-input-tokens-mean128\ 9--synthetic-input-tokens-stddev0\10--output-tokens-mean128\11--output-tokens-stddev0\12--request-count100\13--request-rate10\14--profile-export-filemy_profile_export.json\15--urllocalhost:8000\16--streaming
Refer toREADME ofgenai-perf for more guidance.
Multi-node Serving with Slurm#
You can deployDeepSeek-V3 model across two nodes with Slurm andtrtllm-serve
echo-e"enable_attention_dp: true\npytorch_backend_config:\n enable_overlap_scheduler: true">extra-llm-api-config.ymlsrun-N2-w[NODES]\--output=benchmark_2node.log\--ntasks16--ntasks-per-node=8\--mpi=pmix--gres=gpu:8\--container-image=<CONTAINER_IMG>\--container-mounts=/workspace:/workspace\--container-workdir/workspace\bash-c"trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 --max_batch_size 161 --max_num_tokens 1160 --tp_size 16 --ep_size 4 --kv_cache_free_gpu_memory_fraction 0.95 --extra_llm_api_options ./extra-llm-api-config.yml"
Seethe source code oftrtllm-llmapi-launch for more details.
Metrics Endpoint#
Note
This endpoint is beta maturity.
The statistics for the PyTorch backend are beta and not as comprehensive as those for the TensorRT backend.
Some fields, such as CPU memory usage, are not available for the PyTorch backend.
Enablingenable_iter_perf_stats in the PyTorch backend can impact performance slightly, depending on the serving configuration.
The/metrics endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details.For the TensorRT backend, these statistics are enabled by default.However, for the PyTorch backend, you must explicitly enable iteration statistics logging by setting theenable_iter_perf_stats field in a YAML configuration file as shown in the following example:
# extra-llm-api-config.ymlpytorch_backend_config:enable_iter_perf_stats:true
Then start the server and specify the--extra_llm_api_options argument with the path to the YAML file as shown in the following example:
trtllm-serve<model>\--extra_llm_api_options<path-to-extra-llm-api-config.yml>\[--tp_size<tp>--pp_size<pp>--ep_size<ep>--host<host>--port<port>]
After at least one inference request is sent to the server, you can fetch the runtime-iteration statistics by polling the/metrics endpoint:
curl-XGEThttp://<host>:<port>/metrics
Example Output
[{"gpuMemUsage":56401920000,"inflightBatchingStats":{...},"iter":1,"iterLatencyMS":16.505143404006958,"kvCacheStats":{...},"newActiveRequestsQueueLatencyMS":0.0007503032684326172}
]
Syntax#
trtllm-serve#
trtllm-serve[OPTIONS]COMMAND[ARGS]...
disaggregated#
Running server in disaggregated mode
trtllm-servedisaggregated[OPTIONS]
Options
- -c,--config_file<config_file>#
Specific option for disaggregated mode.
- -m,--metadata_server_config_file<metadata_server_config_file>#
Path to metadata server config file
- -t,--server_start_timeout<server_start_timeout>#
Server start timeout
- -r,--request_timeout<request_timeout>#
Request timeout
- -l,--log_level<log_level>#
The logging level.
- Options:
internal_error | error | warning | info | verbose | debug | trace
disaggregated_mpi_worker#
Launching disaggregated MPI worker
trtllm-servedisaggregated_mpi_worker[OPTIONS]
Options
- -c,--config_file<config_file>#
Specific option for disaggregated mode.
- --log_level<log_level>#
The logging level.
- Options:
internal_error | error | warning | info | verbose | debug | trace
serve#
Running an OpenAI API compatible server
MODEL: model name | HF checkpoint path | TensorRT engine path
trtllm-serveserve[OPTIONS]MODEL
Options
- --tokenizer<tokenizer>#
Path | Name of the tokenizer.Specify this value only if using TensorRT engine as model.
- --host<host>#
Hostname of the server.
- --port<port>#
Port of the server.
- --backend<backend>#
Set to ‘pytorch’ for pytorch path. Default is cpp path.
- Options:
pytorch | trt
- --log_level<log_level>#
The logging level.
- Options:
internal_error | error | warning | info | verbose | debug | trace
- --max_beam_width<max_beam_width>#
Maximum number of beams for beam search decoding.
- --max_batch_size<max_batch_size>#
Maximum number of requests that the engine can schedule.
- --max_num_tokens<max_num_tokens>#
Maximum number of batched input tokens after padding is removed in each batch.
- --max_seq_len<max_seq_len>#
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config.
- --tp_size<tp_size>#
Tensor parallelism size.
- --pp_size<pp_size>#
Pipeline parallelism size.
- --ep_size<ep_size>#
expert parallelism size
- --cluster_size<cluster_size>#
expert cluster parallelism size
- --gpus_per_node<gpus_per_node>#
Number of GPUs per node. Default to None, and it will be detected automatically.
- --kv_cache_free_gpu_memory_fraction<kv_cache_free_gpu_memory_fraction>#
Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers.
- --num_postprocess_workers<num_postprocess_workers>#
[Experimental] Number of workers to postprocess raw responses to comply with OpenAI protocol.
- --trust_remote_code#
Flag for HF transformers.
- --extra_llm_api_options<extra_llm_api_options>#
Path to a YAML file that overwrites the parameters specified by trtllm-serve.
- --reasoning_parser<reasoning_parser>#
[Experimental] Specify the parser for reasoning models.
- Options:
deepseek-r1
- --metadata_server_config_file<metadata_server_config_file>#
Path to metadata server config file
- --server_role<server_role>#
Server role. Specify this value only if running in disaggregated mode.
- --fail_fast_on_attention_window_too_large#
Exit with runtime error when attention window is too large to fit even a single sequence in the KV cache.
Arguments
- MODEL#
Required argument