TensorRT LLM Benchmarking#
Important
This benchmarking suite is a work in progress.Expect breaking API changes.
TensorRT LLM provides thetrtllm-bench CLI, a packaged benchmarking utility that aims to make iteasier for users to reproduce our officially publishedperformance overview.trtllm-bench provides the follows:
A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
An entirely Python workflow for benchmarking.
Ability to benchmark various flows and features within TensorRT LLM.
trtllm-bench executes all benchmarks usingin-flightbatching – for more information seethein-flight batching section that describes the conceptin further detail.
To benchmark the OpenAI-compatibletrtllm-serve, please refer to therun benchmarking withtrtllm-serve section.
Before Benchmarking#
For rigorous benchmarking where consistent and reproducible results are critical, proper GPU configuration is essential. These settings help maximize GPU utilization, eliminate performance variability, and ensure optimal conditions for accurate measurements. While not strictly required for normal operation, we recommend applying these configurations when conducting performance comparisons or publishing benchmark results.
Persistence mode#
Ensure persistence mode is enabled to maintain consistent GPU state:
sudonvidia-smi-pm1GPU Clock Management#
Allow the GPU to dynamically adjust its clock speeds based on workload and temperature. While locking clocks at maximum frequency might seem beneficial, it can sometimes lead to thermal throttling and reduced performance. Reset GPU clocks using:
sudonvidia-smi-rgc
Set power limits#
First query the maximum power limit:
nvidia-smi-q-dPOWER
Then configure the GPU to operate at its maximum power limit for consistent performance:
sudonvidia-smi-pl<max_power_limit>
Boost settings#
Potentially a GPU may support boost levels. First query available boost levels:
sudonvidia-smiboost-slider-l
If supported, enable the boost slider using one of the available levels for maximum performance:
sudonvidia-smiboost-slider--vboost<max_boost_slider>
Throughput Benchmarking#
Limitations and Caveats#
Validated Networks for Benchmarking#
Whiletrtllm-bench should be able to run any network that TensorRT LLM supports, the following are the listthat have been validated extensively and is the same listing as seen on thePerformance Overview page.
Tip
trtllm-bench can automatically download the model from Hugging Face Model Hub.Export your token in theHF_TOKEN environment variable.
Supported Quantization Modes#
trtllm-bench supports the following quantization modes:
None (no quantization applied)
FP8NVFP4
For more information about quantization, refer toQuantization andthesupport matrix of the supported quantization methods for each network.
Tip
Although TensorRT LLM supports more quantization modes than listed above,trtllm-bench currently only configures fora smaller subset.
Preparing a Dataset#
The throughput benchmark utilizes a fixed JSON schema to specify requests. The schema is defined as follows:
Key | Required | Type | Description |
|---|---|---|---|
| Y | String | Unique identifier for the request. |
| N* | String | Input text for a generation request. |
| Y* | List[Integer] | List of logits that make up the request prompt. |
| Y | Integer | Number of generated tokens for this request. |
Tip
* Specifyingprompt orinput_ids is required. However, you can not have both prompts and logits (input_ids)defined at the same time. If you specifyinput_ids, theprompt entry is ignored for request generation.
Refer to the following examples of valid entries for the benchmark:
Entries with a human-readable prompt and no logits.
{"task_id":1,"prompt":"Generate an infinite response to the following: This is the song that never ends, it goes on and on my friend.","output_tokens":1000}{"task_id":2,"prompt":"Generate an infinite response to the following: Na, na, na, na","output_tokens":1000}
Entries which contain logits.
{"task_id":0,"input_ids":[863,22056,25603,11943,8932,13195,3132,25032,21747,22213],"output_tokens":128}{"task_id":1,"input_ids":[14480,13598,15585,6591,1252,8259,30990,26778,7063,30065,21764,11023,1418],"output_tokens":128}
Tip
Specify each entry on one line.To simplify passing the data, a complete JSON entry is on each line so that the benchmarkercan simply read a line and assume a complete entry. When creating a dataset, be sure that a completeJSON entry is on every line.
In order to prepare a synthetic dataset, you can use the provided script in thebenchmarks/cppdirectory. For example, to generate a synthetic dataset of 1000 requests with a uniform ISL/OSL of128/128 formeta-llama/Llama-3.1-8B, run:
pythonbenchmarks/cpp/prepare_dataset.py--stdout--tokenizermeta-llama/Llama-3.1-8Btoken-norm-dist--input-mean128--output-mean128--input-stdev0--output-stdev0--num-requests1000>/tmp/synthetic_128_128.txt
Running with the PyTorch Workflow#
To benchmark the PyTorch backend (tensorrt_llm._torch), use the following command withdataset generated from previous steps. Thethroughput benchmark initializes the backend by tuning against the dataset provided via--dataset (or the other build mode settings described above).
Note that CUDA graph is enabled by default. You can add additional pytorch config with--extra_llm_api_options followed by the path to a YAML file. For more details, please refer to the help text by running the command with--help.
Tip
The command below specifies the--model_path option. The model path is optional and used only when you want to run a locallystored checkpoint. When using--model_path, the--model is still required for reporting reasons and in order to look up parametersfor build heuristics.
trtllm-bench--modelmeta-llama/Llama-3.1-8B\--model_path/Ckpt/Path/To/Llama-3.1-8B\throughput\--dataset/tmp/synthetic_128_128.txt\--backendpytorch# Example output<snipverboselogging>============================================================PyTorchbackend===========================================================Model:meta-llama/Llama-3.1-8BModelPath:/Ckpt/Path/To/Llama-3.1-8BTensorRTLLMVersion:0.17.0Dtype:bfloat16KVCacheDtype:NoneQuantization:FP8============================================================WORLD+RUNTIMEINFORMATION===========================================================TPSize:1PPSize:1MaxRuntimeBatchSize:2048MaxRuntimeTokens:4096SchedulingPolicy:GuaranteedNoEvictKVMemoryPercentage:90.00%IssueRate(req/sec):7.6753E+14============================================================PERFORMANCEOVERVIEW===========================================================Numberofrequests:3000AverageInputLength(tokens):128.0000AverageOutputLength(tokens):128.0000TokenThroughput(tokens/sec):20685.5510RequestThroughput(req/sec):161.6059TotalLatency(ms):18563.6825
When enabling streaming, time to first token (TTFT) and inter-token latency (ITL) metrics will also be recorded.
trtllm-bench--modelmeta-llama/Llama-3.1-8B\--model_path/Ckpt/Path/To/Llama-3.1-8B\throughput\--dataset/tmp/synthetic_128_128.txt\--backendpytorch
Alternatively, users can benchmark the low latency mode:
trtllm-bench--modelmeta-llama/Llama-3.1-8B\--model_path/Ckpt/Path/To/Llama-3.1-8B\latency\--dataset/tmp/synthetic_128_128.txt\--backendpytorch
Benchmarking with LoRA Adapters in PyTorch workflow#
The PyTorch workflow supports benchmarking with LoRA (Low-Rank Adaptation) adapters. This requires preparing a dataset with LoRA metadata and configuring the LoRA settings.
Preparing LoRA Dataset
Useprepare_dataset.py with LoRA-specific options to generate requests with LoRA metadata:
python3benchmarks/cpp/prepare_dataset.py\--stdout\--rand-task-id01\--tokenizer/path/to/tokenizer\--lora-dir/path/to/loras\token-norm-dist\--num-requests100\--input-mean128\--output-mean128\--input-stdev16\--output-stdev24\>synthetic_lora_data.json
Key LoRA options:
--lora-dir: Parent directory containing LoRA adapter subdirectories named by their task IDs (e.g.,0/,1/, etc.)--rand-task-id: Range of LoRA task IDs to randomly assign to requests--task-id: Fixed LoRA task ID for all requests (alternative to--rand-task-id)
The generated dataset will include LoRA request metadata. Below is an example of a single such request data entry:
{"task_id":0,"input_ids":[3452,88226,102415,...],"output_tokens":152,"lora_request":{"lora_name":"lora_0","lora_int_id":0,"lora_path":"/path/to/loras/0"}}
LoRA Configuration
Create anextra-llm-api-options.yaml file with LoRA configuration:
lora_config:lora_dir:-/path/to/loras/0-/path/to/loras/1max_lora_rank:64lora_target_modules:-attn_q-attn_k-attn_vtrtllm_modules_to_hf_modules:attn_q:q_projattn_k:k_projattn_v:v_proj
Running LoRA Benchmark
trtllm-bench--model/path/to/base/model\throughput\--datasetsynthetic_lora_data.json\--backendpytorch\--extra_llm_api_optionsextra-llm-api-options.yaml
Note
The LoRA directory structure should have task-specific subdirectories named by their task IDs (e.g.,loras/0/,loras/1/).Each subdirectory should contain the LoRA adapter files for that specific task.
Running multi-modal models in the PyTorch Workflow#
To benchmark multi-modal models with PyTorch workflow, you can follow the similar approach as above.
First, prepare the dataset:
python./benchmarks/cpp/prepare_dataset.py \--tokenizerQwen/Qwen2-VL-2B-Instruct \--stdout \dataset \--dataset-namelmms-lab/MMMU \--dataset-splittest \--dataset-image-keyimage \--dataset-prompt-keyquestion \--num-requests10 \--output-len-dist128,5>mm_data.jsonl
It will download the media files to/tmp directory and prepare the dataset with their paths. Note that theprompt fields are texts and not tokenized ids. This is due to the fact thattheprompt and the media (image/video) are processed by a preprocessor for multimodal files.
Sample dataset for multimodal:
{"task_id":0,"prompt":"Brahma Industries sells vinyl replacement windows to home improvement retailers nationwide. The national sales manager believes that if they invest an additional $25,000 in advertising, they would increase sales volume by 10,000 units. <image 1> What is the total contribution margin?","media_paths":["/tmp/tmp9so41y3r.jpg"],"output_tokens":126}{"task_id":1,"prompt":"Let us compute for the missing amounts under work in process inventory, what is the cost of goods manufactured? <image 1>","media_paths":["/tmp/tmpowsrb_f4.jpg"],"output_tokens":119}{"task_id":2,"prompt":"Tsuji is reviewing the price of a 3-month Japanese yen/U.S. dollar currency futures contract, using the currency and interest rate data shown below. Because the 3-month Japanese interest rate has just increased to .50%, Itsuji recognizes that an arbitrage opportunity exists nd decides to borrow $1 million U.S. dollars to purchase Japanese yen. Calculate the yen arbitrage profit from Itsuji's strategy, using the following data: <image 1> ","media_paths":["/tmp/tmpxhdvasex.jpg"],"output_tokens":126}...
Run the benchmark:
trtllm-bench--modelQwen/Qwen2-VL-2B-Instruct \throughput \--datasetmm_data.jsonl \--backendpytorch \--num_requests10 \--max_batch_size4 \--modalityimage
Sample output:
============================================================REQUESTDETAILS===========================================================Numberofrequests:10Numberofconcurrentrequests:5.3019AverageInputLength(tokens):411.6000AverageOutputLength(tokens):128.7000============================================================WORLD+RUNTIMEINFORMATION===========================================================TPSize:1PPSize:1EPSize:NoneMaxRuntimeBatchSize:4MaxRuntimeTokens:12288SchedulingPolicy:GUARANTEED_NO_EVICTKVMemoryPercentage:90.00%IssueRate(req/sec):1.4117E+17============================================================PERFORMANCEOVERVIEW===========================================================RequestThroughput(req/sec):1.4439TotalOutputThroughput(tokens/sec):185.8351PerUserOutputThroughput(tokens/sec/user):38.1959PerGPUOutputThroughput(tokens/sec/gpu):185.8351TotalTokenThroughput(tokens/sec):780.1607TotalLatency(ms):6925.4963Averagerequestlatency(ms):3671.8441--RequestLatencyBreakdown(ms)-----------------------[Latency]P50:3936.3022[Latency]P90:5514.4701[Latency]P95:5514.4701[Latency]P99:5514.4701[Latency]MINIMUM:2397.1047[Latency]MAXIMUM:5514.4701[Latency]AVERAGE:3671.8441============================================================DATASETDETAILS===========================================================DatasetPath:/workspaces/tensorrt_llm/mm_data.jsonlNumberofSequences:10--Percentilesstatistics---------------------------------InputOutputSeq.Length-----------------------------------------------------------MIN:167.0000119.0000300.0000MAX:1059.0000137.00001178.0000AVG:411.6000128.7000540.3000P50:299.0000128.0000427.0000P90:1059.0000137.00001178.0000P95:1059.0000137.00001178.0000P99:1059.0000137.00001178.0000===========================================================
Notes and Limitations:
Only image datasets are supported for now.
--output-len-distis a required argument for multimodal datasets.Tokenizer is unused during the prepare step but it is still a required argument.
Since the images are converted to tokens when the model is run,
trtllm-benchuses a default large value for the maximum input sequence length when setting up the execution settings.You can also modify the behavior by specifying a different value with the flag--max_input_lenthat suits your use-case.
Quantization in the PyTorch Flow#
To run a quantized benchmark withtrtllm-bench utilizing the PyTorch flow, you will need to use a pre-quantizedcheckpoint. For the Llama-3.1 models, TensorRT LLM provides the following checkpoints via HuggingFace:
To understand more about how to quantize your own checkpoints, refer to ModelOptdocumentation.
trtllm-bench utilizes thehf_quant_config.json file present in the pre-quantized checkpoints above. The configurationfile is present in checkpoints quantized withTensorRT Model Optimizerand describes the compute and KV cache quantization that checkpoint was compiled with. For example, from the checkpointsabove:
{"producer":{"name":"modelopt","version":"0.23.0rc1"},"quantization":{"quant_algo":"FP8","kv_cache_quant_algo":null}}
The checkpoints above are quantized to run with a compute precision ofFP8 and default to no KV cache quantization (fullFP16 cache). When runningtrtllm-benchthroughput. The benchmark will select a KV cache quantization that is best suitedfor the compute precision in the checkpoint automatically ifkv_cache_quant_algo is specified asnull, otherwise it willbe forced to match the specified non-null KV cache quantization. The following are the mappings thattrtllm-bench willfollow when a checkpoint does not specify a KV cache quantization algorithm:
Checkpoint Compute Quant | Checkpoint KV Cache Quant |
| Note |
|---|---|---|---|
|
|
| In this case, a quantization config doesn’t exist. |
|
|
| Matches the checkpoint |
|
|
| Set to |
|
|
| Set to |
If you would like to force the KV cache quantization, you can specify the following in the YAML file to force the precisionwhen the checkpoint precision isnull:
kv_cache_config:dtype:fp8
Tip
The two valid values forkv_cache_config.dtype areauto andfp8.