TensorRT LLM Benchmarking #

Important

This benchmarking suite is a work in progress.Expect breaking API changes.

TensorRT LLM provides thetrtllm-bench CLI, a packaged benchmarking utility that aims to make iteasier for users to reproduce our officially publishedperformance overview.trtllm-bench provides the follows:

A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
An entirely Python workflow for benchmarking.
Ability to benchmark various flows and features within TensorRT LLM.

trtllm-bench executes all benchmarks usingin-flightbatching – for more information seethein-flight batching section that describes the conceptin further detail.

To benchmark the OpenAI-compatibletrtllm-serve, please refer to therun benchmarking withtrtllm-serve section.

Before Benchmarking#

For rigorous benchmarking where consistent and reproducible results are critical, proper GPU configuration is essential. These settings help maximize GPU utilization, eliminate performance variability, and ensure optimal conditions for accurate measurements. While not strictly required for normal operation, we recommend applying these configurations when conducting performance comparisons or publishing benchmark results.

Persistence mode#

Ensure persistence mode is enabled to maintain consistent GPU state:

sudonvidia-smi-pm1

GPU Clock Management#

Allow the GPU to dynamically adjust its clock speeds based on workload and temperature. While locking clocks at maximum frequency might seem beneficial, it can sometimes lead to thermal throttling and reduced performance. Reset GPU clocks using:

sudonvidia-smi-rgc

Set power limits#

First query the maximum power limit:

nvidia-smi-q-dPOWER

Then configure the GPU to operate at its maximum power limit for consistent performance:

sudonvidia-smi-pl<max_power_limit>

Boost settings#

Potentially a GPU may support boost levels. First query available boost levels:

sudonvidia-smiboost-slider-l

If supported, enable the boost slider using one of the available levels for maximum performance:

sudonvidia-smiboost-slider--vboost<max_boost_slider>

Throughput Benchmarking#

Limitations and Caveats#

Validated Networks for Benchmarking#

Whiletrtllm-bench should be able to run any network that TensorRT LLM supports, the following are the listthat have been validated extensively and is the same listing as seen on thePerformance Overview page.

Tip

trtllm-bench can automatically download the model from Hugging Face Model Hub.Export your token in theHF_TOKEN environment variable.

Supported Quantization Modes#

trtllm-bench supports the following quantization modes:

None (no quantization applied)
FP8
NVFP4

For more information about quantization, refer toQuantization andthesupport matrix of the supported quantization methods for each network.

Tip

Although TensorRT LLM supports more quantization modes than listed above,trtllm-bench currently only configures fora smaller subset.

Preparing a Dataset#

The throughput benchmark utilizes a fixed JSON schema to specify requests. The schema is defined as follows:

Key	Required	Type	Description
`task_id`	Y	String	Unique identifier for the request.
`prompt`	N*	String	Input text for a generation request.
`input_ids`	Y*	List[Integer]	List of logits that make up the request prompt.
`output_tokens`	Y	Integer	Number of generated tokens for this request.

Tip

* Specifyingprompt orinput_ids is required. However, you can not have both prompts and logits (input_ids)defined at the same time. If you specifyinput_ids, theprompt entry is ignored for request generation.

Refer to the following examples of valid entries for the benchmark:

Entries with a human-readable prompt and no logits.

{"task_id":1,"prompt":"Generate an infinite response to the following: This is the song that never ends, it goes on and on my friend.","output_tokens":1000}{"task_id":2,"prompt":"Generate an infinite response to the following: Na, na, na, na","output_tokens":1000}

Entries which contain logits.

{"task_id":0,"input_ids":[863,22056,25603,11943,8932,13195,3132,25032,21747,22213],"output_tokens":128}{"task_id":1,"input_ids":[14480,13598,15585,6591,1252,8259,30990,26778,7063,30065,21764,11023,1418],"output_tokens":128}

Tip

Specify each entry on one line.To simplify passing the data, a complete JSON entry is on each line so that the benchmarkercan simply read a line and assume a complete entry. When creating a dataset, be sure that a completeJSON entry is on every line.

In order to prepare a synthetic dataset, you can use the provided script in thebenchmarks/cppdirectory. For example, to generate a synthetic dataset of 1000 requests with a uniform ISL/OSL of128/128 formeta-llama/Llama-3.1-8B, run:

pythonbenchmarks/cpp/prepare_dataset.py--stdout--tokenizermeta-llama/Llama-3.1-8Btoken-norm-dist--input-mean128--output-mean128--input-stdev0--output-stdev0--num-requests1000>/tmp/synthetic_128_128.txt

Running with the PyTorch Workflow#

To benchmark the PyTorch backend (tensorrt_llm._torch), use the following command withdataset generated from previous steps. Thethroughput benchmark initializes the backend by tuning against the dataset provided via--dataset (or the other build mode settings described above).

Note that CUDA graph is enabled by default. You can add additional pytorch config with--extra_llm_api_options followed by the path to a YAML file. For more details, please refer to the help text by running the command with--help.

Tip

The command below specifies the--model_path option. The model path is optional and used only when you want to run a locallystored checkpoint. When using--model_path, the--model is still required for reporting reasons and in order to look up parametersfor build heuristics.

trtllm-bench--modelmeta-llama/Llama-3.1-8B\--model_path/Ckpt/Path/To/Llama-3.1-8B\throughput\--dataset/tmp/synthetic_128_128.txt\--backendpytorch# Example output<snipverboselogging>============================================================PyTorchbackend===========================================================Model:meta-llama/Llama-3.1-8BModelPath:/Ckpt/Path/To/Llama-3.1-8BTensorRTLLMVersion:0.17.0Dtype:bfloat16KVCacheDtype:NoneQuantization:FP8============================================================WORLD+RUNTIMEINFORMATION===========================================================TPSize:1PPSize:1MaxRuntimeBatchSize:2048MaxRuntimeTokens:4096SchedulingPolicy:GuaranteedNoEvictKVMemoryPercentage:90.00%IssueRate(req/sec):7.6753E+14============================================================PERFORMANCEOVERVIEW===========================================================Numberofrequests:3000AverageInputLength(tokens):128.0000AverageOutputLength(tokens):128.0000TokenThroughput(tokens/sec):20685.5510RequestThroughput(req/sec):161.6059TotalLatency(ms):18563.6825

When enabling streaming, time to first token (TTFT) and inter-token latency (ITL) metrics will also be recorded.

trtllm-bench--modelmeta-llama/Llama-3.1-8B\--model_path/Ckpt/Path/To/Llama-3.1-8B\throughput\--dataset/tmp/synthetic_128_128.txt\--backendpytorch

Alternatively, users can benchmark the low latency mode:

trtllm-bench--modelmeta-llama/Llama-3.1-8B\--model_path/Ckpt/Path/To/Llama-3.1-8B\latency\--dataset/tmp/synthetic_128_128.txt\--backendpytorch

Benchmarking with LoRA Adapters in PyTorch workflow#

The PyTorch workflow supports benchmarking with LoRA (Low-Rank Adaptation) adapters. This requires preparing a dataset with LoRA metadata and configuring the LoRA settings.

Preparing LoRA Dataset

Useprepare_dataset.py with LoRA-specific options to generate requests with LoRA metadata:

python3benchmarks/cpp/prepare_dataset.py\--stdout\--rand-task-id01\--tokenizer/path/to/tokenizer\--lora-dir/path/to/loras\token-norm-dist\--num-requests100\--input-mean128\--output-mean128\--input-stdev16\--output-stdev24\>synthetic_lora_data.json

Key LoRA options:

--lora-dir: Parent directory containing LoRA adapter subdirectories named by their task IDs (e.g.,0/,1/, etc.)
--rand-task-id: Range of LoRA task IDs to randomly assign to requests
--task-id: Fixed LoRA task ID for all requests (alternative to--rand-task-id)

The generated dataset will include LoRA request metadata. Below is an example of a single such request data entry:

{"task_id":0,"input_ids":[3452,88226,102415,...],"output_tokens":152,"lora_request":{"lora_name":"lora_0","lora_int_id":0,"lora_path":"/path/to/loras/0"}}

LoRA Configuration

Create anextra-llm-api-options.yaml file with LoRA configuration:

lora_config:lora_dir:-/path/to/loras/0-/path/to/loras/1max_lora_rank:64lora_target_modules:-attn_q-attn_k-attn_vtrtllm_modules_to_hf_modules:attn_q:q_projattn_k:k_projattn_v:v_proj

Running LoRA Benchmark

trtllm-bench--model/path/to/base/model\throughput\--datasetsynthetic_lora_data.json\--backendpytorch\--extra_llm_api_optionsextra-llm-api-options.yaml

Note

The LoRA directory structure should have task-specific subdirectories named by their task IDs (e.g.,loras/0/,loras/1/).Each subdirectory should contain the LoRA adapter files for that specific task.

Running multi-modal models in the PyTorch Workflow#

To benchmark multi-modal models with PyTorch workflow, you can follow the similar approach as above.

First, prepare the dataset:

python./benchmarks/cpp/prepare_dataset.py \--tokenizerQwen/Qwen2-VL-2B-Instruct \--stdout \dataset \--dataset-namelmms-lab/MMMU \--dataset-splittest \--dataset-image-keyimage \--dataset-prompt-keyquestion \--num-requests10 \--output-len-dist128,5>mm_data.jsonl

It will download the media files to/tmp directory and prepare the dataset with their paths. Note that theprompt fields are texts and not tokenized ids. This is due to the fact thattheprompt and the media (image/video) are processed by a preprocessor for multimodal files.

Sample dataset for multimodal:

{"task_id":0,"prompt":"Brahma Industries sells vinyl replacement windows to home improvement retailers nationwide. The national sales manager believes that if they invest an additional $25,000 in advertising, they would increase sales volume by 10,000 units. <image 1> What is the total contribution margin?","media_paths":["/tmp/tmp9so41y3r.jpg"],"output_tokens":126}{"task_id":1,"prompt":"Let us compute for the missing amounts under work in process inventory, what is the cost of goods manufactured? <image 1>","media_paths":["/tmp/tmpowsrb_f4.jpg"],"output_tokens":119}{"task_id":2,"prompt":"Tsuji is reviewing the price of a 3-month Japanese yen/U.S. dollar currency futures contract, using the currency and interest rate data shown below. Because the 3-month Japanese interest rate has just increased to .50%, Itsuji recognizes that an arbitrage opportunity exists nd decides to borrow $1 million U.S. dollars to purchase Japanese yen. Calculate the yen arbitrage profit from Itsuji's strategy, using the following data: <image 1> ","media_paths":["/tmp/tmpxhdvasex.jpg"],"output_tokens":126}...

Run the benchmark:

trtllm-bench--modelQwen/Qwen2-VL-2B-Instruct \throughput \--datasetmm_data.jsonl \--backendpytorch \--num_requests10 \--max_batch_size4 \--modalityimage

Sample output:

============================================================REQUESTDETAILS===========================================================Numberofrequests:10Numberofconcurrentrequests:5.3019AverageInputLength(tokens):411.6000AverageOutputLength(tokens):128.7000============================================================WORLD+RUNTIMEINFORMATION===========================================================TPSize:1PPSize:1EPSize:NoneMaxRuntimeBatchSize:4MaxRuntimeTokens:12288SchedulingPolicy:GUARANTEED_NO_EVICTKVMemoryPercentage:90.00%IssueRate(req/sec):1.4117E+17============================================================PERFORMANCEOVERVIEW===========================================================RequestThroughput(req/sec):1.4439TotalOutputThroughput(tokens/sec):185.8351PerUserOutputThroughput(tokens/sec/user):38.1959PerGPUOutputThroughput(tokens/sec/gpu):185.8351TotalTokenThroughput(tokens/sec):780.1607TotalLatency(ms):6925.4963Averagerequestlatency(ms):3671.8441--RequestLatencyBreakdown(ms)-----------------------[Latency]P50:3936.3022[Latency]P90:5514.4701[Latency]P95:5514.4701[Latency]P99:5514.4701[Latency]MINIMUM:2397.1047[Latency]MAXIMUM:5514.4701[Latency]AVERAGE:3671.8441============================================================DATASETDETAILS===========================================================DatasetPath:/workspaces/tensorrt_llm/mm_data.jsonlNumberofSequences:10--Percentilesstatistics---------------------------------InputOutputSeq.Length-----------------------------------------------------------MIN:167.0000119.0000300.0000MAX:1059.0000137.00001178.0000AVG:411.6000128.7000540.3000P50:299.0000128.0000427.0000P90:1059.0000137.00001178.0000P95:1059.0000137.00001178.0000P99:1059.0000137.00001178.0000===========================================================

Notes and Limitations:

Only image datasets are supported for now.
--output-len-dist is a required argument for multimodal datasets.
Tokenizer is unused during the prepare step but it is still a required argument.
Since the images are converted to tokens when the model is run,trtllm-bench uses a default large value for the maximum input sequence length when setting up the execution settings.You can also modify the behavior by specifying a different value with the flag--max_input_len that suits your use-case.

Quantization in the PyTorch Flow#

To run a quantized benchmark withtrtllm-bench utilizing the PyTorch flow, you will need to use a pre-quantizedcheckpoint. For the Llama-3.1 models, TensorRT LLM provides the following checkpoints via HuggingFace:

To understand more about how to quantize your own checkpoints, refer to ModelOptdocumentation.

trtllm-bench utilizes thehf_quant_config.json file present in the pre-quantized checkpoints above. The configurationfile is present in checkpoints quantized withTensorRT Model Optimizerand describes the compute and KV cache quantization that checkpoint was compiled with. For example, from the checkpointsabove:

{"producer":{"name":"modelopt","version":"0.23.0rc1"},"quantization":{"quant_algo":"FP8","kv_cache_quant_algo":null}}

The checkpoints above are quantized to run with a compute precision ofFP8 and default to no KV cache quantization (fullFP16 cache). When runningtrtllm-benchthroughput. The benchmark will select a KV cache quantization that is best suitedfor the compute precision in the checkpoint automatically ifkv_cache_quant_algo is specified asnull, otherwise it willbe forced to match the specified non-null KV cache quantization. The following are the mappings thattrtllm-bench willfollow when a checkpoint does not specify a KV cache quantization algorithm:

Checkpoint Compute Quant	Checkpoint KV Cache Quant	`trtllm-bench`	Note
`null`	`null`	`null`	In this case, a quantization config doesn’t exist.
`FP8`	`FP8`	`FP8`	Matches the checkpoint
`FP8`	`null`	`FP8`	Set to`FP8` via benchmark
`NVFP4`	`null`	`FP8`	Set to`FP8` via benchmark

If you would like to force the KV cache quantization, you can specify the following in the YAML file to force the precisionwhen the checkpoint precision isnull:

kv_cache_config:dtype:fp8

Tip

The two valid values forkv_cache_config.dtype areauto andfp8.

On this page

Movatterモバイル変換

TensorRT LLM Benchmarking#