trtllm-eval#
About#
Thetrtllm-eval command provides developers with a unified entry point for accuracy evaluation. It shares the core evaluation logic with theaccuracy test suite of TensorRT LLM.
trtllm-eval is built on the offline API – LLM API. Compared to the onlinetrtllm-serve, the offline API provides clearer error messages and simplifies the debugging workflow.
The following tasks are currently supported:
Dataset | Task | Metric | Default ISL | Default OSL |
|---|---|---|---|---|
CNN Dailymail | summarization | rouge | 924 | 100 |
MMLU | QA; multiple choice | accuracy | 4,094 | 2 |
GSM8K | QA; regex matching | accuracy | 4,096 | 256 |
GPQA | QA; multiple choice | accuracy | 32,768 | 4,096 |
JSON mode eval | structured generation | accuracy | 1,024 | 512 |
Note
trtllm-eval originates from the TensorRT LLM accuracy test suite and serves as a lightweight utility for verifying and debugging accuracy. At this time,trtllm-eval is intended solely for development and is not recommended for production use.
Usage and Examples#
Some evaluation tasks (e.g., GSM8K and GPQA) depend on thelm_eval package. To run these tasks, you need to installlm_eval with:
pipinstall-rrequirements-dev.txt
Alternatively, you can install thelm_eval version specified inrequirements-dev.txt.
Here are some examples:
# Evaluate Llama-3.1-8B-Instruct on MMLUtrtllm-eval--modelmeta-llama/Llama-3.1-8B-Instructmmlu# Evaluate Llama-3.1-8B-Instruct on GSM8Ktrtllm-eval--modelmeta-llama/Llama-3.1-8B-Instructgsm8k# Evaluate Llama-3.3-70B-Instruct on GPQA Diamondtrtllm-eval--modelmeta-llama/Llama-3.3-70B-Instructgpqa_diamond
The--model argument accepts either a Hugging Face model ID or a local checkpoint path. By default,trtllm-eval runs the model with the PyTorch backend; you can pass--backendtensorrt to switch to the TensorRT backend.
Alternatively, the--model argument also accepts a local path to pre-built TensorRT engines. In this case, you should pass the Hugging Face tokenizer path to the--tokenizer argument.
For more details, seetrtllm-eval--help andtrtllm-eval<task>--help.
Syntax#
trtllm-eval#
trtllm-eval[OPTIONS]COMMAND[ARGS]...
Options
- --model<model>#
Required model name | HF checkpoint path | TensorRT engine path
- --tokenizer<tokenizer>#
Path | Name of the tokenizer.Specify this value only if using TensorRT engine as model.
- --backend<backend>#
The backend to use for evaluation. Default is pytorch backend.
- Options:
pytorch | tensorrt
- --log_level<log_level>#
The logging level.
- Options:
internal_error | error | warning | info | verbose | debug | trace
- --max_beam_width<max_beam_width>#
Maximum number of beams for beam search decoding.
- --max_batch_size<max_batch_size>#
Maximum number of requests that the engine can schedule.
- --max_num_tokens<max_num_tokens>#
Maximum number of batched input tokens after padding is removed in each batch.
- --max_seq_len<max_seq_len>#
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config.
- --tp_size<tp_size>#
Tensor parallelism size.
- --pp_size<pp_size>#
Pipeline parallelism size.
- --ep_size<ep_size>#
expert parallelism size
- --gpus_per_node<gpus_per_node>#
Number of GPUs per node. Default to None, and it will be detected automatically.
- --kv_cache_free_gpu_memory_fraction<kv_cache_free_gpu_memory_fraction>#
Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers.
- --trust_remote_code#
Flag for HF transformers.
- --extra_llm_api_options<extra_llm_api_options>#
Path to a YAML file that overwrites the parameters
- --disable_kv_cache_reuse#
Flag for disabling KV cache reuse.
cnn_dailymail#
trtllm-evalcnn_dailymail[OPTIONS]
Options
- --dataset_path<dataset_path>#
The path to CNN Dailymail dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples<num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed<random_seed>#
Random seed for dataset processing.
- --rouge_path<rouge_path>#
The path to rouge repository.If unspecified, the repository is downloaded from HF hub.
- --apply_chat_template#
Whether to apply chat template.
- --system_prompt<system_prompt>#
System prompt.
- --max_input_length<max_input_length>#
Maximum prompt length.
- --max_output_length<max_output_length>#
Maximum generation length.
gpqa_diamond#
trtllm-evalgpqa_diamond[OPTIONS]
Options
- --dataset_path<dataset_path>#
The path to GPQA dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples<num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed<random_seed>#
Random seed for dataset processing.
- --apply_chat_template#
Whether to apply chat template.
- --chat_template_kwargs<chat_template_kwargs>#
Chat template kwargs as JSON string, e.g., ‘{“thinking_budget”: 0}’
- --system_prompt<system_prompt>#
System prompt.
- --max_input_length<max_input_length>#
Maximum prompt length.
- --max_output_length<max_output_length>#
Maximum generation length.
gpqa_extended#
trtllm-evalgpqa_extended[OPTIONS]
Options
- --dataset_path<dataset_path>#
The path to GPQA dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples<num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed<random_seed>#
Random seed for dataset processing.
- --apply_chat_template#
Whether to apply chat template.
- --chat_template_kwargs<chat_template_kwargs>#
Chat template kwargs as JSON string, e.g., ‘{“thinking_budget”: 0}’
- --system_prompt<system_prompt>#
System prompt.
- --max_input_length<max_input_length>#
Maximum prompt length.
- --max_output_length<max_output_length>#
Maximum generation length.
gpqa_main#
trtllm-evalgpqa_main[OPTIONS]
Options
- --dataset_path<dataset_path>#
The path to GPQA dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples<num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed<random_seed>#
Random seed for dataset processing.
- --apply_chat_template#
Whether to apply chat template.
- --chat_template_kwargs<chat_template_kwargs>#
Chat template kwargs as JSON string, e.g., ‘{“thinking_budget”: 0}’
- --system_prompt<system_prompt>#
System prompt.
- --max_input_length<max_input_length>#
Maximum prompt length.
- --max_output_length<max_output_length>#
Maximum generation length.
gsm8k#
trtllm-evalgsm8k[OPTIONS]
Options
- --dataset_path<dataset_path>#
The path to GSM8K dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples<num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed<random_seed>#
Random seed for dataset processing.
- --apply_chat_template#
Whether to apply chat template.
- --chat_template_kwargs<chat_template_kwargs>#
Chat template kwargs as JSON string, e.g., ‘{“thinking_budget”: 0}’
- --fewshot_as_multiturn#
Apply fewshot as multiturn.
- --system_prompt<system_prompt>#
System prompt.
- --max_input_length<max_input_length>#
Maximum prompt length.
- --max_output_length<max_output_length>#
Maximum generation length.
json_mode_eval#
trtllm-evaljson_mode_eval[OPTIONS]
Options
- --dataset_path<dataset_path>#
The path to JSON Mode Eval dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples<num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed<random_seed>#
Random seed for dataset processing.
- --system_prompt<system_prompt>#
System prompt.
- --max_input_length<max_input_length>#
Maximum prompt length.
- --max_output_length<max_output_length>#
Maximum generation length.
longbench_v2#
trtllm-evallongbench_v2[OPTIONS]
Options
- --dataset_path<dataset_path>#
Path to LongBench v2 dataset (HF dataset name or local path).
- --prompts_dir<prompts_dir>#
Path to directory containing prompt templates.
- --num_samples<num_samples>#
Number of samples to evaluate (None for all).
- --start_idx<start_idx>#
Start index for evaluation.
- --difficulty<difficulty>#
Filter by difficulty level.
- Options:
easy | hard
- --length<length>#
Filter by length category.
- Options:
short | medium | long
- --domain<domain>#
Filter by domain.
- --cot#
Enable Chain-of-Thought reasoning.
- --no_context#
Test without long context.
- --rag<rag>#
Use top-N retrieved contexts (0 to disable).
- --max_len<max_len>#
Maximum prompt length in tokens for truncation when building prompts.
- --output_dir<output_dir>#
Directory to save results.
- --random_seed<random_seed>#
Random seed for dataset processing.
- --apply_chat_template#
Whether to apply chat template.
- --system_prompt<system_prompt>#
System prompt.
- --max_input_length<max_input_length>#
Maximum prompt length in sampling parameters.
- --max_output_length<max_output_length>#
Maximum generation length in sampling parameters.
mmlu#
trtllm-evalmmlu[OPTIONS]
Options
- --dataset_path<dataset_path>#
The path to MMLU dataset. The commands to prepare the dataset: wgethttps://people.eecs.berkeley.edu/~hendrycks/data.tar && tar -xf data.tar. If unspecified, the dataset is downloaded automatically.
- --num_samples<num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --num_fewshot<num_fewshot>#
Number of fewshot.
- --random_seed<random_seed>#
Random seed for dataset processing.
- --apply_chat_template#
Whether to apply chat template.
- --chat_template_kwargs<chat_template_kwargs>#
Chat template kwargs as JSON string, e.g., ‘{“thinking_budget”: 0}’
- --system_prompt<system_prompt>#
System prompt.
- --max_input_length<max_input_length>#
Maximum prompt length.
- --max_output_length<max_output_length>#
Maximum generation length.
- --check_accuracy#
- --accuracy_threshold<accuracy_threshold>#
mmmu#
trtllm-evalmmmu[OPTIONS]
Options
- --dataset_path<dataset_path>#
The path to MMMU dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples<num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed<random_seed>#
Random seed for dataset processing.
- --chat_template_kwargs<chat_template_kwargs>#
Chat template kwargs as JSON string, e.g., ‘{“thinking_budget”: 0}’
- --system_prompt<system_prompt>#
The system prompt to be added on the prompt. If specified, it will add {‘role’: ‘system’, ‘content’: system_prompt} to the prompt.
- --max_input_length<max_input_length>#
Maximum prompt length.
- --max_output_length<max_output_length>#
Maximum generation length.