trtllm-eval#

About#

Thetrtllm-eval command provides developers with a unified entry point for accuracy evaluation. It shares the core evaluation logic with theaccuracy test suite of TensorRT LLM.

trtllm-eval is built on the offline API – LLM API. Compared to the onlinetrtllm-serve, the offline API provides clearer error messages and simplifies the debugging workflow.

The following tasks are currently supported:

Dataset

Task

Metric

Default ISL

Default OSL

CNN Dailymail

summarization

rouge

924

100

MMLU

QA; multiple choice

accuracy

4,094

2

GSM8K

QA; regex matching

accuracy

4,096

256

GPQA

QA; multiple choice

accuracy

32,768

4,096

JSON mode eval

structured generation

accuracy

1,024

512

Note

trtllm-eval originates from the TensorRT LLM accuracy test suite and serves as a lightweight utility for verifying and debugging accuracy. At this time,trtllm-eval is intended solely for development and is not recommended for production use.

Usage and Examples#

Some evaluation tasks (e.g., GSM8K and GPQA) depend on thelm_eval package. To run these tasks, you need to installlm_eval with:

pipinstall-rrequirements-dev.txt

Alternatively, you can install thelm_eval version specified inrequirements-dev.txt.

Here are some examples:

# Evaluate Llama-3.1-8B-Instruct on MMLUtrtllm-eval--modelmeta-llama/Llama-3.1-8B-Instructmmlu# Evaluate Llama-3.1-8B-Instruct on GSM8Ktrtllm-eval--modelmeta-llama/Llama-3.1-8B-Instructgsm8k# Evaluate Llama-3.3-70B-Instruct on GPQA Diamondtrtllm-eval--modelmeta-llama/Llama-3.3-70B-Instructgpqa_diamond

The--model argument accepts either a Hugging Face model ID or a local checkpoint path. By default,trtllm-eval runs the model with the PyTorch backend; you can pass--backendtensorrt to switch to the TensorRT backend.

Alternatively, the--model argument also accepts a local path to pre-built TensorRT engines. In this case, you should pass the Hugging Face tokenizer path to the--tokenizer argument.

For more details, seetrtllm-eval--help andtrtllm-eval<task>--help.

Syntax#

trtllm-eval#

trtllm-eval[OPTIONS]COMMAND[ARGS]...

Options

--model<model>#

Required model name | HF checkpoint path | TensorRT engine path

--tokenizer<tokenizer>#

Path | Name of the tokenizer.Specify this value only if using TensorRT engine as model.

--backend<backend>#

The backend to use for evaluation. Default is pytorch backend.

Options:

pytorch | tensorrt

--log_level<log_level>#

The logging level.

Options:

internal_error | error | warning | info | verbose | debug | trace

--max_beam_width<max_beam_width>#

Maximum number of beams for beam search decoding.

--max_batch_size<max_batch_size>#

Maximum number of requests that the engine can schedule.

--max_num_tokens<max_num_tokens>#

Maximum number of batched input tokens after padding is removed in each batch.

--max_seq_len<max_seq_len>#

Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config.

--tp_size<tp_size>#

Tensor parallelism size.

--pp_size<pp_size>#

Pipeline parallelism size.

--ep_size<ep_size>#

expert parallelism size

--gpus_per_node<gpus_per_node>#

Number of GPUs per node. Default to None, and it will be detected automatically.

--kv_cache_free_gpu_memory_fraction<kv_cache_free_gpu_memory_fraction>#

Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers.

--trust_remote_code#

Flag for HF transformers.

--extra_llm_api_options<extra_llm_api_options>#

Path to a YAML file that overwrites the parameters

--disable_kv_cache_reuse#

Flag for disabling KV cache reuse.

cnn_dailymail#

trtllm-evalcnn_dailymail[OPTIONS]

Options

--dataset_path<dataset_path>#

The path to CNN Dailymail dataset. If unspecified, the dataset is downloaded from HF hub.

--num_samples<num_samples>#

Number of samples to run the evaluation; None means full dataset.

--random_seed<random_seed>#

Random seed for dataset processing.

--rouge_path<rouge_path>#

The path to rouge repository.If unspecified, the repository is downloaded from HF hub.

--apply_chat_template#

Whether to apply chat template.

--system_prompt<system_prompt>#

System prompt.

--max_input_length<max_input_length>#

Maximum prompt length.

--max_output_length<max_output_length>#

Maximum generation length.

gpqa_diamond#

trtllm-evalgpqa_diamond[OPTIONS]

Options

--dataset_path<dataset_path>#

The path to GPQA dataset. If unspecified, the dataset is downloaded from HF hub.

--num_samples<num_samples>#

Number of samples to run the evaluation; None means full dataset.

--random_seed<random_seed>#

Random seed for dataset processing.

--apply_chat_template#

Whether to apply chat template.

--chat_template_kwargs<chat_template_kwargs>#

Chat template kwargs as JSON string, e.g., ‘{“thinking_budget”: 0}’

--system_prompt<system_prompt>#

System prompt.

--max_input_length<max_input_length>#

Maximum prompt length.

--max_output_length<max_output_length>#

Maximum generation length.

gpqa_extended#

trtllm-evalgpqa_extended[OPTIONS]

Options

--dataset_path<dataset_path>#

The path to GPQA dataset. If unspecified, the dataset is downloaded from HF hub.

--num_samples<num_samples>#

Number of samples to run the evaluation; None means full dataset.

--random_seed<random_seed>#

Random seed for dataset processing.

--apply_chat_template#

Whether to apply chat template.

--chat_template_kwargs<chat_template_kwargs>#

Chat template kwargs as JSON string, e.g., ‘{“thinking_budget”: 0}’

--system_prompt<system_prompt>#

System prompt.

--max_input_length<max_input_length>#

Maximum prompt length.

--max_output_length<max_output_length>#

Maximum generation length.

gpqa_main#

trtllm-evalgpqa_main[OPTIONS]

Options

--dataset_path<dataset_path>#

The path to GPQA dataset. If unspecified, the dataset is downloaded from HF hub.

--num_samples<num_samples>#

Number of samples to run the evaluation; None means full dataset.

--random_seed<random_seed>#

Random seed for dataset processing.

--apply_chat_template#

Whether to apply chat template.

--chat_template_kwargs<chat_template_kwargs>#

Chat template kwargs as JSON string, e.g., ‘{“thinking_budget”: 0}’

--system_prompt<system_prompt>#

System prompt.

--max_input_length<max_input_length>#

Maximum prompt length.

--max_output_length<max_output_length>#

Maximum generation length.

gsm8k#

trtllm-evalgsm8k[OPTIONS]

Options

--dataset_path<dataset_path>#

The path to GSM8K dataset. If unspecified, the dataset is downloaded from HF hub.

--num_samples<num_samples>#

Number of samples to run the evaluation; None means full dataset.

--random_seed<random_seed>#

Random seed for dataset processing.

--apply_chat_template#

Whether to apply chat template.

--chat_template_kwargs<chat_template_kwargs>#

Chat template kwargs as JSON string, e.g., ‘{“thinking_budget”: 0}’

--fewshot_as_multiturn#

Apply fewshot as multiturn.

--system_prompt<system_prompt>#

System prompt.

--max_input_length<max_input_length>#

Maximum prompt length.

--max_output_length<max_output_length>#

Maximum generation length.

json_mode_eval#

trtllm-evaljson_mode_eval[OPTIONS]

Options

--dataset_path<dataset_path>#

The path to JSON Mode Eval dataset. If unspecified, the dataset is downloaded from HF hub.

--num_samples<num_samples>#

Number of samples to run the evaluation; None means full dataset.

--random_seed<random_seed>#

Random seed for dataset processing.

--system_prompt<system_prompt>#

System prompt.

--max_input_length<max_input_length>#

Maximum prompt length.

--max_output_length<max_output_length>#

Maximum generation length.

longbench_v2#

trtllm-evallongbench_v2[OPTIONS]

Options

--dataset_path<dataset_path>#

Path to LongBench v2 dataset (HF dataset name or local path).

--prompts_dir<prompts_dir>#

Path to directory containing prompt templates.

--num_samples<num_samples>#

Number of samples to evaluate (None for all).

--start_idx<start_idx>#

Start index for evaluation.

--difficulty<difficulty>#

Filter by difficulty level.

Options:

easy | hard

--length<length>#

Filter by length category.

Options:

short | medium | long

--domain<domain>#

Filter by domain.

--cot#

Enable Chain-of-Thought reasoning.

--no_context#

Test without long context.

--rag<rag>#

Use top-N retrieved contexts (0 to disable).

--max_len<max_len>#

Maximum prompt length in tokens for truncation when building prompts.

--output_dir<output_dir>#

Directory to save results.

--random_seed<random_seed>#

Random seed for dataset processing.

--apply_chat_template#

Whether to apply chat template.

--system_prompt<system_prompt>#

System prompt.

--max_input_length<max_input_length>#

Maximum prompt length in sampling parameters.

--max_output_length<max_output_length>#

Maximum generation length in sampling parameters.

mmlu#

trtllm-evalmmlu[OPTIONS]

Options

--dataset_path<dataset_path>#

The path to MMLU dataset. The commands to prepare the dataset: wgethttps://people.eecs.berkeley.edu/~hendrycks/data.tar && tar -xf data.tar. If unspecified, the dataset is downloaded automatically.

--num_samples<num_samples>#

Number of samples to run the evaluation; None means full dataset.

--num_fewshot<num_fewshot>#

Number of fewshot.

--random_seed<random_seed>#

Random seed for dataset processing.

--apply_chat_template#

Whether to apply chat template.

--chat_template_kwargs<chat_template_kwargs>#

Chat template kwargs as JSON string, e.g., ‘{“thinking_budget”: 0}’

--system_prompt<system_prompt>#

System prompt.

--max_input_length<max_input_length>#

Maximum prompt length.

--max_output_length<max_output_length>#

Maximum generation length.

--check_accuracy#
--accuracy_threshold<accuracy_threshold>#

mmmu#

trtllm-evalmmmu[OPTIONS]

Options

--dataset_path<dataset_path>#

The path to MMMU dataset. If unspecified, the dataset is downloaded from HF hub.

--num_samples<num_samples>#

Number of samples to run the evaluation; None means full dataset.

--random_seed<random_seed>#

Random seed for dataset processing.

--chat_template_kwargs<chat_template_kwargs>#

Chat template kwargs as JSON string, e.g., ‘{“thinking_budget”: 0}’

--system_prompt<system_prompt>#

The system prompt to be added on the prompt. If specified, it will add {‘role’: ‘system’, ‘content’: system_prompt} to the prompt.

--max_input_length<max_input_length>#

Maximum prompt length.

--max_output_length<max_output_length>#

Maximum generation length.