Model Recipes #

Quick Start for Popular Models#

The table below containstrtllm-serve commands that can be used to easily deploy popular models including DeepSeek-R1, gpt-oss, Llama 4, Qwen3, and more.

We maintain LLM API configuration files for these models containing recommended performance settings in theexamples/configs directory. The TensorRT LLM Docker container makes the config files available at/app/tensorrt_llm/examples/configs, but you can customize this as needed:

exportTRTLLM_DIR="/app/tensorrt_llm"# path to the TensorRT LLM repo in your local environment

Note

The configs here are specifically optimized for a target ISL/OSL (Input/Output Sequence Length) of 1024/1024. If your traffic pattern is different, you may benefit from additional tuning. In the future, we plan to provide more configs for a wider range of traffic patterns.

This table is designed to provide a straightforward starting point; for detailed model-specific deployment guides, check out the guides below.

Model Name	GPU	Inference Scenario	Config	Command
DeepSeek-R1	H100, H200	Max Throughput	deepseek-r1-throughput.yaml	`trtllm-servedeepseek-ai/DeepSeek-R1-0528--extra_llm_api_options${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml`
DeepSeek-R1	B200, GB200	Max Throughput	deepseek-r1-deepgemm.yaml	`trtllm-servedeepseek-ai/DeepSeek-R1-0528--extra_llm_api_options${TRTLLM_DIR}/examples/configs/deepseek-r1-deepgemm.yaml`
DeepSeek-R1 (NVFP4)	B200, GB200	Max Throughput	deepseek-r1-throughput.yaml	`trtllm-servenvidia/DeepSeek-R1-FP4--extra_llm_api_options${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml`
DeepSeek-R1 (NVFP4)	B200, GB200	Min Latency	deepseek-r1-latency.yaml	`trtllm-servenvidia/DeepSeek-R1-FP4-v2--extra_llm_api_options${TRTLLM_DIR}/examples/configs/deepseek-r1-latency.yaml`
gpt-oss-120b	Any	Max Throughput	gpt-oss-120b-throughput.yaml	`trtllm-serveopenai/gpt-oss-120b--extra_llm_api_options${TRTLLM_DIR}/examples/configs/gpt-oss-120b-throughput.yaml`
gpt-oss-120b	Any	Min Latency	gpt-oss-120b-latency.yaml	`trtllm-serveopenai/gpt-oss-120b--extra_llm_api_options${TRTLLM_DIR}/examples/configs/gpt-oss-120b-latency.yaml`
Qwen3-Next-80B-A3B-Thinking	Any	Max Throughput	qwen3-next.yaml	`trtllm-serveQwen/Qwen3-Next-80B-A3B-Thinking--extra_llm_api_options${TRTLLM_DIR}/examples/configs/qwen3-next.yaml`
Qwen3 family (e.g.Qwen3-30B-A3B)	Any	Max Throughput	qwen3.yaml	`trtllm-serveQwen/Qwen3-30B-A3B--extra_llm_api_options${TRTLLM_DIR}/examples/configs/qwen3.yaml` (swap to another Qwen3 model name as needed)
Llama-3.3-70B (FP8)	Any	Max Throughput	llama-3.3-70b.yaml	`trtllm-servenvidia/Llama-3.3-70B-Instruct-FP8--extra_llm_api_options${TRTLLM_DIR}/examples/configs/llama-3.3-70b.yaml`
Llama 4 Scout (FP8)	Any	Max Throughput	llama-4-scout.yaml	`trtllm-servenvidia/Llama-4-Scout-17B-16E-Instruct-FP8--extra_llm_api_options${TRTLLM_DIR}/examples/configs/llama-4-scout.yaml`

Model-Specific Deployment Guides#

The deployment guides below provide more detailed instructions for serving specific models with TensorRT LLM.

On this page

Movatterモバイル変換

Model Recipes#

Quick Start for Popular Models#

Model-Specific Deployment Guides#

Model Recipes #