Welcome to TensorRT LLM’s Documentation!#
Getting Started
Models
CLI Reference
API Reference
- LLM API Introduction
- API Reference
LLMMultimodalEncoderCompletionOutputRequestOutputGuidedDecodingParamsSamplingParamsDisaggregatedParamsKvCacheConfigKvCacheRetentionConfigCudaGraphConfigMoeConfigLookaheadDecodingConfigMedusaDecodingConfigEagleDecodingConfigMTPDecodingConfigSchedulerConfigCapacitySchedulerPolicyBuildConfigQuantConfigQuantAlgoCalibConfigBuildCacheConfigRequestErrorMpiCommSessionExtendedRuntimePerfKnobConfigBatchingTypeContextChunkingPolicyDynamicBatchConfigCacheTransceiverConfigNGramDecodingConfigUserProvidedDecodingConfigTorchCompileConfigDraftTargetDecodingConfigLlmArgsTorchLlmArgsTrtLlmArgsAutoDecodingConfigAttentionDpConfigLoRARequestSaveHiddenStatesDecodingConfigRocketSparseAttentionConfigDeepSeekSparseAttentionConfig
Features
- Feature Combination Matrix
- Multi-Head, Multi-Query, and Group-Query Attention
- Disaggregated Serving
- KV Cache System
- Long Sequences
- LoRA (Low-Rank Adaptation)
- Multimodal Support in TensorRT LLM
- Overlap Scheduler
- Paged Attention, IFB, and Request Scheduling
- Parallelism in TensorRT LLM
- Quantization
- Sampling
- Additional Outputs
- Speculative Decoding
- Checkpoint Loading
- AutoDeploy (Prototype)
- Ray Orchestrator (Prototype)
- Torch Compile & Piecewise CUDA Graph
Developer Guide
Blogs
- ADP Balance Strategy
- Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)
- Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly
- Inference Time Compute Implementation in TensorRT LLM
- Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)
- Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
- DeepSeek R1 MTP Implementation and Optimization
- Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
- Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
- Disaggregated Serving in TensorRT LLM
- How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
- N-Gram Speculative Decoding in TensorRT LLM
- Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
- Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
- How to get best performance on DeepSeek-R1 in TensorRT LLM
- H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
- New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
- H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token
Quick Links
Indices and tables#
On this page