You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
TensorRT LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8,[FP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/), INT4[AWQ](https://arxiv.org/abs/2306.00978), INT8[SmoothQuant](https://arxiv.org/abs/2211.10438), ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.
232
232
233
-
[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/arch_overview.md), TensorRT LLM provides a high-level Python[LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA[Dynamo](https://github.com/ai-dynamo/dynamo) and the[Triton Inference Server](https://github.com/triton-inference-server/server).
233
+
[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/release/1.1/docs/source/developer-guide/overview.md), TensorRT LLM provides a high-level Python[LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA[Dynamo](https://github.com/ai-dynamo/dynamo) and the[Triton Inference Server](https://github.com/triton-inference-server/server).
234
234
235
235
TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using[native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs.
The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's[Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html)
31
+
The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's[Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html)
Copy file name to clipboardExpand all lines: docs/source/blogs/H200launch.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ TensorRT LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news
21
21
22
22
<sup>*(1) Largest batch supported on given TP configuration by power of 2.*</sup> <sup>*(2) TP = Tensor Parallelism*</sup>
23
23
24
-
Additional Performance data is available on the[NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in[TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html).
24
+
Additional Performance data is available on the[NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in[TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html).
Copy file name to clipboardExpand all lines: docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -124,7 +124,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces
124
124
125
125
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
126
126
127
-
For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
127
+
For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).
Copy file name to clipboardExpand all lines: docs/source/features/disagg-serving.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -94,7 +94,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces
94
94
95
95
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
96
96
97
-
For more information on how to use Dynamo with TensorRT-LLM, please refer to[this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
97
+
For more information on how to use Dynamo with TensorRT-LLM, please refer to[this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).