NVIDIA/TensorRT-Model-OptimizerPublic

NotificationsYou must be signed in to change notification settings
Fork202
Star1.6k

A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.

nvidia.github.io/TensorRT-Model-Optimizer/

License

Apache-2.0 license

1.6k stars 202 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 295 Commits
.github		.github
.gitlab		.gitlab
.vscode		.vscode
docs/source		docs/source
examples		examples
modelopt		modelopt
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.markdownlint-cli2.yaml		.markdownlint-cli2.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG-Windows.rst		CHANGELOG-Windows.rst
CHANGELOG.rst		CHANGELOG.rst
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE_HEADER		LICENSE_HEADER
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
setup.py		setup.py
tox.ini		tox.ini

Repository files navigation

NVIDIA TensorRT Model Optimizer

Documentation |Roadmap

NVIDIA TensorRT Model Optimizer (referred to asModel Optimizer, orModelOpt) is a library comprising state-of-the-art model optimizationtechniques including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.

[Input] Model Optimizer currently supports inputs of aHugging Face,PyTorch orONNX model.

[Optimize] Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.Model Optimizer is also integrated withNVIDIA NeMo,Megatron-LM andHugging Face Accelerate for training required inference optimization techniques.

[Export for deployment] Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks likeSGLang,TensorRT-LLM,TensorRT, orvLLM.

Latest News

[2025/10/07]Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer
[2025/09/17]An Introduction to Speculative Decoding for Reducing Latency in AI Inference
[2025/09/11]How Quantization Aware Training Enables Low-Precision Accuracy Recovery
[2025/08/29]Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
[2025/08/01]Optimizing LLMs for Performance and Accuracy with Post-Training Quantization
[2025/06/24]Introducing NVFP4 for Efficient and Accurate Low-Precision Inference
[2025/05/14]NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs
[2025/04/21]Adobe optimized deployment using TensorRT-Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership
[2025/04/05]NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick. Check out how to quantize Llama4 for deployment accelerationhere
[2025/03/18]World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell
[2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download:DeepSeek-R1-FP4,Llama-3.3-70B-Instruct-FP4,Llama-3.1-405B-Instruct-FP4
[2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQhere.
[2025/01/28] Model Optimizer is now open source!
[2024/10/23] Model Optimizer quantized FP8 Llama-3.1 Instruct models available on Hugging Face for download:8B,70B,405B.
[2024/09/10]Post-Training Quantization of LLMs with NVIDIA NeMo and TensorRT Model Optimizer.

Previous News

[2024/08/28]Boosting Llama 3.1 405B Performance up to 44% with TensorRT Model Optimizer on NVIDIA H200 GPUs
[2024/08/28]Up to 1.9X Higher Llama 3.1 Performance with Medusa
[2024/08/15] New features in recent releases:Cache Diffusion,QLoRA workflow with NVIDIA NeMo, and more. Check outour blog for details.
[2024/06/03] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflowhere
[2024/05/08]Announcement: Model Optimizer Now Formally Available to Further Accelerate GenAI Inference Performance
[2024/03/27]Model Optimizer supercharges TensorRT-LLM to set MLPerf LLM inference records
[2024/03/18]GTC Session: Optimize Generative AI Inference with Quantization in TensorRT-LLM and TensorRT
[2024/03/07]Model Optimizer's 8-bit Post-Training Quantization enables TensorRT to accelerate Stable Diffusion to nearly 2x faster
[2024/02/01]Speed up inference with Model Optimizer quantization techniques in TRT-LLM

Install

To install stable release packages for Model Optimizer withpip fromPyPI:

pip install -U nvidia-modelopt[all]

To install from source in editable mode with all development dependencies or to use the latest features, run:

# Clone the Model Optimizer repositorygit clone git@github.com:NVIDIA/TensorRT-Model-Optimizer.gitcd TensorRT-Model-Optimizerpip install -e .[dev]

You can also directly use theTensorRT-LLM docker images(e.g.,nvcr.io/nvidia/tensorrt-llm/release:<version>), which have Model Optimizer pre-installed.Make sure to upgrade Model Optimizer to the latest version usingpip as described above.Visit ourinstallation guide formore fine-grained control on installed dependencies or for alternative docker images and environment variables to setup.

Techniques

Technique	Description	Examples	Docs
Post Training Quantization	Compress model size by 2x-4x, speeding up inference while preserving model quality!	[LLMs] [diffusers] [VLMs] [onnx] [windows]	[docs]
Quantization Aware Training	Refine accuracy even further with a few training steps!	[NeMo] [Hugging Face]	[docs]
Pruning	Reduce your model size and accelerate inference by removing unnecessary weights!	[PyTorch]	[docs]
Distillation	Reduce deployment model size by teaching small models to behave like larger models!	[NeMo] [Hugging Face]	[docs]
Speculative Decoding	Train draft modules to predict extra tokens during inference!	[Megatron] [Hugging Face]	[docs]
Sparsity	Efficiently compress your model by storing only its non-zero parameter values and their locations	[PyTorch]	[docs]

Pre-Quantized Checkpoints

Ready-to-deploy checkpoints [🤗 Hugging Face - Nvidia TensorRT Model Optimizer Collection]
Deployable onTensorRT-LLM,vLLM andSGLang
More models coming soon!

Resources

Model Support Matrix

Model Type	Support Matrix
LLM Quantization	View Support Matrix
Diffusers Quantization	View Support Matrix
VLM Quantization	View Support Matrix
ONNX Quantization	View Support Matrix
Windows Quantization	View Support Matrix
Quantization Aware Training	View Support Matrix
Pruning	View Support Matrix
Distillation	View Support Matrix
Speculative Decoding	View Support Matrix

Contributing

Model Optimizer is now open source! We welcome any feedback, feature requests and PRs.Please read ourContributing guidelines for details on how to contribute to this project.