- Notifications
You must be signed in to change notification settings - Fork202
A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
License
NVIDIA/TensorRT-Model-Optimizer
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
NVIDIA TensorRT Model Optimizer (referred to asModel Optimizer, orModelOpt) is a library comprising state-of-the-art model optimizationtechniques including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.
[Input] Model Optimizer currently supports inputs of aHugging Face,PyTorch orONNX model.
[Optimize] Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.Model Optimizer is also integrated withNVIDIA NeMo,Megatron-LM andHugging Face Accelerate for training required inference optimization techniques.
[Export for deployment] Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks likeSGLang,TensorRT-LLM,TensorRT, orvLLM.
- [2025/10/07]Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer
- [2025/09/17]An Introduction to Speculative Decoding for Reducing Latency in AI Inference
- [2025/09/11]How Quantization Aware Training Enables Low-Precision Accuracy Recovery
- [2025/08/29]Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
- [2025/08/01]Optimizing LLMs for Performance and Accuracy with Post-Training Quantization
- [2025/06/24]Introducing NVFP4 for Efficient and Accurate Low-Precision Inference
- [2025/05/14]NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs
- [2025/04/21]Adobe optimized deployment using TensorRT-Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership
- [2025/04/05]NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick. Check out how to quantize Llama4 for deployment accelerationhere
- [2025/03/18]World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell
- [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download:DeepSeek-R1-FP4,Llama-3.3-70B-Instruct-FP4,Llama-3.1-405B-Instruct-FP4
- [2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQhere.
- [2025/01/28] Model Optimizer is now open source!
- [2024/10/23] Model Optimizer quantized FP8 Llama-3.1 Instruct models available on Hugging Face for download:8B,70B,405B.
- [2024/09/10]Post-Training Quantization of LLMs with NVIDIA NeMo and TensorRT Model Optimizer.
Previous News
- [2024/08/28]Boosting Llama 3.1 405B Performance up to 44% with TensorRT Model Optimizer on NVIDIA H200 GPUs
- [2024/08/28]Up to 1.9X Higher Llama 3.1 Performance with Medusa
- [2024/08/15] New features in recent releases:Cache Diffusion,QLoRA workflow with NVIDIA NeMo, and more. Check outour blog for details.
- [2024/06/03] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflowhere
- [2024/05/08]Announcement: Model Optimizer Now Formally Available to Further Accelerate GenAI Inference Performance
- [2024/03/27]Model Optimizer supercharges TensorRT-LLM to set MLPerf LLM inference records
- [2024/03/18]GTC Session: Optimize Generative AI Inference with Quantization in TensorRT-LLM and TensorRT
- [2024/03/07]Model Optimizer's 8-bit Post-Training Quantization enables TensorRT to accelerate Stable Diffusion to nearly 2x faster
- [2024/02/01]Speed up inference with Model Optimizer quantization techniques in TRT-LLM
To install stable release packages for Model Optimizer withpip fromPyPI:
pip install -U nvidia-modelopt[all]
To install from source in editable mode with all development dependencies or to use the latest features, run:
# Clone the Model Optimizer repositorygit clone git@github.com:NVIDIA/TensorRT-Model-Optimizer.gitcd TensorRT-Model-Optimizerpip install -e .[dev]
You can also directly use theTensorRT-LLM docker images(e.g.,nvcr.io/nvidia/tensorrt-llm/release:<version>), which have Model Optimizer pre-installed.Make sure to upgrade Model Optimizer to the latest version usingpip as described above.Visit ourinstallation guide formore fine-grained control on installed dependencies or for alternative docker images and environment variables to setup.
| Technique | Description | Examples | Docs |
|---|---|---|---|
| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | [LLMs] [diffusers] [VLMs] [onnx] [windows] | [docs] |
| Quantization Aware Training | Refine accuracy even further with a few training steps! | [NeMo] [Hugging Face] | [docs] |
| Pruning | Reduce your model size and accelerate inference by removing unnecessary weights! | [PyTorch] | [docs] |
| Distillation | Reduce deployment model size by teaching small models to behave like larger models! | [NeMo] [Hugging Face] | [docs] |
| Speculative Decoding | Train draft modules to predict extra tokens during inference! | [Megatron] [Hugging Face] | [docs] |
| Sparsity | Efficiently compress your model by storing only its non-zero parameter values and their locations | [PyTorch] | [docs] |
- Ready-to-deploy checkpoints [🤗 Hugging Face - Nvidia TensorRT Model Optimizer Collection]
- Deployable onTensorRT-LLM,vLLM andSGLang
- More models coming soon!
| Model Type | Support Matrix |
|---|---|
| LLM Quantization | View Support Matrix |
| Diffusers Quantization | View Support Matrix |
| VLM Quantization | View Support Matrix |
| ONNX Quantization | View Support Matrix |
| Windows Quantization | View Support Matrix |
| Quantization Aware Training | View Support Matrix |
| Pruning | View Support Matrix |
| Distillation | View Support Matrix |
| Speculative Decoding | View Support Matrix |
Model Optimizer is now open source! We welcome any feedback, feature requests and PRs.Please read ourContributing guidelines for details on how to contribute to this project.
Happy optimizing!
About
A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
Resources
License
Code of conduct
Contributing
Security policy
Uh oh!
There was an error while loading.Please reload this page.
