Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.

License

NotificationsYou must be signed in to change notification settings

NVIDIA/TensorRT-Model-Optimizer

Banner image

NVIDIA TensorRT Model Optimizer

Documentationversionlicense

Documentation |Roadmap


NVIDIA TensorRT Model Optimizer (referred to asModel Optimizer, orModelOpt) is a library comprising state-of-the-art model optimizationtechniques including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.

[Input] Model Optimizer currently supports inputs of aHugging Face,PyTorch orONNX model.

[Optimize] Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.Model Optimizer is also integrated withNVIDIA NeMo,Megatron-LM andHugging Face Accelerate for training required inference optimization techniques.

[Export for deployment] Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks likeSGLang,TensorRT-LLM,TensorRT, orvLLM.

Latest News

Previous News

Install

To install stable release packages for Model Optimizer withpip fromPyPI:

pip install -U nvidia-modelopt[all]

To install from source in editable mode with all development dependencies or to use the latest features, run:

# Clone the Model Optimizer repositorygit clone git@github.com:NVIDIA/TensorRT-Model-Optimizer.gitcd TensorRT-Model-Optimizerpip install -e .[dev]

You can also directly use theTensorRT-LLM docker images(e.g.,nvcr.io/nvidia/tensorrt-llm/release:<version>), which have Model Optimizer pre-installed.Make sure to upgrade Model Optimizer to the latest version usingpip as described above.Visit ourinstallation guide formore fine-grained control on installed dependencies or for alternative docker images and environment variables to setup.

Techniques

TechniqueDescriptionExamplesDocs
Post Training QuantizationCompress model size by 2x-4x, speeding up inference while preserving model quality![LLMs] [diffusers] [VLMs] [onnx] [windows][docs]
Quantization Aware TrainingRefine accuracy even further with a few training steps![NeMo] [Hugging Face][docs]
PruningReduce your model size and accelerate inference by removing unnecessary weights![PyTorch][docs]
DistillationReduce deployment model size by teaching small models to behave like larger models![NeMo] [Hugging Face][docs]
Speculative DecodingTrain draft modules to predict extra tokens during inference![Megatron] [Hugging Face][docs]
SparsityEfficiently compress your model by storing only its non-zero parameter values and their locations[PyTorch][docs]

Pre-Quantized Checkpoints

Resources

Model Support Matrix

Model TypeSupport Matrix
LLM QuantizationView Support Matrix
Diffusers QuantizationView Support Matrix
VLM QuantizationView Support Matrix
ONNX QuantizationView Support Matrix
Windows QuantizationView Support Matrix
Quantization Aware TrainingView Support Matrix
PruningView Support Matrix
DistillationView Support Matrix
Speculative DecodingView Support Matrix

Contributing

Model Optimizer is now open source! We welcome any feedback, feature requests and PRs.Please read ourContributing guidelines for details on how to contribute to this project.

Top Contributors

Contributors

Happy optimizing!

About

A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Languages


[8]ページ先頭

©2009-2025 Movatter.jp