You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
[OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ (#421)
## What does this PR do?**Type of change:** ? <!-- Use one of the following: Bug fix, newfeature, new example, new tests, documentation. -->**Overview:** This PR andNVIDIA/TensorRT-LLM#8698 enableNVFP4 AWQ deployment for TRT-LLM. Specifically, this PR fusespre_quant_scale in following two cases:* For MLP, pre_quant_scale of gate_proj layer is fused into up_proj'sweight, so we don't need an extra handle in downstream fused moekernels.* For attention, we will try to fuse the pre_quant_scale of o_proj tov_proj if their dimensions match, which means we will skip fusion forMQA/GQA models.## Usage<!-- You can potentially add a usage example below. -->```python# Add a code snippet demonstrating how to use this```## Testing<!-- Mention how have you tested your change if applicable. -->unit test, e2e test for Qwen3 dense and moe models.## Before your PR is "*Ready for review*"<!-- If you haven't finished some of the above items you can still open`Draft` PR. -->- **Make sure you read and follow [Contributorguidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)**and your commits are signed.- **Is this change backward compatible?**: Yes/No <!--- If No, explainwhy. -->- **Did you write any new necessary tests?**: Yes/No- **Did you add or update any necessary documentation?**: Yes/No- **Did you update[Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**:Yes/No <!--- Only for new features, API changes, critical bug fixes orbw breaking changes. -->## Additional Information<!-- E.g. related issue. -->---------Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -47,6 +47,7 @@ Model Optimizer Changelog (Linux)
47
47
- Enabled native Modelopt quantization support for FP8 and NVFP4 formats in SGLang. See `SGLang quantization documentation<https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/quantization.md#using-nvidia-modelopt>`_ for more details.
48
48
- Added modelopt quantized checkpoints in vLLM/SGLang CI/CD pipelines (PRs are under review).
49
49
- Add support for exporting QLoRA checkpoint fintuned using ModelOpt.
50
+
- Update NVFP4 AWQ checkpoint export. It now fuses scaling factors of o_proj and down_proj layers into the model when possible to facilitate deployment.