Commit5842d73

authored

[OMNIML-2244] enable fp8 and int8 ONNX export (#594)

## What does this PR do?**Type of change:** Example update**Overview:**- Support ONNX export for fp8 and int8 precisions- Added utility functions to check for fp8 and int8 quantization (willbe used in ONNXExporter)- Fixed a bug in evaluation API for high batch sizes- Added function to replace zeros from scales to smallest positive valuein fp16## Usage```pythonpython torch_quant_to_onnx.py \ --quantize_mode fp8/int8 \ --onnx_save_path <onnx_path> ```## TestingValidated the accuracy and latency of int8 and fp8 models:| Metric | INT8 | FP8 ||--------|------|-----|| Top1 Accuracy | 84.584% | 85.062% || Top5 Accuracy | 97.3% | 97.534% || Inference Latency | 8.4825 ms | 8.15096 ms |## Before your PR is "*Ready for review*"- **Make sure you read and follow [Contributorguidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)**and your commits are signed.- **Is this change backward compatible?**: Yes- **Did you write any new necessary tests?**: No- **Did you add or update any necessary documentation?**: Yes- **Did you update[Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**:No---------Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

1 parenta5025a2 commit5842d73Copy full SHA for 5842d73

File tree

5 files changed

+53

-7

lines changed

examples/onnx_ptq
modelopt
- onnx/quantization
  - qdq_utils.py
- torch/_deploy/utils
  - torch_onnx.py

5 files changed

+53

-7

lines changed

`‎examples/onnx_ptq/README.md‎`

Lines changed: 5 additions & 5 deletions

Original file line number	Diff line number	Diff line change
`@@ -13,7 +13,7 @@ Model Optimizer enables highly performant quantization formats including NVFP4,`
`13`	`13`	`\| Pre-Requisites\| Required & optional packages to use this technique\|[Link](#pre-requisites)\|\|`
`14`	`14`	`\| Getting Started\| Learn how to optimize your models using PTQ to reduce precision and improve inference efficiency\|[Link](#getting-started)\|[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_onnx_quantization.html)\|`
`15`	`15`	`\| Support Matrix\| View the ONNX export supported LLM models\|[Link](#onnx-export-supported-llm-models)\|\|`
`16`		`-\| PyTorch to ONNX\| Example scripts demonstrating how to quantize with PyTorch and then convert to ONNX\|[Link](#torch-quantization-to-onnx-example-for-mxfp8-int4-or-nvfp4-precision)\|\|`
	`16`	`+\| PyTorch to ONNX\| Example scripts demonstrating how to quantize with PyTorch and then convert to ONNX\|[Link](#torch-quantization-to-onnx-export-example)\|\|`
`17`	`17`	`\| Advanced Features\| Examples demonstrating use advanced ONNX quantization features\|[Link](#advanced-features)\|\|`
`18`	`18`	`\| Pre-Quantized Checkpoints\| Ready to deploy Hugging Face pre-quantized checkpoints\|[Link](#pre-quantized-checkpoints)\|\|`
`19`	`19`	`\| Resources\| Extra links to relevant resources\|[Link](#resources)\|\|`
`@@ -80,7 +80,7 @@ python image_prep.py \`
`80`	`80`
`81`	`81`	The model can be quantized as an FP8, INT8 or INT4 model using either the CLI or Python API. For FP8 and INT8 quantization, you have a choice between`max` and`entropy` calibration algorithms. For INT4 quantization,[awq_clip](https://arxiv.org/abs/2306.00978) or[rtn_dq](https://ar5iv.labs.arxiv.org/html/2301.12017) algorithms can be chosen.
`82`	`82`
`83`		`->For NVFP4 and MXFP8 ONNX, see the[PyTorch to ONNX section](#torch-quantization-to-onnx-example-for-mxfp8-int4-or-nvfp4-precision).`
	`83`	`+>For NVFP4 and MXFP8 ONNX, see the[PyTorch to ONNX section](#torch-quantization-to-onnx-export-example).`
`84`	`84`
`85`	`85`	`>Minimum opset requirements: int8 (13+), fp8 (21+), int4 (21+). ModelOpt will automatically upgrade lower opset versions to meet these requirements.`
`86`	`86`
`@@ -129,9 +129,9 @@ The top5 accuracy of the model is <accuracy score between 0-100%>`
`129`	`129`	`Inference latency of the model is<X> ms`
`130`	`130`	```
`131`	`131`
`132`		`-##Torch quantization to ONNXexample for MXFP8, INT4 or NVFP4 precision`
	`132`	`+##Torch quantization to ONNXexport example`
`133`	`133`
`134`		`-This example demonstrates how to quantize a[timm](https://github.com/huggingface/pytorch-image-models) vision modelusing MXFP8, INT4 or NVFP4precision formats, and then export it to ONNX. The script leverages the ModelOpt toolkit for both quantization and ONNX export.`
	`134`	`+This example demonstrates how to quantize a[timm](https://github.com/huggingface/pytorch-image-models) vision modelfor variousprecision formats followed by export to ONNX. The script leverages the ModelOpt toolkit for both quantization and ONNX export.`
`135`	`135`
`136`	`136`	`>Opset 20 is used to export the torch models to ONNX.`
`137`	`137`
`@@ -148,7 +148,7 @@ This example demonstrates how to quantize a [timm](https://github.com/huggingfac`
`148`	`148`	```bash
`149`	`149`	`python torch_quant_to_onnx.py \`
`150`	`150`	`--timm_model_name=vit_base_patch16_224 \`
`151`		`- --quantize_mode=<mxfp8\|nvfp4\|int4_awq> \`
	`151`	`+ --quantize_mode=<fp8\|mxfp8\|int8\|nvfp4\|int4_awq> \`
`152`	`152`	`--onnx_save_path=<path to save the exported ONNX model>`
`153`	`153`	```
`154`	`154`

`‎examples/onnx_ptq/evaluation.py‎`

Lines changed: 2 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -152,8 +152,9 @@ def evaluate_accuracy(`
`152`	`152`
`153`	`153`	`# Calculate accuracy`
`154`	`154`	`outputs=outputs[0]ifisinstance(outputs,list)elseoutputs.data`
`155`		`-`
`156`	`155`	`labels_size=labels.size(0)`
	`156`	`+outputs=outputs[:labels_size]`
	`157`	`+`
`157`	`158`	`total+=labels_size`
`158`	`159`
`159`	`160`	`labels=labels.to(outputs.device)`

`‎examples/onnx_ptq/torch_quant_to_onnx.py‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -323,7 +323,7 @@ def main():`
`323`	`323`	`)`
`324`	`324`	`print(f"Quantized Model - Top-1 Accuracy:{top1:.2f}%, Top-5 Accuracy:{top5:.2f}%")`
`325`	`325`
`326`		`-ifargs.quantize_modein ["fp8","int8","auto"]:`
	`326`	`+ifargs.quantize_modein ["auto"]:`
`327`	`327`	`print(`
`328`	`328`	`f"The selected quantization mode{args.quantize_mode} is not supported for ONNX export yet."`
`329`	`329`	`)`

`‎modelopt/onnx/quantization/qdq_utils.py‎`

Lines changed: 15 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -1037,6 +1037,21 @@ def remove_graph_input_q(onnx_model: onnx.ModelProto) -> onnx.ModelProto:`
`1037`	`1037`	`returnonnx_model`
`1038`	`1038`
`1039`	`1039`
	`1040`	`+defreplace_zero_scale_with_smallest_nonzero(onnx_model:onnx.ModelProto)->onnx.ModelProto:`
	`1041`	`+"""Replace zero scale values with smallest nonzero fp16 value in the ONNX model."""`
	`1042`	`+graph=onnx_model.graph`
	`1043`	`+fp16_smallest_nonzero=np.float16(6e-08)`
	`1044`	`+scale_nodes= [node.input[1]fornodeingraph.nodeifnode.op_type=="QuantizeLinear"]`
	`1045`	`+fornodeingraph.node:`
	`1046`	`+ifnode.op_type=="Constant"andnode.output[0]inscale_nodes:`
	`1047`	`+forattrinnode.attribute:`
	`1048`	`+ifattr.name=="value":`
	`1049`	`+tensor=numpy_helper.to_array(attr.t)`
	`1050`	`+new_tensor=np.where(tensor==0,fp16_smallest_nonzero,tensor)`
	`1051`	`+attr.t.CopyFrom(numpy_helper.from_array(new_tensor,attr.t.name))`
	`1052`	`+returnonnx_model`
	`1053`	`+`
	`1054`	`+`
`1040`	`1055`	`def_cast_initializer_to_dtype(`
`1041`	`1056`	`node:onnx.NodeProto,dtype:str,initializer_map:dict[str,onnx.TensorProto]`
`1042`	`1057`	`):`

`‎modelopt/torch/_deploy/utils/torch_onnx.py‎`

Lines changed: 30 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -37,6 +37,7 @@`
`37`	`37`	`qdq_to_dq,`
`38`	`38`	`quantize_weights_to_int4,`
`39`	`39`	`quantize_weights_to_mxfp8,`
	`40`	`+replace_zero_scale_with_smallest_nonzero,`
`40`	`41`	`)`
`41`	`42`	`frommodelopt.onnx.utilsimport (`
`42`	`43`	`get_input_names,`
`@@ -336,6 +337,32 @@ def is_mxfp8_quantized(model: nn.Module) -> bool:`
`336`	`337`	`returnFalse`
`337`	`338`
`338`	`339`
	`340`	`+defis_int8_quantized(model:nn.Module)->bool:`
	`341`	`+"""Check if the model is quantized in INT8 mode."""`
	`342`	`+for_,moduleinmodel.named_modules():`
	`343`	`+if (`
	`344`	`+hasattr(module,"weight_quantizer")`
	`345`	`+andhasattr(module,"input_quantizer")`
	`346`	`+andmodule.weight_quantizer._num_bits==8`
	`347`	`+andmodule.input_quantizer._num_bits==8`
	`348`	`+ ):`
	`349`	`+returnTrue`
	`350`	`+returnFalse`
	`351`	`+`
	`352`	`+`
	`353`	`+defis_fp8_quantized(model:nn.Module)->bool:`
	`354`	`+"""Check if the model is quantized in FP8 mode."""`
	`355`	`+for_,moduleinmodel.named_modules():`
	`356`	`+if (`
	`357`	`+hasattr(module,"weight_quantizer")`
	`358`	`+andhasattr(module,"input_quantizer")`
	`359`	`+andmodule.weight_quantizer._num_bits== (4,3)`
	`360`	`+andmodule.input_quantizer._num_bits== (4,3)`
	`361`	`+ ):`
	`362`	`+returnTrue`
	`363`	`+returnFalse`
	`364`	`+`
	`365`	`+`
`339`	`366`	`defget_onnx_bytes_and_metadata(`
`340`	`367`	`model:nn.Module,`
`341`	`368`	`dummy_input:Any\|tuple,`
`@@ -510,6 +537,9 @@ def get_onnx_bytes_and_metadata(`
`510`	`537`	`onnx_opt_graph,low_precision_type=weights_dtype,keep_io_types=False`
`511`	`538`	`)`
`512`	`539`
	`540`	`+# TensorRT expects all scales to be postive`
	`541`	`+onnx_opt_graph=replace_zero_scale_with_smallest_nonzero(onnx_opt_graph)`
	`542`	`+`
`513`	`543`	`# If the onnx model contains external data store the external tensors in one file and save the onnx model`
`514`	`544`	`ifhas_external_data(onnx_save_path):`
`515`	`545`	`tensor_paths=get_external_tensor_paths(onnx_path)`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit5842d73

File tree

5 files changed

5 files changed

`‎examples/onnx_ptq/README.md‎`

`‎examples/onnx_ptq/evaluation.py‎`

`‎examples/onnx_ptq/torch_quant_to_onnx.py‎`

`‎modelopt/onnx/quantization/qdq_utils.py‎`

`‎modelopt/torch/_deploy/utils/torch_onnx.py‎`

0 commit comments