NVIDIA/TensorRT-Model-OptimizerPublic

NotificationsYou must be signed in to change notification settings
Fork202
Star1.6k

Commit6abded4

authored

[5455919] Fix Q/DQ/Cast placement in 'FP32 required' custom ops (#554)

## What does this PR do?**Type of change:** Bug fix**Overview:** Fix incorrect quantization of custom ops when some inputtensors are required to be in INT8 and some in FP32.| Before fix | After fix ||----------------|-------------|| <img width="841" height="623" alt="snap_custom_op_quant_incorrect"src="https://github.com/user-attachments/assets/88e4d460-fbae-4bcb-86c8-139d23ce04c8"/> | <img width="786" height="286" alt="snap_custom_op_quant_correct"src="https://github.com/user-attachments/assets/475079c2-a565-4f0d-b167-6d801ab83dfc"/> |## Usage```python$ python -m modelopt.onnx.quantization --onnx_path=$MODEL_PATH.onnx \ --trt_plugins $PLUGIN_PATH.so \ --trt_plugins_precision $CUSTOM_OP_NAME:$PRECISION```## Testing### 1. BEVFormer model- Follow step 1 in[README](https://github.com/NVIDIA/DL4AGX/tree/master/AV-Solutions/bevformer-int8-eq#1-export-model-to-onnx-and-compile-plugins).- In the quantization step, do:```sh$ python -m modelopt.onnx.quantization --onnx_path=/mnt/models/bevformer_tiny_epoch_24_cp2_op13.onnx \ --trt_plugins=$PLUGIN_PATH \ --trt_plugins_precision MultiScaleDeformableAttnTRT:[int8,int32,fp32,int8,int8]:[int8] \ --high_precision_dtype fp16```> See table in "Overview" for expected graph structure.### 2. 5455919 modelValidated model in bug 5455919.## Before your PR is "*Ready for review*"- **Make sure you read and follow [Contributorguidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)**and your commits are signed.- **Is this change backward compatible?**: Yes- **Did you write any new necessary tests?**: No- **Did you add or update any necessary documentation?**: No- **Did you update[Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**:Yes## Additional Information-/pull/363: Feature expansion.-/pull/524: The graph cleanup isactually needed after Q/DQ trimming around custom ops. Moved the cleanuplines to inside that function.---------Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

1 parente20d218 commit6abded4Copy full SHA for 6abded4

File tree

7 files changed

+77

-41

lines changed

CHANGELOG.rst
modelopt/onnx
- autocast
  - convert.py
  - precisionconverter.py
- quantization

7 files changed

+77

-41

lines changed

`‎CHANGELOG.rst‎`

Lines changed: 2 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -20,12 +20,14 @@ Model Optimizer Changelog (Linux)`
`20`	`20`	`Bug Fixes`
`21`	`21`
`22`	`22`	`- Fix a bug in FastNAS pruning (computer vision models) where the model parameters were sorted twice messing up the ordering.`
	`23`	`+- Fix Q/DQ/Cast node placements in 'FP32 required' tensors in custom ops in the ONNX quantization workflow.`
`23`	`24`
`24`	`25`	`New Features`
`25`	`26`
`26`	`27`	- Add MoE (e.g. Qwen3-30B-A3B, gpt-oss-20b) pruning support for ``num_moe_experts``, ``moe_ffn_hidden_size`` and ``moe_shared_expert_intermediate_size`` parameters in Minitron pruning (``mcore_minitron``).
`27`	`28`	- Add ``specdec_bench`` example to benchmark speculative decoding performance. See `examples/specdec_bench/README.md<https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/specdec_bench#speculative-decoding-benchmark>`_ for more details.
`28`	`29`	`- Add FP8/NVFP4 KV cache quantization support for Megatron Core models.`
	`30`	+- Add flag ``trt_plugins_precision`` in ONNX autocast to indicate custom ops precision. This is similar to the flag already existing in the quantization workflow.
`29`	`31`
`30`	`32`
`31`	`33`	`0.39 (2025-11-11)`

`‎modelopt/onnx/autocast/convert.py‎`

Lines changed: 3 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -194,6 +194,7 @@ def convert_to_f16(`
`194`	`194`	`low_precision_type:str="fp16",`
`195`	`195`	`keep_io_types:bool=True,`
`196`	`196`	`op_block_list:list[str]= [],`
	`197`	`+tensor_block_dict:dict[str,dict[str,list[int]]]= {},`
`197`	`198`	`trt_plugins:list[str]\|None= [],`
`198`	`199`	`)->onnx.ModelProto:`
`199`	`200`	`"""Convert model to mixed precision, using PrecisionConverter.`
`@@ -204,8 +205,8 @@ def convert_to_f16(`
`204`	`205`	`model: ONNX model to convert.`
`205`	`206`	`low_precision_type: Target precision to reduce to ('fp16' or 'bf16').`
`206`	`207`	`keep_io_types: Whether to preserve input/output types.`
`207`		`- disable_shape_infer: Whether to disable shape inference.`
`208`	`208`	`op_block_list: List of operation types that should remain in FP32.`
	`209`	`+ tensor_block_dict: Dictionary of tensors (operation type and I/O indices) that should remain in FP32.`
`209`	`210`	`trt_plugins: List of TensorRT plugin library paths in .so format (compiled shared library).`
`210`	`211`	`"""`
`211`	`212`	`assertlow_precision_typein ["fp16","bf16"],"low_precision_type must be either fp16 or bf16"`
`@@ -235,6 +236,7 @@ def convert_to_f16(`
`235`	`236`	`keep_io_types=keep_io_types,`
`236`	`237`	`low_precision_type=low_precision_type,`
`237`	`238`	`custom_ops=sanitizer.custom_ops,`
	`239`	`+tensor_block_dict=tensor_block_dict,`
`238`	`240`	`)`
`239`	`241`	`high_precision_nodes= [node.namefornodeinmodel.graph.nodeifnode.op_typeinop_block_list]`
`240`	`242`	`low_precision_nodes= [`

`‎modelopt/onnx/autocast/precisionconverter.py‎`

Lines changed: 38 additions & 16 deletions

Original file line number	Diff line number	Diff line change
`@@ -99,6 +99,7 @@ def __init__(`
`99`	`99`	`min_opset:int=13,`
`100`	`100`	`max_ir_version:int\|None=None,`
`101`	`101`	`trt_plugins:list[str]\|None= [],`
	`102`	`+tensor_block_dict:dict[str,dict[str,list[int]]]= {},`
`102`	`103`	`)->None:`
`103`	`104`	`"""Initialize PrecisionConverter.`
`104`	`105`
`@@ -112,6 +113,10 @@ def __init__(`
`112`	`113`	`init_conversion_max_bytes: Maximum size in bytes for initializer conversion. Larger initializers will be`
`113`	`114`	`cast at runtime.`
`114`	`115`	`custom_ops: List of custom ops.`
	`116`	`+ min_opset: Minimum opset for conversion.`
	`117`	`+ max_ir_version: Max IR version for conversion.`
	`118`	`+ trt_plugins: List of custom TensorRT plugin library paths in .so format (compiled shared library).`
	`119`	`+ tensor_block_dict: Dictionary of tensors (operation type and I/O indices) that should remain in FP32.`
`115`	`120`	`"""`
`116`	`121`	`self.model=deepcopy(model)`
`117`	`122`	`self.value_info_map=value_info_map`
`@@ -148,6 +153,9 @@ def __init__(`
`148`	`153`	`)`
`149`	`154`	`)`
`150`	`155`
	`156`	`+# Custom mapping of op types to indices of inputs that should not be converted to low precision`
	`157`	`+self.skip_inputs_map=self._create_skip_inputs_mapping(tensor_block_dict)`
	`158`	`+`
`151`	`159`	`defconvert(`
`152`	`160`	`self,`
`153`	`161`	`high_precision_nodes:list[str],`
`@@ -211,7 +219,8 @@ def convert(`
`211`	`219`	`# For the low precision nodes that take a FP32 input, we don't exclude it from`
`212`	`220`	`# casting up so that the input can be converted to FP32 as expected.`
`213`	`221`	`exclude_consumers=list(`
`214`		`-set(low_precision_nodes)- {fp32_input_to_low_precision_node[tensor_name].name}`
	`222`	`+set(low_precision_nodes)`
	`223`	`+- {n.nameforninfp32_input_to_low_precision_node[tensor_name]}`
`215`	`224`	`)`
`216`	`225`	`self._add_cast(`
`217`	`226`	`tensor_name,`
`@@ -467,12 +476,14 @@ def _filter_unsupported_op_types(`
`467`	`476`	`returnhigh_precision_nodes,low_precision_nodes`
`468`	`477`
`469`	`478`	`def_get_tensors_to_cast(`
`470`		`-self,low_precision_nodes:list[str]`
`471`		`- )->tuple[list[str],list[str],dict[str,onnx.NodeProto]]:`
	`479`	`+self,`
	`480`	`+low_precision_nodes:list[str],`
	`481`	`+high_precision_tensors:dict[str,dict[str,list[int]]]= {},`
	`482`	`+ )->tuple[list[str],list[str],dict[str,list[onnx.NodeProto]]]:`
`472`	`483`	`cast_to_fp16= []# Tensors to cast down to FP16`
`473`	`484`	`cast_to_fp32= []# Tensors to cast up to FP32`
`474`	`485`	`# Keep track of the low precision nodes that take a FP32 input.`
`475`		`-fp32_input_to_low_precision_node={}`
	`486`	`+fp32_input_to_low_precision_node=defaultdict(list)`
`476`	`487`
`477`	`488`	`# Get tensors for FP16 nodes`
`478`	`489`	`fornodeinself.model.graph.node:`
`@@ -481,7 +492,7 @@ def _get_tensors_to_cast(`
`481`	`492`	`forinputinnode.input:`
`482`	`493`	`ifself._should_skip_low_precision_input_conversion(node,input):`
`483`	`494`	`cast_to_fp32.append(input)`
`484`		`-fp32_input_to_low_precision_node[input]=node`
	`495`	`+fp32_input_to_low_precision_node[input].append(node)`
`485`	`496`	`else:`
`486`	`497`	`cast_to_fp16.append(input)`
`487`	`498`
`@@ -536,7 +547,7 @@ def _convert_initializers(`
`536`	`547`	`low_precision_nodes: List of node names that should use low precision initializers.`
`537`	`548`	`high_precision_nodes: List of node names that should use high precision initializers.`
`538`	`549`	`"""`
`539`		`-# 1. Compute a mapping frominitiailizers to high precision nodes & low precision nodes that use them.`
	`550`	`+# 1. Compute a mapping frominitializers to high precision nodes & low precision nodes that use them.`
`540`	`551`	`low_precision_nodes_set:set[str]=set(low_precision_nodes)`
`541`	`552`	`high_precision_nodes_set:set[str]=set(high_precision_nodes)`
`542`	`553`	`initializer_to_nodes:dict[str,InitializerConsumerTracker]=defaultdict(`
`@@ -888,7 +899,7 @@ def _add_cast(`
`888`	`899`	`)`
`889`	`900`
`890`	`901`	`iftensor_to_consumersisNone:`
`891`		`-utils.get_consumer_nodes(self.model,tensor_name)`
	`902`	`+consumer_nodes=utils.get_consumer_nodes(self.model,tensor_name)`
`892`	`903`	`else:`
`893`	`904`	`consumer_nodes=tensor_to_consumers.get(tensor_name, [])`
`894`	`905`	`consumer_nodes= [nforninconsumer_nodesifn.namenotinexclude_consumers]`
`@@ -1272,13 +1283,9 @@ def _sanitize_model(self):`
`1272`	`1283`	`graph_sanitizer.sanitize()`
`1273`	`1284`	`self.model=graph_sanitizer.model`
`1274`	`1285`
`1275`		`-def_should_skip_low_precision_input_conversion(`
`1276`		`-self,node:onnx.NodeProto,input_name:str`
`1277`		`- )->bool:`
`1278`		`-"""Check if the input should be skipped for low precision conversion.`
`1279`		`-`
`1280`		`- This is used for nodes that have inputs that MUST remain in FP32.`
`1281`		`- """`
	`1286`	`+def_create_skip_inputs_mapping(self,tensor_block_dict:dict[str,dict[str,list[int]]]= {}):`
	`1287`	`+"""Create mapping of op types to indices of inputs that should not be converted to low precision."""`
	`1288`	`+skip_inputs_map= {}`
`1282`	`1289`	`matchself.low_precision_type.str_short:`
`1283`	`1290`	`case"fp16":`
`1284`	`1291`	`skip_inputs_map=SKIP_LOW_PRECISION_MAPPING_FP16`
`@@ -1287,12 +1294,27 @@ def _should_skip_low_precision_input_conversion(`
`1287`	`1294`	`case _:`
`1288`	`1295`	`raiseValueError(f"Unsupported low precision type:{self.low_precision_type}")`
`1289`	`1296`
`1290`		`-ifnode.op_typeinskip_inputs_map:`
	`1297`	`+# Update mapping with user-defined information`
	`1298`	`+forop,tensor_mapintensor_block_dict.items():`
	`1299`	`+high_precision_tensor=tensor_map.get("inp", [])`
	`1300`	`+ifhigh_precision_tensor:`
	`1301`	`+skip_inputs_map.update({op:set(high_precision_tensor)})`
	`1302`	`+`
	`1303`	`+returnskip_inputs_map`
	`1304`	`+`
	`1305`	`+def_should_skip_low_precision_input_conversion(`
	`1306`	`+self,node:onnx.NodeProto,input_name:str`
	`1307`	`+ )->bool:`
	`1308`	`+"""Check if the input should be skipped for low precision conversion.`
	`1309`	`+`
	`1310`	`+ This is used for nodes that have inputs that MUST remain in FP32.`
	`1311`	`+ """`
	`1312`	`+ifnode.op_typeinself.skip_inputs_map:`
`1291`	`1313`	`# Figure out the index of the input in the node input`
`1292`	`1314`	`inputs_lst=list(node.input)`
`1293`	`1315`	`ifinput_namenotininputs_lst:`
`1294`	`1316`	`raiseValueError(f"Input{input_name} not found in node{node.name}.")`
`1295`	`1317`	`input_index=inputs_lst.index(input_name)`
`1296`	`1318`	`# Check if we should skip this input for low precision conversion`
`1297`		`-returninput_indexinskip_inputs_map[node.op_type]`
	`1319`	`+returninput_indexinself.skip_inputs_map[node.op_type]`
`1298`	`1320`	`returnFalse`

`‎modelopt/onnx/quantization/fp8.py‎`

Lines changed: 2 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -169,6 +169,7 @@ def quantize(`
`169`	`169`	`op_types_to_quantize:list[str]\|None=None,`
`170`	`170`	`op_types_to_exclude:list[str]\|None=None,`
`171`	`171`	`op_types_to_exclude_fp16:list[str]\|None=None,`
	`172`	`+custom_ops_to_cast_fp32:dict\|None=None,`
`172`	`173`	`nodes_to_quantize:list[str]\|None=None,`
`173`	`174`	`nodes_to_exclude:list[str]\|None=None,`
`174`	`175`	`use_external_data_format:bool=False,`
`@@ -324,6 +325,7 @@ def quantize(`
`324`	`325`	`onnx_model,`
`325`	`326`	`keep_io_types=notdirect_io_types,`
`326`	`327`	`op_block_list=op_types_to_exclude_fp16or [],`
	`328`	`+tensor_block_dict=custom_ops_to_cast_fp32or {},`
`327`	`329`	`low_precision_type=high_precision_dtype,`
`328`	`330`	`trt_plugins=trt_extra_plugin_lib_paths,`
`329`	`331`	`)`

`‎modelopt/onnx/quantization/int8.py‎`

Lines changed: 2 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -120,6 +120,7 @@ def quantize(`
`120`	`120`	`op_types_to_quantize:list[str]\|None=None,`
`121`	`121`	`op_types_to_exclude:list[str]\|None=None,`
`122`	`122`	`op_types_to_exclude_fp16:list[str]\|None=None,`
	`123`	`+custom_ops_to_cast_fp32:dict\|None=None,`
`123`	`124`	`nodes_to_quantize:list[str]\|None=None,`
`124`	`125`	`nodes_to_exclude:list[str]\|None=None,`
`125`	`126`	`use_external_data_format:bool=False,`
`@@ -285,6 +286,7 @@ def quantize(`
`285`	`286`	`onnx_model,`
`286`	`287`	`keep_io_types=notdirect_io_types,`
`287`	`288`	`op_block_list=op_types_to_exclude_fp16or [],`
	`289`	`+tensor_block_dict=custom_ops_to_cast_fp32or {},`
`288`	`290`	`low_precision_type=high_precision_dtype,`
`289`	`291`	`trt_plugins=trt_extra_plugin_lib_paths,`
`290`	`292`	`)`

`‎modelopt/onnx/quantization/qdq_utils.py‎`

Lines changed: 29 additions & 14 deletions

Original file line number	Diff line number	Diff line change
`@@ -872,22 +872,32 @@ def remove_input_dq_and_output_q(`
`872`	`872`	`)`
`873`	`873`
`874`	`874`	`# Only remove DQs from the inputs of custom ops`
`875`		`-ifconsumers[0].op_typenotinquantizable_custom_ops:`
	`875`	`+has_cast=consumers[0].op_type=="Cast"`
	`876`	`+consumers_2=tensor_consumers[consumers[0].output[0]]ifhas_castelseconsumers`
	`877`	`+ifconsumers_2[0].op_typenotinquantizable_custom_ops:`
`876`	`878`	`continue`
`877`	`879`
`878`		`-# Rewire graph to connect Q with the node after DQ (skip DQ)`
`879`		`-forconsumerinconsumers:`
`880`		`-forcons_idx,cons_inpinenumerate(consumer.input):`
`881`		`-ifcons_inp==node.output[0]:`
`882`		`-# If the input tensor is meant to be quantized, delete DQ. Otherwise, delete both Q/DQ.`
`883`		`-ifcons_idxinquantizable_custom_ops[consumer.op_type]["inp"]:`
`884`		`-consumer.input[cons_idx]=q_node.output[0]`
`885`		`-else:`
`886`		`-q_node_prev=tensor_producers.get(q_node.input[0],None)`
`887`		`-consumer.input[cons_idx]= (`
`888`		`-q_node_prev.output[0]ifq_node_prevelseq_node.input[0]`
`889`		`- )`
`890`		`-break`
	`880`	`+ifhas_cast:`
	`881`	`+# Assume that this input tensor is not meant to be quantized as there's a Cast node between DQ`
	`882`	`+# and the custom op. Keep the Cast node and delete both Q/DQ nodes.`
	`883`	`+q_node_prev=tensor_producers.get(q_node.input[0],None)`
	`884`	`+consumers[0].input[0]= (`
	`885`	`+q_node_prev.output[0]ifq_node_prevelseq_node.input[0]`
	`886`	`+ )`
	`887`	`+else:`
	`888`	`+# Rewire graph to connect Q with the node after DQ (skip DQ)`
	`889`	`+forconsumerinconsumers:`
	`890`	`+forcons_idx,cons_inpinenumerate(consumer.input):`
	`891`	`+ifcons_inp==node.output[0]:`
	`892`	`+# If the input tensor is meant to be quantized, delete DQ. Otherwise, delete both Q/DQ.`
	`893`	`+ifcons_idxinquantizable_custom_ops[consumer.op_type]["inp"]:`
	`894`	`+consumer.input[cons_idx]=q_node.output[0]`
	`895`	`+else:`
	`896`	`+q_node_prev=tensor_producers.get(q_node.input[0],None)`
	`897`	`+consumer.input[cons_idx]= (`
	`898`	`+q_node_prev.output[0]ifq_node_prevelseq_node.input[0]`
	`899`	`+ )`
	`900`	`+break`
`891`	`901`
`892`	`902`	`# Track DequantizeLinear node indices for cleanup`
`893`	`903`	`dq_indices.append(node_idx)`
`@@ -944,6 +954,11 @@ def remove_input_dq_and_output_q(`
`944`	`954`	`f"{len(dq_indices)} DQ node{''iflen(dq_indices)==1else's'}"`
`945`	`955`	`)`
`946`	`956`
	`957`	`+# Cleanup graph to remove any dangling Q/DQ nodes`
	`958`	`+graph=gs.import_onnx(onnx_model)`
	`959`	`+graph.cleanup()`
	`960`	`+onnx_model=gs.export_onnx(graph)`
	`961`	`+`
`947`	`962`	`# TODO: remove manual ir_version change once ORT supports ir_version 11`
`948`	`963`	`onnx_model.ir_version=10`
`949`	`964`

`‎modelopt/onnx/quantization/quantize.py‎`

Lines changed: 1 addition & 10 deletions

Original file line number	Diff line number	Diff line change
`@@ -430,16 +430,6 @@ def quantize(`
`430`	`430`	`)`
`431`	`431`	`trt_plugins=update_trt_ep_support(calibration_eps,has_dds_op,has_custom_op,trt_plugins)# type: ignore[arg-type]`
`432`	`432`
`433`		`-# Update list with op types to exclude from FP16/BF16 conversion`
`434`		`-op_types_to_exclude_fp16=list(`
`435`		`-dict.fromkeys((op_types_to_exclude_fp16or [])+list(custom_ops_to_cast_fp32.keys()))`
`436`		`- )`
`437`		`-ifhigh_precision_dtype=="fp32"andop_types_to_exclude_fp16:`
`438`		`-logger.warning(`
`439`		`-"Nodes were detected for exclusion from FP16/BF16 conversion, but 'high_precision_dtype' is set to FP32. "`
`440`		`-"Since the model won't be converted to a lower precision, this flag is void."`
`441`		`- )`
`442`		`-`
`443`	`433`	`# Use random scales if calibration data is not supplied`
`444`	`434`	`ifcalibration_dataisNone:`
`445`	`435`	`calibration_data_reader=RandomDataProvider(onnx_path,calibration_shapes)`
`@@ -485,6 +475,7 @@ def quantize(`
`485`	`475`	`op_types_to_quantize=op_types_to_quantize,`
`486`	`476`	`op_types_to_exclude=op_types_to_exclude,`
`487`	`477`	`op_types_to_exclude_fp16=op_types_to_exclude_fp16,`
	`478`	`+custom_ops_to_cast_fp32=custom_ops_to_cast_fp32,`
`488`	`479`	`nodes_to_quantize=nodes_to_quantize,`
`489`	`480`	`nodes_to_exclude=nodes_to_exclude,`
`490`	`481`	`use_external_data_format=use_external_data_format,`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit6abded4

File tree

7 files changed

7 files changed

`‎CHANGELOG.rst‎`

`‎modelopt/onnx/autocast/convert.py‎`

`‎modelopt/onnx/autocast/precisionconverter.py‎`

`‎modelopt/onnx/quantization/fp8.py‎`

`‎modelopt/onnx/quantization/int8.py‎`

`‎modelopt/onnx/quantization/qdq_utils.py‎`

`‎modelopt/onnx/quantization/quantize.py‎`

0 commit comments