NotificationsYou must be signed in to change notification settings
Fork371
Star2.9k

feat: support aten._cdist_forward converter#2726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

zewenli98 merged 5 commits intomainfromaten_cdist_forward_converter

May 18, 2024

Merged

feat: support aten._cdist_forward converter#2726

zewenli98 merged 5 commits intomainfromaten_cdist_forward_converter

May 18, 2024

Conversation

Copy link

Collaborator

chohk88 commentedApr 4, 2024•
edited
Loading

Description

New feature to support aten._cdist_forward converter converter. I have added test cases ensuring compatibility with both matching and broadcasting input shapes.

Fixes # (issue)

Type of change

New feature (non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

chohk88 requested review fromapbose,gs-olive andzewenli98

April 4, 2024 09:39

chohk88 self-assigned this

Apr 4, 2024

facebook-github-bot added the cla signed label

Apr 4, 2024

github-actionsbot added component: tests

Issues re: Tests

component: conversion

Issues re: Conversion stage

component: converters

Issues re: Specific op converters

component: api [Python]

Issues re: Python API

component: dynamoIssues relating to the `torch.compile` or `torch._dynamo.export` paths labels

Apr 4, 2024

chohk88 linked an issue

Apr 4, 2024

that may beclosed by this pull request

aten._cdist_forward#2725

Closed

zewenli98 requested changes

Apr 11, 2024

View reviewed changes

py/torch_tensorrt/dynamo/conversion/aten_ops_converters.py OutdatedShow resolvedHide resolved

py/torch_tensorrt/dynamo/conversion/impl/linear.pyShow resolvedHide resolved

py/torch_tensorrt/dynamo/conversion/impl/linear.py Outdated

		x1:TRTTensor,
		x2:TRTTensor,
		p:float,
		compute_mode:Optional[int]=None,

Copy link

Collaborator

zewenli98Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It seems thatcompute_mode is not used below. Is there any difficulty for this? Ideally, we should mimic all pytorch's behaviors (i.e., receiving the same args and then returning the same outputs as pytorch, based on theschema).
Ifcompute_mode cannot be supported for some reasons, we need to usecapability_validator to make sure this kind of cases will not be converted, instead, falling back to pytorch.

Copy link

CollaboratorAuthor

chohk88Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

compute_mode does not influence the output values. It is designed for optimizing computational efficiency in specific situations when calculating the Euclidean distance (p=2) for large-sized inputs.

I have modified to clarify:

A note has been added thatcompute_mode doesn't alter the computational path in our implementation.
A default value of 0 is used forcompute_mode when it's unspecified, maintaining behavior consistency with Pytorch.
I have included a warning for situations where an optimized path for p=2 might be expected but isn't utilized due to our unified approach (element-wise).

Copy link

Collaborator

zewenli98Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for the explanation! I'm still thinking if usingmatmul will be faster for the previous two scenarios? Considering that we have out-of-the-boxmatmul, if possible, could you compare their time costs?

Copy link

CollaboratorAuthor

chohk88Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for your valuable comment. In this update, I've implemented a matrix multiplication-based approach to compute the distance forp=2. During my tests on my local PC with inputsx1=(150,100,50,50) andx2=(150,100,30,50), I observed that the TRT run time improved by approximately6.3 times. Here are the relevant logs from those tests:

INFO:harness:FX graph= graph():    %x1 : [num_users=1] = placeholder[target=x1]    %x2 : [num_users=1] = placeholder[target=x2]    %_cdist_forward_default : [num_users=1] = call_function[target=torch.ops.aten._cdist_forward.default](args = (%x1, %x2, 2, 0), kwargs = {})    return _cdist_forward_default===============based on matmul : ==================INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.007999/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py:255: DeprecationWarning: Use build_serialized_network instead.  engine = self.builder.build_engine(self.ctx.net, builder_config)INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:33.710717INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 244801024 bytes of MemoryINFO:harness:Interpreter run time(s): 33.719337898997765/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:63: DeprecationWarning: Use get_tensor_name instead.  self.engine.get_binding_index(name) for name in self.input_names/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:66: DeprecationWarning: Use get_tensor_name instead.  self.engine.get_binding_index(name) for name in self.output_names/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:88: DeprecationWarning: Use get_tensor_dtype instead.  self.engine.get_binding_dtype(idx), Frameworks.TORCH/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:93: DeprecationWarning: Use get_tensor_shape instead.  tuple(self.engine.get_binding_shape(idx))/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:98: DeprecationWarning: Use get_tensor_dtype instead.  self.engine.get_binding_dtype(idx), Frameworks.TORCH/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:247: DeprecationWarning: Use set_input_shape instead.  self.context.set_binding_shape(/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:262: DeprecationWarning: Use get_tensor_shape instead.  shape = tuple(self.context.get_binding_shape(idx))INFO:harness:TRT run time(s)= 0.0026535038948059084.INFO:harness:FX graph= graph():    %x1 : [num_users=1] = placeholder[target=x1]    %x2 : [num_users=1] = placeholder[target=x2]    %_cdist_forward_default : [num_users=1] = call_function[target=torch.ops.aten._cdist_forward.default](args = (%x1, %x2, 2, 1), kwargs = {})    return _cdist_forward_default===============based on elementwise pow for diff : ==================INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.003120/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py:255: DeprecationWarning: Use build_serialized_network instead.  engine = self.builder.build_engine(self.ctx.net, builder_config)INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:43.902498INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 4590000640 bytes of MemoryINFO:harness:Interpreter run time(s): 43.90606521600421/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:63: DeprecationWarning: Use get_tensor_name instead.  self.engine.get_binding_index(name) for name in self.input_names/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:66: DeprecationWarning: Use get_tensor_name instead.  self.engine.get_binding_index(name) for name in self.output_names/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:88: DeprecationWarning: Use get_tensor_dtype instead.  self.engine.get_binding_dtype(idx), Frameworks.TORCH/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:93: DeprecationWarning: Use get_tensor_shape instead.  tuple(self.engine.get_binding_shape(idx))/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:98: DeprecationWarning: Use get_tensor_dtype instead.  self.engine.get_binding_dtype(idx), Frameworks.TORCH/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:247: DeprecationWarning: Use set_input_shape instead.  self.context.set_binding_shape(/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:262: DeprecationWarning: Use get_tensor_shape instead.  shape = tuple(self.context.get_binding_shape(idx))INFO:harness:TRT run time(s)= 0.016830976486206056.----------------------------------------------------------------------

Copy link

Collaborator

zewenli98Apr 29, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It seemsmatmul runs faster in both compilation stage and running stage, and saves a lot of memory.

Copy link

CollaboratorAuthor

chohk88Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You're right. While always using matrix multiplication might seem more efficient considering TensorRT's capabilities, I've aligned the implementation with the compute mode of the ATen operations for consistency. Thank you for pointing this out!

tests/py/dynamo/conversion/test_cdist_aten.py OutdatedShow resolvedHide resolved

tests/py/dynamo/conversion/test_cdist_aten.py Outdated

		print("x1 : ",x1)
		print("x2 : ",x2)

		returntorch.ops.aten._cdist_forward.default(x1,x2,p,None)

Copy link

Collaborator

zewenli98Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Differentcompute_modes other thanNone should be tested.

Copy link

CollaboratorAuthor

chohk88Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I have made modifications to allow testing for differentcompute_mode values from the input of test case.

tests/py/dynamo/conversion/test_cdist_aten.pyShow resolvedHide resolved

tests/py/dynamo/conversion/test_cdist_aten.py OutdatedShow resolvedHide resolved

gs-olive reviewed

Apr 12, 2024

View reviewed changes

Copy link

Contributor

gs-olive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Overall looks good - added a few suggestions

py/torch_tensorrt/dynamo/conversion/aten_ops_converters.py Outdated

		x1=args[0],
		x2=args[1],
		p=args[2],
		compute_mode=args[3],

Copy link

Contributor

gs-oliveApr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Based on theschema havingint? as the type forcompute_mode, this should becompute_mode = args_bounds_check(args, 2, None)

Copy link

CollaboratorAuthor

chohk88Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for the review! I've made the requested changes (compute_mode = args_bounds_check(args, 3, None))

py/torch_tensorrt/dynamo/conversion/impl/normalization/ops.py Outdated

		x1:TRTTensor,
		x2:TRTTensor,
		p:float,
		compute_mode:int,

Copy link

Contributor

gs-oliveApr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Switch toOptional[int], sinceNone is a valid choice

Copy link

CollaboratorAuthor

chohk88Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for your comment! I've made the requested changes.

py/torch_tensorrt/dynamo/conversion/impl/normalization/ops.py Outdated

Comment on lines 627 to 628

		else:
		raiseNotImplementedError(f"Currently, p={p} is not implemented.")

Copy link

Contributor

gs-oliveApr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This can be removed, since it seems you have covered all of the valid intervals by Torch's definition ofcdist, in$[0,\infty]$, as perthis

Copy link

CollaboratorAuthor

chohk88Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for the review! I've made the requested changes.

narendasan added the needs-release-cherrypick label

Apr 19, 2024

github-actionsbot requested a review fromperi044

April 19, 2024 01:11

github-actionsbot requested changes

Apr 29, 2024

View reviewed changes

Copy link

github-actionsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/impl/normalization/ops.py2024-04-29 14:43:08.889448+00:00+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/impl/normalization/ops.py2024-04-29 14:44:58.684466+00:00@@ -549,11 +549,11 @@    - If x1.shape = [2, 3, 10, 5] and x2.shape = [2, 3, 20, 5], both having the same batch dimensions [2, 3], the output shape will be [2, 3, 10, 20].      This represents computing distances in two batches of three groups, each comparing 10 vectors from x1 with 20 vectors from x2.    - For x1.shape = [10, 5] (10 vectors, each of 5 features) and x2.shape = [20, 5] (20 vectors, each of 5 features),      since there are no batch dimensions to match, the output shape is simply [10, 20], comparing all vectors from x1 against all vectors from x2.-    Note: The `compute_mode` parameter is designed to optimize the performance of the Euclidean distance calculation, especially useful when working with large datasets.+    Note: The `compute_mode` parameter is designed to optimize the performance of the Euclidean distance calculation, especially useful when working with large datasets.    This parameter allows you to control how the distances are computed, with different modes available to leverage matrix multiplication for speed improvements.    """    if compute_mode is None:        compute_mode = 0

chohk88 requested review fromgs-olive andzewenli98

April 29, 2024 14:57

Copy link

Collaborator

zewenli98 commentedApr 29, 2024

@chohk88 The CI failed. Can you take a look?

narendasan removed the needs-release-cherrypick label

May 3, 2024

chohk88 mentioned this pull request

May 3, 2024

chore: Upgrade TensorRT version to TRT 10 EA (#2699)#2774

Merged

7 tasks

zewenli98 mentioned this pull request

May 7, 2024

fix: bug in elementwise base for static inputs#2819

Merged

7 tasks

Copy link

Collaborator

zewenli98 commentedMay 8, 2024

@chohk88 PR#2819 was merged to the main. Can you rebase and check if the error in CI is fixed?

chohk88 added4 commits

May 9, 2024 13:33

feat: support aten._cdist_forward converter

3ab76d7

feat: cdist compute_mode compatibility

b563763

feat: add matrix multiply for p2 norm (efficiency)

cc21afe

chore: minor linting

9f901f6

chohk88 force-pushed theaten_cdist_forward_converter branch from1962cf8 to9f901f6Compare

May 9, 2024 04:33

zewenli98 requested changes

May 9, 2024

View reviewed changes

Copy link

Collaborator

zewenli98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@chohk88 It seems the tests have too large dataset, which caused OOM error in CI. My advice is to turn them down but make sure every condition path would be covered.

chore: smaller dataset size for matrix mul test

b6a9e48

Copy link

CollaboratorAuthor

chohk88 commentedMay 11, 2024

@zewenli98 Thank you for your comments! The issue with CI/CD failures due to the dataset size has been resolved. However, while the Windows wheel build has completed successfully, the Linux wheel build is still failing.

Copy link

Collaborator

zewenli98 commentedMay 18, 2024

@zewenli98 Thank you for your comments! The issue with CI/CD failures due to the dataset size has been resolved. However, while the Windows wheel build has completed successfully, the Linux wheel build is still failing.

It seems to work now. LGTM.