NotificationsYou must be signed in to change notification settings
Fork56.4k
Star85.3k

dnn (cuda): support broadcasting if a.rank() != b.rank()#24834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

asmorkalov merged 5 commits intoopencv:4.xfromfengyuentau:cuda_naryeltwise_broadcast

Jan 11, 2024

Merged

dnn (cuda): support broadcasting if a.rank() != b.rank()#24834

asmorkalov merged 5 commits intoopencv:4.xfromfengyuentau:cuda_naryeltwise_broadcast

Jan 11, 2024

Conversation

Copy link

Member

fengyuentau commentedJan 9, 2024•
edited
Loading

Inspired by#24786. This PR keeps the fusion ofNaryEltwise andConcat while addressed the data missing problem via supporting broadcasting if a.rank() != b.rank().

Resolves#23977
Resolves#24606
Resolves#24635
Resolves#24721

Pull Request Readiness Checklist

See details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

initial commit

ad5c900

fengyuentau added bug category: gpu/cuda (contrib)

OpenCV 4.0+: moved to opencv_contrib

category: dnn labels

Jan 9, 2024

fengyuentau added this to the4.10.0 milestone

Jan 9, 2024

fengyuentau requested review fromAbdurrahheem,WanliZhong anddkurt

January 9, 2024 07:31

add yolov8n test (disabled)

7b68013

Copy link

MemberAuthor

fengyuentau commentedJan 9, 2024

Tried to add yolov8n to test on different backends, but turns out we may have more problems, especially in CUDA_FP16 target:

[ RUN      ] DNNTestNetwork.YOLOv8n/0, where GetParam() = CUDA/CUDA/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:79: FailureExpected: (normInf) <= (lInf), actual: 0.0010376 vs 0.0001First run  |ref| = 638.03076171875/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:79: FailureExpected: (normInf) <= (lInf), actual: 0.00109863 vs 0.0001Second run  |ref| = 638.5064697265625[  FAILED  ] DNNTestNetwork.YOLOv8n/0, where GetParam() = CUDA/CUDA (288 ms)[ RUN      ] DNNTestNetwork.YOLOv8n/1, where GetParam() = CUDA/CUDA_FP16/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:76: FailureExpected: (normL1) <= (l1), actual: 0.0118579 vs 0.004First run  |ref| = 638.03076171875/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:79: FailureExpected: (normInf) <= (lInf), actual: 6.54901 vs 0.02First run  |ref| = 638.03076171875/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:76: FailureExpected: (normL1) <= (l1), actual: 0.0119177 vs 0.004Second run  |ref| = 638.5064697265625/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:79: FailureExpected: (normInf) <= (lInf), actual: 8.83636 vs 0.02Second run  |ref| = 638.5064697265625[  FAILED  ] DNNTestNetwork.YOLOv8n/1, where GetParam() = CUDA/CUDA_FP16 (601 ms)[ RUN      ] DNNTestNetwork.YOLOv8n/2, where GetParam() = OCV/OCL[ WARN:0@1.061] global ocl4dnn_conv_spatial.cpp:1931 loadTunedConfig OpenCV(ocl4dnn): consider to specify kernel configuration cache directory through OPENCV_OCL4DNN_CONFIG_PATH parameter.OpenCL program build log: dnn/dummyStatus -11: CL_BUILD_PROGRAM_FAILURE-cl-no-subgroup-ifpError in processing command line: Don't understand command line argument "-cl-no-subgroup-ifp"!/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:79: FailureExpected: (normInf) <= (lInf), actual: 0.00161743 vs 0.0001First run  |ref| = 638.03076171875/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:79: FailureExpected: (normInf) <= (lInf), actual: 0.00117493 vs 0.0001Second run  |ref| = 638.5064697265625[  FAILED  ] DNNTestNetwork.YOLOv8n/2, where GetParam() = OCV/OCL (3184 ms)[ RUN      ] DNNTestNetwork.YOLOv8n/3, where GetParam() = OCV/OCL_FP16[       OK ] DNNTestNetwork.YOLOv8n/3 (545 ms)[----------] 4 tests from DNNTestNetwork (4618 ms total)[----------] Global test environment tear-down[==========] 4 tests from 1 test case ran. (4618 ms total)[  PASSED  ] 1 test.[  FAILED  ] 3 tests, listed below:[  FAILED  ] DNNTestNetwork.YOLOv8n/0, where GetParam() = CUDA/CUDA[  FAILED  ] DNNTestNetwork.YOLOv8n/1, where GetParam() = CUDA/CUDA_FP16[  FAILED  ] DNNTestNetwork.YOLOv8n/2, where GetParam() = OCV/OCL

Copy link

Contributor

Abdurrahheem commentedJan 9, 2024

@fengyuentau once this PR is complete (currently yolov8 is not supported on CUDA here, AFAK) does it mean that PR#24786 is going be obsolete?

Copy link

MemberAuthor

fengyuentau commentedJan 9, 2024

does it mean that PR#24786 is going be obsolete?

Yes.

currently yolov8 is not supported on CUDA here

It's not true. There are some minor differences in the results between CPU and CUDA/CUDA, which is OK I think, but the differences are much bigger when it comes to the CUDA_FP16 target. I guess we lose some accuracy inSigmoid and such. Need an in-depth investigation.

Copy link

Contributor

asmorkalov commentedJan 9, 2024

Locally I observe several test failures like this:

[----------] 1 test from Layer_Test_Eltwise_bcast[ RUN      ] Layer_Test_Eltwise_bcast.brute_force/10, where GetParam() = ("sum", 3, CUDA/CUDA)Exception message: OpenCV(4.9.0-dev) /mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/src/cuda/../cuda4dnn/csl/tensor.hpp:1047: error: (-215:Assertion failed) rank() >= 2 in function 'squeeze'/mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/test/test_layers.cpp:2053: FailureExpected: re = net.forward() doesn't throw an exception.  Actual: it throws.Exception message: OpenCV(4.9.0-dev) /mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/src/cuda/../cuda4dnn/csl/tensor.hpp:1047: error: (-215:Assertion failed) rank() >= 2 in function 'squeeze'/mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/test/test_layers.cpp:2053: FailureExpected: re = net.forward() doesn't throw an exception.  Actual: it throws.Exception message: OpenCV(4.9.0-dev) /mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/src/cuda/../cuda4dnn/csl/tensor.hpp:1047: error: (-215:Assertion failed) rank() >= 2 in function 'squeeze'/mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/test/test_layers.cpp:2053: FailureExpected: re = net.forward() doesn't throw an exception.  Actual: it throws.Exception message: OpenCV(4.9.0-dev) /mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/src/cuda/../cuda4dnn/csl/tensor.hpp:1047: error: (-215:Assertion failed) rank() >= 2 in function 'squeeze'/mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/test/test_layers.cpp:2053: FailureExpected: re = net.forward() doesn't throw an exception.  Actual: it throws.[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/10, where GetParam() = ("sum", 3, CUDA/CUDA) (1536 ms)

Full list:

[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/10, where GetParam() = ("sum", 3, CUDA/CUDA)[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/11, where GetParam() = ("sum", 3, CUDA/CUDA_FP16)[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/15, where GetParam() = ("sum", 4, CUDA/CUDA)[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/16, where GetParam() = ("sum", 4, CUDA/CUDA_FP16)[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/20, where GetParam() = ("sum", 5, CUDA/CUDA)[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/21, where GetParam() = ("sum", 5, CUDA/CUDA_FP16)

Copy link

MemberAuthor

fengyuentau commentedJan 9, 2024

It was due to there are inputs of shape [1] (1d mat) in these failed tests. In cuda backend, there are asserts checkingrank>=2. So it is not feasible to run these tests with CUDA backend without bypassing the assert checks.

It works previously because it was not actually testing the CUDA backend; If two inputs have different dimensions, it falls back to CPU implementation. So it tests nothing related to the CUDA backend in these case. See below for the fall back (Line 804-805):

opencv/modules/dnn/src/layers/nary_eltwise_layers.cpp

Lines 800 to 811 in5c9ad9d

	auto input_0_shape = inputs[0].dynamicCast<CUDABackendWrapper>()->getShape();
	for (int i =1; i < inputs.size(); i++)
	{
	auto input_i_shape = inputs[i].dynamicCast<CUDABackendWrapper>()->getShape();
	if (input_0_shape.size() != input_i_shape.size())
	return Ptr<BackendNode>();
	// check if the shape can be supported by `eltwise_ops.cu`, or return the default BackendNode
	for (int j =0; j < input_0_shape.size(); j++)
	if (input_0_shape[j] != input_i_shape[j] &&
	input_0_shape[j] !=1 && input_i_shape[j] !=1)
	return Ptr<BackendNode>();
	}

With that being said, I propose to turn off these tests specifically for CUDA backend.@asmorkalov What do you think?

@WanliZhong Please join this talk as well.

Copy link

MemberAuthor

fengyuentau commentedJan 9, 2024

It was due to there are inputs of shape [1] (1d mat) in these failed tests. In cuda backend, there are asserts checkingrank>=2. So it is not feasible to run these tests with CUDA backend without bypassing the assert checks.
It works previously because it was not actually testing the CUDA backend; If two inputs have different dimensions, it falls back to CPU implementation. So it tests nothing related to the CUDA backend in these case. See below for the fall back (Line 804-805):
opencv/modules/dnn/src/layers/nary_eltwise_layers.cpp
Lines 800 to 811 in5c9ad9d
auto input_0_shape = inputs[0].dynamicCast<CUDABackendWrapper>()->getShape();
for (int i =1; i < inputs.size(); i++)
{
auto input_i_shape = inputs[i].dynamicCast<CUDABackendWrapper>()->getShape();
if (input_0_shape.size() != input_i_shape.size())
return Ptr<BackendNode>();
// check if the shape can be supported by `eltwise_ops.cu`, or return the default BackendNode
for (int j =0; j < input_0_shape.size(); j++)
if (input_0_shape[j] != input_i_shape[j] &&
input_0_shape[j] !=1 && input_i_shape[j] !=1)
return Ptr<BackendNode>();
}
With that being said, I propose to turn off these tests specifically for CUDA backend.@asmorkalov What do you think?
@WanliZhong Please join this talk as well.

Or we still fall back to CPU when dimension is 1.

Copy link

Member

WanliZhong commentedJan 9, 2024

I propose fallback when dim is 1 to make sure cuda run correctly rather than throw an error

Copy link

MemberAuthor

fengyuentau commentedJan 9, 2024

It was due to there are inputs of shape [1] (1d mat) in these failed tests. In cuda backend, there are asserts checkingrank>=2. So it is not feasible to run these tests with CUDA backend without bypassing the assert checks.
It works previously because it was not actually testing the CUDA backend; If two inputs have different dimensions, it falls back to CPU implementation. So it tests nothing related to the CUDA backend in these case. See below for the fall back (Line 804-805):
opencv/modules/dnn/src/layers/nary_eltwise_layers.cpp
Lines 800 to 811 in5c9ad9d
auto input_0_shape = inputs[0].dynamicCast<CUDABackendWrapper>()->getShape();
for (int i =1; i < inputs.size(); i++)
{
auto input_i_shape = inputs[i].dynamicCast<CUDABackendWrapper>()->getShape();
if (input_0_shape.size() != input_i_shape.size())
return Ptr<BackendNode>();
// check if the shape can be supported by `eltwise_ops.cu`, or return the default BackendNode
for (int j =0; j < input_0_shape.size(); j++)
if (input_0_shape[j] != input_i_shape[j] &&
input_0_shape[j] !=1 && input_i_shape[j] !=1)
return Ptr<BackendNode>();
}
With that being said, I propose to turn off these tests specifically for CUDA backend.@asmorkalov What do you think?
@WanliZhong Please join this talk as well.
Or we still fall back to CPU when dimension is 1.

It does not work due to the 1d mat is actually produced during the broadcasting implementation in the CUDA backend. Let me find another solution to this.