NotificationsYou must be signed in to change notification settings
Fork56.4k
Star85.3k

dnn: parallelize nary elementwise forward implementation & enable related conformance tests#25630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

asmorkalov merged 21 commits intoopencv:4.xfromfengyuentau:nary-multi-thread

Jul 3, 2024

Merged

dnn: parallelize nary elementwise forward implementation & enable related conformance tests#25630

asmorkalov merged 21 commits intoopencv:4.xfromfengyuentau:nary-multi-thread

Jul 3, 2024

Conversation

Copy link

Member

fengyuentau commentedMay 23, 2024•
edited
Loading

This PR introduces the following changes:

Parallelize binary forward impl
Parallelize ternary forward impl (Where)
Parallelize nary (Operator that can take >=1 operands)
Enable conformance tests if workable

Performance

i7-12700K, RAM 64GB, Ubuntu 22.04

Geometric mean (ms)                Name of Test                     opencv        opencv        opencv                                                  perf          perf          perf                                              core.x64.0606 core.x64.0606 core.x64.0606                                                                               vs                                                                             opencv                                                                              perf                                                                          core.x64.0606                                                                           (x-factor)NCHW_C_sum::Layer_NaryEltwise::OCV/CPU           16.116        11.161         1.44NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU        17.469        11.446         1.53NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU        17.531        11.469         1.53NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU      28.653        13.682         2.09NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU    21.899        13.422         1.63NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU       21.738        13.185         1.65NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU        16.172        11.473         1.41NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU       16.309        11.565         1.41NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU        16.166        11.454         1.41NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU        16.157        11.443         1.41NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU        163.459       15.234         10.73NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU    10.880        10.868         1.00NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU    10.947        11.058         0.99NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU    10.948        10.910         1.00NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU    10.874        10.871         1.00NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU    10.971        10.920         1.00NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU        17.546        11.462         1.53NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU        16.175        11.475         1.41NHWC_C::Layer_NaryEltwise::OCV/CPU               11.339        11.333         1.00NHWC_H::Layer_NaryEltwise::OCV/CPU               16.154        11.102         1.46

Apple M1, RAM 16GB, macOS 14.4.1

Geometric mean (ms)                Name of Test                     opencv          opencv             opencv                                                        perf            perf               perf                                                     core.m1.0606 core.m1.0606.patch core.m1.0606.patch                                                                                      vs                                                                                            opencv                                                                                           perf                                                                                        core.m1.0606                                                                                     (x-factor)    NCHW_C_sum::Layer_NaryEltwise::OCV/CPU           28.418          3.768               7.54       NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU        6.942           5.679               1.22       NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU        5.822           5.653               1.03       NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU      5.751           5.628               1.02       NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU    5.797           5.599               1.04       NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU       7.272           5.578               1.30       NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU        5.777           5.562               1.04       NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU       5.819           5.559               1.05       NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU        5.830           5.574               1.05       NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU        5.759           5.567               1.03       NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU       342.260          74.655              4.58       NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU    8.338           8.280               1.01       NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU    8.359           8.309               1.01       NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU    8.412           8.295               1.01       NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU    8.380           8.297               1.01       NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU    8.356           8.323               1.00       NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU        6.818           5.561               1.23       NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU        5.805           5.570               1.04       NHWC_C::Layer_NaryEltwise::OCV/CPU               3.834           4.817               0.80       NHWC_H::Layer_NaryEltwise::OCV/CPU               28.402          3.771               7.53

Pull Request Readiness Checklist

See details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

opencv-alalek reviewed

May 27, 2024

View reviewed changes

modules/dnn/src/layers/nary_eltwise_layers.cpp Outdated

Comment on lines 690 to 706

		double nstripes =getNumThreads();
		parallel_for_(Range(0, nplanes), worker, nstripes);

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nstripes = getNumThreads();

This should not be used.
Already discussed several months ago - e.g.#23047

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for review but take it easy, this pr is still drafting. I still remember our discussion.

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Changed. Performance results are also updated.

fengyuentau added optimization category: dnn labels

May 31, 2024

fengyuentau added this to the4.11.0 milestone

Jun 3, 2024

fengyuentau marked this pull request as ready for review

June 6, 2024 10:07

fengyuentau requested a review fromdkurt

June 7, 2024 04:34

Copy link

Contributor

asmorkalov commentedJun 10, 2024

My results with Jetson tk1 (armv7+neon):

ubuntu@jetson1:~/Projects/perf-dnn$ python3 ../opencv/modules/ts/misc/summary.py ./4.x-1.xml ./patched-1.xml | grep NaryEltwiseNCHW_C_sum::Layer_NaryEltwise::OCV/CPU                                                                                                          65.891   43.371      1.52   NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU                                                                                                       79.287   81.868      0.97   NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU                                                                                                      187.457   187.657     1.00   NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU                                                                                                     88.643   96.376      0.92   NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU                                                                                                   88.694   96.035      0.92   NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU                                                                                                      88.716   90.298      0.98   NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU                                                                                                       84.722   83.976      1.01   NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU                                                                                                      92.757   81.105      1.14   NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU                                                                                                       84.285   84.010      1.00   NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU                                                                                                       78.594   78.574      1.00   NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU                                                                                                      3407.037 3475.724     0.98   NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU                                                                                                  189.651   189.454     1.00   NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU                                                                                                   87.859   87.771      1.00   NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU                                                                                                   87.915   88.053      1.00   NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU                                                                                                   84.077   84.063      1.00   NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU                                                                                                   85.160   84.625      1.01   NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU                                                                                                       86.368   79.089      1.09   NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU                                                                                                       89.897   78.993      1.14   NHWC_C::Layer_NaryEltwise::OCV/CPU                                                                                                              77.220   71.425      1.08   NHWC_H::Layer_NaryEltwise::OCV/CPU                                                                                                              67.494   42.832      1.58

Copy link

Contributor

asmorkalov commentedJun 11, 2024•
edited
Loading

My results for Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz (no AVX2):

NCHW_C_sum::Layer_NaryEltwise::OCV/CPU                                                                                                          24.193   17.846      1.36   NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU                                                                                                       24.026   23.313      1.03   NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU                                                                                                       27.370   23.279      1.18   NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU                                                                                                     35.025   23.254      1.51   NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU                                                                                                   32.455   23.260      1.40   NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU                                                                                                      32.509   23.321      1.39   NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU                                                                                                       23.997   23.262      1.03   NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU                                                                                                      24.038   23.270      1.03   NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU                                                                                                       23.977   23.269      1.03   NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU                                                                                                       23.927   23.279      1.03   NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU                                                                                                      320.598   98.029      3.27   NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU                                                                                                   24.507   24.488      1.00   NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU                                                                                                   24.484   24.477      1.00   NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU                                                                                                   24.500   24.471      1.00   NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU                                                                                                   24.486   24.482      1.00   NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU                                                                                                   24.472   24.476      1.00   NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU                                                                                                       23.953   23.281      1.03   NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU                                                                                                       23.992   23.274      1.03   NHWC_C::Layer_NaryEltwise::OCV/CPU                                                                                                              18.260   18.489      0.99   NHWC_H::Layer_NaryEltwise::OCV/CPU                                                                                                              24.182   17.829      1.36

Copy link

MemberAuthor

fengyuentau commentedJun 12, 2024

Thank you@asmorkalov for adding more performance results :)

fengyuentau mentioned this pull request

Jun 14, 2024

Fix parser for supporting mean operation from conformance tests#25761

Closed

6 tasks

Copy link

MemberAuthor

fengyuentau commentedJun 14, 2024

Any review comments?

fengyuentau changed the title~~dnn: parallelize nary elementwise forward implementation~~dnn: parallelize nary elementwise forward implementation & enable related conformance tests

Jun 14, 2024

Copy link

Contributor

asmorkalov commentedJun 19, 2024

The patch leads to significant OpenCL pipelines degradation, e.g.:

VIT_B_32::DNNTestNetwork::OCV/CPU 149.576 191.409 0.78VIT_B_32::DNNTestNetwork::OCV/OCL 104.428 445.013 0.23VIT_B_32::DNNTestNetwork::OCV/OCL_FP16 102.505 442.994 0.23

I use NVIDIA GF 1080 for benchmark. Looks like the patch prevents some graph fusing or some inference optimization.
Looking into details, if it really caused by the PR.

Copy link

MemberAuthor

fengyuentau commentedJun 19, 2024

The patch leads to significant OpenCL pipelines degradation, e.g.:
VIT_B_32::DNNTestNetwork::OCV/CPU 149.576 191.409 0.78VIT_B_32::DNNTestNetwork::OCV/OCL 104.428 445.013 0.23VIT_B_32::DNNTestNetwork::OCV/OCL_FP16 102.505 442.994 0.23
I use NVIDIA GF 1080 for benchmark. Looks like the patch prevents some graph fusing or some inference optimization. Looking into details, if it really caused by the PR.

Ok, I will take a look at the problem.

fengyuentauand others added16 commits

June 24, 2024 15:51

parallelize binary forward impl

d8d0498

fix bug and format

0742117

refactor dispatch logic; add doc

7fd7017

enable some conformance tests

700716c

use NaryEltwiseLayer for num_inputs=1

9c7f617

parallelize ternary forward impl

afa4ccc

filter some conformance tests for vulkan backend

5f37e82

suppport one input forward in cuda backend

7b1acfa

remove check of number of inputs

4205124

cuda: add pow

e307ab5

separate cuda fp16 filter list to make ci happy

bb522f7

ocl: fix when having only one input

7143030

ov: apply filters to some tests

60e54d6

make default ci happy

d98c148

quickfix for ov backend

09eb31a

parallelize nary forward impl

dba76e2

fengyuentauand others added5 commits

June 24, 2024 15:51

ov: quickfix for namespace

a107489

fix a bug where different threads can read and write ptrs

4649c50

fix ci

20d0d7e

tune threads

ad5be07

fix nary_forward_impl with mean operation; enable test_mean_example

f3adabe

fengyuentau force-pushed thenary-multi-thread branch from4be1a1f tof3adabeCompare

June 24, 2024 07:53

Copy link

MemberAuthor

fengyuentau commentedJun 24, 2024•
edited
Loading

@asmorkalov The performance "degradation" is due to very out-of-date code base (>450 commits behind 4.x). I have updated the code base. Performance testings (on Intel UHD 770) seem to be okay on my side. Feel free to retest on your side.

Thinking positively, we have achieved a lot performance boosting from those commits (OCL is ~4x faster and CPU is ~1.3x faster). Maybe I can add the OCL backend for this layer later :)

vpisarev self-requested a review

June 27, 2024 21:29

vpisarev approved these changes

Jun 27, 2024

View reviewed changes

Copy link

Contributor

asmorkalov commentedJun 28, 2024•
edited
Loading

perf-dnn.zip
OpenCL related degradation disappeared. Perf numbers for updated PR for core i5-2500:

NCHW_C_sum::Layer_NaryEltwise::OCV/CPU 24.142 17.999 1.34NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU 23.860 23.265 1.03NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU 27.383 23.282 1.18NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU 39.056 23.292 1.68NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU 32.489 23.290 1.39NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU 32.435 23.257 1.39NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU 23.966 23.269 1.03NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU 23.992 23.276 1.03NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU 23.951 23.273 1.03NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU 23.862 23.272 1.03NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU 320.265 97.879 3.27NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU 24.491 24.487 1.00NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU 24.463 24.464 1.00NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU 24.472 24.465 1.00NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU 24.460 24.453 1.00NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU 24.463 24.530 1.00NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU 23.870 23.271 1.03NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU 23.964 23.764 1.01NHWC_C::Layer_NaryEltwise::OCV/CPU 18.083 18.458 0.98NHWC_H::Layer_NaryEltwise::OCV/CPU 24.140 17.857 1.35

Copy link

Contributor

asmorkalov commentedJul 1, 2024

I also tried Xiaomi Mi 10 phone. The result is volatile (m.b. power management), but I do not see significant performance gain, besides NCHW_C_sum and NCHW_NCHW_pow.
perf-dnn-xiaomi-mi10.zip

Copy link

MemberAuthor

fengyuentau commentedJul 2, 2024

The result is volatile (m.b. power management), but I do not see significant performance gain

It is tuned to have multi-theading if input scale is large enough. Traditional convolutional nets do not have such a large input scale for elementwise layers.

asmorkalov approved these changes

Jul 3, 2024

View reviewed changes

asmorkalov merged commita7fd944 intoopencv:4.x

Jul 3, 2024

fengyuentau mentioned this pull request

Jul 12, 2024

dnn: merge #25630 to 5.x#25900

Merged

6 tasks

asmorkalov added the port/backport doneLabel for maintainers. Authors of PR can ignore this label

Jul 12, 2024

asmorkalov pushed a commit that referenced this pull request

Jul 15, 2024

Merge pull request#25900from fengyuentau:dnn/nary_elementwise_multi…

4206634

…_threaddnn: merge#25630 to 5.x#25900Sync changes from#25630 to 5.x.### Pull Request Readiness ChecklistSee details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request- [x] I agree to contribute to the project under Apache 2 License.- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV- [x] The PR is proposed to the proper branch- [x] There is a reference to the original bug report and related work- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable      Patch to opencv_extra has the same branch name.- [x] The feature is well documented and sample code can be built with the project CMake

asmorkalov mentioned this pull request

Jul 16, 2024

(5.x) Merge 4.x#25915

Merged

fengyuentau deleted the nary-multi-thread branch

July 30, 2024 15:06

asmorkalov mentioned this pull request

Aug 5, 2024

fix compilation errors caused by namespace#25987

Merged

6 tasks