Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

dnn: parallelize nary elementwise forward implementation & enable related conformance tests#25630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
asmorkalov merged 21 commits intoopencv:4.xfromfengyuentau:nary-multi-thread
Jul 3, 2024

Conversation

@fengyuentau
Copy link
Member

@fengyuentaufengyuentau commentedMay 23, 2024
edited
Loading

This PR introduces the following changes:

  • Parallelize binary forward impl
  • Parallelize ternary forward impl (Where)
  • Parallelize nary (Operator that can take >=1 operands)
  • Enable conformance tests if workable

Performance

i7-12700K, RAM 64GB, Ubuntu 22.04

Geometric mean (ms)                Name of Test                     opencv        opencv        opencv                                                  perf          perf          perf                                              core.x64.0606 core.x64.0606 core.x64.0606                                                                               vs                                                                             opencv                                                                              perf                                                                          core.x64.0606                                                                           (x-factor)NCHW_C_sum::Layer_NaryEltwise::OCV/CPU           16.116        11.161         1.44NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU        17.469        11.446         1.53NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU        17.531        11.469         1.53NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU      28.653        13.682         2.09NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU    21.899        13.422         1.63NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU       21.738        13.185         1.65NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU        16.172        11.473         1.41NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU       16.309        11.565         1.41NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU        16.166        11.454         1.41NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU        16.157        11.443         1.41NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU        163.459       15.234         10.73NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU    10.880        10.868         1.00NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU    10.947        11.058         0.99NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU    10.948        10.910         1.00NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU    10.874        10.871         1.00NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU    10.971        10.920         1.00NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU        17.546        11.462         1.53NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU        16.175        11.475         1.41NHWC_C::Layer_NaryEltwise::OCV/CPU               11.339        11.333         1.00NHWC_H::Layer_NaryEltwise::OCV/CPU               16.154        11.102         1.46

Apple M1, RAM 16GB, macOS 14.4.1

Geometric mean (ms)                Name of Test                     opencv          opencv             opencv                                                        perf            perf               perf                                                     core.m1.0606 core.m1.0606.patch core.m1.0606.patch                                                                                      vs                                                                                            opencv                                                                                           perf                                                                                        core.m1.0606                                                                                     (x-factor)    NCHW_C_sum::Layer_NaryEltwise::OCV/CPU           28.418          3.768               7.54       NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU        6.942           5.679               1.22       NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU        5.822           5.653               1.03       NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU      5.751           5.628               1.02       NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU    5.797           5.599               1.04       NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU       7.272           5.578               1.30       NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU        5.777           5.562               1.04       NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU       5.819           5.559               1.05       NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU        5.830           5.574               1.05       NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU        5.759           5.567               1.03       NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU       342.260          74.655              4.58       NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU    8.338           8.280               1.01       NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU    8.359           8.309               1.01       NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU    8.412           8.295               1.01       NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU    8.380           8.297               1.01       NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU    8.356           8.323               1.00       NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU        6.818           5.561               1.23       NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU        5.805           5.570               1.04       NHWC_C::Layer_NaryEltwise::OCV/CPU               3.834           4.817               0.80       NHWC_H::Layer_NaryEltwise::OCV/CPU               28.402          3.771               7.53

Pull Request Readiness Checklist

See details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

Comment on lines 690 to 706
double nstripes =getNumThreads();
parallel_for_(Range(0, nplanes), worker, nstripes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nstripes = getNumThreads();

This should not be used.
Already discussed several months ago - e.g.#23047

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for review but take it easy, this pr is still drafting. I still remember our discussion.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Changed. Performance results are also updated.

@fengyuentaufengyuentau added this to the4.11.0 milestoneJun 3, 2024
@fengyuentaufengyuentau marked this pull request as ready for reviewJune 6, 2024 10:07
@fengyuentaufengyuentau requested a review fromdkurtJune 7, 2024 04:34
@asmorkalov
Copy link
Contributor

My results with Jetson tk1 (armv7+neon):

ubuntu@jetson1:~/Projects/perf-dnn$ python3 ../opencv/modules/ts/misc/summary.py ./4.x-1.xml ./patched-1.xml | grep NaryEltwiseNCHW_C_sum::Layer_NaryEltwise::OCV/CPU                                                                                                          65.891   43.371      1.52   NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU                                                                                                       79.287   81.868      0.97   NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU                                                                                                      187.457   187.657     1.00   NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU                                                                                                     88.643   96.376      0.92   NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU                                                                                                   88.694   96.035      0.92   NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU                                                                                                      88.716   90.298      0.98   NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU                                                                                                       84.722   83.976      1.01   NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU                                                                                                      92.757   81.105      1.14   NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU                                                                                                       84.285   84.010      1.00   NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU                                                                                                       78.594   78.574      1.00   NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU                                                                                                      3407.037 3475.724     0.98   NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU                                                                                                  189.651   189.454     1.00   NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU                                                                                                   87.859   87.771      1.00   NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU                                                                                                   87.915   88.053      1.00   NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU                                                                                                   84.077   84.063      1.00   NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU                                                                                                   85.160   84.625      1.01   NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU                                                                                                       86.368   79.089      1.09   NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU                                                                                                       89.897   78.993      1.14   NHWC_C::Layer_NaryEltwise::OCV/CPU                                                                                                              77.220   71.425      1.08   NHWC_H::Layer_NaryEltwise::OCV/CPU                                                                                                              67.494   42.832      1.58

@asmorkalov
Copy link
Contributor

asmorkalov commentedJun 11, 2024
edited
Loading

My results for Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz (no AVX2):

NCHW_C_sum::Layer_NaryEltwise::OCV/CPU                                                                                                          24.193   17.846      1.36   NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU                                                                                                       24.026   23.313      1.03   NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU                                                                                                       27.370   23.279      1.18   NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU                                                                                                     35.025   23.254      1.51   NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU                                                                                                   32.455   23.260      1.40   NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU                                                                                                      32.509   23.321      1.39   NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU                                                                                                       23.997   23.262      1.03   NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU                                                                                                      24.038   23.270      1.03   NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU                                                                                                       23.977   23.269      1.03   NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU                                                                                                       23.927   23.279      1.03   NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU                                                                                                      320.598   98.029      3.27   NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU                                                                                                   24.507   24.488      1.00   NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU                                                                                                   24.484   24.477      1.00   NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU                                                                                                   24.500   24.471      1.00   NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU                                                                                                   24.486   24.482      1.00   NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU                                                                                                   24.472   24.476      1.00   NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU                                                                                                       23.953   23.281      1.03   NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU                                                                                                       23.992   23.274      1.03   NHWC_C::Layer_NaryEltwise::OCV/CPU                                                                                                              18.260   18.489      0.99   NHWC_H::Layer_NaryEltwise::OCV/CPU                                                                                                              24.182   17.829      1.36

@fengyuentau
Copy link
MemberAuthor

Thank you@asmorkalov for adding more performance results :)

@fengyuentau
Copy link
MemberAuthor

Any review comments?

@fengyuentaufengyuentau changed the titlednn: parallelize nary elementwise forward implementationdnn: parallelize nary elementwise forward implementation & enable related conformance testsJun 14, 2024
@asmorkalov
Copy link
Contributor

The patch leads to significant OpenCL pipelines degradation, e.g.:

VIT_B_32::DNNTestNetwork::OCV/CPU 149.576 191.409 0.78VIT_B_32::DNNTestNetwork::OCV/OCL 104.428 445.013 0.23VIT_B_32::DNNTestNetwork::OCV/OCL_FP16 102.505 442.994 0.23

I use NVIDIA GF 1080 for benchmark. Looks like the patch prevents some graph fusing or some inference optimization.
Looking into details, if it really caused by the PR.

@fengyuentau
Copy link
MemberAuthor

The patch leads to significant OpenCL pipelines degradation, e.g.:

VIT_B_32::DNNTestNetwork::OCV/CPU 149.576 191.409 0.78VIT_B_32::DNNTestNetwork::OCV/OCL 104.428 445.013 0.23VIT_B_32::DNNTestNetwork::OCV/OCL_FP16 102.505 442.994 0.23

I use NVIDIA GF 1080 for benchmark. Looks like the patch prevents some graph fusing or some inference optimization. Looking into details, if it really caused by the PR.

Ok, I will take a look at the problem.

@fengyuentau
Copy link
MemberAuthor

fengyuentau commentedJun 24, 2024
edited
Loading

@asmorkalov The performance "degradation" is due to very out-of-date code base (>450 commits behind 4.x). I have updated the code base. Performance testings (on Intel UHD 770) seem to be okay on my side. Feel free to retest on your side.


Thinking positively, we have achieved a lot performance boosting from those commits (OCL is ~4x faster and CPU is ~1.3x faster). Maybe I can add the OCL backend for this layer later :)

asmorkalov reacted with thumbs up emojiasmorkalov reacted with eyes emoji

@vpisarevvpisarev self-requested a reviewJune 27, 2024 21:29
@asmorkalov
Copy link
Contributor

asmorkalov commentedJun 28, 2024
edited
Loading

perf-dnn.zip
OpenCL related degradation disappeared. Perf numbers for updated PR for core i5-2500:

NCHW_C_sum::Layer_NaryEltwise::OCV/CPU 24.142 17.999 1.34NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU 23.860 23.265 1.03NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU 27.383 23.282 1.18NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU 39.056 23.292 1.68NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU 32.489 23.290 1.39NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU 32.435 23.257 1.39NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU 23.966 23.269 1.03NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU 23.992 23.276 1.03NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU 23.951 23.273 1.03NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU 23.862 23.272 1.03NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU 320.265 97.879 3.27NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU 24.491 24.487 1.00NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU 24.463 24.464 1.00NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU 24.472 24.465 1.00NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU 24.460 24.453 1.00NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU 24.463 24.530 1.00NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU 23.870 23.271 1.03NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU 23.964 23.764 1.01NHWC_C::Layer_NaryEltwise::OCV/CPU 18.083 18.458 0.98NHWC_H::Layer_NaryEltwise::OCV/CPU 24.140 17.857 1.35

@asmorkalov
Copy link
Contributor

I also tried Xiaomi Mi 10 phone. The result is volatile (m.b. power management), but I do not see significant performance gain, besides NCHW_C_sum and NCHW_NCHW_pow.
perf-dnn-xiaomi-mi10.zip

@fengyuentau
Copy link
MemberAuthor

The result is volatile (m.b. power management), but I do not see significant performance gain

It is tuned to have multi-theading if input scale is large enough. Traditional convolutional nets do not have such a large input scale for elementwise layers.

asmorkalov reacted with thumbs up emoji

@asmorkalovasmorkalov merged commita7fd944 intoopencv:4.xJul 3, 2024
@fengyuentaufengyuentau mentioned this pull requestJul 12, 2024
6 tasks
@asmorkalovasmorkalov added the port/backport doneLabel for maintainers. Authors of PR can ignore this labelJul 12, 2024
asmorkalov pushed a commit that referenced this pull requestJul 15, 2024
…_threaddnn: merge#25630 to 5.x#25900Sync changes from#25630 to 5.x.### Pull Request Readiness ChecklistSee details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request- [x] I agree to contribute to the project under Apache 2 License.- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV- [x] The PR is proposed to the proper branch- [x] There is a reference to the original bug report and related work- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable      Patch to opencv_extra has the same branch name.- [x] The feature is well documented and sample code can be built with the project CMake
@asmorkalovasmorkalov mentioned this pull requestJul 16, 2024
@fengyuentaufengyuentau deleted the nary-multi-thread branchJuly 30, 2024 15:06
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@opencv-alalekopencv-alalekopencv-alalek left review comments

@vpisarevvpisarevvpisarev approved these changes

@asmorkalovasmorkalovasmorkalov approved these changes

@dkurtdkurtAwaiting requested review from dkurt

Assignees

No one assigned

Labels

category: dnnoptimizationport/backport doneLabel for maintainers. Authors of PR can ignore this

Projects

None yet

Milestone

4.11.0

Development

Successfully merging this pull request may close these issues.

4 participants

@fengyuentau@asmorkalov@vpisarev@opencv-alalek

[8]ページ先頭

©2009-2025 Movatter.jp