Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add Neon optimised RGB2Lab conversion#19883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
alalek merged 7 commits intoopencv:3.4fromjondea:arm-neon-optimised-color-lab-3.4
May 28, 2021

Conversation

@jondea
Copy link
Contributor

@jondeajondea commentedApr 9, 2021
edited
Loading

A Neon specific implementation of RGB2Lab increases single threaded performance by ~25%, here's the numbers run on aws c6gd.4xlarge with gcc 9.3 (numbers are similar using gcc 10)

Test setTest numberAfter/before ratioSpeedup with 1/million bounds [%]
cvtColor8u80.7683523.2 ± 0.2
cvtColor8u340.7620423.8 ± 0.2
cvtColor8u670.7666723.3 ± 0.2
cvtColor8u690.7677323.2 ± 0.2
cvtColor8u710.7623123.8 ± 0.2
cvtColor8u730.7618423.8 ± 0.2
cvtColor8u900.7685123.1 ± 0.2
cvtColor8u1030.7614323.9 ± 0.2
cvtColor8u1280.7387026.1 ± 0.1
cvtColor8u1540.7376026.2 ± 0.2
cvtColor8u1870.7389126.1 ± 0.1
cvtColor8u1890.7388926.1 ± 0.1
cvtColor8u1910.7380226.2 ± 0.2
cvtColor8u1930.7381726.2 ± 0.2
cvtColor8u2100.7387926.1 ± 0.1
cvtColor8u2230.7374526.3 ± 0.2
cvtColor8u2480.7375626.2 ± 0.1
cvtColor8u2740.7361326.4 ± 0.1
cvtColor8u3070.7376826.2 ± 0.1
cvtColor8u3090.7376726.2 ± 0.1
cvtColor8u3110.7367626.3 ± 0.2
cvtColor8u3130.7367226.3 ± 0.2
cvtColor8u3300.7374826.3 ± 0.1
cvtColor8u3430.7359126.4 ± 0.1

Pull Request Readiness Checklist

See details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=linux,docs,ARMv8,ARMv7

@alalek
Copy link
Member

Marking thisRFC, because it doesn't follow OpenCV guidelines to avoid using of raw native intrinsics in OpenCV modules.

@jondea
Copy link
ContributorAuthor

Hi@alalek, thank you for looking into this. We (me and@fpetrogalli) submitted it like this because we weren't sure what correct approach was. Also, sorry if this is an silly question, but what does it mean to mark it as RFC?

One possible solution to the raw intrinsics is to keep the#if CV_NEON block in color_lab.cpp but rewrite it using the HAL intrinsics. Another solution would be to split it into a neon specific file, like in the case ofresize.cpp,resize.avx2.cpp andresize.sse4_1.cpp for example. Are either of these acceptable or preferable? Or is there a another way which would achieve the same goal?

@vpisarev
Copy link
Contributor

@jondea, thank you for the contribution!
as@alalek said, for the tiny OpenCV core team it's simply unfeasible to maintain separate code branches for the growing amount of code and the growing number of platforms that we support. With time we hope to port most of the remaining native branches to HAL/universal intrinsics. There will be some exceptions, like deep learning, where the amount of critical kernels is not that big and where we can afford separate branches, but overall universal intrinsics is the preferable (by far) option.

I'd start with the first option that you suggested - keep the separate branch under CV_NEON, but rewrite it using HAL intrinsics. I briefly looked at the current implementation and I found it too bulky for the equivalent C code that it accelerates. So, I'm 60-80% sure that the HAL code that you write will be faster than the existing implementation not just on ARM, but on the other platforms as well. And then we will just replace that code with yours, i.e. remove#if CV_NEON ... #endif around your code and remove the other branch.

@jondea
Copy link
ContributorAuthor

The changes have been rewritten to use just HAL intrinsics, any feedback would be appreciated.

alalek reacted with thumbs up emoji

Copy link
Contributor

@fpetrogallifpetrogalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hi@jondea,

just a couple of minor observations.

Thank you for your work.

Francesco

alalek reacted with thumbs up emoji
Copy link
Contributor

@fpetrogallifpetrogalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM, with a nit.

Final word on the maintainers, of course!

Thank you, Francesco

alalek reacted with thumbs up emoji
Co-authored-by: Francesco Petrogalli <25690309+fpetrogalli@users.noreply.github.com>
@jondea
Copy link
ContributorAuthor

@vpisarev@alalek is there anything else which needs to be done before this can be merged?

@jondea
Copy link
ContributorAuthor

Hi@vpisarev@alalek would you be able to take another look at this please and let me know if it can be merged?

@vpisarevvpisarev self-assigned thisMay 24, 2021
@vpisarev
Copy link
Contributor

@jondea, thank you very much! I tested the code both on Mac-Intel and Mac-ARM (M1), it works well, the claimed acceleration is achieved. On Intel it's no slower than the previous version, but, unfortunately, it's 128-bit only.

In any case, it can be merged as-is, and later we can modify this code to use some new variations ofv_lut() intrinsic.

👍

@vpisarevvpisarev self-requested a reviewMay 25, 2021 05:45
@fpetrogalli
Copy link
Contributor

@jondea, thank you very much! I tested the code both on Mac-Intel and Mac-ARM (M1), it works well, the claimed acceleration is achieved. On Intel it's no slower than the previous version, but, unfortunately, it's 128-bit only.

@vpisarev ,@jondea is working on an equivalent version that uses SVE2 intrinsics. He is using the intrinsicsvld1uh_gather_s32index_s32 for the variation ofv_lut() that does a gather from the indexes. It is a Vector Length Agnositc (VLA) version, so it could be ported easily to HAL once we make thenlanes field to be a runtime value and we have the correspondent indexedlut intrinsic.

vpisarev and asmorkalov reacted with thumbs up emoji

@vpisarevvpisarev requested a review fromasmorkalovMay 28, 2021 04:28
@asmorkalovasmorkalov requested review fromasmorkalov and removed request forasmorkalovMay 28, 2021 07:24
@alalekalalek merged commit8ecfbdb intoopencv:3.4May 28, 2021
This was referencedMay 29, 2021
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@alalekalalekalalek left review comments

@vpisarevvpisarevvpisarev approved these changes

@asmorkalovasmorkalovAwaiting requested review from asmorkalov

+1 more reviewer

@fpetrogallifpetrogallifpetrogalli approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

@vpisarevvpisarev

Labels

optimizationplatform: armARM boards related issues: RPi, NVIDIA TK/TX, etcRFC

Projects

None yet

Milestone

3.4.15

Development

Successfully merging this pull request may close these issues.

5 participants

@jondea@alalek@vpisarev@fpetrogalli@asmorkalov

[8]ページ先頭

©2009-2025 Movatter.jp