NotificationsYou must be signed in to change notification settings
Fork56.4k
Star85.3k

Add Neon optimised RGB2Lab conversion#19883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

alalek merged 7 commits intoopencv:3.4fromjondea:arm-neon-optimised-color-lab-3.4

May 28, 2021

Merged

Add Neon optimised RGB2Lab conversion#19883

alalek merged 7 commits intoopencv:3.4fromjondea:arm-neon-optimised-color-lab-3.4

May 28, 2021

Conversation

Copy link

Contributor

jondea commentedApr 9, 2021•
edited
Loading

A Neon specific implementation of RGB2Lab increases single threaded performance by ~25%, here's the numbers run on aws c6gd.4xlarge with gcc 9.3 (numbers are similar using gcc 10)

Test set	Test number	After/before ratio	Speedup with 1/million bounds [%]
cvtColor8u	8	0.76835	23.2 ± 0.2
cvtColor8u	34	0.76204	23.8 ± 0.2
cvtColor8u	67	0.76667	23.3 ± 0.2
cvtColor8u	69	0.76773	23.2 ± 0.2
cvtColor8u	71	0.76231	23.8 ± 0.2
cvtColor8u	73	0.76184	23.8 ± 0.2
cvtColor8u	90	0.76851	23.1 ± 0.2
cvtColor8u	103	0.76143	23.9 ± 0.2
cvtColor8u	128	0.73870	26.1 ± 0.1
cvtColor8u	154	0.73760	26.2 ± 0.2
cvtColor8u	187	0.73891	26.1 ± 0.1
cvtColor8u	189	0.73889	26.1 ± 0.1
cvtColor8u	191	0.73802	26.2 ± 0.2
cvtColor8u	193	0.73817	26.2 ± 0.2
cvtColor8u	210	0.73879	26.1 ± 0.1
cvtColor8u	223	0.73745	26.3 ± 0.2
cvtColor8u	248	0.73756	26.2 ± 0.1
cvtColor8u	274	0.73613	26.4 ± 0.1
cvtColor8u	307	0.73768	26.2 ± 0.1
cvtColor8u	309	0.73767	26.2 ± 0.1
cvtColor8u	311	0.73676	26.3 ± 0.2
cvtColor8u	313	0.73672	26.3 ± 0.2
cvtColor8u	330	0.73748	26.3 ± 0.1
cvtColor8u	343	0.73591	26.4 ± 0.1

Pull Request Readiness Checklist

See details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

force_builders=linux,docs,ARMv8,ARMv7

Add Neon optimised RGB2Lab conversion

8c25876

Copy link

Member

alalek commentedApr 9, 2021

Marking thisRFC, because it doesn't follow OpenCV guidelines to avoid using of raw native intrinsics in OpenCV modules.

Copy link

ContributorAuthor

Hi@alalek, thank you for looking into this. We (me and@fpetrogalli) submitted it like this because we weren't sure what correct approach was. Also, sorry if this is an silly question, but what does it mean to mark it as RFC?

One possible solution to the raw intrinsics is to keep the#if CV_NEON block in color_lab.cpp but rewrite it using the HAL intrinsics. Another solution would be to split it into a neon specific file, like in the case ofresize.cpp,resize.avx2.cpp andresize.sse4_1.cpp for example. Are either of these acceptable or preferable? Or is there a another way which would achieve the same goal?

Fix compile errors, change lambda to macro

00e21e4

Copy link

Contributor

vpisarev commentedApr 15, 2021

@jondea, thank you for the contribution!
as@alalek said, for the tiny OpenCV core team it's simply unfeasible to maintain separate code branches for the growing amount of code and the growing number of platforms that we support. With time we hope to port most of the remaining native branches to HAL/universal intrinsics. There will be some exceptions, like deep learning, where the amount of critical kernels is not that big and where we can afford separate branches, but overall universal intrinsics is the preferable (by far) option.

I'd start with the first option that you suggested - keep the separate branch under CV_NEON, but rewrite it using HAL intrinsics. I briefly looked at the current implementation and I found it too bulky for the equivalent C code that it accelerates. So, I'm 60-80% sure that the HAL code that you write will be faster than the existing implementation not just on ARM, but on the other platforms as well. And then we will just replace that code with yours, i.e. remove#if CV_NEON ... #endif around your code and remove the other branch.

jondea added2 commits

April 23, 2021 09:01

Change NEON optimised RGB2Lab to just use HAL

df4ed8c

Change [] to v_extract_n in RGB2Lab

c73eae8

Copy link

ContributorAuthor

jondea commentedApr 27, 2021

The changes have been rewritten to use just HAL intrinsics, any feedback would be appreciated.

fpetrogalli reviewed

Apr 28, 2021

View reviewed changes

Copy link

Contributor

fpetrogalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hi@jondea,

just a couple of minor observations.

Thank you for your work.

Francesco

modules/imgproc/src/color_lab.cpp OutdatedShow resolvedHide resolved

RGB2LAB Code quality, change to nlane agnostic

fe8a1bd

fpetrogalli reviewed

Apr 29, 2021

View reviewed changes

modules/imgproc/src/color_lab.cpp OutdatedShow resolvedHide resolved

Change RGB2Lab to use function rather than macro

422583a

fpetrogalli approved these changes

May 4, 2021

View reviewed changes

Copy link

Contributor

fpetrogalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM, with a nit.

Final word on the maintainers, of course!

Thank you, Francesco

modules/imgproc/src/color_lab.cpp OutdatedShow resolvedHide resolved

Remove whitespace

1da33dc

Co-authored-by: Francesco Petrogalli <25690309+fpetrogalli@users.noreply.github.com>

Copy link

ContributorAuthor

jondea commentedMay 10, 2021

@vpisarev @alalek is there anything else which needs to be done before this can be merged?

Copy link

ContributorAuthor

jondea commentedMay 19, 2021

Hi@vpisarev @alalek would you be able to take another look at this please and let me know if it can be merged?

vpisarev self-assigned this

May 24, 2021

Copy link

Contributor

vpisarev commentedMay 25, 2021

@jondea, thank you very much! I tested the code both on Mac-Intel and Mac-ARM (M1), it works well, the claimed acceleration is achieved. On Intel it's no slower than the previous version, but, unfortunately, it's 128-bit only.

In any case, it can be merged as-is, and later we can modify this code to use some new variations ofv_lut() intrinsic.

👍

vpisarev self-requested a review

May 25, 2021 05:45

vpisarev approved these changes

May 25, 2021

View reviewed changes

asmorkalov requested changes

May 25, 2021

View reviewed changes

modules/imgproc/src/color_lab.cppShow resolvedHide resolved

Copy link

Contributor

fpetrogalli commentedMay 25, 2021

@jondea, thank you very much! I tested the code both on Mac-Intel and Mac-ARM (M1), it works well, the claimed acceleration is achieved. On Intel it's no slower than the previous version, but, unfortunately, it's 128-bit only.

@vpisarev ,@jondea is working on an equivalent version that uses SVE2 intrinsics. He is using the intrinsicsvld1uh_gather_s32index_s32 for the variation ofv_lut() that does a gather from the indexes. It is a Vector Length Agnositc (VLA) version, so it could be ported easily to HAL once we make thenlanes field to be a runtime value and we have the correspondent indexedlut intrinsic.