Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork56.4k
Optimization based on RISC-V P Packed SIMD Extension v0.5.2#24556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
asmorkalov commentedNov 20, 2023
asmorkalov commentedNov 20, 2023
@mshabunin Is it possible to add P extension to QEMU configuration on CI? It should help a lot. |
vpisarev commentedNov 20, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@Junyan721113, thank you for the contribution! This is a useful effort. In the long term, however, it will be extremely difficult for our small team to maintain 1000 different branches of the same code. We do it, sometimes, for critical paths in critical modules, such as deep learning convolution etc., but for general-purpose functions using platform-specific intrinsics is too much. Please, consider implementing universal intrinsics backend instead:https://github.com/opencv/opencv/tree/4.x/modules/core/include/opencv2/core/hal. In this case many hundreds of optimized loops in OpenCV can immediately make use of these instructions. Many other backends rely on 128-bit extensions, whereas P-extension is 64-bit, as far as I know. The solution could be to use a pair of registers to emulate 128-bit simd register. |
mshabunin commentedNov 21, 2023
I have several questions, concerns and suggestions. Lower level or technical:
Higher level or more strategic questions and proposals:
|
Junyan721113 commentedNov 21, 2023
Thank you for your guidance! Most of the current optimizations for P extensions are where other platform-specific optimizations already exist (such as int8layers/layers_common.simd.hpp). I would like to know exactly what parts of the code "critical paths in critical modules" refer to, so that P extensions can be optimized in other ways if Universal Intrinsics is not possible.
However, I'm sorry to say that I'm currently having trouble implementing Universal Intrinsics with the P extension for the following reasons:
|
Junyan721113 commentedDec 5, 2023
This is my fault. RVP v0.5.2 should use
I'm sorry, but Andes toolchain uses
As a test outside of this PR, A 3rdparty component called
T-Head DSP implementation does not support
Supporting only v0.5.2 might be the best solution of this PR.
Communication has been made with Andes, development board will soon be available for perfromance tests.
I'm sorry, but currently I don't know about any plans related to Andes adding support to mainline. |
mshabunin left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I suggest simplifying CPU-feature part: instead of adding RVP052 as a separate CPU feature, let's use custom macro defined in cmake toolchain file, like it is done in platforms/linux/riscv64-071-gcc.toolchain.cmake.
Basically you have to revert allcore modifications and add some macro definition to the riscv64-andes-gcc.toolchain.cmake (e.g.-D__riscv_andes_rvp052 or maybe there is one built into the compiler already?). Then use plain#ifdef guard for optimized code sections.
Tricky part is dispatchedfastConv,fastDepthwiseConv andfastGEMM - I suggest adding new filesconv_depthwise.rvp052.cpp/.hpp with your implementation and include/call it if that macro is enabled.
Probably some additional cmake variable should be set in the toolchain file, so thatdnn/CMakeLists.txt would know when to add new rvp052.cpp files to the build (or it can be just guarded by the same macro and added to the build unconditionally).
cc@opencv-alalek , what do you think?
opencv-alalek commentedDec 8, 2023
CPU features uses common principles for detection / control / compilation / execution and diagnostic.
Could we reuse generic RISC-V toolchains? (with appropriate CPU_BASELINE/CPU_DISPATCH CMake parameters) |
mshabunin commentedDec 8, 2023
Yes, in general I agree, but in this specific case - limited HW availability, specialized toolchain, non-ratified extension, which is not available in generic toolchains - it looks more like RVV 0.7.1. Also there is no actual runtime check for this extension, so dispatched implementations do not make sense, in this PR dispatching was implemented only because of DNN module specifics (no So, IMHO experimental less-invasive approach similar to early RVV 0.7.1 would fit better than generalized P-extension support. Later, when various implementations converge to some stable form and the extension is supported in the upstream, we will implement it as a full-fledged CPU feature. |
Junyan721113 commentedDec 12, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Files with As for marcos, there are 2 marcos called Meanwhile, I wonder if it is acceptable to implement all these 3 convolution functions inside one In total, is the following code acceptable? // modules/core/include/opencv2/core/cv_cpu_dispatch.h#if defined(__riscv) && defined(__riscv_dsp) && defined(__ANDES)#include<nds_intrinsic.h>#defineCV_RVP0521#endif // modules/dnn/src/int8layers/layers_common.simd.hpp#include"layers_common.dispatch.hpp" // modules/dnn/src/int8layers/layers_common.dispatch.cppnamespacecv {namespacednn {namespaceopt_RVP052 {#if CV_RVP052//RVP Optimizations // modules/dnn/src/int8layers/convolution_layer.cpp#if CV_RVP052if(isConv2D)opt_RVP052::fastDepthwiseConv(wptr, kernel_h, kernel_w, stride_h, stride_w, dilation_h, dilation_w,pad_t, pad_l, biasptr, multptr, inptr_, height, width, outptr_, out_d, outH, outW, inpZp, outZp);else |
mshabuninDec 20, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I suggest renaming files to something likelayers_rvp052.cpp/.hpp to avoid confusion with.dispatch files in other modules because they usually serve different purpose.
Disable whole.cpp body if macro is not defined or is false and include.hpp file intolayers_common.hpp with the same macro condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Done
| else | ||
| #endif | ||
| #if CV_RVP052 | ||
| if(useRVP052) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
useRVP052 is always the same asCV_RVP052 and does not have external interface, so I suggest removing boolean flag completely. Here and in other files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Infully_connected_layer.cpp this is absolutely right. But inconvolution_layer.cpp,useRVP052 is not always the same asCV_RVP052, because ofline 769p.useRVP052 = CV_RVP052 && isConv2D; introducing a little difference.
So change this boolean flag intoisConv2D might be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I suggest moving these changes to thednn module, maybe toint8layers/layers_common.hpp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Inlayers_rvp052.cpp, includinglayers_common.hpp to getCV_RVP052 could causeHAVE_OPENCL malfunction as follows:
In file included from /home/junyan/opencv_rvp/modules/dnn/src/int8layers/./layers_common.hpp:17, from /home/junyan/opencv_rvp/modules/dnn/src/int8layers/layers_rvp052.cpp:5:/home/junyan/opencv_rvp/modules/dnn/src/int8layers/./../ocl4dnn/include/ocl4dnn.hpp:196:9: error:'ocl' does not name atype; did you mean'ogl'? 196| ocl::ProgramcompileKernel();| ^~~| ogl
So maybe moving them intolayers_rvp052.hpp is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Modifications in this file will not be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Done
Junyan721113 commentedFeb 28, 2024
Development boards for accuracy test and performance test have been set up, results will soon come out. |
Junyan721113 commentedMar 2, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
3rdparty: NDSRVP - A New 3rdparty Library with Optimizations Based on RISC-V P Extension v0.5.2 - Part 1: Basic Functions#25167# Summary### Previous contextFrom PR#24556: >> * As you wrote, the P-extension differs from RVV thus can not be easily implemented via Universal Intrinsics mechanism, but there is another HAL mechanism for lower-level CPU optimizations which is used by the [Carotene](https://github.com/opencv/opencv/tree/4.x/3rdparty/carotene) library on ARM platforms. I suggest moving all non-dnn code to similar third-party component. For example, FAST algorithm should allow such optimization-shortcut: seehttps://github.com/opencv/opencv/blob/4.x/modules/features2d/src/hal_replacement.hpp>> Reference documentation is here:>> >> *https://docs.opencv.org/4.x/d1/d1b/group__core__hal__interface.html>> *https://docs.opencv.org/4.x/dd/d8b/group__imgproc__hal__interface.html>> *https://docs.opencv.org/4.x/db/d47/group__features2d__hal__interface.html>> * Carotene library is turned on here:https://github.com/opencv/opencv/blob/8bbf08f0de9c387c12afefdb05af7780d989e4c3/CMakeLists.txt#L906-L911> As a test outside of this PR, A 3rdparty component called ndsrvp is created, containing one of the non-dnn code (integral_SIMD), and it works very well.> All the non-dnn code in this PR have been removed, currently this PR can be focused on dnn optinizations.> This HAL mechanism is quite suitable for rvp optimizations, all the non-dnn code is expected to be moved into ndsrvp soon.### Progress#### Part 1 (This PR)- [Core](https://docs.opencv.org/4.x/d1/d1b/group__core__hal__interface.html)- [x] Element-wise add and subtract- [x] Element-wise minimum or maximum- [x] Element-wise absolute difference- [x] Bitwise logical operations- [x] Element-wise compare- [ImgProc](https://docs.opencv.org/4.x/dd/d8b/group__imgproc__hal__interface.html)- [x] Integral- [x] Threshold- [x] WarpAffine- [x] WarpPerspective- [Features2D](https://docs.opencv.org/4.x/db/d47/group__features2d__hal__interface.html)#### Part 2 (Next PR)**Rough Estimate. Todo List May Change.**- [Core](https://docs.opencv.org/4.x/d1/d1b/group__core__hal__interface.html)- [ImgProc](https://docs.opencv.org/4.x/dd/d8b/group__imgproc__hal__interface.html)- smaller remap HAL interface- AdaptiveThreshold- BoxFilter- Canny- Convert- Filter- GaussianBlur- MedianBlur- Morph- Pyrdown- Resize- Scharr- SepFilter- Sobel- [Features2D](https://docs.opencv.org/4.x/db/d47/group__features2d__hal__interface.html)- FAST### Performance TestsThe optimization does not contain floating point opreations.**Absolute Difference**Geometric mean (ms)|Name of Test|opencv perf core Absdiff|opencv perf core Absdiff|opencv perf core Absdiff vs opencv perf core Absdiff (x-factor)||---|:-:|:-:|:-:||Absdiff::OCL_AbsDiffFixture::(640x480, 8UC1)|23.104|5.972|3.87||Absdiff::OCL_AbsDiffFixture::(640x480, 32FC1)|39.500|40.830|0.97||Absdiff::OCL_AbsDiffFixture::(640x480, 8UC3)|69.155|15.051|4.59||Absdiff::OCL_AbsDiffFixture::(640x480, 32FC3)|118.715|120.509|0.99||Absdiff::OCL_AbsDiffFixture::(640x480, 8UC4)|93.001|19.770|4.70||Absdiff::OCL_AbsDiffFixture::(640x480, 32FC4)|161.136|160.791|1.00||Absdiff::OCL_AbsDiffFixture::(1280x720, 8UC1)|69.211|15.140|4.57||Absdiff::OCL_AbsDiffFixture::(1280x720, 32FC1)|118.762|119.263|1.00||Absdiff::OCL_AbsDiffFixture::(1280x720, 8UC3)|212.414|44.692|4.75||Absdiff::OCL_AbsDiffFixture::(1280x720, 32FC3)|367.512|366.569|1.00||Absdiff::OCL_AbsDiffFixture::(1280x720, 8UC4)|285.337|59.708|4.78||Absdiff::OCL_AbsDiffFixture::(1280x720, 32FC4)|490.395|491.118|1.00||Absdiff::OCL_AbsDiffFixture::(1920x1080, 8UC1)|158.827|33.462|4.75||Absdiff::OCL_AbsDiffFixture::(1920x1080, 32FC1)|273.503|273.668|1.00||Absdiff::OCL_AbsDiffFixture::(1920x1080, 8UC3)|484.175|100.520|4.82||Absdiff::OCL_AbsDiffFixture::(1920x1080, 32FC3)|828.758|829.689|1.00||Absdiff::OCL_AbsDiffFixture::(1920x1080, 8UC4)|648.592|137.195|4.73||Absdiff::OCL_AbsDiffFixture::(1920x1080, 32FC4)|1116.755|1109.587|1.01||Absdiff::OCL_AbsDiffFixture::(3840x2160, 8UC1)|648.715|134.875|4.81||Absdiff::OCL_AbsDiffFixture::(3840x2160, 32FC1)|1115.939|1113.818|1.00||Absdiff::OCL_AbsDiffFixture::(3840x2160, 8UC3)|1944.791|413.420|4.70||Absdiff::OCL_AbsDiffFixture::(3840x2160, 32FC3)|3354.193|3324.672|1.01||Absdiff::OCL_AbsDiffFixture::(3840x2160, 8UC4)|2594.585|553.486|4.69||Absdiff::OCL_AbsDiffFixture::(3840x2160, 32FC4)|4473.543|4438.453|1.01|**Bitwise Operation**Geometric mean (ms)|Name of Test|opencv perf core Bit|opencv perf core Bit|opencv perf core Bit vs opencv perf core Bit (x-factor)||---|:-:|:-:|:-:||Bitwise_and::OCL_BitwiseAndFixture::(640x480, 8UC1)|22.542|4.971|4.53||Bitwise_and::OCL_BitwiseAndFixture::(640x480, 32FC1)|90.210|19.917|4.53||Bitwise_and::OCL_BitwiseAndFixture::(640x480, 8UC3)|68.429|15.037|4.55||Bitwise_and::OCL_BitwiseAndFixture::(640x480, 32FC3)|280.168|59.239|4.73||Bitwise_and::OCL_BitwiseAndFixture::(640x480, 8UC4)|90.565|19.735|4.59||Bitwise_and::OCL_BitwiseAndFixture::(640x480, 32FC4)|374.695|79.257|4.73||Bitwise_and::OCL_BitwiseAndFixture::(1280x720, 8UC1)|67.824|14.873|4.56||Bitwise_and::OCL_BitwiseAndFixture::(1280x720, 32FC1)|279.514|59.232|4.72||Bitwise_and::OCL_BitwiseAndFixture::(1280x720, 8UC3)|208.337|44.234|4.71||Bitwise_and::OCL_BitwiseAndFixture::(1280x720, 32FC3)|851.211|182.522|4.66||Bitwise_and::OCL_BitwiseAndFixture::(1280x720, 8UC4)|279.529|59.095|4.73||Bitwise_and::OCL_BitwiseAndFixture::(1280x720, 32FC4)|1132.065|244.877|4.62||Bitwise_and::OCL_BitwiseAndFixture::(1920x1080, 8UC1)|155.685|33.078|4.71||Bitwise_and::OCL_BitwiseAndFixture::(1920x1080, 32FC1)|635.253|137.482|4.62||Bitwise_and::OCL_BitwiseAndFixture::(1920x1080, 8UC3)|474.494|100.166|4.74||Bitwise_and::OCL_BitwiseAndFixture::(1920x1080, 32FC3)|1907.340|412.841|4.62||Bitwise_and::OCL_BitwiseAndFixture::(1920x1080, 8UC4)|635.538|134.544|4.72||Bitwise_and::OCL_BitwiseAndFixture::(1920x1080, 32FC4)|2552.666|556.397|4.59||Bitwise_and::OCL_BitwiseAndFixture::(3840x2160, 8UC1)|634.736|136.355|4.66||Bitwise_and::OCL_BitwiseAndFixture::(3840x2160, 32FC1)|2548.283|561.827|4.54||Bitwise_and::OCL_BitwiseAndFixture::(3840x2160, 8UC3)|1911.454|421.571|4.53||Bitwise_and::OCL_BitwiseAndFixture::(3840x2160, 32FC3)|7663.803|1677.289|4.57||Bitwise_and::OCL_BitwiseAndFixture::(3840x2160, 8UC4)|2543.983|562.780|4.52||Bitwise_and::OCL_BitwiseAndFixture::(3840x2160, 32FC4)|10211.693|2237.393|4.56||Bitwise_not::OCL_BitwiseNotFixture::(640x480, 8UC1)|22.341|4.811|4.64||Bitwise_not::OCL_BitwiseNotFixture::(640x480, 32FC1)|89.975|19.288|4.66||Bitwise_not::OCL_BitwiseNotFixture::(640x480, 8UC3)|67.237|14.643|4.59||Bitwise_not::OCL_BitwiseNotFixture::(640x480, 32FC3)|276.324|58.609|4.71||Bitwise_not::OCL_BitwiseNotFixture::(640x480, 8UC4)|89.587|19.554|4.58||Bitwise_not::OCL_BitwiseNotFixture::(640x480, 32FC4)|370.986|77.136|4.81||Bitwise_not::OCL_BitwiseNotFixture::(1280x720, 8UC1)|67.227|14.541|4.62||Bitwise_not::OCL_BitwiseNotFixture::(1280x720, 32FC1)|276.357|58.076|4.76||Bitwise_not::OCL_BitwiseNotFixture::(1280x720, 8UC3)|206.752|43.376|4.77||Bitwise_not::OCL_BitwiseNotFixture::(1280x720, 32FC3)|841.638|177.787|4.73||Bitwise_not::OCL_BitwiseNotFixture::(1280x720, 8UC4)|276.773|57.784|4.79||Bitwise_not::OCL_BitwiseNotFixture::(1280x720, 32FC4)|1127.740|237.472|4.75||Bitwise_not::OCL_BitwiseNotFixture::(1920x1080, 8UC1)|153.808|32.531|4.73||Bitwise_not::OCL_BitwiseNotFixture::(1920x1080, 32FC1)|627.765|129.990|4.83||Bitwise_not::OCL_BitwiseNotFixture::(1920x1080, 8UC3)|469.799|98.249|4.78||Bitwise_not::OCL_BitwiseNotFixture::(1920x1080, 32FC3)|1893.591|403.694|4.69||Bitwise_not::OCL_BitwiseNotFixture::(1920x1080, 8UC4)|627.724|129.962|4.83||Bitwise_not::OCL_BitwiseNotFixture::(1920x1080, 32FC4)|2529.967|540.744|4.68||Bitwise_not::OCL_BitwiseNotFixture::(3840x2160, 8UC1)|628.089|130.277|4.82||Bitwise_not::OCL_BitwiseNotFixture::(3840x2160, 32FC1)|2521.817|540.146|4.67||Bitwise_not::OCL_BitwiseNotFixture::(3840x2160, 8UC3)|1905.004|404.704|4.71||Bitwise_not::OCL_BitwiseNotFixture::(3840x2160, 32FC3)|7567.971|1627.898|4.65||Bitwise_not::OCL_BitwiseNotFixture::(3840x2160, 8UC4)|2531.476|540.181|4.69||Bitwise_not::OCL_BitwiseNotFixture::(3840x2160, 32FC4)|10075.594|2181.654|4.62||Bitwise_or::OCL_BitwiseOrFixture::(640x480, 8UC1)|22.566|5.076|4.45||Bitwise_or::OCL_BitwiseOrFixture::(640x480, 32FC1)|90.391|19.928|4.54||Bitwise_or::OCL_BitwiseOrFixture::(640x480, 8UC3)|67.758|14.740|4.60||Bitwise_or::OCL_BitwiseOrFixture::(640x480, 32FC3)|279.253|59.844|4.67||Bitwise_or::OCL_BitwiseOrFixture::(640x480, 8UC4)|90.296|19.802|4.56||Bitwise_or::OCL_BitwiseOrFixture::(640x480, 32FC4)|373.972|79.815|4.69||Bitwise_or::OCL_BitwiseOrFixture::(1280x720, 8UC1)|67.815|14.865|4.56||Bitwise_or::OCL_BitwiseOrFixture::(1280x720, 32FC1)|279.398|60.054|4.65||Bitwise_or::OCL_BitwiseOrFixture::(1280x720, 8UC3)|208.643|45.043|4.63||Bitwise_or::OCL_BitwiseOrFixture::(1280x720, 32FC3)|850.042|180.985|4.70||Bitwise_or::OCL_BitwiseOrFixture::(1280x720, 8UC4)|279.363|60.385|4.63||Bitwise_or::OCL_BitwiseOrFixture::(1280x720, 32FC4)|1134.858|243.062|4.67||Bitwise_or::OCL_BitwiseOrFixture::(1920x1080, 8UC1)|155.212|33.155|4.68||Bitwise_or::OCL_BitwiseOrFixture::(1920x1080, 32FC1)|634.985|134.911|4.71||Bitwise_or::OCL_BitwiseOrFixture::(1920x1080, 8UC3)|474.648|100.407|4.73||Bitwise_or::OCL_BitwiseOrFixture::(1920x1080, 32FC3)|1912.049|414.184|4.62||Bitwise_or::OCL_BitwiseOrFixture::(1920x1080, 8UC4)|635.252|132.587|4.79||Bitwise_or::OCL_BitwiseOrFixture::(1920x1080, 32FC4)|2544.471|560.737|4.54||Bitwise_or::OCL_BitwiseOrFixture::(3840x2160, 8UC1)|634.574|134.966|4.70||Bitwise_or::OCL_BitwiseOrFixture::(3840x2160, 32FC1)|2545.129|561.498|4.53||Bitwise_or::OCL_BitwiseOrFixture::(3840x2160, 8UC3)|1910.900|419.365|4.56||Bitwise_or::OCL_BitwiseOrFixture::(3840x2160, 32FC3)|7662.603|1685.812|4.55||Bitwise_or::OCL_BitwiseOrFixture::(3840x2160, 8UC4)|2548.971|560.787|4.55||Bitwise_or::OCL_BitwiseOrFixture::(3840x2160, 32FC4)|10201.407|2237.552|4.56||Bitwise_xor::OCL_BitwiseXorFixture::(640x480, 8UC1)|22.718|4.961|4.58||Bitwise_xor::OCL_BitwiseXorFixture::(640x480, 32FC1)|91.496|19.831|4.61||Bitwise_xor::OCL_BitwiseXorFixture::(640x480, 8UC3)|67.910|15.151|4.48||Bitwise_xor::OCL_BitwiseXorFixture::(640x480, 32FC3)|279.612|59.792|4.68||Bitwise_xor::OCL_BitwiseXorFixture::(640x480, 8UC4)|91.073|19.853|4.59||Bitwise_xor::OCL_BitwiseXorFixture::(640x480, 32FC4)|374.641|79.155|4.73||Bitwise_xor::OCL_BitwiseXorFixture::(1280x720, 8UC1)|67.704|15.008|4.51||Bitwise_xor::OCL_BitwiseXorFixture::(1280x720, 32FC1)|279.229|60.088|4.65||Bitwise_xor::OCL_BitwiseXorFixture::(1280x720, 8UC3)|208.156|44.426|4.69||Bitwise_xor::OCL_BitwiseXorFixture::(1280x720, 32FC3)|849.501|180.848|4.70||Bitwise_xor::OCL_BitwiseXorFixture::(1280x720, 8UC4)|279.642|59.728|4.68||Bitwise_xor::OCL_BitwiseXorFixture::(1280x720, 32FC4)|1129.826|242.880|4.65||Bitwise_xor::OCL_BitwiseXorFixture::(1920x1080, 8UC1)|155.585|33.354|4.66||Bitwise_xor::OCL_BitwiseXorFixture::(1920x1080, 32FC1)|634.090|134.995|4.70||Bitwise_xor::OCL_BitwiseXorFixture::(1920x1080, 8UC3)|474.931|99.598|4.77||Bitwise_xor::OCL_BitwiseXorFixture::(1920x1080, 32FC3)|1910.519|413.138|4.62||Bitwise_xor::OCL_BitwiseXorFixture::(1920x1080, 8UC4)|635.026|135.155|4.70||Bitwise_xor::OCL_BitwiseXorFixture::(1920x1080, 32FC4)|2560.167|560.838|4.56||Bitwise_xor::OCL_BitwiseXorFixture::(3840x2160, 8UC1)|634.893|134.883|4.71||Bitwise_xor::OCL_BitwiseXorFixture::(3840x2160, 32FC1)|2548.166|560.831|4.54||Bitwise_xor::OCL_BitwiseXorFixture::(3840x2160, 8UC3)|1911.392|419.816|4.55||Bitwise_xor::OCL_BitwiseXorFixture::(3840x2160, 32FC3)|7646.634|1677.988|4.56||Bitwise_xor::OCL_BitwiseXorFixture::(3840x2160, 8UC4)|2560.637|560.805|4.57||Bitwise_xor::OCL_BitwiseXorFixture::(3840x2160, 32FC4)|10227.044|2249.458|4.55|### Pull Request Readiness ChecklistSee details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request- [x] I agree to contribute to the project under Apache 2 License.- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV- [x] The PR is proposed to the proper branch- [x] There is a reference to the original bug report and related work- [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name.- [ ] The feature is well documented and sample code can be built with the project CMake

Summary
Provides OpenCV optimizations for the RISC-V P extension (v0.5.2).
The writer of the code and the author of the PR is an intern at ISCAS (Institute of Software, Chinese Academy of Sciences).
List of RVP optimizations
Correctness validation (QEMU)
opencv_test_dnn_rvp Consistent with control (before adding RVP optimization)
opencv_test_imgproc_rvp Consistent with controls
opencv_test_features2d_rvp Consistent with controls
Q&A
Why RVP ?
As a lightweight extension, there is some potential for P extensions to be used in the embedded domain.
Why v0.5.2 ?
Although RVP is not frozen, Andes has massive plans based on version 0.5.2, just like T-Head and RVV071.
Why not Universal Intrinsics ?
RVP052 has no floating-point arithmetic and only supports parallel arithmetic up to 64 bits, which makes it less capable of implementing Universal Intrinsics, and thus most of its optimizations refer to existing function-specific optimizations.
How to perform tests ?
The correctness tests are as follows. (Due to hardware issues, performance test results are not available at this time)
Environment
Toolchain
nds-gnu-toolchain
build_linux_toolchain.sh
TARGET=riscv64-linuxPREFIX=/opt/andesARCH=rv64imafdcxandesABI=lp64dCPU=andes-25-seriesXLEN=64BUILD=`pwd`/build-nds64le-linux-glibc-v5dQemu
qemu
Build
Related Tests
dnn module test
imgproc module test
features2d module test
Pull Request Readiness Checklist
See details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.