[ X] I agree to contribute to the project under Apache 2 License.
[ X] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
[ X] The PR is proposed to the proper branch
There is a reference to the original bug report and related work
[ X] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
[ X] The feature is well documented and sample code can be built with the project CMake

asmorkalov added optimization category: core labels

Jul 18, 2023

Copy link

Contributor

asmorkalov commentedJul 18, 2023

@legrosbuffle Thanks a lot for the contribution. I made quick benchmark of imgproc module with and without the patch and outcome is the following:

opencv uses__builtin_lrintf as default solution. The assembly code has function call, but not compiler intrinsic. Seehttps://godbolt.org/z/hvb46xx3h
I tried old Core i5-2500K and modern AMD Ryzen 7 2700X. Overall result is very close for two hosts. There is speedup in several cases for cvtColor conversions (very lightweight) and Hough Transform. Other cases are flat. The best example (AMD):

Geometric mean (ms)                                                                      Name of Test                                                                        imgproc  imgproc  imgproc                                                                                                                                                              4.x     patch    patch                                                                                                                                                                1      round    round                                                                                                                                                                         1        1                                                                                                                                                                                    vs                                                                                                                                                                                imgproc                                                                                                                                                                                4.x                                                                                                                                                                                   1                                                                                                                                                                                (x-factor)HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 0.5)                                                                      15.444   4.715     3.28   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 0.5)                                                                       1.775    0.463     3.83   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 1.1)                                                                      15.557   4.697     3.31   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 1.1)                                                                       1.771    0.458     3.87   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 0.5)                                                                     16.753   4.433     3.78   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 0.5)                                                                      1.746    0.420     4.16   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 1.1)                                                                     16.694   4.394     3.80   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 1.1)                                                                      1.740    0.428     4.06   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 0.5)                                                                       144.136  132.952    1.08   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 0.5)                                                                         14.148   3.270     4.33   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 1.1)                                                                       141.822  32.338     4.39   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 1.1)                                                                         14.132   3.250     4.35   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 0.5)                                                                      136.346  30.763     4.43   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 0.5)                                                                        14.062   3.157     4.45   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 1.1)                                                                      136.306  30.152     4.52   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 1.1)                                                                        14.072   3.159     4.45   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 0.5)                                                                        17.115  16.516     1.04   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 0.5)                                                                         1.780    0.456     3.91   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 1.1)                                                                        16.586  16.874     0.98   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 1.1)                                                                         1.782    0.454     3.93   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 0.5)                                                                       16.688   4.429     3.77   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 0.5)                                                                        1.748    0.432     4.05   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 1.1)                                                                       16.667   3.610     4.62   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 1.1)                                                                        1.739    0.424     4.10   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 0.5)                                                                         144.735  130.719    1.11   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 0.5)                                                                           14.305   3.742     3.82   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 1.1)                                                                         139.298  32.419     4.30   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 1.1)                                                                           14.284   3.697     3.86   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 0.5)                                                                        140.510  121.088    1.16   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 0.5)                                                                          14.101   3.142     4.49   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 1.1)                                                                        139.432  30.010     4.65   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 1.1)                                                                          14.097   3.164     4.46

There are no speedup/regression in features2d and objdetect.

Copy link

Contributor

asmorkalov commentedJul 18, 2023

@legrosbuffle could you describe your use case?

asmorkalov requested review fromopencv-alalek andvpisarev

July 18, 2023 12:47

Copy link

ContributorAuthor

legrosbuffle commentedJul 18, 2023

We're mostly interested inResize andHoughLines. Here's a benchmark that somehow represents what we're measuring in profiles forResize (using the publicgoogle/benchmark library ):

#include "testing/base/public/benchmark.h"#include "third_party/OpenCV/core/fast_math.hpp"#include "third_party/OpenCV/core/mat.hpp"#include "third_party/OpenCV/imgproc.hpp"namespace {template <typename FloatT>void BM_CvRound(benchmark::State& state) {  FloatT input = 1.23456789;  for (const auto s : state) {    testing::DoNotOptimize(input);    int rounded = cvRound(input);    testing::DoNotOptimize(rounded);  }}BENCHMARK(BM_CvRound<float>);BENCHMARK(BM_CvRound<double>);// `cv::resize` makes intensive use of `cvRound`.void BM_CvResize(benchmark::State& state) {  constexpr int kSrcWidth = 2048;  constexpr int kSrcHeight = 1536;  constexpr int kDstWidth = 512;  constexpr int kDstHeight = 384;  const int interpolation = state.range(0);  cv::Mat src(kSrcWidth, kSrcHeight, CV_8UC1, cv::Scalar(0));  cv::Mat dst(kDstWidth, kDstHeight, CV_8UC1, cv::Scalar(0));  for (const auto s : state) {    testing::DoNotOptimize(src);    cv::resize(src, dst, dst.size(), 0, 0, interpolation);    testing::DoNotOptimize(dst);  }}BENCHMARK(BM_CvResize)->Arg(cv::INTER_LINEAR)->Arg(cv::INTER_AREA);}  // namespace

Diff between base and patch on machines we care about:

BM_CvRound<float>     -85.15%         (p=0.000 n=8+10)BM_CvRound<double>    -86.63%         (p=0.000 n=8+10)

BM_CvResize/1      -17.29%         (p=0.000 n=9+10)BM_CvResize/3      -28.96%        (p=0.000 n=10+10)

Copy link

Contributor

vpisarev commentedJul 19, 2023

@legrosbuffle, thanks for PR.

We just want to give compiler more chances to vectorize code. I believe, with higher-level intrinsics like __builtin_lrintf() the chances are higher than with manually embedded scalar SSE2 intrinsics. Another reason is that the code may be compiled with AVX2 or AVX512 and then those SSE2 intrinsics would look weird.

May we get more detailed information about your test environment:

operating system, including version.
GCC version
hardware used
version of OpenCV used
are there any extra compiler flags used to build OpenCV?

Copy link

ContributorAuthor

legrosbuffle commentedJul 19, 2023

We just want to give compiler more chances to vectorize code. I believe, with higher-level intrinsics like __builtin_lrintf() the chances are higher than with manually embedded scalar SSE2 intrinsics.

Do you have an example where the buitin version vectorizes ? Neither GCC nor clang seem to vectorize any version ofcvRound without fast-math:https://godbolt.org/z/fjoqv5jxo

Another reason is that the code may be compiled with AVX2 or AVX512 and then those SSE2 intrinsics would look weird.

When building with AVX2, the compiler will select the proper encoding (see godbolt link).

May we get more detailed information about your test environment:
operating system, including version.

Google's linux disrib. Results are similar ongLinux and prod.

GCC version

We're usingclang, but results looks similar on GCC and clang (see my godbolt link above)

hardware used

Skylake

version of OpenCV used

HEAD

are there any extra compiler flags used to build OpenCV?

The only relevant one is-DCV_ENABLE_INTRINSICS.

Use intrinsics forcvRound on x86 and x86_64__GNUC__ (clang/gcc …

3cce299

…linux) too.We've measured a 7x improvement in speed for `cvRound` using the intrinsic.

asmorkalov force-pushed thelegrosbuffle-cvround-intrinsic branch from23118c9 to3cce299Compare

July 21, 2023 08:35

Copy link

Contributor

asmorkalov commentedJul 21, 2023

@legrosbuffle Thanks a lot for the patch. We discussed the solution on OpenCV core team meeting and decide to merge it. I removedx86_64 check to cover x86 32-bit configuration too. I'll merge it as soon as CI passed.