Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Use intrinsics forcvRound on x86_64__GNUC__ (clang/gcc linux) too.#24001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

@legrosbuffle
Copy link
Contributor

There is no reason to enable this only on windows. We've measured a 7x improvement in speed forcvRound using the intrinsic.

Pull Request Readiness Checklist

See details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • [ X] I agree to contribute to the project under Apache 2 License.
  • [ X] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • [ X] The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • [ X] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • [ X] The feature is well documented and sample code can be built with the project CMake

@asmorkalov
Copy link
Contributor

@legrosbuffle Thanks a lot for the contribution. I made quick benchmark of imgproc module with and without the patch and outcome is the following:

  • opencv uses__builtin_lrintf as default solution. The assembly code has function call, but not compiler intrinsic. Seehttps://godbolt.org/z/hvb46xx3h
  • I tried old Core i5-2500K and modern AMD Ryzen 7 2700X. Overall result is very close for two hosts. There is speedup in several cases for cvtColor conversions (very lightweight) and Hough Transform. Other cases are flat. The best example (AMD):
Geometric mean (ms)                                                                      Name of Test                                                                        imgproc  imgproc  imgproc                                                                                                                                                              4.x     patch    patch                                                                                                                                                                1      round    round                                                                                                                                                                         1        1                                                                                                                                                                                    vs                                                                                                                                                                                imgproc                                                                                                                                                                                4.x                                                                                                                                                                                   1                                                                                                                                                                                (x-factor)HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 0.5)                                                                      15.444   4.715     3.28   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 0.5)                                                                       1.775    0.463     3.83   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 1.1)                                                                      15.557   4.697     3.31   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 1.1)                                                                       1.771    0.458     3.87   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 0.5)                                                                     16.753   4.433     3.78   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 0.5)                                                                      1.746    0.420     4.16   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 1.1)                                                                     16.694   4.394     3.80   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 1.1)                                                                      1.740    0.428     4.06   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 0.5)                                                                       144.136  132.952    1.08   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 0.5)                                                                         14.148   3.270     4.33   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 1.1)                                                                       141.822  32.338     4.39   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 1.1)                                                                         14.132   3.250     4.35   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 0.5)                                                                      136.346  30.763     4.43   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 0.5)                                                                        14.062   3.157     4.45   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 1.1)                                                                      136.306  30.152     4.52   HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 1.1)                                                                        14.072   3.159     4.45   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 0.5)                                                                        17.115  16.516     1.04   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 0.5)                                                                         1.780    0.456     3.91   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 1.1)                                                                        16.586  16.874     0.98   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 1.1)                                                                         1.782    0.454     3.93   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 0.5)                                                                       16.688   4.429     3.77   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 0.5)                                                                        1.748    0.432     4.05   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 1.1)                                                                       16.667   3.610     4.62   HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 1.1)                                                                        1.739    0.424     4.10   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 0.5)                                                                         144.735  130.719    1.11   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 0.5)                                                                           14.305   3.742     3.82   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 1.1)                                                                         139.298  32.419     4.30   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 1.1)                                                                           14.284   3.697     3.86   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 0.5)                                                                        140.510  121.088    1.16   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 0.5)                                                                          14.101   3.142     4.49   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 1.1)                                                                        139.432  30.010     4.65   HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 1.1)                                                                          14.097   3.164     4.46
  • There are no speedup/regression in features2d and objdetect.

@asmorkalov
Copy link
Contributor

@legrosbuffle could you describe your use case?

@legrosbuffle
Copy link
ContributorAuthor

We're mostly interested inResize andHoughLines. Here's a benchmark that somehow represents what we're measuring in profiles forResize (using the publicgoogle/benchmark library ):

#include "testing/base/public/benchmark.h"#include "third_party/OpenCV/core/fast_math.hpp"#include "third_party/OpenCV/core/mat.hpp"#include "third_party/OpenCV/imgproc.hpp"namespace {template <typename FloatT>void BM_CvRound(benchmark::State& state) {  FloatT input = 1.23456789;  for (const auto s : state) {    testing::DoNotOptimize(input);    int rounded = cvRound(input);    testing::DoNotOptimize(rounded);  }}BENCHMARK(BM_CvRound<float>);BENCHMARK(BM_CvRound<double>);// `cv::resize` makes intensive use of `cvRound`.void BM_CvResize(benchmark::State& state) {  constexpr int kSrcWidth = 2048;  constexpr int kSrcHeight = 1536;  constexpr int kDstWidth = 512;  constexpr int kDstHeight = 384;  const int interpolation = state.range(0);  cv::Mat src(kSrcWidth, kSrcHeight, CV_8UC1, cv::Scalar(0));  cv::Mat dst(kDstWidth, kDstHeight, CV_8UC1, cv::Scalar(0));  for (const auto s : state) {    testing::DoNotOptimize(src);    cv::resize(src, dst, dst.size(), 0, 0, interpolation);    testing::DoNotOptimize(dst);  }}BENCHMARK(BM_CvResize)->Arg(cv::INTER_LINEAR)->Arg(cv::INTER_AREA);}  // namespace

Diff between base and patch on machines we care about:

BM_CvRound<float>     -85.15%         (p=0.000 n=8+10)BM_CvRound<double>    -86.63%         (p=0.000 n=8+10)
BM_CvResize/1      -17.29%         (p=0.000 n=9+10)BM_CvResize/3      -28.96%        (p=0.000 n=10+10)
asmorkalov reacted with thumbs up emoji

@vpisarev
Copy link
Contributor

@legrosbuffle, thanks for PR.

We just want to give compiler more chances to vectorize code. I believe, with higher-level intrinsics like __builtin_lrintf() the chances are higher than with manually embedded scalar SSE2 intrinsics. Another reason is that the code may be compiled with AVX2 or AVX512 and then those SSE2 intrinsics would look weird.

May we get more detailed information about your test environment:

  1. operating system, including version.
  2. GCC version
  3. hardware used
  4. version of OpenCV used
  5. are there any extra compiler flags used to build OpenCV?

@legrosbuffle
Copy link
ContributorAuthor

We just want to give compiler more chances to vectorize code. I believe, with higher-level intrinsics like __builtin_lrintf() the chances are higher than with manually embedded scalar SSE2 intrinsics.

Do you have an example where the buitin version vectorizes ? Neither GCC nor clang seem to vectorize any version ofcvRound without fast-math:https://godbolt.org/z/fjoqv5jxo

Another reason is that the code may be compiled with AVX2 or AVX512 and then those SSE2 intrinsics would look weird.

When building with AVX2, the compiler will select the proper encoding (see godbolt link).

May we get more detailed information about your test environment:

  1. operating system, including version.

Google's linux disrib. Results are similar ongLinux and prod.

  1. GCC version

We're usingclang, but results looks similar on GCC and clang (see my godbolt link above)

  1. hardware used

Skylake

  1. version of OpenCV used

HEAD

  1. are there any extra compiler flags used to build OpenCV?

The only relevant one is-DCV_ENABLE_INTRINSICS.

…linux) too.We've measured a 7x improvement in speed for `cvRound` using the intrinsic.
@asmorkalovasmorkalovforce-pushed thelegrosbuffle-cvround-intrinsic branch from23118c9 to3cce299CompareJuly 21, 2023 08:35
@asmorkalov
Copy link
Contributor

@legrosbuffle Thanks a lot for the patch. We discussed the solution on OpenCV core team meeting and decide to merge it. I removedx86_64 check to cover x86 32-bit configuration too. I'll merge it as soon as CI passed.

Copy link
Contributor

@asmorkalovasmorkalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

👍

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@asmorkalovasmorkalovasmorkalov approved these changes

@opencv-alalekopencv-alalekopencv-alalek approved these changes

@vpisarevvpisarevAwaiting requested review from vpisarev

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@legrosbuffle@asmorkalov@vpisarev@opencv-alalek

[8]ページ先頭

©2009-2025 Movatter.jp