Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Find contours speedup#26834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
asmorkalov merged 19 commits intoopencv:4.xfromchacha21:findContours_speedup
Mar 12, 2025
Merged

Conversation

@chacha21
Copy link
Contributor

@chacha21chacha21 commentedJan 23, 2025
edited
Loading

This Pull request is still incomplete.
It is an attempt, as suggested by#26775, to restore lost speed when migratingfindContours() implementation from C to C++

The patch adds an "Arena" (a pool) of pre-allocated memory so that contours points (and TreeNodes) can be picked from the Arena.
The code offindContours() is mostly unchanged, the arena usage being implicit through a utility class Arena::Item that provides C++ overloaded operators and construct/destruct logic.

As mentioned in#26775, the contour points are allocated and released in order, and can be represented by ranges of indices in their arena. No range subset will be released and drill a hole, that's why the internal representation as a range of indices makes sense.

The TreeNodes use another Arena class that does not comply to that range logic.

Currently, there is a significant improvement of the run-time on the test mentioned in#26775, but it is still far from thefindContours_legacy() performance.

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@asmorkalovasmorkalov changed the titleFind contours speedupWIP: Find contours speedupJan 24, 2025
@mshabunin
Copy link
Contributor

mshabunin commentedJan 25, 2025
edited
Loading

I've tested "updated code" (after) VS "current 4.x" (new) VS "4.9" (old). Also profiled with callgrind to check number of memory allocations. Each measurment took 50 iterations using reproducer from the original issue.

Name of Testbeforenewafternew vs before (x-factor)after vs before (x-factor)
findContours_big::TestFindContours41.83775.65561.2640.550.68
malloc calls200K23M13M

Profiling with callgrind shows that updated code still have numerous malloc calls. I've been able to reduce this number even more (by ~3M or by ~6M, I'm not sure) by disablingindex_mapping access in thecv::contourTreeToResults fuinction and get performance ~45ms - comparable with old variant.

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-APIMemStorage. Thus we will avoid vector reallocations which happens currently.

@chacha21
Copy link
ContributorAuthor

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-APIMemStorage. Thus we will avoid vector reallocations which happens currently.

contourTreeToResults() is called once, so there is certainly no improvement by pre-allocating the memory needed for index_mapping. A stack allocation would help, but the current test case has atree.size() equal to 127384 elements, which is too large for that.

the stack does not use a growing capacity, vector should perform less allocations
Avoid allocation in TreeIterator by using a specific storage using an arenaNo visible improvements on the test case
The values stored in the `levels` member of TreeIterator seems to be mostly contiguous indices. Instead of allocating each index, we can just use ranges. It saves a lot of allocations.Still no visible performance improvement on test case.
@mshabunin
Copy link
Contributor

contourTreeToResults() is called once, so there is certainly no improvement by pre-allocating the memory needed for index_mapping

Did you measure performance difference withindex_mapping update on your platform? For me it gave 60 ms -> 45 ms improvement on Linux/GCC.

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API MemStorage. Thus we will avoid vector reallocations which happens currently.

This comment was related to points stored in contours, not to TreeIterator. I believe TreeIterator doesn't have big impact on performance.

@chacha21
Copy link
ContributorAuthor

On my machine (Win7, VS2019)

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API MemStorage. Thus we will avoid vector reallocations which happens currently.
This comment was related to points stored in contours, not to TreeIterator. I believe TreeIterator doesn't have big impact on performance.

OK, I misunderstood as "preallocate index_mapping memory".

On my host machine, it is still not so close to legacy

@mshabunin
Copy link
Contributor

Please take a look at this commit:5f35ea3
This is block storage which I was writing about. Though I'm not sure how to make it more concise.
On my machine it works for 42 ms VS 40 ms for legacy function. Number of malloc calls is 275K VS 185K. There is still some room for improvement (for example, perhaps we dont need to store reference to storage in each contour, but add is an argument in corresponing methods), but overall it look much better.

@chacha21
Copy link
ContributorAuthor

chacha21 commentedJan 26, 2025
edited
Loading

Please take a look at this commit:5f35ea3

[EDIT]:test_impgproc fails with this commit
Ok, it fails because the modifications are not backported toapproximateChainTC89

What I understand :

  • this commit is about replacing the Arena (which has a policy of "fixed size, then dynamic allocation when static capacity is reached") by a structure performing incremental dynamic allocations by blocks.
  • I can see that you tested a static size of 600000 bytes for the arena, because of the current test case generating a high number of points. Of course, it would not be a good idea to allocate that much memory for the standard case. Hence it seems doomed, and the Block storage might be better in general.
  • the Block storage is not conceptually very different from a vector. It performs dynamic allocations as it grows.
  • the difference is that a vector does not assume that realloc() might work, and will always copy all its previous content when resized, while the block storage will just add new blocks
  • yourBlockStorage is similar to thevectorWithArena that I tried previously forTreeIterator::levels (but vectorWithArena allocated blocks from an arena instead of the standardnew)
  • for the current test case, requiring a high amount of memory, a small arena with a fixed stack-allocated buffer won't help, but for the general case it could still be useful for the general case.

A will test the performance with your commit on my host machine ASAP.

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms
Block storage for contour points + index mapping as vector + TreenodeIterator::levels as ranges : ~142ms

I think that I will try to evolve to a BlockStorageWithArena for the general case
Another idea : actually, I am not sure that contour points should be stored asint2(x,y), but ratherint(offset). Thewidth information that allowsoffset->(x,y) can be stored once in the context. I am curious if the saved memory will result in a speed up. There will be an overhead when converting to contour results, a merememcopy being impossible after that.

@mshabunin
Copy link
Contributor

mshabunin commentedJan 27, 2025
edited
Loading

As I understand this algorithm requires memory storage with :

  1. O(1) push_back
  2. O(1) random access
  3. elements for each contour are added sequentially
  4. minimum number of allocations and deallocations
  • std::vector satisfies 1-3, but not 4 - due to reallocations.
  • std::deque should work like this (1-3), but it seems that it usesquite small blocks (especially in MSVC - 16 bytes or 1 element), thus 4 is not satisfied as well

I thought about using std::deque with custom allocator and larger blocks, but it seems it'll make things too complicated.
Do we really need arena in this case? Wouldn't it add extra overhead for keeping items?

I am not sure that contour points should be stored as int2(x,y), but rather int(offset)

Not sure if I understand. Wouldn't offset need to be 2D?

@chacha21
Copy link
ContributorAuthor

chacha21 commentedJan 27, 2025
edited
Loading

As I understand this algorithm requires memory storage with : [...]
* std::vector satisfies 1-3, but not 4 - due to reallocations.
* std::deque should work like this (1-3), but it seems that it usesquite small blocks (especially in MSVC - 16 bytes or 1 element), thus 4 is not satisfied as well
I thought about using std::deque with custom allocator and larger blocks, but it seems to make things too complicated.

I think you are right. We should not rely on a structure which inner tuning depends heavily on the toolchain. Your BlockStorage let us control the amount of memory reliably.

Do we really need arena in this case? Wouldn't it add extra overhead for keeping items?

The idea of the arena is just to provide fast memory on demand, I don't think that my ArenaStackBuffer has much overhead.
However, even better, I think that you could just tune your BlockStorage with one static buffer holding the first blocks (that will be allocated on stack if the BlockStorage itself is allocated on stack before creating the ContourScanner). When more blocks are needed, the dynamic allocation will occur.
No need to delegate that to an arena.

I am not sure that contour points should be stored as int2(x,y), but rather int(offset)
Not sure if I understand. Wouldn't offset need to be 2D?

Yes, but as long as we keep track of width, offset = y*width+x, so (x, y) = (offset%width, offset/width). This is just a trick to save half the memory used to store one point (at the cost of the offset <-> (x,y) conversion overhead here and there)
(and we ignore int overflow for very large images)

removed all arenas in favor of BlockStorageadded static blocks to BlockStorageadded range iteration for block storage to optimize copy data out of itmake BlockStorage allocated on stack before ContourScanner instanciationtested storage of points as int(offset), but it was slower on test cases
@chacha21
Copy link
ContributorAuthor

chacha21 commentedJan 27, 2025
edited
Loading

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms
Block storage for contour points + index mapping as vector + TreenodeIterator::levels as ranges : ~142ms
no arena + enhanced Block storage + index mapping as vector + TreenodeIterator::levels as ranges : ~139ms

mshabunin reacted with thumbs up emoji

a std::vector<char> might not be optimal to store the codes (for instance uit always allocate 16 bytes, even empty), so I tried different things.Nothing was better so far.
even unused and empty,  std::vector<schar> codes allocates memoryusing a specific ContourCodesStorage similar to ContourPointsStorage help saving that allocation
@chacha21
Copy link
ContributorAuthor

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms
Block storage for contour points + index mapping as vector + TreenodeIterator::levels as ranges : ~142ms
no arena + enhanced Block storage + index mapping as vector + TreenodeIterator::levels as ranges : ~139ms
no arena + enhanced Block storage + index mapping as vector + TreenodeIterator::levels as ranges + codes also stored in a BlockStorage: ~133ms

mshabunin reacted with thumbs up emoji

removed harmless debugging code, raising warning with some compilers
@chacha21
Copy link
ContributorAuthor

chacha21 commentedJan 28, 2025
edited
Loading

I ran standard perf tests to see if there are any regression.
It's hard to compare old vs new perstats for perf_impgproc(--gtest_filter=*indContours*), since the number of samples per test varies from run to run.
What's the good practice ? (I can't run python)

@mshabunin
Copy link
Contributor

Number of samples is chosen from the predefinedrange (10-100 by default) taking into account relative performance measurmentdeviation (median VS mean) and overall testexecution time (should not exceed 3 sec by default). Maybe something else.

Use following options to tune the parameters:

  • --perf_min_samples /--perf_force_samples to modify range
  • --perf_max_deviation - to change deviation threshold
  • --perf_time_limit - to change execution time limit

The simplest variant would be to set fixed number of samples using the same value ofmin_samples andmax_samples.


BTW, I used this basic test to reproduce issue from the original ticket and compare legacy and new implementations:

Details
#include"perf_precomp.hpp"#include"opencv2/imgproc/detail/legacy.hpp"namespaceopencv_test {namespace {PERF_TEST(TestFindContours, findContours_big){    Mat img =imread(getDataPath("cv/imgproc/contours.png"), cv::IMREAD_GRAYSCALE);    cv::Mat mask;inRange(img,110,120, mask);    std::vector<std::vector<cv::Point>> v_cont;TEST_CYCLE()cv::findContours(mask, v_cont, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);SANITY_CHECK_NOTHING();}PERF_TEST(TestFindContours, findContours_big_old){    Mat img =imread(getDataPath("cv/imgproc/contours.png"), cv::IMREAD_GRAYSCALE);    cv::Mat mask;inRange(img,110,120, mask);    std::vector<std::vector<cv::Point>> v_cont;TEST_CYCLE()cv::findContours_legacy(mask, v_cont, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);SANITY_CHECK_NOTHING();}} }// namespace

@chacha21
Copy link
ContributorAuthor

Ok, I ran perf_test for*indContours* with current 4.x vs current PR
There is probably no significant difference, but PR timings are almost always slightly better.
Sometimes (it differs from run to run), some tests might have <= 2ms regression, but it's probably noise.
In my opinion, this is an overall improvement.

@chacha21
Copy link
ContributorAuthor

chacha21 commentedJan 31, 2025
edited
Loading

There is still some room to improve the capabilities of BlockStorage, but it won't change anything for the current test case. So I wonder if I should commit it or if it is just noise at this point.
The idea is that a referenced part of a block storage (a pair<first,last>) could be ask to be "released", giving a chance to really free it if it happens to be the tail of the block storage.

  • in that case, the block storage can justsize -= (last-first)
  • but it is unclear if the block storage must free such released memory (if a block can be freed) or keep it as a remaining capacity (certainly the best choice)
  • I assume that addingsize_t capacity() const,void resize() {dangerous if referenced areas},void shrink_to_fit() are standards that one could expect
  • the "release part" operation would beif (last == storage.size()) storage.resize(storage.size()-(last-first))

There is also the temptation (bait ?) to mimic an allocator by

  • adding anstd::pair<size_t, size_t> alloc(size) that would usually append memory like the actualpush_back()
  • adding abool release(const std::pair<size_t, size_t>&) that could use the "release tail if possible", no-op otherwise
  • adding the ability to also mark released areas at the head (like a double-ended buffer), so that future allocations that would fit could reuse that memory(storage start)...(head)(data)...(data)(tail)...(storage end)

And also

  • add awriteTo(size_t first, size_t size, const T* data) that would handle the RangeIterator logic instead of making it public.
  • add areadFrom(size_t first, size_t size, T* data)

That's not a big work, but does it belong to that PR ?

@mshabunin
Copy link
Contributor

I'm observing slight performance degradation in the orignal scenario with current version: PR ~ 48 ms, Legacy ~ 42 ms (Linux, GCC 11, x86_64). Is it OK on Windows? Could you please check?

There is still some room to improve the capabilities of BlockStorage, but it won't change anything for the current test case.

I believe we don't need to add functionality we don't use right now. It can be useful to move block memory storage to separate file(s). Though eventually we might need to move it to thecore module (private interface, likeBufferArea), extend functionality and cover with tests.

@chacha21
Copy link
ContributorAuthor

I'm observing slight performance degradation in the orignal scenario with current version: PR ~ 48 ms, Legacy ~ 42 ms (Linux, GCC 11, x86_64). Is it OK on Windows? Could you please check?

I just performed the test :

On the machine where I developed the PR, Windows 7, vs2019

  • cv::findContours_legacy() : ~133ms
  • current PRcv::findContours() : ~137ms

On another Windows 10, vs2022

  • cv::findContours_legacy() : ~66ms
  • current PRcv::findContours() : ~68ms
mshabunin reacted with thumbs up emojimshabunin reacted with eyes emoji

private:
std::stack<int> levels;
Tree<T>& tree;
vectorOfRanges<int> levels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm observing better performance withstd::stack (though less stable):

On x86_64 (Core i5-11600, Linux):

  • stack:samples=100 mean=44.01 median=42.88 min=40.59 stddev=2.44 (5.6%)
  • vectorOfRanges:samples=100 mean=46.97 median=46.97 min=46.65 stddev=0.16 (0.3%)

On AArch64 (Rockchip RK3588):

  • stack:samples=100 mean=79.23 median=79.02 min=78.35 stddev=1.27 (1.6%)
  • vectorOfRanges:samples=100 mean=89.51 median=89.52 min=88.90 stddev=0.21 (0.2%)

I propose to leave the stack version. Also it is better for simplicity.

-class BlockStorgae has its own file-TreeIterator::levels reverted to std::stack (get rid of vectorOfRanges)-ContourPointsStorage and ContourCodesStorage are template specialization of the same class ContourDataStorage-ContourPointsStorage and ContourCodesStorage us a static size of 0 for their inner BlockStorageI have run the tstandard findContours perf_tests to check that all those changes were not a slowdown for the average case (because the current use case of the PR must not take precedence). It's ok.
Copy link
Contributor

@mshabuninmshabunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks good to me. With minor cosmetic comments.

@asmorkalovasmorkalov added this to the4.12.0 milestoneMar 12, 2025
@asmorkalovasmorkalov changed the titleWIP: Find contours speedupFind contours speedupMar 12, 2025
@asmorkalovasmorkalov merged commitd83df66 intoopencv:4.xMar 12, 2025
27 of 28 checks passed
@asmorkalovasmorkalov mentioned this pull requestApr 29, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@mshabuninmshabuninmshabunin approved these changes

Assignees

@mshabuninmshabunin

Projects

None yet

Milestone

4.12.0

Development

Successfully merging this pull request may close these issues.

3 participants

@chacha21@mshabunin@asmorkalov

[8]ページ先頭

©2009-2025 Movatter.jp