NotificationsYou must be signed in to change notification settings
Fork56.4k
Star85.3k

Find contours speedup#26834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

asmorkalov merged 19 commits intoopencv:4.xfromchacha21:findContours_speedup

Mar 12, 2025

Merged

Find contours speedup#26834

asmorkalov merged 19 commits intoopencv:4.xfromchacha21:findContours_speedup

Mar 12, 2025

Conversation

Copy link

Contributor

chacha21 commentedJan 23, 2025•
edited
Loading

This Pull request is still incomplete.
It is an attempt, as suggested by#26775, to restore lost speed when migratingfindContours() implementation from C to C++

The patch adds an "Arena" (a pool) of pre-allocated memory so that contours points (and TreeNodes) can be picked from the Arena.
The code offindContours() is mostly unchanged, the arena usage being implicit through a utility class Arena::Item that provides C++ overloaded operators and construct/destruct logic.

As mentioned in#26775, the contour points are allocated and released in order, and can be represented by ranges of indices in their arena. No range subset will be released and drill a hole, that's why the internal representation as a range of indices makes sense.

The TreeNodes use another Arena class that does not comply to that range logic.

Currently, there is a significant improvement of the run-time on the test mentioned in#26775, but it is still far from thefindContours_legacy() performance.

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

gather preliminary work into one commit

085e8e3

chacha21 force-pushed thefindContours_speedup branch frombb242cc to085e8e3Compare

January 23, 2025 20:55

better C++ grammar for gcc/clang

9278de7

asmorkalov changed the title~~Find contours speedup~~WIP: Find contours speedup

Jan 24, 2025

asmorkalov requested a review frommshabunin

January 24, 2025 06:09

asmorkalov added the category: imgproc label

Jan 24, 2025

Copy link

Contributor

mshabunin commentedJan 25, 2025•
edited
Loading

I've tested "updated code" (after) VS "current 4.x" (new) VS "4.9" (old). Also profiled with callgrind to check number of memory allocations. Each measurment took 50 iterations using reproducer from the original issue.

Name of Test	before	new	after	new vs before (x-factor)	after vs before (x-factor)
findContours_big::TestFindContours	41.837	75.655	61.264	0.55	0.68
malloc calls	200K	23M	13M

Profiling with callgrind shows that updated code still have numerous malloc calls. I've been able to reduce this number even more (by ~3M or by ~6M, I'm not sure) by disablingindex_mapping access in thecv::contourTreeToResults fuinction and get performance ~45ms - comparable with old variant.

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-APIMemStorage. Thus we will avoid vector reallocations which happens currently.

Do not use map for contour indexing

4dd8408

Copy link

ContributorAuthor

chacha21 commentedJan 25, 2025

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-APIMemStorage. Thus we will avoid vector reallocations which happens currently.

contourTreeToResults() is called once, so there is certainly no improvement by pre-allocating the memory needed for index_mapping. A stack allocation would help, but the current test case has atree.size() equal to 127384 elements, which is too large for that.

chacha21 added5 commits

January 25, 2025 23:25

use vector instead of stack

4a341bc

the stack does not use a growing capacity, vector should perform less allocations

changed an assert

c3347a4

try to minimize allocations

a75ad49

Avoid allocation in TreeIterator by using a specific storage using an arenaNo visible improvements on the test case

fixed typo not detected by MSVC

e8bcc40

new storage structure for the levels of TreeIterator

b89bb69

The values stored in the `levels` member of TreeIterator seems to be mostly contiguous indices. Instead of allocating each index, we can just use ranges. It saves a lot of allocations.Still no visible performance improvement on test case.

Copy link

Contributor

mshabunin commentedJan 26, 2025

contourTreeToResults() is called once, so there is certainly no improvement by pre-allocating the memory needed for index_mapping

Did you measure performance difference withindex_mapping update on your platform? For me it gave 60 ms -> 45 ms improvement on Linux/GCC.

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API MemStorage. Thus we will avoid vector reallocations which happens currently.

This comment was related to points stored in contours, not to TreeIterator. I believe TreeIterator doesn't have big impact on performance.

fixed move

73bba4d

Copy link

ContributorAuthor

chacha21 commentedJan 26, 2025

On my machine (Win7, VS2019)

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API MemStorage. Thus we will avoid vector reallocations which happens currently.
This comment was related to points stored in contours, not to TreeIterator. I believe TreeIterator doesn't have big impact on performance.

OK, I misunderstood as "preallocate index_mapping memory".

On my host machine, it is still not so close to legacy

Copy link

Contributor

mshabunin commentedJan 26, 2025

Please take a look at this commit:5f35ea3
This is block storage which I was writing about. Though I'm not sure how to make it more concise.
On my machine it works for 42 ms VS 40 ms for legacy function. Number of malloc calls is 275K VS 185K. There is still some room for improvement (for example, perhaps we dont need to store reference to storage in each contour, but add is an argument in corresponing methods), but overall it look much better.

Copy link

ContributorAuthor

chacha21 commentedJan 26, 2025•
edited
Loading

Please take a look at this commit:5f35ea3

~~[EDIT]:test_impgproc fails with this commit~~
Ok, it fails because the modifications are not backported toapproximateChainTC89

What I understand :

this commit is about replacing the Arena (which has a policy of "fixed size, then dynamic allocation when static capacity is reached") by a structure performing incremental dynamic allocations by blocks.
I can see that you tested a static size of 600000 bytes for the arena, because of the current test case generating a high number of points. Of course, it would not be a good idea to allocate that much memory for the standard case. Hence it seems doomed, and the Block storage might be better in general.
the Block storage is not conceptually very different from a vector. It performs dynamic allocations as it grows.
the difference is that a vector does not assume that realloc() might work, and will always copy all its previous content when resized, while the block storage will just add new blocks
yourBlockStorage is similar to thevectorWithArena that I tried previously forTreeIterator::levels (but vectorWithArena allocated blocks from an arena instead of the standardnew)
for the current test case, requiring a high amount of memory, a small arena with a fixed stack-allocated buffer won't help, but for the general case it could still be useful for the general case.

~~A will test the performance with your commit on my host machine ASAP.~~

I think that I will try to evolve to a BlockStorageWithArena for the general case
Another idea : actually, I am not sure that contour points should be stored asint2(x,y), but ratherint(offset). Thewidth information that allowsoffset->(x,y) can be stored once in the context. I am curious if the saved memory will result in a speed up. There will be an overhead when converting to contour results, a merememcopy being impossible after that.

Merge branch '4.x' into findContours_speedup

1ca692a

Copy link

Contributor

mshabunin commentedJan 27, 2025•
edited
Loading

As I understand this algorithm requires memory storage with :

O(1) push_back
O(1) random access
elements for each contour are added sequentially
minimum number of allocations and deallocations

std::vector satisfies 1-3, but not 4 - due to reallocations.
std::deque should work like this (1-3), but it seems that it usesquite small blocks (especially in MSVC - 16 bytes or 1 element), thus 4 is not satisfied as well

I thought about using std::deque with custom allocator and larger blocks, but it seems it'll make things too complicated.
Do we really need arena in this case? Wouldn't it add extra overhead for keeping items?

I am not sure that contour points should be stored as int2(x,y), but rather int(offset)

Not sure if I understand. Wouldn't offset need to be 2D?

Copy link

ContributorAuthor

chacha21 commentedJan 27, 2025•
edited
Loading

As I understand this algorithm requires memory storage with : [...]
* std::vector satisfies 1-3, but not 4 - due to reallocations.
* std::deque should work like this (1-3), but it seems that it usesquite small blocks (especially in MSVC - 16 bytes or 1 element), thus 4 is not satisfied as well
I thought about using std::deque with custom allocator and larger blocks, but it seems to make things too complicated.

I think you are right. We should not rely on a structure which inner tuning depends heavily on the toolchain. Your BlockStorage let us control the amount of memory reliably.

Do we really need arena in this case? Wouldn't it add extra overhead for keeping items?

The idea of the arena is just to provide fast memory on demand, I don't think that my ArenaStackBuffer has much overhead.
However, even better, I think that you could just tune your BlockStorage with one static buffer holding the first blocks (that will be allocated on stack if the BlockStorage itself is allocated on stack before creating the ContourScanner). When more blocks are needed, the dynamic allocation will occur.
No need to delegate that to an arena.

I am not sure that contour points should be stored as int2(x,y), but rather int(offset)
Not sure if I understand. Wouldn't offset need to be 2D?

Yes, but as long as we keep track of width, offset = y*width+x, so (x, y) = (offset%width, offset/width). This is just a trick to save half the memory used to store one point (at the cost of the offset <-> (x,y) conversion overhead here and there)
(and we ignore int overflow for very large images)

cleanup and optimizations

3dbd4ee

removed all arenas in favor of BlockStorageadded static blocks to BlockStorageadded range iteration for block storage to optimize copy data out of itmake BlockStorage allocated on stack before ContourScanner instanciationtested storage of points as int(offset), but it was slower on test cases

Copy link

ContributorAuthor

chacha21 commentedJan 27, 2025•
edited
Loading

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms
Block storage for contour points + index mapping as vector + TreenodeIterator::levels as ranges : ~142ms
no arena + enhanced Block storage + index mapping as vector + TreenodeIterator::levels as ranges : ~139ms

chacha21 added3 commits

January 27, 2025 15:03

typo not detected at compile time

c0b747a

use abstraction for Contour::codes

b9e2403

a std::vector<char> might not be optimal to store the codes (for instance uit always allocate 16 bytes, even empty), so I tried different things.Nothing was better so far.

use BlockStorage for codes

1875ef6

even unused and empty,  std::vector<schar> codes allocates memoryusing a specific ContourCodesStorage similar to ContourPointsStorage help saving that allocation

Copy link

ContributorAuthor

chacha21 commentedJan 27, 2025

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms
Block storage for contour points + index mapping as vector + TreenodeIterator::levels as ranges : ~142ms
no arena + enhanced Block storage + index mapping as vector + TreenodeIterator::levels as ranges : ~139ms
no arena + enhanced Block storage + index mapping as vector + TreenodeIterator::levels as ranges + codes also stored in a BlockStorage: ~133ms

remove unused debugging code

ce01235

removed harmless debugging code, raising warning with some compilers

Copy link

ContributorAuthor

chacha21 commentedJan 28, 2025•
edited
Loading

I ran standard perf tests to see if there are any regression.
It's hard to compare old vs new perstats for perf_impgproc(--gtest_filter=*indContours*), since the number of samples per test varies from run to run.
What's the good practice ? (I can't run python)

Copy link

Contributor

mshabunin commentedJan 29, 2025

Number of samples is chosen from the predefinedrange (10-100 by default) taking into account relative performance measurmentdeviation (median VS mean) and overall testexecution time (should not exceed 3 sec by default). Maybe something else.

Use following options to tune the parameters:

--perf_min_samples /--perf_force_samples to modify range
--perf_max_deviation - to change deviation threshold
--perf_time_limit - to change execution time limit

The simplest variant would be to set fixed number of samples using the same value ofmin_samples andmax_samples.

BTW, I used this basic test to reproduce issue from the original ticket and compare legacy and new implementations:

Details

#include"perf_precomp.hpp"#include"opencv2/imgproc/detail/legacy.hpp"namespaceopencv_test {namespace {PERF_TEST(TestFindContours, findContours_big){    Mat img =imread(getDataPath("cv/imgproc/contours.png"), cv::IMREAD_GRAYSCALE);    cv::Mat mask;inRange(img,110,120, mask);    std::vector<std::vector<cv::Point>> v_cont;TEST_CYCLE()cv::findContours(mask, v_cont, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);SANITY_CHECK_NOTHING();}PERF_TEST(TestFindContours, findContours_big_old){    Mat img =imread(getDataPath("cv/imgproc/contours.png"), cv::IMREAD_GRAYSCALE);    cv::Mat mask;inRange(img,110,120, mask);    std::vector<std::vector<cv::Point>> v_cont;TEST_CYCLE()cv::findContours_legacy(mask, v_cont, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);SANITY_CHECK_NOTHING();}} }// namespace

Copy link

ContributorAuthor

chacha21 commentedJan 29, 2025

Ok, I ran perf_test for*indContours* with current 4.x vs current PR
There is probably no significant difference, but PR timings are almost always slightly better.
Sometimes (it differs from run to run), some tests might have <= 2ms regression, but it's probably noise.
In my opinion, this is an overall improvement.

Copy link

ContributorAuthor

chacha21 commentedJan 31, 2025•
edited
Loading

There is still some room to improve the capabilities of BlockStorage, but it won't change anything for the current test case. So I wonder if I should commit it or if it is just noise at this point.
The idea is that a referenced part of a block storage (a pair<first,last>) could be ask to be "released", giving a chance to really free it if it happens to be the tail of the block storage.

in that case, the block storage can justsize -= (last-first)
but it is unclear if the block storage must free such released memory (if a block can be freed) or keep it as a remaining capacity (certainly the best choice)
I assume that addingsize_t capacity() const,void resize() {dangerous if referenced areas},void shrink_to_fit() are standards that one could expect
the "release part" operation would beif (last == storage.size()) storage.resize(storage.size()-(last-first))

There is also the temptation (bait ?) to mimic an allocator by

adding anstd::pair<size_t, size_t> alloc(size) that would usually append memory like the actualpush_back()
adding abool release(const std::pair<size_t, size_t>&) that could use the "release tail if possible", no-op otherwise
adding the ability to also mark released areas at the head (like a double-ended buffer), so that future allocations that would fit could reuse that memory(storage start)...(head)(data)...(data)(tail)...(storage end)

And also

add awriteTo(size_t first, size_t size, const T* data) that would handle the RangeIterator logic instead of making it public.
add areadFrom(size_t first, size_t size, T* data)

That's not a big work, but does it belong to that PR ?

Copy link

Contributor

mshabunin commentedFeb 2, 2025

I'm observing slight performance degradation in the orignal scenario with current version: PR ~ 48 ms, Legacy ~ 42 ms (Linux, GCC 11, x86_64). Is it OK on Windows? Could you please check?

There is still some room to improve the capabilities of BlockStorage, but it won't change anything for the current test case.

I believe we don't need to add functionality we don't use right now. It can be useful to move block memory storage to separate file(s). Though eventually we might need to move it to thecore module (private interface, likeBufferArea), extend functionality and cover with tests.

Copy link

ContributorAuthor

chacha21 commentedFeb 4, 2025

I'm observing slight performance degradation in the orignal scenario with current version: PR ~ 48 ms, Legacy ~ 42 ms (Linux, GCC 11, x86_64). Is it OK on Windows? Could you please check?

I just performed the test :

On the machine where I developed the PR, Windows 7, vs2019

cv::findContours_legacy() : ~133ms
current PRcv::findContours() : ~137ms

On another Windows 10, vs2022

cv::findContours_legacy() : ~66ms
current PRcv::findContours() : ~68ms

mshabunin reviewed

Feb 19, 2025

View reviewed changes

modules/imgproc/src/contours_common.hpp Outdated

		private:
		std::stack<int> levels;
		Tree<T>& tree;
		vectorOfRanges<int> levels;

Copy link

Contributor

mshabuninFeb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm observing better performance withstd::stack (though less stable):

On x86_64 (Core i5-11600, Linux):

stack:samples=100 mean=44.01 median=42.88 min=40.59 stddev=2.44 (5.6%)
vectorOfRanges:samples=100 mean=46.97 median=46.97 min=46.65 stddev=0.16 (0.3%)

On AArch64 (Rockchip RK3588):

stack:samples=100 mean=79.23 median=79.02 min=78.35 stddev=1.27 (1.6%)
vectorOfRanges:samples=100 mean=89.51 median=89.52 min=88.90 stddev=0.21 (0.2%)

I propose to leave the stack version. Also it is better for simplicity.

modules/imgproc/src/contours_common.hpp OutdatedShow resolvedHide resolved

chacha21 added3 commits

February 20, 2025 12:14

Merge remote-tracking branch 'upstream/master' into findContours_speedup

9938b26

changes as suggested by reviewer

45e2d17

-class BlockStorgae has its own file-TreeIterator::levels reverted to std::stack (get rid of vectorOfRanges)-ContourPointsStorage and ContourCodesStorage are template specialization of the same class ContourDataStorage-ContourPointsStorage and ContourCodesStorage us a static size of 0 for their inner BlockStorageI have run the tstandard findContours perf_tests to check that all those changes were not a slowdown for the average case (because the current use case of the PR must not take precedence). It's ok.

typo

e6f4e8a

mshabunin approved these changes

Mar 11, 2025

View reviewed changes

Copy link

Contributor

mshabunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks good to me. With minor cosmetic comments.

modules/imgproc/src/contours_common.hppShow resolvedHide resolved

modules/imgproc/src/contours_common.hpp OutdatedShow resolvedHide resolved

indentation

fd90ee4

asmorkalov assignedmshabunin

Mar 12, 2025

asmorkalov added this to the4.12.0 milestone

Mar 12, 2025

asmorkalov changed the title~~WIP: Find contours speedup~~Find contours speedup

Mar 12, 2025

asmorkalov merged commitd83df66 intoopencv:4.x

Mar 12, 2025

27 of 28 checks passed

asmorkalov mentioned this pull request

Apr 29, 2025

5.x merge 4.x#27265

Merged

sfe-SparkFro mentioned this pull request

Jul 22, 2025

Update to OpenCV 4.12.0sparkfun/micropython-opencv#39

Open

Labels

category: imgproc

Movatterモバイル変換

Uh oh!

Find contours speedup#26834

Find contours speedup#26834

Uh oh!

Conversation

chacha21 commentedJan 23, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

mshabunin commentedJan 25, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

chacha21 commentedJan 25, 2025

Uh oh!

mshabunin commentedJan 26, 2025

Uh oh!

chacha21 commentedJan 26, 2025

Uh oh!

mshabunin commentedJan 26, 2025

Uh oh!

chacha21 commentedJan 26, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

mshabunin commentedJan 27, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

chacha21 commentedJan 27, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

chacha21 commentedJan 27, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

chacha21 commentedJan 27, 2025

Uh oh!

chacha21 commentedJan 28, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

mshabunin commentedJan 29, 2025

Uh oh!

chacha21 commentedJan 29, 2025

Uh oh!

chacha21 commentedJan 31, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

mshabunin commentedFeb 2, 2025

Uh oh!

chacha21 commentedFeb 4, 2025

Uh oh!

mshabuninFeb 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mshabunin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chacha21 commentedJan 23, 2025•
edited
Loading

mshabunin commentedJan 25, 2025•
edited
Loading

chacha21 commentedJan 26, 2025•
edited
Loading

mshabunin commentedJan 27, 2025•
edited
Loading

chacha21 commentedJan 27, 2025•
edited
Loading

chacha21 commentedJan 27, 2025•
edited
Loading

chacha21 commentedJan 28, 2025•
edited
Loading

chacha21 commentedJan 31, 2025•
edited
Loading