Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork56.4k
Find contours speedup#26834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Find contours speedup#26834
Uh oh!
There was an error while loading.Please reload this page.
Conversation
bb242cc to085e8e3Comparemshabunin commentedJan 25, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I've tested "updated code" (after) VS "current 4.x" (new) VS "4.9" (old). Also profiled with callgrind to check number of memory allocations. Each measurment took 50 iterations using reproducer from the original issue.
Profiling with callgrind shows that updated code still have numerous malloc calls. I've been able to reduce this number even more (by ~3M or by ~6M, I'm not sure) by disabling Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API |
chacha21 commentedJan 25, 2025
|
the stack does not use a growing capacity, vector should perform less allocations
Avoid allocation in TreeIterator by using a specific storage using an arenaNo visible improvements on the test case
The values stored in the `levels` member of TreeIterator seems to be mostly contiguous indices. Instead of allocating each index, we can just use ranges. It saves a lot of allocations.Still no visible performance improvement on test case.
mshabunin commentedJan 26, 2025
Did you measure performance difference with
This comment was related to points stored in contours, not to TreeIterator. I believe TreeIterator doesn't have big impact on performance. |
chacha21 commentedJan 26, 2025
On my machine (Win7, VS2019) original 4.11 legacy : ~133ms
OK, I misunderstood as "preallocate index_mapping memory". On my host machine, it is still not so close to legacy |
mshabunin commentedJan 26, 2025
Please take a look at this commit:5f35ea3 |
chacha21 commentedJan 26, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
What I understand :
original 4.11 legacy : ~133ms I think that I will try to evolve to a BlockStorageWithArena for the general case |
mshabunin commentedJan 27, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
As I understand this algorithm requires memory storage with :
I thought about using std::deque with custom allocator and larger blocks, but it seems it'll make things too complicated.
Not sure if I understand. Wouldn't offset need to be 2D? |
chacha21 commentedJan 27, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I think you are right. We should not rely on a structure which inner tuning depends heavily on the toolchain. Your BlockStorage let us control the amount of memory reliably.
The idea of the arena is just to provide fast memory on demand, I don't think that my ArenaStackBuffer has much overhead.
Yes, but as long as we keep track of width, offset = y*width+x, so (x, y) = (offset%width, offset/width). This is just a trick to save half the memory used to store one point (at the cost of the offset <-> (x,y) conversion overhead here and there) |
removed all arenas in favor of BlockStorageadded static blocks to BlockStorageadded range iteration for block storage to optimize copy data out of itmake BlockStorage allocated on stack before ContourScanner instanciationtested storage of points as int(offset), but it was slower on test cases
chacha21 commentedJan 27, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
original 4.11 legacy : ~133ms |
a std::vector<char> might not be optimal to store the codes (for instance uit always allocate 16 bytes, even empty), so I tried different things.Nothing was better so far.
even unused and empty, std::vector<schar> codes allocates memoryusing a specific ContourCodesStorage similar to ContourPointsStorage help saving that allocation
chacha21 commentedJan 27, 2025
original 4.11 legacy : ~133ms |
removed harmless debugging code, raising warning with some compilers
chacha21 commentedJan 28, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I ran standard perf tests to see if there are any regression. |
mshabunin commentedJan 29, 2025
Number of samples is chosen from the predefinedrange (10-100 by default) taking into account relative performance measurmentdeviation (median VS mean) and overall testexecution time (should not exceed 3 sec by default). Maybe something else. Use following options to tune the parameters:
The simplest variant would be to set fixed number of samples using the same value of BTW, I used this basic test to reproduce issue from the original ticket and compare legacy and new implementations: Details#include"perf_precomp.hpp"#include"opencv2/imgproc/detail/legacy.hpp"namespaceopencv_test {namespace {PERF_TEST(TestFindContours, findContours_big){ Mat img =imread(getDataPath("cv/imgproc/contours.png"), cv::IMREAD_GRAYSCALE); cv::Mat mask;inRange(img,110,120, mask); std::vector<std::vector<cv::Point>> v_cont;TEST_CYCLE()cv::findContours(mask, v_cont, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);SANITY_CHECK_NOTHING();}PERF_TEST(TestFindContours, findContours_big_old){ Mat img =imread(getDataPath("cv/imgproc/contours.png"), cv::IMREAD_GRAYSCALE); cv::Mat mask;inRange(img,110,120, mask); std::vector<std::vector<cv::Point>> v_cont;TEST_CYCLE()cv::findContours_legacy(mask, v_cont, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);SANITY_CHECK_NOTHING();}} }// namespace |
chacha21 commentedJan 29, 2025
Ok, I ran perf_test for |
chacha21 commentedJan 31, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There is still some room to improve the capabilities of BlockStorage, but it won't change anything for the current test case. So I wonder if I should commit it or if it is just noise at this point.
There is also the temptation (bait ?) to mimic an allocator by
And also
That's not a big work, but does it belong to that PR ? |
mshabunin commentedFeb 2, 2025
I'm observing slight performance degradation in the orignal scenario with current version: PR ~ 48 ms, Legacy ~ 42 ms (Linux, GCC 11, x86_64). Is it OK on Windows? Could you please check?
I believe we don't need to add functionality we don't use right now. It can be useful to move block memory storage to separate file(s). Though eventually we might need to move it to thecore module (private interface, likeBufferArea), extend functionality and cover with tests. |
chacha21 commentedFeb 4, 2025
I just performed the test : On the machine where I developed the PR, Windows 7, vs2019
On another Windows 10, vs2022
|
| private: | ||
| std::stack<int> levels; | ||
| Tree<T>& tree; | ||
| vectorOfRanges<int> levels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm observing better performance withstd::stack (though less stable):
On x86_64 (Core i5-11600, Linux):
- stack:
samples=100 mean=44.01 median=42.88 min=40.59 stddev=2.44 (5.6%) - vectorOfRanges:
samples=100 mean=46.97 median=46.97 min=46.65 stddev=0.16 (0.3%)
On AArch64 (Rockchip RK3588):
- stack:
samples=100 mean=79.23 median=79.02 min=78.35 stddev=1.27 (1.6%) - vectorOfRanges:
samples=100 mean=89.51 median=89.52 min=88.90 stddev=0.21 (0.2%)
I propose to leave the stack version. Also it is better for simplicity.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-class BlockStorgae has its own file-TreeIterator::levels reverted to std::stack (get rid of vectorOfRanges)-ContourPointsStorage and ContourCodesStorage are template specialization of the same class ContourDataStorage-ContourPointsStorage and ContourCodesStorage us a static size of 0 for their inner BlockStorageI have run the tstandard findContours perf_tests to check that all those changes were not a slowdown for the average case (because the current use case of the PR must not take precedence). It's ok.
mshabunin left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looks good to me. With minor cosmetic comments.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
d83df66 intoopencv:4.xUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
This Pull request is still incomplete.
It is an attempt, as suggested by#26775, to restore lost speed when migrating
findContours()implementation from C to C++The patch adds an "Arena" (a pool) of pre-allocated memory so that contours points (and TreeNodes) can be picked from the Arena.
The code of
findContours()is mostly unchanged, the arena usage being implicit through a utility class Arena::Item that provides C++ overloaded operators and construct/destruct logic.As mentioned in#26775, the contour points are allocated and released in order, and can be represented by ranges of indices in their arena. No range subset will be released and drill a hole, that's why the internal representation as a range of indices makes sense.
The TreeNodes use another Arena class that does not comply to that range logic.
Currently, there is a significant improvement of the run-time on the test mentioned in#26775, but it is still far from the
findContours_legacy()performance.Patch to opencv_extra has the same branch name.