Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork56.4k
cuda4dnn(conv): autotuning for convolution#16900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
tompollok commentedMar 24, 2020
Great Job@YashasSamaga |
alalek left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thank you for contribution!
| runTensorFlowNet("fp16_eltwise_add_mul",false, l1, lInf); | ||
| runTensorFlowNet("fp16_pad_and_concat",false, l1, lInf); | ||
| runTensorFlowNet("fp16_padding_valid",false, l1, lInf); | ||
| runTensorFlowNet("fp16_padding_valid",false, l1, lInf);std::cout <<"2"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Perhaps we should split this test on smaller parts (for accurate detection of failed parts). May be in a separate PR to keep atomic changes.
YashasSamaga commentedMar 26, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
The CUDA backend now performs autotuning on its own which appears to have resolved the performance degradation problems. Moreover, it also includes fused convolutions while autotuning. This has improved the performance further. Problem:The initialization time has gone up significantly. It's in several seconds for many models and few of them take two digit seconds. Mask RCNN takes 26s for initialization! There are eight algorithms available for convolution. These are the combinations which are tried while autotuning:
Not all algorithms are tried always. Some of them fail due to insufficient memory or they are not supported in the required configuration.Detailed summary of autotuning results for many models. In the worst case, 20 combinations will be tried for every convolution layer. There are few algorithms, particularly FFT based algorithms, which request insane amounts of workspace. For example, the FFT tiling algorithm requests a workspace of 8GB for convolving on a 200KB image. These memory allocations take seconds! The memory is cached during profiling to avoid repeated allocations but even a single allocation of GBs of workspace take a second or two. The FFT based algorithms are feebly used (only once in my testbed consisting of 26 models) as they are very inefficient for small filters and small batches. Why try unfused cuDNN combinations when fused cuDNN combinations are available? |
tompollok commentedMar 26, 2020
Would it make sense to add a file next to the model file that persists the tuned configuration in case the same hardware is used to save expensive autotuning when just processing a single image on request, because sometimes several models have to be applied to a single image for some tasks. |
YashasSamaga commentedMar 26, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
For reference, initialization times are around a few hundred milliseconds to a second without this PR. This depends on the device. The numbers that I report are for GTX 1050. That would require discussions on the storage format, API and a lot of other things (and quite a bit of work). An easier temporary solution might be to make autotuning optional (opt-in feature, i.e. disabled by default). This again will require new API (or we can use a layer param of the conv layer). |
alalek commentedMar 26, 2020
Some similar approach exists in OpenCL (ocl4dnn) backend. It includes:
|
YashasSamaga commentedMay 11, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
What about having an API for loading, saving and optimizing model cofigurations that is exposed through Both OpenCL and CUDA could use the same API instead of having individual backend-specific APIs. I think only three functions are required (and maybe user-friendly overloads): The configuration could be stored in protobuf format. The format can be kept as an internal format. It can also have an internal version which can be used to track format across versions. This will allow printing user-friendly messages when it's incompatible or there is an update and the optimizer must be run again for additional benefits. |
11f2edb to4a58a22Comparetompollok commentedMay 12, 2020
The API proposal sounds reasonable to me as it makes the use of optimized vs non optimized versions dynamically selectable at runtime and transparent to the users. I would also vote against the use environment variables. |
4a58a22 to993ba52Compare993ba52 toef3d497CompareYashasSamaga commentedJul 28, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
cuDNN 8 has a brand new API which offers better autotuning capabilities. I will make a PR supporting cuDNN 8's new backend API soon and then a PR for autotuning. |
lorenzolightsgdwarf commentedDec 23, 2021
@YashasSamaga thanks for the work! On my net (a custom version of SSD MobileNet v2) I haven't seen any improvement. I'm using Cuda 10.2 and Cudnn 7.6.2. Is there any flag/environment variable that enables the autotuning? Or is there something that disables it? Bests! |
YashasSamaga commentedDec 24, 2021
It's always enabled in this PR. This PR hasn't been merged into master. Therefore, you need to build this PR to use the autotuning facility. Related:#20966 |
Uh oh!
There was an error while loading.Please reload this page.
Statistics as of May 14th 2020:
Devices:
benchmarks without this PR
benchmarks with this PR
autotuning algorithm selections
convolution configurations
NOTE: Not up-to-date.
Device: GTX 1050
Every model which uses depthwise convolutions has got an improvement.
Most of Mask RCNN improvement comes from a single convolution layer where the algorithm chosen by heuristics took 40ms whereas the best algorithm took just 20ms.
Pending:
Pull Request Readiness Checklist
See details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.