Model	without this patch	with this patch
MobileNet SSD Coco v1	55ms	7ms
MobileNet SSD Coco v2	84ms	11ms
OpenPose pose MPI	156ms	111ms
EfficientNet B0 YOLOv3	121ms	15ms
FastNeuralStyle Stary Night	26ms	22ms
Inception v2 Mask RCNN	180ms	152ms

Every model which uses depthwise convolutions has got an improvement.

Most of Mask RCNN improvement comes from a single convolution layer where the algorithm chosen by heuristics took 40ms whereas the best algorithm took just 20ms.

Pending:

insanely big initialization times

Pull Request Readiness Checklist

See details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under OpenCV (BSD) License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

force_builders=Custombuildworker:Custom=linux-4build_image:Custom=ubuntu-cuda:18.04

Copy link

Contributor

tompollok commentedMar 24, 2020

Great Job@YashasSamaga

alalek reviewed

Mar 25, 2020

View reviewed changes

Copy link

Member

alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for contribution!

modules/dnn/test/test_tf_importer.cpp Outdated

		runTensorFlowNet("fp16_eltwise_add_mul",false, l1, lInf);
		runTensorFlowNet("fp16_pad_and_concat",false, l1, lInf);
		runTensorFlowNet("fp16_padding_valid",false, l1, lInf);
		runTensorFlowNet("fp16_padding_valid",false, l1, lInf);std::cout <<"2";

Copy link

Member

alalekMar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Perhaps we should split this test on smaller parts (for accurate detection of failed parts). May be in a separate PR to keep atomic changes.

Copy link

ContributorAuthor

YashasSamaga commentedMar 26, 2020•
edited
Loading

The CUDA backend now performs autotuning on its own which appears to have resolved the performance degradation problems. Moreover, it also includes fused convolutions while autotuning. This has improved the performance further.

Problem:

The initialization time has gone up significantly. It's in several seconds for many models and few of them take two digit seconds. Mask RCNN takes 26s for initialization!

There are eight algorithms available for convolution. These are the combinations which are tried while autotuning:

cuDNN default math fused:
- all eight algorithms are tried
- cuDNN does convolution, bias addition and activation together
cuDNN default math unfused:
- all eight algorithms are tried
- cuDNN does the convolution
- bias addition and activation are carried out by cuda4dnn kernels (bias and activation are fused together)
cuDNN tensor core fused:
- two algorithms are tried
- cuDNN does convolution, bias addition and activation together
cuDNN tensor core unfused
- two algorithms are tried
- cuDNN does the convolution
- bias addition and activation are carried out by cuda4dnn kernels (bias and activation are fused together)

Not all algorithms are tried always. Some of them fail due to insufficient memory or they are not supported in the required configuration.Detailed summary of autotuning results for many models. In the worst case, 20 combinations will be tried for every convolution layer.

There are few algorithms, particularly FFT based algorithms, which request insane amounts of workspace. For example, the FFT tiling algorithm requests a workspace of 8GB for convolving on a 200KB image. These memory allocations take seconds! The memory is cached during profiling to avoid repeated allocations but even a single allocation of GBs of workspace take a second or two.

The FFT based algorithms are feebly used (only once in my testbed consisting of 26 models) as they are very inefficient for small filters and small batches.

Why try unfused cuDNN combinations when fused cuDNN combinations are available?
Because fused cuDNN can sometimes be slower than unfused cuDNN convolution with separate bias_activation step. For example, if fused cuDNN operations are forced whenever they are available, MobileNet to take 37ms instead of just 7ms!

Copy link

Contributor

tompollok commentedMar 26, 2020

Would it make sense to add a file next to the model file that persists the tuned configuration in case the same hardware is used to save expensive autotuning when just processing a single image on request, because sometimes several models have to be applied to a single image for some tasks.

Copy link

ContributorAuthor

YashasSamaga commentedMar 26, 2020•
edited
Loading

For reference, initialization times are around a few hundred milliseconds to a second without this PR. This depends on the device. The numbers that I report are for GTX 1050.

That would require discussions on the storage format, API and a lot of other things (and quite a bit of work).

An easier temporary solution might be to make autotuning optional (opt-in feature, i.e. disabled by default). This again will require new API (or we can use a layer param of the conv layer).

Copy link

Member

alalek commentedMar 26, 2020

Some similar approach exists in OpenCL (ocl4dnn) backend. It includes:

rules to define convolutionconfiguration.
default "pre-tuned" values for some cases (target H/W is selected through number of execution units (EU) - autotuning is available for Intel iGPUs).
multi-level storage of tuned configurations:"default","in-memory","on disk".
tuning configurationflags. Auto-tuning is disabled by default, but can be enabled by request or loaded from the disk.
somemagic how candidates are generated for tuning process.

asmorkalov added the pr: Discussion Required label

Apr 15, 2020

YashasSamaga changed the title~~cuda4dnn(conv): runtime algorithm selection for cudnn convolution~~cuda4dnn(conv): autotuning for convolution

Apr 23, 2020

YashasSamaga mentioned this pull request

Apr 24, 2020

dnn4cuda runs slower on rtx2080ti than gtx titan X and more GPU memory usage#17127

Closed

4 tasks

YashasSamaga mentioned this pull request

May 11, 2020

cuda4dnn(region): add scale_x_y parameter for YOLOv4#17253

Merged

6 tasks

Copy link

ContributorAuthor

YashasSamaga commentedMay 11, 2020•
edited
Loading

What about having an API for loading, saving and optimizing model cofigurations that is exposed throughcv::dnn::Net? This seems more user-friendly than using environment variables. It also allows having unoptimized and optimized models to be used simultaneously (without having to modify the environment variables at runtime). It's also easier to extend in the future.

Both OpenCL and CUDA could use the same API instead of having individual backend-specific APIs.

I think only three functions are required (and maybe user-friendly overloads):

void Net::loadOptimizerConfiguration(const std::string& config);void Net::saveOptimizerConfiguration(std::string& config);void Net::runOptimizer();

The configuration could be stored in protobuf format. The format can be kept as an internal format. It can also have an internal version which can be used to track format across versions. This will allow printing user-friendly messages when it's incompatible or there is an update and the optimizer must be run again for additional benefits.

YashasSamaga force-pushed thecuda4dnn-conv-runtime-tuning branch from11f2edb to4a58a22Compare

May 11, 2020 16:30

Copy link

Contributor

tompollok commentedMay 12, 2020

The API proposal sounds reasonable to me as it makes the use of optimized vs non optimized versions dynamically selectable at runtime and transparent to the users. I would also vote against the use environment variables.

YashasSamaga mentioned this pull request

May 29, 2020

cuda4dnn(conv): fuse eltwise with convolutions#17363

Merged

6 tasks

YashasSamaga mentioned this pull request

Jun 10, 2020

Mobilenetv2 using cuda is slow than using cpu？#17515

Closed

YashasSamaga force-pushed thecuda4dnn-conv-runtime-tuning branch from4a58a22 to993ba52Compare

June 20, 2020 11:05

YashasSamaga mentioned this pull request

Jun 20, 2020

cuda4dnn(DetectionOutput): add fast approximate DetectionOutputOp#17301

Merged

6 tasks

This was referencedJun 30, 2020

cuda4dnn(build): add basic support for cuDNN 8#17685

Merged

Using gpu opencv dnn on Tesla p4, GPU is much slower than cpu#17711

Closed

YashasSamaga force-pushed thecuda4dnn-conv-runtime-tuning branch from993ba52 toef3d497Compare

July 2, 2020 14:30

YashasSamaga added4 commits

July 2, 2020 20:00

add cudnn runtime tuning for convolution

49b3370

recalibrate tests

7dc67b0

perform autotuning manually

322b5d5

update tests

ef3d497

YashasSamaga mentioned this pull request

Jul 4, 2020

cuda4dnn: overlap D2H output blobs transfer with inference#17748

Merged

6 tasks

YashasSamaga closed this

Jul 28, 2020

Copy link

ContributorAuthor

YashasSamaga commentedJul 28, 2020•
edited
Loading

cuDNN 8 has a brand new API which offers better autotuning capabilities. I will make a PR supporting cuDNN 8's new backend API soon and then a PR for autotuning.

facug91 mentioned this pull request

May 5, 2021

Add CuDNN 8 release support#17496

Open

Copy link

lorenzolightsgdwarf commentedDec 23, 2021

@YashasSamaga thanks for the work! On my net (a custom version of SSD MobileNet v2) I haven't seen any improvement. I'm using Cuda 10.2 and Cudnn 7.6.2. Is there any flag/environment variable that enables the autotuning? Or is there something that disables it? Bests!

Copy link

ContributorAuthor

YashasSamaga commentedDec 24, 2021

@YashasSamaga thanks for the work! On my net (a custom version of SSD MobileNet v2) I haven't seen any improvement. I'm using Cuda 10.2 and Cudnn 7.6.2. Is there any flag/environment variable that enables the autotuning? Or is there something that disables it? Bests!

It's always enabled in this PR. This PR hasn't been merged into master. Therefore, you need to build this PR to use the autotuning facility.

Related:#20966

Labels

category: dnn category: gpu/cuda (contrib)

OpenCV 4.0+: moved to opencv_contrib

optimization pr: Discussion Required

Movatterモバイル変換

Uh oh!

cuda4dnn(conv): autotuning for convolution#16900

cuda4dnn(conv): autotuning for convolution#16900

Uh oh!

Conversation

YashasSamaga commentedMar 24, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

tompollok commentedMar 24, 2020

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

alalekMar 24, 2020

Choose a reason for hiding this comment

Uh oh!

YashasSamaga commentedMar 26, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Problem:

Uh oh!

tompollok commentedMar 26, 2020

Uh oh!

YashasSamaga commentedMar 26, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

alalek commentedMar 26, 2020

Uh oh!

YashasSamaga commentedMay 11, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

tompollok commentedMay 12, 2020

Uh oh!

YashasSamaga commentedJul 28, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

lorenzolightsgdwarf commentedDec 23, 2021

Uh oh!

YashasSamaga commentedDec 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

YashasSamaga commentedMar 24, 2020•
edited
Loading

YashasSamaga commentedMar 26, 2020•
edited
Loading

YashasSamaga commentedMar 26, 2020•
edited
Loading

YashasSamaga commentedMay 11, 2020•
edited
Loading

YashasSamaga commentedJul 28, 2020•
edited
Loading