Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

dnn: add layer normalization for vision transformers#23047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
alalek merged 12 commits intoopencv:4.xfromfengyuentau:layer_norm
Jan 27, 2023

Conversation

@fengyuentau
Copy link
Member

@fengyuentaufengyuentau commentedDec 28, 2022
edited
Loading

Merge withopencv/opencv_extra#1032

  • add layer norm onnx parser
  • add layer norm impl
  • add layer norm onnx simplifier for both cases of constants being Constant and Initializer
  • add test model generation code for layer_norm_expanded and layer_norm_expanded_initializer

Benchmark:

LayerMean (ms)Median (ms)Min (ms)
layer norm expanded0.430.420.40
layer norm (this pr)0.020.020.01

*: tested with size 1x50x768 on Apple M1.

Pull Request Readiness Checklist

See details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=Linux OpenCL

rogday reacted with thumbs up emoji
@fengyuentau
Copy link
MemberAuthor

@rogday Could you review this pull request if possible?

Copy link
Member

@rogdayrogday left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for contribution! LGTM 👍

Copy link
Member

@alalekalalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Left some optimization and minor comments.

std::vector<Mat> inputs, outputs;
inputs_arr.getMatVector(inputs);
outputs_arr.getMatVector(outputs);
constint nstripes =getNumThreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

const int nstripes = getNumThreads();

This doesn't look as a reliable design.
This scheme assumes that all threads has the same speed and they are not interrupted.
It is not true for OS with preemptive execution (all widely used OS for the last 25+ years). Some cores may handle background tasks or interrupts.

Also it is not true at all for CPUs with big+little design (modern ARM, Intel CPUs with P+E cores).

nstripes should be based on subtask's "grain size" (subtask time >>> scheduling overhead) instead of number of available threads.

Some information is available here:https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Controlling_Chunking_os.html

Copy link
MemberAuthor

@fengyuentaufengyuentauJan 17, 2023
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also it is not true at all for CPUs with big+little design (modern ARM, Intel CPUs with P+E cores).

And this is why I happened fix the openmp issue on macOS. I was doing benchmarking on vision transformers on onnxruntime, mnn and opencv dnn the other day, and found that both onnxruntime and mnn can run with 4 threads on my apple m1 by default, but opencv dnn uses all threads (8) instead. It is not available to set numThreads with gcd and when I tried with openmp, things went wrong with building issues.

I think somehow we should improve the multi-threading functionality of opencv, which detects big+little design and returns the numThreads of big cores if possible. Somehowcv::Range needs to support step as well in order to work with "grain size"...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think somehow we should improve the multi-threading functionality of opencv, which detects big+little design and returns the numThreads of big cores if possible

It doesn't look as an improvement really.

E.g., in case of ARM-based phones we a already have configurations with 2 big + 6 little cores.


Again, we should not assume or rely heterogenous or same performance of threads or equality of subtasks complexity.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I am not sure what you are exactly asking for here. Basically every other**Invoker of other layers uses the same strategy. If we want to take care of big+little core CPUs, that is going to be another pull request I think.

Copy link
Member

@alalekalalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for the update!

std::vector<Mat> inputs, outputs;
inputs_arr.getMatVector(inputs);
outputs_arr.getMatVector(outputs);
constint nstripes =getNumThreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think somehow we should improve the multi-threading functionality of opencv, which detects big+little design and returns the numThreads of big cores if possible

It doesn't look as an improvement really.

E.g., in case of ARM-based phones we a already have configurations with 2 big + 6 little cores.


Again, we should not assume or rely heterogenous or same performance of threads or equality of subtasks complexity.

…square division outside of loop; use std::max to ensure positive value before std::sqrt
Copy link
Member

@alalekalalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Refactored references and parallel_for usage.


TEST_P(Test_ONNX_layers, LayerNorm)
{
testONNXModels("test_layer_normalization_2d_axis0", pb,0,0,false,true,3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are many error messages during the model import:

[ RUN      ] Test_ONNX_layers.LayerNorm/0, where GetParam() = OCV/CPU[ INFO:0@132.156] global onnx_importer.cpp:831 populateNet DNN/ONNX: loading ONNX v8 model produced by 'backend-test'. Number of nodes = 1, initializers = 0, inputs = 3, outputs = 3[ INFO:0@132.156] global onnx_importer.cpp:725 parseOperatorSet DNN/ONNX: ONNX opset version = 17[ INFO:0@132.156] global onnx_importer.cpp:997 handleNode DNN/ONNX: processing node with 3 inputs and 3 outputs: [LayerNormalization]:(onnx_node_output_0!Y) from domain='ai.onnx'[ERROR:0@132.156] global onnx_importer.cpp:924 populateNet DNN/ONNX: can't find layer for output name: 'Mean'. Does model imported properly?[ERROR:0@132.156] global onnx_importer.cpp:924 populateNet DNN/ONNX: can't find layer for output name: 'InvStdDev'. Does model imported properly?

We should not have them.

Copy link
MemberAuthor

@fengyuentaufengyuentauJan 21, 2023
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The reason why they exist:

  1. These ONNX models are taken from the ONNX conformance tests. I think it is better not modifying them.
  2. I took a look at the opencv-onnx functionalities and did not find how to remove them completely in our onnx importer:
    // Remove additional outputs (Mean, InvStdDev)if (node_proto.output_size() >1){auto outputName = node_proto.output(0);    opencv_onnx::NodeProto node_proto_ = node_proto;    node_proto_.clear_output();    node_proto_.add_output(outputName);addLayer(layerParams, node_proto_);}

@rogday Do you happen to know how to remove optional (node & graph) output in ONNX importer?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I removed optional outputs from those ONNX models in the end. Turned out it is not straightforward to modify outputs of ONNX graph proto.

CV_CheckTypeEQ(src.type(), dst.type(),"");
CV_Assert(scale.isContinuous());

CV_CheckGE(epsilon,0.0f,"");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Added this extra check

fengyuentau reacted with thumbs up emoji
Comment on lines 107 to 108
double nstripes = ((size_t)p.total * p.normSize) * (1 /1024.0);
parallel_for_(Range(0, p.total), p, nstripes);
Copy link
MemberAuthor

@fengyuentaufengyuentauJan 21, 2023
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for the change! I learned a lot!

So if I understand correctly, you make grainsize appropriately small enough so that we can use both big and small cores, and big cores can naturally take more jobs. What about the "magic number" 1024?Like it was taken from the "bathtub curve" fromthis link?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

parallel_for() strategy should rely on subtask size and the scheduling overhead. 1024 is some empiric number here which specifies size of subtask.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Okay, thanks again. Benefit a lot from this. Another question is why multiplyingp.normSize. Another empirical operation? I tried without it and the speed is like twice slower.


By the way, I found that all other**Invokers use the same strategy assuming all threads have the same performance. I think they need to be upgraded as well.

@fengyuentau
Copy link
MemberAuthor

@alalek what is the status of this pull request? Could we make a step forward?

@alalekalalek merged commit4d918ba intoopencv:4.xJan 27, 2023
@alalekalalek mentioned this pull requestJan 28, 2023
@fengyuentaufengyuentau mentioned this pull requestMar 14, 2023
48 tasks
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull requestMar 30, 2023
dnn: add layer normalization for vision transformers* add layer norm onnx parser, impl and tests* add onnx graph simplifier for layer norm expanded* handle the case when constants are of type Initializer* add test case for layer norm expanded with initializers* use CV_Assert & CV_CheckType in place of CV_Assert_N; use forward_fallback for OCL_FP16* use const ref / ref in parameters of invoker::run; extract inner const if from nested loop; use size_t in place of ull* template hasBias* remove trailing whitespace* use pointer parameter with null check; move normSize division & mean_square division outside of loop; use std::max to ensure positive value before std::sqrt* refactor implementation, optimize parallel_for* disable layer norm expanded* remove the removal of layer norm optional outputs
@asmorkalovasmorkalov mentioned this pull requestMay 31, 2023
geversonsto pushed a commit to stodev-com-br/opencv that referenced this pull requestJun 3, 2023
dnn: add layer normalization for vision transformers* add layer norm onnx parser, impl and tests* add onnx graph simplifier for layer norm expanded* handle the case when constants are of type Initializer* add test case for layer norm expanded with initializers* use CV_Assert & CV_CheckType in place of CV_Assert_N; use forward_fallback for OCL_FP16* use const ref / ref in parameters of invoker::run; extract inner const if from nested loop; use size_t in place of ull* template hasBias* remove trailing whitespace* use pointer parameter with null check; move normSize division & mean_square division outside of loop; use std::max to ensure positive value before std::sqrt* refactor implementation, optimize parallel_for* disable layer norm expanded* remove the removal of layer norm optional outputs
@dkurtdkurt mentioned this pull requestAug 4, 2023
9 tasks
@fengyuentaufengyuentau deleted the layer_norm branchOctober 19, 2023 07:34
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@alalekalalekalalek approved these changes

+1 more reviewer

@rogdayrogdayrogday approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

@rogdayrogday

Projects

None yet

Milestone

4.8.0

Development

Successfully merging this pull request may close these issues.

3 participants

@fengyuentau@alalek@rogday

[8]ページ先頭

©2009-2025 Movatter.jp