Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ENH: enable multi-platform SIMD compiler optimizations#13516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
mattip merged 8 commits intonumpy:masterfromseiko2plus:core_improve_infa_build
Jun 17, 2020

Conversation

seiko2plus
Copy link
Member

@seiko2plusseiko2plus commentedMay 9, 2019
edited
Loading

This pullrequest changes


New Distutils classCCompilerOpt

Used for handling the CPU/hardware optimization, starting from parsing the command arguments, to managing the relation between the CPU baseline and dispatch-able features, also generating
the required C headers and ending with compiling the sources with proper compiler's flags.

ClassCCompilerOpt doesn't provide runtime detection for the CPU features,
instead only focuses on the compiler side, but it creates abstract C headers
that can be used later for the final runtime dispatching process.


New Build Arguments

  • --cpu-baseline minimal set of required optimizations, default value ismin which provides the minimum CPU features that can safely run on a wide range of users platforms.
  • --cpu-dispatch dispatched set of additional optimizations, default value ismax -xop -fma4 which enables all CPU features, except for AMD legacy features.

The new arguments can be reached throughbuild,build_clib ,build_ext.
ifbuild_clib orbuild_ext are not specified by the user, the arguments ofbuild will be used instead, which also hold the default values.

Both--cpu-baseline and--cpu-dispatch are accepting the following CPU features :

X86:SSE,SSE2,SSE3,SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,XOP,FMA4,FMA3,AVX2,AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

IBM/Power64:VSX,VSX2,VSX3

ARM7/8 :NEON,NEON_FP16,NEON_VFPV4,ASIMD,ASIMDHP,ASIMDDP,ASIMDFHM

Other acceptable options:

  • MIN: Enables the minimum CPU features that can safely run on a wide range of users platforms :
    X86:SSE,SSE2
    X86_64:SSE,SSE2,SSE3
    IBM/Power64(big-endian): nothing
    IBM/Power64(little-endian):VSX,VSX2
    ARM7: nothing
    ARM8(64-BIT):NEON,NEON_FP16,NEON_VFPV4,ASIMD

  • MAX: Enables all supported CPU features by the Compiler and platform.

  • NATIVE: Enables all CPU features that supported by the current machine, this operation is based on the compiler flags (-march=native, -xHost, /QxHost)

  • NONE: enables nothing

  • Operand +/-: remove or add features, useful with optionsMAX ,MIN andNATIVE.
    NOTE: operand+ is only added for nominal reason, For example :

    • --cpu-basline="min avx2" equivalent to--cpu-basline="min + avx2"
    • --cpu-basline="min,avx2" equivalent to--cpu-basline="min,+avx2"

NOTES:

  • Case-insensitive among all CPU features and other options.
  • Comma or space can be used as a separator, e.g.
    --cpu-dispatch= "avx2 avx512f" or--cpu-dispatch= "avx2, avx512f" both applicable.
  • If the CPU feature is not supported by the user platform or compiler, it will be skipped rather than
    raising a fatal error.
  • Any specified CPU features to--cpu-dispatch will be skipped if its part of CPU baseline features
  • Argument--cpu-baseline force enables implied features,
    e.g.--cpu-baseline="sse42" equivalent to--cpu-baseline="sse sse2 sse3 ssse3 sse41 popcnt sse42"
  • The value of--cpu-baseline will be treated as "native" if compiler native flag-march=native or-xHost orQxHost is enabled though environment variableCFLAGS
  • The user should always check the final report through the build log to verify the enabled features.

Explicitly disable the new infrastructure

Add new command argument--disable-optimization to explicitly disable the whole new infrastructure, It also adds a new compiler definition calledNPY_DISABLE_OPTIMIZATION

when--disable-optimization is enabled the dispatch-able sources(explained in thiscomment).dispatch.c will be treated as a normal C source, also due to this disabling any C headers that generated byCCompilerOpt must guard it withNPY_DISABLE_OPTIMIZATION, otherwise, it will definitely break the build.


New auto-generated C header(core/src/common/_cpu_dispatch.h)

The new header contains all the definitions and headers of instruction-sets for the CPU baseline and dispatch-able features that have enabled through command arguments--cpu-baseline and--cpu-dispatch.

NPY_HAVE_ is the suffix of CPU features definitions, e.g.NPY_HAVE_SSE2,NPY_HAVE_VSX,NPY_HAVE_NEON.
The new header can be included inside normal C files or dispatch-able C sources(explained below),
however, if the new header is included inside normal C files, then it will only provides the definitions and headers of instruction-sets for the CPU baseline features, For example:

#include"_cpu_dispatch.h"#ifdefNPY_HAVE_AVX2// this branch will only be enabled if AVX2 is enabled through --cpu-baseline="avx2"#endif

NOTE: It should not be used directly but through another new header callednpy_cpu_dispatch.h


New C header(core/src/common/npy_cpu_dispatch.h)

This header contains all utilities that required for the whole CPU dispatching process, it also can be considered as a bridge linking the new infrastructure work with NumPy CPU runtime detection(#13421).


New CPU dispatcher solution(dispatch-able sources)

Explained in thiscomment


Add new attributes to umath module

  • __cpu_baseline__ a list of CPU baseline feature names that configured by--cpu-baseline

  • __cpu_dispatch__ a list of CPU dispatch feature names that configured by--cpu-dispatch

The new attributes contain the final enabled CPU features that supported by the platform.


Print the supported CPU features during the run of PytestTester

This could be one of the easiest ways to trace the enabled features during the run of unit tests.

An example on GCC 8.3.0(64-bit) - i7-8550U:

NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F? AVX512CD? AVX512_KNL? AVX512_KNM? AVX512_SKX? AVX512_CLX? AVX512_CNL? AVX512_ICL?

The format explained as follows :

  • Supported dispatched features by the running machine end with*

  • Dispatched features that arenot supported by the running machine end with?

  • Remained features are representing the baseline.

  • If there any missing features, that because:

    • not supported by the compiler or platform
    • not configured by command arguments--cpu-baseline and--cpu-dispatch.

rgommers, LifeIsStrange, Qiyu8, lukecampbell, and ihtiihti reacted with thumbs up emoji
@seiko2plusseiko2plusforce-pushed thecore_improve_infa_build branch 2 times, most recently from026a1b2 to1c24bd0CompareMay 11, 2019 11:12
@seiko2plusseiko2plusforce-pushed thecore_improve_infa_build branch 5 times, most recently from2561efa to660f621CompareMay 18, 2019 05:41
@seiko2plusseiko2plusforce-pushed thecore_improve_infa_build branch 2 times, most recently frome3f7603 to95ffca3CompareMay 26, 2019 23:58
@seiko2plusseiko2plusforce-pushed thecore_improve_infa_build branch from95ffca3 to9e05278CompareJune 4, 2019 17:57
@seiko2plusseiko2plusforce-pushed thecore_improve_infa_build branch 6 times, most recently from531f6a0 to582cb46CompareJuly 1, 2019 23:43
@mattip
Copy link
Member

I am concerned all this will become a maintenance burden moving forward. Is there a way to use this as a library or a subrepo?

@edelsohn
Copy link

What does this comment about library or subrepo mean? This sounds very hostile to NumPy supporting other architectures.

@rgommers
Copy link
Member

I am concerned all this will become a maintenance burden moving forward. Is there a way to use this as a library or a subrepo?

@mattip this is quite clearly addressed in the discussion that spurred this PR:gh-13393. And we also discussed it with the Steering Council before saying yes to posting the bounty mentioned ingh-13393. So yes it's a concern, but if things are done the right way there's no reason that this should substantially increase the maintenance burden. Please review that issue and comment on code or ideas that are potentially problematic, no need to start a new abstract discussion.

This sounds very hostile to NumPy supporting other architectures.

@edelsohn that's certainly not our intent as a project. We'd really like to have complete SIMD support for PowerPC as well as other architectures, with the compiler doing most of the heavy lifting.

@rgommers
Copy link
Member

That said, this is quite a bit of code. I believe a lot of it is well-tested though, taken over from OpenCV (correct me if I'm wrong@seiko2plus). Some more context in the issue description would be useful:

  • there's a TODO for an example
  • this contains both build-time and run-time features. does that change the situation we have today? IIRC we wanted to move more to run-time, is that achieved?
  • what parts are and aren't covered by CI (we have PPC on TravisCI, were thinking about ARM via drone.io I believe, and conda-forge has ARM too)?

@charris
Copy link
Member

What does this comment about library or subrepo mean?

It is simply another way to organize the code. NumPy already has two submodules that provide tools to build the documentation, putting the code in a submodule allows the NumPy to use the code while letting development proceed at its own pace. Likewise, NumPy builds several libraries for the sorting and math code, probably not strictly necessary these days, but routines in the libraries are linked in when needed, and strictly speaking could be used by downstream projects. You have a number of new files dedicated to this work, so it might be nice to put them together in subdirecties, maybe organized by architecture. Note that the libraries have their own setup and include files and configure the system. So the question here is how should we organize the code, not whether we should have it.

mattip reacted with thumbs up emoji

@rkern
Copy link
Member

It seems to me, after a cursory inspection, that how you want to do thedistutils integration might be dispositive. It would benice to package this up into a separate library with its owndistutils integration tooling so that I could write other packages that use this code without needing to usenumpy.distutils.numpy.distutils works well fornumpy andscipy, but we try not to make it necessary for packages that just want to build againstnumpy's functionality.

On the other hand, I recognize that thedistutils integration required is fairly significant, and welding together two significantdistutils augmentations is often fraught, so I can see an argument for avoiding it and just making this one more feature thatnumpy.distutils alone provides, likef2py. If that's the case, thennumpy/numpy is the right place. There isn't much else that would be profitable to break out into a separate package.

One thing to note is that we ought to be able to use this infrastructure to build another package that uses CPU features that thenumpy binary itself was not compiled to use. I don't see why that would not be the case in this PR, but I just wanted it stated explicitly. Organizing this as a separate library may help enforce that when future modifications are made.

@seiko2plus
Copy link
MemberAuthor

@mattip,

I am concerned all this will become a maintenance burden moving forward.

A maintenance burden already moving forward, I think what I'm doing is just trying to ease the burden.

Is there a way to use this as a library or a subrepo?

I'm afraid not or at least for now , because the universal intrinsics (mapped intrinsics) are going to be heavily expanded depending on the needs and I think it should be part of NumPy's core, later we should forbid or at least reduce the direct use of raw SIMD intrinsics and only count on universal intrinsics for CPU optimizations.


@edelsohn,

This sounds very hostile to NumPy supporting other architectures.

Actually, most of the work here is mainly for the x86 due to the need to dealing with many SIMD extensions, compilers, and platforms which make it kinda complicated and I don't see VSX or even NEON create any obstacles here, maybe supporting BE mode for VSX will be a little bit challengable but I'm planning to implement it in a separate pull-request after I get done from re-implement the current optimized kernels using the new SIMD interface.


@rommers,

That said, this is quite a bit of code.

Yes, well I had to add some fundamental universal intrinsics just for testing propose but sure I will add a lot more depending on the needs.

I believe a lot of it is well-tested though

I believe we shouldn't put our trust in any intrinsics even if it directly mapping to native instruction-set since
compilers aren't developed by angels, that may be why we should try to cover all the used intrinsics within a unit testing.

taken over from OpenCV (correct me if I'm wrong@seiko2plus)

Like I mentioned before in#13393 the whole road map is inspired by OpenCV but not exactly the same design or code but I can tell the similarity is very high especially to infrastructure.

there's a TODO for an example

Of course, I will not be able to cover all areas in this pr but I think it's necessary to provide a decent example for illustrative and testing purposes and please feel free to add any extra tasks.

what parts are and aren't covered by CI (we have PPC on TravisCI, were thinking about ARM via drone.io I believe, and conda-forge has ARM too)?

if it's necessary having a custom CI using "buildbot" could be an alternative solution too.


@charris,

You have a number of new files dedicated to this work, so it might be nice to put them together in subdirecties, maybe organized by architecture. Note that the libraries have their own setup and include files and configure the system.

Do you think the current work isn't organized enough?

So the question here is how should we organize the code, not whether we should have it

Although this work still not completed yet, I would truly appreciate it if you could provide a proposal or any kind of notes.


@rkern,

it seems to me, after a cursory inspection, that how you want to do the distutils integration might be dispositive. It would be nice to package this up into a separate library with its own distutils integration tooling so that I could write other packages that use this code without needing to use numpy.distutils. numpy.distutils works well for numpy and scipy, but we try not to make it necessary for packages that just want to build against numpy's functionality.

On the other hand, I recognize that the distutils integration required is fairly significant, and welding together two significant distutils augmentations is often fraught, so I can see an argument for avoiding it and just making this one more feature that numpy.distutils alone provides, like f2py. If that's the case, then numpy/numpy is the right place. There isn't much else that would be profitable to break out into a separate package.

One thing to note is that we ought to be able to use this infrastructure to build another package that uses CPU features that the numpy binary itself was not compiled to use. I don't see why that would not be the case in this PR, but I just wanted it stated explicitly. Organizing this as a separate library may help enforce that when future modifications are made.

Sure it is a great idea and it seems like the community prefer this way (a separate package) but wait as you know every project have his own needs, that means you have to find compatibilities, flexible design which means more time, more issues and I think we should focus on the main purpose instead at least for now and have some fun by optimize some kernels :).

rgommers and Qiyu8 reacted with thumbs up emoji

@rgommers
Copy link
Member

if it's necessary having a custom CI using "buildbot" could be an alternative solution too.

no not necessary, and in fact very much undesired. we used to have those in the pre-TravisCI age and they're a maintenance nightmare. we'll just cover what we can with the hosted CI platforms we have, and the rest we test manually at PR develop/merge time and rely on community input to keep things working. that's the way it is now too for AIX, Sparc, etc. we'll just slowly expand coverage as we go

seiko2plus reacted with thumbs up emoji

@seiko2plusseiko2plusforce-pushed thecore_improve_infa_build branch 3 times, most recently fromc1c7d9e to8802a35CompareAugust 22, 2019 13:40
  Add new C header located at `core/src/common/npy_cpu_dispatch.h`    the new header works as a bridge linking the generated    headers and macros by `CCompilerOpt` with NumPy runtime    CPU features detection API through provides several    C macros can be used in dispatching the generated objects    from dispatch-able sources.
  - Add new attributes to umath module    * __cpu_baseline__ a list contains the minimal set of required optimizations      that supported by the compiler and platform according to the specified      values to command argument '--cpu-baseline'.    * __cpu_dispatch__ a list contains the dispatched set of additional optimizations      that supported by the compiler and platform according to the specified      values to command argument '--cpu-dispatch'  - Print required and additional optimizations during the run of PytestTester
   Add testing unit for the utilites of CPU dispatcher
@seiko2plusseiko2plusforce-pushed thecore_improve_infa_build branch fromccfad27 to93fee4eCompareJune 16, 2020 16:42
@seiko2plusseiko2plusforce-pushed thecore_improve_infa_build branch from93fee4e toe726538CompareJune 16, 2020 16:44
@mattipmattip merged commit8245b39 intonumpy:masterJun 17, 2020
@mattip
Copy link
Member

Thanks@seiko2plus. Step 1 is done!

seberg, h-vetinari, WarrenWeckesser, and rgommers reacted with hooray emojiseiko2plus reacted with heart emoji

@seiko2plus
Copy link
MemberAuthor

I started to love the purple color, thank you!

@charris
Copy link
Member

charris commentedJul 4, 2020
edited
Loading

@seiko2plus This PR raises compiler warnings with fedora gcc 10 when AVX is missing.

/home/charris/Workspace/numpy.git/numpy/distutils/checks/cpu_avx512f.c:5:13: error: AVX512F vector return without AVX512F enabled changes the ABI [-Werror=psabi]    5 |     __m512i a = _mm512_abs_epi32(_mm512_setzero_si512());      |             ^In file included from /usr/lib/gcc/x86_64-redhat-linux/10/include/immintrin.h:55,                 from /home/charris/Workspace/numpy.git/numpy/distutils/checks/cpu_avx512f.c:1:/usr/lib/gcc/x86_64-redhat-linux/10/include/avx512fintrin.h:15594:1: error: inlining failed in call to ‘always_inline’ ‘_mm512_castsi512_si128’: target specific option mismatch15594 | _mm512_castsi512_si128 (__m512i __A)      | ^~~~~~~~~~~~~~~~~~~~~~

EDIT: Also for other avx checks. Seems the checks should not be compiled without AVX.

@seiko2plus
Copy link
MemberAuthor

@charris, classCCompilerOpt is doing two tests for each required feature to determine the compiler support:

  • check the flags without any involved intrinsics, to determine the availability of flags.
  • check the test file with or without the flags depends on the availability of flags(friendly with msvc).

If both tests failed, then the feature will be skipped, but since we're dealing with GCC-10 both tests should be passed.
And according to the provided error,CCompilerOpt was testing the check filecpu_avx512f.c without providing theavx512f flag-mavx512f which means, the flags test was failed.

However, I tested (GCC version 10.1.0) and it looks fine to me.
please could you provide the build log or provide at least the failure part that related to testing flag-mavx512f?

@charris
Copy link
Member

@seiko2plus I don't see that warning anymore, may have changed with a recent kernel update. Let me try with an older kernel.

@charris
Copy link
Member

No, they are still there, but tests run fine. I think the errors are configuration output thatruntests.py should not be showing, so this is probably some sort of build organization problem. Are those the sort of errors you would expect during configuration?@mattip IIRC, you were the one who suppressed the configuration output, thoughts?

@mattip
Copy link
Member

Could you put the complete build log up somewhere likehttps://paste.ubuntu.com/ ?

@charris
Copy link
Member

@mattip Here:https://paste.ubuntu.com/p/VH7x8qJkVS/ . Search forerror:.

@seiko2plus
Copy link
MemberAuthor

seiko2plus commentedJul 8, 2020
edited
Loading

@charris, umm well it seems '-march=native' is provided through environment variable CFLAGS or it is part of distutils config vars. see the final build log:

CPU baseline:   Requested:'native'  Enabled: SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2  Flags: -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2CPU dispatch:   Requested:'max -xop -fma4'  Enabled: AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL  Generated: noneCCompilerOpt._cache_write[786]: write cache to path -> /home/charris/Workspace/numpy.git/build/temp.linux-x86_64-3.9/ccompiler_opt_cache_ext.py########### CLIB COMPILER OPTIMIZATION ###########CPU baseline:   Requested:'native'  Enabled: SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2  Flags: -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2CPU dispatch:   Requested:'max -xop -fma4'  Enabled: AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL  Generated: noneCCompilerOpt._cache_write[786]: write cache to path -> /home/charris/Workspace/numpy.git/build/temp.linux-x86_64-3.9/ccompiler_opt_cache_clib.py

The failure messages are normal in this case sinceCCompilerOpt was testing all features with the native flag to determine
the supported features by the running machine during the build time.

@mattip
Copy link
Member

This is part of "build_clib" (line 627). The configuration work is done during "build_src" (line 135). Maybe we should do the same tricks during CCompilerOpt to ignore warnings that we do during build_src.

@seiko2plus
Copy link
MemberAuthor

@mattip, why we should suppress the error messages?

@seiko2plus
Copy link
MemberAuthor

@charris,

I think the errors are configuration output that runtests.py should not be showing, so this is probably some sort of build organization problem.

Yes, It should be showing if--show-build-log is enabled.

Are those the sort of errors you would expect during configuration?

Yes, when the compiler doesn't support certain feature.

").*"
)
@staticmethod
def _dist_test_spawn(cmd, display=None):
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@mattip,@charris, we can suppress errors from here. if it's necessary I can add support for verbose levels.

@mattip
Copy link
Member

I am not sure whether we should suppress the warnings when someone specifies-mnative or not.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@rgommersrgommersrgommers left review comments

@mattipmattipmattip left review comments

@Qiyu8Qiyu8Qiyu8 left review comments

@r-devulapr-devulapr-devulap requested changes

Assignees

@mattipmattip

Labels
01 - Enhancementcomponent: numpy.distutilscomponent: SIMDIssues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

12 participants
@seiko2plus@mattip@edelsohn@rgommers@charris@rkern@r-devulap@seberg@Guobing-Chen@h-vetinari@isuruf@Qiyu8

[8]ページ先頭

©2009-2025 Movatter.jp