Used for handling the CPU/hardware optimization, starting from parsing the command arguments, to managing the relation between the CPU baseline and dispatch-able features, also generating
the required C headers and ending with compiling the sources with proper compiler's flags.

ClassCCompilerOpt doesn't provide runtime detection for the CPU features,
instead only focuses on the compiler side, but it creates abstract C headers
that can be used later for the final runtime dispatching process.

New Build Arguments

--cpu-baseline minimal set of required optimizations, default value ismin which provides the minimum CPU features that can safely run on a wide range of users platforms.
--cpu-dispatch dispatched set of additional optimizations, default value ismax -xop -fma4 which enables all CPU features, except for AMD legacy features.

The new arguments can be reached throughbuild,build_clib ,build_ext.
ifbuild_clib orbuild_ext are not specified by the user, the arguments ofbuild will be used instead, which also hold the default values.

Both`--cpu-baseline` and`--cpu-dispatch` are accepting the following CPU features :

X86:SSE,SSE2,SSE3,SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,XOP,FMA4,FMA3,AVX2,AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

IBM/Power64:VSX,VSX2,VSX3

ARM7/8 :NEON,NEON_FP16,NEON_VFPV4,ASIMD,ASIMDHP,ASIMDDP,ASIMDFHM

Other acceptable options:

MIN: Enables the minimum CPU features that can safely run on a wide range of users platforms :
X86:SSE,SSE2
X86_64:SSE,SSE2,SSE3
IBM/Power64(big-endian): nothing
IBM/Power64(little-endian):VSX,VSX2
ARM7: nothing
ARM8(64-BIT):NEON,NEON_FP16,NEON_VFPV4,ASIMD
MAX: Enables all supported CPU features by the Compiler and platform.
NATIVE: Enables all CPU features that supported by the current machine, this operation is based on the compiler flags (-march=native, -xHost, /QxHost)
NONE: enables nothing
Operand +/-: remove or add features, useful with optionsMAX ,MIN andNATIVE.
NOTE: operand+ is only added for nominal reason, For example :
- --cpu-basline="min avx2" equivalent to--cpu-basline="min + avx2"
- --cpu-basline="min,avx2" equivalent to--cpu-basline="min,+avx2"

NOTES:

Case-insensitive among all CPU features and other options.
Comma or space can be used as a separator, e.g.
--cpu-dispatch= "avx2 avx512f" or--cpu-dispatch= "avx2, avx512f" both applicable.
If the CPU feature is not supported by the user platform or compiler, it will be skipped rather than
raising a fatal error.
Any specified CPU features to--cpu-dispatch will be skipped if its part of CPU baseline features
Argument--cpu-baseline force enables implied features,
e.g.--cpu-baseline="sse42" equivalent to--cpu-baseline="sse sse2 sse3 ssse3 sse41 popcnt sse42"
The value of--cpu-baseline will be treated as "native" if compiler native flag-march=native or-xHost orQxHost is enabled though environment variableCFLAGS
The user should always check the final report through the build log to verify the enabled features.

Explicitly disable the new infrastructure

Add new command argument--disable-optimization to explicitly disable the whole new infrastructure, It also adds a new compiler definition calledNPY_DISABLE_OPTIMIZATION

when--disable-optimization is enabled the dispatch-able sources(explained in thiscomment).dispatch.c will be treated as a normal C source, also due to this disabling any C headers that generated byCCompilerOpt must guard it withNPY_DISABLE_OPTIMIZATION, otherwise, it will definitely break the build.

New auto-generated C header(`core/src/common/_cpu_dispatch.h`)

The new header contains all the definitions and headers of instruction-sets for the CPU baseline and dispatch-able features that have enabled through command arguments--cpu-baseline and--cpu-dispatch.

NPY_HAVE_ is the suffix of CPU features definitions, e.g.NPY_HAVE_SSE2,NPY_HAVE_VSX,NPY_HAVE_NEON.
The new header can be included inside normal C files or dispatch-able C sources(explained below),
however, if the new header is included inside normal C files, then it will only provides the definitions and headers of instruction-sets for the CPU baseline features, For example:

#include"_cpu_dispatch.h"#ifdefNPY_HAVE_AVX2// this branch will only be enabled if AVX2 is enabled through --cpu-baseline="avx2"#endif

NOTE: It should not be used directly but through another new header callednpy_cpu_dispatch.h

New C header(`core/src/common/npy_cpu_dispatch.h`)

This header contains all utilities that required for the whole CPU dispatching process, it also can be considered as a bridge linking the new infrastructure work with NumPy CPU runtime detection(#13421).

New CPU dispatcher solution(dispatch-able sources)

Explained in thiscomment

Add new attributes to umath module

__cpu_baseline__ a list of CPU baseline feature names that configured by--cpu-baseline
__cpu_dispatch__ a list of CPU dispatch feature names that configured by--cpu-dispatch

The new attributes contain the final enabled CPU features that supported by the platform.

Print the supported CPU features during the run of PytestTester

This could be one of the easiest ways to trace the enabled features during the run of unit tests.

An example on GCC 8.3.0(64-bit) - i7-8550U:

NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F? AVX512CD? AVX512_KNL? AVX512_KNM? AVX512_SKX? AVX512_CLX? AVX512_CNL? AVX512_ICL?

The format explained as follows :

Supported dispatched features by the running machine end with*
Dispatched features that arenot supported by the running machine end with?
Remained features are representing the baseline.
If there any missing features, that because:
- not supported by the compiler or platform
- not configured by command arguments--cpu-baseline and--cpu-dispatch.

seiko2plus force-pushed thecore_improve_infa_build branch 2 times, most recently from026a1b2 to1c24bd0Compare

May 11, 2019 11:12

charris added 01 - Enhancement component: numpy.distutils labels

May 11, 2019

seiko2plus force-pushed thecore_improve_infa_build branch 5 times, most recently from2561efa to660f621Compare

May 18, 2019 05:41

seiko2plus force-pushed thecore_improve_infa_build branch 2 times, most recently frome3f7603 to95ffca3Compare

May 26, 2019 23:58

seiko2plus force-pushed thecore_improve_infa_build branch from95ffca3 to9e05278Compare

June 4, 2019 17:57

seiko2plus force-pushed thecore_improve_infa_build branch 6 times, most recently from531f6a0 to582cb46Compare

July 1, 2019 23:43

mattip mentioned this pull request

Jul 2, 2019

WIP: use openmp vector function ABI for sin, cos, exp and log#7865

Closed

Copy link

Member

mattip commentedJul 3, 2019

I am concerned all this will become a maintenance burden moving forward. Is there a way to use this as a library or a subrepo?

Copy link

edelsohn commentedJul 3, 2019

What does this comment about library or subrepo mean? This sounds very hostile to NumPy supporting other architectures.

Copy link

Member

rgommers commentedJul 3, 2019

I am concerned all this will become a maintenance burden moving forward. Is there a way to use this as a library or a subrepo?

@mattip this is quite clearly addressed in the discussion that spurred this PR:gh-13393. And we also discussed it with the Steering Council before saying yes to posting the bounty mentioned ingh-13393. So yes it's a concern, but if things are done the right way there's no reason that this should substantially increase the maintenance burden. Please review that issue and comment on code or ideas that are potentially problematic, no need to start a new abstract discussion.

This sounds very hostile to NumPy supporting other architectures.

@edelsohn that's certainly not our intent as a project. We'd really like to have complete SIMD support for PowerPC as well as other architectures, with the compiler doing most of the heavy lifting.

Copy link

Member

rgommers commentedJul 3, 2019

That said, this is quite a bit of code. I believe a lot of it is well-tested though, taken over from OpenCV (correct me if I'm wrong@seiko2plus). Some more context in the issue description would be useful:

there's a TODO for an example
this contains both build-time and run-time features. does that change the situation we have today? IIRC we wanted to move more to run-time, is that achieved?
what parts are and aren't covered by CI (we have PPC on TravisCI, were thinking about ARM via drone.io I believe, and conda-forge has ARM too)?

Copy link

Member

charris commentedJul 4, 2019

What does this comment about library or subrepo mean?

It is simply another way to organize the code. NumPy already has two submodules that provide tools to build the documentation, putting the code in a submodule allows the NumPy to use the code while letting development proceed at its own pace. Likewise, NumPy builds several libraries for the sorting and math code, probably not strictly necessary these days, but routines in the libraries are linked in when needed, and strictly speaking could be used by downstream projects. You have a number of new files dedicated to this work, so it might be nice to put them together in subdirecties, maybe organized by architecture. Note that the libraries have their own setup and include files and configure the system. So the question here is how should we organize the code, not whether we should have it.

Copy link

Member

rkern commentedJul 4, 2019

It seems to me, after a cursory inspection, that how you want to do thedistutils integration might be dispositive. It would benice to package this up into a separate library with its owndistutils integration tooling so that I could write other packages that use this code without needing to usenumpy.distutils.numpy.distutils works well fornumpy andscipy, but we try not to make it necessary for packages that just want to build againstnumpy's functionality.

On the other hand, I recognize that thedistutils integration required is fairly significant, and welding together two significantdistutils augmentations is often fraught, so I can see an argument for avoiding it and just making this one more feature thatnumpy.distutils alone provides, likef2py. If that's the case, thennumpy/numpy is the right place. There isn't much else that would be profitable to break out into a separate package.

One thing to note is that we ought to be able to use this infrastructure to build another package that uses CPU features that thenumpy binary itself was not compiled to use. I don't see why that would not be the case in this PR, but I just wanted it stated explicitly. Organizing this as a separate library may help enforce that when future modifications are made.

Copy link

MemberAuthor

seiko2plus commentedJul 4, 2019

@mattip,

I am concerned all this will become a maintenance burden moving forward.

A maintenance burden already moving forward, I think what I'm doing is just trying to ease the burden.

Is there a way to use this as a library or a subrepo?

I'm afraid not or at least for now , because the universal intrinsics (mapped intrinsics) are going to be heavily expanded depending on the needs and I think it should be part of NumPy's core, later we should forbid or at least reduce the direct use of raw SIMD intrinsics and only count on universal intrinsics for CPU optimizations.

@edelsohn,

This sounds very hostile to NumPy supporting other architectures.

Actually, most of the work here is mainly for the x86 due to the need to dealing with many SIMD extensions, compilers, and platforms which make it kinda complicated and I don't see VSX or even NEON create any obstacles here, maybe supporting BE mode for VSX will be a little bit challengable but I'm planning to implement it in a separate pull-request after I get done from re-implement the current optimized kernels using the new SIMD interface.

@rommers,

That said, this is quite a bit of code.

Yes, well I had to add some fundamental universal intrinsics just for testing propose but sure I will add a lot more depending on the needs.

I believe a lot of it is well-tested though

I believe we shouldn't put our trust in any intrinsics even if it directly mapping to native instruction-set since
compilers aren't developed by angels, that may be why we should try to cover all the used intrinsics within a unit testing.

taken over from OpenCV (correct me if I'm wrong@seiko2plus)

Like I mentioned before in#13393 the whole road map is inspired by OpenCV but not exactly the same design or code but I can tell the similarity is very high especially to infrastructure.

there's a TODO for an example

Of course, I will not be able to cover all areas in this pr but I think it's necessary to provide a decent example for illustrative and testing purposes and please feel free to add any extra tasks.

what parts are and aren't covered by CI (we have PPC on TravisCI, were thinking about ARM via drone.io I believe, and conda-forge has ARM too)?

if it's necessary having a custom CI using "buildbot" could be an alternative solution too.

@charris,

You have a number of new files dedicated to this work, so it might be nice to put them together in subdirecties, maybe organized by architecture. Note that the libraries have their own setup and include files and configure the system.

Do you think the current work isn't organized enough?

So the question here is how should we organize the code, not whether we should have it

Although this work still not completed yet, I would truly appreciate it if you could provide a proposal or any kind of notes.

@rkern,

it seems to me, after a cursory inspection, that how you want to do the distutils integration might be dispositive. It would be nice to package this up into a separate library with its own distutils integration tooling so that I could write other packages that use this code without needing to use numpy.distutils. numpy.distutils works well for numpy and scipy, but we try not to make it necessary for packages that just want to build against numpy's functionality.

On the other hand, I recognize that the distutils integration required is fairly significant, and welding together two significant distutils augmentations is often fraught, so I can see an argument for avoiding it and just making this one more feature that numpy.distutils alone provides, like f2py. If that's the case, then numpy/numpy is the right place. There isn't much else that would be profitable to break out into a separate package.

One thing to note is that we ought to be able to use this infrastructure to build another package that uses CPU features that the numpy binary itself was not compiled to use. I don't see why that would not be the case in this PR, but I just wanted it stated explicitly. Organizing this as a separate library may help enforce that when future modifications are made.

Sure it is a great idea and it seems like the community prefer this way (a separate package) but wait as you know every project have his own needs, that means you have to find compatibilities, flexible design which means more time, more issues and I think we should focus on the main purpose instead at least for now and have some fun by optimize some kernels :).

Copy link

Member

rgommers commentedJul 4, 2019

if it's necessary having a custom CI using "buildbot" could be an alternative solution too.

no not necessary, and in fact very much undesired. we used to have those in the pre-TravisCI age and they're a maintenance nightmare. we'll just cover what we can with the hosted CI platforms we have, and the rest we test manually at PR develop/merge time and rely on community input to keep things working. that's the way it is now too for AIX, Sparc, etc. we'll just slowly expand coverage as we go

seiko2plus force-pushed thecore_improve_infa_build branch 3 times, most recently fromc1c7d9e to8802a35Compare

August 22, 2019 13:40

seiko2plus added4 commits

June 15, 2020 22:51

ENH: [4/7] enable multi-platform SIMD compiler optimizations

b9d5e12

  Add several C macros to `utils.h`

ENH: [5/7] enable multi-platform SIMD compiler optimizations

524664e

  Add new C header located at `core/src/common/npy_cpu_dispatch.h`    the new header works as a bridge linking the generated    headers and macros by `CCompilerOpt` with NumPy runtime    CPU features detection API through provides several    C macros can be used in dispatching the generated objects    from dispatch-able sources.

ENH: [6/7] enable multi-platform SIMD compiler optimizations

1bd0a62

  - Add new attributes to umath module    * __cpu_baseline__ a list contains the minimal set of required optimizations      that supported by the compiler and platform according to the specified      values to command argument '--cpu-baseline'.    * __cpu_dispatch__ a list contains the dispatched set of additional optimizations      that supported by the compiler and platform according to the specified      values to command argument '--cpu-dispatch'  - Print required and additional optimizations during the run of PytestTester

ENH: [7/7] enable multi-platform SIMD compiler optimizations

86d2515

   Add testing unit for the utilites of CPU dispatcher

seiko2plus force-pushed thecore_improve_infa_build branch fromccfad27 to93fee4eCompare

June 16, 2020 16:42

DOC: add a release note

e726538

seiko2plus force-pushed thecore_improve_infa_build branch from93fee4e toe726538Compare

June 16, 2020 16:44

mattip merged commit8245b39 intonumpy:master

Jun 17, 2020

Copy link

Member

mattip commentedJun 17, 2020

Thanks@seiko2plus. Step 1 is done!

Copy link

MemberAuthor

seiko2plus commentedJun 18, 2020

I started to love the purple color, thank you!

anirudh2290 mentioned this pull request

Jun 19, 2020

BUG: Fix uint->timedelta promotion to raise TypeError#16639

Merged

Copy link

Member

charris commentedJul 4, 2020•
edited
Loading

@seiko2plus This PR raises compiler warnings with fedora gcc 10 when AVX is missing.

/home/charris/Workspace/numpy.git/numpy/distutils/checks/cpu_avx512f.c:5:13: error: AVX512F vector return without AVX512F enabled changes the ABI [-Werror=psabi]    5 |     __m512i a = _mm512_abs_epi32(_mm512_setzero_si512());      |             ^In file included from /usr/lib/gcc/x86_64-redhat-linux/10/include/immintrin.h:55,                 from /home/charris/Workspace/numpy.git/numpy/distutils/checks/cpu_avx512f.c:1:/usr/lib/gcc/x86_64-redhat-linux/10/include/avx512fintrin.h:15594:1: error: inlining failed in call to ‘always_inline’ ‘_mm512_castsi512_si128’: target specific option mismatch15594 | _mm512_castsi512_si128 (__m512i __A)      | ^~~~~~~~~~~~~~~~~~~~~~

EDIT: Also for other avx checks. Seems the checks should not be compiled without AVX.

Copy link

MemberAuthor

seiko2plus commentedJul 8, 2020

@charris, classCCompilerOpt is doing two tests for each required feature to determine the compiler support:

check the flags without any involved intrinsics, to determine the availability of flags.
check the test file with or without the flags depends on the availability of flags(friendly with msvc).

If both tests failed, then the feature will be skipped, but since we're dealing with GCC-10 both tests should be passed.
And according to the provided error,CCompilerOpt was testing the check filecpu_avx512f.c without providing theavx512f flag-mavx512f which means, the flags test was failed.

However, I tested (GCC version 10.1.0) and it looks fine to me.
please could you provide the build log or provide at least the failure part that related to testing flag-mavx512f?

Copy link

Member

charris commentedJul 8, 2020

@seiko2plus I don't see that warning anymore, may have changed with a recent kernel update. Let me try with an older kernel.

Copy link

Member

charris commentedJul 8, 2020

No, they are still there, but tests run fine. I think the errors are configuration output thatruntests.py should not be showing, so this is probably some sort of build organization problem. Are those the sort of errors you would expect during configuration?@mattip IIRC, you were the one who suppressed the configuration output, thoughts?

Copy link

Member

mattip commentedJul 8, 2020

Could you put the complete build log up somewhere likehttps://paste.ubuntu.com/ ?

Copy link

Member

charris commentedJul 8, 2020

@mattip Here:https://paste.ubuntu.com/p/VH7x8qJkVS/ . Search forerror:.

Copy link

MemberAuthor

seiko2plus commentedJul 8, 2020•
edited
Loading

@charris, umm well it seems '-march=native' is provided through environment variable CFLAGS or it is part of distutils config vars. see the final build log:

CPU baseline:   Requested:'native'  Enabled: SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2  Flags: -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2CPU dispatch:   Requested:'max -xop -fma4'  Enabled: AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL  Generated: noneCCompilerOpt._cache_write[786]: write cache to path -> /home/charris/Workspace/numpy.git/build/temp.linux-x86_64-3.9/ccompiler_opt_cache_ext.py########### CLIB COMPILER OPTIMIZATION ###########CPU baseline:   Requested:'native'  Enabled: SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2  Flags: -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2CPU dispatch:   Requested:'max -xop -fma4'  Enabled: AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL  Generated: noneCCompilerOpt._cache_write[786]: write cache to path -> /home/charris/Workspace/numpy.git/build/temp.linux-x86_64-3.9/ccompiler_opt_cache_clib.py

The failure messages are normal in this case sinceCCompilerOpt was testing all features with the native flag to determine
the supported features by the running machine during the build time.

Copy link

Member

mattip commentedJul 8, 2020

This is part of "build_clib" (line 627). The configuration work is done during "build_src" (line 135). Maybe we should do the same tricks during CCompilerOpt to ignore warnings that we do during build_src.