CPU build options#

Description#

The following options are mainly used to change the default behavior of optimizationsthat target certain CPU features:

  • cpu-baseline: minimal set of required CPU features.

    Default value ismin which provides the minimum CPU features that cansafely run on a wide range of platforms within the processor family.

    Note

    During the runtime, NumPy modules will fail to load if any of specified featuresare not supported by the target CPU (raises Python runtime error).

  • cpu-dispatch: dispatched set of additional CPU features.

    Default value ismax-xop-fma4 which enables all CPUfeatures, except for AMD legacy features (in case of X86).

    Note

    During the runtime, NumPy modules will skip any specified featuresthat are not available in the target CPU.

These options are accessible at build time by passing setup arguments to meson-pythonvia the build frontend (e.g.,pip orbuild).They accept a set ofCPU featuresor groups of features that gather several features orspecial options thatperform a series of procedures.

To customize CPU/build options:

pipinstall.-Csetup-args=-Dcpu-baseline="avx2 fma3"-Csetup-args=-Dcpu-dispatch="max"

Quick start#

In general, the default settings tend to not impose certain CPU features thatmay not be available on some older processors. Raising the ceiling of thebaseline features will often improve performance and may also reducebinary size.

The following are the most common scenarios that may require changingthe default settings:

I am building NumPy for my local use#

And I do not intend to export the build to other users or target adifferent CPU than what the host has.

Setnative for baseline, or manually specify the CPU features in case of optionnative isn’t supported by your platform:

python-mbuild--wheel-Csetup-args=-Dcpu-baseline="native"

Building NumPy with extra CPU features isn’t necessary for this case,since all supported features are already defined within the baseline features:

python-mbuild--wheel-Csetup-args=-Dcpu-baseline="native" \-Csetup-args=-Dcpu-dispatch="none"

Note

A fatal error will be raised ifnative isn’t supported by the host platform.

I do not want to support the old processors of the x86 architecture#

Since most of the CPUs nowadays support at leastAVX,F16C features, you can use:

python-mbuild--wheel-Csetup-args=-Dcpu-baseline="avx f16c"

Note

cpu-baseline force combine all implied features, so there’s no needto add SSE features.

I’m facing the same case above but with ppc64 architecture#

Then raise the ceiling of the baseline features to Power8:

python-mbuild--wheel-Csetup-args=-Dcpu-baseline="vsx2"

Having issues with AVX512 features?#

You may have some reservations about including ofAVX512 orany other CPU feature and you want to exclude from the dispatched features:

python-mbuild--wheel-Csetup-args=-Dcpu-dispatch="max -avx512f -avx512cd\-avx512_knl -avx512_knm -avx512_skx -avx512_clx -avx512_cnl -avx512_icl -avx512_spr"

Supported features#

The names of the features can express one feature or a group of features,as shown in the following tables supported depend on the lowest interest:

Note

The following features may not be supported by all compilers,also some compilers may produce different set of implied featureswhen it comes to features likeAVX512,AVX2, andFMA3.SeePlatform differences for more details.

On x86#

Name

Implies

Gathers

SSE

SSE2

SSE2

SSE

SSE3

SSESSE2

SSSE3

SSESSE2SSE3

SSE41

SSESSE2SSE3SSSE3

POPCNT

SSESSE2SSE3SSSE3SSE41

SSE42

SSESSE2SSE3SSSE3SSE41POPCNT

AVX

SSESSE2SSE3SSSE3SSE41POPCNTSSE42

XOP

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVX

FMA4

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVX

F16C

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVX

FMA3

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16C

AVX2

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16C

AVX512F

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2

AVX512CD

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512F

AVX512_KNL

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CD

AVX512ERAVX512PF

AVX512_KNM

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CDAVX512_KNL

AVX5124FMAPSAVX5124VNNIWAVX512VPOPCNTDQ

AVX512_SKX

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CD

AVX512VLAVX512BWAVX512DQ

AVX512_CLX

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CDAVX512_SKX

AVX512VNNI

AVX512_CNL

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CDAVX512_SKX

AVX512IFMAAVX512VBMI

AVX512_ICL

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CDAVX512_SKXAVX512_CLXAVX512_CNL

AVX512VBMI2AVX512BITALGAVX512VPOPCNTDQ

AVX512_SPR

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CDAVX512_SKXAVX512_CLXAVX512_CNLAVX512_ICL

AVX512FP16

On IBM/POWER big-endian#

Name

Implies

VSX

VSX2

VSX

VSX3

VSXVSX2

VSX4

VSXVSX2VSX3

On IBM/POWER little-endian#

Name

Implies

VSX

VSX2

VSX2

VSX

VSX3

VSXVSX2

VSX4

VSXVSX2VSX3

On ARMv7/A32#

Name

Implies

NEON

NEON_FP16

NEON

NEON_VFPV4

NEONNEON_FP16

ASIMD

NEONNEON_FP16NEON_VFPV4

ASIMDHP

NEONNEON_FP16NEON_VFPV4ASIMD

ASIMDDP

NEONNEON_FP16NEON_VFPV4ASIMD

ASIMDFHM

NEONNEON_FP16NEON_VFPV4ASIMDASIMDHP

On ARMv8/A64#

Name

Implies

NEON

NEON_FP16NEON_VFPV4ASIMD

NEON_FP16

NEONNEON_VFPV4ASIMD

NEON_VFPV4

NEONNEON_FP16ASIMD

ASIMD

NEONNEON_FP16NEON_VFPV4

ASIMDHP

NEONNEON_FP16NEON_VFPV4ASIMD

ASIMDDP

NEONNEON_FP16NEON_VFPV4ASIMD

ASIMDFHM

NEONNEON_FP16NEON_VFPV4ASIMDASIMDHP

On IBM/ZSYSTEM(S390X)#

Name

Implies

VX

VXE

VX

VXE2

VXVXE

Special options#

  • NONE: enable no features.

  • NATIVE: Enables all CPU features that supported by the host CPU,this operation is based on the compiler flags (-march=native,-xHost,/QxHost)

  • MIN: Enables the minimum CPU features that can safely run on a wide range of platforms:

    For Arch

    Implies

    x86 (32-bit mode)

    SSESSE2

    x86_64

    SSESSE2SSE3

    IBM/POWER (big-endian mode)

    NONE

    IBM/POWER (little-endian mode)

    VSXVSX2

    ARMHF

    NONE

    ARM64 A.K. AARCH64

    NEONNEON_FP16NEON_VFPV4ASIMD

    IBM/ZSYSTEM(S390X)

    NONE

  • MAX: Enables all supported CPU features by the compiler and platform.

  • Operators-/+: remove or add features, useful with optionsMAX,MIN andNATIVE.

Behaviors#

  • CPU features and other options are case-insensitive, for example:

    python-mbuild--wheel-Csetup-args=-Dcpu-dispatch="SSE41 avx2 FMA3"
  • The order of the requested optimizations doesn’t matter:

    python-mbuild--wheel-Csetup-args=-Dcpu-dispatch="SSE41 AVX2 FMA3"# equivalent topython-mbuild--wheel-Csetup-args=-Dcpu-dispatch="FMA3 AVX2 SSE41"
  • Either commas or spaces or ‘+’ can be used as a separator,for example:

    python-mbuild--wheel-Csetup-args=-Dcpu-dispatch="avx2 avx512f"# orpython-mbuild--wheel-Csetup-args=-Dcpu-dispatch=avx2,avx512f# orpython-mbuild--wheel-Csetup-args=-Dcpu-dispatch="avx2+avx512f"

    all works but arguments should be enclosed in quotes or escapedby backslash if any spaces are used.

  • cpu-baseline combines all implied CPU features, for example:

    python-mbuild--wheel-Csetup-args=-Dcpu-baseline=sse42# equivalent topython-mbuild--wheel-Csetup-args=-Dcpu-baseline="sse sse2 sse3 ssse3 sse41 popcnt sse42"
  • cpu-baseline will be treated as “native” if compiler native flag-march=native or-xHost or/QxHost is enabled through environment variableCFLAGS:

    exportCFLAGS="-march=native"pipinstall.# is equivalent topipinstall.-Csetup-args=-Dcpu-baseline=native
  • cpu-baseline escapes any specified features that aren’t supportedby the target platform or compiler rather than raising fatal errors.

    Note

    Sincecpu-baseline combines all implied features, the maximumsupported of implied features will be enabled rather than escape all of them.For example:

    # Requesting `AVX2,FMA3` but the compiler only support **SSE** featurespython-mbuild--wheel-Csetup-args=-Dcpu-baseline="avx2 fma3"# is equivalent topython-mbuild--wheel-Csetup-args=-Dcpu-baseline="sse sse2 sse3 ssse3 sse41 popcnt sse42"
  • cpu-dispatch does not combine any of implied CPU features,so you must add them unless you want to disable one or all of them:

    # Only dispatches AVX2 and FMA3python-mbuild--wheel-Csetup-args=-Dcpu-dispatch=avx2,fma3# Dispatches AVX and SSE featurespython-mbuild--wheel-Csetup-args=-Dcpu-dispatch=ssse3,sse41,sse42,avx,avx2,fma3
  • cpu-dispatch escapes any specified baseline features and also escapesany features not supported by the target platform or compiler without raisingfatal errors.

Eventually, you should always check the final report through the build logto verify the enabled features. SeeBuild report for more details.

Platform differences#

Some exceptional conditions force us to link some features together when it come tocertain compilers or architectures, resulting in the impossibility of building them separately.

These conditions can be divided into two parts, as follows:

Architectural compatibility

The need to align certain CPU features that are assured to be supported bysuccessive generations of the same architecture, some cases:

  • On ppc64leVSX(ISA2.06) andVSX2(ISA2.07) both imply one another since thefirst generation that supports little-endian mode is Power-8`(ISA 2.07)`

  • On AArch64NEONNEON_FP16NEON_VFPV4ASIMD implies each other since they are part of thehardware baseline.

For example:

# On ARMv8/A64, specify NEON is going to enable Advanced SIMD# and all predecessor extensionspython-mbuild--wheel-Csetup-args=-Dcpu-baseline=neon# which is equivalent topython-mbuild--wheel-Csetup-args=-Dcpu-baseline="neon neon_fp16 neon_vfpv4 asimd"

Note

Please take a deep look atSupported features,in order to determine the features that imply one another.

Compilation compatibility

Some compilers don’t provide independent support for all CPU features. For instanceIntel’s compiler doesn’t provide separated flags forAVX2 andFMA3,it makes sense since all Intel CPUs that comes withAVX2 also supportFMA3,but this approach is incompatible with otherx86 CPUs fromAMD orVIA.

For example:

# Specify AVX2 will force enables FMA3 on Intel compilerspython-mbuild--wheel-Csetup-args=-Dcpu-baseline=avx2# which is equivalent topython-mbuild--wheel-Csetup-args=-Dcpu-baseline="avx2 fma3"

The following tables only show the differences imposed by some compilers from thegeneral context that been shown in theSupported features tables:

Note

Features names with strikeout represent the unsupported CPU features.

On x86::Intel Compiler#

Name

Implies

Gathers

FMA3

SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16CAVX2

AVX2

SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16CFMA3

AVX512F

SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2AVX512CD

XOP

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVX

FMA4

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVX

AVX512_SPR

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CDAVX512_SKXAVX512_CLXAVX512_CNLAVX512_ICL

AVX512FP16

On x86::Microsoft Visual C/C++#

Name

Implies

Gathers

FMA3

SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16CAVX2

AVX2

SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16CFMA3

AVX512F

SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2AVX512CDAVX512_SKX

AVX512CD

SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512FAVX512_SKX

AVX512_KNL

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CD

AVX512ERAVX512PF

AVX512_KNM

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CDAVX512_KNL

AVX5124FMAPSAVX5124VNNIWAVX512VPOPCNTDQ

AVX512_SPR

SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CDAVX512_SKXAVX512_CLXAVX512_CNLAVX512_ICL

AVX512FP16

Build report#

In most cases, the CPU build options do not produce any fatal errors that lead to hanging the build.Most of the errors that may appear in the build log serve as heavy warnings due to the lack of someexpected CPU features by the compiler.

So we strongly recommend checking the final report log, to be aware of what kind of CPU featuresare enabled and what are not.

You can find the final report of CPU optimizations at the end of the build log,and here is how it looks on x86_64/gcc:

########### EXT COMPILER OPTIMIZATION ###########Platform:Architecture:x64Compiler:gccCPUbaseline:Requested:'min'Enabled:SSESSE2SSE3Flags:-msse-msse2-msse3Extrachecks:noneCPUdispatch:Requested:'max -xop -fma4'Enabled:SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CDAVX512_KNLAVX512_KNMAVX512_SKXAVX512_CLXAVX512_CNLAVX512_ICLGenerated::SSE41:SSESSE2SSE3SSSE3Flags:-msse-msse2-msse3-mssse3-msse4.1Extrachecks:noneDetect:SSESSE2SSE3SSSE3SSE41:build/src.linux-x86_64-3.9/numpy/_core/src/umath/loops_arithmetic.dispatch.c:numpy/_core/src/umath/_umath_tests.dispatch.c:SSE42:SSESSE2SSE3SSSE3SSE41POPCNTFlags:-msse-msse2-msse3-mssse3-msse4.1-mpopcnt-msse4.2Extrachecks:noneDetect:SSESSE2SSE3SSSE3SSE41POPCNTSSE42:build/src.linux-x86_64-3.9/numpy/_core/src/_simd/_simd.dispatch.c:AVX2:SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFlags:-msse-msse2-msse3-mssse3-msse4.1-mpopcnt-msse4.2-mavx-mf16c-mavx2Extrachecks:noneDetect:AVXF16CAVX2:build/src.linux-x86_64-3.9/numpy/_core/src/umath/loops_arithm_fp.dispatch.c:build/src.linux-x86_64-3.9/numpy/_core/src/umath/loops_arithmetic.dispatch.c:numpy/_core/src/umath/_umath_tests.dispatch.c:(FMA3AVX2):SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFlags:-msse-msse2-msse3-mssse3-msse4.1-mpopcnt-msse4.2-mavx-mf16c-mfma-mavx2Extrachecks:noneDetect:AVXF16CFMA3AVX2:build/src.linux-x86_64-3.9/numpy/_core/src/_simd/_simd.dispatch.c:build/src.linux-x86_64-3.9/numpy/_core/src/umath/loops_exponent_log.dispatch.c:build/src.linux-x86_64-3.9/numpy/_core/src/umath/loops_trigonometric.dispatch.c:AVX512F:SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2Flags:-msse-msse2-msse3-mssse3-msse4.1-mpopcnt-msse4.2-mavx-mf16c-mfma-mavx2-mavx512fExtrachecks:AVX512F_REDUCEDetect:AVX512F:build/src.linux-x86_64-3.9/numpy/_core/src/_simd/_simd.dispatch.c:build/src.linux-x86_64-3.9/numpy/_core/src/umath/loops_arithm_fp.dispatch.c:build/src.linux-x86_64-3.9/numpy/_core/src/umath/loops_arithmetic.dispatch.c:build/src.linux-x86_64-3.9/numpy/_core/src/umath/loops_exponent_log.dispatch.c:build/src.linux-x86_64-3.9/numpy/_core/src/umath/loops_trigonometric.dispatch.c:AVX512_SKX:SSESSE2SSE3SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CDFlags:-msse-msse2-msse3-mssse3-msse4.1-mpopcnt-msse4.2-mavx-mf16c-mfma-mavx2-mavx512f-mavx512cd-mavx512vl-mavx512bw-mavx512dqExtrachecks:AVX512BW_MASKAVX512DQ_MASKDetect:AVX512_SKX:build/src.linux-x86_64-3.9/numpy/_core/src/_simd/_simd.dispatch.c:build/src.linux-x86_64-3.9/numpy/_core/src/umath/loops_arithmetic.dispatch.c:build/src.linux-x86_64-3.9/numpy/_core/src/umath/loops_exponent_log.dispatch.cCCompilerOpt.cache_flush[804]:writecachetopath->/home/seiko/work/repos/numpy/build/temp.linux-x86_64-3.9/ccompiler_opt_cache_ext.py########### CLIB COMPILER OPTIMIZATION ###########Platform:Architecture:x64Compiler:gccCPUbaseline:Requested:'min'Enabled:SSESSE2SSE3Flags:-msse-msse2-msse3Extrachecks:noneCPUdispatch:Requested:'max -xop -fma4'Enabled:SSSE3SSE41POPCNTSSE42AVXF16CFMA3AVX2AVX512FAVX512CDAVX512_KNLAVX512_KNMAVX512_SKXAVX512_CLXAVX512_CNLAVX512_ICLGenerated:none

There is a separate report for each ofbuild_ext andbuild_clibthat includes several sections, and each section has several values, representing the following:

Platform:

  • Architecture: The architecture name of target CPU. It should be one ofx86,x64,ppc64,ppc64le,armhf,aarch64,s390x orunknown.

  • Compiler: The compiler name. It should be one ofgcc, clang, msvc, icc, iccw or unix-like.

CPU baseline:

  • Requested: The specific features and options tocpu-baseline as-is.

  • Enabled: The final set of enabled CPU features.

  • Flags: The compiler flags that were used to all NumPy C/C++ sourcesduring the compilation except for temporary sources that have been used for generatingthe binary objects of dispatched features.

  • Extra checks: list of internal checks that activate certain functionalityor intrinsics related to the enabled features, useful for debugging when it comesto developing SIMD kernels.

CPU dispatch:

  • Requested: The specific features and options tocpu-dispatch as-is.

  • Enabled: The final set of enabled CPU features.

  • Generated: At the beginning of the next row of this property,the features for which optimizations have been generated are shown in theform of several sections with similar properties explained as follows:

    • One or multiple dispatched feature: The implied CPU features.

    • Flags: The compiler flags that been used for these features.

    • Extra checks: Similar to the baseline but for these dispatched features.

    • Detect: Set of CPU features that need be detected in runtime in order toexecute the generated optimizations.

    • The lines that come after the above property and end with a ‘:’ on a separate line,represent the paths of c/c++ sources that define the generated optimizations.

Runtime dispatch#

Importing NumPy triggers a scan of the available CPU features from the setof dispatchable features. This can be further restricted by setting theenvironment variableNPY_DISABLE_CPU_FEATURES to a comma-, tab-, orspace-separated list of features to disable. This will raise an error ifparsing fails or if the feature was not enabled. For instance, onx86_64this will disableAVX2 andFMA3:

NPY_DISABLE_CPU_FEATURES="AVX2,FMA3"

If the feature is not available, a warning will be emitted.

Tracking dispatched functions#

Discovering which CPU targets are enabled for different optimized functions is achievablethrough the Python functionnumpy.lib.introspect.opt_func_info.This function offers the flexibility of applying filters using two optional arguments:one for refining function names and the other for specifying data types in the signatures.

For example:

>>func_info=numpy.lib.introspect.opt_func_info(func_name='add|abs',signature='float64|complex64')>>print(json.dumps(func_info,indent=2)){"absolute":{"dd":{"current":"SSE41","available":"SSE41 baseline(SSE SSE2 SSE3)"},"Ff":{"current":"FMA3__AVX2","available":"AVX512F FMA3__AVX2 baseline(SSE SSE2 SSE3)"},"Dd":{"current":"FMA3__AVX2","available":"AVX512F FMA3__AVX2 baseline(SSE SSE2 SSE3)"}},"add":{"ddd":{"current":"FMA3__AVX2","available":"FMA3__AVX2 baseline(SSE SSE2 SSE3)"},"FFF":{"current":"FMA3__AVX2","available":"FMA3__AVX2 baseline(SSE SSE2 SSE3)"}}}