|
| 1 | +#NumPy SIMD Wrapper for Highway |
| 2 | + |
| 3 | +This directory contains a lightweight C++ wrapper over Google's[Highway](https://github.com/google/highway) SIMD library, designed specifically for NumPy's needs. |
| 4 | + |
| 5 | +>**Note**: This directory also contains the C interface of universal intrinsics (under`simd.h`) which is no longer supported. The Highway wrapper described in this document should be used instead for all new SIMD code. |
| 6 | +
|
| 7 | +##Overview |
| 8 | + |
| 9 | +The wrapper simplifies Highway's SIMD interface by eliminating class tags and using lane types directly, which can be deduced from arguments in most cases. This design makes the SIMD code more intuitive and easier to maintain while still leveraging Highway generic intrinsics. |
| 10 | + |
| 11 | +##Architecture |
| 12 | + |
| 13 | +The wrapper consists of two main headers: |
| 14 | + |
| 15 | +1.`simd.hpp`: The main header that defines namespaces and includes configuration macros |
| 16 | +2.`simd.inc.hpp`: Implementation details included by`simd.hpp` multiple times for different namespaces |
| 17 | + |
| 18 | +Additionally, this directory contains legacy C interface files for universal intrinsics (`simd.h` and related files) which are deprecated and should not be used for new code. All new SIMD code should use the Highway wrapper. |
| 19 | + |
| 20 | + |
| 21 | +##Usage |
| 22 | + |
| 23 | +###Basic Usage |
| 24 | + |
| 25 | +```cpp |
| 26 | +#include"simd/simd.hpp" |
| 27 | + |
| 28 | +// Use np::simd for maximum width SIMD operations |
| 29 | +usingnamespacenp::simd; |
| 30 | +float*data = /* ...*/; |
| 31 | +Vec<float> v = LoadU(data); |
| 32 | +v = Add(v, v); |
| 33 | +StoreU(v, data); |
| 34 | + |
| 35 | +// Use np::simd128 for fixed 128-bit SIMD operations |
| 36 | +using namespace np::simd128; |
| 37 | +Vec<float> v128 = LoadU(data); |
| 38 | +v128 = Add(v128, v128); |
| 39 | +StoreU(v128, data); |
| 40 | +``` |
| 41 | +
|
| 42 | +### Checking for SIMD Support |
| 43 | +
|
| 44 | +```cpp |
| 45 | +#include "simd/simd.hpp" |
| 46 | +
|
| 47 | +// Check if SIMD is enabled |
| 48 | +#if NPY_SIMDX |
| 49 | + // SIMD code |
| 50 | +#else |
| 51 | + // Scalar fallback code |
| 52 | +#endif |
| 53 | +
|
| 54 | +// Check for float64 support |
| 55 | +#if NPY_SIMDX_F64 |
| 56 | + // Use float64 SIMD operations |
| 57 | +#endif |
| 58 | +
|
| 59 | +// Check for FMA support |
| 60 | +#if NPY_SIMDX_FMA |
| 61 | + // Use FMA operations |
| 62 | +#endif |
| 63 | +``` |
| 64 | + |
| 65 | +##Type Support and Constraints |
| 66 | + |
| 67 | +The wrapper provides type constraints to help with SFINAE (Substitution Failure Is Not An Error) and compile-time type checking: |
| 68 | + |
| 69 | +-`kSupportLane<TLane>`: Determines whether the specified lane type is supported by the SIMD extension. |
| 70 | +```cpp |
| 71 | +// Base template - always defined, even when SIMD is not enabled (for SFINAE) |
| 72 | +template<typename TLane> |
| 73 | +constexprboolkSupportLane = NPY_SIMDX !=0; |
| 74 | +template<> |
| 75 | +constexprboolkSupportLane<double> = NPY_SIMDX_F64 !=0; |
| 76 | +``` |
| 77 | + |
| 78 | + |
| 79 | +```cpp |
| 80 | +#include"simd/simd.hpp" |
| 81 | + |
| 82 | +// Check if float64 operations are supported |
| 83 | +ifconstexpr (np::simd::kSupportLane<double>) { |
| 84 | +// Use float64 operations |
| 85 | +} |
| 86 | +``` |
| 87 | + |
| 88 | +These constraints allow for compile-time checking of which lane types are supported, which can be used in SFINAE contexts to enable or disable functions based on type support. |
| 89 | + |
| 90 | +##Available Operations |
| 91 | + |
| 92 | +The wrapper provides the following common operations that are used in NumPy: |
| 93 | + |
| 94 | +- Vector creation operations: |
| 95 | +-`Zero`: Returns a vector with all lanes set to zero |
| 96 | +-`Set`: Returns a vector with all lanes set to the given value |
| 97 | +-`Undefined`: Returns an uninitialized vector |
| 98 | + |
| 99 | +- Memory operations: |
| 100 | +-`LoadU`: Unaligned load of a vector from memory |
| 101 | +-`StoreU`: Unaligned store of a vector to memory |
| 102 | + |
| 103 | +- Vector information: |
| 104 | +-`Lanes`: Returns the number of vector lanes based on the lane type |
| 105 | + |
| 106 | +- Type conversion: |
| 107 | +-`BitCast`: Reinterprets a vector to a different type without modifying the underlying data |
| 108 | +-`VecFromMask`: Converts a mask to a vector |
| 109 | + |
| 110 | +- Comparison operations: |
| 111 | +-`Eq`: Element-wise equality comparison |
| 112 | +-`Le`: Element-wise less than or equal comparison |
| 113 | +-`Lt`: Element-wise less than comparison |
| 114 | +-`Gt`: Element-wise greater than comparison |
| 115 | +-`Ge`: Element-wise greater than or equal comparison |
| 116 | + |
| 117 | +- Arithmetic operations: |
| 118 | +-`Add`: Element-wise addition |
| 119 | +-`Sub`: Element-wise subtraction |
| 120 | +-`Mul`: Element-wise multiplication |
| 121 | +-`Div`: Element-wise division |
| 122 | +-`Min`: Element-wise minimum |
| 123 | +-`Max`: Element-wise maximum |
| 124 | +-`Abs`: Element-wise absolute value |
| 125 | +-`Sqrt`: Element-wise square root |
| 126 | + |
| 127 | +- Logical operations: |
| 128 | +-`And`: Bitwise AND |
| 129 | +-`Or`: Bitwise OR |
| 130 | +-`Xor`: Bitwise XOR |
| 131 | +-`AndNot`: Bitwise AND NOT (a &~b) |
| 132 | + |
| 133 | +Additional Highway operations can be accessed via the`hn` namespace alias inside the`simd` or`simd128` namespaces. |
| 134 | + |
| 135 | +##Extending |
| 136 | + |
| 137 | +To add more operations from Highway: |
| 138 | + |
| 139 | +1. Import them in the`simd.inc.hpp` file using the`using` directive if they don't require a tag: |
| 140 | +```cpp |
| 141 | +// For operations that don't require a tag |
| 142 | +using hn::FunctionName; |
| 143 | +``` |
| 144 | + |
| 145 | +2. Define wrapper functions for intrinsics that require a class tag: |
| 146 | +```cpp |
| 147 | +// For operations that require a tag |
| 148 | +template<typename TLane> |
| 149 | + HWY_API ReturnTypeFunctionName(Args... args) { |
| 150 | + return hn::FunctionName(_Tag<TLane>(), args...); |
| 151 | + } |
| 152 | +``` |
| 153 | +
|
| 154 | +3. Add appropriate documentation and SFINAE constraints if needed |
| 155 | +
|
| 156 | +
|
| 157 | +## Build Configuration |
| 158 | +
|
| 159 | +The SIMD wrapper automatically disables SIMD operations when optimizations are disabled: |
| 160 | +
|
| 161 | +- When `NPY_DISABLE_OPTIMIZATION` is defined, SIMD operations are disabled |
| 162 | +- SIMD is enabled only when the Highway target is not scalar (`HWY_TARGET != HWY_SCALAR`) |
| 163 | +
|
| 164 | +## Design Notes |
| 165 | +
|
| 166 | +1. **Why avoid Highway scalar operations?** |
| 167 | + - NumPy already provides kernels for scalar operations |
| 168 | + - Compilers can better optimize standard library implementations |
| 169 | + - Not all Highway intrinsics are fully supported in scalar mode |
| 170 | +
|
| 171 | +2. **Legacy Universal Intrinsics** |
| 172 | + - The older universal intrinsics C interface (in `simd.h` and accessible via `NPY_SIMD` macros) is deprecated |
| 173 | + - All new SIMD code should use this Highway-based wrapper (accessible via `NPY_SIMDX` macros) |
| 174 | + - The legacy code is maintained for compatibility but will eventually be removed |
| 175 | +
|
| 176 | +3. **Feature Detection Constants vs. Highway Constants** |
| 177 | + - NumPy-specific constants (`NPY_SIMDX_F16`, `NPY_SIMDX_F64`, `NPY_SIMDX_FMA`) provide additional safety beyond raw Highway constants |
| 178 | + - Highway constants (e.g., `HWY_HAVE_FLOAT16`) only check hardware capabilities but don't consider NumPy's build configuration |
| 179 | + - Our constants combine both checks: |
| 180 | + ```cpp |
| 181 | + #define NPY_SIMDX_F16 (NPY_SIMDX && HWY_HAVE_FLOAT16) |
| 182 | + ``` |
| 183 | + - This ensures SIMD features won't be used when: |
| 184 | + - Hardware supports it but NumPy optimization is disabled via meson option: |
| 185 | + ``` |
| 186 | + option('disable-optimization', type: 'boolean', value: false, |
| 187 | + description: 'Disable CPU optimized code (dispatch,simd,unroll...)') |
| 188 | + ``` |
| 189 | + - Highway target is scalar (`HWY_TARGET == HWY_SCALAR`) |
| 190 | + - Using these constants ensures consistent behavior across different compilation settings |
| 191 | + - Without this additional layer, code might incorrectly try to use SIMD paths in scalar mode |
| 192 | +
|
| 193 | +4. **Namespace Design** |
| 194 | + - `np::simd`: Maximum width SIMD operations (scalable) |
| 195 | + - `np::simd128`: Fixed 128-bit SIMD operations |
| 196 | + - `hn`: Highway namespace alias (available within the SIMD namespaces) |
| 197 | +
|
| 198 | +5. **Why Namespaces and Why Not Just Use Highway Directly?** |
| 199 | + - Highway's design uses class tag types as template parameters (e.g., `Vec<ScalableTag<float>>`) when defining vector types |
| 200 | + - Many Highway functions require explicitly passing a tag instance as the first parameter |
| 201 | + - This class tag-based approach increases verbosity and complexity in user code |
| 202 | + - Our wrapper eliminates this by internally managing tags through namespaces, letting users directly use types e.g. `Vec<float>` |
| 203 | + - Simple example with raw Highway: |
| 204 | + ```cpp |
| 205 | + // Highway's approach |
| 206 | + float *data = /* ... */; |
| 207 | +
|
| 208 | + namespace hn = hwy::HWY_NAMESPACE; |
| 209 | + using namespace hn; |
| 210 | +
|
| 211 | + // Full-width operations |
| 212 | + ScalableTag<float> df; // Create a tag instance |
| 213 | + Vec<decltype(df)> v = LoadU(df, data); // LoadU requires a tag instance |
| 214 | + StoreU(v, df, data); // StoreU requires a tag instance |
| 215 | +
|
| 216 | + // 128-bit operations |
| 217 | + Full128<float> df128; // Create a 128-bit tag instance |
| 218 | + Vec<decltype(df128)> v128 = LoadU(df128, data); // LoadU requires a tag instance |
| 219 | + StoreU(v128, df128, data); // StoreU requires a tag instance |
| 220 | + ``` |
| 221 | +
|
| 222 | + - Simple example with our wrapper: |
| 223 | + ```cpp |
| 224 | + // Our wrapper approach |
| 225 | + float *data = /* ... */; |
| 226 | +
|
| 227 | + // Full-width operations |
| 228 | + using namespace np::simd; |
| 229 | + Vec<float> v = LoadU(data); // Full-width vector load |
| 230 | + StoreU(v, data); |
| 231 | +
|
| 232 | + // 128-bit operations |
| 233 | + using namespace np::simd128; |
| 234 | + Vec<float> v128 = LoadU(data); // 128-bit vector load |
| 235 | + StoreU(v128, data); |
| 236 | + ``` |
| 237 | +
|
| 238 | + - The namespaced approach simplifies code, reduces errors, and provides a more intuitive interface |
| 239 | + - It preserves all Highway operations benefits while reducing cognitive overhead |
| 240 | +
|
| 241 | +5. **Why Namespaces Are Essential for This Design?** |
| 242 | + - Namespaces allow us to define different internal tag types (`hn::ScalableTag<TLane>` in `np::simd` vs `hn::Full128<TLane>` in `np::simd128`) |
| 243 | + - This provides a consistent type-based interface (`Vec<float>`) without requiring users to manually create tags |
| 244 | + - Enables using the same function names (like `LoadU`) with different implementations based on SIMD width |
| 245 | + - Without namespaces, we'd have to either reintroduce tags (defeating the purpose of the wrapper) or create different function names for each variant (e.g., `LoadU` vs `LoadU128`) |
| 246 | +
|
| 247 | +6. **Template Type Parameters** |
| 248 | + - `TLane`: The scalar type for each vector lane (e.g., uint8_t, float, double) |
| 249 | +
|
| 250 | +
|
| 251 | +## Requirements |
| 252 | +
|
| 253 | +- C++17 or later |
| 254 | +- Google Highway library |
| 255 | +
|
| 256 | +## License |
| 257 | +
|
| 258 | +Same as NumPy's license |