Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork10.9k
ENH: Convert comparison from C universal intrinsics to C++ using Highway#28490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Hello@r-devulap , I apologize for the interruption, as I know you must be very busy. I submitted PR#28490 a while ago and was wondering if you might have time to review it when your schedule permits. I understand you're likely juggling many priorities, but I'd greatly appreciate any feedback you could provide on my contribution. Thank you for all your work maintaining NumPy! Best regards! |
@ixgbe thanks for your contribution.@seiko2plus had a few comments about using a wrapper around Highway tags to make it easier to develop template functions. I have tagged him to add his comments as well. |
From#21057: If I understood correctly, it was along the lines of numpy/numpy/_core/src/umath/loops_comparison.dispatch.cpp Lines 12 to 82 inb70cda0
|
Hi@seiko2plus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
During the last optimization meeting, I proposed a thin wrapper over Google's Highway
SIMD library to simplify its interface. The wrapper would eliminate the need for
class tags and use lane types directly, which can be deduced from the arguments in
most cases. We can also leverage namespaces for low-level register access and still
rely on lane type only.
While your last commits try to follow the stracture of the purposed C++ universal
intrinsics example still the code count on tags.
So, I suggest to use the following snippet as a reference to implement the wrapper:
We going to need two headers to implement the wrapper, the first one is thesimd.hpp
header which will provide the SIMD interface and the second one issimd.inc.hpp
which is a template header will be called multiple times with different namespaces to
provide the implementation of the SIMD interface.
Thesimd.hpp
header will look like this:
Click to expand
/** * This header provides a thin wrapper over Google's Highway SIMD library. * * The wrapper aims to simplify the SIMD interface of Google's Highway by * get ride of its class tags and use lane types directly which can be deduced * from the args in most cases.*/#ifndef NUMPY__CORE_SRC_COMMON_SIMD_SIMD_HPP_#defineNUMPY__CORE_SRC_COMMON_SIMD_SIMD_HPP_#ifndef NPY_DISABLE_OPTIMIZATION// Highway SIMD is only available when optimization is enabled#include<hwy/highway.h>#defineNPY_SIMDX1// since NPY_SIMD indicates the C universal intrinsics interface#else#defineNPY_SIMDX0#endifnamespacenp {/// Represents the max SIMD width supported by the platform.namespacesimd {#if NPY_SIMDX/// The highway namespace alias./// We can not import all the symbols from the HWY_NAMESPACE because it will/// conflict with the existing symbols in the numpy namespace.namespacehn= hwy::HWY_NAMESPACE;// internaly used by the template headertemplate<typename TLane>using _Tag = hn::ScalableTag<TLane>;#endif#include"simd.inc.hpp"}// namespace simd/// Represents the 128-bit SIMD width.namespacesimd128 {#if NPY_SIMDXnamespacehn= hwy::HWY_NAMESPACE;template<typename TLane>using _Tag = hn::Full128<TLane>;#endif#include"simd.inc.hpp"}// namespace simd128}// namespace np#endif// NUMPY__CORE_SRC_COMMON_SIMD_SIMD_HPP_
and thesimd.inc.hpp
header will look like this:
Click to expand
#ifndef NPY_SIMDX#error "This is not defined, this is not a standalone header use simd.hpp instead"#endif// NOTE: This file is included by simd.hpp multiple times with different namespaces// so avoid include any headers here// #define NPY_SIMDX 1 // uncomment to enable Highlighting/** * Determines whether the specified lane type is supported by the SIMD extension. * Alaways defined as false when SIMD is not enabled so it can be used in SFINAE. * * @tparam TLane The lane type to check for support.*/template<typename TLane>constexprboolkSupportLane = NPY_SIMDX !=0;#if !NPY_SIMDX// defined as void when SIMD is not enabled, for SFINAE purposestemplate<typename TLane>using Vec =void;// defined as void when SIMD is not enabled, for SFINAE purposestemplate<typename TLane>using Mask =void;#endif#if NPY_SIMDX#if HWY_HAVE_FLOAT16template<>constexprboolkSupportLane<Half> =true;#endif#if HWY_HAVE_FLOAT64template<>constexprboolkSupportLane<double> =true;#endif/// Represents an N-lane vector based on the specified lane type.template<typename TLane>using Vec = hn::Vec<_Tag<TLane>>;/// Represents a mask vector with boolean values or as a bitmask.template<typename TLane>using Mask = hn::Mask<_Tag<TLane>>;/// Unaligned load of a vector from memory.template<typename TLane>HWY_API Vec<TLane>LoadU(const TLane *ptr){returnhn::LoadU(_Tag<TLane>(), ptr);}/// Unaligned store of a vector to memory.template<typename TLane>HWY_APIvoidStoreU(const Vec<TLane> &a, TLane *ptr){hn::StoreU(a, _Tag<TLane>(), ptr);}/// Returns the number of vector lanes based on the lane type.template<typename TLane>HWY_APIconstexprsize_tLanes(TLane tag =0/*optional tag*/){returnhn::Lanes(_Tag<TLane>());}/// Returns an uninitialized N-lane vector.template<typename TLane>HWY_API Vec<TLane>Undefined(TLane tag =0/*optional tag*/){returnhn::Undefined(_Tag<TLane>());}// Import common Highway intrinsics:using hn::Add;using hn::And;using hn::Mul;using hn::Not;using hn::Or;using hn::Sub;using hn::Xor;#endif// NPY_SIMDX
"@ixgbe, please hold off on applying this suggestion, even if you agree with it, until we reach consensus on the final interface design.@r-devulap,@Mousius, and@jan-wassenberg - your input on this matter would be valuable."
Note this not my final review, just add it some examples:
#include "simd/simd.h" | ||
#include "loops_utils.h" | ||
#include "loops.h" | ||
#include <hwy/highway.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
#include<hwy/highway.h> | |
#include"simd/simd.hpp" |
will be used instead, note that Highway SIMD is only available when optimization is enabled (default through meson flag)
namespace { | ||
namespace hn = hwy::HWY_NAMESPACE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
namespacehn= hwy::HWY_NAMESPACE; | |
usingnamespacenp::simd; |
example 0
const hn::ScalableTag<uint8_t> u8; | ||
const hn::ScalableTag<int8_t> s8; | ||
const hn::ScalableTag<uint16_t> u16; | ||
const hn::ScalableTag<int16_t> s16; | ||
const hn::ScalableTag<uint32_t> u32; | ||
const hn::ScalableTag<int32_t> s32; | ||
const hn::ScalableTag<uint64_t> u64; | ||
const hn::ScalableTag<int64_t> s64; | ||
const hn::ScalableTag<float> f32; | ||
const hn::ScalableTag<double> f64; | ||
using vec_u8 = hn::Vec<decltype(u8)>; | ||
using vec_s8 = hn::Vec<decltype(s8)>; | ||
using vec_u16 = hn::Vec<decltype(u16)>; | ||
using vec_s16 = hn::Vec<decltype(s16)>; | ||
using vec_u32 = hn::Vec<decltype(u32)>; | ||
using vec_s32 = hn::Vec<decltype(s32)>; | ||
using vec_u64 = hn::Vec<decltype(u64)>; | ||
using vec_s64 = hn::Vec<decltype(s64)>; | ||
using vec_f32 = hn::Vec<decltype(f32)>; | ||
using vec_f64 = hn::Vec<decltype(f64)>; | ||
template<typename T> | ||
struct TagSelector; | ||
template<> struct TagSelector<uint8_t> { static const auto& value() { return u8; } }; | ||
template<> struct TagSelector<int8_t> { static const auto& value() { return s8; } }; | ||
template<> struct TagSelector<uint16_t> { static const auto& value() { return u16; } }; | ||
template<> struct TagSelector<int16_t> { static const auto& value() { return s16; } }; | ||
template<> struct TagSelector<uint32_t> { static const auto& value() { return u32; } }; | ||
template<> struct TagSelector<int32_t> { static const auto& value() { return s32; } }; | ||
template<> struct TagSelector<uint64_t> { static const auto& value() { return u64; } }; | ||
template<> struct TagSelector<int64_t> { static const auto& value() { return s64; } }; | ||
template<> struct TagSelector<float> { static const auto& value() { return f32; } }; | ||
template<> struct TagSelector<double> { static const auto& value() { return f64; } }; | ||
template<typename T> | ||
constexpr const auto& GetTag() { | ||
return TagSelector<T>::value(); | ||
} | ||
template <typename T> | ||
constexpr bool kSupportLane = false; | ||
template <> constexpr bool kSupportLane<uint8_t> = true; | ||
template <> constexpr bool kSupportLane<int8_t> = true; | ||
template <> constexpr bool kSupportLane<uint16_t> = true; | ||
template <> constexpr bool kSupportLane<int16_t> = true; | ||
template <> constexpr bool kSupportLane<uint32_t> = true; | ||
template <> constexpr bool kSupportLane<int32_t> = true; | ||
template <> constexpr bool kSupportLane<uint64_t> = true; | ||
template <> constexpr bool kSupportLane<int64_t> = true; | ||
template <> constexpr bool kSupportLane<float> = true; | ||
template <> constexpr bool kSupportLane<double> = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
const hn::ScalableTag<uint8_t> u8; | |
const hn::ScalableTag<int8_t> s8; | |
const hn::ScalableTag<uint16_t> u16; | |
const hn::ScalableTag<int16_t> s16; | |
const hn::ScalableTag<uint32_t> u32; | |
const hn::ScalableTag<int32_t> s32; | |
const hn::ScalableTag<uint64_t> u64; | |
const hn::ScalableTag<int64_t> s64; | |
const hn::ScalableTag<float> f32; | |
const hn::ScalableTag<double> f64; | |
using vec_u8 = hn::Vec<decltype(u8)>; | |
using vec_s8 = hn::Vec<decltype(s8)>; | |
using vec_u16 = hn::Vec<decltype(u16)>; | |
using vec_s16 = hn::Vec<decltype(s16)>; | |
using vec_u32 = hn::Vec<decltype(u32)>; | |
using vec_s32 = hn::Vec<decltype(s32)>; | |
using vec_u64 = hn::Vec<decltype(u64)>; | |
using vec_s64 = hn::Vec<decltype(s64)>; | |
using vec_f32 = hn::Vec<decltype(f32)>; | |
using vec_f64 = hn::Vec<decltype(f64)>; | |
template<typename T> | |
structTagSelector; | |
template<>structTagSelector<uint8_t> {staticconstauto&value() {return u8; } }; | |
template<>structTagSelector<int8_t> {staticconstauto&value() {return s8; } }; | |
template<>structTagSelector<uint16_t> {staticconstauto&value() {return u16; } }; | |
template<>structTagSelector<int16_t> {staticconstauto&value() {return s16; } }; | |
template<>structTagSelector<uint32_t> {staticconstauto&value() {return u32; } }; | |
template<>structTagSelector<int32_t> {staticconstauto&value() {return s32; } }; | |
template<>structTagSelector<uint64_t> {staticconstauto&value() {return u64; } }; | |
template<>structTagSelector<int64_t> {staticconstauto&value() {return s64; } }; | |
template<>structTagSelector<float> {staticconstauto&value() {return f32; } }; | |
template<>structTagSelector<double> {staticconstauto&value() {return f64; } }; | |
template<typename T> | |
constexprconstauto&GetTag() { | |
return TagSelector<T>::value(); | |
} | |
template<typename T> | |
constexprboolkSupportLane =false; | |
template<>constexprboolkSupportLane<uint8_t> =true; | |
template<>constexprboolkSupportLane<int8_t> =true; | |
template<>constexprboolkSupportLane<uint16_t> =true; | |
template<>constexprboolkSupportLane<int16_t> =true; | |
template<>constexprboolkSupportLane<uint32_t> =true; | |
template<>constexprboolkSupportLane<int32_t> =true; | |
template<>constexprboolkSupportLane<uint64_t> =true; | |
template<>constexprboolkSupportLane<int64_t> =true; | |
template<>constexprboolkSupportLane<float> =true; | |
template<>constexprboolkSupportLane<double> =true; |
there will be no need for it.
struct OpEq { | ||
#if NPY_SIMD | ||
template <typename V, typename = std::enable_if_t<kSupportLane<T>>> | ||
HWY_INLINE HWY_ATTR auto operator()(const V &v) |
seiko2plusMar 28, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
HWY_INLINEHWY_ATTRautooperator()(const V &v) | |
HWY_INLINEautooperator()(const V &v) |
No need for it I suppose, and even if we decided to move Highway CPU dispatcherstart/end
macros can be used instead.
const int vstep = hn::Lanes(u8); | ||
const size_t nlanes = hn::Lanes(GetTag<T>()); | ||
const vec_u8 truemask = hn::Set(u8, 0x1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
constint vstep =hn::Lanes(u8); | |
constsize_t nlanes =hn::Lanes(GetTag<T>()); | |
constvec_u8 truemask =hn::Set(u8,0x1); | |
constint vstep = Lanes<uint8_t>(); | |
constsize_t nlanes = Lanes<T>(); | |
constVec<uint8_t> truemask = Set(uint8_t(0x1)); |
example 1
auto a1 = op(hn::LoadU(GetTag<T>(), src1 + nlanes * 0)); | ||
auto b1 = op(hn::LoadU(GetTag<T>(), src2 + nlanes * 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
auto a1 = op(hn::LoadU(GetTag<T>(),src1 + nlanes *0)); | |
auto b1 = op(hn::LoadU(GetTag<T>(),src2 + nlanes *0)); | |
auto a1 = op(LoadU(src1 + nlanes *0)); | |
auto b1 = op(LoadU(src2 + nlanes *0)); |
example 2
auto m3_vec = hn::VecFromMask(GetTag<T>(), m3); | ||
auto m4_vec = hn::VecFromMask(GetTag<T>(), m4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
auto m3_vec =hn::VecFromMask(GetTag<T>(),m3); | |
auto m4_vec =hn::VecFromMask(GetTag<T>(),m4); | |
auto m3_vec = VecFromMask<T>(m3); | |
auto m4_vec = VecFromMask<T>(m4); |
example 3
} | ||
} | ||
else { | ||
ret = hn::BitCast(u8, m1_vec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
ret =hn::BitCast(u8,m1_vec); | |
ret = BitCast<uint8_t>(m1_vec); |
example 4
else { | ||
ret = hn::BitCast(u8, m1_vec); | ||
} | ||
hn::StoreU(hn::And(ret, truemask), u8, dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
hn::StoreU(hn::And(ret, truemask), u8, dst); | |
StoreU(And(ret, truemask), dst); |
example 5
template <typename T=uint8_t> | ||
struct OpGeBool {}; | ||
#if !defined(__s390x__) && !defined(__arm__) && !defined(__loongarch64) && !defined(__loongarch64__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
#if !defined(__s390x__) && !defined(__arm__) && !defined(__loongarch64) && !defined(__loongarch64__) |
Snap for the build error, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
seiko2plusMar 28, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
OrderedTruncate2To
is only available whenHWY_TARGET != HWY_SCALAR
. Highway appears to be falling back to scalar implementation, which makes sense for architectures likez13
andlongarch
that Highway doesn't directly support. Forarmhf
andrisc-v
, further investigation is needed.
To avoid performance regression, implementing new Highway intrinsics to pack mask data types (similar to universal intrinsics) would be better than the current approach of bitmask-to-vector conversion followed by packing. Ideally, only one VecFromMask would be needed for lane types larger than 8-bits. This mask convervisons is zero-cost only for SIMD extensions that don't provide native mask operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
OrderedTruncate2To
is only available whenHWY_TARGET != HWY_SCALAR
. Highway appears to be falling back to scalar implementation, which makes sense for architectures likez13
andlongarch
that Highway doesn't directly support. Forarmhf
andrisc-v
, further investigation is needed.To avoid performance regression, implementing new Highway intrinsics to pack mask data types (similar to universal intrinsics) would be better than the current approach of bitmask-to-vector conversion followed by packing. Ideally, only one VecFromMask would be needed for lane types larger than 8-bits. This mask convervisons is zero-cost only for SIMD extensions that don't provide native mask operations.
Thanks for your advice! I am not familar to new Highway intrinsics to pack mask data type , could you please tell me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
OrderedTruncate2To
is only available whenHWY_TARGET != HWY_SCALAR
. Highway appears to be falling back to scalar implementation, which makes sense for architectures likez13
andlongarch
that Highway doesn't directly support. Forarmhf
andrisc-v
, further investigation is needed.
Don't we already have scalar implementations and therefore we should never be building withHWY_SCALAR
for fear of polluting the dispatch sources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
OrderedTruncate2To
is only available whenHWY_TARGET != HWY_SCALAR
. Highway appears to be falling back to scalar implementation, which makes sense for architectures likez13
andlongarch
that Highway doesn't directly support. Forarmhf
andrisc-v
, further investigation is needed.To avoid performance regression, implementing new Highway intrinsics to pack mask data types (similar to universal intrinsics) would be better than the current approach of bitmask-to-vector conversion followed by packing. Ideally, only one VecFromMask would be needed for lane types larger than 8-bits. This mask convervisons is zero-cost only for SIMD extensions that don't provide native mask operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks for your advice! I am not familar to new Highway intrinsics to pack mask data type , could you please tell me?
As far as I know, there's no available intrinsic that performs mask packing. To implement such functionality, you would need to submit a PR to the upstream repository. Once it gets merged, we can update the Highway submodule accordingly. Alternatively, you could temporarily implement it directly in the source code itself to speed up the process.
We can also postpone this step until we run benchmarks. Modern compilers may optimize out these extra unnecessary operations. In my last benchmark of universal intrinsics in C++, I observed performance gains only for 8-bit types, while higher bit-width types performed similarly to the C code. We expect the new Highway code to show similar performance improvements primarily for 8-bit data types.
Don't we already have scalar implementations and therefore we should never be building with HWY_SCALAR for fear of polluting the dispatch sources?
I agree, meson.build need to be updates and removes z13'sVX
:
Lines 916 to 922 ine07bd6e
[ | |
AVX512_SKX, AVX512F, AVX2, SSE42, SSE2, | |
VSX3, VSX2, | |
NEON, | |
VXE, VX, | |
LSX, | |
RVV, |
We should removeavx512f
and keep onlyAVX512_SKX
since Highway doesn't support Xeon Phi processors (which Intel no longer supports).
LSX
should remain as-is (HWY WIP), it's part of the LoongArch CPU baseline and is statically dispatched anyway.
However, we should not use Highway when scalar mode is enabled.
For risc-v, OrderedTruncate2To is supported.
Could you provide the build log (build/meson-logs/meson-log.txt)? I need to see which flags are being provided to the CPU dispatcher. Another thing that caught my attention: how were you able to enable the SIMD path for RVV under#if NPY_SIMD
?NPY_SIMD
is part of the C universal intrinsics and should return0
on RVV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
build/meson-logs/meson-log.txt :meson-log.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks for your advice! I am not familar to new Highway intrinsics to pack mask data type , could you please tell me?
As far as I know, there's no available intrinsic that performs mask packing. To implement such functionality, you would need to submit a PR to the upstream repository. Once it gets merged, we can update the Highway submodule accordingly. Alternatively, you could temporarily implement it directly in the source code itself to speed up the process.
We can also postpone this step until we run benchmarks. Modern compilers may optimize out these extra unnecessary operations. In my last benchmark of universal intrinsics in C++, I observed performance gains only for 8-bit types, while higher bit-width types performed similarly to the C code. We expect the new Highway code to show similar performance improvements primarily for 8-bit data types.
Don't we already have scalar implementations and therefore we should never be building with HWY_SCALAR for fear of polluting the dispatch sources?
I agree, meson.build need to be updates and removes z13's
VX
:Lines 916 to 922 ine07bd6e
[ AVX512_SKX, AVX512F, AVX2, SSE42, SSE2, VSX3, VSX2, NEON, VXE, VX, LSX, RVV, We should remove
avx512f
and keep onlyAVX512_SKX
since Highway doesn't support Xeon Phi processors (which Intel no longer supports).
LSX
should remain as-is (HWY WIP), it's part of the LoongArch CPU baseline and is statically dispatched anyway.However, we should not use Highway when scalar mode is enabled.
For risc-v, OrderedTruncate2To is supported.
Could you provide the build log (build/meson-logs/meson-log.txt)? I need to see which flags are being provided to the CPU dispatcher. Another thing that caught my attention: how were you able to enable the SIMD path for RVV under
#if NPY_SIMD
?NPY_SIMD
is part of the C universal intrinsics and should return0
on RVV.
I am mistaken, you are right. It only compiled successfully on RISC-V.
Can we have Highway own this wrapper which can simplify its usage across other projects too?@jan-wassenberg |
During the meeting, I think Sayed preferred it be in numpy. I don't have a strong opinion. On OrderedTruncate2To: yes, that's supported on all targets except scalar. I think the issue is that compiler flags are missing the flags that explicitly ask for RVV, and the runtime dispatch is not yet workable due to compiler bugs e.g.llvm/llvm-project#56592. |
I would prefer to keep it in NumPy for a while until it matures enough, and then pass it to the Highway main repository. I initiated this via PR#28622. |
No description provided.