Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ENH: Convert comparison from C universal intrinsics to C++ using Highway#28490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
ixgbe wants to merge13 commits intonumpy:main
base:main
Choose a base branch
Loading
fromixgbe:comparison_hwy

Conversation

ixgbe
Copy link
Contributor

No description provided.

@ixgbe
Copy link
ContributorAuthor

Hello@r-devulap

I apologize for the interruption, as I know you must be very busy. I submitted PR#28490 a while ago and was wondering if you might have time to review it when your schedule permits. I understand you're likely juggling many priorities, but I'd greatly appreciate any feedback you could provide on my contribution.

Thank you for all your work maintaining NumPy!

Best regards!

@r-devulap
Copy link
Member

@ixgbe thanks for your contribution.@seiko2plus had a few comments about using a wrapper around Highway tags to make it easier to develop template functions. I have tagged him to add his comments as well.

ixgbe reacted with thumbs up emoji

@r-devulap
Copy link
Member

From#21057: If I understood correctly, it was along the lines of

template<typename T>
structOpEq {
#if NPY_SIMD
template<typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINE Vec<T>operator()(const Vec<T_> &a)
{return a; }
template<typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINEautooperator()(const Vec<T_> &a,const Vec<T_> &b)
{returnEq(a, b); }
#endif
NPY_FINLINE Toperator()(T a)
{return a; }
NPY_FINLINE npy_booloperator()(T a, T b)
{return a == b; }
};
template<typename T>
structOpNe {
#if NPY_SIMD
template<typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINE Vec<T>operator()(const Vec<T_> &a)
{return a; }
template<typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINEautooperator()(const Vec<T_> &a,const Vec<T_> &b)
{returnNe(a, b); }
#endif
NPY_FINLINE Toperator()(T a)
{return a; }
NPY_FINLINE npy_booloperator()(T a, T b)
{return a != b; }
};
template<typename T>
structOpLt {
#if NPY_SIMD
template<typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINE Vec<T>operator()(const Vec<T_> &a)
{return a; }
template<typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINEautooperator()(const Vec<T_> &a,const Vec<T_> &b)
{returnLt(a, b); }
#endif
NPY_FINLINE Toperator()(T a)
{return a; }
NPY_FINLINE npy_booloperator()(T a, T b)
{return a < b; }
};
template<typename T>
structOpLe {
#if NPY_SIMD
template<typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINE Vec<T>operator()(const Vec<T_> &a)
{return a; }
template<typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINEautooperator()(const Vec<T_> &a,const Vec<T_> &b)
{returnLe(a, b); }
#endif
NPY_FINLINE Toperator()(T a)
{return a; }
NPY_FINLINE npy_booloperator()(T a, T b)
{return a <= b; }
};

@ixgbe
Copy link
ContributorAuthor

Hi@seiko2plus,
I've updated the code to wrap Highway tags as suggested (#28490 (comment)). Could you please review the implementation when you have time?
Let me know if any further adjustments are needed. Thanks!

Copy link
Member

@seiko2plusseiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

During the last optimization meeting, I proposed a thin wrapper over Google's Highway
SIMD library to simplify its interface. The wrapper would eliminate the need for
class tags and use lane types directly, which can be deduced from the arguments in
most cases. We can also leverage namespaces for low-level register access and still
rely on lane type only.

While your last commits try to follow the stracture of the purposed C++ universal
intrinsics example still the code count on tags.

So, I suggest to use the following snippet as a reference to implement the wrapper:

We going to need two headers to implement the wrapper, the first one is thesimd.hpp
header which will provide the SIMD interface and the second one issimd.inc.hpp
which is a template header will be called multiple times with different namespaces to
provide the implementation of the SIMD interface.

Thesimd.hpp header will look like this:

Click to expand
/** * This header provides a thin wrapper over Google's Highway SIMD library. * * The wrapper aims to simplify the SIMD interface of Google's Highway by * get ride of its class tags and use lane types directly which can be deduced * from the args in most cases.*/#ifndef NUMPY__CORE_SRC_COMMON_SIMD_SIMD_HPP_#defineNUMPY__CORE_SRC_COMMON_SIMD_SIMD_HPP_#ifndef NPY_DISABLE_OPTIMIZATION// Highway SIMD is only available when optimization is enabled#include<hwy/highway.h>#defineNPY_SIMDX1// since NPY_SIMD indicates the C universal intrinsics interface#else#defineNPY_SIMDX0#endifnamespacenp {/// Represents the max SIMD width supported by the platform.namespacesimd {#if NPY_SIMDX/// The highway namespace alias./// We can not import all the symbols from the HWY_NAMESPACE because it will/// conflict with the existing symbols in the numpy namespace.namespacehn= hwy::HWY_NAMESPACE;// internaly used by the template headertemplate<typename TLane>using _Tag = hn::ScalableTag<TLane>;#endif#include"simd.inc.hpp"}// namespace simd/// Represents the 128-bit SIMD width.namespacesimd128 {#if NPY_SIMDXnamespacehn= hwy::HWY_NAMESPACE;template<typename TLane>using _Tag = hn::Full128<TLane>;#endif#include"simd.inc.hpp"}// namespace simd128}// namespace np#endif// NUMPY__CORE_SRC_COMMON_SIMD_SIMD_HPP_

and thesimd.inc.hpp header will look like this:

Click to expand
#ifndef NPY_SIMDX#error "This is not defined, this is not a standalone header use simd.hpp instead"#endif// NOTE: This file is included by simd.hpp multiple times with different namespaces// so avoid include any headers here// #define NPY_SIMDX 1  // uncomment to enable Highlighting/** * Determines whether the specified lane type is supported by the SIMD extension. * Alaways defined as false when SIMD is not enabled so it can be used in SFINAE. * * @tparam TLane The lane type to check for support.*/template<typename TLane>constexprboolkSupportLane = NPY_SIMDX !=0;#if !NPY_SIMDX// defined as void when SIMD is not enabled, for SFINAE purposestemplate<typename TLane>using Vec =void;// defined as void when SIMD is not enabled, for SFINAE purposestemplate<typename TLane>using Mask =void;#endif#if NPY_SIMDX#if HWY_HAVE_FLOAT16template<>constexprboolkSupportLane<Half> =true;#endif#if HWY_HAVE_FLOAT64template<>constexprboolkSupportLane<double> =true;#endif/// Represents an N-lane vector based on the specified lane type.template<typename TLane>using Vec = hn::Vec<_Tag<TLane>>;/// Represents a mask vector with boolean values or as a bitmask.template<typename TLane>using Mask = hn::Mask<_Tag<TLane>>;/// Unaligned load of a vector from memory.template<typename TLane>HWY_API Vec<TLane>LoadU(const TLane *ptr){returnhn::LoadU(_Tag<TLane>(), ptr);}/// Unaligned store of a vector to memory.template<typename TLane>HWY_APIvoidStoreU(const Vec<TLane> &a, TLane *ptr){hn::StoreU(a, _Tag<TLane>(), ptr);}/// Returns the number of vector lanes based on the lane type.template<typename TLane>HWY_APIconstexprsize_tLanes(TLane tag =0/*optional tag*/){returnhn::Lanes(_Tag<TLane>());}/// Returns an uninitialized N-lane vector.template<typename TLane>HWY_API Vec<TLane>Undefined(TLane tag =0/*optional tag*/){returnhn::Undefined(_Tag<TLane>());}// Import common Highway intrinsics:using hn::Add;using hn::And;using hn::Mul;using hn::Not;using hn::Or;using hn::Sub;using hn::Xor;#endif// NPY_SIMDX

"@ixgbe, please hold off on applying this suggestion, even if you agree with it, until we reach consensus on the final interface design.@r-devulap,@Mousius, and@jan-wassenberg - your input on this matter would be valuable."

Note this not my final review, just add it some examples:

ixgbe reacted with thumbs up emoji
#include "simd/simd.h"
#include "loops_utils.h"
#include "loops.h"
#include <hwy/highway.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
#include<hwy/highway.h>
#include"simd/simd.hpp"

will be used instead, note that Highway SIMD is only available when optimization is enabled (default through meson flag)


namespace {

namespace hn = hwy::HWY_NAMESPACE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
namespacehn= hwy::HWY_NAMESPACE;
usingnamespacenp::simd;

example 0

Comment on lines +12 to +65
const hn::ScalableTag<uint8_t> u8;
const hn::ScalableTag<int8_t> s8;
const hn::ScalableTag<uint16_t> u16;
const hn::ScalableTag<int16_t> s16;
const hn::ScalableTag<uint32_t> u32;
const hn::ScalableTag<int32_t> s32;
const hn::ScalableTag<uint64_t> u64;
const hn::ScalableTag<int64_t> s64;
const hn::ScalableTag<float> f32;
const hn::ScalableTag<double> f64;

using vec_u8 = hn::Vec<decltype(u8)>;
using vec_s8 = hn::Vec<decltype(s8)>;
using vec_u16 = hn::Vec<decltype(u16)>;
using vec_s16 = hn::Vec<decltype(s16)>;
using vec_u32 = hn::Vec<decltype(u32)>;
using vec_s32 = hn::Vec<decltype(s32)>;
using vec_u64 = hn::Vec<decltype(u64)>;
using vec_s64 = hn::Vec<decltype(s64)>;
using vec_f32 = hn::Vec<decltype(f32)>;
using vec_f64 = hn::Vec<decltype(f64)>;

template<typename T>
struct TagSelector;

template<> struct TagSelector<uint8_t> { static const auto& value() { return u8; } };
template<> struct TagSelector<int8_t> { static const auto& value() { return s8; } };
template<> struct TagSelector<uint16_t> { static const auto& value() { return u16; } };
template<> struct TagSelector<int16_t> { static const auto& value() { return s16; } };
template<> struct TagSelector<uint32_t> { static const auto& value() { return u32; } };
template<> struct TagSelector<int32_t> { static const auto& value() { return s32; } };
template<> struct TagSelector<uint64_t> { static const auto& value() { return u64; } };
template<> struct TagSelector<int64_t> { static const auto& value() { return s64; } };
template<> struct TagSelector<float> { static const auto& value() { return f32; } };
template<> struct TagSelector<double> { static const auto& value() { return f64; } };

template<typename T>
constexpr const auto& GetTag() {
return TagSelector<T>::value();
}

template <typename T>
constexpr bool kSupportLane = false;

template <> constexpr bool kSupportLane<uint8_t> = true;
template <> constexpr bool kSupportLane<int8_t> = true;
template <> constexpr bool kSupportLane<uint16_t> = true;
template <> constexpr bool kSupportLane<int16_t> = true;
template <> constexpr bool kSupportLane<uint32_t> = true;
template <> constexpr bool kSupportLane<int32_t> = true;
template <> constexpr bool kSupportLane<uint64_t> = true;
template <> constexpr bool kSupportLane<int64_t> = true;
template <> constexpr bool kSupportLane<float> = true;
template <> constexpr bool kSupportLane<double> = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
const hn::ScalableTag<uint8_t> u8;
const hn::ScalableTag<int8_t> s8;
const hn::ScalableTag<uint16_t> u16;
const hn::ScalableTag<int16_t> s16;
const hn::ScalableTag<uint32_t> u32;
const hn::ScalableTag<int32_t> s32;
const hn::ScalableTag<uint64_t> u64;
const hn::ScalableTag<int64_t> s64;
const hn::ScalableTag<float> f32;
const hn::ScalableTag<double> f64;
using vec_u8 = hn::Vec<decltype(u8)>;
using vec_s8 = hn::Vec<decltype(s8)>;
using vec_u16 = hn::Vec<decltype(u16)>;
using vec_s16 = hn::Vec<decltype(s16)>;
using vec_u32 = hn::Vec<decltype(u32)>;
using vec_s32 = hn::Vec<decltype(s32)>;
using vec_u64 = hn::Vec<decltype(u64)>;
using vec_s64 = hn::Vec<decltype(s64)>;
using vec_f32 = hn::Vec<decltype(f32)>;
using vec_f64 = hn::Vec<decltype(f64)>;
template<typename T>
structTagSelector;
template<>structTagSelector<uint8_t> {staticconstauto&value() {return u8; } };
template<>structTagSelector<int8_t> {staticconstauto&value() {return s8; } };
template<>structTagSelector<uint16_t> {staticconstauto&value() {return u16; } };
template<>structTagSelector<int16_t> {staticconstauto&value() {return s16; } };
template<>structTagSelector<uint32_t> {staticconstauto&value() {return u32; } };
template<>structTagSelector<int32_t> {staticconstauto&value() {return s32; } };
template<>structTagSelector<uint64_t> {staticconstauto&value() {return u64; } };
template<>structTagSelector<int64_t> {staticconstauto&value() {return s64; } };
template<>structTagSelector<float> {staticconstauto&value() {return f32; } };
template<>structTagSelector<double> {staticconstauto&value() {return f64; } };
template<typename T>
constexprconstauto&GetTag() {
return TagSelector<T>::value();
}
template<typename T>
constexprboolkSupportLane =false;
template<>constexprboolkSupportLane<uint8_t> =true;
template<>constexprboolkSupportLane<int8_t> =true;
template<>constexprboolkSupportLane<uint16_t> =true;
template<>constexprboolkSupportLane<int16_t> =true;
template<>constexprboolkSupportLane<uint32_t> =true;
template<>constexprboolkSupportLane<int32_t> =true;
template<>constexprboolkSupportLane<uint64_t> =true;
template<>constexprboolkSupportLane<int64_t> =true;
template<>constexprboolkSupportLane<float> =true;
template<>constexprboolkSupportLane<double> =true;

there will be no need for it.

struct OpEq {
#if NPY_SIMD
template <typename V, typename = std::enable_if_t<kSupportLane<T>>>
HWY_INLINE HWY_ATTR auto operator()(const V &v)
Copy link
Member

@seiko2plusseiko2plusMar 28, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
HWY_INLINEHWY_ATTRautooperator()(const V &v)
HWY_INLINEautooperator()(const V &v)

No need for it I suppose, and even if we decided to move Highway CPU dispatcherstart/end macros can be used instead.

Comment on lines +268 to +270
const int vstep = hn::Lanes(u8);
const size_t nlanes = hn::Lanes(GetTag<T>());
const vec_u8 truemask = hn::Set(u8, 0x1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
constint vstep =hn::Lanes(u8);
constsize_t nlanes =hn::Lanes(GetTag<T>());
constvec_u8 truemask =hn::Set(u8,0x1);
constint vstep = Lanes<uint8_t>();
constsize_t nlanes = Lanes<T>();
constVec<uint8_t> truemask = Set(uint8_t(0x1));

example 1

Comment on lines +274 to +275
auto a1 = op(hn::LoadU(GetTag<T>(), src1 + nlanes * 0));
auto b1 = op(hn::LoadU(GetTag<T>(), src2 + nlanes * 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
auto a1 = op(hn::LoadU(GetTag<T>(),src1 + nlanes *0));
auto b1 = op(hn::LoadU(GetTag<T>(),src2 + nlanes *0));
auto a1 = op(LoadU(src1 + nlanes *0));
auto b1 = op(LoadU(src2 + nlanes *0));

example 2

Comment on lines +290 to +291
auto m3_vec = hn::VecFromMask(GetTag<T>(), m3);
auto m4_vec = hn::VecFromMask(GetTag<T>(), m4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
auto m3_vec =hn::VecFromMask(GetTag<T>(),m3);
auto m4_vec =hn::VecFromMask(GetTag<T>(),m4);
auto m3_vec = VecFromMask<T>(m3);
auto m4_vec = VecFromMask<T>(m4);

example 3

}
}
else {
ret = hn::BitCast(u8, m1_vec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
ret =hn::BitCast(u8,m1_vec);
ret = BitCast<uint8_t>(m1_vec);

example 4

else {
ret = hn::BitCast(u8, m1_vec);
}
hn::StoreU(hn::And(ret, truemask), u8, dst);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
hn::StoreU(hn::And(ret, truemask), u8, dst);
StoreU(And(ret, truemask), dst);

example 5

template <typename T=uint8_t>
struct OpGeBool {};

#if !defined(__s390x__) && !defined(__arm__) && !defined(__loongarch64) && !defined(__loongarch64__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
#if !defined(__s390x__) && !defined(__arm__) && !defined(__loongarch64) && !defined(__loongarch64__)

Snap for the build error, please

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Snap for the build error, please

build error: OrderedTruncate2To is not a member of hn

I use hn::OrderedTruncate2To ,however, It seems that there is no support hn::OrderedTruncate2To on S390、ARM32、loongarch64 architecture.
339468e2d131c4072c239eb4b6b45da

Copy link
Member

@seiko2plusseiko2plusMar 28, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OrderedTruncate2To is only available whenHWY_TARGET != HWY_SCALAR. Highway appears to be falling back to scalar implementation, which makes sense for architectures likez13 andlongarch that Highway doesn't directly support. Forarmhf andrisc-v, further investigation is needed.

To avoid performance regression, implementing new Highway intrinsics to pack mask data types (similar to universal intrinsics) would be better than the current approach of bitmask-to-vector conversion followed by packing. Ideally, only one VecFromMask would be needed for lane types larger than 8-bits. This mask convervisons is zero-cost only for SIMD extensions that don't provide native mask operations.

ixgbe reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OrderedTruncate2To is only available whenHWY_TARGET != HWY_SCALAR. Highway appears to be falling back to scalar implementation, which makes sense for architectures likez13 andlongarch that Highway doesn't directly support. Forarmhf andrisc-v, further investigation is needed.

To avoid performance regression, implementing new Highway intrinsics to pack mask data types (similar to universal intrinsics) would be better than the current approach of bitmask-to-vector conversion followed by packing. Ideally, only one VecFromMask would be needed for lane types larger than 8-bits. This mask convervisons is zero-cost only for SIMD extensions that don't provide native mask operations.

Thanks for your advice! I am not familar to new Highway intrinsics to pack mask data type , could you please tell me?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OrderedTruncate2To is only available whenHWY_TARGET != HWY_SCALAR. Highway appears to be falling back to scalar implementation, which makes sense for architectures likez13 andlongarch that Highway doesn't directly support. Forarmhf andrisc-v, further investigation is needed.

Don't we already have scalar implementations and therefore we should never be building withHWY_SCALAR for fear of polluting the dispatch sources?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OrderedTruncate2To is only available whenHWY_TARGET != HWY_SCALAR. Highway appears to be falling back to scalar implementation, which makes sense for architectures likez13 andlongarch that Highway doesn't directly support. Forarmhf andrisc-v, further investigation is needed.

To avoid performance regression, implementing new Highway intrinsics to pack mask data types (similar to universal intrinsics) would be better than the current approach of bitmask-to-vector conversion followed by packing. Ideally, only one VecFromMask would be needed for lane types larger than 8-bits. This mask convervisons is zero-cost only for SIMD extensions that don't provide native mask operations.

image
Forrisc-v,OrderedTruncate2To is supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for your advice! I am not familar to new Highway intrinsics to pack mask data type , could you please tell me?

As far as I know, there's no available intrinsic that performs mask packing. To implement such functionality, you would need to submit a PR to the upstream repository. Once it gets merged, we can update the Highway submodule accordingly. Alternatively, you could temporarily implement it directly in the source code itself to speed up the process.

We can also postpone this step until we run benchmarks. Modern compilers may optimize out these extra unnecessary operations. In my last benchmark of universal intrinsics in C++, I observed performance gains only for 8-bit types, while higher bit-width types performed similarly to the C code. We expect the new Highway code to show similar performance improvements primarily for 8-bit data types.

Don't we already have scalar implementations and therefore we should never be building with HWY_SCALAR for fear of polluting the dispatch sources?

I agree, meson.build need to be updates and removes z13'sVX:

[
AVX512_SKX, AVX512F, AVX2, SSE42, SSE2,
VSX3, VSX2,
NEON,
VXE, VX,
LSX,
RVV,

We should removeavx512f and keep onlyAVX512_SKX since Highway doesn't support Xeon Phi processors (which Intel no longer supports).

LSX should remain as-is (HWY WIP), it's part of the LoongArch CPU baseline and is statically dispatched anyway.

However, we should not use Highway when scalar mode is enabled.

For risc-v, OrderedTruncate2To is supported.

Could you provide the build log (build/meson-logs/meson-log.txt)? I need to see which flags are being provided to the CPU dispatcher. Another thing that caught my attention: how were you able to enable the SIMD path for RVV under#if NPY_SIMD?NPY_SIMD is part of the C universal intrinsics and should return0 on RVV.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

build/meson-logs/meson-log.txt :meson-log.txt

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for your advice! I am not familar to new Highway intrinsics to pack mask data type , could you please tell me?

As far as I know, there's no available intrinsic that performs mask packing. To implement such functionality, you would need to submit a PR to the upstream repository. Once it gets merged, we can update the Highway submodule accordingly. Alternatively, you could temporarily implement it directly in the source code itself to speed up the process.

We can also postpone this step until we run benchmarks. Modern compilers may optimize out these extra unnecessary operations. In my last benchmark of universal intrinsics in C++, I observed performance gains only for 8-bit types, while higher bit-width types performed similarly to the C code. We expect the new Highway code to show similar performance improvements primarily for 8-bit data types.

Don't we already have scalar implementations and therefore we should never be building with HWY_SCALAR for fear of polluting the dispatch sources?

I agree, meson.build need to be updates and removes z13'sVX:

[
AVX512_SKX, AVX512F, AVX2, SSE42, SSE2,
VSX3, VSX2,
NEON,
VXE, VX,
LSX,
RVV,

We should removeavx512f and keep onlyAVX512_SKX since Highway doesn't support Xeon Phi processors (which Intel no longer supports).

LSX should remain as-is (HWY WIP), it's part of the LoongArch CPU baseline and is statically dispatched anyway.

However, we should not use Highway when scalar mode is enabled.

For risc-v, OrderedTruncate2To is supported.

Could you provide the build log (build/meson-logs/meson-log.txt)? I need to see which flags are being provided to the CPU dispatcher. Another thing that caught my attention: how were you able to enable the SIMD path for RVV under#if NPY_SIMD?NPY_SIMD is part of the C universal intrinsics and should return0 on RVV.
I am mistaken, you are right. It only compiled successfully on RISC-V.

@ixgbeixgbe requested a review fromseiko2plusMarch 28, 2025 03:07
@r-devulap
Copy link
Member

During the last optimization meeting, I proposed a thin wrapper over Google's Highway
SIMD library to simplify its interface. The wrapper would eliminate the need for
class tags and use lane types directly, which can be deduced from the arguments in
most cases. We can also leverage namespaces for low-level register access and still
rely on lane type only.

Can we have Highway own this wrapper which can simplify its usage across other projects too?@jan-wassenberg

@jan-wassenberg
Copy link
Contributor

During the meeting, I think Sayed preferred it be in numpy. I don't have a strong opinion.

On OrderedTruncate2To: yes, that's supported on all targets except scalar. I think the issue is that compiler flags are missing the flags that explicitly ask for RVV, and the runtime dispatch is not yet workable due to compiler bugs e.g.llvm/llvm-project#56592.

@seiko2plus
Copy link
Member

During the meeting, I think Sayed preferred it be in numpy. I don't have a strong opinion.

I would prefer to keep it in NumPy for a while until it matures enough, and then pass it to the Highway main repository. I initiated this via PR#28622.

ixgbe reacted with thumbs up emoji

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@seiko2plusseiko2plusAwaiting requested review from seiko2plus

@MousiusMousiusAwaiting requested review from Mousius

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

5 participants
@ixgbe@r-devulap@jan-wassenberg@seiko2plus@Mousius

[8]ページ先頭

©2009-2025 Movatter.jp