NotificationsYou must be signed in to change notification settings
Fork11k
Star29.8k

ENH: Convert comparison from C universal intrinsics to C++ using Highway#28490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

ixgbe wants to merge14 commits intonumpy:main

base:main

Choose a base branch

fromixgbe:comparison_hwy

Open

ENH: Convert comparison from C universal intrinsics to C++ using Highway#28490

ixgbe wants to merge14 commits intonumpy:mainfromixgbe:comparison_hwy

Conversation

Copy link

Contributor

ixgbe commentedMar 13, 2025

No description provided.

ENH: Convert comparison from C universal intrinsics to C++ using Highway

98c0cc2

github-actionsbot added the 01 - Enhancement label

Mar 13, 2025

ixgbe added10 commits

March 13, 2025 11:06

Replace OrderedTruncate2To with OrderedDemote2To

1066b49

MAINT: Enable building loop_comparision on vector length agnostic arc…

faf1bdf

…hitectures

fix compile error[1]

46792b4

fix compile error[2]: cannot convert long unsigned int to unsigned int

62cea2a

fix test failed[1]: Non-bitwise-equal booleans test failed

2f6aae9

fix test failed[2]: Non-bitwise-equal booleans test failed

fb47971

fix test failed[3]: Non-bitwise-equal booleans test failed

365ca32

fix internal compiler error: in sign_mask, at wide-int.h:855 on ARM S…

30dcc0a

…IMD-native

fix compile error[3]:OrderedTruncate2To is not a member of hn

e2a3cc6

fix compile error[4]:OrderedTruncate2To is not a member of hn

8008e40

Copy link

ContributorAuthor

I apologize for the interruption, as I know you must be very busy. I submitted PR#28490 a while ago and was wondering if you might have time to review it when your schedule permits. I understand you're likely juggling many priorities, but I'd greatly appreciate any feedback you could provide on my contribution.

Thank you for all your work maintaining NumPy!

Best regards！

r-devulap requested a review fromseiko2plus

March 24, 2025 21:52

Copy link

Member

r-devulap commentedMar 24, 2025

@ixgbe thanks for your contribution.@seiko2plus had a few comments about using a wrapper around Highway tags to make it easier to develop template functions. I have tagged him to add his comments as well.

Copy link

Member

r-devulap commentedMar 24, 2025

From#21057: If I understood correctly, it was along the lines of

numpy/numpy/_core/src/umath/loops_comparison.dispatch.cpp

Lines 12 to 82 inb70cda0

	template<typename T>
	structOpEq {
	#if NPY_SIMD
	template<typename T_ = std::enable_if<kSupportLane<T>, T>>
	NPY_FINLINE Vec<T>operator()(const Vec<T_> &a)
	{return a; }

	template<typename T_ = std::enable_if<kSupportLane<T>, T>>
	NPY_FINLINEautooperator()(const Vec<T_> &a,const Vec<T_> &b)
	{returnEq(a, b); }
	#endif
	NPY_FINLINE Toperator()(T a)
	{return a; }

	NPY_FINLINE npy_booloperator()(T a, T b)
	{return a == b; }
	};

	template<typename T>
	structOpNe {
	#if NPY_SIMD
	template<typename T_ = std::enable_if<kSupportLane<T>, T>>
	NPY_FINLINE Vec<T>operator()(const Vec<T_> &a)
	{return a; }

	template<typename T_ = std::enable_if<kSupportLane<T>, T>>
	NPY_FINLINEautooperator()(const Vec<T_> &a,const Vec<T_> &b)
	{returnNe(a, b); }
	#endif
	NPY_FINLINE Toperator()(T a)
	{return a; }

	NPY_FINLINE npy_booloperator()(T a, T b)
	{return a != b; }
	};

	template<typename T>
	structOpLt {
	#if NPY_SIMD
	template<typename T_ = std::enable_if<kSupportLane<T>, T>>
	NPY_FINLINE Vec<T>operator()(const Vec<T_> &a)
	{return a; }

	template<typename T_ = std::enable_if<kSupportLane<T>, T>>
	NPY_FINLINEautooperator()(const Vec<T_> &a,const Vec<T_> &b)
	{returnLt(a, b); }
	#endif
	NPY_FINLINE Toperator()(T a)
	{return a; }

	NPY_FINLINE npy_booloperator()(T a, T b)
	{return a < b; }
	};

	template<typename T>
	structOpLe {
	#if NPY_SIMD
	template<typename T_ = std::enable_if<kSupportLane<T>, T>>
	NPY_FINLINE Vec<T>operator()(const Vec<T_> &a)
	{return a; }

	template<typename T_ = std::enable_if<kSupportLane<T>, T>>
	NPY_FINLINEautooperator()(const Vec<T_> &a,const Vec<T_> &b)
	{returnLe(a, b); }
	#endif
	NPY_FINLINE Toperator()(T a)
	{return a; }

	NPY_FINLINE npy_booloperator()(T a, T b)
	{return a <= b; }
	};

ixgbe added2 commits

March 26, 2025 16:38

Add Highway tag wrapper to simplify template function development

e1c3730

Fix compile error for RISC-V architecture

e07bd6e

Copy link

ContributorAuthor

ixgbe commentedMar 27, 2025

Hi@seiko2plus,
I've updated the code to wrap Highway tags as suggested (#28490 (comment)). Could you please review the implementation when you have time?
Let me know if any further adjustments are needed. Thanks!

seiko2plus reviewed

Mar 28, 2025

View reviewed changes

Copy link

Member

seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

During the last optimization meeting, I proposed a thin wrapper over Google's Highway
SIMD library to simplify its interface. The wrapper would eliminate the need for
class tags and use lane types directly, which can be deduced from the arguments in
most cases. We can also leverage namespaces for low-level register access and still
rely on lane type only.

While your last commits try to follow the stracture of the purposed C++ universal
intrinsics example still the code count on tags.

So, I suggest to use the following snippet as a reference to implement the wrapper:

We going to need two headers to implement the wrapper, the first one is thesimd.hpp
header which will provide the SIMD interface and the second one issimd.inc.hpp
which is a template header will be called multiple times with different namespaces to
provide the implementation of the SIMD interface.

Thesimd.hpp header will look like this:

Click to expand

/** * This header provides a thin wrapper over Google's Highway SIMD library. * * The wrapper aims to simplify the SIMD interface of Google's Highway by * get ride of its class tags and use lane types directly which can be deduced * from the args in most cases.*/#ifndef NUMPY__CORE_SRC_COMMON_SIMD_SIMD_HPP_#defineNUMPY__CORE_SRC_COMMON_SIMD_SIMD_HPP_#ifndef NPY_DISABLE_OPTIMIZATION// Highway SIMD is only available when optimization is enabled#include<hwy/highway.h>#defineNPY_SIMDX1// since NPY_SIMD indicates the C universal intrinsics interface#else#defineNPY_SIMDX0#endifnamespacenp {/// Represents the max SIMD width supported by the platform.namespacesimd {#if NPY_SIMDX/// The highway namespace alias./// We can not import all the symbols from the HWY_NAMESPACE because it will/// conflict with the existing symbols in the numpy namespace.namespacehn= hwy::HWY_NAMESPACE;// internaly used by the template headertemplate<typename TLane>using _Tag = hn::ScalableTag<TLane>;#endif#include"simd.inc.hpp"}// namespace simd/// Represents the 128-bit SIMD width.namespacesimd128 {#if NPY_SIMDXnamespacehn= hwy::HWY_NAMESPACE;template<typename TLane>using _Tag = hn::Full128<TLane>;#endif#include"simd.inc.hpp"}// namespace simd128}// namespace np#endif// NUMPY__CORE_SRC_COMMON_SIMD_SIMD_HPP_

and thesimd.inc.hpp header will look like this:

Click to expand

#ifndef NPY_SIMDX#error "This is not defined, this is not a standalone header use simd.hpp instead"#endif// NOTE: This file is included by simd.hpp multiple times with different namespaces// so avoid include any headers here// #define NPY_SIMDX 1  // uncomment to enable Highlighting/** * Determines whether the specified lane type is supported by the SIMD extension. * Alaways defined as false when SIMD is not enabled so it can be used in SFINAE. * * @tparam TLane The lane type to check for support.*/template<typename TLane>constexprboolkSupportLane = NPY_SIMDX !=0;#if !NPY_SIMDX// defined as void when SIMD is not enabled, for SFINAE purposestemplate<typename TLane>using Vec =void;// defined as void when SIMD is not enabled, for SFINAE purposestemplate<typename TLane>using Mask =void;#endif#if NPY_SIMDX#if HWY_HAVE_FLOAT16template<>constexprboolkSupportLane<Half> =true;#endif#if HWY_HAVE_FLOAT64template<>constexprboolkSupportLane<double> =true;#endif/// Represents an N-lane vector based on the specified lane type.template<typename TLane>using Vec = hn::Vec<_Tag<TLane>>;/// Represents a mask vector with boolean values or as a bitmask.template<typename TLane>using Mask = hn::Mask<_Tag<TLane>>;/// Unaligned load of a vector from memory.template<typename TLane>HWY_API Vec<TLane>LoadU(const TLane *ptr){returnhn::LoadU(_Tag<TLane>(), ptr);}/// Unaligned store of a vector to memory.template<typename TLane>HWY_APIvoidStoreU(const Vec<TLane> &a, TLane *ptr){hn::StoreU(a, _Tag<TLane>(), ptr);}/// Returns the number of vector lanes based on the lane type.template<typename TLane>HWY_APIconstexprsize_tLanes(TLane tag =0/*optional tag*/){returnhn::Lanes(_Tag<TLane>());}/// Returns an uninitialized N-lane vector.template<typename TLane>HWY_API Vec<TLane>Undefined(TLane tag =0/*optional tag*/){returnhn::Undefined(_Tag<TLane>());}// Import common Highway intrinsics:using hn::Add;using hn::And;using hn::Mul;using hn::Not;using hn::Or;using hn::Sub;using hn::Xor;#endif// NPY_SIMDX

"@ixgbe, please hold off on applying this suggestion, even if you agree with it, until we reach consensus on the final interface design.@r-devulap,@Mousius, and@jan-wassenberg - your input on this matter would be valuable."

Note this not my final review, just add it some examples:

numpy/_core/src/umath/loops_comparison.dispatch.cpp

		#include "simd/simd.h"
		#include "loops_utils.h"
		#include "loops.h"
		#include <hwy/highway.h>

Copy link

Member

seiko2plusMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	#include<hwy/highway.h>
	#include"simd/simd.hpp"

will be used instead, note that Highway SIMD is only available when optimization is enabled (default through meson flag)

numpy/_core/src/umath/loops_comparison.dispatch.cpp Outdated


		namespace {

		namespace hn = hwy::HWY_NAMESPACE;

Copy link

Member

seiko2plusMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	namespacehn= hwy::HWY_NAMESPACE;
	usingnamespacenp::simd;

example 0

numpy/_core/src/umath/loops_comparison.dispatch.cpp Outdated

Comment on lines 12 to 65

		const hn::ScalableTag<uint8_t> u8;
		const hn::ScalableTag<int8_t> s8;
		const hn::ScalableTag<uint16_t> u16;
		const hn::ScalableTag<int16_t> s16;
		const hn::ScalableTag<uint32_t> u32;
		const hn::ScalableTag<int32_t> s32;
		const hn::ScalableTag<uint64_t> u64;
		const hn::ScalableTag<int64_t> s64;
		const hn::ScalableTag<float> f32;
		const hn::ScalableTag<double> f64;

		using vec_u8 = hn::Vec<decltype(u8)>;
		using vec_s8 = hn::Vec<decltype(s8)>;
		using vec_u16 = hn::Vec<decltype(u16)>;
		using vec_s16 = hn::Vec<decltype(s16)>;
		using vec_u32 = hn::Vec<decltype(u32)>;
		using vec_s32 = hn::Vec<decltype(s32)>;
		using vec_u64 = hn::Vec<decltype(u64)>;
		using vec_s64 = hn::Vec<decltype(s64)>;
		using vec_f32 = hn::Vec<decltype(f32)>;
		using vec_f64 = hn::Vec<decltype(f64)>;

		template<typename T>
		struct TagSelector;

		template<> struct TagSelector<uint8_t> { static const auto& value() { return u8; } };
		template<> struct TagSelector<int8_t> { static const auto& value() { return s8; } };
		template<> struct TagSelector<uint16_t> { static const auto& value() { return u16; } };
		template<> struct TagSelector<int16_t> { static const auto& value() { return s16; } };
		template<> struct TagSelector<uint32_t> { static const auto& value() { return u32; } };
		template<> struct TagSelector<int32_t> { static const auto& value() { return s32; } };
		template<> struct TagSelector<uint64_t> { static const auto& value() { return u64; } };
		template<> struct TagSelector<int64_t> { static const auto& value() { return s64; } };
		template<> struct TagSelector<float> { static const auto& value() { return f32; } };
		template<> struct TagSelector<double> { static const auto& value() { return f64; } };

		template<typename T>
		constexpr const auto& GetTag() {
		return TagSelector<T>::value();
		}

		template <typename T>
		constexpr bool kSupportLane = false;

		template <> constexpr bool kSupportLane<uint8_t> = true;
		template <> constexpr bool kSupportLane<int8_t> = true;
		template <> constexpr bool kSupportLane<uint16_t> = true;
		template <> constexpr bool kSupportLane<int16_t> = true;
		template <> constexpr bool kSupportLane<uint32_t> = true;
		template <> constexpr bool kSupportLane<int32_t> = true;
		template <> constexpr bool kSupportLane<uint64_t> = true;
		template <> constexpr bool kSupportLane<int64_t> = true;
		template <> constexpr bool kSupportLane<float> = true;
		template <> constexpr bool kSupportLane<double> = true;

Copy link

Member

seiko2plusMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	const hn::ScalableTag<uint8_t>u8;
	const hn::ScalableTag<int8_t> s8;
	const hn::ScalableTag<uint16_t>u16;
	const hn::ScalableTag<int16_t> s16;
	const hn::ScalableTag<uint32_t>u32;
	const hn::ScalableTag<int32_t> s32;
	const hn::ScalableTag<uint64_t>u64;
	const hn::ScalableTag<int64_t> s64;
	const hn::ScalableTag<float>f32;
	const hn::ScalableTag<double>f64;

	using vec_u8 = hn::Vec<decltype(u8)>;
	using vec_s8 = hn::Vec<decltype(s8)>;
	using vec_u16 = hn::Vec<decltype(u16)>;
	using vec_s16 = hn::Vec<decltype(s16)>;
	using vec_u32 = hn::Vec<decltype(u32)>;
	using vec_s32 = hn::Vec<decltype(s32)>;
	using vec_u64 = hn::Vec<decltype(u64)>;
	using vec_s64 = hn::Vec<decltype(s64)>;
	using vec_f32 = hn::Vec<decltype(f32)>;
	using vec_f64 = hn::Vec<decltype(f64)>;

	template<typename T>
	structTagSelector;

	template<>structTagSelector<uint8_t> {staticconstauto&value() {returnu8; } };
	template<>structTagSelector<int8_t> {staticconstauto&value() {return s8; } };
	template<>structTagSelector<uint16_t> {staticconstauto&value() {returnu16; } };
	template<>structTagSelector<int16_t> {staticconstauto&value() {return s16; } };
	template<>structTagSelector<uint32_t> {staticconstauto&value() {returnu32; } };
	template<>structTagSelector<int32_t> {staticconstauto&value() {return s32; } };
	template<>structTagSelector<uint64_t> {staticconstauto&value() {returnu64; } };
	template<>structTagSelector<int64_t> {staticconstauto&value() {return s64; } };
	template<>structTagSelector<float> {staticconstauto&value() {returnf32; } };
	template<>structTagSelector<double> {staticconstauto&value() {returnf64; } };

	template<typename T>
	constexprconstauto&GetTag() {
	return TagSelector<T>::value();
	}

	template<typename T>
	constexprboolkSupportLane =false;

	template<>constexprboolkSupportLane<uint8_t> =true;
	template<>constexprboolkSupportLane<int8_t> =true;
	template<>constexprboolkSupportLane<uint16_t> =true;
	template<>constexprboolkSupportLane<int16_t> =true;
	template<>constexprboolkSupportLane<uint32_t> =true;
	template<>constexprboolkSupportLane<int32_t> =true;
	template<>constexprboolkSupportLane<uint64_t> =true;
	template<>constexprboolkSupportLane<int64_t> =true;
	template<>constexprboolkSupportLane<float> =true;
	template<>constexprboolkSupportLane<double> =true;

there will be no need for it.

numpy/_core/src/umath/loops_comparison.dispatch.cpp Outdated

		struct OpEq {
		#if NPY_SIMD
		template <typename V, typename = std::enable_if_t<kSupportLane<T>>>
		HWY_INLINE HWY_ATTR auto operator()(const V &v)

Copy link

Member

seiko2plusMar 28, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	HWY_INLINEHWY_ATTRautooperator()(const V &v)
	HWY_INLINEautooperator()(const V &v)

No need for it I suppose, and even if we decided to move Highway CPU dispatcherstart/end macros can be used instead.

numpy/_core/src/umath/loops_comparison.dispatch.cpp Outdated

Comment on lines 268 to 270

		const int vstep = hn::Lanes(u8);
		const size_t nlanes = hn::Lanes(GetTag<T>());
		const vec_u8 truemask = hn::Set(u8, 0x1);

Copy link

Member

seiko2plusMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	constint vstep =hn::Lanes(u8);
	constsize_t nlanes =hn::Lanes(GetTag<T>());
	constvec_u8 truemask =hn::Set(u8,0x1);
	constint vstep = Lanes<uint8_t>();
	constsize_t nlanes = Lanes<T>();
	constVec<uint8_t> truemask = Set(uint8_t(0x1));

example 1

numpy/_core/src/umath/loops_comparison.dispatch.cpp Outdated

Comment on lines 274 to 275

		auto a1 = op(hn::LoadU(GetTag<T>(), src1 + nlanes * 0));
		auto b1 = op(hn::LoadU(GetTag<T>(), src2 + nlanes * 0));

Copy link

Member

seiko2plusMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	auto a1 = op(hn::LoadU(GetTag<T>(),src1 + nlanes *0));
	auto b1 = op(hn::LoadU(GetTag<T>(),src2 + nlanes *0));
	auto a1 = op(LoadU(src1 + nlanes *0));
	auto b1 = op(LoadU(src2 + nlanes *0));

example 2

numpy/_core/src/umath/loops_comparison.dispatch.cpp Outdated

Comment on lines 290 to 291

		auto m3_vec = hn::VecFromMask(GetTag<T>(), m3);
		auto m4_vec = hn::VecFromMask(GetTag<T>(), m4);

Copy link

Member

seiko2plusMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	auto m3_vec =hn::VecFromMask(GetTag<T>(),m3);
	auto m4_vec =hn::VecFromMask(GetTag<T>(),m4);
	auto m3_vec = VecFromMask<T>(m3);
	auto m4_vec = VecFromMask<T>(m4);

example 3

numpy/_core/src/umath/loops_comparison.dispatch.cpp Outdated

		}
		}
		else {
		ret = hn::BitCast(u8, m1_vec);

Copy link

Member

seiko2plusMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	ret =hn::BitCast(u8,m1_vec);
	ret = BitCast<uint8_t>(m1_vec);

example 4

numpy/_core/src/umath/loops_comparison.dispatch.cpp Outdated

		else {
		ret = hn::BitCast(u8, m1_vec);
		}
		hn::StoreU(hn::And(ret, truemask), u8, dst);

Copy link

Member

seiko2plusMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	hn::StoreU(hn::And(ret, truemask), u8, dst);
	StoreU(And(ret, truemask), dst);

example 5

numpy/_core/src/umath/loops_comparison.dispatch.cpp Outdated

		template <typename T=uint8_t>
		struct OpGeBool {};

		#if !defined(__s390x__) && !defined(__arm__) && !defined(__loongarch64) && !defined(__loongarch64__)

Copy link

Member

seiko2plusMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

#if !defined(__s390x__) && !defined(__arm__) && !defined(__loongarch64) && !defined(__loongarch64__)

Snap for the build error, please

Copy link

ContributorAuthor

ixgbeMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Snap for the build error, please

build error: OrderedTruncate2To is not a member of hn

I use hn::OrderedTruncate2To ,however, It seems that there is no support hn::OrderedTruncate2To on S390、ARM32、loongarch64 architecture.

Copy link

Member

seiko2plusMar 28, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OrderedTruncate2To is only available whenHWY_TARGET != HWY_SCALAR. Highway appears to be falling back to scalar implementation, which makes sense for architectures likez13 andlongarch that Highway doesn't directly support. Forarmhf andrisc-v, further investigation is needed.

To avoid performance regression, implementing new Highway intrinsics to pack mask data types (similar to universal intrinsics) would be better than the current approach of bitmask-to-vector conversion followed by packing. Ideally, only one VecFromMask would be needed for lane types larger than 8-bits. This mask convervisons is zero-cost only for SIMD extensions that don't provide native mask operations.

Copy link

ContributorAuthor

ixgbeMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OrderedTruncate2To is only available whenHWY_TARGET != HWY_SCALAR. Highway appears to be falling back to scalar implementation, which makes sense for architectures likez13 andlongarch that Highway doesn't directly support. Forarmhf andrisc-v, further investigation is needed.
To avoid performance regression, implementing new Highway intrinsics to pack mask data types (similar to universal intrinsics) would be better than the current approach of bitmask-to-vector conversion followed by packing. Ideally, only one VecFromMask would be needed for lane types larger than 8-bits. This mask convervisons is zero-cost only for SIMD extensions that don't provide native mask operations.

Thanks for your advice! I am not familar to new Highway intrinsics to pack mask data type , could you please tell me?

Copy link

Member

MousiusMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OrderedTruncate2To is only available whenHWY_TARGET != HWY_SCALAR. Highway appears to be falling back to scalar implementation, which makes sense for architectures likez13 andlongarch that Highway doesn't directly support. Forarmhf andrisc-v, further investigation is needed.

Don't we already have scalar implementations and therefore we should never be building withHWY_SCALAR for fear of polluting the dispatch sources?

Copy link

ContributorAuthor

ixgbeApr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OrderedTruncate2To is only available whenHWY_TARGET != HWY_SCALAR. Highway appears to be falling back to scalar implementation, which makes sense for architectures likez13 andlongarch that Highway doesn't directly support. Forarmhf andrisc-v, further investigation is needed.
To avoid performance regression, implementing new Highway intrinsics to pack mask data types (similar to universal intrinsics) would be better than the current approach of bitmask-to-vector conversion followed by packing. Ideally, only one VecFromMask would be needed for lane types larger than 8-bits. This mask convervisons is zero-cost only for SIMD extensions that don't provide native mask operations.

Forrisc-v,OrderedTruncate2To is supported.

Copy link

Member

seiko2plusApr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for your advice! I am not familar to new Highway intrinsics to pack mask data type , could you please tell me?

As far as I know, there's no available intrinsic that performs mask packing. To implement such functionality, you would need to submit a PR to the upstream repository. Once it gets merged, we can update the Highway submodule accordingly. Alternatively, you could temporarily implement it directly in the source code itself to speed up the process.

We can also postpone this step until we run benchmarks. Modern compilers may optimize out these extra unnecessary operations. In my last benchmark of universal intrinsics in C++, I observed performance gains only for 8-bit types, while higher bit-width types performed similarly to the C code. We expect the new Highway code to show similar performance improvements primarily for 8-bit data types.

Don't we already have scalar implementations and therefore we should never be building with HWY_SCALAR for fear of polluting the dispatch sources?

I agree, meson.build need to be updates and removes z13'sVX:

numpy/numpy/_core/meson.build

Lines 916 to 922 ine07bd6e

	[
	AVX512_SKX, AVX512F, AVX2, SSE42, SSE2,
	VSX3, VSX2,
	NEON,
	VXE, VX,
	LSX,
	RVV,

We should removeavx512f and keep onlyAVX512_SKX since Highway doesn't support Xeon Phi processors (which Intel no longer supports).

LSX should remain as-is (HWY WIP), it's part of the LoongArch CPU baseline and is statically dispatched anyway.

However, we should not use Highway when scalar mode is enabled.

For risc-v, OrderedTruncate2To is supported.

Could you provide the build log (build/meson-logs/meson-log.txt)? I need to see which flags are being provided to the CPU dispatcher. Another thing that caught my attention: how were you able to enable the SIMD path for RVV under#if NPY_SIMD?NPY_SIMD is part of the C universal intrinsics and should return0 on RVV.

Copy link

ContributorAuthor

ixgbeApr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

build/meson-logs/meson-log.txt :meson-log.txt

Copy link

ContributorAuthor

ixgbeApr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for your advice! I am not familar to new Highway intrinsics to pack mask data type , could you please tell me?
As far as I know, there's no available intrinsic that performs mask packing. To implement such functionality, you would need to submit a PR to the upstream repository. Once it gets merged, we can update the Highway submodule accordingly. Alternatively, you could temporarily implement it directly in the source code itself to speed up the process.
We can also postpone this step until we run benchmarks. Modern compilers may optimize out these extra unnecessary operations. In my last benchmark of universal intrinsics in C++, I observed performance gains only for 8-bit types, while higher bit-width types performed similarly to the C code. We expect the new Highway code to show similar performance improvements primarily for 8-bit data types.
Don't we already have scalar implementations and therefore we should never be building with HWY_SCALAR for fear of polluting the dispatch sources?
I agree, meson.build need to be updates and removes z13'sVX:
numpy/numpy/_core/meson.build
Lines 916 to 922 ine07bd6e
[
AVX512_SKX, AVX512F, AVX2, SSE42, SSE2,
VSX3, VSX2,
NEON,
VXE, VX,
LSX,
RVV,
We should removeavx512f and keep onlyAVX512_SKX since Highway doesn't support Xeon Phi processors (which Intel no longer supports).
LSX should remain as-is (HWY WIP), it's part of the LoongArch CPU baseline and is statically dispatched anyway.
However, we should not use Highway when scalar mode is enabled.
For risc-v, OrderedTruncate2To is supported.
Could you provide the build log (build/meson-logs/meson-log.txt)? I need to see which flags are being provided to the CPU dispatcher. Another thing that caught my attention: how were you able to enable the SIMD path for RVV under#if NPY_SIMD?NPY_SIMD is part of the C universal intrinsics and should return0 on RVV.
I am mistaken, you are right. It only compiled successfully on RISC-V.

ixgbe requested a review fromseiko2plus

March 28, 2025 03:07

Copy link

Member

r-devulap commentedMar 28, 2025

During the last optimization meeting, I proposed a thin wrapper over Google's Highway
SIMD library to simplify its interface. The wrapper would eliminate the need for
class tags and use lane types directly, which can be deduced from the arguments in
most cases. We can also leverage namespaces for low-level register access and still
rely on lane type only.

Can we have Highway own this wrapper which can simplify its usage across other projects too?@jan-wassenberg

Copy link

Contributor

jan-wassenberg commentedApr 1, 2025

During the meeting, I think Sayed preferred it be in numpy. I don't have a strong opinion.

On OrderedTruncate2To: yes, that's supported on all targets except scalar. I think the issue is that compiler flags are missing the flags that explicitly ask for RVV, and the runtime dispatch is not yet workable due to compiler bugs e.g.llvm/llvm-project#56592.

Copy link

Member

seiko2plus commentedApr 1, 2025

During the meeting, I think Sayed preferred it be in numpy. I don't have a strong opinion.

I would prefer to keep it in NumPy for a while until it matures enough, and then pass it to the Highway main repository. I initiated this via PR#28622.