NotificationsYou must be signed in to change notification settings
Fork11.1k
Star29.9k

ENH, SIMD: Initial implementation of Highway wrapper#28622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

Mousius merged 7 commits intonumpy:mainfromseiko2plus:hwy_wrapper

Jun 5, 2025

Merged

ENH, SIMD: Initial implementation of Highway wrapper#28622

Mousius merged 7 commits intonumpy:mainfromseiko2plus:hwy_wrapper

Jun 5, 2025

Conversation

Copy link

Member

seiko2plus commentedApr 1, 2025

A thin wrapper over Google's Highway SIMD library to simplify its interface. This commit provides the implementation of that wrapper, consisting of:

simd.hpp: Main header defining the SIMD namespaces and configuration
simd.inc.hpp: Template header included multiple times with different namespaces

The wrapper eliminates Highway's class tags by:

Using lane types directly which can be deduced from arguments
Leveraging namespaces (np::simd and np::simd128) for different register widths

A README is included to guide usage and document design decisions.

github-actionsbot added the 01 - Enhancement label

Apr 1, 2025

seiko2plus added the component: SIMDIssues in SIMD (fast instruction sets) code or machinery label

Apr 1, 2025

seiko2plus mentioned this pull request

Apr 1, 2025

ENH: Convert comparison from C universal intrinsics to C++ using Highway#28490

Open

tylerjereddy reviewed

Apr 1, 2025

View reviewed changes

Copy link

Contributor

tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I was morbidly curious and read theREADME so added minor copyedits.

numpy/_core/src/common/simd/README.md OutdatedShow resolvedHide resolved

Copy link

Contributor

ixgbe commentedApr 7, 2025

Hi, Could you share the expected timeline or next steps for merging this PR? Thanks!

Mousius requested changes

Apr 7, 2025

View reviewed changes

Copy link

Member

Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

First read, will process more over time 😸

numpy/_core/src/common/simd/README.md OutdatedShow resolvedHide resolved

numpy/_core/src/common/simd/simd.hpp Outdated

		*
		* Therefore, we only enable SIMD when the Highway target is not scalar.
		*/
		#defineNPY_SIMDX (HWY_TARGET != HWY_SCALAR)

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Doesn't this mean we'll have dispatch sources with scalar implementations of our own?

Potentially, this should raise a compile error.

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

When HWY_TARGET == HWY_SCALAR, there are two possible scenarios:

The SIMD extension or architecture isn't supported by Highway
Highway considers it a broken platform or compiler (which is fine if the feature is part of the baseline)

For the second case, we have two options:

Extend meson to detect this behavior
Consider it normal behavior since the compiler still has a chance to optimize scalar operations based on the compiler flags defined by meson CPU dispatcher.

numpy/_core/src/common/simd/simd.hpp Outdated

Comment on lines 34 to 43

		// Indicates if the SIMD operations are available for float16.
		#defineNPY_SIMDX_F16 (NPY_SIMDX && HWY_HAVE_FLOAT16)
		// Note: Highway requires SIMD extentions with native float32 support, so we don't need
		// to check for it.

		// Indicates if the SIMD operations are available for float64.
		#defineNPY_SIMDX_F64 (NPY_SIMDX && HWY_HAVE_FLOAT64)

		// Indicates if the SIMD floating operations are natively supports fma.
		#defineNPY_SIMDX_FMA (NPY_SIMDX && HWY_NATIVE_FMA)

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These seem to just wrap theHWY constants withNPY_SIMDX, so if we removeNPY_SIMDX we can use theHWY variants?

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We can use HWY variants, but under#if NPY_SIMDX, please check design notes3. Feature Detection Constants vs. Highway Constants

numpy/_core/src/common/simd/README.mdShow resolvedHide resolved

seiko2plus force-pushed thehwy_wrapper branch fromefd5ef2 to9759ddcCompare

April 7, 2025 15:49

Copy link

MemberAuthor

seiko2plus commentedApr 7, 2025

I was morbidly curious and read the README so added minor copyedits.

@tylerjereddy, Thanks, I have addressed them. Feel free to give it a second round.

Hi, Could you share the expected timeline or next steps for merging this PR? Thanks!

@ixgbe, I'm was aiming to get feedback on the approach first that outlined in your PR#28490 but so far no strong opinions against it, then we can move forward with merging. Once this PR gets merged we can follow on your PR as fast as possible. Please, let me know if you have any specific suggestions or any concerns against this approach.

First read, will process more over time

@Mousius, Thank you! I've updated the README to cover your questions. Please give it a second read.

seiko2plus commented

Apr 7, 2025

View reviewed changes

numpy/_core/src/common/simd/README.md OutdatedShow resolvedHide resolved

seiko2plus marked this pull request as ready for review

April 7, 2025 16:20

Copy link

Contributor

ixgbe commentedApr 8, 2025

I was morbidly curious and read the README so added minor copyedits.
@tylerjereddy, Thanks, I have addressed them. Feel free to give it a second round.
Hi, Could you share the expected timeline or next steps for merging this PR? Thanks!
@ixgbe, I'm was aiming to get feedback on the approach first that outlined in your PR#28490 but so far no strong opinions against it, then we can move forward with merging. Once this PR gets merged we can follow on your PR as fast as possible. Please, let me know if you have any specific suggestions or any concerns against this approach.
First read, will process more over time
@Mousius, Thank you! I've updated the README to cover your questions. Please give it a second read.
LGTM (Looks Good To Me)! Thanks for coordinating this.

r-devulap self-assigned this

Apr 9, 2025

ixgbe reviewed

Apr 11, 2025

View reviewed changes

numpy/_core/src/common/simd/simd.inc.hppShow resolvedHide resolved

Copy link

Contributor

ixgbe commentedApr 11, 2025

when I compile numpy source code(#28608) on RVV1.0 platform, We’ve identified two pain points that require further investigation: 1、constexpr int UNROLL = (NPY_SIMD == 128 ? 4 : 2); Since we have migrated from NPY_SIMD to NPY_SIMDX, the NPY_SIMD macro is now undefined for RVV (RISC-V Vector), which will trigger compile-time errors."
2、NPY_SIMD、NPY_SIMD_F32、NPY_SIMD_F64 for RISC-V Vector are undefined. The compiler error messages are shown in the figure below. How should we resolve this? Any insights would be greatly appreciated!

Copy link

MemberAuthor

seiko2plus commentedApr 12, 2025•
edited
Loading

@ixgbe,
I have added kMaxLanes which can be used in the first case:

constexprint UNROLL = (NPY_SIMD ==128 ?4 :2);

to:

constexprintkUnrollSize =kMaxLanes<uint8_t> ==16 ?4 :2;

Forloadable_* andstoreable_*, you should avoid using them since Highway's scatter/gather intrinsics use scalar operations. However, you will need to implement another utility function to check if the stride is divisible by the size of the type.

amane-ame mentioned this pull request

Apr 15, 2025

ENH: Convert unary_fp from C universal intrinsics to C++ using Highway#28718

Open

amane-ame reviewed

Apr 16, 2025

View reviewed changes

numpy/_core/src/common/simd/simd.inc.hpp OutdatedShow resolvedHide resolved

amane-ame reviewed

Apr 17, 2025

View reviewed changes

numpy/_core/src/common/simd/simd.inc.hpp OutdatedShow resolvedHide resolved

seiko2plus force-pushed thehwy_wrapper branch 2 times, most recently from18c6085 toc7884f0Compare

April 17, 2025 14:41

r-devulap requested changes

Apr 23, 2025

View reviewed changes

Copy link

Member

r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Minor comments but mostly LGTM. I still need to wrap my head around it and will give it a second read tomorrow.

numpy/_core/src/common/simd/simd.hpp OutdatedShow resolvedHide resolved

numpy/_core/src/common/simd/simd.hppShow resolvedHide resolved

numpy/_core/src/common/simd/simd.inc.hppShow resolvedHide resolved

amane-ame mentioned this pull request

Apr 24, 2025

ENH: Convert unary_complex from C universal intrinsics to C++ using Highway#28815

Open

r-devulap reviewed

Apr 28, 2025

View reviewed changes

Copy link

Member

r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I have had some time to sit on this a little bit and here are my thoughts. As far as I understand, the main goal of having a wrapper around highway intrinsics is to get rid of the tags that highway ops require. For example, this piece of code

namespace hn = hwy::HWY_NAMESPACE;template<typename T>HWY_ATTR void add(T* src, T* dst, size_t N) {    const hn::ScalableTag<T> ftype;    hn::Vec<decltype(ftype)> v = hn::LoadU(ftype, src);    v = hn::Add(v, v);    hn::StoreU(v, ftype, dst);}

can be written without tags:

using namespace np::simd;template<typename T>void add(T* src, T* dst, size_t N) {    Vec<T> v = LoadU(src);    v = Add(v, v);    StoreU(v, dst);}

I do agree that the latter way looks cleaner and avoids having to redefine tags and highway vector for every kernel (like here:#28490 (comment)), I do have to point out couple of downsides of having this wrapper:

This adds an extra layer of abstraction for the intrinsics that needs to be maintained and potentially add an extra layer to debug.
If we want all our highway kernels to be tag free, then this wrapper needs to be expanded with time for all highway ops used which is extra work.

For example, if we want the kernel defined inloops_hyperbolic.dispatch.cpp.src to be tags free, the wrapper needs to expanded to include all these highway ops:

hn::InterleaveLowerhn::TwoTablesLookupLaneshn::RebindMaskhn::GatherIndexhn::ConcatLowerLowerhn::BitCasthn::Subhn::Mul

While the approach proposed here avoids explicit tags in Highway operations, the added abstraction layer introduces maintenance overhead without significant benefits. Directly using Highway's native tag syntax (e.g.,hn::LoadU(ftype, src)) appears sufficiently readable and not hard to write.

Here is my proposed alternative:

Centralize type definitions in simd.hpp with predefined tags/vectors
Reuse these definitions across all Highway kernels
Avoid op-specific expansions through generic implementations

This way we avoid introducing an extra abstraction layer and eliminate the need to expand wrappers for every Highway operation.

@seiko2plus Do you think this would be a simpler and more maintainable solution? ping@jan-wassenberg and@Mousius for their opinions.

Copy link

Contributor

jan-wassenberg commentedApr 28, 2025

That makes sense to me. Centralized variable definitions for the tags would avoid the (minor) problem of people choosing different names for them, and shorten the code a bit.

There was also a concern that tags complicate python bindings because many more wrapper functions would have to be generated there - one for half, quarter vectors etc. I suppose we can for purposes of that python binding only export the full-vector version that the currently proposed wrapper functions here would provide.

Copy link

MemberAuthor

seiko2plus commentedApr 29, 2025•
edited
Loading

@r-devulap,

I appreciate your thoughtful analysis. You make valid points about the trade-offs involved with this wrapper approach.
The primary motivation for this wrapper is indeed to make SIMD code more approachable by eliminating the tag-based syntax, because the current way of Highway seems to encourages users to produce more complicated meta-programming code in their final kernels.

Also, modern C++17 features allow us to introduce simpler constants to deal with SFINAE with cleaner code: e.g.,kMaxLanes<float> rather thanHWY_MAX_LANES_D<ScalableTag<float>> orkSupportLane<double> rather than#ifdef HWY_HAVE_FLOAT64.

This adds an extra layer of abstraction for the intrinsics that needs to be maintained and potentially add an extra layer to debug.

This wrapper is exceptionally thin. All functions are simple one-liners that just forward to Highway with the appropriate tag. There's very little to maintain or debug in the wrapper itself. If an issue occurs, the debugging process is straightforward - the problem will either be in the calling code or in Highway itself, with the wrapper simply passing values through. The mapping between wrapper functions and Highway functions is direct and transparent. See:

numpy/numpy/_core/src/common/simd/simd.inc.hpp

Lines 47 to 52 inc7884f0

	template<typename TLane>
	HWY_API Vec<TLane>
	LoadU(const TLane *ptr)
	{
	returnhn::LoadU(_Tag<TLane>(), ptr);
	}

If we want all our highway kernels to be tag free, then this wrapper needs to be expanded with time for all highway ops used which is extra work.

Not all Highway operations need to be functionally wrapped. Most of Highway operations are already tag-free; extra work is nothing compared to the benefit of having cleaner kernels. Developers can still access the full Highway API through thehn namespace alias that's already included in the wrapper before deciding to add a wrapper for a specific operation. Do you think this expansion process is a burden?

numpy/numpy/_core/src/common/simd/README.md

Lines 142 to 157 inc7884f0

	To add more operations from Highway:

	1. Import them in the`simd.inc.hpp` file using the`using` directive if they don't require a tag:
	```cpp
	// For operations that don't require a tag
	using hn::FunctionName;
	```

	2. Define wrapper functions for intrinsics that require a class tag:
	```cpp
	// For operations that require a tag
	template<typename TLane>
	HWY_API ReturnTypeFunctionName(Args... args) {
	return hn::FunctionName(_Tag<TLane>(), args...);
	}
	```

While the approach proposed here avoids explicit tags in Highway operations, the added abstraction layer introduces maintenance overhead without significant benefits.

The benefits are clear: code becomes more concise, more readable, less prone to errors and encourages users to produce simpler code. As we decided to replace C template sources.c.src with C++ code for simplicity, would it be wise to choose a cleaner approach even if that requires a thin wrapper to Highway that potentially will be moved to Highway itself in the end?

Here is my proposed alternative:
Centralize type definitions in simd.hpp with predefined tags/vectors
Reuse these definitions across all Highway kernels
Avoid op-specific expansions through generic implementations
This way we avoid introducing an extra abstraction layer and eliminate the need to expand wrappers for every Highway operation.

Not sure what the point of predefining tags is? Why not justhn::ScalableTag<float>{}? ortemplate<typename T> constexpr auto kScalable = hn::ScalableTag<T>{}; withtemplate <typename T> using ScalableVec = hn::Vec<hn::ScalableTag<T>>;?

Using your previous example, your suggestion would be like this:

template<typename T>voidadd(T* src, T* dst,size_t N) {// ScalableVec<T> v = hn::LoadU(kScalable<T>, src);// bad practice to use auto everywhere but can we tolerate it?auto v =hn::LoadU(kScalable<T>, src);    v =hn::Add(v, v);hn::StoreU(v,kScalable<T>, dst);}

How aboutkMaxLanes,kSupportLane, and constantsNPY_SIMDX_*? Based on your alternative, are we going to define them in simd.hpp too?

Do you think this would be a simpler and more maintainable solution?

I'm more concerned about maintaining the kernels than the wrapper itself.
With your suggestion, we still create some form of abstraction - just to avoid simple one-liner wrappers where needed.
All code involves abstraction; the question is at what level. At NumPy's scale, the thin wrapper approach is absolutely worth it.

There was also a concern that tags complicate python bindings because many more wrapper functions would have to be generated there - one for half, quarter vectors etc.
I suppose we can for purposes of that python binding only export the full-vector version that the currently proposed wrapper functions here would provide.

@jan-wassenberg, Thank you for your input. If we moved the wrapper to Highway to replacegoogle-tests withpytest, it would still be possible to usehwy::scalable::<HWY_TARGETS>,hwy::fixed128::<TARGETS>, etc. in the bindings.
pytest is dynamic enough to handle this.

Copy link

Contributor

jan-wassenberg commentedApr 29, 2025

One quick comment:

// bad practice to use auto everywhere but can we tolerate it?

I'd advise against using auto everywhere. Kernels that involve range reduction may involve multiple types (at least int32 and f32, possibly more), and I think it's helpful to see the type mentioned in the code.

Copy link

Member

Mousius commentedApr 29, 2025

This wrapper is exceptionally thin. All functions are simple one-liners that just forward to Highway with the appropriate tag. There's very little to maintain or debug in the wrapper itself.

This is potentially my most pressing question: Will it continue to be thin? It seems we'll have a lot of thin wrappers.

@jan-wassenberg do you see issues with these being contributed directly to Highway? I'd imagine a lot of the C++17 stuff is a welcome addition to the macro variants?

I'm not seeing anything that is NumPy-centric or needs to be NumPy-centric just yet, and it appears we already know roughly what we'd like to contribute.

Not all Highway operations need to be functionally wrapped. Most of the Highway operations are already tag-free; extra work is nothing compared to the benefit of having cleaner kernels.

I'd also suggest that the extra work may not be necessary if it's this minimal for users. I prefer@r-devulap's ideas of centralised tags for NumPy types and explicit typing in the SIMD kernels.

The benefits are clear: code becomes more concise, more readable, less prone to errors and encourages users to produce simpler code.

This could be debated. As@jan-wassenberg has mentioned above, it is valuable to see the types and conversions explicitly in SIMD kernels to avoid errors. The code itself may contain fewer characters, but that doesn't necessarily correlate with being easier to read. Personally, I've found the verbosity of the SIMD kernels justified due to their details.

Copy link

MemberAuthor

seiko2plus commentedApr 29, 2025

@jan-wassenberg,

I'd advise against using auto everywhere. Kernels that involve range reduction may involve multiple types (at least int32 and f32, possibly more), and I think it's helpful to see the type mentioned in the code.

Totally agree with you as mentioned "bad practice". Based on@r-devulap's idea of using centralized tags, the most simplified example would be like:

template<typename T>voidadd(T* src, T* dst,size_t N) {    ScalableVec<T> v =hn::LoadU(kScalable<T>, src);    v =hn::Add(v, v);hn::StoreU(v,kScalable<T>, dst);}

And that's why I believe simplifying Highway will encourages users to produce more clearer and simple code.

@Mousius,

Will it continue to be thin? It seems we'll have a lot of thin wrappers.

Yes definitely. We're just erasing tags and using modern C++ features. We're in 2025 - if we have a chance to make our meta-programming much simpler by leveraging modern C++ features, we should take it. Highway had legacy reasons to support C++11, but we don't.

I'm not seeing anything that is NumPy-centric or needs to be NumPy-centric just yet, and it appears we already know roughly what we'd like to contribute.

I agree that most of this wrapper isn't NumPy-specific and could benefit the broader Highway ecosystem. That's why it would be great to move it to Highway as we agreed before.

This could be debated. As@jan-wassenberg has mentioned above, it is valuable to see the types and conversions explicitly in SIMD kernels to avoid errors. The code itself may contain fewer characters, but that doesn't necessarily correlate with being easier to read. Personally, I've found the verbosity of the SIMD kernels justified due to their details.

I think you misunderstood my comment#28622 (comment). Let me explain with more details:

@r-devulap suggested using centralized tags, and my response was "there's no need for them, what's the point of having them?" Instead, we can use template variables (introduced in C++14) rather than explicitly defining each tag (if we decide to tolerate tags).

So instead of explicitly defining each tag:

const hn::ScalableTag<float>f32;const hn::ScalableTag<int32_t> s32;

we can define a template variable like this:

template<typename T>constexprautokScalable = hn::ScalableTag<T>{};

And by relying on the lane type, you can define any tag simply like this:

kScalable<float>;// equivalent to f32kScalable<int32_t>;// equivalent to s32

Copy link

Member

r-devulap commentedApr 29, 2025•
edited
Loading

Instead, we can use template variables (introduced in C++14) rather than explicitly defining each tag (if we decide to tolerate tags).

template<typename T> constexpr auto kScalable = hn::ScalableTag<T>{};template<typename T>void add(T* src, T* dst, size_t N) {    ScalableVec<T> v = hn::LoadU(kScalable<T>, src);    v = hn::Add(v, v);    hn::StoreU(v, kScalable<T>, dst);}

@seiko2plus this sounds good to me. My preference would be to tolerate tags in highway ops and avoid having a wrapper around them. Thin or not, it is still additional code we have to maintain and expand over time and I personally think doesn't add much value. As for the tags and vector themselves, a central definition that kernels can draw from sounds perfect.

How about kMaxLanes, kSupportLane, and constants NPY_SIMDX_*? Based on your alternative, are we going to define them in simd.hpp too?

I think so. These are useful to be defined in a central place that all the highway kernels can use.

Copy link

Member

Mousius commentedApr 30, 2025

I agree that most of this wrapper isn't NumPy-specific and could benefit the broader Highway ecosystem. That's why it would be great to move it to Highway as we agreed before.

I'm sorry, my question was probably misformed. What prevents us from contributing this directly to Highway now?

Instead, we can use template variables (introduced in C++14) rather than explicitly defining each tag (if we decide to tolerate tags).

Couldn't we also useScalableTag<float> everywhere, which is two characters more thankScalable<float>?

Copy link

MemberAuthor

seiko2plus commentedMay 2, 2025

I personally think doesn't add much value

It's different preferences, I guess. To me, counting on ADL and being able to force change SIMD width with just a single line when needed is a great advantage.

Thin or not, it is still additional code

But we are going to transferring the additional code from the kernels to the wrapper.

What prevents us from contributing this directly to Highway now?

Well, once it becomes mature enough we can pass it. New ideas may emerge anytime - why should we worry about backward compatibility at this stage when other projects aren't yet depending on it?

Couldn't we also use ScalableTag everywhere, which is two characters more than kScalable?

ScalableTag is the type, whilekScalable would be the tag instance. Yes, we could use the type directly with extra characters() or{}.

Copy link

Member

r-devulap commentedMay 2, 2025

It's different preferences, I guess. To me, counting on ADL and being able to force change SIMD width with just a single line when needed is a great advantage.

@seiko2plus if you find it advantageous and feel strongly about it, I don't mind putting this in. As you said in an earlier comment: developers can still access the full Highway API through thehn namespace alias.

Copy link

Member

r-devulap commentedMay 2, 2025

@seiko2plus Can we make this change before merging?https://github.com/numpy/numpy/pull/28622/files#r2057043403

r-devulap requested a review fromCopilot

May 5, 2025 03:47

CopilotAI reviewed

May 5, 2025

View reviewed changes

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

This PR introduces a thin Highway SIMD wrapper for NumPy by implementing a simplified interface through new headers and documentation.

Updated loops_trigonometric.dispatch.cpp to include the new SIMD wrapper header and adjust array allocation.
Added simd.hpp and simd.inc.hpp to provide lightweight abstractions over Highway.
Included a README to document design decisions and usage.

Reviewed Changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated no comments.

File	Description
numpy/_core/src/umath/loops_trigonometric.dispatch.cpp	Updated SIMD function with new header inclusion and buffer size adjustments.
numpy/_core/src/common/simd/simd.inc.hpp	Introduces template implementations and exposes Highway operations in an anonymous namespace.
numpy/_core/src/common/simd/simd.hpp	Provides configuration macros and namespace-based wrappers for Highway SIMD functionality.
numpy/_core/src/common/simd/README.md	Documents the design, usage, and configuration details of the new SIMD wrapper.

Files not reviewed (2)

numpy/_core/src/highway: Language not supported
numpy/_core/src/umath/loops_hyperbolic.dispatch.cpp.src: Language not supported

Comments suppressed due to low confidence (2)

numpy/_core/src/umath/loops_trigonometric.dispatch.cpp:6

The new header simd/simd.hpp is included alongside simd/simd.h. Verify whether both are necessary or if consolidation is possible to avoid potential conflicts or redundancy.

#include "simd/simd.hpp"

numpy/_core/src/umath/loops_trigonometric.dispatch.cpp:189

Switching from hn::Lanes(f32) to hn::MaxLanes(f32) alters the allocated buffer size. Confirm that this change is intentional and that the larger buffer size matches the expected SIMD lane count without introducing unnecessary overhead.

float NPY_DECL_ALIGNED(NPY_SIMD_WIDTH) ip_fback[hn::MaxLanes(f32)];

Copy link

Member

Mousius commentedMay 5, 2025

Discussing in the Optimisation team meeting:

#define NPY_SIMDX (HWY_TARGET != HWY_SCALAR) will always beTrue as Highway is going to defaultEMU128 which will allow for compiler auto-vectorisation by default. ThereforeNPY_SIMDX is likely unnecessary?
Highway is happy to host the wrapper to remove lane tags under contrib given it's fairly clear why it exists and why it's useful, which would save us hosting it in NumPy. We should contribute it directly undercontrib for Highway which equally allows for it to evolve further if necessary?

This was referencedMay 7, 2025

ENH: Convert unary_fp_le from C universal intrinsics to C++ using Highway#28919

Open

ENH: Initial implementation of Highway wrapperplctlab/numpy#12

Merged

Copy link

MemberAuthor

seiko2plus commentedMay 17, 2025•
edited
Loading

@seiko2plus Can we make this change before merging?https://github.com/numpy/numpy/pull/28622/files#r2057043403

Done.

#define NPY_SIMDX (HWY_TARGET != HWY_SCALAR) will always be True as Highway is going to default EMU128 which will allow for compiler auto-vectorisation by default. Therefore NPY_SIMDX is likely unnecessary?

NPY_HWY (NPY_SIMDX) is needed for two important reasons: to completely disable optimization via disable-optimization and to clear the baseline features viacpu-baseline=none. This is important because Highway will still enableSSE3 by default even ifSSE flags aren't provided, sinceSSE3 is enabled by default ongcc/clang. By the way,Highway seems doesn't correctly emulate severalSSE41 elementary intrinsics, but this isn't a concern since PR#28896 will raise the baseline to x86-64-v2 (level 2).

Highway is happy to host the wrapper to remove lane tags under contrib given it's fairly clear why it exists and why it's useful, which would save us hosting it in NumPy. We should contribute it directly under contrib for Highway which equally allows for it to evolve further if necessary?

As I mentioned before, I'd prefer to use it within NumPy first rather than moving it to Highway until it's mature enough. Hosting in NumPy will make it easier to make changes as we refine the implementation.

amane-ame mentioned this pull request

May 18, 2025

ENH: Convert minmax from C universal intrinsics to C++ using Highway#29001

Open

Copy link

Member

Mousius commentedMay 19, 2025•
edited
Loading

NPY_HWY (NPY_SIMDX) is needed for two important reasons: to completely disable optimization via disable-optimization and to clear the baseline features viacpu-baseline=none. This is important because Highway will still enableSSE3 by default even ifSSE flags aren't provided, sinceSSE3 is enabled by default ongcc/clang. By the way,Highway seems doesn't correctly emulate severalSSE41 elementary intrinsics, but this isn't a concern since PR#28896 will raise the baseline to x86-64-v2 (level 2).

I believe this is already possible by only enablingEMU128? I don't see why we needNPY_SIMDX for this.

As I mentioned before, I'd prefer to use it within NumPy first rather than moving it to Highway until it's mature enough. Hosting in NumPy will make it easier to make changes as we refine the implementation.

I think this is unnecessary extra work, as it's simple to contribute this to Highway directly without creating our own mini-framework straight away. In my mind, this is setting a precedent that we don't try to work with Highway in the first instance.@jan-wassenberg has already said he has no issues and it's a small change. However, if@r-devulap is happy to merge this, then I will get out of the way.

seiko2plus added7 commits

May 19, 2025 17:58

ENH, SIMD: Initial implementation of Highway wrapper

92f2622

A thin wrapper over Google's Highway SIMD library to simplify its interface.This commit provides the implementation of that wrapper, consisting of:- simd.hpp: Main header defining the SIMD namespaces and configuration- simd.inc.hpp: Template header included multiple times with different namespacesThe wrapper eliminates Highway's class tags by:- Using lane types directly which can be deduced from arguments- Leveraging namespaces (np::simd and np::simd128) for different register widthsA README is included to guide usage and document design decisions.

SIMD: Update wrapper with improved docs and type support

5d48ec3

  - Fix hardware/platform terminology in documentation for clarity  - Add support for long double in template specializations  - Add kMaxLanes constant to expose maximum vector width information  - Follows clang formatting style for consistency with NumPy codebase.

SIMD: Improve isolation and constexpr handling in wrapper

e099bba

  - Add anonymous namespace around implementation to ensure each    translation unit gets its own constants based on local flags  - Use HWY_LANES_CONSTEXPR for Lanes function to ensure proper    constexpr evaluation across platforms

Update Highway submodule to latest master

1fd2076

SIMD: Fix compile error by using MaxLanes instead of Lanes for array …

80e6a30

…size  Replace hn::Lanes(f64) with hn::MaxLanes(f64) when defining the index array  size to fix error C2131: "expression did not evaluate to a constant". This error  occurs because Lanes() isn't always constexpr compatible, especially with  scalable vector extensions. MaxLanes() provides a compile-time constant value  suitable for static array allocation and should be used with non-scalable SIMD  extensions when defining fixed-size arrays.

SIMD: Rename NPY_SIMDX to NPY_HWY

6519b28

Rename Highway wrapper macros for clarity:- NPY_SIMDX → NPY_HWY- NPY_SIMDX_F16 → NPY_HWY_F16- NPY_SIMDX_F64 → NPY_HWY_F64- NPY_SIMDX_FMA → NPY_HWY_FMATo avoids confusion with legacy SIMD macros.

Disable Highway EMU128 scalar emulation

275e45c

Skip Highway's EMU128 in favor of NumPy's scalar implementations fordue to strict IEEE 754 floating-point compliance requirements

seiko2plus force-pushed thehwy_wrapper branch from0f719d4 to275e45cCompare

May 19, 2025 15:29

Copy link

MemberAuthor

seiko2plus commentedMay 19, 2025

I believe this is already possible by only enabling EMU128? I don't see why we need NPY_SIMDX for this.

When the optiondisable-optimization is set to true, we should disable all CPU optimized code including dispatch, SIMD, and loop unrolling.
After a quick look at theEMU128 implementation, I realized that we should disable it too to avoid any kind of fast math operations. Also, compiling withfp-strict may not lead to auto-vectorization and may end up with register spilling. Lets stay safe for now, till we get done from moving to Highway.

Copy link

Member

Mousius commentedMay 20, 2025

When the optiondisable-optimization is set to true, we should disable all CPU optimized code including dispatch, SIMD, and loop unrolling.
After a quick look at theEMU128 implementation, I realized that we should disable it too to avoid any kind of fast math operations. Also, compiling withfp-strict may not lead to auto-vectorization and may end up with register spilling. Lets stay safe for now, till we get done from moving to Highway.

We already haveNPY_DISABLE_OPTIMIZATION to completely disable the optimised kernels?

I'm not sure whether a lack of optimisation inEMU128 is hugely worrying given it's the fallback.