google/highwayPublic

NotificationsYou must be signed in to change notification settings
Fork357
Star4.7k

Performance-portable, length-agnostic SIMD with runtime dispatch

License

Apache-2.0, BSD-3-Clause licenses found

Licenses found

4.7k stars 357 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3,011 Commits
.bcr		.bcr
.github		.github
cmake		cmake
debian		debian
docs		docs
g3doc		g3doc
hwy		hwy
.bazelrc		.bazelrc
.clang-format		.clang-format
.gitignore		.gitignore
BUILD		BUILD
CMakeLists.txt		CMakeLists.txt
CMakeLists.txt.in		CMakeLists.txt.in
CONTRIBUTING		CONTRIBUTING
LICENSE		LICENSE
LICENSE-BSD3		LICENSE-BSD3
MODULE.bazel		MODULE.bazel
README.md		README.md
WORKSPACE		WORKSPACE
hwy.gni		hwy.gni
libhwy-contrib.pc.in		libhwy-contrib.pc.in
libhwy-test.pc.in		libhwy-test.pc.in
libhwy.pc.in		libhwy.pc.in
preamble.js.lds		preamble.js.lds
run_tests.bat		run_tests.bat
run_tests.sh		run_tests.sh

Repository files navigation

Efficient and performance-portable vector software

Highway is a C++ library that provides portable SIMD/vector intrinsics.

Documentation

Previously licensed under Apache 2, now dual-licensed as Apache 2 / BSD-3.

Why

We are passionate about high-performance software. We see major untappedpotential in CPUs (servers, mobile, desktops). Highway is for engineers who wantto reliably and economically push the boundaries of what is possible insoftware.

How

CPUs provide SIMD/vector instructions that apply the same operation to multipledata items. This can reduce energy usage e.g.fivefold because fewerinstructions are executed. We also often see5-10x speedups.

Highway makes SIMD/vector programming practical and workable according to theseguiding principles:

Does what you expect: Highway is a C++ library with carefully-chosenfunctions that map well to CPU instructions without extensive compilertransformations. The resulting code is more predictable and robust to codechanges/compiler updates than autovectorization.

Works on widely-used platforms: Highway supports five architectures; thesame application code can target various instruction sets, including those with'scalable' vectors (size unknown at compile time). Highway only requires C++11and supports four families of compilers. If you would like to use Highway onother platforms, please raise an issue.

Flexible to deploy: Applications using Highway can run on heterogeneousclouds or client devices, choosing the best available instruction set atruntime. Alternatively, developers may choose to target a single instruction setwithout any runtime overhead. In both cases, the application code is the sameexcept for swappingHWY_STATIC_DISPATCH withHWY_DYNAMIC_DISPATCH plus oneline of code. See also @kfjahnke'sintroduction to dispatching.

Suitable for a variety of domains: Highway provides an extensive set ofoperations, used for image processing (floating-point), compression, videoanalysis, linear algebra, cryptography, sorting and random generation. Werecognise that new use-cases may require additional ops and are happy to addthem where it makes sense (e.g. no performance cliffs on some architectures). Ifyou would like to discuss, please file an issue.

Rewards data-parallel design: Highway provides tools such as Gather,MaskedLoad, and FixedTag to enable speedups for legacy data structures. However,the biggest gains are unlocked by designing algorithms and data structures forscalable vectors. Helpful techniques include batching, structure-of-arraylayouts, and aligned/padded allocations.

We recommend these resources for getting started:

Examples

Online demos using Compiler Explorer:

multiple targets with dynamic dispatch(more complicated, but flexible and uses best available SIMD)
single target using -m flags(simpler, but requires/only uses the instruction set enabled by compilerflags)

We observe that Highway is referenced in the following open source projects,found via sourcegraph.com. Most are GitHub repositories. If you would like toadd your project or link to it directly, feel free to raise an issue or contactus via the below email.

Audio:Zimtohrli perceptual metric
Browsers: Chromium (+Vivaldi), Firefox (+floorp / foxhound / librewolf /Waterfox)
Computational biology:RNA analysis
Computer graphics:Sparse voxel renderer
Cryptography: google/distributed_point_functions, google/shell-encryption
Data structures: bkille/BitLib
Image codecs: eustas/2im,Grok JPEG 2000,JPEG XL,JPEGenc,Jpegli, OpenHTJ2K
Image processing: cloudinary/ssimulacra2, m-ab-s/media-autobuild_suite,libvips
Image viewers: AlienCowEatCake/ImageViewer, diffractor/diffractor,mirillis/jpegxl-wic,Lux panorama/image viewer
Information retrieval:iresearch database index,michaeljclark/zvec,nebula interactive analytics / OLAP,ScaNN Scalable Nearest Neighbors,vectorlite vector search
Machine learning:gemma.cpp,Tensorflow, Numpy, zpye/SimpleInfer
Robotics:MIT Model-Based Design and Verification

Other

Evaluation of C++ SIMD Libraries:"Highway excelled with a strong performance across multiple SIMD extensions[..]. Thus, Highway may currently be the most suitable SIMD library for manysoftware projects."
zimt: C++11 template library to process n-dimensional arrays with multi-threaded SIMD code
vectorized Quicksort (paper)

If you'd like to get Highway, in addition to cloning from this GitHub repositoryor using it as a Git submodule, you can also find it in the following packagemanagers or repositories:

alpinelinux
conan-io
conda-forge
DragonFlyBSD,
fd00/yacp
freebsd
getsolus/packages
ghostbsd
microsoft/vcpkg
MidnightBSD
MSYS2
NetBSD
openSUSE
opnsense
Xilinx/Vitis_Libraries
xmake-io/xmake-repo

Current status

Targets

Highway supports 24 targets, listed in alphabetical order of platform:

Any:EMU128,SCALAR;
Armv7+:NEON_WITHOUT_AES,NEON,NEON_BF16,SVE,SVE2,SVE_256,SVE2_128;
IBM Z:Z14,Z15;
POWER:PPC8 (v2.07),PPC9 (v3.0),PPC10 (v3.1B, not yet supported dueto compiler bugs, see #1207; also requires QEMU 7.2);
RISC-V:RVV (1.0);
WebAssembly:WASM,WASM_EMU256 (a 2x unrolled version of wasm128,enabled ifHWY_WANT_WASM2 is defined. This will remain supported until itis potentially superseded by a future version of WASM.);
x86:
- SSE2
- SSSE3 (~Intel Core)
- SSE4 (~Nehalem, also includes AES + CLMUL).
- AVX2 (~Haswell, also includes BMI2 + F16 + FMA)
- AVX3 (~Skylake, AVX-512F/BW/CD/DQ/VL)
- AVX3_DL (~Icelake, includesBitAlg +CLMUL +GFNI +VAES +VBMI +VBMI2 +VNNI +VPOPCNT),
- AVX3_ZEN4 (AVX3_DL plus BF16, optimized for AMD Zen4; requires opt-inby definingHWY_WANT_AVX3_ZEN4 if compiling for static dispatch, butenabled by default for runtime dispatch),
- AVX3_SPR (~Sapphire Rapids, includes AVX-512FP16)

Our policy is that unless otherwise specified, targets will remain supported aslong as they can be (cross-)compiled with currently supported Clang or GCC, andtested using QEMU. If the target can be compiled with LLVM trunk and testedusing our version of QEMU without extra flags, then it is eligible for inclusionin our continuous testing infrastructure. Otherwise, the target will be manuallytested before releases with selected versions/configurations of Clang and GCC.

SVE was initially tested using farm_sve (see acknowledgments).

Versioning

Highway releases aim to follow the semver.org system (MAJOR.MINOR.PATCH),incrementing MINOR after backward-compatible additions and PATCH afterbackward-compatible fixes. We recommend using releases (rather than the Git tip)because they are tested more extensively, see below.

The current version 1.0 signals an increased focus on backwards compatibility.Applications using documented functionality will remain compatible with futureupdates that have the same major version number.

Testing

Continuous integration tests build with a recent version of Clang (running onnative x86, or QEMU for RISC-V and Arm) and MSVC 2019 (v19.28, running on nativex86).

Before releases, we also test on x86 with Clang and GCC, and Armv7/8 via GCCcross-compile. See thetesting process fordetails.

Related modules

Thecontrib directory contains SIMD-related utilities: an image class withaligned rows, a math library (16 functions already implemented, mostlytrigonometry), and functions for computing dot products and sorting.

Other libraries

If you only require x86 support, you may also use Agner Fog'sVCL vector class library. It includes manyfunctions including a complete math library.

If you have existing code using x86/NEON intrinsics, you may be interested inSIMDe, which emulates thoseintrinsics using other platforms' intrinsics or autovectorization.

Installation

This project uses CMake to generate and build. In a Debian-based system you caninstall it via:

sudo apt install cmake

Highway's unit tests usegoogletest.By default, Highway's CMake downloads this dependency at configuration time.You can avoid this by setting theHWY_SYSTEM_GTEST CMake variable to ON andinstalling gtest separately:

sudo apt install libgtest-dev

Alternatively, you can defineHWY_TEST_STANDALONE=1 and remove all occurrencesofgtest_main in each BUILD file, then tests avoid the dependency on GUnit.

Running cross-compiled tests requires support from the OS, which on Debian isprovided by theqemu-user-binfmt package.

To build Highway as a shared or static library (depending on BUILD_SHARED_LIBS),the standard CMake workflow can be used:

mkdir -p build&&cd buildcmake ..make -j&& maketest

Or you can runrun_tests.sh (run_tests.bat on Windows).

Bazel is also supported for building, but it is not as widely used/tested.

When building for Armv7, a limitation of current compilers requires you to add-DHWY_CMAKE_ARM7:BOOL=ON to the CMake command line; see #834 and #1032. Weunderstand that work is underway to remove this limitation.

Building on 32-bit x86 is not officially supported, and AVX2/3 are disabled bydefault there. Note that johnplatts has successfully built and run the Highwaytests on 32-bit x86, including AVX2/3, on GCC 7/8 and Clang 8/11/12. On Ubuntu22.04, Clang 11 and 12, but not later versions, require extra compiler flags-m32 -isystem /usr/i686-linux-gnu/include. Clang 10 and earlier require theabove plus-isystem /usr/i686-linux-gnu/include/c++/12/i686-linux-gnu. See#1279.

Building highway - Using vcpkg

highway is now available invcpkg

vcpkg install highway

The highway port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, pleasecreate an issue or pull request on the vcpkg repository.

Quick start

You can use thebenchmark inside examples/ as a starting point.

Aquick-reference page briefly lists all operationsand their parameters, and theinstruction_matrixindicates the number of instructions per operation.

TheFAQ answers questions about portability, API design andwhere to find more information.

We recommend using full SIMD vectors whenever possible for maximum performanceportability. To obtain them, pass aScalableTag<float> (or equivalentlyHWY_FULL(float)) tag to functions such asZero/Set/Load. There are twoalternatives for use-cases requiring an upper bound on the lanes:

For up toN lanes, specifyCappedTag<T, N> or the equivalentHWY_CAPPED(T, N). The actual number of lanes will beN rounded down tothe nearest power of two, such as 4 ifN is 5, or 8 ifN is 8. This isuseful for data structures such as a narrow matrix. A loop is still requiredbecause vectors may actually have fewer thanN lanes.
For exactly a power of twoN lanes, specifyFixedTag<T, N>. The largestsupportedN depends on the target, but is guaranteed to be at least16/sizeof(T).

Due to ADL restrictions, user code calling Highway ops must either:

Reside insidenamespace hwy { namespace HWY_NAMESPACE {; or
prefix each op with an alias such asnamespace hn = hwy::HWY_NAMESPACE; hn::Add(); or
add using-declarations for each op used:using hwy::HWY_NAMESPACE::Add;.

Additionally, each function that calls Highway ops (such asLoad) must eitherbe prefixed withHWY_ATTR, OR reside betweenHWY_BEFORE_NAMESPACE() andHWY_AFTER_NAMESPACE(). Lambda functions currently requireHWY_ATTR beforetheir opening brace.

Do not use namespace-scope norstatic initializers for SIMD vectors becausethis can cause SIGILL when using runtime dispatch and the compiler chooses aninitializer compiled for a target not supported by the current CPU. Instead,constants initialized viaSet should generally be local (const) variables.

The entry points into code using Highway differ slightly depending on whetherthey use static or dynamic dispatch. In both cases, we recommend that thetop-level function receives one or more pointers to arrays, rather thantarget-specific vector types.

For static dispatch,HWY_TARGET will be the best available target amongHWY_BASELINE_TARGETS, i.e. those allowed for use by the compiler (seequick-reference). Functions insideHWY_NAMESPACE can be called usingHWY_STATIC_DISPATCH(func)(args) withinthe same module they are defined in. You can call the function from othermodules by wrapping it in a regular function and declaring the regularfunction in a header.
For dynamic dispatch, a table of function pointers is generated via theHWY_EXPORT macro that is used byHWY_DYNAMIC_DISPATCH(func)(args) tocall the best function pointer for the current CPU's supported targets. Amodule is automatically compiled for each target inHWY_TARGETS (seequick-reference) ifHWY_TARGET_INCLUDE isdefined andforeach_target.h is included. Note that the first invocationofHWY_DYNAMIC_DISPATCH, or each call to the pointer returned by the firstinvocation ofHWY_DYNAMIC_POINTER, involves some CPU detection overhead.You can prevent this by calling the following before any invocation ofHWY_DYNAMIC_*:hwy::GetChosenTarget().Update(hwy::SupportedTargets());.

See also a separateintroduction to dynamic dispatchby @kfjahnke.

When using dynamic dispatch,foreach_target.h is included from translationunits (.cc files), not headers. Headers containing vector code shared betweenseveral translation units require a special include guard, for example thefollowing taken fromexamples/skeleton-inl.h:

#if defined(HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_) == defined(HWY_TARGET_TOGGLE)#ifdef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_#undef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_#else#define HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_#endif#include "hwy/highway.h"// Your vector code#endif

By convention, we name such headers-inl.h because their contents (oftenfunction templates) are usually inlined.

Compiler flags

Applications should be compiled with optimizations enabled. Without inliningSIMD code may slow down by factors of 10 to 100. For clang and GCC,-O2 isgenerally sufficient.

For MSVC, we recommend compiling with/Gv to allow non-inlined functions topass vector arguments in registers. If intending to use the AVX2 target togetherwith half-width vectors (e.g. forPromoteTo), it is also important to compilewith/arch:AVX2. This seems to be the only way to reliably generateVEX-encoded SSE instructions on MSVC. Sometimes MSVC generates VEX-encoded SSEinstructions, if they are mixed with AVX, but not always, seeDevCom-10618264.Otherwise, mixing VEX-encoded AVX2 instructions and non-VEX SSE may cause severeperformance degradation. Unfortunately, with/arch:AVX2 option, the resultingbinary will then require AVX2. Note that no such flag is needed for clang andGCC because they support target-specific attributes, which we use to ensureproper VEX code generation for AVX2 targets.

Strip-mining loops

When vectorizing a loop, an important question is whether and how to deal witha number of iterations ('trip count', denotedcount) that does not evenlydivide the vector sizeN = Lanes(d). For example, it may be necessary to avoidwriting past the end of an array.

In this section, letT denote the element type andd = ScalableTag<T>.Assume the loop body is given as a functiontemplate<bool partial, class D> void LoopBody(D d, size_t index, size_t max_n).

"Strip-mining" is a technique for vectorizing a loop by transforming it into anouter loop and inner loop, such that the number of iterations in the inner loopmatches the vector width. Then, the inner loop is replaced with vectoroperations.

Highway offers several strategies for loop vectorization:

Ensure all inputs/outputs are padded. Then the (outer) loop is simply
```
for (size_t i = 0; i < count; i += N) LoopBody<false>(d, i, 0);
```
Here, the template parameter and second function argument are not needed.
This is the preferred option, unlessN is in the thousands and vectoroperations are pipelined with long latencies. This was the case forsupercomputers in the 90s, but nowadays ALUs are cheap and we see mostimplementations split vectors into 1, 2 or 4 parts, so there is little costto processing entire vectors even if we do not need all their lanes. Indeedthis avoids the (potentially large) cost of predication or partialloads/stores on older targets, and does not duplicate code.
Process whole vectors and include previously processed elementsin the last vector:
```
for (size_t i = 0; i < count; i += N) LoopBody<false>(d, HWY_MIN(i, count - N), 0);
```
This is the second preferred option provided thatcount >= NandLoopBody is idempotent. Some elements might be processed twice, buta single code path and full vectorization is usually worth it. Even ifcount < N, it usually makes sense to pad inputs/outputs up toN.
Use theTransform* functions in hwy/contrib/algo/transform-inl.h. Thistakes care of the loop and remainder handling and you simply define ageneric lambda function (C++14) or functor which receives the current vectorfrom the input/output array, plus optionally vectors from up to two extrainput arrays, and returns the value to write to the input/output array.
Here is an example implementing the BLAS function SAXPY (alpha * x + y):
```
Transform1(d, x, n, y, [](auto d, const auto v, const auto v1) HWY_ATTR {  return MulAdd(Set(d, alpha), v, v1);});
```
Process whole vectors as above, followed by a scalar loop:
```
size_t i = 0;for (; i + N <= count; i += N) LoopBody<false>(d, i, 0);for (; i < count; ++i) LoopBody<false>(CappedTag<T, 1>(), i, 0);
```
The template parameter and second function arguments are again not needed.
This avoids duplicating code, and is reasonable ifcount is large.Ifcount is small, the second loop may be slower than the next option.
Process whole vectors as above, followed by a single call to a modifiedLoopBody with masking:
```
size_t i = 0;for (; i + N <= count; i += N) {  LoopBody<false>(d, i, 0);}if (i < count) {  LoopBody<true>(d, i, count - i);}
```
Now the template parameter and third function argument can be used insideLoopBody to non-atomically 'blend' the firstnum_remaining lanes ofvwith the previous contents of memory at subsequent locations:BlendedStore(v, FirstN(d, num_remaining), d, pointer);. Similarly,MaskedLoad(FirstN(d, num_remaining), d, pointer) loads the firstnum_remaining elements and returns zero in other lanes.
This is a good default when it is infeasible to ensure vectors are padded,but is only safe#if !HWY_MEM_OPS_MIGHT_FAULT!In contrast to the scalar loop, only a single final iteration is needed.The increased code size from two loop bodies is expected to be worthwhilebecause it avoids the cost of masking in all but the final iteration.