Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 662 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
examples		examples
lib		lib
misc		misc
scripts		scripts
src		src
tests		tests
utils		utils
.clang-format		.clang-format
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
example.c		example.c
meson.build		meson.build
meson_options.txt		meson_options.txt
run-bench.py		run-bench.py

Repository files navigation

x86-simd-sort

C++ template library for high performance SIMD based sorting routines forbuilt-in integers and floats (16-bit, 32-bit and 64-bit data types) and customdefined C++ objects. The sorting routines are accelerated using AVX-512/AVX2when available. The library auto picks the best version depending on theprocessor it is run on. If you are looking for the AVX-512 or AVX2 specificimplementations, please seeREADME fileundersrc/ directory. The following routines are currently supported:

Sort an array of custom defined class objects (uses`O(N)` space)

template<typename T,typename U,typename Func>voidx86simdsort::object_qsort(T *arr, U arrsize, Func key_func)

T is any user defined struct or class andarr is a pointer to the firstelement in the array of objects of typeT. Thearrsize parameter can be any32-bit or 64-bit integer type.Func is a lambda function that computes thekey value for each object which is the metric used to sort the objects.Func needs to have the following signature:

[] (T obj) ->key_t {key_t key;/* compute key for obj*/return key; }

Note that the return type of the keykey_t needs to be one of the following :[float, uint32_t, int32_t, double, uint64_t, int64_t].object_qsort has aspace complexity ofO(N). Specifically, it requiresarrsize * sizeof(key_t)bytes to store a vector with all the keys and an additionalarrsize * sizeof(uint32_t) bytes to store the indexes of the object array. Forperformance reasons, we recommend usingobject_qsort when the array sizeis less than or equal toUINT32_MAX. An example usage ofobject_qsort isprovided in theexamplessection. Refer tosection to get a sense ofhow fast this is relative tostd::sort.

Sort an array of built-in integers and floats

voidx86simdsort::qsort(T* arr,size_t size,bool hasnan,bool descending);voidx86simdsort::qselect(T* arr,size_t k,size_t size,bool hasnan,bool descending);voidx86simdsort::partial_qsort(T* arr,size_t k,size_t size,bool hasnan,bool descending);

Supported datatypes:T$\in$[_Float16, uint16_t, int16_t, float, uint32_t, int32_t, double, uint64_t, int64_t]

Key-value sort routines on pairs of arrays

voidx86simdsort::keyvalue_qsort(T1* key, T2* val,size_t size,bool hasnan,bool descending);voidx86simdsort::keyvalue_select(T1* key, T2* val,size_t k,size_t size,bool hasnan,bool descending);voidx86simdsort::keyvalue_partial_sort(T1* key, T2* val,size_t k,size_t size,bool hasnan,bool descending);

Supported datatypes:T1,T2$\in$[float, uint32_t, int32_t, double, uint64_t, int64_t] Note that keyvalue sort is not yet supported for 16-bitdata types.

Arg sort routines on arrays

std::vector<size_t> arg = x86simdsort::argsort(T* arr,size_t size,bool hasnan,bool descending);std::vector<size_t> arg = x86simdsort::argselect(T* arr,size_t k,size_t size,bool hasnan);

Supported datatypes:T$\in$[_Float16, uint16_t, int16_t, float, uint32_t, int32_t, double, uint64_t, int64_t] Note that argsort and argselect are not accelerated with SIMD when using 16-bitdata types.

Build/Install

meson is the used build system. Commandto build and install the library:

meson setup --buildtype release builddir && cd builddirmeson compilesudo meson install

Once installed, you can usepkg-config --cflags --libs x86simdsortcpp topopulate the right cflags and ldflags to compile and link your C++ program.This repository also contains a test suite and benchmarking suite which arewritten usinggoogletest andgooglebenchmark (>= v1.9.2) frameworksrespectively. You can configure meson to build them both by using-Dbuild_tests=true and-Dbuild_benchmarks=true.

Build using OpenMP

qsort,argsort, andkeyvalue_qsort can achieve even greater performance(up-to 3x speedup) through parallelization withOpenMP. By default, OpenMP support is disabled; toenable it, set the-Duse_openmp=true flag when configuring Meson. If you areusing only the static SIMD implementations, compile with-fopenmp -DXSS_USE_OPENMP.

OpenMP-based parallel sorting routines are used for arrays larger than aspecific threshold where threading makes sense. The number of threads islimited to a maximum of 16. You can control the number of threads by settingtheOMP_NUM_THREADS environment variable.

Using x86-simd-sort as a Meson subproject

If you would like to use this as a Meson subproject, then createsubprojectsdirectory and copyx86-simd-sort into it. Add these two linesin your meson.build.

xss = subproject('x86-simd-sort')xss_dep = xss.get_variable('x86simdsortcpp_dep')

For more detailed instructions please refer to Mesondocumentation.

Example usage

Sort an array of floats

#include"x86simdsort.h"intmain() {    std::vector<float> arr{1000};x86simdsort::qsort(arr.data(),1000,true);return0;}

Sort an array of Points using object_qsort

#include"x86simdsort.h"#include<cmath>structPoint {double x, y, z;};intmain() {    std::vector<Point> arr{1000};// Sort an array of Points by its x value:x86simdsort::object_qsort(arr.data(),1000, [](Point p) {return p.x; });// Sort an array of Points by its distance from origin:x86simdsort::object_qsort(arr.data(),1000, [](Point p) {returnsqrt(p.x*p.x+p.y*p.y+p.z*p.z);        });return0;}

Details

x86simdsort::qsort is equivalent toqsort inCorstd::sort inC++.
x86simdsort::qselect is equivalent tostd::nth_element inC++ ornp.partition inNumPy.
x86simdsort::partial_qsort is equivalent tostd::partial_sort inC++.
x86simdsort::argsort is equivalent tonp.argsort inNumPy.
x86simdsort::argselect is equivalent tonp.argpartition inNumPy.

Supported datatypes:uint16_t, int16_t, _Float16, uint32_t, int32_t, float, uint64_t, int64_t, double. Note that_Float16 will require building thislibrary with g++ >= 12.x. All the functions have an optional argumentbool hasnan set tofalse by default (these are relevant to floating point datatypes only). If your array has NAN's, the the behaviour of the sorting routineis undefined. Ifhasnan is set to true, NAN's are always sorted to the end ofthe array. In addition to that, qsort will replace all your NAN's withstd::numeric_limits<T>::quiet_NaN. The original bit-exact NaNs inthe input are not preserved. Also note that the arg methods (argsort andargselect) will not use the SIMD based algorithms if they detect NAN's in thearray. You can read details of all the implementationshere.

Performance comparison on AVX-512:`object_qsort` v/s`std::sort`

Performance ofobject_qsort can vary significantly depending on the defintionof the custom class and we highly recommend benchmarking before using it. Forthe sake of illustration, we provide a few examples in./benchmarks/bench-objsort.hpp which measuresperformance ofobject_qsort relative tostd::sort when sorting an array of3D points represented by the class:struct Point {double x, y, z;} andstruct Point {float x, y, x;}. We sort these points based on severaldifferent metrics:

sort by coordinatex
sort by manhanttan distance (relative to origin):abs(x) + abx(y) + abs(z)
sort by Euclidean distance (relative to origin):sqrt(x*x + y*y + z*z)
sort by Chebyshev distance (relative to origin):max(abs(x), abs(y), abs(z))

The performance data (shown in the plot below) can be collected by building thebenchmarks suite and running./builddir/benchexe --benchmark_filter==*obj*.The data plot shown below was collected on a processor with AVX-512. For thesimplest of cases where we want to sort an array of struct by one of itsmembers,object_qsort can be up-to 5x faster for 32-bit data type and about4x for 64-bit data type. It tends to do even better when the metric to sort bygets more complicated. Sorting by Euclidean distance can be up-to 10x faster.

Downstream projects using x86-simd-sort

NumPy uses this as asubmodule to acceleratenp.sort, np.argsort, np.partition and np.argpartition.
PyTorch uses this as asubmodule to acceleratetorch.sort, torch.argsort.
A slightly modifed version this library has been integrated intoopenJDK.
GRAPE: C++ library for parallel graph processing.
AVX-512 version of the key-value sort has been submitted toOceanbase.