- Notifications
You must be signed in to change notification settings - Fork39
🚀 Fast C/C++ bit population count library
License
kimwalisch/libpopcnt
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
libpopcnt.h
is a header-only C/C++ library for counting thenumber of 1 bits (bit population count) in an array as quickly aspossible using specialized CPU instructions i.e.POPCNT,AVX2,AVX512,NEON,SVE.libpopcnt.h
has been tested successfully using the GCC,Clang and MSVC compilers.
#include"libpopcnt.h"/* * Count the number of 1 bits in the data array * @data: An array * @size: Size of data in bytes */uint64_tpopcnt(constvoid*data,uint64_tsize);
libpopcnt.h
does not require any special compiler flags like-mavx2
!To get the best performance we only recommend to compile withoptimizations enabled e.g.-O3
or-O2
.
cc -O3 program.cc++ -O3 program.cpp
libpopcnt.h
has hardware accelerated popcount algorithms forthe following CPU architectures:
x86 | POPCNT ,AVX2 ,AVX512 |
x86-64 | POPCNT ,AVX2 ,AVX512 |
ARM | NEON ,SVE |
PPC64 | POPCNTD |
For other CPU architectures a fast integer popcount algorithm is used.
On x86 CPUs,libpopcnt.h
first queries your CPU's supportedinstruction sets using theCPUID
instruction (this is done only once).Thenlibpopcnt.h
chooses the fastest bit population count algorithmsupported by your CPU:
- If the CPU supports
AVX512
theAVX512 VPOPCNT
algorithm is used. - Else if the CPU supports
AVX2
theAVX2 Harley Seal
algorithm is used. - Else if the CPU supports
POPCNT
thePOPCNT
algorithm is used. - For CPUs without
POPCNT
instruction a portable integer algorithm is used.
Note thatlibpopcnt.h
works on all CPUs (x86, ARM, PPC, WebAssembly, ...).It is portable by default and hardware acceleration is only enabled if the CPUsupports it.libpopcnt.h
it is also thread-safe.
We take performance seriously, if you compile using e.g.-march=native
on an x86 CPU with AVX512 support then all runtimeCPUID
checks are removed!
ARM SVE is a new vector instruction set for ARM CPUs that was first released in2020. ARM SVE supports a variable vector length from 128 to 2048 bits. HenceARM SVE algorithms can be much faster than ARM NEON algorithms which are limitedto 128 bits vector length.
libpopcnt's new ARM SVE popcount algorithm is up to 3x faster than its ARM NEONpopcount algorithm (on AWS Graviton3 CPUs). Unfortunately runtime dispatching toARM SVE is not yet well supported by the GCC and Clang compilers and libc's.Therefore, by default only the (portable) ARM NEON popcount algorithm is enabledwhen using libpopcnt on ARM CPUs.
To enable libpopcnt's ARM SVE popcount algorithm you need to compile your programusing your compiler's ARM SVE option e.g.:
gcc -O3 -march=armv8-a+sve program.cg++ -O3 -march=armv8-a+sve program.cpp
cmake.make -jmaketest
The above commands also build thebenchmark
program which isuseful for benchmarkinglibpopcnt.h
. Below is ausage example run on an AMD EPYC 9R14 CPU from 2023:
# Usage: ./benchmark [array bytes] [iters]./benchmarkIters: 10000000Array size: 16.00 KBAlgorithm: AVX512Status: 100%Seconds: 1.23133.5 GB/s
Some of the algorithms used inlibpopcnt.h
are described in the paperFaster Population Counts using AVX2 Instructionsby Daniel Lemire, Nathan Kurz and Wojciech Mula (23 Nov 2016). The AVX2 Harley Sealpopcount algorithm used inlibpopcnt.h
has been copied from Wojciech Muła'ssse-popcount GitHub repo.