Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

🚀 Fast C/C++ bit population count library

License

NotificationsYou must be signed in to change notification settings

kimwalisch/libpopcnt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build statusGithub Releases

libpopcnt.h is a header-only C/C++ library for counting thenumber of 1 bits (bit population count) in an array as quickly aspossible using specialized CPU instructions i.e.POPCNT,AVX2,AVX512,NEON,SVE.libpopcnt.h has been tested successfully using the GCC,Clang and MSVC compilers.

C/C++ API

#include"libpopcnt.h"/* * Count the number of 1 bits in the data array * @data: An array * @size: Size of data in bytes */uint64_tpopcnt(constvoid*data,uint64_tsize);

How to compile

libpopcnt.h does not require any special compiler flags like-mavx2!To get the best performance we only recommend to compile withoptimizations enabled e.g.-O3 or-O2.

cc  -O3 program.cc++ -O3 program.cpp

CPU architectures

libpopcnt.h has hardware accelerated popcount algorithms forthe following CPU architectures:

x86POPCNT,AVX2,AVX512
x86-64POPCNT,AVX2,AVX512
ARMNEON,SVE
PPC64POPCNTD

For other CPU architectures a fast integer popcount algorithm is used.

How it works

On x86 CPUs,libpopcnt.h first queries your CPU's supportedinstruction sets using theCPUID instruction (this is done only once).Thenlibpopcnt.h chooses the fastest bit population count algorithmsupported by your CPU:

  • If the CPU supportsAVX512 theAVX512 VPOPCNT algorithm is used.
  • Else if the CPU supportsAVX2 theAVX2 Harley Seal algorithm is used.
  • Else if the CPU supportsPOPCNT thePOPCNT algorithm is used.
  • For CPUs withoutPOPCNT instruction a portable integer algorithm is used.

Note thatlibpopcnt.h works on all CPUs (x86, ARM, PPC, WebAssembly, ...).It is portable by default and hardware acceleration is only enabled if the CPUsupports it.libpopcnt.h it is also thread-safe.

We take performance seriously, if you compile using e.g.-march=nativeon an x86 CPU with AVX512 support then all runtimeCPUID checks are removed!

ARM SVE (Scalable Vector Extension)

ARM SVE is a new vector instruction set for ARM CPUs that was first released in2020. ARM SVE supports a variable vector length from 128 to 2048 bits. HenceARM SVE algorithms can be much faster than ARM NEON algorithms which are limitedto 128 bits vector length.

libpopcnt's new ARM SVE popcount algorithm is up to 3x faster than its ARM NEONpopcount algorithm (on AWS Graviton3 CPUs). Unfortunately runtime dispatching toARM SVE is not yet well supported by the GCC and Clang compilers and libc's.Therefore, by default only the (portable) ARM NEON popcount algorithm is enabledwhen using libpopcnt on ARM CPUs.

To enable libpopcnt's ARM SVE popcount algorithm you need to compile your programusing your compiler's ARM SVE option e.g.:

gcc -O3 -march=armv8-a+sve program.cg++ -O3 -march=armv8-a+sve program.cpp

Development

cmake.make -jmaketest

The above commands also build thebenchmark program which isuseful for benchmarkinglibpopcnt.h. Below is ausage example run on an AMD EPYC 9R14 CPU from 2023:

# Usage: ./benchmark [array bytes] [iters]./benchmarkIters: 10000000Array size: 16.00 KBAlgorithm: AVX512Status: 100%Seconds: 1.23133.5 GB/s

Acknowledgments

Some of the algorithms used inlibpopcnt.h are described in the paperFaster Population Counts using AVX2 Instructionsby Daniel Lemire, Nathan Kurz and Wojciech Mula (23 Nov 2016). The AVX2 Harley Sealpopcount algorithm used inlibpopcnt.h has been copied from Wojciech Muła'ssse-popcount GitHub repo.


[8]ページ先頭

©2009-2025 Movatter.jp