Expand description
SIMD and vendor intrinsics module.
This module is intended to be the gateway to architecture-specificintrinsic functions, typically related to SIMD (but not always!). Eacharchitecture that Rust compiles to may contain a submodule here, whichmeans that this is not a portable module! If you’re writing a portablelibrary take care when using these APIs!
Under this module you’ll find an architecture-named module, such asx86_64. Each#[cfg(target_arch)] that Rust can compile to may have amodule entry here, only present on that particular target. For example thei686-pc-windows-msvc target will have anx86 module here, whereasx86_64-pc-windows-msvc hasx86_64.
§Overview
This module exposes vendor-specific intrinsics that typically correspond toa single machine instruction. These intrinsics are not portable: theiravailability is architecture-dependent, and not all machines of thatarchitecture might provide the intrinsic.
Thearch module is intended to be a low-level implementation detail forhigher-level APIs. Using it correctly can be quite tricky as you need toensure at least a few guarantees are upheld:
- The correct architecture’s module is used. For example the
armmoduleisn’t available on thex86_64-unknown-linux-gnutarget. This istypically done by ensuring that#[cfg]is used appropriately when usingthis module. - The CPU the program is currently running on supports the function beingcalled. For example it is unsafe to call an AVX2 function on a CPU thatdoesn’t actually support AVX2.
As a result of the latter of these guarantees all intrinsics in this moduleareunsafe and extra care needs to be taken when calling them!
§CPU Feature Detection
In order to call these APIs in a safe fashion there’s a number ofmechanisms available to ensure that the correct CPU feature is availableto call an intrinsic. Let’s consider, for example, the_mm256_add_epi64intrinsics on thex86 andx86_64 architectures. This function requiresthe AVX2 feature asdocumented by Intel so to correctly callthis function we need to (a) guarantee we only call it onx86/x86_64and (b) ensure that the CPU feature is available
§Static CPU Feature Detection
The first option available to us is to conditionally compile code via the#[cfg] attribute. CPU features correspond to thetarget_feature cfgavailable, and can be used like so:
#[cfg( all( any(target_arch ="x86", target_arch ="x86_64"), target_feature ="avx2"))]fnfoo() {#[cfg(target_arch ="x86")]usestd::arch::x86::_mm256_add_epi64;#[cfg(target_arch ="x86_64")]usestd::arch::x86_64::_mm256_add_epi64;unsafe{ _mm256_add_epi64(...); }}Here we’re using#[cfg(target_feature = "avx2")] to conditionally compilethis function into our module. This means that if theavx2 feature isenabled statically then we’ll use the_mm256_add_epi64 function atruntime. Theunsafe block here can be justified through the usage of#[cfg] to only compile the code in situations where the safety guaranteesare upheld.
Statically enabling a feature is typically done with the-C target-feature or-C target-cpu flags to the compiler. For example ifyour local CPU supports AVX2 then you can compile the above function with:
$ RUSTFLAGS='-C target-cpu=native' cargo buildOr otherwise you can specifically enable just the AVX2 feature:
$ RUSTFLAGS='-C target-feature=+avx2' cargo buildNote that when you compile a binary with a particular feature enabled it’simportant to ensure that you only run the binary on systems which satisfythe required feature set.
§Dynamic CPU Feature Detection
Sometimes statically dispatching isn’t quite what you want. Instead youmight want to build a portable binary that runs across a variety of CPUs,but at runtime it selects the most optimized implementation available. Thisallows you to build a “least common denominator” binary which has certainsections more optimized for different CPUs.
Taking our previous example from before, we’re going to compile our binarywithout AVX2 support, but we’d like to enable it for just one function.We can do that in a manner like:
fnfoo() {#[cfg(any(target_arch ="x86", target_arch ="x86_64"))]{ifis_x86_feature_detected!("avx2") {return unsafe{ foo_avx2() }; } }// fallback implementation without using AVX2}#[cfg(any(target_arch ="x86", target_arch ="x86_64"))]#[target_feature(enable ="avx2")]unsafe fnfoo_avx2() {#[cfg(target_arch ="x86")]usestd::arch::x86::_mm256_add_epi64;#[cfg(target_arch ="x86_64")]usestd::arch::x86_64::_mm256_add_epi64;unsafe{ _mm256_add_epi64(...); }}There’s a couple of components in play here, so let’s go through them indetail!
First up we notice the
is_x86_feature_detected!macro. Provided bythe standard library, this macro will perform necessary runtime detectionto determine whether the CPU the program is running on supports thespecified feature. In this case the macro will expand to a booleanexpression evaluating to whether the local CPU has the AVX2 feature ornot.Note that this macro, like the
archmodule, is platform-specific. Forexample callingis_x86_feature_detected!("avx2")on ARM will be acompile time error. To ensure we don’t hit this error a statement level#[cfg]is used to only compile usage of the macro onx86/x86_64.Next up we see our AVX2-enabled function,
foo_avx2. This function isdecorated with the#[target_feature]attribute which enables a CPUfeature for just this one function. Using a compiler flag like-C target-feature=+avx2will enable AVX2 for the entire program, but usingan attribute will only enable it for the one function. Usage of the#[target_feature]attribute currently requires the function to also beunsafe, as we see here. This is because the function can only becorrectly called on systems which have the AVX2 (like the intrinsicsthemselves).
And with all that we should have a working program! This program will runacross all machines and it’ll use the optimized AVX2 implementation onmachines where support is detected.
§Ergonomics
It’s important to note that using thearch module is not the easiestthing in the world, so if you’re curious to try it out you may want tobrace yourself for some wordiness!
The primary purpose of this module is to enable stable crates on crates.ioto build up much more ergonomic abstractions which end up using SIMD underthe hood. Over time these abstractions may also move into the standardlibrary itself, but for now this module is tasked with providing the bareminimum necessary to use vendor intrinsics on stable Rust.
§Other architectures
This documentation is only for one particular architecture, you can findothers at:
§Examples
First let’s take a look at not actually using any intrinsics but insteadusing LLVM’s auto-vectorization to produce optimized vectorized code forAVX2 and also for the default platform.
fnmain() {letmutdst = [0]; add_quickly(&[1],&[2],&mutdst);assert_eq!(dst[0],3);}fnadd_quickly(a:&[u8], b:&[u8], c:&mut[u8]) {#[cfg(any(target_arch ="x86", target_arch ="x86_64"))]{// Note that this `unsafe` block is safe because we're testing // that the `avx2` feature is indeed available on our CPU.ifis_x86_feature_detected!("avx2") {return unsafe{ add_quickly_avx2(a, b, c) }; } } add_quickly_fallback(a, b, c)}#[cfg(any(target_arch ="x86", target_arch ="x86_64"))]#[target_feature(enable ="avx2")]unsafe fnadd_quickly_avx2(a:&[u8], b:&[u8], c:&mut[u8]) { add_quickly_fallback(a, b, c)// the function below is inlined here}fnadd_quickly_fallback(a:&[u8], b:&[u8], c:&mut[u8]) {for((a, b), c)ina.iter().zip(b).zip(c) {*c =*a +*b; }}Next up let’s take a look at an example of manually using intrinsics. Herewe’ll be using SSE4.1 features to implement hex encoding.
fnmain() {letmutdst = [0;32]; hex_encode(b"\x01\x02\x03",&mutdst);assert_eq!(&dst[..6],b"010203");letmutsrc = [0;16];foriin0..16{ src[i] = (i +1)asu8; } hex_encode(&src,&mutdst);assert_eq!(&dst,b"0102030405060708090a0b0c0d0e0f10");}pub fnhex_encode(src:&[u8], dst:&mut[u8]) {letlen = src.len().checked_mul(2).unwrap();assert!(dst.len() >= len);#[cfg(any(target_arch ="x86", target_arch ="x86_64"))]{ifis_x86_feature_detected!("sse4.1") {return unsafe{ hex_encode_sse41(src, dst) }; } } hex_encode_fallback(src, dst)}// translated from// <https://github.com/Matherunner/bin2hex-sse/blob/master/base16_sse4.cpp>#[target_feature(enable ="sse4.1")]#[cfg(any(target_arch ="x86", target_arch ="x86_64"))]unsafe fnhex_encode_sse41(mutsrc:&[u8], dst:&mut[u8]) {#[cfg(target_arch ="x86")]usestd::arch::x86::*;#[cfg(target_arch ="x86_64")]usestd::arch::x86_64::*;unsafe{letascii_zero = _mm_set1_epi8(b'0'asi8);letnines = _mm_set1_epi8(9);letascii_a = _mm_set1_epi8((b'a'-9-1)asi8);letand4bits = _mm_set1_epi8(0xf);letmuti =0_isize;whilesrc.len() >=16{letinvec = _mm_loadu_si128(src.as_ptr()as*const_);letmasked1 = _mm_and_si128(invec, and4bits);letmasked2 = _mm_and_si128(_mm_srli_epi64(invec,4), and4bits);// return 0xff corresponding to the elements > 9, or 0x00 otherwiseletcmpmask1 = _mm_cmpgt_epi8(masked1, nines);letcmpmask2 = _mm_cmpgt_epi8(masked2, nines);// add '0' or the offset depending on the masksletmasked1 = _mm_add_epi8( masked1, _mm_blendv_epi8(ascii_zero, ascii_a, cmpmask1), );letmasked2 = _mm_add_epi8( masked2, _mm_blendv_epi8(ascii_zero, ascii_a, cmpmask2), );// interleave masked1 and masked2 bytesletres1 = _mm_unpacklo_epi8(masked2, masked1);letres2 = _mm_unpackhi_epi8(masked2, masked1); _mm_storeu_si128(dst.as_mut_ptr().offset(i *2)as*mut_, res1); _mm_storeu_si128( dst.as_mut_ptr().offset(i *2+16)as*mut_, res2, ); src =&src[16..]; i +=16; }leti = iasusize; hex_encode_fallback(src,&mutdst[i *2..]); }}fnhex_encode_fallback(src:&[u8], dst:&mut[u8]) {fnhex(byte: u8) -> u8 {staticTABLE:&[u8] =b"0123456789abcdef"; TABLE[byteasusize] }for(byte, slots)insrc.iter().zip(dst.chunks_mut(2)) { slots[0] = hex((*byte >>4) &0xf); slots[1] = hex(*byte &0xf); }}Re-exports§
pub use core::arch::*;
Macros§
- is_
aarch64_ feature_ detected - This macro tests, at runtime, whether an
aarch64feature is enabled on aarch64 platforms.Currently most features are only supported on linux-based platforms. - is_
loongarch_ feature_ detected - Checks if
loongarchfeature is enabled.Supported arguments are: - is_
riscv_ feature_ detected - A macro to test atruntime whether instruction sets are available onRISC-V platforms.
- is_
x86_ feature_ detected - A macro to test atruntime whether a CPU feature is available onx86/x86-64 platforms.
- is_
arm_ feature_ detected Experimental - Checks if
armfeature is enabled. - is_
mips64_ feature_ detected Experimental - Checks if
mips64feature is enabled. - is_
mips_ feature_ detected Experimental - Checks if
mipsfeature is enabled. - is_
powerpc64_ feature_ detected Experimental - Checks if
powerpcfeature is enabled. - is_
powerpc_ feature_ detected Experimental - Checks if
powerpcfeature is enabled. - is_
s390x_ feature_ detected Experimental - Checks if
s390xfeature is enabled.