This first version is around 10x improvement compared to the reference C++ implementation, but slower than the asmjit verison (which is x86 only), see list below for how to close this gap:

Room for further enhacements (see some of the surrounding kernels that do implement them):

This is only specialized forbit_rate right now. Specializing for common block sizes typically nets good improvements.
Directly operates on theout buffer. This is good for unknown block-sizes, but if we specialize for fixed small block-sizes then a separate buffer is better as it can be promoted completely to vector registers (for fixed vector register size anyway, doesn't work for variable size AArch64 SVE registers).
No prefetching logic yet.

Differential Revision: D89086019

Add EmbeddingSpMDMNBitRowWiseSparse autovectorized variant

403af1e

Summary:X-link:facebookresearch/FBGEMM#2235This adds an autovectorized implementation of the `GenerateEmbeddingSpMDMNBitRowWiseSparse` kernels.This first version is around 10x improvement compared to the reference C++ implementation, but slower than the asmjit verison (which is x86 only), see list below for how to close this gap:Room for further enhacements (see some of the surrounding kernels that do implement them):- This is only specialized for `bit_rate` right now. Specializing for common block sizes typically nets good improvements.- Directly operates on the `out` buffer. This is good for unknown block-sizes, but if we specialize for fixed small block-sizes then a separate buffer is better as it can be promoted completely to vector registers (for fixed vector register size anyway, doesn't work for variable size AArch64 SVE registers).- No prefetching logic yet.Differential Revision: D89086019