GatherBlockQuantized: Fix 4 bit uint8 case#26506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft

jambayk wants to merge3 commits intomain

base:main

Choose a base branch

Draft

jambayk wants to merge3 commits intomainfromjambayk/gbq

Conversation

Copy link

Contributor

When uint8 packing is used for 4 bits, the packing happens along the quantization axis.
- For cases where the number of blocks is odd, there is an additional padding block per row of the zero-point tensor. The indexing for zero points is updated to handle this.
- For the data tensor, there appears to be some assumption that the quantization dim is divisible by block size (this packing is supported to share weights with the lm head which uses matmulnbits but that only works if quant dim is divisible by blocksize. otherwise, there is extra padding in the final block per data row). Since block size is a power of 2, there is no padding. Without this assumption, the data indexing logic would need to be updated as well.
  - even if the above assumption is not true, there is an assumption that the quantization dim is even
Fixed the default zero-point value for uint8 case in CUDA implementation

jambayk added3 commits

cuda gatherblockquantized fix

3deeb31

fix cpu gatherblockquantized

45a8c10

comment

9eb215c

tianleiwu requested a review fromxiaomsft

None yet