bitsandbytes-foundation/bitsandbytesPublic

NotificationsYou must be signed in to change notification settings
Fork800
Star7.8k

add support for 64 block size on 32 warp size supported amd gpus#1748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

matthewdouglas merged 14 commits intobitsandbytes-foundation:mainfromelectron271:main

Nov 13, 2025

Merged

add support for 64 block size on 32 warp size supported amd gpus#1748

matthewdouglas merged 14 commits intobitsandbytes-foundation:mainfromelectron271:main

Nov 13, 2025

Conversation

Copy link

Contributor

electron271 commentedSep 6, 2025

https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html most non instinct gpus support 32 warp size

tested on RX 9070 XT, looking into getting this tested on amd instinct accelerators to ensure gpus with 64 warp size still work

electron271 added2 commits

September 6, 2025 00:28

add support for 64 block size on 32 warp size supported amd gpus

d607127

uncomment 64 block size support in csrc

f7b4430

electron271 mentioned this pull request

Sep 6, 2025

ROCM supportunslothai/unsloth#3279

Open

only enable 64 block size support on architectures with 32 warp size

6e2e4d2

matthewdouglas added the ROCm label

Sep 8, 2025

Copy link

Member

matthewdouglas commentedSep 8, 2025

Thanks for the PR! I don't have the bandwidth to test this personally at the moment, so will defer to AMD team. Also I do not have any RDNA GPUs on hand.

cc:@pnunna93

Copy link

github-actionsbot commentedSep 9, 2025

The docs for this PR livehere. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

pnunna93 suggested changes

Sep 24, 2025

View reviewed changes

Copy link

Contributor

pnunna93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for the PR! It's good to go once warp size change is made.

csrc/ops.hip OutdatedShow resolvedHide resolved

use BNB_WARP_SIZE instead of warpSize in ops.hip

8c24b4d

Copy link

Member

matthewdouglas commentedOct 3, 2025

Hi@electron271
There's still a couple conflicts, mostly because we removed all of the imports related to IPEX. If you don't mind fixing those I think we can merge after that! Thanks!

matthewdouglas previously approved these changes

Oct 3, 2025

View reviewed changes

matthewdouglas added this to thev0.49.0 milestone

Oct 3, 2025

Reuse BNB_WARP_SIZE macro

7dd7b88

sstamenk mentioned this pull request

Oct 20, 2025

Reuse BNB_WARP_SIZE macroelectron271/bitsandbytes#1

Merged

Copy link

ContributorAuthor

electron271 commentedOct 20, 2025

will look through all this soon, sorry have been somewhat busy

Merge pull request#1from sstamenk/reuse_bnb_warp_size_macro

978eccf

Reuse BNB_WARP_SIZE macro

electron271 dismissedmatthewdouglas’sstale review via978eccf

October 20, 2025 21:30

Merge branch 'main' into main

1c94701

electron271 requested a review frommatthewdouglas

October 20, 2025 21:37

sstamenk reviewed

Oct 20, 2025

View reviewed changes

csrc/kernels.hip OutdatedShow resolvedHide resolved

sstamenk reviewed

Oct 20, 2025

View reviewed changes

csrc/ops.hip OutdatedShow resolvedHide resolved

sstamenk mentioned this pull request

Oct 21, 2025

Enable bitsandbytes quantization on AMD GPUs that use warp size 32vllm-project/vllm#27307

Merged

5 tasks

Remove unused WARP_SIZE definitions

e9f0af3

electron271 requested a review frompnunna93

October 24, 2025 01:32

Copy link

Member

matthewdouglas commentedOct 27, 2025

Hi,
It looks like this breaks build compatibility for ROCm 6.1. I would be OK with dropping ROCm 6.1 compatibility if@pnunna93 agrees, but otherwise we would need to fix that build as well.

Apart from that, just a few linting issues to fix.

Copy link

Contributor

pnunna93 commentedOct 28, 2025

Hi, It looks like this breaks build compatibility for ROCm 6.1. I would be OK with dropping ROCm 6.1 compatibility if@pnunna93 agrees, but otherwise we would need to fix that build as well.
Apart from that, just a few linting issues to fix.

I agree, we can deprecate 6.1 compatibility

Copy link

Member

matthewdouglas commentedOct 28, 2025

I've opened#1788 which removes the ROCm 6.1 build.

Merge branch 'main' into main

ca63206

Copy link

Contributor

sstamenk commentedNov 5, 2025•
edited
Loading

Did some regression testing compared to the main branch on W7900 (gfx1100), R9700 (gfx1201) and MI300x (gfx942) using the rocm/vllm:latest Docker image. There don't seem to be any regressions. Out of the 804 newly enabled tests on gfx1100 and gfx1201, 156 fail due to accuracy issues while the other 648 pass. Attaching some logs:

W7900 (gfx1100)
- PRbitsandbytes_tests_gfx1100.log
- mainbitsandbytes_tests_gfx1100_main.log
R9700 (gfx1201)
- PRbitsandbytes_tests_gfx1201.log
- mainbitsandbytes_tests_gfx1201_main.log
MI300x (gfx942)
- PRbitsandbytes_tests_gfx942.log
- mainbitsandbytes_tests_gfx942_main.log

Copy link

Member

matthewdouglas commentedNov 5, 2025

Thanks@sstamenk - that's quite useful! The failing tests seem to be mostly gemv with fp32. I think that's OK for now and can be addressed separately.

@electron271 If we fix the lint issues and merge conflict I'm happy to merge this in!

sstamenk reviewed

Nov 5, 2025

View reviewed changes

tests/test_functional.py OutdatedShow resolvedHide resolved

sstamenk reviewed

Nov 5, 2025

View reviewed changes

bitsandbytes/cextension.py



		ROCM_GPU_ARCH=get_rocm_gpu_arch()
		ROCM_WARP_SIZE_64=Trueifget_rocm_warpsize()==64elseFalse

Copy link

Contributor

sstamenkNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Should we rename ROCM_WARP_SIZE_64 and get_rocm_warpsize() to something generic like WARP_SIZE_64 and get_warpsize() since it technically covers both the cases for HIP and CUDA? Would also make more sense for the unit test skip conditions.@matthewdouglas