NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

Enable FP8 row-wise scaled-mm for sm12x#155991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

gau-nernst wants to merge8 commits intopytorch:mainfromgau-nernst:fp8_sm120

Closed

Enable FP8 row-wise scaled-mm for sm12x#155991

gau-nernst wants to merge8 commits intopytorch:mainfromgau-nernst:fp8_sm120

Conversation

Copy link

Contributor

gau-nernst commentedJun 14, 2025•
edited by pytorch-botbot
Loading

Update using Cutlass 3.x (2025/06/15)

Following@alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS)

M	N	K	TFLOPS
16	4096	4096	17.56
64	4096	4096	69.63
256	4096	4096	266.57
1024	4096	4096	339.28
4096	4096	4096	388.91

This uses the same SM100 template. The only difference is

Cluster size is fixed to<1,1,1> since sm120 does not have multicast feature
~~Tile size is fixed to<128,128,128> due to default kernel schedule does not support<64,128,128>. I will work a bit on improve perf for small M.~~ Fixed. UseKernelTmaWarpSpecializedPingpong when TileShape.M == 64

Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape<64,64,128>.

Original using SM89

I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (cp.async+mma and friends).

Simple benchmark script

importtorchfromtorch._inductor.utilsimportdo_bench_using_profilingN,K=4096,4096forMin [16,64,256,1024,4096]:A=torch.randn(M,K,device="cuda").to(torch.float8_e4m3fn)B=torch.randn(N,K,device="cuda").to(torch.float8_e4m3fn).Tscale_A=torch.ones(M,1).cuda()scale_B=torch.ones(1,N).cuda()out=torch._scaled_mm(A,B,scale_A,scale_B,out_dtype=torch.bfloat16)out_ref= ((A.float() @B.float())*scale_A*scale_B).bfloat16()torch.testing.assert_close(out,out_ref)latency_us=do_bench_using_profiling(lambda:torch._scaled_mm(A,B,scale_A,scale_B,out_dtype=torch.bfloat16))tflops= (2*M*N*K)/latency_us/1e9print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS")

M	N	K	TFLOPS
16	4096	4096	25.73 TFLOPS
64	4096	4096	71.84 TFLOPS
256	4096	4096	86.40 TFLOPS
1024	4096	4096	112.12 TFLOPS
4096	4096	4096	121.24 TFLOPS

Accodring toRTX Blackwell Whitepaper, FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here...

However, if I changeThreadblockSwizzle tocutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>

M	N	K	TFLOPS
16	4096	4096	27.13 TFLOPS
64	4096	4096	84.84 TFLOPS
256	4096	4096	96.75 TFLOPS
1024	4096	4096	110.21 TFLOPS
4096	4096	4096	122.98 TFLOPS

Small M slightly improves, but large M is still bad.

If I further changeThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3 for M>256, which is taken fromcutlass example 58, I get the following results

M	N	K	TFLOPS
1024	4096	4096	313.28
4096	4096	4096	376.73

Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs.

To make sure this high perf is only obtainable withGemmIdentityThreadblockSwizzle<1> +ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3, I also try usingThreadblockSwizzleStreamK +ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3

M	N	K	TFLOPS
1024	4096	4096	144.03
4096	4096	4096	156.86

A bit better than current configs, but still very far away from hardware limit.

@alexsamardzic I noticed you chose this configs in#149978. Do you have any numbers how the current configs perform on sm89?

Update: Using triton codegen-ed from inductorcompiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")

M	N	K	TFLOPS
16	4096	4096	25.60
64	4096	4096	71.74
256	4096	4096	161.64
1024	4096	4096	185.89
4096	4096	4096	215.53

Better than default configs, but still far away from the config above for compute-bound

cc@ptrblck @msaroufim @eqy @jerryzh168

gau-nernst requested review fromeqy andsyed-ahmed ascode owners

June 14, 2025 13:41

Copy link

pytorch-botbot commentedJun 14, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/155991

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit46e1250 with merge base517d299 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorchbot added the open source label

Jun 14, 2025

Copy link

Collaborator

alexsamardzic commentedJun 14, 2025

@alexsamardzic I noticed you chose this configs in#149978. Do you have any numbers how the current configs perform on sm89?

I remember doing a lot of benchmarking, but I'm afraid I don't have results saved... The SM89 configs eventually put there were to improve performance for smaller M; but choosing these configs is a guesswork anyway - this is why auto-tuning is so important.

Anyway, is there any particular reason to use SM89 kernel for SM120? There are SM90 and SM100 kernels in the same source file, written using CUTLASS 3.x API (while the SM89 kernel uses 2.x API), and these may be a better match.

Skylion007 requested review fromnWEIdia andngimel

June 14, 2025 15:14

Copy link

ContributorAuthor

gau-nernst commentedJun 14, 2025

Anyway, is there any particular reason to use SM89 kernel for SM120?

Not exactly. I haven't tried using cutlass 3.x API for this (I can try later). But if the sm89 kernel works, and we can get good perf with tuned configs, it's not exactly necessary to use cutlass 3.x API?

Copy link

Collaborator

alexsamardzic commentedJun 14, 2025

Not exactly. I haven't tried using cutlass 3.x API for this (I can try later). But if the sm89 kernel works, and we can get good perf with tuned configs, it's not exactly necessary to use cutlass 3.x API?

I may be wrong, but would expect better performance with 3.x kernel, as it targets archs that are closer to sm120 (TMA etc.); sm89 is really just a weird corner case regarding fp8 support.

Copy link

ContributorAuthor

gau-nernst commentedJun 14, 2025•
edited
Loading

I don't think sm120 has TMA? Performant gemm in sm120 is still just cp.async+mma I think. No TMA or tcgen05 like sm90/sm100. Hence I'm expecting fp8 gemm in sm120 to be similar to that in sm89 (if we don't count new dtypes like mxfp8)

Edit: my mistake, seems like sm120 has TMAhttps://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/collective/sm120_blockscaled_mma_tma.hpp

gau-nernst added4 commits

June 15, 2025 12:44

allow running fp8 sm89 kernel on sm120 device

47a9d30

enable kernel for sm12x

4d8ee66

enable test

aaa5b64

use cutlass 3.x

fce1963

gau-nernst force-pushed thefp8_sm120 branch fromf541c9e tofce1963Compare

June 15, 2025 04:44

eqy reviewed

Jun 15, 2025

View reviewed changes

Copy link

Collaborator

eqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This works without build changes to e.g.,

pytorch/cmake/Codegen.cmake

Line 119 in655b3b1

"${CMAKE_CURRENT_LIST_DIR}/../aten/src/ATen/native/cuda/RowwiseScaledMM.cu"

Copy link

ContributorAuthor

gau-nernst commentedJun 15, 2025•
edited
Loading

@eqy I probably need to add120a to it. Thanks for the check. Anywhere else I should update? And how should I test it locally (without building everything like flash attention) that the build is working as expected?

Locally, I compile pytorch with this command

DEBUG=1 USE_DISTRIBUTED=0 USE_MKLDNN=0 USE_CUDA=1 BUILD_TEST=0 USE_FBGEMM=0 USE_NNPACK=0 USE_QNNPACK=0 USE_XNNPACK=0 USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 CMAKE_LINKER_TYPE=MOLD TORCH_CUDA_ARCH_LIST="12.0 12.0a" python setup.py develop

(maybe the12.0 was unnecessary, I was messing around with some compile problems)

Should I add an entry for sm120 like below as well?

pytorch/cmake/Codegen.cmake

Lines 106 to 110 in655b3b1

	if("${_arch}"STREQUAL"100a")
	if(_existing_arch_flagsMATCHES".compute_100.")
	list(APPEND _file_compile_flags"-gencode;arch=compute_100a,code=sm_100a")
	endif()
	endif()

Copy link

Collaborator

eqy commentedJun 15, 2025

@eqy I probably need to add120a to it. Thanks for the check. Anywhere else I should update? And how should I test it locally (without building everything like flash attention) that the build is working as expected?
Locally, I compile pytorch with this command
DEBUG=1 USE_DISTRIBUTED=0 USE_MKLDNN=0 USE_CUDA=1 BUILD_TEST=0 USE_FBGEMM=0 USE_NNPACK=0 USE_QNNPACK=0 USE_XNNPACK=0 USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 CMAKE_LINKER_TYPE=MOLD TORCH_CUDA_ARCH_LIST="12.0 12.0a" python setup.py develop
(maybe the12.0 was unnecessary, I was messing around with some compile problems)

My guess is that you should should be able to test locally withTORCH_CUDA_ARCH_LIST=12.0 as the12.0a should be added for just that compilation unit if the CMake config is updated correctly

gau-nernst added3 commits

June 15, 2025 14:05

update cmake

ef59433

support use_smaller_tiles

d0ecd6b

fix typo

97c8683

Skylion007 reviewed

Jun 15, 2025

View reviewed changes

aten/src/ATen/native/cuda/RowwiseScaledMM.cu


		using MainloopScheduleType = cutlass::gemm::collective::KernelScheduleAuto;
		// on sm120, KernelScheduleAuto resolves to KernelTmaWarpSpecializedCooperativeSm120<2>>,
		// which does not support TileShape.M < 128