Add GGUF CUDA kernel support for GGUF forwarding
Extend GGUF support to i-matrix quantization
Onlydequantize ops is used currently, because the MMQ/MMVQ implementation is inefficient with diffusers' 3-dimensional batching (it's designed for vLLM's contiguous batching at first)

Test Code

importtorchfromdiffusersimportFluxPipeline,FluxTransformer2DModel,GGUFQuantizationConfigckpt_path= ("https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf")transformer=FluxTransformer2DModel.from_single_file(ckpt_path,quantization_config=GGUFQuantizationConfig(compute_dtype=torch.float16),torch_dtype=torch.float16,)pipe=FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",transformer=transformer,torch_dtype=torch.float16,)pipe.enable_model_cpu_offload()prompt="A cat holding a sign that says hello world"image=pipe(prompt,generator=torch.manual_seed(0),num_inference_steps=50).images[0]image.save("flux-gguf.png")

Speed comparison
Native (6.39s/it) vs CUDA kernel (5.32s/it), about 10% speed-up

# Native implementationLoading pipeline components...:  43%|██████████████████████████████████████▌                                                   | 3/7 [00:00<00:00,  4.62it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizersLoading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:30<00:00, 15.29s/it]Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:32<00:00,  4.58s/it]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [05:19<00:00,  6.39s/it]# CUDA KernelLoading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:36<00:00, 18.50s/it]Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:38<00:00,  5.51s/it]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [04:25<00:00,  5.32s/it]

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read thecontributor guideline?
Did you read ourphilosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or theforum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Isotr0py added3 commits

July 5, 2025 17:47

add gguf kernel support

6c4d01d

Signed-off-by: Isotr0py <2037008807@qq.com>

fix

66bd237

Signed-off-by: Isotr0py <2037008807@qq.com>

optimize

e46571a

Signed-off-by: Isotr0py <2037008807@qq.com>

Copy link

HuggingFaceDocBuilderDev commentedJul 5, 2025

The docs for this PR livehere. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w requested a review fromDN6

July 5, 2025 21:22

sayakpaul reviewed

Jul 7, 2025

View reviewed changes

src/diffusers/quantizers/gguf/utils.py

Comment on lines +528 to +531

		def forward(self, inputs: torch.Tensor):
		if ops is not None and self.weight.is_cuda and inputs.is_cuda:
		return self.forward_cuda(inputs)
		return self.forward_native(inputs)

Copy link

Member

sayakpaulJul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This should be fairly safe as long as we get the same values (upto some tolerance) on both the native and kernels' variants.

Copy link

ContributorAuthor

Isotr0pyJul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, the kernels' variants have been tested in vLLM's kernel tests CI compared with numpy implemented dequantize implementation. (https://github.com/vllm-project/vllm/blob/110df74332785ee749af47c5a3eb634d216b8f3b/tests/kernels/quantization/test_gguf.py#L67-L83)

Copy link

ContributorAuthor

Isotr0py commentedJul 7, 2025

A comparision about`torch.compile`

Test code

importtorchfromdiffusersimportFluxPipeline,FluxTransformer2DModel,GGUFQuantizationConfigckpt_path= ("https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf")transformer=FluxTransformer2DModel.from_single_file(ckpt_path,quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),torch_dtype=torch.bfloat16,)pipe=FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",transformer=transformer,torch_dtype=torch.bfloat16,)pipe.enable_model_cpu_offload()pipeline.transformer.compile()prompt="A cat holding a sign that says hello world"for_inrange(2):image=pipe(prompt,generator=torch.manual_seed(0)).images[0]image.save("flux-gguf.png")

Outputs

## main branch28/28 [05:07<00:00, 10.98s/it]28/28 [03:03<00:00,  6.54s/it]## Kernels PR28/28 [03:27<00:00,  7.40s/it]28/28 [02:57<00:00,  6.33s/it]

About the warmup time, the kernel PR has a significant low compilation time, but after compilation, both methods have similar speed for generation.

DN6 reviewed

Jul 7, 2025

View reviewed changes

Copy link

Collaborator

DN6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Changes look good to me 👍🏽. Thank you@Isotr0py.

For testing purposes would it be possible to configure whether to use CUDA kernels via an env variable?

And could we add a standalone test to
https://github.com/huggingface/diffusers/blob/15d50f16f2320b669c77eae2034b6612c22bd2ef/tests/quantization/gguf/test_gguf.py

that just compares native forward vs kernel forward using a dummy tensor similar to vLLM?

Labels

None yet

4 participants

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cuda kernel support for GGUF inference#11869

Are you sure you want to change the base?

Add cuda kernel support for GGUF inference#11869

Uh oh!

Conversation

Isotr0py commentedJul 5, 2025•
edited
Loading

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commentedJul 5, 2025

Uh oh!

sayakpaulJul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Isotr0pyJul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Isotr0py commentedJul 7, 2025

A comparision about`torch.compile`

Uh oh!

DN6 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Movatterモバイル変換

Add cuda kernel support for GGUF inference#11869

Are you sure you want to change the base?

Add cuda kernel support for GGUF inference#11869

Uh oh!

Conversation

Isotr0py commentedJul 5, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commentedJul 5, 2025

Uh oh!

sayakpaulJul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Isotr0pyJul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Isotr0py commentedJul 7, 2025

A comparision abouttorch.compile

Uh oh!

DN6 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Isotr0py commentedJul 5, 2025•
edited
Loading

A comparision about`torch.compile`