Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add cuda kernel support for GGUF inference#11869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
Isotr0py wants to merge3 commits intohuggingface:main
base:main
Choose a base branch
Loading
fromIsotr0py:gguf-kernel

Conversation

Isotr0py
Copy link
Contributor

@Isotr0pyIsotr0py commentedJul 5, 2025
edited
Loading

What does this PR do?

  • Add GGUF CUDA kernel support for GGUF forwarding
  • Extend GGUF support to i-matrix quantization
  • Onlydequantize ops is used currently, because the MMQ/MMVQ implementation is inefficient with diffusers' 3-dimensional batching (it's designed for vLLM's contiguous batching at first)

Test Code

importtorchfromdiffusersimportFluxPipeline,FluxTransformer2DModel,GGUFQuantizationConfigckpt_path= ("https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf")transformer=FluxTransformer2DModel.from_single_file(ckpt_path,quantization_config=GGUFQuantizationConfig(compute_dtype=torch.float16),torch_dtype=torch.float16,)pipe=FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",transformer=transformer,torch_dtype=torch.float16,)pipe.enable_model_cpu_offload()prompt="A cat holding a sign that says hello world"image=pipe(prompt,generator=torch.manual_seed(0),num_inference_steps=50).images[0]image.save("flux-gguf.png")

Speed comparison
Native (6.39s/it) vs CUDA kernel (5.32s/it), about 10% speed-up

# Native implementationLoading pipeline components...:  43%|██████████████████████████████████████▌                                                   | 3/7 [00:00<00:00,  4.62it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizersLoading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:30<00:00, 15.29s/it]Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:32<00:00,  4.58s/it]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [05:19<00:00,  6.39s/it]# CUDA KernelLoading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:36<00:00, 18.50s/it]Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:38<00:00,  5.51s/it]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [04:25<00:00,  5.32s/it]

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

SunMarc and sayakpaul reacted with heart emoji
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
@HuggingFaceDocBuilderDev

The docs for this PR livehere. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@a-r-r-o-wa-r-r-o-w requested a review fromDN6July 5, 2025 21:22
Comment on lines +528 to +531
def forward(self, inputs: torch.Tensor):
if ops is not None and self.weight.is_cuda and inputs.is_cuda:
return self.forward_cuda(inputs)
return self.forward_native(inputs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This should be fairly safe as long as we get the same values (upto some tolerance) on both the native and kernels' variants.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, the kernels' variants have been tested in vLLM's kernel tests CI compared with numpy implemented dequantize implementation. (https://github.com/vllm-project/vllm/blob/110df74332785ee749af47c5a3eb634d216b8f3b/tests/kernels/quantization/test_gguf.py#L67-L83)

@Isotr0py
Copy link
ContributorAuthor

A comparision abouttorch.compile

Test code

importtorchfromdiffusersimportFluxPipeline,FluxTransformer2DModel,GGUFQuantizationConfigckpt_path= ("https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf")transformer=FluxTransformer2DModel.from_single_file(ckpt_path,quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),torch_dtype=torch.bfloat16,)pipe=FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",transformer=transformer,torch_dtype=torch.bfloat16,)pipe.enable_model_cpu_offload()pipeline.transformer.compile()prompt="A cat holding a sign that says hello world"for_inrange(2):image=pipe(prompt,generator=torch.manual_seed(0)).images[0]image.save("flux-gguf.png")

Outputs

## main branch28/28 [05:07<00:00, 10.98s/it]28/28 [03:03<00:00,  6.54s/it]## Kernels PR28/28 [03:27<00:00,  7.40s/it]28/28 [02:57<00:00,  6.33s/it]

About the warmup time, the kernel PR has a significant low compilation time, but after compilation, both methods have similar speed for generation.

Copy link
Collaborator

@DN6DN6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Changes look good to me 👍🏽. Thank you@Isotr0py.

For testing purposes would it be possible to configure whether to use CUDA kernels via an env variable?

And could we add a standalone test to
https://github.com/huggingface/diffusers/blob/15d50f16f2320b669c77eae2034b6612c22bd2ef/tests/quantization/gguf/test_gguf.py

that just compares native forward vs kernel forward using a dummy tensor similar to vLLM?

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@DN6DN6DN6 left review comments

@sayakpaulsayakpaulsayakpaul left review comments

At least 1 approving review is required to merge this pull request.

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@Isotr0py@HuggingFaceDocBuilderDev@DN6@sayakpaul

[8]ページ先頭

©2009-2025 Movatter.jp