NotificationsYou must be signed in to change notification settings
Fork6.1k
Star29.8k

Context Parallel w/ Ring & Ulysses & Unified Attention#11941

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Draft

a-r-r-o-w wants to merge30 commits intomain

base:main

Choose a base branch

fromattn-dispatcher-cp-and-training

Draft

Context Parallel w/ Ring & Ulysses & Unified Attention#11941

a-r-r-o-w wants to merge30 commits intomainfromattn-dispatcher-cp-and-training

+1,118 −168

Conversation

Copy link

Member

a-r-r-o-w commentedJul 16, 2025•
edited
Loading

Adds support for ring, ulysses and unified attention natively. For a minimal PoC, I've limited changes to Flux.

Supported attention backends with CP: cuDNN, FA2, Sage.

Requires#11916 to be merged first.

Minimal example

importtorchfromdiffusersimportFluxPipelinetry:torch.distributed.init_process_group("nccl")rank=torch.distributed.get_rank()device=torch.device("cuda",rank%torch.cuda.device_count())torch.cuda.set_device(device)pipe=FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",torch_dtype=torch.bfloat16)pipe.to(device)# pipe.transformer.parallelize(ring_degree=2)pipe.transformer.parallelize(ulysses_degree=2)pipe.transformer.set_attention_backend("_native_cudnn")prompt="A cat holding a sign that says 'hello world'"# Must specify generator so all ranks start with same latents (or pass your own)generator=torch.Generator().manual_seed(42)image=pipe(prompt,num_inference_steps=28,guidance_scale=4.0,generator=generator).images[0]ifrank==0:image.save("output.png")exceptExceptionase:print(f"An error occurred:{e}")torch.distributed.breakpoint()raisefinally:iftorch.distributed.is_initialized():torch.distributed.destroy_process_group()

Benchmarks

TODO

Explanation

Each model should define a_cp_plan attribute that contains information on how to shard/gather tensors at different stages of the forward.

TODO

Note: There were some merge conflicts that I'm not sure I resolved correctly. Some things may be broken. For this reason, I've removed training support and only tested inference. I'll update some of the TODOs tomorrow

a-r-r-o-wand others added21 commits

July 14, 2025 04:47

update

d7b9e42

update

7e97e43

add coauthor

ecabd2a

Co-Authored-By: Dhruv Nair <dhruv.nair@gmail.com>

improve test

ff21b7f

handle ip adapter params correctly

b8f7fe6

Merge branch 'main' into to-single-file/flux

17b678f

fix chroma qkv fusion test

0cda91d

fix fastercache implementation

bc64f12

fix more tests

a0b276d

fight more tests

c141520

add back set_attention_backend

4dcd672

update

576da52

update

e909b73

make style

1e7217f

make fix-copies

4f52e34

make ip adapter processor compatible with attention dispatcher

d9c1683

refactor chroma as well

a73cb39

remove rmsnorm assert

1e6b1c5

minify and deprecate npu/xla processors

251bb61

Merge branch 'main' into to-single-file/flux

84d2c84

update

51fed50

a-r-r-o-w requested review fromDN6,yiyixuxu,sayakpaul andSunMarc

July 16, 2025 15:25

Copy link

HuggingFaceDocBuilderDev commentedJul 16, 2025

The docs for this PR livehere. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.