Must the sp_size be equal to the total_gpus in UlyssesSPAttentionHF?#7671

Unanswered

NiuMa-1234 asked this question inQ&A

NiuMa-1234

Nov 5, 2025

· 0 comments

Return to top

Discussion options

NiuMa-1234
Nov 5, 2025

I found thesequence_parallel_size in the provided example of Ulysses (test_ulysses_sp_hf.py) is equal to the world_size( total gpus) , and if thesequence_parallel_size is less than the world_size, the training would encounter an error when going backwards, as shown below:

 self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 65, in backward    scaled_loss.backward(retain_graph=retain_graph)  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward    torch.autograd.backward(  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply    return user_fn(self, *args)  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/nn/functional.py", line 343, in backward    gx = torch.empty_like(grad_outputs[rank])IndexError: tuple index out of range

And the error is likely caused by that, when executingtorch._AllGather, thegradoutput only hassp_world_size items buttorch.distributed.get_rank() causes each GPU to choose its own grad from the gradoutput. Therefore raise this index mismatch error.

So is there a must to make sure the sequence_parallel_size be equal to the world_size?

You must be logged in to vote

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Must the sp_size be equal to the total_gpus in UlyssesSPAttentionHF?#7671

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

NiuMa-1234
Nov 5, 2025

Replies: 0 comments

Select a reply

Uh oh!

Movatterモバイル変換

Must the sp_size be equal to the total_gpus in UlyssesSPAttentionHF?#7671

Uh oh!

Uh oh!

NiuMa-1234Nov 5, 2025

Replies: 0 comments

Uh oh!

NiuMa-1234
Nov 5, 2025