I found thesequence_parallel_size in the provided example of Ulysses (test_ulysses_sp_hf.py) is equal to the world_size( total gpus) , and if thesequence_parallel_size is less than the world_size, the training would encounter an error when going backwards, as shown below: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 65, in backward scaled_loss.backward(retain_graph=retain_graph) File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply return user_fn(self, *args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/nn/functional.py", line 343, in backward gx = torch.empty_like(grad_outputs[rank])IndexError: tuple index out of range
And the error is likely caused by that, when executingtorch._AllGather, thegradoutput only hassp_world_size items buttorch.distributed.get_rank() causes each GPU to choose its own grad from the gradoutput. Therefore raise this index mismatch error. So is there a must to make sure the sequence_parallel_size be equal to the world_size? |