- Notifications
You must be signed in to change notification settings - Fork566
Open
Description
When I runENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=2 generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth, it ends up with error:RuntimeError: get_group_info: no group info associated with the group name.
Detailed error information:
W0609 20:18:37.249000 1431440 torch/distributed/run.py:766] *****************************************W0609 20:18:37.249000 1431440 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.W0609 20:18:37.249000 1431440 torch/distributed/run.py:766] *****************************************Using device=cudaLoading model ...Applying tensor parallel to model ...Time to load model: 10.60 seconds/root/serve/gpt-fast/tp.py:139: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. attn.register_forward_hook(lambda _module, _input, output: funcol.all_reduce([rank0]: Traceback (most recent call last):[rank0]: File "/root/serve/gpt-fast/generate.py", line 480, in <module>[rank0]: main([rank0]: File "/root/serve/gpt-fast/generate.py", line 401, in main[rank0]: y, metrics = generate([rank0]: File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context[rank0]: return func(*args, **kwargs)[rank0]: File "/root/serve/gpt-fast/generate.py", line 194, in generate[rank0]: next_token = prefill(model, prompt.view(batch_size, -1), input_pos, **sampling_kwargs).clone()[rank0]: File "/root/serve/gpt-fast/generate.py", line 71, in prefill[rank0]: logits = model(mask, x, input_pos)[rank0]: File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl[rank0]: return self._call_impl(*args, **kwargs)[rank0]: File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl[rank0]: return forward_call(*args, **kwargs)[rank0]: File "/root/serve/gpt-fast/model.py", line 156, in forward[rank0]: x = layer(x, input_pos, freqs_cis, mask)[rank0]: File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl[rank0]: return self._call_impl(*args, **kwargs)[rank0]: File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl[rank0]: return forward_call(*args, **kwargs)[rank0]: File "/root/serve/gpt-fast/model.py", line 175, in forward[rank0]: h = x + self.attention(self.attention_norm(x), freqs_cis, mask, input_pos)[rank0]: File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl[rank0]: return self._call_impl(*args, **kwargs)[rank0]: File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl[rank0]: return inner()[rank0]: File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in inner[rank0]: hook_result = hook(self, args, result)[rank0]: File "/root/serve/gpt-fast/tp.py", line 139, in <lambda>[rank0]: attn.register_forward_hook(lambda _module, _input, output: funcol.all_reduce([rank0]: File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/distributed/_functional_collectives.py", line 176, in all_reduce[rank0]: tensor = torch.ops._c10d_functional.all_reduce(self, reduceOp.lower(), group_name)[rank0]: File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/_ops.py", line 1158, in __call__[rank0]: return self._op(*args, **(kwargs or {}))[rank0]: RuntimeError: get_group_info: no group info associated with the group nameIn UV venv,
torch version: 2.7.1+cu126
Metadata
Metadata
Assignees
Labels
No labels