Error in TP: RuntimeError: get_group_info: no group info associated with the group name #228

New issue

Open

Error in TP: RuntimeError: get_group_info: no group info associated with the group name#228

Description

zy-ning

opened

on Jun 9, 2025

When I runENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=2 generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth, it ends up with error:RuntimeError: get_group_info: no group info associated with the group name.

Detailed error information:

W0609 20:18:37.249000 1431440 torch/distributed/run.py:766] *****************************************W0609 20:18:37.249000 1431440 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.W0609 20:18:37.249000 1431440 torch/distributed/run.py:766] *****************************************Using device=cudaLoading model ...Applying tensor parallel to model ...Time to load model: 10.60 seconds/root/serve/gpt-fast/tp.py:139: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead.  attn.register_forward_hook(lambda _module, _input, output: funcol.all_reduce([rank0]: Traceback (most recent call last):[rank0]:   File "/root/serve/gpt-fast/generate.py", line 480, in <module>[rank0]:     main([rank0]:   File "/root/serve/gpt-fast/generate.py", line 401, in main[rank0]:     y, metrics = generate([rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context[rank0]:     return func(*args, **kwargs)[rank0]:   File "/root/serve/gpt-fast/generate.py", line 194, in generate[rank0]:     next_token = prefill(model, prompt.view(batch_size, -1), input_pos, **sampling_kwargs).clone()[rank0]:   File "/root/serve/gpt-fast/generate.py", line 71, in prefill[rank0]:     logits = model(mask, x, input_pos)[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl[rank0]:     return self._call_impl(*args, **kwargs)[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl[rank0]:     return forward_call(*args, **kwargs)[rank0]:   File "/root/serve/gpt-fast/model.py", line 156, in forward[rank0]:     x = layer(x, input_pos, freqs_cis, mask)[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl[rank0]:     return self._call_impl(*args, **kwargs)[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl[rank0]:     return forward_call(*args, **kwargs)[rank0]:   File "/root/serve/gpt-fast/model.py", line 175, in forward[rank0]:     h = x + self.attention(self.attention_norm(x), freqs_cis, mask, input_pos)[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl[rank0]:     return self._call_impl(*args, **kwargs)[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl[rank0]:     return inner()[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in inner[rank0]:     hook_result = hook(self, args, result)[rank0]:   File "/root/serve/gpt-fast/tp.py", line 139, in <lambda>[rank0]:     attn.register_forward_hook(lambda _module, _input, output: funcol.all_reduce([rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/distributed/_functional_collectives.py", line 176, in all_reduce[rank0]:     tensor = torch.ops._c10d_functional.all_reduce(self, reduceOp.lower(), group_name)[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/_ops.py", line 1158, in __call__[rank0]:     return self._op(*args, **(kwargs or {}))[rank0]: RuntimeError: get_group_info: no group info associated with the group name

In UV venv,
torch version: 2.7.1+cu126

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error in TP: RuntimeError: get_group_info: no group info associated with the group name #228

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions