Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ProcessGroupGloo: fix CUDA tensor stream handling with futures#170812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
d4l3k wants to merge1 commit intomain
base:main
Choose a base branch
Loading
fromd4l3k/gloo_stream_fix

Conversation

@d4l3k
Copy link
Member

@d4l3kd4l3k commentedDec 18, 2025
edited
Loading

Fixes#155714

There's a very subtle bug in Gloo where CUDA future streams aren't preserved correctly leading to silent corruption when using Gloo with a CUDA model using the DDP reducer.

Test plan:

importosRANK=int(os.environ["RANK"])WORLD_SIZE=int(os.environ["WORLD_SIZE"])os.environ["CUDA_VISIBLE_DEVICES"]=os.environ["LOCAL_RANK"]importtorchimporttorch.distributedasdisttorch.manual_seed(0)dist.init_process_group("gloo")N=10expected=torch.sum(torch.arange(0,WORLD_SIZE,dtype=torch.float)).item()t=torch.full((1000000,),RANK,device="cuda",dtype=torch.float)tensors= [t.clone()for_inrange(N)]futs= []fortensorintensors:work=dist.all_reduce(tensor,async_op=True)futs.append(work.get_future())# create high priority stream to do the CPU copy and preempt the default streamstream=torch.cuda.Stream(priority=-1)forfut,tensorinzip(futs,tensors):withtorch.cuda.stream(stream):fut.wait()val=tensor[-1].item()assertval==expected,f"Expected{expected}, got{val}"
torchrun --nnodes 1 --nproc_per_node=gpu ~/scripts/gloo_future_stream.py
BACKEND=gloo WORLD_SIZE=4 TEMP_DIR=/tmp/foo pytest test/distributed/test_distributed_spawn.py -v -s -x  -k 'test_ddp_apply_optim_in_backward'

@pytorch-bot
Copy link

pytorch-botbot commentedDec 18, 2025
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/170812

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit41bc8d9 with merge base1984725 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Collaborator

@jeffdailyjeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Confirmed, fixes#155714.

@d4l3k
Copy link
MemberAuthor

@pytorchbot merge

pytorch-bot[bot] reacted with thumbs up emoji

@pytorch-botpytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request labelDec 18, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are:trunk / win-vs2022-cpu-py3 / build

Details for Dev Infra teamRaised byworkflow job

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@jeffdailyjeffdailyjeffdaily approved these changes

@fduwjjfduwjjfduwjj approved these changes

@tushar00jaintushar00jainAwaiting requested review from tushar00jain

Assignees

No one assigned

Labels

ciflow/trunkTrigger trunk jobs on your pull requestrelease notes: distributed (c10d)release notes category

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

DISABLED test_ddp_apply_optim_in_backward (__main__.TestDistBackendWithSpawn)

5 participants

@d4l3k@pytorchmergebot@jeffdaily@fduwjj

[8]ページ先頭

©2009-2025 Movatter.jp