Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit8deb4fe

Browse files
Fix flaky NCCL error handling tests. (#42149)
Summary:Pull Requestresolved:#42149Some of these tests were flaky since we could kill the process in someway without cleaning up the ProcessGroup. This resulted in issues where theFileStore didn't clean up appropriately resulting in other processes in thegroup to crash.Fixed this by explicitly deleting the process_group before we bring a processdown forcibly.ghstack-source-id: 108629057Test Plan: waitforbuildbotReviewed By: mrshenliDifferential Revision: D22785042fbshipit-source-id: c31d0f723badbc23b7258e322f75b57e0a1a42cf
1 parentb6a9f42 commit8deb4fe

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

‎test/distributed/test_c10d.py‎

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3413,6 +3413,8 @@ def _test_nccl_errors_blocking(self, func):
34133413
# aborting nccl communicators before throwing Operation timed out
34143414
a=torch.rand(10).cuda(self.rank)
34153415
elifself.rank==1:
3416+
# Clean up structures (ex: files for FileStore before going down)
3417+
delprocess_group
34163418
func()
34173419
else:
34183420
# Wait for timeout
@@ -3494,7 +3496,7 @@ def _wait_for_comm_abort(self, process_group):
34943496
return
34953497
else:
34963498
raisee
3497-
time.sleep(1)
3499+
time.sleep(0.1)
34983500

34993501
@requires_nccl()
35003502
@skip_if_lt_x_gpu(3)

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp