NotificationsYou must be signed in to change notification settings
Fork6.2k
Star36.3k

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

(WIP) [core][compiled graphs] Unify code paths for NCCL P2P and collectives scheduling#48649

Open

AndyUB wants to merge145 commits intoray-project:master

base:master

Choose a base branch

fromAndyUB:union-dev-1105

Open

(WIP) [core][compiled graphs] Unify code paths for NCCL P2P and collectives scheduling#48649

AndyUB wants to merge145 commits intoray-project:masterfromAndyUB:union-dev-1105

Conversation

Copy link

Contributor

AndyUB commentedNov 8, 2024•
edited
Loading

Why are these changes needed?

This PR unifies the code paths for NCCL P2P and collectives. Before, scheduling for NCCL operations is done by splitting each node into three operations:READ,COMPUTE, andWRITE. This PR simplifies the logic by only keeping the compute node. To ensure scheduling still works, NCCL operations are converted into special types of system-created compute nodes.

This PR also allows overlapping NCCL collectives with computation.

NCCL P2P Refactoring

withInputNode()asinp:dag=actor1.foo.bind(inp)dag=dag.with_tensor_transport("nccl")dag=actor2.bar.bind(dag)

Before this PR, compiling this dag will result in aTorchTensorNcclChannel fromfoo tobar.

This PR adds aNcclSendNode afterfoo and aNcclRecvNode beforebar. TheTorchTensorNcclChannel now connects the two added nodes. Sincefoo and the send node are on the same actor, the channel fromfoo to the send node is anIntraProcessChannel. Same thing for the recv side.

Multiple Receivers

withInputNode()asinp:dag=actor1.foo.bind(inp)dag=dag.with_tensor_transport("nccl")dag=MultiOutputNode([actor2.bar.bind(dag),actor3.baz.bind(dag)])

In this case, the sender sends to two different receivers.

Only oneNcclSendNode is created. OneNcclRecvNode is created per receiver. Like before, there is only 1TorchTensorNcclChannel.

Multiple Senders

withInputNode()asinp:branch1=actor1.foo.bind(inp)branch1=branch1.with_tensor_transport("nccl")branch2=actor2.bar.bind(inp)branch2=branch2.with_tensor_transport("nccl")dag=actor3.baz.bind(branch1,branch2)

The receiver receives from two senders.

1NcclSendNode is created per sender. 1NcclRecvNode is created per argument for the receiver. There are 2 differentTorchTensorNcclChannels.

Overlap NCCL Collectives

This is done by prioritizing NCCL operations over non-NCCL operations when scheduling, i.e., if both some NCCL operations and some non-NCCL operations are ready to be added into the actors' execution schedules, NCCL operations are always added before the non-NCCL ones.

Checks

I've signed off every commit(by using the -s flag, i.e.,git commit -s) in this PR.
I've runscripts/format.sh to lint the changes in this PR.
I've included any doc changes needed forhttps://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it indoc/source/tune/api/ under the
  corresponding.rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures athttps://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

dengwxnand others added10 commits

October 27, 2024 10:36

refactor: Add compute aio

d16ba2c

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

refactor: Separate nccl_read, nccl_write, and compute

5b226f3

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

chore: Apply CL

c04ea86

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

test: Fix messages

1f43ce9

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

merge: Upstream master

d4d8764

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Test error message

252789a

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

merge: Upstream master

d88b19c

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Merge errors

767eaf4

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Execution schedule GPU tests

d355e94

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

chore: Format code

bf28397

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

Copy link

Contributor

dengwxn commentedNov 8, 2024

Looks great. Some more TODOs before an initial review as we discussed offline:

Refactor all the[CL] and[TODO] in the code. They are mainly missing comments, unused code blocks, branches to be merged, variable and function names to be renamed, etc.
Introduce a special op node forNCCL_Collective similar to the currentNCCL_READ andNCCL_WRITE, such that theCOMPUTE node does not require NCCL.

cc@dengwxn

Copy link

Contributor

dengwxn commentedNov 8, 2024

@anyscalesam Could you help add a go badge to run more CI tests? Thanks!

AndyUB marked this pull request as ready for review

November 8, 2024 18:49

AndyUB added2 commits

November 8, 2024 19:17

refactor: Separate NCCL_COLLECTIVE and COMPUTE

941cb73

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

Revert "refactor: Separate NCCL_COLLECTIVE and COMPUTE"

2867556

This reverts commit941cb73.Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

Copy link

Contributor

dengwxn commentedNov 9, 2024

Introduce a special op node for NCCL_Collective similar to the current NCCL_READ and NCCL_WRITE, such that the COMPUTE node does not require NCCL.

After your attempt and a second thought, I think this might not be the best way to separate NCCL and non-NCCL ops by introducing anotherNCCL_Collective op. We can skip this and see what others think.

Copy link

Contributor

dengwxn commentedNov 9, 2024

As we discussed offline, we should remove all theNCCL_* op nodes, instead we should create system-level DAG nodes doing NCCL read/write. We will refactor based on this.

AndyUB added4 commits

November 10, 2024 02:44

(WIP) refactor: Add NCCL send/recv nodes

9259b2e

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

(WIP) refactor: Refactor DAGOperationGraphNode

a7507cb

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

(WIP) refactor: Tests

f980026

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

(WIP) chore: Cleanup

2dc06ec

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

dengwxn reviewed

Nov 11, 2024

View reviewed changes

Copy link

Contributor

dengwxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

First pass. Structure seems right. Will look into details later.

python/ray/dag/collective_node.py OutdatedShow resolvedHide resolved

python/ray/dag/compiled_dag_node.py OutdatedShow resolvedHide resolved

python/ray/dag/p2p_node.py OutdatedShow resolvedHide resolved

stephanie-wang requested changes

Nov 12, 2024

View reviewed changes

Copy link

Contributor

stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think this can be made simpler. Try to think about how you can achieve the following:

_NCCLSendNode/_NCCLRecvNode should have the same interface as _CollectiveOperation
If the above is done properly, I believe we can get rid of most of the parts that need to differentiate between send/recv/collective. I.e. there should be only onerequires_nccl flag instead of three, and there should only be on kind of DAG op node, aCOMPUTE node.

python/ray/dag/compiled_dag_node.py OutdatedShow resolvedHide resolved

python/ray/dag/dag_node_operation.py OutdatedShow resolvedHide resolved

rkooo567 self-assigned this

Nov 12, 2024

stephanie-wang self-assigned this

Nov 12, 2024

jcotant1 added the coreIssues that should be addressed in Ray Core label

Nov 15, 2024

AndyUB added4 commits

November 16, 2024 12:34

(WIP) chore: Add comments for test_execution_schedule_gpu

4045f34

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

merge: Upstream master

f3cf414

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

(WIP) refactor: Remove DAG node operation type; add synchronous group

bd7d556

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

(WIP) experimental: Add P2P nodes in _add_node (failed)

fb2f20d

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

AndyUBand others added20 commits

February 13, 2025 15:20

merge: fixes

7e5cef4

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

refactor: TODOs

1ce9796

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

(WIP) fix: tests

c9cd5a6

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Only destroy future upon teardown

926548b

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

merge: fix

63cfac5

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

merge: fix

b143be7

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

merge: test fixes

ff60938

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: NCCL channel with local readers

cea473f

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

chore: Add test for sending GPUFuture across actors; (TODO) Wait on f…

9bdac47

…uture before sending across actorsSigned-off-by: Yuhan Ruan <andyubryh@gmail.com>

merge: Add back device context manager

470482d

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

merge: (WIP) Refactor exec_operation

05abe9e

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

(WIP) refactor: Wrap exec_operation in device and stream contexts

4ac18de

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Return after writing exception

fbbaf76

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

refactor: Remove ResolvedFuture

fdcdb39

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

refactor: Nullable input reader and output writer; separate executabl…

aa2c002

…e P2P send/recv operationsSigned-off-by: Yuhan Ruan <andyubryh@gmail.com>

revert: Single read for local NCCL P2P

ceb8a3b

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Wait for future to be ready before writing the result across actors

a834308

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

chore: Comments

857cf4a

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Multiple waits on the same GPU future

e744f15

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Only wrap NCCL outputs in GPU future

4314fe1

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

AndyUB commented

Feb 18, 2025

View reviewed changes

python/ray/dag/compiled_dag_node.py OutdatedShow resolvedHide resolved

python/ray/dag/compiled_dag_node.pyShow resolvedHide resolved

python/ray/dag/compiled_dag_node.py OutdatedShow resolvedHide resolved

python/ray/dag/nccl_operation.py Outdated


		def __init__(self):
		# Task indices in a compiled DAG. The indices are appended
		# in topological order if there are dependencies among the tasks.

Copy link

ContributorAuthor

AndyUBFeb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Relying on this can be error-prone.

Copy link

Contributor

dengwxnFeb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

TODO: If an actor sends to itself, the current schedule is correct becausetask_idxs containSEND andRECV in order. The order is provided from the topological sort.

python/ray/experimental/channel/nccl_group.py

		@@ -183,6 +185,10 @@ def send(self, buf: "torch.Tensor", peer_rank: int) -> None:
		# TODO(rui): find a better approach
		self._send_stream.synchronize()

		import torch

		buf.record_stream(torch.cuda.ExternalStream(self._send_stream.ptr))

Copy link

ContributorAuthor

AndyUBFeb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe use something better thanrecord_stream.

python/ray/experimental/channel/serialization_context.pyShow resolvedHide resolved

python/ray/experimental/channel/shared_memory_channel.py

		"Sending the result of an asynchronous NCCL operation across actors. "
		"This will block the CPU while waiting for the NCCL operation to finish."
		)
		value = value.wait(blocking=True)

Copy link

ContributorAuthor

AndyUBFeb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

To be improved: Sending GPU future across actors forces a blocking wait.

Copy link

ContributorAuthor

AndyUBFeb 18, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Related: Currently, overlapping is done by immediately launching NCCL operations when their inputs are ready. The downstream task that reads the NCCL operation's output will get a GPU future and wait on it.

Limitations:

Futures can't be sent across actors, in which case waiting can result in worse performance than executing the DAG withoverlap_gpu_communication=False.
Even when futures are only read by the same actor locally, if the actor does multiple NCCL recv operations, they will be scheduled as the first operations in the execution schedule. Therecv_stream.synchronize() call in_NcclGroup.recv blocks the CPU on the second recv operation. (TODO: Determine if the synchronization is still needed.)

Copy link

Contributor

dengwxnFeb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We observe that when futures are sent across actors, overlap is slightly slower than non-overlap version. This is because: (1) There is no benefit from overlapping when futures are sent across actors; (2) The exact stream sync is different in overlap/non-overlap.
Do we need different schedules for overlap/non-overlap? We change schedules and move P2P early in overlap.
Do we actually need to sync before launching another NCCL read/write?
When an actor have multiple NCCL reads, these NCCL reads are not perfectly overlapped since there is sync between each NCCL read.

chore: Format code

31c6150

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

dengwxn reviewed

Feb 21, 2025

View reviewed changes

python/ray/experimental/channel/shared_memory_channel.pyShow resolvedHide resolved

python/ray/experimental/channel/shared_memory_channel.py

		"Sending the result of an asynchronous NCCL operation across actors. "
		"This will block the CPU while waiting for the NCCL operation to finish."
		)
		value = value.wait(blocking=True)

Copy link

Contributor

dengwxnFeb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We observe that when futures are sent across actors, overlap is slightly slower than non-overlap version. This is because: (1) There is no benefit from overlapping when futures are sent across actors; (2) The exact stream sync is different in overlap/non-overlap.
Do we need different schedules for overlap/non-overlap? We change schedules and move P2P early in overlap.
Do we actually need to sync before launching another NCCL read/write?
When an actor have multiple NCCL reads, these NCCL reads are not perfectly overlapped since there is sync between each NCCL read.

python/ray/experimental/channel/nccl_group.py OutdatedShow resolvedHide resolved

python/ray/experimental/channel/nccl_group.pyShow resolvedHide resolved

python/ray/dag/dag_operation_future.py OutdatedShow resolvedHide resolved

python/ray/dag/compiled_dag_node.py OutdatedShow resolvedHide resolved

AndyUBand others added5 commits

February 20, 2025 21:52

chore: Comments

f37f15d

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

chore: Remove unused imports

74ebdf6

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

align: styles

562c0af

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

refactor: comments

cdc8715

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

refactor: styles

382a816

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

I think this is quite brittle and difficult to understand. It would be better if we can think of a way to make all of the bind indices for a given actor unique.

Either we need to compute the new bind indices when creating the CompiledDAG or we need a way to create the NCCL nodes while the user is creating the initial DAG.

fix: Free unresolved future from dag exception

51a9ed2

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

Labels

core

Issues that should be addressed in Ray Core

add ONLY when ready to merge, run all tests

5 participants

Movatterモバイル変換

(WIP) [core][compiled graphs] Unify code paths for NCCL P2P and collectives scheduling#48649

Are you sure you want to change the base?

(WIP) [core][compiled graphs] Unify code paths for NCCL P2P and collectives scheduling#48649

Conversation

AndyUB commentedNov 8, 2024• editedLoading

Why are these changes needed?

NCCL P2P Refactoring

Overlap NCCL Collectives

Checks

dengwxn commentedNov 8, 2024

dengwxn commentedNov 8, 2024

dengwxn commentedNov 9, 2024

dengwxn commentedNov 9, 2024

dengwxn left a comment

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndyUBFeb 18, 2025• editedLoading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndyUB commentedNov 8, 2024•
edited
Loading

AndyUBFeb 18, 2025•
edited
Loading