Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

(WIP) [ADAG] Support dag.experimental_compile(_custom_nccl_group= nccl_group) in aDAG#47987

Open
AndyUB wants to merge9 commits intoray-project:master
base:master
Choose a base branch
Loading
fromAndyUB:eccg-0925

Conversation

AndyUB
Copy link
Contributor

Why are these changes needed?

This supports the API to specify a custom NCCL group for P2P communication when compiling an accelerated DAG. Additionally, this PR allows the user to specify more than 1 custom NCCL group for the same compiled DAG. Specifically, the following code is supported.

withInputNode()asinp:send1=actor1.send.bind(inp)send1=send1.with_type_hint(TorchTensorType(transport="nccl"))recv1=actor2.recv.bind(send1)send2=actor2.send.bind(recv1)send2=send2.with_type_hint(TorchTensorType(transport=custom_nccl_group1))recv2=actor1.recv.bind(send2)dag=recv2compiled_dag=dag.experimental_compile(_custom_nccl_group=custom_nccl_group2)

The custom NCCL group provided fordag.experimental_compile will be used as the default P2P communicator forcompiled_dag. If a dag node's type hint specifiestransport="nccl", the default communicator will be used for its P2P NCCL communication. If a dag node's type hint specifiestransport=another_custom_nccl_group, that custom NCCL group will be used instead.

Requirements:

  1. If a type hint'stransport=custom_nccl_group, the sender and receiver involved must be members in that custom group, otherwise an error is raised.
  2. If a type hint'stransport="nccl" and a default custom NCCL group is specified indag.experimental_compile, the default group must contain the sender and receiver, otherwise an error is raised.

In the future, instead of raising an error for Requirement 2, we can consider constructing a_NcclGroup for the P2P actors that are not members of the group specified inexperimental_compile.

Related issue number

Closes#47540.

Checks

  • I've signed off every commit(by using the -s flag, i.e.,git commit -s) in this PR.
  • I've runscripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed forhttps://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it indoc/source/tune/api/ under the
      corresponding.rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures athttps://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@anyscalesamanyscalesam added triageNeeds triage (eg: priority, bug/not-bug, and owning component) coreIssues that should be addressed in Ray Core compiled-graphs labelsOct 16, 2024
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Copy link
Contributor

@dengwxndengwxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Initial pass. Left some comments.

@@ -322,8 +322,6 @@ def test_torch_tensor_custom_comm(ray_start_regular):
sender = actor_cls.remote()
receiver = actor_cls.remote()

from cupy.cuda import nccl

class TestNcclGroup(GPUCommunicator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@stephanie-wang Rui movesTestNcclGroup back inside each test in the patch PR. This is not necessary though.

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Copy link
Contributor

@dengwxndengwxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@stephanie-wang Do we still need to create a default NCCL group if the specified custom NCCL group does not contain all the P2P actors?

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
@jjyaojjyao removed the triageNeeds triage (eg: priority, bug/not-bug, and owning component) labelOct 28, 2024
@stephanie-wang
Copy link
Contributor

@stephanie-wang Do we still need to create a default NCCL group if the specified custom NCCL group does not contain all the P2P actors?

I think it's better to throw an error. Better to do as little automagically as possible.

# The NCCL group ID for P2P send/recv operations.
self._nccl_group_id_p2p: Optional[str] = None
# The `custom_nccl_group` specified in `experimental_compile`.
self._custom_nccl_group_p2p: Optional[GPUCommunicator] = custom_nccl_group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
self._custom_nccl_group_p2p:Optional[GPUCommunicator]=custom_nccl_group
self._default_nccl_group_p2p:Optional[GPUCommunicator]=custom_nccl_group

Is this ever different from the group ID in the next line?

AndyUB reacted with eyes emoji
Copy link
ContributorAuthor

@AndyUBAndyUBNov 1, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There were 2 cases where they may be different.
  • If a default NCCL group is specified in `experimental_compile`, but it is never used in the DAG (e.g., there is no P2P NCCL send/recvs in the DAG), the `default_nccl_group_id_p2p` will be None.
  • If no default NCCL group is specified in `experimental_compile`, but there are some NCCL send/recv operations in the DAG whose `transport="nccl"`, either a NCCL group specified somewhere else in the DAG is reused for these P2P operations, or a new `_NcclGroup` is created. In this case, `default_nccl_group_id_p2p` is set to the id of that (reused/created) NCCL group.

To make it simpler, I've set it to thedefault_nccl_group_p2p's ID, if one was provided and used, and None otherwise.

In the case where a default NCCL group is provided but never used (and thus never initialized) in the DAG, the ID is still set to None.

@@ -765,7 +765,18 @@ def _preprocess(self) -> None:
self.input_task_idx, self.output_task_idx = None, None
self.actor_task_count.clear()

nccl_actors_p2p: Set["ray.actor.ActorHandle"] = set()
# Actors and DAG nodes involved in NCCL P2P send/recv with `transport=nccl`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The logic in this block of code is getting quite complicated. Let's try to clean it up a bit:

  • Separate it out into a helper function.
  • Ideally, we do it in these steps and none of them should be interleaved:
  1. Collect mapping of set(actors) <> NCCL group specified (None for "default").
  2. Validate all groups. I.e. if NCCL group was specified, check that actors are actually in that group
  3. Initialize groups if needed.
  4. Set type hints on DAG nodes.

AndyUB reacted with thumbs up emoji
Copy link
ContributorAuthor

@AndyUBAndyUBNov 1, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Factored steps 2-4 out into a helper function. Kept step 1 in_preprocess.

For step 1, I created:

  • A mapping of each NCCL group specified (None for "default") -> sender DAG nodes that specify it as theirtransport.
  • A mapping of each NCCL group specified (None for "default") -> sender and receiver DAG nodes that use it as theirtransport, where the receivers are downstream DAG nodes of the senders.

The first mapping is used to set the NCCL group ids in the type hints of the senders. The second mapping corresponds to the set of all actors that use the NCCL group and is used for validation. Maybe there is a simpler way to do this?

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
@AndyUB
Copy link
ContributorAuthor

@anyscalesam This PR is close to a final review. Could you add a go badge for running all tests? Thanks!

Copy link
Contributor

@tfsinghtfsingh left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks great! Did a cursory pass for now (a question and a comment you can ignore), will do a more in-depth one when we meet later today

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
@ruisearch42
Copy link
Contributor

Hi@AndyUB@tfsingh , any plans to wrap up this work?

@staleStale
Copy link

stalebot commentedFeb 25, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stalestalebot added the staleThe issue is stale. It will be closed within 7 days unless there are further conversation labelFeb 25, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@dengwxndengwxndengwxn left review comments

@tfsinghtfsinghtfsingh left review comments

@ruisearch42ruisearch42ruisearch42 left review comments

@stephanie-wangstephanie-wangstephanie-wang requested changes

Requested changes must be addressed to merge this pull request.

Labels
compiled-graphscoreIssues that should be addressed in Ray CorestaleThe issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

[aDAG] Support experimental_compile(_custom_nccl_group= nccl_group) API
7 participants
@AndyUB@stephanie-wang@ruisearch42@dengwxn@tfsingh@jjyao@anyscalesam

[8]ページ先頭

©2009-2025 Movatter.jp