This PR improves multi-device support in Compile Graph, which significantly reduces Tensor transmission latency by utilizing out-of-band communication. Currently, this feature only supports CUDA’s NCCL. Since Ray already supports multiple accelerators, it is necessary to extend Compile Graph to support multi-device as well.

This PR mainly introduces two key changes:
1.Removed dependency on cupy.cuda.ExternalStream – Since this library only supports CUDA devices, we replaced it with a more general stream context manager to accommodate various accelerators. The new implementation uses torch.{device}.StreamContext.
2.Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime – This allows automatic detection of the accelerator type and invokes the appropriate device-specific functions.

How to add a new backend for CG? here's an example for Ascend NPU:

importrayimporttorchimporttorch_npufromray.dagimportInputNode# implement customer Communicator classfromray.experimental.channel.hccl_groupimport_HcclGroupfromray.experimental.channel.accelerator_contextimportregister_accelerator_context@ray.remoteclassTorchTensorWorker:def__init__(self):self.device=torch.device('npu:0')torch.npu.set_device(self.device)defsend(self,shape,dtype,value:int):returntorch.ones(shape,dtype=dtype,device=self.device)*valuedefrecv(self,tensor):return (tensor[0].item(),tensor.shape,tensor.dtype)# global register accelerator contextregister_accelerator_context('npu',_HcclGroup)actor_cls=TorchTensorWorker.options(num_cpus=0,resources={'NPU':1})sender=actor_cls.remote()receiver=actor_cls.remote()withInputNode()asinp:dag=sender.send.bind(inp.shape,inp.dtype,inp[0])dag=dag.with_tensor_transport(transport='nccl')dag=receiver.recv.bind(dag)shape= (10,)dtype=torch.float16compiled_dag=dag.experimental_compile()foriinrange(3):ref=compiled_dag.execute(i,shape=shape,dtype=dtype)assertray.get(ref)== (i,shape,dtype)print("Success")

Related issue number

This PR is the main part of Task 2 in#51574
It would better to set the function name more general, such as changing requires_nccl to require_communicator. This is implemented in#51061.

Checks

I've signed off every commit(by using the -s flag, i.e.,git commit -s) in this PR.
I've runscripts/format.sh to lint the changes in this PR.
I've included any doc changes needed forhttps://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it indoc/source/tune/api/ under the
  corresponding.rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures athttps://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

hipudding marked this pull request as draft

March 3, 2025 11:47

hipudding force-pushed thecupystream branch 2 times, most recently fromf63223c to27215f3Compare

March 4, 2025 02:55

Copy link

ContributorAuthor

hipudding commentedMar 4, 2025

@ruisearch42 Good Day. Could you please review this PR? Thanks.

hipudding marked this pull request as ready for review

March 4, 2025 06:16

hipudding changed the title~~[WIP] Replacement for cupy.cuda.ExternalStream for hardware diversity~~Replacement for cupy.cuda.ExternalStream for hardware diversity

Mar 4, 2025

hipudding changed the title~~Replacement for cupy.cuda.ExternalStream for hardware diversity~~[aDAG] Replacement for cupy.cuda.ExternalStream for hardware diversity

Mar 4, 2025

noemotiovon reviewed

Mar 4, 2025

View reviewed changes

python/ray/experimental/channel/communicator.py Outdated

		"""
		raise NotImplementedError
		return None

Copy link

Contributor

noemotiovonMar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If the method returns None, is it still necessary to mark it as an abstract method?

Copy link

ContributorAuthor

hipuddingMar 4, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Keep this class abstrct.

hipudding force-pushed thecupystream branch 2 times, most recently froma4d06c9 to705a10dCompare

March 4, 2025 11:18

jcotant1 added the coreIssues that should be addressed in Ray Core label

Mar 4, 2025

hipudding force-pushed thecupystream branch from8ae23d2 to6e72a0bCompare

March 5, 2025 03:18

hipudding changed the title~~[aDAG] Replacement for cupy.cuda.ExternalStream for hardware diversity~~[aDAG] Change for acclerator diversity and add HCCL communication lib

Mar 5, 2025

Copy link

Contributor

ruisearch42 commentedMar 5, 2025

Hi@hipudding , is this ready for review? btw, aDAG was renamed to Compiled Graph:https://docs.ray.io/en/latest/ray-core/compiled-graph/ray-compiled-graph.html

Copy link

ContributorAuthor

hipudding commentedMar 5, 2025•
edited
Loading

Hi@hipudding , is this ready for review? btw, aDAG was renamed to Compiled Graph:https://docs.ray.io/en/latest/ray-core/compiled-graph/ray-compiled-graph.html

Yes, I think the main functionality of this PR is ready, I may make some minor adjustments later. Thanks for your review.
I'vd noticed that there's some test cases failed. I will fix them ASAP.

hipudding changed the title~~[aDAG] Change for acclerator diversity and add HCCL communication lib~~[Compiled Graph] Change for acclerator diversity and add HCCL communication lib

Mar 5, 2025

Copy link

Contributor

ruisearch42 commentedMar 5, 2025

Yes, I think the main functionality of this PR is ready, I may make some minor adjustments later. Thanks for your review. I'vd noticed that there's some test cases failed. I will fix them ASAP.

Thanks. I will review tomorrow!

hipudding force-pushed thecupystream branch froma261a94 toa89c1e6Compare

March 6, 2025 00:51

ruisearch42 reviewed

Mar 6, 2025

View reviewed changes

Copy link

Contributor

ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Trying to better understand the PR

python/ray/dag/compiled_dag_node.py Outdated

Comment on lines 775 to 786

		with self._send_stream:
		return self._write()

Copy link

Contributor

ruisearch42Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why don't we want to use context manager?

Copy link

ContributorAuthor

hipuddingMar 6, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This stream context manager is provided by cupy, it only support CUDA devices, So I just remote context manager and specified stream in read and write.
But I think I can implement a stream context manager in Ray, so cupy is not needed. Which is better?

Copy link

ContributorAuthor

hipuddingMar 6, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done. use torch StreamContext instead to avoid too many changes.

python/ray/dag/compiled_dag_node.py Outdated

		else:
		future = ResolvedFuture(val)
		self._intermediate_future = future

		def reset_and_wait_intermediate_future(self) -> Any:
		def reset_and_wait_intermediate_future(self, stream: Any = None) -> Any:

Copy link

Contributor

ruisearch42Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

update doc for arg

Copy link

ContributorAuthor

hipuddingMar 6, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done. reverted this part.

python/ray/dag/compiled_dag_node.py Outdated

		self,
		val: Any,
		wrap_in_gpu_future: bool,
		stream: Any = None,

Copy link

Contributor

ruisearch42Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

update doc for stream

Copy link

ContributorAuthor

hipuddingMar 6, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done. reverted this part.

python/ray/experimental/channel/driver_communicator_holder.py Outdated

Comment on lines 10 to 24

		class _DriverGroupHolder(Communicator):
		"""
		Communicator place holder for Driver, Since driver may has no
		Accelerators, TorchDeviceManager cannot be used.
		"""

Copy link

Contributor

ruisearch42Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can you elaborate a bit why this is needed?

Copy link

ContributorAuthor

hipuddingMar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I replace all torch.cuda.xxx by TorchDeviceManager, including create a _NcclGroup. But driver will keep a communicator instance itself. Driver will not join the group, but need record some information like world size and actors.
But Driver may has no GPU ( or other accelerators), it will choose the wrong TorchDeviceManager (CPUTorchDeviceManager), and get_communicator will return None.
Since Driver will not join the group, it use Communicator only for store some information. So I create a Hardware-independent class for driver.

Copy link

ContributorAuthor

hipuddingMar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Comment updated.

python/ray/experimental/channel/hccl_group.py Outdated

		comm_id: int,
		rank: Optional[int],
		actor_handles: List["ray.actor.ActorHandle"],
		acl_stream: Optional[Any],

Copy link

Contributor

ruisearch42Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What is acl_tream? doc for args

Copy link

ContributorAuthor

hipuddingMar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

acl_stream is the raw stream in Ascend NPU. Just like cuda_stream to GPU.
What about name this stream torch_stream? since acl_stream and cuda_stream are both torch_stream now.

Copy link

ContributorAuthor

hipuddingMar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done. I changed it to npu_stream for better understanding.

python/ray/experimental/channel/hccl_group.py Outdated

		def get_world_size(self) -> int:
		return self._world_size

		def send(self, buf: "torch.Tensor", peer_rank: int) -> None:

Copy link

Contributor

ruisearch42Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

many methods in this class look like code duplication from_NcclGroup with minor changes. Should they be unified?

Copy link

ContributorAuthor

hipuddingMar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, we should indeed reduce duplicate code. I will think about how to design it.

Copy link

ContributorAuthor

hipuddingMar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Indeed, these two files share logical similarities and contain a lot of duplicate code. However, there are several differences in the details. For example, torch.cuda.xxx needs to be replaced with torch.npu.xxx, hccl_group requires an additional StreamContext, and the set device operation must be performed before get_unique_id and creating the HCCL group. Therefore, I haven’t found a good way to merge these two files. Do you have any suggestions?

ruisearch42 mentioned this pull request

Mar 6, 2025

[ADAG]Enable NPU (hccl) communication for CG#47658

Closed

8 tasks

hipudding force-pushed thecupystream branch 2 times, most recently from1e80d9f tod682702Compare

March 6, 2025 08:03

hipudding changed the title~~[Compiled Graph] Change for acclerator diversity and add HCCL communication lib~~[Compiled Graph] Enhance Compile Graph with Multi-Device Support and HCCL Integration

Mar 7, 2025

hipudding force-pushed thecupystream branch from01bd0c4 to18cd7e3Compare

March 7, 2025 01:59

hipudding requested review fromnoemotiovon andruisearch42

March 7, 2025 02:32

hipudding force-pushed thecupystream branch from18cd7e3 to8bcfd7aCompare

March 7, 2025 03:11

hipudding requested a review froma team as acode owner

March 7, 2025 06:26

Copy link

ContributorAuthor

hipudding commentedMar 7, 2025•
edited
Loading

Hi@ruisearch42, Could you please review this PR again？

BTW, I fixed test case fail in doc_test_cgraph_nccl. It blocked this PR.

hipudding force-pushed thecupystream branch from20b28c8 to1a5988eCompare

March 7, 2025 06:34

hipudding added a commit to hipudding/ray that referenced this pull request

May 29, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support (ray…

c793642

…-project#51032)This PR improves multi-device support in Compile Graph, whichsignificantly reduces Tensor transmission latency by utilizingout-of-band communication. Currently, this feature only supports CUDA'sNCCL. Since Ray already supports multiple accelerators, it is necessaryto extend Compile Graph to support multi-device as well.This PR mainly introduces two key changes:1. Removed dependency on cupy.cuda.ExternalStream - Since this libraryonly supports CUDA devices, we replaced it with a more general streamcontext manager to accommodate various accelerators. The newimplementation uses torch.{device}.StreamContext.2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime -This allows automatic detection of the accelerator type and invokes theappropriate device-specific functions.```pythonimport rayimport torchimport torch_npufrom ray.dag import InputNodefrom ray.experimental.channel.hccl_group import _HcclGroupfrom ray.experimental.channel.accelerator_context import register_accelerator_context@ray.remoteclass TorchTensorWorker:    def __init__(self):        self.device = torch.device('npu:0')        torch.npu.set_device(self.device)    def send(self, shape, dtype, value: int):        return torch.ones(shape, dtype=dtype, device=self.device) * value    def recv(self, tensor):        return (tensor[0].item(), tensor.shape, tensor.dtype)register_accelerator_context('npu', _HcclGroup)actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1})sender = actor_cls.remote()receiver = actor_cls.remote()with InputNode() as inp:    dag = sender.send.bind(inp.shape, inp.dtype, inp[0])    dag = dag.with_tensor_transport(transport='nccl')    dag = receiver.recv.bind(dag)shape = (10,)dtype = torch.float16compiled_dag = dag.experimental_compile()for i in range(3):    ref = compiled_dag.execute(i, shape=shape, dtype=dtype)    assert ray.get(ref) == (i, shape, dtype)print("Success")```This PR is the main part of Task 2 inray-project#51574It would better to set the function name more general, such as changingrequires_nccl to require_communicator. This is implemented inray-project#51061.Signed-off-by: noemotiovon <757486878@qq.com>Co-authored-by: noemotiovon <757486878@qq.com>

hipudding added a commit to hipudding/ray that referenced this pull request

May 29, 2025

Post bug fix forray-project#51032

98cf723

This commit fixes two issues:1.Fixed the issue where with_tensor_transport would automatically  select the device based on the environment, even when CUDA was  explicitly specified.2.Fixed an NCCL "invalid memory access" error that occurred when  ray serve was used with vllm PP > 1.Signed-off-by: hipudding <huafengchun@gmail.com>

hipudding added a commit to hipudding/ray that referenced this pull request

May 29, 2025

Post bug fix forray-project#51032

a0c40e0

This commit fixes two issues:1.Fixed the issue where with_tensor_transport would automatically  select the device based on the environment, even when CUDA was  explicitly specified.2.Fixed an NCCL "invalid memory access" error that occurred when  ray serve was used with vllm PP > 1.Signed-off-by: hipudding <huafengchun@gmail.com>

hipudding added a commit to hipudding/ray that referenced this pull request

May 29, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support (ray…

77c424f

…-project#51032)This PR improves multi-device support in Compile Graph, whichsignificantly reduces Tensor transmission latency by utilizingout-of-band communication. Currently, this feature only supports CUDA'sNCCL. Since Ray already supports multiple accelerators, it is necessaryto extend Compile Graph to support multi-device as well.This PR mainly introduces two key changes:1. Removed dependency on cupy.cuda.ExternalStream - Since this libraryonly supports CUDA devices, we replaced it with a more general streamcontext manager to accommodate various accelerators. The newimplementation uses torch.{device}.StreamContext.2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime -This allows automatic detection of the accelerator type and invokes theappropriate device-specific functions.```pythonimport rayimport torchimport torch_npufrom ray.dag import InputNodefrom ray.experimental.channel.hccl_group import _HcclGroupfrom ray.experimental.channel.accelerator_context import register_accelerator_context@ray.remoteclass TorchTensorWorker:    def __init__(self):        self.device = torch.device('npu:0')        torch.npu.set_device(self.device)    def send(self, shape, dtype, value: int):        return torch.ones(shape, dtype=dtype, device=self.device) * value    def recv(self, tensor):        return (tensor[0].item(), tensor.shape, tensor.dtype)register_accelerator_context('npu', _HcclGroup)actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1})sender = actor_cls.remote()receiver = actor_cls.remote()with InputNode() as inp:    dag = sender.send.bind(inp.shape, inp.dtype, inp[0])    dag = dag.with_tensor_transport(transport='nccl')    dag = receiver.recv.bind(dag)shape = (10,)dtype = torch.float16compiled_dag = dag.experimental_compile()for i in range(3):    ref = compiled_dag.execute(i, shape=shape, dtype=dtype)    assert ray.get(ref) == (i, shape, dtype)print("Success")```This PR is the main part of Task 2 inray-project#51574It would better to set the function name more general, such as changingrequires_nccl to require_communicator. This is implemented inray-project#51061.Signed-off-by: huafengchun <huafengchun@gmail.com>Co-authored-by: noemotiovon <757486878@qq.com>

hipudding added a commit to hipudding/ray that referenced this pull request

May 29, 2025

Post bug fix forray-project#51032

b86f2f9

This commit fixes two issues:1.Fixed the issue where with_tensor_transport would automatically  select the device based on the environment, even when CUDA was  explicitly specified.2.Fixed an NCCL "invalid memory access" error that occurred when  ray serve was used with vllm PP > 1.Signed-off-by: huafengchun <huafengchun@gmail.com>

hipudding added a commit to hipudding/ray that referenced this pull request

May 29, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support (ray…

11db10f

…-project#51032)This PR improves multi-device support in Compile Graph, whichsignificantly reduces Tensor transmission latency by utilizingout-of-band communication. Currently, this feature only supports CUDA'sNCCL. Since Ray already supports multiple accelerators, it is necessaryto extend Compile Graph to support multi-device as well.This PR mainly introduces two key changes:1. Removed dependency on cupy.cuda.ExternalStream - Since this libraryonly supports CUDA devices, we replaced it with a more general streamcontext manager to accommodate various accelerators. The newimplementation uses torch.{device}.StreamContext.2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime -This allows automatic detection of the accelerator type and invokes theappropriate device-specific functions.```pythonimport rayimport torchimport torch_npufrom ray.dag import InputNodefrom ray.experimental.channel.hccl_group import _HcclGroupfrom ray.experimental.channel.accelerator_context import register_accelerator_context@ray.remoteclass TorchTensorWorker:    def __init__(self):        self.device = torch.device('npu:0')        torch.npu.set_device(self.device)    def send(self, shape, dtype, value: int):        return torch.ones(shape, dtype=dtype, device=self.device) * value    def recv(self, tensor):        return (tensor[0].item(), tensor.shape, tensor.dtype)register_accelerator_context('npu', _HcclGroup)actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1})sender = actor_cls.remote()receiver = actor_cls.remote()with InputNode() as inp:    dag = sender.send.bind(inp.shape, inp.dtype, inp[0])    dag = dag.with_tensor_transport(transport='nccl')    dag = receiver.recv.bind(dag)shape = (10,)dtype = torch.float16compiled_dag = dag.experimental_compile()for i in range(3):    ref = compiled_dag.execute(i, shape=shape, dtype=dtype)    assert ray.get(ref) == (i, shape, dtype)print("Success")```This PR is the main part of Task 2 inray-project#51574It would better to set the function name more general, such as changingrequires_nccl to require_communicator. This is implemented inray-project#51061.Signed-off-by: hipudding <huafengchun@gmail.com>Co-authored-by: noemotiovon <757486878@qq.com>

hipudding added a commit to hipudding/ray that referenced this pull request

May 29, 2025

Post bug fix forray-project#51032

f34d469

This commit fixes two issues:1.Fixed the issue where with_tensor_transport would automatically  select the device based on the environment, even when CUDA was  explicitly specified.2.Fixed an NCCL "invalid memory access" error that occurred when  ray serve was used with vllm PP > 1.Signed-off-by: hipudding <huafengchun@gmail.com>

hipudding added a commit to hipudding/ray that referenced this pull request

May 29, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support (ray…

dd04285

…-project#51032)This PR improves multi-device support in Compile Graph, whichsignificantly reduces Tensor transmission latency by utilizingout-of-band communication. Currently, this feature only supports CUDA'sNCCL. Since Ray already supports multiple accelerators, it is necessaryto extend Compile Graph to support multi-device as well.This PR mainly introduces two key changes:1. Removed dependency on cupy.cuda.ExternalStream - Since this libraryonly supports CUDA devices, we replaced it with a more general streamcontext manager to accommodate various accelerators. The newimplementation uses torch.{device}.StreamContext.2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime -This allows automatic detection of the accelerator type and invokes theappropriate device-specific functions.```pythonimport rayimport torchimport torch_npufrom ray.dag import InputNodefrom ray.experimental.channel.hccl_group import _HcclGroupfrom ray.experimental.channel.accelerator_context import register_accelerator_context@ray.remoteclass TorchTensorWorker:    def __init__(self):        self.device = torch.device('npu:0')        torch.npu.set_device(self.device)    def send(self, shape, dtype, value: int):        return torch.ones(shape, dtype=dtype, device=self.device) * value    def recv(self, tensor):        return (tensor[0].item(), tensor.shape, tensor.dtype)register_accelerator_context('npu', _HcclGroup)actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1})sender = actor_cls.remote()receiver = actor_cls.remote()with InputNode() as inp:    dag = sender.send.bind(inp.shape, inp.dtype, inp[0])    dag = dag.with_tensor_transport(transport='nccl')    dag = receiver.recv.bind(dag)shape = (10,)dtype = torch.float16compiled_dag = dag.experimental_compile()for i in range(3):    ref = compiled_dag.execute(i, shape=shape, dtype=dtype)    assert ray.get(ref) == (i, shape, dtype)print("Success")```This PR is the main part of Task 2 inray-project#51574It would better to set the function name more general, such as changingrequires_nccl to require_communicator. This is implemented inray-project#51061.Signed-off-by: hipudding <huafengchun@gmail.com>Co-authored-by: noemotiovon <757486878@qq.com>

hipudding added a commit to hipudding/ray that referenced this pull request

May 29, 2025

Post bug fix forray-project#51032

9450c7c

This commit fixes two issues:1.Fixed the issue where with_tensor_transport would automatically  select the device based on the environment, even when CUDA was  explicitly specified.2.Fixed an NCCL "invalid memory access" error that occurred when  ray serve was used with vllm PP > 1.Signed-off-by: hipudding <huafengchun@gmail.com>

hipudding added a commit to hipudding/ray that referenced this pull request

May 29, 2025

Post bug fix forray-project#51032

76ff1d8

This commit fixes two issues:1.Fixed the issue where with_tensor_transport would automatically  select the device based on the environment, even when CUDA was  explicitly specified.2.Fixed an NCCL "invalid memory access" error that occurred when  ray serve was used with vllm PP > 1.Signed-off-by: hipudding <huafengchun@gmail.com>

hipudding added a commit to hipudding/ray that referenced this pull request

May 29, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support (ray…

38fc54c

…-project#51032)This PR improves multi-device support in Compile Graph, whichsignificantly reduces Tensor transmission latency by utilizingout-of-band communication. Currently, this feature only supports CUDA'sNCCL. Since Ray already supports multiple accelerators, it is necessaryto extend Compile Graph to support multi-device as well.This PR mainly introduces two key changes:1. Removed dependency on cupy.cuda.ExternalStream - Since this libraryonly supports CUDA devices, we replaced it with a more general streamcontext manager to accommodate various accelerators. The newimplementation uses torch.{device}.StreamContext.2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime -This allows automatic detection of the accelerator type and invokes theappropriate device-specific functions.```pythonimport rayimport torchimport torch_npufrom ray.dag import InputNodefrom ray.experimental.channel.hccl_group import _HcclGroupfrom ray.experimental.channel.accelerator_context import register_accelerator_context@ray.remoteclass TorchTensorWorker:    def __init__(self):        self.device = torch.device('npu:0')        torch.npu.set_device(self.device)    def send(self, shape, dtype, value: int):        return torch.ones(shape, dtype=dtype, device=self.device) * value    def recv(self, tensor):        return (tensor[0].item(), tensor.shape, tensor.dtype)register_accelerator_context('npu', _HcclGroup)actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1})sender = actor_cls.remote()receiver = actor_cls.remote()with InputNode() as inp:    dag = sender.send.bind(inp.shape, inp.dtype, inp[0])    dag = dag.with_tensor_transport(transport='nccl')    dag = receiver.recv.bind(dag)shape = (10,)dtype = torch.float16compiled_dag = dag.experimental_compile()for i in range(3):    ref = compiled_dag.execute(i, shape=shape, dtype=dtype)    assert ray.get(ref) == (i, shape, dtype)print("Success")```This PR is the main part of Task 2 inray-project#51574It would better to set the function name more general, such as changingrequires_nccl to require_communicator. This is implemented inray-project#51061.Signed-off-by: hipudding <huafengchun@gmail.com>Co-authored-by: noemotiovon <757486878@qq.com>

hipudding added a commit to hipudding/ray that referenced this pull request

May 29, 2025

Post bug fix forray-project#51032

029d099

This commit fixes two issues:1.Fixed the issue where with_tensor_transport would automatically  select the device based on the environment, even when CUDA was  explicitly specified.2.Fixed an NCCL "invalid memory access" error that occurred when  ray serve was used with vllm PP > 1.Signed-off-by: hipudding <huafengchun@gmail.com>

hipudding mentioned this pull request

May 29, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support#53395

Merged

8 tasks

hipudding added a commit to hipudding/ray that referenced this pull request

May 30, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support (ray…

dc63c6f

…-project#51032)This PR improves multi-device support in Compile Graph, whichsignificantly reduces Tensor transmission latency by utilizingout-of-band communication. Currently, this feature only supports CUDA'sNCCL. Since Ray already supports multiple accelerators, it is necessaryto extend Compile Graph to support multi-device as well.This PR mainly introduces two key changes:1. Removed dependency on cupy.cuda.ExternalStream - Since this libraryonly supports CUDA devices, we replaced it with a more general streamcontext manager to accommodate various accelerators. The newimplementation uses torch.{device}.StreamContext.2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime -This allows automatic detection of the accelerator type and invokes theappropriate device-specific functions.```pythonimport rayimport torchimport torch_npufrom ray.dag import InputNodefrom ray.experimental.channel.hccl_group import _HcclGroupfrom ray.experimental.channel.accelerator_context import register_accelerator_context@ray.remoteclass TorchTensorWorker:    def __init__(self):        self.device = torch.device('npu:0')        torch.npu.set_device(self.device)    def send(self, shape, dtype, value: int):        return torch.ones(shape, dtype=dtype, device=self.device) * value    def recv(self, tensor):        return (tensor[0].item(), tensor.shape, tensor.dtype)register_accelerator_context('npu', _HcclGroup)actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1})sender = actor_cls.remote()receiver = actor_cls.remote()with InputNode() as inp:    dag = sender.send.bind(inp.shape, inp.dtype, inp[0])    dag = dag.with_tensor_transport(transport='nccl')    dag = receiver.recv.bind(dag)shape = (10,)dtype = torch.float16compiled_dag = dag.experimental_compile()for i in range(3):    ref = compiled_dag.execute(i, shape=shape, dtype=dtype)    assert ray.get(ref) == (i, shape, dtype)print("Success")```This PR is the main part of Task 2 inray-project#51574It would better to set the function name more general, such as changingrequires_nccl to require_communicator. This is implemented inray-project#51061.Signed-off-by: hipudding <huafengchun@gmail.com>Co-authored-by: noemotiovon <757486878@qq.com>

hipudding added a commit to hipudding/ray that referenced this pull request

May 30, 2025

Post bug fix forray-project#51032

676e3c8

This commit fixes two issues:1.Fixed the issue where with_tensor_transport would automatically  select the device based on the environment, even when CUDA was  explicitly specified.2.Fixed an NCCL "invalid memory access" error that occurred when  ray serve was used with vllm PP > 1.Signed-off-by: hipudding <huafengchun@gmail.com>

vickytsang pushed a commit to ROCm/ray that referenced this pull request

Jun 3, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support (ray…

8718d00

…-project#51032)This PR improves multi-device support in Compile Graph, whichsignificantly reduces Tensor transmission latency by utilizingout-of-band communication. Currently, this feature only supports CUDA’sNCCL. Since Ray already supports multiple accelerators, it is necessaryto extend Compile Graph to support multi-device as well.This PR mainly introduces two key changes:1. Removed dependency on cupy.cuda.ExternalStream – Since this libraryonly supports CUDA devices, we replaced it with a more general streamcontext manager to accommodate various accelerators. The newimplementation uses torch.{device}.StreamContext.2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime –This allows automatic detection of the accelerator type and invokes theappropriate device-specific functions.### How to add a new backend for CG? here's an example for Ascend NPU:```pythonimport rayimport torchimport torch_npufrom ray.dag import InputNode# implement customer Communicator classfrom ray.experimental.channel.hccl_group import _HcclGroupfrom ray.experimental.channel.accelerator_context import register_accelerator_context@ray.remoteclass TorchTensorWorker:    def __init__(self):        self.device = torch.device('npu:0')        torch.npu.set_device(self.device)    def send(self, shape, dtype, value: int):        return torch.ones(shape, dtype=dtype, device=self.device) * value    def recv(self, tensor):        return (tensor[0].item(), tensor.shape, tensor.dtype)# global register accelerator contextregister_accelerator_context('npu', _HcclGroup)actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1})sender = actor_cls.remote()receiver = actor_cls.remote()with InputNode() as inp:    dag = sender.send.bind(inp.shape, inp.dtype, inp[0])    dag = dag.with_tensor_transport(transport='nccl')    dag = receiver.recv.bind(dag)shape = (10,)dtype = torch.float16compiled_dag = dag.experimental_compile()for i in range(3):    ref = compiled_dag.execute(i, shape=shape, dtype=dtype)    assert ray.get(ref) == (i, shape, dtype)print("Success")```This PR is the main part of Task 2 inray-project#51574It would better to set the function name more general, such as changingrequires_nccl to require_communicator. This is implemented inray-project#51061.Signed-off-by: noemotiovon <757486878@qq.com>Co-authored-by: noemotiovon <757486878@qq.com>Signed-off-by: Vicky Tsang <vtsang@amd.com>

vickytsang pushed a commit to ROCm/ray that referenced this pull request

Jun 3, 2025

Revert "[Compiled Graph] Enhance Compile Graph with Multi-Device Supp…

469d6aa

…ort (ray-project#51032)" (ray-project#53263)This reverts commit2c7f6d4.`test_torch_tensor_transport_gpu` is [failing onpostmerge](https://buildkite.com/ray-project/postmerge/builds/10332#0196fbb9-7c80-4513-96f7-0250e53fd671/177-959).It appears this test does not run on premerge.Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>Signed-off-by: Vicky Tsang <vtsang@amd.com>

iamjustinhsu pushed a commit to iamjustinhsu/ray that referenced this pull request

Jun 12, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support (ray…

9063619

…-project#51032)This PR improves multi-device support in Compile Graph, whichsignificantly reduces Tensor transmission latency by utilizingout-of-band communication. Currently, this feature only supports CUDA’sNCCL. Since Ray already supports multiple accelerators, it is necessaryto extend Compile Graph to support multi-device as well.This PR mainly introduces two key changes:1. Removed dependency on cupy.cuda.ExternalStream – Since this libraryonly supports CUDA devices, we replaced it with a more general streamcontext manager to accommodate various accelerators. The newimplementation uses torch.{device}.StreamContext.2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime –This allows automatic detection of the accelerator type and invokes theappropriate device-specific functions.### How to add a new backend for CG? here's an example for Ascend NPU:```pythonimport rayimport torchimport torch_npufrom ray.dag import InputNode# implement customer Communicator classfrom ray.experimental.channel.hccl_group import _HcclGroupfrom ray.experimental.channel.accelerator_context import register_accelerator_context@ray.remoteclass TorchTensorWorker:    def __init__(self):        self.device = torch.device('npu:0')        torch.npu.set_device(self.device)    def send(self, shape, dtype, value: int):        return torch.ones(shape, dtype=dtype, device=self.device) * value    def recv(self, tensor):        return (tensor[0].item(), tensor.shape, tensor.dtype)# global register accelerator contextregister_accelerator_context('npu', _HcclGroup)actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1})sender = actor_cls.remote()receiver = actor_cls.remote()with InputNode() as inp:    dag = sender.send.bind(inp.shape, inp.dtype, inp[0])    dag = dag.with_tensor_transport(transport='nccl')    dag = receiver.recv.bind(dag)shape = (10,)dtype = torch.float16compiled_dag = dag.experimental_compile()for i in range(3):    ref = compiled_dag.execute(i, shape=shape, dtype=dtype)    assert ray.get(ref) == (i, shape, dtype)print("Success")```This PR is the main part of Task 2 inray-project#51574 It would better to set the function name more general, such as changingrequires_nccl to require_communicator. This is implemented inray-project#51061.Signed-off-by: noemotiovon <757486878@qq.com>Co-authored-by: noemotiovon <757486878@qq.com>

iamjustinhsu pushed a commit to iamjustinhsu/ray that referenced this pull request

Jun 12, 2025

Revert "[Compiled Graph] Enhance Compile Graph with Multi-Device Supp…

7ceb585

…ort (ray-project#51032)" (ray-project#53263)This reverts commit2c7f6d4.`test_torch_tensor_transport_gpu` is [failing onpostmerge](https://buildkite.com/ray-project/postmerge/builds/10332#0196fbb9-7c80-4513-96f7-0250e53fd671/177-959).It appears this test does not run on premerge.Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

noemotiovon added a commit to noemotiovon/ray that referenced this pull request

Jun 16, 2025

[Compiled Graph] Refactor NCCL Integration for Extensibility

d8b677a

This commit is a follow-up toray-project#51032, which introduced multi-device supportin the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-bandtensor communication. While the current implementation is NCCL-specific, theCompiled Graph runtime is now ready to support a broader range of device typesand collective communication libraries.To prepare for this generalization, this commit introduces the following changes:1. Refactored NCCL-specific naming and interfaces2. Established a pluggable communication backend interfaceThis refactor does not change the behavior for existing NCCL-based Compiled Graphexecution, but lays the foundation for enabling collective communication acrossdiverse hardware accelerators and runtime environments.Signed-off-by: noemotiovon <757486878@qq.com>

noemotiovon added a commit to noemotiovon/ray that referenced this pull request

Jun 16, 2025

[Compiled Graph] Refactor NCCL Integration for Extensibility

f95998a

This commit is a follow-up toray-project#51032, which introduced multi-device supportin the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-bandtensor communication. While the current implementation is NCCL-specific, theCompiled Graph runtime is now ready to support a broader range of device typesand collective communication libraries.To prepare for this generalization, this commit introduces the following changes:1. Refactored NCCL-specific naming and interfaces2. Established a pluggable communication backend interfaceThis refactor does not change the behavior for existing NCCL-based Compiled Graphexecution, but lays the foundation for enabling collective communication acrossdiverse hardware accelerators and runtime environments.Signed-off-by: noemotiovon <757486878@qq.com>

noemotiovon added a commit to noemotiovon/ray that referenced this pull request

Jun 16, 2025

[Compiled Graph] Refactor NCCL Integration for Extensibility

291cc7c

This commit is a follow-up toray-project#51032, which introduced multi-device supportin the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-bandtensor communication. While the current implementation is NCCL-specific, theCompiled Graph runtime is now ready to support a broader range of device typesand collective communication libraries.To prepare for this generalization, this commit introduces the following changes:1. Refactored NCCL-specific naming and interfaces2. Established a pluggable communication backend interfaceThis refactor does not change the behavior for existing NCCL-based Compiled Graphexecution, but lays the foundation for enabling collective communication acrossdiverse hardware accelerators and runtime environments.Signed-off-by: noemotiovon <757486878@qq.com>

noemotiovon added a commit to noemotiovon/ray that referenced this pull request

Jun 18, 2025

[Compiled Graph] Refactor NCCL Integration for Extensibility

15ffaee

This commit is a follow-up toray-project#51032, which introduced multi-device supportin the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-bandtensor communication. While the current implementation is NCCL-specific, theCompiled Graph runtime is now ready to support a broader range of device typesand collective communication libraries.To prepare for this generalization, this commit introduces the following changes:1. Refactored NCCL-specific naming and interfaces2. Established a pluggable communication backend interfaceThis refactor does not change the behavior for existing NCCL-based Compiled Graphexecution, but lays the foundation for enabling collective communication acrossdiverse hardware accelerators and runtime environments.Signed-off-by: noemotiovon <757486878@qq.com>

rebel-scottlee pushed a commit to rebellions-sw/ray that referenced this pull request

Jun 21, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support (ray…

93b5b73

…-project#51032)This PR improves multi-device support in Compile Graph, whichsignificantly reduces Tensor transmission latency by utilizingout-of-band communication. Currently, this feature only supports CUDA’sNCCL. Since Ray already supports multiple accelerators, it is necessaryto extend Compile Graph to support multi-device as well.This PR mainly introduces two key changes:1. Removed dependency on cupy.cuda.ExternalStream – Since this libraryonly supports CUDA devices, we replaced it with a more general streamcontext manager to accommodate various accelerators. The newimplementation uses torch.{device}.StreamContext.2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime –This allows automatic detection of the accelerator type and invokes theappropriate device-specific functions.### How to add a new backend for CG? here's an example for Ascend NPU:```pythonimport rayimport torchimport torch_npufrom ray.dag import InputNode# implement customer Communicator classfrom ray.experimental.channel.hccl_group import _HcclGroupfrom ray.experimental.channel.accelerator_context import register_accelerator_context@ray.remoteclass TorchTensorWorker:    def __init__(self):        self.device = torch.device('npu:0')        torch.npu.set_device(self.device)    def send(self, shape, dtype, value: int):        return torch.ones(shape, dtype=dtype, device=self.device) * value    def recv(self, tensor):        return (tensor[0].item(), tensor.shape, tensor.dtype)# global register accelerator contextregister_accelerator_context('npu', _HcclGroup)actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1})sender = actor_cls.remote()receiver = actor_cls.remote()with InputNode() as inp:    dag = sender.send.bind(inp.shape, inp.dtype, inp[0])    dag = dag.with_tensor_transport(transport='nccl')    dag = receiver.recv.bind(dag)shape = (10,)dtype = torch.float16compiled_dag = dag.experimental_compile()for i in range(3):    ref = compiled_dag.execute(i, shape=shape, dtype=dtype)    assert ray.get(ref) == (i, shape, dtype)print("Success")```This PR is the main part of Task 2 inray-project#51574It would better to set the function name more general, such as changingrequires_nccl to require_communicator. This is implemented inray-project#51061.Signed-off-by: noemotiovon <757486878@qq.com>Co-authored-by: noemotiovon <757486878@qq.com>Signed-off-by: Scott Lee <scott.lee@rebellions.ai>

rebel-scottlee pushed a commit to rebellions-sw/ray that referenced this pull request

Jun 21, 2025

Revert "[Compiled Graph] Enhance Compile Graph with Multi-Device Supp…

0c7b315

…ort (ray-project#51032)" (ray-project#53263)This reverts commit2c7f6d4.`test_torch_tensor_transport_gpu` is [failing onpostmerge](https://buildkite.com/ray-project/postmerge/builds/10332#0196fbb9-7c80-4513-96f7-0250e53fd671/177-959).It appears this test does not run on premerge.Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>Signed-off-by: Scott Lee <scott.lee@rebellions.ai>

noemotiovon added a commit to noemotiovon/ray that referenced this pull request

Jul 2, 2025

[Compiled Graph] Refactor NCCL Integration for Extensibility

ea97b50

This commit is a follow-up toray-project#51032, which introduced multi-device supportin the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-bandtensor communication. While the current implementation is NCCL-specific, theCompiled Graph runtime is now ready to support a broader range of device typesand collective communication libraries.To prepare for this generalization, this commit introduces the following changes:1. Refactored NCCL-specific naming and interfaces2. Established a pluggable communication backend interfaceThis refactor does not change the behavior for existing NCCL-based Compiled Graphexecution, but lays the foundation for enabling collective communication acrossdiverse hardware accelerators and runtime environments.Signed-off-by: noemotiovon <757486878@qq.com>

noemotiovon added a commit to noemotiovon/ray that referenced this pull request

Jul 3, 2025

[Compiled Graph] Refactor NCCL Integration for Extensibility

36a5aa7

This commit is a follow-up toray-project#51032, which introduced multi-device supportin the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-bandtensor communication. While the current implementation is NCCL-specific, theCompiled Graph runtime is now ready to support a broader range of device typesand collective communication libraries.To prepare for this generalization, this commit introduces the following changes:1. Refactored NCCL-specific naming and interfaces2. Established a pluggable communication backend interfaceThis refactor does not change the behavior for existing NCCL-based Compiled Graphexecution, but lays the foundation for enabling collective communication acrossdiverse hardware accelerators and runtime environments.Signed-off-by: noemotiovon <757486878@qq.com>

edoakes pushed a commit that referenced this pull request

Jul 8, 2025

[Refactor]Rename NCCL-related items to comm_backend (#51061)

0e93947

## Why are these changes needed?### BackgroundThis PR is a follow-up to[#51032](#51032), whichintroduced multi-device support in the Compiled Graph by leveragingCUDA's NCCL backend for efficient out-of-band tensor communication.While the current implementation is tightly coupled with NCCL and CUDA,the Compiled Graph runtime is now ready to support a broader spectrum ofdevice types and collective communication backends (e.g., HCCL, RCCL).### What This PR Does?To enable extensibility and backend-agnostic design, this PR introducesthe following core changes:Refactored NCCL-specific naming and APIsNCCL-related modules, classes, and function names have been generalizedto eliminate hardcoded CUDA/NCCL assumptions.Introduced a pluggable communication backend interfaceA unified abstraction layer is added to decouple collectivecommunication logic from any specific implementation. This makes iteasier to support alternative collective libraries and device types inthe future.This refactor does not alter the existing behavior of NCCL-basedCompiled Graph execution. All current workflows using CUDA+NCCL continueto function as before.## Related issue number#51574<!-- For example: "Closes#1234" -->## Checks- [x] I've signed off every commit(by using the -s flag, i.e., `gitcommit -s`) in this PR.- [x] I've run `scripts/format.sh` to lint the changes in this PR.- [ ] I've included any doc changes needed forhttps://docs.ray.io/en/master/.- [ ] I've added any new APIs to the API Reference. For example, if Iadded amethod in Tune, I've added it in `doc/source/tune/api/` under the           corresponding `.rst` file.- [ ] I've made sure the tests are passing. Note that there might be afew flaky tests, see the recent failures athttps://flakey-tests.ray.io/- Testing Strategy   - [x] Unit tests   - [ ] Release tests   - [ ] This PR is not tested :(---------Signed-off-by: noemotiovon <757486878@qq.com>

landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request

Jul 9, 2025

[Refactor]Rename NCCL-related items to comm_backend (ray-project#51061)

52be268

## Why are these changes needed?### BackgroundThis PR is a follow-up to[ray-project#51032](ray-project#51032), whichintroduced multi-device support in the Compiled Graph by leveragingCUDA's NCCL backend for efficient out-of-band tensor communication.While the current implementation is tightly coupled with NCCL and CUDA,the Compiled Graph runtime is now ready to support a broader spectrum ofdevice types and collective communication backends (e.g., HCCL, RCCL).### What This PR Does?To enable extensibility and backend-agnostic design, this PR introducesthe following core changes:Refactored NCCL-specific naming and APIsNCCL-related modules, classes, and function names have been generalizedto eliminate hardcoded CUDA/NCCL assumptions.Introduced a pluggable communication backend interfaceA unified abstraction layer is added to decouple collectivecommunication logic from any specific implementation. This makes iteasier to support alternative collective libraries and device types inthe future.This refactor does not alter the existing behavior of NCCL-basedCompiled Graph execution. All current workflows using CUDA+NCCL continueto function as before.## Related issue numberray-project#51574<!-- For example: "Closesray-project#1234" -->## Checks- [x] I've signed off every commit(by using the -s flag, i.e., `gitcommit -s`) in this PR.- [x] I've run `scripts/format.sh` to lint the changes in this PR.- [ ] I've included any doc changes needed forhttps://docs.ray.io/en/master/.- [ ] I've added any new APIs to the API Reference. For example, if Iadded amethod in Tune, I've added it in `doc/source/tune/api/` under the           corresponding `.rst` file.- [ ] I've made sure the tests are passing. Note that there might be afew flaky tests, see the recent failures athttps://flakey-tests.ray.io/- Testing Strategy   - [x] Unit tests   - [ ] Release tests   - [ ] This PR is not tested :(---------Signed-off-by: noemotiovon <757486878@qq.com>Signed-off-by: doyoung <doyoung@anyscale.com>

landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request

Jul 9, 2025

[Refactor]Rename NCCL-related items to comm_backend (ray-project#51061)

eaa35d1

## Why are these changes needed?### BackgroundThis PR is a follow-up to[ray-project#51032](ray-project#51032), whichintroduced multi-device support in the Compiled Graph by leveragingCUDA's NCCL backend for efficient out-of-band tensor communication.While the current implementation is tightly coupled with NCCL and CUDA,the Compiled Graph runtime is now ready to support a broader spectrum ofdevice types and collective communication backends (e.g., HCCL, RCCL).### What This PR Does?To enable extensibility and backend-agnostic design, this PR introducesthe following core changes:Refactored NCCL-specific naming and APIsNCCL-related modules, classes, and function names have been generalizedto eliminate hardcoded CUDA/NCCL assumptions.Introduced a pluggable communication backend interfaceA unified abstraction layer is added to decouple collectivecommunication logic from any specific implementation. This makes iteasier to support alternative collective libraries and device types inthe future.This refactor does not alter the existing behavior of NCCL-basedCompiled Graph execution. All current workflows using CUDA+NCCL continueto function as before.## Related issue numberray-project#51574<!-- For example: "Closesray-project#1234" -->## Checks- [x] I've signed off every commit(by using the -s flag, i.e., `gitcommit -s`) in this PR.- [x] I've run `scripts/format.sh` to lint the changes in this PR.- [ ] I've included any doc changes needed forhttps://docs.ray.io/en/master/.- [ ] I've added any new APIs to the API Reference. For example, if Iadded amethod in Tune, I've added it in `doc/source/tune/api/` under the           corresponding `.rst` file.- [ ] I've made sure the tests are passing. Note that there might be afew flaky tests, see the recent failures athttps://flakey-tests.ray.io/- Testing Strategy   - [x] Unit tests   - [ ] Release tests   - [ ] This PR is not tested :(---------Signed-off-by: noemotiovon <757486878@qq.com>Signed-off-by: doyoung <doyoung@anyscale.com>

Labels

community-contribution

Contributed by the community

core

Issues that should be addressed in Ray Core

add ONLY when ready to merge, run all tests

8 participants

Movatterモバイル変換

[Compiled Graph] Enhance Compile Graph with Multi-Device Support#51032

[Compiled Graph] Enhance Compile Graph with Multi-Device Support#51032

Uh oh!

Conversation

hipudding commentedMar 3, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Why are these changes needed?

How to add a new backend for CG? here's an example for Ascend NPU:

Related issue number

Checks

Uh oh!

hipudding commentedMar 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hipuddingMar 4, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisearch42 commentedMar 5, 2025

Uh oh!

hipudding commentedMar 5, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ruisearch42 commentedMar 5, 2025

Uh oh!

ruisearch42 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hipuddingMar 6, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hipuddingMar 6, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hipuddingMar 6, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hipuddingMar 6, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hipudding commentedMar 7, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hipudding commentedMar 3, 2025•
edited
Loading

hipuddingMar 4, 2025•
edited
Loading

hipudding commentedMar 5, 2025•
edited
Loading

hipuddingMar 6, 2025•
edited
Loading

hipuddingMar 6, 2025•
edited
Loading

hipuddingMar 6, 2025•
edited
Loading

hipuddingMar 6, 2025•
edited
Loading

hipudding commentedMar 7, 2025•
edited
Loading