NotificationsYou must be signed in to change notification settings
Fork6.6k
Star38k

Dag bind order execution fix#48603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

cadedillon wants to merge5 commits intoray-project:masterfromcadedillon:dag_bind_order_execution_fix

Closed

Dag bind order execution fix#48603

cadedillon wants to merge5 commits intoray-project:masterfromcadedillon:dag_bind_order_execution_fix

Conversation

Copy link

cadedillon commentedNov 6, 2024

Why are these changes needed?

Tasks bound to a non-compiled DAG can currently execute in any order, which is inconsistent with the behavior of ray core and compiled DAGs. This can make debugging harder because the same DAG may behave differently after compilation.

This change enforces an execution order of tasks based on the order that they were bound to the DAG. In apply_recursive we add a conditional which checks the children of the current node for the existence of a bind index. Any child node having a bind index will trigger a sort of the _bounds_args by the bind index. Handles mixed child node types by deprioritizing nodes without a bind index over those that do.

ifany(hasattr(child,"get_other_args_to_resolve")and"bind_index"inchild.get_other_args_to_resolve()forchildinself._bound_args):self._bound_args=sorted(self._bound_args,key=lambdachild:child.get_other_args_to_resolve().get("bind_index",float("inf")        )ifhasattr(child,"get_other_args_to_resolve")elsefloat("inf"),    )

Related issue number

Closes#47159

Checks

I've signed off every commit(by using the -s flag, i.e.,git commit -s) in this PR.
I've runscripts/format.sh to lint the changes in this PR.
I've included any doc changes needed forhttps://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it indoc/source/tune/api/ under the
  corresponding.rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures athttps://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

cadedillon added3 commits

November 6, 2024 09:11

Implemented sorting of output node children by to maintain consistent…

1af2464

… execution order. Added unit tests for fix. See issueray-project#47159Signed-off-by: Cade Dillon <42156426+cadedillon@users.noreply.github.com>

Ran linter on changed files.

9a94073

Signed-off-by: Cade Dillon <42156426+cadedillon@users.noreply.github.com>

Merge remote-tracking branch 'upstream/master' into dag_bind_order_ex…

887ffaf

…ecution_fix

AndyUB reviewed

Nov 7, 2024

View reviewed changes

Copy link

Contributor

AndyUB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Couldn't think of a way that doesn't require substantial code change. I guess you can try:

ModifyingDAGNode.execute to dispatch actor tasks in bind index order.
ChangingClassMethodNode._execute_impl to respect bind index order when invoking the actor tasks.

1 seems easier to me. 2 might not be a good idea.

python/ray/dag/dag_node.py Outdated

		"bind_index", float("inf")
		)
		if hasattr(child, "get_other_args_to_resolve")
		else float("inf"),

Copy link

Contributor

AndyUBNov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This would reorder all bound args. I think the order of bound args should not be changed. For example,

deftest_keep_original_order(shared_ray_instance):@ray.remoteclassActor:deffoo(self,input_data):returninput_data*2defbar(self,input_data):returninput_data/2defbaz(self,*inputs):print(inputs)returnsum(inputs)a=Actor.remote()withInputNode()asinp:x=a.foo.bind(inp)y=a.bar.bind(inp)z=a.baz.bind(1,2,3,x,y,inp)dag=MultiOutputNode([y,x,z])ray.get(dag.execute(4))

z should print 1, 2, 3, x, y, inp. But the args are reordered as: x, y, 1, 2, 3, inp.

python/ray/dag/tests/test_output_node.py Outdated


		dag = MultiOutputNode([y, x])

		assert ray.get(dag.execute(4)) == [8, 2]

Copy link

Contributor

AndyUBNov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

x would get 8 andy would get 2. I would expect executing this dag gives me[2, 8] as the result, same as the order specified inMultiOutputNode.

Copy link

Author

cadedillonNov 20, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hi Andy I had a follow up question about this. I've reworked my solution and I've gotten the functions to execute in bind order without inherently reordering the arguments to a node. However, I've run against the design decision about the ordering in which results should be returned to the user.

withInputNode()asinp:x=a.foo.bind(inp)y=a.bar.bind(inp)z=a.baz.bind(1,2,3,y,inp,x)dag=MultiOutputNode([y,x,z])# Execute the DAG with a sample input and observe the orderprint("Starting DAG execution to test task order...")result=ray.get(dag.execute(4))print("DAG execution result:",result)

In the way I wrote my first attempt, when this executes result will get [8, 2, 20] because foo returns 8, bar returns 2 and baz returns 20. Is this fundamentally the wrong approach? Should we always return these values in the position that they were passed in? This makes the most intuitive sense to me from a user perspective, but if that's the case then I'm not sure how to write unit tests for the ordering of functions that are executing remotely? Thanks for your help!

Copy link

Contributor

stephanie-wangNov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, we should keep the same order that the values are passed in.

For unit tests, you can check that the order is as expected by modifying state internal to the actor, like keeping a list of tasks that were called.

stephanie-wang self-assigned this

Nov 12, 2024

jcotant1 added the coreIssues that should be addressed in Ray Core label

Nov 15, 2024

cadedillon added2 commits

November 22, 2024 12:32

Updated fix to identify and topologically sort control dependencies b…

203fb58

…etween nodes to avoid reordering bound args.Signed-off-by: Cade Dillon <42156426+cadedillon@users.noreply.github.com>

Merge remote-tracking branch 'upstream' into dag_bind_order_execution…

d402613

…_fix

Copy link

Author

cadedillon commentedNov 22, 2024

Hi@stephanie-wang and@AndyUB, I have pushed an update to the fix which involves altering the flow of execution according to a topological sort rather than reordering arguments that are bound to a node. Hopefully this update behaves as expected. Thank you for your help with this!

stephanie-wang reviewed

Nov 28, 2024

View reviewed changes

python/ray/dag/dag_node.py

		dep_graph.add_dependency(other_node, node)

		# Topologically sort the control dependencies for this layer of the dag
		ctrl_dependencies = dep_graph.topological_sort()

Copy link

Contributor

stephanie-wangNov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I believe the nodes here are only for the current DAG node's arguments, not the entire DAG? In that case, I'm not sure that topological sort of just the current args will necessarily work.

withInputNode()asinp:x1=a.foo.bind(inp)x2=a.bar.bind(inp)y1=b.baz.bind(x1)y2=c.baz.bind(x2)dag=MultiOutputNode([y2,y1])

In this case, y2 and y1 execute on different actors and they have the same bind index, so we may visit y2 then y1. Then since it does DFS, it should execute x2, y2, x1, y1. But we want to make sure to execute x1 before x2.

Topological sort can work, but you would need to modify how the recursion is done to not do DFS I believe. The simpler way is probably to add a control dependency during DAG declaration instead of execution time, by adding the previous DAG node submitted to the same actor to the nodes found byscanner.find_nodes.

Copy link

stalebot commentedJan 31, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stalebot added the staleThe issue is stale. It will be closed within 7 days unless there are further conversation label

Jan 31, 2025

hainesmichaelc added the community-contributionContributed by the community label

Apr 4, 2025

stalebot removed the staleThe issue is stale. It will be closed within 7 days unless there are further conversation label

Apr 4, 2025

Copy link

Collaborator

jjyao commentedApr 29, 2025

@stephanie-wang @cadedillon do you want to continue working on this PR?

hainesmichaelc added community-backlog and removed community-backlog labels

May 22, 2025

Copy link

github-actionsbot commentedJun 6, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on ourdiscussion forum orRay's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.