Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[Autoscaler][Placement Group] Skip placed bundle when requesting resource#48924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
jjyao merged 12 commits intoray-project:masterfrommimiliaogo:pg-overprovision
May 14, 2025

Conversation

mimiliaogo
Copy link
Contributor

@mimiliaogomimiliaogo commentedNov 25, 2024
edited
Loading

Why are these changes needed?

Before the PR, when a node in a placement group (PG) goes down, the autoscaler attempts to reschedule the entire PG (all bundles). However, this will lead to overprovisioning. Details:#40212

This PR solved this by skipping already placed bundles (i.e., bundles with an associated node_id) when demanding resources in autoscaler.

Before: Every bundles get rescheduled

image

After: Only one node will be scaled up

Screenshot 2024-11-25 at 12 37 39 PM

Related issue number

Closes#40212

Checks

  • I've signed off every commit(by using the -s flag, i.e.,git commit -s) in this PR.
  • I've runscripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed forhttps://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it indoc/source/tune/api/ under the
      corresponding.rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures athttps://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

PhilippWillms reacted with thumbs up emoji
@jcotant1jcotant1 added the coreIssues that should be addressed in Ray Core labelNov 25, 2024
@kevin85421kevin85421 self-assigned thisNov 26, 2024
Copy link
Member

@kevin85421kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

add a test

shapes = [dict(bundle.unit_resources) for bundle in placement_group.bundles]
# Skip **placed** bundle (which has node id associated with it).
for bundle in placement_group.bundles:
if bundle.node_id:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is it an empty string orNone? If it isNone, useis instead.

Suggested change
ifbundle.node_id:
ifbundle.node_idisnotNone:

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

fix in7a44207, it should be an empty byte string.

break

# TODO(mimi): kill_raylet won't trigger reschedule in autoscaler v1
# kill_raylet(node["NodeManagerAddress"], node["NodeManagerPort"])
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I found that when usingkill_raylet, reschedule won't be triggered in autoscaler v1, even when the cluster status shows the node is killed. In this case, v1 will fail and v2 pass.
Both v1 and v2 pass when usingkill_node.

kevin85421 reacted with thumbs up emoji

from ray.autoscaler.v2.sdk import get_cluster_status

def verify_nodes(active=3, idle=1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
defverify_nodes(active=3,idle=1):
defverify_nodes(active,idle):

mimiliaogo reacted with thumbs up emoji

def kill_node(node_id):
# kill -9
import subprocess
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Moveimport to top-level. Typically, Ray uses deferred import only to avoid circular dependencies.

mimiliaogo reacted with thumbs up emoji
wait_for_condition(lambda: verify_nodes(3, 1))

# Kill a node
def kill_raylet(ip, port, graceful=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Remove this function because it is not used for now.

mimiliaogo reacted with thumbs up emoji
# Wait for the node to be removed
wait_for_condition(lambda: verify_nodes(2, 1), 20)

# Check that the placement group is rescheduled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

where is the logic to check the placement group is rescheduled?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

wait_for_condition(lambda: verify_nodes(3, 1)) is to check the autoscaler rescheduling. However, this comment is redundant, I've already removed it.


ray.get(pg.ready())

from ray.autoscaler.v2.sdk import get_cluster_status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do we need to import this? It seems to have already been imported at the top level.

mimiliaogo reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Above suggestions fixed in415dcf8

@kevin85421kevin85421 added the goadd ONLY when ready to merge, run all tests labelDec 17, 2024
@kevin85421
Copy link
Member

CI fails. Can you fix the CI errors?

Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>
Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>
Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>
Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>
Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>
@@ -986,7 +986,13 @@ def placement_groups_to_resource_demands(
resource_demand_vector = []
unconverted = []
for placement_group in pending_placement_groups:
shapes = [dict(bundle.unit_resources) for bundle in placement_group.bundles]
# Skip **placed** bundle (which has node id associated with it).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is this behavior correct withSTRICT_PACK? If already placed bundles are removed, will the new bundles be placed on different nodes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, the behavior is still correct withSTRICT_PACK because we only calculate shapes here, and shapes here do not directly instruct scheduling. The scheduler will not schedule aSTRICT_PACK group across different nodes.

Back to the duty of thisplacement_groups_to_resource_demands function, it is correct to present the remaining demand of aSTRICT_PACK group to the laterget_bin_pack_residual. Say we have 3 bundles in aSTRICT_PACK but 2 of the bundles have node_id on them, we still need to present the remaining 1 bundle to laterget_bin_pack_residual and thenget_bin_pack_residual should consume it.

@kevin85421
Copy link
Member

cc@rueian would you mind taking this PR for another pass?

@rueian
Copy link
Contributor

rueian commentedApr 1, 2025
edited
Loading

cc@rueian would you mind taking this PR for another pass?

The changes still look good to me, but in my personal experience, we will be asked for a unit test intest_resource_demand_scheduler.py. cc@mimiliaogo.

@hainesmichaelchainesmichaelc added the community-contributionContributed by the community labelApr 4, 2025
@rueian
Copy link
Contributor

A new unit test is added totest_resource_demand_scheduler.py. I think this PR is now ready to merge. cc@kevin85421 for review again.


# Only provision nodes for unplaced bundles;
# avoid rescheduling the whole placement group.
wait_for_condition(lambda: verify_nodes(3, 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

do we need to verify whether the new node isR1?

Copy link
Contributor

@rueianrueianMay 13, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

verify that inthe new commit.

# fully idle.
nodes = provider.non_terminated_nodes({})

resource_demands = [{"GPU": 1}] * 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Remove this. It implies that 4 GPUs on the p2.8xlarge are occupied by these resource demands, which isn’t easy to understand at first glance.

In addition, we also need to check whether the 4 GPUs on the p2.8xlarge are actually occupied. If they aren’t, and the bundles require 8 GPUs in total, the test may still pass even though the underlying behavior is incorrect.

  1. Removeresource_demands.
  2. Increase each bundle from 2 GPUs to 4 GPUs.

rueian reacted with thumbs up emoji
provider.create_node({}, {TAG_RAY_USER_NODE_TYPE: "p2.8xlarge"}, 1)
# At this point our cluster has 1 p2.8xlarge instances (8 GPUs) and is
# fully idle.
nodes = provider.non_terminated_nodes({})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Does this simulate the case that would happen at runtime (node_1 doesn't exist innodes)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I replaced the node_1 with the existing node in the new commit.

Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Copy link
Member

@kevin85421kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@rueian can you open an issue in KubeRay repo to track the progress of adding KubeRay e2e tests in KubeRay repo for this PR?

PlacementGroupTableData(
state=PlacementGroupTableData.PENDING,
strategy=PlacementStrategy.PACK,
bundles=[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I am also not sure whether this is possible to happen in runtime.

Copy link
Contributor

@rueianrueianMay 14, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Imagine that the placement group was originally spread across 2 nodes (that is possible with the best-effort
PACK strategy) but later the second node disappeared. Now, we have 1 node left alive, and if it has enough resources available this time for the bundle that was originally on the disappeared node, then we should not launch a new node.

kevin85421 reacted with thumbs up emoji
@kevin85421
Copy link
Member

@rueian please ping me when all CI tests pass. Thanks!

@rueian
Copy link
Contributor

Hi@kevin85421, all CI tests are passed.

@kevin85421
Copy link
Member

cc@jjyao@edoakes for merge.@rueian would you mind adding this PR to the autoscaler umbrella issue and open an issue to track the progress for e2e tests in KubeRay repo?

@rueian
Copy link
Contributor

sure

@jjyaojjyao merged commitab03e3b intoray-project:masterMay 14, 2025
5 checks passed
zhaoch23 pushed a commit to Bye-legumes/ray that referenced this pull requestMay 14, 2025
…urce (ray-project#48924)Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>Signed-off-by: zhaoch23 <c233zhao@uwaterloo.ca>
iamjustinhsu pushed a commit to iamjustinhsu/ray that referenced this pull requestMay 15, 2025
…urce (ray-project#48924)Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
lk-chen pushed a commit to lk-chen/ray that referenced this pull requestMay 17, 2025
vickytsang pushed a commit to ROCm/ray that referenced this pull requestJun 3, 2025
…urce (ray-project#48924)Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>Signed-off-by: Vicky Tsang <vtsang@amd.com>
rebel-scottlee pushed a commit to rebellions-sw/ray that referenced this pull requestJun 21, 2025
…urce (ray-project#48924)Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>Signed-off-by: Scott Lee <scott.lee@rebellions.ai>
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@rueianrueianrueian approved these changes

@chmeyerschmeyerschmeyers left review comments

@kevin85421kevin85421kevin85421 approved these changes

@hongchaodenghongchaodengAwaiting requested review from hongchaodeng

Labels
community-backlogcommunity-contributionContributed by the communitycoreIssues that should be addressed in Ray Coregoadd ONLY when ready to merge, run all tests
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

[core][autoscaler][v1] Autoscaler overprovisions nodes when strict placement group is rescheduling
7 participants
@mimiliaogo@kevin85421@rueian@chmeyers@jjyao@hainesmichaelc@jcotant1

[8]ページ先頭

©2009-2025 Movatter.jp