- Notifications
You must be signed in to change notification settings - Fork6.6k
[Autoscaler][Placement Group] Skip placed bundle when requesting resource#48924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
add a test
shapes = [dict(bundle.unit_resources) for bundle in placement_group.bundles] | ||
# Skip **placed** bundle (which has node id associated with it). | ||
for bundle in placement_group.bundles: | ||
if bundle.node_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Is it an empty string orNone
? If it isNone
, useis
instead.
ifbundle.node_id: | |
ifbundle.node_idisnotNone: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
fix in7a44207, it should be an empty byte string.
5818b44
to6353847
Comparebreak | ||
# TODO(mimi): kill_raylet won't trigger reschedule in autoscaler v1 | ||
# kill_raylet(node["NodeManagerAddress"], node["NodeManagerPort"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I found that when usingkill_raylet
, reschedule won't be triggered in autoscaler v1, even when the cluster status shows the node is killed. In this case, v1 will fail and v2 pass.
Both v1 and v2 pass when usingkill_node
.
from ray.autoscaler.v2.sdk import get_cluster_status | ||
def verify_nodes(active=3, idle=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
defverify_nodes(active=3,idle=1): | |
defverify_nodes(active,idle): |
def kill_node(node_id): | ||
# kill -9 | ||
import subprocess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Moveimport
to top-level. Typically, Ray uses deferred import only to avoid circular dependencies.
wait_for_condition(lambda: verify_nodes(3, 1)) | ||
# Kill a node | ||
def kill_raylet(ip, port, graceful=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Remove this function because it is not used for now.
# Wait for the node to be removed | ||
wait_for_condition(lambda: verify_nodes(2, 1), 20) | ||
# Check that the placement group is rescheduled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
where is the logic to check the placement group is rescheduled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
wait_for_condition(lambda: verify_nodes(3, 1))
is to check the autoscaler rescheduling. However, this comment is redundant, I've already removed it.
ray.get(pg.ready()) | ||
from ray.autoscaler.v2.sdk import get_cluster_status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Do we need to import this? It seems to have already been imported at the top level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Above suggestions fixed in415dcf8
CI fails. Can you fix the CI errors? |
Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>
Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>
Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>
Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>
415dcf8
tod9f0cd9
Compare@@ -986,7 +986,13 @@ def placement_groups_to_resource_demands( | |||
resource_demand_vector = [] | |||
unconverted = [] | |||
for placement_group in pending_placement_groups: | |||
shapes = [dict(bundle.unit_resources) for bundle in placement_group.bundles] | |||
# Skip **placed** bundle (which has node id associated with it). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Is this behavior correct withSTRICT_PACK
? If already placed bundles are removed, will the new bundles be placed on different nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, the behavior is still correct withSTRICT_PACK
because we only calculate shapes here, and shapes here do not directly instruct scheduling. The scheduler will not schedule aSTRICT_PACK
group across different nodes.
Back to the duty of thisplacement_groups_to_resource_demands
function, it is correct to present the remaining demand of aSTRICT_PACK
group to the laterget_bin_pack_residual
. Say we have 3 bundles in aSTRICT_PACK
but 2 of the bundles have node_id on them, we still need to present the remaining 1 bundle to laterget_bin_pack_residual
and thenget_bin_pack_residual
should consume it.
cc@rueian would you mind taking this PR for another pass? |
rueian commentedApr 1, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
The changes still look good to me, but in my personal experience, we will be asked for a unit test in |
Signed-off-by: Rueian <rueiancsie@gmail.com>
A new unit test is added to |
# Only provision nodes for unplaced bundles; | ||
# avoid rescheduling the whole placement group. | ||
wait_for_condition(lambda: verify_nodes(3, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
do we need to verify whether the new node isR1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
verify that inthe new commit.
# fully idle. | ||
nodes = provider.non_terminated_nodes({}) | ||
resource_demands = [{"GPU": 1}] * 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Remove this. It implies that 4 GPUs on the p2.8xlarge are occupied by these resource demands, which isn’t easy to understand at first glance.
In addition, we also need to check whether the 4 GPUs on the p2.8xlarge are actually occupied. If they aren’t, and the bundles require 8 GPUs in total, the test may still pass even though the underlying behavior is incorrect.
- Remove
resource_demands
. - Increase each bundle from 2 GPUs to 4 GPUs.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
provider.create_node({}, {TAG_RAY_USER_NODE_TYPE: "p2.8xlarge"}, 1) | ||
# At this point our cluster has 1 p2.8xlarge instances (8 GPUs) and is | ||
# fully idle. | ||
nodes = provider.non_terminated_nodes({}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Does this simulate the case that would happen at runtime (node_1
doesn't exist innodes
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I replaced the node_1 with the existing node in the new commit.
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@rueian can you open an issue in KubeRay repo to track the progress of adding KubeRay e2e tests in KubeRay repo for this PR?
PlacementGroupTableData( | ||
state=PlacementGroupTableData.PENDING, | ||
strategy=PlacementStrategy.PACK, | ||
bundles=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I am also not sure whether this is possible to happen in runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Imagine that the placement group was originally spread across 2 nodes (that is possible with the best-effort
PACK strategy) but later the second node disappeared. Now, we have 1 node left alive, and if it has enough resources available this time for the bundle that was originally on the disappeared node, then we should not launch a new node.
@rueian please ping me when all CI tests pass. Thanks! |
Hi@kevin85421, all CI tests are passed. |
sure |
ab03e3b
intoray-project:masterUh oh!
There was an error while loading.Please reload this page.
…urce (ray-project#48924)Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>Signed-off-by: zhaoch23 <c233zhao@uwaterloo.ca>
…urce (ray-project#48924)Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…urce (ray-project#48924)Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>
…urce (ray-project#48924)Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>Signed-off-by: Vicky Tsang <vtsang@amd.com>
…urce (ray-project#48924)Signed-off-by: Mimi Liao <mimiliao2000@gmail.com>Signed-off-by: Scott Lee <scott.lee@rebellions.ai>
Uh oh!
There was an error while loading.Please reload this page.
Why are these changes needed?
Before the PR, when a node in a placement group (PG) goes down, the autoscaler attempts to reschedule the entire PG (all bundles). However, this will lead to overprovisioning. Details:#40212
This PR solved this by skipping already placed bundles (i.e., bundles with an associated node_id) when demanding resources in autoscaler.
Before: Every bundles get rescheduled
After: Only one node will be scaled up
Related issue number
Closes#40212
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.