- Notifications
You must be signed in to change notification settings - Fork6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
[DO NOT REVIEW, LONG TERM PR FOR CI] Pinterest main branch 2.9.1#42672
Open
lee1258561 wants to merge63 commits intomasterChoose a base branch frompinterest/main-2.9.1
base:master
Could not load branches
Branch not found:{{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline, and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
…d failures (#41285) (#41609)Users expect different failures types to be handled differently in step 4 above:* The current behavior is that the count decrements, regardless of the error type. For example, if 3 pre-emptions happen with `max_failures=3`, then the run will end without continuing to recover through preemptions.* With `max_failures=-1` or some large value, there will be an infinite number of retries, but this could crash-loop on an application error (ex: a bug in the user code). This can be very expensive.This PR changes the failure counting of Ray Train/Tune to ignore spot instance preemption failures by default. This behavior is enabled by the new `RayActorError.preempted` flag introduced in#41102 that is set if the underlying cluster setup handles the cloud preemption signals properly and sets the preempting node to the `DRAINING` status.---------Signed-off-by: Justin Yu <justinvyu@anyscale.com>
in various places. first step for release.Signed-off-by: Archit Kulkarni <archit@anyscale.com>
…#41666)- Get the updated stats after execution for read-only plans from the input LazyBlockList, not the resulting BlockList after fully executing which does not contain stats for the read operation.- Also fix a bug in Union operator with initializing stats.Signed-off-by: Scott Lee <sjl@anyscale.com>
… used by training (#41603) (#41674)Fixes a resource contention issue in training ingest workloads. The default ingest resource limits should exclude the resources used by Ray Train.This bug sufficed after [streaming output backpressure is enabled](#41327), and caused ray-data-resnet50-ingest-file-size-benchmark.aws to fail (#41496). Here is what happens:* The dataset has 2 stages: `read` and `preprocess`.* The cluster has 16 CPUs. 2CPUs out of them will be used by Ray Trainer, 1 for the trainer actor, and 1 for the training worker.* Data StreamExecutor incorrectly thinks 16 CPUs are available, while there are actually only 14. It submits 15 read tasks and 1 process task. But the preprocess task is not actually running.* The streaming output backpressure throttles the output the read tasks when the output queue is full, and the preprocess task cannot run and consume the output data. This makes the execution hang forever.---------Signed-off-by: Hao Chen <chenh1024@gmail.com>
…41634) (#41665)#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage.Signed-off-by: Scott Lee <sjl@anyscale.com>
Pick of#41698. De-noise release test runs. Purely a test infra change.Signed-off-by: can <can@anyscale.com>
…#41637) (#41720)When there is non-Data code running in the same clusters. Data StreamExecutor will consider all submitted tasks as active, while they may not actually have resources to run.#41603 is an attempt to fix the data+train workload by excluding training resources.While this PR is a more general fix for other workloads, with two main changes:1. Besides detecting active tasks, we also detect if the downstream is not making any progress for a specific interval.2. Introduce a new `reserved_resources` option to allow specifying non-Data resources.This PR along can alsofix#41496---------Signed-off-by: Hao Chen <chenh1024@gmail.com>Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
…41725)#41118 added an include_paths parameter to ParquetDatasource. As part of the PR, we pass an self._include_paths attribute to Parquet read tasks.As a result, the datasource (self) gets serialized with each read tasks. Normally, this isn't an issue, but if you're working with a large dataset (like in the failing release test), then the datasource is slow to serialize.This PR fixes the issue by removing the reference to self.Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
… (#41751)This PR passes a grpc_context to deployments if the deployment uses it to get gRPC request related info can use to set code, details, trailing metadata, and compression. The original grpc._cython.cygrpc._ServicerContext type is not serializable, so we created a RayServegRPCContext to be able to pass to the deployment. Will follow up with doc change.Why are these changes needed?Pick of#41667---------Signed-off-by: Gene Su <e870252314@gmail.com>
Pick ofhttps://github.com/ray-project/ray/pull/41754/files. Required to unblock CI on release branch.Signed-off-by: can <can@anyscale.com>
Updates the replica log filename format from:deployment_{deployment_name}_{app_name}#{deployment_name}#{replica_id}.logto:replica_{app_name}_{deployment_name}_{replica_id}.logAlso adjusts the replica log format to be:... {app_name}_{deployment_name} {replica_id} ...instead of... {deployment_name} {app_name}#{deployment_name}#{replica_id} {app_name} ...---------Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
#41744) (#41755)This is a minified version ofhttps://github.com/ray-project/ray/pull/41722/files, specifically put to be cherry-picked into 2.9Addresses#41726Addresses following gaps:Patches ActorProxyWrapper.is_drained method to handle RPC response properlyCherry-picks a test from [Serve] Revisiting ProxyState to fix draining sequence#41722 to validate draining sequence is correct---------Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Cherry pick of#40419.This is a feature we want to get out for 2.9, quite a few OSS users are interested in trying it out (see thread). We want to reach out to users to try it out and gather feedback once it’s released in 2.9 (otherwise we will have to wait for 2.10). It’s also pretty low risk, the changes are confined to the container part of runtime_env, and most of the PR is actually adding tests.Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Currently IntelGPUAcceleratorManager.get_current_node_num_accelerators() will be called everytime we try to get accelerator manager for GPU resource. This is expensive and makes single_client_tasks_sync slower. This PR changes to only call it once.Related issue number#41695---------Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Implements retryable exceptions for methods. Also plumbs direct_actor_task_submitter quite a bit. Behavior:Added new method-level annotation max_retries that overrides actor's max_task_retries if any.If user_exceptions=True | List[exception class types] and ((max_retries or max_task_retries) > 0) we may retry the method by issuing another task invocation to the actor.Both exception-retry and actor-death-retry counts toward max_retries. For example for a max_retries=3 call, and there are 2 actor deaths and 1 exception and 1 return, we can return the value.For a streaming generator call, if it yielded 4 values then raises an exception, we retry by calling the method again and ignoring the first 4 values and start yielding the 5th value in the second run.Java and CPP: they still have max_task_retries on actor deaths, but I did not add max_retries or retry_exceptions.Signed-off-by: Ruiyang Wang <rywang014@gmail.com>Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
If a deployment is autoscaling and replicas take a long time to start, there is a bug that makes the state transition to (UPDATING, AUTOSCALING) which is a combination that should never occur. Instead, we should just update the message but not the status.---------Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
(no review)
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.