Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

[DO NOT REVIEW, LONG TERM PR FOR CI] Pinterest main branch 2.9.1#42672

Open
lee1258561 wants to merge63 commits intomaster
base:master
Choose a base branch
Loading
frompinterest/main-2.9.1

Conversation

lee1258561
Copy link

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e.,git commit -s) in this PR.
  • I've runscripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed forhttps://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it indoc/source/tune/api/ under the
      corresponding.rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures athttps://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

shrekris-anyscaleand others added30 commitsDecember 5, 2023 12:54
… from `None` to a number (#41592) (#41617)This change makes the Serve controller's internal target_capacity_direction become DOWN instead of UP when a Serve config's target_capacity goes from None to a number.---------Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
…d failures (#41285) (#41609)Users expect different failures types to be handled differently in step 4 above:* The current behavior is that the count decrements, regardless of the error type. For example, if 3 pre-emptions happen with `max_failures=3`, then the run will end without continuing to recover through preemptions.* With `max_failures=-1` or some large value, there will be an infinite number of retries, but this could crash-loop on an application error (ex: a bug in the user code). This can be very expensive.This PR changes the failure counting of Ray Train/Tune to ignore spot instance preemption failures by default. This behavior is enabled by the new `RayActorError.preempted` flag introduced in#41102 that is set if the underlying cluster setup handles the cloud preemption signals properly and sets the preempting node to the `DRAINING` status.---------Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
in various places. first step for release.Signed-off-by: Archit Kulkarni <archit@anyscale.com>
…#41666)- Get the updated stats after execution for read-only plans from the input LazyBlockList, not the resulting BlockList after fully executing which does not contain stats for the read operation.- Also fix a bug in Union operator with initializing stats.Signed-off-by: Scott Lee <sjl@anyscale.com>
… used by training (#41603) (#41674)Fixes a resource contention issue in training ingest workloads. The default ingest resource limits should exclude the resources used by Ray Train.This bug sufficed after [streaming output backpressure is enabled](#41327), and caused  ray-data-resnet50-ingest-file-size-benchmark.aws to fail (#41496). Here is what happens:* The dataset has 2 stages: `read` and `preprocess`.* The cluster has 16 CPUs. 2CPUs out of them will be used by Ray Trainer, 1 for the trainer actor, and 1 for the training worker.* Data StreamExecutor incorrectly thinks 16 CPUs are available, while there are actually only 14. It submits 15 read tasks and 1 process task. But the preprocess task is not actually running.* The streaming output backpressure throttles the output the read tasks when the output queue is full, and the preprocess task cannot run and consume the output data. This makes the execution hang forever.---------Signed-off-by: Hao Chen <chenh1024@gmail.com>
Thiscloses#40959 a flaky test, and it fixes a real bug in the async actor log recording.---------Signed-off-by: rickyyx <rickyx@anyscale.com>
Removes test_stats from flaky testsSigned-off-by: Andrew Xue <andewzxue@gmail.com>
---------Signed-off-by: rickyyx <rickyx@anyscale.com>
…41634) (#41665)#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage.Signed-off-by: Scott Lee <sjl@anyscale.com>
Pick of#41698. De-noise release test runs. Purely a test infra change.Signed-off-by: can <can@anyscale.com>
cherrypick#41688Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
…#41637) (#41720)When there is non-Data code running in the same clusters. Data StreamExecutor will consider all submitted tasks as active, while they may not actually have resources to run.#41603 is an attempt to fix the data+train workload by excluding training resources.While this PR is a more general fix for other workloads, with two main changes:1. Besides detecting active tasks, we also detect if the downstream is not making any progress for a specific interval.2. Introduce a new `reserved_resources` option to allow specifying non-Data resources.This PR along can alsofix#41496---------Signed-off-by: Hao Chen <chenh1024@gmail.com>Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
…41725)#41118 added an include_paths parameter to ParquetDatasource. As part of the PR, we pass an self._include_paths attribute to Parquet read tasks.As a result, the datasource (self) gets serialized with each read tasks. Normally, this isn't an issue, but if you're working with a large dataset (like in the failing release test), then the datasource is slow to serialize.This PR fixes the issue by removing the reference to self.Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
… (#41751)This PR passes a grpc_context to deployments if the deployment uses it to get gRPC request related info can use to set code, details, trailing metadata, and compression. The original grpc._cython.cygrpc._ServicerContext type is not serializable, so we created a RayServegRPCContext to be able to pass to the deployment. Will follow up with doc change.Why are these changes needed?Pick of#41667---------Signed-off-by: Gene Su <e870252314@gmail.com>
Pick ofhttps://github.com/ray-project/ray/pull/41754/files. Required to unblock CI on release branch.Signed-off-by: can <can@anyscale.com>
Updates the replica log filename format from:deployment_{deployment_name}_{app_name}#{deployment_name}#{replica_id}.logto:replica_{app_name}_{deployment_name}_{replica_id}.logAlso adjusts the replica log format to be:... {app_name}_{deployment_name} {replica_id} ...instead of... {deployment_name} {app_name}#{deployment_name}#{replica_id} {app_name} ...---------Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
#41744) (#41755)This is a minified version ofhttps://github.com/ray-project/ray/pull/41722/files, specifically put to be cherry-picked into 2.9Addresses#41726Addresses following gaps:Patches ActorProxyWrapper.is_drained method to handle RPC response properlyCherry-picks a test from [Serve] Revisiting ProxyState to fix draining sequence#41722 to validate draining sequence is correct---------Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Cherry pick of#40419.This is a feature we want to get out for 2.9, quite a few OSS users are interested in trying it out (see thread). We want to reach out to users to try it out and gather feedback once it’s released in 2.9 (otherwise we will have to wait for 2.10). It’s also pretty low risk, the changes are confined to the container part of runtime_env, and most of the PR is actually adding tests.Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Currently IntelGPUAcceleratorManager.get_current_node_num_accelerators() will be called everytime we try to get accelerator manager for GPU resource. This is expensive and makes single_client_tasks_sync slower. This PR changes to only call it once.Related issue number#41695---------Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
…#41811)#40900 refactored the read_parquet_bulk implementation and introduced a regression where the columns parameter wasn't used. This PR fixes the bug.Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
#41821)#40127 removed the "Implementing a Custom Datasource" example because it used deprecated APIs. This PR introduces a new example that uses up-to-date APIs.---------Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Cherry-pick#41748 to 2.9.0 release. This is a simple and safe perf fix and confirmed to be a bottleneck for training ingestion workloads.Signed-off-by: Hao Chen <chenh1024@gmail.com>
This PR remove the test file in pip wheel which lead to the ray start failed
Implements retryable exceptions for methods. Also plumbs direct_actor_task_submitter quite a bit. Behavior:Added new method-level annotation max_retries that overrides actor's max_task_retries if any.If user_exceptions=True | List[exception class types] and ((max_retries or max_task_retries) > 0) we may retry the method by issuing another task invocation to the actor.Both exception-retry and actor-death-retry counts toward max_retries. For example for a max_retries=3 call, and there are 2 actor deaths and 1 exception and 1 return, we can return the value.For a streaming generator call, if it yielded 4 values then raises an exception, we retry by calling the method again and ignoring the first 4 values and start yielding the 5th value in the second run.Java and CPP: they still have max_task_retries on actor deaths, but I did not add max_retries or retry_exceptions.Signed-off-by: Ruiyang Wang <rywang014@gmail.com>Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
cherrypick:#41749we have an api inconsistency in the logging config.Signed-off-by: Sihan Wang <sihanwang41@gmail.com>Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Adding example code and descriptions for using gRPC context in Serve.---------Signed-off-by: Gene Su <e870252314@gmail.com>Signed-off-by: Gene Der Su <gdsu@ucdavis.edu>Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
If a deployment is autoscaling and replicas take a long time to start, there is a bug that makes the state transition to (UPDATING, AUTOSCALING) which is a combination that should never occur. Instead, we should just update the message but not the status.---------Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
)Cherry-pick#41869 to fix serve release test failuresSigned-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
- Include docs for accelerators other than GPU- Replace RuntimeContext.get_resource_ids() with RuntimeContext.get_accelerator_ids()Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@aslonnieaslonnie removed the request for review froma teamJanuary 24, 2024 20:24
@lee1258561lee1258561 changed the titlePinterest/main 2.9.1[DO NOT REVIEW, LONG TERM PR FOR CI] Pinterest main branch 2.9.1Jan 24, 2024
Copy link
Collaborator

@aslonnieaslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

(no review)

@anyscalesamanyscalesam added triageNeeds triage (eg: priority, bug/not-bug, and owning component) Devprod labelsApr 29, 2024
@dayshahdayshah removed the triageNeeds triage (eg: priority, bug/not-bug, and owning component) labelMar 12, 2025
@hainesmichaelchainesmichaelc added the community-contributionContributed by the community labelApr 4, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@aslonnieaslonnieaslonnie left review comments

@ericlericlAwaiting requested review from ericl

@richardliawrichardliawAwaiting requested review from richardliawrichardliaw is a code owner

@rkooo567rkooo567Awaiting requested review from rkooo567

@jjyaojjyaoAwaiting requested review from jjyaojjyao is a code owner

@edoakesedoakesAwaiting requested review from edoakesedoakes is a code owner

@shrekris-anyscaleshrekris-anyscaleAwaiting requested review from shrekris-anyscale

@sihanwang41sihanwang41Awaiting requested review from sihanwang41

@zcinzcinAwaiting requested review from zcinzcin is a code owner

@architkulkarniarchitkulkarniAwaiting requested review from architkulkarni

@krfrickekrfrickeAwaiting requested review from krfricke

@xwjiang2010xwjiang2010Awaiting requested review from xwjiang2010

@amogkamamogkamAwaiting requested review from amogkam

@matthewdengmatthewdengAwaiting requested review from matthewdengmatthewdeng is a code owner

@Yard1Yard1Awaiting requested review from Yard1

@maxpumperlamaxpumperlaAwaiting requested review from maxpumperlamaxpumperla is a code owner

@justinvyujustinvyuAwaiting requested review from justinvyujustinvyu is a code owner

@woshiyyyawoshiyyyaAwaiting requested review from woshiyyyawoshiyyya is a code owner

@sven1977sven1977Awaiting requested review from sven1977sven1977 is a code owner

@avnishnavnishnAwaiting requested review from avnishn

@ArturNiederfahrenhorstArturNiederfahrenhorstAwaiting requested review from ArturNiederfahrenhorst

@smoradsmoradAwaiting requested review from smorad

@kouroshHakhakouroshHakhaAwaiting requested review from kouroshHakha

@scv119scv119Awaiting requested review from scv119

@c21c21Awaiting requested review from c21

@scottjleescottjleeAwaiting requested review from scottjlee

@bveeramanibveeramaniAwaiting requested review from bveeramani

@raulchenraulchenAwaiting requested review from raulchenraulchen is a code owner

@stephanie-wangstephanie-wangAwaiting requested review from stephanie-wang

@ZandewZandewAwaiting requested review from Zandew

@kfstormkfstormAwaiting requested review from kfstormkfstorm is a code owner

@fishbonefishboneAwaiting requested review from fishbone

@WangTaoTheTonicWangTaoTheTonicAwaiting requested review from WangTaoTheTonicWangTaoTheTonic is a code owner

@SongGuyangSongGuyangAwaiting requested review from SongGuyangSongGuyang is a code owner

@pcmoritzpcmoritzAwaiting requested review from pcmoritzpcmoritz is a code owner

@thomasdesrthomasdesrAwaiting requested review from thomasdesrthomasdesr is a code owner

@sofianhnaidesofianhnaideAwaiting requested review from sofianhnaide

@GeneDerGeneDerAwaiting requested review from GeneDerGeneDer is a code owner

@akshay-anyscaleakshay-anyscaleAwaiting requested review from akshay-anyscaleakshay-anyscale is a code owner

@hongpeng-guohongpeng-guoAwaiting requested review from hongpeng-guohongpeng-guo is a code owner

@simonsays1980simonsays1980Awaiting requested review from simonsays1980simonsays1980 is a code owner

At least 1 approving review is required to merge this pull request.

Assignees

@lee1258561lee1258561

Labels
community-contributionContributed by the communityDevprod
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

29 participants
@lee1258561@aslonnie@hainesmichaelc@dayshah@anyscalesam@shrekris-anyscale@justinvyu@rickyyx@architkulkarni@scottjlee@raulchen@Zandew@can-anyscale@sihanwang41@bveeramani@GeneDer@edoakes@alexeykudinkin@zcin@jjyao@fishbone@rynewang@angelinalg@stephanie-wang@c21@jonathan-anyscale@sven1977@rkooo567@brycehuang30

[8]ページ先頭

©2009-2025 Movatter.jp