NotificationsYou must be signed in to change notification settings
Fork6.6k
Star38k

[DO NOT REVIEW, LONG TERM PR FOR CI] Pinterest main branch 2.9.1#42672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

lee1258561 wants to merge63 commits intomasterfrompinterest/main-2.9.1

Closed

[DO NOT REVIEW, LONG TERM PR FOR CI] Pinterest main branch 2.9.1#42672

lee1258561 wants to merge63 commits intomasterfrompinterest/main-2.9.1

+6,556 −4,980

Conversation

Copy link

lee1258561 commentedJan 24, 2024

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e.,git commit -s) in this PR.
I've runscripts/format.sh to lint the changes in this PR.
I've included any doc changes needed forhttps://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it indoc/source/tune/api/ under the
  corresponding.rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures athttps://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

shrekris-anyscaleand others added30 commits

December 5, 2023 12:54

[Serve] Make capacity directionDOWN whentarget_capacity changes…

97e59c8

… from `None` to a number (#41592) (#41617)This change makes the Serve controller's internal target_capacity_direction become DOWN instead of UP when a Serve config's target_capacity goes from None to a number.---------Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

[Train/Tune] Skip incrementing failure counter on preemption node die…

ebc97d5

…d failures (#41285) (#41609)Users expect different failures types to be handled differently in step 4 above:* The current behavior is that the count decrements, regardless of the error type. For example, if 3 pre-emptions happen with `max_failures=3`, then the run will end without continuing to recover through preemptions.* With `max_failures=-1` or some large value, there will be an infinite number of retries, but this could crash-loop on an application error (ex: a bug in the user code). This can be very expensive.This PR changes the failure counting of Ray Train/Tune to ignore spot instance preemption failures by default. This behavior is enabled by the new `RayActorError.preempted` flag introduced in#41102 that is set if the underlying cluster setup handles the cloud preemption signals properly and sets the preempting node to the `DRAINING` status.---------Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update (#41622) (#41624)

7b44a8c

Signed-off-by: rickyyx <rickyx@anyscale.com>

[release] change version from 3.0.0.dev0 to 2.9.0 (#41625)

91bb28a

in various places. first step for release.Signed-off-by: Archit Kulkarni <archit@anyscale.com>

[Data] Get updated stats after execution for read-only plans (#41633) (…

3de0306

…#41666)- Get the updated stats after execution for read-only plans from the input LazyBlockList, not the resulting BlockList after fully executing which does not contain stats for the read operation.- Also fix a bug in Union operator with initializing stats.Signed-off-by: Scott Lee <sjl@anyscale.com>

[data][train] default ingest resource limits should exclude resources…

f519b60

… used by training (#41603) (#41674)Fixes a resource contention issue in training ingest workloads. The default ingest resource limits should exclude the resources used by Ray Train.This bug sufficed after [streaming output backpressure is enabled](#41327), and caused  ray-data-resnet50-ingest-file-size-benchmark.aws to fail (#41496). Here is what happens:* The dataset has 2 stages: `read` and `preprocess`.* The cluster has 16 CPUs. 2CPUs out of them will be used by Ray Trainer, 1 for the trainer actor, and 1 for the training worker.* Data StreamExecutor incorrectly thinks 16 CPUs are available, while there are actually only 14. It submits 15 read tasks and 1 process task. But the preprocess task is not actually running.* The streaming output backpressure throttles the output the read tasks when the output queue is full, and the preprocess task cannot run and consume the output data. This makes the execution hang forever.---------Signed-off-by: Hao Chen <chenh1024@gmail.com>

[core] Fix test log race on async actor task (#41563) (#41681)

267ae9f

Thiscloses#40959 a flaky test, and it fixes a real bug in the async actor log recording.---------Signed-off-by: rickyyx <rickyx@anyscale.com>

[data] deflake test_stats (#41655) (#41683)

83e52c0

Removes test_stats from flaky testsSigned-off-by: Andrew Xue <andewzxue@gmail.com>

[core] Deflak test_metrics_agent (#41599) (#41664)

acadb3d

---------Signed-off-by: rickyyx <rickyx@anyscale.com>

[Data] Skiptest_client_compat.py::test_client_data_getunit test (#…

c5dc784

…41634) (#41665)#41466 enables Ray Data streaming executor by default for all datasets. As a result, the Ray Data execution in `test_client_data_get` test is now executed through the streaming executor, which is known to have many incompatibilities since Ray 2.7. So, we skip the test which checks compatibility between Ray Client and Ray Data, until we have a future Ray Client implementation which can better support Ray Data usage.Signed-off-by: Scott Lee <sjl@anyscale.com>

Pick of 41698 (#41701)

8f325f6

Pick of#41698. De-noise release test runs. Purely a test infra change.Signed-off-by: can <can@anyscale.com>

[Serve] Fix duplicated pymodules in the serve details (#41688) (#41697)

2659bea

cherrypick#41688Signed-off-by: Sihan Wang <sihanwang41@gmail.com>

[data] Improve stall detection for StreamingOutputsBackpressurePolicy (…

1f5a10a

…#41637) (#41720)When there is non-Data code running in the same clusters. Data StreamExecutor will consider all submitted tasks as active, while they may not actually have resources to run.#41603 is an attempt to fix the data+train workload by excluding training resources.While this PR is a more general fix for other workloads, with two main changes:1. Besides detecting active tasks, we also detect if the downstream is not making any progress for a specific interval.2. Introduce a new `reserved_resources` option to allow specifying non-Data resources.This PR along can alsofix#41496---------Signed-off-by: Hao Chen <chenh1024@gmail.com>Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>

[Data] Avoid serializing datasource for Parquet read tasks (#41712) (#…

0329448

…41725)#41118 added an include_paths parameter to ParquetDatasource. As part of the PR, we pass an self._include_paths attribute to Parquet read tasks.As a result, the datasource (self) gets serialized with each read tasks. Normally, this isn't an issue, but if you're working with a large dataset (like in the failing release test), then the datasource is slow to serialize.This PR fixes the issue by removing the reference to self.Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

[Cherry-pick][Serve] allow gRPC deployment to use grpc context (#41667)…

5b98e79

… (#41751)This PR passes a grpc_context to deployments if the deployment uses it to get gRPC request related info can use to set code, details, trailing metadata, and compression. The original grpc._cython.cygrpc._ServicerContext type is not serializable, so we created a RayServegRPCContext to be able to pass to the deployment. Will follow up with doc change.Why are these changes needed?Pick of#41667---------Signed-off-by: Gene Su <e870252314@gmail.com>

Pick of 41754 (#41762)

4e83810

Pick ofhttps://github.com/ray-project/ray/pull/41754/files. Required to unblock CI on release branch.Signed-off-by: can <can@anyscale.com>

[serve] Update replica log format to not contain#(#41614) (#41711)

fe2f6b6

Updates the replica log filename format from:deployment_{deployment_name}_{app_name}#{deployment_name}#{replica_id}.logto:replica_{app_name}_{deployment_name}_{replica_id}.logAlso adjusts the replica log format to be:... {app_name}_{deployment_name} {replica_id} ...instead of... {deployment_name} {app_name}#{deployment_name}#{replica_id} {app_name} ...---------Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[Serve] Patching ActorProxyWrapper to properly handleis_drained RPC (

0637878

#41744) (#41755)This is a minified version ofhttps://github.com/ray-project/ray/pull/41722/files, specifically put to be cherry-picked into 2.9Addresses#41726Addresses following gaps:Patches ActorProxyWrapper.is_drained method to handle RPC response properlyCherry-picks a test from [Serve] Revisiting ProxyState to fix draining sequence#41722 to validate draining sequence is correct---------Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

[Serve] [Core] Container runtime env support (#40419) (#41716)

ae8ff07

Cherry pick of#40419.This is a feature we want to get out for 2.9, quite a few OSS users are interested in trying it out (see thread). We want to reach out to users to try it out and gather feedback once it’s released in 2.9 (otherwise we will have to wait for 2.10). It’s also pretty low risk, the changes are confined to the container part of runtime_env, and most of the PR is actually adding tests.Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

[Core] Speed up single_client_tasks_sync (#41743) (#41763)

7e47e4b

Currently IntelGPUAcceleratorManager.get_current_node_num_accelerators() will be called everytime we try to get accelerator manager for GPU resource. This is expensive and makes single_client_tasks_sync slower. This PR changes to only call it once.Related issue number#41695---------Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

[Data] Fix unusedcolumnsparameter inread_parquet_bulk(#41806) (…

3f9e493

…#41811)#40900 refactored the read_parquet_bulk implementation and introduced a regression where the columns parameter wasn't used. This PR fixes the bug.Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

[Data] Add example of how to read and write custom file types (#41785) (

955da09

#41821)#40127 removed the "Implementing a Custom Datasource" example because it used deprecated APIs. This PR introduces a new example that uses up-to-date APIs.---------Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

[data] Optimize OpState.outqueue_num_blocks (#41748) (#41815)

f6e155b

Cherry-pick#41748 to 2.9.0 release. This is a simple and safe perf fix and confirmed to be a bottleneck for training ingestion workloads.Signed-off-by: Hao Chen <chenh1024@gmail.com>

Remove unnecessary test file in wheel (#41775) (#41822)

8760e2e

This PR remove the test file in pip wheel which lead to the ray start failed

[core] retryable exceptions for method (#41194) (#41756)

db8b0e3

Implements retryable exceptions for methods. Also plumbs direct_actor_task_submitter quite a bit. Behavior:Added new method-level annotation max_retries that overrides actor's max_task_retries if any.If user_exceptions=True | List[exception class types] and ((max_retries or max_task_retries) > 0) we may retry the method by issuing another task invocation to the actor.Both exception-retry and actor-death-retry counts toward max_retries. For example for a max_retries=3 call, and there are 2 actor deaths and 1 exception and 1 return, we can return the value.For a streaming generator call, if it yielded 4 values then raises an exception, we retry by calling the method again and ignoring the first 4 values and start yielding the 5th value in the second run.Java and CPP: they still have max_task_retries on actor deaths, but I did not add max_retries or retry_exceptions.Signed-off-by: Ruiyang Wang <rywang014@gmail.com>Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

[Serve] Update system_logging_config to logging_config (#41749) (#41757)

ce8f82f

cherrypick:#41749we have an api inconsistency in the logging config.Signed-off-by: Sihan Wang <sihanwang41@gmail.com>Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[Serve] Add gRPC context related docs (#41783) (#41848)

46d85f0

Adding example code and descriptions for using gRPC context in Serve.---------Signed-off-by: Gene Su <e870252314@gmail.com>Signed-off-by: Gene Der Su <gdsu@ucdavis.edu>Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[serve] Fix bug in deployment state machine (#41799) (#41840)

bf228c7

If a deployment is autoscaling and replicas take a long time to start, there is a bug that makes the state transition to (UPDATING, AUTOSCALING) which is a combination that should never occur. Instead, we should just update the message but not the status.---------Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

[serve] Don't pass full Request object in release tests (#41869) (#41874

8bf5626

)Cherry-pick#41869 to fix serve release test failuresSigned-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[Core] Update doc for Ray core accelerator support (#41849) (#41891)

a559a58

- Include docs for accelerators other than GPU- Replace RuntimeContext.get_resource_ids() with RuntimeContext.get_accelerator_ids()Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

lee1258561 requested review fromc21,scottjlee,bveeramani,raulchen,stephanie-wang,Zandew,kfstorm,fishbone,WangTaoTheTonic,SongGuyang,a team,pcmoritz,thomasdesr andsofianhnaide ascode owners

January 24, 2024 20:17

aslonnie removed the request for review froma team

January 24, 2024 20:24

lee1258561 changed the title~~Pinterest/main 2.9.1~~[DO NOT REVIEW, LONG TERM PR FOR CI] Pinterest main branch 2.9.1

Jan 24, 2024

anyscalesam assignedlee1258561

Feb 28, 2024

aslonnie reviewed

Apr 26, 2024

View reviewed changes

Copy link

Collaborator

aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

(no review)

anyscalesam added triage

Needs triage (eg: priority, bug/not-bug, and owning component)

Devprod labels

Apr 29, 2024

dayshah removed the triageNeeds triage (eg: priority, bug/not-bug, and owning component) label

Mar 12, 2025

pcmoritz requested review fromGeneDer,akshay-anyscale,hongpeng-guo,simonsays1980 anda team ascode owners

March 26, 2025 22:34

hainesmichaelc added the community-contributionContributed by the community label

Apr 4, 2025

Copy link

stalebot commentedMay 6, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.