Movatterモバイル変換

[core] Always create a default executor#51058

Closed

8 tasks

kevin85421 commented

ray/src/ray/core_worker/core_worker_process.cc

src/ray/core_worker/core_worker.cc Outdated

		@@ -1167,6 +1160,13 @@ void CoreWorker::Exit(
		const rpc::WorkerExitType exit_type,
		const std::string &detail,
		const std::shared_ptr<LocalMemoryBuffer> &creation_task_exception_pb_bytes) {
		if (is_shutdown_) {

Copy link

MemberAuthor

kevin85421Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

SinceShutdown is only called in theshutdown callback withinExit, promising to executeExit at most once also ensures thatShutdown is executed at most once.

Copy link

Member

MortalHappinessMar 22, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think this is correct.CoreWorker::Shutdown is also called byCoreWorkerProcessImpl::ShutdownDriver, which is called inCoreWorker.shutdown_driver in_raylet.pyx.

Lines 337 to 349 inc6639d2

	voidCoreWorkerProcessImpl::ShutdownDriver() {
	RAY_CHECK(options_.worker_type == WorkerType::DRIVER)
	<<"The `Shutdown` interface is for driver only.";
	auto global_worker =GetCoreWorker();
	RAY_CHECK(global_worker);
	global_worker->Disconnect(/exit_type/ rpc::WorkerExitType::INTENDED_USER_EXIT,
	/exit_detail/"Shutdown by ray.shutdown().");
	global_worker->Shutdown();
	{
	auto write_locked = core_worker_.LockForWrite();
	write_locked.Get().reset();
	}
	}

ray/python/ray/_raylet.pyx

Lines 3047 to 3055 inc6639d2

	defshutdown_driver(self):
	# If it's a worker, the core worker process should have been
	# shutdown. So we can't call
	# `CCoreWorkerProcess.GetCoreWorker().GetWorkerType()` here.
	# Instead, we use the cached `is_driver` flag to test if it's a
	# driver.
	assertself.is_driver
	with nogil:
	CCoreWorkerProcess.Shutdown()

Copy link

Member

MortalHappinessMar 22, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Should also add a test to check the case thatray.shutdown is called.

Copy link

MemberAuthor

kevin85421Mar 22, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Good catch! I only checked the call incore_worker.cc. I'm surprised there's another call toShutdown outside ofcore_worker.cc. Maybe we should consider unifying the code paths for both the core worker driver and the worker shutdown process.

Copy link

MemberAuthor

kevin85421Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I will use another flag instead of reusingis_shutdown_.

Copy link

Collaborator

edoakesMar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In the future, is there a way that we can better unify this shutdown sequence or is there a reason why we need separate public interfaces?

Copy link

MemberAuthor

kevin85421Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't actually delve deeply into it, but I fully agree that we should have a single public function to terminate the core worker, whether it is a driver or a worker as I said#51582 (comment).

Copy link

MemberAuthor

kevin85421Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We can leaveExit as the only one public interface to gracefully terminate a core worker process, and onlyExit can callShutdown.

Copy link

Collaborator

edoakesMar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can you take it as a follow-up to investigate and fix it if it isn't too hard?

Copy link

MemberAuthor

kevin85421Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can you take it as a follow-up to investigate and fix it if it isn't too hard?

Sure! Create an issue to track the progress:#51642.

kevin85421 added the goadd ONLY when ready to merge, run all tests label

Copy link

Collaborator

edoakes commentedMar 21, 2025

ping when ready for review

kevin85421 marked this pull request as ready for review

March 21, 2025 16:09

kevin85421 assignededoakes

Copy link

Collaborator

edoakes commentedMar 21, 2025

what's the testing plan for this change?

Copy link

MemberAuthor

kevin85421 commentedMar 21, 2025

Sorry, I initially thought it would be tested in#51058. Let me add a test in this PR.

update

c2bf1d1

Signed-off-by: kaihsun <kaihsun@anyscale.com>

kevin85421 commented

release/nightly_tests/stress_tests/test_state_api_scale.py OutdatedShow resolvedHide resolved

Copy link

MemberAuthor

kevin85421 commentedMar 21, 2025

Added a test.

edoakes requested review fromMortalHappiness anddayshah

March 21, 2025 19:45

Copy link

Collaborator

edoakes commentedMar 21, 2025

please help review@MortalHappiness //@dayshah

update

4e24d0e

Signed-off-by: kaihsun <kaihsun@anyscale.com>

Copy link

Member

MortalHappiness commentedMar 22, 2025•
edited
Loading

Not sure whether the following reproduciton script addresses the same issue as this one.

importtimeimportrayray.init(address="auto")@ray.remote(concurrency_groups={"io":1})classTestActor:defexit(self):ray.actor.exit_actor()actors= [TestActor.remote()for_inrange(50)]ray.get([a.__ray_ready__.remote()forainactors])ray.wait([a.exit.remote()forainactors],timeout=10.0)print("Sleeping")time.sleep(3600)

Run this script until theSleeping message is printed, and then runray list actors in another terminal. Some actors are still alive.

I'm also wondering where the actor received the exit signal twice?

kevin85421 added5 commits

March 22, 2025 15:36

update

bbe509d

Signed-off-by: kaihsun <kaihsun@anyscale.com>

update

ddcce7e

Signed-off-by: kaihsun <kaihsun@anyscale.com>

update

0927bfb

Signed-off-by: kaihsun <kaihsun@anyscale.com>

update

87bca51

Signed-off-by: kaihsun <kaihsun@anyscale.com>

update

08af09f

Signed-off-by: kaihsun <kaihsun@anyscale.com>

kevin85421 marked this pull request as draft

March 23, 2025 23:29

kevin85421 added2 commits

March 23, 2025 23:50

update

e19e7e0

Signed-off-by: kaihsun <kaihsun@anyscale.com>

update

b16c98c

Signed-off-by: kaihsun <kaihsun@anyscale.com>

kevin85421 marked this pull request as ready for review

March 24, 2025 15:49

edoakes approved these changes

Mar 24, 2025

[core] UnifyCoreWorker::Exit andCoreWorker::Shutdown#51642

Copy link

Collaborator

edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM pending comment

src/ray/core_worker/core_worker.h Outdated

Comment on lines 1889 to 1891

		/// Whether the `Exit` function has been called, to avoid executing the exit
		/// process multiple times.
		std::atomic<bool> is_exit_ = false;

Copy link

Collaborator

edoakesMar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

drop a comment about the relationship between this andis_shutdown_

in general, in cases where we have confusing duplicate behaviors we should document it as clearly as possible to help future readers

Copy link

MemberAuthor

kevin85421Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Added

kevin85421 mentioned this pull request

Mar 24, 2025

Open

update

57ee3ea

Signed-off-by: kaihsun <kaihsun@anyscale.com>

edoakesenabled auto-merge (squash)

March 24, 2025 17:17

Merge branch 'master' into 20250320-devbox1-tmux3-ray7

96ab9a9

github-actionsbotdisabled auto-merge

March 24, 2025 19:08

Copy link

MemberAuthor

kevin85421 commentedMar 24, 2025

The CI failures seem to be unrelated. Let me figure out what happened.

Copy link

MemberAuthor

kevin85421 commentedMar 24, 2025

CI passes cc@edoakes would you mind merging this PR? The auto-merge was canceled when I sync with the master branch.

MortalHappiness approved these changes

edoakes merged commit6b805b5 intomaster

5 checks passed

edoakes deleted the 20250320-devbox1-tmux3-ray7 branch

March 25, 2025 18:24

dentiny reviewed

src/ray/core_worker/core_worker.cc

		"signal and is shutting down.";
		return;
		}
		is_exit_ = true;

Copy link

Contributor

dentinyMar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

(not familiar with the code, just question for my curiosity)
is it possible multiple threads checkingis_exit_ and all of them pass through?
If it's not supposed to be called in multi-threaded env, we don't need atomic variable here.

In a word, reading through the code, I think you should use CAS
Ref:https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange

angelinalg pushed a commit to angelinalg/ray that referenced this pull request

[core] Threaded actors get stuck forever if they receive two exit sig…

8974a44

…nals (ray-project#51582)If a threaded actor receives exit signals twice as shown in the abovescreenshot, it will execute[task_receiver_->Stop()](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/core_worker.cc#L1224)twice. However, the second call to `task_receiver_->Stop()` will getstuck forever when executing the[releaser](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/transport/concurrency_group_manager.cc#L135).### Reproduction```shray start --head --include-dashboard=True --num-cpus=1#https://gist.github.com/kevin85421/7a42ac3693537c2148fa554065bb5223python3 test.py# Some actors are still ALIVE. If all actors are DEAD, increase the number of actors.ray list actors```---------Signed-off-by: kaihsun <kaihsun@anyscale.com>

dentiny pushed a commit to dentiny/ray that referenced this pull request