Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

[core]Make GCS InternalKV workload configurable to the Policy.#47736

Open
rynewang wants to merge18 commits intoray-project:master
base:master
Choose a base branch
Loading
fromrynewang:dedicated-kv-ioctx

Conversation

rynewang
Copy link
Contributor

@rynewangrynewang commentedSep 18, 2024
edited
Loading

Since#48231 we can define a policy for IoContext runs. To make a workload configurable one need to fix thread safety for involved classes and then define the policy for the class.

MakesPeriodicalRunner thread-safe with atomic bool.

Adds a policy to assign InternalKVManager to the default IO Context for

  1. InternalKV gRPC service
  2. InMemoryStoreClient callbacks
  3. RedisStoreClient callbacks

Notably gcs_table_storage_ is not controlled by this policy, because that's used by all other non-KV services and we want it be isolated from InternalKVManager's policy.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewangrynewang requested a review froma team as acode ownerSeptember 18, 2024 21:20
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewangrynewangforce-pushed thededicated-kv-ioctx branch 2 times, most recently fromaadcd14 toc84fd80CompareSeptember 20, 2024 00:19
@rynewang
Copy link
ContributorAuthor

Note on the force push:

  1. c84fd80 is the OG commit of internal kv
  2. aadcd14 is a PR + GcsNodeManager to its dedicated thread. This did not work and causes segfault since now the thread + the main thread both RW the GcsNodeManager internal maps.
  3. Now, reverting toc84fd80 and merge master.

@rynewangrynewang added the goadd ONLY when ready to merge, run all tests labelSep 23, 2024
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang
Copy link
ContributorAuthor

Note: TSAN captured a new data race: GetAllJobInfo sends RPCs to workers (main thread) and KVMultiGet (the new ioctx thread). Both modifies a counter that implements a "gather all these tasks and then run callback" logic. I don't want a new mutex just for this; so I used atomics.

So now, both threads access to a same atomic<size_t>. They perform atomicfetch_add to increment the counter and use the atomic readout value to determine if the callback can be run. This also simplifies code from an int + a bool to a single size_t.

alexeykudinkin reacted with thumbs up emoji

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Comment on lines 844 to 854
RAY_LOG(INFO) << "main_service_ Event stats:\n\n"
<< main_service_.stats().StatsString() << "\n\n";
RAY_LOG(INFO) << "pubsub_io_context_ Event stats:\n\n"
<< pubsub_io_context_.GetIoService().stats().StatsString() << "\n\n";
RAY_LOG(INFO) << "kv_io_context_ Event stats:\n\n"
<< kv_io_context_.GetIoService().stats().StatsString() << "\n\n";
RAY_LOG(INFO) << "task_io_context_ Event stats:\n\n"
<< task_io_context_.GetIoService().stats().StatsString() << "\n\n";
RAY_LOG(INFO) << "ray_syncer_io_context_ Event stats:\n\n"
<< ray_syncer_io_context_.GetIoService().stats().StatsString()
<< "\n\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is user-facing, let's make this user-friendly -- instead of using field name describe what they are

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I placed it this way for ease of searching logs - I don't think Ray users (other than Core developers) will ever need to read this...

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Copy link
Collaborator

@jjyaojjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There is refactoring in this PR. Can we have the refactoring in its own PR first?

* The constructor takes a thread name and starts the thread.
* The destructor stops the io_service and joins the thread.
*/
class InstrumentedIoContextWithThread {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

should be IO instead of Io? lol

Copy link
Collaborator

@jjyaojjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There is refactoring in this PR. Can we have the refactoring in its own PR first?

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
rynewang added a commit that referenced this pull requestSep 26, 2024
Every time GetInternalConfig reads from table_storage, but it's nevermutated. Moves to internal kv as a simple in-mem get (no more read fromredis). This itself should slightly update performance. But with#47736it should improve start up latency a lot in thousand-node clusters. Intheory we can remove it all for good, instead just put it as anInternalKV entry. but let's do things step by step.Signed-off-by: Ruiyang Wang <rywang014@gmail.com>Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull requestOct 15, 2024
)Every time GetInternalConfig reads from table_storage, but it's nevermutated. Moves to internal kv as a simple in-mem get (no more read fromredis). This itself should slightly update performance. But withray-project#47736it should improve start up latency a lot in thousand-node clusters. Intheory we can remove it all for good, instead just put it as anInternalKV entry. but let's do things step by step.Signed-off-by: Ruiyang Wang <rywang014@gmail.com>Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewangrynewang changed the title[core] Move GCS InternalKV workload to dedicated thread.[core]Make GCS InternalKV workload configurable to the Policy.Oct 29, 2024
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang
Copy link
ContributorAuthor

@jjyao ready to review

@@ -104,7 +103,7 @@ void PeriodicalRunner::DoRunFnPeriodicallyInstrumented(
// event loop.
auto stats_handle = io_service_.stats().RecordStart(name, period.total_nanoseconds());
timer->async_wait([this,
fn = std::move(fn),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why removing move?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

because that fn move did not work, because fn is a const& and move is no-op.

Comment on lines +72 to +73
// Init GCS table storage. Note this is on the default io context, not the one with
// GcsInternalKVManager, to avoid congestion on the latter.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

what does this mean?

Copy link
ContributorAuthor

@rynewangrynewangOct 30, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Meaning gcs_table_storage_ should always go with GetDefaultIOContext, not withGetIOContext<GcsInternalKVManager>()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

to avoid congestion on the latter.

What does this mean?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If gcs_table_storage_ operations go withGcsInternalKVManager's, it may slow down GcsInternalKVManager

@staleStale
Copy link

stalebot commentedFeb 1, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stalestalebot added the staleThe issue is stale. It will be closed within 7 days unless there are further conversation labelFeb 1, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@alexeykudinkinalexeykudinkinalexeykudinkin left review comments

@jjyaojjyaojjyao left review comments

At least 1 approving review is required to merge this pull request.

Assignees

@jjyaojjyao

Labels
goadd ONLY when ready to merge, run all testsstaleThe issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

3 participants
@rynewang@alexeykudinkin@jjyao

[8]ページ先頭

©2009-2025 Movatter.jp