- Notifications
You must be signed in to change notification settings - Fork6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
[core]Make GCS InternalKV workload configurable to the Policy.#47736
base:master
Are you sure you want to change the base?
Conversation
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
aadcd14
toc84fd80
CompareSigned-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Note: TSAN captured a new data race: GetAllJobInfo sends RPCs to workers (main thread) and KVMultiGet (the new ioctx thread). Both modifies a counter that implements a "gather all these tasks and then run callback" logic. I don't want a new mutex just for this; so I used atomics. So now, both threads access to a same atomic<size_t>. They perform atomic |
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
src/ray/gcs/gcs_server/gcs_server.cc Outdated
RAY_LOG(INFO) << "main_service_ Event stats:\n\n" | ||
<< main_service_.stats().StatsString() << "\n\n"; | ||
RAY_LOG(INFO) << "pubsub_io_context_ Event stats:\n\n" | ||
<< pubsub_io_context_.GetIoService().stats().StatsString() << "\n\n"; | ||
RAY_LOG(INFO) << "kv_io_context_ Event stats:\n\n" | ||
<< kv_io_context_.GetIoService().stats().StatsString() << "\n\n"; | ||
RAY_LOG(INFO) << "task_io_context_ Event stats:\n\n" | ||
<< task_io_context_.GetIoService().stats().StatsString() << "\n\n"; | ||
RAY_LOG(INFO) << "ray_syncer_io_context_ Event stats:\n\n" | ||
<< ray_syncer_io_context_.GetIoService().stats().StatsString() | ||
<< "\n\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is user-facing, let's make this user-friendly -- instead of using field name describe what they are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I placed it this way for ease of searching logs - I don't think Ray users (other than Core developers) will ever need to read this...
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There is refactoring in this PR. Can we have the refactoring in its own PR first?
src/ray/common/asio/asio_util.h Outdated
* The constructor takes a thread name and starts the thread. | ||
* The destructor stops the io_service and joins the thread. | ||
*/ | ||
class InstrumentedIoContextWithThread { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
should be IO instead of Io? lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There is refactoring in this PR. Can we have the refactoring in its own PR first?
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Every time GetInternalConfig reads from table_storage, but it's nevermutated. Moves to internal kv as a simple in-mem get (no more read fromredis). This itself should slightly update performance. But with#47736it should improve start up latency a lot in thousand-node clusters. Intheory we can remove it all for good, instead just put it as anInternalKV entry. but let's do things step by step.Signed-off-by: Ruiyang Wang <rywang014@gmail.com>Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
)Every time GetInternalConfig reads from table_storage, but it's nevermutated. Moves to internal kv as a simple in-mem get (no more read fromredis). This itself should slightly update performance. But withray-project#47736it should improve start up latency a lot in thousand-node clusters. Intheory we can remove it all for good, instead just put it as anInternalKV entry. but let's do things step by step.Signed-off-by: Ruiyang Wang <rywang014@gmail.com>Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@jjyao ready to review |
@@ -104,7 +103,7 @@ void PeriodicalRunner::DoRunFnPeriodicallyInstrumented( | |||
// event loop. | |||
auto stats_handle = io_service_.stats().RecordStart(name, period.total_nanoseconds()); | |||
timer->async_wait([this, | |||
fn = std::move(fn), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
why removing move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
because that fn move did not work, because fn is a const& and move is no-op.
// Init GCS table storage. Note this is on the default io context, not the one with | ||
// GcsInternalKVManager, to avoid congestion on the latter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
what does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Meaning gcs_table_storage_ should always go with GetDefaultIOContext, not withGetIOContext<GcsInternalKVManager>()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
to avoid congestion on the latter.
What does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If gcs_table_storage_ operations go withGcsInternalKVManager
's, it may slow down GcsInternalKVManager
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Since#48231 we can define a policy for IoContext runs. To make a workload configurable one need to fix thread safety for involved classes and then define the policy for the class.
Makes
PeriodicalRunner
thread-safe with atomic bool.Adds a policy to assign InternalKVManager to the default IO Context for
Notably gcs_table_storage_ is not controlled by this policy, because that's used by all other non-KV services and we want it be isolated from InternalKVManager's policy.