- Notifications
You must be signed in to change notification settings - Fork6.6k
[RFC][dashboard] Use aiohttp client for inter dependencies.#49932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on ourdiscussion forum orRay's public slack channel. Thanks again for opening the issue! |
Uh oh!
There was an error while loading.Please reload this page.
This PR is a request for comment on a design of using HTTP requests for inter-Head dependencies. It removes TrainHead usage of DataSource.
Background
TrainHead uses
DataOrganizer.get_actor_infos
to get facts about Actors. This can't be easily reduced to simple singular GcsClient calls, because it comes from a merge of Actor infos and Worker infos (e.g.actor["gpus"][0][processesPids"]
are fromDataSource.node_physical_stats
that roots to GCSGetAllResourceUsage
rpc.Proposal
Let the TrainHead depend on ActorHead by directly calling HTTP requests. The overhead should be small since they are guaranteed to live in a same node.
Scope?
We will do direct read to GCS as much as possible. For cases like this, where it's not trivial to adapt, and frequency is low, and (maybe) non critical, we can use http client.
Changes
Cache-Control: no-cache
.actor_ids
to APIGET /logical/actors
Alternative
Now, ResourceUsage are subscribed by ReportHead and written to
DataSource.node_physical_stats
(moving to NodeHead in#49878). To do "true isolation" we will need to define a way to get snapshot info of ResourceUsage for a certain Node, which can be a bigger amount of change.