Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

add more execution and iteration metrics to prometheus#44971

Draft
votrou wants to merge2 commits intoray-project:master
base:master
Choose a base branch
Loading
frompinterest:cvotroubek/add-more-execution-and-iteration-metrics-to-prometheus

Conversation

votrou
Copy link

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e.,git commit -s) in this PR.
  • I've runscripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed forhttps://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it indoc/source/tune/api/ under the
      corresponding.rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures athttps://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@anyscalesamanyscalesam added tuneTune-related issues dataRay Data-related issues and removed tuneTune-related issues labelsApr 26, 2024
@anyscalesamanyscalesam assignedc21 and unassignedc21Apr 26, 2024
@scottjleescottjlee self-assigned thisJun 25, 2024
Copy link
Contributor

@scottjleescottjlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So far looking good! left a few comments.

A few other items to take care of to get the PR ready for merging:

  • merge latest master
  • handle the test failures in CI,example
  • runscripts/format.sh to run the linter
  • add short descriptions of new metrics to the docs page:doc/source/data/monitoring-your-workload.rst

Comment on lines +514 to +516
self.task_cpu_time += meta.exec_stats.cpu_time_s
self.task_udf_time += meta.exec_stats.udf_time_s
self.task_idle_time += meta.exec_stats.wall_time_s - meta.exec_stats.cpu_time_s # how does UDF time fit into this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

i think this is the right place fortask_cpu_time andtask_udf_time. if i understand your intent correctly, fortask_idle_time, i think we need some additional information. we need to get the difference between (time at which current task finished) and (time at which next task starts)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think what I was aiming for here is how long we are idlewithin a task, do you think that is possible here? Some situations I would hope to measure with it would be time spent betweenyields of a generator task, or from streaming backpressure due to hitting the max number of streaming blocks. Is that possible here?

Copy link
Contributor

@scottjleescottjleeJul 3, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

time spent between yields of a generator task

Would this be the same as the time it takes for the generator to produce the next output? or more like the timing around here when we yield batches from the iterator?https://github.com/ray-project/ray/blob/master/python/ray/data/iterator.py#L185-L186

from streaming backpressure due to hitting the max number of streaming blocks

we have this existing metrictask_submission_backpressure_time, which keeps track of the cumulative time spent in backpressure to submit the next task. is this what you had in mind for your use case?
https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py#L204-L210

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

More like the timing around when we yield batches from the iterator.

As for thetask_submission_backpressure_time, I'm still a bit confused on how to interpret metric. It seems like this metric will increase as long as we have enough tasks submitted to our pool of workers. But that would always seem like the case given that the scheduling loop runs every millisecond or so. For a one-operator pipeline, what does an increasing or stagnatedtask_submission_backpressure_time indicate? I don't think it would mean that we're necessarily producing blocks faster than consuming them, because the running tasks themselves could just be slow, right?

I think Raul kind of mentions in thiscomment what I'd like to capture. When do we apply backpressure to running tasks? How can we measure that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For a one-operator pipeline, what does an increasing or stagnated task_submission_backpressure_time indicate? I don't think it would mean that we're necessarily producing blocks faster than consuming them, because the running tasks themselves could just be slow, right?

Yeah, the reason could be several possible things, but the underlying cause is that there are too many tasks queued for the operator, so we don't want to add any more.

This is where we evaluate whether the op should be backpressured. I think older Ray versions may look slightly different, but essentiallybackpressure_policy.can_add_input() returns False if backpressure is to be applied. When backpressure is applied, the operator will no longer accept new inputs.

)
self.iter_next_batch_s = Gauge(
"data_iter_next_batch_seconds",
description="Seconds spent getting next batch", # Need a better description for this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

a slightly more descriptive line could be something like:

Suggested change
description="Seconds spent getting next batch",# Need a better description for this?
description="Seconds spent getting next batch from internal batch builder",

Comment on lines +265 to +269
self.iter_total_s = Gauge(
"data_iter_total_seconds",
description="Total time spent in iteration",
tag_keys=iter_tag_keys,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

did you want to also add a chart for this, or just track the metric via the class and access it elsewhere? i didn't see one, i may have missed it

Comment on lines +307 to +311
self.streaming_exec_schedule_s = Gauge(
"data_streaming_exec_schedule_seconds",
description="Seconds spent streaming executor scheduling",
tag_keys=iter_tag_keys,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

same for this, did you want to add a chart? or just track the metric via the class and access it elsewhere?

self.iter_collate_batch_s.set(stats.iter_collate_batch_s.get(), tags)
self.iter_finalize_batch_s.set(stats.iter_finalize_batch_s.get(), tags)
self.streaming_split_coordinator_s.set(stats.streaming_split_coordinator_s.get(), tags)
self.streaming_exec_schedule_s.set(stats.streaming_exec_schedule_s.get(), tags) # Where should this live?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

since this is a metric which is tracked for the overall execution of the entire dataset, and not related to iteration,update_execution_metrics() would be a more appropriate place to update this.

Comment on lines +466 to +473
self.iter_total_s.set(0, tags)
self.iter_wait_s.set(0, tags)
self.iter_get_s.set(0, tags)
self.iter_next_batch_s.set(0, tags)
self.iter_format_batch_s.set(0, tags)
self.iter_collate_batch_s.set(0, tags)
self.iter_finalize_batch_s.set(0, tags)
self.streaming_split_coordinator_s.set(0, tags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

for the iteration related metrics, let's move them toclear_iteration_metrics().

self.iter_collate_batch_s.set(0, tags)
self.iter_finalize_batch_s.set(0, tags)
self.streaming_split_coordinator_s.set(0, tags)
self.streaming_exec_schedule_s.set(0, tags) # Where should this live?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this LGTM

@staleStale
Copy link

stalebot commentedFeb 25, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stalestalebot added the staleThe issue is stale. It will be closed within 7 days unless there are further conversation labelFeb 25, 2025
@hainesmichaelchainesmichaelc added the community-contributionContributed by the community labelApr 4, 2025
@stalestalebot removed the staleThe issue is stale. It will be closed within 7 days unless there are further conversation labelApr 4, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@scottjleescottjleescottjlee left review comments

At least 1 approving review is required to merge this pull request.

Assignees

@scottjleescottjlee

Labels
community-contributionContributed by the communitydataRay Data-related issues
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

5 participants
@votrou@scottjlee@c21@hainesmichaelc@anyscalesam

[8]ページ先頭

©2009-2025 Movatter.jp