- Notifications
You must be signed in to change notification settings - Fork6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
add more execution and iteration metrics to prometheus#44971
base:master
Are you sure you want to change the base?
add more execution and iteration metrics to prometheus#44971
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
So far looking good! left a few comments.
A few other items to take care of to get the PR ready for merging:
- merge latest master
- handle the test failures in CI,example
- run
scripts/format.sh
to run the linter - add short descriptions of new metrics to the docs page:
doc/source/data/monitoring-your-workload.rst
self.task_cpu_time += meta.exec_stats.cpu_time_s | ||
self.task_udf_time += meta.exec_stats.udf_time_s | ||
self.task_idle_time += meta.exec_stats.wall_time_s - meta.exec_stats.cpu_time_s # how does UDF time fit into this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
i think this is the right place fortask_cpu_time
andtask_udf_time
. if i understand your intent correctly, fortask_idle_time
, i think we need some additional information. we need to get the difference between (time at which current task finished) and (time at which next task starts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think what I was aiming for here is how long we are idlewithin a task, do you think that is possible here? Some situations I would hope to measure with it would be time spent betweenyields
of a generator task, or from streaming backpressure due to hitting the max number of streaming blocks. Is that possible here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
time spent between yields of a generator task
Would this be the same as the time it takes for the generator to produce the next output? or more like the timing around here when we yield batches from the iterator?https://github.com/ray-project/ray/blob/master/python/ray/data/iterator.py#L185-L186
from streaming backpressure due to hitting the max number of streaming blocks
we have this existing metrictask_submission_backpressure_time
, which keeps track of the cumulative time spent in backpressure to submit the next task. is this what you had in mind for your use case?
https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py#L204-L210
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
More like the timing around when we yield batches from the iterator.
As for thetask_submission_backpressure_time
, I'm still a bit confused on how to interpret metric. It seems like this metric will increase as long as we have enough tasks submitted to our pool of workers. But that would always seem like the case given that the scheduling loop runs every millisecond or so. For a one-operator pipeline, what does an increasing or stagnatedtask_submission_backpressure_time
indicate? I don't think it would mean that we're necessarily producing blocks faster than consuming them, because the running tasks themselves could just be slow, right?
I think Raul kind of mentions in thiscomment what I'd like to capture. When do we apply backpressure to running tasks? How can we measure that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
For a one-operator pipeline, what does an increasing or stagnated task_submission_backpressure_time indicate? I don't think it would mean that we're necessarily producing blocks faster than consuming them, because the running tasks themselves could just be slow, right?
Yeah, the reason could be several possible things, but the underlying cause is that there are too many tasks queued for the operator, so we don't want to add any more.
This is where we evaluate whether the op should be backpressured. I think older Ray versions may look slightly different, but essentiallybackpressure_policy.can_add_input()
returns False if backpressure is to be applied. When backpressure is applied, the operator will no longer accept new inputs.
) | ||
self.iter_next_batch_s = Gauge( | ||
"data_iter_next_batch_seconds", | ||
description="Seconds spent getting next batch", # Need a better description for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
a slightly more descriptive line could be something like:
description="Seconds spent getting next batch",# Need a better description for this? | |
description="Seconds spent getting next batch from internal batch builder", |
self.iter_total_s = Gauge( | ||
"data_iter_total_seconds", | ||
description="Total time spent in iteration", | ||
tag_keys=iter_tag_keys, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
did you want to also add a chart for this, or just track the metric via the class and access it elsewhere? i didn't see one, i may have missed it
self.streaming_exec_schedule_s = Gauge( | ||
"data_streaming_exec_schedule_seconds", | ||
description="Seconds spent streaming executor scheduling", | ||
tag_keys=iter_tag_keys, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
same for this, did you want to add a chart? or just track the metric via the class and access it elsewhere?
self.iter_collate_batch_s.set(stats.iter_collate_batch_s.get(), tags) | ||
self.iter_finalize_batch_s.set(stats.iter_finalize_batch_s.get(), tags) | ||
self.streaming_split_coordinator_s.set(stats.streaming_split_coordinator_s.get(), tags) | ||
self.streaming_exec_schedule_s.set(stats.streaming_exec_schedule_s.get(), tags) # Where should this live? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
since this is a metric which is tracked for the overall execution of the entire dataset, and not related to iteration,update_execution_metrics()
would be a more appropriate place to update this.
self.iter_total_s.set(0, tags) | ||
self.iter_wait_s.set(0, tags) | ||
self.iter_get_s.set(0, tags) | ||
self.iter_next_batch_s.set(0, tags) | ||
self.iter_format_batch_s.set(0, tags) | ||
self.iter_collate_batch_s.set(0, tags) | ||
self.iter_finalize_batch_s.set(0, tags) | ||
self.streaming_split_coordinator_s.set(0, tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
for the iteration related metrics, let's move them toclear_iteration_metrics()
.
self.iter_collate_batch_s.set(0, tags) | ||
self.iter_finalize_batch_s.set(0, tags) | ||
self.streaming_split_coordinator_s.set(0, tags) | ||
self.streaming_exec_schedule_s.set(0, tags) # Where should this live? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
this LGTM
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.