NotificationsYou must be signed in to change notification settings
Fork6.6k
Star38k

add more execution and iteration metrics to prometheus#44971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

votrou wants to merge2 commits intoray-project:masterfrompinterest:cvotroubek/add-more-execution-and-iteration-metrics-to-prometheus

Closed

add more execution and iteration metrics to prometheus#44971

votrou wants to merge2 commits intoray-project:masterfrompinterest:cvotroubek/add-more-execution-and-iteration-metrics-to-prometheus

Conversation

Copy link

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e.,git commit -s) in this PR.
I've runscripts/format.sh to lint the changes in this PR.
I've included any doc changes needed forhttps://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it indoc/source/tune/api/ under the
  corresponding.rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures athttps://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

votrou added2 commits

April 25, 2024 11:08

add more iteration and execution metrics to prometheus

3300180

clear exec schedule

0a16073

anyscalesam added tune

Tune-related issues

dataRay Data-related issues and removed tuneTune-related issues labels

Apr 26, 2024

anyscalesam assignedc21 and unassignedc21

Apr 26, 2024

scottjlee self-assigned this

Jun 25, 2024

scottjlee reviewed

Jul 2, 2024

View reviewed changes

Copy link

Contributor

scottjlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So far looking good! left a few comments.

A few other items to take care of to get the PR ready for merging:

merge latest master
handle the test failures in CI,example
runscripts/format.sh to run the linter
add short descriptions of new metrics to the docs page:doc/source/data/monitoring-your-workload.rst

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

Comment on lines +514 to +516

		self.task_cpu_time += meta.exec_stats.cpu_time_s
		self.task_udf_time += meta.exec_stats.udf_time_s
		self.task_idle_time += meta.exec_stats.wall_time_s - meta.exec_stats.cpu_time_s # how does UDF time fit into this?

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

i think this is the right place fortask_cpu_time andtask_udf_time. if i understand your intent correctly, fortask_idle_time, i think we need some additional information. we need to get the difference between (time at which current task finished) and (time at which next task starts)

Copy link

Author

votrouJul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think what I was aiming for here is how long we are idlewithin a task, do you think that is possible here? Some situations I would hope to measure with it would be time spent betweenyields of a generator task, or from streaming backpressure due to hitting the max number of streaming blocks. Is that possible here?

Copy link

Contributor

scottjleeJul 3, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

time spent between yields of a generator task

Would this be the same as the time it takes for the generator to produce the next output? or more like the timing around here when we yield batches from the iterator?https://github.com/ray-project/ray/blob/master/python/ray/data/iterator.py#L185-L186

from streaming backpressure due to hitting the max number of streaming blocks

we have this existing metrictask_submission_backpressure_time, which keeps track of the cumulative time spent in backpressure to submit the next task. is this what you had in mind for your use case?
https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py#L204-L210

Copy link

Author

votrouJul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

More like the timing around when we yield batches from the iterator.

As for thetask_submission_backpressure_time, I'm still a bit confused on how to interpret metric. It seems like this metric will increase as long as we have enough tasks submitted to our pool of workers. But that would always seem like the case given that the scheduling loop runs every millisecond or so. For a one-operator pipeline, what does an increasing or stagnatedtask_submission_backpressure_time indicate? I don't think it would mean that we're necessarily producing blocks faster than consuming them, because the running tasks themselves could just be slow, right?

I think Raul kind of mentions in thiscomment what I'd like to capture. When do we apply backpressure to running tasks? How can we measure that?

Copy link

Contributor

scottjleeJul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For a one-operator pipeline, what does an increasing or stagnated task_submission_backpressure_time indicate? I don't think it would mean that we're necessarily producing blocks faster than consuming them, because the running tasks themselves could just be slow, right?

Yeah, the reason could be several possible things, but the underlying cause is that there are too many tasks queued for the operator, so we don't want to add any more.

This is where we evaluate whether the op should be backpressured. I think older Ray versions may look slightly different, but essentiallybackpressure_policy.can_add_input() returns False if backpressure is to be applied. When backpressure is applied, the operator will no longer accept new inputs.

python/ray/data/_internal/stats.py

		)
		self.iter_next_batch_s = Gauge(
		"data_iter_next_batch_seconds",
		description="Seconds spent getting next batch", # Need a better description for this?

Copy link

Contributor

scottjleeJul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

a slightly more descriptive line could be something like:

Suggested change

	description="Seconds spent getting next batch",# Need a better description for this?
	description="Seconds spent getting next batch from internal batch builder",

python/ray/data/_internal/stats.py

Comment on lines +265 to +269

		self.iter_total_s = Gauge(
		"data_iter_total_seconds",
		description="Total time spent in iteration",
		tag_keys=iter_tag_keys,
		)

Copy link

Contributor

scottjleeJul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

did you want to also add a chart for this, or just track the metric via the class and access it elsewhere? i didn't see one, i may have missed it

python/ray/data/_internal/stats.py

Comment on lines +307 to +311

		self.streaming_exec_schedule_s = Gauge(
		"data_streaming_exec_schedule_seconds",
		description="Seconds spent streaming executor scheduling",
		tag_keys=iter_tag_keys,
		)

Copy link

Contributor

scottjleeJul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

same for this, did you want to add a chart? or just track the metric via the class and access it elsewhere?

python/ray/data/_internal/stats.py

		self.iter_collate_batch_s.set(stats.iter_collate_batch_s.get(), tags)
		self.iter_finalize_batch_s.set(stats.iter_finalize_batch_s.get(), tags)
		self.streaming_split_coordinator_s.set(stats.streaming_split_coordinator_s.get(), tags)
		self.streaming_exec_schedule_s.set(stats.streaming_exec_schedule_s.get(), tags) # Where should this live?

Copy link

Contributor

scottjleeJul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

since this is a metric which is tracked for the overall execution of the entire dataset, and not related to iteration,update_execution_metrics() would be a more appropriate place to update this.

python/ray/data/_internal/stats.py

Comment on lines +466 to +473

		self.iter_total_s.set(0, tags)
		self.iter_wait_s.set(0, tags)
		self.iter_get_s.set(0, tags)
		self.iter_next_batch_s.set(0, tags)
		self.iter_format_batch_s.set(0, tags)
		self.iter_collate_batch_s.set(0, tags)
		self.iter_finalize_batch_s.set(0, tags)
		self.streaming_split_coordinator_s.set(0, tags)

Copy link

Contributor

scottjleeJul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

for the iteration related metrics, let's move them toclear_iteration_metrics().

python/ray/data/_internal/stats.py

		self.iter_collate_batch_s.set(0, tags)
		self.iter_finalize_batch_s.set(0, tags)
		self.streaming_split_coordinator_s.set(0, tags)
		self.streaming_exec_schedule_s.set(0, tags) # Where should this live?

Copy link

Contributor

scottjleeJul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this LGTM

Copy link

stalebot commentedFeb 25, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stalebot added the staleThe issue is stale. It will be closed within 7 days unless there are further conversation label

Feb 25, 2025

hainesmichaelc added the community-contributionContributed by the community label

Apr 4, 2025

stalebot removed the staleThe issue is stale. It will be closed within 7 days unless there are further conversation label

Apr 4, 2025

Copy link

stalebot commentedMay 6, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.