Metrics#
Created On: May 04, 2021 | Last Updated On: May 04, 2021
Metrics API.
Overview:
The metrics API in torchelastic is used to publish telemetry metrics.It is designed to be used by torchelastic’s internal modules topublish metrics for the end user with the goal of increasing visibilityand helping with debugging. However you may use the same API in yourjobs to publish metrics to the same metricssink.
Ametric can be thought of as timeseries dataand is uniquely identified by the string-valued tuple(metric_group,metric_name).
torchelastic makes no assumptions about what ametric_group isand what relationship it has withmetric_name. It is totally upto the user to use these two fields to uniquely identify a metric.
Note
The metric grouptorchelastic is reserved by torchelastic forplatform level metrics that it produces.For instance torchelastic may output the latency (in milliseconds)of a re-rendezvous operation from the agent as(torchelastic,agent.rendezvous.duration.ms)
A sensible way to use metric groups is to map them to a stage or modulein your job. You may also encode certain high level propertiesthe job such as the region or stage (dev vs prod).
Publish Metrics:
Using torchelastic’s metrics API is similar to using python’s loggingframework. You first have to configure a metrics handler beforetrying to add metric data.
The example below measures the latency for thecalculate() function.
importtimeimporttorch.distributed.elastic.metricsasmetrics# makes all metrics other than the one from "my_module" to go /dev/nullmetrics.configure(metrics.NullMetricsHandler())metrics.configure(metrics.ConsoleMetricsHandler(),"my_module")defmy_method():start=time.time()calculate()end=time.time()metrics.put_metric("calculate_latency",int(end-start),"my_module")
You may also use the torch.distributed.elastic.metrics.prof` decoratorto conveniently and succinctly profile functions
# -- in module examples.foobar --importtorch.distributed.elastic.metricsasmetricsmetrics.configure(metrics.ConsoleMetricsHandler(),"foobar")metrics.configure(metrics.ConsoleMetricsHandler(),"Bar")@metrics.profdeffoo():passclassBar:@metrics.profdefbaz():pass
@metrics.prof will publish the following metrics
<leaf_moduleorclassname>.success-1ifthefunctionfinishedsuccessfully<leaf_moduleorclassname>.failure-1ifthefunctionthrewanexception<leaf_moduleorclassname>.duration.ms-functiondurationinmilliseconds
Configuring Metrics Handler:
torch.distributed.elastic.metrics.MetricHandler is responsible for emittingthe added metric values to a particular destination. Metric groups can beconfigured with different metric handlers.
By default torchelastic emits all metrics to/dev/null.By adding the following configuration metrics,torchelastic andmy_app metric groups will be printed out toconsole.
importtorch.distributed.elastic.metricsasmetricsmetrics.configure(metrics.ConsoleMetricHandler(),group="torchelastic")metrics.configure(metrics.ConsoleMetricHandler(),group="my_app")
Writing a Custom Metric Handler:
If you want your metrics to be emitted to a custom location, implementthetorch.distributed.elastic.metrics.MetricHandler interfaceand configure your job to use your custom metric handler.
Below is a toy example that prints the metrics tostdout
importtorch.distributed.elastic.metricsasmetricsclassStdoutMetricHandler(metrics.MetricHandler):defemit(self,metric_data):ts=metric_data.timestampgroup=metric_data.group_namename=metric_data.namevalue=metric_data.valueprint(f"[{ts}][{group}]:{name}={value}")metrics.configure(StdoutMetricHandler(),group="my_app")
Now all metrics in the groupmy_app will be printed to stdout as:
[1574213883.4182858][my_app]:my_metric=<value>[1574213940.5237644][my_app]:my_metric=<value>
Metric Handlers#
Below are the metric handlers that come included with torchelastic.
Methods#
- torch.distributed.elastic.metrics.prof(fn=None,group='torchelastic')[source]#
@profile decorator publishes duration.ms, count, success, failure metrics for the function that it decorates.
The metric name defaults to the qualified name (
class_name.def_name) of the function.If the function does not belong to a class, it uses the leaf module name instead.Usage
@metrics.profdefx():pass@metrics.prof(group="agent")defy():pass