Customization#
Created On: May 04, 2021 | Last Updated On: May 04, 2021
This section describes how to customize TorchElastic to fit your needs.
Launcher#
The launcher program that ships with TorchElasticshould be sufficient for most use-cases (seetorchrun (Elastic Launch)).You can implement a custom launcher byprogrammatically creating an agent and passing it specs for your workers asshown below.
# my_launcher.pyif__name__=="__main__":args=parse_args(sys.argv[1:])rdzv_handler=RendezvousHandler(...)spec=WorkerSpec(local_world_size=args.nproc_per_node,fn=trainer_entrypoint_fn,args=(trainer_entrypoint_fnargs.fn_args,...),rdzv_handler=rdzv_handler,max_restarts=args.max_restarts,monitor_interval=args.monitor_interval,)agent=LocalElasticAgent(spec,start_method="spawn")try:run_result=agent.run()ifrun_result.is_failed():print(f"worker 0 failed with: run_result.failures[0]")else:print(f"worker 0 return value is: run_result.return_values[0]")exceptExceptionex:# handle exception
Rendezvous Handler#
To implement your own rendezvous, extendtorch.distributed.elastic.rendezvous.RendezvousHandlerand implement its methods.
Warning
Rendezvous handlers are tricky to implement. Before you beginmake sure you completely understand the properties of rendezvous.Please refer toRendezvous for more information.
Once implemented you can pass your custom rendezvous handler to the workerspec when creating the agent.
spec=WorkerSpec(rdzv_handler=MyRendezvousHandler(params),...)elastic_agent=LocalElasticAgent(spec,start_method=start_method)elastic_agent.run(spec.role)
Metric Handler#
TorchElastic emits platform level metrics (seeMetrics).By default metrics are emitted to/dev/null so you will not see them.To have the metrics pushed to a metric handling service in your infrastructure,implement atorch.distributed.elastic.metrics.MetricHandler andconfigure it in yourcustom launcher.
# my_launcher.pyimporttorch.distributed.elastic.metricsasmetricsclassMyMetricHandler(metrics.MetricHandler):defemit(self,metric_data:metrics.MetricData):# push metric_data to your metric sinkdefmain():metrics.configure(MyMetricHandler())spec=WorkerSpec(...)agent=LocalElasticAgent(spec)agent.run()
Events Handler#
TorchElastic supports events recording (seeEvents).The events module defines API that allows you to record events andimplement custom EventHandler. EventHandler is used for publishing eventsproduced during torchelastic execution to different sources, e.g. AWS CloudWatch.By default it usestorch.distributed.elastic.events.NullEventHandler that ignoresevents. To configure custom events handler you need to implementtorch.distributed.elastic.events.EventHandler interface andconfigure itin your custom launcher.
# my_launcher.pyimporttorch.distributed.elastic.eventsaseventsclassMyEventHandler(events.EventHandler):defrecord(self,event:events.Event):# process eventdefmain():events.configure(MyEventHandler())spec=WorkerSpec(...)agent=LocalElasticAgent(spec)agent.run()