Rate this Page

Quickstart#

Created On: May 04, 2021 | Last Updated On: Feb 09, 2023

To launch afault-tolerant job, run the following on all nodes.

torchrun--nnodes=NUM_NODES--nproc-per-node=TRAINERS_PER_NODE--max-restarts=NUM_ALLOWED_FAILURES--rdzv-id=JOB_ID--rdzv-backend=c10d--rdzv-endpoint=HOST_NODE_ADDRYOUR_TRAINING_SCRIPT.py(--arg1...trainscriptargs...)

To launch anelastic job, run the following on at leastMIN_SIZE nodesand at mostMAX_SIZE nodes.

torchrun--nnodes=MIN_SIZE:MAX_SIZE--nproc-per-node=TRAINERS_PER_NODE--max-restarts=NUM_ALLOWED_FAILURES_OR_MEMBERSHIP_CHANGES--rdzv-id=JOB_ID--rdzv-backend=c10d--rdzv-endpoint=HOST_NODE_ADDRYOUR_TRAINING_SCRIPT.py(--arg1...trainscriptargs...)

Note

TorchElastic models failures as membership changes. When a node fails,this is treated as a “scale down” event. When the failed node is replaced bythe scheduler, it is a “scale up” event. Hence for both fault tolerantand elastic jobs,--max-restarts is used to control the total number ofrestarts before giving up, regardless of whether the restart was causeddue to a failure or a scaling event.

HOST_NODE_ADDR, in form <host>[:<port>] (e.g. node1.example.com:29400),specifies the node and the port on which the C10d rendezvous backend should beinstantiated and hosted. It can be any node in your training cluster, butideally you should pick a node that has a high bandwidth.

Note

If no port number is specifiedHOST_NODE_ADDR defaults to 29400.

Note

The--standalone option can be passed to launch a single node job with asidecar rendezvous backend. You don’t have to pass--rdzv-id,--rdzv-endpoint, and--rdzv-backend when the--standalone optionis used.

Note

Learn more about writing your distributed training scripthere.

Iftorchrun does not meet your requirements you may use our APIs directlyfor more powerful customization. Start by taking a look at theelastic agent API.