Movatterモバイル変換

Home

Rate this Page

★★★★★

Quickstart #

Created On: May 04, 2021 | Last Updated On: Feb 09, 2023

To launch afault-tolerant job, run the following on all nodes.

torchrun--nnodes=NUM_NODES--nproc-per-node=TRAINERS_PER_NODE--max-restarts=NUM_ALLOWED_FAILURES--rdzv-id=JOB_ID--rdzv-backend=c10d--rdzv-endpoint=HOST_NODE_ADDRYOUR_TRAINING_SCRIPT.py(--arg1...trainscriptargs...)

To launch anelastic job, run the following on at leastMIN_SIZE nodesand at mostMAX_SIZE nodes.

torchrun--nnodes=MIN_SIZE:MAX_SIZE--nproc-per-node=TRAINERS_PER_NODE--max-restarts=NUM_ALLOWED_FAILURES_OR_MEMBERSHIP_CHANGES--rdzv-id=JOB_ID--rdzv-backend=c10d--rdzv-endpoint=HOST_NODE_ADDRYOUR_TRAINING_SCRIPT.py(--arg1...trainscriptargs...)

Note

TorchElastic models failures as membership changes. When a node fails,this is treated as a “scale down” event. When the failed node is replaced bythe scheduler, it is a “scale up” event. Hence for both fault tolerantand elastic jobs,--max-restarts is used to control the total number ofrestarts before giving up, regardless of whether the restart was causeddue to a failure or a scaling event.

HOST_NODE_ADDR, in form <host>[:<port>] (e.g. node1.example.com:29400),specifies the node and the port on which the C10d rendezvous backend should beinstantiated and hosted. It can be any node in your training cluster, butideally you should pick a node that has a high bandwidth.

Note

If no port number is specifiedHOST_NODE_ADDR defaults to 29400.

Note

The--standalone option can be passed to launch a single node job with asidecar rendezvous backend. You don’t have to pass--rdzv-id,--rdzv-endpoint, and--rdzv-backend when the--standalone optionis used.

Note

Learn more about writing your distributed training scripthere.

Iftorchrun does not meet your requirements you may use our APIs directlyfor more powerful customization. Start by taking a look at theelastic agent API.

Edit on GitHub

Show Source

PyTorch Libraries

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources

To analyze traffic and optimize your experience, we serve cookies on this site. By clicking or navigating, you agree to allow our usage of cookies. As the current maintainers of this site, Facebook’s Cookies Policy applies. Learn more, including about available controls:Cookies Policy.

[8]ページ先頭

Movatterモバイル変換

Quickstart#

Docs

Tutorials

Resources

Quickstart #