Rate this Page

★★★★★

beginner/dist_overview

PyTorch Distributed Overview #

Created On: Jul 28, 2020 | Last Updated: Jul 20, 2025 | Last Verified: Nov 05, 2024

Author:Will Constable,Wei Feng

Note

View and edit this tutorial ingithub.

This is the overview page for thetorch.distributed package. The goal ofthis page is to categorize documents into different topics and brieflydescribe each of them. If this is your first time building distributed trainingapplications using PyTorch, it is recommended to use this document to navigateto the technology that can best serve your use case.

Introduction#

The PyTorch Distributed library includes a collective of parallelism modules,a communications layer, and infrastructure for launching anddebugging large training jobs.

Parallelism APIs#

These Parallelism Modules offer high-level functionality and compose with existing models:

Sharding primitives#

DTensor andDeviceMesh are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups.

DTensor represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations.
DeviceMesh abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlyingProcessGroup instances for collective communications in multi-dimensional parallelisms. Try out ourDevice Mesh Recipe to learn more.

Communications APIs#

ThePyTorch distributed communication layer (C10D) offers both collective communication APIs (e.g.,all_reduce: andall_gather)and P2P communication APIs (e.g.,sendandisend),which are used under the hood in all of the parallelism implementations.Writing Distributed Applications with PyTorchshows examples of using c10d communication APIs.

Launcher#

torchrun is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.

Applying Parallelism To Scale Your Model#

Data Parallelism is a widely adopted single-program multiple-data training paradigmwhere the model is replicated on every process, every model replica computes local gradients fora different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step.

Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn’t fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques.

When deciding what parallelism techniques to choose for your model, use these common guidelines:

UseDistributedDataParallel (DDP),if your model fits in a single GPU but you want to easily scale up training using multiple GPUs.
- Usetorchrun, to launch multiple pytorch processes if you are using more than one node.
- See also:Getting Started with Distributed Data Parallel
UseFullyShardedDataParallel (FSDP2) when your model cannot fit on one GPU.
- See also:Getting Started with FSDP2
UseTensor Parallel (TP) and/orPipeline Parallel (PP) if you reach scaling limitations with FSDP2.
- Try ourTensor Parallelism Tutorial
- See also:TorchTitan end to end example of 3D parallelism

Note

Data-parallel training also works withAutomatic Mixed Precision (AMP).

PyTorch Distributed Developers#

If you’d like to contribute to PyTorch Distributed, refer to ourDeveloper Guide.

On this page

PyTorch Libraries

Movatterモバイル変換

PyTorch Distributed Overview #

Introduction#

Parallelism APIs#

Sharding primitives#

Communications APIs#

Launcher#

Applying Parallelism To Scale Your Model#

PyTorch Distributed Developers#

Docs

Tutorials

Resources

Movatterモバイル変換

PyTorch Distributed Overview#

Introduction#

Parallelism APIs#

Sharding primitives#

Communications APIs#

Launcher#

Applying Parallelism To Scale Your Model#

PyTorch Distributed Developers#

Docs

Tutorials

Resources

PyTorch Distributed Overview #