Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)

License

NotificationsYou must be signed in to change notification settings

pytorch/torchft

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

torchft

|Documentation |Poster |Design Doc |

PyPI - Version


This repository implements techniques for doing a per-step fault tolerance soyou can keep training if errors occur without interrupting the entire trainingjob.

This is based on the large scale training techniques presented at PyTorchConference 2024.

Overview

torchft is designed to provide the primitives required to implement faulttolerance in any application/train script as well as the primitives needed toimplement custom fault tolerance strategies.

Out of the box, torchft provides the following algorithms:

  • Fault Tolerant DDP
  • Fault Tolerant HSDP: fault tolerance across the replicated dimension with any mix of FSDP/TP/etc across the other dimensions.
  • LocalSGD
  • DiLoCo

To implement these, torchft provides some key reusable components:

  1. Coordination primitives that can determine which workers are healthy viaheartbeating on a per-step basis
  2. Fault tolerant ProcessGroup implementations that report errors sanely and bereinitialized gracefully.
  3. Checkpoint transports that can be used to do live recovery from a healthypeer when doing scale up operations.

The following component diagram shows the high level components and how theyrelate to each other:

Component Diagram

Seetorchft's documentation for more details.

Examples

torchtitan (Fault Tolerant HSDP)

torchtitan provides an out of the box fault tolerant HSDP training loop built ontop of torchft that can be used to train models such as Llama 3 70B.

It also serves as a good example of how you can integrate torchft into your own training script for use with HSDP.

Seetorchtitan's documentation for end to end usage.

Fault Tolerant DDP

We have a minimal DDP train loop that highlights all of the key components in torchft.

Seetrain_ddp.py for more info.

DiLoCo

LocalSGD and DiLoCo are currently experimental.

Seethe diloco_train_loop/local_sgd_train_loop testsfor an example on how to integrate these algorithms into your training loop.

Design

torchft is designed to allow for fault tolerance when using training with replicated weights such as in DDP or HSDP (FSDP with DDP).

See thedesign doc for the most detailed explanation.

Lighthouse

torchft implements a lighthouse server that coordinates across the differentreplica groups and then a per replica group manager and fault tolerance librarythat can be used in a standard PyTorch training loop.

This allows for membership changes at the training step granularity which cangreatly improve efficiency by avoiding stopping the world training on errors.

Lighthouse Diagram

Fault Tolerant HSDP Algorithm

torchft provides an implementation of a fault tolerant HSDP/DDP algorithm. Thefollowing diagram shows the high level operations that need to happen in thetrain loop to ensure everything stays consistent during a healing operation.

HSDP Diagram

See the design doc linked above for more details.

Installing from PyPI

We have nighty builds available athttps://pypi.org/project/torchft-nightly/

To install torchft with minimal dependencies you can run:

pip install torchft-nightly

If you want all development dependencies you can install:

pip install torchft-nightly[dev]

Installing from Source

Prerequisites

Before proceeding, ensure you have the following installed:

  • Rust (with necessary dependencies)
  • protobuf-compiler and the corresponding development package for Protobuf.
  • PyTorch 2.7 RC+ or Nightly

Note that the Rust versions available in many conda environments may be outdated. To install the latest version of Rust, we recommend downloading it directly from the official website as shown in the below command:

curl --proto'=https' --tlsv1.2 https://sh.rustup.rs -sSf| sh

To install the required packages on a Debian-based system (such as Ubuntu) using apt, run:

sudo apt install protobuf-compiler libprotobuf-dev

or for a Red Hat-based system, run:

sudo dnf install protobuf-compiler protobuf-devel

Installation

pip install.

This uses pyo3+maturin to build the package, you'll need maturin installed.

If the installation command fails to invokecargo update due to an inability to fetch the manifest, it may be caused by theproxy,proxySSLCert, andproxySSLKey settings in your .gitconfig file affecting thecargo command. To resolve this issue, try temporarily removing these fields from your.gitconfig before running the installation command.

To install in editable mode w/ the Rust extensions and development dependencies, you can use the normal pip install command:

pip install -e'.[dev]'

Usage

Lighthouse

The lighthouse is used for fault tolerance across replicated workers (DDP/FSDP)when using synchronous training.

You can start a lighthouse server by running:

RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000

Example Training Loop (DDP)

Seetrain_ddp.py for the full example.

Invoke with:

TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train_ddp.py

train.py:

fromtorchftimportManager,DistributedDataParallel,Optimizer,ProcessGroupGloomanager=Manager(pg=ProcessGroupGloo(),load_state_dict=...,state_dict=...,)m=nn.Linear(2,3)m=DistributedDataParallel(manager,m)optimizer=Optimizer(manager,optim.AdamW(m.parameters()))foriinrange(1000):batch=torch.rand(2,2,device=device)optimizer.zero_grad()out=m(batch)loss=out.sum()loss.backward()optimizer.step()

Running DDP

After starting the lighthouse server by running:

RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000

A test DDP script can be launched with torchX with:

torchx run

Or Diloco with:

USE_STREAMING=True torchx run ./torchft/torchx.py:hsdp --script='train_diloco.py'

See.torchxconfig,torchx.py and thetorchX documentation to understand how DDP is being ran.

torchx.py could also launch HSDP jobs whenworkers_per_replica is set > 1, if the training script supports it. For an example HSDP training implementation with torchFT enabled, seetorchtitan.

Alternatively, to test on a node with two GPUs, you can launch two replica groups runningtrain_ddp.py by:

On shell 1 (one replica groups starts initial training):

export REPLICA_GROUP_ID=0export NUM_REPLICA_GROUPS=2CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

On shell 2 (a second replica group joins):

export REPLICA_GROUP_ID=1export NUM_REPLICA_GROUPS=2CUDA_VISIBLE_DEVICES=1 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29601 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

By observing the outputs from both shells, you should observe process group reconfiguration and live checkpoint recovery.

Example Parameter Server

torchft has a fault tolerant parameter server implementation built on it'sreconfigurable ProcessGroups. This does not require/use a Lighthouse server.

Seeparameter_server_test.py for an example.

Contributing

We welcome PRs! See theCONTRIBUTING file.

License

torchft is BSD 3-Clause licensed. SeeLICENSE for more details.

About

Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors19

Languages


[8]ページ先頭

©2009-2025 Movatter.jp