Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Implementation of (overlap) local SGD in Pytorch

NotificationsYou must be signed in to change notification settings

JYWa/Overlap_Local_SGD

Repository files navigation

Code to reproduce the experiments reported in this paper:

Jianyu Wang, Hao Liang, Gauri Joshi, "Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD," ICASSP 2020.(arXiv)

This repo contains the implementations of the following algorithms:

Please cite this paper if you use this code for your research/projects.

Dependencies and Setup

The code runs on Python 3.5 with PyTorch 1.0.0 and torchvision 0.2.1.The non-blocking communication is implemented using Python threading package.

Training examples

We implement all the above mentioned algorithms as subclasses oftorch.optim.optimizer. A typical usage is shown as follows:

importdistoptim# Before training# define the optimizer# One can use: 1) LocalSGD (including BMUF); 2) OverlapLocalSGD;#              3) EASGD; 4) CoCoDSGD# tau is the number of local updates / communication periodoptimizer=distoptim.SELECTED_OPTIMIZER(tau)......# define model, criterion, logging, etc..# Start trainingforbatch_id, (data,label)inenumerate(data_loader):# same as serial trainingoutput=model(data)# forwardloss=criterion(output,label)loss.backward()# backwardoptimizer.step()# gradient stepoptimizer.zero_grad()# additional line to average local models at workers# communication happens after every tau iterations# optimizer has its own iteration counter insideoptimizer.average()

In addition, one need to initialize the process group as described in thisdocumentation. In our private cluster, each machine has one GPU.

# backend = gloo or nccl# rank: 0,1,2,3,...# size: number of workers# h0 is the host name of worker0, you need to change ittorch.distributed.init_process_group(backend=args.backend,init_method='tcp://h0:22000',rank=args.rank,world_size=args.size)

Citation

@article{wang2020overlap,title={Overlap Local-{SGD}: An Algorithmic Approach to Hide Communication Delays in Distributed {SGD}},author={Wang, Jianyu and Liang, Hao and Joshi, Gauri},journal={arXiv preprint arXiv:2002.09539},year={2020}}

About

Implementation of (overlap) local SGD in Pytorch

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp