Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

DRACO: Byzantine-resilient Distributed Training via Redundant Gradients

License

NotificationsYou must be signed in to change notification settings

hwang595/Draco

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains source code for Draco, a scalable framework for robust distributed training that uses ideas from coding theory. Please checkhttps://arxiv.org/abs/1803.09877 for detailed information about this project.

Overview:

Draco is a scalable framework for robust distributed training that uses ideasfrom coding theory. In Draco, compute nodes evaluate redundant gradients that are then used by the parameter server (PS) to eliminate the effects of adversarial updates.

In Draco, each compute node processesrB/P gradients and sends a linear combination of those to the PS. This means that Draco incurs a computational redundancy ratio ofr. Upon receiving theP gradient sums, the PS uses a “decoding” function to remove the effect of the adversarial nodes and reconstruct the original desired sum of the B gradients. With redundancy ratior, we show that Draco can tolerate up to(r − 1)/2 adversaries, which is information theoretically tight.

Depdendencies:

Tested stable depdencises:

  • python 2.7 (Anaconda)
  • PyTorch 0.3.0 (please note that, we're moving to PyTorch 0.4.0, and 1.0.x)
  • torchvision 0.1.18
  • MPI4Py 0.3.0
  • python-blosc 1.5.0
  • hdmedians

We highly recommend installing anAnaconda environment.You will get a high-quality BLAS library (MKL) and you get a controlled compiler version regardless of your Linux distro.

We providethis script to help you with building all dependencies. To do that you can run:

bash ./tools/pre_run.sh

Cluster Setup:

For running on distributed cluster, the first thing you need do is to launch AWS EC2 instances.

Launching Instances:

This script helps you to launch EC2 instances automatically, but before running this script, you should followthe instruction to setup AWS CLI on your local machine.After that, please edit this part in./tools/pytorch_ec2.py

cfg=Cfg({"name" :"PS_PYTORCH",# Unique name for this specific configuration"key_name":"NameOfKeyFile",# Necessary to ssh into created instances# Cluster topology"n_masters" :1,# Should always be 1"n_workers" :8,"num_replicas_to_aggregate" :"8",# deprecated, not necessary"method" :"spot",# Region speficiation"region" :"us-west-2","availability_zone" :"us-west-2b",# Machine type - instance type configuration."master_type" :"m4.2xlarge","worker_type" :"m4.2xlarge",# please only use this AMI for pytorch"image_id":"ami-xxxxxxxx",# id of AMI# Launch specifications"spot_price" :"0.15",# Has to be a string# SSH configuration"ssh_username" :"ubuntu",# For sshing. E.G: ssh ssh_username@hostname"path_to_keyfile" :"/dir/to/NameOfKeyFile.pem",# NFS configuration# To set up these values, go to Services > ElasticFileSystem > Create new filesystem, and follow the directions.#"nfs_ip_address" : "172.31.3.173",         # us-west-2c#"nfs_ip_address" : "172.31.35.0",          # us-west-2a"nfs_ip_address" :"172.31.14.225",# us-west-2b"nfs_mount_point" :"/home/ubuntu/shared",# NFS base dir

For setting everything up on EC2 cluster, the easiest way is to setup one machine and create an AMI. Then use the AMI id forimage_id inpytorch_ec2.py. Then, launch EC2 instances by running

python ./tools/pytorch_ec2.py launch

After all launched instances are ready (this may take a while), getting private ips of instances by

python ./tools/pytorch_ec2.py get_hosts

this will write ips into a file namedhosts_address, which looks like

172.31.16.226 (${PS_IP})172.31.27.245172.31.29.131172.31.18.108172.31.18.174172.31.17.228172.31.16.25172.31.30.61172.31.29.30

After generating thehosts_address of all EC2 instances, running the following command will copy your keyfile to the parameter server (PS) instance whose address is always the first one inhosts_address.local_script.sh will also do some basic configurations e.g. clone this git repo

bash ./tool/local_script.sh ${PS_IP}

SSH related:

At this stage, you should ssh to the PS instance and all operation should happen on PS. In PS setting, PS should be able to ssh to any compute node,this part dose the job for you by running (after ssh to the PS)

bash ./tools/remote_script.sh

Prepare Datasets

We currently supportMNIST andCifar10 datasets. Download, split, and transform datasets by (and./tools/remote_script.sh dose this for you)

bash ./src/data_prepare.sh

Job Launching

Since this project is built on MPI, tasks are required to be launched by PS (or master) instance.run_pytorch.sh wraps job-launching process up. Commonly used options (arguments) are listed as following:

ArgumentComments
nNumber of processes (size of cluster) e.g. if we have P compute node and 1 PS, n=P+1.
hostfileA directory to the file that contains Private IPs of every node in the cluster, we usehosts_address here asmentioned before.
lrInital learning rate that will be use.
momentumValue of momentum that will be use.
networkTypes of deep neural nets, currentlyLeNet,ResNet-18/32/50/110/152, andVGGs are supported.
datasetDatasets use for training.
batch-sizeBatch size for optimization algorithms.
comm-typeA fake parameter, please always set it to beBcast.
modeUpdate mode used on PS, e.g. geometric median, Krum, majority vote, and etc.
approachApproach used in experiments, e.g. baseline method or Draco (repition code or cyclic code).
err-modeMode of simulated adversaries, reverse gradient adversary and constant adversary are currently supported.
adversarialMagnitude of adversaries.
worker-failNumber of adversarial nodes simulated in the cluster.
group-sizeUsed for repitition code in specific, for group size of workers.
max-stepsThe maximum number of iterations to train.
epochsThe maximal number of epochs to train (somehow redundant).
eval-freqFrequency of iterations to evaluation the model.
enable-gpuTraining on CPU/GPU, if CPU please leave this argument empty.
train-dirDirectory to save model checkpoints for evaluation.

Model Evaluation

Distributed evaluator will fetch model checkpoints from the shared directory and evaluate model on validation set.To evaluate model, you can run

bash ./src/evaluate_pytorch.sh

with specified arguments.

Evaluation arguments are listed as following:

ArgumentComments
eval-batch-sizeBatch size (on validation set) used during model evaluation.
eval-freqFrequency of iterations to evaluation the model, should be set to the same value asrun_pytorch.sh.
networkTypes of deep neural nets, should be set to the same value asrun_pytorch.sh.
datasetDatasets use for training, should be set to the same value asrun_pytorch.sh.
model-dirDirectory to save model checkpoints for evaluation, should be set to the same value asrun_pytorch.sh.

Future Work

Those are potential directions we are actively working on, stay tuned!

  • Reduce the computational cost of Draco by only approximately recovering the desired gradient summation.
  • Explore other coding methods that achieve the same redundancy and computation lower bounds.
  • Move Draco to state-of-the-art PS (or distributed) frameworks e.g.Ray orTensorFlow.

Citation

@inproceedings{Draco,  author = {Lingjiao Chen and Hongyi Wang and Zachary Charles and Dimitris Papailiopoulos},  title = {DRACO: Byzantine-resilient Distributed Training via Redundant Gradients},  booktitle = {Proceedings of the 35th International Conference on Machine Learning, {ICML} 2018},  year = {2018},  month = jul,  url = {https://arxiv.org/abs/1803.09877},}

About

DRACO: Byzantine-resilient Distributed Training via Redundant Gradients

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp