- Notifications
You must be signed in to change notification settings - Fork257
Mesh TensorFlow: Model Parallelism Made Easier
License
tensorflow/mesh
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Mesh TensorFlow (mtf) is a language for distributed deep learning, capable ofspecifying a broad class of distributed tensor computations. The purpose ofMesh TensorFlow is to formalize and implement distribution strategies for yourcomputation graph over your hardware/processors. For example: "Split the batchover rows of processors and split the units in the hidden layer across columnsof processors." Mesh TensorFlow is implemented as a layer over TensorFlow.
Watch ourYouTube video.
If you just want data-parallel training (batch-splitting), then you do not needMesh TensorFlow, though Mesh TensorFlow can do this. The most common reasonsfor more sophisticated parallel computation are:
The parameters of the model do not fit on one device - e.g. a5-billion-parameter language model.
An example is so large that the activations do not fit on one device. - e.g.large 3D image model(
experimental/unet.py).Lower-latency parallel inference (at batch size 1).
A "Mesh" is an n-dimensional array of processors, connected by a network.
Each tensor is distributed (split and/or replicated) across all processorsin a mesh.
Tensor dimensions and mesh dimensions are named. The layouts of all tensorsfollow from a set of user-defined layout rules which specify whichtensor-dimensions are split across which mesh-dimensions. This ensures thatthe corresponding dimensions in different tensors are split in the samemanner.
Layouts do not affect results - only performance.
The implementation of an operation involves parallel computation on allprocessors in the mesh, and sometimes also collective communication. Aprocessor usually just manipulates the slices of the input tensors alreadyresident on that processor, and produces the slice of the output that goes onthat processor.
To install the latest stable version, run
pip install mesh-tensorflow
To install the latest development version, run
pip install -e"git+https://github.com/tensorflow/mesh.git#egg=mesh-tensorflow"Installingmesh-tensorflow does not automatically install or updateTensorFlow. We recommend installing it viapip install tensorflow orpip install tensorflow-gpu. See TensorFlow’sinstallation instructions for details.If you're using a development version of Mesh TensorFlow, you may need touse TensorFlow's nightly package (tf-nightly).
To illustrate, let us consider a simple model for the MNIST image-classificationtask. Our network has one hidden layer with 1024 units, and an output layerwith 10 units (corresponding to the 10 digit classes).
The code consists of two parts, the first describing the mathematicaloperations, and the second describing the devices and tensor/computation layout.For the full example, seeexamples/mnist.py.TODO(noam): verify that this code works.
# tf_images is a tf.Tensor with shape [100, 28, 28] and dtype tf.float32# tf_labels is a tf.Tensor with shape [100] and dtype tf.int32graph=mtf.Graph()mesh=mtf.Mesh(graph,"my_mesh")batch_dim=mtf.Dimension("batch",100)rows_dim=mtf.Dimension("rows",28)cols_dim=mtf.Dimension("cols",28)hidden_dim=mtf.Dimension("hidden",1024)classes_dim=mtf.Dimension("classes",10)images=mtf.import_tf_tensor(mesh,tf_images,shape=[batch_dim,rows_dim,cols_dim])labels=mtf.import_tf_tensor(mesh,tf_labels, [batch_dim])w1=mtf.get_variable(mesh,"w1", [rows_dim,cols_dim,hidden_dim])w2=mtf.get_variable(mesh,"w2", [hidden_dim,classes_dim])# einsum is a generalization of matrix multiplication (see numpy.einsum)hidden=mtf.relu(mtf.einsum(images,w1,output_shape=[batch_dim,hidden_dim]))logits=mtf.einsum(hidden,w2,output_shape=[batch_dim,classes_dim])loss=mtf.reduce_mean(mtf.layers.softmax_cross_entropy_with_logits(logits,mtf.one_hot(labels,classes_dim),classes_dim))w1_grad,w2_grad=mtf.gradients([loss], [w1,w2])update_w1_op=mtf.assign(w1,w1-w1_grad*0.001)update_w2_op=mtf.assign(w2,w2-w2_grad*0.001)
In the code above, we have built a Mesh TensorFlow graph, which is simplya Python structure. We have completely defined the mathematical operations.In the code below, we specify the mesh of processors and the layout of thecomputation.
devices= ["gpu:0","gpu:1","gpu:2","gpu:3"]mesh_shape= [("all_processors",4)]layout_rules= [("batch","all_processors")]mesh_impl=mtf.placement_mesh_impl.PlacementMeshImpl(mesh_shape,layout_rules,devices)lowering=mtf.Lowering(graph, {mesh:mesh_impl})tf_update_ops= [lowering.lowered_operation(update_w1_op),lowering.lowered_operation(update_w2_op)]
The particular layout above implements data-parallelism, splitting the batch ofexamples evenly across all four processors. Any Tensor with a "batch" dimension(e.g.images,h,logits, and their gradients) is split in that dimensionacross all processors, while any tensor without a "batch" dimension (e.g. themodel parameters) is replicated identically on every processor.
Alternatively, for model-parallelism, we can setlayout_rules=[("hidden", "all_processors")]. In this case,any tensor with a "hidden" dimension (e.g.hidden,w1,w2) is split,while any other tensor (e.g.image,logits) is fully replicated.
We can even combine data-parallelism and model-parallelism on a 2-dimensionalmesh of processors. We split the batch along one dimension of the mesh, and theunits in the hidden layer along the other dimension of the mesh, as below. Inthis case, the hidden layer is actually tiled between the four processors, beingsplit in both the "batch" and "hidden_units" dimensions.
mesh_shape= [("processor_rows",2), ("processor_cols",2)]layout_rules= [("batch","processor_rows"), ("hidden","processor_cols")]
Some Mesh TensorFlow operations cause network communication. For example, aneinsum (generalized matrix multiplication) is computed as follows:
- On each processor, compute the einsum of the slices of the two operands thatare local to that processor.
- If no reduced-out dimensions are split, then we are done.
- If reduced-out dimensions are split, then perform an "allreduce" operationon the resulting slices - summing across any mesh dimensions over which thereduced-out dimensions are split.
Where the allreduces happen depends will depend on the computation layout.For example, in a data-parallel layout where the "batch" dimension is split,allreduces will happen when computing the parameter gradients, since thisinvolves matrix multiplications which reduce out the "batch" dimension.
While results do not depend on layout (except in the realm of roundoff errorsand random seeds), performance and memory consumption depend heavily on layout.Fortunately, the auto_mtf subpackage provides a method for automaticallychoosing a layout. For more information about what auto_mtf is doing to choosea layout, see itsREADME file.
importmesh_tensorflow.auto_mtfgraph=mtf.Graph()mesh=mtf.Mesh(graph,"my_mesh")# Insert model code here.outputs= [logits,loss]# iterable of mtf.Tensor, the outputs you're computingmesh_shape= [("processor_rows",2), ("processor_cols",2)]layout_rules=mtf.auto_mtf.layout(graph,mesh_shape,outputs)
It is possible for advanced users to eke out additional performance by tuningthe layout (and model) further. Mesh TensorFlow helps by accumulating andprinting counters of computation/communication. To start, here are sometricks/guidelines.
- It is illegal for two dimensions of the same tensor to be split across thesame mesh dimension.
- For any compute-intense operation (e.g. einsum), make sure that allmesh-dimensions are used to split dimensions of the inputs or outputs.Otherwise, computation is duplicated.
- To keep the ratio of compute/communication high (i.e. not be bandwidth-bound),split dimensions into large chunks. This should be familiar in thedata-parallelism case, where we want a large batch size per processor to avoidspending most of our time communicating.
Mesh TensorFlow (v0.0) is implemented as a Python library which can generatepart of a TensorFlow graph. The user first builds amtf.Graph (the analog ofa TensorFlow graph) made up ofmtf.Tensors andmtf.Operations. As inTensorFlow, this graph consists of simple Python objects. The user then createsamtf.Lowering object, which lowers themtf.Graph into TensorFlow, adding tothe default TensorFlow graph.
The Mesh TensorFlow language is nearly identical to TensorFlow, with thefamiliar notion of a Graph, Tensors, Operations, and automatic gradientcomputation. The principal differences are as follows:
AMesh is a n-dimensional array of processors with named dimensions. EachTensor is assigned to aMesh, instead of a device.
EachTensor has a staticShape, which is a tuple of different "Dimensions".ADimension is a(name, size) pair. For example, the shape of aTensorrepresenting a batch of images might be:
[("batch", 100), ("rows", 28"), ("cols", 28), ("channels", 3)].
ATensor is laid out on its mesh with one slice on each processor. ATensor"layout", is an injective partial map specifying which dimensions of the tensorare (evenly) split across which dimensions of the mesh. No dimension of atensor may be split across two dimensions of its mesh and no two dimensions of atensor may be split across the same dimension of its mesh. The user defines aglobal set of layout rules in the form of (tensor-dimension-name,mesh-dimension-name) pairs. A dimension of a tensor is split across a dimensionof its mesh if there is a matching rule.
Take our exampleTensorimage_batch with shape:[("batch", 100), ("rows", 28"), ("cols", 28), ("channels", 3)]
Assume that thisTensor is assigned to a mesh of 8 processors with shape:[("processor_rows", 2), ("processor_cols", 4)]
If we use an empty set of layout rules
[], we get no splitting. Eachprocessor contains the wholeTensor.If we use the layout rules
"batch:processor_cols", then the"batch"dimension of theTensoris split across the"processor_cols"dimension ofthe batch. This means that each processor contains a Tensor slice with shape[25, 28, 28, 3]. For example, processors (0, 3) and (1, 3) containidentical slices -image_batch[75:100, :, :, :].If we use the layout rules
"rows:processor_rows;cols:processor_cols",then the image is split in two dimensions, with each processor containing onespatial tile with shape[100, 14, 7, 3]. For example, processor (0, 1)contains the sliceimage_batch[:, 0:14, 7:14, :].
Some layout rules would lead to illegal layouts:
"batch:processor_rows;rows:processor_rows"is illegal because two tensordimensions could not be split across the same mesh dimension."channels:processor_rows"is illegal because the size of the tensordimension is not evenly divisible by the size of the mesh dimension.
Mesh TensorFlow uses Einstein-summation notation,mtf.einsum(inputs, output_shape), using the (named)Dimensions as the symbols. Matrixmultiplication, broadcast, sum-reduction, and transposition can all be expressedas special cases ofmtf.einsum, though the familiar interfaces are alsosupported. The operation is lowered to slice-wisetf.einsums, followed byallreduce across any mesh-dimensions corresponding to the summed-out Tensordimensions.
mtf.reshape(x, new_shape) is used to change aTensor's shape, potentiallyleading to a new tensor layout and hence network communication.
Mesh TensorFlow works on CPU, GPU and TPU. The TPU implementation is verydifferent from the CPU/GPU implementation.
Multi-CPU/GPU meshes are implemented withPlacementMeshImpl. In this caseMesh TensorFlow emits separate TensorFlow operations placed on the differentdevices, all in one big TensorFlow graph.
TPU meshes are implemented in withSimdMeshImpl. In this case,Mesh TensorFlow emits TensorFlow operations (and communication collectives) fromthe perspective of one core, and this same program runs on every core, relyingon the fact that each core actually performs the same operations. Thispiggy-backs on the TPU data-parallelism infrastructure, which operates the sameway. This "SIMD" approach keeps the TensorFlow and XLA graphs from growing withthe number of cores. The differences between cores are as follows:
- different slices of the variables (this works now)
- different positions in the collective communication (this works now)
- different slices of the infed and outfed tensors. We currently work aroundthis by requiring that all imported/exported tensors be fully-replicated. Inthe future, we should handle this correctly.
The input pipeline of Mesh Tensorflow models might become a bottleneck, whentraining with large input (e.g., high resolution images). We provide new APIsand a new input pipeline for you to run Mesh Tensorflow models. You can findthem under theexperimental/folder. We suggest that you give them a try when your input is so large thatrunning Mesh Tensorflow models with the default APIs is almost infeasible.To be more specific:
- The BROADCAST mode in TPUEstimator does not scale up to large inputs (imagesof tens of millions of pixels). We provide a new input pipeline:
experimental/input_reader.py.Seeexperimental/model_executor.pyon how to use it. - If your model takes images as input and has convolution layers. You cannotdirectly map image height and width dimensions to mesh dimensions, due to thesliding-window nature of convolution. Instead, you should use spatialpartitioning. We provide examples in
experimental/unet.py. - If you want more control on the training and evaluation loop, instead of usingthe default API (TPUEstimator) to run your model, you can use low level APIsin
experimental/model_executor.py.
Note that we did not test the experimental code on GPUs. We ran them on TPUs.We believe that some debugging would be required for it to work on GPUs.
Note: It requirestensorflow>=1.11.0.
Please go through theTransformer tutorial.
TODO(trandustin,ylc): update given mtf pypi package
ctpu up -name=ylc-mtf-donut -tf-version=nightly -tpu-size=v2-8 -zone=us-central1-b
git clone https://github.com/tensorflow/mesh.gitcd mesh/pip install --user.
pip install tensorflow_datasetscd mesh/DATA_DIR=gs://noam-mtf/dataMODEL_DIR=gs://noam-mtf/transformer_standaloneTPU=noam-mtf-donut# MODEL HPARAMS AND DIRECTORY (uncomment one)# base modelMODEL=./transformer/gin/model_base.gin# 5B parameters (too big for this dataset, only trains with model-parallelism)# MODEL=./transformer/gin/model_5b.gin# UNCOMMENT ONE OF THESE# Data-parallelismLAYOUT=./transformer/gin/layout_data_parallel.gin# Model-parallelism# LAYOUT=./transformer/gin/layout_model_parallel.gin# Data-parallelism and Model-Parallelism# LAYOUT=./transformer/gin/layout_data_and_model_parallel.gin# TRAINpython examples/transformer_standalone.py \ --tpu=$TPU --data_dir=$DATA_DIR --model_dir=$MODEL_DIR --gin_file=$MODEL \ --gin_file=$LAYOUT --gin_param="run.mode='train'"# EVALpython examples/transformer_standalone.py \ --tpu=$TPU --data_dir=$DATA_DIR --model_dir=$MODEL_DIR --gin_file=$MODEL \ --gin_file=$LAYOUT --gin_param="run.mode='evaluate'"
The above code will train on the LM1B language modeling benchmark, as specifiedinexamples/transformer_standalone_defaults.gin. To train asequence-to-sequence model on WMT14 en-de, changeutils.run.dataset towmt_translate_ende/ende_subwords8k_t2t and setutils.run.mode toTrue.Note that thewmt_translate_ende/ende_subwords8k_t2t dataset was removed fromTensorFlow Datasets incommit 211cb6f,so in order to train a model using this dataset you need to install a version ofTFDS before this commit. Then, you can decode the WMT en-de development setand evaluate it usingSacreBLEU like so:
# INFERpip3 install sacrebleumkdir ~/input ~/outputDECODE_INPUT=/home/$USER/input/ende.devDECODE_OUTPUT=/home/$USER/output/ende.dev.out~/.local/bin/sacrebleu -t wmt13 -l en-de --echo src > $DECODE_INPUTpython examples/transformer_standalone.py \ --tpu=$TPU --data_dir=$DATA_DIR --model_dir=$MODEL_DIR --gin_file=$MODEL \ --gin_file=$LAYOUT \ --gin_param="decode_from_file.input_filename='$DECODE_INPUT'" \ --gin_param="decode_from_file.output_filename='$DECODE_OUTPUT'" \ --gin_param="run.mode='infer'"# Compute BLEU score for dev setcat $DECODE_OUTPUT | ~/.local/bin/sacrebleu -t wmt13 -l en-de -tok intlgit clone https://github.com/tensorflow/tensor2tensor.gitcd tensor2tensor/pip install --user.
Before running the model, you need to prepare the training data and bucket forstoring checkpoints. Refer to theTransformer tutorialto learn how to generate the training data and create buckets.
CONF=mtf_transformer_paper_tr_0_mesh_8NAME=ende_$CONF\_0828MODEL=mtf_transformerPROBLEM=translate_ende_wmt32k_packedDATA_DIR=gs://xxxxOUT_DIR=gs://xxxxTPU_NAME=ylc-mtf-donuttensor2tensor/bin/t2t-trainer \ --model=$MODEL \ --hparams_set=$CONF \ --problem=$PROBLEM \ --train_steps=10000 \ --eval_steps=200 \ --data_dir=$DATA_DIR \ --output_dir=$OUT_DIR \ --use_tpu=True \ --cloud_tpu_name=$TPU_NAME
This toy model contains two fully-connected layers which aim to train aidentity function: f(x) = x. Since there are 8 TPU cores, we can arbitrarychange the FLAGS.mesh_shape and FLAGS.layout to achieve differentdata-parallelism and model-parallelism strategies.
MODEL_DIR=gs://xxxxTPU_NAME=ylc-mtf-donut# 2 ways data-parallelism and 4 ways model-parallelism.# In this configuration, we split the batch dimension into 2 cores and the# hidden dimension into 4 cores.python examples/toy_model_tpu.py \ --tpu=$TPU \ --model_dir=$MODEL_DIR \ --io_size=8 \ --hidden_size=8 \ --mesh_shape='x:2;y:4' \ --layout='batch:x;hidden:y'# 8 ways model-parallelism.# In this configuration, We split the hidden dimension into 8 cores.python examples/toy_model_tpu.py \ --tpu=$TPU \ --model_dir=$MODEL_DIR \ --io_size=8 \ --hidden_size=8 \ --mesh_shape='all:8' \ --layout='hidden:all'
N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool,P. Hawkins, H. Lee, M. Hong, C. Young, R. Sepassi, and B. Hechtman.Mesh-TensorFlow: Deep learning for supercomputers.InNeural Information Processing Systems, 2018.
@inproceedings{shazeer2018mesh, author = {Noam Shazeer and Youlong Cheng and Niki Parmar and Dustin Tran and Ashish Vaswani and Penporn Koanantakool and Peter Hawkins and HyoukJoong Lee and Mingsheng Hong and Cliff Young and Ryan Sepassi and Blake Hechtman}, title = {{Mesh-TensorFlow}: Deep Learning for Supercomputers}, booktitle = {Neural Information Processing Systems}, year = {2018},}About
Mesh TensorFlow: Model Parallelism Made Easier
Resources
License
Contributing
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.