Distributed Linear Algebra #

Overview#

The distributed Linear Algebra modulenvmath.distributed.linalg.advanced innvmath-python leverages the NVIDIA cuBLASMp library and provides a powerful suiteof APIs that can be directly called from the host to efficiently perform matrixmultiplications on multi-node multi-GPU systems at scale. Both statelessfunction-form APIs and stateful class-form APIs are provided.

The distributed matrix multiplication APIs are similar to their non-distributed hostAPI counterparts, with some key differences:

The operands to the API on each process are thelocal partition of theglobal operands and the user specifies thedistribution (how the datais partitioned across processes). The APIs natively support the block-cyclicdistribution (seeBlock distributions).
The APIs optionally support GPU operands onsymmetric memory. Refer toDistributed API Utilities for examples and details of how tomanage symmetric memory GPU operands.

Operand distribution#

To perform a distributed operation, first you have to specify how the operand isdistributed across processes. Distributed matrix multiply natively supports theblock-cyclic distribution (seeBlock distributions), therefore you mustprovide a distribution compatible with block-cyclic. Compatible distributionsincludeBlockCyclic,BlockNonCyclicandSlab (with uniform partition sizes).

Memory layout#

cuBLASMp requires operands to use Fortran-order memory layout, while Python librariessuch as NumPy and PyTorch use C-order by default.SeeDistribution, memory layout and transpose for guidelines on memory layout conversionfor distributed operands and potential implications on distribution.

Matrix qualifiers#

Matrix qualifiers are used to indicate whether an input matrix is transposed or not.

For example, forA.T@B you have to specify:

fromnvmath.distributed.linalg.advancedimportmatrix_qualifiers_dtype,matmulqualifiers=np.zeros((3,),dtype=matrix_qualifiers_dtype)qualifiers[0]["is_transpose"]=True# a is transposedqualifiers[1]["is_transpose"]=False# b is not transposed (optional)...result=matmul(a,b,distributions=distributions,qualifiers=qualifiers)

Caution

A common strategy to convert memory layout to Fortran-order (required by cuBLASMp)is to transpose the input matrices, as explained inDistribution, memory layout and transpose.Remember to set the matrix qualifiers accordingly.

Distributed algorithm#

cuBLASMp implements efficient communication-overlap algorithms that are suited fordistributed machine learning scenarios with tensor parallelism.Algorithms include AllGather+GEMM and GEMM+ReduceScatter.These algorithms have special requirements in terms of how each of the operandsis distributed and their transpose qualifiers.

Currently, to be able to use these algorithms the matrices must be distributed using a1D partitioning scheme without the cyclic distribution and the partition sizes mustbe uniform (BlockNonCyclic andSlabare valid distributions for this use case).

Please refer tocuBLASMp documentationfor full details.

Symmetric memory#

Operands may be allocated on the symmetric heap. If so, the result will also beallocated on the symmetric heap.

Tip

Certain distributed matrix multiplication algorithms may perform better when theoperands are on symmetric memory.

Important

Any memory on the symmetric heap that is owned by the user (including thedistributed Matmul result) must be deleted explicitly usingfree_symmetric_memory(). Refer toDistributed API Utilities for more information.

Seeexample.

Example#

The following example performs\(\alpha A @ B + \beta C\) with inputs distributedaccording to aSlab distribution (partitioning on a single dimension):

Tip

Reminder to initialize the distributed context first as perInitializing the distributed runtime and to select both NVSHMEM andNCCL as communication backends.

importcupyascpfromnvmath.distributed.distributionimportSlabfromnvmath.distributed.linalg.advancedimportmatrix_qualifiers_dtype# Get my process rank from mpi4py communicator.rank=communicator.Get_rank()# The global problem size m, n, km,n,k=128,512,1024# Prepare sample input data.withcp.cuda.Device(device_id):a=cp.random.rand(*Slab.X.shape(rank,(m,k)))b=cp.random.rand(*Slab.X.shape(rank,(n,k)))c=cp.random.rand(*Slab.Y.shape(rank,(n,m)))# Get transposed views with Fortran-order memory layouta=a.T# a is now (k, m) with Slab.Yb=b.T# b is now (k, n) with Slab.Yc=c.T# c is now (m, n) with Slab.Xdistributions=[Slab.Y,Slab.Y,Slab.X]qualifiers=np.zeros((3,),dtype=matrix_qualifiers_dtype)qualifiers[0]["is_transpose"]=True# a is transposedalpha=0.45beta=0.67# Perform the distributed GEMM.result=nvmath.distributed.linalg.advanced.matmul(a,b,c=c,alpha=alpha,beta=beta,distributions=distributions,qualifiers=qualifiers,)# Synchronize the default stream, since by default the execution# is non-blocking for GPU operands.cp.cuda.get_current_stream().synchronize()# result is distributed row-wiseassertresult.shape==Slab.X.shape(rank,(m,n))

You can find many more exampleshere.

API Reference#