Operand distribution#
To perform distributed math operations withnvmath. you must firstspecify how the operands are distributed across processes. nvmath-python supportsmultiple distribution types (Slab, Box, BlockCyclic, etc.) which we’ll explain inthis section.
You can use any distribution type for any distributed operation as long as nvmath-pythonimplements an implicit conversion to the native distribution type supported by theoperation. For example, the distributed dense linear algebra library (cuBLASMp)supports the PBLAS 2D block-cyclic distribution and your input matrices must bedistributed in a way that conforms to this distribution type. Slab is compatiblewith 2D block distribution for uniform partition sizes, so you can use Slab fordistributed matrix multiplication in such cases (seeexamples).
It’s also important to consider the memory layout requirements of the distributedoperation that you’re performing, and the potential implications on the distributionof the global array. SeeDistribution, memory layout and transpose for moreinformation.
In the following we describe the available distribution types.
Slab#
Slab specifies the distribution of an N-D array that is partitioned across processesalong a single dimension. More formally:
The shape of the slab on the first\(s_p \mathbin{\%} P\) processes is\((s_0, \ldots, \lfloor \frac{s_p}{P} \rfloor + 1, \ldots, s_{n-1})\)
The shape of the slab on the remaining processes is\((s_0, \ldots, \lfloor \frac{s_p}{P} \rfloor, \ldots, s_{n-1})\)
Process 0 owns the first slab according to the global index order, process 1 ownsthe second slab and so on.
where:
\(s_i\) is the size of dimension\(i\) of the global array
\(p\) is the partition dimension
\(n\) is the number of dimensions of the array
\(P\) is the number of processes
Let’s look at an example with a 2D array and four processes:

Here we see a MxN 2D array partitioned on the X axis, where each number (and color) denotesthe slab of the global array owned by that process.
If\((M, N) = (40, 64)\), the shape of the slab on every process will be\((10, 64)\). For\((M, N) = (39, 64)\), the shape of the slab on the firstthree processes is\((10, 64)\) and the shape on the last process is\((9, 64)\).
Using thenvmath. APIs, you can specify the above distribution like this:
fromnvmath.distributed.distributionimportSlabdistribution=Slab(partition_dim=0)# ordistribution=Slab.X
Tip
We offer convenience aliases to use with 1D/2D/3D arrays:Slab.X,Slab.Y andSlab.Z (which partition on axis 0, 1 and 2, respectively).
Note
Slab is natively supported by cuFFTMp (distributed FFT API).A Slab (or compatible) distribution is recommended for best performance in cuFFTMp.
Box#
Given a global N-D array, a N-D box can be used to describe a subsection of the globalarray by indicating the lower and upper corner of the subsection. By associating boxesto processes we can describe a data distribution where every process owns a contiguousrectangular subsection of the global array.
For example, consider a 8x8 2D array distributed across 3 processes using thefollowing boxes:

where each number (and color) denotes the subsection of the global N-D array owned bythat process.
Using thenvmath. APIs, you can specify the above distribution like this:
fromnvmath.distributed.distributionimportBoxifprocess_id==0:distribution=Box((0,0),(4,4))elifprocess_id==1:distribution=Box((4,0),(8,4))elifprocess_id==2:distribution=Box((0,4),(8,8))
Note
Box is natively supported by cuFFTMp (distributed FFTandReshape APIs).For further information, refer to thecuFFTMp documentation.
Block distributions#
In the block-cyclic distribution, a global N-D array is split into blocks of a specifiedshape and these blocks are distributed to a grid of processes in a cyclic pattern. As aresult, each process owns a set of typically non-contiguous blocks of the global N-D array.
PBLAS uses the block-cyclic distribution to distribute dense matrices in a way that evenlybalances the computational load across processes, while at the same time optimizingperformance by being able to exploit memory locality(reference).
nvmath-python provides two distribution types based on block-cyclic, described below.
BlockCyclic#
BlockCyclic is specified with a process grid and a block size. The blocks assigned toa process are typically non-contiguous owing to the cyclic distribution pattern.Blocks can partition on any number of dimensions.
Consider the following example:

Here we see an NxN matrix distributed across 4 processes using a 2D block-cyclic scheme.Each number (and color) denotes the blocks of the global matrix belonging to thatprocess. Each block has BxB elements and each process has 16 blocks.
Using thenvmath. APIs, you can specify the above distribution like this:
fromnvmath.distributed.distributionimportProcessGrid,BlockCyclicprocess_grid=ProcessGrid(shape=(2,2),layout=ProcessGrid.Layout.ROW_MAJOR)distribution=BlockCyclic(process_grid,(B,B))
Note how the partition dimensions are determined by the process grid and block shape.Here is an example of 1D block-cyclic distribution:

The above distribution can be specified like this:
fromnvmath.distributed.distributionimportProcessGrid,BlockCyclic# layout is irrelevant in this case and can be omittedprocess_grid=ProcessGrid(shape=(1,4))distribution=BlockCyclic(process_grid,(N,B))
Note
Block distributions are natively supported by cuBLASMp(distributed matrix multiplication APIs).
BlockNonCyclic#
BlockNonCyclic is a special case of BlockCyclic, where the block size and process grid aresuch that it generates no cycles. For this distribution there is no need to specify blocksizes, as nvmath-python can infer them automatically.
Tip
BlockNonCyclic is a convenience type and you can represent the same distribution withBlockCyclic and explicit block sizes.
Example 1D block non-cyclic:

The above distribution can be specified like this:
fromnvmath.distributed.distributionimportProcessGrid,BlockNonCyclic# layout is irrelevant in this case and can be omittedprocess_grid=ProcessGrid(shape=(1,4))distribution=BlockNonCyclic(process_grid)
Note
Block distributions are natively supported by cuBLASMp(distributed matrix multiplication APIs).
Tip
Slab and BlockNonCyclic are equivalent for uniform partition sizes.
Utilities#
You can get the local shape of an operand according to a distribution by querying thedistribution:
fromnvmath.distributed.distributionimportSlabglobal_shape=(64,48,32)# Get the local shape on this process according to Slab.Yshape=Slab.Y.shape(process_id,global_shape)
If desired, you may do explicit conversion between distribution types. For example:
fromnvmath.distributed.distributionimportProcessGrid,BlockNonCyclic,Slab# layout is irrelevant in this case and can be omittedprocess_grid=ProcessGrid(shape=(1,4))distribution=BlockNonCyclic(process_grid)slab_distribution=distribution.to(Slab)print(slab_distribution)# prints "Slab(partition_dim=1, ndim=2)"
Distribution, memory layout and transpose#
Memory layout refers to the way that N-D arrays are stored in memory on each process.The two primary layouts are C-order (row-major) and Fortran-order (column-major).Memory layout is independent of the distribution of the global array, i.e. you can haveany combination of distribution and local memory layout. In practice, however, mathlibraries have specific requirements on memory layout. For example, cuFFTMp requiresC-order while cuBLASMp requires Fortran-order. As such, you may find that you haveto convert the layout of your inputs. Two common ways to convert the layout are:
Copy the array to a buffer with the new layout (expensive, preserves thedistribution).
For example:
# Get the local shape according to Slab.Xa_shape=Slab.X.shape(process_id,(m,n))# Allocate operand on this process (NumPy uses C-order by default).a=np.random.rand(*a_shape)# Convert layout to F-order by copying to a new array (distribution is preserved)a=np.asfortranarray(a)
Get atransposed view (efficient, modifies the distribution).
Transposing a global arraytransposes the distribution, and so always resultsin a different distribution. For example:
# Get the local shape according to Slab.Xa_shape=Slab.X.shape(process_id,(m,n))# Allocate operand on this process (NumPy uses C-order by default).a=np.random.rand(*a_shape)# Transpose the global array (transposing on each process)a=a.T# the distribution of a is now Slab.Y
Note
Transposing changes the global shape of the operand and will accordingly impact thedistributed operation. For example, if the global shape of the input matrix A ofdistributed matrix multiplication is\((k, m)\), you have to set theis_transpose qualifier toTrue for A. Similarly if B is transposed. SeeDistributed Linear Algebra for more information.
Hint
For matrices,transpose(Slab.X)==Slab.Y andtranspose(Slab.Y)==Slab.X.
See also
Seedistributed matmul examples for more examples showing the interaction between memory layout,transpose and distribution.
API Reference#
| Slab distribution |
| Box distribution |
| N-dimensional grid of processes used by some distributions like the PBLAS block-cyclic distribution. |
| Block-cyclic distribution |
| Block distribution without cycles |