Device-Initiated Communication

Starting with version 2.28, NCCL provides a device-side communication API, making it possible to use communicationprimitives directly from user CUDA kernels.

Device API

Device API consists of the following modules:

  • LSA (Load/Store Accessible) – for communication between devices accessible via memory load/store operations,using CUDA P2P. This includes devices connected over NVLink and some devices connected over PCIe, so long as theyhave P2P connectivity with each other (as indicated bynvidia-smitopo-p2pp). Up to NCCL 2.28.3, theavailability of LSA was also subject to theNCCL_P2P_LEVEL distance check, but that is no longer the casewith newer versions.
  • Multimem – for communication between devices using the hardware multicast feature provided byNVLink SHARP (available on some datacenter GPUs since the Hopper generation).
  • GIN (GPU-Initiated Networking) – for communication over the network (since NCCL 2.28.7).

The device API relies on symmetric memory (seeWindow Registration), which in turn depends on GPU virtual memorymanagement (seeNCCL_CUMEM_ENABLE) and optionally – for multimem support – on NVLink SHARP (seeNCCL_NVLS_ENABLE).

Host-Side Setup

To perform communication from the device kernel, a device communicator needs to be created first, usingncclDevCommCreate().Data transfer operations on buffers require symmetric memory windows (seeWindow Registration). A customcommunication kernel can then be launched using the standard CUDA syntax. The code excerpt below demonstratesthese steps:

intmain(){[...]NCCLCHECK(ncclCommInitRank(&comm,nranks,id,rank));/* Buffer initialization and window creation */char*buffer;size_tsize=256*1048576;NCCLCHECK(ncclMemAlloc((void**)&buffer,size));ncclWindow_twin;NCCLCHECK(ncclCommWindowRegister(comm,buffer,size,&win,NCCL_WIN_COLL_SYMMETRIC));/* Get device communicator */ncclDevCommdevComm;ncclDevCommRequirementsreqs;memset(&reqs,0,sizeof(ncclDevCommRequirements));intnCTAs=16;reqs.lsaBarrierCount=nCTAs;NCCLCHECK(ncclDevCommCreate(comm,&reqs,&devComm));/* Launch user kernel */customKernel<<<nCTAs,512>>>(devComm,win);[...]}

Depending on the kernel and application requirements, the same window can be used for input and output, or multiplewindows may be needed. When creating a device communicator, the resources that the kernel will need should be specifiedvia the requirements list (seencclDevCommRequirements). In the above example we specify just the number ofbarriers that our LSA kernel will need, in this case one for each CTA the kernelis to be launched on (16, each CTA running 512 threads).

Simple LSA Kernel

template<typenameT>__global__voidinPlaceAllReduceKernel(ncclDevCommdevComm,ncclWindow_twin,size_toffset,size_tcount){ncclLsaBarrierSession<ncclCoopCta>bar{ncclCoopCta(),devComm,ncclTeamTagLsa(),blockIdx.x};bar.sync(ncclCoopCta(),cuda::memory_order_relaxed);constintrank=devComm.lsaRank,nRanks=devComm.lsaSize;constintglobalTid=threadIdx.x+blockDim.x*(rank+blockIdx.x*nRanks);constintglobalNthreads=blockDim.x*gridDim.x*nRanks;for(size_to=globalTid;o<count;o+=globalNthreads){Tv=0;for(intpeer=0;peer<nRanks;peer++){T*inputPtr=(T*)ncclGetLsaPointer(win,offset,peer);v+=inputPtr[o];}for(intpeer=0;peer<nRanks;peer++){T*outputPtr=(T*)ncclGetLsaPointer(win,offset,peer);outputPtr[o]=v;}}bar.sync(ncclCoopCta(),cuda::memory_order_release);}

The above code excerpt shows a simple device kernel – an in-place variant (the input buffer is reused for the output)of AllReduce, utilizing LSA support (data is transferred via memory load/store instructions).

The start of the buffer is specified as a (byte-based)offset within the previously registered windowwin (seeWindow Registration); the buffer consists ofcount elements of typeT.

Before the kernel can start processing data, it needs to ensure that all participants are ready. It creates a memorybarrier sessionbar (seencclLsaBarrierSession) and uses it to synchronize across all the threads of the CTA(ncclCoopCta(); seeThread Groups) and the ranks of the communicator (devComm).ncclTeamTagLsa indicatesthe subset of ranks the barrier will apply to (seeTeams) – this kernel assumes that all ranks areLSA-connected.blockIdx.x is the CTA’s local index, used to select the barrier.

The kernel then calculates a globally unique index for each thread as well as the overall thread count, and can finallystart processing data, using an all-to-all communication pattern. In each iteration of the outer loop, everyparticipating thread loads a single input element from each communicator rank (the first inner loop).ncclGetLsaPointer() is used to calculate the locally-accessibleaddress of the start of the buffer within each rank (remote device memory was previously mapped into the local addressspace – seeWindow Registration). Extracted input data is accumulated and the result is stored back at each rank (thesecond inner loop). Before thekernel terminates, another memory synchronization needs to take place to ensure that all participants have finishedprocessing their data.

Note that this simple implementation would likely fall short of achieving the peak bandwidth, as it utilizes neithervectorization nor loop unrolling.

Multimem Device Kernel

intmain(){[...]memset(&reqs,0,sizeof(ncclDevCommRequirements));intnCTAs=16;reqs.lsaBarrierCount=nCTAs;reqs.lsaMultimem=true;NCCLCHECK(ncclDevCommCreate(comm,&reqs,&devComm));[...]}template<typenameT>__global__voidinPlaceAllReduceKernel(ncclDevCommdevComm,ncclWindow_twin,size_toffset,size_tcount){ncclLsaBarrierSession<ncclCoopCta>bar{ncclCoopCta(),devComm,ncclTeamTagLsa(),blockIdx.x,/*multimem*/true};[...]T*mmPtr=(T*)ncclGetLsaMultimemPointer(win,offset,devComm);for(size_to=globalTid;o<count;o+=globalNthreads){Tv=multimem_sum(mmPtr+o);multimem_st(mmPtr+o,v);}[...]}

The above code excerpt demonstrates modifications needed to the earlier code segments to enable multimem support (thelines with critical changes are highlighted). On the hostside,lsaMultimem needs to be set in the requirements prior to creating the device communicator(ncclDevCommCreate() will fail if the necessary hardware support is unavailable).

Within the device kernel, we can switch the memory barrier to a multimem-optimized variant by adding an extra argumentto the constructor. The processing loop is actually simpler with multimem:ncclGetLsaMultimemPointer() needs tobe invoked just once per kernel. The returned multicast memory pointer enables access to the device memory of all theranks of the communicator without having to iterate over them, and the data can be reduced in hardware. To keep thisexample simple, the implementations ofmultimem_sum andmultimem_st are not included; they need to beimplemented using PTX, e.g.,multimem.ld_reduce.global.add andmultimem.st.global.

Thread Groups

Many functions in the device API take a thread cooperative group as input to indicate which threads within the CTA willtake part in the operation. NCCL provides three predefined ones:ncclCoopThread(),ncclCoopWarp(), and (the mostcommonly used)ncclCoopCta().

Users may also pass CUDA cooperative groups, or any class which providesthread_rank(),size(), andsync()methods.

Teams

To address remote ranks or perform barriers, NCCL refers to subsets of ranks within a communicator as “teams”.NCCL provides three predefined ones:

  • ncclTeamWorld() – the “world” team, encompassing all the ranks of a given communicator.
  • ncclTeamLsa() – all the peers accessible from the local rank using load/store operations.
  • ncclTeamRail() – the set of peers directly accessible from the local rank over the network, assuming that thenetwork fabric is rail-optimized (seeNCCL_CROSS_NIC).

ThencclTeam structure contains fairly self-explanatory elementsnRanks,rank, andstride. The deviceAPI contains functions to verify team membership, convert rank numbers between teams, etc. The world and LSA teams arealways contiguous (stride1), whereas the rail team is typically not – its stride equals the size of the LSA team(the assumption is thus that each rankn within the local LSA team has direct network connectivity with correspondingranksn of all remote LSA teams).

GIN Device Kernel

intmain(){[...]memset(&reqs,0,sizeof(ncclDevCommRequirements));intnCTAs=1;reqs.railGinBarrierCount=nCTAs;reqs.ginSignalCount=1;NCCLCHECK(ncclDevCommCreate(comm,&reqs,&devComm));[...]}template<typenameT>__global__voidginAlltoAllKernel(ncclDevCommdevComm,ncclWindow_twin,size_tinputOffset,size_toutputOffset,size_tcount){intginContext=0;ncclGinSignal_tsignalIndex=0;ncclGingin{devComm,ginContext};uint64_tsignalValue=gin.readSignal(signalIndex);ncclGinBarrierSession<ncclCoopCta>bar{ncclCoopCta(),gin,ncclTeamWorld(devComm),devComm.railGinBarrier,blockIdx.x};bar.sync(ncclCoopCta(),cuda::memory_order_relaxed,ncclGinFenceLevel::Relaxed);constintrank=devComm.rank,nRanks=devComm.nRanks;constinttid=threadIdx.x+blockIdx.x*blockDim.x;constintnThreads=blockDim.x*gridDim.x;constsize_tsize=count*sizeof(T);for(intpeer=tid;peer<nRanks;peer+=nThreads){gin.put(ncclTeamWorld(devComm),peer,win,outputOffset+rank*size,win,inputOffset+peer*size,size,ncclGin_SignalInc{signalIndex});}gin.waitSignal(ncclCoopCta(),signalIndex,signalValue+nRanks);gin.flush(ncclCoopCta());}

The above code excerpt demonstrates modifications needed to the earlier host code to enable GIN support, available sinceNCCL 2.28.7 (the lines with critical changes are highlighted), and also includes a GIN AlltoAll kernel. On the hostside, compared to the LSA kernels, we request a launch on just a single CTA (because our kernel doesn’t have much to do)and we setrailGinBarrierCount andginSignalCount to request GIN-specific barriers and signals(ncclDevCommCreate() will fail if GIN support is unavailable). As with LSA barriers, we need as many of them asCTAs, but signals (used for completion notifications) can be shared between CTAs so, for this simple example, we’ll usejust one per rank (for performance-oriented kernels, keeping signals exclusive to each CTA can improve performance).

On the device side, GIN API centers around thencclGin object, initialized using the device communicator and aGINcontext index (0 will do for this simple example but, for performance-oriented kernels, using multiple contexts canprovide a performance boost). To avoid race conditions, the initial value of the signal must be readprior to thesynchronizing barrier. GIN-specific barriers look much like their LSA counterparts, being local to each CTA, butcommunicating over the network, not memory.ncclTeamWorld indicates all the ranks of a communicator (this kernelassumesthat all the ranks can reach one another over the network, which in general need not be the case – seeNCCL_CROSS_NIC).

Unlike with the AllReduce kernels, for AlltoAll the calculated thread index needs to be unique only locally within eachrank. This is then used to determine the destination peer. The main GIN data transfer operation is the one-sidedput(), here launched in parallel on all participating threads, one per each destination peer (the loop is neededmerely if the total rank count exceeds the local thread count – this is why we launched on just a single CTA).put() takes the usual arguments such as the destination rank and buffer address, the source buffer, and thetransfer size. It also accepts several optional arguments; the above example takes advantage of theremoteAction,requesting that the destination peer increments the value of its local signal once the payload has been settled.

Once the local signal has been incremented bynRanks, we know that every peer has deposited their data in this rank’soutput buffer and thus that the buffer is ready;waitSignal() can be used to block until that happens. Beforeterminating, the kernel still needs toflush() all the previously initiated outgoingput() operations –while that does not guarantee remote completion, it does ensure that the local input buffer is safe to reuse. We canskip an explicit barrier at the end, sincewaitSignal() andflush() together ensure that nobody else isusing this rank’s buffers.