Device API¶
Host-Side Setup¶
ncclDevComm¶
ncclDevComm¶
A structure describing a device communicator, as created on the host side usingncclDevCommCreate(). Thestructure is used primarily on the device side; elements that could be of particular interest include:
ncclDevCommCreate¶
- ncclResult_t
ncclDevCommCreate(ncclComm_t comm, structncclDevCommRequirements const* reqs, structncclDevComm* outDevComm)¶
Creates a new device communicator (seencclDevComm) corresponding to the supplied host-side communicatorcomm. The result is returned in theoutDevComm buffer (which needs to be supplied by the caller). The caller needsto also provide a filled-in list of requirements via thereqs argument (seencclDevCommRequirements); thefunction will allocate any necessary resources to meet them. The function can fail and return an error code if thecommunicator does not support symmetric memory or if the list of requirements cannot be met (e.g., if the multimemcapability is requested on a system lacking the necessary hardware support).
Note that this is ahost-side function.
ncclDevCommDestroy¶
- ncclResult_t
ncclDevCommDestroy(ncclComm_t comm, structncclDevComm const* devComm)¶
Destroys a device communicator (seencclDevComm) previously created usingncclDevCommCreate() andreleases any allocated resources. The caller must ensure that no device kernel that uses this device communicator couldbe running at the time this function is invoked.
Note that this is ahost-side function.
ncclDevCommRequirements¶
ncclDevCommRequirements¶
A host-side structure specifying the list of requirements when creating device communicators (seencclDevComm).
lsaMultimem¶Specifies whether multimem support is required for all LSA ranks.
lsaBarrierCount¶Specifies the number of memory barriers to allocate (see
ncclLsaBarrierSession).
railGinBarrierCount¶Specifies the number of network barriers to allocate (see
ncclGinBarrierSession; available since NCCL2.28.7).
barrierCount¶Specifies the minimum number for both the memory and network barriers (see above; available since NCCL 2.28.7).
ginSignalCount¶Specifies the number of network signals to allocate (see
ncclGinSignal_t; available since NCCL 2.28.7).
ginCounterCount¶Specifies the number of network counters to allocate (see
ncclGinCounter_t; available since NCCL 2.28.7).
resourceRequirementsList¶Specifies a list of resource requirements. This is best set to NULL for now.
teamRequirementsList¶Specifies a list of requirements for particular teams. This is best set to NULL for now.
LSA¶
All functionality described from this point on is available on the device side only.
ncclLsaBarrierSession¶
ncclLsaBarrierSession¶
A class representing a memory barrier session.
ncclLsaBarrierSession(Coop coop,ncclDevComm const& comm, ncclTeamTagLsa, uint32_t index, bool multimem=false)Initializes a new memory barrier session.coop represents a cooperative group (seeTeams).comm is the device communicator created using
ncclDevCommCreate().ncclTeamTagLsa is here to indicate which subset of ranks the barrier will apply to. The identifier of the underlyingbarrier to use is provided byindex (it should be different for eachcoop; typically set toblockIdx.xtoensure uniqueness between CTAs).multimem requests a hardware-accelerated implementation using memory multicast.
- void
arrive(Coop, cuda::memory_order order)¶Signals the arrival of the thread at the barrier session.
- void
wait(Coop, cuda::memory_order order)¶Blocks until all threads of all team members arrive at the barrier session.
- void
sync(Coop, cuda::memory_order order)¶Synchronizes all threads of all team members that participate in the barrier session (combines
arrive()andwait()).
ncclGetPeerPointer¶
- void*
ncclGetPeerPointer(ncclWindow_t w, size_t offset, int peer)¶
Returns a load/store accessible pointer to the memory buffer of devicepeer within the windoww.offset isbyte-based.peer is a rank index within the world team (seeTeams). This function will return NULL ifthepeer is not within the LSA team.
ncclGetLsaPointer¶
- void*
ncclGetLsaPointer(ncclWindow_t w, size_t offset, int lsaPeer)¶
Returns a load/store accessible pointer to the memory buffer of devicelsaPeer within the windoww.offset isbyte-based. This is similar toncclGetPeerPointer(), but herelsaPeer is a rank index with the LSA team (seeTeams).
ncclGetLocalPointer¶
- void*
ncclGetLocalPointer(ncclWindow_t w, size_t offset)¶
Returns a load-store accessible pointer to the memory buffer of the current device within the windoww.offset isbyte-based. This is just a shortcut version ofncclGetPeerPointer() withdevComm.rank aspeer, orncclGetLsaPointer() withdevComm.lsaRank aslsaPeer.
Multimem¶
ncclGetLsaMultimemPointer¶
- void*
ncclGetLsaMultimemPointer(ncclWindow_t w, size_t offset,ncclDevComm const& devComm)¶
Returns a multicast memory pointer associated with the windoww and device communicatordevComm.offsetis byte-based. Availability of multicast memory is hardware-dependent.
GIN¶
GIN is supported since NCCL 2.28.7.
ncclGin¶
ncclGin¶
A class encompassing major elements of the GIN support.
ncclGin(ncclDevComm const& comm, int contextIndex)Initializes a new
ncclGinobject.comm is the device communicator created usingncclDevCommCreate().contextIndex is the index of the GIN context – a network communication channel. Using multiple GIN contexts allowsthe implementation to spread traffic onto multiple connections, avoiding locking and bottlenecks. Therefore,performance-oriented kernels should cycle among the available contexts to improve resource utilization (the number ofavailable contexts is available viaginContextCount).
- void
put(ncclTeam team, int peer,ncclWindow_t dstWnd, size_t dstOffset,ncclWindow_t srcWnd, size_t srcOffset, size_t bytes, [...])¶Schedules a device-initiated, one-sided data transfer operation from a local buffer to a remote buffer on a peer.
peer is a rank withinteam (seeTeams); it may refer to the local rank (a loopback). The destinationand source buffers are each specified using the window (dstWnd,srcWnd) and a byte-based offset (dstOffset,srcOffset).size specifies the data transfer count in bytes.
Arguments beyond that are optional; we focus here on the first three.remoteAction andlocalAction specify actionsto undertake on the destination peer and on the local rank when the payload has been settled and the input has beenconsumed (respectively). They default to
ncclGin_None(no action); other options includencclGin_Signal{Inc|Add}(forremoteAction) andncclGin_CounterInc(forlocalAction); seeSignals and Counters below for more details.coop indicates the set of threads participating in this operation (seeThread Groups); it defaults toncclCoopThread(a single device thread), which is the recommended model.The visibility of the signal on the destination peer implies the visibility of the put data it is attached toand allthe preceding puts to the same peer, provided that they were issued using the same GIN context.
The API also defines an alternative, “convenience” variant of this method that uses
ncclSymPtrtypes to specify thebuffers and expects size to be conveyed in terms of the number of elements instead of the byte count. There are alsotwoputValuevariants that take a single element at a time (no greater than eight bytes), passed by value.
- void
flush(Coop coop, cuda::memory_order ord = cuda::memory_order_acquire)¶Ensures that all the pending transfer operations scheduled by any threads ofcoop are locally consumed, meaning thattheir source buffers are safe to reuse. Makes no claims regarding the completion status on the remote peer(s).
Signals and Counters¶
ncclGinSignal_t¶
Signals are used to trigger actions on remote peers, most commonly on the completion of aput operation. They eachhave a 64-bit integer value associated with them that can be manipulated atomically.
ncclGin_SignalAdd { ncclGinSignal_t signal; uint64_t value; }
ncclGin_SignalInc { ncclGinSignal_t signal; }These objects can be passed as theremoteAction arguments of methods such as
putandsignalto describe theactions to perform on the peer on receipt – in this case, increase the value of asignal specified byindex.ncclGin_SignalInc{signalIdx}is functionally equivalent toncclGin_SignalAdd{signalIdx,1}; however, itmay not be mixed with other signal-modifying operations without an intervening signal reset (see below). Signal valuesuse “rolling” comparison logic to ensure that an unsigned overflow maintains the property ofx<x+1.
- void
signal(ncclTeam team, int peer, RemoteAction remoteAction, Coop coop = ncclCoopThread(), [...])¶
- uint64_t
readSignal(ncclGinSignal_t signal, int bits=64, cuda::memory_order ord = cuda::memory_order_acquire)¶
- void
waitSignal(Coop coop,ncclGinSignal_t signal, uint64_t least, int bits=64, cuda::memory_order ord = cuda::memory_order_acquire)¶
- void
resetSignal(ncclGinSignal_t signal)¶These are signal-specific methods of
ncclGin.signalimplements an explicit signal notification withoutan accompanying data transfer operation; it takes a subset of arguments ofput().readSignalreturns thebottombits of the value of thesignal.waitSignalwaits for the bottombits of thesignal value to meetor exceedleast. Finally,resetSignalresets thesignal value to0(this method may not race withconcurrent modifications to the signal).
ncclGinCounter_t¶
Counters are used to trigger actions on the local rank; as such, they are complementary to signals, which are meant forremote actions. Like signals, they use “rolling” comparison logic, but they are limited to storing values of at most 56bits.
ncclGin_CounterInc { ncclGinCounter_t counter; }This object can be passed as thelocalAction argument of methods such as
put(). It is the only actiondefined for counters.
- uint64_t
readCounter(ncclGinCounter_t counter, int bits=56, cuda::memory_order ord = cuda::memory_order_acquire)¶
- void
waitCounter(Coop coop,ncclGinCounter_t counter, uint64_t least, int bits=56, cuda::memory_order ord = cuda::memory_order_acquire)¶
- void
resetCounter(ncclGinCounter_t counter)¶These are counter-specific methods of
ncclGinand they are functionally equivalent to their signalcounterparts discussed above.
ncclGinBarrierSession¶
ncclGinBarrierSession¶
A class representing a network barrier session.
ncclGinBarrierSession(Coop coop,ncclGin gin, ncclTeamTagRail, uint32_t index)Initializes a new network barrier session.coop represents a cooperative group (seeThread Groups).gin isa previously initialized
ncclGinobject.ncclTeamTagRail indicates that the barrier will apply to allpeers on the same rail as the local rank (seeTeams).index identifies the underlying barrier to use(it should be different for eachcoop; typically set toblockIdx.xto ensure uniqueness between CTAs).
ncclGinBarrierSession(Coop coop,ncclGin gin, ncclTeam team, ncclGinBarrierHandle handle, uint32_t index)Initializes a new network barrier session. This is the general-purpose variant to be used, e.g., when communicatingwith ranks from the world team (seeTeams), whereas the previous variant was specific to the rail team.This variant expectsteam to be passed as an argument, and also takes an extrahandle argument indicating thelocation of the underlying barriers (typically set to the
railGinBarrierfield of the device communicator).
- void
sync(Coop coop, cuda::memory_order order, ncclGinFenceLevel fence)Synchronizes all threads of all team members that participate in the barrier session.
ncclGinFenceLevel::Relaxedisthe only defined value forfence for now.