CUDA Array Interface (Version 3)

CUDA Built-in Target deprecation notice

The CUDA target built-in to Numba is deprecated, with further development moved to theNVIDIA numba-cuda package. Please seeBuilt-in CUDA target deprecation and maintenance status.

TheCUDA Array Interface (or CAI) is created for interoperability betweendifferent implementations of CUDA array-like objects in various projects. Theidea is borrowed from theNumPy array interface.

Note

Currently, we only define the Python-side interface. In the future, we mayadd a C-side interface for efficient exchange of the information incompiled code.

Python Interface Specification

Note

Experimental feature. Specification may change.

The__cuda_array_interface__ attribute returns a dictionary (dict)that must contain the following entries:

  • shape:(integer,...)

    A tuple ofint (orlong) representing the size of each dimension.

  • typestr:str

    The type string. This has the same definition astypestr in theNumPy array interface.

  • data:(integer,boolean)

    Thedata is a 2-tuple. The first element is the data pointeras a Pythonint (orlong). The data must be device-accessible.For zero-size arrays, use0 here.The second element is the read-only flag as a Pythonbool.

    Because the user of the interface may or may not be in the same context,the most common case is to usecuPointerGetAttribute withCU_POINTER_ATTRIBUTE_DEVICE_POINTER in the CUDA driver API (or theequivalent CUDA Runtime API) to retrieve a device pointer thatis usable in the currently active context.

  • version:integer

    An integer for the version of the interface being exported.The current version is3.

The following are optional entries:

  • strides:None or(integer,...)

    Ifstrides is not given, or it isNone, the array is inC-contiguous layout. Otherwise, a tuple ofint (orlong) is explicitlygiven for representing the number of bytes to skip to access the nextelement at each dimension.

  • descr

    This is for describing more complicated types. This follows the samespecification as in theNumPy array interface.

  • mask:None or object exposing the__cuda_array_interface__

    IfNone then all values indata are valid. All elements of the maskarray should be interpreted only as true or not true indicating whichelements of this array are valid. This has the same definition asmaskin theNumPy array interface.

    Note

    Numba does not currently support working with masked CUDA arraysand will raise aNotImplementedError exception if one is passedto a GPU function.

  • stream:None orinteger

    An optional stream upon which synchronization must take place at the point ofconsumption, either by synchronizing on the stream or enqueuing operations onthe data on the given stream. Integer values in this entry are as follows:

    • 0: This is disallowed as it would be ambiguous betweenNone and thedefault stream, and also between the legacy and per-thread default streams.Any use case where0 might be given should either useNone,1,or2 instead for clarity.

    • 1: The legacy default stream.

    • 2: The per-thread default stream.

    • Any other integer: acudaStream_t represented as a Python integer.

    WhenNone, no synchronization is required. See theSynchronization section below for further details.

    In a future revision of the interface, this entry may be expanded (or anotherentry added) so that an event to synchronize on can be specified instead of astream.

Synchronization

Definitions

When discussing synchronization, the following definitions are used:

  • Producer: The library / object on which__cuda_array_interface__ isaccessed.

  • Consumer: The library / function that accesses the__cuda_array_interface__ of the Producer.

  • User Code: Code that induces a Producer and Consumer to share data throughthe CAI.

  • User: The person writing or maintaining the User Code. The User mayimplement User Code without knowledge of the CAI, since the CAI accesses canbe hidden from their view.

In the following example:

importcupyfromnumbaimportcuda@cuda.jitdefadd(x,y,out):start=cuda.grid(1)stride=cuda.gridsize(1)foriinrange(start,x.shape[0],stride):out[i]=x[i]+y[i]a=cupy.arange(10)b=a*2out=cupy.zeros_like(a)add[1,32](a,b,out)

When theadd kernel is launched:

  • a,b,out are Producers.

  • Theadd kernel is the Consumer.

  • The User Code is specificallyadd[1,32](a,b,out).

  • The author of the code is the User.

Design Motivations

Elements of the CAI design related to synchronization seek to fulfill theserequirements:

  1. Producers and Consumers that exchange data through the CAI must be able to doso without data races.

  2. Requirement 1 should be met without requiring the user to beaware of any particulars of the CAI - in other words, exchanging data betweenProducers and Consumers that operate on data asynchronously should be correctby default.

    • An exception to this requirement is made for Producers and Consumers thatexplicitly document that the User is required to take additional steps toensure correctness with respect to synchronization. In this case, Usersare required to understand the details of the CUDA Array Interface, andthe Producer/Consumer library documentation must specify the steps thatUsers are required to take.

      Use of this exception should be avoided where possible, as it is providedfor libraries that cannot implement the synchronization semantics withoutthe involvement of the User - for example, those interfacing withthird-party libraries oblivious to the CUDA Array Interface.

  3. Where the User is aware of the particulars of the CAI and implementationdetails of the Producer and Consumer, they should be able to, at theirdiscretion, override some of the synchronization semantics of the interfaceto reduce the synchronization overhead. Overriding synchronization semanticsimplies that:

    • The CAI design, and the design and implementation of the Producer andConsumer do not specify or guarantee correctness with respect to dataraces.

    • Instead, the User is responsible for ensuring correctness with respect todata races.

Interface Requirements

Thestream entry enables Producers and Consumers to avoid hazards whenexchanging data. Expected behaviour of the Consumer is as follows:

  • Whenstream is not present or isNone:

    • No synchronization is required on the part of the Consumer.

    • The Consumer may enqueue operations on the underlying data immediately onany stream.

  • Whenstream is an integer, its value indicates the stream on which theProducer may have in-progress operations on the data, and which the Consumeris expected to either:

    • Synchronize on before accessing the data, or

    • Enqueue operations in when accessing the data.

    The Consumer can choose which mechanism to use, with the followingconsiderations:

    • If the Consumer synchronizes on the provided stream prior to accessing thedata, then it must ensure that no computation can take place in the providedstream until its operations in its own choice of stream have taken place.This could be achieved by either:

      • Placing a wait on an event in the provided stream that occurs once allof the Consumer’s operations on the data are completed, or

      • Avoiding returning control to the user code until after its operationson its own stream have completed.

    • If the consumer chooses to only enqueue operations on the data in theprovided stream, then it may return control to the User code immediatelyafter enqueueing its work, as the work will all be serialized on theexported array’s stream. This is sufficient to ensure correctness even ifthe User code were to induce the Producer to subsequently start enqueueingmore work on the same stream.

  • If the User has set the Consumer to ignore CAI synchronization semantics, theConsumer may assume it can operate on the data immediately in any stream withno further synchronization, even if thestream member has an integervalue.

When exporting an array through the CAI, Producers must ensure that:

  • If there is work on the data enqueued in one or more streams, thensynchronization on the providedstream is sufficient to ensuresynchronization with all pending work.

    • If the Producer has no enqueued work, or work only enqueued on the streamidentified bystream, then this condition is met.

    • If the Producer has enqueued work on the data on multiple streams, then itmust enqueue events on those streams that follow the enqueued work, andthen wait on those events in the providedstream. For example:

      1. Work is enqueued by the Producer on streams7,9, and15.

      2. Events are then enqueued on each of streams7,9, and15.

      3. Producer then tells stream3 to wait on the events from Step 2, andthestream entry is set to3.

  • If there is no work enqueued on the data, then thestream entry may beeitherNone, or not provided.

Optionally, to facilitate the User relaxing conformance to synchronizationsemantics:

  • Producers may provide a configuration option to always setstream toNone.

  • Consumers may provide a configuration option to ignore the value ofstreamand act as if it wereNone or not provided. This elides synchronizationon the Producer-provided streams, and allows enqueuing work on streams otherthan that provided by the Producer.

These options should not be set by default in either a Producer or a Consumer.The CAI specification does not prescribe the exact mechanism by which theseoptions are set, or related options that Producers or Consumers might provideto allow the user further control over synchronization behavior.

Synchronization in Numba

Numba is neither strictly a Producer nor a Consumer - it may be used toimplement either by a User. In order to facilitate the correct implementation ofsynchronization semantics, Numba exhibits the following behaviors related tosynchronization of the interface:

  • When Numba acts as a Consumer (for example when an array-like object is passedto a kernel launch): Ifstream is an integer, then Numba will immediatelysynchronize on the providedstream. A NumbaDeviceArray created from an array-likeobject has itsdefault stream set to the provided stream.

  • When Numba acts as a Producer (when the__cuda_array_interface__ propertyof a Numba CUDA Array is accessed): If the exported CUDA Array has adefault stream, then it is given as thestream entry. Otherwise,stream is set toNone.

Note

In Numba’s terminology, an array’sdefault stream is a propertyspecifying the stream that Numba will enqueue asynchronoustransfers in if no other stream is provided as an argument to thefunction invoking the transfer. It is not the same as theDefaultStreamin normal CUDA terminology.

Numba’s synchronization behavior results in the following intendedconsequences:

  • Exchanging data either as a Producer or a Consumer will be correct withoutthe need for any further action from the User, provided that the other sideof the interaction also follows the CAI synchronization semantics.

  • The User is expected to either:

    • Avoid launching kernels or other operations on streams thatare not the default stream for their parameters, or

    • When launching operations on a stream that is not the default stream fora given parameter, they should then insert an event into the stream thatthey are operating in, and wait on that event in the default stream forthe parameter. For an example of this,see below.

The User may override Numba’s synchronization behavior by setting theenvironment variableNUMBA_CUDA_ARRAY_INTERFACE_SYNC or the config variableCUDA_ARRAY_INTERFACE_SYNC to0 (seeGPU Support EnvironmentVariables). When set, Numba will not synchronizeon the streams of imported arrays, and it is the responsibility of the user toensure correctness with respect to stream synchronization. Synchronization whencreating a Numba CUDA Array from an object exporting the CUDA Array Interfacemay also be elided by passingsync=False when creating the Numba CUDAArray withnumba.cuda.as_cuda_array() ornumba.cuda.from_cuda_array_interface().

There is scope for Numba’s synchronization implementation to be optimized inthe future, by eliding synchronizations when a kernel or driver API operation(e.g. a memcopy or memset) is launched on the same stream as an importedarray.

An example launching on an array’s non-default stream

This example shows how to ensure that a Consumer can safely consume an arraywith a default stream when it is passed to a kernel launched in a differentstream.

First we need to import Numba and a consumer library (a fictitious library namedother_cai_library for this example):

fromnumbaimportcuda,int32,voidimportother_cai_library

Now we’ll define a kernel - this initializes the elements of the array, settingeach entry to its index:

@cuda.jit(void,int32[::1])definitialize_array(x):i=cuda.grid(1)ifi<len(x):x[i]=i

Next we will create two streams:

array_stream=cuda.stream()kernel_stream=cuda.stream()

Then create an array with one of the streams as its default stream:

N=16384x=cuda.device_array(N,stream=array_stream)

Now we launch the kernel in the other stream:

nthreads=256nblocks=N//nthreadsinitialize_array[nthreads,nblocks,kernel_stream](x)

If we were to passx to a Consumer now, there is a risk that it may operate onit inarray_stream whilst the kernel is still running inkernel_stream.To prevent operations inarray_stream starting before the kernel launch isfinished, we create an event and wait on it:

# Create eventevt=cuda.event()# Record the event after the kernel launch in kernel_streamevt.record(kernel_stream)# Wait for the event in array_streamevt.wait(array_stream)

It is now safe forother_cai_library to consumex:

other_cai_library.consume(x)

Lifetime management

Data

Obtaining the value of the__cuda_array_interface__ property of any objecthas no effect on the lifetime of the object from which it was created. Inparticular, note that the interface has no slot for the owner of the data.

The User code must preserve the lifetime of the object owning the data for aslong as the Consumer might use it.

Streams

Like data, CUDA streams also have a finite lifetime. It is therefore requiredthat a Producer exporting data on the interface with an associated streamensures that the exported stream’s lifetime is equal to or surpasses thelifetime of the object from which the interface was exported.

Lifetime management in Numba

Producing Arrays

Numba takes no steps to maintain the lifetime of an object from which theinterface is exported - it is the user’s responsibility to ensure that theunderlying object is kept alive for the duration that the exported interfacemight be used.

The lifetime of any Numba-managed stream exported on the interface is guaranteedto equal or surpass the lifetime of the underlying object, because theunderlying object holds a reference to the stream.

Note

Numba-managed streams are those created withcuda.default_stream(),cuda.legacy_default_stream(), orcuda.per_thread_default_stream(). Streams not managed by Numbaare created from an external stream withcuda.external_stream().

Consuming Arrays

Numba provides two mechanisms for creating device arrays from objects exportingthe CUDA Array Interface. Which to use depends on whether the created devicearray should maintain the life of the object from which it is created:

  • as_cuda_array: This creates a device array that holds a reference to theowning object. As long as a reference to the device array is held, itsunderlying data will also be kept alive, even if all other references to theoriginal owning object have been dropped.

  • from_cuda_array_interface: This creates a device array with no referenceto the owning object by default. The owning object, or some other object tobe considered the owner can be passed in theowner parameter.

The interfaces of these functions are:

cuda.as_cuda_array(sync=True)

Create a DeviceNDArray from any object that implementsthecuda array interface.

A view of the underlying GPU buffer is created. No copying of the datais done. The resulting DeviceNDArray will acquire a reference fromobj.

Ifsync isTrue, then the imported stream (if present) will besynchronized.

cuda.from_cuda_array_interface(owner=None,sync=True)

Create a DeviceNDArray from a cuda-array-interface description.Theowner is the owner of the underlying memory.The resulting DeviceNDArray will acquire a reference from it.

Ifsync isTrue, then the imported stream (if present) will besynchronized.

Pointer Attributes

Additional information about the data pointer can be retrieved usingcuPointerGetAttribute orcudaPointerGetAttributes. Such informationinclude:

  • the CUDA context that owns the pointer;

  • is the pointer host-accessible?

  • is the pointer a managed memory?

Differences with CUDA Array Interface (Version 0)

Version 0 of the CUDA Array Interface did not have the optionalmaskattribute to support masked arrays.

Differences with CUDA Array Interface (Version 1)

Versions 0 and 1 of the CUDA Array Interface neither clarified thestrides attribute for C-contiguous arrays nor specified the treatment forzero-size arrays.

Differences with CUDA Array Interface (Version 2)

Prior versions of the CUDA Array Interface made no statement aboutsynchronization.

Interoperability

The following Python libraries have adopted the CUDA Array Interface:

If your project is not on this list, please feel free to report it on theNumba issue tracker.