Memory management
CUDA Built-in Target deprecation notice
The CUDA target built-in to Numba is deprecated, with further development moved to theNVIDIA numba-cuda package. Please seeBuilt-in CUDA target deprecation and maintenance status.
Data transfer
Even though Numba can automatically transfer NumPy arrays to the device,it can only do so conservatively by always transferring device memory back tothe host when a kernel finishes. To avoid the unnecessary transfer forread-only arrays, you can use the following APIs to manually control thetransfer:
- numba.cuda.device_array(shape,dtype=np.float64,strides=None,order='C',stream=0)
Allocate an empty device ndarray. Similar to
numpy.empty().
- numba.cuda.device_array_like(ary,stream=0)
Call
device_array()with information fromthe array.
- numba.cuda.to_device(obj,stream=0,copy=True,to=None)
Allocate and transfer a numpy ndarray or structured scalar to the device.
To copy host->device a numpy array:
ary=np.arange(10)d_ary=cuda.to_device(ary)
To enqueue the transfer to a stream:
stream=cuda.stream()d_ary=cuda.to_device(ary,stream=stream)
The resulting
d_aryis aDeviceNDArray.To copy device->host:
hary=d_ary.copy_to_host()
To copy device->host to an existing array:
ary=np.empty(shape=d_ary.shape,dtype=d_ary.dtype)d_ary.copy_to_host(ary)
To enqueue the transfer to a stream:
hary=d_ary.copy_to_host(stream=stream)
In addition to the device arrays, Numba can consume any object that implementscuda array interface. These objects also can bemanually converted into a Numba device array by creating a view of the GPUbuffer using the following APIs:
- numba.cuda.as_cuda_array(obj,sync=True)
Create a DeviceNDArray from any object that implementsthecuda array interface.
A view of the underlying GPU buffer is created. No copying of the datais done. The resulting DeviceNDArray will acquire a reference fromobj.
If
syncisTrue, then the imported stream (if present) will besynchronized.
- numba.cuda.is_cuda_array(obj)
Test if the object has defined the__cuda_array_interface__ attribute.
Does not verify the validity of the interface.
Device arrays
Device array references have the following methods. These methods are to becalled in host code, not within CUDA-jitted functions.
- classnumba.cuda.cudadrv.devicearray.DeviceNDArray(shape,strides,dtype,stream=0,gpu_data=None)
An on-GPU array type
- copy_to_host(ary=None,stream=0)
Copy
selftoaryor create a new Numpy ndarrayifaryisNone.If a CUDA
streamis given, then the transfer will be madeasynchronously as part as the given stream. Otherwise, the transfer issynchronous: the function returns after the copy is finished.Always returns the host array.
Example:
importnumpyasnpfromnumbaimportcudaarr=np.arange(1000)d_arr=cuda.to_device(arr)my_kernel[100,100](d_arr)result_array=d_arr.copy_to_host()
- is_c_contiguous()
Return true if the array is C-contiguous.
- is_f_contiguous()
Return true if the array is Fortran-contiguous.
- ravel(order='C',stream=0)
Flattens a contiguous array without changing its contents, similar to
numpy.ndarray.ravel(). If the array is not contiguous, raises anexception.
- reshape(*newshape,**kws)
Reshape the array without changing its contents, similarly to
numpy.ndarray.reshape(). Example:d_arr=d_arr.reshape(20,50,order='F')
Note
DeviceNDArray defines thecuda array interface.
Pinned memory
- numba.cuda.pinned(*arylist)
A context manager for temporary pinning a sequence of host ndarrays.
- numba.cuda.pinned_array(shape,dtype=np.float64,strides=None,order='C')
Allocate an
ndarraywith a buffer that is pinned(pagelocked). Similar tonp.empty().
- numba.cuda.pinned_array_like(ary)
Call
pinned_array()with the informationfrom the array.
Mapped memory
- numba.cuda.mapped(*arylist,**kws)
A context manager for temporarily mapping a sequence of host ndarrays.
- numba.cuda.mapped_array(shape,dtype=np.float64,strides=None,order='C',stream=0,portable=False,wc=False)
Allocate a mapped ndarray with a buffer that is pinned and mapped onto the device. Similar to np.empty()
- Parameters
portable – a boolean flag to allow the allocated device memory to beusable in multiple devices.
wc – a boolean flag to enable writecombined allocation which is fasterto write by the host and to read by the device, but slower towrite by the host and slower to write by the device.
- numba.cuda.mapped_array_like(ary,stream=0,portable=False,wc=False)
Call
mapped_array()with the informationfrom the array.
Managed memory
- numba.cuda.managed_array(shape,dtype=np.float64,strides=None,order='C',stream=0,attach_global=True)
Allocate a np.ndarray with a buffer that is managed.Similar to np.empty().
Managed memory is supported on Linux / x86 and PowerPC, and is consideredexperimental on Windows and Linux / AArch64.
- Parameters
attach_global – A flag indicating whether to attach globally. Globalattachment implies that the memory is accessible fromany stream on any device. If
False, attachment ishost, and memory is only accessible by deviceswith Compute Capability 6.0 and later.
Streams
Streams can be passed to functions that accept them (e.g. copies between thehost and device) and into kernel launch configurations so that the operationsare executed asynchronously.
- numba.cuda.stream()
Create a CUDA stream that represents a command queue for the device.
- numba.cuda.default_stream()
Get the default CUDA stream. CUDA semantics in general are that the defaultstream is either the legacy default stream or the per-thread default streamdepending on which CUDA APIs are in use. In Numba, the APIs for the legacydefault stream are always the ones in use, but an option to use APIs forthe per-thread default stream may be provided in future.
- numba.cuda.legacy_default_stream()
Get the legacy default CUDA stream.
- numba.cuda.per_thread_default_stream()
Get the per-thread default CUDA stream.
- numba.cuda.external_stream(ptr)
Create a Numba stream object for a stream allocated outside Numba.
- Parameters
ptr (int) – Pointer to the external stream to wrap in a Numba Stream
CUDA streams have the following methods:
- classnumba.cuda.cudadrv.driver.Stream(context,handle,finalizer,external=False)
- auto_synchronize()
A context manager that waits for all commands in this stream to executeand commits any pending memory transfers upon exiting the context.
- synchronize()
Wait for all commands in this stream to execute. This will commit anypending memory transfers.
Shared memory and thread synchronization
A limited amount of shared memory can be allocated on the device to speedup access to data, when necessary. That memory will be shared (i.e. bothreadable and writable) amongst all threads belonging to a given blockand has faster access times than regular device memory. It also allowsthreads to cooperate on a given solution. You can think of it as amanually-managed data cache.
The memory is allocated once for the duration of the kernel, unliketraditional dynamic memory management.
- numba.cuda.shared.array(shape,type)
Allocate a shared array of the givenshape andtype on the device.This function must be called on the device (i.e. from a kernel ordevice function).shape is either an integer or a tuple of integersrepresenting the array’s dimensions and must be a simple constantexpression. A “simple constant expression” includes, but is not limited to:
A literal (e.g.
10)A local variable whose right-hand side is a literal or a simple constantexpression (e.g.
shape, whereshapeis defined earlier in the functionasshape=10)A global variable that is defined in the jitted function’s globals by the timeof compilation (e.g.
shape, whereshapeis defined using any expressionat global scope).
The definition must result in a Python
int(i.e. not a NumPy scalar or otherscalar / integer-like type).type is aNumba type of theelements needing to be stored in the array. The returned array-like object can beread and written to like any normal device array (e.g. through indexing).A common pattern is to have each thread populate one element in theshared array and then wait for all threads to finish using
syncthreads().
- numba.cuda.syncthreads()
Synchronize all threads in the same thread block. This functionimplements the same pattern asbarriersin traditional multi-threaded programming: this function waitsuntil all threads in the block call it, at which point it returnscontrol to all its callers.
See also
Dynamic Shared Memory
In order to use dynamic shared memory in kernel code declare a shared array ofsize 0:
@cuda.jitdefkernel_func(x):dyn_arr=cuda.shared.array(0,dtype=np.float32)...
and specify the size of dynamic shared memory in bytes during kernel invocation:
kernel_func[32,32,0,128](x)
In the above code the kernel launch is configured with 4 parameters:
kernel_func[grid_dim,block_dim,stream,dyn_shared_mem_size]
Note: all dynamic shared memory arraysalias, so if you want to havemultiple dynamic shared arrays, you need to takedisjoint views of the arrays.For example, consider:
fromnumbaimportcudaimportnumpyasnp@cuda.jitdeff():f32_arr=cuda.shared.array(0,dtype=np.float32)i32_arr=cuda.shared.array(0,dtype=np.int32)f32_arr[0]=3.14print(f32_arr[0])print(i32_arr[0])f[1,1,0,4]()cuda.synchronize()
This allocates 4 bytes of shared memory (large enough for oneint32 or onefloat32) and declares dynamic shared memory arrays of typeint32 and oftypefloat32. Whenf32_arr[0] is set, this also sets the value ofi32_arr[0], because they’re pointing at the same memory. So we see asoutput:
3.1400001078523331
because 1078523331 is theint32 represented by the bits of thefloat32value 3.14.
If we take disjoint views of the dynamic shared memory:
fromnumbaimportcudaimportnumpyasnp@cuda.jitdeff_with_view():f32_arr=cuda.shared.array(0,dtype=np.float32)i32_arr=cuda.shared.array(0,dtype=np.int32)[1:]# 1 int32 = 4 bytesf32_arr[0]=3.14i32_arr[0]=1print(f32_arr[0])print(i32_arr[0])f_with_view[1,1,0,8]()cuda.synchronize()
This time we declare 8 dynamic shared memory bytes, using the first 4 for afloat32 value and the next 4 for anint32 value. Now we can set both theint32 andfloat32 value without them aliasing:
3.1400001
Local memory
Local memory is an area of memory private to each thread. Using localmemory helps allocate some scratchpad area when scalar local variablesare not enough. The memory is allocated once for the duration of the kernel,unlike traditional dynamic memory management.
- numba.cuda.local.array(shape,type)
Allocate a local array of the givenshape andtype on the device.shape is either an integer or a tuple of integers representing the array’sdimensions and must be a simple constant expression. A “simple constant expression”includes, but is not limited to:
A literal (e.g.
10)A local variable whose right-hand side is a literal or a simple constantexpression (e.g.
shape, whereshapeis defined earlier in the functionasshape=10)A global variable that is defined in the jitted function’s globals by the timeof compilation (e.g.
shape, whereshapeis defined using any expressionat global scope).
The definition must result in a Python
int(i.e. not a NumPy scalar or otherscalar / integer-like type).type is aNumba typeof the elements needing to be stored in the array. The array is private tothe current thread. An array-like object is returned which can be read andwritten to like any standard array (e.g. through indexing).See also
The Local Memory section ofDevice Memory Accessesin the CUDA programming guide.
Constant memory
Constant memory is an area of memory that is read only, cached and off-chip, itis accessible by all threads and is host allocated. A method ofcreating an array in constant memory is through the use of:
- numba.cuda.const.array_like(arr)
Allocate and make accessible an array in constant memory based on array-likearr.
Deallocation Behavior
This section describes the deallocation behaviour of Numba’s internal memorymanagement. If an External Memory Management Plugin is in use (seeExternal Memory Management (EMM) Plugin interface), then deallocation behaviour may differ; you may refer to thedocumentation for the EMM Plugin to understand its deallocation behaviour.
Deallocation of all CUDA resources are tracked on a per-context basis.When the last reference to a device memory is dropped, the underlying memoryis scheduled to be deallocated. The deallocation does not occur immediately.It is added to a queue of pending deallocations. This design has two benefits:
Resource deallocation API may cause the device to synchronize; thus, breakingany asynchronous execution. Deferring the deallocation could avoid latencyin performance critical code section.
Some deallocation errors may cause all the remaining deallocations to fail.Continued deallocation errors can cause critical errors at the CUDA driverlevel. In some cases, this could mean a segmentation fault in the CUDAdriver. In the worst case, this could cause the system GUI to freeze andcould only recover with a system reset. When an error occurs during adeallocation, the remaining pending deallocations are cancelled. Anydeallocation error will be reported. When the process is terminated, theCUDA driver is able to release all allocated resources by the terminatedprocess.
The deallocation queue is flushed automatically as soon as the following eventsoccur:
An allocation failed due to out-of-memory error. Allocation is retried afterflushing all deallocations.
The deallocation queue has reached its maximum size, which is default to 10.User can override by setting the environment variableNUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT. For example,NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT=20, increases the limit to 20.
The maximum accumulated byte size of resources that are pending deallocationis reached. This is default to 20% of the device memory capacity.User can override by setting the environment variableNUMBA_CUDA_MAX_PENDING_DEALLOCS_RATIO. For example,NUMBA_CUDA_MAX_PENDING_DEALLOCS_RATIO=0.5 sets the limit to 50% of thecapacity.
Sometimes, it is desired to defer resource deallocation until a code sectionends. Most often, users want to avoid any implicit synchronization due todeallocation. This can be done by using the following context manager:
- numba.cuda.defer_cleanup()
Temporarily disable memory deallocation.Use this to prevent resource deallocation breaking asynchronous execution.
For example:
withdefer_cleanup():# all cleanup is deferred in heredo_speed_critical_code()# cleanup can occur here
Note: this context manager can be nested.