This section describes the memory management functions of the CUDA runtime application programming interface.
Some functions have overloaded C++ API template versions documented separately in theC++ API Routines module.
Returns in*desc,*extent and*flags respectively, the type, shape and flags ofarray.
Any of*desc,*extent and*flags may be specified as NULL.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
Returns the memory requirements of a CUDA array inmemoryRequirements If the CUDA array is not allocated with flagcudaArrayDeferredMappingcudaErrorInvalidValue will be returned.
The returned value incudaArrayMemoryRequirements::size represents the total size of the CUDA array. The returned value incudaArrayMemoryRequirements::alignment represents the alignment necessary for mapping the CUDA array.
See also:
Returns inpPlaneArray a CUDA array that represents a single format plane of the CUDA arrayhArray.
IfplaneIdx is greater than the maximum number of planes in this array or if the array does not have a multi-planar format e.g:cudaChannelFormatKindNV12, thencudaErrorInvalidValue is returned.
Note that if thehArray has formatcudaChannelFormatKindNV12, then passing in 0 forplaneIdx returns a CUDA array of the same size ashArray but with one 8-bit channel andcudaChannelFormatKindUnsigned as its format kind. If 1 is passed forplaneIdx, then the returned CUDA array has half the height and width ofhArray with two 8-bit channels andcudaChannelFormatKindUnsigned as its format kind.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
Returns the layout properties of a sparse CUDA array insparseProperties. If the CUDA array is not allocated with flagcudaArraySparsecudaErrorInvalidValue will be returned.
If the returned value incudaArraySparseProperties::flags containscudaArraySparsePropertiesSingleMipTail, thencudaArraySparseProperties::miptailSize represents the total size of the array. Otherwise, it will be zero. Also, the returned value incudaArraySparseProperties::miptailFirstLevel is always zero. Note that thearray must have been allocated usingcudaMallocArray orcudaMalloc3DArray. For CUDA arrays obtained using cudaMipmappedArrayGetLevel,cudaErrorInvalidValue will be returned. Instead,cudaMipmappedArrayGetSparseProperties must be used to obtain the sparse properties of the entire CUDA mipmapped array to whicharray belongs to.
See also:
Frees the memory space pointed to bydevPtr, which must have been returned by a previous call to one of the following memory allocation APIs -cudaMalloc(),cudaMallocPitch(),cudaMallocManaged(),cudaMallocAsync(),cudaMallocFromPoolAsync().
Note - This API will not perform any implicit synchronization when the pointer was allocated withcudaMallocAsync orcudaMallocFromPoolAsync. Callers must ensure that all accesses to these pointer have completed before invokingcudaFree. For best performance and memory reuse, users should usecudaFreeAsync to free memory allocated via the stream ordered memory allocator. For all other pointers, this API may perform implicit synchronization.
IfcudaFree(devPtr) has already been called before, an error is returned. IfdevPtr is 0, no operation is performed.cudaFree() returns cudaErrorValue in case of failure.
The device version ofcudaFree cannot be used with a*devPtr allocated using the host API, and vice versa.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMalloc,cudaMallocPitch,cudaMallocManaged,cudaMallocArray,cudaFreeArray,cudaMallocAsync,cudaMallocFromPoolAsynccudaMallocHost ( C API),cudaFreeHost,cudaMalloc3D,cudaMalloc3DArray,cudaFreeAsynccudaHostAlloc,cuMemFree
Frees the CUDA arrayarray, which must have been returned by a previous call tocudaMallocArray(). IfdevPtr is 0, no operation is performed.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMalloc,cudaMallocPitch,cudaFree,cudaMallocArray,cudaMallocHost ( C API),cudaFreeHost,cudaHostAlloc,cuArrayDestroy
Frees the memory space pointed to byhostPtr, which must have been returned by a previous call tocudaMallocHost() orcudaHostAlloc().
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMalloc,cudaMallocPitch,cudaFree,cudaMallocArray,cudaFreeArray,cudaMallocHost ( C API),cudaMalloc3D,cudaMalloc3DArray,cudaHostAlloc,cuMemFreeHost
Frees the CUDA mipmapped arraymipmappedArray, which must have been returned by a previous call tocudaMallocMipmappedArray(). IfdevPtr is 0, no operation is performed.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMalloc,cudaMallocPitch,cudaFree,cudaMallocArray,cudaMallocHost ( C API),cudaFreeHost,cudaHostAlloc,cuMipmappedArrayDestroy
Returns in*levelArray a CUDA array that represents a single mipmap level of the CUDA mipmapped arraymipmappedArray.
Iflevel is greater than the maximum number of levels in this mipmapped array,cudaErrorInvalidValue is returned.
IfmipmappedArray is NULL,cudaErrorInvalidResourceHandle is returned.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMalloc3D,cudaMalloc,cudaMallocPitch,cudaFree,cudaFreeArray,cudaMallocHost ( C API),cudaFreeHost,cudaHostAlloc,make_cudaExtent,cuMipmappedArrayGetLevel
Returns in*devPtr the address of symbolsymbol on the device.symbol is a variable that resides in global or constant memory space. Ifsymbol cannot be found, or ifsymbol is not declared in the global or constant memory space,*devPtr is unchanged and the errorcudaErrorInvalidSymbol is returned.
Note that this function may also return error codes from previous, asynchronous launches.
Use of a string naming a variable as thesymbol parameter was deprecated in CUDA 4.1 and removed in CUDA 5.0.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaGetSymbolAddress ( C++ API),cudaGetSymbolSize ( C API),cuModuleGetGlobal
Returns in*size the size of symbolsymbol.symbol is a variable that resides in global or constant memory space. Ifsymbol cannot be found, or ifsymbol is not declared in global or constant memory space,*size is unchanged and the errorcudaErrorInvalidSymbol is returned.
Note that this function may also return error codes from previous, asynchronous launches.
Use of a string naming a variable as thesymbol parameter was deprecated in CUDA 4.1 and removed in CUDA 5.0.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaGetSymbolAddress ( C API),cudaGetSymbolSize ( C++ API),cuModuleGetGlobal
Allocatessize bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such ascudaMemcpy(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc(). Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.
Theflags parameter enables different options to be specified that affect the allocation, as follows.
cudaHostAllocDefault: This flag's value is defined to be 0 and causescudaHostAlloc() to emulatecudaMallocHost().
cudaHostAllocPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.
cudaHostAllocMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by callingcudaHostGetDevicePointer().
cudaHostAllocWriteCombined: Allocates the memory as write-combined (WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host->device transfers.
All of these flags are orthogonal to one another: a developer may allocate memory that is portable, mapped and/or write-combined with no restrictions.
In order for thecudaHostAllocMapped flag to have any effect, the CUDA context must support thecudaDeviceMapHost flag, which can be checked viacudaGetDeviceFlags(). ThecudaDeviceMapHost flag is implicitly set for contexts created via the runtime API.
ThecudaHostAllocMapped flag may be specified on CUDA contexts for devices that do not support mapped pinned memory. The failure is deferred tocudaHostGetDevicePointer() because the memory may be mapped into other CUDA contexts via thecudaHostAllocPortable flag.
Memory allocated by this function must be freed withcudaFreeHost().
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaSetDeviceFlags,cudaMallocHost ( C API),cudaFreeHost,cudaGetDeviceFlags,cuMemHostAlloc
Passes back the device pointer corresponding to the mapped, pinned host buffer allocated bycudaHostAlloc() or registered bycudaHostRegister().
cudaHostGetDevicePointer() will fail if thecudaDeviceMapHost flag was not specified before deferred context creation occurred, or if called on a device that does not support mapped, pinned memory.
For devices that have a non-zero value for the device attributecudaDevAttrCanUseHostPointerForRegisteredMem, the memory can also be accessed from the device using the host pointerpHost. The device pointer returned bycudaHostGetDevicePointer() may or may not match the original host pointerpHost and depends on the devices visible to the application. If all devices visible to the application have a non-zero value for the device attribute, the device pointer returned bycudaHostGetDevicePointer() will match the original pointerpHost. If any device visible to the application has a zero value for the device attribute, the device pointer returned bycudaHostGetDevicePointer() will not match the original host pointerpHost, but it will be suitable for use on all devices provided Unified Virtual Addressing is enabled. In such systems, it is valid to access the memory using either pointer on devices that have a non-zero value for the device attribute. Note however that such devices should access the memory using only of the two pointers and not both.
flags provides for future releases. For now, it must be set to 0.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaHostGetFlags() will fail if the input pointer does not reside in an address range allocated bycudaHostAlloc().
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaSuccess,cudaErrorInvalidValue,cudaErrorMemoryAllocation,cudaErrorHostMemoryAlreadyRegistered,cudaErrorNotSupported
Page-locks the memory range specified byptr andsize and maps it for the device(s) as specified byflags. This memory range also is added to the same tracking mechanism ascudaHostAlloc() to automatically accelerate calls to functions such ascudaMemcpy(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory that has not been registered. Page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to register staging areas for data exchange between host and device.
On systems where pageableMemoryAccessUsesHostPageTables is true,cudaHostRegister will not page-lock the memory range specified byptr but only populate unpopulated pages.
cudaHostRegister is supported only on I/O coherent devices that have a non-zero value for the device attributecudaDevAttrHostRegisterSupported.
Theflags parameter enables different options to be specified that affect the allocation, as follows.
cudaHostRegisterDefault: On a system with unified virtual addressing, the memory will be both mapped and portable. On a system with no unified virtual addressing, the memory will be neither mapped nor portable.
cudaHostRegisterPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.
cudaHostRegisterMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by callingcudaHostGetDevicePointer().
cudaHostRegisterIoMemory: The passed memory pointer is treated as pointing to some memory-mapped I/O space, e.g. belonging to a third-party PCIe device, and it will marked as non cache-coherent and contiguous.
cudaHostRegisterReadOnly: The passed memory pointer is treated as pointing to memory that is considered read-only by the device. On platforms withoutcudaDevAttrPageableMemoryAccessUsesHostPageTables, this flag is required in order to register memory mapped to the CPU as read-only. Support for the use of this flag can be queried from the device attributecudaDevAttrHostRegisterReadOnlySupported. Using this flag with a current context associated with a device that does not have this attribute set will causecudaHostRegister to error with cudaErrorNotSupported.
All of these flags are orthogonal to one another: a developer may page-lock memory that is portable or mapped with no restrictions.
The CUDA context must have been created with the cudaMapHost flag in order for thecudaHostRegisterMapped flag to have any effect.
ThecudaHostRegisterMapped flag may be specified on CUDA contexts for devices that do not support mapped pinned memory. The failure is deferred tocudaHostGetDevicePointer() because the memory may be mapped into other CUDA contexts via thecudaHostRegisterPortable flag.
For devices that have a non-zero value for the device attributecudaDevAttrCanUseHostPointerForRegisteredMem, the memory can also be accessed from the device using the host pointerptr. The device pointer returned bycudaHostGetDevicePointer() may or may not match the original host pointerptr and depends on the devices visible to the application. If all devices visible to the application have a non-zero value for the device attribute, the device pointer returned bycudaHostGetDevicePointer() will match the original pointerptr. If any device visible to the application has a zero value for the device attribute, the device pointer returned bycudaHostGetDevicePointer() will not match the original host pointerptr, but it will be suitable for use on all devices provided Unified Virtual Addressing is enabled. In such systems, it is valid to access the memory using either pointer on devices that have a non-zero value for the device attribute. Note however that such devices should access the memory using only of the two pointers and not both.
The memory page-locked by this function must be unregistered withcudaHostUnregister().
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaHostUnregister,cudaHostGetFlags,cudaHostGetDevicePointer,cuMemHostRegister
Unmaps the memory range whose base address is specified byptr, and makes it pageable again.
The base address must be the same one specified tocudaHostRegister().
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
Allocatessize bytes of linear memory on the device and returns in*devPtr a pointer to the allocated memory. The allocated memory is suitably aligned for any kind of variable. The memory is not cleared.cudaMalloc() returnscudaErrorMemoryAllocation in case of failure.
The device version ofcudaFree cannot be used with a*devPtr allocated using the host API, and vice versa.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMallocPitch,cudaFree,cudaMallocArray,cudaFreeArray,cudaMalloc3D,cudaMalloc3DArray,cudaMallocHost ( C API),cudaFreeHost,cudaHostAlloc,cuMemAlloc
Allocates at leastwidth *height *depth bytes of linear memory on the device and returns acudaPitchedPtr in whichptr is a pointer to the allocated memory. The function may pad the allocation to ensure hardware alignment requirements are met. The pitch returned in thepitch field ofpitchedDevPtr is the width in bytes of the allocation.
The returnedcudaPitchedPtr contains additional fieldsxsize andysize, the logical width and height of the allocation, which are equivalent to thewidth andheightextent parameters provided by the programmer during allocation.
For allocations of 2D and 3D objects, it is highly recommended that programmers perform allocations usingcudaMalloc3D() orcudaMallocPitch(). Due to alignment restrictions in the hardware, this is especially true if the application will be performing memory copies involving 2D or 3D objects (whether linear memory or CUDA arrays).
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMallocPitch,cudaFree,cudaMemcpy3D,cudaMemset3D,cudaMalloc3DArray,cudaMallocArray,cudaFreeArray,cudaMallocHost ( C API),cudaFreeHost,cudaHostAlloc,make_cudaPitchedPtr,make_cudaExtent,cuMemAllocPitch
Allocates a CUDA array according to thecudaChannelFormatDesc structuredesc and returns a handle to the new CUDA array in*array.
ThecudaChannelFormatDesc is defined as:
structcudaChannelFormatDesc { intx,y,z,w; enumcudaChannelFormatKindf; };wherecudaChannelFormatKind is one ofcudaChannelFormatKindSigned,cudaChannelFormatKindUnsigned, orcudaChannelFormatKindFloat.
cudaMalloc3DArray() can allocate the following:
A 1D array is allocated if the height and depth extents are both zero.
A 2D array is allocated if only the depth extent is zero.
A 3D array is allocated if all three extents are non-zero.
A 1D layered CUDA array is allocated if only the height extent is zero and the cudaArrayLayered flag is set. Each layer is a 1D array. The number of layers is determined by the depth extent.
A 2D layered CUDA array is allocated if all three extents are non-zero and the cudaArrayLayered flag is set. Each layer is a 2D array. The number of layers is determined by the depth extent.
A cubemap CUDA array is allocated if all three extents are non-zero and the cudaArrayCubemap flag is set. Width must be equal to height, and depth must be six. A cubemap is a special type of 2D layered CUDA array, where the six layers represent the six faces of a cube. The order of the six layers in memory is the same as that listed incudaGraphicsCubeFace.
A cubemap layered CUDA array is allocated if all three extents are non-zero, and both, cudaArrayCubemap and cudaArrayLayered flags are set. Width must be equal to height, and depth must be a multiple of six. A cubemap layered CUDA array is a special type of 2D layered CUDA array that consists of a collection of cubemaps. The first six layers represent the first cubemap, the next six layers form the second cubemap, and so on.
Theflags parameter enables different options to be specified that affect the allocation, as follows.
cudaArrayDefault: This flag's value is defined to be 0 and provides default array allocation
cudaArrayLayered: Allocates a layered CUDA array, with the depth extent indicating the number of layers
cudaArrayCubemap: Allocates a cubemap CUDA array. Width must be equal to height, and depth must be six. If the cudaArrayLayered flag is also set, depth must be a multiple of six.
cudaArraySurfaceLoadStore: Allocates a CUDA array that could be read from or written to using a surface reference.
cudaArrayTextureGather: This flag indicates that texture gather operations will be performed on the CUDA array. Texture gather can only be performed on 2D CUDA arrays.
cudaArraySparse: Allocates a CUDA array without physical backing memory. The subregions within this sparse array can later be mapped onto a physical memory allocation by callingcuMemMapArrayAsync. This flag can only be used for creating 2D, 3D or 2D layered sparse CUDA arrays. The physical backing memory must be allocated viacuMemCreate.
cudaArrayDeferredMapping: Allocates a CUDA array without physical backing memory. The entire array can later be mapped onto a physical memory allocation by callingcuMemMapArrayAsync. The physical backing memory must be allocated viacuMemCreate.
The width, height and depth extents must meet certain size requirements as listed in the following table. All values are specified in elements.
Note that 2D CUDA arrays have different size requirements if thecudaArrayTextureGather flag is set. In that case, the valid range for (width, height, depth) is ((1,maxTexture2DGather[0]), (1,maxTexture2DGather[1]), 0).
| CUDA array type | Valid extents that must always be met {(width range in elements), (height range), (depth range)} | Valid extents with cudaArraySurfaceLoadStore set {(width range in elements), (height range), (depth range)} |
|---|---|---|
| 1D | { (1,maxTexture1D), 0, 0 } | { (1,maxSurface1D), 0, 0 } |
| 2D | { (1,maxTexture2D[0]), (1,maxTexture2D[1]), 0 } | { (1,maxSurface2D[0]), (1,maxSurface2D[1]), 0 } |
| 3D | { (1,maxTexture3D[0]), (1,maxTexture3D[1]), (1,maxTexture3D[2]) } OR { (1,maxTexture3DAlt[0]), (1,maxTexture3DAlt[1]), (1,maxTexture3DAlt[2]) } | { (1,maxSurface3D[0]), (1,maxSurface3D[1]), (1,maxSurface3D[2]) } |
| 1D Layered | { (1,maxTexture1DLayered[0]), 0, (1,maxTexture1DLayered[1]) } | { (1,maxSurface1DLayered[0]), 0, (1,maxSurface1DLayered[1]) } |
| 2D Layered | { (1,maxTexture2DLayered[0]), (1,maxTexture2DLayered[1]), (1,maxTexture2DLayered[2]) } | { (1,maxSurface2DLayered[0]), (1,maxSurface2DLayered[1]), (1,maxSurface2DLayered[2]) } |
| Cubemap | { (1,maxTextureCubemap), (1,maxTextureCubemap), 6 } | { (1,maxSurfaceCubemap), (1,maxSurfaceCubemap), 6 } |
| Cubemap Layered | { (1,maxTextureCubemapLayered[0]), (1,maxTextureCubemapLayered[0]), (1,maxTextureCubemapLayered[1]) } | { (1,maxSurfaceCubemapLayered[0]), (1,maxSurfaceCubemapLayered[0]), (1,maxSurfaceCubemapLayered[1]) } |
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMalloc3D,cudaMalloc,cudaMallocPitch,cudaFree,cudaFreeArray,cudaMallocHost ( C API),cudaFreeHost,cudaHostAlloc,make_cudaExtent,cuArray3DCreate
Allocates a CUDA array according to thecudaChannelFormatDesc structuredesc and returns a handle to the new CUDA array in*array.
ThecudaChannelFormatDesc is defined as:
structcudaChannelFormatDesc { intx,y,z,w; enumcudaChannelFormatKindf; };wherecudaChannelFormatKind is one ofcudaChannelFormatKindSigned,cudaChannelFormatKindUnsigned, orcudaChannelFormatKindFloat.
Theflags parameter enables different options to be specified that affect the allocation, as follows.
cudaArrayDefault: This flag's value is defined to be 0 and provides default array allocation
cudaArraySurfaceLoadStore: Allocates an array that can be read from or written to using a surface reference
cudaArrayTextureGather: This flag indicates that texture gather operations will be performed on the array.
cudaArraySparse: Allocates a CUDA array without physical backing memory. The subregions within this sparse array can later be mapped onto a physical memory allocation by callingcuMemMapArrayAsync. The physical backing memory must be allocated viacuMemCreate.
cudaArrayDeferredMapping: Allocates a CUDA array without physical backing memory. The entire array can later be mapped onto a physical memory allocation by callingcuMemMapArrayAsync. The physical backing memory must be allocated viacuMemCreate.
width andheight must meet certain size requirements. SeecudaMalloc3DArray() for more details.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMalloc,cudaMallocPitch,cudaFree,cudaFreeArray,cudaMallocHost ( C API),cudaFreeHost,cudaMalloc3D,cudaMalloc3DArray,cudaHostAlloc,cuArrayCreate
Allocatessize bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such ascudaMemcpy*(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().
On systems where pageableMemoryAccessUsesHostPageTables is true,cudaMallocHost may not page-lock the allocated memory.
Page-locking excessive amounts of memory withcudaMallocHost() may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMalloc,cudaMallocPitch,cudaMallocArray,cudaMalloc3D,cudaMalloc3DArray,cudaHostAlloc,cudaFree,cudaFreeArray,cudaMallocHost ( C++ API),cudaFreeHost,cudaHostAlloc,cuMemAllocHost
Allocatessize bytes of managed memory on the device and returns in*devPtr a pointer to the allocated memory. If the device doesn't support allocating managed memory,cudaErrorNotSupported is returned. Support for managed memory can be queried using the device attributecudaDevAttrManagedMemory. The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. Ifsize is 0,cudaMallocManaged returnscudaErrorInvalidValue. The pointer is valid on the CPU and on all GPUs in the system that support managed memory. All accesses to this pointer must obey the Unified Memory programming model.
flags specifies the default stream association for this allocation.flags must be one ofcudaMemAttachGlobal orcudaMemAttachHost. The default value forflags iscudaMemAttachGlobal. IfcudaMemAttachGlobal is specified, then this memory is accessible from any stream on any device. IfcudaMemAttachHost is specified, then the allocation should not be accessed from devices that have a zero value for the device attributecudaDevAttrConcurrentManagedAccess; an explicit call tocudaStreamAttachMemAsync will be required to enable access on such devices.
If the association is later changed viacudaStreamAttachMemAsync to a single stream, the default association, as specifed duringcudaMallocManaged, is restored when that stream is destroyed. For __managed__ variables, the default association is alwayscudaMemAttachGlobal. Note that destroying a stream is an asynchronous operation, and as a result, the change to default association won't happen until all work in the stream has completed.
Memory allocated withcudaMallocManaged should be released withcudaFree.
Device memory oversubscription is possible for GPUs that have a non-zero value for the device attributecudaDevAttrConcurrentManagedAccess. Managed memory on such GPUs may be evicted from device memory to host memory at any time by the Unified Memory driver in order to make room for other allocations.
In a system where all GPUs have a non-zero value for the device attributecudaDevAttrConcurrentManagedAccess, managed memory may not be populated when this API returns and instead may be populated on access. In such systems, managed memory can migrate to any processor's memory at any time. The Unified Memory driver will employ heuristics to maintain data locality and prevent excessive page faults to the extent possible. The application can also guide the driver about memory usage patterns viacudaMemAdvise. The application can also explicitly migrate memory to a desired processor's memory viacudaMemPrefetchAsync.
In a multi-GPU system where all of the GPUs have a zero value for the device attributecudaDevAttrConcurrentManagedAccess and all the GPUs have peer-to-peer support with each other, the physical storage for managed memory is created on the GPU which is active at the timecudaMallocManaged is called. All other GPUs will reference the data at reduced bandwidth via peer mappings over the PCIe bus. The Unified Memory driver does not migrate memory among such GPUs.
In a multi-GPU system where not all GPUs have peer-to-peer support with each other and where the value of the device attributecudaDevAttrConcurrentManagedAccess is zero for at least one of those GPUs, the location chosen for physical storage of managed memory is system-dependent.
On Linux, the location chosen will be device memory as long as the current set of active contexts are on devices that either have peer-to-peer support with each other or have a non-zero value for the device attributecudaDevAttrConcurrentManagedAccess. If there is an active context on a GPU that does not have a non-zero value for that device attribute and it does not have peer-to-peer support with the other devices that have active contexts on them, then the location for physical storage will be 'zero-copy' or host memory. Note that this means that managed memory that is located in device memory is migrated to host memory if a new context is created on a GPU that doesn't have a non-zero value for the device attribute and does not support peer-to-peer with at least one of the other devices that has an active context. This in turn implies that context creation may fail if there is insufficient host memory to migrate all managed allocations.
On Windows, the physical storage is always created in 'zero-copy' or host memory. All GPUs will reference the data at reduced bandwidth over the PCIe bus. In these circumstances, use of the environment variable CUDA_VISIBLE_DEVICES is recommended to restrict CUDA to only use those GPUs that have peer-to-peer support. Alternatively, users can also set CUDA_MANAGED_FORCE_DEVICE_ALLOC to a non-zero value to force the driver to always use device memory for physical storage. When this environment variable is set to a non-zero value, all devices used in that process that support managed memory have to be peer-to-peer compatible with each other. The errorcudaErrorInvalidDevice will be returned if a device that supports managed memory is used and it is not peer-to-peer compatible with any of the other managed memory supporting devices that were previously used in that process, even ifcudaDeviceReset has been called on those devices. These environment variables are described in the CUDA programming guide under the "CUDA environment variables" section.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMallocPitch,cudaFree,cudaMallocArray,cudaFreeArray,cudaMalloc3D,cudaMalloc3DArray,cudaMallocHost ( C API),cudaFreeHost,cudaHostAlloc,cudaDeviceGetAttribute,cudaStreamAttachMemAsync,cuMemAllocManaged
Allocates a CUDA mipmapped array according to thecudaChannelFormatDesc structuredesc and returns a handle to the new CUDA mipmapped array in*mipmappedArray.numLevels specifies the number of mipmap levels to be allocated. This value is clamped to the range [1, 1 + floor(log2(max(width, height, depth)))].
ThecudaChannelFormatDesc is defined as:
structcudaChannelFormatDesc { intx,y,z,w; enumcudaChannelFormatKindf; };wherecudaChannelFormatKind is one ofcudaChannelFormatKindSigned,cudaChannelFormatKindUnsigned, orcudaChannelFormatKindFloat.
cudaMallocMipmappedArray() can allocate the following:
A 1D mipmapped array is allocated if the height and depth extents are both zero.
A 2D mipmapped array is allocated if only the depth extent is zero.
A 3D mipmapped array is allocated if all three extents are non-zero.
A 1D layered CUDA mipmapped array is allocated if only the height extent is zero and the cudaArrayLayered flag is set. Each layer is a 1D mipmapped array. The number of layers is determined by the depth extent.
A 2D layered CUDA mipmapped array is allocated if all three extents are non-zero and the cudaArrayLayered flag is set. Each layer is a 2D mipmapped array. The number of layers is determined by the depth extent.
A cubemap CUDA mipmapped array is allocated if all three extents are non-zero and the cudaArrayCubemap flag is set. Width must be equal to height, and depth must be six. The order of the six layers in memory is the same as that listed incudaGraphicsCubeFace.
A cubemap layered CUDA mipmapped array is allocated if all three extents are non-zero, and both, cudaArrayCubemap and cudaArrayLayered flags are set. Width must be equal to height, and depth must be a multiple of six. A cubemap layered CUDA mipmapped array is a special type of 2D layered CUDA mipmapped array that consists of a collection of cubemap mipmapped arrays. The first six layers represent the first cubemap mipmapped array, the next six layers form the second cubemap mipmapped array, and so on.
Theflags parameter enables different options to be specified that affect the allocation, as follows.
cudaArrayDefault: This flag's value is defined to be 0 and provides default mipmapped array allocation
cudaArrayLayered: Allocates a layered CUDA mipmapped array, with the depth extent indicating the number of layers
cudaArrayCubemap: Allocates a cubemap CUDA mipmapped array. Width must be equal to height, and depth must be six. If the cudaArrayLayered flag is also set, depth must be a multiple of six.
cudaArraySurfaceLoadStore: This flag indicates that individual mipmap levels of the CUDA mipmapped array will be read from or written to using a surface reference.
cudaArrayTextureGather: This flag indicates that texture gather operations will be performed on the CUDA array. Texture gather can only be performed on 2D CUDA mipmapped arrays, and the gather operations are performed only on the most detailed mipmap level.
cudaArraySparse: Allocates a CUDA mipmapped array without physical backing memory. The subregions within this sparse array can later be mapped onto a physical memory allocation by callingcuMemMapArrayAsync. This flag can only be used for creating 2D, 3D or 2D layered sparse CUDA mipmapped arrays. The physical backing memory must be allocated viacuMemCreate.
cudaArrayDeferredMapping: Allocates a CUDA mipmapped array without physical backing memory. The entire array can later be mapped onto a physical memory allocation by callingcuMemMapArrayAsync. The physical backing memory must be allocated viacuMemCreate.
The width, height and depth extents must meet certain size requirements as listed in the following table. All values are specified in elements.
| CUDA array type | Valid extents that must always be met {(width range in elements), (height range), (depth range)} | Valid extents with cudaArraySurfaceLoadStore set {(width range in elements), (height range), (depth range)} |
|---|---|---|
| 1D | { (1,maxTexture1DMipmap), 0, 0 } | { (1,maxSurface1D), 0, 0 } |
| 2D | { (1,maxTexture2DMipmap[0]), (1,maxTexture2DMipmap[1]), 0 } | { (1,maxSurface2D[0]), (1,maxSurface2D[1]), 0 } |
| 3D | { (1,maxTexture3D[0]), (1,maxTexture3D[1]), (1,maxTexture3D[2]) } OR { (1,maxTexture3DAlt[0]), (1,maxTexture3DAlt[1]), (1,maxTexture3DAlt[2]) } | { (1,maxSurface3D[0]), (1,maxSurface3D[1]), (1,maxSurface3D[2]) } |
| 1D Layered | { (1,maxTexture1DLayered[0]), 0, (1,maxTexture1DLayered[1]) } | { (1,maxSurface1DLayered[0]), 0, (1,maxSurface1DLayered[1]) } |
| 2D Layered | { (1,maxTexture2DLayered[0]), (1,maxTexture2DLayered[1]), (1,maxTexture2DLayered[2]) } | { (1,maxSurface2DLayered[0]), (1,maxSurface2DLayered[1]), (1,maxSurface2DLayered[2]) } |
| Cubemap | { (1,maxTextureCubemap), (1,maxTextureCubemap), 6 } | { (1,maxSurfaceCubemap), (1,maxSurfaceCubemap), 6 } |
| Cubemap Layered | { (1,maxTextureCubemapLayered[0]), (1,maxTextureCubemapLayered[0]), (1,maxTextureCubemapLayered[1]) } | { (1,maxSurfaceCubemapLayered[0]), (1,maxSurfaceCubemapLayered[0]), (1,maxSurfaceCubemapLayered[1]) } |
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMalloc3D,cudaMalloc,cudaMallocPitch,cudaFree,cudaFreeArray,cudaMallocHost ( C API),cudaFreeHost,cudaHostAlloc,make_cudaExtent,cuMipmappedArrayCreate
Allocates at leastwidth (in bytes) *height bytes of linear memory on the device and returns in*devPtr a pointer to the allocated memory. The function may pad the allocation to ensure that corresponding pointers in any given row will continue to meet the alignment requirements for coalescing as the address is updated from row to row. The pitch returned in*pitch bycudaMallocPitch() is the width in bytes of the allocation. The intended usage ofpitch is as a separate parameter of the allocation, used to compute addresses within the 2D array. Given the row and column of an array element of typeT, the address is computed as:
T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;
For allocations of 2D arrays, it is recommended that programmers consider performing pitch allocations usingcudaMallocPitch(). Due to pitch alignment restrictions in the hardware, this is especially true if the application will be performing 2D memory copies between different regions of device memory (whether linear memory or CUDA arrays).
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMalloc,cudaFree,cudaMallocArray,cudaFreeArray,cudaMallocHost ( C API),cudaFreeHost,cudaMalloc3D,cudaMalloc3DArray,cudaHostAlloc,cuMemAllocPitch
Advise the Unified Memory subsystem about the usage pattern for the memory range starting atdevPtr with a size ofcount bytes. The start address and end address of the memory range will be rounded down and rounded up respectively to be aligned to CPU page size before the advice is applied. The memory range must refer to managed memory allocated viacudaMallocManaged or declared via __managed__ variables. The memory range could also refer to system-allocated pageable memory provided it represents a valid, host-accessible region of memory and all additional constraints imposed byadvice as outlined below are also satisfied. Specifying an invalid system-allocated pageable memory range results in an error being returned.
Theadvice parameter can take the following values:
cudaMemAdviseSetReadMostly: This implies that the data is mostly going to be read from and only occasionally written to. Any read accesses from any processor to this region will create a read-only copy of at least the accessed pages in that processor's memory. Additionally, ifcudaMemPrefetchAsync orcudaMemPrefetchAsync is called on this region, it will create a read-only copy of the data on the destination processor. If the target location forcudaMemPrefetchAsync is a host NUMA node and a read-only copy already exists on another host NUMA node, that copy will be migrated to the targeted host NUMA node. If any processor writes to this region, all copies of the corresponding page will be invalidated except for the one where the write occurred. If the writing processor is the CPU and the preferred location of the page is a host NUMA node, then the page will also be migrated to that host NUMA node. Thelocation argument is ignored for this advice. Note that for a page to be read-duplicated, the accessing processor must either be the CPU or a GPU that has a non-zero value for the device attributecudaDevAttrConcurrentManagedAccess. Also, if a context is created on a device that does not have the device attributecudaDevAttrConcurrentManagedAccess set, then read-duplication will not occur until all such contexts are destroyed. If the memory region refers to valid system-allocated pageable memory, then the accessing device must have a non-zero value for the device attributecudaDevAttrPageableMemoryAccess for a read-only copy to be created on that device. Note however that if the accessing device also has a non-zero value for the device attributecudaDevAttrPageableMemoryAccessUsesHostPageTables, then setting this advice will not create a read-only copy when that device accesses this memory region.
cudaMemAdviceUnsetReadMostly: Undoes the effect ofcudaMemAdviseSetReadMostly and also prevents the Unified Memory driver from attempting heuristic read-duplication on the memory range. Any read-duplicated copies of the data will be collapsed into a single copy. The location for the collapsed copy will be the preferred location if the page has a preferred location and one of the read-duplicated copies was resident at that location. Otherwise, the location chosen is arbitrary. Note: Thelocation argument is ignored for this advice.
cudaMemAdviseSetPreferredLocation: This advice sets the preferred location for the data to be the memory belonging tolocation. WhencudaMemLocation::type iscudaMemLocationTypeHost,cudaMemLocation::id is ignored and the preferred location is set to be host memory. To set the preferred location to a specific host NUMA node, applications must setcudaMemLocation::type tocudaMemLocationTypeHostNuma andcudaMemLocation::id must specify the NUMA ID of the host NUMA node. IfcudaMemLocation::type is set tocudaMemLocationTypeHostNumaCurrent,cudaMemLocation::id will be ignored and the host NUMA node closest to the calling thread's CPU will be used as the preferred location. IfcudaMemLocation::type is acudaMemLocationTypeDevice, thencudaMemLocation::id must be a valid device ordinal and the device must have a non-zero value for the device attributecudaDevAttrConcurrentManagedAccess. Setting the preferred location does not cause data to migrate to that location immediately. Instead, it guides the migration policy when a fault occurs on that memory region. If the data is already in its preferred location and the faulting processor can establish a mapping without requiring the data to be migrated, then data migration will be avoided. On the other hand, if the data is not in its preferred location or if a direct mapping cannot be established, then it will be migrated to the processor accessing it. It is important to note that setting the preferred location does not prevent data prefetching done usingcudaMemPrefetchAsync. Having a preferred location can override the page thrash detection and resolution logic in the Unified Memory driver. Normally, if a page is detected to be constantly thrashing between for example host and device memory, the page may eventually be pinned to host memory by the Unified Memory driver. But if the preferred location is set as device memory, then the page will continue to thrash indefinitely. IfcudaMemAdviseSetReadMostly is also set on this memory region or any subset of it, then the policies associated with that advice will override the policies of this advice, unless read accesses fromlocation will not result in a read-only copy being created on that procesor as outlined in description for the advicecudaMemAdviseSetReadMostly. If the memory region refers to valid system-allocated pageable memory, andcudaMemLocation::type iscudaMemLocationTypeDevice thencudaMemLocation::id must be a valid device that has a non-zero alue for the device attributecudaDevAttrPageableMemoryAccess.
cudaMemAdviseUnsetPreferredLocation: Undoes the effect ofcudaMemAdviseSetPreferredLocation and changes the preferred location to none. Thelocation argument is ignored for this advice.
cudaMemAdviseSetAccessedBy: This advice implies that the data will be accessed by processorlocation. ThecudaMemLocation::type must be eithercudaMemLocationTypeDevice withcudaMemLocation::id representing a valid device ordinal orcudaMemLocationTypeHost andcudaMemLocation::id will be ignored. All other location types are invalid. IfcudaMemLocation::id is a GPU, then the device attributecudaDevAttrConcurrentManagedAccess must be non-zero. This advice does not cause data migration and has no impact on the location of the data per se. Instead, it causes the data to always be mapped in the specified processor's page tables, as long as the location of the data permits a mapping to be established. If the data gets migrated for any reason, the mappings are updated accordingly. This advice is recommended in scenarios where data locality is not important, but avoiding faults is. Consider for example a system containing multiple GPUs with peer-to-peer access enabled, where the data located on one GPU is occasionally accessed by peer GPUs. In such scenarios, migrating data over to the other GPUs is not as important because the accesses are infrequent and the overhead of migration may be too high. But preventing faults can still help improve performance, and so having a mapping set up in advance is useful. Note that on CPU access of this data, the data may be migrated to host memory because the CPU typically cannot access device memory directly. Any GPU that had thecudaMemAdviseSetAccessedBy flag set for this data will now have its mapping updated to point to the page in host memory. IfcudaMemAdviseSetReadMostly is also set on this memory region or any subset of it, then the policies associated with that advice will override the policies of this advice. Additionally, if the preferred location of this memory region or any subset of it is alsolocation, then the policies associated withCU_MEM_ADVISE_SET_PREFERRED_LOCATION will override the policies of this advice. If the memory region refers to valid system-allocated pageable memory, andcudaMemLocation::type iscudaMemLocationTypeDevice then device incudaMemLocation::id must have a non-zero value for the device attributecudaDevAttrPageableMemoryAccess. Additionally, ifcudaMemLocation::id has a non-zero value for the device attributecudaDevAttrPageableMemoryAccessUsesHostPageTables, then this call has no effect.
CU_MEM_ADVISE_UNSET_ACCESSED_BY: Undoes the effect ofcudaMemAdviseSetAccessedBy. Any mappings to the data fromlocation may be removed at any time causing accesses to result in non-fatal page faults. If the memory region refers to valid system-allocated pageable memory, andcudaMemLocation::type iscudaMemLocationTypeDevice then device incudaMemLocation::id must have a non-zero value for the device attributecudaDevAttrPageableMemoryAccess. Additionally, ifcudaMemLocation::id has a non-zero value for the device attributecudaDevAttrPageableMemoryAccessUsesHostPageTables, then this call has no effect.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemcpy,cudaMemcpyPeer,cudaMemcpyAsync,cudaMemcpy3DPeerAsync,cudaMemPrefetchAsync,cuMemAdvise
Performs a batch of memory discards followed by prefetches. The batch as a whole executes in stream order but operations within a batch are not guaranteed to execute in any specific order. All devices in the system must have a non-zero value for the device attributecudaDevAttrConcurrentManagedAccess otherwise the API will return an error.
CallingcudaMemDiscardAndPrefetchBatchAsync is semantically equivalent to callingcudaMemDiscardBatchAsync followed bycudaMemPrefetchBatchAsync, but is more optimal. For more details on what discarding and prefetching imply, please refer tocudaMemDiscardBatchAsync andcudaMemPrefetchBatchAsync respectively. Note that any reads, writes or prefetches to any part of the memory range that occur simultaneously with this combined discard+prefetch operation result in undefined behavior.
Performs memory discard and prefetch on address ranges specified indptrs andsizes. Both arrays must be of the same length as specified bycount. Each memory range specified must refer to managed memory allocated viacudaMallocManaged or declared via __managed__ variables or it may also refer to system-allocated memory when all devices have a non-zero value forcudaDevAttrPageableMemoryAccess. Every operation in the batch has to be associated with a valid location to prefetch the address range to and specified in theprefetchLocs array. Each entry in this array can apply to more than one operation. This can be done by specifying in theprefetchLocIdxs array, the index of the first operation that the corresponding entry in theprefetchLocs array applies to. BothprefetchLocs andprefetchLocIdxs must be of the same length as specified bynumPrefetchLocs. For example, if a batch has 10 operations listed in dptrs/sizes, the first 6 of which are to be prefetched to one location and the remaining 4 are to be prefetched to another, thennumPrefetchLocs will be 2,prefetchLocIdxs will be {0, 6} andprefetchLocs will contain the two set of locations. Note the first entry inprefetchLocIdxs must always be 0. Also, each entry must be greater than the previous entry and the last entry should be less thancount. Furthermore,numPrefetchLocs must be lesser than or equal tocount.
Performs a batch of memory discards. The batch as a whole executes in stream order but operations within a batch are not guaranteed to execute in any specific order. All devices in the system must have a non-zero value for the device attributecudaDevAttrConcurrentManagedAccess otherwise the API will return an error.
Discarding a memory range informs the driver that the contents of that range are no longer useful. Discarding memory ranges allows the driver to optimize certain data migrations and can also help reduce memory pressure. This operation can be undone on any part of the range by either writing to it or prefetching it viacudaMemPrefetchAsync orcudaMemPrefetchBatchAsync. Reading from a discarded range, without a subsequent write or prefetch to that part of the range, will return an indeterminate value. Note that any reads, writes or prefetches to any part of the memory range that occur simultaneously with the discard operation result in undefined behavior.
Performs memory discard on address ranges specified indptrs andsizes. Both arrays must be of the same length as specified bycount. Each memory range specified must refer to managed memory allocated viacudaMallocManaged or declared via __managed__ variables or it may also refer to system-allocated memory when all devices have a non-zero value forcudaDevAttrPageableMemoryAccess.
Returns in*total the total amount of memory available to the the current context. Returns in*free the amount of memory on the device that is free according to the OS. CUDA is not guaranteed to be able to allocate all of the memory that the OS reports as free. In a multi-tenet situation, free estimate returned is prone to race condition where a new allocation/free done by a different process or a different thread in the same process between the time when free memory was estimated and reported, will result in deviation in free value reported and actual free memory.
The integrated GPU on Tegra shares memory with CPU and other component of the SoC. The free and total values returned by the API excludes the SWAP memory space maintained by the OS on some platforms. The OS may move some of the memory pages into swap area as the GPU or CPU allocate or access memory. See Tegra app note on how to calculate total and free memory on Tegra.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
Prefetches memory to the specified destination location.devPtr is the base device pointer of the memory to be prefetched andlocation specifies the destination location.count specifies the number of bytes to copy.stream is the stream in which the operation is enqueued. The memory range must refer to managed memory allocated viacudaMallocManaged or declared via __managed__ variables, or it may also refer to system-allocated memory on systems with non-zero cudaDevAttrPageableMemoryAccess.
SpecifyingcudaMemLocationTypeDevice forcudaMemLocation::type will prefetch memory to GPU specified by device ordinalcudaMemLocation::id which must have non-zero value for the device attribute concurrentManagedAccess. Additionally,stream must be associated with a device that has a non-zero value for the device attribute concurrentManagedAccess. SpecifyingcudaMemLocationTypeHost ascudaMemLocation::type will prefetch data to host memory. Applications can request prefetching memory to a specific host NUMA node by specifyingcudaMemLocationTypeHostNuma forcudaMemLocation::type and a valid host NUMA node id incudaMemLocation::id Users can also request prefetching memory to the host NUMA node closest to the current thread's CPU by specifyingcudaMemLocationTypeHostNumaCurrent forcudaMemLocation::type. Note whencudaMemLocation::type is etihercudaMemLocationTypeHost ORcudaMemLocationTypeHostNumaCurrent,cudaMemLocation::id will be ignored.
The start address and end address of the memory range will be rounded down and rounded up respectively to be aligned to CPU page size before the prefetch operation is enqueued in the stream.
If no physical memory has been allocated for this region, then this memory region will be populated and mapped on the destination device. If there's insufficient memory to prefetch the desired region, the Unified Memory driver may evict pages from othercudaMallocManaged allocations to host memory in order to make room. Device memory allocated usingcudaMalloc orcudaMallocArray will not be evicted.
By default, any mappings to the previous location of the migrated pages are removed and mappings for the new location are only setup on the destination location. The exact behavior however also depends on the settings applied to this memory range viacuMemAdvise as described below:
IfcudaMemAdviseSetReadMostly was set on any subset of this memory range, then that subset will create a read-only copy of the pages on destination location. If however the destination location is a host NUMA node, then any pages of that subset that are already in another host NUMA node will be transferred to the destination.
IfcudaMemAdviseSetPreferredLocation was called on any subset of this memory range, then the pages will be migrated tolocation even iflocation is not the preferred location of any pages in the memory range.
IfcudaMemAdviseSetAccessedBy was called on any subset of this memory range, then mappings to those pages from all the appropriate processors are updated to refer to the new location if establishing such a mapping is possible. Otherwise, those mappings are cleared.
Note that this API is not required for functionality and only serves to improve performance by allowing the application to migrate data to a suitable location before it is accessed. Memory accesses to this range are always coherent and are allowed even when the data is actively being migrated.
Note that this function is asynchronous with respect to the host and all work on other devices.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemcpy,cudaMemcpyPeer,cudaMemcpyAsync,cudaMemcpy3DPeerAsync,cudaMemAdvise,cuMemPrefetchAsync
Performs a batch of memory prefetches. The batch as a whole executes in stream order but operations within a batch are not guaranteed to execute in any specific order. All devices in the system must have a non-zero value for the device attributecudaDevAttrConcurrentManagedAccess otherwise the API will return an error.
The semantics of the individual prefetch operations are as described incudaMemPrefetchAsync.
Performs memory prefetch on address ranges specified indptrs andsizes. Both arrays must be of the same length as specified bycount. Each memory range specified must refer to managed memory allocated viacudaMallocManaged or declared via __managed__ variables or it may also refer to system-allocated memory when all devices have a non-zero value forcudaDevAttrPageableMemoryAccess. The prefetch location for every operation in the batch is specified in theprefetchLocs array. Each entry in this array can apply to more than one operation. This can be done by specifying in theprefetchLocIdxs array, the index of the first prefetch operation that the corresponding entry in theprefetchLocs array applies to. BothprefetchLocs andprefetchLocIdxs must be of the same length as specified bynumPrefetchLocs. For example, if a batch has 10 prefetches listed in dptrs/sizes, the first 4 of which are to be prefetched to one location and the remaining 6 are to be prefetched to another, thennumPrefetchLocs will be 2,prefetchLocIdxs will be {0, 4} andprefetchLocs will contain the two locations. Note the first entry inprefetchLocIdxs must always be 0. Also, each entry must be greater than the previous entry and the last entry should be less thancount. Furthermore,numPrefetchLocs must be lesser than or equal tocount.
Query an attribute about the memory range starting atdevPtr with a size ofcount bytes. The memory range must refer to managed memory allocated viacudaMallocManaged or declared via __managed__ variables.
Theattribute parameter can take the following values:
cudaMemRangeAttributeReadMostly: If this attribute is specified,data will be interpreted as a 32-bit integer, anddataSize must be 4. The result returned will be 1 if all pages in the given memory range have read-duplication enabled, or 0 otherwise.
cudaMemRangeAttributePreferredLocation: If this attribute is specified,data will be interpreted as a 32-bit integer, anddataSize must be 4. The result returned will be a GPU device id if all pages in the memory range have that GPU as their preferred location, or it will be cudaCpuDeviceId if all pages in the memory range have the CPU as their preferred location, or it will be cudaInvalidDeviceId if either all the pages don't have the same preferred location or some of the pages don't have a preferred location at all. Note that the actual location of the pages in the memory range at the time of the query may be different from the preferred location.
cudaMemRangeAttributeAccessedBy: If this attribute is specified,data will be interpreted as an array of 32-bit integers, anddataSize must be a non-zero multiple of 4. The result returned will be a list of device ids that had cudaMemAdviceSetAccessedBy set for that entire memory range. If any device does not have that advice set for the entire memory range, that device will not be included. Ifdata is larger than the number of devices that have that advice set for that memory range, cudaInvalidDeviceId will be returned in all the extra space provided. For ex., ifdataSize is 12 (i.e.data has 3 elements) and only device 0 has the advice set, then the result returned will be { 0, cudaInvalidDeviceId, cudaInvalidDeviceId }. Ifdata is smaller than the number of devices that have that advice set, then only as many devices will be returned as can fit in the array. There is no guarantee on which specific devices will be returned, however.
cudaMemRangeAttributeLastPrefetchLocation: If this attribute is specified,data will be interpreted as a 32-bit integer, anddataSize must be 4. The result returned will be the last location to which all pages in the memory range were prefetched explicitly viacudaMemPrefetchAsync. This will either be a GPU id or cudaCpuDeviceId depending on whether the last location for prefetch was a GPU or the CPU respectively. If any page in the memory range was never explicitly prefetched or if all pages were not prefetched to the same location, cudaInvalidDeviceId will be returned. Note that this simply returns the last location that the applicaton requested to prefetch the memory range to. It gives no indication as to whether the prefetch operation to that location has completed or even begun.
cudaMemRangeAttributePreferredLocationId: If this attribute is specified,data will be interpreted as a 32-bit integer, anddataSize must be 4. If thecudaMemRangeAttributePreferredLocationType query for the same address range returnscudaMemLocationTypeDevice, it will be a valid device ordinal or if it returnscudaMemLocationTypeHostNuma, it will be a valid host NUMA node ID or if it returns any other location type, the id should be ignored.
cudaMemRangeAttributeLastPrefetchLocationId: If this attribute is specified,data will be interpreted as a 32-bit integer, anddataSize must be 4. If thecudaMemRangeAttributeLastPrefetchLocationType query for the same address range returnscudaMemLocationTypeDevice, it will be a valid device ordinal or if it returnscudaMemLocationTypeHostNuma, it will be a valid host NUMA node ID or if it returns any other location type, the id should be ignored.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemRangeGetAttributes,cudaMemPrefetchAsync,cudaMemAdvise,cuMemRangeGetAttribute
Query attributes of the memory range starting atdevPtr with a size ofcount bytes. The memory range must refer to managed memory allocated viacudaMallocManaged or declared via __managed__ variables. Theattributes array will be interpreted to havenumAttributes entries. ThedataSizes array will also be interpreted to havenumAttributes entries. The results of the query will be stored indata.
The list of supported attributes are given below. Please refer tocudaMemRangeGetAttribute for attribute descriptions and restrictions.
:: cudaMemRangeAttributePreferredLocationType
:: cudaMemRangeAttributePreferredLocationId
:: cudaMemRangeAttributeLastPrefetchLocationType
:: cudaMemRangeAttributeLastPrefetchLocationId
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemRangeGetAttribute,cudaMemAdvise,cudaMemPrefetchAsync,cuMemRangeGetAttributes
Copiescount bytes from the memory area pointed to bysrc to the memory area pointed to bydst, wherekind specifies the direction of the copy, and must be one ofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. CallingcudaMemcpy() with dst and src pointers that do not match the direction of the copy results in an undefined behavior.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
This function exhibitssynchronous behavior for most use cases.
Memory regions requested must be either entirely registered with CUDA, or in the case of host pageable transfers, not registered at all. Memory regions spanning over allocations that are both registered and not registered with CUDA are not supported and will return CUDA_ERROR_INVALID_VALUE.
See also:
cudaMemcpy2D,cudaMemcpy2DToArray,cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,cuMemcpyDtoH,cuMemcpyHtoD,cuMemcpyDtoD,cuMemcpy
Copies a matrix (height rows ofwidth bytes each) from the memory area pointed to bysrc to the memory area pointed to bydst, wherekind specifies the direction of the copy, and must be one ofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.dpitch andspitch are the widths in memory in bytes of the 2D arrays pointed to bydst andsrc, including any padding added to the end of each row. The memory areas may not overlap.width must not exceed eitherdpitch orspitch. CallingcudaMemcpy2D() withdst andsrc pointers that do not match the direction of the copy results in an undefined behavior.cudaMemcpy2D() returns an error ifdpitch orspitch exceeds the maximum allowed.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
Memory regions requested must be either entirely registered with CUDA, or in the case of host pageable transfers, not registered at all. Memory regions spanning over allocations that are both registered and not registered with CUDA are not supported and will return CUDA_ERROR_INVALID_VALUE.
See also:
cudaMemcpy,cudaMemcpy2DToArray,cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,cuMemcpy2D,cuMemcpy2DUnaligned
Copies a matrix (height rows ofwidth bytes each) from the CUDA arraysrc starting athOffsetSrc rows andwOffsetSrc bytes from the upper left corner to the CUDA arraydst starting athOffsetDst rows andwOffsetDst bytes from the upper left corner, wherekind specifies the direction of the copy, and must be one ofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.wOffsetDst +width must not exceed the width of the CUDA arraydst.wOffsetSrc +width must not exceed the width of the CUDA arraysrc.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitssynchronous behavior for most use cases.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DToArray,cudaMemcpy2DFromArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,cuMemcpy2D,cuMemcpy2DUnaligned
Copies a matrix (height rows ofwidth bytes each) from the memory area pointed to bysrc to the memory area pointed to bydst, wherekind specifies the direction of the copy, and must be one ofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.dpitch andspitch are the widths in memory in bytes of the 2D arrays pointed to bydst andsrc, including any padding added to the end of each row. The memory areas may not overlap.width must not exceed eitherdpitch orspitch.
CallingcudaMemcpy2DAsync() withdst andsrc pointers that do not match the direction of the copy results in an undefined behavior.cudaMemcpy2DAsync() returns an error ifdpitch orspitch is greater than the maximum allowed.
cudaMemcpy2DAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zerostream argument. Ifkind iscudaMemcpyHostToDevice orcudaMemcpyDeviceToHost andstream is non-zero, the copy may overlap with operations in other streams.
The device version of this function only handles device to device copies and cannot be given local or shared pointers.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
Memory regions requested must be either entirely registered with CUDA, or in the case of host pageable transfers, not registered at all. Memory regions spanning over allocations that are both registered and not registered with CUDA are not supported and will return CUDA_ERROR_INVALID_VALUE.
See also:
cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DToArray,cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,cuMemcpy2DAsync
Copies a matrix (height rows ofwidth bytes each) from the CUDA arraysrc starting athOffset rows andwOffset bytes from the upper left corner to the memory area pointed to bydst, wherekind specifies the direction of the copy, and must be one ofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.dpitch is the width in memory in bytes of the 2D array pointed to bydst, including any padding added to the end of each row.wOffset +width must not exceed the width of the CUDA arraysrc.width must not exceeddpitch.cudaMemcpy2DFromArray() returns an error ifdpitch exceeds the maximum allowed.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitssynchronous behavior for most use cases.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
Memory regions requested must be either entirely registered with CUDA, or in the case of host pageable transfers, not registered at all. Memory regions spanning over allocations that are both registered and not registered with CUDA are not supported and will return CUDA_ERROR_INVALID_VALUE.
See also:
cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DToArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,cuMemcpy2D,cuMemcpy2DUnaligned
Copies a matrix (height rows ofwidth bytes each) from the CUDA arraysrc starting athOffset rows andwOffset bytes from the upper left corner to the memory area pointed to bydst, wherekind specifies the direction of the copy, and must be one ofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.dpitch is the width in memory in bytes of the 2D array pointed to bydst, including any padding added to the end of each row.wOffset +width must not exceed the width of the CUDA arraysrc.width must not exceeddpitch.cudaMemcpy2DFromArrayAsync() returns an error ifdpitch exceeds the maximum allowed.
cudaMemcpy2DFromArrayAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zerostream argument. Ifkind iscudaMemcpyHostToDevice orcudaMemcpyDeviceToHost andstream is non-zero, the copy may overlap with operations in other streams.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
Memory regions requested must be either entirely registered with CUDA, or in the case of host pageable transfers, not registered at all. Memory regions spanning over allocations that are both registered and not registered with CUDA are not supported and will return CUDA_ERROR_INVALID_VALUE.
See also:
cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DToArray,cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,
cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,cuMemcpy2DAsync
Copies a matrix (height rows ofwidth bytes each) from the memory area pointed to bysrc to the CUDA arraydst starting athOffset rows andwOffset bytes from the upper left corner, wherekind specifies the direction of the copy, and must be one ofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.spitch is the width in memory in bytes of the 2D array pointed to bysrc, including any padding added to the end of each row.wOffset +width must not exceed the width of the CUDA arraydst.width must not exceedspitch.cudaMemcpy2DToArray() returns an error ifspitch exceeds the maximum allowed.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitssynchronous behavior for most use cases.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
Memory regions requested must be either entirely registered with CUDA, or in the case of host pageable transfers, not registered at all. Memory regions spanning over allocations that are both registered and not registered with CUDA are not supported and will return CUDA_ERROR_INVALID_VALUE.
See also:
cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,cuMemcpy2D,cuMemcpy2DUnaligned
Copies a matrix (height rows ofwidth bytes each) from the memory area pointed to bysrc to the CUDA arraydst starting athOffset rows andwOffset bytes from the upper left corner, wherekind specifies the direction of the copy, and must be one ofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.spitch is the width in memory in bytes of the 2D array pointed to bysrc, including any padding added to the end of each row.wOffset +width must not exceed the width of the CUDA arraydst.width must not exceedspitch.cudaMemcpy2DToArrayAsync() returns an error ifspitch exceeds the maximum allowed.
cudaMemcpy2DToArrayAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zerostream argument. Ifkind iscudaMemcpyHostToDevice orcudaMemcpyDeviceToHost andstream is non-zero, the copy may overlap with operations in other streams.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
Memory regions requested must be either entirely registered with CUDA, or in the case of host pageable transfers, not registered at all. Memory regions spanning over allocations that are both registered and not registered with CUDA are not supported and will return CUDA_ERROR_INVALID_VALUE.
See also:
cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DToArray,cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,
cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,cuMemcpy2DAsync
structcudaExtent { size_twidth; size_theight; size_tdepth; }; structcudaExtentmake_cudaExtent(size_t w, size_t h, size_t d); structcudaPos { size_tx; size_ty; size_tz; }; structcudaPosmake_cudaPos(size_tx, size_ty, size_tz); structcudaMemcpy3DParms {cudaArray_tsrcArray; structcudaPossrcPos; structcudaPitchedPtrsrcPtr;cudaArray_tdstArray; structcudaPosdstPos; structcudaPitchedPtrdstPtr; structcudaExtentextent; enumcudaMemcpyKindkind; };
cudaMemcpy3D() copies data betwen two 3D objects. The source and destination objects may be in either host memory, device memory, or a CUDA array. The source, destination, extent, and kind of copy performed is specified by thecudaMemcpy3DParms struct which should be initialized to zero before use:
cudaMemcpy3DParms myParms = {0};
The struct passed tocudaMemcpy3D() must specify one ofsrcArray orsrcPtr and one ofdstArray ordstPtr. Passing more than one non-zero source or destination will causecudaMemcpy3D() to return an error.
ThesrcPos anddstPos fields are optional offsets into the source and destination objects and are defined in units of each object's elements. The element for a host or device pointer is assumed to beunsigned char.
Theextent field defines the dimensions of the transferred area in elements. If a CUDA array is participating in the copy, the extent is defined in terms of that array's elements. If no CUDA array is participating in the copy then the extents are defined in elements ofunsigned char.
Thekind field defines the direction of the copy. It must be one ofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. ForcudaMemcpyHostToHost orcudaMemcpyHostToDevice orcudaMemcpyDeviceToHost passed as kind and cudaArray type passed as source or destination, if the kind implies cudaArray type to be present on the host,cudaMemcpy3D() will disregard that implication and silently correct the kind based on the fact that cudaArray type can only be present on the device.
If the source and destination are both arrays,cudaMemcpy3D() will return an error if they do not have the same element size.
The source and destination object may not overlap. If overlapping source and destination objects are specified, undefined behavior will result.
The source object must entirely contain the region defined bysrcPos andextent. The destination object must entirely contain the region defined bydstPos andextent.
cudaMemcpy3D() returns an error if the pitch ofsrcPtr ordstPtr exceeds the maximum allowed. The pitch of acudaPitchedPtr allocated withcudaMalloc3D() will always be valid.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitssynchronous behavior for most use cases.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMalloc3D,cudaMalloc3DArray,cudaMemset3D,cudaMemcpy3DAsync,cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DToArray,cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,make_cudaExtent,make_cudaPos,cuMemcpy3D
structcudaExtent { size_twidth; size_theight; size_tdepth; }; structcudaExtentmake_cudaExtent(size_t w, size_t h, size_t d); structcudaPos { size_tx; size_ty; size_tz; }; structcudaPosmake_cudaPos(size_tx, size_ty, size_tz); structcudaMemcpy3DParms {cudaArray_tsrcArray; structcudaPossrcPos; structcudaPitchedPtrsrcPtr;cudaArray_tdstArray; structcudaPosdstPos; structcudaPitchedPtrdstPtr; structcudaExtentextent; enumcudaMemcpyKindkind; };
cudaMemcpy3DAsync() copies data betwen two 3D objects. The source and destination objects may be in either host memory, device memory, or a CUDA array. The source, destination, extent, and kind of copy performed is specified by thecudaMemcpy3DParms struct which should be initialized to zero before use:
cudaMemcpy3DParms myParms = {0};
The struct passed tocudaMemcpy3DAsync() must specify one ofsrcArray orsrcPtr and one ofdstArray ordstPtr. Passing more than one non-zero source or destination will causecudaMemcpy3DAsync() to return an error.
ThesrcPos anddstPos fields are optional offsets into the source and destination objects and are defined in units of each object's elements. The element for a host or device pointer is assumed to beunsigned char. For CUDA arrays, positions must be in the range [0, 2048) for any dimension.
Theextent field defines the dimensions of the transferred area in elements. If a CUDA array is participating in the copy, the extent is defined in terms of that array's elements. If no CUDA array is participating in the copy then the extents are defined in elements ofunsigned char.
Thekind field defines the direction of the copy. It must be one ofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. ForcudaMemcpyHostToHost orcudaMemcpyHostToDevice orcudaMemcpyDeviceToHost passed as kind and cudaArray type passed as source or destination, if the kind implies cudaArray type to be present on the host,cudaMemcpy3DAsync() will disregard that implication and silently correct the kind based on the fact that cudaArray type can only be present on the device.
If the source and destination are both arrays,cudaMemcpy3DAsync() will return an error if they do not have the same element size.
The source and destination object may not overlap. If overlapping source and destination objects are specified, undefined behavior will result.
The source object must lie entirely within the region defined bysrcPos andextent. The destination object must lie entirely within the region defined bydstPos andextent.
cudaMemcpy3DAsync() returns an error if the pitch ofsrcPtr ordstPtr exceeds the maximum allowed. The pitch of acudaPitchedPtr allocated withcudaMalloc3D() will always be valid.
cudaMemcpy3DAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zerostream argument. Ifkind iscudaMemcpyHostToDevice orcudaMemcpyDeviceToHost andstream is non-zero, the copy may overlap with operations in other streams.
The device version of this function only handles device to device copies and cannot be given local or shared pointers.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMalloc3D,cudaMalloc3DArray,cudaMemset3D,cudaMemcpy3D,cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DToArray, :cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,make_cudaExtent,make_cudaPos,cuMemcpy3DAsync
Performs a batch of memory copies. The batch as a whole executes in stream order but copies within a batch are not guaranteed to execute in any specific order. Note that this means specifying any dependent copies within a batch will result in undefined behavior.
Performs memory copies as specified in theopList array. The length of this array is specified innumOps. Each entry in this array describes a copy operation. This includes among other things, the source and destination operands for the copy as specified in cudaMemcpy3DBatchOp::src and cudaMemcpy3DBatchOp::dst respectively. The source and destination operands of a copy can either be a pointer or a CUDA array. The width, height and depth of a copy is specified in cudaMemcpy3DBatchOp::extent. The width, height and depth of a copy are specified in elements and must not be zero. For pointer-to-pointer copies, the element size is considered to be 1. For pointer to CUDA array or vice versa copies, the element size is determined by the CUDA array. For CUDA array to CUDA array copies, the element size of the two CUDA arrays must match.
For a given operand, if cudaMemcpy3DOperand::type is specified ascudaMemcpyOperandTypePointer, then cudaMemcpy3DOperand::op::ptr will be used. The cudaMemcpy3DOperand::op::ptr::ptr field must contain the pointer where the copy should begin. The cudaMemcpy3DOperand::op::ptr::rowLength field specifies the length of each row in elements and must either be zero or be greater than or equal to the width of the copy specified in cudaMemcpy3DBatchOp::extent::width. The cudaMemcpy3DOperand::op::ptr::layerHeight field specifies the height of each layer and must either be zero or be greater than or equal to the height of the copy specified in cudaMemcpy3DBatchOp::extent::height. When either of these values is zero, that aspect of the operand is considered to be tightly packed according to the copy extent. For managed memory pointers on devices wherecudaDevAttrConcurrentManagedAccess is true or system-allocated pageable memory on devices wherecudaDevAttrPageableMemoryAccess is true, the cudaMemcpy3DOperand::op::ptr::locHint field can be used to hint the location of the operand.
If an operand's type is specified ascudaMemcpyOperandTypeArray, then cudaMemcpy3DOperand::op::array will be used. The cudaMemcpy3DOperand::op::array::array field specifies the CUDA array and cudaMemcpy3DOperand::op::array::offset specifies the 3D offset into that array where the copy begins.
ThecudaMemcpyAttributes::srcAccessOrder indicates the source access ordering to be observed for copies associated with the attribute. If the source access order is set tocudaMemcpySrcAccessOrderStream, then the source will be accessed in stream order. If the source access order is set tocudaMemcpySrcAccessOrderDuringApiCall then it indicates that access to the source pointer can be out of stream order and all accesses must be complete before the API call returns. This flag is suited for ephemeral sources (ex., stack variables) when it's known that no prior operations in the stream can be accessing the memory and also that the lifetime of the memory is limited to the scope that the source variable was declared in. Specifying this flag allows the driver to optimize the copy and removes the need for the user to synchronize the stream after the API call. If the source access order is set tocudaMemcpySrcAccessOrderAny then it indicates that access to the source pointer can be out of stream order and the accesses can happen even after the API call returns. This flag is suited for host pointers allocated outside CUDA (ex., via malloc) when it's known that no prior operations in the stream can be accessing the memory. Specifying this flag allows the driver to optimize the copy on certain platforms. Each memcopy operation inopList must have a valid srcAccessOrder setting, otherwise this API will returncudaErrorInvalidValue.
ThecudaMemcpyAttributes::flags field can be used to specify certain flags for copies. Setting thecudaMemcpyFlagPreferOverlapWithCompute flag indicates that the associated copies should preferably overlap with any compute work. Note that this flag is a hint and can be ignored depending on the platform and other parameters of the copy.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
Memory regions requested must be either entirely registered with CUDA, or in the case of host pageable transfers, not registered at all. Memory regions spanning over allocations that are both registered and not registered with CUDA are not supported and will return CUDA_ERROR_INVALID_VALUE.
Perform a 3D memory copy according to the parameters specified inp. See the definition of thecudaMemcpy3DPeerParms structure for documentation of its parameters.
Note that this function is synchronous with respect to the host only if the source or destination of the transfer is host memory. Note also that this copy is serialized with respect to all pending and future asynchronous work in to the current device, the copy's source device, and the copy's destination device (usecudaMemcpy3DPeerAsync to avoid this synchronization).
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitssynchronous behavior for most use cases.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemcpy,cudaMemcpyPeer,cudaMemcpyAsync,cudaMemcpyPeerAsync,cudaMemcpy3DPeerAsync,cuMemcpy3DPeer
Perform a 3D memory copy according to the parameters specified inp. See the definition of thecudaMemcpy3DPeerParms structure for documentation of its parameters.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemcpy,cudaMemcpyPeer,cudaMemcpyAsync,cudaMemcpyPeerAsync,cudaMemcpy3DPeerAsync,cuMemcpy3DPeerAsync
Copiescount bytes from the memory area pointed to bysrc to the memory area pointed to bydst, wherekind specifies the direction of the copy, and must be one ofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.
The memory areas may not overlap. CallingcudaMemcpyAsync() withdst andsrc pointers that do not match the direction of the copy results in an undefined behavior.
cudaMemcpyAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zerostream argument. Ifkind iscudaMemcpyHostToDevice orcudaMemcpyDeviceToHost and thestream is non-zero, the copy may overlap with operations in other streams.
The device version of this function only handles device to device copies and cannot be given local or shared pointers.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
Memory regions requested must be either entirely registered with CUDA, or in the case of host pageable transfers, not registered at all. Memory regions spanning over allocations that are both registered and not registered with CUDA are not supported and will return CUDA_ERROR_INVALID_VALUE.
See also:
cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DToArray,cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,cuMemcpyAsync,cuMemcpyDtoHAsync,cuMemcpyHtoDAsync,cuMemcpyDtoDAsync
Performs a batch of memory copies. The batch as a whole executes in stream order but copies within a batch are not guaranteed to execute in any specific order. This API only supports pointer-to-pointer copies. For copies involving CUDA arrays, please seecudaMemcpy3DBatchAsync.
Performs memory copies from source buffers specified insrcs to destination buffers specified indsts. The size of each copy is specified insizes. All three arrays must be of the same length as specified bycount. Since there are no ordering guarantees for copies within a batch, specifying any dependent copies within a batch will result in undefined behavior.
Every copy in the batch has to be associated with a set of attributes specified in theattrs array. Each entry in this array can apply to more than one copy. This can be done by specifying in theattrsIdxs array, the index of the first copy that the corresponding entry in theattrs array applies to. Bothattrs andattrsIdxs must be of the same length as specified bynumAttrs. For example, if a batch has 10 copies listed in dst/src/sizes, the first 6 of which have one set of attributes and the remaining 4 another, thennumAttrs will be 2,attrsIdxs will be {0, 6} andattrs will contains the two sets of attributes. Note that the first entry inattrsIdxs must always be 0. Also, each entry must be greater than the previous entry and the last entry should be less thancount. Furthermore,numAttrs must be lesser than or equal tocount.
ThecudaMemcpyAttributes::srcAccessOrder indicates the source access ordering to be observed for copies associated with the attribute. If the source access order is set tocudaMemcpySrcAccessOrderStream, then the source will be accessed in stream order. If the source access order is set tocudaMemcpySrcAccessOrderDuringApiCall then it indicates that access to the source pointer can be out of stream order and all accesses must be complete before the API call returns. This flag is suited for ephemeral sources (ex., stack variables) when it's known that no prior operations in the stream can be accessing the memory and also that the lifetime of the memory is limited to the scope that the source variable was declared in. Specifying this flag allows the driver to optimize the copy and removes the need for the user to synchronize the stream after the API call. If the source access order is set tocudaMemcpySrcAccessOrderAny then it indicates that access to the source pointer can be out of stream order and the accesses can happen even after the API call returns. This flag is suited for host pointers allocated outside CUDA (ex., via malloc) when it's known that no prior operations in the stream can be accessing the memory. Specifying this flag allows the driver to optimize the copy on certain platforms. Each memcpy operation in the batch must have a validcudaMemcpyAttributes corresponding to it including the appropriate srcAccessOrder setting, otherwise the API will returncudaErrorInvalidValue.
ThecudaMemcpyAttributes::srcLocHint andcudaMemcpyAttributes::dstLocHint allows applications to specify hint locations for operands of a copy when the operand doesn't have a fixed location. That is, these hints are only applicable for managed memory pointers on devices wherecudaDevAttrConcurrentManagedAccess is true or system-allocated pageable memory on devices wherecudaDevAttrPageableMemoryAccess is true. For other cases, these hints are ignored.
ThecudaMemcpyAttributes::flags field can be used to specify certain flags for copies. Setting thecudaMemcpyFlagPreferOverlapWithCompute flag indicates that the associated copies should preferably overlap with any compute work. Note that this flag is a hint and can be ignored depending on the platform and other parameters of the copy.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
Memory regions requested must be either entirely registered with CUDA, or in the case of host pageable transfers, not registered at all. Memory regions spanning over allocations that are both registered and not registered with CUDA are not supported and will return CUDA_ERROR_INVALID_VALUE.
cudaSuccess,cudaErrorInvalidValue,cudaErrorInvalidSymbol,cudaErrorInvalidMemcpyDirection,cudaErrorNoKernelImageForDevice
Copiescount bytes from the memory area pointed to byoffset bytes from the start of symbolsymbol to the memory area pointed to bydst. The memory areas may not overlap.symbol is a variable that resides in global or constant memory space.kind can be eithercudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitssynchronous behavior for most use cases.
Use of a string naming a variable as thesymbol parameter was deprecated in CUDA 4.1 and removed in CUDA 5.0.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DToArray,cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,cuMemcpy,cuMemcpyDtoH,cuMemcpyDtoD
cudaSuccess,cudaErrorInvalidValue,cudaErrorInvalidSymbol,cudaErrorInvalidMemcpyDirection,cudaErrorNoKernelImageForDevice
Copiescount bytes from the memory area pointed to byoffset bytes from the start of symbolsymbol to the memory area pointed to bydst. The memory areas may not overlap.symbol is a variable that resides in global or constant memory space.kind can be eithercudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.
cudaMemcpyFromSymbolAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zerostream argument. Ifkind iscudaMemcpyDeviceToHost andstream is non-zero, the copy may overlap with operations in other streams.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
This function uses standarddefault stream semantics.
Use of a string naming a variable as thesymbol parameter was deprecated in CUDA 4.1 and removed in CUDA 5.0.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DToArray,cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cuMemcpyAsync,cuMemcpyDtoHAsync,cuMemcpyDtoDAsync
Copies memory from one device to memory on another device.dst is the base device pointer of the destination memory anddstDevice is the destination device.src is the base device pointer of the source memory andsrcDevice is the source device.count specifies the number of bytes to copy.
Note that this function is asynchronous with respect to the host, but serialized with respect all pending and future asynchronous work in to the current device,srcDevice, anddstDevice (usecudaMemcpyPeerAsync to avoid this synchronization).
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitssynchronous behavior for most use cases.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemcpy,cudaMemcpyAsync,cudaMemcpyPeerAsync,cudaMemcpy3DPeerAsync,cuMemcpyPeer
Copies memory from one device to memory on another device.dst is the base device pointer of the destination memory anddstDevice is the destination device.src is the base device pointer of the source memory andsrcDevice is the source device.count specifies the number of bytes to copy.
Note that this function is asynchronous with respect to the host and all work on other devices.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemcpy,cudaMemcpyPeer,cudaMemcpyAsync,cudaMemcpy3DPeerAsync,cuMemcpyPeerAsync
cudaSuccess,cudaErrorInvalidValue,cudaErrorInvalidSymbol,cudaErrorInvalidMemcpyDirection,cudaErrorNoKernelImageForDevice
Copiescount bytes from the memory area pointed to bysrc to the memory area pointed to byoffset bytes from the start of symbolsymbol. The memory areas may not overlap.symbol is a variable that resides in global or constant memory space.kind can be eithercudaMemcpyHostToDevice,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitssynchronous behavior for most use cases.
Use of a string naming a variable as thesymbol parameter was deprecated in CUDA 4.1 and removed in CUDA 5.0.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DToArray,cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsync,cuMemcpy,cuMemcpyHtoD,cuMemcpyDtoD
cudaSuccess,cudaErrorInvalidValue,cudaErrorInvalidSymbol,cudaErrorInvalidMemcpyDirection,cudaErrorNoKernelImageForDevice
Copiescount bytes from the memory area pointed to bysrc to the memory area pointed to byoffset bytes from the start of symbolsymbol. The memory areas may not overlap.symbol is a variable that resides in global or constant memory space.kind can be eithercudaMemcpyHostToDevice,cudaMemcpyDeviceToDevice, orcudaMemcpyDefault. PassingcudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However,cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.
cudaMemcpyToSymbolAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. The copy can optionally be associated to a stream by passing a non-zerostream argument. Ifkind iscudaMemcpyHostToDevice andstream is non-zero, the copy may overlap with operations in other streams.
Note that this function may also return error codes from previous, asynchronous launches.
This function exhibitsasynchronous behavior for most use cases.
This function uses standarddefault stream semantics.
Use of a string naming a variable as thesymbol parameter was deprecated in CUDA 4.1 and removed in CUDA 5.0.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemcpy,cudaMemcpy2D,cudaMemcpy2DToArray,cudaMemcpy2DFromArray,cudaMemcpy2DArrayToArray,cudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyAsync,cudaMemcpy2DAsync,cudaMemcpy2DToArrayAsync,cudaMemcpy2DFromArrayAsync,cudaMemcpyFromSymbolAsync,cuMemcpyAsync,cuMemcpyHtoDAsync,cuMemcpyDtoDAsync
Fills the firstcount bytes of the memory area pointed to bydevPtr with the constant byte valuevalue.
Note that this function is asynchronous with respect to the host unlessdevPtr refers to pinned host memory.
Note that this function may also return error codes from previous, asynchronous launches.
See alsomemset synchronization details.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
Sets to the specified valuevalue a matrix (height rows ofwidth bytes each) pointed to bydstPtr.pitch is the width in bytes of the 2D array pointed to bydstPtr, including any padding added to the end of each row. This function performs fastest when the pitch is one that has been passed back bycudaMallocPitch().
Note that this function is asynchronous with respect to the host unlessdevPtr refers to pinned host memory.
Note that this function may also return error codes from previous, asynchronous launches.
See alsomemset synchronization details.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemset,cudaMemset3D,cudaMemsetAsync,cudaMemset2DAsync,cudaMemset3DAsync,cuMemsetD2D8,cuMemsetD2D16,cuMemsetD2D32
Sets to the specified valuevalue a matrix (height rows ofwidth bytes each) pointed to bydstPtr.pitch is the width in bytes of the 2D array pointed to bydstPtr, including any padding added to the end of each row. This function performs fastest when the pitch is one that has been passed back bycudaMallocPitch().
cudaMemset2DAsync() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zerostream argument. Ifstream is non-zero, the operation may overlap with operations in other streams.
The device version of this function only handles device to device copies and cannot be given local or shared pointers.
Note that this function may also return error codes from previous, asynchronous launches.
See alsomemset synchronization details.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemset,cudaMemset2D,cudaMemset3D,cudaMemsetAsync,cudaMemset3DAsync,cuMemsetD2D8Async,cuMemsetD2D16Async,cuMemsetD2D32Async
Initializes each element of a 3D array to the specified valuevalue. The object to initialize is defined bypitchedDevPtr. Thepitch field ofpitchedDevPtr is the width in memory in bytes of the 3D array pointed to bypitchedDevPtr, including any padding added to the end of each row. Thexsize field specifies the logical width of each row in bytes, while theysize field specifies the height of each 2D slice in rows. Thepitch field ofpitchedDevPtr is ignored whenheight anddepth are both equal to 1.
The extents of the initialized region are specified as awidth in bytes, aheight in rows, and adepth in slices.
Extents withwidth greater than or equal to thexsize ofpitchedDevPtr may perform significantly faster than extents narrower than thexsize. Secondarily, extents withheight equal to theysize ofpitchedDevPtr will perform faster than when theheight is shorter than theysize.
This function performs fastest when thepitchedDevPtr has been allocated bycudaMalloc3D().
Note that this function is asynchronous with respect to the host unlesspitchedDevPtr refers to pinned host memory.
Note that this function may also return error codes from previous, asynchronous launches.
See alsomemset synchronization details.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemset,cudaMemset2D,cudaMemsetAsync,cudaMemset2DAsync,cudaMemset3DAsync,cudaMalloc3D,make_cudaPitchedPtr,make_cudaExtent
Initializes each element of a 3D array to the specified valuevalue. The object to initialize is defined bypitchedDevPtr. Thepitch field ofpitchedDevPtr is the width in memory in bytes of the 3D array pointed to bypitchedDevPtr, including any padding added to the end of each row. Thexsize field specifies the logical width of each row in bytes, while theysize field specifies the height of each 2D slice in rows. Thepitch field ofpitchedDevPtr is ignored whenheight anddepth are both equal to 1.
The extents of the initialized region are specified as awidth in bytes, aheight in rows, and adepth in slices.
Extents withwidth greater than or equal to thexsize ofpitchedDevPtr may perform significantly faster than extents narrower than thexsize. Secondarily, extents withheight equal to theysize ofpitchedDevPtr will perform faster than when theheight is shorter than theysize.
This function performs fastest when thepitchedDevPtr has been allocated bycudaMalloc3D().
cudaMemset3DAsync() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zerostream argument. Ifstream is non-zero, the operation may overlap with operations in other streams.
The device version of this function only handles device to device copies and cannot be given local or shared pointers.
Note that this function may also return error codes from previous, asynchronous launches.
See alsomemset synchronization details.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemset,cudaMemset2D,cudaMemset3D,cudaMemsetAsync,cudaMemset2DAsync,cudaMalloc3D,make_cudaPitchedPtr,make_cudaExtent
Fills the firstcount bytes of the memory area pointed to bydevPtr with the constant byte valuevalue.
cudaMemsetAsync() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zerostream argument. Ifstream is non-zero, the operation may overlap with operations in other streams.
The device version of this function only handles device to device copies and cannot be given local or shared pointers.
Note that this function may also return error codes from previous, asynchronous launches.
See alsomemset synchronization details.
This function uses standarddefault stream semantics.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaMemset,cudaMemset2D,cudaMemset3D,cudaMemset2DAsync,cudaMemset3DAsync,cuMemsetD8Async,cuMemsetD16Async,cuMemsetD32Async
Returns the memory requirements of a CUDA mipmapped array inmemoryRequirements If the CUDA mipmapped array is not allocated with flagcudaArrayDeferredMappingcudaErrorInvalidValue will be returned.
The returned value incudaArrayMemoryRequirements::size represents the total size of the CUDA mipmapped array. The returned value incudaArrayMemoryRequirements::alignment represents the alignment necessary for mapping the CUDA mipmapped array.
See also:
Returns the sparse array layout properties insparseProperties. If the CUDA mipmapped array is not allocated with flagcudaArraySparsecudaErrorInvalidValue will be returned.
For non-layered CUDA mipmapped arrays,cudaArraySparseProperties::miptailSize returns the size of the mip tail region. The mip tail region includes all mip levels whose width, height or depth is less than that of the tile. For layered CUDA mipmapped arrays, ifcudaArraySparseProperties::flags containscudaArraySparsePropertiesSingleMipTail, thencudaArraySparseProperties::miptailSize specifies the size of the mip tail of all layers combined. Otherwise,cudaArraySparseProperties::miptailSize specifies mip tail size per layer. The returned value ofcudaArraySparseProperties::miptailFirstLevel is valid only ifcudaArraySparseProperties::miptailSize is non-zero.
See also:
cudaExtent specified byw,h, andd
cudaPitchedPtr specified byd,p,xsz, andysz
cudaPos specified byx,y, andz