CUDA support#

Contexts#

classCudaDeviceManager#

Public Functions

Result<std::shared_ptr<CudaDevice>>GetDevice(intdevice_number)#

Get aCudaDevice instance for a particular device.

Parameters:

device_number[in] the CUDA device number

Result<std::shared_ptr<CudaContext>>GetContext(intdevice_number)#

Get the CUDA driver context for a particular device.

Parameters:

device_number[in] the CUDA device number

Returns:

cached context

Result<std::shared_ptr<CudaContext>>GetSharedContext(intdevice_number,void*handle)#

Get the shared CUDA driver context for a particular device.

Parameters:
  • device_number[in] the CUDA device number

  • handle[in] CUDA context handle created by another library

Returns:

shared context

Result<std::shared_ptr<CudaHostBuffer>>AllocateHost(intdevice_number,int64_tnbytes)#

Allocate host memory with fast access to given GPU device.

Parameters:
  • device_number[in] the CUDA device number

  • nbytes[in] number of bytes

Returns:

Host buffer orStatus

StatusFreeHost(void*data,int64_tnbytes)#

Free host memory.

The given memory pointer must have been allocated with AllocateHost.

classCudaContext:publicstd::enable_shared_from_this<CudaContext>#

Object-oriented interface to the low-level CUDA driver API.

Public Functions

Result<std::unique_ptr<CudaBuffer>>Allocate(int64_tnbytes)#

Allocate CUDA memory on GPU device for this context.

Parameters:

nbytes[in] number of bytes

Returns:

the allocated buffer

StatusFree(void*device_ptr,int64_tnbytes)#

Release CUDA memory on GPU device for this context.

Parameters:
  • device_ptr[in] the buffer address

  • nbytes[in] number of bytes

Returns:

Status

Result<std::shared_ptr<CudaBuffer>>View(uint8_t*data,int64_tnbytes)#

Create a view of CUDA memory on GPU device of this context.

Note

The caller is responsible for allocating and freeing the memory as well as ensuring that the memory belongs to the CUDA context that thisCudaContext instance holds.

Parameters:
  • data[in] the starting device address

  • nbytes[in] number of bytes

Returns:

the view buffer

Result<std::shared_ptr<CudaBuffer>>OpenIpcBuffer(constCudaIpcMemHandle&ipc_handle)#

Open existing CUDA IPC memory handle.

Parameters:

ipc_handle[in] opaque pointer to CUipcMemHandle (driver API)

Returns:

aCudaBuffer referencing the IPC segment

StatusCloseIpcBuffer(CudaBuffer*buffer)#

Close memory mapped with IPC buffer.

Parameters:

buffer[in] aCudaBuffer referencing

Returns:

Status

StatusSynchronize(void)#

Block until the all device tasks are completed.

void*handle()const#

Expose CUDA context handle to other libraries.

std::shared_ptr<CudaMemoryManager>memory_manager()const#

Return the default memory manager tied to this context’s device.

std::shared_ptr<CudaDevice>device()const#

Return the device instance associated with this context.

intdevice_number()const#

Return the logical device number.

Result<uintptr_t>GetDeviceAddress(uint8_t*addr)#

Return the device address that is reachable from kernels running in the context.

The device address is defined as a memory address accessible by device. While it is often a device memory address, it can be also a host memory address, for instance, when the memory is allocated as host memory (using cudaMallocHost or cudaHostAlloc) or as managed memory (using cudaMallocManaged) or the host memory is page-locked (using cudaHostRegister).

Parameters:

addr[in] device or host memory address

Returns:

the device address

Devices#

classCudaDevice:publicarrow::Device#

Device implementation for CUDA.

EachCudaDevice instance is tied to a particular CUDA device (identified by its logical device number).

Public Functions

virtualconstchar*type_name()constoverride#

A shorthand for this device’s type.

The returned value is different for each device class, but is the same for all instances of a given class. It can be used as a replacement for RTTI.

virtualstd::stringToString()constoverride#

A human-readable description of the device.

The returned value should be detailed enough to distinguish between different instances, where necessary.

virtualboolEquals(constDevice&)constoverride#

Whether this instance points to the same device as another one.

virtualstd::shared_ptr<MemoryManager>default_memory_manager()override#

Return aMemoryManager instance tied to this device.

The returned instance uses default parameters for this device type’sMemoryManager implementation. Some devices also allow constructingMemoryManager instances with non-default parameters.

inlinevirtualDeviceAllocationTypedevice_type()constoverride#

Return the DeviceAllocationType of this device.

inlinevirtualint64_tdevice_id()constoverride#

A device ID to identify this device if there are multiple of this type.

If there is no “device_id” equivalent (such as for the main CPU device on non-numa systems) returns -1.

intdevice_number()const#

Return the device logical number.

std::stringdevice_name()const#

Return the GPU model name.

int64_ttotal_memory()const#

Return total memory on this device.

inthandle()const#

Return a raw CUDA device handle.

The returned value can be used to expose this device to other libraries. It should be interpreted asCUdevice.

Result<std::shared_ptr<CudaContext>>GetContext()#

Get a CUDA driver context for this device.

The returned context is associated with the primary CUDA context for the device. This is the recommended way of getting a context for a device, as it allows interoperating transparently with any library using the primary CUDA context API.

Result<std::shared_ptr<CudaContext>>GetSharedContext(void*handle)#

Get a CUDA driver context for this device, using an existing handle.

The handle is not owned: it will not be released when theCudaContext is destroyed. This function should only be used if you need interoperation with a library that uses a non-primary context.

Parameters:

handle[in] CUDA context handle created by another library

Result<std::shared_ptr<CudaHostBuffer>>AllocateHostBuffer(int64_tsize)#

Allocate a host-residing, GPU-accessible buffer.

The buffer is allocated using this device’s primary context.

Parameters:

size[in] The buffer size in bytes

virtualResult<std::shared_ptr<Device::Stream>>MakeStream(unsignedintflags)override#

Create a CUstream wrapper in the current context.

virtualResult<std::shared_ptr<Device::Stream>>WrapStream(void*device_stream,Stream::release_fn_trelease_fn)override#

Wrap a pointer to an existing stream.

Parameters:
  • device_stream – passed in stream (should be a CUstream*)

  • release_fn – destructor to free the stream.nullptr may be passed to indicate there is no destruction/freeing necessary.

Public Static Functions

staticResult<std::shared_ptr<CudaDevice>>Make(intdevice_number)#

Return aCudaDevice instance for a particular device.

Parameters:

device_number[in] the CUDA device number

classStream:publicarrow::Device::Stream#

EXPERIMENTAL: Wrapper for CUstreams.

Does notown the CUstream object which must be separately constructed and freed using cuStreamCreate and cuStreamDestroy (or equivalent). Default construction will use the cuda default stream, and does not allow construction from literal 0 or nullptr.

Public Functions

virtualStatusWaitEvent(constDevice::SyncEvent&)override#

Make the stream wait on the provided event.

Tells the stream that it should wait until the synchronization event is completed without blocking the CPU.

virtualStatusSynchronize()constoverride#

Blocks the current thread until a stream’s remaining tasks are completed.

classSyncEvent:publicarrow::Device::SyncEvent#

Public Functions

virtualStatusWait()override#

Block until the sync event is marked completed.

virtualStatusRecord(constDevice::Stream&)override#

Record the wrapped event on the stream.

Once the stream completes the tasks previously added to it, it will trigger the event.

classCudaMemoryManager:publicarrow::MemoryManager#

MemoryManager implementation for CUDA.

Public Functions

virtualResult<std::shared_ptr<io::RandomAccessFile>>GetBufferReader(std::shared_ptr<Buffer>buf)override#

Create a RandomAccessFile to read a particular buffer.

The given buffer must be tied to thisMemoryManager.

See also theBuffer::GetReader shorthand.

virtualResult<std::shared_ptr<io::OutputStream>>GetBufferWriter(std::shared_ptr<Buffer>buf)override#

Create a OutputStream to write to a particular buffer.

The given buffer must be mutable and tied to thisMemoryManager. The returned stream object writes into the buffer’s underlying memory (but it won’t resize it).

See also theBuffer::GetWriter shorthand.

virtualResult<std::unique_ptr<Buffer>>AllocateBuffer(int64_tsize)override#

Allocate a (mutable)Buffer.

The buffer will be allocated in the device’s memory.

std::shared_ptr<CudaDevice>cuda_device()const#

TheCudaDevice instance tied to thisMemoryManager.

This is a useful shorthand returning a concrete-typed pointer, avoiding having to cast thedevice() result.

virtualResult<std::shared_ptr<Device::SyncEvent>>MakeDeviceSyncEvent()override#

Creates a wrapped CUevent.

Will call cuEventCreate and it will call cuEventDestroy internally when the event is destructed.

virtualResult<std::shared_ptr<Device::SyncEvent>>WrapDeviceSyncEvent(void*sync_event,Device::SyncEvent::release_fn_trelease_sync_event)override#

Wraps an existing event into a sync event.

Parameters:
  • sync_event – the event to wrap, must be a CUevent*

  • release_sync_event – a function to call during destruction,nullptr or a no-op function can be passed to indicate ownership is maintained externally

Buffers#

classCudaBuffer:publicarrow::Buffer#

An Arrow buffer located on a GPU device.

Be careful using this in any Arrow code which may not be GPU-aware

Public Functions

StatusCopyToHost(constint64_tposition,constint64_tnbytes,void*out)const#

Copy memory from GPU device to CPU host.

Parameters:
  • position[in] start position inside buffer to copy bytes from

  • nbytes[in] number of bytes to copy

  • out[out] start address of the host memory area to copy to

Returns:

Status

StatusCopyFromHost(constint64_tposition,constvoid*data,int64_tnbytes)#

Copy memory to device at position.

Parameters:
  • position[in] start position to copy bytes to

  • data[in] the host data to copy

  • nbytes[in] number of bytes to copy

Returns:

Status

StatusCopyFromDevice(constint64_tposition,constvoid*data,int64_tnbytes)#

Copy memory from device to device at position.

Note

It is assumed that both source and destination device memories have been allocated within the same context.

Parameters:
  • position[in] start position inside buffer to copy bytes to

  • data[in] start address of the device memory area to copy from

  • nbytes[in] number of bytes to copy

Returns:

Status

StatusCopyFromAnotherDevice(conststd::shared_ptr<CudaContext>&src_ctx,constint64_tposition,constvoid*data,int64_tnbytes)#

Copy memory from another device to device at position.

Parameters:
  • src_ctx[in] context of the source device memory

  • position[in] start position inside buffer to copy bytes to

  • data[in] start address of the another device memory area to copy from

  • nbytes[in] number of bytes to copy

Returns:

Status

virtualResult<std::shared_ptr<CudaIpcMemHandle>>ExportForIpc()#

Expose this device buffer as IPC memory which can be used in other processes.

Note

After calling this function, this device memory will not be freed when theCudaBuffer is destructed

Returns:

Handle orStatus

Public Static Functions

staticResult<std::shared_ptr<CudaBuffer>>FromBuffer(std::shared_ptr<Buffer>buffer)#

Convert back generic buffer intoCudaBuffer.

Note

This function returns an error if the buffer isn’t backed by GPU memory

Parameters:

buffer[in] buffer to convert

Returns:

CudaBuffer orStatus

classCudaHostBuffer:publicarrow::MutableBuffer#

Device-accessible CPU memory created using cudaHostAlloc.

Public Functions

Result<uintptr_t>GetDeviceAddress(conststd::shared_ptr<CudaContext>&ctx)#

Return a device address the GPU can read this memory from.

Memory Input / Output#

classCudaBufferReader:publicarrow::io::internal::RandomAccessFileConcurrencyWrapper<CudaBufferReader>#

File interface for zero-copy read from CUDA buffers.

CAUTION: reading to aBuffer returns aBuffer pointing to device memory. It will generally not be compatible with Arrow code expecting a buffer pointing to CPU memory. Reading to a raw pointer, though, copies device memory into the host memory pointed to.

Public Functions

virtualboolclosed()constoverride#

Return whether the stream is closed.

virtualboolsupports_zero_copy()constoverride#

Return true if InputStream is capable of zero copyBuffer reads.

Zero copy reads imply the use of Buffer-returningRead() overloads.

classCudaBufferWriter:publicarrow::io::WritableFile#

File interface for writing to CUDA buffers, with optional buffering.

Public Functions

virtualStatusClose()override#

Close writer and flush buffered bytes to GPU.

virtualboolclosed()constoverride#

Return whether the stream is closed.

virtualStatusFlush()override#

Flush buffered bytes to GPU.

virtualStatusWrite(constvoid*data,int64_tnbytes)override#

Write the given data to the stream.

This method always processes the bytes in full. Depending on the semantics of the stream, the data may be written out immediately, held in a buffer, or written asynchronously. In the case where the stream buffers the data, it will be copied. To avoid potentially large copies, use the Write variant that takes an ownedBuffer.

virtualResult<int64_t>Tell()constoverride#

Return the position in this stream.

StatusSetBufferSize(constint64_tbuffer_size)#

Set CPU buffer size to limit calls to cudaMemcpy.

By default writes are unbuffered

Parameters:

buffer_size[in] the size of CPU buffer to allocate

Returns:

Status

int64_tbuffer_size()const#

Returns size of host (CPU) buffer, 0 for unbuffered.

int64_tnum_bytes_buffered()const#

Returns number of bytes buffered on host.

IPC#

classCudaIpcMemHandle#

Public Functions

Result<std::shared_ptr<Buffer>>Serialize(MemoryPool*pool=default_memory_pool())const#

WriteCudaIpcMemHandle to aBuffer.

Parameters:

pool[in] aMemoryPool to allocate memory from

Returns:

Buffer orStatus

Public Static Functions

staticResult<std::shared_ptr<CudaIpcMemHandle>>FromBuffer(constvoid*opaque_handle)#

CreateCudaIpcMemHandle from opaque buffer (e.g.

from another process)

Parameters:

opaque_handle[in] a CUipcMemHandle as a const void*

Returns:

Handle orStatus

ARROW_CUDA_EXPORTResult<std::shared_ptr<CudaBuffer>>SerializeRecordBatch(constRecordBatch&batch,CudaContext*ctx)

Write record batch message to GPU device memory.

Parameters:
  • batch[in] record batch to write

  • ctx[in]CudaContext to allocate device memory from

Returns:

CudaBuffer orStatus

ARROW_CUDA_EXPORTResult<std::shared_ptr<RecordBatch>>ReadRecordBatch(conststd::shared_ptr<Schema>&schema,constipc::DictionaryMemo*dictionary_memo,conststd::shared_ptr<CudaBuffer>&buffer,MemoryPool*pool=default_memory_pool())

ReadRecordBatch specialized to handle metadata on CUDA device.

Parameters:
  • schema[in] theSchema for the record batch

  • dictionary_memo[in] DictionaryMemo which has any dictionaries. Can be nullptr if you are sure there are no dictionary-encoded fields

  • buffer[in] aCudaBuffer containing the complete IPC message

  • pool[in] aMemoryPool to use for allocating space for the metadata

Returns:

RecordBatch orStatus

On this page