Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

RAPIDS Memory Manager

License

NotificationsYou must be signed in to change notification settings

rapidsai/rmm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 RMM: RAPIDS Memory Manager

NOTE: For the latest stableREADME.md ensure you are on themain branch.

Resources

Overview

Achieving optimal performance in GPU-centric workflows frequently requires customizing how host anddevice memory are allocated. For example, using "pinned" host memory for asynchronoushost <-> device memory transfers, or using a device memory pool sub-allocator to reduce the cost ofdynamic device memory allocation.

The goal of the RAPIDS Memory Manager (RMM) is to provide:

For information on the interface RMM provides and how to use RMM in your C++ code, seebelow.

For a walkthrough about the design of the RAPIDS Memory Manager, readFast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager on the NVIDIA Developer Blog.

Installation

Conda

RMM can be installed with conda. You can get a minimal conda installation withminiforge.

Install RMM with:

conda install -c rapidsai -c conda-forge -c nvidia rmm cuda-version=12.8

We also providenightly conda packages built from the HEADof our latest development branch.

Note: RMM is supported only on Linux, and only tested with Python versions 3.10, 3.11, and 3.12.

Note: The RMM package from conda requires building with GCC 9 or later. Otherwise, your application may fail to build.

See theRAPIDS Installation Guide for system requirements.

Building from Source

Get RMM Dependencies

Compiler requirements:

  • gcc version 9.3+
  • nvcc version 11.4+
  • cmake version 3.30.4+

CUDA/GPU requirements:

GPU Support:

  • RMM is tested and supported only on Volta architecture and newer (Compute Capability 7.0+). Itmay work on earlier architectures.

Python requirements:

  • rapids-build-backend (available from PyPI or therapidsai conda channel)
  • scikit-build-core
  • cuda-python
  • cython

For more details, seepyproject.toml

Script to build RMM from source

To install RMM from source, ensure the dependencies are met and follow the steps below:

  • Clone the repository
$ git clone https://github.com/rapidsai/rmm.git$cd rmm
  • Create the conda development environmentrmm_dev
# create the conda environment (assuming in base `rmm` directory)$ conda env create --name rmm_dev --file conda/environments/all_cuda-128_arch-x86_64.yaml# activate the environment$ conda activate rmm_dev
  • Build and installlibrmm using cmake & make. CMake depends on thenvcc executable being onyour path or defined inCUDACXX environment variable.
$ mkdir build# make a build directory$cd build# enter the build directory$ cmake .. -DCMAKE_INSTALL_PREFIX=/install/path# configure cmake ... use $CONDA_PREFIX if you're using Anaconda$ make -j# compile the library librmm.so ... '-j' will start a parallel job using the number of physical cores available on your system$ make install# install the library librmm.so to '/install/path'
  • Building and installinglibrmm andrmm using build.sh. Build.sh creates build dir at root ofgit repository. build.sh depends on thenvcc executable being on your path or defined inCUDACXX environment variable.
$ ./build.sh -h# Display help and exit$ ./build.sh -n librmm# Build librmm without installing$ ./build.sh -n rmm# Build rmm without installing$ ./build.sh -n librmm rmm# Build librmm and rmm without installing$ ./build.sh librmm rmm# Build and install librmm and rmm
  • To run tests (Optional):
$cd build (if you are not alreadyin build directory)$ maketest
  • Build, install, and test thermm python package, in thepython folder:
# In the root rmm directory$ python -m pip install -e ./python/rmm$ pytest -v

Done! You are ready to develop for the RMM OSS project.

Caching third-party dependencies

RMM usesCPM.cmake tohandle third-party dependencies like spdlog, Thrust, GoogleTest,GoogleBenchmark. In general you won't have to worry about it. If CMakefinds an appropriate version on your system, it uses it (you canhelp it along by settingCMAKE_PREFIX_PATH to point to theinstalled location). Otherwise those dependencies will be downloaded aspart of the build.

If you frequently start new builds from scratch, consider setting theenvironment variableCPM_SOURCE_CACHE to an external downloaddirectory to avoid repeated downloads of the third-party dependencies.

Using RMM in a downstream CMake project

The installed RMM library provides a set of config files that makes it easy tointegrate RMM into your own CMake project. Add the following toCMakeLists.txt:

find_package(rmm [VERSION])# ...target_link_libraries(<your-target> (PRIVATE|PUBLIC|INTERFACE) rmm::rmm)

Since RMM is a header-only library, this does not actually link RMM,but it makes the headers available and pulls in transitive dependencies.If RMM is not installed in a default location, useCMAKE_PREFIX_PATH orrmm_ROOT to point to its location.

One of RMM's dependencies is the Thrust library, so the aboveautomatically pulls inThrust by means of a dependency on thermm::Thrust target. By default it uses the standard configuration ofThrust. If you want to customize it, you can set the variablesTHRUST_HOST_SYSTEM andTHRUST_DEVICE_SYSTEM; seeThrust's CMake documentation.

Using CPM to manage RMM

RMM usesCPM.cmake to manageits dependencies, includingCCCL, and you canuse CPM for your project's dependency on RMM.

There is an issue with using CPM'ssingle-argument compact syntax forRMM/CCCL as it transitively marks targets asSYSTEM dependencies.This causes the CCCL headers pulled in through CPM to be of lower priorityto the preprocessor than the (potentially outdated) CCCL headers providedby the CUDA SDK. To avoid this issue, use CPM'smulti-argument syntaxinstead:

CPMAddPackage(NAME rmm [VERSION]              GITHUB_REPOSITORY rapidsai/rmmSYSTEMOff)# ...target_link_libraries(<your-target> (PRIVATE|PUBLIC|INTERFACE) rmm::rmm)

Using RMM in C++

The first goal of RMM is to provide a common interface for device and host memory allocation.This allows bothusers andimplementers of custom allocation logic to program to a singleinterface.

To this end, RMM defines two abstract interface classes:

These classes are based on thestd::pmr::memory_resource interfaceclass introduced in C++17 for polymorphic memory allocation.

device_memory_resource

rmm::mr::device_memory_resource is the base class that defines the interface for allocating andfreeing device memory.

It has two key functions:

  1. void* device_memory_resource::allocate(std::size_t bytes, cuda_stream_view s)

    • Returns a pointer to an allocation of at leastbytes bytes.
  2. void device_memory_resource::deallocate(void* p, std::size_t bytes, cuda_stream_view s)

    • Reclaims a previous allocation of sizebytes pointed to byp.
    • pmust have been returned by a previous call toallocate(bytes), otherwise behavior isundefined

It is up to a derived class to provide implementations of these functions. Seeavailable resources for exampledevice_memory_resource derived classes.

Unlikestd::pmr::memory_resource,rmm::mr::device_memory_resource does not allow specifying analignment argument. All allocations are required to be aligned to at least 256B. Furthermore,device_memory_resource adds an additionalcuda_stream_view argument to allow specifying the streamon which to perform the (de)allocation.

Stream-ordered Memory Allocation

rmm::mr::device_memory_resource is a base class that provides stream-ordered memory allocation.This allows optimizations such as re-using memory deallocated on the same stream without theoverhead of synchronization.

A call todevice_memory_resource::allocate(bytes, stream_a) returns a pointer that is valid to useonstream_a. Using the memory on a different stream (saystream_b) is Undefined Behavior unlessthe two streams are first synchronized, for example by usingcudaStreamSynchronize(stream_a) or byrecording a CUDA event onstream_a and then callingcudaStreamWaitEvent(stream_b, event).

The stream specified todevice_memory_resource::deallocate should be a stream on which it is validto use the deallocated memory immediately for another allocation. Typically this is the streamon which the allocation waslast used before the call todeallocate. The passed stream may beused internally by adevice_memory_resource for managing available memory with minimalsynchronization, and it may also be synchronized at a later time, for example using a call tocudaStreamSynchronize().

For this reason, it is Undefined Behavior to destroy a CUDA stream that is passed todevice_memory_resource::deallocate. If the stream on which the allocation was last used has beendestroyed before callingdeallocate or it is known that it will be destroyed, it is likely betterto synchronize the stream (before destroying it) and then pass a different stream todeallocate(e.g. the default stream).

Note that device memory data structures such asrmm::device_buffer andrmm::device_uvectorfollow these stream-ordered memory allocation semantics and rules.

For further information about stream-ordered memory allocation semantics, readUsing the NVIDIA CUDA Stream-Ordered MemoryAllocatoron the NVIDIA Developer Blog.

Available Device Resources

RMM provides severaldevice_memory_resource derived classes to satisfy various user requirements.For more detailed information about these resources, see their respective documentation.

cuda_memory_resource

Allocates and frees device memory usingcudaMalloc andcudaFree.

managed_memory_resource

Allocates and frees device memory usingcudaMallocManaged andcudaFree.

Note thatmanaged_memory_resource cannot be used with NVIDIA Virtual GPU Software (vGPU, for usewith virtual machines or hypervisors) becauseNVIDIA CUDA Unified Memory is not supported byNVIDIA vGPU.

pool_memory_resource

A coalescing, best-fit pool sub-allocator.

fixed_size_memory_resource

A memory resource that can only allocate a single fixed size. Average allocation and deallocationcost is constant.

binning_memory_resource

Configurable to use multiple upstream memory resources for allocations that fall within differentbin sizes. Often configured with multiple bins backed byfixed_size_memory_resources and a singlepool_memory_resource for allocations larger than the largest bin size.

Default Resources and Per-device Resources

RMM users commonly need to configure adevice_memory_resource object to use for all allocationswhere another resource has not explicitly been provided. A common example is configuring apool_memory_resource to use for all allocations to get fast dynamic allocation.

To enable this use case, RMM provides the concept of a "default"device_memory_resource. Thisresource is used when another is not explicitly provided.

Accessing and modifying the default resource is done through two functions:

  • device_memory_resource* get_current_device_resource()

    • Returns a pointer to the default resource for the current CUDA device.
    • The initial default memory resource is an instance ofcuda_memory_resource.
    • This function is thread safe with respect to concurrent calls to it andset_current_device_resource().
    • For more explicit control, you can useget_per_device_resource(), which takes a device ID.
  • device_memory_resource* set_current_device_resource(device_memory_resource* new_mr)

    • Updates the default memory resource pointer for the current CUDA device tonew_mr
    • Returns the previous default resource pointer
    • Ifnew_mr isnullptr, then resets the default resource tocuda_memory_resource
    • This function is thread safe with respect to concurrent calls to it andget_current_device_resource()
    • For more explicit control, you can useset_per_device_resource(), which takes a device ID.

Example

rmm::mr::cuda_memory_resource cuda_mr;// Construct a resource that uses a coalescing best-fit pool allocator// With the pool initially half of available device memoryauto initial_size = rmm::percent_of_free_device_memory(50);rmm::mr::pool_memory_resource<rmm::mr::cuda_memory_resource> pool_mr{&cuda_mr, initial_size};rmm::mr::set_current_device_resource(&pool_mr);// Updates the current device resource pointer to `pool_mr`rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource();// Points to `pool_mr`

Multiple Devices

Adevice_memory_resource should only be used when the active CUDA device is the same devicethat was active when thedevice_memory_resource was created. Otherwise behavior is undefined.

If adevice_memory_resource is used with a stream associated with a different CUDA device than thedevice for which the memory resource was created, behavior is undefined.

Creating adevice_memory_resource for each device requires care to set the current device beforecreating each resource, and to maintain the lifetime of the resources as long as they are set asper-device resources. Here is an example loop that createsunique_ptrs topool_memory_resourceobjects for each device and sets them as the per-device resource for that device.

using pool_mr = rmm::mr::pool_memory_resource<rmm::mr::cuda_memory_resource>;std::vector<unique_ptr<pool_mr>> per_device_pools;for(int i =0; i < N; ++i) {cudaSetDevice(i);// set device i before creating MR// Use a vector of unique_ptr to maintain the lifetime of the MRs// Note: for brevity, omitting creation of upstream and computing initial_size  per_device_pools.push_back(std::make_unique<pool_mr>(upstream, initial_size));// Set the per-device resource for device iset_per_device_resource(cuda_device_id{i}, &per_device_pools.back());}

Note that the CUDA device that is current when creating adevice_memory_resource must also becurrent any time thatdevice_memory_resource is used to deallocate memory, including in adestructor. The RAII classrmm::device_buffer and classes that use it as a backing store(rmm::device_scalar andrmm::device_uvector) handle this by storing the active device when theconstructor is called, and then ensuring that the stored device is active whenever an allocation ordeallocation is performed (including in the destructor). The user must therefore only ensure thatthe device active duringcreation of anrmm::device_buffer matches the active device of thememory resource being used.

Here is an incorrect example that creates a memory resource on device0 and then uses it toallocate adevice_buffer on device1:

{RMM_CUDA_TRY(cudaSetDevice(0));auto mr = rmm::mr::cuda_memory_resource{};  {RMM_CUDA_TRY(cudaSetDevice(1));// Invalid, current device is 1, but MR is only valid for device 0    rmm::device_bufferbuf(16, rmm::cuda_stream_default, &mr);  }}

A correct example creates the device buffer with device0 active. After that it is safe to switchdevices and let the buffer go out of scope and destruct with a different device active. For example,this code is correct:

{RMM_CUDA_TRY(cudaSetDevice(0));auto mr = rmm::mr::cuda_memory_resource{};  rmm::device_bufferbuf(16, rmm::cuda_stream_default, &mr);RMM_CUDA_TRY(cudaSetDevice(1));  ...// No need to switch back to device 0 before ~buf runs}

Use ofrmm::device_vector with multiple devices

rmm:device_vector uses anrmm::mr::thrust_allocator to enablethrust::device_vector toallocate and deallocate memory using RMM. As such, the usual rules for usage of the backing memoryresource apply: the active device must match the active device at resource construction time. Tofacilitate use in an RAII setting,rmm::mr::thrust_allocator records the active device atconstruction time and ensures that device is active whenever it allocates or deallocates memory.Usage ofrmm::device_vector with multiple devices is therefore the same asrmm::device_buffer.One mustcreatedevice_vectors with the correct device active, but it is safe to destroy themwith a different active device.

For example, recapitulating the previous example usingrmm::device_vector:

{RMM_CUDA_TRY(cudaSetDevice(0));auto mr = rmm::mr::cuda_memory_resource{};  rmm::device_vector<int>vec(16, rmm::mr::thrust_allocator<int>(rmm::cuda_stream_default, &mr));RMM_CUDA_TRY(cudaSetDevice(1));  ...// No need to switch back to device 0 before ~vec runs}

Note

Although allocation and deallocation in thethrust_allocator run with the correct active device,modification ofrmm::device_vector might necessitate a kernel launch, and this must run with thecorrect device active. For example,.resize() might both allocateand launch a kernel toinitialize new elements: the user must arrange for this kernel launch to occur with the correctdevice for the memory resource active.

cuda_stream_view andcuda_stream

rmm::cuda_stream_view is a simple non-owning wrapper around a CUDAcudaStream_t. This wrapper'spurpose is to provide strong type safety for stream types. (cudaStream_t is an alias for a pointer,which can lead to ambiguity in APIs when it is assigned0.) All RMM stream-ordered APIs take armm::cuda_stream_view argument.

rmm::cuda_stream is a simple owning wrapper around a CUDAcudaStream_t. This class providesRAII semantics (constructor creates the CUDA stream, destructor destroys it). Anrmm::cuda_streamcan never represent the CUDA default stream or per-thread default stream; it only ever representsa single non-default stream.rmm::cuda_stream cannot be copied, but can be moved.

cuda_stream_pool

rmm::cuda_stream_pool provides fast access to a pool of CUDA streams. This class can be used tocreate a set ofcuda_stream objects whose lifetime is equal to thecuda_stream_pool. Using thestream pool can be faster than creating the streams on the fly. The size of the pool is configurable.Depending on this size, multiple calls tocuda_stream_pool::get_stream() may return instances ofrmm::cuda_stream_view that represent identical CUDA streams.

Thread Safety

All current device memory resources are thread safe unless documented otherwise. More specifically,calls to memory resourceallocate() anddeallocate() methods are safe with respect to calls toeither of these functions from other threads. They arenot thread safe with respect toconstruction and destruction of the memory resource object.

Note that a classthread_safe_resource_adapter is provided which can be used to adapt a memoryresource that is not thread safe to be thread safe (as described above). This adapter is not neededwith any current RMM device memory resources.

Allocators

C++ interfaces commonly allow customizable memory allocation through anAllocator object.RMM provides severalAllocator andAllocator-like classes.

polymorphic_allocator

Astream-ordered allocator similar tostd::pmr::polymorphic_allocator.Unlike the standard C++Allocator interface, theallocate anddeallocate functions take acuda_stream_view indicating the stream on which the (de)allocation occurs.

stream_allocator_adaptor

stream_allocator_adaptor can be used to adapt a stream-ordered allocator to present a standardAllocator interface to consumers that may not be designed to work with a stream-ordered interface.

Example:

rmm::cuda_stream stream;rmm::mr::polymorphic_allocator<int> stream_alloc;// Constructs an adaptor that forwards all (de)allocations to `stream_alloc` on `stream`.auto adapted = rmm::mr::stream_allocator_adaptor(stream_alloc, stream);// Allocates 100 bytes using `stream_alloc` on `stream`auto p = adapted.allocate(100);...// Deallocates using `stream_alloc` on `stream`adapted.deallocate(p,100);

thrust_allocator

thrust_allocator is a device memory allocator that uses the strongly typedthrust::device_ptr, making it usable with containers likethrust::device_vector.

Seebelow for more information on using RMM with Thrust.

Device Data Structures

device_buffer

An untyped, uninitialized RAII class for stream ordered device memory allocation.

Example

cuda_stream_view s{...};// Allocates at least 100 bytes on stream `s` using the *default* resourcermm::device_buffer b{100,s};void* p = b.data();// Raw, untyped pointer to underlying device memorykernel<<<..., s.value()>>>(b.data());// `b` is only safe to use on `s`rmm::mr::device_memory_resource * mr =new my_custom_resource{...};// Allocates at least 100 bytes on stream `s` using the resource `mr`rmm::device_buffer b2{100, s, mr};

device_uvector<T>

A typed, uninitialized RAII class for allocation of a contiguous set of elements in device memory.Similar to athrust::device_vector, but as an optimization, does not default initialize thecontained elements. This optimization restricts the typesT to trivially copyable types.

Example

cuda_stream_view s{...};// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the// default resourcermm::device_uvector<int32_t>v(100, s);// Initializes the elements to 0thrust::uninitialized_fill(thrust::cuda::par.on(s.value()), v.begin(), v.end(), int32_t{0});rmm::mr::device_memory_resource * mr =new my_custom_resource{...};// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the resource `mr`rmm::device_uvector<int32_t> v2{100, s, mr};

device_scalar

A typed, RAII class for allocation of a single element in device memory.This is similar to adevice_uvector with a single element, but provides convenience functions likemodifying the value in device memory from the host, or retrieving the value from device to host.

Example

cuda_stream_view s{...};// Allocates uninitialized storage for a single `int32_t` in device memoryrmm::device_scalar<int32_t> a{s};a.set_value(42, s);// Updates the value in device memory to `42` on stream `s`kernel<<<...,s.value()>>>(a.data());// Pass raw pointer to underlying element in device memoryint32_t v = a.value(s);// Retrieves the value from device to host on stream `s`

host_memory_resource

rmm::mr::host_memory_resource is the base class that defines the interface for allocating andfreeing host memory.

Similar todevice_memory_resource, it has two key functions for (de)allocation:

  1. void* host_memory_resource::allocate(std::size_t bytes, std::size_t alignment)

    • Returns a pointer to an allocation of at leastbytes bytes aligned to the specifiedalignment
  2. void host_memory_resource::deallocate(void* p, std::size_t bytes, std::size_t alignment)

    • Reclaims a previous allocation of sizebytes pointed to byp.

Unlikedevice_memory_resource, thehost_memory_resource interface and behavior is identical tostd::pmr::memory_resource.

Available Host Resources

new_delete_resource

Uses the globaloperator new andoperator delete to allocate host memory.

pinned_memory_resource

Allocates "pinned" host memory usingcuda(Malloc/Free)Host.

Host Data Structures

RMM does not currently provide any data structures that interface withhost_memory_resource.In the future, RMM will provide a similar host-side structure likedevice_buffer and an allocatorthat can be used with STL containers.

Using RMM with Thrust

RAPIDS and other CUDA libraries make heavy use of Thrust. Thrust uses CUDA device memory in twosituations:

  1. As the backing store forthrust::device_vector, and
  2. As temporary storage inside some algorithms, such asthrust::sort.

RMM providesrmm::mr::thrust_allocator as a conforming Thrust allocator that usesdevice_memory_resources.

Thrust Algorithms

To instruct a Thrust algorithm to usermm::mr::thrust_allocator to allocate temporary storage, youcan use the custom Thrust CUDA device execution policy:rmm::exec_policy(stream).

thrust::sort(rmm::exec_policy(stream, ...);

The firststream argument is thestream to use forrmm::mr::thrust_allocator.The secondstream argument is what should be used to execute the Thrust algorithm.These two arguments must be identical.

Logging

RMM includes two forms of logging. Memory event logging and debug logging.

Memory Event Logging andlogging_resource_adaptor

Memory event logging writes details of every allocation or deallocation to a CSV (comma-separatedvalue) file. In C++, Memory Event Logging is enabled by using thelogging_resource_adaptor as awrapper around any otherdevice_memory_resource object.

Each row in the log represents either an allocation or a deallocation. The columns of the file are"Thread, Time, Action, Pointer, Size, Stream".

The CSV output files of thelogging_resource_adaptor can be used as input toREPLAY_BENCHMARK,which is available when building RMM from source, in thegbenchmarks folder in the build directory.This log replayer can be useful for profiling and debugging allocator issues.

The following C++ example creates a logging version of acuda_memory_resource that outputs the logto the file "logs/test1.csv".

std::string filename{"logs/test1.csv"};rmm::mr::cuda_memory_resource upstream;rmm::mr::logging_resource_adaptor<rmm::mr::cuda_memory_resource> log_mr{&upstream, filename};

If a file name is not specified, the environment variableRMM_LOG_FILE is queried for the filename. IfRMM_LOG_FILE is not set, then an exception is thrown by thelogging_resource_adaptorconstructor.

In Python, memory event logging is enabled when thelogging parameter ofrmm.reinitialize() isset toTrue. The log file name can be set using thelog_file_name parameter. Seehelp(rmm.reinitialize) for full details.

Debug Logging

RMM leveragesrapids-logger to log trace and debuginformation to a file. This information can show when errors occur, when additional memory isallocated from upstream resources, etc. By default output is logged to stderr, but the environmentvariableRMM_DEBUG_LOG_FILE can be set to specify a path and file name to dump the logs toinstead.

There is a CMake configuration variableRMM_LOGGING_LEVEL, which can be set to enable compilationof more detailed logging. The default isINFO. Available levels areTRACE,DEBUG,INFO,WARN,ERROR,CRITICAL andOFF.

Note that to see logging below theINFO level, the application must also set the logging level atrun time. C++ applications must must callrmm::default_logger().set_level(), for example to enable alllevels of logging down toTRACE, callrmm::default_logger().set_level(spdlog::level::trace) (and compilelibrmm with-DRMM_LOGGING_LEVEL=TRACE). Python applications must callrmm.set_logging_level(),for example to enable all levels of logging down toTRACE, callrmm.set_logging_level("trace")(and compile the RMM Python module with-DRMM_LOGGING_LEVEL=TRACE).

Note that debug logging is different from the CSV memory allocation logging provided byrmm::mr::logging_resource_adapter. The latter is for logging a history of allocation /deallocation actions which can be useful for replay with RMM's replay benchmark.

RMM and CUDA Memory Bounds Checking

Memory allocations taken from a memory resource that allocates a pool of memory (such aspool_memory_resource andarena_memory_resource) are part of the same low-level CUDA memoryallocation. Therefore, out-of-bounds or misaligned accesses to these allocations are not likely tobe detected by CUDA tools such asCUDA Compute Sanitizer memcheck.

Exceptions to this arecuda_memory_resource, which wrapscudaMalloc, andcuda_async_memory_resource, which usescudaMallocAsync with CUDA's built-in memory poolfunctionality (CUDA 11.2 or later required). Illegal memory accesses to memory allocated by theseresources are detectable with Compute Sanitizer Memcheck.

It may be possible in the future to add support for memory bounds checking with other memoryresources using NVTX APIs.

Using RMM in Python

There are two ways to use RMM in Python code:

  1. Using thermm.DeviceBuffer API to explicitly create and managedevice memory allocations
  2. Transparently via external libraries such as CuPy and Numba

RMM provides aMemoryResource abstraction to controlhow devicememory is allocated in both the above uses.

DeviceBuffer

A DeviceBuffer represents anuntyped, uninitialized device memoryallocation. DeviceBuffers can be created by providing thesize of the allocation in bytes:

>>>importrmm>>>buf=rmm.DeviceBuffer(size=100)

The size of the allocation and the memory address associated with itcan be accessed via the.size and.ptr attributes respectively:

>>>buf.size100>>>buf.ptr140202544726016

DeviceBuffers can also be created by copying data from host memory:

>>>importrmm>>>importnumpyasnp>>>a=np.array([1,2,3],dtype='float64')>>>buf=rmm.DeviceBuffer.to_device(a.tobytes())>>>buf.size24

Conversely, the data underlying a DeviceBuffer can be copied to thehost:

>>>np.frombuffer(buf.tobytes())array([1.,2.,3.])

MemoryResource objects

MemoryResource objects are used to configure how device memory allocations are made byRMM.

By default if aMemoryResource is not set explicitly, RMM uses theCudaMemoryResource, whichusescudaMalloc for allocating device memory.

rmm.reinitialize() provides an easy way to initialize RMM with specific memory resource optionsacross multiple devices. Seehelp(rmm.reinitialize) for full details.

For lower-level control, thermm.mr.set_current_device_resource() function can beused to set a different MemoryResource for the current CUDA device. Forexample, enabling theManagedMemoryResource tells RMM to usecudaMallocManaged instead ofcudaMalloc for allocating memory:

>>>importrmm>>>rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())

⚠️ The default resource must be set for any devicebeforeallocating any device memory on that device. Setting or changing theresource after device allocations have been made can lead to unexpectedbehaviour or crashes. SeeMultiple Devices

As another example,PoolMemoryResource allows you to allocate alarge "pool" of device memory up-front. Subsequent allocations willdraw from this pool of already allocated memory. The examplebelow shows how to construct a PoolMemoryResource with an initial sizeof 1 GiB and a maximum size of 4 GiB. The pool usesCudaMemoryResource as its underlying ("upstream") memory resource:

>>>importrmm>>>pool=rmm.mr.PoolMemoryResource(...rmm.mr.CudaMemoryResource(),...initial_pool_size="1GiB",# equivalent to initial_pool_size=2**30...maximum_pool_size="4GiB"... )>>>rmm.mr.set_current_device_resource(pool)

Other MemoryResources include:

  • FixedSizeMemoryResource for allocating fixed blocks of memory
  • BinningMemoryResource for allocating blocks within specified "bin" sizes from different memoryresources

MemoryResources are highly configurable and can be composed together in different ways.Seehelp(rmm.mr) for more information.

Using RMM with third-party libraries

Using RMM with CuPy

You can configureCuPy to use RMM for memoryallocations by setting the CuPy CUDA allocator tormm_cupy_allocator:

>>>fromrmm.allocators.cupyimportrmm_cupy_allocator>>>importcupy>>>cupy.cuda.set_allocator(rmm_cupy_allocator)

Note: This only configures CuPy to use the current RMM resource for allocations.It does not initialize nor change the current resource, e.g., enabling a memory pool.Seehere for more information on changing the current memory resource.

Using RMM with Numba

You can configure Numba to use RMM for memory allocations using theNumbaEMM Plugin.

This can be done in two ways:

  1. Setting the environment variableNUMBA_CUDA_MEMORY_MANAGER:
$NUMBA_CUDA_MEMORY_MANAGER=rmm.allocators.numbapython (args)
  1. Using theset_memory_manager() function provided by Numba:
>>>fromnumbaimportcuda>>>fromrmm.allocators.numbaimportRMMNumbaManager>>>cuda.set_memory_manager(RMMNumbaManager)

Note: This only configures Numba to use the current RMM resource for allocations.It does not initialize nor change the current resource, e.g., enabling a memory pool.Seehere for more information on changing the current memory resource.

Using RMM with PyTorch

PyTorch can use RMMfor memory allocation. For example, to configure PyTorch to use anRMM-managed pool:

importrmmfromrmm.allocators.torchimportrmm_torch_allocatorimporttorchrmm.reinitialize(pool_allocator=True)torch.cuda.memory.change_current_allocator(rmm_torch_allocator)

PyTorch and RMM will now share the same memory pool.

You can, of course, use a custom memory resource with PyTorch as well:

importrmmfromrmm.allocators.torchimportrmm_torch_allocatorimporttorch# note that you can configure PyTorch to use RMM either before or# after changing RMM's memory resource.  PyTorch will use whatever# memory resource is configured to be the "current" memory resource at# the time of allocation.torch.cuda.change_current_allocator(rmm_torch_allocator)# configure RMM to use a managed memory resource, wrapped with a# statistics resource adaptor that can report information about the# amount of memory allocated:mr=rmm.mr.StatisticsResourceAdaptor(rmm.mr.ManagedMemoryResource())rmm.mr.set_current_device_resource(mr)x=torch.tensor([1,2]).cuda()# the memory resource reports information about PyTorch allocations:mr.allocation_countsOut[6]:{'current_bytes':16,'current_count':1,'peak_bytes':16,'peak_count':1,'total_bytes':16,'total_count':1}

Taking ownership of C++ objects from Python

When interacting with a C++ library that uses RMM from Python, onemust be careful when taking ownership ofrmm::device_buffer objectson the Python side. Thermm::device_buffer does not contain anowning reference to the memory resource used for its allocation (onlyadevice_async_resource_ref), and the allocating user is expected tokeep this memory resource alive for at least the lifetime of thebuffer. When taking ownership of such a buffer in Python, we have noway (in the general case) of ensuring that the memory resource willoutlive the buffer we are now holding.

To avoid any issues, we need two things:

  1. The C++ library we are interfacing with should accept a memoryresource that is used for allocations that are returned to theuser.
  2. When calling into the library from python, we should provide amemory resource whose lifetime we control. This memory resourceshould then be provided when we take ownership of any allocatedrmm::device_buffers.

For example, suppose we have a C++ function that allocatesdevice_buffers, which has a utility overload that defaults thememory resource to the current device resource:

std::unique_ptr<rmm::device_buffer>allocate(  std::size_t size,  rmm::mr::device_async_resource_ref mr = get_current_device_resource()){return std::make_unique<rmm::device_buffer>(size, rmm::cuda_stream_default, mr);}

The PythonDeviceBuffer class has a convenience Cython function,c_from_unique_ptr to construct aDeviceBuffer from aunique_ptr<rmm::device_buffer>, taking ownership of it. To do thissafely, we must ensure that the allocation that was done on the C++side uses a memory resource we control. So:

# Bad, doesn't control lifetimebuffer_bad= DeviceBuffer.c_from_unique_ptr(allocate(10))# Good, allocation happens with a memory resource we control# mr is a DeviceMemoryResourcebuffer_good= DeviceBuffer.c_from_unique_ptr(    allocate(10, mr.get_mr()),mr=mr,)

Note two differences between the bad and good cases:

  1. In the good case we pass the memory resource to the allocationfunction.
  2. In the good case, we passthe same memory resource to theDeviceBuffer constructor so that its lifetime is tied to thelifetime of the buffer.

Potential pitfalls of relying onget_current_device_resource

Functions in both the C++ and Python APIs that perform allocationtypically default the memory resource argument to the value ofget_current_device_resource. This is to simplify the interface forcallers. When using a C++ library from Python, this defaulting issafe,as long as it is only the Python process that ever callsset_current_device_resource.

This is because the current device resource on the C++ side has alifetime which is expected to be managed by the user. The resourcesset byrmm::mr::set_current_device_resource are stored in a staticstd::map whose keys are device ids and values are raw pointers tothe memory resources. Consequently,rmm::mr::get_current_device_resource returns an object with nolifetime provenance. This is, for the reasons discussed above, notusable from Python. To handle this on the Python side, thePython-levelset_current_device_resource sets the C++ resourceandstores the Python object in a static global dictionary. The Pythonget_current_device_resource thendoes not usermm::mr::get_current_device_resource and instead looks up thecurrent device resource in this global dictionary.

Hence, if the C++ library we are interfacing with callsrmm::mr::set_current_device_resource, the C++ and Python sides ofthe program can disagree on whatget_current_device_resourcereturns. The only safe thing to do if using the simplified interfacesis therefore to ensure thatset_current_device_resource is only evercalled on the Python side.


[8]ページ先頭

©2009-2025 Movatter.jp