Enhancing Memory Allocation with New NVIDIA CUDA 11.2 Features

Dec 16, 2020

AI-Generated Summary

Dislike

CUDA 11.2 introduces a stream-ordered CUDA memory allocator, enabling applications to order memory allocation and deallocation with other work launched into a CUDA stream, improving application performance by reusing memory allocations.
The CUDA 11.2 release allows cooperative kernels launched into separate streams to execute concurrently on a GPU if they fit within the GPU resources, enhancing the functionality of cooperative groups.
CUDA 11.2 includes various compiler upgrades, including an LLVM upgrade to 7.0 and link-time optimization for device kernel code, which can improve compiler code generation for NVIDIA GPUs and application performance.

AI-generated content may summarize information incompletely. Verify important information.Learn more

CUDA is the software development platform for building GPU-accelerated applications, providing all the components needed to develop applications targeting every NVIDIA GPU platformfor general purpose compute acceleration. The latest CUDA release,CUDA 11.2, is focused on improving the user experience and application performance for CUDA developers.

CUDA 11.2 has several important features including programming model updates, new compiler features, and enhanced compatibility across CUDA releases. This post offers an overview of the key CUDA 11.2 software features and highlights:

Stream-ordered CUDA memory suballocator:cudaMallocAsync andcudaFreeAsync
Updates to CUDA graphs and cooperative groups
Compiler upgrade to LLVM 7 and CUDA kernel link-time optimization
Enhanced CUDA compatibility support

CUDA 11.2 is available to download now.

CUDA programming model enhancements

With every CUDA release, we continue to enhance the CUDA programming model to enable you to get the most out of NVIDIA GPUs, while maintaining the programming flexibility of the higher-level APIs. In this release, we added an exciting new feature for stream-ordered memory allocation and extended some of the APIs for improving the functionality of cooperative groups and CUDA graphs.

Stream-ordered memory allocator

One of the highlights of CUDA 11.2 is the new stream-ordered CUDA memory allocator. This feature enables applications to order memory allocation and deallocation with other work launched into a CUDA stream such as kernel launches and asynchronous copies. This improves application performance by taking advantage of stream-ordering semantics to reuse memory allocations, using and managing memory pools to avoid expensive calls into the OS. The new asynchronous memory allocation and free API actions allow you to manage memory use as part of your application’s CUDA workflow. For many applications, this reduces the need for custom memory management abstractions, and makes it easier to create high-performance custom memory management for applications that need it. Moreover, this feature makes it easier to share memory pools across entities within an application.

cudaMallocAsync(&ptr, size, stream); // Allocates physical memorykernel<<<...,stream>>>(ptr);cudaFreeAsync(ptr, stream);          // releases memory back into a poolcudaMallocAsync(&ptr, size, stream); // Reuses previously freed pointerkernel<<<...,stream>>>(ptr);cudaFreeAsync(ptr, stream);          // releases memory back into a pool....                                 // Executes other work in the stream

As shown in this example, CUDA 11.2 introduces new stream-ordered versions ofcudaMalloc andcudaFree—calledcudaMallocAsync andcudaFreeAsync—which take a stream as an additional argument. The first call tocudaMallocAsync in the example allocates memory from the OS, but the subsequent call tocudaFreeAsync does not free it back to the OS. Instead, the memory is stored in a pool maintained by the CUDA driver, which allows the second call tocudaMallocAsync to reuse the memory previously freed, if it is of sufficient size.

Figure illustrating reuse of memory within a stream using the new cudaMallocAsync and cudaFreeAsync API actions. — *Figure 1. Diagram shows memory allocation and deallocation in order with other tasks within a stream, as described in the earlier code example.*

For more information, seecudaMallocAsync in the C++ API Routines topic in the CUDA Toolkit documentation. For more information about how stream-ordered allocation works, the performance benefits, and how to port your application to use the new APIs, seeUsing the NVIDIA CUDA Stream-Ordered Memory Allocator posts.

Cooperative groups

Cooperative groups, introduced in CUDA 9, provides device code API actions to define groups of communicating threads and to express the granularity at which threads synchronize for more efficient parallel decompositions. For more information, see Cooperative Groups: Flexible CUDA Thread Programming.

When you are using cooperative groups to launch kernels into separate streams withcuLaunchCooperativeKernel, these kernels can now execute concurrently on a GPU. Prior to CUDA 11.2, cooperative kernels were always serialized as if launched into the same stream. Kernels A and B launched into separate streams would execute sequentially on the GPU, with B waiting for A to finish before it could start. With CUDA 11.2, cooperative kernels now run concurrently if they can fit together within the GPU resources.

You can take advantage of this functionality with the existingcuLaunchCooperativeKernel API action. If you were already using multiple streams in your application, you may not even need to modify your application code to benefit from this feature.

CUDA graphs

CUDA graphs were introduced in CUDA 10.0 and have seen a steady progression of new features with every CUDA release. For more information about the performance enhancement, seeGetting Started with CUDA Graphs.

CUDA 11.2 introduces a new mechanism for synchronization between graph workloads and non-graph workloads. CUDA graphs now support two pairs of graph node types for external synchronization: signal and wait for CUDA events (available since CUDA 11.1), and external semaphore signal and wait (new in CUDA 11.2). These enhance existing graph functionality allowing internal graph operations to depend upon external work. Allowing graphs to inter-operate with the existing external semaphore infrastructure in CUDA enables new types of synchronization between graph workloads and non-CUDA workloads.

cudaGraphCreate(&graph, 0);                                                        // Create the graphcudaGraphAddKernelNode(&a, graph, NULL, 0, &nodeParams);                           // create the nodescudaGraphAddKernelNode(&b, graph, NULL, 0, &nodeParams);..cudaGraphAddExternalSemaphoresSignalNode( &ext_sem, graph, NULL, 0, &nodeParams);  // New node for external semaphore signal..                                                                                   // Now set up dependencies on each nodecudaGraphAddDependencies(graph, &a, &b, 1);                                        // A->B..

CUDA 11.2 now also allows graph update to change the kernel function launched by a kernel node, using either explicit node update withcudaGraphExecKernelNodeSetParams or whole graph update withcudaGraphExecUpdate. This is an enhancement when compared to prior releases, where the kernel function could not be modified and had to match the original value.

CUDA compiler

In CUDA 11.2, the compiler tool chain gets multiple feature and performance upgrades that are aimed at accelerating the GPU performance of applications and enhancing your overall productivity.

The compiler toolchain has an LLVM upgrade to 7.0, which enables new features and can help improve compiler code generation for NVIDIA GPUs. The CUDA C++ compiler, libNVVM, and NVRTC shared library have all been upgraded to the LLVM 7.0 code base. The libNVVM library provides GPU extensions to LLVM in support of the wider community of developers of compilers, DSL translators, and parallel applications targeting computational workloads on NVIDIA GPUs. The NVRTC shared library helps compile dynamically generated CUDA source code at runtime.

Link-time optimization for device kernel code (Device LTO), introduced as a preview feature in the CUDA 11.0 toolkit release, is now available as a full-featured optimization capability in CUDA 11.2. Device LTO enables you to enjoy the productivity benefits of separate compilation of device code without incurring an undue runtime performance overhead relative to whole-program device compilation.

The 11.2 CUDA C++ compiler can optionally generate a diagnostic report on inline functions which can provide insights into the compiler’s function inlining decisions. These diagnostic reports can aid in advanced application performance analysis and tuning efforts.

The CUDA C++ compiler aggressively inlines device functions into call sites by default. This can make assembly-level debugging of optimized device code a difficult task. For source code compiled using the 11.2 CUDA C++ compiler toolchain, the cuda-gdb and NVIDIA Nsight Compute debugger can display names of inlined device functions in call-stack backtraces, thereby improving the debugging experience. You can single step through inline functions just like any other device function.

Nsight Developer Tools

NVIDIA Developer Tools are a collection of applications, spanning desktop and mobile targets, which enable you to build, debug, profile, and develop CUDA applications that use the latest visual computing hardware. TheNVIDIA Nsight tools have introduced some new functionality as well in CUDA 11.2.

Nsight Systems is a system-wide performance analysis tool, designed to help developers tune and scale software across CPUs and GPUs. The new 2020.5 update enhances Vulkan ray tracing, and profile tracing for NVIDIA Collectives Communication Library (NCCL) and CUDA memory allocation. It also delivers performance and UX improvements.

NVIDIA Nsight Systems 2020.5 is now available for download.

The 2020.3 release of NVIDIANsight Compute included in the 11.2 CUDA Toolkit introduces several new features that simplify the process of CUDA kernel profiling and optimization. The update for Nsight Compute introduces a new Profile Series feature enabling you to configure ranges for multiple kernel parameters, and a source file import functionality.

NVIDIA Nsight Compute 2020.3 is now available for download.

CUDA enhanced compatibility

Here’s a review of the enhanced CUDA compatibility support that was enabled in CUDA 11.1 and what it means for CUDA developers. By leveraging semantic versioning across components in the CUDA Toolkit, these components remain binary-compatible across all minor versions of a toolkit release. This means that CUDA has relaxed the minimum driver version check for the CUDA Toolkit and no longer requires a driver upgrade with minor releases. This is especially important for users who don’t have root privileges on their system.

For enterprise users, upgrading to the newer version of the CUDA driver was particularly cumbersome as it required quite a bit of planning and execution to ensure that all components in the production stack dependent on the driver were accounted for and validated. With enhanced compatibility, you can upgrade to a newer version of the CUDA Toolkit while still using an older version of the CUDA driver.

Enhanced CUDA compatibility also gives you the general flexibility to move to newer toolkits and features, only excepting the ones that have new APIs or which depend on the kernel mode driver. You get the compatibility of the CUDA Toolkit with the CUDA driver across all minor versions. An application can be built for one CUDA minor release (for example, 11.1) and work across all future minor releases within the major family (for example, 11.x), as shown in Figure 2.

CUDA 11.0 now runs CUDA 11.1 applications and Future CUDA 11.x versions will also run on an 11.0 system. — *Figure 2. Diagram showing both backward compatibility and enhanced compatibility for CUDA 11.x toolkits with the corresponding CUDA drivers.*

For more information about the enhanced compatibility feature and the overall CUDA compatibility model in the toolkit documentation, see theCUDA Compatibility guide.

Summary

To learn more about the CUDA 11 generation toolkit capabilities and introductions, seeCUDA 11 Features Revealed and followfuture CUDA posts. For more

Discuss (0)

About the Authors

About Ram Cherukuri
Ram Cherukuri is a senior product manager for PhysicsNeMo, the Physics-ML platform for AI in science and engineering . He is also the product manager for DLA software, working with embedded AI developers and was part of the CUDA product management team. Prior to NVIDIA, Ram was a product manager at MathWorks for code generation and verification products for embedded software development, working with automotive and aero-def customers. He holds a master’s degree in aerospace engineering from Purdue University and a bachelor’s degree in the same discipline from IIT Bombay.

View all posts by Ram Cherukuri