CUDA.jl 3.4

Aug 13, 2021
Tim Besard

The latest version of CUDA.jl brings several new features, from improved atomic operations to initial support for arrays with unified memory. The native random number generator introduced in CUDA.jl 3.0 is now the default fallback, and support for memory pools other than the CUDA stream-ordered one has been removed.

Streamlined atomic operations

In preparation of integrating with the new standard@atomic macro introduced in Julia 1.7, we havestreamlined the capabilities of atomic operations in CUDA.jl. The API is now split into two levels: low-levelatomic_ methods for atomic functionality that's directly supported by the hardware, and a high-level@atomic macro that tries to perform operations natively or falls back to a loop with compare-and-swap. This fall-back implementation makes it possible to use more complex operations that do not map onto a single atomic operation:

julia> a = CuArray([1]);julia> function kernel(a)         CUDA.@atomic a[] <<= 1         return       endjulia> @cuda threads=16 kernel(a)julia> a1-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}: 65536julia> 1<<1665536

The only requirement is that the types being used are supported byCUDA.atomic_cas!. This includes common types like 32 and 64-bit integers and floating-point numbers, as well as 16-bit numbers on devices with compute capability 7.0 or higher.

Note that on Julia 1.7 and higher, CUDA.jldoes not export the@atomic macro anymore to avoid conflicts with the version in Base. That means it is recommended to always fully specify uses of the macro, i.e., useCUDA.@atomic as in the example above.

Arrays with unified memory

You may have noticed that theCuArray type in the example above included an additional parameter,Mem.DeviceBuffer. This has been introduced to support arrays backed by different kinds of buffers. By default, we will use an ordinary device buffer, but it's now possible toallocate arrays backed by unified buffers that can be used on multiple devices:

julia> a = cu([0]; unified=true)1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}: 0julia> a .+= 11-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}: 1julia> device!(1)julia> a .+= 11-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}: 2

Although all operations should work equally well with arrays backed by unified memory, they have not been optimized yet. For example, copying memory to the device could be avoided as the driver can automatically page in unified memory on-demand.

New default random number generator

CUDA.jl 3.0 introduced a new random number generator, and starting with CUDA.jl 3.2 performance and quality of this generator was improved up to the point it could be used by applications. A couple of features were still missing though, such as generating normally-distributed random numbers, or support for complex numbers. These features have beenadded in CUDA.jl 3.3, and the generator is now used as the default fallback when CURAND does not support the requested element types.

Both the performance and quality of this generator is much better than the previous, GPUArrays.jl-based one:

julia> using BenchmarkToolsjulia> cuda_rng = CUDA.RNG();julia> gpuarrays_rng = GPUArrays.default_rng(CuArray);julia> a = CUDA.zeros(1024,1024);julia> @benchmark CUDA.@sync rand!($cuda_rng, $a)BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max):  17.040 μs …  2.430 ms  ┊ GC (min … max): 0.00% … 99.04% Time  (median):     18.500 μs              ┊ GC (median):    0.00% Time  (mean ± σ):   20.604 μs ± 34.734 μs  ┊ GC (mean ± σ):  1.17% ±  0.99%         ▃▆█▇▇▅▄▂▁  ▂▂▂▃▄▆███████████▇▆▆▅▅▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▄  17 μs           Histogram: frequency by time        24.1 μs <julia> @benchmark CUDA.@sync rand!($gpuarrays_rng, $a)BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max):  72.489 μs …  2.790 ms  ┊ GC (min … max): 0.00% … 98.44% Time  (median):     74.479 μs              ┊ GC (median):    0.00% Time  (mean ± σ):   81.211 μs ± 61.598 μs  ┊ GC (mean ± σ):  0.67% ±  1.40%  █                                                           ▁  █▆▃▁▃▃▅▆▅▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▄▆▁▁▁▁▁▁▁▁▄▄▃▄▃▁▁▁▁▁▁▁▁▁▃▃▄▆▄▁▄▃▆ █  72.5 μs      Histogram: log(frequency) by time       443 μs <

julia> using RNGTestjulia> test_cuda_rng = RNGTest.wrap(cuda_rng, UInt32);julia> test_gpuarrays_rng = RNGTest.wrap(gpuarrays_rng, UInt32);julia> RNGTest.smallcrushTestU01(test_cuda_rng) All tests were passedjulia> RNGTest.smallcrushTestU01(test_gpuarrays_rng) The following tests gave p-values outside [0.001, 0.9990]:       Test                          p-value ----------------------------------------------  1  BirthdaySpacings                 eps  2  Collision                        eps  3  Gap                              eps  4  SimpPoker                       1.0e-4  5  CouponCollector                  eps  6  MaxOft                           eps  7  WeightDistrib                    eps 10  RandomWalk1 M                   6.0e-4 ---------------------------------------------- (eps  means a value < 1.0e-300):

Removal of old memory pools

With the new stream-ordered allocator, caching memory allocations at the CUDA library level, much of the need for memory pools to cache memory allocations has disappeared. To simplify the allocation code, we haveremoved support for those Julia-managed memory pools (i.e.,binned,split andsimple). You can now only use thecuda memory pool, or use no pool at all by setting theJULIA_CUDA_MEMORY_POOL environment variable tonone.

Not using a memory pool degrades performance, so if you are stuck on an NVIDIA driver that does not support CUDA 11.2, it is advised to remain on CUDA.jl 3.3 until you can upgrade.

Also note that the new stream-ordered allocator hasturned out incompatible with legacy cuIpc APIs as used by OpenMPI. If that applies to you, consider disabling the memory pool or reverting to CUDA.jl 3.3 if your application's allocation pattern benefits from a memory pool.

Because of this, we will be maintaining CUDA.jl 3.3 longer than usual. All bug fixes in CUDA.jl 3.4 have already been backported to the previous release, which is currently at version 3.3.6.

Device capability-dependent kernel code

Some of the improvements in this release depend on the ability to write generic code that only uses certain hardware features when they are available. To facilitate writing such code, the compiler now embeds metadata in the generated code that can be used to branch on.

Currently, the device capability and PTX ISA version are embedded and made available using respectively thecompute_capability andptx_isa_version functions. A simplified version number type, constructable using thesv"..." string macro, can be used to test against these properties. For example:

julia> function kernel(a)           a[] = compute_capability() >= sv"6.0" ? 1 : 2           return       endkernel (generic function with 1 method)julia> CUDA.code_llvm(kernel, Tuple{CuDeviceVector{Float32, AS.Global}})define void @julia_kernel_1({ i8 addrspace(1)*, i64, [1 x i64] }* %0) {top:  %1 = bitcast { i8 addrspace(1)*, i64, [1 x i64] }* %0 to float addrspace(1)**  %2 = load float addrspace(1)*, float addrspace(1)** %1, align 8  store float 1.000000e+00, float addrspace(1)* %2, align 4  ret void}julia> capability(device!(1))v"3.5.0"julia> CUDA.code_llvm(kernel, Tuple{CuDeviceVector{Float32, AS.Global}})define void @julia_kernel_2({ i8 addrspace(1)*, i64, [1 x i64] }* %0) {top:  %1 = bitcast { i8 addrspace(1)*, i64, [1 x i64] }* %0 to float addrspace(1)**  %2 = load float addrspace(1)*, float addrspace(1)** %1, align 8  store float 2.000000e+00, float addrspace(1)* %2, align 4  ret void}

The branch on the compute capability is completely optimized away. At the same time, this does not require re-inferring the function as the optimization happens at the LLVM level.

Other changes

Support for CUDA 11.4 Update 1
Improved thread safety[1][2]

Movatterモバイル変換