User Guide for NVPTX Back-end

Introduction

To support GPU programming, the NVPTX back-end supports a subset of LLVM IRalong with a defined set of conventions used to represent GPU programmingconcepts. This document provides an overview of the general usage of the back-end, including a description of the conventions used and the set of acceptedLLVM IR.

Note

This document assumes a basic familiarity with CUDA and the PTXassembly language. Information about the CUDA Driver API and the PTX assemblylanguage can be found in theCUDA documentation.

Conventions

Marking Functions as Kernels

In PTX, there are two types of functions:device functions, which are onlycallable by device code, andkernel functions, which are callable by hostcode. By default, the back-end will emit device functions. Theptx_kernelcalling convention is used to declare a function as a kernel function.

The following example shows a kernel function calling a device function in LLVMIR. The function@my_kernel is callable from host code, but@my_fmad isnot.

definefloat@my_fmad(float%x,float%y,float%z){%mul=fmulfloat%x,%y%add=faddfloat%mul,%zretfloat%add}defineptx_kernelvoid@my_kernel(ptr%ptr){%val=loadfloat,ptr%ptr%ret=callfloat@my_fmad(float%val,float%val,float%val)storefloat%ret,ptr%ptrretvoid}

When compiled, the PTX kernel functions are callable by host-side code.

Parameter Attributes

"nvvm.grid_constant"

This attribute may be attached to abyval parameter of a kernel functionto indicate that the parameter should be lowered as a direct reference tothe grid-constant memory of the parameter, as opposed to a copy of theparameter in local memory. Writing to a grid-constant parameter isundefined behavior. Unlike a normalbyval parameter, the address of agrid-constant parameter is not unique to a given function invocation butinstead is shared by all kernels in the grid.

Function Attributes

"nvvm.maxclusterrank"="<n>"

This attribute specifies the maximum number of blocks per cluster. Must benon-zero. Only supported for Hopper+.

"nvvm.minctasm"="<n>"

This indicates a hint/directive to the compiler/driver, asking it to put atleast these many CTAs on an SM.

"nvvm.maxnreg"="<n>"

This attribute indicates the maximum number of registers to be used for thekernel function.

"nvvm.maxntid"="<x>[,<y>[,<z>]]"

This attribute declares the maximum number of threads in the thread block(CTA). The maximum number of threads is the product of the maximum extent ineach dimension. Exceeding the maximum number of threads results in a runtimeerror or kernel launch failure.

"nvvm.reqntid"="<x>[,<y>[,<z>]]"

This attribute declares the exact number of threads in the thread block(CTA). The number of threads is the product of the value in each dimension.Specifying a different CTA dimension at launch will result in a runtimeerror or kernel launch failure.

"nvvm.cluster_dim"="<x>[,<y>[,<z>]]"

This attribute declares the number of thread blocks (CTAs) in the cluster.The total number of CTAs is the product of the number of CTAs in eachdimension. Specifying a different cluster dimension at launch will result ina runtime error or kernel launch failure. Only supported for Hopper+.

"nvvm.blocksareclusters"

This attribute implies that the grid launch configuration for the correspondingkernel function is specifying the number of clusters instead of the number of threadblocks. This attribute is only allowed for kernel functions and requiresnvvm.reqntid andnvvm.cluster_dim attributes.

Address Spaces

The NVPTX back-end uses the following address space mapping:

Address Space

Memory Space

0

Generic

1

Global

2

Internal Use

3

Shared

4

Constant

5

Local

7

Shared Cluster

Every global variable and pointer type is assigned to one of these addressspaces, with 0 being the default address space. Intrinsics are provided whichcan be used to convert pointers between the generic and non-generic addressspaces.

As an example, the following IR will define an array@g that resides inglobal device memory.

@g=internaladdrspace(1)global[4xi32][i320,i321,i322,i323]

LLVM IR functions can read and write to this array, and host-side code cancopy data to it by name with the CUDA Driver API.

Note that since address space 0 is the generic space, it is illegal to haveglobal variables in address space 0. Address space 0 is the default addressspace in LLVM, so theaddrspace(N) annotation isrequired for globalvariables.

Triples

The NVPTX target uses the module triple to select between 32/64-bit codegeneration and the driver-compiler interface to use. The triple architecturecan be one ofnvptx (32-bit PTX) ornvptx64 (64-bit PTX). Theoperating system should be one ofcuda ornvcl, which determines theinterface used by the generated code to communicate with the driver. Mostusers will want to usecuda as the operating system, which makes thegenerated PTX compatible with the CUDA Driver API.

Example: 32-bit PTX for CUDA Driver API:nvptx-nvidia-cuda

Example: 64-bit PTX for CUDA Driver API:nvptx64-nvidia-cuda

NVPTX Architecture Hierarchy and Ordering

GPU architectures: sm_2Y/sm_3Y/sm_5Y/sm_6Y/sm_7Y/sm_8Y/sm_9Y/sm_10Y/sm_12Y(‘Y’ represents version within the architecture)The architectures have name of formsm_XYz whereX represent the generationnumber,Y represents the version within the architecture, andz representsthe optional feature suffix.IfX1Y1<=X2Y2, then GPU capabilities ofsm_X1Y1 are included insm_X2Y2.For example, takesm_90 (9 representsX, 0 representsY, and no featuresuffix) andsm_103 architectures (10 representsX, 3 representsY, and nofeature suffix). Since 90 <= 103,sm_90 is compatible withsm_103.

The family-specific variants havef feature suffix and they followfollowing order:sm_X{Y2}f>sm_X{Y1}f iffY2>Y1sm_XY{f}>sm_{XY}{}

For example, takesm_100f (10 representsX, 0 representsY, andfrepresentsz) andsm_103f (10 representsX, 3 representsY, andfrepresentsz) architecture variants. SinceY1<Y2,sm_100f is compatible withsm_103f. Similarly based on the second rule,sm_90 is compatible withsm_103f.

Some counter examples, takesm_100f andsm_120f (12 representsX, 0representsY, andf representsz) architecture variants. Since bothbelongs to different family i.e.X1!=X2,sm_100f is not compatible withsm_120f.

The architecture-specific variants havea feature suffix and they followfollowing order:sm_XY{a}>sm_XY{f}>sm_{XY}{}

For example, takesm_103a (10 representsX, 3 representsY, andarepresentsz),sm_103f, andsm_103 architecture variants. Thesm_103 iscompatible withsm_103a andsm_103f, andsm_103f is compatible withsm_103a.

Encoding := Arch * 10 + 2 (for ‘f’) + 1 (for ‘a’)Arch := X * 10 + Y

For example,sm_103f is encoded as 1032 (103 * 10 + 2) andsm_103a isencoded as 1033 (103 * 10 + 2 + 1).

This encoding allows simple partial ordering of the architectures.

  • Compare Family and Arch by dividing FullSMVersion by 100 and 10respectively before the comparison.

  • Compare within the family by comparing FullSMVersion, given both belongs tothe same family.

  • Detecta variants by checking FullSMVersion & 1.

NVPTX Intrinsics

Reading PTX Special Registers

llvm.nvvm.read.ptx.sreg.*

Syntax:
declarei32@llvm.nvvm.read.ptx.sreg.tid.x()declarei32@llvm.nvvm.read.ptx.sreg.tid.y()declarei32@llvm.nvvm.read.ptx.sreg.tid.z()declarei32@llvm.nvvm.read.ptx.sreg.ntid.x()declarei32@llvm.nvvm.read.ptx.sreg.ntid.y()declarei32@llvm.nvvm.read.ptx.sreg.ntid.z()declarei32@llvm.nvvm.read.ptx.sreg.ctaid.x()declarei32@llvm.nvvm.read.ptx.sreg.ctaid.y()declarei32@llvm.nvvm.read.ptx.sreg.ctaid.z()declarei32@llvm.nvvm.read.ptx.sreg.nctaid.x()declarei32@llvm.nvvm.read.ptx.sreg.nctaid.y()declarei32@llvm.nvvm.read.ptx.sreg.nctaid.z()declarei32@llvm.nvvm.read.ptx.sreg.warpsize()
Overview:

The ‘@llvm.nvvm.read.ptx.sreg.*’ intrinsics provide access to the PTXspecial registers, in particular the kernel launch bounds. These registersmap in the following way to CUDA builtins:

CUDA Builtin

PTX Special Register Intrinsic

threadId

@llvm.nvvm.read.ptx.sreg.tid.*

blockIdx

@llvm.nvvm.read.ptx.sreg.ctaid.*

blockDim

@llvm.nvvm.read.ptx.sreg.ntid.*

gridDim

@llvm.nvvm.read.ptx.sreg.nctaid.*

Barriers

llvm.nvvm.barrier.cta.*

Syntax:
declarevoid@llvm.nvvm.barrier.cta.sync.count(i32%id,i32%n)declarevoid@llvm.nvvm.barrier.cta.sync.all(i32%id)declarevoid@llvm.nvvm.barrier.cta.arrive.count(i32%id,i32%n)declarevoid@llvm.nvvm.barrier.cta.sync.aligned.count(i32%id,i32%n)declarevoid@llvm.nvvm.barrier.cta.sync.aligned.all(i32%id)declarevoid@llvm.nvvm.barrier.cta.arrive.aligned.count(i32%id,i32%n)
Overview:

The ‘@llvm.nvvm.barrier.cta.*’ family of intrinsics perform barriersynchronization and communication within a CTA. They can be used by the threadswithin the CTA for synchronization and communication.

Semantics:

Operand %id specifies a logical barrier resource and must fall within the range0 through 15. When present, operand %n specifies the number of threadsparticipating in the barrier. When specifying a thread count, the value must bea multiple of the warp size. With the ‘@llvm.nvvm.barrier.cta.sync.*’variants, the ‘.all’ suffix indicates that all threads in the CTA shouldparticipate in the barrier while the ‘.count’ suffix indicates that onlythe threads specified by the %n operand should participate in the barrier.

All forms of the ‘@llvm.nvvm.barrier.cta.*’ intrinsic cause the executingthread to wait for all non-exited threads from its warp and then marks thewarp’s arrival at the barrier. In addition to signaling its arrival at thebarrier, the ‘@llvm.nvvm.barrier.cta.sync.*’ intrinsics cause the executingthread to wait for non-exited threads of all other warps participating in thebarrier to arrive. On the other hand, the ‘@llvm.nvvm.barrier.cta.arrive.*’intrinsic does not cause the executing thread to wait for threads of otherparticipating warps.

When a barrier completes, the waiting threads are restarted without delay,and the barrier is reinitialized so that it can be immediately reused.

The ‘@llvm.nvvm.barrier.cta.*’ intrinsic has an optional ‘.aligned’modifier to indicate textual alignment of the barrier. When specified, itindicates that all threads in the CTA will execute the same‘@llvm.nvvm.barrier.cta.*’ instruction. In conditionally executed code, analigned ‘@llvm.nvvm.barrier.cta.*’ instruction should only be used if it isknown that all threads in the CTA evaluate the condition identically, otherwisebehavior is undefined.

MBarrier family of Intrinsics

Overview:

Anmbarrier is a barrier created in shared memory that supports:

  • Synchronizing any subset of threads within a CTA.

  • One-way synchronization of threads across CTAs of a cluster.Threads can perform onlyarrive operations but not*_wait on anmbarrier located in shared::cluster space.

  • Waiting for completion of asynchronous memory operations initiated by athread and making them visible to other threads.

Unlikebar{.cta}/barrier{.cta} instructions which can access a limitednumber of barriers per CTA,mbarrier objects are user-defined and areonly limited by the total shared memory size available.

An mbarrier object is an opaque object in shared memory with analignment of 8-bytes. It keeps track of:

  • Current phase of the mbarrier object

  • Count of pending arrivals for the current phase of the mbarrier object

  • Count of expected arrivals for the next phase of the mbarrier object

  • Count of pending asynchronous memory operations (or transactions)tracked by the current phase of the mbarrier object. This is alsoreferred to astx-count. The unit oftx-count is specifiedby the asynchronous memory operation (for example,llvm.nvvm.cp.async.bulk.tensor.g2s.*).

Thephase of an mbarrier object is the number of times the mbarrierobject has been used to synchronize threads/track async operations.In each phase, threads perform:

  • arrive/expect-tx/complete-tx operations to progress the current phase.

  • test_wait/try_wait operations to check for completion of the current phase.

An mbarrier object completes the current phase when:

  • The count of the pending arrivals has reached zero AND

  • The tx-count has reached zero.

When an mbarrier object completes the current phase, belowactions are performedatomically:

  • The mbarrier object transitions to the next phase.

  • The pending arrival count is reinitialized to the expected arrival count.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier.

llvm.nvvm.mbarrier.init

Syntax:
declarevoid@llvm.nvvm.mbarrier.init(ptr%addr,i32%count)declarevoid@llvm.nvvm.mbarrier.init.shared(ptraddrspace(3)%addr,i32%count)
Overview:

The ‘@llvm.nvvm.mbarrier.init.*’ intrinsics are used to initializean mbarrier object located ataddr with the valuecount.count is a 32-bit unsigned integer value and must be withinthe range [1…2^20-1]. During initialization:

  • The tx-count and the current phase of the mbarrier object are set to 0.

  • The expected and pending arrival counts are set tocount.

Semantics:

The.shared variant explicitly uses shared memory address space fortheaddr operand. If theaddr does not fall within theshared::cta space, then the behavior of this intrinsic is undefined.Performingmbarrier.init on a valid mbarrier object is undefined;usembarrier.inval before reusing the memory for another mbarrieror any other purpose.

llvm.nvvm.mbarrier.inval

Syntax:
declarevoid@llvm.nvvm.mbarrier.inval(ptr%addr)declarevoid@llvm.nvvm.mbarrier.inval.shared(ptraddrspace(3)%addr)
Overview:

The ‘@llvm.nvvm.mbarrier.inval.*’ intrinsics invalidate the mbarrierobject at the address specified byaddr.

Semantics:

The.shared variant explicitly uses shared memory address space fortheaddr operand. If theaddr does not fall within theshared::cta space, then the behavior of this intrinsic is undefined.It is expected thataddr was previously initialized usingmbarrier.init; otherwise, the behavior is undefined.

llvm.nvvm.mbarrier.expect.tx

Syntax:
declarevoid@llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cta(ptraddrspace(3)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cluster(ptraddrspace(7)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cluster(ptraddrspace(7)%addr,i32%tx_count)
Overview:

The ‘@llvm.nvvm.mbarrier.expect.tx.*’ intrinsics increase the transactioncount of the mbarrier object at%addr by%tx_count. The%tx_countis a 32-bit unsigned integer value.

Semantics:

The.space.{cta/cluster} indicates the address space where the mbarrierobject resides.

The.scope.{cta/cluster} denotes the set of threads that can directlyobserve the synchronizing effect of the mbarrier operation. When scope is“cta”, all threads executing in the same CTA (as the current thread) candirectly observe the effect of theexpect.tx operation. Similarly,when scope is “cluster”, all threads executing in the same Cluster(as the current thread) can directly observe the effect of the operation.

If theaddr does not fall within shared::cta or shared::cluster space,then the behavior of this intrinsic is undefined. This intrinsic hasrelaxed semantics and hence does not provide any memory orderingor visibility guarantees.

llvm.nvvm.mbarrier.complete.tx

Syntax:
declarevoid@llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cta(ptraddrspace(3)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cluster(ptraddrspace(7)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cluster(ptraddrspace(7)%addr,i32%tx_count)
Overview:

The ‘@llvm.nvvm.mbarrier.complete.tx.*’ intrinsics decrease the transactioncount of the mbarrier object at%addr by%tx_count. The%tx_countis a 32-bit unsigned integer value. As a result of this decrement,the mbarrier can potentially complete its current phase and transitionto the next phase.

Semantics:

The semantics of these intrinsics are identical to those of thellvm.nvvm.mbarrier.expect.tx.* intrinsics described above.

llvm.nvvm.mbarrier.arrive

Syntax:
declarei64@llvm.nvvm.mbarrier.arrive.scope.cta.space.cta(ptraddrspace(3)%addr,i32%count)declarei64@llvm.nvvm.mbarrier.arrive.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%count)declarevoid@llvm.nvvm.mbarrier.arrive.scope.cta.space.cluster(ptraddrspace(7)%addr,i32%count)declarevoid@llvm.nvvm.mbarrier.arrive.scope.cluster.space.cluster(ptraddrspace(7)%addr,i32%count)declarei64@llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cta(ptraddrspace(3)%addr,i32%count)declarei64@llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%count)declarevoid@llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cluster(ptraddrspace(7)%addr,i32%count)declarevoid@llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cluster(ptraddrspace(7)%addr,i32%count)
Overview:

The@llvm.nvvm.mbarrier.arrive.* intrinsics signal the arrival of theexecuting thread or completion of an asynchronous instruction associated withan arrive operation on the mbarrier object at%addr. This operationdecrements the pending arrival count by%count, a 32-bit unsigned integer,potentially completing the current phase and triggering a transition to thenext phase.

Semantics:

The.space.{cta/cluster} indicates the address space where the mbarrierobject resides. When the mbarrier is in shared::cta space, the intrinsicsreturn an opaque 64-bit value capturing the phase of the mbarrier object_prior_ to this arrive operation. This value can be used with a try_waitor test_wait operation to check for the completion of the mbarrier.

The.scope.{cta/cluster} denotes the set of threads that can directlyobserve the synchronizing effect of the mbarrier operation. When scope is“cta”, all threads executing in the same CTA (as the current thread) candirectly observe the effect of thearrive operation. Similarly,when scope is “cluster”, all threads executing in the same Cluster(as the current thread) can directly observe the effect of the operation.

If theaddr does not fall within shared::cta or shared::cluster space,then the behavior of this intrinsic is undefined.

These intrinsics haverelease semantics by default. The release semanticsensure ordering of operations that occur in program order _before_ this arriveinstruction, making their effects visible to subsequent operations in otherthreads of the CTA (or cluster, depending on scope). Threads performingcorresponding acquire operations (such as mbarrier.test.wait) synchronizewith this release. Therelaxed variants of these intrinsics do notprovide any memory ordering or visibility guarantees.

llvm.nvvm.mbarrier.arrive.expect.tx

Syntax:
declarei64@llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptraddrspace(3)%addr,i32%tx_count)declarei64@llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cluster(ptraddrspace(7)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cluster(ptraddrspace(7)%addr,i32%tx_count)declarei64@llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptraddrspace(3)%addr,i32%tx_count)declarei64@llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cluster(ptraddrspace(7)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cluster(ptraddrspace(7)%addr,i32%tx_count)
Overview:

The@llvm.nvvm.mbarrier.arrive.expect.tx.* intrinsics are similar tothe@llvm.nvvm.mbarrier.arrive intrinsics except that they alsoperform anexpect-tx operation _prior_ to thearrive operation.The%tx_count specifies the transaction count for theexpect-txoperation and the count for thearrive operation is assumed to be 1.

Semantics:

The semantics of these intrinsics are identical to those of thellvm.nvvm.mbarrier.arrive.* intrinsics described above.

llvm.nvvm.mbarrier.arrive.drop

Syntax:
declarei64@llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cta(ptraddrspace(3)%addr,i32%count)declarei64@llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%count)declarevoid@llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cluster(ptraddrspace(7)%addr,i32%count)declarevoid@llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cluster(ptraddrspace(7)%addr,i32%count)declarei64@llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cta(ptraddrspace(3)%addr,i32%count)declarei64@llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%count)declarevoid@llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cluster(ptraddrspace(7)%addr,i32%count)declarevoid@llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cluster(ptraddrspace(7)%addr,i32%count)
Overview:

The@llvm.nvvm.mbarrier.arrive.drop.* intrinsics decrement theexpected arrival count of the mbarrier object at%addr by%count and then perform anarrive operation with%count.The%count is a 32-bit integer.

Semantics:

The semantics of these intrinsics are identical to those of thellvm.nvvm.mbarrier.arrive.* intrinsics described above.

llvm.nvvm.mbarrier.arrive.drop.expect.tx

Syntax:
declarei64@llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptraddrspace(3)%addr,i32%tx_count)declarei64@llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cluster(ptraddrspace(7)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cluster(ptraddrspace(7)%addr,i32%tx_count)declarei64@llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptraddrspace(3)%addr,i32%tx_count)declarei64@llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cluster(ptraddrspace(7)%addr,i32%tx_count)declarevoid@llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cluster(ptraddrspace(7)%addr,i32%tx_count)
Overview:

The@llvm.nvvm.mbarrier.arrive.drop.expect.tx.* intrinsics performthe below operations on the mbarrier located at%addr.

  • Perform anexpect-tx operation i.e. increase the transaction countof the mbarrier by%tx_count, a 32-bit unsigned integer value.

  • Decrement the expected arrival count of the mbarrier by 1.

  • Perform anarrive operation on the mbarrier with a value of 1.

Semantics:

The semantics of these intrinsics are identical to those of thellvm.nvvm.mbarrier.arrive.* intrinsics described above.

llvm.nvvm.mbarrier.test.wait

Syntax:
declarei1@llvm.nvvm.mbarrier.test.wait.scope.cta.space.cta(ptraddrspace(3)%addr,i64%state)declarei1@llvm.nvvm.mbarrier.test.wait.scope.cluster.space.cta(ptraddrspace(3)%addr,i64%state)declarei1@llvm.nvvm.mbarrier.test.wait.parity.scope.cta.space.cta(ptraddrspace(3)%addr,i32%phase)declarei1@llvm.nvvm.mbarrier.test.wait.parity.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%phase)declarei1@llvm.nvvm.mbarrier.test.wait.relaxed.scope.cta.space.cta(ptraddrspace(3)%addr,i64%state)declarei1@llvm.nvvm.mbarrier.test.wait.relaxed.scope.cluster.space.cta(ptraddrspace(3)%addr,i64%state)declarei1@llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cta.space.cta(ptraddrspace(3)%addr,i32%phase)declarei1@llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%phase)
Overview:

The@llvm.nvvm.mbarrier.test.wait.* intrinsics test for the completionof the current or the immediately preceding phase of an mbarrier object at%addr. The test for completion can be done with either thestate orthephase-parity of the mbarrier object.

  • When done through thei64%state operand, the state must bereturned by anllvm.nvvm.mbarrier.arrive.* on the _same_mbarrier object.

  • The.parity variant of these intrinsics test for completionof the phase indicated by the operandi32%phase, which isthe integer parity of either the current phase or the immediatelypreceding phase of the mbarrier object. An even phase has integerparity 0 and an odd phase has integer parity of 1. So the validvalues for phase-parity are 0 and 1.

Semantics:

The.scope.{cta/cluster} denotes the set of threads that thetest_wait operation can directly synchronize with.

If theaddr does not fall within shared::cta space, then thethe behavior of this intrinsic is undefined.

These intrinsics haveacquire semantics by default. This acquirepattern establishes memory ordering for operations occurring in programorder after thistest_wait instruction by making operations fromother threads in the CTA (or cluster, depending on scope) visible tosubsequent operations in the current thread. When this wait completes,it synchronizes with the corresponding release pattern from thembarrier.arrive operation. Therelaxed variants of these intrinsicsdo not provide any memory ordering or visibility guarantees.

Thistest.wait intrinsic is non-blocking and immediately returnsthe completion status without suspending the executing thread.

The boolean return value indicates:

  • True: The immediately preceding phase has completed

  • False: The current phase is still incomplete

When this wait returns true, the following ordering guarantees hold:

  • All memory accesses (except async operations) requested prior tombarrier.arrive having release semantics by participatingthreads of a CTA (or cluster, depending on scope) are visible tothe executing thread.

  • Allcp.async operations requested prior tocp.async.mbarrier.arriveby participating threads of a CTA are visible to the executing thread.

  • Allcp.async.bulk operations using the same mbarrier object requestedprior tombarrier.arrive having release semantics by participating CTAthreads are visible to the executing thread.

  • Memory accesses requested after this wait are not visible to memoryaccesses performed prior tombarrier.arrive by other participatingthreads.

  • No ordering guarantee exists for memory accesses by the same threadbetween anmbarrier.arrive and this wait.

llvm.nvvm.mbarrier.try.wait

Syntax:
declarei1@llvm.nvvm.mbarrier.try.wait{.relaxed}.scope.cta.space.cta(ptraddrspace(3)%addr,i64%state)declarei1@llvm.nvvm.mbarrier.try.wait{.relaxed}.scope.cluster.space.cta(ptraddrspace(3)%addr,i64%state)declarei1@llvm.nvvm.mbarrier.try.wait.parity{.relaxed}.scope.cta.space.cta(ptraddrspace(3)%addr,i32%phase)declarei1@llvm.nvvm.mbarrier.try.wait.parity{.relaxed}.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%phase)declarei1@llvm.nvvm.mbarrier.try.wait.tl{.relaxed}.scope.cta.space.cta(ptraddrspace(3)%addr,i64%state,i32%timelimit)declarei1@llvm.nvvm.mbarrier.try.wait.tl{.relaxed}.scope.cluster.space.cta(ptraddrspace(3)%addr,i64%state,i32%timelimit)declarei1@llvm.nvvm.mbarrier.try.wait.parity.tl{.relaxed}.scope.cta.space.cta(ptraddrspace(3)%addr,i32%phase,i32%timelimit)declarei1@llvm.nvvm.mbarrier.try.wait.parity.tl{.relaxed}.scope.cluster.space.cta(ptraddrspace(3)%addr,i32%phase,i32%timelimit)
Overview:

The@llvm.nvvm.mbarrier.try.wait.* intrinsics test for the completion ofthe current or immediately preceding phase of an mbarrier object at%addr.Unlike thetest.wait intrinsics, which perform a non-blocking test, theseintrinsics may block the executing thread until the specified phase completesor a system-dependent time limit expires. Suspended threads resume executionwhen the phase completes or the time limit elapses. This time limit isconfigurable through the.tl variants of these intrinsics, where the%timelimit operand (an unsigned integer) specifies the limit innanoseconds. Other semantics are identical to those of thetest.waitintrinsics described above.

Electing a thread

llvm.nvvm.elect.sync

Syntax:
declare{i32,i1}@llvm.nvvm.elect.sync(i32%membermask)
Overview:

The ‘@llvm.nvvm.elect.sync’ intrinsic generates theelect.syncPTX instruction, which elects one predicated active leader thread froma set of threads specified bymembermask. The behavior is undefinedif the executing thread is not inmembermask. The laneid of theelected thread is captured in the i32 return value. The i1 returnvalue is set toTrue for the leader thread andFalse for allthe other threads. Election of a leader thread happens deterministically,i.e. the same leader thread is elected for the samemembermaskevery time. For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-elect-sync.

Membar/Fences

llvm.nvvm.fence.proxy.tensormap_generic.*

Syntax:
declarevoid@llvm.nvvm.fence.proxy.tensormap_generic.release.cta()declarevoid@llvm.nvvm.fence.proxy.tensormap_generic.release.cluster()declarevoid@llvm.nvvm.fence.proxy.tensormap_generic.release.gpu()declarevoid@llvm.nvvm.fence.proxy.tensormap_generic.release.sys()declarevoid@llvm.nvvm.fence.proxy.tensormap_generic.acquire.cta(ptr%addr,i32%size)declarevoid@llvm.nvvm.fence.proxy.tensormap_generic.acquire.cluster(ptr%addr,i32%size)declarevoid@llvm.nvvm.fence.proxy.tensormap_generic.acquire.gpu(ptr%addr,i32%size)declarevoid@llvm.nvvm.fence.proxy.tensormap_generic.acquire.sys(ptr%addr,i32%size)
Overview:

The@llvm.nvvm.fence.proxy.tensormap_generic.* is a uni-directional fence used to establish ordering between a prior memory access performed via the genericproxy<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#proxies>_ and a subsequent memory access performed via the tensormap proxy.nvvm.fence.proxy.tensormap_generic.release can form a release sequence that synchronizes with an acquire sequence that contains thenvvm.fence.proxy.tensormap_generic.acquire proxy fence. The following table describes the mapping between LLVM Intrinsic and the PTX instruction:

NVVM Intrinsic

PTX Instruction

@llvm.nvvm.fence.proxy.tensormap_generic.release.*

fence.proxy.tensormap::generic.release.*

@llvm.nvvm.fence.proxy.tensormap_generic.acquire.*

fence.proxy.tensormap::generic.acquire.*[addr],size

The address operandaddr and the operandsize together specify the memory range[addr,addr+size) on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for thesize operand is128 and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the.global state space. Otherwise, the behavior is undefined. For more information, seePTX ISA.

Address Space Intrinsics

llvm.nvvm.isspacep.*’ Intrinsics

Syntax:
declarei1@llvm.nvvm.isspacep.const(ptr%p)declarei1@llvm.nvvm.isspacep.global(ptr%p)declarei1@llvm.nvvm.isspacep.local(ptr%p)declarei1@llvm.nvvm.isspacep.shared(ptr%p)declarei1@llvm.nvvm.isspacep.shared.cluster(ptr%p)
Overview:

The ‘llvm.nvvm.isspacep.*’ intrinsics determine whether the provided genericpointer references memory which falls within a particular address space.

Semantics:

If the given pointer in the generic address space refers to memory which fallswithin the state space of the intrinsic (and therefore could be safely addressspace casted to this space), 1 is returned, otherwise 0 is returned.

llvm.nvvm.mapa.*’ Intrinsics

Syntax:
declareptr@llvm.nvvm.mapa(ptr%p,i32%rank)declareptraddrspace(7)@llvm.nvvm.mapa.shared.cluster(ptraddrspace(3)%p,i32%rank)
Overview:

The ‘llvm.nvvm.mapa.*’ intrinsics map a shared memory pointerp of another CTA with%rank to the current CTA.Thellvm.nvvm.mapa form expects a generic pointer to shared memory and returns a generic pointer to shared cluster memory.Thellvm.nvvm.mapa.shared.cluster form expects a pointer to shared memory and returns a pointer to shared cluster memory.They corresponds directly to themapa andmapa.shared.cluster PTX instructions.

Semantics:

If the given pointer in the generic address space refers to memory which fallswithin the state space of the intrinsic (and therefore could be safely addressspace casted to this space), 1 is returned, otherwise 0 is returned.

Arithmetic Intrinsics

llvm.nvvm.fabs.*’ Intrinsic

Syntax:
declarefloat@llvm.nvvm.fabs.f32(float%a)declaredouble@llvm.nvvm.fabs.f64(double%a)declarehalf@llvm.nvvm.fabs.f16(half%a)declare<2xhalf>@llvm.nvvm.fabs.v2f16(<2xhalf>%a)declarebfloat@llvm.nvvm.fabs.bf16(bfloat%a)declare<2xbfloat>@llvm.nvvm.fabs.v2bf16(<2xbfloat>%a)
Overview:

The ‘llvm.nvvm.fabs.*’ intrinsics return the absolute value of the operand.

Semantics:

Unlike, ‘llvm.fabs.*’, these intrinsics do not perfectly preserve NaNvalues. Instead, a NaN input yeilds an unspecified NaN output.

llvm.nvvm.fabs.ftz.*’ Intrinsic

Syntax:
declarefloat@llvm.nvvm.fabs.ftz.f32(float%a)declarehalf@llvm.nvvm.fabs.ftz.f16(half%a)declare<2xhalf>@llvm.nvvm.fabs.ftz.v2f16(<2xhalf>%a)
Overview:

The ‘llvm.nvvm.fabs.ftz.*’ intrinsics return the absolute value of theoperand, flushing subnormals to sign preserving zero.

Semantics:

Before the absolute value is taken, the input is flushed to sign preservingzero if it is a subnormal. In addition, unlike ‘llvm.fabs.*’, a NaN inputyields an unspecified NaN output.

llvm.nvvm.idp2a.[us].[us]’ Intrinsics

Syntax:
declarei32@llvm.nvvm.idp2a.s.s(i32%a,i32%b,i1immarg%is.hi,i32%c)declarei32@llvm.nvvm.idp2a.s.u(i32%a,i32%b,i1immarg%is.hi,i32%c)declarei32@llvm.nvvm.idp2a.u.s(i32%a,i32%b,i1immarg%is.hi,i32%c)declarei32@llvm.nvvm.idp2a.u.u(i32%a,i32%b,i1immarg%is.hi,i32%c)
Overview:

The ‘llvm.nvvm.idp2a.[us].[us]’ intrinsics performs a 2-element vector dotproduct followed by addition. They corresponds directly to thedp2a PTXinstruction.

Semantics:

The 32-bit value in%a is broken into 2 16-bit values which are extended to32 bits. For the ‘llvm.nvvm.idp2a.u.[us]’ variants zero-extension is used,while for the ‘llvm.nvvm.idp2a.s.[us]’ sign-extension is used. Two bytes areselected from%b, if%is.hi is true, the most significant bytes areselected, otherwise the least significant bytes are selected. These bytes arethen extended to 32-bits. For the ‘llvm.nvvm.idp2a.[us].u’ variantszero-extension is used, while for the ‘llvm.nvvm.idp2a.[us].s’sign-extension is used. The dot product of these 2-element vectors is added to%c to produce the return.

llvm.nvvm.idp4a.[us].[us]’ Intrinsics

Syntax:
declarei32@llvm.nvvm.idp4a.s.s(i32%a,i32%b,i32%c)declarei32@llvm.nvvm.idp4a.s.u(i32%a,i32%b,i32%c)declarei32@llvm.nvvm.idp4a.u.s(i32%a,i32%b,i32%c)declarei32@llvm.nvvm.idp4a.u.u(i32%a,i32%b,i32%c)
Overview:

The ‘llvm.nvvm.idp4a.[us].[us]’ intrinsics perform a 4-element vector dotproduct followed by addition. They corresponds directly to thedp4a PTXinstruction.

Semantics:

Each of the 4 bytes in both%a and%b are extended to 32-bit integersforming 2<4xi32>. For%a, zero-extension is used in the‘llvm.nvvm.idp4a.u.[us]’ variants, while sign-extension is used with‘llvm.nvvm.idp4a.s.[us]’ variants. Similarly, for%b, zero-extension isused in the ‘llvm.nvvm.idp4a.[us].u’ variants, while sign-extension is usedwith ‘llvm.nvvm.idp4a.[us].s’ variants. The dot product of these 4-elementvectors is added to%c to produce the return.

Bit Manipulation Intrinsics

llvm.nvvm.fshl.clamp.*’ Intrinsic

Syntax:
declarei32@llvm.nvvm.fshl.clamp.i32(i32%hi,i32%lo,i32%n)
Overview:

The ‘llvm.nvvm.fshl.clamp’ family of intrinsics performs a clamped funnelshift left. These intrinsics are very similar to ‘llvm.fshl’, except theshift amount is clamped at the integer width (instead of modulo it). Currently,onlyi32 is supported.

Semantics:

The ‘llvm.nvvm.fshl.clamp’ family of intrinsic functions performs a clampedfunnel shift left: the first two values are concatenated as { %hi : %lo } (%hiis the most significant bits of the wide value), the combined value is shiftedleft, and the most significant bits are extracted to produce a result that isthe same size as the original arguments. The shift amount is the minimum of thevalue of %n and the bit width of the integer type.

llvm.nvvm.fshr.clamp.*’ Intrinsic

Syntax:
declarei32@llvm.nvvm.fshr.clamp.i32(i32%hi,i32%lo,i32%n)
Overview:

The ‘llvm.nvvm.fshr.clamp’ family of intrinsics perform a clamped funnelshift right. These intrinsics are very similar to ‘llvm.fshr’, except theshift amount is clamped at the integer width (instead of modulo it). Currently,onlyi32 is supported.

Semantics:

The ‘llvm.nvvm.fshr.clamp’ family of intrinsic functions performs a clampedfunnel shift right: the first two values are concatenated as { %hi : %lo } (%hiis the most significant bits of the wide value), the combined value is shiftedright, and the least significant bits are extracted to produce a result that isthe same size as the original arguments. The shift amount is the minimum of thevalue of %n and the bit width of the integer type.

llvm.nvvm.flo.u.*’ Intrinsic

Syntax:
declarei32@llvm.nvvm.flo.u.i32(i32%a,i1%shiftamt)declarei32@llvm.nvvm.flo.u.i64(i64%a,i1%shiftamt)
Overview:

The ‘llvm.nvvm.flo.u’ family of intrinsics identifies the bit position of theleading one, returning either it’s offset from the most or least significant bit.

Semantics:

The ‘llvm.nvvm.flo.u’ family of intrinsics returns the bit position of themost significant 1. If %shiftamt is true, The result is the shift amount neededto left-shift the found bit into the most-significant bit position, otherwisethe result is the shift amount needed to right-shift the found bit into theleast-significant bit position. 0xffffffff is returned if no 1 bit is found.

llvm.nvvm.flo.s.*’ Intrinsic

Syntax:
declarei32@llvm.nvvm.flo.s.i32(i32%a,i1%shiftamt)declarei32@llvm.nvvm.flo.s.i64(i64%a,i1%shiftamt)
Overview:

The ‘llvm.nvvm.flo.s’ family of intrinsics identifies the bit position of theleading non-sign bit, returning either it’s offset from the most or leastsignificant bit.

Semantics:

The ‘llvm.nvvm.flo.s’ family of intrinsics returns the bit position of themost significant 0 for negative inputs and the most significant 1 fornon-negative inputs. If %shiftamt is true, The result is the shift amount neededto left-shift the found bit into the most-significant bit position, otherwisethe result is the shift amount needed to right-shift the found bit into theleast-significant bit position. 0xffffffff is returned if no 1 bit is found.

llvm.nvvm.{zext,sext}.{wrap,clamp}’ Intrinsics

Syntax:
declarei32@llvm.nvvm.zext.wrap(i32%a,i32%b)declarei32@llvm.nvvm.zext.clamp(i32%a,i32%b)declarei32@llvm.nvvm.sext.wrap(i32%a,i32%b)declarei32@llvm.nvvm.sext.clamp(i32%a,i32%b)
Overview:

The ‘llvm.nvvm.{zext,sext}.{wrap,clamp}’ family of intrinsics extracts thelow bits of the input value, and zero- or sign-extends them back to the originalwidth.

Semantics:

The ‘llvm.nvvm.{zext,sext}.{wrap,clamp}’ family of intrinsics returnsextension of N lowest bits of operand %a. For the ‘wrap’ variants, N is thevalue of operand %b modulo 32. For the ‘clamp’ variants, N is the value ofoperand %b clamped to the range [0, 32]. The N lowest bits are thenzero-extended the case of the ‘zext’ variants, or sign-extended the case ofthe ‘sext’ variants. If N is 0, the result is 0.

llvm.nvvm.bmsk.{wrap,clamp}’ Intrinsic

Syntax:
declarei32@llvm.nvvm.bmsk.wrap(i32%a,i32%b)declarei32@llvm.nvvm.bmsk.clamp(i32%a,i32%b)
Overview:

The ‘llvm.nvvm.bmsk.{wrap,clamp}’ family of intrinsics creates a bit maskgiven a starting bit position and a bit width.

Semantics:

The ‘llvm.nvvm.bmsk.{wrap,clamp}’ family of intrinsics returns a value withall bits set to 0 except for %b bits starting at bit position %a. For the‘wrap’ variants, the values of %a and %b modulo 32 are used. For the‘clamp’ variants, the values of %a and %b are clamped to the range [0, 32],which in practice is equivalent to using them as is.

llvm.nvvm.prmt’ Intrinsic

Syntax:
declarei32@llvm.nvvm.prmt(i32%lo,i32%hi,i32%selector)
Overview:

The ‘llvm.nvvm.prmt’ constructs a permutation of the bytes of the first twooperands, selecting based on the third operand.

Semantics:

The bytes in the first two source operands are numbered from 0 to 7:{%hi, %lo} = {{b7, b6, b5, b4}, {b3, b2, b1, b0}}. For each byte in the targetregister, a 4-bit selection value is defined.

The 3 lsbs of the selection value specify which of the 8 source bytes should bemoved into the target position. The msb defines if the byte value should becopied, or if the sign (msb of the byte) should be replicated over all 8 bitsof the target position (sign extend of the byte value); msb=0 means copy theliteral value; msb=1 means replicate the sign.

These 4-bit selection values are pulled from the lower 16-bits of the %selectoroperand, with the least significant selection value corresponding to the leastsignificant byte of the destination.

llvm.nvvm.prmt.*’ Intrinsics

Syntax:
declarei32@llvm.nvvm.prmt.f4e(i32%lo,i32%hi,i32%selector)declarei32@llvm.nvvm.prmt.b4e(i32%lo,i32%hi,i32%selector)declarei32@llvm.nvvm.prmt.rc8(i32%lo,i32%selector)declarei32@llvm.nvvm.prmt.ecl(i32%lo,i32%selector)declarei32@llvm.nvvm.prmt.ecr(i32%lo,i32%selector)declarei32@llvm.nvvm.prmt.rc16(i32%lo,i32%selector)
Overview:

The ‘llvm.nvvm.prmt.*’ family of intrinsics constructs a permutation of thebytes of the first one or two operands, selecting based on the 2 leastsignificant bits of the final operand.

Semantics:

As with the generic ‘llvm.nvvm.prmt’ intrinsic, the bytes in the first oneor two source operands are numbered. The first source operand (%lo) is numbered{b3, b2, b1, b0}, in the case of the ‘f4e’ and ‘b4e’ variants, thesecond source operand (%hi) is numbered {b7, b6, b5, b4}.

Depending on the 2 least significant bits of the %selector operand, the resultof the permutation is defined as follows:

Mode

%selector[1:0]

Output

f4e

0

{3, 2, 1, 0}

1

{4, 3, 2, 1}

2

{5, 4, 3, 2}

3

{6, 5, 4, 3}

b4e

0

{5, 6, 7, 0}

1

{6, 7, 0, 1}

2

{7, 0, 1, 2}

3

{0, 1, 2, 3}

rc8

0

{0, 0, 0, 0}

1

{1, 1, 1, 1}

2

{2, 2, 2, 2}

3

{3, 3, 3, 3}

ecl

0

{3, 2, 1, 0}

1

{3, 2, 1, 1}

2

{3, 2, 2, 2}

3

{3, 3, 3, 3}

ecr

0

{0, 0, 0, 0}

1

{1, 1, 1, 0}

2

{2, 2, 1, 0}

3

{3, 2, 1, 0}

rc16

0

{1, 0, 1, 0}

1

{3, 2, 3, 2}

2

{1, 0, 1, 0}

3

{3, 2, 3, 2}

TMA family of Intrinsics

llvm.nvvm.cp.async.bulk.global.to.shared.cluster

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.global.to.shared.cluster(ptraddrspace(7)%dst,ptraddrspace(3)%mbar,ptraddrspace(1)%src,i32%size,i16%mc,i64%ch,i1%flag_mc,i1%flag_ch)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.global.to.shared.cluster’ intrinsiccorresponds to thecp.async.bulk.shared::cluster.global.* familyof PTX instructions. These instructions initiate an asynchronouscopy of bulk data from global memory to shared::cluster memory.The 32-bit operand%size specifies the amount of memory to becopied and it must be a multiple of 16.

  • The last two arguments to these intrinsics are boolean flagsindicating support for cache_hint and/or multicast modifiers.These flag arguments must be compile-time constants. The backendlooks through these flags and lowers the intrinsics appropriately.

  • The Nth argument (denoted byi1%flag_ch) when set, indicatesa valid cache_hint (i64%ch) and generates the.L2::cache_hintvariant of the PTX instruction.

  • The [N-1]th argument (denoted byi1%flag_mc) when set, indicatesthe presence of a multicast mask (i16%mc) and generates the PTXinstruction with the.multicast::cluster modifier.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk.

llvm.nvvm.cp.async.bulk.shared.cta.to.global

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.shared.cta.to.global(ptraddrspace(1)%dst,ptraddrspace(3)%src,i32%size,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.shared.cta.to.global.bytemask(...,i32%size,i64%ch,i1%flag_ch,i16%mask)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.shared.cta.to.global’ intrinsiccorresponds to thecp.async.bulk.global.shared::cta.* set of PTXinstructions. These instructions initiate an asynchronous copy fromshared::cta to global memory. The 32-bit operand%size specifiesthe amount of memory to be copied (in bytes) and it must be a multipleof 16. For the.bytemask variant, the 16-bit wide mask operandspecifies whether the i-th byte of each 16-byte wide chunk of sourcedata is copied to the destination.

  • Thei1%flag_ch argument to these intrinsics is a booleanflag indicating support for cache_hint. This flag argument mustbe a compile-time constant. When set, it indicates a validcache_hint (i64%ch) and generates the.L2::cache_hintvariant of the PTX instruction.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk.

llvm.nvvm.cp.async.bulk.shared.cta.to.cluster

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.shared.cta.to.cluster(ptraddrspace(7)%dst,ptraddrspace(3)%mbar,ptraddrspace(3)%src,i32%size)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.shared.cta.to.cluster’ intrinsiccorresponds to thecp.async.bulk.shared::cluster.shared::cta.*PTX instruction. This instruction initiates an asynchronous copy fromshared::cta to shared::cluster memory. The destination has to be inthe shared memory of a different CTA within the cluster. The 32-bitoperand%size specifies the amount of memory to be copied andit must be a multiple of 16.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk.

llvm.nvvm.cp.async.bulk.prefetch.L2

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.prefetch.L2(ptraddrspace(1)%src,i32%size,i64%ch,i1%flag_ch)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.prefetch.L2’ intrinsiccorresponds to thecp.async.bulk.prefetch.L2.* familyof PTX instructions. These instructions initiate an asynchronousprefetch of bulk data from global memory to the L2 cache.The 32-bit operand%size specifies the amount of memory to beprefetched in terms of bytes and it must be a multiple of 16.

  • The last argument to these intrinsics is boolean flag indicatingsupport for cache_hint. These flag argument must be compile-timeconstant. When set, it indicates a valid cache_hint (i64%ch)and generates the.L2::cache_hint variant of the PTX instruction.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch.

llvm.nvvm.prefetch.*

Syntax:
declarevoid@llvm.nvvm.prefetch.global.L1(ptraddrspace(1)%global_ptr)declarevoid@llvm.nvvm.prefetch.global.L2(ptraddrspace(1)%global_ptr)declarevoid@llvm.nvvm.prefetch.local.L1(ptraddrspace(5)%local_ptr)declarevoid@llvm.nvvm.prefetch.local.L2(ptraddrspace(5)%local_ptr)declarevoid@llvm.nvvm.prefetch.L1(ptr%ptr)declarevoid@llvm.nvvm.prefetch.L2(ptr%ptr)declarevoid@llvm.nvvm.prefetch.tensormap.p0(ptr%ptr)declarevoid@llvm.nvvm.prefetch.tensormap.p4(ptraddrspace(4)%const_ptr)declarevoid@llvm.nvvm.prefetch.tensormap.p101(ptraddrspace(101)%param_ptr)declarevoid@llvm.nvvm.prefetch.global.L2.evict.normal(ptraddrspace(1)%global_ptr)declarevoid@llvm.nvvm.prefetch.global.L2.evict.last(ptraddrspace(1)%global_ptr)declarevoid@llvm.nvvm.prefetchu.L1(ptr%ptr)
Overview:

The ‘@llvm.nvvm.prefetch.*’ and ‘@llvm.nvvm.prefetchu.*’ intrinsiccorrespond to the ‘prefetch.*;’ and ‘prefetchu.*’ family of PTX instructions.The ‘prefetch.*’ instructions bring the cache line containing thespecified address in global or local memory address space into thespecified cache level (L1 or L2). If the ‘.tensormap’ qualifier is specified then theprefetch instruction brings the cache line containing the specified address in the‘.const’ or ‘.parammemory’ state space for subsequent use by the ‘cp.async.bulk.tensor’instruction. The ‘prefetchu.*`’ instruction brings the cache linecontaining the specified generic address into the specified uniform cache level.If no address space is specified, it is assumed to be generic address. The intrinsicuses and eviction priority which can be accessed by the ‘.level::eviction_priority’ modifier.

  • A prefetch to a shared memory location performs no operation.

  • A prefetch into the uniform cache requires a generic address,and no operation occurs if the address maps to a const, local, or shared memory location.

For more information, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu.

llvm.nvvm.applypriority.*

Syntax:
declarevoid@llvm.nvvm.applypriority.global.L2.evict.normal(ptraddrspace(1)%global_ptr,i64%size)declarevoid@llvm.nvvm.applypriority.L2.evict.normal(ptr%ptr,i64%size)
Overview:

The ‘@llvm.nvvm.applypriority.*’ applies the cache eviction priority specified by the.level::eviction_priority qualifier to the address range [a..a+size) in the specified cachelevel. If no state space is specified then Generic Addressing is used. If the specified addressdoes not fall within the address window of .global state space then the behavior is undefined.The operand size is an integer constant that specifies the amount of data, in bytes, in the specified cachelevel on which the priority is to be applied. The only supported value for the size operand is 128.

For more information, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-applypriority.

llvm.nvvm.discard.*

Syntax:
declarevoid@llvm.nvvm.discard.global.L2(ptraddrspace(1)%global_ptr,i64immarg)declarevoid@llvm.nvvm.discard.L2(ptr%ptr,i64immarg)
Overview:

Theeffects of the@llvm.nvvm.discard.L2* intrinsics are those of a non-atomicnon-volatilellvm.memset that writesundef to the destinationaddress range[%ptr,%ptr+immarg). The%ptr must be aligned by 128 bytes.Subsequent reads from the address range may readundef until the memory is overwrittenwith a different value.These operationshint the implementation that data in the L2 cache can be destructivelydiscarded without writing it back to memory.The operandimmarg is an integer constant that specifies the length in bytes of theaddress range[%ptr,%ptr+immarg) to writeundef into.The only supported value for theimmarg operand is128.If generic addressing is used and the specified address does not fall within theaddress window of global memory (addrspace(1)) the behavior is undefined.

callvoid@llvm.nvvm.discard.L2(ptr%p,i64128);; writes `undef` to [p, p+128)%a=loadi64,ptr%p.;; loads 8 bytes containing undef%b=loadi64,ptr%p;; loads 8 bytes containing undef;; comparing %a and %b compares `undef` values!%fa=freezei64%a;; freezes undef to stable bit-pattern%fb=freezei64%b;; freezes undef to stable bit-pattern;; %fa may compare different to %fb!

For more information, refer to theCUDA C++ discard documentation and to thePTX ISA discard documentation .

llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.tile.1d(ptraddrspace(7)%dst,ptraddrspace(3)%bar,ptr%tensor_map,i32%d0,i16%mc,i64%ch,i1%flag_mc,i1%flag_ch,i32%flag_cta_group)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.tile.2d(...,i32%d0,i32%d1,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.tile.3d(...,i32%d0,i32%d1,i32%d2,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.tile.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.tile.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.tile.gather4.2d(ptraddrspace(7)%dst,ptraddrspace(3)%bar,ptr%tensor_map,i32%x0,i32%y0,i32%y1,i32%y2,i32%y3,i16%mc,i64%ch,i1%flag_mc,i1%flag_ch,i32%flag_cta_group)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d’ intrinsicscorrespond to thecp.async.bulk.tensor.[1-5]d.* set of PTX instructions.These instructions initiate an asynchronous copy of tensor data fromglobal memory to shared::cluster memory (indicated by theg2s prefix)intile mode. In tile mode, the multi-dimensional layout of thesource tensor is preserved at the destination. The dimension of thetensor data ranges from 1d to 5d with the coordinates specifiedby thei32%d0...i32%d4 arguments. Intile.gather4 mode,four rows in a 2D tensor are combined to form a single 2D destinationtensor. The first coordinatei32%x0 denotes the column indexfollowed by four coordinates indicating the four row-indices.So, this mode takes a total of 5 coordinates as input arguments.For more information ongather4 mode, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-tiled-scatter4-gather4-modes.

  • The last three arguments to these intrinsics are flagsindicating support for multicast, cache_hint and cta_group::1/2modifiers. These flag arguments must be compile-time constants.The backend looks through these flags and lowers the intrinsicsappropriately.

  • The argument denoted byi1%flag_ch when set, indicatesa valid cache_hint (i64%ch) and generates the.L2::cache_hintvariant of the PTX instruction.

  • The argument denoted byi1%flag_mc when set, indicatesthe presence of a multicast mask (i16%mc) and generatesthe PTX instruction with the.multicast::cluster modifier.

  • The argument denoted byi32%flag_cta_group takes values withinthe range [0, 3) i.e. {0,1,2}. When the value of%flag_cta_groupis not within the range, it may raise an error from the Verifier.The default value is ‘0’ with no cta_group modifier in theinstruction. The values of ‘1’ and ‘2’ lower tocta_group::1andcta_group::2 variants of the PTX instruction respectively.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor.

llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.[3-5]d

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.3d(ptraddrspace(7)%dst,ptraddrspace(3)%bar,ptr%tensor_map,i32%d0,i32%d1,i32%d2,i16%im2col0,i16%mc,i64%ch,i1%flag_mc,i1%flag_ch,i32%flag_cta_group)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,i16%im2col0,i16%im2col1,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,i16%im2col0,i16%im2col1,i16%im2col2,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.3d(ptraddrspace(7)%dst,ptraddrspace(3)%bar,ptr%tensor_map,i32%d0,i32%d1,i32%d2,i16%wHalo,i16%wOffset,i16%mc,i64%ch,i1%flag_mc,i1%flag_ch,i32%flag_cta_group)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.128.3d(ptraddrspace(7)%dst,ptraddrspace(3)%bar,ptr%tensor_map,i32%d0,i32%d1,i32%d2,i16%wHalo,i16%wOffset,i16%mc,i64%ch,i1%flag_mc,i1%flag_ch,i32%flag_cta_group)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.128.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.128.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.[3-5]d’ intrinsicscorrespond to thecp.async.bulk.tensor.[1-5]d.* set of PTX instructions.These instructions initiate an asynchronous copy of tensor data fromglobal memory to shared::cluster memory (indicated by theg2s prefix)inim2col mode. In im2col mode, some dimensions of the source tensorare unrolled into a single dimensional column at the destination. In thismode, the tensor has to be at least three-dimensional. Along with the tensorcoordinates, im2col offsets are also specified (denoted byi16im2col0...i16%im2col2). For theim2col mode, the number of offsetsis two less than the number of dimensions of the tensor operation. For theim2col.w andim2col.w.128 mode, the number of offsets is always 2,denoted byi16%wHalo andi16%wOffset arguments. For more informationonim2col.w andim2col.w.128 modes, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-w-w128-modes.

The last three arguments to these intrinsics are flags, with the same functionalityas described in thetile mode intrinsics above.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor.

llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.[1-5]d

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.1d(ptraddrspace(3)%dst,ptraddrspace(3)%bar,ptr%tensor_map,i32%d0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.2d(...,i32%d0,i32%d1,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.3d(...,i32%d0,i32%d1,i32%d2,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.gather4.2d(ptraddrspace(3)%dst,ptraddrspace(3)%bar,ptr%tensor_map,i32%x0,i32%y0,i32%y1,i32%y2,i32%y3,i64%ch,i1%flag_ch)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.[1-5]d’ intrinsicscorrespond to thecp.async.bulk.tensor.[1-5]d.shared::cta.global.*set of PTX instructions. These instructions initiate an asynchronouscopy of tensor data from global memory to shared::cta memory intile mode. In tile mode, the multi-dimensional layout of thesource tensor is preserved at the destination. The dimension of thetensor data ranges from 1d to 5d with the coordinates specifiedby thei32%d0...i32%d4 arguments. Intile.gather4 mode,four rows in a 2D tensor are combined to form a single 2D destinationtensor. The first coordinatei32%x0 denotes the column indexfollowed by four coordinates indicating the four row-indices.So, this mode takes a total of 5 coordinates as input arguments.For more information ongather4 mode, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-tiled-scatter4-gather4-modes.

  • The last argument to these intrinsics is a boolean flagindicating support for cache_hint. This flag argument mustbe a compile-time constant. When set, it indicates a validcache_hint (i64%ch) and generates the.L2::cache_hintvariant of the PTX instruction.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor.

llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.[3-5]d

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.3d(ptraddrspace(3)%dst,ptraddrspace(3)%bar,ptr%tensor_map,i32%d0,i32%d1,i32%d2,i16%im2col0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,i16%im2col0,i16%im2col1,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,i16%im2col0,i16%im2col1,i16%im2col2,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.3d(ptraddrspace(3)%dst,ptraddrspace(3)%bar,ptr%tensor_map,i32%d0,i32%d1,i32%d2,i16%wHalo,i16%wOffset,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.128.3d(ptraddrspace(3)%dst,ptraddrspace(3)%bar,ptr%tensor_map,i32%d0,i32%d1,i32%d2,i16%wHalo,i16%wOffset,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.128.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.128.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.[3-5]d’ intrinsicscorrespond to thecp.async.bulk.tensor.[1-5]d.shared::cta.global.*set of PTX instructions. These instructions initiate an asynchronous copyof tensor data from global memory to shared::cta memory inim2col mode.In im2col mode, some dimensions of the source tensor are unrolled into asingle dimensional column at the destination. In this mode, the tensor hasto be at least three-dimensional. Along with the tensor coordinates, im2coloffsets are also specified (denoted byi16im2col0...i16%im2col2).For theim2col mode, the number of offsets is two less than the numberof dimensions of the tensor operation. For theim2col.w andim2col.w.128mode, the number of offsets is always 2, denoted byi16%wHalo andi16%wOffset arguments. For more information onim2col.w andim2col.w.128 modes, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-w-w128-modes.

  • The last argument to these intrinsics is a boolean flagindicating support for cache_hint. This flag argument mustbe a compile-time constant. When set, it indicates a validcache_hint (i64%ch) and generates the.L2::cache_hintvariant of the PTX instruction.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor.

llvm.nvvm.cp.async.bulk.tensor.s2g.tile.[1-5]d

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.tensor.s2g.tile.1d(ptraddrspace(3)%src,ptr%tensor_map,i32%d0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.s2g.tile.2d(...,i32%d0,i32%d1,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.s2g.tile.3d(...,i32%d0,i32%d1,i32%d2,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.s2g.tile.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.s2g.tile.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.s2g.tile.scatter4.2d(ptraddrspace(3)%src,ptr%tensor_map,i32%x0,i32%y0,i32%y1,i32%y2,i32%y3,i64%ch,i1%flag_ch)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.tensor.s2g.tile.[1-5]d’ intrinsicscorrespond to thecp.async.bulk.tensor.[1-5]d.* set of PTX instructions.These instructions initiate an asynchronous copy of tensor data fromshared::cta to global memory (indicated by thes2g prefix)intile mode. The dimension of the tensor data ranges from 1d to 5dwith the coordinates specified by thei32%d0...i32%d4 arguments.Intile.scatter4 mode, a single 2D source tensor is divided intofour rows in the 2D destination tensor. The first coordinatei32%x0denotes the column index followed by four coordinates indicating thefour row-indices. So, this mode takes a total of 5 coordinates as input arguments.For more information onscatter4 mode, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-tiled-scatter4-gather4-modes.

  • The last argument to these intrinsics is a boolean flagindicating support for cache_hint. This flag argument mustbe a compile-time constant. When set, it indicates a validcache_hint (i64%ch) and generates the.L2::cache_hintvariant of the PTX instruction.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor.

llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.[3-5]d

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.3d(ptraddrspace(3)%src,ptr%tensor_map,i32%d0,i32%d1,i32%d2,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.[1-5]d’ intrinsicscorrespond to thecp.async.bulk.tensor.[1-5]d.* set of PTX instructions.These instructions initiate an asynchronous copy of tensor data fromshared::cta to global memory (indicated by thes2g prefix)inim2col mode. In this mode, the tensor has to be at leastthree-dimensional. Unlike theg2s variants, there are noim2col_offsets for these intrinsics. The last argument to theseintrinsics is a boolean flag, with the same functionality asdescribed in thes2g.tile mode intrinsics above.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor.

llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.[1-5]d

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.1d(ptr%tensor_map,i32%d0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.2d(...,i32%d0,i32%d1,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.3d(...,i32%d0,i32%d1,i32%d2,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.gather4.2d(ptr%tensor_map,i32%x0,i32%y0,i32%y1,i32%y2,i32%y3,i64%ch,i1%flag_ch)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.[1-5]d’ intrinsicscorrespond to thecp.async.bulk.prefetch.tensor.[1-5]d.L2.global* setof PTX instructions. These instructions initiate an asynchronous prefetchof tensor data from global memory to the L2 cache. In tile mode, themulti-dimensional layout of the source tensor is preserved at the destination.The dimension of the tensor data ranges from 1d to 5d with the coordinatesspecified by thei32%d0...i32%d4 arguments.

Intile.gather4 mode, four rows in the 2-dimnesional source tensor arefetched to the L2 cache. The first coordinatei32%x0 denotes the column indexfollowed by four coordinates indicating the four row-indices. So, this mode takesa total of 5 coordinates as input arguments.For more information ongather4 mode, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-tiled-scatter4-gather4-modes.

  • The last argument to these intrinsics is a boolean flagindicating support for cache_hint. This flag argument mustbe a compile-time constant. When set, it indicates a validcache_hint (i64%ch) and generates the.L2::cache_hintvariant of the PTX instruction.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor.

llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[3-5]d

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.3d(ptr%tensor_map,i32%d0,i32%d1,i32%d2,i16%im2col0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,i16%im2col0,i16%im2col1,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,i16%im2col0,i16%im2col1,i16%im2col2,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.3d(ptr%tensor_map,i32%d0,i32%d1,i32%d2,i16%wHalo,i16%wOffset,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.128.3d(ptr%tensor_map,i32%d0,i32%d1,i32%d2,i16%wHalo,i16%wOffset,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.128.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.128.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[3-5]d’ intrinsicscorrespond to thecp.async.bulk.prefetch.tensor.[1-5]d.L2.global* setof PTX instructions. These instructions initiate an asynchronous prefetchof tensor data from global memory to the L2 cache. In im2col mode, somedimensions of the source tensor are unrolled into a single dimensionalcolumn at the destination. In this mode, the tensor has to be at leastthree-dimensional. Along with the tensor coordinates, im2col offsets arealso specified (denoted byi16im2col0...i16%im2col2). Forim2colmode, the number of offsets is two less than the number of dimensions ofthe tensor operation. For theim2col.w andim2col.w.128 modes,the number of offsets is always 2, denoted byi16%wHalo andi16%wOffset arguments. For more information onim2col.w andim2col.w.128 modes, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-w-w128-modes.

The last argument to these intrinsics is a boolean flag, withthe same functionality as described in thetile mode intrinsics above.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor.

llvm.nvvm.cp.async.bulk.tensor.reduce.[red_op].tile.[1-5]d

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.add.tile.1d(ptraddrspace(3)%src,ptr%tensor_map,i32%d0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.min.tile.1d(ptraddrspace(3)%src,ptr%tensor_map,i32%d0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.max.tile.1d(ptraddrspace(3)%src,ptr%tensor_map,i32%d0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.inc.tile.1d(ptraddrspace(3)%src,ptr%tensor_map,i32%d0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.dec.tile.1d(ptraddrspace(3)%src,ptr%tensor_map,i32%d0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.and.tile.1d(ptraddrspace(3)%src,ptr%tensor_map,i32%d0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.or.tile.1d(ptraddrspace(3)%src,ptr%tensor_map,i32%d0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.xor.tile.1d(ptraddrspace(3)%src,ptr%tensor_map,i32%d0,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.2d(...,i32%d0,i32%d1,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.3d(...,i32%d0,i32%d1,i32%d2,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.[1-5]d’ intrinsicscorrespond to thecp.reduce.async.bulk.tensor.[1-5]d.* set of PTX instructions.These instructions initiate an asynchronous reduction operation of tensor datain global memory with the tensor data in shared{::cta} memory, usingtile mode.The dimension of the tensor data ranges from 1d to 5d with the coordinatesspecified by thei32%d0...i32%d4 arguments. The supported reductionoperations are {add, min, max, inc, dec, and, or, xor} as described in thetile.1d intrinsics.

  • The last argument to these intrinsics is a boolean flagindicating support for cache_hint. This flag argument mustbe a compile-time constant. When set, it indicates a validcache_hint (i64%ch) and generates the.L2::cache_hintvariant of the PTX instruction.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor.

llvm.nvvm.cp.async.bulk.tensor.reduce.[red_op].im2col.[3-5]d

Syntax:
declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.3d(ptraddrspace(3)%src,ptr%tensor_map,i32%d0,i32%d1,i32%d2,i64%ch,i1%flag_ch)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.4d(...,i32%d0,i32%d1,i32%d2,i32%d3,...)declarevoid@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.5d(...,i32%d0,i32%d1,i32%d2,i32%d3,i32%d4,...)
Overview:

The ‘@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.[3-5]d’ intrinsicscorrespond to thecp.reduce.async.bulk.tensor.[3-5]d.* set of PTX instructions.These instructions initiate an asynchronous reduction operation of tensor datain global memory with the tensor data in shared{::cta} memory, usingim2col mode.In this mode, the tensor has to be at least three-dimensional. The supported reductionoperations supported are the same as the ones in the tile mode. The last argument tothese intrinsics is a boolean flag, with the same functionality as described in thetile mode intrinsics above.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor.

Warp Group Intrinsics

llvm.nvvm.wgmma.fence.sync.aligned

Syntax:
declarevoid@llvm.nvvm.wgmma.fence.sync.aligned()
Overview:

The ‘@llvm.nvvm.wgmma.fence.sync.aligned’ intrinsic generates thewgmma.fence.sync.aligned PTX instruction, which establishes an orderingbetween prior accesses to any warpgroup registers and subsequent accesses tothe same registers by awgmma.mma_async instruction.

Thewgmma.fence instruction must be issued by all warps of the warpgroup inthe following locations:

  • Before the firstwgmma.mma_async operation in a warpgroup.

  • Between a register access by a thread in the warpgroup and anywgmma.mma_async instruction that accesses the same registers, except whenthese are accumulator register accesses across multiplewgmma.mma_asyncinstructions of the same shape in which case an ordering guarantee isprovided by default.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-fence.

llvm.nvvm.wgmma.commit_group.sync.aligned

Syntax:
declarevoid@llvm.nvvm.wgmma.commit_group.sync.aligned()
Overview:

The ‘@llvm.nvvm.wgmma.commit_group.sync.aligned’ intrinsic generates thewgmma.commit_group.sync.aligned PTX instruction, which creates a newwgmma-group per warpgroup and batches all priorwgmma.mma_asyncinstructions initiated by the executing warp but not committed to anywgmma-group into the new wgmma-group. If there are no uncommittedwgmmamma_async instructions then,wgmma.commit_group results in an emptywgmma-group.

An executing thread can wait for the completion of allwgmma.mma_asyncoperations in a wgmma-group by usingwgmma.wait_group.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-commit-group.

llvm.nvvm.wgmma.wait_group.sync.aligned

Syntax:
declarevoid@llvm.nvvm.wgmma.wait_group.sync.aligned(i64immargN)
Overview:

The ‘@llvm.nvvm.wgmma.wait_group.sync.aligned’ intrinsic generates thewgmma.commit_group.sync.alignedN PTX instruction, which will cause theexecuting thread to wait until onlyN or fewer of the most recentwgmma-groups are pending and all the prior wgmma-groups committed by theexecuting threads are complete. For example, whenN is 0, the executingthread waits on all the prior wgmma-groups to complete. OperandN is aninteger constant.

Accessing the accumulator register or the input register containing thefragments of matrix A of awgmma.mma_async instruction without firstperforming awgmma.wait_group instruction that waits on a wgmma-groupincluding thatwgmma.mma_async instruction is undefined behavior.

For more information, refer PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-wait-group.

llvm.nvvm.griddepcontrol.*

Syntax:
declarevoid@llvm.nvvm.griddepcontrol.launch_dependents()declarevoid@llvm.nvvm.griddepcontrol.wait()
Overview:

Thegriddepcontrol intrinsics allows the dependent grids and prerequisite grids as defined by the runtime, to control execution in the following way:

griddepcontrol.launch_dependents intrinsic signals that the dependents can be scheduled, before the current grid completes. The intrinsic can be invoked by multiple threads in the current CTA and repeated invocations of the intrinsic will have no additional side effects past that of the first invocation.

griddepcontrol.wait intrinsic causes the executing thread to wait until all prerequisite grids in flight have completed and all the memory operations from the prerequisite grids are performed and made visible to the current grid.

For more information, referPTX ISA.

TCGEN05 family of Intrinsics

The llvm.nvvm.tcgen05.* intrinsics model the TCGEN05 family of instructionsexposed by PTX. These intrinsics use ‘Tensor Memory’ (henceforthtmem).NVPTX represents this memory usingaddrspace(6) and is always 32-bits.

For more information, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory.

The tensor-memory pointers may only be used with the tcgen05 intrinsics.There are specialized load/store instructions provided (tcgen05.ld/st) towork with tensor-memory.

See the PTX ISA for more information on tensor-memory load/store instructionshttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-and-register-load-store-instructions.

llvm.nvvm.tcgen05.alloc

Syntax:
declarevoid@llvm.nvvm.tcgen05.alloc.cg1(ptr%dst,i32%ncols)declarevoid@llvm.nvvm.tcgen05.alloc.cg2(ptr%dst,i32%ncols)declarevoid@llvm.nvvm.tcgen05.alloc.shared.cg1(ptraddrspace(3)%dst,i32%ncols)declarevoid@llvm.nvvm.tcgen05.alloc.shared.cg2(ptraddrspace(3)%dst,i32%ncols)
Overview:

The ‘@llvm.nvvm.tcgen05.alloc.*’ intrinsics correspond to thetcgen05.alloc.cta_group*.sync.aligned.b32 family of PTX instructions.Thetcgen05.alloc is a potentially blocking instruction which dynamicallyallocates the specified number of columns in the Tensor Memory and writesthe address of the allocated Tensor Memory into shared memory at thelocation specified by%dst. The 32-bit operand%ncols specifiesthe number of columns to be allocated and it must be a power-of-two.The.shared variant explicitly uses shared memory address space forthe%dst operand. The.cg1 and.cg2 variants generatecta_group::1 andcta_group::2 variants of the instruction respectively.

For more information, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions.

llvm.nvvm.tcgen05.dealloc

Syntax:
declarevoid@llvm.nvvm.tcgen05.dealloc.cg1(ptraddrspace(6)%tmem_addr,i32%ncols)declarevoid@llvm.nvvm.tcgen05.dealloc.cg2(ptraddrspace(6)%tmem_addr,i32%ncols)
Overview:

The ‘@llvm.nvvm.tcgen05.dealloc.*’ intrinsics correspond to thetcgen05.dealloc.* set of PTX instructions. Thetcgen05.deallocinstructions deallocates the Tensor Memory specified by the Tensor Memoryaddress%tmem_addr. The operand%tmem_addr must point to a previousTensor Memory allocation. The 32-bit operand%ncols specifies the numberof columns to be de-allocated. The.cg1 and.cg2 variants generatecta_group::1 andcta_group::2 variants of the instruction respectively.

For more information, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions.

llvm.nvvm.tcgen05.relinq.alloc.permit

Syntax:
declarevoid@llvm.nvvm.tcgen05.relinq.alloc.permit.cg1()declarevoid@llvm.nvvm.tcgen05.relinq.alloc.permit.cg2()
Overview:

The ‘@llvm.nvvm.tcgen05.relinq.alloc.permit.*’ intrinsics correspondto thetcgen05.relinquish_alloc_permit.* set of PTX instructions.This instruction specifies that the CTA of the executing thread isrelinquishing the right to allocate Tensor Memory. So, it is illegalfor a CTA to performtcgen05.alloc after any of its constituentthreads executetcgen05.relinquish_alloc_permit. The.cg1and.cg2 variants generatecta_group::1 andcta_group::2flavors of the instruction respectively.

For more information, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions.

llvm.nvvm.tcgen05.commit

Syntax:
declarevoid@llvm.nvvm.tcgen05.commit.{cg1,cg2}(ptr%mbar)declarevoid@llvm.nvvm.tcgen05.commit.shared.{cg1,cg2}(ptraddrspace(3)%mbar)declarevoid@llvm.nvvm.tcgen05.commit.mc.{cg1,cg2}(ptr%mbar,i16%mc)declarevoid@llvm.nvvm.tcgen05.commit.mc.shared.{cg1,cg2}(ptraddrspace(3)%mbar,i16%mc)
Overview:

The ‘@llvm.nvvm.tcgen05.commit.*’ intrinsics correspond to thetcgen05.commit.{cg1/cg2}.mbarrier::arrive::one.* set of PTX instructions.Thetcgen05.commit is an asynchronous instruction which makes the mbarrierobject (%mbar) track the completion of all prior asynchronous tcgen05 operations.The.mc variants allow signaling on the mbarrier objects of multiple CTAs(specified by%mc) in the cluster. The.cg1 and.cg2 variants generatecta_group::1 andcta_group::2 flavors of the instruction respectively.

For more information, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen-async-sync-operations-commit.

llvm.nvvm.tcgen05.wait

Syntax:
declarevoid@llvm.nvvm.tcgen05.wait.ld()declarevoid@llvm.nvvm.tcgen05.wait.st()
Overview:

The ‘@llvm.nvvm.tcgen05.wait.ld/st’ intrinsics correspond tothetcgen05.wait::{ld/st}.sync.aligned pair of PTX instructions.Thetcgen05.wait::ld causes the executing thread to block untilall priortcgen05.ld operations issued by the executing threadhave completed. Thetcgen05.wait::st causes the executing threadto block until all priortcgen05.st operations issued by theexecuting thread have completed.

For more information, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-instructions-tcgen05-wait.

llvm.nvvm.tcgen05.fence

Syntax:
declarevoid@llvm.nvvm.tcgen05.fence.before.thread.sync()declarevoid@llvm.nvvm.tcgen05.fence.after.thread.sync()
Overview:

The ‘@llvm.nvvm.tcgen05.fence.*’ intrinsics correspond tothetcgen05.fence::{before/after}_thread_sync pair of PTX instructions.These instructions act as code motion fences for asynchronous tcgen05operations.

For more information, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tensorcore-5th-generation-instructions-tcgen05-fence.

llvm.nvvm.tcgen05.shift

Syntax:
declarevoid@llvm.nvvm.tcgen05.shift.down.cg1(ptraddrspace(6)%tmem_addr)declarevoid@llvm.nvvm.tcgen05.shift.down.cg2(ptraddrspace(6)%tmem_addr)
Overview:

The ‘@llvm.nvvm.tcgen05.shift.{cg1/cg2}’ intrinsics correspond tothetcgen05.shift.{cg1/cg2} PTX instructions. Thetcgen05.shiftis an asynchronous instruction which initiates the shifting of 32-byteelements downwards across all the rows, except the last, by one row.The address operand%tmem_addr specifies the base address of thematrix in the Tensor Memory whose rows must be down shifted.

For more information, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-instructions-tcgen05-shift.

llvm.nvvm.tcgen05.cp

Syntax:
declarevoid@llvm.nvvm.tcgen05.cp.4x256b.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.128x256b.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.128x128b.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.32x128b_warpx4.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.64x128b_warpx2_02_13.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.64x128b_warpx2_01_23.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.4x256b.b6x16_p32.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.128x256b.b6x16_p32.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.128x128b.b6x16_p32.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.32x128b_warpx4.b6x16_p32.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.64x128b_warpx2_02_13.b6x16_p32.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.64x128b_warpx2_01_23.b6x16_p32.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.4x256b.b4x16_p64.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.128x256b.b4x16_p64.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.128x128b.b4x16_p64.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.32x128b_warpx4.b4x16_p64.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.64x128b_warpx2_02_13.b4x16_p64.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)declarevoid@llvm.nvvm.tcgen05.cp.64x128b_warpx2_01_23.b4x16_p64.{cg1,cg2}(ptraddrspace(6)%tmem_addr,i64%sdesc)
Overview:

The ‘@llvm.nvvm.tcgen05.cp.{shape}.{src_fmt}.{cg1/cg2}’ intrinsicscorrespond to thetcgen05.cp.* family of PTX instructions.Thetcgen05.cp instruction initiates an asynchronous copy operation fromshared memory to the location specified by%tmem_addr in Tensor Memory.The 64-bit register operand%sdesc is the matrix descriptor representingthe source matrix in shared memory that needs to be copied.

The valid shapes for the copy operation are:{128x256b, 4x256b, 128x128b, 64x128b_warpx2_02_13, 64x128b_warpx2_01_23, 32x128b_warpx4}.

Shapes64x128b and32x128b require dedicated multicast qualifiers,which are appended to the corresponding intrinsic names.

Optionally, the data can be decompressed from the source format in the shared memoryto the destination format in Tensor Memory during the copy operation. Currently,only.b8x16 is supported as destination format. The valid source formats are.b6x16_p32 and.b4x16_p64.

When the source format is.b6x16_p32, a contiguous set of 16 elements of 6-bitseach followed by four bytes of padding (_p32) in shared memory is decompressedinto 16 elements of 8-bits (.b8x16) each in the Tensor Memory.

When the source format is.b4x16_p64, a contiguous set of 16 elements of 4-bitseach followed by eight bytes of padding (_p64) in shared memory is decompressedinto 16 elements of 8-bits (.b8x16) each in the Tensor Memory.

For more information on the decompression schemes, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#optional-decompression.

For more information on the tcgen05.cp instruction, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-instructions-tcgen05-cp.

llvm.nvvm.tcgen05.ld.*

Syntax:
declare<nxi32>@llvm.nvvm.tcgen05.ld.<shape>.<num>(ptraddrspace(6)%tmem_addr,i1%pack)declare<nxi32>@llvm.nvvm.tcgen05.ld.16x32bx2.<num>(ptraddrspace(6)%tmem_addr,i64%offset,i1%pack)
Overview:

This group of intrinsics asynchronously load data from the Tensor Memory at the location specifiedby the 32-bit address operandtmem_addr into the destination registers, collectively across all threadsof the warps.

All the threads in the warp must specify the same value oftmem_addr, which must be the base addressof the collective load operation. Otherwise, the behavior is undefined.

Theshape qualifier and thenum qualifier together determines the total dimension of the data (‘n’) whichis loaded from the Tensor Memory. Theshape qualifier indicates the base dimension of data. Thenum qualifierindicates the repeat factor on the base dimension resulting in the total dimension of the data that is accessed.

Allowed values for the ‘num’ arex1, x2, x4, x8, x16, x32, x64, x128.

Allowed values for the ‘shape’ in the first intrinsic are16x64b, 16x128b, 16x256b, 32x32b.

Allowed value for the ‘shape’ in the second intrinsic is16x32bx2.

The result of the intrinsic is a vector consisting of one or more 32-bit registers derived fromshape andnum as shown below.

num/shape

16x32bx2/16x64b/32x32b

16x128b

16x256b

x1

1

2

4

x2

2

4

8

x4

4

8

16

x8

8

16

32

x16

16

32

64

x32

32

64

128

x64

64

128

NA

x128

128

NA

NA

The last argumenti1 %pack is a compile-time constant which when set, indicates that the adjacent columns are packed into a single 32-bit element during the load

For more information, refer to thePTX ISA.

llvm.nvvm.tcgen05.st.*

Syntax:
declarevoid@llvm.nvvm.tcgen05.st.<shape>.<num>(ptraddrspace(6)%tmem_addr,<nxi32>%args,i1%unpack)declarevoid@llvm.nvvm.tcgen05.st.16x32bx2.<num>(ptraddrspace(6)%tmem_addr,<nxi32>%args,i64%offset,i1%unpack)
Overview:

This group of intrinsics asynchronously store data from the source vector into the Tensor Memory at the locationspecified by the 32-bit address operand ‘tmem_addr` collectively across all threads of the warps.

All the threads in the warp must specify the same value oftmem_addr, which must be the base address of thecollective load operation. Otherwise, the behavior is undefined.

Theshape qualifier and thenum qualifier together determines the total dimension of the data (‘n’) whichis loaded from the Tensor Memory. Theshape qualifier indicates the base dimension of data. Thenum qualifierindicates the repeat factor on the base dimension resulting in the total dimension of the data that is accessed.

Allowed values for the ‘num’ arex1, x2, x4, x8, x16, x32, x64, x128.

Allowed values for the ‘shape’ in the first intrinsic are16x64b, 16x128b, 16x256b, 32x32b.

Allowed value for the ‘shape’ in the second intrinsic is16x32bx2.

args argument is a vector consisting of one or more 32-bit registers derived fromshape andnum as listed in the table listed in thetcgen05.ld section.

Each shape support anunpack mode to allow a 32-bit element in the register to be unpacked into two 16-bit elements and store them in adjacent columns.unpack mode can be enabled by setting the%unpack operand to 1 and can be disabled by setting it to 0.

The last argumenti1 %unpack is a compile-time constant which when set, indicates that a 32-bit element in the register to be unpacked into two 16-bit elements and store them in adjacent columns.

For more information, refer to thePTX ISA.

tcgen05.mma Intrinsics

Overview

tcgen05.mma operation of shapeM x N x K perform matrix multiplication andaccumulation of the form:D = A * B + D where:

  • theA matrix has shapeM x K, in eitherTensor Memory orShared Memory

  • theB matrix has shapeK x N, inShared Memory of the current CTA and, optionally in peer CTA

  • theD matrix is of the shapeM x N, inTensor Memory

Optionally an input predicate can be used to disable the input (%enable_inp_d)from the accumulator matrix and the following operation can be performed asD = A * B

The matrix multiplication and accumulation operations are categorized into variouskinds based on input types and the throughput of the multiplication operation.The following table shows the different kinds of MMA operations that are supported:

.kind

Supported Input Types

f16

F16 and BF16

tf32

TF32

f8f6f4

All combinations of F8, F6, and F4

i8

Signed and Unsigned 8-bit Integers

mxf8f6f4

MX-floating point formats

mxf4

MX-floating point formats (FP4)

mxf4nvf4

MXF4 + custom NVIDIA 4-bit floating point(with common scaling factor)

tcgen05.mma.sp supports sparse variant ofA with shapeM x K stored in packedform asM X (K / 2) in memory. The%spmetadata specifies the mapping of theK / 2 non-zero elements to theK elements before performing the MMA operation.

tcgen05.mma.block_scale perform matrix multiplication with block scalingD = (A * scale_A) * (B * scale_B) + D where scaling of input matrices frommemory to form the matrixA and matrixB before performing the MMA operation.Scale factors forA andB matrices need to be duplicated to all 32 lane partitionsof tensor memory. The shape of%scale_a and%scale_b matrices depend on the.scale_vectorsize described inhere

The sparsity metadata (%spmetadata) as well as the block-scale inputs forA / Bmatrices (%scale_a and%scale_b) reside in Tensor Memory.

To facilitate opportunistic re-use ofA / B matrix data across a sequence of MMAoperations, theA/B matrices are loaded into a collector buffer(%collector_usage_a_op_flag,%collector_usage_b_buffer_flag, and%collector_usage_b_op_flag).The flag value of the collector_usage flag in the intrinsic specifies the nature of the re-use

There are three kinds of matrix descriptors used by the tcgen05 family of instructions:

Descriptor

Description

Size (bits)

Shared Memory Descriptor

Describes properties of multiplicand matrixin shared memory, including its locationwithin the CTA’s shared memory.PTX ISA

64

Instruction Descriptor

Describes shapes, types, and details ofall matrices and the MMA operation.PTX ISA

32

Zero-Column Mask Descriptor

Generates a mask specifying which columns ofB matrix are zeroed in the MMA operation,regardless of values in shared memory.Total mask size = N bitsPTX ISA

64

tcgen05.mma can be used for general matrix multiplication or for convolution operations.In case of convolutions, theactivations can be stored in either matrixA or matrixBwhile theweights will be stored in the other matrix

tcgen05.mma has an optional collector qualifier to specify when anA orB matrixis new to the sequence and should be loaded, unchanged within the sequence and,should be reused, or the last use in the sequence and should be discarded.The collector qualifier is used to give the TensorCore permission to reuse apreviously loadedA orB matrix; however reuse is opportunistic in that theTensorCore may reload a matrix even when it has permission to reuse that matrix.Thus, the source memory of an A or B matrix must not be modified while the MMAinstruction using those matrices has not completed - regardless of collectorqualifier permissions.

Thecta_group::1 specifies that the operation is performed on the Tensor Memoryof the executing thread’s CTA only. Thecta_group::2 specifies that the MMAoperation is performed on the Tensor Memory of the executing thread’s CTA and its peer CTA.

The vector operand%disable_output_lane specifies the lane(s) in the Tensor Memorythat should be not be updated with the resultant matrix D. Elements of the vector operanddisable-output-lane forms a mask where each bit corresponds to a lane of the Tensor Memory,with least significant bit of the first element of the vector (leftmost in syntax)corresponding to the lane 0 of the Tensor Memory. If a bit in the mask is 1, thenthe corresponding lane in the Tensor Memory for the resultant matrix D will not beupdated

Intrinsic Design:

Given the broad feature set oftcgen05.mma instruction modeling thesethrough intrinsics is highly complex, and the following table outlines the largenumber of intrinsics required to fully support thetcgen05.mma instruction set.

variant

Configuration

Total Variants

tcgen05.mma.shared

2 (space) x 2 (sp) x 4 (kind) x 2 (cta_group) x 4 (collector_usage)

128

tcgen05.mma.tensor.ashift

2 (sp) x 4 (kind) x 2 (cta_group) x 2 (collector_usage)

32

tcgen05.mma.scale_d

2 (space) x 2 (sp) x 2 (kind) x 2 (cta_group) x 4 (collector_usage)

128

tcgen05.mma.scale_d.tensor.ashift

2 (sp) x 2 (kind) x 2 (cta_group) x 2 (collector_usage)

16

tcgen05.mma.disable_output_lane

2 (space) x 2 (sp) x 4 (kind) x 2 (cta_group) x 4 (collector_usage)

128

tcgen05.mma.disable_output_lane…

2 (sp) x 4 (kind) x 2 (cta_group) x 2 (collector_usage)

32

tcgen05.mma.block_scale

2 (space) x 1 (mxf4nvf4) x 2 (cta_group) x 2 (scale_vec_size) x 4 (collector_usage)

32

tcgen05.mma.block_scale

2 (space) x 1 (mxf4) x 2 (cta_group) x 2 (scale_vec_size) x 4 (collector_usage)

32

tcgen05.mma.block_scale

2 (space) x 1 (mxf8f6f4) x 2 (cta_group) x 2 (scale_vec_size) x 4 (collector_usage)

32

tcgen05.mma.ws

2 (space) x 2 (sp) x 4 (kind) x 2 (zero_col_mask) x 4 (collector_usage_op) x 4 (collector_buffer)

256

Total

816

To reduce the number of possible intrinsic variations, we’ve modeled thetcgen05.mmainstructions using flag operands. We’ve added range checks to these flags to preventinvalid values. We also expanded some flags back into intrinsic modifiers to avoidsupporting invalid combinations of features.

llvm.nvvm.tcgen05.mma.*

Syntax:
declarevoid@llvm.nvvm.tcgen05.mma.shared(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,i32%kind_flag,i32%cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.tensor<.ashift>(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,i32%kind_flag,i32%cta_group_flag,i32%collector_usage_a_op_flag); .sp variantsdeclarevoid@llvm.nvvm.tcgen05.mma.sp.shared(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,i32%kind_flag,i32%cta_group_flag,i32%collector_usage_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.tensor<.ashift>(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,i32%kind_flag,i32%cta_group_flag,i32%collector_usage_a_op_flag); .scale_d variantsdeclarevoid@llvm.nvvm.tcgen05.mma.shared.scale_d(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,i64%scale_d_imm,i32%cta_group_flag,i32%kind_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.tensor.scale_d<.ashift>(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,i64%scale_d_imm,i32%cta_group_flag,i32%kind_flag,i32%collector_usage_a_op_flag); sp.scale_d variantsdeclarevoid@llvm.nvvm.tcgen05.mma.sp.shared.scale_d(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,i64%scale_d_imm,i32%cta_group_flag,i32%collector_usage_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.tensor.scale_d<.ashift>(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,i64%scale_d_imm,i32%cta_group,i32%collector_usage_a_op_flag)
Overview:

nvvm.tcgen05.mma is an asynchronous intrinsic which initiates anM x N x K matrixmultiply and accumulate operation,D = A * B + D where theA matrix isM x K,theB matrix isK x N, and theD matrix isM x N. The operation of the formD = A*B is issued when the input predicate argument%enable_inp_d is false.The optional immediate argument%scale_d_imm can be specified to scale the inputmatrixD as follows:D = A * B + D * (2 ^ - %scale_d_imm). The valid range ofvalues for argument%scale_d_imm is[0, 15]. The 32-bit register operand idescis the instruction descriptor as described inInstruction descriptor

nvvm.tcgen05.mma has single thread semantics, unlike the collective instructionsnvvm.mma.sync or the PTXwgmma.mma_async instruction. So, a single thread issuingthenvvm.tcgen05.mma will result in the initiation of the whole matrix and accumulateoperation

When.sp is specifed, the dimension of A matrix isM x (K/2) and requiresspecifiying an additional%spmetadata argument

.ashift shifts the rows of the A matrix down by one row, except for the last rowin the Tensor Memory..ashift is only allowed with M = 128 or M = 256.

The%collector_usage_a_op_flag flag specifies the usage of collector buffer formatrixA. It is illegal to specify either ofUSE orFILL for%collector_usage_a_op_flagalong with.ashift

For more information, refer to thePTX ISA

The following tables describes the possible values of the flag arguments

%kind_flag flag:

kind_flag

value

F16

0

TF32

1

F8F6F4

2

I8

3

%cta_group_flag flag:

cta_group_flag

value

CG1

1

CG2

2

%collector_usage_a_op_flag flag:

collector_usage_a_op_flag

value

DISCARD

0

LASTUSE

1

USE

2

FILL

3

llvm.nvvm.tcgen05.mma.block_scale*

Syntax:
; mxf8f6f4declarevoid@llvm.nvvm.tcgen05.mma.shared.mxf8f6f4.block_scale(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.tensor.mxf8f6f4.block_scale(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.shared.mxf8f6f4.block_scale.block32(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.tensor.mxf8f6f4.block_scale.block32(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.shared.mxf8f6f4.block_scale(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.tensor.mxf8f6f4.block_scale(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.shared.mxf8f6f4.block_scale.block32(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.tensor.mxf8f6f4.block_scale.block32(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag); mxf4declarevoid@llvm.nvvm.tcgen05.mma.shared.mxf4.block_scale(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.tensor.mxf4.block_scale(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.shared.mxf4.block_scale.block32(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.tensor.mxf4.block_scale.block32(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.shared.mxf4.block_scale(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%spmetadata,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.tensor.mxf4.block_scale(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%spmetadata,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.shared.mxf4.block_scale.block32(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%spmetadata,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.tensor.mxf4.block_scale.block32(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%spmetadata,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag); mxf4nvf4declarevoid@llvm.nvvm.tcgen05.mma.shared.mxf4nvf4.block_scale.block16(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.tensor.mxf4nvf4.block_scale.block16(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.shared.mxf4nvf4.block_scale.block32(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.tensor.mxf4nvf4.block_scale.block32(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.shared.mxf4nvf4.block_scale.block16(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%spmetadata,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.tensor.mxf4nvf4.block_scale.block16(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%spmetadata,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.shared.mxf4nvf4.block_scale.block32(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%spmetadata,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.tensor.mxf4nvf4.block_scale.block32(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%spmetadata,ptraddrspace(6)%scale_a,ptraddrspace(6)%scale_b,i32cta_group_flag,i32%collector_usage_a_op_flag)
Overview:

nvvm.tcgen05.mma.block_scale is an asynchronous intrinsic which initiates anM x N x K matrix multiply and accumulate operation,D = (A * scale_a) * (B * scale_b) + D where theA matrix isM x K, theB matrix isK x N, and theD matrix isM x N. The matricesA andB are scaled with%scale_A and%scale_B matrices respectively before performing the matrix multiply and accumulate operation. The operation of the formD = A*B is issued when the input predicate argument%enable_inp_d is false. The 32-bit register operand idesc is the instruction descriptor as described inInstruction descriptor

nvvm.tcgen05.mma.block_scale has single thread semantics, unlike the collective instructionsnvvm.mma.sync or the PTXwgmma.mma_async instruction. So, a single thread issuing thenvvm.tcgen05.mma.block_scale will result in the initiation of the whole matrix multiply and accumulate operation

When.sp is specifed, the dimension of A matrix isM x (K / 2) and requires specifiying an additional%spmetadata argument

The%collector_usage_a_op_flag flag specifies the usage of collector buffer for matrixA

For more information, refer to thePTX ISA

The following tables describes the possible values of the flag arguments

%cta_group:

cta_group

value

CG1

1

CG2

2

%collector_usage_a_op_flag:

collector_usage_a_op_flag

value

DISCARD

0

LASTUSE

1

USE

2

FILL

3

llvm.nvvm.tcgen05.mma.disable_output_lane*

Syntax:
declarevoid@llvm.nvvm.tcgen05.mma.shared.disable_output_lane.cg1(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,<4xi32>%disable_output_lane_v4,i32%kind_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.shared.disable_output_lane.cg2(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,<8xi32>%disable_output_lane_v8,i32%kind_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.tensor.disable_output_lane.cg1<.ashift>(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,<4xi32>%disable_output_lane_v4,i32%kind_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.tensor.disable_output_lane.cg2<.ashift>(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,<8xi32>%disable_output_lane_v8,i32%kind_flag,i32%collector_usage_a_op_flag); .sp variantsdeclarevoid@llvm.nvvm.tcgen05.mma.sp.shared.disable_output_lane.cg1(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,<4xi32>%disable_output_lane_v4,i32%kind_flag,i32%collector_usage_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.shared.disable_output_lane.cg2(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,<8xi32>%disable_output_lane_v8,i32%kind_flag,i32%collector_usage_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.tensor.disable_output_lane.cg1<.ashift>(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,<4xi32>%disable_output_lane_v4,i32%kind_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.tensor.disable_output_lane.cg2<.ashift>(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,<8xi32>%disable_output_lane_v8,i32%kind_flag,i32%collector_usage_a_op_flag); .scale_d variantsdeclarevoid@llvm.nvvm.tcgen05.mma.shared.scale_d.disable_output_lane.cg1(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,i64%scale_d_imm,<4xi32>%disable_output_lane_v4,i32%kind_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.shared.scale_d.disable_output_lane.cg2(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,i64%scale_d_imm,<8xi32>%disable_output_lane_v8,i32%kind_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.tensor.scale_d.disable_output_lane.cg1<.ashift>(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,i64%scale_d_imm,<4xi32>%disable_output_lane_v4,i32%kind_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.tensor.scale_d.disable_output_lane.cg2<.ashift>(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,i64%scale_d_imm,<8xi32>%disable_output_lane_v8,i32%kind_flag,i32%collector_usage_a_op_flag); .sp.scale_d variantsdeclarevoid@llvm.nvvm.tcgen05.mma.sp.shared.scale_d.disable_output_lane.cg1(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,i64%scale_d_imm,<4xi32>%disable_output_lane_v4,i32%kind_flag,i32%collector_usage_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.shared.scale_d.disable_output_lane.cg2(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,i64%scale_d_imm,<8xi32>%disable_output_lane_v8,i32%kind_flag,i32%collector_usage_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.tensor.scale_d.disable_output_lane.cg1<.ashift>(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,i64%scale_d_imm,<4xi32>%disable_output_lane_v4,i32%kind_flag,i32%collector_usage_a_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.sp.tensor.scale_d.disable_output_lane.cg2<.ashift>(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,ptraddrspace(6)%spmetadata,i1%enable_inp_d,i64%scale_d_imm,<8xi32>%disable_output_lane_v8,i32%kind_flag,i32%collector_usage_a_op_flag)
Overview:

nvvm.tcgen05.mma.disable_output_lane is an asynchronous intrinsic which initiates anM x N x K matrix multiply and accumulate operation,D = A * B + D where theA matrix isM x K, theB matrix isK x N, and theD matrix isM x N. The operation of the formD = A*B is issued when the input predicate argument%enable_inp_d is false. The optional immediate argument%scale_d_imm can be specified to scale the input matrixD as follows:D = A*B+D * (2 ^ - %scale_d_imm). The valid range of values for argument%scale_d_imm is[0, 15]. The 32-bit register operand idesc is the instruction descriptor as described inInstruction descriptor

The vector operand%disable_output_lane specifies the lane(s) in the Tensor Memory that should be not be updated with the resultant matrixD. Elements of the vector operand%disable_output_lane forms a mask where each bit corresponds to a lane of the Tensor Memory, with least significant bit of the first element of the vector corresponding to thelane 0 of the Tensor Memory. If a bit in the mask is 1, then the corresponding lane in the Tensor Memory for the resultant matrixD will not be updated

nvvm.tcgen05.mma.disable_output_lane has single thread semantics, unlike the collective instructionsnvvm.mma.sync or the PTXwgmma.mma_async instruction. So, a single thread issuing thenvvm.tcgen05.mma.disable_output_lane will result in the initiation of the whole matrix multiply and accumulate operation

When.sp is specifed, the dimension of A matrix isM x (K / 2) and requires specifiying an additional%spmetadata argument

.ashift shifts the rows of the A matrix down by one row, except for the last row in the Tensor Memory..ashift is only allowed with M = 128 or M = 256.

The%collector_usage_a_op_flag flag specifies the usage of collector buffer for matrixA. It is illegal to specify either ofUSE orFILL for%collector_usage_a_op_flag along with.ashift

For more information, refer to thePTX ISA

The following tables describes the possible values of the flag arguments

%kind_flag:

kind_flag

value

F16

0

TF32

1

F8F6F4

2

I8

3

%cta_group_flag:

cta_group_flag

value

CG1

1

CG2

2

%collector_usage_a_op_flag:

collector_usage_a_op_flag

value

DISCARD

0

LASTUSE

1

USE

2

FILL

3

llvm.nvvm.tcgen05.mma.ws*

Syntax:
//tcgen05.mma.wsdeclarevoid@llvm.nvvm.tcgen05.mma.ws.shared(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,i32%kind_flag,i32%collector_usage_b_buffer_flag,i32%collector_usage_b_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.ws.tensor(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,i32%kind_flag,i32%collector_usage_b_buffer_flag,i32%collector_usage_b_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.ws.shared.zero_col_mask(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,i64%zero_col_mask,i32%kind_flag,i32%collector_usage_b_buffer_flag,i32%collector_usage_b_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.ws.shared.zero_col_mask(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,i64%zero_col_mask,i32%kind_flag,i32%collector_usage_b_buffer_flag,i32%collector_usage_b_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.ws.tensor.zero_col_mask(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,i64%zero_col_mask,i32%kind_flag,i32%collector_usage_b_buffer_flag,i32%collector_usage_b_op_flag); .sp variantsdeclarevoid@llvm.nvvm.tcgen05.mma.ws.sp.shared(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%spmetadata,i32%kind_flag,i32%collector_usage_b_buffer_flag,i32%collector_usage_b_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.ws.sp.tensor(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%spmetadata,i32%kind_flag,i32%collector_usage_b_buffer_flag,i32%collector_usage_b_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.ws.sp.shared.zero_col_mask(ptraddrspace(6)%d,i64%adesc,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%spmetadata,i64%zero_col_mask,i32%kind_flag,i32%collector_usage_b_buffer_flag,i32%collector_usage_b_op_flag)declarevoid@llvm.nvvm.tcgen05.mma.ws.sp.tensor.zero_col_mask(ptraddrspace(6)%d,ptraddrspace(6)%atensor,i64%bdesc,i32%idesc,i1%enable_inp_d,ptraddrspace(6)%spmetadata,i64%zero_col_mask,i32%kind_flag,i32%collector_usage_b_buffer_flag,i32%collector_usage_b_op_flag)
Overview:

nvvm.tcgen05.mma.ws is an asynchronous intrinsic which initiates anM x N x K weight stationary convolution matrix multiply and accumulate operation,D = A * B + D where theA matrix isM x K, theB matrix isK x N, and theD matrix isM x N. The operation of the formD = A*B is issued when the input predicate argument%enable_inp_d is false. The optional immediate argument%scale_d_imm can be specified to scale the input matrixD as follows:D = A*B+D * (2 ^ - %scale_d_imm). The valid range of values for argument%scale_d_imm is[0, 15]. The 32-bit register operand idesc is the instruction descriptor as described inInstruction descriptor

nvvm.tcgen05.mma has single thread semantics, unlike the collective instructionsnvvm.mma.sync or the PTXwgmma.mma_async instruction. So, a single thread issuing thenvvm.tcgen05.mma will result in the initiation of the whole matrix multiply and accumulate operation

When.sp is specifed, the dimension of A matrix isM x (K / 2) and requires specifiying an additional%spmetadata argument

The operand%zero_col_mask is a 64-bit register which specifies theZero-Column Mask Descriptor. The zero-column mask descriptor is used to generate a mask that specifies which columns ofB matrix will have zero value for the matrix multiply and accumulate operation regardless of the values present in the shared memory.

The%collector_usage_b_buffer_flag and%collector_usage_b_op_flag together flag specifies the usage of collector buffer for MatrixB

For more information, refer to thePTX ISA

The following tables describes the possible values of the flag arguments

%kind_flag:

kind_flag

value

F16

0

TF32

1

F8F6F4

2

I8

3

%collector_usage_b_buffer_flag:

collector_usage_b_buffer_flag

value

B0

0

B1

1

B2

2

B3

3

%collector_usage_b_op_flag:

collector_usage_b_op_flag

value

DISCARD

0

LASTUSE

1

USE

2

FILL

3

Store Intrinsics

llvm.nvvm.st.bulk.*

Syntax:
declarevoid@llvm.nvvm.st.bulk(ptraddrspace(1)%dst,i64%size,i64immarg%initval)declarevoid@llvm.nvvm.st.bulk.shared.cta(ptraddrspace(3)%dst,i64%size,i64immarg%initval)
Overview:

The ‘@llvm.nvvm.st.bulk.*’ intrinsics initialize a region of shared memorystarting from the location specified by the destination address operand%dst.

The integer operand%size specifies the amount of memory to be initialized interms of number of bytes and must be a multiple of 8. Otherwise, the behavioris undefined.

The integer immediate operand%initval specifies the initialization value forthe memory locations. The only numeric value allowed is 0.

The@llvm.nvvm.st.bulk.shared.cta and@llvm.nvvm.st.bulk intrinsics aresimilar but the latter uses generic addressing (seeGeneric Addressing).

For more information, referPTX ISA.

clusterlaunchcontrol Intrinsics

llvm.nvvm.clusterlaunchcontrol.try_cancel*’ Intrinsics

Syntax:
declarevoid@llvm.nvvm.clusterlaunchcontrol.try_cancel.async.shared(ptraddrspace(3)%addr,ptraddrspace(3)%mbar)declarevoid@llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast.shared(ptraddrspace(3)%addr,ptraddrspace(3)%mbar)
Overview:

Theclusterlaunchcontrol.try_cancel intrinsics requests atomically cancellingthe launch of a cluster that has not started running yet. It asynchronously non-atomically writesa 16-byte opaque response to shared memory, pointed to by 16-byte-alignedaddr indicating whether theoperation succeeded or failed.addr and 8-byte-alignedmbar must refer toshared::ctaotherwise the behavior is undefined. The completion of the asynchronous operationis tracked using the mbarrier completion mechanism at.cluster scope referencedby the shared memory pointer,mbar. On success, the opaque response containsthe CTA id of the first CTA of the canceled cluster; no other successful responsefrom otherclusterlaunchcontrol.try_cancel operations from the same grid willcontain that id.

Themulticast variant specifies that the response is asynchronously non-atomically written tothe corresponding shared memory location of each CTA in the requesting cluster.The completion of the write of each local response is tracked by independentmbarriers at the corresponding shared memory location of each CTA in thecluster.

For more information, referPTX ISA.

llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled’ Intrinsic

Syntax:
declarei1@llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled(i128%try_cancel_response)
Overview:

Thellvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled intrinsic decodes the opaque response written by thellvm.nvvm.clusterlaunchcontrol.try_cancel operation.

The intrinsic returns0 (false) if the request failed. If the request succeeded,it returns1 (true). A true result indicates that:

  • the thread block cluster whose first CTA id matches that of the responsehandle will not run, and

  • no other successful response of anothertry_cancel request in the grid will containthe first CTA id of that cluster

For more information, referPTX ISA.

llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.*’ Intrinsics

Syntax:
declarei32@llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.x(i128%try_cancel_response)declarei32@llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.y(i128%try_cancel_response)declarei32@llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.z(i128%try_cancel_response)
Overview:

Theclusterlaunchcontrol.query_cancel.get_first_ctaid.* intrinsic can beused to decode the successful opaque response written by thellvm.nvvm.clusterlaunchcontrol.try_cancel operation.

If the request succeeded:

  • llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.{x,y,z} returnsthe coordinate of the first CTA in the canceled cluster, either x, y, or z.

If the request failed, the behavior of these intrinsics is undefined.

For more information, referPTX ISA.

Perf Monitor Event Intrinsics

llvm.nvvm.pm.event.mask’ Intrinsic

Syntax:
declarevoid@llvm.nvvm.pm.event.mask(i16immarg%mask_val)
Overview:

The ‘llvm.nvvm.pm.event.mask’ intrinsic triggers one or moreperformance monitor events. Each bit in the 16-bit immediate operand%mask_val controls an event.

For more information on the pmevent instructions, refer to the PTX ISAhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-pmevent.

Other Intrinsics

For the full set of NVPTX intrinsics, please see theinclude/llvm/IR/IntrinsicsNVVM.td file in the LLVM source tree.

Linking with Libdevice

The CUDA Toolkit comes with an LLVM bitcode library calledlibdevice thatimplements many common mathematical functions. This library can be used as ahigh-performance math library for any compilers using the LLVM NVPTX target.The library can be found undernvvm/libdevice/ in the CUDA Toolkit andthere is a separate version for each compute architecture.

For a list of all math functions implemented in libdevice, seelibdevice Users Guide.

To accommodate various math-related compiler flags that can affect codegeneration of libdevice code, the library code depends on a special LLVM IRpass (NVVMReflect) to handle conditional compilation within LLVM IR. Thispass looks for calls to the@__nvvm_reflect function and replaces themwith constants based on the defined reflection parameters. Such conditionalcode often follows a pattern:

floatmy_function(floata){if(__nvvm_reflect("FASTMATH"))returnmy_function_fast(a);elsereturnmy_function_precise(a);}

The default value for all unspecified reflection parameters is zero.

TheNVVMReflect pass should be executed early in the optimizationpipeline, immediately after the link stage. Theinternalize pass is alsorecommended to remove unused math functions from the resulting PTX. For aninput IR modulemodule.bc, the following compilation flow is recommended:

TheNVVMReflect pass will attempt to remove dead code even withoutoptimizations. This allows potentially incompatible instructions to be avoidedat all optimizations levels by using the__CUDA_ARCH argument.

  1. Save list of external functions inmodule.bc

  2. Linkmodule.bc withlibdevice.compute_XX.YY.bc

  3. Internalize all functions not in list from (1)

  4. Eliminate all unused internal functions

  5. RunNVVMReflect pass

  6. Run standard optimization pipeline

Note

linkonce andlinkonce_odr linkage types are not suitable for thelibdevice functions. It is possible to link two IR modules that have beenlinked against libdevice using different reflection variables.

Since theNVVMReflect pass replaces conditionals with constants, it willoften leave behind dead code of the form:

entry:..bri1true,label%foo,label%barfoo:..bar:; Dead code..

Therefore, it is recommended thatNVVMReflect is executed early in theoptimization pipeline before dead-code elimination.

The NVPTX TargetMachine knows how to scheduleNVVMReflect at the beginningof your pass manager; just use the following code when setting up your passmanager and the PassBuilder will useregisterPassBuilderCallbacks to letNVPTXTargetMachine::registerPassBuilderCallbacks add the pass to thepass manager:

std::unique_ptr<TargetMachine>TM=...;PassBuilderPB(TM);ModulePassManagerMPM;PB.parsePassPipeline(MPM,...);

Reflection Parameters

The libdevice library currently uses the following reflection parameters tocontrol code generation:

Flag

Description

__CUDA_FTZ=[0,1]

Use optimized code paths that flush subnormals to zero

The value of this flag is determined by the “nvvm-reflect-ftz” module flag.The following sets the ftz flag to 1.

!llvm.module.flags=!{!0}!0=!{i324,!"nvvm-reflect-ftz",i321}

(i324 indicates that the value set here overrides the value in anothermodule we link with. See theLangRef <LangRef.html#module-flags-metadata>for details.)

Executing PTX

The most common way to execute PTX assembly on a GPU device is to use the CUDADriver API. This API is a low-level interface to the GPU driver and allows forJIT compilation of PTX code to native GPU machine code.

Initializing the Driver API:

CUdevicedevice;CUcontextcontext;// Initialize the driver APIcuInit(0);// Get a handle to the first compute devicecuDeviceGet(&device,0);// Create a compute device contextcuCtxCreate(&context,0,device);

JIT compiling a PTX string to a device binary:

CUmodulemodule;CUfunctionfunction;// JIT compile a null-terminated PTX stringcuModuleLoadData(&module,(void*)PTXString);// Get a handle to the "myfunction" kernel functioncuModuleGetFunction(&function,module,"myfunction");

For full examples of executing PTX assembly, please see theCUDA Samples distribution.

Common Issues

ptxas complains of undefined function: __nvvm_reflect

When linking with libdevice, theNVVMReflect pass must be used. SeeLinking with Libdevice for more information.

Tutorial: A Simple Compute Kernel

To start, let us take a look at a simple compute kernel written directly inLLVM IR. The kernel implements vector addition, where each thread computes oneelement of the output vector C from the input vectors A and B. To make thiseasier, we also assume that only a single CTA (thread block) will be launched,and that it will be one dimensional.

The Kernel

targetdatalayout="e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"targettriple="nvptx64-nvidia-cuda"; Intrinsic to read X component of thread IDdeclarei32@llvm.nvvm.read.ptx.sreg.tid.x()readnonenounwinddefineptx_kernelvoid@kernel(ptraddrspace(1)%A,ptraddrspace(1)%B,ptraddrspace(1)%C){entry:; What is my ID?%id=tailcalli32@llvm.nvvm.read.ptx.sreg.tid.x()readnonenounwind; Compute pointers into A, B, and C%ptrA=getelementptrfloat,ptraddrspace(1)%A,i32%id%ptrB=getelementptrfloat,ptraddrspace(1)%B,i32%id%ptrC=getelementptrfloat,ptraddrspace(1)%C,i32%id; Read A, B%valA=loadfloat,ptraddrspace(1)%ptrA,align4%valB=loadfloat,ptraddrspace(1)%ptrB,align4; Compute C = A + B%valC=faddfloat%valA,%valB; Store back to Cstorefloat%valC,ptraddrspace(1)%ptrC,align4retvoid}

We can use the LLVMllc tool to directly run the NVPTX code generator:

# llc -mcpu=sm_20 kernel.ll -o kernel.ptx

Note

If you want to generate 32-bit code, changep:64:64:64 top:32:32:32in the module data layout string and usenvptx-nvidia-cuda as thetarget triple.

The output we get fromllc (as of LLVM 3.4):

//// Generated by LLVM NVPTX Back-End//.version 3.1.target sm_20.address_size 64  // .globl kernel                                        // @kernel.visible .entry kernel(  .param .u64 kernel_param_0,  .param .u64 kernel_param_1,  .param .u64 kernel_param_2){  .reg .f32   %f<4>;  .reg .s32   %r<2>;  .reg .s64   %rl<8>;// %bb.0:                                // %entry  ld.param.u64    %rl1, [kernel_param_0];  mov.u32         %r1, %tid.x;  mul.wide.s32    %rl2, %r1, 4;  add.s64         %rl3, %rl1, %rl2;  ld.param.u64    %rl4, [kernel_param_1];  add.s64         %rl5, %rl4, %rl2;  ld.param.u64    %rl6, [kernel_param_2];  add.s64         %rl7, %rl6, %rl2;  ld.global.f32   %f1, [%rl3];  ld.global.f32   %f2, [%rl5];  add.f32         %f3, %f1, %f2;  st.global.f32   [%rl7], %f3;  ret;}

Dissecting the Kernel

Now let us dissect the LLVM IR that makes up this kernel.

Data Layout

The data layout string determines the size in bits of common data types, theirABI alignment, and their storage size. For NVPTX, you should use one of thefollowing:

32-bit PTX:

targetdatalayout="e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"

64-bit PTX:

targetdatalayout="e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"

Target Intrinsics

In this example, we use the@llvm.nvvm.read.ptx.sreg.tid.x intrinsic toread the X component of the current thread’s ID, which corresponds to a readof register%tid.x in PTX. The NVPTX back-end supports a large set ofintrinsics. A short list is shown below; please seeinclude/llvm/IR/IntrinsicsNVVM.td for the full list.

Intrinsic

CUDA Equivalent

i32@llvm.nvvm.read.ptx.sreg.tid.{x,y,z}

threadIdx.{x,y,z}

i32@llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}

blockIdx.{x,y,z}

i32@llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}

blockDim.{x,y,z}

i32@llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}

gridDim.{x,y,z}

void@llvm.nvvm.barrier0()

__syncthreads()

Address Spaces

You may have noticed that all of the pointer types in the LLVM IR example hadan explicit address space specifier. What is address space 1? NVIDIA GPUdevices (generally) have four types of memory:

  • Global: Large, off-chip memory

  • Shared: Small, on-chip memory shared among all threads in a CTA

  • Local: Per-thread, private memory

  • Constant: Read-only memory shared across all threads

These different types of memory are represented in LLVM IR as address spaces.There is also a fifth address space used by the NVPTX code generator thatcorresponds to the “generic” address space. This address space can representaddresses in any other address space (with a few exceptions). This allowsusers to write IR functions that can load/store memory using the sameinstructions. Intrinsics are provided to convert pointers between the genericand non-generic address spaces.

SeeAddress Spaces andNVPTX Intrinsics for more information.

Running the Kernel

Generating PTX from LLVM IR is all well and good, but how do we execute it ona real GPU device? The CUDA Driver API provides a convenient mechanism forloading and JIT compiling PTX to a native GPU device, and launching a kernel.The API is similar to OpenCL. A simple example showing how to load andexecute our vector addition code is shown below. Note that for brevity thiscode does not perform much error checking!

Note

You can also use theptxas tool provided by the CUDA Toolkit to offlinecompile PTX to machine code (SASS) for a specific GPU architecture. Suchbinaries can be loaded by the CUDA Driver API in the same way as PTX. Thiscan be useful for reducing startup time by precompiling the PTX kernels.

#include<iostream>#include<fstream>#include<cassert>#include"cuda.h"voidcheckCudaErrors(CUresulterr){assert(err==CUDA_SUCCESS);}/// main - Program entry pointintmain(intargc,char**argv){CUdevicedevice;CUmodulecudaModule;CUcontextcontext;CUfunctionfunction;CUlinkStatelinker;intdevCount;// CUDA initializationcheckCudaErrors(cuInit(0));checkCudaErrors(cuDeviceGetCount(&devCount));checkCudaErrors(cuDeviceGet(&device,0));charname[128];checkCudaErrors(cuDeviceGetName(name,128,device));std::cout<<"Using CUDA Device [0]: "<<name<<"\n";intdevMajor,devMinor;checkCudaErrors(cuDeviceComputeCapability(&devMajor,&devMinor,device));std::cout<<"Device Compute Capability: "<<devMajor<<"."<<devMinor<<"\n";if(devMajor<2){std::cerr<<"ERROR: Device 0 is not SM 2.0 or greater\n";return1;}std::ifstreamt("kernel.ptx");if(!t.is_open()){std::cerr<<"kernel.ptx not found\n";return1;}std::stringstr((std::istreambuf_iterator<char>(t)),std::istreambuf_iterator<char>());// Create driver contextcheckCudaErrors(cuCtxCreate(&context,0,device));// Create module for objectcheckCudaErrors(cuModuleLoadDataEx(&cudaModule,str.c_str(),0,0,0));// Get kernel functioncheckCudaErrors(cuModuleGetFunction(&function,cudaModule,"kernel"));// Device dataCUdeviceptrdevBufferA;CUdeviceptrdevBufferB;CUdeviceptrdevBufferC;checkCudaErrors(cuMemAlloc(&devBufferA,sizeof(float)*16));checkCudaErrors(cuMemAlloc(&devBufferB,sizeof(float)*16));checkCudaErrors(cuMemAlloc(&devBufferC,sizeof(float)*16));float*hostA=newfloat[16];float*hostB=newfloat[16];float*hostC=newfloat[16];// Populate inputfor(unsignedi=0;i!=16;++i){hostA[i]=(float)i;hostB[i]=(float)(2*i);hostC[i]=0.0f;}checkCudaErrors(cuMemcpyHtoD(devBufferA,&hostA[0],sizeof(float)*16));checkCudaErrors(cuMemcpyHtoD(devBufferB,&hostB[0],sizeof(float)*16));unsignedblockSizeX=16;unsignedblockSizeY=1;unsignedblockSizeZ=1;unsignedgridSizeX=1;unsignedgridSizeY=1;unsignedgridSizeZ=1;// Kernel parametersvoid*KernelParams[]={&devBufferA,&devBufferB,&devBufferC};std::cout<<"Launching kernel\n";// Kernel launchcheckCudaErrors(cuLaunchKernel(function,gridSizeX,gridSizeY,gridSizeZ,blockSizeX,blockSizeY,blockSizeZ,0,NULL,KernelParams,NULL));// Retrieve device datacheckCudaErrors(cuMemcpyDtoH(&hostC[0],devBufferC,sizeof(float)*16));std::cout<<"Results:\n";for(unsignedi=0;i!=16;++i){std::cout<<hostA[i]<<" + "<<hostB[i]<<" = "<<hostC[i]<<"\n";}// Clean up after ourselvesdelete[]hostA;delete[]hostB;delete[]hostC;// Clean-upcheckCudaErrors(cuMemFree(devBufferA));checkCudaErrors(cuMemFree(devBufferB));checkCudaErrors(cuMemFree(devBufferC));checkCudaErrors(cuModuleUnload(cudaModule));checkCudaErrors(cuCtxDestroy(context));return0;}

You will need to link with the CUDA driver and specify the path to cuda.h.

# clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda

We don’t need to specify a path tolibcuda.so since this is installed in asystem location by the driver, not the CUDA toolkit.

If everything goes as planned, you should see the following output whenrunning the compiled program:

Using CUDA Device [0]: GeForce GTX 680Device Compute Capability: 3.0Launching kernelResults:0 + 0 = 01 + 2 = 32 + 4 = 63 + 6 = 94 + 8 = 125 + 10 = 156 + 12 = 187 + 14 = 218 + 16 = 249 + 18 = 2710 + 20 = 3011 + 22 = 3312 + 24 = 3613 + 26 = 3914 + 28 = 4215 + 30 = 45

Note

You will likely see a different device identifier based on your hardware

Tutorial: Linking with Libdevice

In this tutorial, we show a simple example of linking LLVM IR with thelibdevice library. We will use the same kernel as the previous tutorial,except that we will computeC=pow(A,B) instead ofC=A+B.Libdevice provides an__nv_powf function that we will use.

targetdatalayout="e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"targettriple="nvptx64-nvidia-cuda"; Intrinsic to read X component of thread IDdeclarei32@llvm.nvvm.read.ptx.sreg.tid.x()readnonenounwind; libdevice functiondeclarefloat@__nv_powf(float,float)defineptx_kernelvoid@kernel(ptraddrspace(1)%A,ptraddrspace(1)%B,ptraddrspace(1)%C){entry:; What is my ID?%id=tailcalli32@llvm.nvvm.read.ptx.sreg.tid.x()readnonenounwind; Compute pointers into A, B, and C%ptrA=getelementptrfloat,ptraddrspace(1)%A,i32%id%ptrB=getelementptrfloat,ptraddrspace(1)%B,i32%id%ptrC=getelementptrfloat,ptraddrspace(1)%C,i32%id; Read A, B%valA=loadfloat,ptraddrspace(1)%ptrA,align4%valB=loadfloat,ptraddrspace(1)%ptrB,align4; Compute C = pow(A, B)%valC=callfloat@__nv_powf(float%valA,float%valB); Store back to Cstorefloat%valC,ptraddrspace(1)%ptrC,align4retvoid}

To compile this kernel, we perform the following steps:

  1. Link with libdevice

  2. Internalize all but the public kernel function

  3. RunNVVMReflect and set__CUDA_FTZ to 0

  4. Optimize the linked module

  5. Codegen the module

These steps can be performed by the LLVMllvm-link,opt, andllctools. In a complete compiler, these steps can also be performed entirelyprogrammatically by setting up an appropriate pass configuration (seeLinking with Libdevice).

# llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc# opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc# llc -mcpu=sm_20 t2.opt.bc -o t2.ptx

Note

The-nvvm-reflect-list=_CUDA_FTZ=0 is not strictly required, as anyundefined variables will default to zero. It is shown here for evaluationpurposes.

This gives us the following PTX (excerpt):

//// Generated by LLVM NVPTX Back-End//.version 3.1.target sm_20.address_size 64  // .globl kernel                                        // @kernel.visible .entry kernel(  .param .u64 kernel_param_0,  .param .u64 kernel_param_1,  .param .u64 kernel_param_2){  .reg .pred  %p<30>;  .reg .f32   %f<111>;  .reg .s32   %r<21>;  .reg .s64   %rl<8>;// %bb.0:                                // %entry  ld.param.u64  %rl2, [kernel_param_0];  mov.u32   %r3, %tid.x;  ld.param.u64  %rl3, [kernel_param_1];  mul.wide.s32  %rl4, %r3, 4;  add.s64   %rl5, %rl2, %rl4;  ld.param.u64  %rl6, [kernel_param_2];  add.s64   %rl7, %rl3, %rl4;  add.s64   %rl1, %rl6, %rl4;  ld.global.f32   %f1, [%rl5];  ld.global.f32   %f2, [%rl7];  setp.eq.f32 %p1, %f1, 0f3F800000;  setp.eq.f32 %p2, %f2, 0f00000000;  or.pred   %p3, %p1, %p2;  @%p3 bra  BB0_1;  bra.uni   BB0_2;BB0_1:  mov.f32   %f110, 0f3F800000;  st.global.f32   [%rl1], %f110;  ret;BB0_2:                                  // %__nv_isnanf.exit.i  abs.f32   %f4, %f1;  setp.gtu.f32  %p4, %f4, 0f7F800000;  @%p4 bra  BB0_4;// %bb.3:                                // %__nv_isnanf.exit5.i  abs.f32   %f5, %f2;  setp.le.f32 %p5, %f5, 0f7F800000;  @%p5 bra  BB0_5;BB0_4:                                  // %.critedge1.i  add.f32   %f110, %f1, %f2;  st.global.f32   [%rl1], %f110;  ret;BB0_5:                                  // %__nv_isinff.exit.i  ...BB0_26:                                 // %__nv_truncf.exit.i.i.i.i.i  mul.f32   %f90, %f107, 0f3FB8AA3B;  cvt.rzi.f32.f32 %f91, %f90;  mov.f32   %f92, 0fBF317200;  fma.rn.f32  %f93, %f91, %f92, %f107;  mov.f32   %f94, 0fB5BFBE8E;  fma.rn.f32  %f95, %f91, %f94, %f93;  mul.f32   %f89, %f95, 0f3FB8AA3B;  // inline asm  ex2.approx.ftz.f32 %f88,%f89;  // inline asm  add.f32   %f96, %f91, 0f00000000;  ex2.approx.f32  %f97, %f96;  mul.f32   %f98, %f88, %f97;  setp.lt.f32 %p15, %f107, 0fC2D20000;  selp.f32  %f99, 0f00000000, %f98, %p15;  setp.gt.f32 %p16, %f107, 0f42D20000;  selp.f32  %f110, 0f7F800000, %f99, %p16;  setp.eq.f32 %p17, %f110, 0f7F800000;  @%p17 bra   BB0_28;// %bb.27:  fma.rn.f32  %f110, %f110, %f108, %f110;BB0_28:                                 // %__internal_accurate_powf.exit.i  setp.lt.f32 %p18, %f1, 0f00000000;  setp.eq.f32 %p19, %f3, 0f3F800000;  and.pred    %p20, %p18, %p19;  @!%p20 bra  BB0_30;  bra.uni   BB0_29;BB0_29:  mov.b32    %r9, %f110;  xor.b32   %r10, %r9, -2147483648;  mov.b32    %f110, %r10;BB0_30:                                 // %__nv_powf.exit  st.global.f32   [%rl1], %f110;  ret;}