Parallel Thread Execution ISA Version 9.0

The programming guide to using PTX (Parallel Thread Execution) and ISA (Instruction Set Architecture).

1.Introduction

This document describes PTX, a low-levelparallel thread execution virtual machine and instructionset architecture (ISA). PTX exposes the GPU as a data-parallel computingdevice.

1.1.Scalable Data-Parallel Computing using GPUs

Driven by the insatiable market demand for real-time, high-definition 3D graphics, the programmableGPU has evolved into a highly parallel, multithreaded, many-core processor with tremendouscomputational horsepower and very high memory bandwidth. The GPU is especially well-suited toaddress problems that can be expressed as data-parallel computations - the same program is executedon many data elements in parallel - with high arithmetic intensity - the ratio of arithmeticoperations to memory operations. Because the same program is executed for each data element, thereis a lower requirement for sophisticated flow control; and because it is executed on many dataelements and has high arithmetic intensity, the memory access latency can be hidden withcalculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications thatprocess large data sets can use a data-parallel programming model to speed up the computations. In3D rendering large sets of pixels and vertices are mapped to parallel threads. Similarly, image andmedia processing applications such as post-processing of rendered images, video encoding anddecoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels toparallel processing threads. In fact, many algorithms outside the field of image rendering andprocessing are accelerated by data-parallel processing, from general signal processing or physicssimulation to computational finance or computational biology.

PTX defines a virtual machine and ISA for general purpose parallel thread execution. PTX programsare translated at install time to the target hardware instruction set. The PTX-to-GPU translatorand driver enable NVIDIA GPUs to be used as programmable parallel computers.

1.2.Goals of PTX

PTX provides a stable programming model and instruction set for general purpose parallelprogramming. It is designed to be efficient on NVIDIA GPUs supporting the computation featuresdefined by the NVIDIA Tesla architecture. High level language compilers for languages such as CUDAand C/C++ generate PTX instructions, which are optimized for and translated to nativetarget-architecture instructions.

The goals for PTX include the following:

  • Provide a stable ISA that spans multiple GPU generations.

  • Achieve performance in compiled applications comparable to native GPU performance.

  • Provide a machine-independent ISA for C/C++ and other compilers to target.

  • Provide a code distribution ISA for application and middleware developers.

  • Provide a common source-level ISA for optimizing code generators and translators, which map PTX tospecific target machines.

  • Facilitate hand-coding of libraries, performance kernels, and architecture tests.

  • Provide a scalable programming model that spans GPU sizes from a single unit to many parallel units.

1.3.PTX ISA Version 9.0

PTX ISA version 9.0 introduces the following new features:

  • Adds support forsm_88 target architecture.

  • Adds support forsm_110 target architecture.

  • Adds support for targetsm_110f that supports family-specific features.

  • Adds support for targetsm_110a that supports architecture-specific features.

  • Adds support for pragmaenable_smem_spilling that is used to enable sharedmemory spilling for a function.

  • Adds support for pragmafrequency that is used to specify the execution frequency of a basicblock.

  • Adds support for directive.blocksareclusters that is used to specify that CUDA thread blocksare mapped to clusters.

  • Extendssize operand ofst.bulk instruction to support 32-bit length.

  • Adds support for performance-tuning directives.abi_preserve and.abi_preserve_controlthat are used to specify the number of data and control registers that should be preserved by thecallers of a function.

1.4.Document Structure

The information in this document is organized into the following Chapters:

References

2.Programming Model

2.1.A Highly Multithreaded Coprocessor

The GPU is a compute device capable of executing a very large number of threads in parallel. Itoperates as a coprocessor to the main CPU, or host: In other words, data-parallel, compute-intensiveportions of applications running on the host are off-loaded onto the device.

More precisely, a portion of an application that is executed many times, but independently ondifferent data, can be isolated into a kernel function that is executed on the GPU as many differentthreads. To that effect, such a function is compiled to the PTX instruction set and the resultingkernel is translated at install time to the target GPU instruction set.

2.2.Thread Hierarchy

The batch of threads that executes a kernel is organized as a grid. A grid consists of eithercooperative thread arrays or clusters of cooperative thread arrays as described in this section andillustrated inFigure 1 andFigure 2.Cooperative thread arrays (CTAs) implement CUDAthread blocks and clusters implement CUDA thread block clusters.

2.2.1.Cooperative Thread Arrays

TheParallel Thread Execution (PTX) programming model is explicitly parallel: a PTX programspecifies the execution of a given thread of a parallel thread array. Acooperative thread array,or CTA, is an array of threads that execute a kernel concurrently or in parallel.

Threads within a CTA can communicate with each other. To coordinate the communication of the threadswithin the CTA, one can specify synchronization points where threads wait until all threads in theCTA have arrived.

Each thread has a unique thread identifier within the CTA. Programs use a data paralleldecomposition to partition inputs, work, and results across the threads of the CTA. Each CTA threaduses its thread identifier to determine its assigned role, assign specific input and outputpositions, compute addresses, and select work to perform. The thread identifier is a three-elementvectortid, (with elementstid.x,tid.y, andtid.z) that specifies the thread’sposition within a 1D, 2D, or 3D CTA. Each thread identifier component ranges from zero up to thenumber of thread ids in that CTA dimension.

Each CTA has a 1D, 2D, or 3D shape specified by a three-element vectorntid (with elementsntid.x,ntid.y, andntid.z). The vectorntid specifies the number of threads in eachCTA dimension.

Threads within a CTA execute in SIMT (single-instruction, multiple-thread) fashion in groups calledwarps. Awarp is a maximal subset of threads from a single CTA, such that the threads executethe same instructions at the same time. Threads within a warp are sequentially numbered. The warpsize is a machine-dependent constant. Typically, a warp has 32 threads. Some applications may beable to maximize performance with knowledge of the warp size, so PTX includes a run-time immediateconstant,WARP_SZ, which may be used in any instruction where an immediate operand is allowed.

2.2.2.Cluster of Cooperative Thread Arrays

Cluster is a group of CTAs that run concurrently or in parallel and can synchronize and communicatewith each other via shared memory. The executing CTA has to make sure that the shared memory of thepeer CTA exists before communicating with it via shared memory and the peer CTA hasn’t exited beforecompleting the shared memory operation.

Threads within the different CTAs in a cluster can synchronize and communicate with each other viashared memory. Cluster-wide barriers can be used to synchronize all the threads within thecluster. Each CTA in a cluster has a unique CTA identifier within its cluster(cluster_ctaid). Each cluster of CTAs has 1D, 2D or 3D shape specified by the parametercluster_nctaid. Each CTA in the cluster also has a unique CTA identifier (cluster_ctarank)across all dimensions. The total number of CTAs across all the dimensions in the cluster isspecified bycluster_nctarank. Threads may read and use these values through predefined, read-onlyspecial registers%cluster_ctaid,%cluster_nctaid,%cluster_ctarank,%cluster_nctarank.

Cluster level is applicable only on target architecturesm_90 or higher. Specifying clusterlevel during launch time is optional. If the user specifies the cluster dimensions at launch timethen it will be treated as explicit cluster launch, otherwise it will be treated as implicit clusterlaunch with default dimension 1x1x1. PTX provides read-only special register%is_explicit_cluster to differentiate between explicit and implicit cluster launch.

2.2.3.Grid of Clusters

There is a maximum number of threads that a CTA can contain and a maximum number of CTAs that acluster can contain. However, clusters with CTAs that execute the same kernel can be batchedtogether into a grid of clusters, so that the total number of threads that can be launched in asingle kernel invocation is very large. This comes at the expense of reduced thread communicationand synchronization, because threads in different clusters cannot communicate and synchronize witheach other.

Each cluster has a unique cluster identifier (clusterid) within a grid of clusters. Each grid ofclusters has a 1D, 2D , or 3D shape specified by the parameternclusterid. Each grid also has aunique temporal grid identifier (gridid). Threads may read and use these values throughpredefined, read-only special registers%tid,%ntid,%clusterid,%nclusterid, and%gridid.

Each CTA has a unique identifier (ctaid) within a grid. Each grid of CTAs has 1D, 2D, or 3D shapespecified by the parameternctaid. Thread may use and read these values through predefined,read-only special registers%ctaid and%nctaid.

Each kernel is executed as a batch of threads organized as a grid of clusters consisting of CTAswhere cluster is optional level and is applicable only for target architecturessm_90 andhigher.Figure 1 shows a grid consisting of CTAs andFigure 2 shows a grid consisting of clusters.

Grids may be launched with dependencies between one another - a grid may be a dependent grid and/ora prerequisite grid. To understand how grid dependencies may be defined, refer to the section onCUDA Graphs in theCuda Programming Guide.

Grid with CTAs

Figure 1Grid with CTAs

Grid with clusters

Figure 2Grid with clusters

A cluster is a set of cooperative thread arrays (CTAs) where a CTA is a set of concurrent threadsthat execute the same kernel program. A grid is a set of clusters consisting of CTAs thatexecute independently.

2.3.Memory Hierarchy

PTX threads may access data from multiple state spaces during their execution as illustrated byFigure 3 where cluster level is introduced fromtarget architecturesm_90 onwards. Each thread has a private local memory. Each thread block(CTA) has a shared memory visible to all threads of the block and to all active blocks in thecluster and with the same lifetime as the block. Finally, all threads have access to the same globalmemory.

There are additional state spaces accessible by all threads: the constant, param, texture, andsurface state spaces. Constant and texture memory are read-only; surface memory is readable andwritable. The global, constant, param, texture, and surface state spaces are optimized for differentmemory usages. For example, texture memory offers different addressing modes as well as datafiltering for specific data formats. Note that texture and surface memory is cached, and within thesame kernel call, the cache is not kept coherent with respect to global memory writes and surfacememory writes, so any texture fetch or surface read to an address that has been written to via aglobal or a surface write in the same kernel call returns undefined data. In other words, a threadcan safely read some texture or surface memory location only if this memory location has beenupdated by a previous kernel call or memory copy, but not if it has been previously updated by thesame thread or another thread from the same kernel call.

The global, constant, and texture state spaces are persistent across kernel launches by the sameapplication.

Both the host and the device maintain their own local memory, referred to ashost memory anddevice memory, respectively. The device memory may be mapped and read or written by the host, or,for more efficient transfer, copied from the host memory through optimized API calls that utilizethe device’s high-performanceDirect Memory Access (DMA) engine.

Memory Hierarchy

Figure 3Memory Hierarchy

3.PTX Machine Model

3.1.A Set of SIMT Multiprocessors

The NVIDIA GPU architecture is built around a scalable array of multithreadedStreamingMultiprocessors (SMs). When a host program invokes a kernel grid, the blocks of the grid areenumerated and distributed to multiprocessors with available execution capacity. The threads of athread block execute concurrently on one multiprocessor. As thread blocks terminate, new blocks arelaunched on the vacated multiprocessors.

A multiprocessor consists of multipleScalar Processor (SP) cores, a multithreaded instructionunit, and on-chip shared memory. The multiprocessor creates, manages, and executes concurrentthreads in hardware with zero scheduling overhead. It implements a single-instruction barriersynchronization. Fast barrier synchronization together with lightweight thread creation andzero-overhead thread scheduling efficiently support very fine-grained parallelism, allowing, forexample, a low granularity decomposition of problems by assigning one thread to each data element(such as a pixel in an image, a voxel in a volume, a cell in a grid-based computation).

To manage hundreds of threads running several different programs, the multiprocessor employs anarchitecture we callSIMT (single-instruction, multiple-thread). The multiprocessor maps eachthread to one scalar processor core, and each scalar thread executes independently with its owninstruction address and register state. The multiprocessor SIMT unit creates, manages, schedules,and executes threads in groups of parallel threads calledwarps. (This term originates fromweaving, the first parallel thread technology.) Individual threads composing a SIMT warp starttogether at the same program address but are otherwise free to branch and execute independently.

When a multiprocessor is given one or more thread blocks to execute, it splits them into warps thatget scheduled by the SIMT unit. The way a block is split into warps is always the same; each warpcontains threads of consecutive, increasing thread IDs with the first warp containing thread 0.

At every instruction issue time, the SIMT unit selects a warp that is ready to execute and issuesthe next instruction to the active threads of the warp. A warp executes one common instruction at atime, so full efficiency is realized when all threads of a warp agree on their execution path. Ifthreads of a warp diverge via a data-dependent conditional branch, the warp serially executes eachbranch path taken, disabling threads that are not on that path, and when all paths complete, thethreads converge back to the same execution path. Branch divergence occurs only within a warp;different warps execute independently regardless of whether they are executing common or disjointedcode paths.

SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that asingle instruction controls multiple processing elements. A key difference is that SIMD vectororganizations expose the SIMD width to the software, whereas SIMT instructions specify the executionand branching behavior of a single thread. In contrast with SIMD vector machines, SIMT enablesprogrammers to write thread-level parallel code for independent, scalar threads, as well asdata-parallel code for coordinated threads. For the purposes of correctness, the programmer canessentially ignore the SIMT behavior; however, substantial performance improvements can be realizedby taking care that the code seldom requires threads in a warp to diverge. In practice, this isanalogous to the role of cache lines in traditional code: Cache line size can be safely ignored whendesigning for correctness but must be considered in the code structure when designing for peakperformance. Vector architectures, on the other hand, require the software to coalesce loads intovectors and manage divergence manually.

How many blocks a multiprocessor can process at once depends on how many registers per thread andhow much shared memory per block are required for a given kernel since the multiprocessor’sregisters and shared memory are split among all the threads of the batch of blocks. If there are notenough registers or shared memory available per multiprocessor to process at least one block, thekernel will fail to launch.

_images/hardware-model.png

Figure 4Hardware Model

A set of SIMT multiprocessors with on-chip shared memory.

3.2.Independent Thread Scheduling

On architectures prior to Volta, warps used a single program counter shared amongst all 32 threadsin the warp together with an active mask specifying the active threads of the warp. As a result,threads from the same warp in divergent regions or different states of execution cannot signal eachother or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks ormutexes can easily lead to deadlock, depending on which warp the contending threads come from.

Starting with the Volta architecture,Independent Thread Scheduling allows full concurrencybetween threads, regardless of warp. WithIndependent Thread Scheduling, the GPU maintainsexecution state per thread, including a program counter and call stack, and can yield execution at aper-thread granularity, either to make better use of execution resources or to allow one thread towait for data to be produced by another. A schedule optimizer determines how to group active threadsfrom the same warp together into SIMT units. This retains the high throughput of SIMT execution asin prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge atsub-warp granularity.

Independent Thread Scheduling can lead to a rather different set of threads participating in theexecuted code than intended if the developer made assumptions about warp-synchronicity of previoushardware architectures. In particular, any warp-synchronous code (such as synchronization-free,intra-warp reductions) should be revisited to ensure compatibility with Volta and beyond. See thesection on Compute Capability 7.x in theCuda Programming Guide for further details.

3.3.On-chip Shared Memory

As illustrated byFigure 4, each multiprocessor hason-chip memory of the four following types:

  • One set of local 32-bitregisters per processor,

  • A parallel data cache orshared memory that is shared by all scalar processor cores and is wherethe shared memory space resides,

  • A read-onlyconstant cache that is shared by all scalar processor cores and speeds up reads fromthe constant memory space, which is a read-only region of device memory,

  • A read-onlytexture cache that is shared by all scalar processor cores and speeds up reads fromthe texture memory space, which is a read-only region of device memory; each multiprocessoraccesses the texture cache via atexture unit that implements the various addressing modes anddata filtering.

The local and global memory spaces are read-write regions of device memory.

4.Syntax

PTX programs are a collection of text source modules (files). PTX source modules have anassembly-language style syntax with instruction operation codes and operands. Pseudo-operationsspecify symbol and addressing management. The ptxas optimizing backend compiler optimizes andassembles PTX source modules to produce corresponding binary object files.

4.1.Source Format

Source modules are ASCII text. Lines are separated by the newline character (\n).

All whitespace characters are equivalent; whitespace is ignored except for its use in separatingtokens in the language.

The C preprocessor cpp may be used to process PTX source modules. Lines beginning with# arepreprocessor directives. The following are common preprocessor directives:

#include,#define,#if,#ifdef,#else,#endif,#line,#file

C: A Reference Manual by Harbison and Steele provides a good description of the C preprocessor.

PTX is case sensitive and uses lowercase for keywords.

Each PTX module must begin with a.version directive specifying the PTX language version,followed by a.target directive specifying the target architecture assumed. SeePTX Module Directives for a more information on these directives.

4.2.Comments

Comments in PTX follow C/C++ syntax, using non-nested/* and*/ for comments that may spanmultiple lines, and using// to begin a comment that extends up to the next newline character,which terminates the current line. Comments cannot occur within character constants, stringliterals, or within other comments.

Comments in PTX are treated as whitespace.

4.3.Statements

A PTX statement is either a directive or an instruction. Statements begin with an optional label andend with a semicolon.

Examples

        .reg     .b32 r1, r2;        .global  .f32  array[N];start:  mov.b32   r1, %tid.x;        shl.b32   r1, r1, 2;          // shift thread id by 2 bits        ld.global.b32 r2, array[r1];  // thread[tid] gets array[tid]        add.f32   r2, r2, 0.5;        // add 1/2

4.3.1.Directive Statements

Directive keywords begin with a dot, so no conflict is possible with user-defined identifiers. Thedirectives in PTX are listed inTable 1 anddescribed inState Spaces, Types, and VariablesandDirectives.

Table 1PTX Directives

.address_size

.explicitcluster

.maxnreg

.section

.alias

.extern

.maxntid

.shared

.align

.file

.minnctapersm

.sreg

.branchtargets

.func

.noreturn

.target

.callprototype

.global

.param

.tex

.calltargets

.loc

.pragma

.version

.common

.local

.reg

.visible

.const

.maxclusterrank

.reqnctapercluster

.weak

.entry

.maxnctapersm

.reqntid

4.3.2.Instruction Statements

Instructions are formed from an instruction opcode followed by a comma-separated list of zero ormore operands, and terminated with a semicolon. Operands may be register variables, constantexpressions, address expressions, or label names. Instructions have an optional guard predicatewhich controls conditional execution. The guard predicate follows the optional label and precedesthe opcode, and is written as@p, wherep is a predicate register. The guard predicate maybe optionally negated, written as@!p.

The destination operand is first, followed by source operands.

Instruction keywords are listed inTable 2. All instruction keywords arereserved tokens in PTX.

Table 2Reserved Instruction Keywords

abs

cvta

membar

setp

vabsdiff

activemask

discard

min

shf

vabsdiff2

add

div

mma

shfl

vabsdiff4

addc

dp2a

mov

shl

vadd

alloca

dp4a

movmatrix

shr

vadd2

and

elect

mul

sin

vadd4

applypriority

ex2

mul24

slct

vavrg2

atom

exit

multimem

sqrt

vavrg4

bar

fence

nanosleep

st

vmad

barrier

fma

neg

stackrestore

vmax

bfe

fns

not

stacksave

vmax2

bfi

getctarank

or

stmatrix

vmax4

bfind

griddepcontrol

pmevent

sub

vmin

bmsk

isspacep

popc

subc

vmin2

bra

istypep

prefetch

suld

vmin4

brev

ld

prefetchu

suq

vote

brkpt

ldmatrix

prmt

sured

vset

brx

ldu

rcp

sust

vset2

call

lg2

red

szext

vset4

clz

lop3

redux

tanh

vshl

cnot

mad

rem

tcgen05

vshr

copysign

mad24

ret

tensormap

vsub

cos

madc

rsqrt

testp

vsub2

clusterlaunchcontrol

mapa

sad

tex

vsub4

cp

match

selp

tld4

wgmma

createpolicy

max

set

trap

wmma

cvt

mbarrier

setmaxnreg

txq

xor

4.4.Identifiers

User-defined identifiers follow extended C++ rules: they either start with a letter followed by zeroor more letters, digits, underscore, or dollar characters; or they start with an underscore, dollar,or percentage character followed by one or more letters, digits, underscore, or dollar characters:

followsym:   [a-zA-Z0-9_$]identifier:  [a-zA-Z]{followsym}* | {[_$%]{followsym}+

PTX does not specify a maximum length for identifiers and suggests that all implementations supporta minimum length of at least 1024 characters.

Many high-level languages such as C and C++ follow similar rules for identifier names, except thatthe percentage sign is not allowed. PTX allows the percentage sign as the first character of anidentifier. The percentage sign can be used to avoid name conflicts, e.g., between user-definedvariable names and compiler-generated names.

PTX predefines one constant and a small number of special registers that begin with the percentagesign, listed inTable 3.

Table 3Predefined Identifiers

%aggr_smem_size

%dynamic_smem_size

%lanemask_gt

%reserved_smem_offset_begin

%clock

%envreg<32>

%lanemask_le

%reserved_smem_offset_cap

%clock64

%globaltimer

%lanemask_lt

%reserved_smem_offset_end

%cluster_ctaid

%globaltimer_hi

%nclusterid

%smid

%cluster_ctarank

%globaltimer_lo

%nctaid

%tid

%cluster_nctaid

%gridid

%nsmid

%total_smem_size

%cluster_nctarank

%is_explicit_cluster

%ntid

%warpid

%clusterid

%laneid

%nwarpid

WARP_SZ

%ctaid

%lanemask_eq

%pm0,...,%pm7

%current_graph_exec

%lanemask_ge

%reserved_smem_offset_<2>

4.5.Constants

PTX supports integer and floating-point constants and constant expressions. These constants may beused in data initialization and as operands to instructions. Type checking rules remain the same forinteger, floating-point, and bit-size types. For predicate-type data and instructions, integerconstants are allowed and are interpreted as in C, i.e., zero values areFalse and non-zerovalues areTrue.

4.5.1.Integer Constants

Integer constants are 64-bits in size and are either signed or unsigned, i.e., every integerconstant has type.s64 or.u64. The signed/unsigned nature of an integer constant is neededto correctly evaluate constant expressions containing operations such as division and orderedcomparisons, where the behavior of the operation depends on the operand types. When used in aninstruction or data initialization, each integer constant is converted to the appropriate size basedon the data or instruction type at its use.

Integer literals may be written in decimal, hexadecimal, octal, or binary notation. The syntaxfollows that of C. Integer literals may be followed immediately by the letterU to indicate thatthe literal is unsigned.

hexadecimal literal:  0[xX]{hexdigit}+U?octal literal:        0{octal digit}+U?binary literal:       0[bB]{bit}+U?decimal literal       {nonzero-digit}{digit}*U?

Integer literals are non-negative and have a type determined by their magnitude and optional typesuffix as follows: literals are signed (.s64) unless the value cannot be fully represented in.s64 or the unsigned suffix is specified, in which case the literal is unsigned (.u64).

The predefined integer constantWARP_SZ specifies the number of threads per warp for the targetplatform; to date, all target architectures have aWARP_SZ value of 32.

4.5.2.Floating-Point Constants

Floating-point constants are represented as 64-bit double-precision values, and all floating-pointconstant expressions are evaluated using 64-bit double precision arithmetic. The only exception isthe 32-bit hex notation for expressing an exact single-precision floating-point value; such valuesretain their exact 32-bit single-precision value and may not be used in constant expressions. Each64-bit floating-point constant is converted to the appropriate floating-point size based on the dataor instruction type at its use.

Floating-point literals may be written with an optional decimal point and an optional signedexponent. Unlike C and C++, there is no suffix letter to specify size; literals are alwaysrepresented in 64-bit double-precision format.

PTX includes a second representation of floating-point constants for specifying the exact machinerepresentation using a hexadecimal constant. To specify IEEE 754 double-precision floating pointvalues, the constant begins with0d or0D followed by 16 hex digits. To specify IEEE 754single-precision floating point values, the constant begins with0f or0F followed by 8 hexdigits.

0[fF]{hexdigit}{8}      // single-precision floating point0[dD]{hexdigit}{16}     // double-precision floating point

Example

mov.f32  $f3, 0F3f800000;       //  1.0

4.5.3.Predicate Constants

In PTX, integer constants may be used as predicates. For predicate-type data initializers andinstruction operands, integer constants are interpreted as in C, i.e., zero values areFalse andnon-zero values areTrue.

4.5.4.Constant Expressions

In PTX, constant expressions are formed using operators as in C and are evaluated using rulessimilar to those in C, but simplified by restricting types and sizes, removing most casts, anddefining full semantics to eliminate cases where expression evaluation in C is implementationdependent.

Constant expressions are formed from constant literals, unary plus and minus, basic arithmeticoperators (addition, subtraction, multiplication, division), comparison operators, the conditionalternary operator (?: ), and parentheses. Integer constant expressions also allow unary logicalnegation (!), bitwise complement (~), remainder (%), shift operators (<< and>>), bit-type operators (&,|, and^), and logical operators (&&,||).

Constant expressions in PTX do not support casts between integer and floating-point.

Constant expressions are evaluated using the same operator precedence asin C.Table 4 gives operator precedence andassociativity. Operator precedence is highest for unary operators and decreases with each line inthe chart. Operators on the same line have the same precedence and are evaluated right-to-left forunary operators and left-to-right for binary operators.

Table 4Operator Precedence

Kind

Operator Symbols

Operator Names

Associates

Primary

()

parenthesis

n/a

Unary

+-!~

plus, minus, negation, complement

right

(.s64)(.u64)

casts

right

Binary

*/%

multiplication, division, remainder

left

+-

addition, subtraction

>><<

shifts

<><=>=

ordered comparisons

==!=

equal, not equal

&

bitwise AND

^

bitwise XOR

|

bitwise OR

&&

logical AND

||

logical OR

Ternary

?:

conditional

right

4.5.5.Integer Constant Expression Evaluation

Integer constant expressions are evaluated at compile time according to a set of rules thatdetermine the type (signed.s64 versus unsigned.u64) of each sub-expression. These rulesare based on the rules in C, but they’ve been simplified to apply only to 64-bit integers, andbehavior is fully defined in all cases (specifically, for remainder and shift operators).

  • Literals are signed unless unsigned is needed to prevent overflow, or unless the literal uses aU suffix. For example:

    • 42,0x1234,0123 are signed.

    • 0xfabc123400000000,42U,0x1234U are unsigned.

  • Unary plus and minus preserve the type of the input operand. For example:

    • +123,-1,-(-42) are signed.

    • -1U,-0xfabc123400000000 are unsigned.

  • Unary logical negation (!) produces a signed result with value0 or1.

  • Unary bitwise complement (~) interprets the source operand as unsigned and produces anunsigned result.

  • Some binary operators require normalization of source operands. This normalization is known asthe usual arithmetic conversions and simply converts both operands to unsigned type if eitheroperand is unsigned.

  • Addition, subtraction, multiplication, and division perform the usual arithmetic conversions andproduce a result with the same type as the converted operands. That is, the operands and resultare unsigned if either source operand is unsigned, and is otherwise signed.

  • Remainder (%) interprets the operands as unsigned. Note that this differs from C, which allowsa negative divisor but defines the behavior to be implementation dependent.

  • Left and right shift interpret the second operand as unsigned and produce a result with the sametype as the first operand. Note that the behavior of right-shift is determined by the type of thefirst operand: right shift of a signed value is arithmetic and preserves the sign, and right shiftof an unsigned value is logical and shifts in a zero bit.

  • AND (&), OR (|), and XOR (^) perform the usual arithmetic conversions and produce aresult with the same type as the converted operands.

  • AND_OP (&&), OR_OP (||), Equal (==), and Not_Equal (!=) produce a signedresult. The result value is 0 or 1.

  • Ordered comparisons (<,<=,>,>=) perform the usual arithmetic conversions onsource operands and produce a signed result. The result value is0 or1.

  • Casting of expressions to signed or unsigned is supported using (.s64) and (.u64) casts.

  • For the conditional operator (?: ) , the first operand must be an integer, and the secondand third operands are either both integers or both floating-point. The usual arithmeticconversions are performed on the second and third operands, and the result type is the same as theconverted type.

4.5.6.Summary of Constant Expression Evaluation Rules

Table 5contains a summary of the constant expression evaluation rules.

Table 5Constant Expression Evaluation Rules

Kind

Operator

Operand Types

Operand Interpretation

Result Type

Primary

()

any type

same as source

same as source

constant literal

n/a

n/a

.u64,.s64, or.f64

Unary

+-

any type

same as source

same as source

!

integer

zero or non-zero

.s64

~

integer

.u64

.u64

Cast

(.u64)

integer

.u64

.u64

(.s64)

integer

.s64

.s64

Binary

+-*/

.f64

.f64

.f64

integer

use usual conversions

converted type

<><=>=

.f64

.f64

.s64

integer

use usual conversions

.s64

==!=

.f64

.f64

.s64

integer

use usual conversions

.s64

%

integer

.u64

.s64

>><<

integer

1st unchanged, 2nd is.u64

same as 1st operand

&|^

integer

.u64

.u64

&&||

integer

zero or non-zero

.s64

Ternary

?:

int?.f64:.f64

same as sources

.f64

int?int:int

use usual conversions

converted type

5.State Spaces, Types, and Variables

While the specific resources available in a given target GPU will vary, the kinds of resources willbe common across platforms, and these resources are abstracted in PTX through state spaces and datatypes.

5.1.State Spaces

A state space is a storage area with particular characteristics. All variables reside in some statespace. The characteristics of a state space include its size, addressability, access speed, accessrights, and level of sharing between threads.

The state spaces defined in PTX are a byproduct of parallel programming and graphicsprogramming. The list of state spaces is shown inTable 6,andproperties of state spaces are shown inTable 7.

Table 6State Spaces

Name

Description

.reg

Registers, fast.

.sreg

Special registers. Read-only; pre-defined; platform-specific.

.const

Shared, read-only memory.

.global

Global memory, shared by all threads.

.local

Local memory, private to each thread.

.param

Kernel parameters, defined per-grid; or

Function or local parameters, defined per-thread.

.shared

Addressable memory, defined per CTA, accessible to all threads in the clusterthroughout the lifetime of the CTA that defines it.

.tex

Global texture memory (deprecated).

Table 7Properties of State Spaces

Name

Addressable

Initializable

Access

Sharing

.reg

No

No

R/W

per-thread

.sreg

No

No

RO

per-CTA

.const

Yes

Yes1

RO

per-grid

.global

Yes

Yes1

R/W

Context

.local

Yes

No

R/W

per-thread

.param(as input to kernel)

Yes2

No

RO

per-grid

.param(used in functions)

Restricted3

No

R/W

per-thread

.shared

Yes

No

R/W

per-cluster5

.tex

No4

Yes, via driver

RO

Context

Notes:

1 Variables in.const and.global state spaces are initialized to zero by default.

2 Accessible only via theld.param{::entry} instruction. Address may be taken viamov instruction.

3 Accessible viald.param{::func} andst.param{::func} instructions. Device functioninput and return parameters may have their address taken viamov; the parameter is then locatedon the stack frame and its address is in the.local state space.

4 Accessible only via thetex instruction.

5 Visible to the owning CTA and other active CTAs in the cluster.

5.1.1.Register State Space

Registers (.reg state space) are fast storage locations. The number of registers is limited, andwill vary from platform to platform. When the limit is exceeded, register variables will be spilledto memory, causing changes in performance. For each architecture, there is a recommended maximumnumber of registers to use (see theCUDA Programming Guide for details).

Registers may be typed (signed integer, unsigned integer, floating point, predicate) oruntyped. Register size is restricted; aside from predicate registers which are 1-bit, scalarregisters have a width of 8-, 16-, 32-, 64-, or 128-bits, and vector registers have a width of16-, 32-, 64-, or 128-bits. The most common use of 8-bit registers is withld,st, andcvtinstructions, or as elements of vector tuples.

Registers differ from the other state spaces in that they are not fully addressable, i.e., it is notpossible to refer to the address of a register. When compiling to use the Application BinaryInterface (ABI), register variables are restricted to function scope and may not be declared atmodule scope. When compiling legacy PTX code (ISA versions prior to 3.0) containing module-scoped.reg variables, the compiler silently disables use of the ABI. Registers may have alignmentboundaries required by multi-word loads and stores.

5.1.2.Special Register State Space

The special register (.sreg) state space holds predefined, platform-specific registers, such asgrid, cluster, CTA, and thread parameters, clock counters, and performance monitoring registers. Allspecial registers are predefined.

5.1.3.Constant State Space

The constant (.const) state space is a read-only memory initialized by the host. Constant memoryis accessed with ald.const instruction. Constant memory is restricted in size, currentlylimited to 64 KB which can be used to hold statically-sized constant variables. There is anadditional 640 KB of constant memory, organized as ten independent 64 KB regions. The driver mayallocate and initialize constant buffers in these regions and pass pointers to the buffers as kernelfunction parameters. Since the ten regions are not contiguous, the driver must ensure that constantbuffers are allocated so that each buffer fits entirely within a 64 KB region and does not span aregion boundary.

Statically-sized constant variables have an optional variable initializer; constant variables withno explicit initializer are initialized to zero by default. Constant buffers allocated by the driverare initialized by the host, and pointers to such buffers are passed to the kernel asparameters. See the description of kernel parameter attributes inKernel Function Parameter Attributes for more details on passing pointersto constant buffers as kernel parameters.

5.1.3.1.Banked Constant State Space (deprecated)

Previous versions of PTX exposed constant memory as a set of eleven 64 KB banks, with explicit banknumbers required for variable declaration and during access.

Prior to PTX ISA version 2.2, the constant memory was organized into fixed size banks. There wereeleven 64 KB banks, and banks were specified using the.const[bank] modifier, wherebankranged from 0 to 10. If no bank number was given, bank zero was assumed.

By convention, bank zero was used for all statically-sized constant variables. The remaining bankswere used to declareincomplete constant arrays (as in C, for example), where the size is notknown at compile time. For example, the declaration

.extern .const[2] .b32 const_buffer[];

resulted inconst_buffer pointing to the start of constant bank two. This pointer could then beused to access the entire 64 KB constant bank. Multiple incomplete array variables declared in thesame bank were aliased, with each pointing to the start address of the specified constant bank.

To access data in contant banks 1 through 10, the bank number was required in the state space of theload instruction. For example, an incomplete array in bank 2 was accessed as follows:

.extern .const[2] .b32 const_buffer[];ld.const[2].b32  %r1, [const_buffer+4]; // load second word

In PTX ISA version 2.2, we eliminated explicit banks and replaced the incomplete arrayrepresentation of driver-allocated constant buffers with kernel parameter attributes that allowpointers to constant buffers to be passed as kernel parameters.

5.1.4.Global State Space

The global (.global) state space is memory that is accessible by all threads in a context. It isthe mechanism by which threads in different CTAs, clusters, and grids can communicate. Useld.global,st.global, andatom.global to access global variables.

Global variables have an optional variable initializer; global variables with no explicitinitializer are initialized to zero by default.

5.1.5.Local State Space

The local state space (.local) is private memory for each thread to keep its own data. It istypically standard memory with cache. The size is limited, as it must be allocated on a per-threadbasis. Useld.local andst.local to access local variables.

When compiling to use theApplication Binary Interface (ABI),.local state-space variablesmust be declared within function scope and are allocated on the stack. In implementations that donot support a stack, all local memory variables are stored at fixed addresses, recursive functioncalls are not supported, and.local variables may be declared at module scope. When compilinglegacy PTX code (ISA versions prior to 3.0) containing module-scoped.local variables, thecompiler silently disables use of the ABI.

5.1.6.Parameter State Space

The parameter (.param) state space is used (1) to pass input arguments from the host to thekernel, (2a) to declare formal input and return parameters for device functions called from withinkernel execution, and (2b) to declare locally-scoped byte array variables that serve as functioncall arguments, typically for passing large structures by value to a function. Kernel functionparameters differ from device function parameters in terms of access and sharing (read-only versusread-write, per-kernel versus per-thread). Note that PTX ISA versions 1.x supports only kernelfunction parameters in .param space; device function parameters were previously restricted to theregister state space. The use of parameter state space for device function parameters was introducedin PTX ISA version 2.0 and requires target architecturesm_20 or higher. Additional sub-qualifiers::entry or::func can be specified on instructions with.param state space to indicatewhether the address refers to kernel function parameter or device function parameter. If nosub-qualifier is specified with the.param state space, then the default sub-qualifier is specificto and dependent on the exact instruction. For example,st.param is equivalent tost.param::funcwhereasisspacep.param is equivalent toisspacep.param::entry. Refer to the instructiondescription for more details on default sub-qualifier assumption.

Note

The location of parameter space is implementation specific. For example, in some implementationskernel parameters reside in global memory. No access protection is provided between parameter andglobal space in this case. Though the exact location of the kernel parameter space isimplementation specific, the kernel parameter space window is always contained within the globalspace window. Similarly, function parameters are mapped to parameter passing registers and/orstack locations based on the function calling conventions of theApplication Binary Interface(ABI). Therefore, PTX code should make no assumptions about the relative locations or orderingof.param space variables.

5.1.6.1.Kernel Function Parameters

Each kernel function definition includes an optional list of parameters. These parameters areaddressable, read-only variables declared in the.param state space. Values passed from the hostto the kernel are accessed through these parameter variables usingld.param instructions. Thekernel parameter variables are shared across all CTAs from all clusters within a grid.

The address of a kernel parameter may be moved into a register using themov instruction. Theresulting address is in the.param state space and is accessed usingld.param instructions.

Example

.entry foo ( .param .b32 N, .param .align 8 .b8 buffer[64] ){    .reg .u32 %n;    .reg .f64 %d;    ld.param.u32 %n, [N];    ld.param.f64 %d, [buffer];    ...

Example

.entry bar ( .param .b32 len ){    .reg .u32 %ptr, %n;    mov.u32      %ptr, len;    ld.param.u32 %n, [%ptr];    ...

Kernel function parameters may represent normal data values, or they may hold addresses to objectsin constant, global, local, or shared state spaces. In the case of pointers, the compiler andruntime system need information about which parameters are pointers, and to which state space theypoint. Kernel parameter attribute directives are used to provide this information at the PTXlevel. SeeKernel Function Parameter Attributesfor a description of kernel parameter attributedirectives.

Note

The current implementation does not allow creation of generic pointers to constant variables(cvta.const) in programs that have pointers to constant buffers passed as kernel parameters.

5.1.6.2.Kernel Function Parameter Attributes

Kernel function parameters may be declared with an optional .ptr attribute to indicate that aparameter is a pointer to memory, and also indicate the state space and alignment of the memorybeing pointed to.Kernel Parameter Attribute: .ptrdescribes the.ptr kernel parameter attribute.

5.1.6.3.Kernel Parameter Attribute:.ptr

.ptr

Kernel parameter alignment attribute.

Syntax

.param .type .ptr .space .align N  varname.param .type .ptr        .align N  varname.space = { .const, .global, .local, .shared };

Description

Used to specify the state space and, optionally, the alignment of memory pointed to by a pointertype kernel parameter. The alignment valueN, if present, must be a power of two. If no statespace is specified, the pointer is assumed to be a generic address pointing to one of const, global,local, or shared memory. If no alignment is specified, the memory pointed to is assumed to bealigned to a 4 byte boundary.

Spaces between.ptr,.space, and.align may be eliminated to improve readability.

PTX ISA Notes

  • Introduced in PTX ISA version 2.2.

  • Support for generic addressing of .const space added in PTX ISA version 3.1.

Target ISA Notes

  • Supported on all target architectures.

Examples

.entry foo ( .param .u32 param1,             .param .u32 .ptr.global.align 16 param2,             .param .u32 .ptr.const.align 8 param3,             .param .u32 .ptr.align 16 param4  // generic address                                               // pointer) { .. }

5.1.6.4.Device Function Parameters

PTX ISA version 2.0 extended the use of parameter space to device function parameters. The mostcommon use is for passing objects by value that do not fit within a PTX register, such as Cstructures larger than 8 bytes. In this case, a byte array in parameter space is used. Typically,the caller will declare a locally-scoped.param byte array variable that represents a flattenedC structure or union. This will be passed by value to a callee, which declares a.param formalparameter having the same size and alignment as the passed argument.

Example

// pass object of type struct { double d; int y; };.func foo ( .reg .b32 N, .param .align 8 .b8 buffer[12] ){    .reg .f64 %d;    .reg .s32 %y;    ld.param.f64 %d, [buffer];    ld.param.s32 %y, [buffer+8];    ...}// code snippet from the caller// struct { double d; int y; } mystruct; is flattened, passed to foo    ...    .reg .f64 dbl;    .reg .s32 x;    .param .align 8 .b8 mystruct;    ...    st.param.f64 [mystruct+0], dbl;    st.param.s32 [mystruct+8], x;    call foo, (4, mystruct);    ...

See the section on function call syntax for more details.

Function input parameters may be read viald.param and function return parameters may be writtenusingst.param; it is illegal to write to an input parameter or read from a return parameter.

Aside from passing structures by value,.param space is also required whenever a formalparameter has its address taken within the called function. In PTX, the address of a function inputparameter may be moved into a register using themov instruction. Note that the parameter willbe copied to the stack if necessary, and so the address will be in the.local state space and isaccessed viald.local andst.local instructions. It is not possible to usemov to getthe address of or a locally-scoped.param space variable. Starting PTX ISA version 6.0, it ispossible to usemov instruction to get address of return parameter of device function.

Example

// pass array of up to eight floating-point values in buffer.func foo ( .param .b32 N, .param .b32 buffer[32] ){    .reg .u32  %n, %r;    .reg .f32  %f;    .reg .pred %p;    ld.param.u32 %n, [N];    mov.u32      %r, buffer;  // forces buffer to .local state spaceLoop:    setp.eq.u32  %p, %n, 0;@%p bra         Done;    ld.local.f32 %f, [%r];    ...    add.u32      %r, %r, 4;    sub.u32      %n, %n, 1;    bra          Loop;Done:    ...}

5.1.7.Shared State Space

The shared (.shared) state space is a memory that is owned by an executing CTA and is accessibleto the threads of all the CTAs within a cluster. An address in shared memory can be read and writtenby any thread in a CTA cluster.

Additional sub-qualifiers::cta or::cluster can be specified on instructions with.shared state space to indicate whether the address belongs to the shared memory window of theexecuting CTA or of any CTA in the cluster respectively. The addresses in the.shared::ctawindow also fall within the.shared::cluster window. If no sub-qualifier is specified with the.shared state space, then it defaults to::cta. For example,ld.shared is equivalent told.shared::cta.

Variables declared in.shared state space refer to the memory addresses in the currentCTA. Instructionmapa gives the.shared::cluster address of the corresponding variable inanother CTA in the cluster.

Shared memory typically has some optimizations to support the sharing. One example is broadcast;where all threads read from the same address. Another is sequential access from sequential threads.

5.1.8.Texture State Space (deprecated)

The texture (.tex) state space is global memory accessed via the texture instruction. It isshared by all threads in a context. Texture memory is read-only and cached, so accesses to texturememory are not coherent with global memory stores to the texture image.

The GPU hardware has a fixed number of texture bindings that can be accessed within a single kernel(typically 128). The .tex directive will bind the named texture memory variable to a hardwaretexture identifier, where texture identifiers are allocated sequentially beginning withzero. Multiple names may be bound to the same physical texture identifier. An error is generated ifthe maximum number of physical resources is exceeded. The texture name must be of type.u32 or.u64.

Physical texture resources are allocated on a per-kernel granularity, and.tex variables arerequired to be defined in the global scope.

Texture memory is read-only. A texture’s base address is assumed to be aligned to a 16 byteboundary.

Example

.tex .u32 tex_a;         // bound to physical texture 0.tex .u32 tex_c, tex_d;  // both bound to physical texture 1.tex .u32 tex_d;         // bound to physical texture 2.tex .u32 tex_f;         // bound to physical texture 3

Note

Explicit declarations of variables in the texture state space is deprecated, and programs shouldinstead reference texture memory through variables of type.texref. The.tex directive isretained for backward compatibility, and variables declared in the.tex state space areequivalent to module-scoped.texref variables in the.global state space.

For example, a legacy PTX definitions such as

.tex .u32 tex_a;

is equivalent to:

.global .texref tex_a;

SeeTexture Sampler and Surface Types for thedescription of the.texref type andTexture Instructionsfor its use in texture instructions.

5.2.Types

5.2.1.Fundamental Types

In PTX, the fundamental types reflect the native data types supported by the target architectures. Afundamental type specifies both a basic type and a size. Register variables are always of afundamental type, and instructions operate on these types. The same type-size specifiers are usedfor both variable definitions and for typing instructions, so their names are intentionally short.

Table 8 lists the fundamental type specifiers foreach basic type:

Table 8Fundamental Type Specifiers

Basic Type

Fundamental Type Specifiers

Signed integer

.s8,.s16,.s32,.s64

Unsigned integer

.u8,.u16,.u32,.u64

Floating-point

.f16,.f16x2,.f32,.f64

Bits (untyped)

.b8,.b16,.b32,.b64,.b128

Predicate

.pred

Most instructions have one or more type specifiers, needed to fully specify instructionbehavior. Operand types and sizes are checked against instruction types for compatibility.

Two fundamental types are compatible if they have the same basic type and are the same size. Signedand unsigned integer types are compatible if they have the same size. The bit-size type iscompatible with any fundamental type having the same size.

In principle, all variables (aside from predicates) could be declared using only bit-size types, buttyped variables enhance program readability and allow for better operand type checking.

5.2.2.Restricted Use of Sub-Word Sizes

The.u8,.s8, and.b8 instruction types are restricted told,st, andcvtinstructions. The.f16 floating-point type is allowed only in conversions to and from.f32,.f64 types, in half precision floating point instructions and texture fetch instructions. The.f16x2 floating point type is allowed only in half precision floating point arithmeticinstructions and texture fetch instructions.

For convenience,ld,st, andcvt instructions permit source and destination dataoperands to be wider than the instruction-type size, so that narrow values may be loaded, stored,and converted using regular-width registers. For example, 8-bit or 16-bit values may be helddirectly in 32-bit or 64-bit registers when being loaded, stored, or converted to other types andsizes.

5.2.3.Alternate Floating-Point Data Formats

The fundamental floating-point types supported in PTX have implicit bit representations thatindicate the number of bits used to store exponent and mantissa. For example, the.f16 typeindicates 5 bits reserved for exponent and 10 bits reserved for mantissa. In addition to thefloating-point representations assumed by the fundamental types, PTX allows the following alternatefloating-point data formats:

bf16 data format:

This data format is a 16-bit floating point format with 8 bits for exponent and 7 bits formantissa. A register variable containingbf16 data must be declared with.b16 type.

e4m3 data format:

This data format is an 8-bit floating point format with 4 bits for exponent and 3 bits formantissa. Thee4m3 encoding does not support infinity andNaN values are limited to0x7f and0xff. A register variable containinge4m3 value must be declared usingbit-size type.

e5m2 data format:

This data format is an 8-bit floating point format with 5 bits for exponent and 2 bits formantissa. A register variable containinge5m2 value must be declared using bit-size type.

tf32 data format:

This data format is a special 32-bit floating point format supported by the matrixmultiply-and-accumulate instructions, with the same range as.f32 and reduced precision (>=10bits). The internal layout oftf32 format is implementation defined. PTX facilitatesconversion from single precision.f32 type totf32 format. A register variable containingtf32 data must be declared with.b32 type.

e2m1 data format:

This data format is a 4-bit floating point format with 2 bits for exponent and 1 bit for mantissa.Thee2m1 encoding does not support infinity andNaN.e2m1 values must be used in apacked format specified ase2m1x2. A register variable containing twoe2m1 values must bedeclared with.b8 type.

e2m3 data format:

This data format is a 6-bit floating point format with 2 bits for exponent and 3 bits for mantissa.Thee2m3 encoding does not support infinity andNaN.e2m3 values must be used in apacked format specified ase2m3x2. A register variable containing twoe2m3 values must bedeclared with.b16 type where each.b8 element has 6-bit floating point value and 2 MSBbits padded with zeros.

e3m2 data format:

This data format is a 6-bit floating point format with 3 bits for exponent and 2 bits for mantissa.Thee3m2 encoding does not support infinity andNaN.e3m2 values must be used in apacked format specified ase3m2x2. A register variable containing twoe3m2 values must bedeclared with.b16 type where each.b8 element has 6-bit floating point value and 2 MSBbits padded with zeros.

ue8m0 data format:

This data format is an 8-bit unsigned floating-point format with 8 bits for exponent and 0 bits formantissa. Theue8m0 encoding does not support infinity.NaN value is limited to0xff.ue8m0 values must be used in a packed format specified asue8m0x2. A register variablecontaining twoue8m0 values must be declared with.b16 type.

ue4m3 data format:

This data format is a 7-bit unsigned floating-point format with 4 bits for exponent and 3 bits formantissa. Theue4m3 encoding does not support infinity.NaN value is limited to0x7f.A register variable containing singleue4m3 value must be declared with.b8 type havingMSB bit padded with zero.

Alternate data formats cannot be used as fundamental types. They are supported as source ordestination formats by certain instructions.

5.2.4.Packed Data Types

Certain PTX instructions operate on two or more sets of inputs in parallel, and produce two or moreoutputs. Such instructions can use the data stored in a packed format. PTX supports packing two orfour values of the same scalar data type into a single, larger value. The packed value is consideredas a value of apacked data type. In this section we describe the packed data types supported in PTX.

5.2.4.1.Packed Floating Point Data Types

PTX supports various variants of packed floating point data types. Out of them, only.f16x2 issupported as a fundamental type, while others cannot be used as fundamental types - they aresupported as instruction types on certain instructions. When using an instruction with suchnon-fundamental types, the operand data variables must be of bit type of appropriate size.For example, all of the operand variables must be of type.b32 for an instruction withinstruction type as.bf16x2.Table 9 described various variantsof packed floating point data types in PTX.

Table 9Operand types for packed floating point instruction type.

Packed floatingpoint type

Number of elementscontained in apacked format

Type of eachelement

Register variable typeto be used in thedeclaration

.f16x2

Two

.f16

.f16x2 or.b32

.f32x2

.f32

.b64

.bf16x2

.bf16

.b32

.e4m3x2

.e4m3

.b16

.e5m2x2

.e5m2

.e2m3x2

.e2m3

.e3m2x2

.e3m2

.ue8m0x2

.ue8m0

.e2m1x2

.e2m1

.b8

.e4m3x4

Four

.e4m3

.b32

.e5m2x4

.e5m2

.e2m3x4

.e2m3

.e3m2x4

.e3m2

.e2m1x4

.e2m1

5.2.4.2.Packed Integer Data Types

PTX supports two variants of packed integer data types:.u16x2 and.s16x2. The packed datatype consists of two.u16 or.s16 values. A register variable containing.u16x2 or.s16x2 data must be declared with.b32 type. Packed integer data types cannot be used asfundamental types. They are supported as instruction types on certain instructions.

5.3.Texture Sampler and Surface Types

PTX includes built-inopaque types for defining texture, sampler, and surface descriptorvariables. These types have named fields similar to structures, but all information about layout,field ordering, base address, and overall size is hidden to a PTX program, hence the termopaque. The use of these opaque types is limited to:

  • Variable definition within global (module) scope and in kernel entry parameter lists.

  • Static initialization of module-scope variables using comma-delimited static assignmentexpressions for the named members of the type.

  • Referencing textures, samplers, or surfaces via texture and surface load/store instructions(tex,suld,sust,sured).

  • Retrieving the value of a named member via query instructions (txq,suq).

  • Creating pointers to opaque variables usingmov, e.g.,mov.u64reg,opaque_var;. Theresulting pointer may be stored to and loaded from memory, passed as a parameter to functions, andde-referenced by texture and surface load, store, and query instructions, but the pointer cannototherwise be treated as an address, i.e., accessing the pointer withld andstinstructions, or performing pointer arithmetic will result in undefined results.

  • Opaque variables may not appear in initializers, e.g., to initialize a pointer to an opaquevariable.

Note

Indirect access to textures and surfaces using pointers to opaque variables is supportedbeginning with PTX ISA version 3.1 and requires targetsm_20 or later.

Indirect access to textures is supported only in unified texture mode (see below).

The three built-in types are.texref,.samplerref, and.surfref. For working withtextures and samplers, PTX has two modes of operation. In theunified mode, texture and samplerinformation is accessed through a single.texref handle. In theindependent mode, texture andsampler information each have their own handle, allowing them to be defined separately and combinedat the site of usage in the program. In independent mode, the fields of the.texref type thatdescribe sampler properties are ignored, since these properties are defined by.samplerrefvariables.

Table 10 andTable 11 list the named membersof each type for unified and independent texture modes. These members and their values haveprecise mappings to methods and values defined in the textureHW class as well asexposed values via the API.

Table 10Opaque Type Fields in Unified Texture Mode

Member

.texref values

.surfref values

width

in elements

height

in elements

depth

in elements

channel_data_type

enum type corresponding to source language API

channel_order

enum type corresponding to source language API

normalized_coords

0,1

N/A

filter_mode

nearest,linear

N/A

addr_mode_0,addr_mode_1,addr_mode_2

wrap,mirror,clamp_ogl,clamp_to_edge,clamp_to_border

N/A

array_size

as number of textures in a texturearray

as number of surfaces in a surface array

num_mipmap_levels

as number of levels in a mipmappedtexture

N/A

num_samples

as number of samples in a multi-sampletexture

N/A

memory_layout

N/A

1 for linear memory layout;0 otherwise

5.3.1.Texture and Surface Properties

Fieldswidth,height, anddepth specify the size of the texture or surface in number ofelements in each dimension.

Thechannel_data_type andchannel_order fields specify these properties of the texture orsurface using enumeration types corresponding to the source language API. For example, seeChannel Data Type and Channel Order Fields forthe OpenCL enumeration types currently supported in PTX.

5.3.2.Sampler Properties

Thenormalized_coords field indicates whether the texture or surface uses normalized coordinatesin the range [0.0, 1.0) instead of unnormalized coordinates in the range [0, N). If no value isspecified, the default is set by the runtime system based on the source language.

Thefilter_mode field specifies how the values returned by texture reads are computed based onthe input texture coordinates.

Theaddr_mode_{0,1,2} fields define the addressing mode in each dimension, which determine howout-of-range coordinates are handled.

See theCUDA C++ Programming Guide for more details of these properties.

Table 11Opaque Type Fields in Independent Texture Mode

Member

.samplerref values

.texref values

.surfref values

width

N/A

in elements

height

N/A

in elements

depth

N/A

in elements

channel_data_type

N/A

enum type corresponding to sourcelanguage API

channel_order

N/A

enum type corresponding to sourcelanguage AP

normalized_coords

N/A

0,1

N/A

force_unnormalized_coords

0,1

N/A

N/A

filter_mode

nearest,linear

ignored

N/A

addr_mode_0,addr_mode_1,addr_mode_2

wrap,mirror,clamp_ogl,clamp_to_edge,clamp_to_border

N/A

N/A

array_size

N/A

as number of texturesin a texture array

as number of surfaces ina surface array

num_mipmap_levels

N/A

as number of levelsin a mipmappedtexture

N/A

num_samples

N/A

as number of samplesin a multi-sampletexture

N/A

memory_layout

N/A

N/A

1 for linear memorylayout;0 otherwise

In independent texture mode, the sampler properties are carried in an independent.samplerrefvariable, and these fields are disabled in the.texref variables. One additional samplerproperty,force_unnormalized_coords, is available in independent texture mode.

Theforce_unnormalized_coords field is a property of.samplerref variables that allows thesampler to override the texture headernormalized_coords property. This field is defined only inindependent texture mode. WhenTrue, the texture header setting is overridden and unnormalizedcoordinates are used; whenFalse, the texture header setting is used.

Theforce_unnormalized_coords property is used in compiling OpenCL; in OpenCL, the property ofnormalized coordinates is carried in sampler headers. To compile OpenCL to PTX, texture headers arealways initialized withnormalized_coords set to True, and the OpenCL sampler-basednormalized_coords flag maps (negated) to the PTX-levelforce_unnormalized_coords flag.

Variables using these types may be declared at module scope or within kernel entry parameterlists. At module scope, these variables must be in the.global state space. As kernelparameters, these variables are declared in the.param state space.

Example

.global .texref     my_texture_name;.global .samplerref my_sampler_name;.global .surfref    my_surface_name;

When declared at module scope, the types may be initialized using a list of static expressionsassigning values to the named members.

Example

.global .texref tex1;.global .samplerref tsamp1 = { addr_mode_0 = clamp_to_border,                               filter_mode = nearest                             };

5.3.3.Channel Data Type and Channel Order Fields

Thechannel_data_type andchannel_order fields have enumeration types corresponding to thesource language API. Currently, OpenCL is the only source language that defines thesefields.Table 13 andTable 12 show theenumeration values defined in OpenCL version 1.0 for channel data type and channel order.

Table 12OpenCL 1.0 Channel Data Type Definition

CL_SNORM_INT8

0x10D0

CL_SNORM_INT16

0x10D1

CL_UNORM_INT8

0x10D2

CL_UNORM_INT16

0x10D3

CL_UNORM_SHORT_565

0x10D4

CL_UNORM_SHORT_555

0x10D5

CL_UNORM_INT_101010

0x10D6

CL_SIGNED_INT8

0x10D7

CL_SIGNED_INT16

0x10D8

CL_SIGNED_INT32

0x10D9

CL_UNSIGNED_INT8

0x10DA

CL_UNSIGNED_INT16

0x10DB

CL_UNSIGNED_INT32

0x10DC

CL_HALF_FLOAT

0x10DD

CL_FLOAT

0x10DE

Table 13OpenCL 1.0 Channel Order Definition

CL_R

0x10B0

CL_A

0x10B1

CL_RG

0x10B2

CL_RA

0x10B3

CL_RGB

0x10B4

CL_RGBA

0x10B5

CL_BGRA

0x10B6

CL_ARGB

0x10B7

CL_INTENSITY

0x10B8

CL_LUMINANCE

0x10B9

5.4.Variables

In PTX, a variable declaration describes both the variable’s type and its state space. In additionto fundamental types, PTX supports types for simple aggregate objects such as vectors and arrays.

5.4.1.Variable Declarations

All storage for data is specified with variable declarations. Every variable must reside in one ofthe state spaces enumerated in the previous section.

A variable declaration names the space in which the variable resides, its type and size, its name,an optional array size, an optional initializer, and an optional fixed address for the variable.

Predicate variables may only be declared in the register state space.

Examples

.global .u32 loc;.reg    .s32 i;.const  .f32 bias[] = {-1.0, 1.0};.global .u8  bg[4] = {0, 0, 0, 0};.reg    .v4 .f32 accel;.reg    .pred p, q, r;

5.4.2.Vectors

Limited-length vector types are supported. Vectors of length 2 and 4 of any non-predicatefundamental type can be declared by prefixing the type with.v2 or.v4. Vectors must bebased on a fundamental type, and they may reside in the register space. Vectors cannot exceed128-bits in length; for example,.v4.f64 is not allowed. Three-element vectors may behandled by using a.v4 vector, where the fourth element provides padding. This is a common casefor three-dimensional grids, textures, etc.

Examples

.global .v4 .f32 V;   // a length-4 vector of floats.shared .v2 .u16 uv;  // a length-2 vector of unsigned ints.global .v4 .b8  v;   // a length-4 vector of bytes

By default, vector variables are aligned to a multiple of their overall size (vector length timesbase-type size), to enable vector load and store instructions which require addresses aligned to amultiple of the access size.

5.4.3.Array Declarations

Array declarations are provided to allow the programmer to reserve space. To declare an array, thevariable name is followed with dimensional declarations similar to fixed-size array declarationsin C. The size of each dimension is a constant expression.

Examples

.local  .u16 kernel[19][19];.shared .u8  mailbox[128];

The size of the array specifies how many elements should be reserved. For the declaration of arraykernel above, 19*19 = 361 halfwords are reserved, for a total of 722 bytes.

When declared with an initializer, the first dimension of the array may be omitted. The size of thefirst array dimension is determined by the number of elements in the array initializer.

Examples

.global .u32 index[] = { 0, 1, 2, 3, 4, 5, 6, 7 };.global .s32 offset[][2] = { {-1, 0}, {0, -1}, {1, 0}, {0, 1} };

Arrayindex has eight elements, and arrayoffset is a 4x2 array.

5.4.4.Initializers

Declared variables may specify an initial value using a syntax similar to C/C++, where the variablename is followed by an equals sign and the initial value or values for the variable. A scalar takesa single value, while vectors and arrays take nested lists of values inside of curly braces (thenesting matches the dimensionality of the declaration).

As in C, array initializers may be incomplete, i.e., the number of initializer elements may be lessthan the extent of the corresponding array dimension, with remaining array locations initialized tothe default value for the specified array type.

Examples

.const  .f32 vals[8] = { 0.33, 0.25, 0.125 };.global .s32 x[3][2] = { {1,2}, {3} };

is equivalent to

.const  .f32 vals[8] = { 0.33, 0.25, 0.125, 0.0, 0.0, 0.0, 0.0, 0.0 };.global .s32 x[3][2] = { {1,2}, {3,0}, {0,0} };

Currently, variable initialization is supported only for constant and global state spaces. Variablesin constant and global state spaces with no explicit initializer are initialized to zero bydefault. Initializers are not allowed in external variable declarations.

Variable names appearing in initializers represent the address of the variable; this can be used tostatically initialize a pointer to a variable. Initializers may also containvar+offsetexpressions, whereoffset is a byte offset added to the address ofvar. Only variables in.global or.const state spaces may be used in initializers. By default, the resultingaddress is the offset in the variable’s state space (as is the case when taking the address of avariable with amov instruction). An operator,generic(), is provided to create a genericaddress for variables used in initializers.

Starting PTX ISA version 7.1, an operatormask() is provided, wheremask is an integerimmediate. The only allowed expressions in themask() operator are integer constant expressionand symbol expression representing address of variable. Themask() operator extractsnconsecutive bits from the expression used in initializers and inserts these bits at the lowestposition of the initialized variable. The numbern and the starting position of the bits to beextracted is specified by the integer immediatemask. PTX ISA version 7.1 only supportsextracting a single byte starting at byte boundary from the address of the variable. PTX ISA version7.3 supports Integer constant expression as an operand in themask() operator.

Supported values formask are: 0xFF, 0xFF00, 0XFF0000, 0xFF000000, 0xFF00000000, 0xFF0000000000,0xFF000000000000, 0xFF00000000000000.

Examples

.const  .u32 foo = 42;.global .u32 bar[] = { 2, 3, 5 };.global .u32 p1 = foo;          // offset of foo in .const space.global .u32 p2 = generic(foo); // generic address of foo// array of generic-address pointers to elements of bar.global .u32 parr[] = { generic(bar), generic(bar)+4,generic(bar)+8 };// examples using mask() operator are pruned for brevity.global .u8 addr[] = {0xff(foo), 0xff00(foo), 0xff0000(foo), ...};.global .u8 addr2[] = {0xff(foo+4), 0xff00(foo+4), 0xff0000(foo+4),...}.global .u8 addr3[] = {0xff(generic(foo)), 0xff00(generic(foo)),...}.global .u8 addr4[] = {0xff(generic(foo)+4), 0xff00(generic(foo)+4),...}// mask() operator with integer const expression.global .u8 addr5[] = { 0xFF(1000 + 546), 0xFF00(131187), ...};

Note

PTX 3.1 redefines the default addressing for global variables in initializers, from genericaddresses to offsets in the global state space. Legacy PTX code is treated as having an implicitgeneric() operator for each global variable used in an initializer. PTX 3.1 code shouldeither include explicitgeneric() operators in initializers, usecvta.global to formgeneric addresses at runtime, or load from the non-generic address usingld.global.

Device function names appearing in initializers represent the address of the first instruction inthe function; this can be used to initialize a table of function pointers to be used with indirectcalls. Beginning in PTX ISA version 3.1, kernel function names can be used as initializers e.g. toinitialize a table of kernel function pointers, to be used with CUDA Dynamic Parallelism to launchkernels from GPU. See theCUDA Dynamic Parallelism Programming Guide for details.

Labels cannot be used in initializers.

Variables that hold addresses of variables or functions should be of type.u8 or.u32 or.u64.

Type.u8 is allowed only if themask() operator is used.

Initializers are allowed for all types except.f16,.f16x2 and.pred.

Examples

.global .s32 n = 10;.global .f32 blur_kernel[][3]               = {{.05,.1,.05},{.1,.4,.1},{.05,.1,.05}};.global .u32 foo[] = { 2, 3, 5, 7, 9, 11 };.global .u64 ptr = generic(foo);   // generic address of foo[0].global .u64 ptr = generic(foo)+8; // generic address of foo[2]

5.4.5.Alignment

Byte alignment of storage for all addressable variables can be specified in the variabledeclaration. Alignment is specified using an optional.alignbyte-count specifier immediatelyfollowing the state-space specifier. The variable will be aligned to an address which is an integermultiple of byte-count. The alignment value byte-count must be a power of two. For arrays, alignmentspecifies the address alignment for the starting address of the entire array, not for individualelements.

The default alignment for scalar and array variables is to a multiple of the base-type size. Thedefault alignment for vector variables is to a multiple of the overall vector size.

Examples

 // allocate array at 4-byte aligned address.  Elements are bytes..const .align 4 .b8 bar[8] = {0,0,0,0,2,0,0,0};

Note that all PTX instructions that access memory require that the address be aligned to a multipleof the access size. The access size of a memory instruction is the total number of bytes accessed inmemory. For example, the access size ofld.v4.b32 is 16 bytes, while the access size ofatom.f16x2 is 4 bytes.

5.4.6.Parameterized Variable Names

Since PTX supports virtual registers, it is quite common for a compiler frontend to generate a largenumber of register names. Rather than require explicit declaration of every name, PTX supports asyntax for creating a set of variables having a common prefix string appended with integer suffixes.

For example, suppose a program uses a large number, say one hundred, of.b32 variables, named%r0,%r1, …,%r99. These 100 register variables can be declared as follows:

.reg .b32 %r<100>;    // declare %r0, %r1, ..., %r99

This shorthand syntax may be used with any of the fundamental types and with any state space, andmay be preceded by an alignment specifier. Array variables cannot be declared this way, nor areinitializers permitted.

5.4.7.Variable Attributes

Variables may be declared with an optional.attribute directive which allows specifying specialattributes of variables. Keyword.attribute is followed by attribute specification insideparenthesis. Multiple attributes are separated by comma.

Variable and Function Attribute Directive: .attribute describes the.attributedirective.

5.4.8.Variable and Function Attribute Directive:.attribute

.attribute

Variable and function attributes

Description

Used to specify special attributes of a variable or a function.

The following attributes are supported.

.managed

.managed attribute specifies that variable will be allocated at a location in unified virtualmemory environment where host and other devices in the system can reference the variabledirectly. This attribute can only be used with variables in .global state space. See theCUDAUVM-Lite Programming Guide for details.

.unified

.unified attribute specifies that function has the same memory address on the host and onother devices in the system. Integer constantsuuid1 anduuid2 respectively specify upperand lower 64 bits of the unique identifier associated with the function or the variable. Thisattribute can only be used on device functions or on variables in the.global statespace. Variables with.unified attribute are read-only and must be loaded by specifying.unified qualifier on the address operand ofld instruction, otherwise the behavior isundefined.

PTX ISA Notes

  • Introduced in PTX ISA version 4.0.

  • Support for function attributes introduced in PTX ISA version 8.0.

Target ISA Notes

  • .managed attribute requiressm_30 or higher.

  • .unified attribute requiressm_90 or higher.

Examples

.global .attribute(.managed) .s32 g;.global .attribute(.managed) .u64 x;.global .attribute(.unified(19,95)) .f32 f;.func .attribute(.unified(0xAB, 0xCD)) bar() { ... }

5.5.Tensors

A tensor is a multi-dimensional matrix structure in the memory. Tensor is defined by the followingproperties:

  • Dimensionality

  • Dimension sizes across each dimension

  • Individual element types

  • Tensor stride across each dimension

PTX supports instructions which can operate on the tensor data. PTX Tensor instructions include:

  • Copying data between global and shared memories

  • Reducing the destination tensor data with the source.

The Tensor data can be operated on by variouswmma.mma,mma andwgmma.mma_asyncinstructions.

PTX Tensor instructions treat the tensor data in the global memory as a multi-dimensional structureand treat the data in the shared memory as a linear data.

5.5.1.Tensor Dimension, size and format

Tensors can have dimensions: 1D, 2D, 3D, 4D or 5D.

Each dimension has a size which represents the number of elements along the dimension. The elementscan have one the following types:

  • Bit-sized type:.b32,.b64

  • Sub-byte types:.b4x16,.b4x16_p64,.b6x16_p32,.b6p2x16

  • Integer:.u8,.u16,.u32,.s32,.u64,.s64

  • Floating point and alternate floating point:.f16,.bf16,.tf32,.f32,.f64(rounded to nearest even).

Tensor can have padding at the end in each of the dimensions to provide alignment for the data inthe subsequent dimensions. Tensor stride can be used to specify the amount of padding in eachdimension.

5.5.1.1.Sub-byte Types

5.5.1.1.1.Padding and alignment of the sub-byte types

The sub-byte types are expected to packed contiguously in the global memory andthe Tensor copy instruction will expand them by appending empty spaces as shown below:

  1. Type.b4x16:With this type, there is no padding involved and the packed sixteen.b4 elementsin a 64-bits container is copied as is between the shared memory and the global memory.

  2. Type.b4x16_p64:With this type, sixteen contiguous 4-bits of data is copied from global memory to theshared memory with the append of 64-bits of padding as shown inFigure 5

    _images/tensor-dimension-size-format-sub-bytes-padding-align-b4-16-p64.png

    Figure 5Layout for .b4x16_p64

    The padded region that gets added is un-initialized.

  3. Type.b6x16_p32:With this type, sixteen 6-bits of data is copied from global memory to the shared memorywith an append of 32-bits of padding as shown inFigure 6

    _images/tensor-dimension-size-format-sub-bytes-padding-align-b6-16-p32.png

    Figure 6Layout for .b6x16_p32

    The padded region that gets added is un-initialized.

  4. Type.b6p2x16:With this type, sixteen elements, each containing 6-bits of data at the LSB and 2-bitsof padding at the MSB, are copied from shared memory into the global memory by discardingthe 2-bits of padding data and packing the 6-bits data contiguously as shown inFigure 7

    _images/tensor-dimension-size-format-sub-bytes-padding-align-b6-p2-16.png

    Figure 7Layout for .b6p2x16

In case of.b6x16_p32 and.b4x16_p64, the padded region that gets added isun-initialized.

The types.b6x16_p32 and.b6p2x16 share the same encoding value in thedescriptor (value 15) as the two types are applicable for different types oftensor copy operations:

Type

Valid Tensor Copy Direction

.b6x16_p32

.shared::cluster.global,.shared::cta.global

.b6p2x16

.global.shared::cta

5.5.2.Tensor Access Modes

Tensor data can be accessed in two modes:

  • Tiled mode:

    In tiled mode, the source multi-dimensional tensor layout is preserved at the destination.

  • Im2col mode:

    In im2col mode, the elements in the Bounding Box of the source tensor are rearranged into columnsat the destination. Referhere for more details.

5.5.3.Tiled Mode

This section talks about how Tensor and Tensor access work in tiled mode.

5.5.3.1.Bounding Box

A tensor can be accessed in chunks known asBounding Box. The Bounding Box has the samedimensionality as the tensor they are accessing into. Size of each bounding Box must be a multipleof 16 bytes. The address of the bounding Box must also be aligned to 16 bytes.

Bounding Box has the following access properties:

  • Bounding Box dimension sizes

  • Out of boundary access mode

  • Traversal strides

The tensor-coordinates, specified in the PTX tensor instructions, specify the starting offset of thebounding box. Starting offset of the bounding box along with the rest of the bounding boxinformation together are used to determine the elements which are to be accessed.

5.5.3.2.Traversal-Stride

While the Bounding Box is iterating the tensor across a dimension, the traversal stride specifiesthe exact number of elements to be skipped. If no jump over is required, default value of 1 must bespecified.

The traversal stride in dimension 0 can be used for theInterleave layout.For non-interleaved layout, the traversal stride indimension 0 must always be 1.

Figure 8 illustrates tensor, tensor size, tensor stride,Bounding Box size and traversal stride.

_images/tensor-tiled-mode-bounding-box-example.png

Figure 8Tiled mode bounding box, tensor size and traversal stride

5.5.3.3.Out of Boundary Access

PTX Tensor operation can detect and handle the case when the Bounding Box crosses the tensorboundary in any dimension. There are 2 modes:

  • Zero fill mode:

    Elements in the Bounding Box which fall outside of the tensor boundary are set to 0.

  • OOB-NaN fill mode:

    Elements in the Bounding Box which fall outside of the tensor boundary are set to a special NaNcalledOOB-NaN.

Figure 9 shows an example of the out of boundary access.

_images/tensor-oob-access.png

Figure 9Out of boundary access

5.5.3.4..tile::scatter4 and.tile::gather4 modes

These modes are similar to the tiled mode with restriction that these modes work only on 2D tensor data.Tile::scatter4 andTile::gather4 modes are used to access multiple non-contiguous rows of tensor data.

InTile::scatter4 mode single 2D source tensor is divided into four rows in the 2D destination tensor.InTile::gather4 mode four rows in the source 2D tensor are combined to form single 2D destination tensor.

These modes work on four rows and hence the instruction will take:

  1. four tensor coordinates across the dimension 0

  2. one tensor coordinate across the dimension 1

The interleave layout is not supported for.tile::scatter4 and.tile::gather4 modes.

All other constraints and rules of the tile mode apply to these modes as well.

5.5.3.4.1.Bounding Box

ForTile::scatter4 andTile::gather4 modes, four request coordinates will form four BoundingBoxes in the tensor space.

Figure 10 shows an example of the same with startcoordinates (1, 2), (1, 5), (1, 0) and (1, 9).

The size of the bounding box in the dimension 0 represents the length of the rows.The size of the bounding box in the dimension 1 must be one.

_images/tiled-scatter4-gather4-bounding-box.png

Figure 10tiled::scatter4/tiled::gather4 mode bounding box example

5.5.4.im2col mode

Im2col mode supports the following tensor dimensions : 3D, 4D and 5D. In this mode, the tensor datais treated as a batch of images with the following properties:

  • N : number of images in the batch

  • D, H, W : size of a 3D image (depth, height and width)

  • C: channels per image element

The above properties are associated with 3D, 4D and 5D tensors as follows:

Dimension

N/D/H/W/C applicability

3D

NWC

4D

NHWC

5D

NDHWC

5.5.4.1.Bounding Box

In im2col mode, the Bounding Box is defined in DHW space. Boundaries along other dimensions arespecified by Pixels-per-Column and Channels-per-Pixel parameters as described below.

The dimensionality of the Bounding Box is two less than the tensor dimensionality.

The following properties describe how to access of the elements in im2col mode:

  • Bounding-Box Lower-Corner

  • Bounding-Box Upper-Corner

  • Pixels-per-Column

  • Channels-per-Pixel

Bounding-box Lower-Corner andBounding-box Upper-Corner specify the two opposite corners of theBounding Box in the DHW space.Bounding-box Lower-Corner specifies the corner with the smallestcoordinate andBounding-box Upper-Corner specifies the corner with the largest coordinate.

Bounding-box Upper- andLower-Corners are 16-bit signed values whose limits varies across thedimensions and are as shown below:

3D

4D

5D

Upper- / Lower- Corner sizes

[-215, 215-1]

[-27, 27-1]

[-24, 24-1]

Figure 11 andFigure 12 show the Upper-Corners and Lower-Corners.

_images/tensor-im2col-mode-bounding-box1.png

Figure 11im2col mode bounding box example 1

_images/tensor-im2col-mode-bounding-box2.png

Figure 12im2col mode bounding box example 2

TheBounding-box Upper- andLower- Corners specify only the boundaries and not the number ofelements to be accessed.Pixels-per-Column specifies the number of elements to be accessed in theNDHW space.

Channels-per-Pixel specifies the number of elements to access across the C dimension.

The tensor coordinates, specified in the PTX tensor instructions, behaves differently in differentdimensions:

  • Across N and C dimensions: specify the starting offsets along the dimension, similar to the tiledmode.

  • Across DHW dimensions: specify the location of the convolution filter base in the tensorspace. The filter corner location must be within the bounding box.

The im2col offsets, specified in the PTX tensor instructions in im2col mode, are added to the filterbase coordinates to determine the starting location in the tensor space from where the elements areaccessed.

The size of the im2col offsets varies across the dimensions and their valid ranges are as shownbelow:

3D

4D

5D

im2col offsets range

[0, 216-1]

[0, 28-1]

[0, 25-1]

Following are some examples of the im2col mode accesses:

  • Example 1 (Figure 13):

    TensorSize[0]=64TensorSize[1]=9TensorSize[2]=14TensorSize[3]=64Pixels-per-Column=64channels-per-pixel=8Bounding-BoxLower-CornerW=-1Bounding-BoxLower-CornerH=-1Bounding-BoxUpper-CornerW=-1Bounding-BoxUpper-CornerH=-1.tensorcoordinates=(7,7,4,0)im2coloffsets:(0,0)
    _images/tensor-im2col-mode-example1.png

    Figure 13im2col mode example 1

  • Example 2 (Figure 14):

    TensorSize[0]=64TensorSize[1]=9TensorSize[2]=14TensorSize[3]=64Pixels-per-Column=64channels-per-pixel=8Bounding-BoxLower-CornerW=0Bounding-BoxLower-CornerH=0Bounding-BoxUpper-CornerW=-2Bounding-BoxUpper-CornerH=-2tensorcoordinates=(7,7,4,0)im2coloffsets:(2,2)
    _images/tensor-im2col-mode-example2.png

    Figure 14im2col mode example 2

5.5.4.2.Traversal Stride

The traversal stride, in im2col mode, does not impact the total number of elements (or pixels) beingaccessed unlike the tiled mode. Pixels-per-Column determines the total number of elements beingaccessed, in im2col mode.

The number of elements traversed along the D, H and W dimensions is strided by the traversal stridefor that dimension.

The following example withFigure 15 illustrates accesse with traversal-strides:

Tensor Size[0] = 64Tensor Size[1] = 8Tensor Size[2] = 14Tensor Size[3] = 64Traversal Stride = 2Pixels-per-Column = 32channels-per-pixel = 16Bounding-Box Lower-Corner W = -1Bounding-Box Lower-Corner H = -1Bounding-Box Upper-Corner W = -1Bounding-Box Upper-Corner H = -1.Tensor coordinates in the instruction = (7, 7, 5, 0)Im2col offsets in the instruction : (1, 1)
_images/tensor-im2col-mode-example3.png

Figure 15im2col mode traversal stride example

5.5.4.3.Out of Boundary Access

In im2col mode, when the number of requested pixels in NDHW space specified byPixels-per-Columnexceeds the number of available pixels in the image batch then out-of-bounds access is performed.

Similar to tiled mode, zero fill orOOB-NaN fill can be performed based on the Fill-Modespecified.

5.5.5.im2col::w andim2col::w::128 modes

These modes are similar to the im2col mode with the restriction that elements are accessed acrosstheW dimension only while keeping theH andD dimension constant.

All the constraints and rules of the im2col mode apply to these modes as well.

The number of elements accessed in theim2col::w::128 mode is fixed and is equal to 128.The number of elements accessed in theim2col::w mode depends on the field Pixels-per-Columnfield in the TensorMap.

5.5.5.1.Bounding Box

In these modes, the size of the bounding box inD andH dimensions are 1.

TheD andH dimensions in the tensor coordinates argument in the PTX instruction specifythe position of the bounding box in the tensor space.

The Bounding-BoxLower-Corner-W and Bounding-BoxUpper-Corner-W specify the two oppositecorners of the Bounding Box in theW dimension.

TheW dimension in the tensor coordinates argument in the PTX instruction specify the locationof the first element that is to be accessed in the bounding box.

Number of pixels loaded inim2col::w mode is as specified by Pixels-per-Column in the TensorMap.Number of pixels loaded inim2col::w::128 mode is always 128. So, Pixels-per-Column is ignoredinim2col::w::128 mode.

Figure 16 shows an example of theim2col::w andim2col::w:128 modes.

_images/tensor-im2col-w-w128-modes-example.png

Figure 16im2col::w and im2col::w::128 modes example

The first element can lie outside of the Bounding Box in the W-dimension only and only on the leftside of the Bounding Box.Figure 17 shows of an example of this.

_images/tensor-im2col-w-w128-modes-example2.png

Figure 17im2col::w and im2col::w::128 modes first element outside Bounding Box example

5.5.5.2.Traversal Stride

This is similar to im2col mode with the exception of that the number of elements traversedalong only theW dimension is strided by the traversal stride as specified in the TensorMap.

5.5.5.3.wHalo

Inim2col::w mode, thewHalo argument in the PTX instruction specifies how many filterhalo elements must be loaded at the end of the image.

Inim2col::w::128 mode, the halo elements are loaded after every 32 elements in the boundingbox along theW dimension. ThewHalo argument in the PTX instruction specifies how manyhalo elements must be loaded after every 32 elements.

Following is an example of.im2col::w mode access:

Tensor Size [0] = 128Tensor Size [1] = 9Tensor Size [2] = 7Tensor Size [3] = 64Pixels-per-column = 128Channels-per-pixel = 64Bounding Box Lower Corner W = 0Bounding Box Upper Corner W = 0Tensor Coordinates in the instruction = (7, 2, 3, 0)wHalo in the instruction = 2 (as 3x3 convolution filter is used)

A tensor copy operation with the above parameters loads 128 pixels and the two halo pixels as shown inFigure 18.

_images/tensor-im2col-w-w128-modes-example3.png

Figure 18tensor copy operation with im2col::w mode example

The halo pixels are always loaded in the shared memory next to the main row pixels as shown inFigure 18.

Following is an example of.im2col::w::128 mode access:

Tensor Size [0] = 128Tensor Size [1] = 9Tensor Size [2] = 7Tensor Size [3] = 64Channels-per-pixel = 64Bounding Box Lower Corner W = 0Bounding Box Upper Corner W = 0Tensor Coordinates in the instruction = (7, 2, 3, 0)wHalo in the instruction = 2 (as 3x3 convolution filter is used)

A tensor copy operation with the above parameters loads 128 elements such that after every 32 elements,wHalo number of elements are loaded as shown inFigure 19.

_images/tensor-im2col-w-w128-modes-example4.png

Figure 19tensor copy operation with im2col::w::128 mode example

5.5.5.4.wOffset

In the convolution calculations, the same elements along theW dimension are reused for differentlocations within the convolution filter footprint. Based on the number of times a pixel is used, thepixels may be loaded into different shared memory buffers. Each buffer can be loaded by a separatetensor copy operation.

ThewOffset argument in the tensor copy and prefetch instruction adjusts the source pixel locationfor each buffer. The exact position of the buffer is adjusted along theW dimension using thefollowing formula:

Bounding Box Lower Corner W += wOffsetBounding Box Upper Corner W += wOffsetW += wOffset

Following are examples of tensor copy to multiple buffers with variouswHalo andwOffset values:

Example 1:

Tensor Size [0] = 128Tensor Size [1] = 9Tensor Size [2] = 67Tensor Size [3] = 64Pixels-per-Column = 128Channels-per-pixel = 64Bounding Box Lower Corner W = -1Bounding Box Upper Corner W = 0Traversal Stride = 2Tensor Coordinates in the instruction = (7, 2, -1, 0)Shared memory buffer 1:   wHalo = 1   wOffset = 0Shared memory buffer 2:   wHalo = 0   wOffset = 1
_images/tensor-im2col-w-w128-modes-example5.png

Figure 20tensor copy operation to buffer 1 of Example 1

_images/tensor-im2col-w-w128-modes-example6.png

Figure 21tensor copy operation to buffer 2 of Example 1

Example 2:

Tensor Size [0] = 128Tensor Size [1] = 7Tensor Size [2] = 7Tensor Size [3] = 64Pixels-per-Column = 128Channels-per-pixel = 64Bounding Box Lower Corner W = -1Bounding Box Upper Corner W = -1Traversal Stride = 3Tensor Coordinates in the instruction = (7, 2, -1, 0)Shared memory buffer 1:   wHalo = 0   wOffset = 0Shared memory buffer 2:   wHalo = 0   wOffset = 1Shared memory buffer 3:   wHalo = 0   wOffset = 2
_images/tensor-im2col-w-w128-modes-example7.png

Figure 22tensor copy operation to buffer 1 of Example 2

_images/tensor-im2col-w-w128-modes-example8.png

Figure 23tensor copy operation to buffer 2 of Example 2

_images/tensor-im2col-w-w128-modes-example9.png

Figure 24tensor copy operation to buffer 3 of Example 2

5.5.6.Interleave layout

Tensor can be interleaved and the following interleave layouts are supported:

  • No interleave (NDHWC)

  • 8 byte interleave (NC/8DHWC8) : C8 utilizes 16 bytes in memory assuming 2B per channel.

  • 16 byte interleave (NC/16HWC16) : C16 utilizes 32 bytes in memory assuming 4B per channel.

TheC information is organized in slices where sequential C elements are grouped in 16 byte or 32byte quantities.

If the total number of channels is not a multiple of the number of channels per slice, then the lastslice must be padded with zeros to make it complete 16B or 32B slice.

Interleaved layouts are supported only for the dimensionalities : 3D, 4D and 5D.

The interleave layout is not supported for.im2col::w and.im2col::w::128 modes.

5.5.7.Swizzling Modes

The layout of the data in the shared memory can be different to that of global memory, for accessperformance reasons. The following describes various swizzling modes:

  • No swizzle mode:

    There is no swizzling in this mode and the destination data layout is exactly similar to thesource data layout.

    0

    1

    2

    3

    4

    5

    6

    7

    0

    1

    2

    3

    4

    5

    6

    7

    … Pattern repeats …

  • 32 byte swizzle mode:

    The following table, where each elements (numbered cell) is 16 byte and the starting address is256 bytes aligned, shows the pattern of the destination data layout:

    0

    1

    2

    3

    4

    5

    6

    7

    1

    0

    3

    2

    5

    4

    7

    6

    … Pattern repeats …

    An example of the 32 byte swizzle mode for NC/(32B)HWC(32B) tensor of 1x2x10x10xC16 dimension,with the innermost dimension holding slice of 16 channels with 2 byte/channel, is shown inFigure 25.

    _images/tensor-32B-swizzle.png

    Figure 2532-byte swizzle mode example

    Figure 26 shows the two fragments of the tensor : one for C/(32B) = 0 and another for C/(32B) = 1.

    _images/tensor-32B-swizzle-frag.png

    Figure 2632-byte swizzle mode fragments

    Figure 27 shows the destination data layout with 32 byte swizzling.

    _images/tensor-32B-swizzle-dst.png

    Figure 2732-byte swizzle mode destination data layout

  • 64 byte swizzle mode:

    The following table, where each elements (numbered cell) is 16 byte and the starting address is512 bytes aligned, shows the pattern of the destination data layout:

    0

    1

    2

    3

    4

    5

    6

    7

    1

    0

    3

    2

    5

    4

    7

    6

    2

    3

    0

    1

    6

    7

    4

    5

    3

    2

    1

    0

    7

    6

    5

    4

    … Pattern repeats …

    An example of the 64 byte swizzle mode for NHWC tensor of 1x10x10x64 dimension, with 2 bytes /channel and 32 channels, is shown inFigure 28.

    _images/tensor-64B-swizzle.png

    Figure 2864-byte swizzle mode example

    Each colored cell represents 8 channels.Figure 29 shows the source data layout.

    _images/tensor-64B-swizzle-src.png

    Figure 2964-byte swizzle mode source data layout

    Figure 30 shows the destination data layout with 64 byte swizzling.

    _images/tensor-64B-swizzle-dst.png

    Figure 3064-byte swizzle mode destination data layout

  • 96 byte swizzle mode:

    The following table where each element (numbered cell) is 16 byte shows the swizzling pattern at the destinationdata layout:

    0

    1

    2

    3

    4

    5

    6

    7

    1

    0

    3

    2

    5

    4

    7

    6

    … Pattern repeats …

    An example of the data layout in global memory and its swizzled data layout in shared memory where each element(colored cell) is 16 bytes and the starting address is 256 bytes aligned is shown inFigure 31.

    _images/tensor-96B-swizzle.png

    Figure 3196-byte swizzle mode example

  • 128 byte swizzle mode:

    The 128-byte swizzling mode supports the following sub-modes:

    • 16-byte atomicity sub-mode:

      In this sub-mode, the 16-byte of data is kept intact while swizzling.

    The following table, where each elements (numbered cell) is 16 byte and the starting address is1024 bytes aligned, shows the pattern of the destination data layout:

    0

    1

    2

    3

    4

    5

    6

    7

    1

    0

    3

    2

    5

    4

    7

    6

    2

    3

    0

    1

    6

    7

    4

    5

    3

    2

    1

    0

    7

    6

    5

    4

    4

    5

    6

    7

    0

    1

    2

    3

    5

    4

    7

    6

    1

    0

    3

    2

    6

    7

    4

    5

    2

    3

    0

    1

    7

    6

    5

    4

    3

    2

    1

    0

    … Pattern repeats …

    An example of the 128 byte swizzle mode for NHWC tensor of 1x10x10x64 dimension, with 2 bytes /channel and 64 channels, is shown inFigure 32.

    _images/tensor-128B-swizzle.png

    Figure 32128-byte swizzle mode example

    Each colored cell represents 8 channels.Figure 33 shows the source data layout.

    _images/tensor-128B-swizzle-src.png

    Figure 33128-byte swizzle mode source data layout

    Figure 34 shows the destination data layout with 128 byte swizzling.

    _images/tensor-128B-swizzle-dst.png

    Figure 34128-byte swizzle mode destination data layout

    • 32-byte atomicity sub-mode:

      In this sub-mode, the 32-byte of data is kept intact while swizzling.

      The following table where each element (numbered cell) is 16 byte shows theswizzling pattern at the destination data layout:

      0 1

      2 3

      4 5

      6 7

      2 3

      0 1

      6 7

      4 5

      4 5

      6 7

      0 1

      2 3

      6 7

      4 5

      2 3

      0 1

      … Pattern repeats …

      This sub-mode requires 32 byte alignment at shared memory.

      An example of the data layout in global memory and its swizzled data layout in shared memorywhere each element (colored cell) is 16 bytes is shown inFigure 35

      _images/tensor-128B-swizzle-32B-atom.png

      Figure 35128-byte swizzle mode example with 32-byte atomicity

    • 32-byte atomicity with 8-byte flip sub-mode:

      The swizzling pattern for this sub-mode is similar to the 32-byte atomicity sub-mode except thatthere is a flip of adjacent 8-bytes within the 16-byte data at every alternate shared memory line.

      An example of the data layout in global memory and its swizzled data layout in shared memory whereeach element (colored cell) is 16 bytes (two 8-byte sub-elements for each 16-byte colored cell areshown to show the flip) is shown inFigure 36

      _images/tensor-128B-swizzle-32B-atom-8B-flip.png

      Figure 36128-byte swizzle mode example with 32-byte atomicity with 8-byte flip

    • 64-byte atomicity sub-mode:

      In this sub-mode, the 64-byte of data is kept intact while swizzling.

      The following table where each element (numbered cell) is 16 byte shows the swizzlingpattern at the destination data layout:

      0 1 2 3

      4 5 6 7

      4 5 6 7

      0 1 2 3

      … Pattern repeats …

      This sub-mode requires 64-byte alignment at shared memory.

      An example of the data layout in global memory and its swizzled data layoutin shared memory where each element (colored cell) is 16 bytes is showninFigure 37

      _images/tensor-128B-swizzle-64B-atom.png

      Figure 37128-byte swizzle mode example with 64-byte atomicity

Table 14lists the valid combination of swizzle-atomicity with the swizzling-mode.

Table 14Valid combination of swizzle-atomicity with swizzling-mode

Swizzling Mode

Swizzle-Atomicity

No Swizzling

32B Swizzling Mode

16B

64B Swizzling Mode

16B

96B Swizzling Mode

16B

128B Swizzling Mode

  • 16B

  • 32B

  • 32B + 8B-flip

  • 64B

The value of swizzle base offset is 0 when thedstMem shared memory address is locatedat the following boundary:

Swizzling Mode

Starting address of the repeating pattern

128-Byte swizzle

1024-Byte boundary

96-Byte swizzle

256-Byte boundary

64-Byte swizzle

512-Byte boundary

32-Byte swizzle

256-Byte boundary

Otherwise, the swizzle base offset is a non-zero value, computed using following formula:

Swizzling Mode

Formula

128-Byte swizzle

base offset = (dstMem / 128) % 8

96-Byte swizzle

base offset = (dstMem / 128) % 2

64-Byte swizzle

base offset = (dstMem / 128) % 4

32-Byte swizzle

base offset = (dstMem / 128) % 2

5.5.8.Tensor-map

The tensor-map is a 128-byte opaque object either in.const space or.param (kernel functionparameter) space or.global space which describes the tensor properties and the access propertiesof the tensor data described in previous sections.

Tensor-Map can be created using CUDA APIs. Refer toCUDA programming guide for more details.

6.Instruction Operands

6.1.Operand Type Information

All operands in instructions have a known type from their declarations. Each operand type must becompatible with the type determined by the instruction template and instruction type. There is noautomatic conversion between types.

The bit-size type is compatible with every type having the same size. Integer types of a common sizeare compatible with each other. Operands having type different from but compatible with theinstruction type are silently cast to the instruction type.

6.2.Source Operands

The source operands are denoted in the instruction descriptions by the namesa,b, andc. PTX describes a load-store machine, so operands for ALU instructions must all be in variablesdeclared in the.reg register state space. For most operations, the sizes of the operands mustbe consistent.

Thecvt (convert) instruction takes a variety of operand types and sizes, as its job is toconvert from nearly any data type to any other data type (and size).

Theld,st,mov, andcvt instructions copy data from one location toanother. Instructionsld andst move data from/to addressable state spaces to/fromregisters. Themov instruction copies data between registers.

Most instructions have an optional predicate guard that controls conditional execution, and a fewinstructions have additional predicate source operands. Predicate operands are denoted by the namesp,q,r,s.

6.3.Destination Operands

PTX instructions that produce a single result store the result in the field denoted byd (fordestination) in the instruction descriptions. The result operand is a scalar or vector variable inthe register state space.

6.4.Using Addresses, Arrays, and Vectors

Using scalar variables as operands is straightforward. The interesting capabilities begin withaddresses, arrays, and vectors.

6.4.1.Addresses as Operands

All the memory instructions take an address operand that specifies the memory location beingaccessed. This addressable operand is one of:

[var]

the name of an addressable variablevar.

[reg]

an integer or bit-size type registerreg containing a byte address.

[reg+immOff]

a sum of registerreg containing a byte address plus a constant integer byte offset (signed, 32-bit).

[var+immOff]

a sum of address of addressable variablevar containing a byte address plus a constant integerbyte offset (signed, 32-bit).

[immAddr]

an immediate absolute byte address (unsigned, 32-bit).

var[immOff]

an array element as described inArrays as Operands.

The register containing an address may be declared as a bit-size type or integer type.

The access size of a memory instruction is the total number of bytes accessed in memory. Forexample, the access size ofld.v4.b32 is 16 bytes, while the access size ofatom.f16x2 is 4bytes.

The address must be naturally aligned to a multiple of the access size. If an address is notproperly aligned, the resulting behavior is undefined. For example, among other things, the accessmay proceed by silently masking off low-order address bits to achieve proper rounding, or theinstruction may fault.

The address size may be either 32-bit or 64-bit. 128-bit adresses are not supported. Addresses arezero-extended to the specified width as needed, and truncated if the register width exceeds thestate space address width for the target architecture.

Address arithmetic is performed using integer arithmetic and logical instructions. Examples includepointer arithmetic and pointer comparisons. All addresses and address computations are byte-based;there is no support for C-style pointer arithmetic.

Themov instruction can be used to move the address of a variable into a pointer. The address isan offset in the state space in which the variable is declared. Load and store operations move databetween registers and locations in addressable state spaces. The syntax is similar to that used inmany assembly languages, where scalar variables are simply named and addresses are de-referenced byenclosing the address expression in square brackets. Address expressions include variable names,address registers, address register plus byte offset, and immediate address expressions whichevaluate at compile-time to a constant address.

Here are a few examples:

.shared .u16 x;.reg    .u16 r0;.global .v4 .f32 V;.reg    .v4 .f32 W;.const  .s32 tbl[256];.reg    .b32 p;.reg    .s32 q;ld.shared.u16   r0,[x];ld.global.v4.f32 W, [V];ld.const.s32    q, [tbl+12];mov.u32         p, tbl;

6.4.1.1.Generic Addressing

If a memory instruction does not specify a state space, the operation is performed using genericaddressing. The state spaces.const,Kernel Function Parameters(.param),.local and.shared are modeled aswindows within the generic address space. Each window is defined by a window base and a window sizethat is equal to the size of the corresponding state space. A generic address maps toglobalmemory unless it falls within the window forconst,local, orshared memory. TheKernel Function Parameters (.param) window is containedwithin the.global window. Within each window, a generic address maps to an address in theunderlying state space by subtracting the window base from the generic address.

6.4.2.Arrays as Operands

Arrays of all types can be declared, and the identifier becomes an address constant in the spacewhere the array is declared. The size of the array is a constant in the program.

Array elements can be accessed using an explicitly calculated byte address, or by indexing into thearray using square-bracket notation. The expression within square brackets is either a constantinteger, a register variable, or a simpleregister with constant offset expression, where theoffset is a constant expression that is either added or subtracted from a register variable. If morecomplicated indexing is desired, it must be written as an address calculation prior to use. Examplesare:

ld.global.u32  s, a[0];ld.global.u32  s, a[N-1];mov.u32        s, a[1];  // move address of a[1] into s

6.4.3.Vectors as Operands

Vector operands can be specified as source and destination operands for instructions. However, whenspecified as destination operand, all elements in vector expression must be unique, otherwise behavioris undefined.Vectors may also be passed as arguments to called functions.

Vector elements can be extracted from the vector with the suffixes.x,.y,.z and.w, as well as the typical color fields.r,.g,.b and.a.

A brace-enclosed list is used for pattern matching to pull apart vectors.

.reg .v4 .f32 V;.reg .f32     a, b, c, d;mov.v4.f32 {a,b,c,d}, V;

Vector loads and stores can be used to implement wide loads and stores, which may improve memoryperformance. The registers in the load/store operations can be a vector, or a brace-enclosed list ofsimilarly typed scalars. Here are examples:

ld.global.v4.f32  {a,b,c,d}, [addr+16];ld.global.v2.u32  V2, [addr+8];

Elements in a brace-enclosed vector, say {Ra, Rb, Rc, Rd}, correspond to extracted elements as follows:

Ra = V.x = V.rRb = V.y = V.gRc = V.z = V.bRd = V.w = V.a

6.4.4.Labels and Function Names as Operands

Labels and function names can be used only inbra/brx.idx andcall instructionsrespectively. Function names can be used inmov instruction to get the address of the functioninto a register, for use in an indirect call.

Beginning in PTX ISA version 3.1, themov instruction may be used to take the address of kernelfunctions, to be passed to a system call that initiates a kernel launch from the GPU. This featureis part of the support for CUDA Dynamic Parallelism. See theCUDA Dynamic Parallelism ProgrammingGuide for details.

6.5.Type Conversion

All operands to all arithmetic, logic, and data movement instruction must be of the same type andsize, except for operations where changing the size and/or type is part of the definition of theinstruction. Operands of different sizes or types must be converted prior to the operation.

6.5.1.Scalar Conversions

Table 15 andTable 16 show whatprecision and format the cvt instruction uses given operands of differing types. For example, if acvt.s32.u16 instruction is given au16 source operand ands32 as a destination operand,theu16 is zero-extended tos32.

Conversions to floating-point that are beyond the range of floating-point numbers are representedwith the maximum floating-point value (IEEE 754 Inf forf32 andf64, and ~131,000 forf16).

Table 15Convert Instruction Precision and Format Table 1

Destination Format

s8

s16

s32

s64

u8

u16

u32

u64

f16

f32

f64

bf16

tf32

SourceFormat

s8

sext

sext

sext

sext

sext

sext

s2f

s2f

s2f

s2f

s16

chop1

sext

sext

chop1

sext

sext

s2f

s2f

s2f

s2f

s32

chop1

chop1

sext

chop1

chop1

sext

s2f

s2f

s2f

s2f

s64

chop1

chop1

chop1

chop1

chop1

chop1

s2f

s2f

s2f

s2f

u8

zext

zext

zext

zext

zext

zext

u2f

u2f

u2f

u2f

u16

chop1

zext

zext

chop1

zext

zext

u2f

u2f

u2f

u2f

u32

chop1

chop1

zext

chop1

chop1

zext

u2f

u2f

u2f

u2f

u64

chop1

chop1

chop1

chop1

chop1

chop1

u2f

u2f

u2f

u2f

f16

f2s

f2s

f2s

f2s

f2u

f2u

f2u

f2u

f2f

f2f

f2f

f32

f2s

f2s

f2s

f2s

f2u

f2u

f2u

f2u

f2f

f2f

f2f

f2f

f64

f2s

f2s

f2s

f2s

f2u

f2u

f2u

f2u

f2f

f2f

f2f

bf16

f2s

f2s

f2s

f2s

f2u

f2u

f2u

f2u

f2f

f2f

f2f

f2f

tf32

Table 16Convert Instruction Precision and Format Table 2

Destination Format

f16

f32

bf16

e4m3

e5m2

e2m3

e3m2

e2m1

ue8m0

SourceFormat

f16

f2f

f2f

f2f

f2f

f32

f2f

f2f

f2f

f2f

f2f

f2f

f2f

f2f

bf16

f2f

f2f

f2f

f2f

e4m3

f2f

e5m2

f2f

e2m3

f2f

e3m2

f2f

e2m1

f2f

ue8m0

f2f

Notes

sext = sign-extend; zext = zero-extend; chop = keep only low bits that fit;

s2f = signed-to-float; f2s = float-to-signed; u2f = unsigned-to-float;

f2u = float-to-unsigned; f2f = float-to-float.

1 If the destination register is wider than the destination format, the result is extended to thedestination register width after chopping. The type of extension (sign or zero) is based on thedestination format. For example, cvt.s16.u32 targeting a 32-bit register first chops to 16-bit, thensign-extends to 32-bit.

6.5.2.Rounding Modifiers

Conversion instructions may specify a rounding modifier. In PTX, there are four integer roundingmodifiers and six floating-point roundingmodifiers.Table 17 andTable 18 summarize the rounding modifiers.

Table 17Floating-Point Rounding Modifiers

Modifier

Description

.rn

rounds to nearest even

.rna

rounds to nearest, ties away from zero

.rz

rounds towards zero

.rm

rounds towards negative infinity

.rp

rounds towards positive infinity

.rs

rounds either towards zero or away from zero basedon the carry out of the integer addition of randombits and the discarded bits of mantissa

Table 18Integer Rounding Modifiers

Modifier

Description

.rni

round to nearest integer, choosing even integer if source is equidistant between two integers.

.rzi

round to nearest integer in the direction of zero

.rmi

round to nearest integer in direction of negative infinity

.rpi

round to nearest integer in direction of positive infinity

6.6.Operand Costs

Operands from different state spaces affect the speed of an operation. Registers are fastest, whileglobal memory is slowest. Much of the delay to memory can be hidden in a number of ways. The firstis to have multiple threads of execution so that the hardware can issue a memory operation and thenswitch to other execution. Another way to hide latency is to issue the load instructions as early aspossible, as execution is not blocked until the desired result is used in a subsequent (in time)instruction. The register in a store operation is available much morequickly.Table 19 gives estimates of thecosts of using different kinds of memory.

Table 19Cost Estimates for Accessing State-Spaces

Space

Time

Notes

Register

0

Shared

0

Constant

0

Amortized cost is low, first access is high

Local

> 100 clocks

Parameter

0

Immediate

0

Global

> 100 clocks

Texture

> 100 clocks

Surface

> 100 clocks

7.Abstracting the ABI

Rather than expose details of a particular calling convention, stack layout, and Application BinaryInterface (ABI), PTX provides a slightly higher-level abstraction and supports multiple ABIimplementations. In this section, we describe the features of PTX needed to achieve this hiding ofthe ABI. These include syntax for function definitions, function calls, parameter passing, andmemory allocated on the stack (alloca).

Refer toPTX Writers Guide to Interoperability for details on generating PTX compliant withApplication Binary Interface (ABI) for the CUDA® architecture.

7.1.Function Declarations and Definitions

In PTX, functions are declared and defined using the.func directive. A functiondeclarationspecifies an optional list of return parameters, the function name, and an optional list of inputparameters; together these specify the function’s interface, or prototype. A functiondefinitionspecifies both the interface and the body of the function. A function must be declared or definedprior to being called.

The simplest function has no parameters or return values, and is represented in PTX as follows:

.func foo{    ...    ret;}    ...    call foo;    ...

Here, execution of thecall instruction transfers control tofoo, implicitly saving thereturn address. Execution of theret instruction withinfoo transfers control to theinstruction following the call.

Scalar and vector base-type input and return parameters may be represented simply as registervariables. At the call, arguments may be register variables or constants, and return values may beplaced directly into register variables. The arguments and return variables at the call must havetype and size that match the callee’s corresponding formal parameters.

Example

.func (.reg .u32 %res) inc_ptr ( .reg .u32 %ptr, .reg .u32 %inc ){    add.u32 %res, %ptr, %inc;    ret;}    ...    call (%r1), inc_ptr, (%r1,4);    ...

When using the ABI,.reg state space parameters must be at least 32-bits in size. Subword scalarobjects in the source language should be promoted to 32-bit registers in PTX, or use.paramstate space byte arrays described next.

Objects such as C structures and unions are flattened into registers or byte arrays in PTX and arerepresented using.param space memory. For example, consider the following C structure, passedby value to a function:

struct {    double dbl;    char   c[4];};

In PTX, this structure will be flattened into a byte array. Since memory accesses are required to bealigned to a multiple of the access size, the structure in this example will be a 12 byte array with8 byte alignment so that accesses to the.f64 field are aligned. The.param state space isused to pass the structure by value:

Example

.func (.reg .s32 out) bar (.reg .s32 x, .param .align 8 .b8 y[12]){    .reg .f64 f1;    .reg .b32 c1, c2, c3, c4;    ...    ld.param.f64 f1, [y+0];    ld.param.b8  c1, [y+8];    ld.param.b8  c2, [y+9];    ld.param.b8  c3, [y+10];    ld.param.b8  c4, [y+11];    ...    ... // computation using x,f1,c1,c2,c3,c4;}{     .param .b8 .align 8 py[12];     ...     st.param.b64 [py+ 0], %rd;     st.param.b8  [py+ 8], %rc1;     st.param.b8  [py+ 9], %rc2;     st.param.b8  [py+10], %rc1;     st.param.b8  [py+11], %rc2;     // scalar args in .reg space, byte array in .param space     call (%out), bar, (%x, py);     ...

In this example, note that.param space variables are used in two ways. First, a.paramvariabley is used in function definition bar to represent a formal parameter. Second, a.param variablepy is declared in the body of the calling function and used to set up thestructure being passed to bar.

The following is a conceptual way to think about the.param state space use in device functions.

For a caller,

  • The.param state space is used to set values that will be passed to a called function and/orto receive return values from a called function. Typically, a.param byte array is used tocollect together fields of a structure being passed by value.

For a callee,

  • The.param state space is used to receive parameter values and/or pass return values back tothe caller.

The following restrictions apply to parameter passing.

For a caller,

  • Arguments may be.param variables,.reg variables, or constants.

  • In the case of.param space formal parameters that are byte arrays, the argument must also bea.param space byte array with matching type, size, and alignment. A.param argument mustbe declared within the local scope of the caller.

  • In the case of.param space formal parameters that are base-type scalar or vector variables,the corresponding argument may be either a.param or.reg space variable with matchingtype and size, or a constant that can be represented in the type of the formal parameter.

  • In the case of.reg space formal parameters, the corresponding argument may be either a.param or.reg space variable of matching type and size, or a constant that can berepresented in the type of the formal parameter.

  • In the case of.reg space formal parameters, the register must be at least 32-bits in size.

  • Allst.param instructions used for passing arguments to function call must immediately precedethe correspondingcall instruction andld.param instruction used for collecting returnvalue must immediately follow thecall instruction without any control flowalteration.st.param andld.param instructions used for argument passing cannot bepredicated. This enables compiler optimization and ensures that the.param variable does notconsume extra space in the caller’s frame beyond that needed by the ABI. The.param variablesimply allows a mapping to be made at the call site between data that may be in multiplelocations (e.g., structure being manipulated by caller is located in registers and memory) tosomething that can be passed as a parameter or return value to the callee.

For a callee,

  • Input and return parameters may be.param variables or.reg variables.

  • Parameters in.param memory must be aligned to a multiple of 1, 2, 4, 8, or 16 bytes.

  • Parameters in the.reg state space must be at least 32-bits in size.

  • The.reg state space can be used to receive and return base-type scalar and vector values,including sub-word size objects when compiling in non-ABI mode. Supporting the.reg statespace provides legacy support.

Note that the choice of.reg or.param state space for parameter passing has no impact onwhether the parameter is ultimately passed in physical registers or on the stack. The mapping ofparameters to physical registers and stack locations depends on the ABI definition and the order,size, and alignment of parameters.

7.1.1.Changes from PTX ISA Version 1.x

In PTX ISA version 1.x, formal parameters were restricted to .reg state space, and there was nosupport for array parameters. Objects such as C structures were flattened and passed or returnedusing multiple registers. PTX ISA version 1.x supports multiple return values for this purpose.

Beginning with PTX ISA version 2.0, formal parameters may be in either.reg or.param statespace, and.param space parameters support arrays. For targetssm_20 or higher, PTXrestricts functions to a single return value, and a.param byte array should be used to returnobjects that do not fit into a register. PTX continues to support multiple return registers forsm_1x targets.

Note

PTX implements a stack-based ABI only for targetssm_20 or higher.

PTX ISA versions prior to 3.0 permitted variables in.reg and.local state spaces to bedefined at module scope. When compiling to use the ABI, PTX ISA version 3.0 and later disallowsmodule-scoped.reg and.local variables and restricts their use to within functionscope. When compiling without use of the ABI, module-scoped.reg and.local variables aresupported as before. When compiling legacy PTX code (ISA versions prior to 3.0) containingmodule-scoped.reg or.local variables, the compiler silently disables use of the ABI.

7.2.Variadic Functions

Note

Support for variadic functions which was unimplemented has been removed from the spec.

PTX version 6.0 supports passing unsized array parameter to a function which can be used toimplement variadic functions.

Refer toKernel and Function Directives: .func for details

7.3.Alloca

PTX providesalloca instruction for allocating storage at runtime on the per-thread local memorystack. The allocated stack memory can be accessed withld.local andst.local instructionsusing the pointer returned byalloca.

In order to facilitate deallocation of memory allocated withalloca, PTX provides two additionalinstructions:stacksave which allows reading the value of stack pointer in a local variable, andstackrestore which can restore the stack pointer with the saved value.

alloca,stacksave, andstackrestore instructions are described inStack Manipulation Instructions.

Preview Feature

Stack manipulation instructionsalloca,stacksave andstackrestore are preview featuresin PTX ISA version 7.3. All details are subject to change with no guarantees of backwardcompatibility on future PTX ISA versions or SM architectures.

8.Memory Consistency Model

In multi-threaded executions, the side-effects of memory operations performed by each thread becomevisible to other threads in a partial and non-identical order. This means that any two operationsmay appear to happen in no order, or in different orders, to different threads. The axiomsintroduced by the memory consistency model specify exactly which contradictions are forbiddenbetween the orders observed by different threads.

In the absence of any constraint, each read operation returns the value committed by some writeoperation to the same memory location, including the initial write to that memory location. Thememory consistency model effectively constrains the set of such candidate writes from which a readoperation can return a value.

8.1.Scope and applicability of the model

The constraints specified under this model apply to PTX programs with any PTX ISA version number,running onsm_70 or later architectures.

The memory consistency model does not apply to texture (includingld.global.nc) and surfaceaccesses.

8.1.1.Limitations on atomicity at system scope

When communicating with the host CPU, certain strong operations with system scope may not beperformed atomically on some systems. For more details on atomicity guarantees to host memory, seetheCUDA Atomicity Requirements.

8.2.Memory operations

The fundamental storage unit in the PTX memory model is a byte, consisting of 8 bits. Each statespace available to a PTX program is a sequence of contiguous bytes in memory. Every byte in a PTXstate space has a unique address relative to all threads that have access to the same state space.

Each PTX memory instruction specifies an address operand and a data type. The address operandcontains a virtual address that gets converted to a physical address during memory access. Thephysical address and the size of the data type together define a physical memory location, which isthe range of bytes starting from the physical address and extending up to the size of the data typein bytes.

The memory consistency model specification uses the terms “address” or “memory address” to indicatea virtual address, and the term “memory location” to indicate a physical memory location.

Each PTX memory instruction also specifies the operation — either a read, a write or an atomicread-modify-write — to be performed on all the bytes in the corresponding memory location.

8.2.1.Overlap

Two memory locations are said to overlap when the starting address of one location is within therange of bytes constituting the other location. Two memory operations are said to overlap when theyspecify the same virtual address and the corresponding memory locations overlap. The overlap is saidto be complete when both memory locations are identical, and it is said to be partial otherwise.

8.2.2.Aliases

Two distinct virtual addresses are said to be aliases if they map to the same memory location.

8.2.3.Multimem Addresses

A multimem address is a virtual address which points to multiple distinct memory locations acrossdevices.

Onlymultimem.* operations are valid on multimem addresses. That is, the behavior of accessinga multimem address in any other memory operation is undefined.

8.2.4.Memory Operations on Vector Data Types

The memory consistency model relates operations executed on memory locations with scalar data types,which have a maximum size and alignment of 64 bits. Memory operations with a vector data type aremodelled as a set of equivalent memory operations with a scalar data type, executed in anunspecified order on the elements in the vector.

8.2.5.Memory Operations on Packed Data Types

A packed data type consists of two values of the same scalar data type, as described inPacked Data Types. These values are accessed in adjacent memory locations. Amemory operation on a packed data type is modelled as a pair of equivalent memory operations on thescalar data type, executed in an unspecified order on each element of the packed data.

8.2.6.Initialization

Each byte in memory is initialized by a hypothetical writeW0 executed before starting any threadin the program. If the byte is included in a program variable, and that variable has an initialvalue, thenW0 writes the corresponding initial value for that byte; elseW0 is assumed to havewritten an unknown but constant value to the byte.

8.3.State spaces

The relations defined in the memory consistency model are independent of state spaces. Inparticular, causality order closes over all memory operations across all the state spaces. But theside-effect of a memory operation in one state space can be observed directly only by operationsthat also have access to the same state space. This further constrains the synchronizing effect of amemory operation in addition to scope. For example, the synchronizing effect of the PTX instructionld.relaxed.shared.sys is identical to that ofld.relaxed.shared.cluster, since no threadoutside the same cluster can execute an operation that accesses the same memory location.

8.4.Operation types

For simplicity, the rest of the document refers to the following operation types, instead ofmentioning specific instructions that give rise to them.

Table 20Operation Types

Operation Type

Instruction/Operation

atomic operation

atom orred instruction.

read operation

All variants ofld instruction andatom instruction (but notred instruction).

write operation

All variants ofst instruction, andatomic operations if they resultin a write.

memory operation

Aread orwrite operation.

volatile operation

An instruction with.volatile qualifier.

acquire operation

Amemory operation with.acquire or.acq_rel qualifier.

release operation

Amemory operation with.release or.acq_rel qualifier.

mmio operation

Anld orst instruction with.mmio qualifier.

memory fence operation

Amembar,fence.sc orfence.acq_rel instruction.

proxy fence operation

Afence.proxy or amembar.proxy instruction.

strong operation

Amemory fence operation, or amemory operation with a.relaxed,.acquire,.release,.acq_rel,.volatile, or.mmioqualifier.

weak operation

Anld orst instruction with a.weak qualifier.

synchronizing operation

Abarrier instruction,fence operation,release operation oracquire operation.

8.4.1.mmio Operation

Anmmio operation is a memory operation with.mmio qualifier specified. It is usually performedon a memory location which is mapped to the control registers of peer I/O devices. It can also beused for communication between threads but has poor performance relative to non-mmio operations.

The semantic meaning ofmmio operations cannot be defined precisely as it is defined by theunderlying I/O device. For formal specification of semantics ofmmio operation from MemoryConsistency Model perspective, it is equivalent to the semantics of astrong operation. But itfollows a few implementation-specific properties, if it meets theCUDA atomicity requirements atthe specified scope:

  • Writes are always performed and are never combined within the scope specified.

  • Reads are always performed, and are not forwarded, prefetched, combined, or allowed to hit anycache within the scope specified.

    • As an exception, in some implementations, the surrounding locations may also be loaded. In suchcases the amount of data loaded is implementation specific and varies between 32 and 128 bytesin size.

8.4.2.volatile Operation

Avolatile operation is a memory operation with.volatile qualifier specified.The semantics of volatile operations are equivalent to a relaxed memory operation with system-scopebut with the following extra implementation-specific constraints:

  • The number of volatileinstructions (not operations) executed by a program is preserved.Hardware may combine and merge volatileoperations issued by multiple different volatileinstructions, that is, the number of volatileoperations in the program is not preserved.

  • Volatileinstructions are not re-ordered around other volatileinstructions, but the memoryoperations performed by thoseinstructions may be re-ordered around each other.

Note

PTX volatile operations are intended for compilers to lower volatile read and write operations fromCUDA C++, and other programming languages sharing CUDA C++ volatile semantics, to PTX.

Since volatile operations are relaxed at system-scope with extra constraints, prefer using otherstrong read or write operations (e.g.ld.relaxed.sys orst.relaxed.sys) forInter-Thread Synchronization instead, which may deliver better performance.

PTX volatile operations are not suited forMemory Mapped IO (MMIO) because volatile operationsdo not preserve the number of memory operations performed, and may perform more or less operationsthan requested in a non-deterministic way.Use.mmio operations instead, which strictly preserve the number of operationsperformed.

8.5.Scope

Eachstrong operation must specify ascope, which is the set of threads that may interactdirectly with that operation and establish any of the relations described in the memory consistencymodel. There are four scopes:

Table 21Scopes

Scope

Description

.cta

The set of all threads executing in the same CTA as the current thread.

.cluster

The set of all threads executing in the same cluster as the current thread.

.gpu

The set of all threads in the current program executing on the same computedevice as the current thread. This also includes other kernel grids invoked bythe host program on the same compute device.

.sys

The set of all threads in the current program, including all kernel gridsinvoked by the host program on all compute devices, and all threadsconstituting the host program itself.

Note that the warp is not ascope; the CTA is the smallest collection of threads that qualifies asascope in the memory consistency model.

8.6.Proxies

Amemory proxy, or aproxy is an abstract label applied to a method of memory access. When twomemory operations use distinct methods of memory access, they are said to be differentproxies.

Memory operations as defined inOperation types usegenericmethod of memory access, i.e. ageneric proxy. Other operations such as textures and surfaces alluse distinct methods of memory access, also distinct from thegeneric method.

Aproxy fence is required to synchronize memory operations across differentproxies. Althoughvirtual aliases use thegeneric method of memory access, since using distinct virtual addressesbehaves as if using differentproxies, they require aproxy fence to establish memory ordering.

8.7.Morally strong operations

Two operations are said to bemorally strong relative to each other if they satisfy all of thefollowing conditions:

  1. The operations are related inprogram order (i.e, they are both executed by the same thread),or each operation isstrong and specifies ascope that includes the thread executing theother operation.

  2. Both operations are performed via the sameproxy.

  3. If both are memory operations, then they overlap completely.

Most (but not all) of the axioms in the memory consistency model depend on relations betweenmorally strong operations.

8.7.1.Conflict and Data-races

Twooverlapping memory operations are said toconflict when at least one of them is awrite.

Twoconflicting memory operations are said to be in adata-race if they are not related incausality order and they are notmorally strong.

8.7.2.Limitations on Mixed-size Data-races

Adata-race between operations thatoverlap completely is called auniform-size data-race,while adata-race between operations thatoverlap partially is called amixed-size data-race.

The axioms in the memory consistency model do not apply if a PTX program contains one or moremixed-size data-races. But these axioms are sufficient to describe the behavior of a PTX programwith onlyuniform-size data-races.

Atomicity of mixed-size RMW operations

In any program with or withoutmixed-size data-races, the following property holds for every pairofoverlapping atomic operations A1 and A2 such that each specifies ascope that includes theother: Either theread-modify-write operation specified by A1 is performed completely before A2 isinitiated, or vice versa. This property holds irrespective of whether the two operations A1 and A2overlap partially or completely.

8.8.Release and Acquire Patterns

Some sequences of instructions give rise to patterns that participate in memory synchronization asdescribed later. Therelease pattern makes prior operations from the current thread1visible to some operations from other threads. Theacquire pattern makes some operations fromother threads visible to later operations from the current thread.

Arelease pattern on a location M consists of one of the following:

  1. Arelease operation on M

    E.g.:st.release[M]; oratom.release[M]; ormbarrier.arrive.release[M];

  2. Or arelease oracquire-release operation on M followed by astrong write on M inprogram order

    E.g.:st.release[M];st.relaxed[M];

  3. Or arelease oracquire-releasememory fence followed by astrongwrite on M inprogram order

    E.g.:fence.release;st.relaxed[M]; orfence.release;atom.relaxed[M];

Anymemory synchronization established by arelease pattern only affects operations occurring inprogram order before the first instruction in that pattern.

Anacquire pattern on a location M consists of one of the following:

  1. Anacquire operation on M

    E.g.:ld.acquire[M]; oratom.acquire[M]; ormbarrier.test_wait.acquire[M];

  2. Or astrong read on M followed by anacquire operation on M inprogram order

    E.g.:ld.relaxed[M];ld.acquire[M];

  3. Or astrong read on M followed by an acquirememory fence inprogram order

    E.g.:ld.relaxed[M];fence.acquire; oratom.relaxed[M];fence.acquire;

Anymemory synchronization established by anacquire pattern only affects operations occurringinprogram order after the last instruction in that pattern.

Note that while atomic reductions conceptually perform a strong read as part of itsread-modify-write sequence, this strong read does not form an acquire pattern.

E.g.:red.add[M],1;fence.acquire; is not an acquire pattern.

1 For bothrelease andacquire patterns, this effect is further extended to operations inother threads through the transitive nature ofcausality order.

8.9.Ordering of memory operations

The sequence of operations performed by each thread is captured asprogram order whilememorysynchronization across threads is captured ascausality order. The visibility of the side-effectsof memory operations to other memory operations is captured ascommunication order. The memoryconsistency model defines contradictions that are disallowed between communication order on the onehand, andcausality order andprogram order on the other.

8.9.1.Program Order

Theprogram order relates all operations performed by a thread to the order in which a sequentialprocessor will execute instructions in the corresponding PTX source. It is a transitive relationthat forms a total order over the operations performed by the thread, but does not relate operationsfrom different threads.

8.9.1.1.Asynchronous Operations

Some PTX instructions (all variants ofcp.async,cp.async.bulk,cp.reduce.async.bulk,wgmma.mma_async) perform operations that are asynchronous to the thread that executed theinstruction. These asynchronous operations are ordered after prior instructions in the same thread(except in the case ofwgmma.mma_async), but they are not part of the program order for thatthread. Instead, they provide weaker ordering guarantees as documented in the instructiondescription.

For example, the loads and stores performed as part of acp.async are ordered with respect toeach other, but not to those of any othercp.async instructions initiated by the same thread,nor any other instruction subsequently issued by the thread with the exception ofcp.async.commit_group orcp.async.mbarrier.arrive. The asynchronous mbarrierarrive-on operationperformed by acp.async.mbarrier.arrive instruction is ordered with respect to the memoryoperations performed by all priorcp.async operations initiated by the same thread, but not tothose of any other instruction issued by the thread. The implicit mbarriercomplete-txoperation that is part of all variants ofcp.async.bulk andcp.reduce.async.bulkinstructions is ordered only with respect to the memory operations performed by the sameasynchronous instruction, and in particular it does not transitively establish ordering with respectto prior instructions from the issuing thread.

8.9.2.Observation Order

Observation order relates a write W to a read R through an optional sequence of atomicread-modify-write operations.

A write W precedes a read R inobservation order if:

  1. R and W aremorally strong and R reads the value written by W, or

  2. For some atomic operation Z, W precedes Z and Z precedes R inobservation order.

8.9.3.Fence-SC Order

TheFence-SC order is an acyclic partial order, determined at runtime, that relates every pair ofmorally strong fence.sc operations.

8.9.4.Memory synchronization

Synchronizing operations performed by different threads synchronize with each other at runtime asdescribed here. The effect of such synchronization is to establishcausality order across threads.

  1. Afence.sc operation Xsynchronizes with afence.sc operation Y if X precedes Y in theFence-SC order.

  2. Abar{.cta}.sync orbar{.cta}.red orbar{.cta}.arrive operationsynchronizes with abar{.cta}.sync orbar{.cta}.red operation executed on the same barrier.

  3. Abarrier.cluster.arrive operation synchronizes with abarrier.cluster.wait operation.

  4. Arelease pattern Xsynchronizes with anacquire pattern Y, if awrite operation in Xprecedes aread operation in Y inobservation order, and the first operation in X and thelast operation in Y aremorally strong.

API synchronization

Asynchronizes relation can also be established by certain CUDA APIs.

  1. Completion of a task enqueued in a CUDA streamsynchronizes with the start of the followingtask in the same stream, if any.

  2. For purposes of the above, recording or waiting on a CUDA event in a stream, or causing across-stream barrier to be inserted due tocudaStreamLegacy, enqueues tasks in the associatedstreams even if there are no direct side effects. An event record tasksynchronizes withmatching event wait tasks, and a barrier arrival tasksynchronizes with matching barrier waittasks.

  3. Start of a CUDA kernelsynchronizes with start of all threads in the kernel. End of all threadsin a kernelsynchronize with end of the kernel.

  4. Start of a CUDA graphsynchronizes with start of all source nodes in the graph. Completion ofall sink nodes in a CUDA graphsynchronizes with completion of the graph. Completion of a graphnodesynchronizes with start of all nodes with a direct dependency.

  5. Start of a CUDA API call to enqueue a tasksynchronizes with start of the task.

  6. Completion of the last task queued to a stream, if any,synchronizes with return fromcudaStreamSynchronize. Completion of the most recently queued matching event record task, ifany,synchronizes with return fromcudaEventSynchronize. Synchronizing a CUDA device orcontext behaves as if synchronizing all streams in the context, including ones that have beendestroyed.

  7. ReturningcudaSuccess from an API to query a CUDA handle, such as a stream or event, behavesthe same as return from the matching synchronization API.

In addition to establishing asynchronizes relation, the CUDA API synchronization mechanisms abovealso participate inproxy-preserved base causality order.

8.9.5.Causality Order

Causality order captures how memory operations become visible across threads through synchronizingoperations. The axiom “Causality” uses this order to constrain the set of write operations fromwhich a read operation may read a value.

Relations in thecausality order primarily consist of relations inBase causality order1 , which is a transitive order, determined at runtime.

Base causality order

An operation X precedes an operation Y inbase causality order if:

  1. X precedes Y inprogram order, or

  2. Xsynchronizes with Y, or

  3. For some operation Z,

    1. X precedes Z inprogram order and Z precedes Y inbase causality order, or

    2. X precedes Z inbase causality order and Z precedes Y inprogram order, or

    3. X precedes Z inbase causality order and Z precedes Y inbase causality order.

Proxy-preserved base causality order

A memory operation X precedes a memory operation Y inproxy-preserved base causality order if Xprecedes Y inbase causality order, and:

  1. X and Y are performed to the same address, using thegeneric proxy, or

  2. X and Y are performed to the same address, using the sameproxy, and by the same thread block,or

  3. X and Y are aliases and there is an aliasproxy fence along the base causality path from Xto Y.

Causality order

Causality order combinesbase causality order with some non-transitive relations as follows:

An operation X precedes an operation Y incausality order if:

  1. X precedes Y inproxy-preserved base causality order, or

  2. For some operation Z, X precedes Z in observation order, and Z precedes Y inproxy-preservedbase causality order.

1 The transitivity ofbase causality order accounts for the “cumulativity” of synchronizingoperations.

8.9.6.Coherence Order

There exists a partial transitive order that relatesoverlapping write operations, determined atruntime, called thecoherence order1. Twooverlapping write operations are related incoherence order if they aremorally strong or if they are related incausality order. Twooverlapping writes are unrelated incoherence order if they are in adata-race, which givesrise to the partial nature ofcoherence order.

1Coherence order cannot be observed directly since it consists entirely of writeoperations. It may be observed indirectly by its use in constraining the set of candidatewrites that a read operation may read from.

8.9.7.Communication Order

Thecommunication order is a non-transitive order, determined at runtime, that relates writeoperations to otheroverlapping memory operations.

  1. A write W precedes anoverlapping read R incommunication order if R returns the value of anybyte that was written by W.

  2. A write W precedes a write W’ incommunication order if W precedes W’ incoherence order.

  3. A read R precedes anoverlapping write W incommunication order if, for any byte accessed byboth R and W, R returns the value written by a write W’ that precedes W incoherence order.

Communication order captures the visibility of memory operations — when a memory operation X1precedes a memory operation X2 incommunication order, X1 is said to be visible to X2.

8.10.Axioms

8.10.1.Coherence

If a write W precedes anoverlapping write W’ incausality order, then W must precede W’ incoherence order.

8.10.2.Fence-SC

Fence-SC order cannot contradictcausality order. For a pair ofmorally strongfence.scoperations F1 and F2, if F1 precedes F2 incausality order, then F1 must precede F2 inFence-SCorder.

8.10.3.Atomicity

Single-Copy Atomicity

Conflictingmorally strong operations are performed withsingle-copy atomicity. When a read Rand a write W aremorally strong, then the following two communications cannot both exist in thesame execution, for the set of bytes accessed by both R and W:

  1. R reads any byte from W.

  2. R reads any byte from any write W’ which precedes W incoherence order.

Atomicity of read-modify-write (RMW) operations

When anatomic operation A and a write Woverlap and aremorally strong, then the followingtwo communications cannot both exist in the same execution, for the set of bytes accessed by both Aand W:

  1. A reads any byte from a write W’ that precedes W incoherence order.

  2. A follows W incoherence order.

Litmus Test 1

.global.u32x=0;

T1

T2

A1:atom.sys.inc.u32%r0,[x];
A2:atom.sys.inc.u32%r0,[x];
FINALSTATE:x==2

Atomicity is guaranteed when the operations aremorally strong.

Litmus Test 2

.global.u32x=0;

T1

T2 (In a different CTA)

A1:atom.cta.inc.u32%r0,[x];
A2:atom.gpu.inc.u32%r0,[x];
FINALSTATE:x==1ORx==2

Atomicity is not guaranteed if the operations are notmorally strong.

8.10.4.No Thin Air

Values may not appear “out of thin air”: an execution cannot speculatively produce a value in such away that the speculation becomes self-satisfying through chains of instruction dependencies andinter-thread communication. This matches both programmer intuition and hardware reality, but isnecessary to state explicitly when performing formal analysis.

Litmus Test: Load Buffering

.global.u32x=0;.global.u32y=0;

T1

T2

A1:ld.global.u32%r0,[x];B1:st.global.u32[y],%r0;
A2:ld.global.u32%r1,[y];B2:st.global.u32[x],%r1;
FINALSTATE:x==0ANDy==0

The litmus test known as “LB” (Load Buffering) checks such forbidden values that may arise out ofthin air. Two threads T1 and T2 each read from a first variable and copy the observed result into asecond variable, with the first and second variable exchanged between the threads. If each variableis initially zero, the final result shall also be zero. If A1 reads from B2 and A2 reads from B1,then values passing through the memory operations in this example form a cycle:A1->B1->A2->B2->A1. Only the values x == 0 and y == 0 are allowed to satisfy this cycle. If any ofthe memory operations in this example were to speculatively associate a different value with thecorresponding memory location, then such a speculation would become self-fulfilling, and henceforbidden.

8.10.5.Sequential Consistency Per Location

Within any set ofoverlapping memory operations that are pairwisemorally strong,communicationorder cannot contradictprogram order, i.e., a concatenation ofprogram order betweenoverlapping operations andmorally strong relations incommunication order cannot result in acycle. This ensures that each program slice ofoverlapping pairwise morallystrong operations isstrictlysequentially-consistent.

Litmus Test: CoRR

.global.u32x=0;

T1

T2

W1:st.global.relaxed.sys.u32[x],1;
R1:ld.global.relaxed.sys.u32%r0,[x];R2:ld.global.relaxed.sys.u32%r1,[x];
IF%r0==1THEN%r1==1

The litmus test “CoRR” (Coherent Read-Read), demonstrates one consequence of this guarantee. Athread T1 executes a write W1 on a location x, and a thread T2 executes two (or an infinite sequenceof) reads R1 and R2 on the same location x. No other writes are executed on x, except the onemodelling the initial value. The operations W1, R1 and R2 are pairwisemorally strong. If R1 readsfrom W1, then the subsequent read R2 must also observe the same value. If R2 observed the initialvalue of x instead, then this would form a sequence ofmorally-strong relations R2->W1->R1 incommunication order that contradicts theprogram order R1->R2 in thread T2. Hence R2 cannot readthe initial value of x in such an execution.

8.10.6.Causality

Relations incommunication order cannot contradictcausality order. This constrains the set ofcandidate write operations that a read operation may read from:

  1. If a read R precedes anoverlapping write W incausality order, then R cannot read from W.

  2. If a write W precedes anoverlapping read R incausality order, then for any byte accessed byboth R and W, R cannot read from any write W’ that precedes W incoherence order.

Litmus Test: Message Passing

.global.u32data=0;.global.u32flag=0;

T1

T2

W1:st.global.u32[data],1;F1:fence.sys;W2:st.global.relaxed.sys.u32[flag],1;
R1:ld.global.relaxed.sys.u32%r0,[flag];F2:fence.sys;R2:ld.global.u32%r1,[data];
IF%r0==1THEN%r1==1

The litmus test known as “MP” (Message Passing) represents the essence of typical synchronizationalgorithms. A vast majority of useful programs can be reduced to sequenced applications of thispattern.

Thread T1 first writes to a data variable and then to a flag variable while a second thread T2 firstreads from the flag variable and then from the data variable. The operations on the flag aremorally strong and the memory operations in each thread are separated by afence, and thesefences aremorally strong.

If R1 observes W2, then the release pattern “F1; W2”synchronizes with theacquire pattern “R1;F2”. This establishes thecausality order W1 -> F1 -> W2 -> R1 -> F2 -> R2. Then axiomcausalityguarantees that R2 cannot read from any write that precedes W1 incoherence order. In the absenceof any other writes in this example, R2 must read from W1.

Litmus Test: CoWR

// These addresses are aliases.global.u32data_alias_1;.global.u32data_alias_2;

T1

W1:st.global.u32[data_alias_1],1;F1:fence.proxy.alias;R1:ld.global.u32%r1,[data_alias_2];
%r1==1

Virtual aliases require an aliasproxy fence along the synchronization path.

Litmus Test: Store Buffering

The litmus test known as “SB” (Store Buffering) demonstrates thesequential consistency enforcedby thefence.sc. A thread T1 writes to a first variable, and then reads the value of a secondvariable, while a second thread T2 writes to the second variable and then reads the value of thefirst variable. The memory operations in each thread are separated byfence.sc instructions,and thesefences aremorally strong.

.global.u32x=0;.global.u32y=0;

T1

T2

W1:st.global.u32[x],1;F1:fence.sc.sys;R1:ld.global.u32%r0,[y];
W2:st.global.u32[y],1;F2:fence.sc.sys;R2:ld.global.u32%r1,[x];
%r0==1OR%r1==1

In any execution, either F1 precedes F2 inFence-SC order, or vice versa. If F1 precedes F2 inFence-SC order, then F1synchronizes with F2. This establishes thecausality order in W1 -> F1-> F2 -> R2. Axiomcausality ensures that R2 cannot read from any write that precedes W1 incoherence order. In the absence of any other write to that variable, R2 must read fromW1. Similarly, in the case where F2 precedes F1 inFence-SC order, R1 must read from W2. If eachfence.sc in this example were replaced by afence.acq_rel instruction, then this outcome isnot guaranteed. There may be an execution where the write from each thread remains unobserved fromthe other thread, i.e., an execution is possible, where both R1 and R2 return the initial value “0”for variables y and x respectively.

8.11.Special Cases

8.11.1.Reductions do not form Acquire Patterns

Atomic reduction operations likered do not form acquire patterns with acquire fences.

Litmus Test: Message Passing with a Red Instruction

.global.u32x=0;.global.u32flag=0;

T1

T2

W1:st.u32[x],42;W2:st.release.gpu.u32[flag],1;
RMW1:red.sys.global.add.u32[flag],1;F2:fence.acquire.gpu;R2:ld.weak.u32%r1,[x];
%r1==0ANDflag==2

The litmus test known as “MP” (Message Passing) demonstrates the consequenceof reductions being excluded from acquire patterns.It is possible to observe the outcome whereR2 reads the value0fromx andflag has the final value of2.This outcome is possible since the release pattern inT1 does not synchronizewith any acquire pattern inT2.Using theatom instruction instead ofred forbids this outcome.

9.Instruction Set

9.1.Format and Semantics of Instruction Descriptions

This section describes each PTX instruction. In addition to the name and the format of theinstruction, the semantics are described, followed by some examples that attempt to show severalpossible instantiations of the instruction.

9.2.PTX Instructions

PTX instructions generally have from zero to four operands, plus an optional guard predicateappearing after an@ symbol to the left of theopcode:

  • @p  opcode;

  • @p  opcodea;

  • @p  opcoded,a;

  • @p  opcoded,a,b;

  • @p  opcoded,a,b,c;

For instructions that create a result value, thed operand is the destination operand, whilea,b, andc are source operands.

Thesetp instruction writes two destination registers. We use a| symbol to separatemultiple destination registers.

setp.lt.s32  p|q, a, b;  // p = (a < b); q = !(a < b);

For some instructions the destination operand is optional. Abit bucket operand denoted with anunderscore (_) may be used in place of a destination register.

9.3.Predicated Execution

In PTX, predicate registers are virtual and have.pred as the type specifier. So, predicateregisters can be declared as

.reg .pred p, q, r;

All instructions have an optionalguard predicate which controls conditional execution of theinstruction. The syntax to specify conditional execution is to prefix an instruction with@{!}p,wherep is a predicate variable, optionally negated. Instructions without a guard predicate areexecuted unconditionally.

Predicates are most commonly set as the result of a comparison performed by thesetpinstruction.

As an example, consider the high-level code

if (i < n)    j = j + 1;

This can be written in PTX as

      setp.lt.s32  p, i, n;    // p = (i < n)@p    add.s32      j, j, 1;    // if i < n, add 1 to j

To get a conditional branch or conditional function call, use a predicate to control the executionof the branch or call instructions. To implement the above example as a true conditional branch, thefollowing PTX instruction sequence might be used:

      setp.lt.s32  p, i, n;    // compare i to n@!p   bra  L1;                 // if False, branch over      add.s32      j, j, 1;L1:     ...

9.3.1.Comparisons

9.3.1.1.Integer and Bit-Size Comparisons

The signed integer comparisons are the traditionaleq (equal),ne (not-equal),lt(less-than),le (less-than-or-equal),gt (greater-than), andge(greater-than-or-equal). The unsigned comparisons areeq,ne,lo (lower),ls(lower-or-same),hi (higher), andhs (higher-or-same). The bit-size comparisons areeqandne; ordering comparisons are not defined for bit-size types.

Table 22shows the operators for signed integer, unsigned integer, and bit-size types.

Table 22Operators for Signed Integer, Unsigned Integer, and Bit-Size Types

Meaning

Signed Operator

Unsigned Operator

Bit-Size Operator

a==b

eq

eq

eq

a!=b

ne

ne

ne

a<b

lt

lo

n/a

a<=b

le

ls

n/a

a>b

gt

hi

n/a

a>=b

ge

hs

n/a

9.3.1.2.Floating Point Comparisons

The ordered floating-point comparisons areeq,ne,lt,le,gt, andge. Ifeither operand isNaN, the result isFalse.Table 23 lists the floating-pointcomparison operators.

Table 23Floating-Point Comparison Operators

Meaning

Floating-Point Operator

a==b&&!isNaN(a)&&!isNaN(b)

eq

a!=b&&!isNaN(a)&&!isNaN(b)

ne

a<b&&!isNaN(a)&&!isNaN(b)

lt

a<=b&&!isNaN(a)&&!isNaN(b)

le

a>b&&!isNaN(a)&&!isNaN(b)

gt

a>=b&&!isNaN(a)&&!isNaN(b)

ge

To aid comparison operations in the presence ofNaN values, unordered floating-point comparisonsare provided:equ,neu,ltu,leu,gtu, andgeu. If both operands are numericvalues (notNaN), then the comparison has the same result as its ordered counterpart. If eitheroperand isNaN, then the result of the comparison isTrue.

Table 24 lists the floating-pointcomparison operators acceptingNaN values.

Table 24Floating-Point Comparison Operators Accepting NaN

Meaning

Floating-Point Operator

a==b||isNaN(a)||isNaN(b)

equ

a!=b||isNaN(a)||isNaN(b)

neu

a<b||isNaN(a)||isNaN(b)

ltu

a<=b||isNaN(a)||isNaN(b)

leu

a>b||isNaN(a)||isNaN(b)

gtu

a>=b||isNaN(a)||isNaN(b)

geu

To test forNaN values, two operatorsnum (numeric) andnan (isNaN) areprovided.num returnsTrue if both operands are numeric values (notNaN), andnanreturnsTrue if either operand isNaN.Table 25 lists thefloating-point comparison operators testing forNaN values.

Table 25Floating-Point Comparison Operators Testing for NaN

Meaning

Floating-Point Operator

!isNaN(a)&&!isNaN(b)

num

isNaN(a)||isNaN(b)

nan

9.3.2.Manipulating Predicates

Predicate values may be computed and manipulated using the following instructions:and,or,xor,not, andmov.

There is no direct conversion between predicates and integer values, and no direct way to load orstore predicate register values. However,setp can be used to generate a predicate from aninteger, and the predicate-based select (selp) instruction can be used to generate an integervalue based on the value of a predicate; for example:

selp.u32 %r1,1,0,%p;    // convert predicate to 32-bit value

9.4.Type Information for Instructions and Operands

Typed instructions must have a type-size modifier. For example, theadd instruction requirestype and size information to properly perform the addition operation (signed, unsigned, float,different sizes), and this information must be specified as a suffix to the opcode.

Example

.reg .u16 d, a, b;add.u16 d, a, b;    // perform a 16-bit unsigned add

Some instructions require multiple type-size modifiers, most notably the data conversion instructioncvt. It requires separate type-size modifiers for the result and source, and these are placed inthe same order as the operands. For example:

.reg .u16 a;.reg .f32 d;cvt.f32.u16 d, a;   // convert 16-bit unsigned to 32-bit float

In general, an operand’s type must agree with the corresponding instruction-type modifier. The rulesfor operand and instruction type conformance are as follows:

  • Bit-size types agree with any type of the same size.

  • Signed and unsigned integer types agree provided they have the same size, and integer operands aresilently cast to the instruction type if needed. For example, an unsigned integer operand used ina signed integer instruction will be treated as a signed integer by the instruction.

  • Floating-point types agree only if they have the same size; i.e., they must match exactly.

Table 26 summarizes these typechecking rules.

Table 26Type Checking Rules

Operand Type

.bX

.sX

.uX

.fX

Instruction Type

.bX

okay

okay

okay

okay

.sX

okay

okay

okay

invalid

.uX

okay

okay

okay

invalid

.fX

okay

invalid

invalid

okay

Note

Some operands have their type and size defined independently from the instruction type-size. Forexample, the shift amount operand for left and right shift instructions always has type.u32,while the remaining operands have their type and size determined by the instruction type.

Example

// 64-bit arithmetic right shift; shift amount 'b' is .u32    shr.s64 d,a,b;

9.4.1.Operand Size Exceeding Instruction-Type Size

For convenience,ld,st, andcvt instructions permit source and destination dataoperands to be wider than the instruction-type size, so that narrow values may be loaded, stored,and converted using regular-width registers. For example, 8-bit or 16-bit values may be helddirectly in 32-bit or 64-bit registers when being loaded, stored, or converted to other types andsizes. The operand type checking rules are relaxed for bit-size and integer (signed and unsigned)instruction types; floating-point instruction types still require that the operand type-size matchesexactly, unless the operand is of bit-size type.

When a source operand has a size that exceeds the instruction-type size, the source data istruncated (chopped) to the appropriate number of bits specified by the instruction type-size.

Table 27summarizes the relaxed type-checking rules for source operands. Note that some combinations maystill be invalid for a particular instruction; for example, thecvt instruction does not support.bX instruction types, so those rows are invalid forcvt.

Table 27Relaxed Type-checking Rules for Source Operands

Source Operand Type

b8

b16

b32

b64

b128

s8

s16

s32

s64

u8

u16

u32

u64

f16

f32

f64

Instruction Type

b8

chop

chop

chop

chop

chop

chop

chop

chop

chop

chop

chop

chop

chop

b16

inv

chop

chop

chop

inv

chop

chop

inv

chop

chop

chop

chop

b32

inv

inv

chop

chop

inv

inv

chop

inv

inv

chop

inv

chop

b64

inv

inv

inv

chop

inv

inv

inv

inv

inv

inv

inv

inv

b128

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

s8

chop

chop

chop

chop

chop

chop

chop

chop

chop

chop

inv

inv

inv

s16

inv

chop

chop

chop

inv

chop

chop

inv

chop

chop

inv

inv

inv

s32

inv

inv

chop

chop

inv

inv

chop

inv

inv

chop

inv

inv

inv

s64

inv

inv

inv

chop

inv

inv

inv

inv

inv

inv

inv

inv

inv

u8

chop

chop

chop

chop

chop

chop

chop

chop

chop

chop

inv

inv

inv

u16

inv

chop

chop

chop

inv

chop

chop

inv

chop

chop

inv

inv

inv

u32

inv

inv

chop

chop

inv

inv

chop

inv

inv

chop

inv

inv

inv

u64

inv

inv

inv

chop

inv

inv

inv

inv

inv

inv

inv

inv

inv

f16

inv

chop

chop

chop

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

f32

inv

inv

chop

chop

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

f64

inv

inv

inv

chop

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

Notes

chop = keep only low bits that fit; “–” = allowed, but no conversion needed;

inv = invalid, parse error.

  1. Source register size must be of equal or greater size than the instruction-type size.

  2. Bit-size source registers may be used with any appropriately-sized instruction type. The data aretruncated (“chopped”) to the instruction-type size and interpreted according to the instructiontype.

  3. Integer source registers may be used with any appropriately-sized bit-size or integer instructiontype. The data are truncated to the instruction-type size and interpreted according to theinstruction type.

  4. Floating-point source registers can only be used with bit-size or floating-point instruction types.When used with a narrower bit-size instruction type, the data are truncated. When used with afloating-point instruction type, the size must match exactly.

When a destination operand has a size that exceeds the instruction-type size, the destination datais zero- or sign-extended to the size of the destination register. If the corresponding instructiontype is signed integer, the data is sign-extended; otherwise, the data is zero-extended.

Table 28summarizes the relaxed type-checking rules for destination operands.

Table 28Relaxed Type-checking Rules for Destination Operands

Destination Operand Type

b8

b16

b32

b64

b128

s8

s16

s32

s64

u8

u16

u32

u64

f16

f32

f64

Instruction Type

b8

zext

zext

zext

zext

zext

zext

zext

zext

zext

zext

zext

zext

zext

b16

inv

zext

zext

zext

inv

zext

zext

inv

zext

zext

zext

zext

b32

inv

inv

zext

zext

inv

inv

zext

inv

inv

zext

inv

zext

b64

inv

inv

inv

zext

inv

inv

inv

inv

inv

inv

inv

inv

b128

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

s8

sext

sext

sext

sext

sext

sext

sext

sext

sext

sext

inv

inv

inv

s16

inv

sext

sext

sext

inv

sext

sext

inv

sext

sext

inv

inv

inv

s32

inv

inv

sext

sext

inv

inv

sext

inv

inv

sext

inv

inv

inv

s64

inv

inv

inv

sext

inv

inv

inv

inv

inv

inv

inv

inv

inv

u8

zext

zext

zext

zext

zext

zext

zext

zext

zext

zext

inv

inv

inv

u16

inv

zext

zext

zext

inv

zext

zext

inv

zext

zext

inv

inv

inv

u32

inv

inv

zext

zext

inv

inv

zext

inv

inv

zext

inv

inv

inv

u64

inv

inv

inv

zext

inv

inv

inv

inv

inv

inv

inv

inv

inv

f16

inv

zext

zext

zext

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

f32

inv

inv

zext

zext

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

f64

inv

inv

inv

zext

inv

inv

inv

inv

inv

inv

inv

inv

inv

inv

Notes

sext = sign-extend; zext = zero-extend; “–” = allowed, but no conversion needed;

inv = invalid, parse error.

  1. Destination register size must be of equal or greater size than the instruction-type size.

  2. Bit-size destination registers may be used with any appropriately-sized instruction type. The dataare sign-extended to the destination register width for signed integer instruction types, and arezero-extended to the destination register width otherwise.

  3. Integer destination registers may be used with any appropriately-sized bit-size or integerinstruction type. The data are sign-extended to the destination register width for signed integerinstruction types, and are zero-extended to the destination register width for bit-size an dunsigned integer instruction types.

  4. Floating-point destination registers can only be used with bit-size or floating-point instructiontypes. When used with a narrower bit-size instruction type, the data are zero-extended. When usedwith a floating-point instruction type, the size must match exactly.

9.5.Divergence of Threads in Control Constructs

Threads in a CTA execute together, at least in appearance, until they come to a conditional controlconstruct such as a conditional branch, conditional function call, or conditional return. If threadsexecute down different control flow paths, the threads are calleddivergent. If all of the threadsact in unison and follow a single control flow path, the threads are calleduniform. Bothsituations occur often in programs.

A CTA with divergent threads may have lower performance than a CTA with uniformly executing threads,so it is important to have divergent threads re-converge as soon as possible. All control constructsare assumed to be divergent points unless the control-flow instruction is marked as uniform, usingthe.uni suffix. For divergent control flow, the optimizing code generator automaticallydetermines points of re-convergence. Therefore, a compiler or code author targeting PTX can ignorethe issue of divergent threads, but has the opportunity to improve performance by marking branchpoints as uniform when the compiler or author can guarantee that the branch point is non-divergent.

9.6.Semantics

The goal of the semantic description of an instruction is to describe the results in all cases in assimple language as possible. The semantics are described using C, until C is not expressive enough.

9.6.1.Machine-Specific Semantics of 16-bit Code

A PTX program may execute on a GPU with either a 16-bit or a 32-bit data path. When executing on a32-bit data path, 16-bit registers in PTX are mapped to 32-bit physical registers, and 16-bitcomputations arepromoted to 32-bit computations. This can lead to computational differencesbetween code run on a 16-bit machine versus the same code run on a 32-bit machine, since thepromoted computation may have bits in the high-order half-word of registers that are not present in16-bit physical registers. These extra precision bits can become visible at the application level,for example, by a right-shift instruction.

At the PTX language level, one solution would be to define semantics for 16-bit code that isconsistent with execution on a 16-bit data path. This approach introduces a performance penalty for16-bit code executing on a 32-bit data path, since the translated code would require many additionalmasking instructions to suppress extra precision bits in the high-order half-word of 32-bitregisters.

Rather than introduce a performance penalty for 16-bit code running on 32-bit GPUs, the semantics of16-bit instructions in PTX is machine-specific. A compiler or programmer may chose to enforceportable, machine-independent 16-bit semantics by adding explicit conversions to 16-bit values atappropriate points in the program to guarantee portability of the code. However, for manyperformance-critical applications, this is not desirable, and for many applications the differencein execution is preferable to limiting performance.

9.7.Instructions

All PTX instructions may be predicated. In the following descriptions, the optional guard predicateis omitted from the syntax.

9.7.1.Integer Arithmetic Instructions

Integer arithmetic instructions operate on the integer types in register and constant immediateforms. The integer arithmetic instructions are:

  • add

  • sub

  • mul

  • mad

  • mul24

  • mad24

  • sad

  • div

  • rem

  • abs

  • neg

  • min

  • max

  • popc

  • clz

  • bfind

  • fns

  • brev

  • bfe

  • bfi

  • bmsk

  • szext

  • dp4a

  • dp2a

9.7.1.1.Integer Arithmetic Instructions:add

add

Add two values.

Syntax

add.type       d, a, b;add{.sat}.s32  d, a, b;     // .sat applies only to .s32.type = { .u16, .u32, .u64,          .s16, .s32, .s64,          .u16x2, .s16x2 };

Description

Performs addition and writes the resulting value into a destination register.

For.u16x2,.s16x2 instruction types, forms input vectors by half word values from sourceoperands. Half-word operands are then added in parallel to produce.u16x2,.s16x2 result indestination.

Operandsd,a andb have type.type. For instruction types.u16x2,.s16x2,operandsd,a andb have type.b32.

Semantics

if (type == u16x2 || type == s16x2) {    iA[0] = a[0:15];    iA[1] = a[16:31];    iB[0] = b[0:15];    iB[1] = b[16:31];    for (i = 0; i < 2; i++) {         d[i] = iA[i] + iB[i];    }} else {    d = a + b;}

Notes

Saturation modifier:

.sat

limits result toMININT..MAXINT (no overflow) for the size of the operation. Applies only to.s32 type.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

add.u16x2 andadd.s16x2 introduced in PTX ISA version 8.0.

Target ISA Notes

Supported on all target architectures.

add.u16x2 andadd.s16x2 requiresm_90 or higher.

Examples

@p  add.u32     x,y,z;    add.sat.s32 c,c,1;    add.u16x2   u,v,w;

9.7.1.2.Integer Arithmetic Instructions:sub

sub

Subtract one value from another.

Syntax

sub.type       d, a, b;sub{.sat}.s32  d, a, b;     // .sat applies only to .s32.type = { .u16, .u32, .u64,          .s16, .s32, .s64 };

Description

Performs subtraction and writes the resulting value into a destination register.

Semantics

d = a - b;

Notes

Saturation modifier:

.sat

limits result toMININT..MAXINT (no overflow) for the size of the operation. Applies only to.s32 type.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

sub.s32 c,a,b;

9.7.1.3.Integer Arithmetic Instructions:mul

mul

Multiply two values.

Syntax

mul.mode.type  d, a, b;.mode = { .hi, .lo, .wide };.type = { .u16, .u32, .u64,          .s16, .s32, .s64 };

Description

Compute the product of two values.

Semantics

t = a * b;n = bitwidth of type;d = t;            // for .wided = t<2n-1..n>;   // for .hi variantd = t<n-1..0>;    // for .lo variant

Notes

The type of the operation represents the types of thea andb operands. If.hi or.lo is specified, thend is the same size asa andb, and either the upper or lowerhalf of the result is written to the destination register. If.wide is specified, thend istwice as wide asa andb to receive the full result of the multiplication.

The.wide suffix is supported only for 16- and 32-bit integer types.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

mul.wide.s16 fa,fxs,fys;   // 16*16 bits yields 32 bitsmul.lo.s16 fa,fxs,fys;     // 16*16 bits, save only the low 16 bitsmul.wide.s32 z,x,y;        // 32*32 bits, creates 64 bit result

9.7.1.4.Integer Arithmetic Instructions:mad

mad

Multiply two values, optionally extract the high or low half of the intermediate result, and add a third value.

Syntax

mad.mode.type  d, a, b, c;mad.hi.sat.s32 d, a, b, c;.mode = { .hi, .lo, .wide };.type = { .u16, .u32, .u64,          .s16, .s32, .s64 };

Description

Multiplies two values, optionally extracts the high or low half of the intermediate result, and addsa third value. Writes the result into a destination register.

Semantics

t = a * b;n = bitwidth of type;d = t + c;           // for .wided = t<2n-1..n> + c;  // for .hi variantd = t<n-1..0> + c;   // for .lo variant

Notes

The type of the operation represents the types of thea andb operands. If .hi or .lo isspecified, thend andc are the same size asa andb, and either the upper or lowerhalf of the result is written to the destination register. If.wide is specified, thend andc are twice as wide asa andb to receive the result of the multiplication.

The.wide suffix is supported only for 16-bit and 32-bit integer types.

Saturation modifier:

.sat

limits result toMININT..MAXINT (no overflow) for the size of the operation.

Applies only to.s32 type in.hi mode.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

@p  mad.lo.s32 d,a,b,c;    mad.lo.s32 r,p,q,r;

9.7.1.5.Integer Arithmetic Instructions:mul24

mul24

Multiply two 24-bit integer values.

Syntax

mul24.mode.type  d, a, b;.mode = { .hi, .lo };.type = { .u32, .s32 };

Description

Compute the product of two 24-bit integer values held in 32-bit source registers, and return eitherthe high or low 32-bits of the 48-bit result.

Semantics

t = a * b;d = t<47..16>;    // for .hi variantd = t<31..0>;     // for .lo variant

Notes

Integer multiplication yields a result that is twice the size of the input operands, i.e., 48-bits.

mul24.hi performs a 24x24-bit multiply and returns the high 32 bits of the 48-bit result.

mul24.lo performs a 24x24-bit multiply and returns the low 32 bits of the 48-bit result.

All operands are of the same type and size.

mul24.hi may be less efficient on machines without hardware support for 24-bit multiply.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

mul24.lo.s32 d,a,b;   // low 32-bits of 24x24-bit signed multiply.

9.7.1.6.Integer Arithmetic Instructions:mad24

mad24

Multiply two 24-bit integer values and add a third value.

Syntax

mad24.mode.type  d, a, b, c;mad24.hi.sat.s32 d, a, b, c;.mode = { .hi, .lo };.type = { .u32, .s32 };

Description

Compute the product of two 24-bit integer values held in 32-bit source registers, and add a third,32-bit value to either the high or low 32-bits of the 48-bit result. Return either the high or low32-bits of the 48-bit result.

Semantics

t = a * b;d = t<47..16> + c;   // for .hi variantd = t<31..0> + c;    // for .lo variant

Notes

Integer multiplication yields a result that is twice the size of the input operands, i.e., 48-bits.

mad24.hi performs a 24x24-bit multiply and adds the high 32 bits of the 48-bit result to a thirdvalue.

mad24.lo performs a 24x24-bit multiply and adds the low 32 bits of the 48-bit result to a thirdvalue.

All operands are of the same type and size.

Saturation modifier:

.sat

limits result of 32-bit signed addition toMININT..MAXINT (no overflow). Applies only to.s32 type in .hi mode.

mad24.hi may be less efficient on machines without hardware support for 24-bit multiply.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

mad24.lo.s32 d,a,b,c;   // low 32-bits of 24x24-bit signed multiply.

9.7.1.7.Integer Arithmetic Instructions:sad

sad

Sum of absolute differences.

Syntax

sad.type  d, a, b, c;.type = { .u16, .u32, .u64,          .s16, .s32, .s64 };

Description

Adds the absolute value ofa-b toc and writes the resulting value intod.

Semantics

d = c + ((a<b) ? b-a : a-b);

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

sad.s32  d,a,b,c;sad.u32  d,a,b,d;  // running sum

9.7.1.8.Integer Arithmetic Instructions:div

div

Divide one value by another.

Syntax

div.type  d, a, b;.type = { .u16, .u32, .u64,          .s16, .s32, .s64 };

Description

Dividesa byb, stores result ind.

Semantics

d = a / b;

Notes

Division by zero yields an unspecified, machine-specific value.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

div.s32  b,n,i;

9.7.1.9.Integer Arithmetic Instructions:rem

rem

The remainder of integer division.

Syntax

rem.type  d, a, b;.type = { .u16, .u32, .u64,          .s16, .s32, .s64 };

Description

Dividesa byb, store the remainder ind.

Semantics

d = a % b;

Notes

The behavior for negative numbers is machine-dependent and depends on whether divide rounds towardszero or negative infinity.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

rem.s32  x,x,8;    // x = x%8;

9.7.1.10.Integer Arithmetic Instructions:abs

abs

Absolute value.

Syntax

abs.type  d, a;.type = { .s16, .s32, .s64 };

Description

Take the absolute value ofa and store it ind.

Semantics

d = |a|;

Notes

Only for signed integers.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

abs.s32  r0,a;

9.7.1.11.Integer Arithmetic Instructions:neg

neg

Arithmetic negate.

Syntax

neg.type  d, a;.type = { .s16, .s32, .s64 };

Description

Negate the sign ofa and store the result ind.

Semantics

d = -a;

Notes

Only for signed integers.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

neg.s32  r0,a;

9.7.1.12.Integer Arithmetic Instructions:min

min

Find the minimum of two values.

Syntax

min.atype         d, a, b;min{.relu}.btype  d, a, b;.atype = { .u16, .u32, .u64,           .u16x2, .s16, .s64 };.btype = { .s16x2, .s32 };

Description

Store the minimum ofa andb ind.

For.u16x2,.s16x2 instruction types, forms input vectors by half word values from sourceoperands. Half-word operands are then processed in parallel to produce.u16x2,.s16x2 resultin destination.

Operandsd,a andb have the same type as the instruction type. For instruction types.u16x2,.s16x2, operandsd,a andb have type.b32.

Semantics

if (type == u16x2 || type == s16x2) {    iA[0] = a[0:15];    iA[1] = a[16:31];    iB[0] = b[0:15];    iB[1] = b[16:31];    for (i = 0; i < 2; i++) {         d[i] = (iA[i] < iB[i]) ? iA[i] : iB[i];    }} else {    d = (a < b) ? a : b; // Integer (signed and unsigned)}

Notes

Signed and unsigned differ.

Saturation modifier:

min.relu.{s16x2,s32} clamps the result to 0 if negative.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

min.u16x2,min{.relu}.s16x2 andmin.relu.s32 introduced in PTX ISA version 8.0.

Target ISA Notes

Supported on all target architectures.

min.u16x2,min{.relu}.s16x2 andmin.relu.s32 requiresm_90 or higher.

Examples

    min.s32  r0,a,b;@p  min.u16  h,i,j;    min.s16x2.relu u,v,w;

9.7.1.13.Integer Arithmetic Instructions:max

max

Find the maximum of two values.

Syntax

max.atype         d, a, b;max{.relu}.btype  d, a, b;.atype = { .u16, .u32, .u64,           .u16x2, .s16, .s64 };.btype = { .s16x2, .s32 };

Description

Store the maximum ofa andb ind.

For.u16x2,.s16x2 instruction types, forms input vectors by half word values from sourceoperands. Half-word operands are then processed in parallel to produce.u16x2,.s16x2 resultin destination.

Operandsd,a andb have the same type as the instruction type. For instruction types.u16x2,.s16x2, operandsd,a andb have type.b32.

Semantics

if (type == u16x2 || type == s16x2) {    iA[0] = a[0:15];    iA[1] = a[16:31];    iB[0] = b[0:15];    iB[1] = b[16:31];    for (i = 0; i < 2; i++) {         d[i] = (iA[i] > iB[i]) ? iA[i] : iB[i];    }} else {    d = (a > b) ? a : b; // Integer (signed and unsigned)}

Notes

Signed and unsigned differ.

Saturation modifier:

max.relu.{s16x2,s32} clamps the result to 0 if negative.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

max.u16x2,max{.relu}.s16x2 andmax.relu.s32 introduced in PTX ISA version 8.0.

Target ISA Notes

Supported on all target architectures.

max.u16x2,max{.relu}.s16x2 andmax.relu.s32 requiresm_90 or higher.

Examples

max.u32  d,a,b;max.s32  q,q,0;max.relu.s16x2 t,t,u;

9.7.1.14.Integer Arithmetic Instructions:popc

popc

Population count.

Syntax

popc.type  d, a;.type = { .b32, .b64 };

Description

Count the number of one bits ina and place the resultingpopulation count in 32-bitdestination registerd. Operanda has the instruction type and destinationd has type.u32.

Semantics

.u32  d = 0;while (a != 0) {   if (a & 0x1)  d++;   a = a >> 1;}

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

popc requiressm_20 or higher.

Examples

popc.b32  d, a;popc.b64  cnt, X;  // cnt is .u32

9.7.1.15.Integer Arithmetic Instructions:clz

clz

Count leading zeros.

Syntax

clz.type  d, a;.type = { .b32, .b64 };

Description

Count the number of leading zeros ina starting with the most-significant bit and place theresult in 32-bit destination registerd. Operanda has the instruction type, and destinationd has type.u32. For.b32 type, the number of leading zeros is between 0 and 32,inclusively. For.b64 type, the number of leading zeros is between 0 and 64, inclusively.

Semantics

.u32  d = 0;if (.type == .b32)   { max = 32; mask = 0x80000000; }else                 { max = 64; mask = 0x8000000000000000; }while (d < max && (a&mask == 0) ) {    d++;    a = a << 1;}

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

clz requiressm_20 or higher.

Examples

clz.b32  d, a;clz.b64  cnt, X;  // cnt is .u32

9.7.1.16.Integer Arithmetic Instructions:bfind

bfind

Find most significant non-sign bit.

Syntax

bfind.type           d, a;bfind.shiftamt.type  d, a;.type = { .u32, .u64,          .s32, .s64 };

Description

Find the bit position of the most significant non-sign bit ina and place the result ind. Operanda has the instruction type, and destinationd has type.u32. For unsignedintegers,bfind returns the bit position of the most significant1. For signed integers,bfind returns the bit position of the most significant0 for negative inputs and the mostsignificant1 for non-negative inputs.

If.shiftamt is specified,bfind returns the shift amount needed to left-shift the found bitinto the most-significant bit position.

bfind returns0xffffffff if no non-sign bit is found.

Semantics

msb = (.type==.u32 || .type==.s32) ? 31 : 63;// negate negative signed inputsif ( (.type==.s32 || .type==.s64) && (a & (1<<msb)) ) {    a = ~a;}.u32  d = 0xffffffff;for (.s32 i=msb; i>=0; i--) {    if (a & (1<<i))  { d = i; break; }}if (.shiftamt && d != 0xffffffff)  { d = msb - d; }

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

bfind requiressm_20 or higher.

Examples

bfind.u32  d, a;bfind.shiftamt.s64  cnt, X;  // cnt is .u32

9.7.1.17.Integer Arithmetic Instructions:fns

fns

Find the n-th set bit

Syntax

fns.b32 d, mask, base, offset;

Description

Given a 32-bit valuemask and an integer valuebase (between 0 and 31), find the n-th (givenby offset) set bit inmask from thebase bit, and store the bit position ind. If notfound, store 0xffffffff ind.

Operandmask has a 32-bit type. Operandbase has.b32,.u32 or.s32type. Operand offset has.s32 type. Destinationd has type.b32.

Operandbase must be <= 31, otherwise behavior is undefined.

Semantics

d = 0xffffffff;if (offset == 0) {    if (mask[base] == 1) {        d = base;    }} else {    pos = base;    count = |offset| - 1;    inc = (offset > 0) ? 1 : -1;    while ((pos >= 0) && (pos < 32)) {        if (mask[pos] == 1) {            if (count == 0) {              d = pos;              break;           } else {               count = count - 1;           }        }        pos = pos + inc;    }}

PTX ISA Notes

Introduced in PTX ISA version 6.0.

Target ISA Notes

fns requiressm_30 or higher.

Examples

fns.b32 d, 0xaaaaaaaa, 3, 1;   // d = 3fns.b32 d, 0xaaaaaaaa, 3, -1;  // d = 3fns.b32 d, 0xaaaaaaaa, 2, 1;   // d = 3fns.b32 d, 0xaaaaaaaa, 2, -1;  // d = 1

9.7.1.18.Integer Arithmetic Instructions:brev

brev

Bit reverse.

Syntax

brev.type  d, a;.type = { .b32, .b64 };

Description

Perform bitwise reversal of input.

Semantics

msb = (.type==.b32) ? 31 : 63;for (i=0; i<=msb; i++) {    d[i] = a[msb-i];}

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

brev requiressm_20 or higher.

Examples

brev.b32  d, a;

9.7.1.19.Integer Arithmetic Instructions:bfe

bfe

Bit Field Extract.

Syntax

bfe.type  d, a, b, c;.type = { .u32, .u64,          .s32, .s64 };

Description

Extract bit field froma and place the zero or sign-extended result ind. Sourceb givesthe bit field starting bit position, and sourcec gives the bit field length in bits.

Operandsa andd have the same type as the instruction type. Operandsb andc aretype.u32, but are restricted to the 8-bit value range0..255.

The sign bit of the extracted field is defined as:

.u32,.u64:

zero

.s32,.s64:

msb of input a if the extracted field extends beyond themsb of amsb of extractedfield, otherwise

If the bit field length is zero, the result is zero.

The destinationd is padded with the sign bit of the extracted field. If the start position isbeyond themsb of the input, the destinationd is filled with the replicated sign bit of theextracted field.

Semantics

msb = (.type==.u32 || .type==.s32) ? 31 : 63;pos = b & 0xff;  // pos restricted to 0..255 rangelen = c & 0xff;  // len restricted to 0..255 rangeif (.type==.u32 || .type==.u64 || len==0)    sbit = 0;else    sbit = a[min(pos+len-1,msb)];d = 0;for (i=0; i<=msb; i++) {    d[i] = (i<len && pos+i<=msb) ? a[pos+i] : sbit;}

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

bfe requiressm_20 or higher.

Examples

bfe.b32  d,a,start,len;

9.7.1.20.Integer Arithmetic Instructions:bfi

bfi

Bit Field Insert.

Syntax

bfi.type  f, a, b, c, d;.type = { .b32, .b64 };

Description

Align and insert a bit field froma intob, and place the result inf. Sourcecgives the starting bit position for the insertion, and sourced gives the bit field length inbits.

Operandsa,b, andf have the same type as the instruction type. Operandsc andd are type.u32, but are restricted to the 8-bit value range0..255.

If the bit field length is zero, the result isb.

If the start position is beyond the msb of the input, the result isb.

Semantics

msb = (.type==.b32) ? 31 : 63;pos = c & 0xff;  // pos restricted to 0..255 rangelen = d & 0xff;  // len restricted to 0..255 rangef = b;for (i=0; i<len && pos+i<=msb; i++) {    f[pos+i] = a[i];}

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

bfi requiressm_20 or higher.

Examples

bfi.b32  d,a,b,start,len;

9.7.1.21.Integer Arithmetic Instructions:szext

szext

Sign-extend or Zero-extend.

Syntax

szext.mode.type  d, a, b;.mode = { .clamp, .wrap };.type = { .u32, .s32 };

Description

Sign-extends or zero-extends an N-bit value from operanda where N is specified in operandb. The resulting value is stored in the destination operandd.

For the.s32 instruction type, the value ina is treated as an N-bit signed value and themost significant bit of this N-bit value is replicated up to bit 31. For the.u32 instructiontype, the value ina is treated as an N-bit unsigned number and is zero-extended to 32bits. Operandb is an unsigned 32-bit value.

If the value of N is 0, then the result ofszext is 0. If the value of N is 32 or higher, thenthe result ofszext depends upon the value of the.mode qualifier as follows:

  • If.mode is.clamp, then the result is the same as the source operanda.

  • If.mode is.wrap, then the result is computed using the wrapped value of N.

Semantics

b1        = b & 0x1f;too_large = (b >= 32 && .mode == .clamp) ? true : false;mask      = too_large ? 0 : (~0) << b1;sign_pos  = (b1 - 1) & 0x1f;if (b1 == 0 || too_large || .type != .s32) {    sign_bit = false;} else {    sign_bit = (a >> sign_pos) & 1;}d = (a & ~mask) | (sign_bit ? mask | 0);

PTX ISA Notes

Introduced in PTX ISA version 7.6.

Target ISA Notes

szext requiressm_70 or higher.

Examples

szext.clamp.s32 rd, ra, rb;szext.wrap.u32  rd, 0xffffffff, 0; // Result is 0.

9.7.1.22.Integer Arithmetic Instructions:bmsk

bmsk

Bit Field Mask.

Syntax

bmsk.mode.b32  d, a, b;.mode = { .clamp, .wrap };

Description

Generates a 32-bit mask starting from the bit position specified in operanda, and of the widthspecified in operandb. The generated bitmask is stored in the destination operandd.

The resulting bitmask is 0 in the following cases:

  • When the value ofa is 32 or higher and.mode is.clamp.

  • When either the specified value ofb or the wrapped value ofb (when.mode isspecified as.wrap) is 0.

Semantics

a1    = a & 0x1f;mask0 = (~0) << a1;b1    = b & 0x1f;sum   = a1 + b1;mask1 = (~0) << sum;sum-overflow          = sum >= 32 ? true : false;bit-position-overflow = false;bit-width-overflow    = false;if (.mode == .clamp) {    if (a >= 32) {        bit-position-overflow = true;        mask0 = 0;    }    if (b >= 32) {        bit-width-overflow = true;    }}if (sum-overflow || bit-position-overflow || bit-width-overflow) {    mask1 = 0;} else if (b1 == 0) {    mask1 = ~0;}d = mask0 & ~mask1;

Notes

The bitmask width specified by operandb is limited to range0..32 in.clamp mode and torange0..31 in.wrap mode.

PTX ISA Notes

Introduced in PTX ISA version 7.6.

Target ISA Notes

bmsk requiressm_70 or higher.

Examples

bmsk.clamp.b32  rd, ra, rb;bmsk.wrap.b32   rd, 1, 2; // Creates a bitmask of 0x00000006.

9.7.1.23.Integer Arithmetic Instructions:dp4a

dp4a

Four-way byte dot product-accumulate.

Syntax

dp4a.atype.btype  d, a, b, c;.atype = .btype = { .u32, .s32 };

Description

Four-way byte dot product which is accumulated in 32-bit result.

Operanda andb are 32-bit inputs which hold 4 byte inputs in packed form for dot product.

Operandc has type.u32 if both.atype and.btype are.u32 else operandchas type.s32.

Semantics

d = c;// Extract 4 bytes from a 32bit input and sign or zero extend// based on input type.Va = extractAndSignOrZeroExt_4(a, .atype);Vb = extractAndSignOrZeroExt_4(b, .btype);for (i = 0; i < 4; ++i) {    d += Va[i] * Vb[i];}

PTX ISA Notes

Introduced in PTX ISA version 5.0.

Target ISA Notes

Requiressm_61 or higher.

Examples

dp4a.u32.u32           d0, a0, b0, c0;dp4a.u32.s32           d1, a1, b1, c1;

9.7.1.24.Integer Arithmetic Instructions:dp2a

dp2a

Two-way dot product-accumulate.

Syntax

dp2a.mode.atype.btype  d, a, b, c;.atype = .btype = { .u32, .s32 };.mode = { .lo, .hi };

Description

Two-way 16-bit to 8-bit dot product which is accumulated in 32-bit result.

Operanda andb are 32-bit inputs. Operanda holds two 16-bits inputs in packed form andoperandb holds 4 byte inputs in packed form for dot product.

Depending on the.mode specified, either lower half or upper half of operandb will be usedfor dot product.

Operandc has type.u32 if both.atype and.btype are.u32 else operandchas type.s32.

Semantics

d = c;// Extract two 16-bit values from a 32-bit input and sign or zero extend// based on input type.Va = extractAndSignOrZeroExt_2(a, .atype);// Extract four 8-bit values from a 32-bit input and sign or zer extend// based on input type.Vb = extractAndSignOrZeroExt_4(b, .btype);b_select = (.mode == .lo) ? 0 : 2;for (i = 0; i < 2; ++i) {    d += Va[i] * Vb[b_select + i];}

PTX ISA Notes

Introduced in PTX ISA version 5.0.

Target ISA Notes

Requiressm_61 or higher.

Examples

dp2a.lo.u32.u32           d0, a0, b0, c0;dp2a.hi.u32.s32           d1, a1, b1, c1;

9.7.2.Extended-Precision Integer Arithmetic Instructions

Instructionsadd.cc,addc,sub.cc,subc,mad.cc andmadc reference animplicitly specified condition code register (CC) having a single carry flag bit (CC.CF)holding carry-in/carry-out or borrow-in/borrow-out. These instructions support extended-precisioninteger addition, subtraction, and multiplication. No other instructions access the condition code,and there is no support for setting, clearing, or testing the condition code. The condition coderegister is not preserved across calls and is mainly intended for use in straight-line codesequences for computing extended-precision integer addition, subtraction, and multiplication.

The extended-precision arithmetic instructions are:

  • add.cc,addc

  • sub.cc,subc

  • mad.cc,madc

9.7.2.1.Extended-Precision Arithmetic Instructions:add.cc

add.cc

Add two values with carry-out.

Syntax

add.cc.type  d, a, b;.type = { .u32, .s32, .u64, .s64 };

Description

Performs integer addition and writes the carry-out value into the condition code register.

Semantics

d = a + b;

carry-out written toCC.CF

Notes

No integer rounding modifiers.

No saturation.

Behavior is the same for unsigned and signed integers.

PTX ISA Notes

32-bitadd.cc introduced in PTX ISA version 1.2.

64-bitadd.cc introduced in PTX ISA version 4.3.

Target ISA Notes

32-bitadd.cc is supported on all target architectures.

64-bitadd.cc requiressm_20 or higher.

Examples

@p  add.cc.u32   x1,y1,z1;   // extended-precision addition of@p  addc.cc.u32  x2,y2,z2;   // two 128-bit values@p  addc.cc.u32  x3,y3,z3;@p  addc.u32     x4,y4,z4;

9.7.2.2.Extended-Precision Arithmetic Instructions:addc

addc

Add two values with carry-in and optional carry-out.

Syntax

addc{.cc}.type  d, a, b;.type = { .u32, .s32, .u64, .s64 };

Description

Performs integer addition with carry-in and optionally writes the carry-out value into the conditioncode register.

Semantics

d = a + b + CC.CF;

if.cc specified, carry-out written toCC.CF

Notes

No integer rounding modifiers.

No saturation.

Behavior is the same for unsigned and signed integers.

PTX ISA Notes

32-bitaddc introduced in PTX ISA version 1.2.

64-bitaddc introduced in PTX ISA version 4.3.

Target ISA Notes

32-bitaddc is supported on all target architectures.

64-bitaddc requiressm_20 or higher.

Examples

@p  add.cc.u32   x1,y1,z1;   // extended-precision addition of@p  addc.cc.u32  x2,y2,z2;   // two 128-bit values@p  addc.cc.u32  x3,y3,z3;@p  addc.u32     x4,y4,z4;

9.7.2.3.Extended-Precision Arithmetic Instructions:sub.cc

sub.cc

Subtract one value from another, with borrow-out.

Syntax

sub.cc.type  d, a, b;.type = { .u32, .s32, .u64, .s64 };

Description

Performs integer subtraction and writes the borrow-out value into the condition code register.

Semantics

d = a - b;

borrow-out written toCC.CF

Notes

No integer rounding modifiers.

No saturation.

Behavior is the same for unsigned and signed integers.

PTX ISA Notes

32-bitsub.cc introduced in PTX ISA version 1.2.

64-bitsub.cc introduced in PTX ISA version 4.3.

Target ISA Notes

32-bitsub.cc is supported on all target architectures.

64-bitsub.cc requiressm_20 or higher.

Examples

@p  sub.cc.u32   x1,y1,z1;   // extended-precision subtraction@p  subc.cc.u32  x2,y2,z2;   // of two 128-bit values@p  subc.cc.u32  x3,y3,z3;@p  subc.u32     x4,y4,z4;

9.7.2.4.Extended-Precision Arithmetic Instructions:subc

subc

Subtract one value from another, with borrow-in and optional borrow-out.

Syntax

subc{.cc}.type  d, a, b;.type = { .u32, .s32, .u64, .s64 };

Description

Performs integer subtraction with borrow-in and optionally writes the borrow-out value into thecondition code register.

Semantics

d = a  - (b + CC.CF);

if.cc specified, borrow-out written toCC.CF

Notes

No integer rounding modifiers.

No saturation.

Behavior is the same for unsigned and signed integers.

PTX ISA Notes

32-bitsubc introduced in PTX ISA version 1.2.

64-bitsubc introduced in PTX ISA version 4.3.

Target ISA Notes

32-bitsubc is supported on all target architectures.

64-bitsubc requiressm_20 or higher.

Examples

@p  sub.cc.u32   x1,y1,z1;   // extended-precision subtraction@p  subc.cc.u32  x2,y2,z2;   // of two 128-bit values@p  subc.cc.u32  x3,y3,z3;@p  subc.u32     x4,y4,z4;

9.7.2.5.Extended-Precision Arithmetic Instructions:mad.cc

mad.cc

Multiply two values, extract high or low half of result, and add a third value with carry-out.

Syntax

mad{.hi,.lo}.cc.type  d, a, b, c;.type = { .u32, .s32, .u64, .s64 };

Description

Multiplies two values, extracts either the high or low part of the result, and adds a thirdvalue. Writes the result to the destination register and the carry-out from the addition into thecondition code register.

Semantics

t = a * b;d = t<63..32> + c;    // for .hi variantd = t<31..0> + c;     // for .lo variant

carry-out from addition is written toCC.CF

Notes

Generally used in combination withmadc andaddc to implement extended-precision multi-wordmultiplication. Seemadc for an example.

PTX ISA Notes

32-bitmad.cc introduced in PTX ISA version 3.0.

64-bitmad.cc introduced in PTX ISA version 4.3.

Target ISA Notes

Requires targetsm_20 or higher.

Examples

@p  mad.lo.cc.u32 d,a,b,c;    mad.lo.cc.u32 r,p,q,r;

9.7.2.6.Extended-Precision Arithmetic Instructions:madc

madc

Multiply two values, extract high or low half of result, and add a third value with carry-in andoptional carry-out.

Syntax

madc{.hi,.lo}{.cc}.type  d, a, b, c;.type = { .u32, .s32, .u64, .s64 };

Description

Multiplies two values, extracts either the high or low part of the result, and adds a third valuealong with carry-in. Writes the result to the destination register and optionally writes thecarry-out from the addition into the condition code register.

Semantics

t = a * b;d = t<63..32> + c + CC.CF;     // for .hi variantd = t<31..0> + c + CC.CF;      // for .lo variant

if.cc specified, carry-out from addition is written toCC.CF

Notes

Generally used in combination withmad.cc andaddc to implement extended-precisionmulti-word multiplication. See example below.

PTX ISA Notes

32-bitmadc introduced in PTX ISA version 3.0.

64-bitmadc introduced in PTX ISA version 4.3.

Target ISA Notes

Requires targetsm_20 or higher.

Examples

// extended-precision multiply:  [r3,r2,r1,r0] = [r5,r4] * [r7,r6]mul.lo.u32     r0,r4,r6;      // r0=(r4*r6).[31:0], no carry-outmul.hi.u32     r1,r4,r6;      // r1=(r4*r6).[63:32], no carry-outmad.lo.cc.u32  r1,r5,r6,r1;   // r1+=(r5*r6).[31:0], may carry-outmadc.hi.u32    r2,r5,r6,0;    // r2 =(r5*r6).[63:32]+carry-in,                              // no carry-outmad.lo.cc.u32   r1,r4,r7,r1;  // r1+=(r4*r7).[31:0], may carry-outmadc.hi.cc.u32  r2,r4,r7,r2;  // r2+=(r4*r7).[63:32]+carry-in,                              // may carry-outaddc.u32        r3,0,0;       // r3 = carry-in, no carry-outmad.lo.cc.u32   r2,r5,r7,r2;  // r2+=(r5*r7).[31:0], may carry-outmadc.hi.u32     r3,r5,r7,r3;  // r3+=(r5*r7).[63:32]+carry-in

9.7.3.Floating-Point Instructions

Floating-point instructions operate on.f32 and.f64 register operands and constantimmediate values. The floating-point instructions are:

  • testp

  • copysign

  • add

  • sub

  • mul

  • fma

  • mad

  • div

  • abs

  • neg

  • min

  • max

  • rcp

  • sqrt

  • rsqrt

  • sin

  • cos

  • lg2

  • ex2

  • tanh

Instructions that support rounding modifiers are IEEE-754 compliant. Double-precision instructionssupport subnormal inputs and results. Single-precision instructions support subnormal inputs andresults by default forsm_20 and subsequent targets, and flush subnormal inputs and results tosign-preserving zero forsm_1x targets. The optional.ftz modifier on single-precisioninstructions provides backward compatibility withsm_1x targets by flushing subnormal inputs andresults to sign-preserving zero regardless of the target architecture.

Single-precisionadd,sub,mul, andmad support saturation of results to the range[0.0, 1.0], withNaNs being flushed to positive zero.NaN payloads are supported fordouble-precision instructions (except forrcp.approx.ftz.f64 andrsqrt.approx.ftz.f64, whichmaps inputNaNs to a canonicalNaN). Single-precision instructions return an unspecifiedNaN. Note that future implementations may supportNaN payloads for single-precisioninstructions, so PTX programs should not rely on the specific single-precisionNaNs beinggenerated.

Table 29 summarizesfloating-point instructions in PTX.

Table 29Summary of Floating-Point Instructions

Instruction

.rn

.rz

.rm

.rp

.ftz

.sat

Notes

{add,sub,mul}.rnd.f32

x

x

x

x

x

x

If no rounding modifier is specified,default is.rn and instructions maybe folded into a multiply-add.

{add,sub,mul}.rnd.f64

x

x

x

x

n/a

n/a

If no rounding modifier is specified,default is.rn and instructions maybe folded into a multiply-add.

mad.f32

n/a

n/a

n/a

n/a

x

x

.targetsm_1x

No rounding modifier.

{mad,fma}.rnd.f32

x

x

x

x

x

x

.targetsm_20 or higher

mad.f32 andfma.f32 are the same.

{mad,fma}.rnd.f64

x

x

x

x

n/a

n/a

mad.f64 andfma.f64 are the same.

div.full.f32

n/a

n/a

n/a

n/a

x

n/a

No rounding modifier.

{div,rcp,sqrt}.approx.f32

n/a

n/a

n/a

n/a

x

n/a

n/a

rcp.approx.ftz.f64

n/a

n/a

n/a

n/a

x

n/a

.targetsm_20 or higher

{div,rcp,sqrt}.rnd.f32

x

x

x

x

x

n/a

.targetsm_20 or higher

{div,rcp,sqrt}.rnd.f64

x

x

x

x

n/a

n/a

.targetsm_20 or higher

{abs,neg,min,max}.f32

n/a

n/a

n/a

n/a

x

n/a

{abs,neg,min,max}.f64

n/a

n/a

n/a

n/a

n/a

n/a

rsqrt.approx.f32

n/a

n/a

n/a

n/a

x

n/a

rsqrt.approx.f64

n/a

n/a

n/a

n/a

n/a

n/a

rsqrt.approx.ftz.f64

n/a

n/a

n/a

n/a

x

n/a

.targetsm_20 or higher

{sin,cos,lg2,ex2}.approx.f32

n/a

n/a

n/a

n/a

x

n/a

tanh.approx.f32

n/a

n/a

n/a

n/a

n/a

n/a

.targetsm_75 or higher

9.7.3.1.Floating Point Instructions:testp

testp

Test floating-point property.

Syntax

testp.op.type  p, a;  // result is .pred.op   = { .finite, .infinite,          .number, .notanumber,          .normal, .subnormal };.type = { .f32, .f64 };

Description

testp tests common properties of floating-point numbers and returns a predicate value of1ifTrue and0 ifFalse.

testp.finite

True if the input is not infinite orNaN

testp.infinite

True if the input is positive or negative infinity

testp.number

True if the input is notNaN

testp.notanumber

True if the input isNaN

testp.normal

True if the input is a normal number (notNaN, not infinity)

testp.subnormal

True if the input is a subnormal number (notNaN, not infinity)

As a special case, positive and negative zero are considered normal numbers.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

Requiressm_20 or higher.

Examples

testp.notanumber.f32  isnan, f0;testp.infinite.f64    p, X;

9.7.3.2.Floating Point Instructions:copysign

copysign

Copy sign of one input to another.

Syntax

copysign.type  d, a, b;.type = { .f32, .f64 };

Description

Copy sign bit ofa into value ofb, and return the result asd.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

Requiressm_20 or higher.

Examples

copysign.f32  x, y, z;copysign.f64  A, B, C;

9.7.3.3.Floating Point Instructions:add

add

Add two values.

Syntax

add{.rnd}{.ftz}{.sat}.f32  d, a, b;add{.rnd}{.ftz}.f32x2      d, a, b;add{.rnd}.f64              d, a, b;.rnd = { .rn, .rz, .rm, .rp };

Description

Performs addition and writes the resulting value into a destination register.

For.f32x2 instruction type, forms input vectors of single precision (.f32) values fromsource operands. Single precision (.f32) operands are then added in parallel to produce.f32x2 result in destination.

For.f32x2 instruction type, operandsd,a andb have.b64 type.

Semantics

if (type == f32 || type == f64) {    d = a + b;} else if (type == f32x2) {    fA[0] = a[0:31];    fA[1] = a[32:63];    fB[0] = b[0:31];    fB[1] = b[32:63];    for (i = 0; i < 2; i++) {        d[i] = fA[i] + fB[i];    }}

Notes

Rounding modifiers:

.rn

mantissa LSB rounds to nearest even

.rz

mantissa LSB rounds towards zero

.rm

mantissa LSB rounds towards negative infinity

.rp

mantissa LSB rounds towards positive infinity

The default value of rounding modifier is.rn. Note that anadd instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Anadd instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/add sequences with no rounding modifiers may be optimized touse fused-multiply-add instructions on the target device.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

add.ftz.f32,add.ftz.f32x2 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

add.f64 supports subnormal numbers.

add.f32 flushes subnormal inputs and results to sign-preserving zero.

Saturation modifier:

add.sat.f32 clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

add.f32x2 introduced in PTX ISA version 8.6.

Target ISA Notes

add.f32 supported on all target architectures.

add.f64 requiressm_13 or higher.

Rounding modifiers have the following target requirements:

.rn,.rz

available for all targets

.rm,.rp

foradd.f64, requiressm_13 or higher.

foradd.f32, requiressm_20 or higher.

add.f32x2 requiressm_100 or higher.

Examples

@p  add.rz.ftz.f32  f1,f2,f3;add.rp.ftz.f32x2    d, a, b;

9.7.3.4.Floating Point Instructions:sub

sub

Subtract one value from another.

Syntax

sub{.rnd}{.ftz}{.sat}.f32  d, a, b;sub{.rnd}{.ftz}.f32x2      d, a, b;sub{.rnd}.f64              d, a, b;.rnd = { .rn, .rz, .rm, .rp };

Description

Performs subtraction and writes the resulting value into a destination register.

For.f32x2 instruction type, forms input vectors of single precision (.f32) valuesfrom source operands. Single precision (.f32) operands are then subtracted in parallelto produce.f32x2 result in destination.

For.f32x2 instruction type, operandsd,a andb have.b64 type.

Semantics

if (type == f32 || type == f64) {    d = a - b;} else if (type == f32x2) {    fA[0] = a[0:31];    fA[1] = a[32:63];    fB[0] = b[0:31];    fB[1] = b[32:63];    for (i = 0; i < 2; i++) {        d[i] = fA[i] - fB[i];    }}

Notes

Rounding modifiers:

.rn

mantissa LSB rounds to nearest even

.rz

mantissa LSB rounds towards zero

.rm

mantissa LSB rounds towards negative infinity

.rp

mantissa LSB rounds towards positive infinity

The default value of rounding modifier is.rn. Note that asub instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Asub instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/sub sequences with no rounding modifiers may be optimized touse fused-multiply-add instructions on the target device.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

sub.ftz.f32,sub.ftz.f32x2 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

sub.f64 supports subnormal numbers.

sub.f32 flushes subnormal inputs and results to sign-preserving zero.

Saturation modifier:

sub.sat.f32 clamps the result to [0.0, 1.0]. NaN results are flushed to+0.0f.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

sub.f32x2 introduced in PTX ISA version 8.6.

Target ISA Notes

sub.f32 supported on all target architectures.

sub.f64 requiressm_13 or higher.

Rounding modifiers have the following target requirements:

.rn,.rz

available for all targets

.rm,.rp

forsub.f64, requiressm_13 or higher.

forsub.f32, requiressm_20 or higher.

sub.f32x2 requiressm_100 or higher.

Examples

sub.f32 c,a,b;sub.rn.ftz.f32  f1,f2,f3;

9.7.3.5.Floating Point Instructions:mul

mul

Multiply two values.

Syntax

mul{.rnd}{.ftz}{.sat}.f32  d, a, b;mul{.rnd}{.ftz}.f32x2      d, a, b;mul{.rnd}.f64              d, a, b;.rnd = { .rn, .rz, .rm, .rp };

Description

Compute the product of two values.

For.f32x2 instruction type, forms input vectors of single precision (.f32) valuesfrom source operands. Single precision (.f32) operands are then multiplied in parallelto produce.f32x2 result in destination.

For.f32x2 instruction type, operandsd,a andb have.b64 type.

Semantics

if (type == f32 || type == f64) {    d = a * b;} else if (type == f32x2) {    fA[0] = a[0:31];    fA[1] = a[32:63];    fB[0] = b[0:31];    fB[1] = b[32:63];    for (i = 0; i < 2; i++) {        d[i] = fA[i] * fB[i];    }}

Notes

For floating-point multiplication, all operands must be the same size.

Rounding modifiers:

.rn

mantissa LSB rounds to nearest even

.rz

mantissa LSB rounds towards zero

.rm

mantissa LSB rounds towards negative infinity

.rp

mantissa LSB rounds towards positive infinity

The default value of rounding modifier is.rn. Note that amul instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Amul instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/add andmul/sub sequences with no rounding modifiers may beoptimized to use fused-multiply-add instructions on the target device.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

mul.ftz.f32,mul.ftz.f32x2 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

mul.f64 supports subnormal numbers.

mul.f32 flushes subnormal inputs and results to sign-preserving zero.

Saturation modifier:

mul.sat.f32 clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

mul.f32x2 introduced in PTX ISA version 8.6.

Target ISA Notes

mul.f32 supported on all target architectures.

mul.f64 requiressm_13 or higher.

Rounding modifiers have the following target requirements:

.rn,.rz

available for all targets

.rm,.rp

formul.f64, requiressm_13 or higher.

formul.f32, requiressm_20 or higher.

mul.f32x2 requiressm_100 or higher.

Examples

mul.ftz.f32 circumf,radius,pi  // a single-precision multiply

9.7.3.6.Floating Point Instructions:fma

fma

Fused multiply-add.

Syntax

fma.rnd{.ftz}{.sat}.f32  d, a, b, c;fma.rnd{.ftz}.f32x2      d, a, b, c;fma.rnd.f64              d, a, b, c;.rnd = { .rn, .rz, .rm, .rp };

Description

Performs a fused multiply-add with no loss of precision in the intermediate product and addition.

For.f32x2 instruction type, forms input vectors of single precision (.f32) values fromsource operands. Single precision (.f32) operands are then operated in parallel to produce.f32x2 result in destination.

For.f32x2 instruction type, operandsd,a,b andc have.b64 type.

Semantics

if (type == f32 || type == f64) {    d = a * b + c;} else if (type == f32x2) {    fA[0] = a[0:31];    fA[1] = a[32:63];    fB[0] = b[0:31];    fB[1] = b[32:63];    fC[0] = c[0:31];    fC[1] = c[32:63];    for (i = 0; i < 2; i++) {        d[i] = fA[i] * fB[i] + fC[i];    }}

Notes

fma.f32 computes the product ofa andb to infinite precision and then addsc tothis product, again in infinite precision. The resulting value is then rounded to single precisionusing the rounding mode specified by.rnd.

fma.f64 computes the product ofa andb to infinite precision and then addsc tothis product, again in infinite precision. The resulting value is then rounded to double precisionusing the rounding mode specified by.rnd.

fma.f64 is the same asmad.f64.

Rounding modifiers (no default):

.rn

mantissa LSB rounds to nearest even

.rz

mantissa LSB rounds towards zero

.rm

mantissa LSB rounds towards negative infinity

.rp

mantissa LSB rounds towards positive infinity

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

fma.ftz.f32,fma.ftz.f32x2 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

fma.f64 supports subnormal numbers.

fma.f32 is unimplemented forsm_1x targets.

Saturation:

fma.sat.f32 clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.

PTX ISA Notes

fma.f64 introduced in PTX ISA version 1.4.

fma.f32 introduced in PTX ISA version 2.0.

fma.f32x2 introduced in PTX ISA version 8.6.

Target ISA Notes

fma.f32 requiressm_20 or higher.

fma.f64 requiressm_13 or higher.

fma.f32x2 requiressm_100 or higher.

Examples

    fma.rn.ftz.f32  w,x,y,z;@p  fma.rn.f64      d,a,b,c;    fma.rp.ftz.f32x2 p,q,r,s;

9.7.3.7.Floating Point Instructions:mad

mad

Multiply two values and add a third value.

Syntax

mad{.ftz}{.sat}.f32      d, a, b, c;    // .target sm_1xmad.rnd{.ftz}{.sat}.f32  d, a, b, c;    // .target sm_20mad.rnd.f64              d, a, b, c;    // .target sm_13 and higher.rnd = { .rn, .rz, .rm, .rp };

Description

Multiplies two values and adds a third, and then writes the resulting value into a destinationregister.

Semantics

d = a*b + c;

Notes

For.targetsm_20 and higher:

  • mad.f32 computes the product ofa andb to infinite precision and then addsc tothis product, again in infinite precision. The resulting value is then rounded to single precisionusing the rounding mode specified by.rnd.

  • mad.f64 computes the product ofa andb to infinite precision and then addsc tothis product, again in infinite precision. The resulting value is then rounded to double precisionusing the rounding mode specified by.rnd.

  • mad.{f32,f64} is the same asfma.{f32,f64}.

For.targetsm_1x:

  • mad.f32 computes the product ofa andb at double precision, and then the mantissa istruncated to 23 bits, but the exponent is preserved. Note that this is different from computingthe product withmul, where the mantissa can be rounded and the exponent will be clamped. Theexception formad.f32 is whenc=+/-0.0,mad.f32 is identical to the result computedusing separate mul and add instructions. When JIT-compiled for SM 2.0 devices,mad.f32 isimplemented as a fused multiply-add (i.e.,fma.rn.ftz.f32). In this case,mad.f32 canproduce slightly different numeric results and backward compatibility is not guaranteed in thiscase.

  • mad.f64 computes the product ofa andb to infinite precision and then addsc tothis product, again in infinite precision. The resulting value is then rounded to double precisionusing the rounding mode specified by.rnd. Unlikemad.f32, the treatment of subnormalinputs and output follows IEEE 754 standard.

  • mad.f64 is the same asfma.f64.

Rounding modifiers (no default):

.rn

mantissa LSB rounds to nearest even

.rz

mantissa LSB rounds towards zero

.rm

mantissa LSB rounds towards negative infinity

.rp

mantissa LSB rounds towards positive infinity

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

mad.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

mad.f64 supports subnormal numbers.

mad.f32 flushes subnormal inputs and results to sign-preserving zero.

Saturation modifier:

mad.sat.f32 clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

In PTX ISA versions 1.4 and later, a rounding modifier is required formad.f64.

Legacymad.f64 instructions having no rounding modifier will map tomad.rn.f64.

In PTX ISA versions 2.0 and later, a rounding modifier is required formad.f32 forsm_20 and higher targets.

Errata

mad.f32 requires a rounding modifier forsm_20 and higher targets. However for PTX ISAversion 3.0 and earlier, ptxas does not enforce this requirement andmad.f32 silently defaultstomad.rn.f32. For PTX ISA version 3.1, ptxas generates a warning and defaults tomad.rn.f32, and in subsequent releases ptxas will enforce the requirement for PTX ISA version3.2 and later.

Target ISA Notes

mad.f32 supported on all target architectures.

mad.f64 requiressm_13 or higher.

Rounding modifiers have the following target requirements:

  • .rn,.rz,.rm,.rp formad.f64, requiressm_13 or higher.

  • .rn,.rz,.rm,.rp formad.f32, requiressm_20 or higher.

Examples

@p  mad.f32  d,a,b,c;

9.7.3.8.Floating Point Instructions:div

div

Divide one value by another.

Syntax

div.approx{.ftz}.f32  d, a, b;  // fast, approximate dividediv.full{.ftz}.f32    d, a, b;  // full-range approximate dividediv.rnd{.ftz}.f32     d, a, b;  // IEEE 754 compliant roundingdiv.rnd.f64           d, a, b;  // IEEE 754 compliant rounding.rnd = { .rn, .rz, .rm, .rp };

Description

Dividesa byb, stores result ind.

Semantics

d = a / b;

Notes

Fast, approximate single-precision divides:

  • div.approx.f32 implements a fast approximation to divide, computed asd=a*(1/b). For|b| in [2-126, 2126], the maximumulp error is 2. For 2126 <|b| < 2128, ifa is infinity,div.approx.f32 returnsNaN, otherwise itreturns a sign-preserving zero.

  • div.full.f32 implements a relatively fast, full-range approximation that scales operands toachieve better accuracy, but is not fully IEEE 754 compliant and does not support roundingmodifiers. The maximumulp error is 2 across the full range of inputs.

Divide with IEEE 754 compliant rounding:

Rounding modifiers (no default):

.rn

mantissa LSB rounds to nearest even

.rz

mantissa LSB rounds towards zero

.rm

mantissa LSB rounds towards negative infinity

.rp

mantissa LSB rounds towards positive infinity

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

div.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

div.f64 supports subnormal numbers.

div.f32 flushes subnormal inputs and results to sign-preserving zero.

PTX ISA Notes

div.f32 anddiv.f64 introduced in PTX ISA version 1.0.

Explicit modifiers.approx,.full,.ftz, and rounding introduced in PTX ISA version 1.4.

For PTX ISA version 1.4 and later, one of.approx,.full, or.rnd is required.

For PTX ISA versions 1.0 through 1.3,div.f32 defaults todiv.approx.ftz.f32, anddiv.f64 defaults todiv.rn.f64.

Target ISA Notes

div.approx.f32 anddiv.full.f32 supported on all target architectures.

div.rnd.f32 requiressm_20 or higher.

div.rn.f64 requiressm_13 or higher, or.targetmap_f64_to_f32.

div.{rz,rm,rp}.f64 requiressm_20 or higher.

Examples

div.approx.ftz.f32  diam,circum,3.14159;div.full.ftz.f32    x, y, z;div.rn.f64          xd, yd, zd;

9.7.3.9.Floating Point Instructions:abs

abs

Absolute value.

Syntax

abs{.ftz}.f32  d, a;abs.f64        d, a;

Description

Take the absolute value ofa and store the result ind.

Semantics

d = |a|;

Notes

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

abs.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

abs.f64 supports subnormal numbers.

abs.f32 flushes subnormal inputs and results to sign-preserving zero.

Forabs.f32,NaN input yields unspecifiedNaN. Forabs.f64,NaN input is passedthrough unchanged. Future implementations may comply with the IEEE 754 standard by preservingpayload and modifying only the sign bit.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

abs.f32 supported on all target architectures.

abs.f64 requiressm_13 or higher.

Examples

abs.ftz.f32  x,f0;

9.7.3.10.Floating Point Instructions:neg

neg

Arithmetic negate.

Syntax

neg{.ftz}.f32  d, a;neg.f64        d, a;

Description

Negate the sign ofa and store the result ind.

Semantics

d = -a;

Notes

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

neg.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

neg.f64 supports subnormal numbers.

neg.f32 flushes subnormal inputs and results to sign-preserving zero.

NaN inputs yield an unspecifiedNaN. Future implementations may comply with the IEEE 754standard by preserving payload and modifying only the sign bit.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

neg.f32 supported on all target architectures.

neg.f64 requiressm_13 or higher.

Examples

neg.ftz.f32  x,f0;

9.7.3.11.Floating Point Instructions:min

min

Find the minimum of given values.

Syntax

min{.ftz}{.NaN}{.xorsign.abs}.f32  d, a, b;min{.ftz}{.NaN}{.abs}.f32          d, a, b, c;min.f64                            d, a, b;

Description

Store the minimum ofa,b and optionallyc ind.

If.NaN modifier is specified, then the result is canonicalNaN if any of the inputs isNaN.

If.abs modifier is specified, the magnitude of destination operandd is the minimum ofabsolute values of both input arguments.

If.xorsign modifier is specified, the sign bit of destinationd is equal to the XOR of thesign bits of both inputsa andb. The.xorsign qualifier cannot be specified for threeinputs operation.

Qualifier.xorsign requires qualifier.abs to be specified. In such cases,.xorsignconsiders the sign bit of both inputs before applying.abs operation.

If the result ofmin isNaN then the.xorsign and.abs modifiers will be ignored.

Semantics

def min_num (z, x, y) {    if (isNaN(x) && isNaN(y))        z = NaN;    else if (isNaN(x))        z = y;    else if (isNaN(y))        z = x;    else        // note: -0.0 < +0.0 here        z = (x < y) ? x : y;    return z;}def min_nan (z, x, y) {    if (isNaN(x) || isNaN(y))        z = NaN;    else        // note: -0.0 < +0.0 here        z = (x < y) ? x : y;    return z;}def two_inputs_min (z, x, y) {    if (.NaN)        z = min_nan(z, x, y);    else        z = min_num(z, x, y);    return z;}if (.xorsign && !isPresent(c)) {    xorsign = getSignBit(a) ^ getSignBit(b);}if (.abs) {    a = |a|;    b = |b|;    if (isPresent(c)) {        c = |c|;    }}d = two_inputs_min(d, a, b)if (isPresent(c)) {    d = two_inputs_min(d, d, c)}if (.xorsign && !isPresent(c) && !isNaN(d)) {    setSignBit(d, xorsign);}

Notes

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

min.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

min.f64 supports subnormal numbers.

min.f32 flushes subnormal inputs and results to sign-preserving zero.

If values of both inputs are 0.0, then +0.0 > -0.0.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

min.NaN introduced in PTX ISA version 7.0.

min.xorsign.abs introduced in PTX ISA version 7.2.

min with three input arguments introduced in PTX ISA version 8.8.

Target ISA Notes

min.f32 supported on all target architectures.

min.f64 requiressm_13 or higher.

min.NaN requiressm_80 or higher.

min.xorsign.abs requiressm_86 or higher.

min with three input arguments requiressm_100 or higher.

Examples

@p  min.ftz.f32  z,z,x;    min.f64      a,b,c;    // fp32 min with .NaN    min.NaN.f32  f0,f1,f2;    // fp32 min with .xorsign.abs    min.xorsign.abs.f32 Rd, Ra, Rb;

9.7.3.12.Floating Point Instructions:max

max

Find the maximum of given values.

Syntax

max{.ftz}{.NaN}{.xorsign.abs}.f32  d, a, b;max{.ftz}{.NaN}{.abs}.f32          d, a, b, c;max.f64                            d, a, b;

Description

Store the maximum ofa,b and optionallyc ind.

If.NaN modifier is specified, the result is canonicalNaN if any of the inputs isNaN.

If.abs modifier is specified, the magnitude of destination operandd is the maximum ofabsolute values of the input arguments.

If.xorsign modifier is specified, the sign bit of destinationd is equal to the XOR of thesign bits of the inputs:a andb. The.xorsign qualifier cannot be specified for threeinputs operation.

Qualifier.xorsign requires qualifier.abs to be specified. In such cases,.xorsignconsiders the sign bit of both inputs before applying.abs operation.

If the result ofmax isNaN then the.xorsign and.abs modifiers will be ignored.

Semantics

def max_num (z, x, y) {    if (isNaN(x) && isNaN(y))        z = NaN;    else if (isNaN(x))        z = y;    else if (isNaN(y))        z = x;    else        // note: +0.0 > -0.0 here        z = (x > y) ? x : y;    return z;}def max_nan (z, x, y) {    if (isNaN(x) || isNaN(y))        z = NaN;    else        // note: +0.0 > -0.0 here        z = (x > y) ? x : y;    return z;}def two_inputs_max (z, x, y) {    if (.NaN)        z = max_nan(z, x, y);    else        z = max_num(z, x, y);    return z;}if (.xorsign && !isPresent(c)) {    xorsign = getSignBit(a) ^ getSignBit(b);}if (.abs) {    a = |a|;    b = |b|;    if (isPresent(c)) {        c = |c|;    }}d = two_inputs_max (d, a, b)if (isPresent(c)) {    d = two_inputs_max (d, d, c)}if (.xorsign && !isPresent(c) !isNaN(d)) {    setSignBit(d, xorsign);}

Notes

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

max.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

max.f64 supports subnormal numbers.

max.f32 flushes subnormal inputs and results to sign-preserving zero.

If values of both inputs are 0.0, then +0.0 > -0.0.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

max.NaN introduced in PTX ISA version 7.0.

max.xorsign.abs introduced in PTX ISA version 7.2.

max with three input arguments introduced in PTX ISA version 8.8.

Target ISA Notes

max.f32 supported on all target architectures.

max.f64 requiressm_13 or higher.

max.NaN requiressm_80 or higher.

max.xorsign.abs requiressm_86 or higher.

max with three input arguments requiressm_100 or higher.

Examples

max.ftz.f32  f0,f1,f2;max.f64      a,b,c;// fp32 max with .NaNmax.NaN.f32  f0,f1,f2;// fp32 max with .xorsign.absmax.xorsign.abs.f32 Rd, Ra, Rb;

9.7.3.13.Floating Point Instructions:rcp

rcp

Take the reciprocal of a value.

Syntax

rcp.approx{.ftz}.f32  d, a;  // fast, approximate reciprocalrcp.rnd{.ftz}.f32     d, a;  // IEEE 754 compliant roundingrcp.rnd.f64           d, a;  // IEEE 754 compliant rounding.rnd = { .rn, .rz, .rm, .rp };

Description

Compute1/a, store result ind.

Semantics

d = 1 / a;

Notes

Fast, approximate single-precision reciprocal:

rcp.approx.f32 implements a fast approximation to reciprocal.The maximum ulp error is 1 across the full range of inputs.

Input

Result

-Inf

-0.0

-0.0

-Inf

+0.0

+Inf

+Inf

+0.0

NaN

NaN

Reciprocal with IEEE 754 compliant rounding:

Rounding modifiers (no default):

.rn

mantissa LSB rounds to nearest even

.rz

mantissa LSB rounds towards zero

.rm

mantissa LSB rounds towards negative infinity

.rp

mantissa LSB rounds towards positive infinity

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

rcp.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

rcp.f64 supports subnormal numbers.

rcp.f32 flushes subnormal inputs and results to sign-preserving zero.

PTX ISA Notes

rcp.f32 andrcp.f64 introduced in PTX ISA version 1.0.rcp.rn.f64 and explicit modifiers.approx and.ftz were introduced in PTX ISA version 1.4. General rounding modifiers wereadded in PTX ISA version 2.0.

For PTX ISA version 1.4 and later, one of.approx or.rnd is required.

For PTX ISA versions 1.0 through 1.3,rcp.f32 defaults torcp.approx.ftz.f32, andrcp.f64 defaults torcp.rn.f64.

Target ISA Notes

rcp.approx.f32 supported on all target architectures.

rcp.rnd.f32 requiressm_20 or higher.

rcp.rn.f64 requiressm_13 or higher, or.targetmap_f64_to_f32.

rcp.{rz,rm,rp}.f64 requiressm_20 or higher.

Examples

rcp.approx.ftz.f32  ri,r;rcp.rn.ftz.f32      xi,x;rcp.rn.f64          xi,x;

9.7.3.14.Floating Point Instructions:rcp.approx.ftz.f64

rcp.approx.ftz.f64

Compute a fast, gross approximation to the reciprocal of a value.

Syntax

rcp.approx.ftz.f64  d, a;

Description

Compute a fast, gross approximation to the reciprocal as follows:

  1. extract the most-significant 32 bits of.f64 operanda in 1.11.20 IEEE floating-pointformat (i.e., ignore the least-significant 32 bits ofa),

  2. compute an approximate.f64 reciprocal of this value using the most-significant 20 bits ofthe mantissa of operanda,

  3. place the resulting 32-bits in 1.11.20 IEEE floating-point format in the most-significant 32-bitsof destinationd,and

  4. zero the least significant 32 mantissa bits of.f64 destinationd.

Semantics

tmp = a[63:32]; // upper word of a, 1.11.20 formatd[63:32] = 1.0 / tmp;d[31:0] = 0x00000000;

Notes

rcp.approx.ftz.f64 implements a fast, gross approximation to reciprocal.

Input a[63:32]

Result d[63:32]

-Inf

-0.0

-subnormal

-Inf

-0.0

-Inf

+0.0

+Inf

+subnormal

+Inf

+Inf

+0.0

NaN

NaN

InputNaNs map to a canonicalNaN with encoding0x7fffffff00000000.

Subnormal inputs and results are flushed to sign-preserving zero.

PTX ISA Notes

rcp.approx.ftz.f64 introduced in PTX ISA version 2.1.

Target ISA Notes

rcp.approx.ftz.f64 requiressm_20 or higher.

Examples

rcp.approx.ftz.f64  xi,x;

9.7.3.15.Floating Point Instructions:sqrt

sqrt

Take the square root of a value.

Syntax

sqrt.approx{.ftz}.f32  d, a; // fast, approximate square rootsqrt.rnd{.ftz}.f32     d, a; // IEEE 754 compliant roundingsqrt.rnd.f64           d, a; // IEEE 754 compliant rounding.rnd = { .rn, .rz, .rm, .rp };

Description

Compute sqrt(a) and store the result ind.

Semantics

d = sqrt(a);

Notes

sqrt.approx.f32 implements a fast approximation to square root.The maximum relative error over the entire positive finite floating-pointrange is 2-23.

For various corner-case inputs, results ofsqrt instruction are shownin below table:

Input

Result

-Inf

NaN

-normal

NaN

-0.0

-0.0

+0.0

+0.0

+Inf

+Inf

NaN

NaN

Square root with IEEE 754 compliant rounding:

Rounding modifiers (no default):

.rn

mantissa LSB rounds to nearest even

.rz

mantissa LSB rounds towards zero

.rm

mantissa LSB rounds towards negative infinity

.rp

mantissa LSB rounds towards positive infinity

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

sqrt.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

sqrt.f64 supports subnormal numbers.

sqrt.f32 flushes subnormal inputs and results to sign-preserving zero.

PTX ISA Notes

sqrt.f32 andsqrt.f64 introduced in PTX ISA version 1.0.sqrt.rn.f64 and explicitmodifiers.approx and.ftz were introduced in PTX ISA version 1.4. General roundingmodifiers were added in PTX ISA version 2.0.

For PTX ISA version 1.4 and later, one of.approx or.rnd is required.

For PTX ISA versions 1.0 through 1.3,sqrt.f32 defaults tosqrt.approx.ftz.f32, andsqrt.f64 defaults tosqrt.rn.f64.

Target ISA Notes

sqrt.approx.f32 supported on all target architectures.

sqrt.rnd.f32 requiressm_20 or higher.

sqrt.rn.f64 requiressm_13 or higher, or.targetmap_f64_to_f32.

sqrt.{rz,rm,rp}.f64 requiressm_20 or higher.

Examples

sqrt.approx.ftz.f32  r,x;sqrt.rn.ftz.f32      r,x;sqrt.rn.f64          r,x;

9.7.3.16.Floating Point Instructions:rsqrt

rsqrt

Take the reciprocal of the square root of a value.

Syntax

rsqrt.approx{.ftz}.f32  d, a;rsqrt.approx.f64        d, a;

Description

Compute1/sqrt(a) and store the result ind.

Semantics

d = 1/sqrt(a);

Notes

rsqrt.approx implements an approximation to the reciprocal square root.

Input

Result

-Inf

NaN

-normal

NaN

-0.0

-Inf

+0.0

+Inf

+Inf

+0.0

NaN

NaN

The maximum relative error forrsqrt.f32 over the entire positivefinite floating-point range is 2-22.9.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

rsqrt.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

rsqrt.f64 supports subnormal numbers.

rsqrt.f32 flushes subnormal inputs and results to sign-preserving zero.

Note thatrsqrt.approx.f64 is emulated in software and are relatively slow.

PTX ISA Notes

rsqrt.f32 andrsqrt.f64 were introduced in PTX ISA version 1.0. Explicit modifiers.approx and.ftz were introduced in PTX ISA version 1.4.

For PTX ISA version 1.4 and later, the.approx modifier is required.

For PTX ISA versions 1.0 through 1.3,rsqrt.f32 defaults torsqrt.approx.ftz.f32, andrsqrt.f64 defaults torsqrt.approx.f64.

Target ISA Notes

rsqrt.f32 supported on all target architectures.

rsqrt.f64 requiressm_13 or higher.

Examples

rsqrt.approx.ftz.f32  isr, x;rsqrt.approx.f64      ISR, X;

9.7.3.17.Floating Point Instructions:rsqrt.approx.ftz.f64

rsqrt.approx.ftz.f64

Compute an approximation of the square root reciprocal of a value.

Syntax

rsqrt.approx.ftz.f64 d, a;

Description

Compute a double-precision (.f64) approximation of the square root reciprocal of a value. Theleast significant 32 bits of the double-precision (.f64) destinationd are all zeros.

Semantics

tmp = a[63:32]; // upper word of a, 1.11.20 formatd[63:32] = 1.0 / sqrt(tmp);d[31:0] = 0x00000000;

Notes

rsqrt.approx.ftz.f64 implements a fast approximation of the square root reciprocal of a value.

Input

Result

-Inf

NaN

-subnormal

-Inf

-0.0

-Inf

+0.0

+Inf

+subnormal

+Inf

+Inf

+0.0

NaN

NaN

InputNaNs map to a canonicalNaN with encoding0x7fffffff00000000.

Subnormal inputs and results are flushed to sign-preserving zero.

PTX ISA Notes

rsqrt.approx.ftz.f64 introduced in PTX ISA version 4.0.

Target ISA Notes

rsqrt.approx.ftz.f64 requiressm_20 or higher.

Examples

rsqrt.approx.ftz.f64 xi,x;

9.7.3.18.Floating Point Instructions:sin

sin

Find the sine of a value.

Syntax

sin.approx{.ftz}.f32  d, a;

Description

Find the sine of the anglea (in radians).

Semantics

d = sin(a);

Notes

sin.approx.f32 implements a fast approximation to sine.

Input

Result

-Inf

NaN

-0.0

-0.0

+0.0

+0.0

+Inf

NaN

NaN

NaN

The maximum absolute error over input range is as follows:

Range

[-2pi .. 2pi]

[-100pi .. +100pi]

Error

2-20.5

2-14.7

Outside of the range [-100pi .. +100pi], only best effortis provided. There are no defined error guarantees.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

sin.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

Subnormal inputs and results to sign-preserving zero.

PTX ISA Notes

sin.f32 introduced in PTX ISA version 1.0. Explicit modifiers.approx and.ftzintroduced in PTX ISA version 1.4.

For PTX ISA version 1.4 and later, the .approx modifier is required.

For PTX ISA versions 1.0 through 1.3,sin.f32 defaults tosin.approx.ftz.f32.

Target ISA Notes

Supported on all target architectures.

Examples

sin.approx.ftz.f32  sa, a;

9.7.3.19.Floating Point Instructions:cos

cos

Find the cosine of a value.

Syntax

cos.approx{.ftz}.f32  d, a;

Description

Find the cosine of the anglea (in radians).

Semantics

d = cos(a);

Notes

cos.approx.f32 implements a fast approximation to cosine.

Input

Result

-Inf

NaN

-0.0

+1.0

+0.0

+1.0

+Inf

NaN

NaN

NaN

The maximum absolute error over input range is as follows:

Range

[-2pi .. 2pi]

[-100pi .. +100pi]

Error

2-20.5

2-14.7

Outside of the range [-100pi .. +100pi], only best effortis provided. There are no defined error guarantees.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

cos.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

Subnormal inputs and results to sign-preserving zero.

PTX ISA Notes

cos.f32 introduced in PTX ISA version 1.0. Explicit modifiers.approx and.ftzintroduced in PTX ISA version 1.4.

For PTX ISA version 1.4 and later, the.approx modifier is required.

For PTX ISA versions 1.0 through 1.3,cos.f32 defaults tocos.approx.ftz.f32.

Target ISA Notes

Supported on all target architectures.

Examples

cos.approx.ftz.f32  ca, a;

9.7.3.20.Floating Point Instructions:lg2

lg2

Find the base-2 logarithm of a value.

Syntax

lg2.approx{.ftz}.f32  d, a;

Description

Determine the log2 ofa.

Semantics

d = log(a) / log(2);

Notes

lg2.approx.f32 implements a fast approximation to log2(a).

Input

Result

-Inf

NaN

-normal

NaN

-0.0

-Inf

+0.0

-Inf

+Inf

+Inf

NaN

NaN

The maximum absolute error is 2-22 when the input operand is in therange (0.5, 2). For positive finite inputs outside of this interval, maximumrelative error is 2-22.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

lg2.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

Subnormal inputs and results to sign-preserving zero.

PTX ISA Notes

lg2.f32 introduced in PTX ISA version 1.0. Explicit modifiers.approx and.ftzintroduced in PTX ISA version 1.4.

For PTX ISA version 1.4 and later, the.approx modifier is required.

For PTX ISA versions 1.0 through 1.3,lg2.f32 defaults tolg2.approx.ftz.f32.

Target ISA Notes

Supported on all target architectures.

Examples

lg2.approx.ftz.f32  la, a;

9.7.3.21.Floating Point Instructions:ex2

ex2

Find the base-2 exponential of a value.

Syntax

ex2.approx{.ftz}.f32  d, a;

Description

Raise 2 to the powera.

Semantics

d = 2 ^ a;

Notes

ex2.approx.f32 implements a fast approximation to 2a.

Input

Result

-Inf

+0.0

-0.0

+1.0

+0.0

+1.0

+Inf

+Inf

NaN

NaN

The maximum ulp error is 2 ulp from correctly rounded result across thefull range of inputs.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

ex2.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.

sm_1x

Subnormal inputs and results to sign-preserving zero.

PTX ISA Notes

ex2.f32 introduced in PTX ISA version 1.0. Explicit modifiers.approx and.ftzintroduced in PTX ISA version 1.4.

For PTX ISA version 1.4 and later, the.approx modifier is required.

For PTX ISA versions 1.0 through 1.3,ex2.f32 defaults toex2.approx.ftz.f32.

Target ISA Notes

Supported on all target architectures.

Examples

ex2.approx.ftz.f32  xa, a;

9.7.3.22.Floating Point Instructions:tanh

tanh

Find the hyperbolic tangent of a value (in radians)

Syntax

tanh.approx.f32 d, a;

Description

Take hyperbolic tangent value ofa.

The operandsd anda are of type.f32.

Semantics

d = tanh(a);

Notes

tanh.approx.f32 implements a fast approximation to FP32 hyperbolic-tangent.

Results oftanh for various corner-case inputs are as follows:

Input

Result

-Inf

-1.0

-0.0

-0.0

+0.0

+0.0

+Inf

1.0

NaN

NaN

The maximum relative error over the entire floating pointrange is 2-11.The subnormal numbers are supported.

Note

The subnormal inputs gets passed through to the output since the value oftanh(x) for smallvalues ofx is approximately the same asx.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

Target ISA Notes

Requiressm_75 or higher.

Examples

tanh.approx.f32 ta, a;

9.7.4.Half Precision Floating-Point Instructions

Half precision floating-point instructions operate on.f16 and.f16x2 register operands. Thehalf precision floating-point instructions are:

  • add

  • sub

  • mul

  • fma

  • neg

  • abs

  • min

  • max

  • tanh

  • ex2

Half-precisionadd,sub,mul, andfma support saturation of results to the range[0.0, 1.0], withNaNs being flushed to positive zero. Half-precision instructions return anunspecifiedNaN.

9.7.4.1.Half Precision Floating Point Instructions:add

add

Add two values.

Syntax

add{.rnd}{.ftz}{.sat}.f16   d, a, b;add{.rnd}{.ftz}{.sat}.f16x2 d, a, b;add{.rnd}.bf16   d, a, b;add{.rnd}.bf16x2 d, a, b;.rnd = { .rn };

Description

Performs addition and writes the resulting value into a destination register.

For.f16x2 and.bf16x2 instruction type, forms input vectors by half word values from sourceoperands. Half-word operands are then added in parallel to produce.f16x2 or.bf16x2 resultin destination.

For.f16 instruction type, operandsd,a andb have.f16 or.b16 type. For.f16x2 instruction type, operandsd,a andb have.b32 type. For.bf16instruction type, operandsd,a,b have.b16 type. For.bf16x2 instruction type,operandsd,a,b have.b32 type.

Semantics

if (type == f16 || type == bf16) {    d = a + b;} else if (type == f16x2 || type == bf16x2) {    fA[0] = a[0:15];    fA[1] = a[16:31];    fB[0] = b[0:15];    fB[1] = b[16:31];    for (i = 0; i < 2; i++) {         d[i] = fA[i] + fB[i];    }}

Notes

Rounding modifiers:

.rn

mantissa LSB rounds to nearest even

The default value of rounding modifier is.rn. Note that anadd instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Anadd instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/add sequences with no rounding modifiers may be optimized touse fused-multiply-add instructions on the target device.

Subnormal numbers:

By default, subnormal numbers are supported.add.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.

Saturation modifier:

add.sat.{f16,f16x2} clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.

PTX ISA Notes

Introduced in PTX ISA version 4.2.

add{.rnd}.bf16 andadd{.rnd}.bf16x2 introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_53 or higher.

add{.rnd}.bf16 andadd{.rnd}.bf16x2 requiressm_90 or higher.

Examples

// scalar f16 additionsadd.f16        d0, a0, b0;add.rn.f16     d1, a1, b1;add.bf16       bd0, ba0, bb0;add.rn.bf16    bd1, ba1, bb1;// SIMD f16 additioncvt.rn.f16.f32 h0, f0;cvt.rn.f16.f32 h1, f1;cvt.rn.f16.f32 h2, f2;cvt.rn.f16.f32 h3, f3;mov.b32  p1, {h0, h1};   // pack two f16 to 32bit f16x2mov.b32  p2, {h2, h3};   // pack two f16 to 32bit f16x2add.f16x2  p3, p1, p2;   // SIMD f16x2 addition// SIMD bf16 additioncvt.rn.bf16x2.f32 p4, f4, f5; // Convert two f32 into packed bf16x2cvt.rn.bf16x2.f32 p5, f6, f7; // Convert two f32 into packed bf16x2add.bf16x2  p6, p4, p5;       // SIMD bf16x2 addition// SIMD fp16 additionld.global.b32   f0, [addr];     // load 32 bit which hold packed f16x2ld.global.b32   f1, [addr + 4]; // load 32 bit which hold packed f16x2add.f16x2       f2, f0, f1;     // SIMD f16x2 additionld.global.b32   f3, [addr + 8];  // load 32 bit which hold packed bf16x2ld.global.b32   f4, [addr + 12]; // load 32 bit which hold packed bf16x2add.bf16x2      f5, f3, f4;      // SIMD bf16x2 addition

9.7.4.2.Half Precision Floating Point Instructions:sub

sub

Subtract two values.

Syntax

sub{.rnd}{.ftz}{.sat}.f16   d, a, b;sub{.rnd}{.ftz}{.sat}.f16x2 d, a, b;sub{.rnd}.bf16   d, a, b;sub{.rnd}.bf16x2 d, a, b;.rnd = { .rn };

Description

Performs subtraction and writes the resulting value into a destination register.

For.f16x2 and.bf16x2 instruction type, forms input vectors by half word values from sourceoperands. Half-word operands are then subtracted in parallel to produce.f16x2 or.bf16x2result in destination.

For.f16 instruction type, operandsd,a andb have.f16 or.b16 type. For.f16x2 instruction type, operandsd,a andb have.b32 type. For.bf16instruction type, operandsd,a,b have.b16 type. For.bf16x2 instruction type,operandsd,a,b have.b32 type.

Semantics

if (type == f16 || type == bf16) {    d = a - b;} else if (type == f16x2 || type == bf16x2) {    fA[0] = a[0:15];    fA[1] = a[16:31];    fB[0] = b[0:15];    fB[1] = b[16:31];    for (i = 0; i < 2; i++) {         d[i] = fA[i] - fB[i];    }}

Notes

Rounding modifiers:

.rn

mantissa LSB rounds to nearest even

The default value of rounding modifier is.rn. Note that asub instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Asub instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/sub sequences with no rounding modifiers may be optimized touse fused-multiply-add instructions on the target device.

Subnormal numbers:

By default, subnormal numbers are supported.sub.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.

Saturation modifier:

sub.sat.{f16,f16x2} clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.

PTX ISA Notes

Introduced in PTX ISA version 4.2.

sub{.rnd}.bf16 andsub{.rnd}.bf16x2 introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_53 or higher.

sub{.rnd}.bf16 andsub{.rnd}.bf16x2 requiressm_90 or higher.

Examples

// scalar f16 subtractionssub.f16        d0, a0, b0;sub.rn.f16     d1, a1, b1;sub.bf16       bd0, ba0, bb0;sub.rn.bf16    bd1, ba1, bb1;// SIMD f16 subtractioncvt.rn.f16.f32 h0, f0;cvt.rn.f16.f32 h1, f1;cvt.rn.f16.f32 h2, f2;cvt.rn.f16.f32 h3, f3;mov.b32  p1, {h0, h1};   // pack two f16 to 32bit f16x2mov.b32  p2, {h2, h3};   // pack two f16 to 32bit f16x2sub.f16x2  p3, p1, p2;   // SIMD f16x2 subtraction// SIMD bf16 subtractioncvt.rn.bf16x2.f32 p4, f4, f5; // Convert two f32 into packed bf16x2cvt.rn.bf16x2.f32 p5, f6, f7; // Convert two f32 into packed bf16x2sub.bf16x2  p6, p4, p5;       // SIMD bf16x2 subtraction// SIMD fp16 subtractionld.global.b32   f0, [addr];     // load 32 bit which hold packed f16x2ld.global.b32   f1, [addr + 4]; // load 32 bit which hold packed f16x2sub.f16x2       f2, f0, f1;     // SIMD f16x2 subtraction// SIMD bf16 subtractionld.global.b32   f3, [addr + 8];  // load 32 bit which hold packed bf16x2ld.global.b32   f4, [addr + 12]; // load 32 bit which hold packed bf16x2sub.bf16x2      f5, f3, f4;      // SIMD bf16x2 subtraction

9.7.4.3.Half Precision Floating Point Instructions:mul

mul

Multiply two values.

Syntax

mul{.rnd}{.ftz}{.sat}.f16   d, a, b;mul{.rnd}{.ftz}{.sat}.f16x2 d, a, b;mul{.rnd}.bf16   d, a, b;mul{.rnd}.bf16x2 d, a, b;.rnd = { .rn };

Description

Performs multiplication and writes the resulting value into a destination register.

For.f16x2 and.bf16x2 instruction type, forms input vectors by half word values from sourceoperands. Half-word operands are then multiplied in parallel to produce.f16x2 or.bf16x2result in destination.

For.f16 instruction type, operandsd,a andb have.f16 or.b16 type. For.f16x2 instruction type, operandsd,a andb have.b32 type. For.bf16instruction type, operandsd,a,b have.b16 type. For.bf16x2 instruction type,operandsd,a,b have.b32 type.

Semantics

if (type == f16 || type == bf16) {    d = a * b;} else if (type == f16x2 || type == bf16x2) {    fA[0] = a[0:15];    fA[1] = a[16:31];    fB[0] = b[0:15];    fB[1] = b[16:31];    for (i = 0; i < 2; i++) {         d[i] = fA[i] * fB[i];    }}

Notes

Rounding modifiers:

.rn

mantissa LSB rounds to nearest even

The default value of rounding modifier is.rn. Note that amul instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Amul instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/add andmul/sub sequences with no rounding modifiers maybe optimized to use fused-multiply-add instructions on the target device.

Subnormal numbers:

By default, subnormal numbers are supported.mul.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.

Saturation modifier:

mul.sat.{f16,f16x2} clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.

PTX ISA Notes

Introduced in PTX ISA version 4.2.

mul{.rnd}.bf16 andmul{.rnd}.bf16x2 introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_53 or higher.

mul{.rnd}.bf16 andmul{.rnd}.bf16x2 requiressm_90 or higher.

Examples

// scalar f16 multiplicationsmul.f16        d0, a0, b0;mul.rn.f16     d1, a1, b1;mul.bf16       bd0, ba0, bb0;mul.rn.bf16    bd1, ba1, bb1;// SIMD f16 multiplicationcvt.rn.f16.f32 h0, f0;cvt.rn.f16.f32 h1, f1;cvt.rn.f16.f32 h2, f2;cvt.rn.f16.f32 h3, f3;mov.b32  p1, {h0, h1};   // pack two f16 to 32bit f16x2mov.b32  p2, {h2, h3};   // pack two f16 to 32bit f16x2mul.f16x2  p3, p1, p2;   // SIMD f16x2 multiplication// SIMD bf16 multiplicationcvt.rn.bf16x2.f32 p4, f4, f5; // Convert two f32 into packed bf16x2cvt.rn.bf16x2.f32 p5, f6, f7; // Convert two f32 into packed bf16x2mul.bf16x2  p6, p4, p5;       // SIMD bf16x2 multiplication// SIMD fp16 multiplicationld.global.b32   f0, [addr];     // load 32 bit which hold packed f16x2ld.global.b32   f1, [addr + 4]; // load 32 bit which hold packed f16x2mul.f16x2       f2, f0, f1;     // SIMD f16x2 multiplication// SIMD bf16 multiplicationld.global.b32   f3, [addr + 8];  // load 32 bit which hold packed bf16x2ld.global.b32   f4, [addr + 12]; // load 32 bit which hold packed bf16x2mul.bf16x2      f5, f3, f4;      // SIMD bf16x2 multiplication

9.7.4.4.Half Precision Floating Point Instructions:fma

fma

Fused multiply-add

Syntax

fma.rnd{.ftz}{.sat}.f16     d, a, b, c;fma.rnd{.ftz}{.sat}.f16x2   d, a, b, c;fma.rnd{.ftz}.relu.f16      d, a, b, c;fma.rnd{.ftz}.relu.f16x2    d, a, b, c;fma.rnd{.relu}.bf16         d, a, b, c;fma.rnd{.relu}.bf16x2       d, a, b, c;fma.rnd.oob.{relu}.type     d, a, b, c;.rnd = { .rn };

Description

Performs a fused multiply-add with no loss of precision in the intermediate product and addition.

For.f16x2 and.bf16x2 instruction type, forms input vectors by half word values from sourceoperands. Half-word operands are then operated in parallel to produce.f16x2 or.bf16x2result in destination.

For.f16 instruction type, operandsd,a,b andc have.f16 or.b16type. For.f16x2 instruction type, operandsd,a,b andc have.b32type. For.bf16 instruction type, operandsd,a,b andc have.b16 type. For.bf16x2 instruction type, operandsd,a,b andc have.b32 type.

Semantics

if (type == f16 || type == bf16) {    d = a * b + c;} else if (type == f16x2 || type == bf16x2) {    fA[0] = a[0:15];    fA[1] = a[16:31];    fB[0] = b[0:15];    fB[1] = b[16:31];    fC[0] = c[0:15];    fC[1] = c[16:31];    for (i = 0; i < 2; i++) {         d[i] = fA[i] * fB[i] + fC[i];    }}

Notes

Rounding modifiers (default is.rn):

.rn

mantissa LSB rounds to nearest even

Subnormal numbers:

By default, subnormal numbers are supported.fma.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.

Saturation modifier:

fma.sat.{f16,f16x2} clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.fma.relu.{f16,f16x2,bf16,bf16x2} clamps the result to 0 if negative.NaN result isconverted to canonicalNaN.

Out Of Bounds modifier:

fma.oob.{f16,f16x2,bf16,bf16x2} clamps the result to 0 if either of the operandsisOOBNaN (defined underTensors) value. The test for the specialNaN valueand resultant forcing of the result to +0.0 is performed independently for each of thetwo SIMD operations.

PTX ISA Notes

Introduced in PTX ISA version 4.2.

fma.relu.{f16,f16x2} andfma{.relu}.{bf16,bf16x2} introduced in PTX ISA version 7.0.

Support for modifier.oob introduced in PTX ISA version 8.1.

Target ISA Notes

Requiressm_53 or higher.

fma.relu.{f16,f16x2} andfma{.relu}.{bf16,bf16x2} requiresm_80 or higher.

fma{.oob}.{f16,f16x2,bf16,bf16x2} requiressm_90 or higher.

Examples

// scalar f16 fused multiply-addfma.rn.f16         d0, a0, b0, c0;fma.rn.f16         d1, a1, b1, c1;fma.rn.relu.f16    d1, a1, b1, c1;fma.rn.oob.f16      d1, a1, b1, c1;fma.rn.oob.relu.f16 d1, a1, b1, c1;// scalar bf16 fused multiply-addfma.rn.bf16        d1, a1, b1, c1;fma.rn.relu.bf16   d1, a1, b1, c1;fma.rn.oob.bf16       d1, a1, b1, c1;fma.rn.oob.relu.bf16  d1, a1, b1, c1;// SIMD f16 fused multiply-addcvt.rn.f16.f32 h0, f0;cvt.rn.f16.f32 h1, f1;cvt.rn.f16.f32 h2, f2;cvt.rn.f16.f32 h3, f3;mov.b32  p1, {h0, h1}; // pack two f16 to 32bit f16x2mov.b32  p2, {h2, h3}; // pack two f16 to 32bit f16x2fma.rn.f16x2  p3, p1, p2, p2;   // SIMD f16x2 fused multiply-addfma.rn.relu.f16x2  p3, p1, p2, p2; // SIMD f16x2 fused multiply-add with relu saturation modefma.rn.oob.f16x2  p3, p1, p2, p2; // SIMD f16x2 fused multiply-add with oob modifierfma.rn.oob.relu.f16x2 p3, p1, p2, p2; // SIMD f16x2 fused multiply-add with oob modifier and relu saturation mode// SIMD fp16 fused multiply-addld.global.b32   f0, [addr];     // load 32 bit which hold packed f16x2ld.global.b32   f1, [addr + 4]; // load 32 bit which hold packed f16x2fma.rn.f16x2    f2, f0, f1, f1; // SIMD f16x2 fused multiply-add// SIMD bf16 fused multiply-addfma.rn.bf16x2       f2, f0, f1, f1; // SIMD bf16x2 fused multiply-addfma.rn.relu.bf16x2  f2, f0, f1, f1; // SIMD bf16x2 fused multiply-add with relu saturation modefma.rn.oob.bf16x2  f2, f0, f1, f1; // SIMD bf16x2 fused multiply-add with oob modifierfma.rn.oob.relu.bf16x2  f2, f0, f1, f1; // SIMD bf16x2 fused multiply-add with oob modifier and relu saturation mode

9.7.4.5.Half Precision Floating Point Instructions:neg

neg

Arithmetic negate.

Syntax

neg{.ftz}.f16    d, a;neg{.ftz}.f16x2  d, a;neg.bf16         d, a;neg.bf16x2       d, a;

Description

Negate the sign ofa and store the result ind.

For.f16x2 and.bf16x2 instruction type, forms input vector by extracting half word valuesfrom the source operand. Half-word operands are then negated in parallel to produce.f16x2 or.bf16x2 result in destination.

For.f16 instruction type, operandsd anda have.f16 or.b16 type. For.f16x2 instruction type, operandsd anda have.b32 type. For.bf16 instructiontype, operandsd anda have.b16 type. For.bf16x2 instruction type, operandsdanda have.b32 type.

Semantics

if (type == f16 || type == bf16) {    d = -a;} else if (type == f16x2 || type == bf16x2) {    fA[0] = a[0:15];    fA[1] = a[16:31];    for (i = 0; i < 2; i++) {         d[i] = -fA[i];    }}

Notes

Subnormal numbers:

By default, subnormal numbers are supported.neg.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.

NaN inputs yield an unspecifiedNaN. Future implementations may comply with the IEEE 754standard by preserving payload and modifying only the sign bit.

PTX ISA Notes

Introduced in PTX ISA version 6.0.

neg.bf16 andneg.bf16x2 introduced in PTX ISA 7.0.

Target ISA Notes

Requiressm_53 or higher.

neg.bf16 andneg.bf16x2 requires architecturesm_80 or higher.

Examples

neg.ftz.f16  x,f0;neg.bf16     x,b0;neg.bf16x2   x1,b1;

9.7.4.6.Half Precision Floating Point Instructions:abs

abs

Absolute value

Syntax

abs{.ftz}.f16    d, a;abs{.ftz}.f16x2  d, a;abs.bf16         d, a;abs.bf16x2       d, a;

Description

Take absolute value ofa and store the result ind.

For.f16x2 and.bf16x2 instruction type, forms input vector by extracting half word valuesfrom the source operand. Absolute values of half-word operands are then computed in parallel toproduce.f16x2 or.bf16x2 result in destination.

For.f16 instruction type, operandsd anda have.f16 or.b16 type. For.f16x2 instruction type, operandsd anda have.f16x2 or.b32 type. For.bf16 instruction type, operandsd anda have.b16 type. For.bf16x2 instructiontype, operandsd anda have.b32 type.

Semantics

if (type == f16 || type == bf16) {    d = |a|;} else if (type == f16x2 || type == bf16x2) {    fA[0] = a[0:15];    fA[1] = a[16:31];    for (i = 0; i < 2; i++) {         d[i] = |fA[i]|;    }}

Notes

Subnormal numbers:

By default, subnormal numbers are supported.abs.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.

NaN inputs yield an unspecifiedNaN. Future implementations may comply with the IEEE 754standard by preserving payload and modifying only the sign bit.

PTX ISA Notes

Introduced in PTX ISA version 6.5.

abs.bf16 andabs.bf16x2 introduced in PTX ISA 7.0.

Target ISA Notes

Requiressm_53 or higher.

abs.bf16 andabs.bf16x2 requires architecturesm_80 or higher.

Examples

abs.ftz.f16  x,f0;abs.bf16     x,b0;abs.bf16x2   x1,b1;

9.7.4.7.Half Precision Floating Point Instructions:min

min

Find the minimum of two values.

Syntax

min{.ftz}{.NaN}{.xorsign.abs}.f16      d, a, b;min{.ftz}{.NaN}{.xorsign.abs}.f16x2    d, a, b;min{.NaN}{.xorsign.abs}.bf16           d, a, b;min{.NaN}{.xorsign.abs}.bf16x2         d, a, b;

Description

Store the minimum ofa andb ind.

For.f16x2 and.bf16x2 instruction types, input vectors are formed with half-word valuesfrom source operands. Half-word operands are then processed in parallel to store.f16x2 or.bf16x2 result in destination.

For.f16 instruction type, operandsd anda have.f16 or.b16 type. For.f16x2 instruction type, operandsd anda have.f16x2 or.b32 type. For.bf16 instruction type, operandsd anda have.b16 type. For.bf16x2 instructiontype, operandsd anda have.b32 type.

If.NaN modifier is specified, then the result is canonicalNaN if either of the inputs isNaN.

If.abs modifier is specified, the magnitude of destination operandd is the minimum ofabsolute values of both the input arguments.

If.xorsign modifier is specified, the sign bit of destinationd is equal to the XOR of thesign bits of both the inputs.

Modifiers.abs and.xorsign must be specified together and.xorsign considers the signbit of both inputs before applying.abs operation.

If the result ofmin isNaN then the.xorsign and.abs modifiers will be ignored.

Semantics

if (type == f16 || type == bf16) {    if (.xorsign) {        xorsign = getSignBit(a) ^ getSignBit(b);        if (.abs) {            a = |a|;            b = |b|;        }    }    if (isNaN(a) && isNaN(b))              d = NaN;    if (.NaN && (isNaN(a) || isNaN(b)))    d = NaN;    else if (isNaN(a))                     d = b;    else if (isNaN(b))                     d = a;    else                                   d = (a < b) ? a : b;    if (.xorsign && !isNaN(d)) {         setSignBit(d, xorsign);    }} else if (type == f16x2 || type == bf16x2) {    fA[0] = a[0:15];    fA[1] = a[16:31];    fB[0] = b[0:15];    fB[1] = b[16:31];    for (i = 0; i < 2; i++) {        if (.xorsign) {            xorsign = getSignBit(fA[i]) ^ getSignBit(fB[i]);            if (.abs) {               fA[i] = |fA[i]|;               fB[i] = |fB[i]|;           }        }        if (isNaN(fA[i]) && isNaN(fB[i]))              d[i] = NaN;        if (.NaN && (isNaN(fA[i]) || isNaN(fB[i])))    d[i] = NaN;        else if (isNaN(fA[i]))                         d[i] = fB[i];        else if (isNaN(fB[i]))                         d[i] = fA[i];        else                                           d[i] = (fA[i] < fB[i]) ? fA[i] : fB[i];        if (.xorsign && !isNaN(d[i])) {            setSignBit(d[i], xorsign);        }    }}

Notes

Subnormal numbers:

By default, subnormal numbers are supported.min.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.

If values of both inputs are 0.0, then +0.0 > -0.0.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

min.xorsign introduced in PTX ISA version 7.2.

Target ISA Notes

Requiressm_80 or higher.

min.xorsign.abs support requiressm_86 or higher.

Examples

min.ftz.f16       h0,h1,h2;min.f16x2         b0,b1,b2;// SIMD fp16 min with .NaNmin.NaN.f16x2     b0,b1,b2;min.bf16          h0, h1, h2;// SIMD bf16 min with NaNmin.NaN.bf16x2    b0, b1, b2;// scalar bf16 min with xorsign.absmin.xorsign.abs.bf16 Rd, Ra, Rb

9.7.4.8.Half Precision Floating Point Instructions:max

max

Find the maximum of two values.

Syntax

max{.ftz}{.NaN}{.xorsign.abs}.f16      d, a, b;max{.ftz}{.NaN}{.xorsign.abs}.f16x2    d, a, b;max{.NaN}{.xorsign.abs}.bf16           d, a, b;max{.NaN}{.xorsign.abs}.bf16x2         d, a, b;

Description

Store the maximum ofa andb ind.

For.f16x2 and.bf16x2 instruction types, input vectors are formed with half-word valuesfrom source operands. Half-word operands are then processed in parallel to store.f16x2 or.bf16x2 result in destination.

For.f16 instruction type, operandsd anda have.f16 or.b16 type. For.f16x2 instruction type, operandsd anda have.f16x2 or.b32 type. For.bf16 instruction type, operandsd anda have.b16 type. For.bf16x2 instructiontype, operandsd anda have.b32 type.

If.NaN modifier is specified, the result is canonicalNaN if either of the inputs isNaN.

If.abs modifier is specified, the magnitude of destination operandd is the maximum ofabsolute values of both the input arguments.

If.xorsign modifier is specified, the sign bit of destinationd is equal to the XOR of thesign bits of both the inputs.

Modifiers.abs and.xorsign must be specified together and.xorsign considers the signbit of both inputs before applying.abs operation.

If the result ofmax isNaN then the.xorsign and.abs modifiers will be ignored.

Semantics

if (type == f16 || type == bf16) {    if (.xorsign) {        xorsign = getSignBit(a) ^ getSignBit(b);        if (.abs) {            a = |a|;            b = |b|;        }    }    if (isNaN(a) && isNaN(b))              d = NaN;    if (.NaN && (isNaN(a) || isNaN(b)))    d = NaN;    else if (isNaN(a))                     d = b;    else if (isNaN(b))                     d = a;    else                                   d = (a > b) ? a : b;    if (.xorsign && !isNaN(d)) {         setSignBit(d, xorsign);    }} else if (type == f16x2 || type == bf16x2) {    fA[0] = a[0:15];    fA[1] = a[16:31];    fB[0] = b[0:15];    fB[1] = b[16:31];    for (i = 0; i < 2; i++) {        if (.xorsign) {            xorsign = getSignBit(fA[i]) ^ getSignBit(fB[i]);            if (.abs) {                fA[i] = |fA[i]|;                fB[i] = |fB[i]|;            }        }        if (isNaN(fA[i]) && isNaN(fB[i]))              d[i] = NaN;        if (.NaN && (isNaN(fA[i]) || isNaN(fB[i])))    d[i] = NaN;        else if (isNaN(fA[i]))                         d[i] = fB[i];        else if (isNaN(fB[i]))                         d[i] = fA[i];        else                                           d[i] = (fA[i] > fB[i]) ? fA[i] : fB[i];        if (.xorsign && !isNaN(fA[i])) {            setSignBit(d[i], xorsign);        }    }}

Notes

Subnormal numbers:

By default, subnormal numbers are supported.max.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.

If values of both inputs are 0.0, then +0.0 > -0.0.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

max.xorsign.abs introduced in PTX ISA version 7.2.

Target ISA Notes

Requiressm_80 or higher.

max.xorsign.abs support requiressm_86 or higher.

Examples

max.ftz.f16       h0,h1,h2;max.f16x2         b0,b1,b2;// SIMD fp16 max with NaNmax.NaN.f16x2     b0,b1,b2;// scalar f16 max with xorsign.absmax.xorsign.abs.f16 Rd, Ra, Rb;max.bf16          h0, h1, h2;// scalar bf16 max and NaNmax.NaN.bf16x2    b0, b1, b2;// SIMD bf16 max with xorsign.absmax.xorsign.abs.bf16x2 Rd, Ra, Rb;

9.7.4.9.Half Precision Floating Point Instructions:tanh

tanh

Find the hyperbolic tangent of a value (in radians)

Syntax

tanh.approx.type d, a;.type = {.f16, .f16x2, .bf16, .bf16x2}

Description

Take hyperbolic tangent value ofa.

The type of operandsd anda are as specified by.type.

For.f16x2 or.bf16x2 instruction type, each of the half-word operands are operated inparallel and the results are packed appropriately into a.f16x2 or.bf16x2.

Semantics

if (.type == .f16 || .type == .bf16) {  d = tanh(a)} else if (.type == .f16x2 || .type == .bf16x2) {  fA[0] = a[0:15];  fA[1] = a[16:31];  d[0] = tanh(fA[0])  d[1] = tanh(fA[1])}

Notes

tanh.approx.{f16,f16x2,bf16,bf16x2} implements an approximate hyperbolic tangent in thetarget format.

Results oftanh for various corner-case inputs are as follows:

Input

Result

-Inf

-1.0

-0.0

-0.0

+0.0

+0.0

+Inf

1.0

NaN

NaN

The maximum absolute error for.f16 type is 2-10.987. The maximum absolute error for.bf16type is 2-8.

The subnormal numbers are supported.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

tanh.approx.{bf16/bf16x2} introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_75 or higher.

tanh.approx.{bf16/bf16x2} requiressm_90 or higher.

Examples

tanh.approx.f16    h1, h0;tanh.approx.f16x2  hd1, hd0;tanh.approx.bf16   b1, b0;tanh.approx.bf16x2 hb1, hb0;

9.7.4.10.Half Precision Floating Point Instructions:ex2

ex2

Find the base-2 exponent of input.

Syntax

ex2.approx.atype     d, a;ex2.approx.ftz.btype d, a;.atype = { .f16,  .f16x2}.btype = { .bf16, .bf16x2}

Description

Raise 2 to the powera.

The type of operandsd anda are as specified by.type.

For.f16x2 or.bf16x2 instruction type, each of the half-word operands are operated inparallel and the results are packed appropriately into a.f16x2 or.bf16x2.

Semantics

if (.type == .f16 || .type == .bf16) {  d = 2 ^ a} else if (.type == .f16x2 || .type == .bf16x2) {  fA[0] = a[0:15];  fA[1] = a[16:31];  d[0] = 2 ^ fA[0]  d[1] = 2 ^ fA[1]}

Notes

ex2.approx.{f16,f16x2,bf16,bf16x2} implement a fast approximation to 2a.

For the.f16 type, subnormal inputs are supported.ex2.approx.ftz.bf16 flushes subnormalinputs and results to sign-preserving zero.

Results ofex2.approx.ftz.bf16 for various corner-case inputs are as follows:

Input

Result

-Inf

+0.0

-subnormal

+1.0

-0.0

+1.0

+0.0

+1.0

+subnormal

+1.0

+Inf

+Inf

NaN

NaN

Results ofex2.approx.f16 for various corner-case inputs are as follows:

Input

Result

-Inf

+0.0

-0.0

+1.0

+0.0

+1.0

+Inf

+Inf

NaN

NaN

The maximum relative error for.f16 type is 2-9.9. The maximum relative error for.bf16 typeis 2-7.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

ex2.approx.ftz.{bf16/bf16x2} introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_75 or higher.

ex2.approx.ftz.{bf16/bf16x2} requiressm_90 or higher.

Examples

ex2.approx.f16         h1, h0;ex2.approx.f16x2       hd1, hd0;ex2.approx.ftz.bf16    b1, b2;ex2.approx.ftz.bf16x2  hb1, hb2;

9.7.5.Mixed Precision Floating-Point Instructions

Mixed precision floating-point instructions operate on data with varied floating point precision.Before executing the specified operation, operands with different precision needs to be convertedsuch that all the instruction operands can be represented with a consistent floating-point precision.The register variable to be used for holding a particular operand depends upon the combination ofthe instruction types. ReferFundamental Types andAlternate Floating-Point Data Formats for more detailsaround exact register operand to be used for a given data type.

The mixed precision floating point instructions are:

  • add

  • sub

  • fma

Mixed precisionadd,sub,fma support saturation of results to the range [0.0, 1.0],withNaN being flushed to positive zero.

9.7.5.1.Mixed Precision Floating Point Instructions:add

add

Add 2 values.

Syntax

add{.rnd}{.sat}.f32.atype  d, a, c;.atype = { .f16, .bf16};.rnd   = { .rn, .rz, .rm, .rp };

Description

Converts input operanda from.atype into.f32 type. The converted value is thenused for the addition. The resulting value is stored in the destination operandd.

Semantics

d = convert(a) + c;

Notes

Rounding modifiers:

.rn

mantissa LSB rounds to nearest even

.rz

mantissa LSB rounds towards zero

.rm

mantissa LSB rounds towards negative infinity

.rp

mantissa LSB rounds towards positive infinity

The default value of rounding modifier is.rn. Note that anadd instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Anadd instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/add sequences with no rounding modifiers may be optimized touse fused-multiply-add instructions on the target device.

Subnormal numbers:

By default, subnormal numbers are supported.

Saturation modifier:

add.sat clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.

PTX ISA Notes

add.f32.{f16/bf16} introduced in PTX ISA version 8.6.

Target ISA Notes

add.f32.{f16/bf16} requiressm_100 or higher.

Examples

.reg .f32 fc, fd;.reg .b16 ba;add.rz.f32.bf16.sat   fd, fa, fc;

9.7.5.2.Mixed Precision Floating Point Instructions:sub

sub

Subtract one value from another.

Syntax

sub{.rnd}{.sat}.f32.atype  d, a, c;.atype = { .f16, .bf16};.rnd   = { .rn, .rz, .rm, .rp };

Description

Converts input operanda from.atype into.f32 type. The converted value is thenused for the subtraction. The resulting value is stored in the destination operandd.

Semantics

d = convert(a) - c;

Notes

Rounding modifiers:

.rn

mantissa LSB rounds to nearest even

.rz

mantissa LSB rounds towards zero

.rm

mantissa LSB rounds towards negative infinity

.rp

mantissa LSB rounds towards positive infinity

The default value of rounding modifier is.rn. Note that ansub instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Ansub instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/sub sequences with no rounding modifiers may be optimized touse fused-multiply-add instructions on the target device.

Subnormal numbers:

By default, subnormal numbers are supported.

Saturation modifier:

sub.sat clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.

PTX ISA Notes

sub.f32.{f16/bf16} introduced in PTX ISA version 8.6.

Target ISA Notes

sub.f32.{f16/bf16} requiressm_100 or higher.

Examples

.reg .f32 fc, fd;.reg .f16 ha;sub.rz.f32.f16.sat   fd, ha, fc;

9.7.5.3.Mixed Precision Floating Point Instructions:fma

fma

Fused multiply-add.

Syntax

fma.rnd{.sat}.f32.abtype  d, a, b, c;.abtype = { .f16, .bf16};.rnd    = { .rn, .rz, .rm, .rp };

Description

Converts input operandsa andb from.atype into.f32 type. The converted valuesare then used to perform fused multiply-add operation with no loss of precision in the intermediateproduct and addition. The resulting value is stored in the destination operandd.

Semantics

d = convert(a) * convert(b) + c;

Notes

fma.f32.{f16/bf16} computes the product ofa andb to infinite precision and then addsc to this product, again in infinite precision. The resulting value is then rounded to singleprecision using the rounding mode specified by.rnd.

Rounding modifiers(no default):

.rn

mantissa LSB rounds to nearest even

.rz

mantissa LSB rounds towards zero

.rm

mantissa LSB rounds towards negative infinity

.rp

mantissa LSB rounds towards positive infinity

Subnormal numbers:

By default, subnormal numbers are supported.

Saturation modifier:

fma.sat clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.

PTX ISA Notes

fma.f32.{f16/bf16} introduced in PTX ISA version 8.6.

Target ISA Notes

fma.f32.{f16/bf16} requiressm_100 or higher.

Examples

.reg .f32 fc, fd;.reg .f16 ha, hb;fma.rz.sat.f32.f16.sat   fd, ha, hb, fc;

9.7.6.Comparison and Selection Instructions

The comparison select instructions are:

  • set

  • setp

  • selp

  • slct

As with single-precision floating-point instructions, theset,setp, andslctinstructions support subnormal numbers forsm_20 and higher targets and flush single-precisionsubnormal inputs to sign-preserving zero forsm_1x targets. The optional.ftz modifierprovides backward compatibility withsm_1x targets by flushing subnormal inputs and results tosign-preserving zero regardless of the target architecture.

9.7.6.1.Comparison and Selection Instructions:set

set

Compare two numeric values with a relational operator, and optionally combine this result with apredicate value by applying a Boolean operator.

Syntax

set.CmpOp{.ftz}.dtype.stype         d, a, b;set.CmpOp.BoolOp{.ftz}.dtype.stype  d, a, b, {!}c;.CmpOp  = { eq, ne, lt, le, gt, ge, lo, ls, hi, hs,            equ, neu, ltu, leu, gtu, geu, num, nan };.BoolOp = { and, or, xor };.dtype  = { .u32, .s32, .f32 };.stype  = { .b16, .b32, .b64,            .u16, .u32, .u64,            .s16, .s32, .s64,                  .f32, .f64 };

Description

Compares two numeric values and optionally combines the result with another predicate value byapplying a Boolean operator. If this result isTrue,1.0f is written for floating-pointdestination types, and0xffffffff is written for integer destination types. Otherwise,0x00000000 is written.

Operandd has type.dtype; operandsa andb have type.stype; operandc hastype.pred.

Semantics

t = (a CmpOp b) ? 1 : 0;if (isFloat(dtype))    d = BoolOp(t, c) ? 1.0f : 0x00000000;else    d = BoolOp(t, c) ? 0xffffffff : 0x00000000;

Integer Notes

The signed and unsigned comparison operators areeq,ne,lt,le,gt,ge.

For unsigned values, the comparison operatorslo,ls,hi, andhs for lower,lower-or-same, higher, and higher-or-same may be used instead oflt,le,gt,ge,respectively.

The untyped, bit-size comparisons areeq andne.

Floating Point Notes

The ordered comparisons areeq,ne,lt,le,gt,ge. If either operand isNaN, the result isFalse.

To aid comparison operations in the presence ofNaN values, unordered versions are included:equ,neu,ltu,leu,gtu,geu. If both operands are numeric values (notNaN), then these comparisons have the same result as their ordered counterparts. If eitheroperand isNaN, then the result of these comparisons isTrue.

num returnsTrue if both operands are numeric values (notNaN), andnan returnsTrue if either operand isNaN.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

set.ftz.dtype.f32 flushes subnormal inputs to sign-preserving zero.

sm_1x

set.dtype.f64 supports subnormal numbers.

set.dtype.f32 flushes subnormal inputs to sign-preserving zero.

Modifier.ftz applies only to.f32 comparisons.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

set with.f64 source type requiressm_13 or higher.

Examples

@p  set.lt.and.f32.s32  d,a,b,r;    set.eq.u32.u32      d,i,n;

9.7.6.2.Comparison and Selection Instructions:setp

setp

Compare two numeric values with a relational operator, and (optionally) combine this result with apredicate value by applying a Boolean operator.

Syntax

setp.CmpOp{.ftz}.type         p[|q], a, b;setp.CmpOp.BoolOp{.ftz}.type  p[|q], a, b, {!}c;.CmpOp  = { eq, ne, lt, le, gt, ge, lo, ls, hi, hs,            equ, neu, ltu, leu, gtu, geu, num, nan };.BoolOp = { and, or, xor };.type   = { .b16, .b32, .b64,            .u16, .u32, .u64,            .s16, .s32, .s64,                  .f32, .f64 };

Description

Compares two values and combines the result with another predicate value by applying a Booleanoperator. This result is written to the first destination operand. A related value computed usingthe complement of the compare result is written to the second destination operand.

Applies to all numeric types. Operandsa andb have type.type; operandsp,q,andc have type.pred. The sink symbol ‘_’ may be used in place of any one of thedestination operands.

Semantics

t = (a CmpOp b) ? 1 : 0;p = BoolOp(t, c);q = BoolOp(!t, c);

Integer Notes

The signed and unsigned comparison operators areeq,ne,lt,le,gt,ge.

For unsigned values, the comparison operatorslo,ls,hi, andhs for lower,lower-or-same, higher, and higher-or-same may be used instead oflt,le,gt,ge,respectively.

The untyped, bit-size comparisons areeq andne.

Floating Point Notes

The ordered comparisons areeq,ne,lt,le,gt,ge. If either operand isNaN, the result isFalse.

To aid comparison operations in the presence ofNaN values, unordered versions are included:equ,neu,ltu,leu,gtu,geu. If both operands are numeric values (notNaN), then these comparisons have the same result as their ordered counterparts. If eitheroperand isNaN, then the result of these comparisons isTrue.

num returnsTrue if both operands are numeric values (notNaN), andnan returnsTrue if either operand isNaN.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

setp.ftz.dtype.f32 flushes subnormal inputs to sign-preserving zero.

sm_1x

setp.dtype.f64 supports subnormal numbers.

setp.dtype.f32 flushes subnormal inputs to sign-preserving zero.

Modifier.ftz applies only to.f32 comparisons.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

setp with.f64 source type requiressm_13 or higher.

Examples

    setp.lt.and.s32  p|q,a,b,r;@q  setp.eq.u32      p,i,n;

9.7.6.3.Comparison and Selection Instructions:selp

selp

Select between source operands, based on the value of the predicate source operand.

Syntax

selp.type d, a, b, c;.type = { .b16, .b32, .b64,          .u16, .u32, .u64,          .s16, .s32, .s64,                .f32, .f64 };

Description

Conditional selection. Ifc isTrue,a is stored ind,b otherwise. Operandsd,a, andb must be of the same type. Operandc is a predicate.

Semantics

d = (c == 1) ? a : b;

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

selp.f64 requiressm_13 or higher.

Examples

    selp.s32  r0,r,g,p;@q  selp.f32  f0,t,x,xp;

9.7.6.4.Comparison and Selection Instructions:slct

slct

Select one source operand, based on the sign of the third operand.

Syntax

slct.dtype.s32        d, a, b, c;slct{.ftz}.dtype.f32  d, a, b, c;.dtype = { .b16, .b32, .b64,           .u16, .u32, .u64,           .s16, .s32, .s64,                 .f32, .f64 };

Description

Conditional selection. Ifc >= 0,a is stored ind, otherwiseb is stored ind. Operandsd,a, andb are treated as a bitsize type of the same width as the firstinstruction type; operandc must match the second instruction type (.s32 or.f32). Theselected input is copied to the output without modification.

Semantics

d = (c >= 0) ? a : b;

Floating Point Notes

For.f32 comparisons, negative zero equals zero.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

slct.ftz.dtype.f32 flushes subnormal values of operandc to sign-preserving zero, andoperanda is selected.

sm_1x

slct.dtype.f32 flushes subnormal values of operandc to sign-preserving zero, and operanda is selected.

Modifier.ftz applies only to.f32 comparisons.

If operandc isNaN, the comparison is unordered and operandb is selected.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

slct.f64 requiressm_13 or higher.

Examples

slct.u32.s32  x, y, z, val;slct.ftz.u64.f32  A, B, C, fval;

9.7.7.Half Precision Comparison Instructions

The comparison instructions are:

  • set

  • setp

9.7.7.1.Half Precision Comparison Instructions:set

set

Compare two numeric values with a relational operator, and optionally combine this result with apredicate value by applying a Boolean operator.

Syntax

set.CmpOp{.ftz}.f16.stype            d, a, b;set.CmpOp.BoolOp{.ftz}.f16.stype     d, a, b, {!}c;set.CmpOp.bf16.stype                 d, a, b;set.CmpOp.BoolOp.bf16.stype          d, a, b, {!}c;set.CmpOp{.ftz}.dtype.f16            d, a, b;set.CmpOp.BoolOp{.ftz}.dtype.f16     d, a, b, {!}c;.dtype  = { .u16, .s16, .u32, .s32}set.CmpOp.dtype.bf16                 d, a, b;set.CmpOp.BoolOp.dtype.bf16          d, a, b, {!}c;.dtype  = { .u16, .s16, .u32, .s32}set.CmpOp{.ftz}.dtype.f16x2          d, a, b;set.CmpOp.BoolOp{.ftz}.dtype.f16x2   d, a, b, {!}c;.dtype  = { .f16x2, .u32, .s32}set.CmpOp.dtype.bf16x2               d, a, b;set.CmpOp.BoolOp.dtype.bf16x2        d, a, b, {!}c;.dtype  = { .bf16x2, .u32, .s32}.CmpOp  = { eq, ne, lt, le, gt, ge,            equ, neu, ltu, leu, gtu, geu, num, nan };.BoolOp = { and, or, xor };.stype  = { .b16, .b32, .b64,            .u16, .u32, .u64,            .s16, .s32, .s64,            .f16, .f32, .f64};

Description

Compares two numeric values and optionally combines the result with another predicate value byapplying a Boolean operator.

Result of this computation is written in destination register in the following way:

  • If result isTrue,

    • 0xffffffff is written for destination types.u32/.s32.

    • 0xffff is written for destination types.u16/.s16.

    • 1.0 in target precision floating point format is written for destination type.f16,.bf16.

  • If result isFalse,

    • 0x0 is written for all integer destination types.

    • 0.0 in target precision floating point format is written for destination type.f16,.bf16.

If the source type is.f16x2 or.bf16x2 then result of individual operations are packed inthe 32-bit destination operand.

Operandc has type.pred.

Semantics

if (stype == .f16x2 || stype == .bf16x2) {    fA[0] = a[0:15];    fA[1] = a[16:31];    fB[0] = b[0:15];    fB[1] = b[16:31];    t[0]   = (fA[0] CmpOp fB[0]) ? 1 : 0;    t[1]   = (fA[1] CmpOp fB[1]) ? 1 : 0;    if (dtype == .f16x2 || stype == .bf16x2) {        for (i = 0; i < 2; i++) {            d[i] = BoolOp(t[i], c) ? 1.0 : 0.0;        }    } else {        for (i = 0; i < 2; i++) {            d[i] = BoolOp(t[i], c) ? 0xffff : 0;        }    }} else if (dtype == .f16 || stype == .bf16) {    t = (a CmpOp b) ? 1 : 0;    d = BoolOp(t, c) ? 1.0 : 0.0;} else  { // Integer destination type    trueVal = (isU16(dtype) || isS16(dtype)) ?  0xffff : 0xffffffff;    t = (a CmpOp b) ? 1 : 0;    d = BoolOp(t, c) ? trueVal : 0;}

Floating Point Notes

The ordered comparisons areeq,ne,lt,le,gt,ge. If either operand isNaN, the result isFalse.

To aid comparison operations in the presence ofNaN values, unordered versions are included:equ,neu,ltu,leu,gtu,geu. If both operands are numeric values (notNaN), then these comparisons have the same result as their ordered counterparts. If eitheroperand isNaN, then the result of these comparisons isTrue.

num returnsTrue if both operands are numeric values (notNaN), andnan returnsTrue if either operand isNaN.

Subnormal numbers:

By default, subnormal numbers are supported.

When.ftz modifier is specified then subnormal inputs and results are flushed to signpreserving zero.

PTX ISA Notes

Introduced in PTX ISA version 4.2.

set.{u16,u32,s16,s32}.f16 andset.{u32,s32}.f16x2 are introduced in PTX ISA version 6.5.

set.{u16,u32,s16,s32}.bf16,set.{u32,s32,bf16x2}.bf16x2,set.bf16.{s16,u16,f16,b16,s32,u32,f32,b32,s64,u64,f64,b64} are introduced in PTX ISA version7.8.

Target ISA Notes

Requiressm_53 or higher.

set.{u16,u32,s16,s32}.bf16,set.{u32,s32,bf16x2}.bf16x2,set.bf16.{s16,u16,f16,b16,s32,u32,f32,b32,s64,u64,f64,b64} requiresm_90 or higher.

Examples

set.lt.and.f16.f16  d,a,b,r;set.eq.f16x2.f16x2  d,i,n;set.eq.u32.f16x2    d,i,n;set.lt.and.u16.f16  d,a,b,r;set.ltu.or.bf16.f16    d,u,v,s;set.equ.bf16x2.bf16x2  d,j,m;set.geu.s32.bf16x2     d,j,m;set.num.xor.s32.bf16   d,u,v,s;

9.7.7.2.Half Precision Comparison Instructions:setp

setp

Compare two numeric values with a relational operator, and optionally combine this result with apredicate value by applying a Boolean operator.

Syntax

setp.CmpOp{.ftz}.f16           p, a, b;setp.CmpOp.BoolOp{.ftz}.f16    p, a, b, {!}c;setp.CmpOp{.ftz}.f16x2         p|q, a, b;setp.CmpOp.BoolOp{.ftz}.f16x2  p|q, a, b, {!}c;setp.CmpOp.bf16                p, a, b;setp.CmpOp.BoolOp.bf16         p, a, b, {!}c;setp.CmpOp.bf16x2              p|q, a, b;setp.CmpOp.BoolOp.bf16x2       p|q, a, b, {!}c;.CmpOp  = { eq, ne, lt, le, gt, ge,            equ, neu, ltu, leu, gtu, geu, num, nan };.BoolOp = { and, or, xor };

Description

Compares two values and combines the result with another predicate value by applying a Booleanoperator. This result is written to the destination operand.

Operandc,p andq has type.pred.

For instruction type.f16, operandsa andb have type.b16 or.f16.

For instruction type.f16x2, operandsa andb have type.b32.

For instruction type.bf16, operandsa andb have type.b16.

For instruction type.bf16x2, operandsa andb have type.b32.

Semantics

if (type == .f16 || type == .bf16) {     t = (a CmpOp b) ? 1 : 0;     p = BoolOp(t, c);} else if (type == .f16x2 || type == .bf16x2) {    fA[0] = a[0:15];    fA[1] = a[16:31];    fB[0] = b[0:15];    fB[1] = b[16:31];    t[0] = (fA[0] CmpOp fB[0]) ? 1 : 0;    t[1] = (fA[1] CmpOp fB[1]) ? 1 : 0;    p = BoolOp(t[0], c);    q = BoolOp(t[1], c);}

Floating Point Notes

The ordered comparisons areeq,ne,lt,le,gt,ge. If either operand isNaN, the result isFalse.

To aid comparison operations in the presence ofNaN values, unordered versions are included:equ,neu,ltu,leu,gtu,geu. If both operands are numeric values (notNaN), then these comparisons have the same result as their ordered counterparts. If eitheroperand isNaN, then the result of these comparisons isTrue.

num returnsTrue if both operands are numeric values (notNaN), andnan returnsTrue if either operand isNaN.

Subnormal numbers:

By default, subnormal numbers are supported.

setp.ftz.{f16,f16x2} flushes subnormal inputs to sign-preserving zero.

PTX ISA Notes

Introduced in PTX ISA version 4.2.

setp.{bf16/bf16x2} introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_53 or higher.

setp.{bf16/bf16x2} requiressm_90 or higher.

Examples

setp.lt.and.f16x2  p|q,a,b,r;@q  setp.eq.f16    p,i,n;setp.gt.or.bf16x2  u|v,c,d,s;@q  setp.eq.bf16   u,j,m;

9.7.8.Logic and Shift Instructions

The logic and shift instructions are fundamentally untyped, performing bit-wise operations onoperands of any type, provided the operands are of the same size. This permits bit-wise operationson floating point values without having to define a union to access the bits. Instructionsand,or,xor, andnot also operate on predicates.

The logical shift instructions are:

  • and

  • or

  • xor

  • not

  • cnot

  • lop3

  • shf

  • shl

  • shr

9.7.8.1.Logic and Shift Instructions:and

and

Bitwise AND.

Syntax

and.type d, a, b;.type = { .pred, .b16, .b32, .b64 };

Description

Compute the bit-wise and operation for the bits ina andb.

Semantics

d = a & b;

Notes

The size of the operands must match, but not necessarily the type.

Allowed types include predicate registers.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

and.b32  x,q,r;and.b32  sign,fpvalue,0x80000000;

9.7.8.2.Logic and Shift Instructions:or

or

Biwise OR.

Syntax

or.type d, a, b;.type = { .pred, .b16, .b32, .b64 };

Description

Compute the bit-wise or operation for the bits ina andb.

Semantics

d = a | b;

Notes

The size of the operands must match, but not necessarily the type.

Allowed types include predicate registers.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

or.b32  mask mask,0x00010001or.pred  p,q,r;

9.7.8.3.Logic and Shift Instructions:xor

xor

Bitwise exclusive-OR (inequality).

Syntax

xor.type d, a, b;.type = { .pred, .b16, .b32, .b64 };

Description

Compute the bit-wise exclusive-or operation for the bits ina andb.

Semantics

d = a ^ b;

Notes

The size of the operands must match, but not necessarily the type.

Allowed types include predicate registers.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

xor.b32  d,q,r;xor.b16  d,x,0x0001;

9.7.8.4.Logic and Shift Instructions:not

not

Bitwise negation; one’s complement.

Syntax

not.type d, a;.type = { .pred, .b16, .b32, .b64 };

Description

Invert the bits ina.

Semantics

d = ~a;

Notes

The size of the operands must match, but not necessarily the type.

Allowed types include predicates.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

not.b32  mask,mask;not.pred  p,q;

9.7.8.5.Logic and Shift Instructions:cnot

cnot

C/C++ style logical negation.

Syntax

cnot.type d, a;.type = { .b16, .b32, .b64 };

Description

Compute the logical negation using C/C++ semantics.

Semantics

d = (a==0) ? 1 : 0;

Notes

The size of the operands must match, but not necessarily the type.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

cnot.b32 d,a;

9.7.8.6.Logic and Shift Instructions:lop3

lop3

Arbitrary logical operation on 3 inputs.

Syntax

lop3.b32 d, a, b, c, immLut;lop3.BoolOp.b32 d|p, a, b, c, immLut, q;.BoolOp   = { .or , .and };

Description

Compute bitwise logical operation on inputsa,b,c and store the result in destinationd.

Optionally,.BoolOp can be specified to compute the predicate resultp by performing aBoolean operation on the destination operandd with the predicateq in the following manner:

p = (d != 0) BoolOp q;

The sink symbol ‘_’ may be used in place of the destination operandd when.BoolOp qualifieris specified.

The logical operation is defined by a look-up table which, for 3 inputs, can be represented as an8-bit value specified by operandimmLut as described below.immLut is an integer constantthat can take values from 0 to 255, thereby allowing up to 256 distinct logical operations on inputsa,b,c.

For a logical operationF(a,b,c) the value ofimmLut can be computed by applying the sameoperation to three predefined constant values as follows:

ta = 0xF0;tb = 0xCC;tc = 0xAA;immLut = F(ta, tb, tc);

Examples:

If F = (a & b & c);immLut = 0xF0 & 0xCC & 0xAA = 0x80If F = (a | b | c);immLut = 0xF0 | 0xCC | 0xAA = 0xFEIf F = (a & b & ~c);immLut = 0xF0 & 0xCC & (~0xAA) = 0x40If F = ((a & b | c) ^ a);immLut = (0xF0 & 0xCC | 0xAA) ^ 0xF0 = 0x1A

The following table illustrates computation ofimmLut for various logical operations:

ta

tb

tc

Oper 0 (False)

Oper 1 (ta & tb & tc)

Oper 2 (ta & tb & ~tc)

Oper 254 (ta | tb | tc)

Oper 255 (True)

0

0

0

0

0

0

0

1

0

0

1

0

0

0

1

1

0

1

0

0

0

0

1

1

0

1

1

0

0

0

1

1

1

0

0

0

0

0

1

1

1

0

1

0

0

0

1

1

1

1

0

0

0

1

1

1

1

1

1

0

1

0

1

1

immLut

0x0

0x80

0x40

0xFE

0xFF

Semantics

F = GetFunctionFromTable(immLut); // returns the function corresponding to immLut valued = F(a, b, c);if (BoolOp specified) {    p = (d != 0) BoolOp q;}

PTX ISA Notes

Introduced in PTX ISA version 4.3.

Support for.BoolOp qualifier introduced in PTX ISA version 8.2.

Target ISA Notes

Requiressm_50 or higher.

Qualifier.BoolOp requiressm_70 or higher.

Examples

lop3.b32       d, a, b, c, 0x40;lop3.or.b32  d|p, a, b, c, 0x3f, q;lop3.and.b32 _|p, a, b, c, 0x3f, q;

9.7.8.7.Logic and Shift Instructions:shf

shf

Funnel shift.

Syntax

shf.l.mode.b32  d, a, b, c;  // left shiftshf.r.mode.b32  d, a, b, c;  // right shift.mode = { .clamp, .wrap };

Description

Shift the 64-bit value formed by concatenating operandsa andb left or right by the amountspecified by the unsigned 32-bit value inc. Operandb holds bits63:32 and operand aholds bits31:0 of the 64-bit source value. The source is shifted left or right by the clampedor wrapped value inc. Forshf.l, the most-significant 32-bits of the result are writtenintod; forshf.r, the least-significant 32-bits of the result are written intod.

Semantics

u32  n = (.mode == .clamp) ? min(c, 32) : c & 0x1f;switch (shf.dir) {  // shift concatenation of [b, a]    case shf.l:     // extract 32 msbs           u32  d = (b << n)      | (a >> (32-n));    case shf.r:     // extract 32 lsbs           u32  d = (b << (32-n)) | (a >> n);}

Notes

Use funnel shift for multi-word shift operations and for rotate operations. The shift amount islimited to the range0..32 in clamp mode and0..31 in wrap mode, so shifting multi-wordvalues by distances greater than 32 requires first moving 32-bit words, then usingshf to shiftthe remaining0..31 distance.

To shift data sizes greater than 64 bits to the right, use repeatedshf.r instructions appliedto adjacent words, operating from least-significant word towards most-significant word. At eachstep, a single word of the shifted result is computed. The most-significant word of the result iscomputed using ashr.{u32,s32} instruction, which zero or sign fills based on the instructiontype.

To shift data sizes greater than 64 bits to the left, use repeatedshf.l instructions applied toadjacent words, operating from most-significant word towards least-significant word. At each step, asingle word of the shifted result is computed. The least-significant word of the result is computedusing ashl instruction.

Use funnel shift to perform 32-bit left or right rotate by supplying the same value for sourceargumentsa andb.

PTX ISA Notes

Introduced in PTX ISA version 3.1.

Target ISA Notes

Requiressm_32 or higher.

Example

shf.l.clamp.b32  r3,r1,r0,16;// 128-bit left shift; n < 32// [r7,r6,r5,r4] = [r3,r2,r1,r0] << nshf.l.clamp.b32  r7,r2,r3,n;shf.l.clamp.b32  r6,r1,r2,n;shf.l.clamp.b32  r5,r0,r1,n;shl.b32          r4,r0,n;// 128-bit right shift, arithmetic; n < 32// [r7,r6,r5,r4] = [r3,r2,r1,r0] >> nshf.r.clamp.b32  r4,r0,r1,n;shf.r.clamp.b32  r5,r1,r2,n;shf.r.clamp.b32  r6,r2,r3,n;shr.s32          r7,r3,n;     // result is sign-extendedshf.r.clamp.b32  r1,r0,r0,n;  // rotate right by n; n < 32shf.l.clamp.b32  r1,r0,r0,n;  // rotate left by n; n < 32// extract 32-bits from [r1,r0] starting at position n < 32shf.r.clamp.b32  r0,r0,r1,n;

9.7.8.8.Logic and Shift Instructions:shl

shl

Shift bits left, zero-fill on right.

Syntax

shl.type d, a, b;.type = { .b16, .b32, .b64 };

Description

Shifta left by the amount specified by unsigned 32-bit value inb.

Semantics

d = a << b;

Notes

Shift amounts greater than the register widthN are clamped toN.

The sizes of the destination and first source operand must match, but not necessarily the type. Theb operand must be a 32-bit value, regardless of the instruction type.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Example

shl.b32  q,a,2;

9.7.8.9.Logic and Shift Instructions:shr

shr

Shift bits right, sign or zero-fill on left.

Syntax

shr.type d, a, b;.type = { .b16, .b32, .b64,          .u16, .u32, .u64,          .s16, .s32, .s64 };

Description

Shifta right by the amount specified by unsigned 32-bit value inb. Signed shifts fill withthe sign bit, unsigned and untyped shifts fill with0.

Semantics

d = a >> b;

Notes

Shift amounts greater than the register widthN are clamped toN.

The sizes of the destination and first source operand must match, but not necessarily the type. Theb operand must be a 32-bit value, regardless of the instruction type.

Bit-size types are included for symmetry withshl.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Example

shr.u16  c,a,2;shr.s32  i,i,1;shr.b16  k,i,j;

9.7.9.Data Movement and Conversion Instructions

These instructions copy data from place to place, and from state space to state space, possiblyconverting it from one format to another.mov,ld,ldu, andst operate on bothscalar and vector types. Theisspacep instruction is provided to query whether a generic addressfalls within a particular state space window. Thecvta instruction converts addresses betweengeneric andconst,global,local, orshared state spaces.

Instructionsld,st,suld, andsust support optional cache operations.

The Data Movement and Conversion Instructions are:

  • mov

  • shfl.sync

  • prmt

  • ld

  • ldu

  • st

  • st.async

  • st.bulk

  • multimem.ld_reduce,multimem.st,multimem.red

  • prefetch,prefetchu

  • isspacep

  • cvta

  • cvt

  • cvt.pack

  • cp.async

  • cp.async.commit_group

  • cp.async.wait_group,cp.async.wait_all

  • cp.async.bulk

  • cp.reduce.async.bulk

  • cp.async.bulk.prefetch

  • cp.async.bulk.tensor

  • cp.reduce.async.bulk.tensor

  • cp.async.bulk.prefetch.tensor

  • cp.async.bulk.commit_group

  • cp.async.bulk.wait_group

  • tensormap.replace

9.7.9.1.Cache Operators

PTX ISA version 2.0 introduced optional cache operators on load and store instructions. The cacheoperators require a target architecture ofsm_20 or higher.

Cache operators on load or store instructions are treated as performance hints only. The use of acache operator on anld orst instruction does not change the memory consistency behavior ofthe program.

Forsm_20 and higher, the cache operators have the following definitions and behavior.

Table 30Cache Operators for Memory Load Instructions

Operator

Meaning

.ca

Cache at all levels, likely to be accessed again.

The default load instruction cache operation is ld.ca, which allocates cache lines in alllevels (L1 and L2) with normal eviction policy. Global data is coherent at the L2 level, butmultiple L1 caches are not coherent for global data. If one thread stores to global memoryvia one L1 cache, and a second thread loads that address via a second L1 cache withld.ca,the second thread may get stale L1 cache data, rather than the data stored by the first thread.The driver must invalidate global L1 cache lines between dependent grids of parallel threads.Stores by the first grid program are then correctly fetched by the second grid program issuingdefaultld.ca loads cached in L1.

.cg

Cache at global level (cache in L2 and below, not L1).

Useld.cg to cache loads only globally, bypassing the L1 cache, and cache only in the L2cache.

.cs

Cache streaming, likely to be accessed once.

Theld.cs load cached streaming operation allocates global lines with evict-first policy inL1 and L2 to limit cache pollution by temporary streaming data that may be accessed once ortwice. Whenld.cs is applied to a Local window address, it performs theld.luoperation.

.lu

Last use.

The compiler/programmer may useld.lu when restoring spilled registers and popping functionstack frames to avoid needless write-backs of lines that will not be used again. Theld.luinstruction performs a load cached streaming operation (ld.cs) on global addresses.

.cv

Don’t cache and fetch again (consider cached system memory lines stale, fetch again).

The ld.cv load operation applied to a global System Memory address invalidates (discards) amatching L2 line and re-fetches the line on each new load.

Table 31Cache Operators for Memory Store Instructions

Operator

Meaning

.wb

Cache write-back all coherent levels.

The default store instruction cache operation isst.wb, which writes back cache lines ofcoherent cache levels with normal eviction policy.

If one thread stores to global memory, bypassing its L1 cache, and a second thread in adifferent SM later loads from that address via a different L1 cache withld.ca, the secondthread may get a hit on stale L1 cache data, rather than get the data from L2 or memory storedby the first thread.

The driver must invalidate global L1 cache lines between dependent grids of thread arrays.Stores by the first grid program are then correctly missed in L1 and fetched by the second gridprogram issuing defaultld.ca loads.

.cg

Cache at global level (cache in L2 and below, not L1).

Usest.cg to cache global store data only globally, bypassing the L1 cache, and cache onlyin the L2 cache.

.cs

Cache streaming, likely to be accessed once.

Thest.cs store cached-streaming operation allocates cache lines with evict-first policy tolimit cache pollution by streaming output data.

.wt

Cache write-through (to system memory).

Thest.wt store write-through operation applied to a global System Memory address writesthrough the L2 cache.

9.7.9.2.Cache Eviction Priority Hints

PTX ISA version 7.4 adds optional cache eviction priority hints on load and storeinstructions. Cache eviction priority requires target architecturesm_70 or higher.

Cache eviction priority on load or store instructions is treated as a performance hint. It issupported for.global state space and generic addresses where the address points to.globalstate space.

Table 32Cache Eviction Priority Hints for Memory Load and Store Instructions

Cache Eviction Priority

Meaning

evict_normal

Cache data with normal eviction priority. This is the default eviction priority.

evict_first

Data cached with this priority will be first in the eviction priority order andwill likely be evicted when cache eviction is required. This priority is suitablefor streaming data.

evict_last

Data cached with this priority will be last in the eviction priority order and willlikely be evicted only after other data withevict_normal orevict_firsteviction priotity is already evicted. This priority is suitable for data thatshould remain persistent in cache.

evict_unchanged

Do not change eviction priority order as part of this operation.

no_allocate

Do not allocate data to cache. This priority is suitable for streaming data.

9.7.9.3.Data Movement and Conversion Instructions:mov

mov

Set a register variable with the value of a register variable or an immediate value. Take thenon-generic address of a variable in global, local, or shared state space.

Syntax

mov.type  d, a;mov.type  d, sreg;mov.type  d, avar;       // get address of variablemov.type  d, avar+imm;   // get address of variable with offsetmov.u32   d, fname;      // get address of device functionmov.u64   d, fname;      // get address of device functionmov.u32   d, kernel;     // get address of entry functionmov.u64   d, kernel;     // get address of entry function.type = { .pred,          .b16, .b32, .b64,          .u16, .u32, .u64,          .s16, .s32, .s64,                .f32, .f64 };

Description

Write registerd with the value ofa.

Operanda may be a register, special register, variable with optional offset in an addressablememory space, or function name.

For variables declared in.const,.global,.local, and.shared state spaces,movplaces the non-generic address of the variable (i.e., the address of the variable in its statespace) into the destination register. The generic address of a variable inconst,global,local, orshared state space may be generated by first taking the address within the statespace withmov and then converting it to a generic address using thecvta instruction;alternately, the generic address of a variable declared inconst,global,local, orshared state space may be taken directly using thecvta instruction.

Note that if the address of a device function parameter is moved to a register, the parameter willbe copied onto the stack and the address will be in the local state space.

Semantics

d = a;d = sreg;d = &avar;        // address is non-generic; i.e., within the variable's declared state spaced = &avar+imm;

Notes

  • Although only predicate and bit-size types are required, we include the arithmetic types for theprogrammer’s convenience: their use enhances program readability and allows additional typechecking.

  • When moving address of a kernel or a device function, only.u32 or.u64 instruction typesare allowed. However, if a signed type is used, it is not treated as a compilation error. Thecompiler issues a warning in this case.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Taking the address of kernel entry functions requires PTX ISA version 3.1 or later. Kernel functionaddresses should only be used in the context of CUDA Dynamic Parallelism system calls. See theCUDADynamic Parallelism Programming Guide for details.

Target ISA Notes

mov.f64 requiressm_13 or higher.

Taking the address of kernel entry functions requiressm_35 or higher.

Examples

mov.f32  d,a;mov.u16  u,v;mov.f32  k,0.1;mov.u32  ptr, A;        // move address of A into ptrmov.u32  ptr, A[5];     // move address of A[5] into ptrmov.u32  ptr, A+20;     // move address with offset into ptrmov.u32  addr, myFunc;  // get address of device function 'myFunc'mov.u64  kptr, main;    // get address of entry function 'main'

9.7.9.4.Data Movement and Conversion Instructions:mov

mov

Move vector-to-scalar (pack) or scalar-to-vector (unpack).

Syntax

mov.type  d, a;.type = { .b16, .b32, .b64, .b128 };

Description

Write scalar registerd with the packed value of vector registera, or write vector registerd with the unpacked values from scalar registera.

When destination operandd is a vector register, the sink symbol'_' may be used for one ormore elements provided that at least one element is a scalar register.

For bit-size types,mov may be used to pack vector elements into a scalar register or unpacksub-fields of a scalar register into a vector. Both the overall size of the vector and the size ofthe scalar must match the size of the instruction type.

Semantics

// pack two 8-bit elements into .b16d = a.x | (a.y << 8)// pack four 8-bit elements into .b32d = a.x | (a.y << 8)  | (a.z << 16) | (a.w << 24)// pack two 16-bit elements into .b32d = a.x | (a.y << 16)// pack four 16-bit elements into .b64d = a.x | (a.y << 16)  | (a.z << 32) | (a.w << 48)// pack two 32-bit elements into .b64d = a.x | (a.y << 32)// pack four 32-bit elements into .b128d = a.x | (a.y << 32)  | (a.z << 64) | (a.w << 96)// pack two 64-bit elements into .b128d = a.x | (a.y << 64)// unpack 8-bit elements from .b16{ d.x, d.y } = { a[0..7], a[8..15] }// unpack 8-bit elements from .b32{ d.x, d.y, d.z, d.w }        { a[0..7], a[8..15], a[16..23], a[24..31] }// unpack 16-bit elements from .b32{ d.x, d.y }  = { a[0..15], a[16..31] }// unpack 16-bit elements from .b64{ d.x, d.y, d.z, d.w } =        { a[0..15], a[16..31], a[32..47], a[48..63] }// unpack 32-bit elements from .b64{ d.x, d.y } = { a[0..31], a[32..63] }// unpack 32-bit elements from .b128{ d.x, d.y, d.z, d.w } =        { a[0..31], a[32..63], a[64..95], a[96..127] }// unpack 64-bit elements from .b128{ d.x, d.y } = { a[0..63], a[64..127] }

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Support for.b128 type introduced in PTX ISA version 8.3.

Target ISA Notes

Supported on all target architectures.

Support for.b128 type requiressm_70 or higher.

Examples

mov.b32 %r1,{a,b};      // a,b have type .u16mov.b64 {lo,hi}, %x;    // %x is a double; lo,hi are .u32mov.b32 %r1,{x,y,z,w};  // x,y,z,w have type .b8mov.b32 {r,g,b,a},%r1;  // r,g,b,a have type .u8mov.b64 {%r1, _}, %x;   // %x is.b64, %r1 is .b32mov.b128 {%b1, %b2}, %y;   // %y is.b128, %b1 and % b2 are .b64mov.b128 %y, {%b1, %b2};   // %y is.b128, %b1 and % b2 are .b64

9.7.9.5.Data Movement and Conversion Instructions:shfl (deprecated)

shfl (deprecated)

Register data shuffle within threads of a warp.

Syntax

shfl.mode.b32  d[|p], a, b, c;.mode = { .up, .down, .bfly, .idx };

Deprecation Note

Theshfl instruction without a.sync qualifier is deprecated in PTX ISA version 6.0.

  • Support for this instruction with.target lower thansm_70 may be removed in a future PTX ISA version.

Removal Note

Support forshfl instruction without a.sync qualifier is removed in PTX ISA version 6.4 for.targetsm_70 or higher.

Description

Exchange register data between threads of a warp.

Each thread in the currently executing warp will compute a source lane indexj based on inputoperandsb andc and themode. If the computed source lane indexj is in range, thethread will copy the input operanda from lanej into its own destination registerd;otherwise, the thread will simply copy its own inputa to destinationd. The optionaldestination predicatep is set toTrue if the computed source lane is in range, andotherwise set toFalse.

Note that an out of range value ofb may still result in a valid computed source lane indexj. In this case, a data transfer occurs and the destination predicatep is True.

Note that results are undefined in divergent control flow within a warp, if an active thread sourcesa register from an inactive thread.

Operandb specifies a source lane or source lane offset, depending on the mode.

Operandc contains two packed values specifying a mask for logically splitting warps intosub-segments and an upper bound for clamping the source lane index.

Semantics

lane[4:0]  = [Thread].laneid;  // position of thread in warpbval[4:0] = b[4:0];            // source lane or lane offset (0..31)cval[4:0] = c[4:0];            // clamp valuemask[4:0] = c[12:8];// get value of source register a if thread is active and// guard predicate true, else unpredictableif (isActive(Thread) && isGuardPredicateTrue(Thread)) {    SourceA[lane] = a;} else {    // Value of SourceA[lane] is unpredictable for    // inactive/predicated-off threads in warp}maxLane = (lane[4:0] & mask[4:0]) | (cval[4:0] & ~mask[4:0]);minLane = (lane[4:0] & mask[4:0]);switch (.mode) {    case .up:    j = lane - bval; pval = (j >= maxLane); break;    case .down:  j = lane + bval; pval = (j <= maxLane); break;    case .bfly:  j = lane ^ bval; pval = (j <= maxLane); break;    case .idx:   j = minLane  | (bval[4:0] & ~mask[4:0]);                                 pval = (j <= maxLane); break;}if (!pval) j = lane;  // copy from own laned = SourceA[j];       // copy input a from lane jif (dest predicate selected)    p = pval;

PTX ISA Notes

Introduced in PTX ISA version 3.0.

Deprecated in PTX ISA version 6.0 in favor ofshfl.sync.

Not supported in PTX ISA version 6.4 for .targetsm_70 or higher.

Target ISA Notes

shfl requiressm_30 or higher.

shfl is not supported onsm_70 or higher starting PTX ISA version 6.4.

Examples

    // Warp-level INCLUSIVE PLUS SCAN:    //    // Assumes input in following registers:    //     - Rx  = sequence value for this thread    //    shfl.up.b32  Ry|p, Rx, 0x1,  0x0;@p  add.f32      Rx, Ry, Rx;    shfl.up.b32  Ry|p, Rx, 0x2,  0x0;@p  add.f32      Rx, Ry, Rx;    shfl.up.b32  Ry|p, Rx, 0x4,  0x0;@p  add.f32      Rx, Ry, Rx;    shfl.up.b32  Ry|p, Rx, 0x8,  0x0;@p  add.f32      Rx, Ry, Rx;    shfl.up.b32  Ry|p, Rx, 0x10, 0x0;@p  add.f32      Rx, Ry, Rx;    // Warp-level INCLUSIVE PLUS REVERSE-SCAN:    //    // Assumes input in following registers:    //     - Rx  = sequence value for this thread    //    shfl.down.b32  Ry|p, Rx, 0x1,  0x1f;@p  add.f32        Rx, Ry, Rx;    shfl.down.b32  Ry|p, Rx, 0x2,  0x1f;@p  add.f32        Rx, Ry, Rx;    shfl.down.b32  Ry|p, Rx, 0x4,  0x1f;@p  add.f32        Rx, Ry, Rx;    shfl.down.b32  Ry|p, Rx, 0x8,  0x1f;@p  add.f32        Rx, Ry, Rx;    shfl.down.b32  Ry|p, Rx, 0x10, 0x1f;@p  add.f32        Rx, Ry, Rx;    // BUTTERFLY REDUCTION:    //    // Assumes input in following registers:    //     - Rx  = sequence value for this thread    //    shfl.bfly.b32  Ry, Rx, 0x10, 0x1f;   // no predicate dest    add.f32        Rx, Ry, Rx;    shfl.bfly.b32  Ry, Rx, 0x8,  0x1f;    add.f32        Rx, Ry, Rx;    shfl.bfly.b32  Ry, Rx, 0x4,  0x1f;    add.f32        Rx, Ry, Rx;    shfl.bfly.b32  Ry, Rx, 0x2,  0x1f;    add.f32        Rx, Ry, Rx;    shfl.bfly.b32  Ry, Rx, 0x1,  0x1f;    add.f32        Rx, Ry, Rx;    //    // All threads now hold sum in Rx

9.7.9.6.Data Movement and Conversion Instructions:shfl.sync

shfl.sync

Register data shuffle within threads of a warp.

Syntax

shfl.sync.mode.b32  d[|p], a, b, c, membermask;.mode = { .up, .down, .bfly, .idx };

Description

Exchange register data between threads of a warp.

shfl.sync will cause executing thread to wait until all non-exited threads corresponding tomembermask have executedshfl.sync with the same qualifiers and samemembermask valuebefore resuming execution.

Operandmembermask specifies a 32-bit integer which is a mask indicating threads participatingin barrier where the bit position corresponds to thread’slaneid.

shfl.sync exchanges register data between threads inmembermask.

Each thread in the currently executing warp will compute a source lane indexj based on inputoperandsb andc and themode. If the computed source lane indexj is in range, thethread will copy the input operanda from lanej into its own destination registerd;otherwise, the thread will simply copy its own inputa to destinationd. The optionaldestination predicatep is set toTrue if the computed source lane is in range, andotherwise set toFalse.

Note that an out of range value ofb may still result in a valid computed source lane indexj. In this case, a data transfer occurs and the destination predicatep is True.

Note that results are undefined if a thread sources a register from an inactive thread or a threadthat is not inmembermask.

Operandb specifies a source lane or source lane offset, depending on the mode.

Operandc contains two packed values specifying a mask for logically splitting warps intosub-segments and an upper bound for clamping the source lane index.

The behavior ofshfl.sync is undefined if the executing thread is not in themembermask.

Note

For .targetsm_6x or below, all threads inmembermask must execute the sameshfl.syncinstruction in convergence, and only threads belonging to somemembermask can be active whentheshfl.sync instruction is executed. Otherwise, the behavior is undefined.

Semantics

// wait for all threads in membermask to arrivewait_for_specified_threads(membermask);lane[4:0]  = [Thread].laneid;  // position of thread in warpbval[4:0] = b[4:0];            // source lane or lane offset (0..31)cval[4:0] = c[4:0];            // clamp valuesegmask[4:0] = c[12:8];// get value of source register a if thread is active and// guard predicate true, else unpredictableif (isActive(Thread) && isGuardPredicateTrue(Thread)) {    SourceA[lane] = a;} else {    // Value of SourceA[lane] is unpredictable for    // inactive/predicated-off threads in warp}maxLane = (lane[4:0] & segmask[4:0]) | (cval[4:0] & ~segmask[4:0]);minLane = (lane[4:0] & segmask[4:0]);switch (.mode) {    case .up:    j = lane - bval; pval = (j >= maxLane); break;    case .down:  j = lane + bval; pval = (j <= maxLane); break;    case .bfly:  j = lane ^ bval; pval = (j <= maxLane); break;    case .idx:   j = minLane  | (bval[4:0] & ~segmask[4:0]);                                 pval = (j <= maxLane); break;}if (!pval) j = lane;  // copy from own laned = SourceA[j];       // copy input a from lane jif (dest predicate selected)    p = pval;

PTX ISA Notes

Introduced in PTX ISA version 6.0.

Target ISA Notes

Requiressm_30 or higher.

Examples

shfl.sync.up.b32  Ry|p, Rx, 0x1,  0x0, 0xffffffff;

9.7.9.7.Data Movement and Conversion Instructions:prmt

prmt

Permute bytes from register pair.

Syntax

prmt.b32{.mode}  d, a, b, c;.mode = { .f4e, .b4e, .rc8, .ecl, .ecr, .rc16 };

Description

Pick four arbitrary bytes from two 32-bit registers, and reassemble them into a 32-bit destinationregister.

In the generic form (no mode specified), the permute control consists of four 4-bit selectionvalues. The bytes in the two source registers are numbered from 0 to 7:{b,a}={{b7,b6,b5,b4},{b3,b2,b1,b0}}. For each byte in the target register, a 4-bit selection value is defined.

The 3 lsbs of the selection value specify which of the 8 source bytes should be moved into thetarget position. The msb defines if the byte value should be copied, or if the sign (msb of thebyte) should be replicated over all 8 bits of the target position (sign extend of the byte value);msb=0 means copy the literal value;msb=1 means replicate the sign. Note that the signextension is only performed as part of generic form.

Thus, the four 4-bit values fully specify an arbitrary byte permute, as a16b permute code.

default mode

d.b3

source select

d.b2

source select

d.b1

source select

d.b0

source select

index

c[15:12]

c[11:8]

c[7:4]

c[3:0]

The more specialized form of the permute control uses the two lsb’s of operandc (which istypically an address pointer) to control the byte extraction.

mode

selector

c[1:0]

d.b3

source

d.b2

source

d.b1

source

d.b0

source

f4e (forward 4 extract)

0

3

2

1

0

1

4

3

2

1

2

5

4

3

2

3

6

5

4

3

b4e (backward 4 extract)

0

5

6

7

0

1

6

7

0

1

2

7

0

1

2

3

0

1

2

3

rc8 (replicate 8)

0

0

0

0

0

1

1

1

1

1

2

2

2

2

2

3

3

3

3

3

ecl (edge clamp left)

0

3

2

1

0

1

3

2

1

1

2

3

2

2

2

3

3

3

3

3

ecr (edge clamp right)

0

0

0

0

0

1

1

1

1

0

2

2

2

1

0

3

3

2

1

0

rc16 (replicate 16)

0

1

0

1

0

1

3

2

3

2

2

1

0

1

0

3

3

2

3

2

Semantics

tmp64 = (b<<32) | a;  // create 8 byte sourceif ( ! mode ) {   ctl[0] = (c >>  0) & 0xf;   ctl[1] = (c >>  4) & 0xf;   ctl[2] = (c >>  8) & 0xf;   ctl[3] = (c >> 12) & 0xf;} else {   ctl[0] = ctl[1] = ctl[2] = ctl[3] = (c >>  0) & 0x3;}tmp[07:00] = ReadByte( mode, ctl[0], tmp64 );tmp[15:08] = ReadByte( mode, ctl[1], tmp64 );tmp[23:16] = ReadByte( mode, ctl[2], tmp64 );tmp[31:24] = ReadByte( mode, ctl[3], tmp64 );

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

prmt requiressm_20 or higher.

Examples

prmt.b32      r1, r2, r3, r4;prmt.b32.f4e  r1, r2, r3, r4;

9.7.9.8.Data Movement and Conversion Instructions:ld

ld

Load a register variable from an addressable state space variable.

Syntax

ld{.weak}{.ss}{.cop}{.level::cache_hint}{.level::prefetch_size}{.vec}.type  d, [a]{.unified}{, cache-policy};ld{.weak}{.ss}{.level1::eviction_priority}{.level2::eviction_priority}{.level::cache_hint}{.level::prefetch_size}{.vec}.type  d, [a]{.unified}{, cache-policy};ld.volatile{.ss}{.level::prefetch_size}{.vec}.type  d, [a];ld.relaxed.scope{.ss}{.level1::eviction_priority}{.level2::eviction_priority}{.level::cache_hint}{.level::prefetch_size}{.vec}.type  d, [a]{, cache-policy};ld.acquire.scope{.ss}{.level1::eviction_priority}{.level2::eviction_priority}{.level::cache_hint}{.level::prefetch_size}{.vec}.type  d, [a]{, cache-policy};ld.mmio.relaxed.sys{.global}.type  d, [a];.ss =                       { .const, .global, .local, .param{::entry, ::func}, .shared{::cta, ::cluster} };.cop =                      { .ca, .cg, .cs, .lu, .cv };.level1::eviction_priority = { .L1::evict_normal, .L1::evict_unchanged,                               .L1::evict_first, .L1::evict_last, .L1::no_allocate };.level2::eviction_priority = {.L2::evict_normal, .L2::evict_first, .L2::evict_last};.level::cache_hint =        { .L2::cache_hint };.level::prefetch_size =     { .L2::64B, .L2::128B, .L2::256B }.scope =                    { .cta, .cluster, .gpu, .sys };.vec =                      { .v2, .v4, .v8 };.type =                     { .b8, .b16, .b32, .b64, .b128,                              .u8, .u16, .u32, .u64,                              .s8, .s16, .s32, .s64,                              .f32, .f64 };

Description

Load register variabled from the location specified by the source address operanda inspecified state space. If no state space is given, perform the load usingGeneric Addressing.

If no sub-qualifier is specified with.shared state space, then::cta is assumed by default.

Supported addressing modes for operanda and alignment requirements are described inAddresses as Operands

If no sub-qualifier is specified with.param state space, then:

  • ::func is assumed when access is inside a device function.

  • ::entry is assumed when accessing kernel function parameters from entry function. Otherwise, whenaccessing device function parameters or any other.param variables from entry function::funcis assumed by default.

Forld.param::entry instruction, operand a must be a kernel parameter address, otherwise behavioris undefined. Forld.param::func instruction, operand a must be a device function parameter address,otherwise behavior is undefined.

Instructionld.param{::func} used for reading value returned from device function call cannot bepredicated. SeeParameter State Space andFunction Declarations and Definitions for descriptionsof the proper use ofld.param.

The.relaxed and.acquire qualifiers indicate memory synchronization as described in theMemory Consistency Model. The.scope qualifierindicates the set of threads with which anld.relaxed orld.acquire instruction can directlysynchronize1. The.weak qualifier indicates a memory instruction with no synchronization.The effects of this instruction become visible to other threads only when synchronization is establishedby other means.

The semantic details of.mmio qualifier are described in theMemory Consistency Model.Only.sys thread scope is valid forld.mmio operation. Thequalifiers.mmio and.relaxed must be specified together.

The semantic details of.volatile qualifier are described in theMemory Consistency Model.

The.weak,.volatile,.relaxed and.acquire qualifiers are mutually exclusive. Whennone of these is specified, the.weak qualifier is assumed by default.

The qualifiers.volatile,.relaxed and.acquire may be used only with.global and.shared spaces and with generic addressing, where the address points to.global or.shared space. Cache operations are not permitted with these qualifiers. The qualifier.mmiomay be used only with.global space and with generic addressing, where the address points to.global space.

The optional qualifier.unified must be specified on operanda ifa is the address of avariable declared with.unified attribute as described inVariable and Function Attribute Directive: .attribute.

The.v8 (.vec) qualifier is supported if:

  • .type is.b32 or.s32 or.u32 or.f32 AND

  • State space is.global or with generic addressing where address points to.global state space

The.v4 (.vec) qualifier with type.b64 or.s64 or.u64 or.f64 is supported if:

  • State space is.global or with generic addressing where address points to.global state space

Qualifiers.level1::eviction_priority and.level2::eviction_priority specify the eviction policyfor L1 and L2 cache respectively which may be applied during memory access.

Qualifier.level2::eviction_priority is supported if:

  • .vec is.v8 and.type is.b32 or.s32 or.u32 or.f32

    • AND Operandd is vector of 8 registers with type specified with.type

  • OR.vec is.v4 and.type is.b64 or.s64 or.u64 or.f64

    • AND Operandd is vector of 4 registers with type specified with.type

Optionally, sink symbol ‘_’ can be used in vector expressiond when:

  • .vec is.v8 and.type is.b32 or.s32 or.u32 or.f32 OR

  • .vec is.v4 and.type is.b64 or.s64 or.u64 or.f64

which indicates that data from corresponding memory location is not read.

The.level::prefetch_size qualifier is a hint to fetch additional data of the specified sizeinto the respective cache level.The sub-qualifierprefetch_size can be set to either of64B,128B,256B thereby allowing the prefetch size to be 64 Bytes, 128 Bytes or 256 Bytesrespectively.

The qualifier.level::prefetch_size may only be used with.global state space and withgeneric addressing where the address points to.global state space. If the generic address doesnot fall within the address window of the global memory, then the prefetching behavior is undefined.

The.level::prefetch_size qualifier is treated as a performance hint only.

When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.

The qualifiers.unified and.level::cache_hint are only supported for.global statespace and for generic addressing where the address points to the.global state space.

cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.

1 This synchronization is further extended to other threads through the transitive nature ofcausality order, as described in the memory consistency model.

Semantics

d = a;             // named variable ad = *(&a+immOff)   // variable-plus-offsetd = *a;            // registerd = *(a+immOff);   // register-plus-offsetd = *(immAddr);    // immediate address

Notes

Destinationd must be in the.reg state space.

A destination register wider than the specified type may be used. The value loaded is sign-extendedto the destination register width for signed integers, and is zero-extended to the destinationregister width for unsigned and bit-size types. SeeTable 28for a description of these relaxed type-checking rules.

.f16 data may be loaded usingld.b16, and then converted to.f32 or.f64 usingcvt or can be used in half precision floating point instructions.

.f16x2 data may be loaded usingld.b32 and then used in half precision floating pointinstructions.

PTX ISA Notes

ld introduced in PTX ISA version 1.0.ld.volatile introduced in PTX ISA version 1.1.

Generic addressing and cache operations introduced in PTX ISA version 2.0.

Support for scope qualifier,.relaxed,.acquire,.weak qualifiers introduced in PTX ISAversion 6.0.

Support for generic addressing of .const space added in PTX ISA version 3.1.

Support for.level1::eviction_priority,.level::prefetch_size and.level::cache_hintqualifiers introduced in PTX ISA version 7.4.

Support for.cluster scope qualifier introduced in PTX ISA version 7.8.

Support for::cta and::cluster sub-qualifiers introduced in PTX ISA version 7.8.

Support for.unified qualifier introduced in PTX ISA version 8.0.

Support for.mmio qualifier introduced in PTX ISA version 8.2.

Support for::entry and::func sub-qualifiers on.param space introduced in PTX ISAversion 8.3.

Support for.b128 type introduced in PTX ISA version 8.3.

Support for.sys scope with.b128 type introduced in PTX ISA version 8.4.

Support for.level2::eviction_priority qualifier and.v8.b32/.v4.b64 introduced in PTX ISA version 8.8.

Target ISA Notes

ld.f64 requiressm_13 or higher.

Support for scope qualifier,.relaxed,.acquire,.weak qualifiers requiresm_70 orhigher.

Generic addressing requiressm_20 or higher.

Cache operations requiresm_20 or higher.

Support for.level::eviction_priority qualifier requiressm_70 or higher.

Support for.level::prefetch_size qualifier requiressm_75 or higher.

Support for.L2::256B and.L2::cache_hint qualifiers requiressm_80 or higher.

Support for.cluster scope qualifier requiressm_90 or higher.

Sub-qualifier::cta requiressm_30 or higher.

Sub-qualifier::cluster requiressm_90 or higher.

Support for.unified qualifier requiressm_90 or higher.

Support for.mmio qualifier requiressm_70 or higher.

Support for.b128 type requiressm_70 or higher.

Support for.level2::eviction_priority qualifier and.v8.b32/.v4.b64 requiresm_100 or higher.

Examples

ld.global.f32    d,[a];ld.shared.v4.b32 Q,[p];ld.const.s32     d,[p+4];ld.local.b32     x,[p+-8]; // negative offsetld.local.b64     x,[240];  // immediate addressld.global.b16    %r,[fs];  // load .f16 data into 32-bit regcvt.f32.f16      %r,%r;    // up-convert f16 data to f32ld.global.b32    %r0, [fs];     // load .f16x2 data in 32-bit regld.global.b32    %r1, [fs + 4]; // load .f16x2 data in 32-bit regadd.rn.f16x2     %d0, %r0, %r1; // addition of f16x2 datald.global.relaxed.gpu.u32 %r0, [gbl];ld.shared.acquire.gpu.u32 %r1, [sh];ld.global.relaxed.cluster.u32 %r2, [gbl];ld.shared::cta.acquire.gpu.u32 %r2, [sh + 4];ld.shared::cluster.u32 %r3, [sh + 8];ld.global.mmio.relaxed.sys.u32 %r3, [gbl];ld.global.f32    d,[ugbl].unified;ld.b32           %r0, [%r1].unified;ld.global.L1::evict_last.u32  d, [p];ld.global.L2::64B.b32   %r0, [gbl]; // Prefetch 64B to L2ld.L2::128B.f64         %r1, [gbl]; // Prefetch 128B to L2ld.global.L2::256B.f64  %r2, [gbl]; // Prefetch 256B to L2createpolicy.fractional.L2::evict_last.L2::evict_unchanged.b64 cache-policy, 1;ld.global.L2::cache_hint.b64  x, [p], cache-policy;ld.param::entry.b32 %rp1, [kparam1];ld.global.b128   %r0, [gbl];   // 128-bit load// 256-bit loadld.global.L2::evict_last.v8.f32 { %reg0, _, %reg2, %reg3, %reg4, %reg5, %reg6, %reg7}, [addr];ld.global.L2::evict_last.L1::evict_last.v4.u64 { %reg0, %reg1, %reg2, %reg3}, [addr];

9.7.9.9.Data Movement and Conversion Instructions:ld.global.nc

ld.global.nc

Load a register variable from global state space via non-coherent cache.

Syntax

ld.global{.cop}.nc{.level::cache_hint}{.level::prefetch_size}.type                 d, [a]{, cache-policy};ld.global{.cop}.nc{.level::cache_hint}{.level::prefetch_size}.vec.type             d, [a]{, cache-policy};ld.global.nc{.level1::eviction_priority}{.level2::eviction_priority}{.level::cache_hint}{.level::prefetch_size}.type      d, [a]{, cache-policy};ld.global.nc{.level1::eviction_priority}{.level2::eviction_priority}{.level::cache_hint}{.level::prefetch_size}.vec.type  d, [a]{, cache-policy};.cop  =                     { .ca, .cg, .cs };     // cache operation.level1::eviction_priority = { .L1::evict_normal, .L1::evict_unchanged,                               .L1::evict_first, .L1::evict_last, .L1::no_allocate};.level2::eviction_priority = {.L2::evict_normal, .L2::evict_first, .L2::evict_last};.level::cache_hint =        { .L2::cache_hint };.level::prefetch_size =     { .L2::64B, .L2::128B, .L2::256B }.vec  =                     { .v2, .v4, .v8 };.type =                     { .b8, .b16, .b32, .b64, .b128,                              .u8, .u16, .u32, .u64,                              .s8, .s16, .s32, .s64,                              .f32, .f64 };

Description

Load register variabled from the location specified by the source address operanda in theglobal state space, and optionally cache in non-coherent read-only cache.

Note

On some architectures, the texture cache is larger, has higher bandwidth, and longer latency thanthe global memory cache. For applications with sufficient parallelism to cover the longerlatency,ld.global.nc should offer better performance thanld.global on sucharchitectures.

The address operanda shall contain a global address.Supported addressing modes for operanda and alignment requirements aredescribed inAddresses as Operands.

The.v8 (.vec) qualifier is supported if:

  • .type is.b32,.s32,.u32, or.f32 AND

  • State space is.global or with generic addressing where address points to.global state space

The.v4 (.vec) qualifier with type.b64 or.s64 or.u64 or.f64 is supported if:

  • State space is.global or with generic addressing where address points to.global state space

Qualifiers.level1::eviction_priority and.level2::eviction_priority specify the eviction policyfor L1 and L2 cache respectively which may be applied during memory access.

Qualifier.level2::eviction_priority is supported if:

  • .vec is.v8 and.type is.b32 or.s32 or.u32 or.f32

    • AND Operandd is vector of 8 registers with type specified with.type

  • OR.vec is.v4 and.type is.b64 or.s64 or.u64 or.f64

    • AND Operandd is vector of 4 registers with type specified with.type

Optionally, sink symbol ‘_’ can be used in vector expressiond when:

  • .vec is.v8 and.type is.b32 or.s32 or.u32 or.f32 OR

  • .vec is.v4 and.type is.b64 or.s64 or.u64 or.f64

which indicates that data from corresponding memory location is not read.

The.level::prefetch_size qualifier is a hint to fetch additional data of the specified sizeinto the respective cache level.The sub-qualifierprefetch_size can be set to either of64B,128B,256B thereby allowing the prefetch size to be 64 Bytes, 128 Bytes or 256 Bytesrespectively.

The.level::prefetch_size qualifier is treated as a performance hint only.

When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.

cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.

Semantics

d = a;             // named variable ad = *(&a+immOff)   // variable-plus-offsetd = *a;            // registerd = *(a+immOff);   // register-plus-offsetd = *(immAddr);    // immediate address

Notes

Destinationd must be in the.reg state space.

A destination register wider than the specified type may be used. The value loaded is sign-extendedto the destination register width for signed integers, and is zero-extended to the destinationregister width for unsigned and bit-size types.

.f16 data may be loaded usingld.b16, and then converted to.f32 or.f64 usingcvt.

PTX ISA Notes

Introduced in PTX ISA version 3.1.

Support for.level::eviction_priority,.level::prefetch_size and.level::cache_hintqualifiers introduced in PTX ISA version 7.4.

Support for.b128 type introduced in PTX ISA version 8.3.

Support for.level2::eviction_priority qualifier and.v8.b32/.v4.b64 introduced in PTX ISA version 8.8.

Target ISA Notes

Requiressm_32 or higher.

Support for.level1::eviction_priority qualifier requiressm_70 or higher.

Support for.level::prefetch_size qualifier requiressm_75 or higher.

Support for.level::cache_hint qualifier requiressm_80 or higher.

Support for.b128 type requiressm_70 or higher.

Support for.level2::eviction_priority qualifier and.v8.b32/.v4.b64 requiresm_100 or higher.

Examples

ld.global.nc.f32           d, [a];ld.gloal.nc.L1::evict_last.u32 d, [a];createpolicy.fractional.L2::evict_last.b64 cache-policy, 0.5;ld.global.nc.L2::cache_hint.f32  d, [a], cache-policy;ld.global.nc.L2::64B.b32      d,  [a];     // Prefetch 64B to L2ld.global.nc.L2::256B.f64     d,  [a];     // Prefetch 256B to L2ld.global.nc.b128             d,  [a];ld.global.nc.L2::evict_first.v4.f64 {%reg0, %reg1. %reg2, %reg3}. [a]; // 256-bit load

9.7.9.10.Data Movement and Conversion Instructions:ldu

ldu

Load read-only data from an address that is common across threads in the warp.

Syntax

ldu{.ss}.type      d, [a];       // load from addressldu{.ss}.vec.type  d, [a];       // vec load from address.ss   = { .global };             // state space.vec  = { .v2, .v4 };.type = { .b8, .b16, .b32, .b64, .b128,          .u8, .u16, .u32, .u64,          .s8, .s16, .s32, .s64,                     .f32, .f64 };

Description

Loadread-only data into register variabled from the location specified by the source addressoperanda in the global state space, where the address is guaranteed to be the same across allthreads in the warp. If no state space is given, perform the load usingGeneric Addressing.

Supported addressing modes for operanda and alignment requirements are described inAddresses as Operands.

Semantics

d = a;             // named variable ad = *(&a+immOff)   // variable-plus-offsetd = *a;            // registerd = *(a+immOff);   // register-plus-offsetd = *(immAddr);    // immediate address

Notes

Destinationd must be in the.reg state space.

A destination register wider than the specified type may be used. The value loaded is sign-extendedto the destination register width for signed integers, and is zero-extended to the destinationregister width for unsigned and bit-size types. SeeTable 28for a description of these relaxed type-checking rules.

.f16 data may be loaded usingldu.b16, and then converted to.f32 or.f64 usingcvt or can be used in half precision floating point instructions.

.f16x2 data may be loaded usingldu.b32 and then used in half precision floating pointinstructions.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Support for.b128 type introduced in PTX ISA version 8.3.

Target ISA Notes

ldu.f64 requiressm_13 or higher.

Support for.b128 type requiressm_70 or higher.

Examples

ldu.global.f32    d,[a];ldu.global.b32    d,[p+4];ldu.global.v4.f32 Q,[p];ldu.global.b128   d,[a];

9.7.9.11.Data Movement and Conversion Instructions:st

st

Store data to an addressable state space variable.

Syntax

st{.weak}{.ss}{.cop}{.level::cache_hint}{.vec}.type   [a], b{, cache-policy};st{.weak}{.ss}{.level1::eviction_priority}{.level2::eviction_priority}{.level::cache_hint}{.vec}.type                                                      [a], b{, cache-policy};st.volatile{.ss}{.vec}.type                           [a], b;st.relaxed.scope{.ss}{.level1::eviction_priority}{.level2::eviction_priority}{.level::cache_hint}{.vec}.type                                                      [a], b{, cache-policy};st.release.scope{.ss}{.level1::eviction_priority}{.level2::eviction_priority}{.level::cache_hint}{.vec}.type                                                      [a], b{, cache-policy};st.mmio.relaxed.sys{.global}.type         [a], b;.ss =                       { .global, .local, .param{::func}, .shared{::cta, ::cluster} };.level1::eviction_priority = { .L1::evict_normal, .L1::evict_unchanged,                               .L1::evict_first, .L1::evict_last, .L1::no_allocate };.level2::eviction_priority = { .L2::evict_normal, .L2::evict_first, .L2::evict_last };.level::cache_hint =        { .L2::cache_hint };.cop =                      { .wb, .cg, .cs, .wt };.sem =                      { .relaxed, .release };.scope =                    { .cta, .cluster, .gpu, .sys };.vec =                      { .v2, .v4, .v8 };.type =                     { .b8, .b16, .b32, .b64, .b128,                              .u8, .u16, .u32, .u64,                              .s8, .s16, .s32, .s64,                              .f32, .f64 };

Description

Store the value of operandb in the location specified by the destination addressoperanda in specified state space. If no state space is given, perform the store usingGeneric Addressing. Stores to const memory are illegal.

If no sub-qualifier is specified with.shared state space, then::cta is assumed by default.

Supported addressing modes for operanda and alignment requirements are described inAddresses as Operands.

If.param is specified without any sub-qualifiers then it defaults to.param::func.

Instructionst.param{::func} used for passing arguments to device function cannot be predicated.SeeParameter State Space andFunction Declarations and Definitionsfor descriptions of the proper useofst.param.

The qualifiers.relaxed and.release indicate memory synchronization as described in theMemory Consistency Model. The.scope qualifierindicates the set of threads with which anst.relaxed orst.release instruction can directlysynchronize1. The.weak qualifier indicates a memory instruction with no synchronization.The effects of this instruction become visible to other threads only when synchronization is establishedby other means.

The semantic details of.mmio qualifier are described in theMemory Consistency Model.Only.sys thread scope is valid forst.mmio operation. Thequalifiers.mmio and.relaxed must be specified together.

The semantic details of.volatile qualifier are described in theMemory Consistency Model.

The.weak,.volatile,.relaxed and.release qualifiers are mutually exclusive. Whennone of these is specified, the.weak qualifier is assumed by default.

The qualifiers.volatile,.relaxed and.release may be used only with.global and.shared spaces and with generic addressing, where the address points to.global or.shared space. Cache operations are not permitted with these qualifiers. The qualifier.mmiomay be used only with.global space and with generic addressing, where the address points to.global space.

The.v8 (.vec) qualifier is supported if:

  • .type is.b32,.s32,.u32, or.f32 AND

  • State space is.global or with generic addressing where address points to.global state space

The.v4 (.vec) qualifier with type.b64 or.s64 or.u64 or.f64 is supported if:

  • State space is.global or with generic addressing where address points to.global state space

Qualifiers.level1::eviction_priority and.level2::eviction_priority specify the eviction policyfor L1 and L2 cache respectively which may be applied during memory access.

Qualifier.level2::eviction_priority is supported if:

  • .vec is.v8 and.type is.b32 or.s32 or.u32 or.f32

    • AND Operandd is vector of 8 registers with type specified with.type

  • OR.vec is.v4 and.type is.b64 or.s64 or.u64 or.f64

    • AND Operandd is vector of 4 registers with type specified with.type

Optionally, sink symbol ‘_’ can be used in vector expressionb when:

  • .vec is.v8 and.type is.b32 or.s32 or.u32 or.f32 OR

  • .vec is.v4 and.type is.b64 or.s64 or.u64 or.f64

which indicates that no data is being written at the corresponding destination address.

When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.

The qualifier.level::cache_hint is only supported for.global state space and for genericaddressing where the address points to the.global state space.

cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.

1 This synchronization is further extended to other threads through the transitive nature ofcausality order, as described in the memory consistency model.

Semantics

d = a;                // named variable d*(&a+immOffset) = b;            // variable-plus-offset*a = b;               // register*(a+immOffset) = b;   // register-plus-offset*(immAddr) = b;       // immediate address

Notes

Operandb must be in the.reg state space.

A source register wider than the specified type may be used. The lowern bits corresponding tothe instruction-type width are stored to memory. SeeTable 27for a description of these relaxed type-checking rules.

.f16 data resulting from acvt instruction may be stored usingst.b16.

.f16x2 data may be stored usingst.b32.

PTX ISA Notes

st introduced in PTX ISA version 1.0.st.volatile introduced in PTX ISA version 1.1.

Generic addressing and cache operations introduced in PTX ISA version 2.0.

Support for scope qualifier,.relaxed,.release,.weak qualifiers introduced in PTX ISAversion 6.0.

Support for.level1::eviction_priority and.level::cache_hint qualifiers introduced in PTXISA version 7.4.

Support for.cluster scope qualifier introduced in PTX ISA version 7.8.

Support for::cta and::cluster sub-qualifiers introduced in PTX ISA version 7.8.

Support for.mmio qualifier introduced in PTX ISA version 8.2.

Support for::func sub-qualifier on.param space introduced in PTX ISA version 8.3.

Support for.b128 type introduced in PTX ISA version 8.3.

Support for.sys scope with.b128 type introduced in PTX ISA version 8.4.

Support for.level2::eviction_priority qualifier and.v8.b32/.v4.b64 introduced in PTX ISA version 8.8.

Target ISA Notes

st.f64 requiressm_13 or higher.

Support for scope qualifier,.relaxed,.release,.weak qualifiers requiresm_70 orhigher.

Generic addressing requiressm_20 or higher.

Cache operations requiresm_20 or higher.

Support for.level1::eviction_priority qualifier requiressm_70 or higher.

Support for.level::cache_hint qualifier requiressm_80 or higher.

Support for.cluster scope qualifier requiressm_90 or higher.

Sub-qualifier::cta requiressm_30 or higher.

Sub-qualifier::cluster requiressm_90 or higher.

Support for.mmio qualifier requiressm_70 or higher.

Support for.b128 type requiressm_70 or higher.

Support for.level2::eviction_priority qualifier and.v8.b32/.v4.b64 requiresm_100 or higher.

Examples

st.global.f32    [a],b;st.local.b32     [q+4],a;st.global.v4.s32 [p],Q;st.local.b32     [q+-8],a; // negative offsetst.local.s32     [100],r7; // immediate addresscvt.f16.f32      %r,%r;    // %r is 32-bit registerst.b16           [fs],%r;  // store lowerst.global.relaxed.sys.u32 [gbl], %r0;st.shared.release.cta.u32 [sh], %r1;st.global.relaxed.cluster.u32 [gbl], %r2;st.shared::cta.release.cta.u32 [sh + 4], %r1;st.shared::cluster.u32 [sh + 8], %r1;st.global.mmio.relaxed.sys.u32 [gbl], %r1;st.global.L1::no_allocate.f32 [p], a;createpolicy.fractional.L2::evict_last.b64 cache-policy, 0.25;st.global.L2::cache_hint.b32  [a], b, cache-policy;st.param::func.b64 [param1], %rp1;st.global.b128  [a], b;  // 128-bit store// 256-bit storest.global.L2::evict_last.v8.f32 [addr], { %reg0, _, %reg2, %reg3, %reg4, %reg5, %reg6, %reg7};

9.7.9.12.Data Movement and Conversion Instructions:st.async

st.async

Asynchronous store operation.

Syntax

st.async{.sem}{.scope}{.ss}{.completion_mechanism}{.vec}.type [a], b, [mbar];.sem  =                 { .weak };.scope =                { .cluster };.ss   =                 { .shared::cluster };.type =                 { .b32, .b64,                          .u32, .u64,                          .s32, .s64,                          .f32, .f64 };.vec  =                 { .v2, .v4 };.completion_mechanism = { .mbarrier::complete_tx::bytes };st.async{.mmio}.sem.scope{.ss}{.completion_mechanism}.type [a], b;.sem =                  { .release };.scope =                { .gpu, .sys };.ss =                   { .global };.completion_mechanism = { };.type =                 { .b8, .b16, .b32, .b64,                          .u8, .u16, .u32, .u64,                          .s8, .s16, .s32, .s64,                                     .f32, .f64 };

Description

st.async is a non-blocking instruction which initiates an asynchronous store operation thatstores the value specified by source operandb to the destination memory locationspecified by operanda.

Operands

  • a is a destination address, and must be either a register, or of the formregister+immOff,as described inAddresses as Operands.

  • b is a source value, of the type indicated by qualifier.type.

  • mbar is an mbarrier object address.

Qualifiers

  • .mmio indicates whether this is anmmio Operation.

  • .sem specifies the memory ordering semantics as described in theMemory Consistency Model.

    • If.sem is not specified, it defaults to.weak.

  • .scope specifies the set of threads with which this instruction can directly synchronize.

  • .ss specifies the state space of the destination operanda and the mbarrieroperandmbar.

  • .completion_mechanism specifies the mechanism for observing the completion of theasynchronous operation.

    • When.completion_mechanism is.mbarrier::complete_tx::bytes: upon completion of theasynchronous operation, acomplete-txoperation will be performed on the mbarrier object specified by the operandmbar, withcompleteCount argument equal to the amount of data stored in bytes.

    • When.completion_mechanism is not specified: the completion of the store synchronizeswith the end of the CTA.

  • .type specifies the type of the source operandb.

Conditions

When.sem is.weak:

  • This is a weak store to shared memory, which signals its completion through an mbarrier object.

  • The store operation is treated as a weak memory operation.

  • The complete-tx operation on the mbarrier has.release semantics at.clusterscope.

  • Requires:

    • The shared memory addresses of destination operanda and thembarrier objectmbar belongto the same CTA within the same cluster as the executing thread.

    • The number of CTAs within the cluster is strictly greater than one;%cluster_nctarank>1 is true.

    Otherwise, the behavior is undefined.

  • .mmio must not be specified.

  • If.ss is specified, it must be.shared::cluster.

  • If.ss is not specified, generic addressing is used for operandsa andmbar.If the generic addresses specified do not fall within the address window of.shared::cluster state space, the behavior is undefined.

  • If.completion_mechanism is specified, it must be.mbarrier::complete_tx::bytes.

  • If.completion_mechanism is not specified, it defaults to.mbarrier::complete_tx::bytes.

When.sem is.release:

  • This is a release store to global memory.

  • The store operation is a strong memory operation with.release semantics at thescope specified by.scope.

  • If.mmio is specified,.scope must be.sys.

  • If.ss is specified, it must be.global.

  • If.ss is not specified, generic addressing is used for operanda.If the generic address specified does not fall within the address window of.globalstate space, the behavior is undefined.

  • .completion_mechanism must not be specified.

PTX ISA Notes

Introduced in PTX ISA version 8.1.

Support for.mmio qualifier,.release semantics,.global state space, and.scope qualifier introduced in PTX ISA version 8.7.

Target ISA Notes

Requiressm_90 or higher.

.mmio qualifier,.release semantics,.global state space, and.scope qualifier requiresm_100 or higher.

Examples

st.async.shared::cluster.mbarrier::complete_tx::bytes.u32 [addr], b, [mbar_addr];st.async.release.global.u32 [addr], b;

9.7.9.13.Data Movement and Conversion Instructions:st.bulk

st.bulk

Initializes a region of memory as specified by state space.

Syntax

st.bulk{.weak}{.shared::cta}  [a], size, initval; // initval must be zero

Description

st.bulk instruction initializes a region of shared memory starting from the location specifiedby destination address operanda.

The 32-bit or 64-bit integer operandsize specifies the amount of memory to be initialized in terms ofnumber of bytes.size must be a multiple of 8. If the value is not a multiple of 8, then thebehavior is undefined. The maximum value ofsize operand can be 16777216.

The integer immediate operandinitval specifies the initialization value for the memorylocations. The only numeric value allowed for operandinitval is 0.

If no state space is specified thenGeneric Addressing is used. If theaddress specified bya does not fall within the address window of.shared state space thenthe behavior is undefined.

The optional qualifier.weak specify the memory synchronizing effect of thest.bulkinstruction as described in theMemory Consistency Model.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Support forsize operand with 32-bit length is introduced in PTX ISA version 9.0.

Target ISA Notes

Requiressm_100 or higher.

Examples

st.bulk.weak.shared::cta  [dst], n, 0;st.bulk                   [gdst], 4096, 0;

9.7.9.14.Data Movement and Conversion Instructions:multimem.ld_reduce,multimem.st,multimem.red

The multimem.* operations operate on multimem addresses and accesses all of the multiple memorylocations which the multimem address points to.

Multimem addresses can only be accessed only by multimem.* operations. Accessing a multimem addresswithld,st or any other memory operations results in undefined behavior.

Refer toCUDA programming guide for creation and management of the multimem addresses.

multimem.ld_reduce,multimem.st,multimem.red

Perform memory operations on the multimem address.

Syntax

// Integer type:multimem.ld_reduce{.ldsem}{.scope}{.ss}.op.type      d, [a];multimem.ld_reduce.weak{.ss}.op.type                 d, [a];multimem.st{.stsem}{.scope}{.ss}.type                [a], b;multimem.st.weak{.ss}.type                           [a], b;multimem.red{.redsem}{.scope}{.ss}.op.type           [a], b;.ss =       { .global }.ldsem =    { .relaxed, .acquire }.stsem =    { .relaxed, .release }.redsem =   { .relaxed, .release }.scope =    { .cta, .cluster, .gpu, .sys }.op  =      { .min, .max, .add, .and, .or, .xor }.type =     { .b32, .b64,  .u32, .u64, .s32, .s64 }// Floating point type:multimem.ld_reduce{.ldsem}{.scope}{.ss}.op{.acc_prec}{.vec}.type    d, [a];multimem.ld_reduce.weak{.ss}.op{.acc_prec}{.vec}.type               d, [a];multimem.st{.stsem}{.scope}{.ss}{.vec}.type                         [a], b;multimem.st.weak{.ss}{.vec}.type                                    [a], b;multimem.red{.redsem}{.scope}{.ss}.redop{.vec}.redtype              [a], b;.ss =       { .global }.ldsem =    { .relaxed, .acquire }.stsem =    { .relaxed, .release }.redsem =   { .relaxed, .release }.scope =    { .cta, .cluster, .gpu, .sys }.op  =      { .min, .max, .add }.redop  =   { .add }.acc_prec = { .acc::f32, .acc::f16 }.vec =      { .v2, .v4, .v8 }.type=      { .f16, .f16x2, .bf16, .bf16x2, .f32, .f64, .e5m2, .e5m2x2, .e5m2x4, .e4m3, .e4m3x2, .e4m3x4 }.redtype =  { .f16, .f16x2, .bf16, .bf16x2, .f32, .f64 }

Description

Instructionmultimem.ld_reduce performs the following operations:

  • load operation on the multimem addressa, which involves loading of data from all of themultiple memory locations pointed to by the multimem addressa,

  • reduction operation specified by.op on the multiple data loaded from the multimem addressa.

The result of the reduction operation in returned in registerd.

Instructionmultimem.st performs a store operation of the input operandb to all the memorylocations pointed to by the multimem addressa.

Instructionmultimem.red performs a reduction operation on all the memory locations pointed toby the multimem addressa, with operandb.

Instructionmultimem.ld_reduce performs reduction on the values loaded from all the memorylocations that the multimem address points to. In contrast, themultimem.red perform reductionon all the memory locations that the multimem address points to.

Address operanda must be a multimem address. Otherwise, the behavior is undefined. Supportedaddressing modes for operand a and alignment requirements are described inAddresses as Operands.

If no state space is specified thenGeneric Addressing isused. If the address specified bya does not fall within the address window of.global statespace then the behavior is undefined.

For floating-point type multi- operations, the size of the specified type along with.vec mustequal either 32-bits or 64-bits or 128-bits. No other combinations of.vec and type areallowed. Type.f64 cannot be used with.vec qualifier.The following table describes the valid usage of.vec and base floating-point type:

.vec

Base float-type supported

No.vec specified

.f16x2,.bf16x2,.f32,.f64,.e5m2x4,.e4m3x4

.v2

.f16,.f16x2,.bf16,.bf16x2.f32,.e5m2x2,.e5m2x4,.e4m3x2,.e4m3x4

.v4

.f16,.f16x2,.bf16,.bf16x2.f32,.e5m2,.e5m2x2,.e5m2x4,.e4m3,.e4m3x2,.e4m3x4

.v8

.f16,.bf16,.e5m2,.e4m3,.e5m2x2,.e4m3x2

The following table describes the valid combinations of.op and base type:

op

Base type

.add

.u32,.u64,.s32.f16,.f16x2,.bf16,.bf16x2.f32,.f64,.e5m2,.e5m2x2,.e5m2x4,.e4m3,.e4m3x2,.e4m3x4

.and,.or,.xor

.b32,.b64

.min,.max

.u32,.s32,.u64,.s64.f16,.f16x2,.bf16,.bf16x2.e5m2,.e5m2x2,.e5m2x4,.e4m3,.e4m3x2,.e4m3x4

Formultimem.ld_reduce, the default precision of the intermediate accumulation is same as thespecified type.

Optionally,.acc_prec qualifier can be specified to change the precision of intermediateaccumulation as follows:

.type

.acc::prec

Changes precision to

.f16,.f16x2,.bf16,.bf16x2

.acc::f32

.f32

.e5m2,.e4m3,.e5m2x2,.e4m3x2,.e4m3x4,.e5m2x4

.acc::f16

.f16

Optional qualifiers.ldsem,.stsem and.redsem specify the memory synchronizing effectof themultimem.ld_reduce,multimem.st andmultimem.red respectively, as described inMemory Consistency Model. If explicit semantics qualifiersare not specified, thenmultimem.ld_reduce andmultimem.st default to.weak andmultimem.red defaults to.relaxed.

The optional.scope qualifier specifies the set of threads that can directly observe the memorysynchronizing effect of this operation, as described inMemory Consistency Model. If the.scope qualifier is not specified formultimem.red then.sys scope is assumed by default.

PTX ISA Notes

Introduced in PTX ISA version 8.1.

Support for.acc::f32 qualifier introduced in PTX ISA version 8.2.

Support for types.e5m2,.e5m2x2,.e5m2x4,.e4m3,.e4m3x2,.e4m3x4introduced in PTX ISA version 8.6.

Support for.acc::f16 qualifier introduced in PTX ISA version 8.6.

Target ISA Notes

Requiressm_90 or higher.

Types.e5m2,.e5m2x2,.e5m2x4,.e4m3,.e4m3x2,.e4m3x4are supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • sm_121a

  • And are supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Qualifier.acc::f16 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • sm_121a

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Examples

multimem.ld_reduce.and.b32                    val1_b32, [addr1];multimem.ld_reduce.acquire.gpu.global.add.u32 val2_u32, [addr2];multimem.st.relaxed.gpu.b32                [addr3], val3_b32;multimem.st.release.cta.global.u32         [addr4], val4_u32;multimem.red.relaxed.gpu.max.f64           [addr5], val5_f64;multimem.red.release.cta.global.add.v4.f32 [addr6], {val6, val7, val8, val9};multimem.ld_reduce.add.acc::f32.v2.f16x2   {val_10, val_11}, [addr7];multimem.ld_reduce.relaxed.cta.min.v2.e4m3x2 {val_12, val_13}, [addr8];multimem.ld_reduce.relaxed.cta.add.v4.e4m3   {val_14, val_15, val_16, val_17}, [addr9];multimem.ld_reduce.add.acc::f16.v4.e5m2      {val_18, val_19, val_20, val_21}, [addr10];

9.7.9.15.Data Movement and Conversion Instructions:prefetch,prefetchu

prefetch,prefetchu

Prefetch line containing a generic address at a specified level of memory hierarchy, in specifiedstate space.

Syntax

prefetch{.space}.level                    [a];   // prefetch to data cacheprefetch.global.level::eviction_priority  [a];   // prefetch to data cacheprefetchu.L1  [a];             // prefetch to uniform cacheprefetch{.tensormap_space}.tensormap [a];  // prefetch the tensormap.space =                    { .global, .local };.level =                    { .L1, .L2 };.level::eviction_priority = { .L2::evict_last, .L2::evict_normal };.tensormap_space =          { .const, .param };

Description

Theprefetch instruction brings the cache line containing the specified address in global orlocal memory state space into the specified cache level.

If the.tensormap qualifier is specified then theprefetch instruction brings the cache linecontaining the specified address in the.const or.param memory state space for subsequentuse by thecp.async.bulk.tensor instruction.

If no state space is given, theprefetch usesGeneric Addressing.

Optionally, the eviction priority to be applied on the prefetched cache line can be specified by themodifier.level::eviction_priority.

Supported addressing modes for operanda and alignment requirements are described inAddresses as Operands

Theprefetchu instruction brings the cache line containing the specified generic address intothe specified uniform cache level.

Aprefetch to a shared memory location performs no operation.

Aprefetch into the uniform cache requires a generic address, and no operation occurs if theaddress maps to aconst,local, orshared memory location.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Support for.level::eviction_priority qualifier introduced in PTX ISA version 7.4.

Support for the.tensormap qualifier is introduced in PTX ISA version 8.0.

Target ISA Notes

prefetch andprefetchu requiresm_20 or higher.

Support for.level::eviction_priority qualifier requiressm_80 or higher.

Support for the.tensormap qualifier requiressm_90 or higher.

Examples

prefetch.global.L1             [ptr];prefetch.global.L2::evict_last [ptr];prefetchu.L1  [addr];prefetch.const.tensormap       [ptr];

9.7.9.16.Data Movement and Conversion Instructions:applypriority

applypriority

Apply the cache eviction priority to the specified address in the specified cache level.

Syntax

applypriority{.global}.level::eviction_priority  [a], size;.level::eviction_priority = { .L2::evict_normal };

Description

Theapplypriority instruction applies the cache eviction priority specified by the.level::eviction_priority qualifier to the address range[a..a+size) in the specified cachelevel.

If no state space is specified thenGeneric Addressing isused. If the specified address does not fall within the address window of.global state spacethen the behavior is undefined.

The operandsize is an integer constant that specifies the amount of data, in bytes, in thespecified cache level on which the priority is to be applied. The only supported value for thesize operand is 128.

Supported addressing modes for operanda are described inAddresses as Operands.a must be aligned to 128 bytes.

PTX ISA Notes

Introduced in PTX ISA version 7.4.

Target ISA Notes

Requiressm_80 or higher.

Examples

applypriority.global.L2::evict_normal [ptr], 128;

9.7.9.17.Data Movement and Conversion Instructions:discard

discard

Discard the data at the specified address range and cache level.

Syntax

discard{.global}.level  [a], size;.level = { .L2 };

Description

Semantically, this behaves like a weak write of anunstable indeterminate value:reads of memory locations withunstable indeterminate values may return differentbit patterns each time until the memory is overwritten.This operationhints to the implementation that data in the specified cache.levelcan be destructively discarded without writing it back to memory.

The operandsize is an integer constant that specifies the length in bytes of theaddress range[a,a+size) to writeunstable indeterminate values into.The only supported value for thesize operand is128.

If no state space is specified thenGeneric Addressing is used.If the specified address does not fall within the address window of.global state spacethen the behavior is undefined.

Supported addressing modes for address operanda are described inAddresses as Operands.a must be aligned to 128 bytes.

PTX ISA Notes

Introduced in PTX ISA version 7.4.

Target ISA Notes

Requiressm_80 or higher.

Examples

discard.global.L2 [ptr], 128;ld.weak.u32 r0, [ptr];ld.weak.u32 r1, [ptr];// The values in r0 and r1 may differ!

9.7.9.18.Data Movement and Conversion Instructions:createpolicy

createpolicy

Create a cache eviction policy for the specified cache level.

Syntax

// Range-based policycreatepolicy.range{.global}.level::primary_priority{.level::secondary_priority}.b64                                   cache-policy, [a], primary-size, total-size;// Fraction-based policycreatepolicy.fractional.level::primary_priority{.level::secondary_priority}.b64                                   cache-policy{, fraction};// Converting the access property from CUDA APIscreatepolicy.cvt.L2.b64            cache-policy, access-property;.level::primary_priority =   { .L2::evict_last, .L2::evict_normal,                               .L2::evict_first, .L2::evict_unchanged };.level::secondary_priority = { .L2::evict_first, .L2::evict_unchanged };

Description

Thecreatepolicy instruction creates a cache eviction policy for the specified cache level in anopaque 64-bit register specified by the destination operandcache-policy. The cache evictionpolicy specifies how cache eviction priorities are applied to global memory addresses used in memoryoperations with.level::cache_hint qualifier.

There are two types of cache eviction policies:

  • Range-based policy

    The cache eviction policy created usingcreatepolicy.range specifies the cache evictionbehaviors for the following three address ranges:

    • [a..a+(primary-size-1)] referred to as primary range.

    • [a+primary-size..a+(total-size-1)] referred to as trailing secondary range.

    • [a-(total-size-primary-size)..(a-1)] referred to as preceding secondary range.

    When a range-based cache eviction policy is used in a memory operation with.level::cache_hint qualifier, the eviction priorities are applied as follows:

    • If the memory address falls in the primary range, the eviction priority specified by.L2::primary_priority is applied.

    • If the memory address falls in any of the secondary ranges, the eviction priority specified by.L2::secondary_priority is applied.

    • If the memory address does not fall in either of the above ranges, then the applied evictionpriority is unspecified.

    The 32-bit operandprimary-size specifies the size, in bytes, of the primary range. The32-bit operandtotal-size specifies the combined size, in bytes, of the address rangeincluding primary and secondary ranges. The value ofprimary-size must be less than or equalto the value oftotal-size. Maximum allowed value oftotal-size is 4GB.

    If.L2::secondary_priority is not specified, then it defaults to.L2::evict_unchanged.

    If no state space is specified thenGeneric Addressing isused. If the specified address does not fall within the address window of.global state spacethen the behavior is undefined.

  • Fraction-based policy

    A memory operation with.level::cache_hint qualifier can use the fraction-based cacheeviction policy to request the cache eviction priority specified by.L2:primary_priority tobe applied to a fraction of cache accesses specified by the 32-bit floating point operandfraction. The remainder of the cache accesses get the eviction priority specified by.L2::secondary_priority. This implies that in a memory operation that uses a fraction-basedcache policy, the memory access has a probability specified by the operandfraction ofgetting the cache eviction priority specified by.L2::primary_priority.

    The valid range of values for the operandfraction is(0.0,..,1.0]. If the operandfraction is not specified, it defaults to 1.0.

    If.L2::secondary_priority is not specified, then it defaults to.L2::evict_unchanged.

The access property created using the CUDA APIs can be converted into cache eviction policy by theinstructioncreatepolicy.cvt. The source operandaccess-property is a 64-bit opaqueregister. Refer toCUDA programming guide for more details.

PTX ISA Notes

Introduced in PTX ISA version 7.4.

Target ISA Notes

Requiressm_80 or higher.

Examples

createpolicy.fractional.L2::evict_last.b64                      policy, 1.0;createpolicy.fractional.L2::evict_last.L2::evict_unchanged.b64  policy, 0.5;createpolicy.range.L2::evict_last.L2::evict_first.b64                                            policy, [ptr], 0x100000, 0x200000;// access-prop is created by CUDA APIs.createpolicy.cvt.L2.b64 policy, access-prop;

9.7.9.19.Data Movement and Conversion Instructions:isspacep

isspacep

Query whether a generic address falls within a specified state space window.

Syntax

isspacep.space  p, a;    // result is .pred.space = { const, .global, .local, .shared{::cta, ::cluster}, .param{::entry} };

Description

Write predicate registerp with1 if generic address a falls within the specified statespace window and with0 otherwise. Destinationp has type.pred; the source addressoperand must be of type.u32 or.u64.

isspacep.param{::entry} returns1 if the generic address falls within the window ofKernel Function Parameters, otherwise returns0. If.paramis specified without any sub-qualifiers then it defaults to.param::entry.

isspacep.global returns1 forKernel Function Parametersas.param window is contained within the.globalwindow.

If no sub-qualifier is specified with.shared state space, then::cta is assumed by default.

Note

ispacep.shared::cluster will return 1 for every shared memory address that is accessible tothe threads in the cluster, whereasispacep.shared::cta will return 1 only if the address isof a variable declared in the executing CTA.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

isspacep.const introduced in PTX ISA version 3.1.

isspacep.param introduced in PTX ISA version 7.7.

Support for::cta and::cluster sub-qualifiers introduced in PTX ISA version 7.8.

Support for sub-qualifier::entry on.param space introduced in PTX ISA version 8.3.

Target ISA Notes

isspacep requiressm_20 or higher.

isspacep.param{::entry} requiressm_70 or higher.

Sub-qualifier::cta requiressm_30 or higher.

Sub-qualifier::cluster requiressm_90 or higher.

Examples

isspacep.const           iscnst, cptr;isspacep.global          isglbl, gptr;isspacep.local           islcl,  lptr;isspacep.shared          isshrd, sptr;isspacep.param::entry    isparam, pptr;isspacep.shared::cta     isshrdcta, sptr;isspacep.shared::cluster ishrdany sptr;

9.7.9.20.Data Movement and Conversion Instructions:cvta

cvta

Convert address from.const,Kernel Function Parameters (.param),.global,.local, or.sharedstate space to generic, or vice-versa. Take the generic address of a variable declared in.const,Kernel Function Parameters (.param),.global,.local, or.shared state space.

Syntax

// convert const, global, local, or shared address to generic addresscvta.space.size  p, a;        // source address in register acvta.space.size  p, var;      // get generic address of varcvta.space.size  p, var+imm;  // generic address of var+offset// convert generic address to const, global, local, or shared addresscvta.to.space.size  p, a;.space = { .const, .global, .local, .shared{::cta, ::cluster}, .param{::entry} };.size  = { .u32, .u64 };

Description

Convert aconst,Kernel Function Parameters(.param),global,local, orshared address to a generic address, or vice-versa. Thesource and destination addresses must be the same size. Usecvt.u32.u64 orcvt.u64.u32 totruncate or zero-extend addresses.

For variables declared in.const,Kernel Function Parameters (.param),.global,.local, or.sharedstate space, the generic address of the variable may be taken usingcvta. The source is either aregister or a variable defined inconst,Kernel Function Parameters (.param),global,local, orshared memorywith an optional offset.

When converting a generic address into aconst,Kernel Function Parameters (.param),global,local, orsharedaddress, the resulting address is undefined in cases where the generic address does not fall withinthe address window of the specified state space. A program may useisspacep to guard againstsuch incorrect behavior.

Forcvta with.shared state space, the address must belong to the space specified by::cta or::cluster sub-qualifier, otherwise the behavior is undefined. If no sub-qualifieris specified with.shared state space, then::cta is assumed by default.

If.param is specified without any sub-qualifiers then it defaults to.param::entry. For.param{::entry} state space, operanda must be a kernel parameter address, otherwisebehavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

cvta.const andcvta.to.const introduced in PTX ISA version 3.1.

cvta.param andcvta.to.param introduced in PTX ISA version 7.7.

Note: The current implementation does not allow generic pointers toconst space variables inprograms that contain pointers to constant buffers passed as kernel parameters.

Support for::cta and::cluster sub-qualifiers introduced in PTX ISA version 7.8.

Support for sub-qualifier::entry on.param space introduced in PTX ISA version 8.3.

Target ISA Notes

cvta requiressm_20 or higher.

cvta.param{::entry} andcvta.to.param{::entry} requiressm_70 or higher.

Sub-qualifier::cta requiressm_30 or higher.

Sub-qualifier::cluster requiressm_90 or higher.

Examples

cvta.const.u32   ptr,cvar;cvta.local.u32   ptr,lptr;cvta.shared::cta.u32  p,As+4;cvta.shared::cluster.u32 ptr, As;cvta.to.global.u32  p,gptr;cvta.param.u64   ptr,pvar;cvta.to.param::entry.u64  epptr, ptr;

9.7.9.21.Data Movement and Conversion Instructions:cvt

cvt

Convert a value from one type to another.

Syntax

cvt{.irnd}{.ftz}{.sat}.dtype.atype         d, a;  // integer roundingcvt{.frnd}{.ftz}{.sat}.dtype.atype         d, a;  // fp roundingcvt.frnd2{.relu}{.satfinite}.f16.f32       d, a;cvt.frnd2{.relu}{.satfinite}.f16x2.f32     d, a, b;cvt.rs{.relu}{.satfinite}.f16x2.f32        d, a, b, rbits;cvt.frnd2{.relu}{.satfinite}.bf16.f32      d, a;cvt.frnd2{.relu}{.satfinite}.bf16x2.f32    d, a, b;cvt.rs{.relu}{.satfinite}.bf16x2.f32       d, a, b, rbits;cvt.rna{.satfinite}.tf32.f32               d, a;cvt.frnd2{.satfinite}{.relu}.tf32.f32      d, a;cvt.rn.satfinite{.relu}.f8x2type.f32       d, a, b;cvt.rn.satfinite{.relu}.f8x2type.f16x2     d, a;cvt.rn.{.relu}.f16x2.f8x2type              d, a;cvt.rs{.relu}.satfinite.f8x4type.f32       d, {a, b, e, f}, rbits;cvt.rn.satfinite{.relu}.f4x2type.f32       d, a, b;cvt.rn{.relu}.f16x2.f4x2type               d, a;cvt.rs{.relu}.satfinite.f4x4type.f32       d, {a, b, e, f}, rbits;cvt.rn.satfinite{.relu}.f6x2type.f32       d, a, b;cvt.rn{.relu}.f16x2.f6x2type               d, a;cvt.rs{.relu}.satfinite.f6x4type.f32       d, {a, b, e, f}, rbits;cvt.frnd3{.satfinite}.ue8m0x2.f32          d, a, b;cvt.frnd3{.satfinite}.ue8m0x2.bf16x2       d, a;cvt.rn.bf16x2.ue8m0x2                      d, a;.irnd   = { .rni, .rzi, .rmi, .rpi };.frnd   = { .rn,  .rz,  .rm,  .rp  };.frnd2  = { .rn,  .rz };.frnd2  = { .rn,  .rz };.frnd3  = { .rz,  .rp };.dtype = .atype = { .u8,   .u16, .u32, .u64,                    .s8,   .s16, .s32, .s64,                    .bf16, .f16, .f32, .f64 };.f8x2type = { .e4m3x2, .e5m2x2 };.f4x2type = { .e2m1x2 };.f6x2type = { .e2m3x2, .e3m2x2 };.f4x4type = { .e2m1x4 };.f8x4type = { .e4m3x4, .e5m2x4 };.f6x4type = { .e2m3x4, .e3m2x4 };

Description

Convert between different types and sizes.

For.f16x2 and.bf16x2 instruction type, two inputsa andb of.f32 type areconverted into.f16 or.bf16 type and the converted values are packed in the destinationregisterd, such that the value converted from inputa is stored in the upper half ofdand the value converted from inputb is stored in the lower half ofd

For.f16x2 instruction type, destination operandd has.f16x2 or.b32 type. For.bf16 instruction type, operandd has.b16 type. For.bf16x2 instruction type,operandd has.b32 type. For.tf32 instruction type, operandd has.b32 type.

When converting to.e4m3x2/.e5m2x2 data formats, the destination operandd has.b16type. When converting two.f32 inputs to.e4m3x2/.e5m2x2, each input is converted to thespecified format, and the converted values are packed in the destination operandd such that thevalue converted from inputa is stored in the upper 8 bits ofd and the value converted frominputb is stored in the lower 8 bits ofd. When converting an.f16x2 input to.e4m3x2/.e5m2x2, each.f16 input from operanda is converted to the specifiedformat. The converted values are packed in the destination operandd such that the valueconverted from the upper 16 bits of inputa is stored in the upper 8 bits ofd and the valueconverted from the lower 16 bits of inputa is stored in the lower 8 bits ofd.

When converting from.e4m3x2/.e5m2x2 to.f16x2, source operanda has.b16type. Each 8-bit input value in operanda is converted to.f16 type. The converted valuesare packed in the destination operandd such that the value converted from the upper 8 bits ofa is stored in the upper 16 bits ofd and the value converted from the lower 8 bits ofais stored in the lower 16 bits ofd.

When converting to.e2m1x2 data formats, the destination operandd has.b8 type.When converting two.f32 inputs to.e2m1x2, each input is converted to the specified format,and the converted values are packed in the destination operandd such that the value convertedfrom inputa is stored in the upper 4 bits ofd and the value converted from inputb isstored in the lower 4 bits ofd.

When converting from.e2m1x2 to.f16x2, source operanda has.b8 type. Each 4-bitinput value in operanda is converted to.f16 type. The converted values are packed in thedestination operandd such that the value converted from the upper 4 bits ofa is stored inthe upper 16 bits ofd and the value converted from the lower 4 bits ofa is stored in thelower 16 bits ofd.

When converting to.e2m1x4 data format, the destination operandd has.b16 type. Whenconverting four.f32 inputs to.e2m1x4, each input is converted to the specified format,and the converted values are packed in the destination operandd such that the value convertedfrom inputsa,b,e,f are stored in each 4 bits starting from upper bits ofd.

When converting to.e2m3x2/.e3m2x2 data formats, the destination operandd has.b16type. When converting two.f32 inputs to.e2m3x2/.e3m2x2, each input is converted to thespecified format, and the converted values are packed in the destination operandd such that thevalue converted from inputa is stored in the upper 8 bits ofd with 2 MSB bits padded withzeros and the value converted from inputb is stored in the lower 8 bits ofd with 2 MSB bitspadded with zeros.

When converting from.e2m3x2/.e3m2x2 to.f16x2, source operanda has.b16 type.Each 8-bit input value with 2 MSB bits 0 in operanda is converted to.f16 type. The convertedvalues are packed in the destination operandd such that the value converted from the upper 8 bitsofa is stored in the upper 16 bits ofd and the value converted from the lower 8 bits ofais stored in the lower 16 bits ofd.

When converting to.e5m2x4/.e4m3x4/.e3m2x4/.e2m3x4 data format, the destinationoperandd has.b32 type. When converting four.f32 inputs to.e5m2x4/.e4m3x4/.e3m2x4/.e2m3x4, each input is converted to the specified format,and the converted values are packed in the destination operandd such that the value convertedfrom inputsa,b,e,f are stored in each 8 bits starting from upper bits ofd.For.e3m2x4/.e2m3x4, each 8-bit output will have 2 MSB bits padded with zeros.

When converting to.ue8m0x2 data formats, the destination operandd has.b16 type. Whenconverting two.f32 or two packed.bf16 inputs to.ue8m0x2, each input is converted to thespecified format, and the converted values are packed in the destination operandd such that thevalue converted from inputa is stored in the upper 8 bits ofd and the value converted frominputb is stored in the lower 8 bits ofd.

When converting from.ue8m0x2 to.bf16x2, source operanda has.b16 type. Each 8-bitinput value in operanda is converted to.bf16 type. The converted values are packed in thedestination operandd such that the value converted from the upper 8 bits ofa is stored inthe upper 16 bits ofd and the value converted from the lower 8 bits ofa is stored in thelower 16 bits ofd.

rbits is a.b32 type register operand used for providing random bits for.rs rounding mode.

When converting to.f16x2, two 16-bit values are provided fromrbits where 13 LSBs fromupper 16-bits are used as random bits for operanda with 3 MSBs are 0 and 13 LSBs from lower16-bits are used as random bits for operandb with 3 MSBs are 0.

When converting to.bf16x2, two 16-bit values are provided fromrbits where upper 16-bitsare used as random bits for operanda and lower 16-bits are used as random bits for operandb.

When converting to.e4m3x4/.e5m2x4/.e2m3x4/.e3m2x4, two 16-bit values are providedfromrbits where lower 16-bits are used for operandse,f and upper 16 bits are usedfor operandsa,b.

When converting to.e2m1x4, two 16-bit values are provided fromrbits where lower 8-bitsfrom both 16-bits half ofrbits are used for operandse,f and upper 8-bits from both16-bits half ofrbits are used for operandsa,b.

Rounding modifier is mandatory in all of the following cases:

  • float-to-float conversions, when destination type is smaller than source type

  • All float-to-int conversions

  • All int-to-float conversions

  • All conversions involving.f16x2,.e4m3x2,.e5m2x2,,.bf16x2,.tf32,.e2m1x2,.e2m3x2,.e3m2x2,.e4m3x4,.e5m2x4,.e2m1x4,.e2m3x4,.e3m2x4 and.ue8m0x2 instruction types.

.satfinite modifier is only supported for conversions involving the following types:

  • .e4m3x2,.e5m2x2,.e2m1x2,.e2m3x2,.e3m2x2,.e4m3x4,.e5m2x4,.e2m1x4,.e2m3x4,.e3m2x4 destination types..satfinite modifier is mandatory for such conversions.

  • .f16,.bf16,.f16x2,.bf16x2,.tf32,.ue8m0x2 as destination types.

Semantics

if (/* inst type is .f16x2 or .bf16x2 */) {    d[31:16] = convert(a);    d[15:0]  = convert(b);} else if (/* inst destination type is .e5m2x2 or .e4m3x2 or .ue8m0x2 */) {    d[15:8] = convert(a);    d[7:0]  = convert(b);} else if (/* inst destination type is .e2m1x2 */) {    d[7:4] = convert(a);    d[3:0] = convert(b);} else if (/* inst destination type is .e2m3x2 or .e3m2x2 */) {    d[15:14] = 0;    d[13:8] = convert(a);    d[7:6] = 0;    d[5:0] = convert(b);} else if (/* inst destination type is .e2m1x4 */) {    d[15:12] = convert(a);    d[11:8] = convert(b);    d[7:4] = convert(e);    d[3:0] = convert(f);} else if (/* inst destination type is .e4m3x4 or .e5m2x4 */) {    d[31:24] = convert(a);    d[23:16] = convert(b);    d[15:8] = convert(e);    d[7:0] = convert(f);} else if (/* inst destination type is .e2m3x4 or .e3m2x4 */) {    d[31:30] = 0;    d[29:24] = convert(a);    d[23:22] = 0;    d[21:16] = convert(b);    d[15:14] = 0;    d[13:8] = convert(e);    d[7:6] = 0;    d[5:0] = convert(f);} else {    d = convert(a);}

// Random bitsrbits semantics for.rs rounding:

  1. Destination type.f16:ReferFigure 38 for random bits layout details.

    _images/cvt-rs-rbits-layout-f16.png

    Figure 38Random bits layout for.rs rounding with.f16 destination type

  2. Destination type.bf16:ReferFigure 39 for random bits layout details.

    _images/cvt-rs-rbits-layout-bf16.png

    Figure 39Random bits layout for.rs rounding with.bf16 destination type

  3. Destination type.e2m1x4:ReferFigure 40 for random bits layout details.

    _images/cvt-rs-rbits-layout-f4.png

    Figure 40Random bits layout for.rs rounding with.e2m1x4 destination type

  4. Destination type.e5m2x4,.e4m3x4,.e3m2x4,.e2m3x4:ReferFigure 41 for random bits layout details.

    _images/cvt-rs-rbits-layout-f8-f6.png

    Figure 41Random bits layout for.rs rounding with.e5m2x4/.e4m3x4/.e3m2x4/.e2m3x4 destination type

Integer Notes

Integer rounding is required for float-to-integer conversions, and for same-size float-to-floatconversions where the value is rounded to an integer. Integer rounding is illegal in all otherinstances.

Integer rounding modifiers:

.rni

round to nearest integer, choosing even integer if source is equidistant between two integers

.rzi

round to nearest integer in the direction of zero

.rmi

round to nearest integer in direction of negative infinity

.rpi

round to nearest integer in direction of positive infinity

In float-to-integer conversions, depending upon conversion types,NaN input results in followingvalue:

  1. Zero if source is not.f64 and destination is not.s64,.u64.

  2. Otherwise 1 << (BitWidth(dst) - 1) corresponding to the value of (MAXINT >> 1) + 1 for unsigned typeorMININT for signed type.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported.

Forcvt.ftz.dtype.f32 float-to-integer conversions andcvt.ftz.f32.f32 float-to-floatconversions with integer rounding, subnormal inputs are flushed to sign-preserving zero. Modifier.ftz can only be specified when either.dtype or.atype is.f32 and applies onlyto single precision (.f32) inputs and results.

sm_1x

Forcvt.ftz.dtype.f32 float-to-integer conversions andcvt.ftz.f32.f32float-to-float conversions with integer rounding, subnormal inputs are flushed to sign-preservingzero. The optional.ftz modifier may be specified in these cases for clarity.

Note: In PTX ISA versions 1.4 and earlier, thecvt instruction did not flush single-precisionsubnormal inputs or results to zero if the destination type size was 64-bits. The compiler willpreserve this behavior for legacy PTX code.

Saturation modifier:

.sat

For integer destination types,.sat limits the result toMININT..MAXINT for the size ofthe operation. Note that saturation applies to both signed and unsigned integer types.

The saturation modifier is allowed only in cases where the destination type’s value range is nota superset of the source type’s value range; i.e., the.sat modifier is illegal in caseswhere saturation is not possible based on the source and destination types.

For float-to-integer conversions, the result is clamped to the destination range by default; i.e,.sat is redundant.

Floating Point Notes

Floating-point rounding is required for float-to-float conversions that result in loss of precision,and for integer-to-float conversions. Floating-point rounding is illegal in all other instances.

Floating-point rounding modifiers:

.rn

rounding to nearest, with ties to even

.rna

rounding to nearest, with ties away from zero

.rz

rounding toward zero

.rm

rounding toward negative infinity

.rp

rounding toward positive infinity

.rs

Stochastic rounding is achieved through the use of the supplied random bits. Operation’s resultis rounded in the direction toward zero or away from zero based on the carry out of the integeraddition of the supplied random bits (rbits) to the truncated off (discarded) bits ofmantissa from the input.

A floating-point value may be rounded to an integral value using the integer rounding modifiers (seeInteger Notes). The operands must be of the same size. The result is an integral value, stored infloating-point format.

Subnormal numbers:

sm_20+

By default, subnormal numbers are supported. Modifier.ftz may be specified to flushsingle-precision subnormal inputs and results to sign-preserving zero. Modifier.ftz can onlybe specified when either.dtype or.atype is.f32 and applies only to singleprecision (.f32) inputs and results.

sm_1x

Single-precision subnormal inputs and results are flushed to sign-preserving zero. The optional.ftz modifier may be specified in these cases for clarity.

Note: In PTX ISA versions 1.4 and earlier, thecvt instruction did not flushsingle-precision subnormal inputs or results to zero if either source or destination type was.f64. The compiler will preserve this behavior for legacy PTX code. Specifically, if the PTXISA version is 1.4 or earlier, single-precision subnormal inputs and results are flushed tosign-preserving zero only forcvt.f32.f16,cvt.f16.f32, andcvt.f32.f32 instructions.

Saturation modifier:

.sat:

For floating-point destination types,.sat limits the result to the range [0.0, 1.0].NaNresults are flushed to positive zero. Applies to.f16,.f32, and.f64 types.

.relu:

For.f16,.f16x2,.bf16,.bf16x2,.e4m3x2,.e5m2x2,.e2m1x2,.e2m3x2,.e3m2x2,.e4m3x4,.e5m2x4,.e2m1x4,.e2m3x4,.e3m2x4 and.tf32destination types,.relu clamps the result to 0 if negative.NaN results are convertedto canonicalNaN.

.satfinite:

For.f16,.f16x2,.bf16,.bf16x2,.e4m3x2,.e5m2x2,.ue8m0x2,.e4m3x4,.e5m2x4 and.tf32 destination formats, if the input value isNaN, then the result isNaN in the specified destination format. For.e2m1x2,.e2m3x2,.e3m2x2,.e2m1x4,.e2m3x4,.e3m2x4 destination formatsNaN results are converted to positiveMAX_NORM.If the absolute value of input (ignoring sign) is greater thanMAX_NORM of the specified destinationformat, then the result is sign-preservedMAX_NORM of the destination format and a positiveMAX_NORM in.ue8m0x2 for which the destination sign is not supported.

Notes

A source register wider than the specified type may be used, except when the source operand has.bf16 or.bf16x2 format. The lowern bits corresponding to the instruction-type widthare used in the conversion. SeeOperand Size Exceeding Instruction-Type Size for a description of these relaxedtype-checking rules.

A destination register wider than the specified type may be used, except when the destinationoperand has.bf16,.bf16x2 or.tf32 format. The result of conversion is sign-extended tothe destination register width for signed integers, and is zero-extended to the destination registerwidth for unsigned, bit-size, and floating-point types. SeeOperand Size Exceeding Instruction-Type Size for a description of these relaxedtype-checking rules.

Forcvt.f32.bf16,NaN input yields unspecifiedNaN.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

.relu modifier and {.f16x2,.bf16,.bf16x2,.tf32} destination formatsintroduced in PTX ISA version 7.0.

cvt.f32.bf16 introduced in PTX ISA version 7.1.

cvt.bf16.{u8/s8/u16/s16/u32/s32/u64/s64/f16/f64/bf16},cvt.{u8/s8/u16/s16/u32/s32/u64/s64/f16/f64}.bf16, andcvt.tf32.f32.{relu}.{rn/rz} introducedin PTX ISA version 7.8.

.ftz qualifier forcvt.f32.bf16 introduced in PTX ISA version 7.8.

cvt with.e4m3x2/.e5m2x2 forsm_90 or higher introduced in PTX ISA version 7.8.

cvt.satfinite.{e4m3x2,e5m2x2}.{f32,f16x2} forsm_90 or higher introduced in PTX ISA version 7.8.

cvt with.e4m3x2/.e5m2x2 forsm_89 introduced in PTX ISA version 8.1.

cvt.satfinite.{e4m3x2,e5m2x2}.{f32,f16x2} forsm_89 introduced in PTX ISA version 8.1.

cvt.satfinite.{f16,bf16,f16x2,bf16x2,tf32}.f32 introduced in PTX ISA version 8.1.

cvt.{rn/rz}.satfinite.tf32.f32 introduced in PTX ISA version 8.6.

cvt.rn.satfinite{.relu}.{e2m1x2/e2m3x2/e3m2x2/ue8m0x2}.f32 introduced in PTX ISA version 8.6.

cvt.rn{.relu}.f16x2.{e2m1x2/e2m3x2/e3m2x2} introduced in PTX ISA version 8.6.

cvt.{rp/rz}{.satfinite}{.relu}.ue8m0x2.bf16x2 introduced in PTX ISA version 8.6.

cvt.{rz/rp}.satfinite.ue8m0x2.f32 introduced in PTX ISA version 8.6.

cvt.rn.bf16x2.ue8m0x2 introduced in PTX ISA version 8.6.

.rs rounding mode introduced in PTX ISA version 8.7.

cvt.rs{.e2m1x4/.e4m3x4/.e5m2x4/.e3m2x4/.e2m3x4}.f32 introduced in PTX ISA version 8.7.

Target ISA Notes

cvt to or from.f64 requiressm_13 or higher.

.relu modifier and {.f16x2,.bf16,.bf16x2,.tf32} destination formats requiresm_80 or higher.

cvt.f32.bf16 requiressm_80 or higher.

cvt.bf16.{u8/s8/u16/s16/u32/s32/u64/s64/f16/f64/bf16},cvt.{u8/s8/u16/s16/u32/s32/u64/s64/f16/f64}.bf16, andcvt.tf32.f32.{relu}.{rn/rz} requiresm_90 or higher.

.ftz qualifier forcvt.f32.bf16 requiressm_90 or higher.

cvt with.e4m3x2/.e5m2x2 requiressm89 or higher.

cvt.satfinite.{e4m3x2,e5m2x2}.{f32,f16x2} requiressm_89 or higher.

cvt.{rn/rz}.satfinite.tf32.f32 requiressm_100 or higher.

cvt.rn.satfinite{.relu}.{e2m1x2/e2m3x2/e3m2x2/ue8m0x2}.f32 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

cvt.rn{.relu}.f16x2.{e2m1x2/e2m3x2/e3m2x2} is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

cvt.{rz/rp}{.satfinite}{.relu}.ue8m0x2.bf16x2 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

cvt.{rz/rp}.satfinite.ue8m0x2.f32 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

cvt.rn.bf16x2.ue8m0x2 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

.rs rounding mode is supported on following architectures:

  • sm_100a

  • sm_103a

cvt.rs{.e2m1x4/.e4m3x4/.e5m2x4/.e3m2x4/.e2m3x4}.f32 is supported on following architectures:

  • sm_100a

  • sm_103a

Examples

cvt.f32.s32 f,i;cvt.s32.f64 j,r;     // float-to-int saturates by defaultcvt.rni.f32.f32 x,y; // round to nearest int, result is fpcvt.f32.f32 x,y;     // note .ftz behavior for sm_1x targetscvt.rn.relu.f16.f32      b, f;        // result is saturated with .relu saturation modecvt.rz.f16x2.f32         b1, f, f1;   // convert two fp32 values to packed fp16 outputscvt.rn.relu.satfinite.f16x2.f32    b1, f, f1;   // convert two fp32 values to packed fp16 outputs with .relu saturation on each outputcvt.rn.bf16.f32          b, f;        // convert fp32 to bf16cvt.rz.relu.satfinite.bf16.f3 2    b, f;        // convert fp32 to bf16 with .relu and .satfinite saturationcvt.rz.satfinite.bf16x2.f32        b1, f, f1;   // convert two fp32 values to packed bf16 outputscvt.rn.relu.bf16x2.f32   b1, f, f1;   // convert two fp32 values to packed bf16 outputs with .relu saturation on each outputcvt.rna.satfinite.tf32.f32         b1, f;       // convert fp32 to tf32 formatcvt.rn.relu.tf32.f32     d, a;        // convert fp32 to tf32 formatcvt.f64.bf16.rp          f, b;        // convert bf16 to f64 formatcvt.bf16.f16.rz          b, f         // convert f16 to bf16 formatcvt.bf16.u64.rz          b, u         // convert u64 to bf16 formatcvt.s8.bf16.rpi          s, b         // convert bf16 to s8 formatcvt.bf16.bf16.rpi        b1, b2       // convert bf16 to corresponding int represented in bf16 formatcvt.rn.satfinite.e4m3x2.f32 d, a, b;  // convert a, b to .e4m3 and pack as .e4m3x2 outputcvt.rn.relu.satfinite.e5m2x2.f16x2 d, a; // unpack a and convert the values to .e5m2 outputs with .relu                                         // saturation on each output and pack as .e5m2x2cvt.rn.f16x2.e4m3x2 d, a;             // unpack a, convert two .e4m3 values to packed f16x2 outputcvt.rn.satfinite.tf32.f32 d, a;       // convert fp32 to tf32 formatcvt.rn.relu.f16x2.e2m1x2 d, a;        // unpack a, convert two .e2m1 values to packed f16x2 outputcvt.rn.satfinite.e2m3x2.f32 d, a, b;  // convert a, b to .e2m3 and pack as .e2m3x2 outputcvt.rn.relu.f16x2.e3m2x2 d, a;        // unpack a, convert two .e3m2 values to packed f16x2 outputcvt.rs.f16x2.f32    d, a, b, rbits;  // convert 2 fp32 values to packed fp16 with applying .rs roundingcvt.rs.satfinite.e2m1x4.f32  d, {a, b, e, f}, rbits; // convert 4 fp32 values to packed 4 e2m1 values with applying .rs rounding

9.7.9.22.Data Movement and Conversion Instructions:cvt.pack

cvt.pack

Convert two integer values from one integer type to another and pack the results.

Syntax

cvt.pack.sat.convertType.abType  d, a, b;    .convertType  = { .u16, .s16 }    .abType       = { .s32 }cvt.pack.sat.convertType.abType.cType  d, a, b, c;    .convertType  = { .u2, .s2, .u4, .s4, .u8, .s8 }    .abType       = { .s32 }    .cType        = { .b32 }

Description

Convert two 32-bit integersa andb into specified type and pack the results intod.

Destinationd is an unsigned 32-bit integer. Source operandsa andb are integers oftype.abType and the source operandc is an integer of type.cType.

The inputsa andb are converted to values of type specified by.convertType withsaturation and the results after conversion are packed into lower bits ofd.

If operandc is specified then remaining bits ofd are copied from lower bits ofc.

Semantics

ta = a < MIN(convertType) ? MIN(convertType) : a;ta = a > MAX(convertType) ? MAX(convertType) : a;tb = b < MIN(convertType) ? MIN(convertType) : b;tb = b > MAX(convertType) ? MAX(convertType) : b;size = sizeInBits(convertType);td = tb ;for (i = size; i <= 2 * size - 1; i++) {    td[i] = ta[i - size];}if (isU16(convertType) || isS16(convertType)) {    d = td;} else {    for (i = 0; i < 2 * size; i++) {        d[i] = td[i];    }    for (i = 2 * size; i <= 31; i++) {        d[i] = c[i - 2 * size];    }}

.sat modifier limits the converted values toMIN(convertType)..MAX(convertedType) (nooverflow) if the corresponding inputs are not in the range of datatype specified as.convertType.

PTX ISA Notes

Introduced in PTX ISA version 6.5.

Target ISA Notes

Requiressm_72 or higher.

Sub byte types (.u4/.s4 and.u2/.s2) requiressm_75 or higher.

Examples

cvt.pack.sat.s16.s32      %r1, %r2, %r3;           // 32-bit to 16-bit conversioncvt.pack.sat.u8.s32.b32   %r4, %r5, %r6, 0;        // 32-bit to 8-bit conversioncvt.pack.sat.u8.s32.b32   %r7, %r8, %r9, %r4;      // %r7 = { %r5, %r6, %r8, %r9 }cvt.pack.sat.u4.s32.b32   %r10, %r12, %r13, %r14;  // 32-bit to 4-bit conversioncvt.pack.sat.s2.s32.b32   %r15, %r16, %r17, %r18;  // 32-bits to 2-bit conversion

9.7.9.23.Data Movement and Conversion Instructions:mapa

mapa

Map the address of the shared variable in the target CTA.

Syntax

mapa{.space}.type          d, a, b;// Maps shared memory address in register a into CTA b.mapa.shared::cluster.type  d, a, b;// Maps shared memory variable into CTA b.mapa.shared::cluster.type  d, sh, b;// Maps shared memory variable into CTA b.mapa.shared::cluster.type  d, sh + imm, b;// Maps generic address in register a into CTA b.mapa.type                  d, a, b;.space = { .shared::cluster }.type  = { .u32, .u64 }

Description

Get address in the CTA specified by operandb which corresponds to the address specified byoperanda.

Instruction type.type indicates the type of the destination operandd and the sourceoperanda.

When space is.shared::cluster, sourcea is either a shared memory variable or a registercontaining a valid shared memory address and registerd contains a shared memory address. Whenthe optional qualifier.space is not specified, botha andd are registers containinggeneric addresses pointing to shared memory.

b is a 32-bit integer operand representing the rank of the target CTA.

Destination registerd will hold an address in CTAb corresponding to operanda.

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

mapa.shared::cluster.u64 d1, %reg1, cta;mapa.shared::cluster.u32 d2, sh, 3;mapa.u64                 d3, %reg2, cta;

9.7.9.24.Data Movement and Conversion Instructions:getctarank

getctarank

Generate the CTA rank of the address.

Syntax

getctarank{.space}.type d, a;// Get cta rank from source shared memory address in register a.getctarank.shared::cluster.type d, a;// Get cta rank from shared memory variable.getctarank.shared::cluster.type d, var;// Get cta rank from shared memory variable+offset.getctarank.shared::cluster.type d, var + imm;// Get cta rank from generic address of shared memory variable in register a.getctarank.type d, a;.space = { .shared::cluster }.type  = { .u32, .u64 }

Description

Write the destination registerd with the rank of the CTA which contains the address specifiedin operanda.

Instruction type.type indicates the type of source operanda.

When space is.shared::cluster, sourcea is either a shared memory variable or a registercontaining a valid shared memory address. When the optional qualifier.space is not specified,a is a register containing a generic addresses pointing to shared memory. Destinationd isalways a 32-bit register which holds the rank of the CTA.

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

getctarank.shared::cluster.u32 d1, addr;getctarank.shared::cluster.u64 d2, sh + 4;getctarank.u64                 d3, src;

9.7.9.25.Data Movement and Conversion Instructions: Asynchronous copy

An asynchronous copy operation performs the underlying operation asynchronously in the background,thus allowing the issuing threads to perform subsequent tasks.

An asynchronous copy operation can be abulk operation that operates on a large amount of data, oranon-bulk operation that operates on smaller sized data. The amount of data handled by a bulkasynchronous operation must be a multiple of 16 bytes.

An asynchronous copy operation typically includes the following sequence:

  • Optionally, reading from the tensormap.

  • Reading data from the source location(s).

  • Writing data to the destination location(s).

  • Writes being made visible to the executing thread or other threads.

9.7.9.25.1.Completion Mechanisms for Asynchronous Copy Operations

A thread must explicitly wait for the completion of an asynchronous copy operation in order toaccess the result of the operation. Once an asynchronous copy operation is initiated, modifying thesource memory location or tensor descriptor or reading from the destination memory location beforethe asynchronous operation completes, exhibits undefined behavior.

This section describes two asynchronous copy operation completion mechanisms supported in PTX:Async-group mechanism and mbarrier-based mechanism.

Asynchronous operations may be tracked by either of the completion mechanisms or both mechanisms.The tracking mechanism is instruction/instruction-variant specific.

9.7.9.25.1.1.Async-group mechanism

When using the async-group completion mechanism, the issuing thread specifies a group ofasynchronous operations, calledasync-group, using acommit operation and tracks the completionof this group using await operation. The thread issuing the asynchronous operation must createseparateasync-groups for bulk and non-bulk asynchronous operations.

Acommit operation creates a per-threadasync-group containing all prior asynchronous operationstracked byasync-group completion and initiated by the executing thread but none of the asynchronousoperations following the commit operation. A committed asynchronous operation belongs to a singleasync-group.

When anasync-group completes, all the asynchronous operations belonging to that group arecomplete and the executing thread that initiated the asynchronous operations can read the result ofthe asynchronous operations. Allasync-groups committed by an executing thread always complete inthe order in which they were committed. There is no ordering between asynchronous operations withinanasync-group.

A typical pattern of usingasync-group as the completion mechanism is as follows:

  • Initiate the asynchronous operations.

  • Group the asynchronous operations into anasync-group using acommit operation.

  • Wait for the completion of the async-group using the wait operation.

  • Once theasync-group completes, access the results of all asynchronous operations in thatasync-group.

9.7.9.25.1.2.Mbarrier-based mechanism

A thread can track the completion of one or more asynchronous operations using the current phase ofanmbarrier object. When the current phase of thembarrier object is complete, it implies thatall asynchronous operations tracked by this phase are complete, and all threads participating inthatmbarrier object can access the result of the asynchronous operations.

Thembarrier object to be used for tracking the completion of an asynchronous operation can beeither specified along with the asynchronous operation as part of its syntax, or as a separateoperation. For a bulk asynchronous operation, thembarrier object must be specified in theasynchronous operation, whereas for non-bulk operations, it can be specified after the asynchronousoperation.

A typical pattern of using mbarrier-based completion mechanism is as follows:

  • Initiate the asynchronous operations.

  • Set up anmbarrier object to track the asynchronous operations in its current phase, either aspart of the asynchronous operation or as a separate operation.

  • Wait for thembarrier object to complete its current phase usingmbarrier.test_wait ormbarrier.try_wait.

  • Once thembarrier.test_wait ormbarrier.try_wait operation returnsTrue, access theresults of the asynchronous operations tracked by thembarrier object.

9.7.9.25.2.Async Proxy

Thecp{.reduce}.async.bulk operations are performed in theasynchronous proxy (orasyncproxy).

Accessing the same memory location across multiple proxies needs a cross-proxy fence. For theasync proxy,fence.proxy.async should be used to synchronize memory betweengenericproxy and theasync proxy.

The completion of acp{.reduce}.async.bulk operation is followed by an implicitgeneric-asyncproxy fence. So the result of the asynchronous operation is made visible to the generic proxy assoon as its completion is observed.Async-group ORmbarrier-based completion mechanism mustbe used to wait for the completion of thecp{.reduce}.async.bulk instructions.

9.7.9.25.3.Data Movement and Conversion Instructions: Non-bulk copy
9.7.9.25.3.1.Data Movement and Conversion Instructions:cp.async

cp.async

Initiates an asynchronous copy operation from one state space to another.

Syntax

cp.async.ca.shared{::cta}.global{.level::cache_hint}{.level::prefetch_size}                         [dst], [src], cp-size{, src-size}{, cache-policy} ;cp.async.cg.shared{::cta}.global{.level::cache_hint}{.level::prefetch_size}                         [dst], [src], 16{, src-size}{, cache-policy} ;cp.async.ca.shared{::cta}.global{.level::cache_hint}{.level::prefetch_size}                         [dst], [src], cp-size{, ignore-src}{, cache-policy} ;cp.async.cg.shared{::cta}.global{.level::cache_hint}{.level::prefetch_size}                         [dst], [src], 16{, ignore-src}{, cache-policy} ;.level::cache_hint =     { .L2::cache_hint }.level::prefetch_size =  { .L2::64B, .L2::128B, .L2::256B }cp-size =                { 4, 8, 16 }

Description

cp.async is a non-blocking instruction which initiates an asynchronous copy operation of datafrom the location specified by source address operandsrc to the location specified bydestination address operanddst. Operandsrc specifies a location in the global state spaceanddst specifies a location in the shared state space.

Operandcp-size is an integer constant which specifies the size of data in bytes to be copied tothe destinationdst.cp-size can only be 4, 8 and 16.

Instructioncp.async allows optionally specifying a 32-bit integer operandsrc-size. Operandsrc-size represents the size of the data in bytes to be copied fromsrc todst and mustbe less thancp-size. In such case, remaining bytes in destinationdst are filled withzeros. Specifyingsrc-size larger thancp-size results in undefined behavior.

The optional and non-immediate predicate argumentignore-src specifies whether the data from thesource locationsrc should be ignored completely. If the source data is ignored then zeros willbe copied to destinationdst. If the argumentignore-src is not specified then it defaultstoFalse.

Supported alignment requirements and addressing modes for operandsrc anddst are describedinAddresses as Operands.

The mandatory.async qualifier indicates that thecp instruction will initiate the memorycopy operation asynchronously and control will return to the executing thread before the copyoperation is complete. The executing thread can then useasync-group based completion mechanismor thembarrier based completion mechanismto wait for completion of the asynchronous copy operation.No other synchronization mechanism guarantees the completion of the asynchronouscopy operations.

There is no ordering guarantee between twocp.async operations if they are not explicitlysynchronized usingcp.async.wait_all orcp.async.wait_group ormbarrier instructions.

As described inCache Operators, the.cg qualifier indicatescaching of data only at global level cache L2 and not at L1 whereas.ca qualifier indicatescaching of data at all levels including L1 cache. Cache operator are treated as performance hintsonly.

cp.async is treated as a weak memory operation in theMemory Consistency Model.

The.level::prefetch_size qualifier is a hint to fetch additional data of the specified sizeinto the respective cache level.The sub-qualifierprefetch_size can be set to either of64B,128B,256B thereby allowing the prefetch size to be 64 Bytes, 128 Bytes or 256 Bytesrespectively.

The qualifier.level::prefetch_size may only be used with.global state space and withgeneric addressing where the address points to.global state space. If the generic address doesnot fall within the address window of the global memory, then the prefetching behavior is undefined.

The.level::prefetch_size qualifier is treated as a performance hint only.

When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.

The qualifier.level::cache_hint is only supported for.global state space and for genericaddressing where the address points to the.global state space.

cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

Support for.level::cache_hint and.level::prefetch_size qualifiers introduced in PTX ISAversion 7.4.

Support forignore-src operand introduced in PTX ISA version 7.5.

Support for sub-qualifier::cta introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_80 or higher.

Sub-qualifier::cta requiressm_30 or higher.

Examples

cp.async.ca.shared.global  [shrd],    [gbl + 4], 4;cp.async.ca.shared::cta.global  [%r0 + 8], [%r1],     8;cp.async.cg.shared.global  [%r2],     [%r3],     16;cp.async.cg.shared.global.L2::64B   [%r2],      [%r3],     16;cp.async.cg.shared.global.L2::128B  [%r0 + 16], [%r1],     16;cp.async.cg.shared.global.L2::256B  [%r2 + 32], [%r3],     16;createpolicy.fractional.L2::evict_last.L2::evict_unchanged.b64 cache-policy, 0.25;cp.async.ca.shared.global.L2::cache_hint [%r2], [%r1], 4, cache-policy;cp.async.ca.shared.global                   [shrd], [gbl], 4, p;cp.async.cg.shared.global.L2::cache_hint   [%r0], [%r2], 16, q, cache-policy;
9.7.9.25.3.2.Data Movement and Conversion Instructions:cp.async.commit_group

cp.async.commit_group

Commits all prior initiated but uncommittedcp.async instructions into acp.async-group.

Syntax

cp.async.commit_group ;

Description

cp.async.commit_group instruction creates a newcp.async-group per thread and batches allpriorcp.async instructions initiated by the executing thread but not committed to anycp.async-group into the newcp.async-group. If there are no uncommittedcp.asyncinstructions thencp.async.commit_group results in an emptycp.async-group.

An executing thread can wait for the completion of allcp.async operations in acp.async-groupusingcp.async.wait_group.

There is no memory ordering guarantee provided between any twocp.async operations within thesamecp.async-group. So two or morecp.async operations within acp.async-group copying datato the same location results in undefined behavior.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

Target ISA Notes

Requiressm_80 or higher.

Examples

// Example 1:cp.async.ca.shared.global [shrd], [gbl], 4;cp.async.commit_group ; // Marks the end of a cp.async group// Example 2:cp.async.ca.shared.global [shrd1],   [gbl1],   8;cp.async.ca.shared.global [shrd1+8], [gbl1+8], 8;cp.async.commit_group ; // Marks the end of cp.async group 1cp.async.ca.shared.global [shrd2],    [gbl2],    16;cp.async.cg.shared.global [shrd2+16], [gbl2+16], 16;cp.async.commit_group ; // Marks the end of cp.async group 2
9.7.9.25.3.3.Data Movement and Conversion Instructions:cp.async.wait_group /cp.async.wait_all

cp.async.wait_group,cp.async.wait_all

Wait for completion of prior asynchronous copy operations.

Syntax

cp.async.wait_group N;cp.async.wait_all ;

Description

cp.async.wait_group instruction will cause executing thread to wait till onlyN or fewer ofthe most recentcp.async-groups are pending and all the priorcp.async-groups committed bythe executing threads are complete. For example, whenN is 0, the executing thread waits on allthe priorcp.async-groups to complete. OperandN is an integer constant.

cp.async.wait_all is equivalent to :

cp.async.commit_group;cp.async.wait_group 0;

An emptycp.async-group is considered to be trivially complete.

Writes performed bycp.async operations are made visible to the executing thread only after:

  1. The completion ofcp.async.wait_all or

  2. The completion ofcp.async.wait_group on thecp.async-group in which thecp.asyncbelongs to or

  3. mbarrier.test_waitreturnsTrue on anmbarrier object which is tracking the completion of thecp.asyncoperation.

There is no ordering between twocp.async operations that are not synchronized withcp.async.wait_all orcp.async.wait_group ormbarrier objects.

cp.async.wait_group andcp.async.wait_all does not provide any ordering and visibilityguarantees for any other memory operation apart fromcp.async.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

Target ISA Notes

Requiressm_80 or higher.

Examples

// Example of .wait_all:cp.async.ca.shared.global [shrd1], [gbl1], 4;cp.async.cg.shared.global [shrd2], [gbl2], 16;cp.async.wait_all;  // waits for all prior cp.async to complete// Example of .wait_group :cp.async.ca.shared.global [shrd3], [gbl3], 8;cp.async.commit_group;  // End of group 1cp.async.cg.shared.global [shrd4], [gbl4], 16;cp.async.commit_group;  // End of group 2cp.async.cg.shared.global [shrd5], [gbl5], 16;cp.async.commit_group;  // End of group 3cp.async.wait_group 1;  // waits for group 1 and group 2 to complete
9.7.9.25.4.Data Movement and Conversion Instructions: Bulk copy
9.7.9.25.4.1.Data Movement and Conversion Instructions:cp.async.bulk

cp.async.bulk

Initiates an asynchronous copy operation from one state space to another.

Syntax

// global -> shared::ctacp.async.bulk.dst.src.completion_mechanism{.level::cache_hint}                      [dstMem], [srcMem], size, [mbar] {, cache-policy}.dst =                  { .shared::cta }.src =                  { .global }.completion_mechanism = { .mbarrier::complete_tx::bytes }.level::cache_hint =    { .L2::cache_hint }// global -> shared::clustercp.async.bulk.dst.src.completion_mechanism{.multicast}{.level::cache_hint}                      [dstMem], [srcMem], size, [mbar] {, ctaMask} {, cache-policy}.dst =                  { .shared::cluster }.src =                  { .global }.completion_mechanism = { .mbarrier::complete_tx::bytes }.level::cache_hint =    { .L2::cache_hint }.multicast =            { .multicast::cluster  }// shared::cta -> shared::clustercp.async.bulk.dst.src.completion_mechanism [dstMem], [srcMem], size, [mbar].dst =                  { .shared::cluster }.src =                  { .shared::cta }.completion_mechanism = { .mbarrier::complete_tx::bytes }// shared::cta -> globalcp.async.bulk.dst.src.completion_mechanism{.level::cache_hint}{.cp_mask}                      [dstMem], [srcMem], size {, cache-policy} {, byteMask}.dst =                  { .global }.src =                  { .shared::cta }.completion_mechanism = { .bulk_group }.level::cache_hint =    { .L2::cache_hint }

Description

cp.async.bulk is a non-blocking instruction which initiates an asynchronous bulk-copy operationfrom the location specified by source address operandsrcMem to the location specified bydestination address operanddstMem.

The direction of bulk-copy is from the state space specified by the.src modifier to the statespace specified by the.dst modifiers.

The 32-bit operandsize specifies the amount of memory to be copied, in terms of number ofbytes.size must be a multiple of 16. If the value is not a multiple of 16, then the behavior isundefined. The memory range[dstMem,dstMem+size-1] must not overflow the destination memoryspace and the memory range[srcMem,srcMem+size-1] must not overflow the source memoryspace. Otherwise, the behavior is undefined. The addressesdstMem andsrcMem must be alignedto 16 bytes.

When the destination of the copy is.shared::cta the destination address has to be in the sharedmemory of the executing CTA within the cluster, otherwise the behavior is undefined.

When the source of the copy is.shared::cta and the destination is.shared::cluster, thedestination has to be in the shared memory of a different CTA within the cluster.

The modifier.completion_mechanism specifies the completion mechanism that is supported on theinstruction variant. The completion mechanisms that are supported for different variants aresummarized in the following table:

.completion-mechanism

.dst

.src

Completion mechanism

Needed for completion ofentire Async operation

optionally can be used for the completion of- Reading data from the source- Reading from the tensormap, if applicable

.mbarrier::...

.shared::cta

.global

mbarrier based

Bulk async-group based

.shared::cluster

.global

.shared::cluster

.shared::cta

.bulk_group

.global

.shared::cta

Bulk async-groupbased

The modifier.mbarrier::complete_tx::bytes specifies that thecp.async.bulk variant usesmbarrier based completion mechanism. Thecomplete-txoperation, withcompleteCount argument equal to amount of data copied in bytes, will beperformed on the mbarrier object specified by the operandmbar.

The modifier.bulk_group specifies that thecp.async.bulk variant usesbulk async-groupbased completion mechanism.

The optional modifier.multicast::cluster allows copying of data from global memory to sharedmemory of multiple CTAs in the cluster. OperandctaMask specifies the destination CTAs in thecluster such that each bit position in the 16-bitctaMask operand corresponds to the%ctaidof the destination CTA. The source data is multicast to the same CTA-relative offset asdstMemin the shared memory of each destination CTA. The mbarrier signal is also multicast to the sameCTA-relative offset asmbar in the shared memory of the destination CTA.

When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.

cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program. Thequalifier.level::cache_hint is only supported when at least one of the.src or.dststatespaces is.global state space.

When the optional qualifier.cp_mask is specified, the argumentbyteMask is required.The i-th bit in the 16-bit widebyteMask operand specifies whether the i-th byte of each 16-bytewide chunk of source data is copied to the destination. If the bit is set, the byte is copied.

The copy operation incp.async.bulk is treated as a weak memory operation and thecomplete-txoperation on the mbarrier has.release semantics at the.cluster scope as described in theMemory Consistency Model.

Notes

.multicast::cluster qualifier is optimized for target architecturesm_90a/sm_100f/sm_100a/sm_103f/sm_103a/sm_110f/sm_110a and may have substantially reduced performance on othertargets and hence.multicast::cluster is advised to be used with.targetsm_90a/sm_100f/sm_100a/sm_103f/sm_103a/sm_110f/sm_110a.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Support for.shared::cta as destination state space is introduced in PTX ISA version 8.6.

Support for.cp_mask qualifier introduced in PTX ISA version 8.6.

Target ISA Notes

Requiressm_90 or higher.

.multicast::cluster qualifier advised to be used with.targetsm_90a orsm_100f orsm_100a orsm_103f orsm_103a orsm_110f orsm_110a.

Support for.cp_mask qualifier requiressm_100 or higher.

Examples

// .global -> .shared::cta (strictly non-remote):cp.async.bulk.shared::cta.global.mbarrier::complete_tx::bytes [dstMem], [srcMem], size, [mbar];cp.async.bulk.shared::cta.global.mbarrier::complete_tx::bytes.L2::cache_hint                                             [dstMem], [srcMem], size, [mbar], cache-policy;// .global -> .shared::cluster:cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [dstMem], [srcMem], size, [mbar];cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster                                             [dstMem], [srcMem], size, [mbar], ctaMask;cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes.L2::cache_hint                                             [dstMem], [srcMem], size, [mbar], cache-policy;// .shared::cta -> .shared::cluster (strictly remote):cp.async.bulk.shared::cluster.shared::cta.mbarrier::complete_tx::bytes [dstMem], [srcMem], size, [mbar];// .shared::cta -> .global:cp.async.bulk.global.shared::cta.bulk_group [dstMem], [srcMem], size;cp.async.bulk.global.shared::cta.bulk_group.L2::cache_hint} [dstMem], [srcMem], size, cache-policy;// .shared::cta -> .global with .cp_mask:cp.async.bulk.global.shared::cta.bulk_group.L2::cache_hint.cp_mask [dstMem], [srcMem], size, cache-policy, byteMask;
9.7.9.25.4.2.Data Movement and Conversion Instructions:cp.reduce.async.bulk

cp.reduce.async.bulk

Initiates an asynchronous reduction operation.

Syntax

cp.reduce.async.bulk.dst.src.completion_mechanism.redOp.type              [dstMem], [srcMem], size, [mbar].dst =                  { .shared::cluster }.src =                  { .shared::cta }.completion_mechanism = { .mbarrier::complete_tx::bytes }.redOp=                 { .and, .or, .xor,                          .add, .inc, .dec,                          .min, .max }.type =                 { .b32, .u32, .s32, .b64, .u64 }cp.reduce.async.bulk.dst.src.completion_mechanism{.level::cache_hint}.redOp.type               [dstMem], [srcMem], size{, cache-policy}.dst =                  { .global      }.src =                  { .shared::cta }.completion_mechanism = { .bulk_group }.level::cache_hint    = { .L2::cache_hint }.redOp=                 { .and, .or, .xor,                          .add, .inc, .dec,                          .min, .max }.type =                 { .f16, .bf16, .b32, .u32, .s32, .b64, .u64, .s64, .f32, .f64 }cp.reduce.async.bulk.dst.src.completion_mechanism{.level::cache_hint}.add.noftz.type               [dstMem], [srcMem], size{, cache-policy}.dst  =                 { .global }.src  =                 { .shared::cta }.completion_mechanism = { .bulk_group }.type =                 { .f16, .bf16 }

Description

cp.reduce.async.bulk is a non-blocking instruction which initiates an asynchronous reductionoperation on an array of memory locations specified by the destination address operanddstMemwith the source array whose location is specified by the source address operandsrcMem. The sizeof the source and the destination array must be the same and is specified by the operandsize.

Each data element in the destination array is reduced inline with the corresponding data element inthe source array with the reduction operation specified by the modifier.redOp. The type of eachdata element in the source and the destination array is specified by the modifier.type.

The source address operandsrcMem is located in the state space specified by.src and thedestination address operanddstMem is located in the state specified by the.dst.

The 32-bit operandsize specifies the amount of memory to be copied from the source location andused in the reduction operation, in terms of number of bytes.size must be a multiple of 16. Ifthe value is not a multiple of 16, then the behavior is undefined. The memory range[dstMem,dstMem+size-1] must not overflow the destination memory space and the memory range[srcMem,srcMem+size-1] must not overflow the source memory space. Otherwise, the behavior isundefined. The addressesdstMem andsrcMem must be aligned to 16 bytes.

The operations supported by.redOp are classified as follows:

  • The bit-size operations are.and,.or, and.xor.

  • The integer operations are.add,.inc,.dec,.min, and.max. The.inc and.dec operations return a result in the range[0..x] wherex is the value at the sourcestate space.

  • The floating point operation.add rounds to the nearest even. The current implementation ofcp.reduce.async.bulk.add.f32 flushes subnormal inputs and results to sign-preserving zero. Thecp.reduce.async.bulk.add.f16 andcp.reduce.async.bulk.add.bf16 operations require.noftz qualifier. It preserves input and result subnormals, and does not flush them to zero.

The following table describes the valid combinations of.redOp and element type:

.dst

.redOp

Element type

.shared::cluster

.add

.u32,.s32,.u64

.min,.max

.u32,.s32

.inc,.dec

.u32

.and,.or,.xor

.b32

.global

.add

.u32,.s32,.u64,.f32,.f64,.f16,.bf16

.min,.max

.u32,.s32,.u64,.s64,.f16,.bf16

.inc,.dec

.u32

.and,.or,.xor

.b32,.b64

The modifier.completion_mechanism specifies the completion mechanism that is supported on theinstruction variant. The completion mechanisms that are supported for different variants aresummarized in the following table:

.completion-mechanism

.dst

.src

Completion mechanism

Needed for completion ofentire Async operation

optionally can be used for the completion of- Reading data from the source- Reading from the tensormap, if applicable

.mbarrier::...

.shared::cluster

.global

mbarrier based

Bulk async-group based

.shared::cluster

.shared::cta

.bulk_group

.global

.shared::cta

Bulk async-groupbased

The modifier.mbarrier::complete_tx::bytes specifies that thecp.reduce.async.bulk variantuses mbarrier based completion mechanism. Thecomplete-txoperation, withcompleteCount argument equal to amount of data copied in bytes, will beperformed on the mbarrier object specified by the operandmbar.

The modifier.bulk_group specifies that thecp.reduce.async.bulk variant usesbulkasync-group based completion mechanism.

When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.

cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program. Thequalifier.level::cache_hint is only supported when at least one of the.src or.dststatespaces is.global state space.

Each reduction operation performed by thecp.reduce.async.bulk has individually.relaxed.gpumemory ordering semantics. The load operations incp.reduce.async.bulk are treated as weakmemory operation and thecomplete-txoperation on the mbarrier has.release semantics at the.cluster scope as described in theMemory Consistency Model.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_90 or higher.

Examples

cp.reduce.async.bulk.shared::cluster.shared::cta.mbarrier::complete_tx::bytes.add.u64                                                                  [dstMem], [srcMem], size, [mbar];cp.reduce.async.bulk.shared::cluster.shared::cta.mbarrier::complete_tx::bytes.min.s32                                                                  [dstMem], [srcMem], size, [mbar];cp.reduce.async.bulk.global.shared::cta.bulk_group.min.f16 [dstMem], [srcMem], size;cp.reduce.async.bulk.global.shared::cta.bulk_group.L2::cache_hint.xor.s32 [dstMem], [srcMem], size, policy;cp.reduce.async.bulk.global.shared::cta.bulk_group.add.noftz.f16 [dstMem], [srcMem], size;
9.7.9.25.4.3.Data Movement and Conversion Instructions:cp.async.bulk.prefetch

cp.async.bulk.prefetch

Provides a hint to the system to initiate the asynchronous prefetch of data to the cache.

Syntax

cp.async.bulk.prefetch.L2.src{.level::cache_hint}   [srcMem], size {, cache-policy}.src =                { .global }.level::cache_hint =  { .L2::cache_hint }

Description

cp.async.bulk.prefetch is a non-blocking instruction which may initiate an asynchronous prefetchof data from the location specified by source address operandsrcMem, in.src statespace, tothe L2 cache.

The 32-bit operandsize specifies the amount of memory to be prefetched in terms of number ofbytes.size must be a multiple of 16. If the value is not a multiple of 16, then the behavior isundefined. The addresssrcMem must be aligned to 16 bytes.

When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.

cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.

cp.async.bulk.prefetch is treated as a weak memory operation in theMemory Consistency Model.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_90 or higher.

Examples

cp.async.bulk.prefetch.L2.global                 [srcMem], size;cp.async.bulk.prefetch.L2.global.L2::cache_hint  [srcMem], size, policy;
9.7.9.25.5.Data Movement and Conversion Instructions: Tensor copy
9.7.9.25.5.1.Restriction on Tensor Copy instructions

Following are the restrictions on the types.b4x16,.b4x16_p64,.b6x16_p32 and.b6p2x16:

  1. cp.reduce.async.bulk doesn’t support the types.b4x16,.b4x16_p64,.b6x16_p32and.b6p2x16.

  2. cp.async.bulk.tensor with the direction.global.shared::cta doesn’t support thetype.b4x16_p64.

  3. cp.async.bulk.tensor with the direction.shared::cluster.global doesn’t supportthe sub-byte types onsm_120a.

  4. OOB-NaN fill mode doesn’t support the types.b4x16,.b4x16_p64,.b6x16_p32and.b6p2x16.

  5. Box-Size[0] must be exactly:

    1. 96B forb6x16_p32 and.b6p2x16.

    2. 64B forb4x16_p64.

  6. Tensor-Size[0] must be a multiple of:

    1. 96B forb6x16_p32 and.b6p2x16.

    2. 64B forb4x16_p64.

  7. For.b4x16_p64,.b6x16_p32 and.b6p2x16, the first coordinate in the tensorCoordsargument vector must be a multiple of 128.

  8. For.b4x16_p64,.b6x16_p32 and.b6p2x16, the global memory address must be 32B aligned.Additionally, tensor stride in every dimension must be 32B aligned.

  9. .b4x16_p64,.b6x16_p32 and.b6p2x16 supports the following swizzling modes:

    1. None.

    2. 128B (With all potential swizzle atomicity values except: 32B with 8B flip)

Following are the restrictions on the 96B swizzle mode:

  1. The.swizzle_atomicity must be 16B.

  2. The.interleave_layout must not be set.

  3. Box-Size[0] must be less than or equal to 96B.

  4. The type must not be among following:.b4x16_p64,.b6x16_p32 and.b6p2x16.

  5. The.load_mode must not be set to.im2col::w::128.

Following are the restrictions on the.global.shared::cta direction:

  1. Starting co-ordinates for Bounding Box (tensorCoords) must be non-negative.

  2. The bounding box along the D, W and H dimensions must stay within the tensor boundaries.This implies:

    1. Bounding-Box Lower-Corner must be non-negative.

    2. Bounding-Box Upper-Corner must be non-positive.

Following are the restrictions forsm_120a:

  1. cp.async.bulk.tensor with the direction.shared::cluster.global doesn’t support:

    1. the sub-byte types

    2. the qualifier.swizzle_atomicity

Following are the restrictions forsm_103a while using type.b6p2x16 oncp.async.bulk.tensor with the direction.global.shared::cta:

  1. Box-Size[0] must be exactly either of 48B or 96B.

  2. The global memory address must be 16B aligned.

  3. Tensor Stride in every dimension must be 16B aligned.

  4. The first coordinate in the tensorCoords argument vector must be a multiple of 64.

  5. Tensor-Size[0] must be a multiple of 48B.

  6. The following swizzle modes are supported:

    1. None.

    2. 128B (With all potential swizzle atomicity values except: 32B with 8B flip)

    3. 64B swizzle with 16B swizzle atomicity

9.7.9.25.5.2.Data Movement and Conversion Instructions:cp.async.bulk.tensor

cp.async.bulk.tensor

Initiates an asynchronous copy operation on the tensor data from one state space to another.

Syntax

// global -> shared::ctacp.async.bulk.tensor.dim.dst.src{.load_mode}.completion_mechanism{.cta_group}{.level::cache_hint}                                   [dstMem], [tensorMap, tensorCoords], [mbar]{, im2colInfo} {, cache-policy}.dst =                  { .shared::cta }.src =                  { .global }.dim =                  { .1d, .2d, .3d, .4d, .5d }.completion_mechanism = { .mbarrier::complete_tx::bytes }.cta_group =            { .cta_group::1, .cta_group::2 }.load_mode =            { .tile, .tile::gather4, .im2col, .im2col::w, .im2col::w::128 }.level::cache_hint =    { .L2::cache_hint }// global -> shared::clustercp.async.bulk.tensor.dim.dst.src{.load_mode}.completion_mechanism{.multicast}{.cta_group}{.level::cache_hint}                                   [dstMem], [tensorMap, tensorCoords], [mbar]{, im2colInfo}                                   {, ctaMask} {, cache-policy}.dst =                  { .shared::cluster }.src =                  { .global }.dim =                  { .1d, .2d, .3d, .4d, .5d }.completion_mechanism = { .mbarrier::complete_tx::bytes }.cta_group =            { .cta_group::1, .cta_group::2 }.load_mode =            { .tile, .tile::gather4, .im2col, .im2col::w, .im2col::w::128 }.level::cache_hint =    { .L2::cache_hint }.multicast =            { .multicast::cluster  }// shared::cta -> globalcp.async.bulk.tensor.dim.dst.src{.load_mode}.completion_mechanism{.level::cache_hint}                                   [tensorMap, tensorCoords], [srcMem] {, cache-policy}.dst =                  { .global }.src =                  { .shared::cta }.dim =                  { .1d, .2d, .3d, .4d, .5d }.completion_mechanism = { .bulk_group }.load_mode =            { .tile, .tile::scatter4, .im2col_no_offs }.level::cache_hint =    { .L2::cache_hint }

Description

cp.async.bulk.tensor is a non-blocking instruction which initiates an asynchronous copyoperation of tensor data from the location in.src state space to the location in the.dststate space.

The operanddstMem specifies the location in the.dst state space into which the tensor datahas to be copied andsrcMem specifies the location in the.src state space from which thetensor data has to be copied.

When.dst is specified as.shared::cta, the addressdstMem must be in the shared memoryof the executing CTA within the cluster, otherwise the behavior is undefined.

When.dst is specified as.shared::cluster, the addressdstMem can be in the shared memoryof any of the CTAs within the current cluster.

The operandtensorMap is the generic address of the opaque tensor-map object which residesin.param space or.const space or.global space. The operandtensorMap specifiesthe properties of the tensor copy operation, as described inTensor-map.ThetensorMap is accessed in tensormap proxy. Refer to theCUDA programming guide for creatingthe tensor-map objects on the host side.

The dimension of the tensor data is specified by the.dim modifier.

The vector operandtensorCoords specifies the starting coordinates in the tensor data in theglobal memory from or to which the copy operation has to be performed. The individual tensorcoordinates intensorCoords are of type.s32. The format of vector argumenttensorCoordsis dependent on.load_mode specified and is as follows:

.load_mode

tensorCoords

Semantics

.tile::scatter4

{col_idx, row_idx0, row_idx1, row_idx2, row_idx3}

Fixed length vector of size 5.The five elements together specify the startco-ordinates of the four rows.

.tile::gather4

Rest all

{d0, .., dn}for n = .dim

Vector of n elements where n = .dim.The elements indicate the offset in each of thedimension.

The modifier.completion_mechanism specifies the completion mechanism that is supported on theinstruction variant. The completion mechanisms that are supported for different variants aresummarized in the following table:

.completion-mechanism

.dst

.src

Completion mechanism

Needed for completion ofentire Async operation

optionally can be used for the completion of- Reading data from the source- Reading from the tensormap, if applicable

.mbarrier::...

.shared::cta

.global

mbarrier based

Bulk async-group based

.shared::cluster

.global

.bulk_group

.global

.shared::cta

Bulk async-groupbased

The modifier.mbarrier::complete_tx::bytes specifies that thecp.async.bulk.tensor variantuses mbarrier based completion mechanism. Upon the completion of the asynchronous copy operation, thecomplete-txoperation, withcompleteCount argument equal to amount of data copied in bytes, will beperformed on the mbarrier object specified by the operandmbar.

The modifier.cta_group can only be specified with the mbarrier based completion mechanism. Themodifier.cta_group is used to signal either the odd numbered CTA or the even numbered CTA amongtheCTA-Pair. When.cta_group::1 is specified, the mbarrier objectmbarthat is specified must be in the shared memory of the same CTA as the shared memory destinationdstMem.When.cta_group::2 is specified, the mbarrier objectmbar can be in shared memory of either thesame CTA as the shared memory destinationdstMem or in itspeer-CTA. If.cta_group is not specified, then it defaults to.cta_group::1.

The modifier.bulk_group specifies that thecp.async.bulk.tensor variant usesbulkasync-group based completion mechanism.

The qualifier.load_mode specifies how the data in the source location is copied into thedestination location. If.load_mode is not specified, it defaults to.tile.

In.tile mode, the multi-dimensional layout of the source tensor is preserved at the destination.In.tile::gather4 mode, four rows in 2-dimnesional source tensor are combined to form a single 2-dimensionaldestination tensor. In.tile::scatter4 mode, single 2-dimensional source tensor is divided into four rowsin 2-dimensional destination tensor. Details of.tile::scatter4/.tile::gather4 modes are describedin.tile::scatter4 and .tile::gather4 modes.

In.im2col and.im2col::* modes, some dimensions of the source tensors are unrolled in a singledimensional column at the destination. Details of theim2col and.im2col::* modes are describedinim2col mode andim2col::w and im2col::w::128 modesrespectively. In.im2col and.im2col::* modes, the tensor has to be at least 3-dimensional. The vectoroperandim2colInfo can be specified only when.load_mode is.im2col or.im2col::w or.im2col::w::128. The format of the vector argumentim2colInfo is dependent on the exact im2col modeand is as follows:

Exact im2col mode

im2colInfo argument

Semantics

.im2col

{ i2cOffW , i2cOffH , i2cOffD }for.dim =.5d

A vector of im2col offsets whose vector size is twoless than number of dimensions .dim.

.im2col::w

{ wHalo, wOffset }

A vector of 2 arguments containingwHalo andwOffsetarguments.

.im2col::w::128

.im2col_no_offs

im2colInfo is not applicable.

im2colInfo is not applicable.

ArgumentwHalo is a 16bit unsigned integer whose valid set of values differs on the load-mode and is as follows:- Im2col::w mode : valid range is [0, 512).- Im2col::w::128 mode : valid range is [0, 32).

ArgumentwOffset is a 16bit unsigned integer whose valid range of values is [0, 32).

The optional modifier.multicast::cluster allows copying of data from global memory to sharedmemory of multiple CTAs in the cluster. OperandctaMask specifies the destination CTAs in thecluster such that each bit position in the 16-bitctaMask operand corresponds to the%ctaidof the destination CTA. The source data is multicast to the same offset asdstMem in the sharedmemory of each destination CTA. When.cta_group is specified as:

  • .cta_group::1 : The mbarrier signal is also multicasted to the same offset asmbar inthe shared memory of the destination CTA.

  • .cta_group::2 : The mbarrier signal is multicasted either to all the odd numbered CTAs or theeven numbered CTAs within the correspondingCTA-Pair. For each destinationCTA specified in thectaMask, the mbarrier signal is sent either to the destination CTA or itspeer-CTA based on CTAs%cluster_ctarank parity of shared memory wherethe mbarrier objectmbar resides.

When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.

cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.

The copy operation incp.async.bulk.tensor is treated as a weak memory operation and thecomplete-txoperation on the mbarrier has.release semantics at the.cluster scope as described in theMemory Consistency Model.

Notes

.multicast::cluster qualifier is optimized for target architecturesm_90a/sm_100f/sm_100a/sm_103f/sm_103a/sm_110f/sm_110a and may have substantially reduced performance on othertargets and hence.multicast::cluster is advised to be used with.targetsm_90a/sm_100f/sm_100a/sm_103f/sm_103a/sm_110f/sm_110a.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Support for.shared::cta as destination state space is introduced in PTX ISA version 8.6.

Support for qualifiers.tile::gather4 and.tile::scatter4 introduced in PTX ISA version 8.6.

Support for qualifiers.im2col::w and.im2col::w::128 introduced in PTX ISA version 8.6.

Support for qualifier.cta_group introduced in PTX ISA version 8.6.

Target ISA Notes

Requiressm_90 or higher.

.multicast::cluster qualifier advised to be used with.targetsm_90a orsm_100f orsm_100a orsm_103f orsm_103a orsm_110f orsm_110a.

Qualifiers.tile::gather4 and.im2col::w require:

  • sm_100a when destination state space is.shared::cluster and is supported onsm_100f from PTX ISA version 8.8.

  • sm_100 or higher when destination state space is.shared::cta.

Qualifier.tile::scatter4 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Qualifier.im2col::w::128 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Qualifier.cta_group is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Examples

.reg .b16 ctaMask;.reg .u16 i2cOffW, i2cOffH, i2cOffD;.reg .b64 l2CachePolicy;cp.async.bulk.tensor.1d.shared::cta.global.mbarrier::complete_tx::bytes.tile  [sMem0], [tensorMap0, {tc0}], [mbar0];@p cp.async.bulk.tensor.5d.shared::cta.global.im2col.mbarrier::complete_tx::bytes                     [sMem2], [tensorMap2, {tc0, tc1, tc2, tc3, tc4}], [mbar2], {i2cOffW, i2cOffH, i2cOffD};cp.async.bulk.tensor.1d.shared::cluster.global.mbarrier::complete_tx::bytes.tile  [sMem0], [tensorMap0, {tc0}], [mbar0];@p cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster                     [sMem1], [tensorMap1, {tc0, tc1}], [mbar2], ctaMask;@p cp.async.bulk.tensor.5d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes                     [sMem2], [tensorMap2, {tc0, tc1, tc2, tc3, tc4}], [mbar2], {i2cOffW, i2cOffH, i2cOffD};@p cp.async.bulk.tensor.3d.im2col.shared::cluster.global.mbarrier::complete_tx::bytes.L2::cache_hint                     [sMem3], [tensorMap3, {tc0, tc1, tc2}], [mbar3], {i2cOffW}, policy;@p cp.async.bulk.tensor.1d.global.shared::cta.bulk_group  [tensorMap3, {tc0}], [sMem3];cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes                     [sMem5], [tensorMap6, {x0, y0, y1, y2, y3}], [mbar5];cp.async.bulk.tensor.3d.im2col::w.shared::cluster.global.mbarrier::complete_tx::bytes                     [sMem4], [tensorMap5, {t0, t1, t2}], [mbar4], {im2colwHalo, im2colOff};cp.async.bulk.tensor.1d.shared::cluster.global.tile.cta_group::2                     [sMem6], [tensorMap7, {tc0}], [peerMbar];
9.7.9.25.5.3.Data Movement and Conversion Instructions:cp.reduce.async.bulk.tensor

cp.reduce.async.bulk.tensor

Initiates an asynchronous reduction operation on the tensor data.

Syntax

// shared::cta -> global:cp.reduce.async.bulk.tensor.dim.dst.src.redOp{.load_mode}.completion_mechanism{.level::cache_hint}                                          [tensorMap, tensorCoords], [srcMem] {,cache-policy}.dst =                  { .global }.src =                  { .shared::cta }.dim =                  { .1d, .2d, .3d, .4d, .5d }.completion_mechanism = { .bulk_group }.load_mode =            { .tile, .im2col_no_offs }.redOp =                { .add, .min, .max, .inc, .dec, .and, .or, .xor}

Description

cp.reduce.async.bulk.tensor is a non-blocking instruction which initiates an asynchronousreduction operation of tensor data in the.dst state space with tensor data in the.srcstate space.

The operandsrcMem specifies the location of the tensor data in the.src state space usingwhich the reduction operation has to be performed.

The operandtensorMap is the generic address of the opaque tensor-map object which residesin.param space or.const space or.global space. The operandtensorMap specifiesthe properties of the tensor copy operation, as described inTensor-map.ThetensorMap is accessed in tensormap proxy. Refer to theCUDA programming guide for creatingthe tensor-map objects on the host side.

Each element of the tensor data in the.dst state space is reduced inline with the correspondingelement from the tensor data in the.src state space. The modifier.redOp specifies thereduction operation used for the inline reduction. The type of each tensor data element in thesource and the destination tensor is specified inTensor-map.

The dimension of the tensor is specified by the.dim modifier.

The vector operandtensorCoords specifies the starting coordinates of the tensor data in theglobal memory on which the reduce operation is to be performed. The number of tensor coordinates inthe vector argumenttensorCoords should be equal to the dimension specified by the modifier.dim. The individual tensor coordinates are of the type.s32.

The following table describes the valid combinations of.redOp and element type:

.redOp

Element type

.add

.u32,.s32,.u64,.f32,.f16,.bf16

.min,.max

.u32,.s32,.u64,.s64,.f16,.bf16

.inc,.dec

.u32

.and,.or,.xor

.b32,.b64

The modifier.completion_mechanism specifies the completion mechanism that is supported on theinstruction variant. Value.bulk_group of the modifier.completion_mechanism specifies thatcp.reduce.async.bulk.tensor instruction usesbulk async-group based completion mechanism.

The qualifier.load_mode specifies how the data in the source location is copied into thedestination location. If.load_mode is not specified, it defaults to.tile. In.tilemode, the multi-dimensional layout of the source tensor is preserved at the destination. In.im2col_no_offs mode, some dimensions of the source tensors are unrolled in a single dimensionalcolumn at the destination. Details of theim2col mode are described inim2col mode. In.im2col mode, the tensor has to be at least3-dimensional.

When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.

cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program. Thequalifier.level::cache_hint is only supported when at least one of the.src or.dststatespaces is.global state space.

Each reduction operation performed bycp.reduce.async.bulk.tensor has individually.relaxed.gpu memory ordering semantics. The load operations incp.reduce.async.bulk.tensorare treated as weak memory operations and thecomplete-txoperation on the mbarrier has.release semantics at the.cluster scope as described in theMemory Consistency Model.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_90 or higher.

Examples

cp.reduce.async.bulk.tensor.1d.global.shared::cta.add.tile.bulk_group                                             [tensorMap0, {tc0}], [sMem0];cp.reduce.async.bulk.tensor.2d.global.shared::cta.and.bulk_group.L2::cache_hint                                             [tensorMap1, {tc0, tc1}], [sMem1] , policy;cp.reduce.async.bulk.tensor.3d.global.shared::cta.xor.im2col.bulk_group                                             [tensorMap2, {tc0, tc1, tc2}], [sMem2]
9.7.9.25.5.4.Data Movement and Conversion Instructions:cp.async.bulk.prefetch.tensor

cp.async.bulk.prefetch.tensor

Provides a hint to the system to initiate the asynchronous prefetch of tensor data to the cache.

Syntax

// global -> shared::cluster:cp.async.bulk.prefetch.tensor.dim.L2.src{.load_mode}{.level::cache_hint} [tensorMap, tensorCoords]                                                             {, im2colInfo } {, cache-policy}.src =                { .global }.dim =                { .1d, .2d, .3d, .4d, .5d }.load_mode =          { .tile, .tile::gather4, .im2col, .im2col::w, .im2col::w::128 }.level::cache_hint =  { .L2::cache_hint }

Description

cp.async.bulk.prefetch.tensor is a non-blocking instruction which may initiate an asynchronousprefetch of tensor data from the location in.src statespace to the L2 cache.

The operandtensorMap is the generic address of the opaque tensor-map object which residesin.param space or.const space or.global space. The operandtensorMap specifiesthe properties of the tensor copy operation, as described inTensor-map.ThetensorMap is accessed in tensormap proxy. Refer to theCUDA programming guide for creatingthe tensor-map objects on the host side.

The dimension of the tensor data is specified by the.dim modifier.

The vector operandtensorCoords specifies the starting coordinates in the tensor data in theglobal memory from which the copy operation has to be performed. The individual tensorcoordinates intensorCoords are of type.s32. The format of vector argumenttensorCoordsis dependent on.load_mode specified and is as follows:

.load_mode

tensorCoords

Semantics

.tile::gather4

{col_idx, row_idx0, row_idx1, row_idx2, row_idx3}

Fixed length vector of size 5.The five elements together specify the startco-ordinates of the four rows.

Rest all

{d0, .., dn}for n = .dim

Vector of n elements where n = .dim.The elements indicate the offset in each of thedimension.

The qualifier.load_mode specifies how the data in the source location is copied into thedestination location. If.load_mode is not specified, it defaults to.tile.

In.tile mode, the multi-dimensional layout of the source tensor is preserved at the destination.In.tile::gather4 mode, four rows in the 2-dimnesional source tensor are fetched to L2 cache.Details of.tile::gather4 modes are describedin.tile::scatter4 and .tile::gather4 modes.

In.im2col and.im2col::* modes, some dimensions of the source tensors are unrolled in a singledimensional column at the destination. Details of theim2col and.im2col::* modes are described inim2col mode andim2col::w and im2col::w::128 modesrespectively. In.im2col and.im2col::* modes, the tensor has to be at least 3-dimensional. The vectoroperandim2colInfo can be specified only when.load_mode is.im2col or.im2col::w or.im2col::w::128. The format of the vector argumentim2colInfo is dependent on the exact im2col modeand is as follows:

Exact im2col mode

im2colInfo argument

Semantics

.im2col

{ i2cOffW , i2cOffH , i2cOffD }for.dim =.5d

A vector of im2col offsets whose vector size is twoless than number of dimensions .dim.

.im2col::w

{ wHalo, wOffset }

A vector of 2 arguments containingwHalo andwOffsetarguments.

.im2col::w::128

When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.

cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.

cp.async.bulk.prefetch.tensor is treated as a weak memory operation in theMemory Consistency Model.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Support for qualifier.tile::gather4 introduced in PTX ISA version 8.6.

Support for qualifiers.im2col::w and.im2col::w::128 introduced in PTX ISA version 8.6.

Target ISA Notes

Requiressm_90 or higher.

Qualifier.tile::gather4 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Qualifiers.im2col::w and.im2col::w::128 are supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And are supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Examples

.reg .b16 ctaMask, im2colwHalo, im2colOff;.reg .u16 i2cOffW, i2cOffH, i2cOffD;.reg .b64 l2CachePolicy;cp.async.bulk.prefetch.tensor.1d.L2.global.tile  [tensorMap0, {tc0}];@p cp.async.bulk.prefetch.tensor.2d.L2.global    [tensorMap1, {tc0, tc1}];@p cp.async.bulk.prefetch.tensor.5d.L2.global.im2col                      [tensorMap2, {tc0, tc1, tc2, tc3, tc4}], {i2cOffW, i2cOffH, i2cOffD};@p cp.async.bulk.prefetch.tensor.3d.L2.global.im2col.L2::cache_hint                      [tensorMap3, {tc0, tc1, tc2}], {i2cOffW}, policy;cp.async.bulk.prefetch.tensor.2d.L2.global.tile::gather4 [tensorMap5, {col_idx, row_idx0, row_idx1, row_idx2, row_idx3}];cp.async.bulk.prefetch.tensor.4d.L2.global.im2col::w::128                      [tensorMap4, {t0, t1, t2, t3}], {im2colwHalo, im2colOff};
9.7.9.25.6.Data Movement and Conversion Instructions: Bulk and Tensor copy completion instructions
9.7.9.25.6.1.Data Movement and Conversion Instructions:cp.async.bulk.commit_group

cp.async.bulk.commit_group

Commits all prior initiated but uncommittedcp.async.bulk instructions into acp.async.bulk-group.

Syntax

cp.async.bulk.commit_group;

Description

cp.async.bulk.commit_group instruction creates a new per-threadbulk async-group and batchesall priorcp{.reduce}.async.bulk.{.prefetch}{.tensor} instructions satisfying the followingconditions into the newbulk async-group:

  • The priorcp{.reduce}.async.bulk.{.prefetch}{.tensor} instructions usebulk_group basedcompletion mechanism, and

  • They are initiated by the executing thread but not committed to anybulk async-group.

If there are no uncommittedcp{.reduce}.async.bulk.{.prefetch}{.tensor} instructions thencp.async.bulk.commit_group results in an emptybulk async-group.

An executing thread can wait for the completion of allcp{.reduce}.async.bulk.{.prefetch}{.tensor} operations in abulk async-group usingcp.async.bulk.wait_group.

There is no memory ordering guarantee provided between any twocp{.reduce}.async.bulk.{.prefetch}{.tensor} operations within the samebulk async-group.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_90 or higher.

Examples

cp.async.bulk.commit_group;
9.7.9.25.6.2.Data Movement and Conversion Instructions:cp.async.bulk.wait_group

cp.async.bulk.wait_group

Wait for completion ofbulk async-groups.

Syntax

cp.async.bulk.wait_group{.read} N;

Description

cp.async.bulk.wait_group instruction will cause the executing thread to wait until only N orfewer of the most recentbulk async-groups are pending and all the priorbulk async-groupscommitted by the executing threads are complete. For example, when N is 0, the executing threadwaits on all the priorbulk async-groups to complete. Operand N is an integer constant.

By default,cp.async.bulk.wait_group instruction will cause the executing thread to wait untilcompletion of all the bulk async operations in the specifiedbulk async-group. A bulk asyncoperation includes the following:

  • Optionally, reading from the tensormap.

  • Reading from the source locations.

  • Writing to their respective destination locations.

  • Writes being made visible to the executing thread.

The optional.read modifier indicates that the waiting has to be done until all the bulkasync operations in the specifiedbulk async-group have completed:

  1. reading from the tensormap

  2. the reading from their source locations.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_90 or higher.

Examples

cp.async.bulk.wait_group.read   0;cp.async.bulk.wait_group        2;

9.7.9.26.Data Movement and Conversion Instructions:tensormap.replace

tensormap.replace

Modifies the field of a tensor-map object.

Syntax

tensormap.replace.mode.field1{.ss}.b1024.type  [addr], new_val;tensormap.replace.mode.field2{.ss}.b1024.type  [addr], ord, new_val;tensormap.replace.mode.field3{.ss}.b1024.type  [addr], new_val;.mode    = { .tile }.field1  = { .global_address, .rank }.field2  = { .box_dim, .global_dim, .global_stride, .element_stride  }.field3  = { .elemtype,  .interleave_layout, .swizzle_mode, .swizzle_atomicity, .fill_mode }.ss      = { .global, .shared::cta }.type    = { .b32, .b64 }

Description

Thetensormap.replace instruction replaces the field, specified by.field qualifier,of the tensor-map object at the location specified by the address operandaddr with anew value. The new value is specified by the argumentnew_val.

Qualifier.mode specifies the mode of thetensor-map objectlocated at the address operandaddr.

Instruction type.b1024 indicates the size of thetensor-mapobject, which is 1024 bits.

Operandnew_val has the type.type. When.field is specified as.global_addressor.global_stride,.type must be.b64. Otherwise,.type must be.b32.

The immediate integer operandord specifies the ordinal of the field across the rank of thetensor which needs to be replaced in thetensor-map object.

For field.rank, the operandnew_val must be ones less than the desired tensor rank asthis field uses zero-based numbering.

When.field3 is specified, the operandnew_val must be an immediate and theTable 33 shows the mapping of the operandnew_val across various fields.

Table 33Tensormap new_val validity

new_val

.field3

.elemtype

.interleave_layout

.swizzle_mode

.swizzle_atomicity

.fill_mode

0

.u8

No interleave

No swizzling

16B

Zero fill

1

.u16

16B interleave

32B swizzling

32B

OOB-NaN fill

2

.u32

32B interleave

64B swizzling

32B + 8B flip

x

3

.s32

x

128B swizzling

64B

x

4

.u64

x

96B swizzling

x

x

5

.s64

x

x

x

x

6

.f16

x

x

x

x

7

.f32

x

x

x

x

8

.f32.ftz

x

x

x

x

9

.f64

x

x

x

x

10

.bf16

x

x

x

x

11

.tf32

x

x

x

x

12

.tf32.ftz

x

x

x

x

13

.b4x16

x

x

x

x

14

.b4x16_p64

x

x

x

x

15

.b6x16_p32or.b6p2x16

x

x

x

x

Note

The values of.elemtype do not correspond to the values of theCUtensorMapDataType enum used in the driver API.

If no state space is specified thenGeneric Addressing is used.If the address specified byaddr does not fall within the address window of.globalor.shared::cta state space then the behavior is undefined.

tensormap.replace is treated as a weak memory operation, on the entire 1024-bit opaquetensor-map object, in theMemory Consistency Model.

PTX ISA Notes

Introduced in PTX ISA version 8.3.

Qualifier.swizzle_atomicity introduced in PTX ISA version 8.6.

Qualifier.elemtype with values from13 to15, both inclusive, issupported in PTX ISA version 8.7 onwards.

Qualifier.swizzle_mode with value4 is supported from PTX ISA version 8.8 onwards.

Target ISA Notes

Supported on following architectures:

  • sm_90a

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

Qualifier.swizzle_atomicity is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a (refer tosectionfor restrictions on sm_120a)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

.field3 variant.elemtype corresponding tonew_val values13,14and15 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a (refer tosectionfor restrictions on sm_120a)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

.field3 variant.swizzle_mode corresponding tonew_val value4 is supported onfollowing architectures:

  • sm_103a (refer tosectionfor restrictions on sm_103a)

Examples

tensormap.replace.tile.global_address.shared::cta.b1024.b64   [sMem], new_val;

9.7.10.Texture Instructions

This section describes PTX instructions for accessing textures and samplers. PTX supports thefollowing operations on texture and sampler descriptors:

  • Static initialization of texture and sampler descriptors.

  • Module-scope and per-entry scope definitions of texture and sampler descriptors.

  • Ability to query fields within texture and sampler descriptors.

9.7.10.1.Texturing Modes

For working with textures and samplers, PTX has two modes of operation. In theunified mode,texture and sampler information is accessed through a single.texref handle. In theindependentmode, texture and sampler information each have their own handle, allowing them to be definedseparately and combined at the site of usage in the program.

The advantage of unified mode is that it allows 256 samplers per kernel (128 for architectures priortosm_3x), with the restriction that they correspond 1-to-1 with the 256 possible textures perkernel (128 for architectures prior tosm_3x). The advantage of independent mode is thattextures and samplers can be mixed and matched, but the number of samplers is greatly restricted to32 per kernel (16 for architectures prior tosm_3x).

Table 34 summarizes the number of textures, samplers andsurfaces available in different texturing modes.

Table 34Texture, sampler and surface limits

Texturing mode

Resource

sm_1x,sm_2x

sm_3x+

Unified mode

Textures

128

256

Samplers

128

256

Surfaces

8

16

Independent mode

Textures

128

256

Samplers

16

32

Surfaces

8

16

The texturing mode is selected using.target optionstexmode_unified andtexmode_independent. A PTX module may declare only one texturing mode. If no texturing mode isdeclared, the module is assumed to use unified mode.

Example: calculate an element’s power contribution as element’s power/total number of elements.

.target texmode_independent.global .samplerref tsamp1 = { addr_mode_0 = clamp_to_border,                               filter_mode = nearest                             };....entry compute_power  ( .param .texref tex1 ){  txq.width.b32  r6, [tex1]; // get tex1's width  txq.height.b32 r5, [tex1]; // get tex1's height  tex.2d.v4.f32.f32  {r1,r2,r3,r4}, [tex1, tsamp1, {f1,f2}];  mul.u32 r5, r5, r6;  add.f32 r1, r1, r2;  add.f32 r3, r3, r4;  add.f32 r1, r1, r3;  cvt.f32.u32 r5, r5;  div.f32 r1, r1, r5;}

9.7.10.2.Mipmaps

Amipmap is a sequence of textures, each of which is a progressively lower resolutionrepresentation of the same image. The height and width of each image, or level of detail (LOD), inthe mipmap is a power of two smaller than the previous level. Mipmaps are used in graphicsapplications to improve rendering speed and reduce aliasing artifacts. For example, ahigh-resolution mipmap image is used for objects that are close to the user; lower-resolution imagesare used as the object appears farther away. Mipmap filtering modes are provided when switchingbetween two levels of detail (LODs) in order to avoid abrupt changes in visual fidelity.

Example: If the texture has a basic size of 256 by 256 pixels, then the associated mipmap setmay contain a series of eight images, each one-fourth the total area of the previous one: 128x128pixels, 64x64, 32x32, 16x16, 8x8, 4x4, 2x2, 1x1 (a single pixel). If, for example, a scene isrendering this texture in a space of 40x40 pixels, then either a scaled up version of the 32x32(without trilinear interpolation) or an interpolation of the 64x64 and the 32x32 mipmaps (withtrilinear interpolation) would be used.

The total number of LODs in a complete mipmap pyramid is calculated through the following equation:

numLODs = 1 + floor(log2(max(w, h, d)))

The finest LOD is called the base level and is the 0th level. The next (coarser) level is the 1stlevel, and so on. The coarsest level is the level of size (1 x 1 x 1). Each successively smallermipmap level has half the {width, height, depth} of the previous level, but if this half value is afractional value, it’s rounded down to the next largest integer. Essentially, the size of a mipmaplevel can be specified as:

max(1, floor(w_b / 2^i)) xmax(1, floor(h_b / 2^i)) xmax(1, floor(d_b / 2^i))

wherei is the ith level beyond the 0th level (the base level). Andw_b,h_b andd_b are thewidth, height and depth of the base level respectively.

PTX support for mipmaps

The PTXtex instruction supports three modes for specifying the LOD:base,level, andgradient. In base mode, the instruction always picks level 0. In level mode, an additionalargument is provided to specify the LOD to fetch from. In gradmode, two floating-point vectorarguments providepartials (e.g.,{ds/dx,dt/dx} and{ds/dy,dt/dy} for a 2d texture),which thetex instruction uses to compute the LOD.

These instructions provide access to texture memory.

  • tex

  • tld4

  • txq

9.7.10.3.Texture Instructions:tex

tex

Perform a texture memory lookup.

Syntax

tex.geom.v4.dtype.ctype  d, [a, c] {, e} {, f};tex.geom.v4.dtype.ctype  d[|p], [a, b, c] {, e} {, f};  // explicit samplertex.geom.v2.f16x2.ctype  d[|p], [a, c] {, e} {, f};tex.geom.v2.f16x2.ctype  d[|p], [a, b, c] {, e} {, f};  // explicit sampler// mipmapstex.base.geom.v4.dtype.ctype   d[|p], [a, {b,} c] {, e} {, f};tex.level.geom.v4.dtype.ctype  d[|p], [a, {b,} c], lod {, e} {, f};tex.grad.geom.v4.dtype.ctype   d[|p], [a, {b,} c], dPdx, dPdy {, e} {, f};tex.base.geom.v2.f16x2.ctype   d[|p], [a, {b,} c] {, e} {, f};tex.level.geom.v2.f16x2.ctype  d[|p], [a, {b,} c], lod {, e} {, f};tex.grad.geom.v2.f16x2.ctype   d[|p], [a, {b,} c], dPdx, dPdy {, e} {, f};.geom  = { .1d, .2d, .3d, .a1d, .a2d, .cube, .acube, .2dms, .a2dms };.dtype = { .u32, .s32, .f16,  .f32 };.ctype = {       .s32, .f32 };          // .cube, .acube require .f32                                        // .2dms, .a2dms require .s32

Description

tex.{1d,2d,3d}

Texture lookup using a texture coordinate vector. The instruction loads data from the texture namedby operanda at coordinates given by operandc into destinationd. Operandc is ascalar or singleton tuple for 1d textures; is a two-element vector for 2d textures; and is afour-element vector for 3d textures, where the fourth element is ignored. An optional texturesamplerb may be specified. If no sampler is specified, the sampler behavior is a property ofthe named texture. The optional destination predicatep is set toTrue if data from textureat specified coordinates is resident in memory,False otherwise. When optional destinationpredicatep is set toFalse, data loaded will be all zeros. Memory residency of Texture Dataat specified coordinates is dependent on execution environment setup using Driver API calls, priorto kernel launch. Refer to Driver API documentation for more details including anysystem/implementation specific behavior.

An optional operande may be specified. Operande is a vector of.s32 values thatspecifies coordinate offset. Offset is applied to coordinates before doing texture lookup. Offsetvalue is in the range of -8 to +7. Operande is a singleton tuple for 1d textures; is a twoelement vector 2d textures; and is four-element vector for 3d textures, where the fourth element isignored.

An optional operandf may be specified fordepthtextures. Depth textures are special typeof textures which hold data from the depth buffer. Depth buffer contains depth information of eachpixel. Operandf is.f32 scalar value that specifies depth compare value for depthtextures. Each element fetched from texture is compared against value given inf operand. Ifcomparison passes, result is 1.0; otherwise result is 0.0. These per-element comparison results areused for the filtering. When using depth compare operand, the elements in texture coordinate vectorc have.f32 type.

Depth compare operand is not supported for3d textures.

The instruction returns a two-element vector for destination type.f16x2. For all otherdestination types, the instruction returns a four-element vector. Coordinates may be given in eithersigned 32-bit integer or 32-bit floating point form.

A texture base address is assumed to be aligned to a 16 byte boundary, and the address given by thecoordinate vector must be naturally aligned to a multiple of the access size. If an address is notproperly aligned, the resulting behavior is undefined; i.e., the access may proceed by silentlymasking off low-order address bits to achieve proper rounding, or the instruction may fault.

tex.{a1d,a2d}

Texture array selection, followed by texture lookup. The instruction first selects a texture fromthe texture array named by operanda using the index given by the first element of the arraycoordinate vectorc. The instruction then loads data from the selected texture at coordinatesgiven by the remaining elements of operandc into destinationd. Operandc is a bit-sizetype vector or tuple containing an index into the array of textures followed by coordinates withinthe selected texture, as follows:

  • For 1d texture arrays, operandc has type.v2.b32. The first element is interpreted as anunsigned integer index (.u32) into the texture array, and the second element is interpreted asa 1d texture coordinate of type.ctype.

  • For 2d texture arrays, operandc has type.v4.b32. The first element is interpreted as anunsigned integer index (.u32) into the texture array, and the next two elements areinterpreted as 2d texture coordinates of type.ctype. The fourth element is ignored.

An optional texture samplerb may be specified. If no sampler is specified, the sampler behavioris a property of the named texture.

An optional operande may be specified. Operande is a vector of.s32 values thatspecifies coordinate offset. Offset is applied to coordinates before doing texture lookup. Offsetvalue is in the range of -8 to +7. Operande is a singleton tuple for 1d texture arrays; and isa two element vector 2d texture arrays.

An optional operandf may be specified for depth textures arrays. Operandf is.f32scalar value that specifies depth compare value for depth textures. When using depth compareoperand, the coordinates in texture coordinate vectorc have.f32 type.

The instruction returns a two-element vector for destination type.f16x2. For all otherdestination types, the instruction returns a four-element vector. The texture array index is a32-bit unsigned integer, and texture coordinate elements are 32-bit signed integer or floating pointvalues.

The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.

tex.cube

Cubemap texture lookup. The instruction loads data from the cubemap texture named by operandaat coordinates given by operandc into destinationd. Cubemap textures are specialtwo-dimensional layered textures consisting of six layers that represent the faces of a cube. Alllayers in a cubemap are of the same size and are square (i.e., width equals height).

When accessing a cubemap, the texture coordinate vectorc has type.v4.f32, and comprisesthree floating-point coordinates (s,t,r) and a fourth padding argument which isignored. Coordinates (s,t,r) are projected onto one of the six cube faces. The (s,t,r) coordinates can be thought of as a direction vector emanating from the center of thecube. Of the three coordinates (s,t,r), the coordinate of the largest magnitude (themajor axis) selects the cube face. Then, the other two coordinates (the minor axes) are divided bythe absolute value of the major axis to produce a new (s,t) coordinate pair to lookup intothe selected cube face.

An optional texture samplerb may be specified. If no sampler is specified, the sampler behavioris a property of the named texture.

Offset vector operande is not supported for cubemap textures.

an optional operandf may be specified for cubemap depth textures. operandf is.f32scalar value that specifies depth compare value for cubemap depth textures.

The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.

tex.acube

Cubemap array selection, followed by cubemap lookup. The instruction first selects a cubemap texturefrom the cubemap array named by operanda using the index given by the first element of thearray coordinate vectorc. The instruction then loads data from the selected cubemap texture atcoordinates given by the remaining elements of operandc into destinationd.

Cubemap array textures consist of an array of cubemaps, i.e., the total number of layers is amultiple of six. When accessing a cubemap array texture, the coordinate vectorc has type.v4.b32. The first element is interpreted as an unsigned integer index (.u32) into thecubemap array, and the remaining three elements are interpreted as floating-point cubemapcoordinates (s,t,r), used to lookup in the selected cubemap as described above.

An optional texture samplerb may be specified. If no sampler is specified, the sampler behavioris a property of the named texture.

Offset vector operande is not supported for cubemap texture arrays.

An optional operandf may be specified for cubemap depth texture arrays. Operandf is.f32 scalar value that specifies depth compare value for cubemap depth textures.

The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.

tex.2dms

Multi-sample texture lookup using a texture coordinate vector. Multi-sample textures consist ofmultiple samples per data element. The instruction loads data from the texture named by operanda from sample number given by first element of the operandc, at coordinates given byremaining elements of operandc into destinationd. When accessing a multi-sample texture,texture coordinate vectorc has type.v4.b32. The first element in operandc isinterpreted as unsigned integer sample number (.u32), and the next two elements are interpretedas signed integer (.s32) 2d texture coordinates. The fourth element is ignored. An optionaltexture samplerb may be specified. If no sampler is specified, the sampler behavior is aproperty of the named texture.

An optional operande may be specified. Operande is a vector of type.v2.s32 thatspecifies coordinate offset. Offset is applied to coordinates before doing texture lookup. Offsetvalue is in the range of -8 to +7.

Depth compare operandf is not supported for multi-sample textures.

The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.

tex.a2dms

Multi-sample texture array selection, followed by multi-sample texture lookup. The instruction firstselects a multi-sample texture from the multi-sample texture array named by operand a using theindex given by the first element of the array coordinate vectorc. The instruction then loadsdata from the selected multi-sample texture from sample number given by second element of theoperandc, at coordinates given by remaining elements of operandc into destinationd. When accessing a multi-sample texture array, texture coordinate vectorc has type.v4.b32. The first element in operand c is interpreted as unsigned integer sampler number, thesecond element is interpreted as unsigned integer index (.u32) into the multi-sample texturearray and the next two elements are interpreted as signed integer (.s32) 2d texturecoordinates. An optional texture samplerb may be specified. If no sampler is specified, thesampler behavior is a property of the named texture.

An optional operande may be specified. Operande is a vector of type.v2.s32 valuesthat specifies coordinate offset. Offset is applied to coordinates before doing texturelookup. Offset value is in the range of -8 to +7.

Depth compare operandf is not supported for multi-sample texture arrays.

The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.

Mipmaps

.base (lod zero)

Pick level 0 (base level). This is the default if no mipmap mode is specified. No additional arguments.

.level (lod explicit)

Requires an additional 32-bit scalar argument,lod, which contains the LOD to fetch from. Thetype oflod follows.ctype (either.s32 or.f32). Geometries.2dms and.a2dms are not supported in this mode.

.grad (lod gradient)

Requires two.f32 vectors,dPdx anddPdy, that specify the partials. The vectors aresingletons for 1d and a1d textures; are two-element vectors for 2d and a2d textures; and arefour-element vectors for 3d, cube and acube textures, where the fourth element is ignored for 3dand cube geometries. Geometries.2dms and.a2dms are not supported in this mode.

For mipmap texture lookup, an optional operande may be specified. Operande is a vector of.s32 that specifies coordinate offset. Offset is applied to coordinates before doing texturelookup. Offset value is in the range of -8 to +7. Offset vector operand is not supported for cubeand cubemap geometries.

An optional operandf may be specified for mipmap textures. Operandf is.f32 scalarvalue that specifies depth compare value for depth textures. When using depth compare operand, thecoordinates in texture coordinate vectorc have.f32 type.

The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.

Depth compare operand is not supported for3d textures.

Indirect texture access

Beginning with PTX ISA version 3.1, indirect texture access is supported in unified mode for targetarchitecturesm_20 or higher. In indirect access, operanda is a.u64 register holdingthe address of a.texref variable.

Notes

For compatibility with prior versions of PTX, the square brackets are not required and.v4coordinate vectors are allowed for any geometry, with the extra elements being ignored.

PTX ISA Notes

Unified mode texturing introduced in PTX ISA version 1.0. Extension using opaque.texref and.samplerref types and independent mode texturing introduced in PTX ISA version 1.5.

Texture arraystex.{a1d,a2d} introduced in PTX ISA version 2.3.

Cubemaps and cubemap arrays introduced in PTX ISA version 3.0.

Support for mipmaps introduced in PTX ISA version 3.1.

Indirect texture access introduced in PTX ISA version 3.1.

Multi-sample textures and multi-sample texture arrays introduced in PTX ISA version 3.2.

Support for textures returning.f16 and.f16x2 data introduced in PTX ISA version 4.2.

Support fortex.grad.{cube,acube} introduced in PTX ISA version 4.3.

Offset vector operand introduced in PTX ISA version 4.3.

Depth compare operand introduced in PTX ISA version 4.3.

Support for optional destination predicate introduced in PTX ISA version 7.1.

Target ISA Notes

Supported on all target architectures.

The cubemap array geometry (.acube) requiressm_20 or higher.

Mipmaps requiresm_20 or higher.

Indirect texture access requiressm_20 or higher.

Multi-sample textures and multi-sample texture arrays requiresm_30 or higher.

Texture fetch returning.f16 and.f16x2 data requiresm_53 or higher.

tex.grad.{cube,acube} requiressm_20 or higher.

Offset vector operand requiressm_30 or higher.

Depth compare operand requiressm_30 or higher.

Support for optional destination predicate requiressm_60 or higher.

Examples

 // Example of unified mode texturing // - f4 is required to pad four-element tuple and is ignored tex.3d.v4.s32.s32  {r1,r2,r3,r4}, [tex_a,{f1,f2,f3,f4}]; // Example of independent mode texturing tex.1d.v4.s32.f32  {r1,r2,r3,r4}, [tex_a,smpl_x,{f1}]; // Example of 1D texture array, independent texturing mode tex.a1d.v4.s32.s32 {r1,r2,r3,r4}, [tex_a,smpl_x,{idx,s1}]; // Example of 2D texture array, unified texturing mode // - f3 is required to pad four-element tuple and is ignored tex.a2d.v4.s32.f32 {r1,r2,r3,r4}, [tex_a,{idx,f1,f2,f3}]; // Example of cubemap array, unified textureing mode tex.acube.v4.f32.f32 {r0,r1,r2,r3}, [tex_cuarray,{idx,f1,f2,f3}]; // Example of multi-sample texture, unified texturing mode tex.2dms.v4.s32.s32 {r0,r1,r2,r3}, [tex_ms,{sample,r6,r7,r8}]; // Example of multi-sample texture, independent texturing mode tex.2dms.v4.s32.s32 {r0,r1,r2,r3}, [tex_ms, smpl_x,{sample,r6,r7,r8}]; // Example of multi-sample texture array, unified texturing mode tex.a2dms.v4.s32.s32 {r0,r1,r2,r3}, [tex_ams,{idx,sample,r6,r7}]; // Example of texture returning .f16 data tex.1d.v4.f16.f32  {h1,h2,h3,h4}, [tex_a,smpl_x,{f1}]; // Example of texture returning .f16x2 data tex.1d.v2.f16x2.f32  {h1,h2}, [tex_a,smpl_x,{f1}]; // Example of 3d texture array access with tex.grad,unified texturing mode tex.grad.3d.v4.f32.f32 {%f4,%f5,%f6,%f7},[tex_3d,{%f0,%f0,%f0,%f0}],                 {fl0,fl1,fl2,fl3},{fl0,fl1,fl2,fl3};// Example of cube texture array access with tex.grad,unified texturing mode tex.grad.cube.v4.f32.f32{%f4,%f5,%f6,%f7},[tex_cube,{%f0,%f0,%f0,%f0}],                 {fl0,fl1,fl2,fl3},{fl0,fl1,fl2,fl3}; // Example of 1d texture lookup with offset, unified texturing mode tex.1d.v4.s32.f32  {r1,r2,r3,r4}, [tex_a, {f1}], {r5}; // Example of 2d texture array lookup with offset, unified texturing mode tex.a2d.v4.s32.f32  {r1,r2,r3,r4}, [tex_a,{idx,f1,f2}], {f5,f6}; // Example of 2d mipmap texture lookup with offset, unified texturing mode tex.level.2d.v4.s32.f32  {r1,r2,r3,r4}, [tex_a,{f1,f2}],                          flvl, {r7, r8}; // Example of 2d depth texture lookup with compare, unified texturing mode tex.1d.v4.f32.f32  {f1,f2,f3,f4}, [tex_a, {f1}], f0; // Example of depth 2d texture array lookup with offset, compare tex.a2d.v4.s32.f32  {f0,f1,f2,f3}, [tex_a,{idx,f4,f5}], {r5,r6}, f6; // Example of destination predicate use tex.3d.v4.s32.s32 {r1,r2,r3,r4}|p, [tex_a,{f1,f2,f3,f4}];

9.7.10.4.Texture Instructions:tld4

tld4

Perform a texture fetch of the 4-texel bilerp footprint.

Syntax

tld4.comp.2d.v4.dtype.f32    d[|p], [a, c] {, e} {, f};tld4.comp.geom.v4.dtype.f32  d[|p], [a, b, c] {, e} {, f};  // explicit sampler.comp  = { .r, .g, .b, .a };.geom  = { .2d, .a2d, .cube, .acube };.dtype = { .u32, .s32, .f32 };

Description

Texture fetch of the 4-texel bilerp footprint using a texture coordinate vector. The instructionloads the bilerp footprint from the texture named by operanda at coordinates given by operandc into vector destinationd. The texture component fetched for each texel sample isspecified by.comp. The four texel samples are placed into destination vectord incounter-clockwise order starting at lower left.

An optional texture samplerb may be specified. If no sampler is specified, the sampler behavioris a property of the named texture.

The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.

An optional operandf may be specified fordepth textures. Depth textures are special type oftextures which hold data from the depth buffer. Depth buffer contains depth information of eachpixel. Operandf is.f32 scalar value that specifies depth compare value for depthtextures. Each element fetched from texture is compared against value given inf operand. Ifcomparison passes, result is 1.0; otherwise result is 0.0. These per-element comparison results areused for the filtering.

A texture base address is assumed to be aligned to a 16 byte boundary, and the address given by thecoordinate vector must be naturally aligned to a multiple of the access size. If an address is notproperly aligned, the resulting behavior is undefined; i.e., the access may proceed by silentlymasking off low-order address bits to achieve proper rounding, or the instruction may fault.

tld4.2d

For 2D textures, operandc specifies coordinates as a two-element, 32-bit floating-point vector.

An optional operande may be specified. Operande is a vector of type.v2.s32 thatspecifies coordinate offset. Offset is applied to coordinates before doing texture fetch. Offsetvalue is in the range of -8 to +7.

tld4.a2d

Texture array selection, followed bytld4 texture fetch of 2d texture. For 2d texture arraysoperandc is a four element, 32-bit vector. The first element in operand c is interpreted as anunsigned integer index (.u32) into the texture array, and the next two elements are interpretedas 32-bit floating point coordinates of 2d texture. The fourth element is ignored.

An optional operande may be specified. Operande is a vector of type.v2.s32 thatspecifies coordinate offset. Offset is applied to coordinates before doing texture fetch. Offsetvalue is in the range of -8 to +7.

tld4.cube

For cubemap textures, operandc specifies four-element vector which comprises threefloating-point coordinates (s, t, r) and a fourth padding argument which is ignored.

Cubemap textures are special two-dimensional layered textures consisting of six layers thatrepresent the faces of a cube. All layers in a cubemap are of the same size and are square (i.e.,width equals height).

Coordinates (s, t, r) are projected onto one of the six cube faces. The (s, t, r) coordinates can bethought of as a direction vector emanating from the center of the cube. Of the three coordinates (s,t, r), the coordinate of the largest magnitude (the major axis) selects the cube face. Then, theother two coordinates (the minor axes) are divided by the absolute value of the major axis toproduce a new (s, t) coordinate pair to lookup into the selected cube face.

Offset vector operande is not supported for cubemap textures.

tld4.acube

Cubemap array selection, followed bytld4 texture fetch of cubemap texture. The first element inoperandc is interpreted as an unsigned integer index (.u32) into the cubemap texture array,and the remaining three elements are interpreted as floating-point cubemap coordinates (s, t, r),used to lookup in the selected cubemap.

Offset vector operande is not supported for cubemap texture arrays.

Indirect texture access

Beginning with PTX ISA version 3.1, indirect texture access is supported in unified mode for targetarchitecturesm_20 or higher. In indirect access, operanda is a.u64 register holdingthe address of a.texref variable.

PTX ISA Notes

Introduced in PTX ISA version 2.2.

Indirect texture access introduced in PTX ISA version 3.1.

tld4.{a2d,cube,acube} introduced in PTX ISA version 4.3.

Offset vector operand introduced in PTX ISA version 4.3.

Depth compare operand introduced in PTX ISA version 4.3.

Support for optional destination predicate introduced in PTX ISA version 7.1.

Target ISA Notes

tld4 requiressm_20 or higher.

Indirect texture access requiressm_20 or higher.

tld4.{a2d,cube,acube} requiressm_30 or higher.

Offset vector operand requiressm_30 or higher.

Depth compare operand requiressm_30 or higher.

Support for optional destination predicate requiressm_60 or higher.

Examples

//Example of unified mode texturingtld4.r.2d.v4.s32.f32  {r1,r2,r3,r4}, [tex_a,{f1,f2}];// Example of independent mode texturingtld4.r.2d.v4.u32.f32  {u1,u2,u3,u4}, [tex_a,smpl_x,{f1,f2}];// Example of unified mode texturing using offsettld4.r.2d.v4.s32.f32  {r1,r2,r3,r4}, [tex_a,{f1,f2}], {r5, r6};// Example of unified mode texturing using comparetld4.r.2d.v4.f32.f32  {f1,f2,f3,f4}, [tex_a,{f5,f6}], f7;// Example of optional destination predicatetld4.r.2d.v4.f32.f32 {f1,f2,f3,f4}|p, [tex_a,{f5,f6}], f7;

9.7.10.5.Texture Instructions:txq

txq

Query texture and sampler attributes.

Syntax

txq.tquery.b32         d, [a];       // texture attributestxq.level.tlquery.b32  d, [a], lod;  // texture attributestxq.squery.b32         d, [a];       // sampler attributes.tquery  = { .width, .height, .depth,             .channel_data_type, .channel_order,             .normalized_coords, .array_size,             .num_mipmap_levels, .num_samples};.tlquery = { .width, .height, .depth };.squery  = { .force_unnormalized_coords, .filter_mode,             .addr_mode_0, addr_mode_1, addr_mode_2 };

Description

Query an attribute of a texture or sampler. Operanda is either a.texref or.samplerref variable, or a.u64 register.

Query

Returns

.width

.height

.depth

value in elements

.channel_data_type

Unsigned integer corresponding to source language’s channel data typeenumeration. If the source language combines channel data type and channelorder into a single enumeration type, that value is returned for bothchannel_data_type and channel_order queries.

.channel_order

Unsigned integer corresponding to source language’s channel orderenumeration. If the source language combines channel data type and channelorder into a single enumeration type, that value is returned for bothchannel_data_type andchannel_order queries.

.normalized_coords

1 (True) or0 (False).

.force_unnormalized_coords

1 (True) or0 (False). Defined only for.samplerrefvariables in independent texture mode. Overrides thenormalized_coordsfield of a.texref variable used with a.samplerref in atexinstruction.

.filter_mode

Integer fromenum{nearest,linear}

.addr_mode_0

.addr_mode_1

.addr_mode_2

Integer fromenum{wrap,mirror,clamp_ogl,clamp_to_edge,clamp_to_border}

.array_size

For a texture array, number of textures in array, 0 otherwise.

.num_mipmap_levels

For a mipmapped texture, number of levels of details (LOD), 0 otherwise.

.num_samples

For a multi-sample texture, number of samples, 0 otherwise.

Texture attributes are queried by supplying a.texref argument totxq. In unified mode,sampler attributes are also accessed via a.texref argument, and in independent mode samplerattributes are accessed via a separate.samplerref argument.

txq.level

txq.level requires an additional 32bit integer argument,lod, which specifies LOD andqueries requested attribute for the specified LOD.

Indirect texture access

Beginning with PTX ISA version 3.1, indirect texture access is supported in unified mode for targetarchitecturesm_20 or higher. In indirect access, operanda is a.u64 register holdingthe address of a.texref variable.

PTX ISA Notes

Introduced in PTX ISA version 1.5.

Channel data type and channel order queries were added in PTX ISA version 2.1.

The.force_unnormalized_coords query was added in PTX ISA version 2.2.

Indirect texture access introduced in PTX ISA version 3.1.

.array_size,.num_mipmap_levels,.num_samples samples queries were added in PTX ISAversion 4.1.

txq.level introduced in PTX ISA version 4.3.

Target ISA Notes

Supported on all target architectures.

Indirect texture access requiressm_20 or higher.

Querying the number of mipmap levels requiressm_20 or higher.

Querying the number of samples requiressm_30 or higher.

txq.level requiressm_30 or higher.

Examples

txq.width.b32       %r1, [tex_A];txq.filter_mode.b32 %r1, [tex_A];   // unified modetxq.addr_mode_0.b32 %r1, [smpl_B];  // independent modetxq.level.width.b32 %r1, [tex_A], %r_lod;

9.7.10.6.Texture Instructions:istypep

istypep

Query whether a register points to an opaque variable of a specified type.

Syntax

istypep.type   p, a;  // result is .pred.type = { .texref, .samplerref, .surfref };

Description

Write predicate registerp with 1 if registera points to an opaque variable of thespecified type, and with 0 otherwise. Destinationp has type.pred; the source addressoperand must be of type.u64.

PTX ISA Notes

Introduced in PTX ISA version 4.0.

Target ISA Notes

istypep requiressm_30 or higher.

Examples

istypep.texref istex, tptr;istypep.samplerref issampler, sptr;istypep.surfref issurface, surfptr;

9.7.11.Surface Instructions

This section describes PTX instructions for accessing surfaces. PTX supports the followingoperations on surface descriptors:

  • Static initialization of surface descriptors.

  • Module-scope and per-entry scope definitions of surface descriptors.

  • Ability to query fields within surface descriptors.

These instructions provide access to surface memory.

  • suld

  • sust

  • sured

  • suq

9.7.11.1.Surface Instructions:suld

suld

Load from surface memory.

Syntax

suld.b.geom{.cop}.vec.dtype.clamp  d, [a, b];  // unformatted.geom  = { .1d, .2d, .3d, .a1d, .a2d };.cop   = { .ca, .cg, .cs, .cv };               // cache operation.vec   = { none, .v2, .v4 };.dtype = { .b8 , .b16, .b32, .b64 };.clamp = { .trap, .clamp, .zero };

Description

suld.b.{1d,2d,3d}

Load from surface memory using a surface coordinate vector. The instruction loads data from thesurface named by operanda at coordinates given by operandb into destinationd. Operanda is a.surfref variable or.u64 register. Operandb is a scalar or singleton tuplefor 1d surfaces; is a two-element vector for 2d surfaces; and is a four-element vector for 3dsurfaces, where the fourth element is ignored. Coordinate elements are of type.s32.

suld.b performs an unformatted load of binary data. The lowest dimension coordinate represents abyte offset into the surface and is not scaled, and the size of the data transfer matches the sizeof destination operandd.

suld.b.{a1d,a2d}

Surface layer selection, followed by a load from the selected surface. The instruction first selectsa surface layer from the surface array named by operanda using the index given by the firstelement of the array coordinate vectorb. The instruction then loads data from the selectedsurface at coordinates given by the remaining elements of operandb into destinationd. Operanda is a.surfref variable or.u64 register. Operandb is a bit-sizetype vector or tuple containing an index into the array of surfaces followed by coordinates withinthe selected surface, as follows:

For 1d surface arrays, operandb has type.v2.b32. The first element is interpreted as anunsigned integer index (.u32) into the surface array, and the second element is interpreted as a1d surface coordinate of type.s32.

For 2d surface arrays, operandb has type.v4.b32. The first element is interpreted as anunsigned integer index (.u32) into the surface array, and the next two elements are interpretedas 2d surface coordinates of type.s32. The fourth element is ignored.

A surface base address is assumed to be aligned to a 16 byte boundary, and the address given by thecoordinate vector must be naturally aligned to a multiple of the access size. If an address is notproperly aligned, the resulting behavior is undefined; i.e., the access may proceed by silentlymasking off low-order address bits to achieve proper rounding, or the instruction may fault.

The.clamp field specifies how to handle out-of-bounds addresses:

.trap

causes an execution trap on out-of-bounds addresses

.clamp

loads data at the nearest surface location (sized appropriately)

.zero

loads zero for out-of-bounds addresses

Indirect surface access

Beginning with PTX ISA version 3.1, indirect surface access is supported for target architecturesm_20 or higher. In indirect access, operanda is a.u64 register holding the address ofa.surfref variable.

PTX ISA Notes

suld.b.trap introduced in PTX ISA version 1.5.

Additional clamp modifiers and cache operations introduced in PTX ISA version 2.0.

suld.b.3d andsuld.b.{a1d,a2d} introduced in PTX ISA version 3.0.

Indirect surface access introduced in PTX ISA version 3.1.

Target ISA Notes

suld.b supported on all target architectures.

sm_1x targets support only the.trap clamping modifier.

suld.3d andsuld.{a1d,a2d} requiresm_20 or higher.

Indirect surface access requiressm_20 or higher.

Cache operations requiresm_20 or higher.

Examples

suld.b.1d.v4.b32.trap  {s1,s2,s3,s4}, [surf_B, {x}];suld.b.3d.v2.b64.trap  {r1,r2}, [surf_A, {x,y,z,w}];suld.b.a1d.v2.b32      {r0,r1}, [surf_C, {idx,x}];suld.b.a2d.b32         r0, [surf_D, {idx,x,y,z}];  // z ignored

9.7.11.2.Surface Instructions:sust

sust

Store to surface memory.

Syntax

sust.b.{1d,2d,3d}{.cop}.vec.ctype.clamp  [a, b], c;  // unformattedsust.p.{1d,2d,3d}.vec.b32.clamp          [a, b], c;  // formattedsust.b.{a1d,a2d}{.cop}.vec.ctype.clamp   [a, b], c;  // unformatted.cop   = { .wb, .cg, .cs, .wt };                     // cache operation.vec   = { none, .v2, .v4 };.ctype = { .b8 , .b16, .b32, .b64 };.clamp = { .trap, .clamp, .zero };

Description

sust.{1d,2d,3d}

Store to surface memory using a surface coordinate vector. The instruction stores data from operandc to the surface named by operanda at coordinates given by operandb. Operanda isa.surfref variable or.u64 register. Operandb is a scalar or singleton tuple for 1dsurfaces; is a two-element vector for 2d surfaces; and is a four-element vector for 3d surfaces,where the fourth element is ignored. Coordinate elements are of type.s32.

sust.b performs an unformatted store of binary data. The lowest dimension coordinate representsa byte offset into the surface and is not scaled. The size of the data transfer matches the size ofsource operandc.

sust.p performs a formatted store of a vector of 32-bit data values to a surface sample. Thesource vector elements are interpreted left-to-right asR,G,B, andA surfacecomponents. These elements are written to the corresponding surface sample components. Sourceelements that do not occur in the surface sample are ignored. Surface sample components that do notoccur in the source vector will be written with an unpredictable value. The lowest dimensioncoordinate represents a sample offset rather than a byte offset.

The source data interpretation is based on the surface sample format as follows: If the surfaceformat containsUNORM,SNORM, orFLOAT data, then.f32 is assumed; if the surfaceformat containsUINT data, then.u32 is assumed; if the surface format containsSINTdata, then.s32 is assumed. The source data is then converted from this type to the surfacesample format.

sust.b.{a1d,a2d}

Surface layer selection, followed by an unformatted store to the selected surface. The instructionfirst selects a surface layer from the surface array named by operanda using the index given bythe first element of the array coordinate vectorb. The instruction then stores the data inoperandc to the selected surface at coordinates given by the remaining elements of operandb. Operanda is a .surfref variable or.u64 register. Operandb is a bit-size typevector or tuple containing an index into the array of surfaces followed by coordinates within theselected surface, as follows:

  • For 1d surface arrays, operandb has type.v2.b32. The first element is interpreted as anunsigned integer index (.u32) into the surface array, and the second element is interpreted asa 1d surface coordinate of type.s32.

  • For 2d surface arrays, operandb has type.v4.b32. The first element is interpreted as anunsigned integer index (.u32) into the surface array, and the next two elements areinterpreted as 2d surface coordinates of type.s32. The fourth element is ignored.

A surface base address is assumed to be aligned to a 16 byte boundary, and the address given by thecoordinate vector must be naturally aligned to a multiple of the access size. If an address is notproperly aligned, the resulting behavior is undefined; i.e., the access may proceed by silentlymasking off low-order address bits to achieve proper rounding, or the instruction may fault.

The.clamp field specifies how to handle out-of-bounds addresses:

.trap

causes an execution trap on out-of-bounds addresses

.clamp

stores data at the nearest surface location (sized appropriately)

.zero

drops stores to out-of-bounds addresses

Indirect surface access

Beginning with PTX ISA version 3.1, indirect surface access is supported for target architecturesm_20 or higher. In indirect access, operanda is a.u64 register holding the address ofa.surfref variable.

PTX ISA Notes

sust.b.trap introduced in PTX ISA version 1.5.sust.p, additional clamp modifiers, andcache operations introduced in PTX ISA version 2.0.

sust.b.3d andsust.b.{a1d,a2d} introduced in PTX ISA version 3.0.

Indirect surface access introduced in PTX ISA version 3.1.

Target ISA Notes

sust.b supported on all target architectures.

sm_1x targets support only the.trap clamping modifier.

sust.3d andsust.{a1d,a2d} requiresm_20 or higher.

sust.p requiressm_20 or higher.

Indirect surface access requiressm_20 or higher.

Cache operations requiresm_20 or higher.

Examples

sust.p.1d.v4.b32.trap  [surf_B, {x}], {f1,f2,f3,f4};sust.b.3d.v2.b64.trap  [surf_A, {x,y,z,w}], {r1,r2};sust.b.a1d.v2.b64      [surf_C, {idx,x}], {r1,r2};sust.b.a2d.b32         [surf_D, {idx,x,y,z}], r0;  // z ignored

9.7.11.3.Surface Instructions:sured

sured

Reduce surface memory.

Syntax

sured.b.op.geom.ctype.clamp  [a,b],c; // byte addressingsured.p.op.geom.ctype.clamp  [a,b],c; // sample addressing.op    = { .add, .min, .max, .and, .or };.geom  = { .1d, .2d, .3d };.ctype = { .u32, .u64, .s32, .b32, .s64 };  // for sured.b.ctype = { .b32, .b64 };                    // for sured.p.clamp = { .trap, .clamp, .zero };

Description

Reduction to surface memory using a surface coordinate vector. The instruction performs a reductionoperation with data from operandc to the surface named by operanda at coordinates given byoperandb. Operanda is a.surfref variable or.u64 register. Operandb is ascalar or singleton tuple for 1d surfaces; is a two-element vector for 2d surfaces; and is afour-element vector for 3d surfaces, where the fourth element is ignored. Coordinate elements are oftype.s32.

sured.b performs an unformatted reduction on.u32,.s32,.b32,.u64, or.s64data. The lowest dimension coordinate represents a byte offset into the surface and is notscaled. Operationadd applies to.u32,.u64, and.s32 types;min andmaxapply to.u32,.s32,.u64 and.s64 types; operationsand andor apply to.b32 type.

sured.p performs a reduction on sample-addressed data. The lowest dimension coordinaterepresents a sample offset rather than a byte offset. The instruction type.b64 is restricted tomin andmax operations. For type.b32, the data is interpreted as.u32 or.s32based on the surface sample format as follows: if the surface format containsUINT data, then.u32 is assumed; if the surface format containsSINT data, then.s32 is assumed. Fortype.b64, if the surface format containsUINT data, then.u64 is assumed; if thesurface format containsSINT data, then.s64 is assumed.

A surface base address is assumed to be aligned to a 16 byte boundary, and the address given by thecoordinate vector must be naturally aligned to a multiple of the access size. If an address is notproperly aligned, the resulting behavior is undefined; i.e., the access may proceed by silentlymasking off low-order address bits to achieve proper rounding, or the instruction may fault.

The.clamp field specifies how to handle out-of-bounds addresses:

.trap

causes an execution trap on out-of-bounds addresses

.clamp

stores data at the nearest surface location (sized appropriately)

.zero

drops stores to out-of-bounds addresses

Indirect surface access

Beginning with PTX ISA version 3.1, indirect surface access is supported for target architecturesm_20 or higher. In indirect access, operanda is a.u64 register holding the address ofa.surfref variable.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Indirect surface access introduced in PTX ISA version 3.1.

.u64/.s64/.b64 types with.min/.max operations introduced in PTX ISA version8.1.

Target ISA Notes

sured requiressm_20 or higher.

Indirect surface access requiressm_20 or higher.

.u64/.s64/.b64 types with.min/.max operations requiressm_50 or higher.

Examples

sured.b.add.2d.u32.trap  [surf_A, {x,y}], r1;sured.p.min.1d.u32.trap  [surf_B, {x}], r1;sured.b.max.1d.u64.trap  [surf_C, {x}], r1;sured.p.min.1d.b64.trap  [surf_D, {x}], r1;

9.7.11.4.Surface Instructions:suq

suq

Query a surface attribute.

Syntax

suq.query.b32   d, [a];.query = { .width, .height, .depth,           .channel_data_type, .channel_order,           .array_size, .memory_layout };

Description

Query an attribute of a surface. Operanda is a.surfref variable or a.u64 register.

Query

Returns

.width

.height

.depth

value in elements

.channel_data_type

Unsigned integer corresponding to source language’s channel datatype enumeration. If the source language combines channel datatype and channel order into a single enumeration type, that valueis returned for bothchannel_data_type andchannel_orderqueries.

.channel_order

Unsigned integer corresponding to source language’s channel orderenumeration. If the source language combines channel data type andchannel order into a single enumeration type, that value isreturned for bothchannel_data_type andchannel_orderqueries.

.array_size

For a surface array, number of surfaces in array, 0 otherwise.

.memory_layout

1 for surface with linear memory layout;0 otherwise

Indirect surface access

Beginning with PTX ISA version 3.1, indirect surface access is supported for target architecturesm_20 or higher. In indirect access, operanda is a.u64 register holding the address ofa.surfref variable.

PTX ISA Notes

Introduced in PTX ISA version 1.5.

Channel data type and channel order queries added in PTX ISA version 2.1.

Indirect surface access introduced in PTX ISA version 3.1.

The.array_size query was added in PTX ISA version 4.1.

The.memory_layout query was added in PTX ISA version 4.2.

Target ISA Notes

Supported on all target architectures.

Indirect surface access requiressm_20 or higher.

Examples

suq.width.b32       %r1, [surf_A];

9.7.12.Control Flow Instructions

The following PTX instructions and syntax are for controlling execution in a PTX program:

  • {}

  • @

  • bra

  • call

  • ret

  • exit

9.7.12.1.Control Flow Instructions:{}

{}

Instruction grouping.

Syntax

{ instructionList }

Description

The curly braces create a group of instructions, used primarily for defining a function body. Thecurly braces also provide a mechanism for determining the scope of a variable: any variable declaredwithin a scope is not available outside the scope.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

{ add.s32  a,b,c; mov.s32  d,a; }

9.7.12.2.Control Flow Instructions:@

@

Predicated execution.

Syntax

@{!}p    instruction;

Description

Execute an instruction or instruction block for threads that have the guard predicateTrue. Threads with aFalse guard predicate do nothing.

Semantics

If{!}p then instruction

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

    setp.eq.f32  p,y,0;     // is y zero?@!p div.f32      ratio,x,y  // avoid division by zero@q  bra L23;                // conditional branch

9.7.12.3.Control Flow Instructions:bra

bra

Branch to a target and continue execution there.

Syntax

@p   bra{.uni}  tgt;           // tgt is a label     bra{.uni}  tgt;           // unconditional branch

Description

Continue execution at the target. Conditional branches are specified by using a guard predicate. Thebranch target must be a label.

bra.uni is guaranteed to be non-divergent, i.e. all active threads in a warp that are currentlyexecuting this instruction have identical values for the guard predicate and branch target.

Semantics

if (p) {    pc = tgt;}

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Unimplemented indirect branch introduced in PTX ISA version 2.1 has been removed from the spec.

Target ISA Notes

Supported on all target architectures.

Examples

bra.uni  L_exit;    // uniform unconditional jump@q  bra      L23;   // conditional branch

9.7.12.4.Control Flow Instructions:brx.idx

brx.idx

Branch to a label indexed from a list of potential branch targets.

Syntax

@p    brx.idx{.uni} index, tlist;      brx.idx{.uni} index, tlist;

Description

Index into a list of possible destination labels, and continue execution from the chosenlabel. Conditional branches are specified by using a guard predicate.

brx.idx.uni guarantees that the branch is non-divergent, i.e. all active threads in a warp thatare currently executing this instruction have identical values for the guard predicate and theindex argument.

Theindex operand is a.u32 register. Thetlist operand must be the label of a.branchtargets directive. It is accessed as a zero-based sequence usingindex. Behaviour isundefined if the value ofindex is greater than or equal to the length oftlist.

The.branchtargets directive must be defined in the local function scope before it is used. Itmust refer to labels within the current function.

Semantics

if (p) {    if (index < length(tlist)) {      pc = tlist[index];    } else {      pc = undefined;    }}

PTX ISA Notes

Introduced in PTX ISA version 6.0.

Target ISA Notes

Requiressm_30 or higher.

Examples

.function foo () {    .reg .u32 %r0;    ...    L1:    ...    L2:    ...    L3:    ...    ts: .branchtargets L1, L2, L3;    @p brx.idx %r0, ts;    ...}

9.7.12.5.Control Flow Instructions:call

call

Call a function, recording the return location.

Syntax

// direct call to named function, func is a symbolcall{.uni} (ret-param), func, (param-list);call{.uni} func, (param-list);call{.uni} func;// indirect call via pointer, with full list of call targetscall{.uni} (ret-param), fptr, (param-list), flist;call{.uni} fptr, (param-list), flist;call{.uni} fptr, flist;// indirect call via pointer, with no knowledge of call targetscall{.uni} (ret-param), fptr, (param-list), fproto;call{.uni} fptr, (param-list), fproto;call{.uni} fptr, fproto;

Description

Thecall instruction stores the address of the next instruction, so execution can resume at thatpoint after executing aret instruction. Acall is assumed to be divergent unless the.uni suffix is present. The.uni suffix indicates that thecall is guaranteed to benon-divergent, i.e. all active threads in a warp that are currently executing this instruction haveidentical values for the guard predicate andcall target.

For direct calls, the called locationfunc must be a symbolic function name; for indirect calls,the called locationfptr must be an address of a function held in a register. Input argumentsand return values are optional. Arguments may be registers, immediate constants, or variables in.param space. Arguments are pass-by-value.

Indirect calls require an additional operand,flist orfproto, to communicate the list ofpotentialcall targets or the common function prototype of allcall targets,respectively. In the first case,flist gives a complete list of potentialcall targets andthe optimizing backend is free to optimize the calling convention. In the second case, where thecomplete list of potentialcall targets may not be known, the common function prototype is givenand thecall must obey the ABI’s calling convention.

Theflist operand is either the name of an array (call table) initialized to a list of functionnames; or a label associated with a.calltargets directive, which declares a list of potentialcall targets. In both cases the fptr register holds the address of a function listed in the calltable or.calltargets list, and thecall operands are type-checked against the typesignature of the functions indicated byflist.

The fproto operand is the name of a label associated with a.callprototype directive. Thisoperand is used when a complete list of potential targets is not known. Thecall operands aretype-checked against the prototype, and code generation will follow the ABI calling convention. If afunction that doesn’t match the prototype is called, the behavior is undefined.

Call tables may be declared at module scope or local scope, in either the constant or global statespace. The.calltargets and.callprototype directives must be declared within a functionbody. All functions must be declared prior to being referenced in acall table initializer or.calltargets directive.

PTX ISA Notes

Directcall introduced in PTX ISA version 1.0. Indirectcall introduced in PTX ISA version 2.1.

Target ISA Notes

Directcall supported on all target architectures. Indirectcall requiressm_20 or higher.

Examples

// examples of direct call    call     init;    // call function 'init'    call.uni g, (a);  // call function 'g' with parameter 'a'@p  call     (d), h, (a, b);  // return value into register d// call-via-pointer using jump table.func (.reg .u32 rv) foo (.reg .u32 a, .reg .u32 b) ....func (.reg .u32 rv) bar (.reg .u32 a, .reg .u32 b) ....func (.reg .u32 rv) baz (.reg .u32 a, .reg .u32 b) ....global .u32 jmptbl[5] = { foo, bar, baz };      ...@p    ld.global.u32  %r0, [jmptbl+4];@p    ld.global.u32  %r0, [jmptbl+8];      call  (retval), %r0, (x, y), jmptbl;// call-via-pointer using .calltargets directive.func (.reg .u32 rv) foo (.reg .u32 a, .reg .u32 b) ....func (.reg .u32 rv) bar (.reg .u32 a, .reg .u32 b) ....func (.reg .u32 rv) baz (.reg .u32 a, .reg .u32 b) ...      ...@p    mov.u32  %r0, foo;@q    mov.u32  %r0, baz;Ftgt: .calltargets foo, bar, baz;      call  (retval), %r0, (x, y), Ftgt;// call-via-pointer using .callprototype directive.func dispatch (.reg .u32 fptr, .reg .u32 idx){...Fproto: .callprototype _ (.param .u32 _, .param .u32 _);      call  %fptr, (x, y), Fproto;...

9.7.12.6.Control Flow Instructions:ret

ret

Return from function to instruction after call.

Syntax

ret{.uni};

Description

Return execution to caller’s environment. A divergent return suspends threads until all threads areready to return to the caller. This allows multiple divergentret instructions.

Aret is assumed to be divergent unless the.uni suffix is present, indicating that thereturn is guaranteed to be non-divergent.

Any values returned from a function should be moved into the return parameter variables prior toexecuting theret instruction.

A return instruction executed in a top-level entry routine will terminate thread execution.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

    ret;@p  ret;

9.7.12.7.Control Flow Instructions:exit

exit

Terminate a thread.

Syntax

exit;

Description

Ends execution of a thread.

As threads exit, barriers waiting on all threads are checked to see if the exiting threads are theonly threads that have not yet made it to a barrier{.cta} for all threads in the CTA or to abarrier.cluster for all threads in the cluster. If the exiting threads are holding up thebarrier, the barrier is released.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

    exit;@p  exit;

9.7.13.Parallel Synchronization and Communication Instructions

These instructions are:

  • bar{.cta},barrier{.cta}

  • bar.warp.sync

  • barrier.cluster

  • membar

  • atom

  • red

  • red.async

  • vote

  • match.sync

  • activemask

  • redux.sync

  • griddepcontrol

  • elect.sync

  • mbarrier.init

  • mbarrier.inval

  • mbarrier.arrive

  • mbarrier.arrive_drop

  • mbarrier.test_wait

  • mbarrier.try_wait

  • mbarrier.pending_count

  • cp.async.mbarrier.arrive

  • tensormap.cp_fenceproxy

  • clusterlaunchcontrol.try_cancel

  • clusterlaunchcontrol.query_cancel

9.7.13.1.Parallel Synchronization and Communication Instructions:bar,barrier

bar,bar.cta,barrier,barrier.cta

Barrier synchronization.

Syntax

barrier{.cta}.sync{.aligned}      a{, b};barrier{.cta}.arrive{.aligned}    a, b;barrier{.cta}.red.popc{.aligned}.u32  d, a{, b}, {!}c;barrier{.cta}.red.op{.aligned}.pred   p, a{, b}, {!}c;bar{.cta}.sync      a{, b};bar{.cta}.arrive    a, b;bar{.cta}.red.popc.u32  d, a{, b}, {!}c;bar{.cta}.red.op.pred   p, a{, b}, {!}c;.op = { .and, .or };

Description

Performs barrier synchronization and communication within a CTA. Each CTA instance has sixteenbarriers numbered0..15.

barrier{.cta} instructions can be used by the threads within the CTA for synchronization andcommunication.

Operandsa,b, andd have type.u32; operandsp andc are predicates. Sourceoperanda specifies a logical barrier resource as an immediate constant or register with value0 through15. Operandb specifies the number of threads participating in the barrier. Ifno thread count is specified, all threads in the CTA participate in the barrier. When specifying athread count, the value must be a multiple of the warp size. Note that a non-zero thread count isrequired forbarrier{.cta}.arrive.

Depending on operandb, either specified number of threads (in multiple of warp size) or allthreads in the CTA participate inbarrier{.cta} instruction. Thebarrier{.cta} instructionssignal the arrival of the executing threads at the named barrier.

barrier{.cta} instruction causes executing thread to wait for all non-exited threads from itswarp and marks warps’ arrival at barrier. In addition to signaling its arrival at the barrier, thebarrier{.cta}.red andbarrier{.cta}.sync instructions causes executing thread to wait fornon-exited threads of all other warps participating in the barrier toarrive.barrier{.cta}.arrive does not cause executing thread to wait for threads of otherparticipating warps.

When a barrier completes, the waiting threads are restarted without delay, and the barrier isreinitialized so that it can be immediately reused.

Thebarrier{.cta}.sync orbarrier{.cta}.red orbarrier{.cta}.arrive instructionguarantees that when the barrier completes, prior memory accesses requested by this thread areperformed relative to all threads participating in the barrier. Thebarrier{.cta}.sync andbarrier{.cta}.red instruction further guarantees that no new memory access is requested by thisthread before the barrier completes.

A memory read (e.g., byld oratom) has been performed when the value read has beentransmitted from memory and cannot be modified by another thread participating in the barrier. Amemory write (e.g., byst,red oratom) has been performed when the value written hasbecome visible to other threads participating in the barrier, that is, when the previous value canno longer be read.

barrier{.cta}.red performs a reduction operation across threads. Thec predicate (or itscomplement) from all threads in the CTA are combined using the specified reduction operator. Oncethe barrier count is reached, the final value is written to the destination register in all threadswaiting at the barrier.

The reduction operations forbarrier{.cta}.red are population-count (.popc),all-threads-True (.and), and any-thread-True (.or). The result of.popc is the number ofthreads with aTrue predicate, while.and and.or indicate if all the threads had aTrue predicate or if any of the threads had aTrue predicate.

Instructionbarrier{.cta} has optional.aligned modifier. When specified, it indicates thatall threads in CTA will execute the samebarrier{.cta} instruction. In conditionally executedcode, an alignedbarrier{.cta} instruction should only be used if it is known that all threadsin CTA evaluate the condition identically, otherwise behavior is undefined.

Different warps may execute different forms of thebarrier{.cta} instruction using the samebarrier name and thread count. One example mixesbarrier{.cta}.sync andbarrier{.cta}.arriveto implement producer/consumer models. The producer threads executebarrier{.cta}.arrive toannounce their arrival at the barrier and continue execution without delay to produce the nextvalue, while the consumer threads execute thebarrier{.cta}.sync to wait for a resource to beproduced. The roles are then reversed, using a different barrier, where the producer threads executeabarrier{.cta}.sync to wait for a resource to consumed, while the consumer threads announcethat the resource has been consumed withbarrier{.cta}.arrive. Care must be taken to keep a warpfrom executing morebarrier{.cta} instructions than intended (barrier{.cta}.arrive followedby any otherbarrier{.cta} instruction to the same barrier) prior to the reset of thebarrier.barrier{.cta}.red should not be intermixed withbarrier{.cta}.sync orbarrier{.cta}.arrive using the same active barrier. Execution in this case is unpredictable.

The optional.cta qualifier simply indicates CTA-level applicability of the barrier and itdoesn’t change the semantics of the instruction.

bar{.cta}.sync is equivalent tobarrier{.cta}.sync.aligned.bar{.cta}.arrive isequivalent tobarrier{.cta}.arrive.aligned.bar{.cta}.red is equivalent tobarrier{.cta}.red.aligned.

Note

For .targetsm_6x or below,

  1. barrier{.cta} instruction without.aligned modifier is equivalent to.alignedvariant and has the same restrictions as of.aligned variant.

  2. All threads in warp (except for those have exited) must executebarrier{.cta} instructionin convergence.

PTX ISA Notes

bar.sync without a thread count introduced in PTX ISA version 1.0.

Register operands, thread count, andbar.{arrive,red} introduced in PTX ISA version 2.0.

barrier instruction introduced in PTX ISA version 6.0.

.cta qualifier introduced in PTX ISA version 7.8.

Target ISA Notes

Register operands, thread count, andbar{.cta}.{arrive,red} requiresm_20 or higher.

Onlybar{.cta}.sync with an immediate barrier number is supported forsm_1x targets.

barrier{.cta} instruction requiressm_30 or higher.

Examples

// Use bar.sync to arrive at a pre-computed barrier number and// wait for all threads in CTA to also arrive:    st.shared [r0],r1;  // write my result to shared memory    bar.cta.sync  1;    // arrive, wait for others to arrive    ld.shared r2,[r3];  // use shared results from other threads// Use bar.sync to arrive at a pre-computed barrier number and// wait for fixed number of cooperating threads to arrive:    #define CNT1 (8*12) // Number of cooperating threads    st.shared [r0],r1;     // write my result to shared memory    bar.cta.sync  1, CNT1; // arrive, wait for others to arrive    ld.shared r2,[r3];     // use shared results from other threads// Use bar.red.and to compare results across the entire CTA:    setp.eq.u32 p,r1,r2;         // p is True if r1==r2    bar.cta.red.and.pred r3,1,p; // r3=AND(p) forall threads in CTA// Use bar.red.popc to compute the size of a group of threads// that have a specific condition True:    setp.eq.u32 p,r1,r2;         // p is True if r1==r2    bar.cta.red.popc.u32 r3,1,p; // r3=SUM(p) forall threads in CTA// Examples of barrier.cta.sync    st.shared         [r0],r1;    barrier.cta.sync  0;    ld.shared         r1, [r0];/* Producer/consumer model. The producer deposits a value in * shared memory, signals that it is complete but does not wait * using bar.arrive, and begins fetching more data from memory. * Once the data returns from memory, the producer must wait * until the consumer signals that it has read the value from * the shared memory location. In the meantime, a consumer * thread waits until the data is stored by the producer, reads * it, and then signals that it is done (without waiting). */    // Producer code places produced value in shared memory.    st.shared   [r0],r1;    bar.arrive  0,64;    ld.global   r1,[r2];    bar.sync    1,64;    ...    // Consumer code, reads value from shared memory    bar.sync   0,64;    ld.shared  r1,[r0];    bar.arrive 1,64;    ...

9.7.13.2.Parallel Synchronization and Communication Instructions:bar.warp.sync

bar.warp.sync

Barrier synchronization for threads in a warp.

Syntax

bar.warp.sync      membermask;

Description

bar.warp.sync will cause executing thread to wait until all threads corresponding tomembermask have executed abar.warp.sync with the samemembermask value before resumingexecution.

Operandmembermask specifies a 32-bit integer which is a mask indicating threads participatingin barrier where the bit position corresponds to thread’slaneid.

The behavior ofbar.warp.sync is undefined if the executing thread is not in themembermask.

bar.warp.sync also guarantee memory ordering among threads participating in barrier. Thus,threads within warp that wish to communicate via memory can store to memory, executebar.warp.sync, and then safely read values stored by other threads in warp.

Note

For .targetsm_6x or below, all threads inmembermask must execute the samebar.warp.sync instruction in convergence, and only threads belonging to somemembermaskcan be active when thebar.warp.sync instruction is executed. Otherwise, the behavior isundefined.

PTX ISA Notes

Introduced in PTX ISA version 6.0.

Target ISA Notes

Requiressm_30 or higher.

Examples

st.shared.u32 [r0],r1;         // write my result to shared memorybar.warp.sync  0xffffffff;     // arrive, wait for others to arriveld.shared.u32 r2,[r3];         // read results written by other threads

9.7.13.3.Parallel Synchronization and Communication Instructions:barrier.cluster

barrier.cluster

Barrier synchronization within a cluster.

Syntax

barrier.cluster.arrive{.sem}{.aligned};barrier.cluster.wait{.acquire}{.aligned};.sem = {.release, .relaxed}

Description

Performs barrier synchronization and communication within a cluster.

barrier.cluster instructions can be used by the threads within the cluster for synchronizationand communication.

barrier.cluster.arrive instruction marks warps’ arrival at barrier without causing executingthread to wait for threads of other participating warps.

barrier.cluster.wait instruction causes the executing thread to wait for all non-exited threadsof the cluster to performbarrier.cluster.arrive.

In addition,barrier.cluster instructions cause the executing thread to wait for all non-exitedthreads from its warp.

When all non-exited threads in the cluster have executedbarrier.cluster.arrive, the barriercompletes and is automatically reinitialized. After usingbarrier.cluster.wait to detect completionof the barrier, a thread may immediately arrive at the barrier once again.Each thread must arrive at the barrier only once before the barrier completes.

Thebarrier.cluster.wait instruction guarantees that when it completes the execution, memoryaccesses (except asynchronous operations) requested, in program order, prior to the precedingbarrier.cluster.arrive by all threads in the cluster are complete and visible to the executingthread.

There is no memory ordering and visibility guarantee for memory accesses requested by the executingthread, in program order, afterbarrier.cluster.arrive and prior tobarrier.cluster.wait.

The optional.relaxed qualifier onbarrier.cluster.arrive specifies that there are no memoryordering and visibility guarantees provided for the memory accesses performed prior tobarrier.cluster.arrive.

The optional.sem and.acquire qualifiers on instructionsbarrier.cluster.arrive andbarrier.cluster.wait specify the memory synchronization as described in theMemory Consistency Model. If the optional.sem qualifier is absent forbarrier.cluster.arrive,.release is assumed by default. If the optional.acquirequalifier is absent forbarrier.cluster.wait,.acquire is assumed by default.

The optional.aligned qualifier indicates that all threads in the warp must execute the samebarrier.cluster instruction. In conditionally executed code, an alignedbarrier.clusterinstruction should only be used if it is known that all threads in the warp evaluate the conditionidentically, otherwise behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Support for.acquire,.relaxed,.release qualifiers introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_90 or higher.

Examples

// use of arrive followed by waitld.shared::cluster.u32 r0, [addr];barrier.cluster.arrive.aligned;...barrier.cluster.wait.aligned;st.shared::cluster.u32 [addr], r1;// use memory fence prior to arrive for relaxed barrier@cta0 ld.shared::cluster.u32 r0, [addr];fence.cluster.acq_rel;barrier.cluster.arrive.relaxed.aligned;...barrier.cluster.wait.aligned;@cta1 st.shared::cluster.u32 [addr], r1;

9.7.13.4.Parallel Synchronization and Communication Instructions:membar /fence

membar,fence

Enforce an ordering of memory operations.

Syntax

// Thread fence:fence{.sem}.scope;// Thread fence (uni-directional):fence.acquire.sync_restrict::shared::cluster.cluster;fence.release.sync_restrict::shared::cta.cluster;// Operation fence (uni-directional):fence.op_restrict.release.cluster;// Proxy fence (bi-directional):fence.proxy.proxykind;// Proxy fence (uni-directional):fence.proxy.to_proxykind::from_proxykind.release.scope;fence.proxy.to_proxykind::from_proxykind.acquire.scope  [addr], size;fence.proxy.async::generic.acquire.sync_restrict::shared::cluster.cluster;fence.proxy.async::generic.release.sync_restrict::shared::cta.cluster;// Old style membar:membar.level;membar.proxy.proxykind;.sem       = { .sc, .acq_rel, .acquire, .release };.scope     = { .cta, .cluster, .gpu, .sys };.level     = { .cta, .gl, .sys };.proxykind = { .alias, .async, .async.global, .async.shared::{cta, cluster} };.op_restrict = { .mbarrier_init };.to_proxykind::from_proxykind = {.tensormap::generic};

Description

Themembar instruction guarantees that prior memory accesses requested by this thread (ld,st,atom andred instructions) are performed at the specifiedlevel, before latermemory operations requested by this thread following themembar instruction. Thelevelqualifier specifies the set of threads that may observe the ordering effect of this operation.

A memory read (e.g., byld oratom) has been performed when the value read has beentransmitted from memory and cannot be modified by another thread at the indicated level. A memorywrite (e.g., byst,red oratom) has been performed when the value written has becomevisible to other threads at the specified level, that is, when the previous value can no longer beread.

Thefence instruction establishes an ordering between memory accesses requested by this thread(ld,st,atom andred instructions) as described in theMemory Consistency Model. The scope qualifier specifies the set of threads that mayobserve the ordering effect of this operation.

fence.acq_rel is a light-weight fence that is sufficient for memory synchronization in mostprograms. Instances offence.acq_rel synchronize when combined with additional memory operationsas described inacquire andrelease patterns in theMemory Consistency Model.If the optional.sem qualifier is absent,.acq_relis assumed by default.

fence.sc is a slower fence that can restoresequential consistency when used in sufficientplaces, at the cost of performance. Instances offence.sc with sufficient scope alwayssynchronize by forming a total order per scope, determined at runtime. This total order can beconstrained further by other synchronization in the program.

Qualifiers.op_restrict and.sync_restrict restrict the class of memory operationsfor which thefence instruction provides the memory ordering guarantees. When.op_restrictis.mbarrier_init, the synchronizing effect of the fence only applies to the priormbarrier.init operations executed by the same thread onmbarrier objects in.shared::ctastate space. When.sync_restrict is.sync_restrict::shared::cta,.sem must be.release, and the effect of the fence only applies to operations performed on objects in.shared::cta state space. Likewise, when.sync_restrict is.sync_restrict::shared::cluster,.sem must be.acquire, and the effect of the fence only applies to operations performed onobjects in.shared::cluster state space. When either.sync_restrict::shared::cta or.sync_restrict::shared::cluster is present, the.scope must be specified as.cluster.

The address operandaddr and the operandsize together specify the memory range[addr,addr+size-1] on which the ordering guarantees on the memory accesses across the proxies is to beprovided. The only supported value for thesize operand is 128, which must be a constant integer literal.Generic Addressing is used unconditionally, and the address specified bythe operandaddr must fall within the.global state space. Otherwise, the behavior is undefined.

Onsm_70 and highermembar is a synonym forfence.sc1, and themembarlevelscta,gl andsys are synonymous with thefence scopescta,gpu andsys respectively.

membar.proxy andfence.proxy instructions establish an ordering between memory accesses thatmay happen through differentproxies.

Auni-directional proxy ordering from thefrom-proxykind to theto-proxykind establishesordering between a prior memory access performed via thefrom-proxykind and a subsequent memory accessperformed via theto-proxykind.

Abi-directional proxy ordering between two proxykinds establishes twouni-directional proxy orderings: one from the first proxykind to the second proxykind and the other from the second proxykind to the firstproxykind.

The.proxykind qualifier indicates thebi-directional proxy ordering that is established between the memoryaccesses done between the generic proxy and the proxy specified by.proxykind.

Value.alias of the.proxykind qualifier refers to memory accesses performed using virtuallyaliased addresses to the same memory location. Value.async of the.proxykind qualifier specifiesthat the memory ordering is established between the async proxy and the generic proxy. The memoryordering is limited only to operations performed on objects in the state space specified. If no state spaceis specified, then the memory ordering applies on all state spaces.

A.release proxy fence can form a release sequence that synchronizes with an acquiresequence that contains a.acquire proxy fence. The.to_proxykind and.from_proxykind qualifiers indicate theuni-directional proxy ordering that is established.

Onsm_70 and higher,membar.proxy is a synonym forfence.proxy.

1 The semantics offence.sc introduced withsm_70 is a superset of the semantics ofmembar and the two are compatible; when executing onsm_70 or later architectures,membar acquires the full semantics offence.sc.

PTX ISA Notes

membar.{cta,gl} introduced in PTX ISA version 1.4.

membar.sys introduced in PTX ISA version 2.0.

fence introduced in PTX ISA version 6.0.

membar.proxy andfence.proxy introduced in PTX ISA version 7.5.

.cluster scope qualifier introduced in PTX ISA version 7.8.

.op_restrict qualifier introduced in PTX ISA version 8.0.

fence.proxy.async is introduced in PTX ISA version 8.0.

.to_proxykind::from_proxykind qualifier introduced in PTX ISA version 8.3.

.acquire and.release qualifiers forfence instruction introduced in PTX ISA version 8.6.

.sync_restrict qualifier introduced in PTX ISA version 8.6.

Target ISA Notes

membar.{cta,gl} supported on all target architectures.

membar.sys requiressm_20 or higher.

fence requiressm_70 or higher.

membar.proxy requiressm_60 or higher.

fence.proxy requiressm_70 or higher.

.cluster scope qualifier requiressm_90 or higher.

.op_restrict qualifier requiressm_90 or higher.

fence.proxy.async requiressm_90 or higher.

.to_proxykind::from_proxykind qualifier requiressm_90 or higher.

.acquire and.release qualifiers forfence instruction requiresm_90 or higher..

.sync_restrict qualifier requiressm_90 or higher..

Examples

membar.gl;membar.cta;membar.sys;fence.sc.cta;fence.sc.cluster;fence.proxy.alias;membar.proxy.alias;fence.mbarrier_init.release.cluster;fence.proxy.async;fence.proxy.async.shared::cta;fence.proxy.async.shared::cluster;fence.proxy.async.global;tensormap.replace.tile.global_address.global.b1024.b64   [gbl], new_addr;fence.proxy.tensormap::generic.release.gpu;cvta.global.u64  tmap, gbl;fence.proxy.tensormap::generic.acquire.gpu [tmap], 128;cp.async.bulk.tensor.1d.shared::cluster.global.tile  [addr0], [tmap, {tc0}], [mbar0];// Acquire remote barrier state via async proxy.barrier.cluster.wait.acquire;fence.proxy.async::generic.acquire.sync_restrict::shared::cluster.cluster;// Release local barrier state via async proxy.mbarrier.init [bar];fence.mbarrier_init.release.cluster;fence.proxy.async::generic.release.sync_restrict::shared::cta.cluster;barrier.cluster.arrive.relaxed;// Acquire local shared memory via generic proxy.mbarrier.try_wait.relaxed.cluster.shared::cta.b64 complete, [addr], parity;fence.acquire.sync_restrict::shared::cluster.cluster;// Release local shared memory via generic proxy.fence.release.sync_restrict::shared::cta.cluster;mbarrier.arrive.relaxed.cluster.shared::cluster.b64 state, [bar];

9.7.13.5.Parallel Synchronization and Communication Instructions:atom

atom

Atomic reduction operations for thread-to-thread communication.

Syntax

Atomic operation with scalar type:

atom{.sem}{.scope}{.space}.op{.level::cache_hint}.type d, [a], b{, cache-policy};atom{.sem}{.scope}{.space}.op.type d, [a], b, c;atom{.sem}{.scope}{.space}.cas.b16 d, [a], b, c;atom{.sem}{.scope}{.space}.cas.b128 d, [a], b, c;atom{.sem}{.scope}{.space}.exch{.level::cache_hint}.b128 d, [a], b {, cache-policy};atom{.sem}{.scope}{.space}.add.noftz{.level::cache_hint}.f16     d, [a], b{, cache-policy};atom{.sem}{.scope}{.space}.add.noftz{.level::cache_hint}.f16x2   d, [a], b{, cache-policy};atom{.sem}{.scope}{.space}.add.noftz{.level::cache_hint}.bf16    d, [a], b{, cache-policy};atom{.sem}{.scope}{.space}.add.noftz{.level::cache_hint}.bf16x2  d, [a], b{, cache-policy};.space =              { .global, .shared{::cta, ::cluster} };.sem =                { .relaxed, .acquire, .release, .acq_rel };.scope =              { .cta, .cluster, .gpu, .sys };.op =                 { .and, .or, .xor,                        .cas, .exch,                        .add, .inc, .dec,                        .min, .max };.level::cache_hint =  { .L2::cache_hint };.type =               { .b32, .b64, .u32, .u64, .s32, .s64, .f32, .f64 };

Atomic operation with vector type:

atom{.sem}{.scope}{.global}.add{.level::cache_hint}.vec_32_bit.f32                  d, [a], b{, cache-policy};atom{.sem}{.scope}{.global}.op.noftz{.level::cache_hint}.vec_16_bit.half_word_type  d, [a], b{, cache-policy};atom{.sem}{.scope}{.global}.op.noftz{.level::cache_hint}.vec_32_bit.packed_type     d, [a], b{, cache-policy};.sem =               { .relaxed, .acquire, .release, .acq_rel };.scope =             { .cta, .cluster, .gpu, .sys };.op =                { .add, .min, .max };.half_word_type =    { .f16, .bf16 };.packed_type =       { .f16x2, .bf16x2 };.vec_16_bit =        { .v2, .v4, .v8 }.vec_32_bit =        { .v2, .v4 };.level::cache_hint = { .L2::cache_hint }

Description

Atomically loads the original value at locationa into destination registerd, performs areduction operation with operandb and the value in locationa, and stores the result of thespecified operation at locationa, overwriting the original value. Operanda specifies alocation in the specified state space. If no state space is given, perform the memory accesses usingGeneric Addressing.atom with scalar type may be used onlywith.global and.shared spaces and with generic addressing, where the address points to.global or.shared space.atom with vector type may be used only with.global spaceand with generic addressing where the address points to.global space.

Foratom with vector type, operandsd andb are brace-enclosed vector expressions, sizeof which is equal to the size of vector qualifier.

If no sub-qualifier is specified with.shared state space, then::cta is assumed by default.

The optional.sem qualifier specifies a memory synchronizing effect as described in theMemory Consistency Model. If the.sem qualifier is absent,.relaxed is assumed by default.

The optional.scope qualifier specifies the set of threads that can directly observe the memorysynchronizing effect of this operation, as described in theMemory Consistency Model.If the.scope qualifier is absent,.gpu scope isassumed by default.

Foratom with vector type, the supported combinations of vector qualifier and types, and atomicoperations supported on these combinations are depicted in the following table:

Vector qualifier

Types

.f16/bf16

.f16x2/bf16x2

.f32

.v2

.add,.min,.max

.add,.min,.max

.add

.v4

.add,.min,.max

.add,.min,.max

.add

.v8

.add,.min,.max

Not supported

Not Supported

Two atomic operations (atom orred) are performed atomically with respect to each other onlyif each operation specifies a scope that includes the other. When this condition is not met, eachoperation observes the other operation being performed as if it were split into a read followed by adependent write.

atom instruction on packed type or vector type, accesses adjacent scalar elements in memory. Insuch cases, the atomicity is guaranteed separately for each of the individual scalar elements; theentireatom is not guaranteed to be atomic as a single access.

Forsm_6x and earlier architectures,atom operations on.shared state space do notguarantee atomicity with respect to normal store instructions to the same address. It is theprogrammer’s responsibility to guarantee correctness of programs that use shared memory atomicinstructions, e.g., by inserting barriers between normal stores and atomic operations to a commonaddress, or by using atom.exch to store to locations accessed by other atomic operations.

Supported addressing modes for operanda and alignment requirements are described inAddresses as Operands

The bit-size operations are.and,.or,.xor,.cas (compare-and-swap), and.exch(exchange).

The integer operations are.add,.inc,.dec,.min,.max. The.inc and.dec operations return a result in the range[0..b].

The floating-point operation.add operation rounds to nearest even. Current implementation ofatom.add.f32 on global memory flushes subnormal inputs and results to sign-preserving zero;whereasatom.add.f32 on shared memory supports subnormal inputs and results and doesn’t flushthem to zero.

atom.add.f16,atom.add.f16x2,atom.add.bf16 andatom.add.bf16x2 operation requiresthe.noftz qualifier; it preserves subnormal inputs and results, and does not flush them tozero.

When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.

The qualifier.level::cache_hint is only supported for.global state space and for genericaddressing where the address points to the.global state space.

cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.

Semantics

atomic {    d = *a;    *a = (operation == cas) ? operation(*a, b, c)                            : operation(*a, b);}where    inc(r, s)  = (r >= s) ? 0 : r+1;    dec(r, s)  = (r==0 || r > s)  ? s : r-1;    exch(r, s) =  s;    cas(r,s,t) = (r == s) ? t : r;

Notes

Simple reductions may be specified by using thebit bucket destination operand_.

PTX ISA Notes

32-bit atom.global introduced in PTX ISA version 1.1.

atom.shared and 64-bitatom.global.{add,cas,exch} introduced in PTX ISA 1.2.

atom.add.f32 and 64-bitatom.shared.{add,cas,exch} introduced in PTX ISA 2.0.

64-bitatom.{and,or,xor,min,max} introduced in PTX ISA 3.1.

atom.add.f64 introduced in PTX ISA 5.0.

.scope qualifier introduced in PTX ISA 5.0.

.sem qualifier introduced in PTX ISA version 6.0.

atom.add.noftz.f16x2 introduced in PTX ISA 6.2.

atom.add.noftz.f16 andatom.cas.b16 introduced in PTX ISA 6.3.

Per-element atomicity ofatom.f16x2 clarified in PTX ISA version 6.3, with retrospective effectfrom PTX ISA version 6.2.

Support for.level::cache_hint qualifier introduced in PTX ISA version 7.4.

atom.add.noftz.bf16 andatom.add.noftz.bf16x2 introduced in PTX ISA 7.8.

Support for.cluster scope qualifier introduced in PTX ISA version 7.8.

Support for::cta and::cluster sub-qualifiers introduced in PTX ISA version 7.8.

Support for vector types introduced in PTX ISA version 8.1.

Support for.b128 type introduced in PTX ISA version 8.3.

Support for.sys scope with.b128 type introduced in PTX ISA version 8.4.

Target ISA Notes

atom.global requiressm_11 or higher.

atom.shared requiressm_12 or higher.

64-bitatom.global.{add,cas,exch} requiresm_12 or higher.

64-bitatom.shared.{add,cas,exch} requiresm_20 or higher.

64-bitatom.{and,or,xor,min,max} requiresm_32 or higher.

atom.add.f32 requiressm_20 or higher.

atom.add.f64 requiressm_60 or higher.

.scope qualifier requiressm_60 or higher.

.sem qualifier requiressm_70 or higher.

Use of generic addressing requiressm_20 or higher.

atom.add.noftz.f16x2 requiressm_60 or higher.

atom.add.noftz.f16 andatom.cas.b16 requiressm_70 or higher.

Support for.level::cache_hint qualifier requiressm_80 or higher.

atom.add.noftz.bf16 andatom.add.noftz.bf16x2 requiresm_90 or higher.

Support for.cluster scope qualifier requiressm_90 or higher.

Sub-qualifier::cta requiressm_30 or higher.

Sub-qualifier::cluster requiressm_90 or higher.

Support for vector types requiressm_90 or higher.

Support for.b128 type requiressm_90 or higher.

Examples

atom.global.add.s32  d,[a],1;atom.shared::cta.max.u32  d,[x+4],0;@p  atom.global.cas.b32  d,[p],my_val,my_new_val;atom.global.sys.add.u32 d, [a], 1;atom.global.acquire.sys.inc.u32 ans, [gbl], %r0;atom.add.noftz.f16x2 d, [a], b;atom.add.noftz.f16   hd, [ha], hb;atom.global.cas.b16  hd, [ha], hb, hc;atom.add.noftz.bf16   hd, [a], hb;atom.add.noftz.bf16x2 bd, [b], bb;atom.add.shared::cluster.noftz.f16   hd, [ha], hb;atom.shared.b128.cas d, a, b, c; // 128-bit atomatom.global.b128.exch d, a, b;   // 128-bit atomatom.global.cluster.relaxed.add.u32 d, [a], 1;createpolicy.fractional.L2::evict_last.b64 cache-policy, 0.25;atom.global.add.L2::cache_hint.s32  d, [a], 1, cache-policy;atom.global.v8.f16.max.noftz  {%hd0, %hd1, %hd2, %hd3, %hd4, %hd5, %hd6, %hd7}, [gbl],                                              {%h0, %h1, %h2, %h3, %h4, %h5, %h6, %h7};atom.global.v8.bf16.add.noftz  {%hd0, %hd1, %hd2, %hd3, %hd4, %hd5, %hd6, %hd7}, [gbl],                                              {%h0, %h1, %h2, %h3, %h4, %h5, %h6, %h7};atom.global.v2.f16.add.noftz  {%hd0, %hd1}, [gbl], {%h0, %h1};atom.global.v2.bf16.add.noftz  {%hd0, %hd1}, [gbl], {%h0, %h1};atom.global.v4.b16x2.min.noftz  {%hd0, %hd1, %hd2, %hd3}, [gbl], {%h0, %h1, %h2, %h3};atom.global.v4.f32.add  {%f0, %f1, %f2, %f3}, [gbl], {%f0, %f1, %f2, %f3};atom.global.v2.f16x2.min.noftz  {%bd0, %bd1}, [g], {%b0, %b1};atom.global.v2.bf16x2.max.noftz  {%bd0, %bd1}, [g], {%b0, %b1};atom.global.v2.f32.add  {%f0, %f1}, [g], {%f0, %f1};

9.7.13.6.Parallel Synchronization and Communication Instructions:red

red

Reduction operations on global and shared memory.

Syntax

Reduction operation with scalar type:

red{.sem}{.scope}{.space}.op{.level::cache_hint}.type          [a], b{, cache-policy};red{.sem}{.scope}{.space}.add.noftz{.level::cache_hint}.f16    [a], b{, cache-policy};red{.sem}{.scope}{.space}.add.noftz{.level::cache_hint}.f16x2  [a], b{, cache-policy};red{.sem}{.scope}{.space}.add.noftz{.level::cache_hint}.bf16                                                      [a], b {, cache-policy};red{.sem}{.scope}{.space}.add.noftz{.level::cache_hint}.bf16x2                                                      [a], b {, cache-policy};.space =              { .global, .shared{::cta, ::cluster} };.sem =                {.relaxed, .release};.scope =              {.cta, .cluster, .gpu, .sys};.op =                 { .and, .or, .xor,                        .add, .inc, .dec,                        .min, .max };.level::cache_hint =  { .L2::cache_hint };.type =               { .b32, .b64, .u32, .u64, .s32, .s64, .f32, .f64 };

Reduction operation with vector type:

red{.sem}{.scope}{.global}.add{.level::cache_hint}.vec_32_bit.f32 [a], b{, cache-policy};red{.sem}{.scope}{.global}.op.noftz{.level::cache_hint}. vec_16_bit.half_word_type [a], b{, cache-policy};red{.sem}{.scope}{.global}.op.noftz{.level::cache_hint}.vec_32_bit.packed_type [a], b {, cache-policy};.sem =                { .relaxed, .release };.scope =              { .cta, .cluster, .gpu, .sys };.op =                 { .add, .min, .max };.half_word_type =     { .f16, .bf16 };.packed_type =        { .f16x2,.bf16x2 };.vec_16_bit =         { .v2, .v4, .v8 }.vec_32_bit =         { .v2, .v4 };.level::cache_hint =  { .L2::cache_hint }

Description

Performs a reduction operation with operandb and the value in locationa, and stores theresult of the specified operation at locationa, overwriting the original value. Operandaspecifies a location in the specified state space. If no state space is given, perform the memoryaccesses usingGeneric Addressing.red with scalar type maybe used only with.global and.shared spaces and with generic addressing, where the addresspoints to.global or.shared space.red with vector type may be used only with.global space and with generic addressing where the address points to.global space.

Forred with vector type, operandb is brace-enclosed vector expressions, size of which isequal to the size of vector qualifier.

If no sub-qualifier is specified with.shared state space, then::cta is assumed by default.

The optional.sem qualifier specifies a memory synchronizing effect as described in theMemory Consistency Model. If the.sem qualifier is absent,.relaxed is assumed by default.

The optional.scope qualifier specifies the set of threads that can directly observe the memorysynchronizing effect of this operation, as described in theMemory Consistency Model.If the.scope qualifier is absent,.gpu scope isassumed by default.

Forred with vector type, the supported combinations of vector qualifier, types and reductionoperations supported on these combinations are depicted in following table:

Vector qualifier

Types

.f16/bf16

.f16x2/bf16x2

.f32

.v2

.add,.min,.max

.add,.min,.max

.add

.v4

.add,.min,.max

.add,.min,.max

.add

.v8

.add,.min,.max

Not supported

Not Supported

Two atomic operations (atom orred) are performed atomically with respect to each other onlyif each operation specifies a scope that includes the other. When this condition is not met, eachoperation observes the other operation being performed as if it were split into a read followed by adependent write.

red instruction on packed type or vector type, accesses adjacent scalar elements in memory. Insuch case, the atomicity is guaranteed separately for each of the individual scalar elements; theentirered is not guaranteed to be atomic as a single access.

Forsm_6x and earlier architectures,red operations on.shared state space do notguarantee atomicity with respect to normal store instructions to the same address. It is theprogrammer’s responsibility to guarantee correctness of programs that use shared memory reductioninstructions, e.g., by inserting barriers between normal stores and reduction operations to a commonaddress, or by usingatom.exch to store to locations accessed by other reduction operations.

Supported addressing modes for operanda and alignment requirements are described inAddresses as Operands

The bit-size operations are.and,.or, and.xor.

The integer operations are.add,.inc,.dec,.min,.max. The.inc and.dec operations return a result in the range[0..b].

The floating-point operation.add operation rounds to nearest even. Current implementation ofred.add.f32 on global memory flushes subnormal inputs and results to sign-preserving zero;whereasred.add.f32 on shared memory supports subnormal inputs and results and doesn’t flushthem to zero.

red.add.f16,red.add.f16x2,red.add.bf16 andred.add.bf16x2 operation requires the.noftz qualifier; it preserves subnormal inputs and results, and does not flush them to zero.

When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.

The qualifier.level::cache_hint is only supported for.global state space and for genericaddressing where the address points to the.global state space.

cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.

Semantics

*a = operation(*a, b);where    inc(r, s) = (r >= s) ? 0 : r+1;    dec(r, s) = (r==0 || r > s)  ? s : r-1;

PTX ISA Notes

Introduced in PTX ISA version 1.2.

red.add.f32 andred.shared.add.u64 introduced in PTX ISA 2.0.

64-bitred.{and,or,xor,min,max} introduced in PTX ISA 3.1.

red.add.f64 introduced in PTX ISA 5.0.

.scope qualifier introduced in PTX ISA 5.0.

.sem qualifier introduced in PTX ISA version 6.0.

red.add.noftz.f16x2 introduced in PTX ISA 6.2.

red.add.noftz.f16 introduced in PTX ISA 6.3.

Per-element atomicity ofred.f16x2 clarified in PTX ISA version 6.3, with retrospective effectfrom PTX ISA version 6.2

Support for.level::cache_hint qualifier introduced in PTX ISA version 7.4.

red.add.noftz.bf16 andred.add.noftz.bf16x2 introduced in PTX ISA 7.8.

Support for.cluster scope qualifier introduced in PTX ISA version 7.8.

Support for::cta and::cluster sub-qualifiers introduced in PTX ISA version 7.8.

Support for vector types introduced in PTX ISA version 8.1.

Target ISA Notes

red.global requiressm_11 or higher

red.shared requiressm_12 or higher.

red.global.add.u64 requiressm_12 or higher.

red.shared.add.u64 requiressm_20 or higher.

64-bitred.{and,or,xor,min,max} requiresm_32 or higher.

red.add.f32 requiressm_20 or higher.

red.add.f64 requiressm_60 or higher.

.scope qualifier requiressm_60 or higher.

.sem qualifier requiressm_70 or higher.

Use of generic addressing requiressm_20 or higher.

red.add.noftz.f16x2 requiressm_60 or higher.

red.add.noftz.f16 requiressm_70 or higher.

Support for.level::cache_hint qualifier requiressm_80 or higher.

red.add.noftz.bf16 andred.add.noftz.bf16x2 requiresm_90 or higher.

Support for.cluster scope qualifier requiressm_90 or higher.

Sub-qualifier::cta requiressm_30 or higher.

Sub-qualifier::cluster requiressm_90 or higher.

Support for vector types requiressm_90 or higher.

Examples

red.global.add.s32  [a],1;red.shared::cluster.max.u32  [x+4],0;@p  red.global.and.b32  [p],my_val;red.global.sys.add.u32 [a], 1;red.global.acquire.sys.add.u32 [gbl], 1;red.add.noftz.f16x2 [a], b;red.add.noftz.bf16   [a], hb;red.add.noftz.bf16x2 [b], bb;red.global.cluster.relaxed.add.u32 [a], 1;red.shared::cta.min.u32  [x+4],0;createpolicy.fractional.L2::evict_last.b64 cache-policy, 0.25;red.global.and.L2::cache_hint.b32 [a], 1, cache-policy;red.global.v8.f16.add.noftz  [gbl], {%h0, %h1, %h2, %h3, %h4, %h5, %h6, %h7};red.global.v8.bf16.min.noftz [gbl], {%h0, %h1, %h2, %h3, %h4, %h5, %h6, %h7};red.global.v2.f16.add.noftz [gbl], {%h0, %h1};red.global.v2.bf16.add.noftz [gbl], {%h0, %h1};red.global.v4.f16x2.max.noftz [gbl], {%h0, %h1, %h2, %h3};red.global.v4.f32.add  [gbl], {%f0, %f1, %f2, %f3};red.global.v2.f16x2.max.noftz {%bd0, %bd1}, [g], {%b0, %b1};red.global.v2.bf16x2.add.noftz {%bd0, %bd1}, [g], {%b0, %b1};red.global.v2.f32.add  {%f0, %f1}, [g], {%f0, %f1};

9.7.13.7.Parallel Synchronization and Communication Instructions:red.async

red.async

Asynchronous reduction operation.

Syntax

// Increment and Decrement reductionsred.async.sem.scope{.ss}.completion_mechanism.op.type [a], b, [mbar];.sem  =                 { .relaxed };.scope =                { .cluster };.ss   =                 { .shared::cluster };.op   =                 { .inc, .dec };.type =                 { .u32 };.completion_mechanism = { .mbarrier::complete_tx::bytes };// MIN and MAX reductionsred.async.sem.scope{.ss}.completion_mechanism.op.type [a], b, [mbar];.sem  = { .relaxed };.scope = { .cluster };.ss   = { .shared::cluster };.op   = { .min, .max };.type = { .u32, .s32 };.completion_mechanism = { .mbarrier::complete_tx::bytes };// Bitwise AND, OR and XOR reductionsred.async.sem.scope{.ss}.completion_mechanism.op.type [a], b, [mbar];.sem  = { .relaxed };.scope = { .cluster };.ss   = { .shared::cluster };.op   = { .and, .or, .xor };.type = { .b32 };.completion_mechanism = { .mbarrier::complete_tx::bytes };// ADD reductionsred.async.sem.scope{.ss}.completion_mechanism.add.type [a], b, [mbar];.sem  = { .relaxed };.scope = { .cluster };.ss   = { .shared::cluster };.type = { .u32, .s32, .u64 };.completion_mechanism = { .mbarrier::complete_tx::bytes };red.async{.mmio}.sem.scope{.ss}.add.type [a], b;.sem  = { .release };.scope = { .gpu, .cluster };.ss   = { .global };.type = { .u32, .s32, .u64, .s64 };

Description

red.async is a non-blocking instruction which initiates an asynchronous reduction operationspecified by.op, with the operandb and the value at destination shared memory locationspecified by operanda.

Operands

  • a is a destination address, and must be either a register, or of the formregister+immOff, as described inAddresses as Operands.

  • b is a source value, of the type indicated by qualifier.type.

  • mbar is an mbarrier object address.

Qualifiers

  • .mmio indicates whether this is anmmio Operation.

  • .sem specifies the memory ordering semantics as described in theMemory Consistency Model.

  • .scope specifies the set of threads with which this instruction candirectly synchronize.

  • .ss specifies the state space of the destination operanda and thembarrier operandmbar.

  • .completion_mechanism specifies the mechanism for observing thecompletion of the asynchronous operation.

    • When.completion_mechanism is.mbarrier::complete_tx::bytes: uponcompletion of the asynchronous operation, acomplete-txoperation will be performed on the mbarrier object specified by the operandmbar,withcompleteCount argument equal to the amount of data stored in bytes.

    • When.completion_mechanism is not specified: the completion of the storesynchronizes with the end of the CTA.

  • .op specifies the reduction operation.

    • The.inc and.dec operations return a result in the range[0..b].

  • .type specifies the type of the source operandb.

Conditions

When.sem is.relaxed:

  • The reduce operation is a relaxed memory operation.

  • The complete-tx operation on the mbarrier has.releasesemantics at.cluster scope.

  • The shared-memory addresses of the destination operanda and thembarrier operandmbar must meet all of the following conditions:

    • They belong to the same CTA.

    • The CTA to which they belong is different from the CTA of the executing thread,but must be within the same cluster.

    Otherwise, the behavior is undefined.

  • .mmio must not be specified.

  • If.ss is specified, it must be.shared::cluster.

  • If.ss is not specified, generic addressing is used for operandsa andmbar.If the generic addresses specified do not fall within the address window of.shared::cluster state space, the behavior is undefined.

  • If.completion_mechanism is specified, it must be.mbarrier::complete_tx::bytes.

  • If.completion_mechanism is not specified, it defaults to.mbarrier::complete_tx::bytes.

When.sem is.release:

  • The reduce operation is a strong memory operation with.release semanticsat the scope specified by.scope.

  • If.mmio is specified,.scope must be.sys.

  • If.ss is specified, it must be.global.

  • If.ss is not specified, generic addressing is used for operanda.If the generic address specified does not fall within the address window of.global state space, the behavior is undefined.

  • .completion_mechanism must not be specified.

PTX ISA Notes

Introduced in PTX ISA version 8.1.

Support for.mmio qualifier,.release semantics,.global state space,and.gpu and.sys scopes introduced in PTX ISA version 8.7.

Target ISA Notes

Requiressm_90 or higher.

.mmio qualifier,.release semantics,.global state space,and.gpu and.sys scopes requiresm_100 or higher.

Examples

red.async.relaxed.cluster.shared::cluster.mbarrier::complete_tx::bytes.min.u32 [addr], b, [mbar_addr];red.async.release.sys.global.add.u32 [addr], b;

9.7.13.8.Parallel Synchronization and Communication Instructions:vote (deprecated)

vote (deprecated)

Vote across thread group.

Syntax

vote.mode.pred  d, {!}a;vote.ballot.b32 d, {!}a;  // 'ballot' form, returns bitmask.mode = { .all, .any, .uni };

Deprecation Note

Thevote instruction without a.sync qualifier is deprecated in PTX ISA version 6.0.

  • Support for this instruction with.target lower thansm_70 may be removed in a future PTXISA version.

Removal Note

Support forvote instruction without a.sync qualifier is removed in PTX ISA version 6.4 for.targetsm_70 or higher.

Description

Performs a reduction of the source predicate across all active threads in a warp. The destinationpredicate value is the same across all threads in the warp.

The reduction modes are:

.all

True if source predicate isTrue for all active threads in warp. Negate the sourcepredicate to compute.none.

.any

True if source predicate isTrue for some active thread in warp. Negate the sourcepredicate to compute.not_all.

.uni

True if source predicate has the same value in all active threads in warp. Negating thesource predicate also computes.uni.

In theballot form,vote.ballot.b32 simply copies the predicate from each thread in a warpinto the corresponding bit position of destination registerd, where the bit positioncorresponds to the thread’s lane id.

An inactive thread in warp will contribute a 0 for its entry when participating invote.ballot.b32.

PTX ISA Notes

Introduced in PTX ISA version 1.2.

Deprecated in PTX ISA version 6.0 in favor ofvote.sync.

Not supported in PTX ISA version 6.4 for .targetsm_70 or higher.

Target ISA Notes

vote requiressm_12 or higher.

vote.ballot.b32 requiressm_20 or higher.

vote is not supported onsm_70 or higher starting PTX ISA version 6.4.

Release Notes

Note thatvote applies to threads in a single warp, not across an entire CTA.

Examples

vote.all.pred    p,q;vote.uni.pred    p,q;vote.ballot.b32  r1,p;  // get 'ballot' across warp

9.7.13.9.Parallel Synchronization and Communication Instructions:vote.sync

vote.sync

Vote across thread group.

Syntax

vote.sync.mode.pred  d, {!}a, membermask;vote.sync.ballot.b32 d, {!}a, membermask;  // 'ballot' form, returns bitmask.mode = { .all, .any, .uni };

Description

vote.sync will cause executing thread to wait until all non-exited threads corresponding tomembermask have executedvote.sync with the same qualifiers and samemembermask valuebefore resuming execution.

Operandmembermask specifies a 32-bit integer which is a mask indicating threads participatingin this instruction where the bit position corresponds to thread’slaneid. Operanda is apredicate register.

In themode form,vote.sync performs a reduction of the source predicate across all non-exitedthreads inmembermask. The destination operandd is a predicate register and its value isthe same across all threads inmembermask.

The reduction modes are:

.all

True if source predicate isTrue for all non-exited threads inmembermask. Negate thesource predicate to compute.none.

.any

True if source predicate isTrue for some thread inmembermask. Negate the sourcepredicate to compute.not_all.

.uni

True if source predicate has the same value in all non-exited threads inmembermask. Negating the source predicate also computes.uni.

In theballot form, the destination operandd is a.b32 register. In this form,vote.sync.ballot.b32 simply copies the predicate from each thread inmembermask into thecorresponding bit position of destination registerd, where the bit position corresponds to thethread’s lane id.

A thread not specified inmembermask will contribute a 0 for its entry invote.sync.ballot.b32.

The behavior ofvote.sync is undefined if the executing thread is not in themembermask.

Note

For .targetsm_6x or below, all threads inmembermask must execute the samevote.syncinstruction in convergence, and only threads belonging to somemembermask can be active whenthevote.sync instruction is executed. Otherwise, the behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 6.0.

Target ISA Notes

Requiressm_30 or higher.

Examples

vote.sync.all.pred    p,q,0xffffffff;vote.sync.ballot.b32  r1,p,0xffffffff;  // get 'ballot' across warp

9.7.13.10.Parallel Synchronization and Communication Instructions:match.sync

match.sync

Broadcast and compare a value across threads in warp.

Syntax

match.any.sync.type  d, a, membermask;match.all.sync.type  d[|p], a, membermask;.type = { .b32, .b64 };

Description

match.sync will cause executing thread to wait until all non-exited threads frommembermaskhave executedmatch.sync with the same qualifiers and samemembermask value before resumingexecution.

Operandmembermask specifies a 32-bit integer which is a mask indicating threads participatingin this instruction where the bit position corresponds to thread’s laneid.

match.sync performs broadcast and compare of operanda across all non-exited threads inmembermask and sets destinationd and optional predicatep based on mode.

Operanda has instruction type andd has.b32 type.

Destinationd is a 32-bit mask where bit position in mask corresponds to thread’s laneid.

The matching operation modes are:

.all

d is set to mask corresponding to non-exited threads inmembermask if all non-exitedthreads inmembermask have same value of operanda; otherwised is setto 0. Optionally predicatep is set to true if all non-exited threads inmembermask havesame value of operanda; otherwisep is set to false. The sink symbol ‘_’ may be used inplace of any one of the destination operands.

.any

d is set to mask of non-exited threads inmembermask that have same value of operanda.

The behavior ofmatch.sync is undefined if the executing thread is not in themembermask.

PTX ISA Notes

Introduced in PTX ISA version 6.0.

Target ISA Notes

Requiressm_70 or higher.

Release Notes

Note thatmatch.sync applies to threads in a single warp, not across an entire CTA.

Examples

match.any.sync.b32    d, a, 0xffffffff;match.all.sync.b64    d|p, a, mask;

9.7.13.11.Parallel Synchronization and Communication Instructions:activemask

activemask

Queries the active threads within a warp.

Syntax

activemask.b32 d;

Description

activemask queries predicated-on active threads from the executing warp and sets the destinationd with 32-bit integer mask where bit position in the mask corresponds to the thread’slaneid.

Destinationd is a 32-bit destination register.

An active thread will contribute 1 for its entry in the result and exited or inactive orpredicated-off thread will contribute 0 for its entry in the result.

PTX ISA Notes

Introduced in PTX ISA version 6.2.

Target ISA Notes

Requiressm_30 or higher.

Examples

activemask.b32  %r1;

9.7.13.12.Parallel Synchronization and Communication Instructions:redux.sync

redux.sync

Perform reduction operation on the data from each predicated active thread in the thread group.

Syntax

redux.sync.op.type dst, src, membermask;.op   = {.add, .min, .max}.type = {.u32, .s32}redux.sync.op.b32 dst, src, membermask;.op   = {.and, .or, .xor}redux.sync.op{.abs.}{.NaN}.f32 dst, src, membermask;.op   = { .min, .max }

Description

redux.sync will cause the executing thread to wait until all non-exited threads corresponding tomembermask have executedredux.sync with the same qualifiers and samemembermask valuebefore resuming execution.

Operandmembermask specifies a 32-bit integer which is a mask indicating threads participatingin this instruction where the bit position corresponds to thread’slaneid.

redux.sync performs a reduction operation.op of the 32 bit source registersrc acrossall non-exited threads in themembermask. The result of the reduction operation is written tothe 32 bit destination registerdst.

Reduction operation can be one of the bitwise operation in.and,.or,.xor or arithmeticoperation in.add,.min ,.max.

For the.add operation result is truncated to 32 bits.

For.f32 instruction type, if the input value is 0.0 then +0.0 > -0.0.

If.abs qualifier is specified, then the absolute value of the input is considered for thereduction operation.

If the.NaN qualifier is specified, then the result of the reduction operation is canonical NaNif the input to the reduction operation from any participating thread is NaN.

In the absence of.NaN qualifier, only non-NaN values are considered for the reduction operationand the result will be canonical NaN when all inputs are NaNs.

The behavior ofredux.sync is undefined if the executing thread is not in themembermask.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

Support for.f32 type is introduced in PTX ISA version 8.6.

Support for.abs and.NaN qualifiers is introduced in PTX ISA version 8.6.

Target ISA Notes

Requiressm_80 or higher.

.f32 type requiressm_100a and is supported onsm_100f from PTX ISA version 8.8.

Qualifiers.abs and.NaN requiresm_100a and are supported onsm_100f orhigher in the same family from PTX ISA version 8.8.

Release Notes

Note thatredux.sync applies to threads in a single warp, not across an entire CTA.

Examples

.reg .b32 dst, src, init, mask;redux.sync.add.s32 dst, src, 0xff;redux.sync.xor.b32 dst, src, mask;redux.sync.min.abs.NaN.f32 dst, src, mask;

9.7.13.13.Parallel Synchronization and Communication Instructions:griddepcontrol

griddepcontrol

Control execution of dependent grids.

Syntax

griddepcontrol.action;.action   = { .launch_dependents, .wait }

Description

Thegriddepcontrol instruction allows the dependent grids and prerequisite grids as defined bythe runtime, to control execution in the following way:

.launch_dependents modifier signals that specific dependents the runtime system designated toreact to this instruction can be scheduled as soon as all other CTAs in the grid issue the sameinstruction or have completed. The dependent may launch before the completion of the currentgrid. There is no guarantee that the dependent will launch before the completion of the currentgrid. Repeated invocations of this instruction by threads in the current CTA will have no additionalside effects past that of the first invocation.

.wait modifier causes the executing thread to wait until all prerequisite grids in flight havecompleted and all the memory operations from the prerequisite grids are performed and made visibleto the current grid.

Note

If the prerequisite grid is usinggriddepcontrol.launch_dependents, then the dependent gridmust usegriddepcontrol.wait to ensure correct functional execution.

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

griddepcontrol.launch_dependents;griddepcontrol.wait;

9.7.13.14.Parallel Synchronization and Communication Instructions:elect.sync

elect.sync

Elect a leader thread from a set of threads.

Syntax

elect.sync d|p, membermask;

Description

elect.sync elects one predicated active leader thread from among a set of threads specified bymembermask.laneid of the elected thread is returned in the 32-bit destination operandd. The sink symbol ‘_’ can be used for destination operandd. The predicate destinationp is set toTrue for the leader thread, andFalse for all other threads.

Operandmembermask specifies a 32-bit integer indicating the set of threads from which a leaderis to be elected. The behavior is undefined if the executing thread is not inmembermask.

Election of a leader thread happens deterministically, i.e. the same leader thread is elected forthe samemembermask every time.

The mandatory.sync qualifier indicates thatelect causes the executing thread to wait untilall threads in themembermask execute theelect instruction before resuming execution.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_90 or higher.

Examples

elect.sync    %r0|%p0, 0xffffffff;

9.7.13.15.Parallel Synchronization and Communication Instructions:mbarrier

mbarrier is a barrier created in shared memory that supports :

  • Synchronizing any subset of threads within a CTA

  • One-way synchronization of threads across CTAs of a cluster. As noted inmbarrier support with shared memory, threads canperform onlyarrive operations but not*_wait on an mbarrier located inshared::clusterspace.

  • Waiting for completion of asynchronous memory operations initiated by a thread and making themvisible to other threads.

Anmbarrier object is an opaque object in memory which can be initialized and invalidated using :

  • mbarrier.init

  • mbarrier.inval

Operations supported onmbarrier objects are :

  • mbarrier.expect_tx

  • mbarrier.complete_tx

  • mbarrier.arrive

  • mbarrier.arrive_drop

  • mbarrier.test_wait

  • mbarrier.try_wait

  • mbarrier.pending_count

  • cp.async.mbarrier.arrive

Performing anymbarrier operation exceptmbarrier.init on an uninitializedmbarrier objectresults in undefined behavior.Performing anynon-mbarrier ormbarrier.init operations on an initializedmbarrier objectresults in undefined behavior.

Unlikebar{.cta}/barrier{.cta} instructions which can access a limited number of barriersper CTA,mbarrier objects are user defined and are only limited by the total shared memory sizeavailable.

mbarrier operations enable threads to perform useful work after the arrival at thembarrier andbefore waiting for thembarrier to complete.

9.7.13.15.1.Size and alignment of mbarrier object

An mbarrier object is an opaque object with the following type and alignment requirements :

Type

Alignment (bytes)

Memory space

.b64

8

.shared

9.7.13.15.2.Contents of the mbarrier object

An opaquembarrier object keeps track of the following information :

  • Current phase of thembarrier object

  • Count of pending arrivals for the current phase of thembarrier object

  • Count of expected arrivals for the next phase of thembarrier object

  • Count of pending asynchronous memory operations (or transactions) tracked by the current phase ofthembarrier object. This is also referred to astx-count.

Anmbarrier object progresses through a sequence of phases where each phase is defined by threadsperforming an expected number ofarrive-onoperations.

The valid range of each of the counts is as shown below:

Count name

Minimum value

Maximum value

Expected arrival count

1

220 - 1

Pending arrival count

0

220 - 1

tx-count

-(220 - 1)

220 - 1

9.7.13.15.3.Lifecycle of the mbarrier object

Thembarrier object must be initialized prior to use.

Anmbarrier object is used to synchronize threads and asynchronous memory operations.

Anmbarrier object may be used to perform a sequence of such synchronizations.

Anmbarrier object must be invalidated to repurpose its memory for any purpose,including repurposing it for another mbarrier object.

9.7.13.15.4.Phase of the mbarrier object

The phase of anmbarrier object is the number of times thembarrier object has been used tosynchronize threads andasynchronousoperations. In each phase {0, 1, 2, …}, threads perform in program order :

  • arrive-onoperations to complete the current phase and

  • test_wait /try_wait operations to check for the completion of the current phase.

Anmbarrier object is automatically reinitialized upon completion of the current phase forimmediate use in the next phase. The current phase is incomplete and all prior phases are complete.

For each phase of the mbarrier object, at least onetest_wait ortry_wait operation must beperformed which returnsTrue forwaitComplete before anarrive-on operationin the subsequent phase.

9.7.13.15.5.Tracking asynchronous operations by the mbarrier object

Starting with the Hopper architecture (sm_9x),mbarrier object supports a new count, calledtx-count, which is used for tracking the completion of asynchronous memory operations ortransactions.tx-count tracks the number of asynchronous transactions, in units specified by theasynchronous memory operation, that are outstanding and yet to be complete.

Thetx-count of anmbarrier object must be set to the total amount of asynchronous memoryoperations, in units as specified by the asynchronous operations, to be tracked by the currentphase. Upon completion of each of the asynchronous operations, thecomplete-txoperation will be performed on thembarrier object and thus progress the mbarrier towards thecompletion of the current phase.

9.7.13.15.5.1.expect-tx operation

Theexpect-tx operation, with anexpectCount argument, increases thetx-count of anmbarrier object by the value specified byexpectCount. This sets the current phase of thembarrier object to expect and track the completion of additional asynchronous transactions.

9.7.13.15.5.2.complete-tx operation

Thecomplete-tx operation, with ancompleteCount argument, on anmbarrier object consists of the following:

mbarrier signaling

Signals the completion of asynchronous transactions that were tracked by the current phase. As aresult of this,tx-count is decremented bycompleteCount.

mbarrier potentially completing the current phase

If the current phase has been completed then the mbarrier transitions to the next phase. Refer toPhase Completion of the mbarrier objectfor details on phase completion requirements and phase transition process.

9.7.13.15.6.Phase Completion of the mbarrier object

The requirements for completion of the current phase are described below. Upon completion of thecurrent phase, the phase transitions to the subsequent phase as described below.

Current phase completion requirements

Anmbarrier object completes the current phase when all of the following conditions are met:

  • The count of the pending arrivals has reached zero.

  • Thetx-count has reached zero.

Phase transition

When anmbarrier object completes the current phase, the following actions are performedatomically:

  • Thembarrier object transitions to the next phase.

  • The pending arrival count is reinitialized to the expected arrival count.

9.7.13.15.7.Arrive-on operation on mbarrier object

Anarrive-on operation, with an optionalcount argument, on anmbarrier object consists of thefollowing 2 steps :

  • mbarrier signalling:

    Signals the arrival of the executing thread OR completion of the asynchronous instruction whichsignals the arrive-on operation initiated by the executing thread on thembarrier object. As aresult of this, the pending arrival count is decremented bycount. If thecount argument isnot specified, then it defaults to 1.

  • mbarrier potentially completing the current phase:

    If the current phase has been completed then the mbarrier transitions to the next phase. Refer toPhase Completion of the mbarrier objectfor details on phase completion requirements and phase transition process.

9.7.13.15.8.mbarrier support with shared memory

The following table summarizes the support of various mbarrier operations onmbarrier objectslocated at different shared memory locations:

mbarrier operations

.shared::cta

.shared::cluster

mbarrier.arrive

Supported

Supported, cannot return result

mbarrier.expect_tx

Supported

Supported

mbarrier.complete_tx

Supported

Supported

Other mbarrier operations

Supported

Not supported

9.7.13.15.9.Parallel Synchronization and Communication Instructions:mbarrier.init

mbarrier.init

Initialize thembarrier object.

Syntax

mbarrier.init{.shared{::cta}}.b64 [addr], count;

Description

mbarrier.init initializes thembarrier object at the location specified by the address operandaddr with the unsigned 32-bit integercount. The value of operand count must be in the rangeas specified inContents of the mbarrier object.

Initialization of thembarrier object involves :

  • Initializing the current phase to 0.

  • Initializing the expected arrival count tocount.

  • Initializing the pending arrival count tocount.

  • Initializing thetx-count to 0.

The valid range of values for the operandcount is [1, …, 220 - 1].ReferContents of the mbarrier object for thevalid range of values for the various constituents of the mbarrier.

If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta state space then the behavior is undefined.

Supported addressing modes for operandaddr is as described inAddresses as Operands.Alignment for operandaddr is as described in theSize and alignment of mbarrier object.

The behavior of performing anmbarrier.init operation on a memory location containing avalidmbarrier object is undefined; invalidate thembarrier object usingmbarrier.invalfirst, before repurposing the memory location for any other purpose, including anothermbarrier object.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

Support for sub-qualifier::cta on.shared introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_80 or higher.

Examples

.shared .b64 shMem, shMem2;.reg    .b64 addr;.reg    .b32 %r1;cvta.shared.u64          addr, shMem2;mbarrier.init.b64        [addr],   %r1;bar.cta.sync             0;// ... other mbarrier operations on addrmbarrier.init.shared::cta.b64 [shMem], 12;bar.sync                 0;// ... other mbarrier operations on shMem
9.7.13.15.10.Parallel Synchronization and Communication Instructions:mbarrier.inval

mbarrier.inval

Invalidates thembarrier object.

Syntax

mbarrier.inval{.shared{::cta}}.b64 [addr];

Description

mbarrier.inval invalidates thembarrier object at the location specified by the addressoperandaddr.

Anmbarrier object must be invalidated before using its memory location for any other purpose.

Performing anymbarrier operation exceptmbarrier.init on a memory location that does notcontain a validmbarrier object, results in undefined behaviour.

If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta state space then the behavior is undefined.

Supported addressing modes for operandaddr is as described inAddresses as Operands.Alignment for operandaddr is as described in theSize and alignment of mbarrier object.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

Support for sub-qualifier::cta on.shared introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_80 or higher.

Examples

.shared .b64 shmem;.reg    .b64 addr;.reg    .b32 %r1;.reg    .pred t0;// Example 1 :bar.sync                      0;@t0 mbarrier.init.b64     [addr], %r1;// ... other mbarrier operations on addrbar.sync                      0;@t0 mbarrier.inval.b64    [addr];// Example 2 :bar.cta.sync                  0;mbarrier.init.shared.b64           [shmem], 12;// ... other mbarrier operations on shmembar.cta.sync                  0;@t0 mbarrier.inval.shared.b64      [shmem];// shmem can be reused here for unrelated use :bar.cta.sync                  0;st.shared.b64                      [shmem], ...;// shmem can be re-initialized as mbarrier object :bar.cta.sync                  0;@t0 mbarrier.init.shared.b64       [shmem], 24;// ... other mbarrier operations on shmembar.cta.sync                  0;@t0 mbarrier.inval.shared::cta.b64 [shmem];
9.7.13.15.11.Parallel Synchronization and Communication Instructions:mbarrier.expect_tx

mbarrier.expect_tx

Perfomsexpect-txoperation on thembarrier object.

Syntax

mbarrier.expect_tx{.sem}{.scope}{.space}.b64 [addr], txCount;.sem   = { .relaxed }.scope = { .cta, .cluster }.space = { .shared{::cta}, .shared::cluster }

Description

A thread executingmbarrier.expect_tx performs anexpect-txoperation on thembarrier object at the location specified by the address operandaddr. The32-bit unsigned integer operandtxCount specifies theexpectCount argument to theexpect-tx operation.

If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta or.shared::cluster state space then the behavior is undefined.

Supported addressing modes for operandaddr are as described inAddresses as Operands.Alignment for operandaddr is as described in theSize and alignment of mbarrier object.

This operation does not provide any memory ordering semantics and thus is arelaxed operation.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_90 or higher.

Examples

mbarrier.expect_tx.b64                       [addr], 32;mbarrier.expect_tx.relaxed.cta.shared.b64    [mbarObj1], 512;mbarrier.expect_tx.relaxed.cta.shared.b64    [mbarObj2], 512;
9.7.13.15.12.Parallel Synchronization and Communication Instructions:mbarrier.complete_tx

mbarrier.complete_tx

Perfomscomplete-txoperation on thembarrier object.

Syntax

mbarrier.complete_tx{.sem}{.scope}{.space}.b64 [addr], txCount;.sem   = { .relaxed }.scope = { .cta, .cluster }.space = { .shared{::cta}, .shared::cluster }

Description

A thread executingmbarrier.complete_tx performs acomplete-txoperation on thembarrier object at the location specified by the address operandaddr. The32-bit unsigned integer operandtxCount specifies thecompleteCount argument to thecomplete-tx operation.

mbarrier.complete_tx does not involve any asynchronous memory operations and only simulates thecompletion of an asynchronous memory operation and its side effect of signaling to thembarrierobject.

If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta or.shared::cluster state space then the behavior is undefined.

Supported addressing modes for operandaddr are as described inAddresses as Operands.Alignment for operandaddr is as described in theSize and alignment of mbarrier object.

This operation does not provide any memory ordering semantics and thus is arelaxed operation.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_90 or higher.

Examples

mbarrier.complete_tx.b64             [addr],     32;mbarrier.complete_tx.shared.b64      [mbarObj1], 512;mbarrier.complete_tx.relaxed.cta.b64 [addr2],    32;
9.7.13.15.13.Parallel Synchronization and Communication Instructions:mbarrier.arrive

mbarrier.arrive

Performsarrive-on operation on thembarrier object.

Syntax

mbarrier.arrive{.sem}{.scope}{.shared{::cta}}.b64           state, [addr]{, count};mbarrier.arrive{.sem}{.scope}{.shared::cluster}.b64         _, [addr] {,count}mbarrier.arrive.expect_tx{.sem}{.scope}{.shared{::cta}}.b64 state, [addr], txCount;mbarrier.arrive.expect_tx{.sem}{.scope}{.shared::cluster}.b64   _, [addr], txCount;mbarrier.arrive.noComplete{.release}{.cta}{.shared{::cta}}.b64  state, [addr], count;.sem   = { .release, .relaxed }.scope = { .cta, .cluster }

Description

A thread executingmbarrier.arrive performs anarrive-on operationon thembarrier object at the location specified by the address operandaddr. The 32-bitunsigned integer operandcount specifies thecount argument to thearrive-onoperation.

If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta state space then the behavior is undefined.

Supported addressing modes for operandaddr is as described inAddresses as Operands.Alignment for operandaddr is as described in theSize and alignment of mbarrier object.

The optional qualifier.expect_tx specifies that anexpect-txoperation is performed prior to thearrive-onoperation. The 32-bit unsigned integer operandtxCount specifies theexpectCount argument totheexpect-tx operation. When both qualifiers.arrive and.expect_tx are specified, thenthe count argument of thearrive-on operation is assumed to be 1.

Ambarrier.arrive operation with.noComplete qualifier must not cause thembarrier tocomplete its current phase, otherwise the behavior is undefined.

The value of the operandcount must be in the range as specified inContents of the mbarrier object.

Note: forsm_8x, when the argumentcount is specified, the modifier.noComplete isrequired.

mbarrier.arrive operation on anmbarrier object located in.shared::cta returns an opaque64-bit register capturing the phase of thembarrier object prior to thearrive-on operation in thedestination operandstate. Contents of thestate operand are implementationspecific. Optionally, sink symbol'_' can be used for thestate argument.

mbarrier.arrive operation on anmbarrier object located in.shared::cluster but not in.shared::cta cannot return a value. Sink symbol ‘_’ is mandatory for the destination operand forsuch cases.

The optional.sem qualifier specifies a memory synchronizing effect as described in theMemory Consistency Model. If the.sem qualifier is absent,.release is assumed by default.

The.relaxed qualifier does not provide any memory ordering semantics and visibilityguarantees.

The optional.scope qualifier indicates the set of threads that directly observe the memorysynchronizing effect of this operation, as described in theMemory Consistency Model.If the.scope qualifier is not specified then itdefaults to.cta. In contrast, the.shared::<scope> indicates the state space where thembarrier resides.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

Support for sink symbol ‘_’ as the destination operand is introduced in PTX ISA version 7.1.

Support for sub-qualifier::cta on.shared introduced in PTX ISA version 7.8.

Support forcount argument without the modifier.noComplete introduced in PTX ISA version7.8.

Support for sub-qualifier::cluster introduced in PTX ISA version 8.0.

Support for qualifier.expect_tx is introduced in PTX ISA version 8.0.

Support for.scope and.sem qualifiers introduced in PTX ISA version 8.0

Support for.relaxed qualifier introduced in PTX ISA version 8.6.

Target ISA Notes

Requiressm_80 or higher.

Support forcount argument without the modifier.noComplete requiressm_90 or higher.

Qualifier.expect_tx requiressm_90 or higher.

Sub-qualifier::cluster requiressm_90 or higher.

Support for.cluster scope requiressm_90 or higher.

Examples

.reg .b32 cnt, remoteAddr32, remoteCTAId, addr32;.reg .b64 %r<5>, addr, remoteAddr64;.shared .b64 shMem, shMem2;cvta.shared.u64            addr, shMem2;mov.b32                    addr32, shMem2;mapa.shared::cluster.u32   remoteAddr32, addr32, remoteCTAId;mapa.u64                   remoteAddr64, addr,   remoteCTAId;cvta.shared.u64          addr, shMem2;mbarrier.arrive.shared.b64                       %r0, [shMem];mbarrier.arrive.shared::cta.b64                  %r0, [shMem2];mbarrier.arrive.release.cta.shared::cluster.b64  _, [remoteAddr32];mbarrier.arrive.release.cluster.b64              _, [remoteAddr64], cnt;mbarrier.arrive.expect_tx.release.cluster.b64    _, [remoteAddr64], tx_count;mbarrier.arrive.noComplete.b64                   %r1, [addr], 2;mbarrier.arrive.relaxed.cta.b64                  %r2, [addr], 4;mbarrier.arrive.b64                              %r2, [addr], cnt;
9.7.13.15.14.Parallel Synchronization and Communication Instructions:mbarrier.arrive_drop

mbarrier.arrive_drop

Decrements the expected count of thembarrier object and performsarrive-on operation.

Syntax

mbarrier.arrive_drop{.sem}{.scope}{.shared{::cta}}.b64 state,           [addr]{, count};mbarrier.arrive_drop{.sem}{.scope}{.shared::cluster}.b64           _,   [addr] {,count};mbarrier.arrive_drop.expect_tx{.shared{::cta}}{.sem}{.scope}.b64 state, [addr], tx_count;mbarrier.arrive_drop.expect_tx{.shared::cluster}{.sem}{.scope}.b64   _, [addr], tx_count;mbarrier.arrive_drop.noComplete{.release}{.cta}{.shared{::cta}}.b64 state,  [addr], count;.sem   = { .release, .relaxed }.scope = { .cta, .cluster }

Description

A thread executingmbarrier.arrive_drop on thembarrier object at the location specified bythe address operandaddr performs the following steps:

  • Decrements the expected arrival count of thembarrier object by the value specified by the32-bit integer operandcount. Ifcount operand is not specified, it defaults to 1.

  • Performs anarrive-on operation on thembarrier object. The operandcount specifies thecount argument to thearrive-on operation.

The decrement done in the expected arrivals count of thembarrier object will be for all thesubsequent phases of thembarrier object.

If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta or.shared::cluster state space then the behavior is undefined.

Supported addressing modes for operandaddr is as described inAddresses as Operands.Alignment for operandaddr is as described in theSize and alignment of mbarrier object.

The optional qualifier.expect_tx specifies that anexpect-txoperation is performed prior to thearrive-onoperation. The 32-bit unsigned integer operandtxCount specifies theexpectCount argument totheexpect-tx operation. When both qualifiers.arrive and.expect_tx are specified, thenthe count argument of thearrive-on operation is assumed to be 1.

mbarrier.arrive_drop operation with.release qualifier forms therelease pattern asdescribed in the Memory Consistency Model and synchronizes with theacquire patterns.

The optional.sem qualifier specifies a memory synchronizing effect as described in theMemory Consistency Model. If the.sem qualifier is absent,.release is assumed by default. The.relaxed qualifier does not provide any memoryordering semantics and visibility guarantees.

The optional.scope qualifier indicates the set of threads that anmbarrier.arrive_dropinstruction can directly synchronize. If the.scope qualifier is not specified then it defaultsto.cta. In contrast, the.shared::<scope> indicates the state space where the mbarrierresides.

Ambarrier.arrive_drop with.noComplete qualifier must not complete thembarrier,otherwise the behavior is undefined.

The value of the operandcount must be in the range as specified inContents of the mbarrier object.

Note: forsm_8x, when the argumentcount is specified, the modifier.noComplete isrequired.

A thread that wants to either exit or opt out of participating in thearrive-on operation can usembarrier.arrive_drop to drop itself from thembarrier.

mbarrier.arrive_drop operation on anmbarrier object located in.shared::cta returns anopaque 64-bit register capturing the phase of thembarrier object prior to thearrive-onoperationin the destination operandstate. Contents of the returned state are implementationspecific. Optionally, sink symbol'_' can be used for thestate argument.

mbarrier.arrive_drop operation on anmbarrier object located in.shared::cluster but notin.shared::cta cannot return a value. Sink symbol ‘_’ is mandatory for the destination operandfor such cases.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

Support for sub-qualifier::cta on.shared introduced in PTX ISA version 7.8.

Support forcount argument without the modifier.noComplete introduced in PTX ISA version7.8.

Support for qualifier.expect_tx is introduced in PTX ISA version 8.0.

Support for sub-qualifier::cluster introduced in PTX ISA version 8.0.

Support for.scope and.sem qualifiers introduced in PTX ISA version 8.0

Support for.relaxed qualifier introduced in PTX ISA version 8.6.

Target ISA Notes

Requiressm_80 or higher.

Support forcount argument without the modifier.noComplete requiressm_90 or higher.

Qualifier.expect_tx requiressm_90 or higher.

Sub-qualifier::cluster requiressm_90 or higher.

Support for.cluster scope requiressm_90 or higher.

Examples

.reg .b32 cnt;.reg .b64 %r1;.shared .b64 shMem;// Example 1@p mbarrier.arrive_drop.shared.b64 _, [shMem];@p exit;@p2 mbarrier.arrive_drop.noComplete.shared.b64 _, [shMem], %a;@p2 exit;..@!p mbarrier.arrive.shared.b64   %r1, [shMem];@!p mbarrier.test_wait.shared.b64  q, [shMem], %r1;// Example 2mbarrier.arrive_drop.shared::cluster.b64 _, [addr];mbarrier.arrive_drop.shared::cta.release.cluster.b64     _, [addr], cnt;// Example 3mbarrier.arrive_drop.expect_tx.shared::cta.relaxed.cluster.b64 state, [addr], tx_count;
9.7.13.15.15.Parallel Synchronization and Communication Instructions:cp.async.mbarrier.arrive

cp.async.mbarrier.arrive

Makes thembarrier object track all priorcp.asyncoperations initiated by theexecuting thread.

Syntax

cp.async.mbarrier.arrive{.noinc}{.shared{::cta}}.b64 [addr];

Description

Causes anarrive-on operation to betriggered by the system on thembarrier object upon the completion of all priorcp.asyncoperations initiated by theexecuting thread. Thembarrier object is at the location specified by the operandaddr. Thearrive-on operation isasynchronous to execution ofcp.async.mbarrier.arrive.

When.noinc modifier is not specified, the pending count of the mbarrier object is incrementedby 1 prior to the asynchronousarrive-on operation. Thisresults in a zero-net change for the pending count from the asynchronousarrive-on operationduring the current phase. The pending count of thembarrier object after the increment should notexceed the limit as mentioned inContents of the mbarrier object. Otherwise,the behavior is undefined.

When the.noinc modifier is specified, the increment to the pending count of thembarrierobject is not performed. Hence the decrement of the pending count done by the asynchronousarrive-on operation must beaccounted for in the initialization of thembarrier object.

If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta state space then the behavior is undefined.

Supported addressing modes for operandaddr is as described inAddresses as Operands.Alignment for operandaddr is as described in theSize and alignment of mbarrier object.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

Support for sub-qualifier::cta on.shared introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_80 or higher.

Examples

// Example 1: no .noincmbarrier.init.shared.b64 [shMem], threadCount;....cp.async.ca.shared.global [shard1], [gbl1], 4;cp.async.cg.shared.global [shard2], [gbl2], 16;....// Absence of .noinc accounts for arrive-on from completion of prior cp.async operations.// So mbarrier.init must only account for arrive-on from mbarrier.arrive.cp.async.mbarrier.arrive.shared.b64 [shMem];....mbarrier.arrive.shared.b64 state, [shMem];waitLoop:mbarrier.test_wait.shared.b64 p, [shMem], state;@!p bra waitLoop;// Example 2: with .noinc// Tracks arrive-on from mbarrier.arrive and cp.async.mbarrier.arrive.// All threads participating in the mbarrier perform cp.asyncmov.b32 copyOperationCnt, threadCount;// 3 arrive-on operations will be triggered per-threadmul.lo.u32 copyArrivalCnt, copyOperationCnt, 3;add.u32 totalCount, threadCount, copyArrivalCnt;mbarrier.init.shared.b64 [shMem], totalCount;....cp.async.ca.shared.global [shard1], [gbl1], 4;cp.async.cg.shared.global [shard2], [gbl2], 16;...// Presence of .noinc requires mbarrier initalization to have accounted for arrive-on from cp.asynccp.async.mbarrier.arrive.noinc.shared.b64 [shMem]; // 1st instance....cp.async.ca.shared.global [shard3], [gbl3], 4;cp.async.ca.shared.global [shard4], [gbl4], 16;cp.async.mbarrier.arrive.noinc.shared::cta.b64 [shMem]; // 2nd instance....cp.async.ca.shared.global [shard5], [gbl5], 4;cp.async.cg.shared.global [shard6], [gbl6], 16;cp.async.mbarrier.arrive.noinc.shared.b64 [shMem]; // 3rd and last instance....mbarrier.arrive.shared.b64 state, [shMem];waitLoop:mbarrier.test_wait.shared.b64 p, [shMem], state;@!p bra waitLoop;
9.7.13.15.16.Parallel Synchronization and Communication Instructions:mbarrier.test_wait /mbarrier.try_wait

mbarrier.test_wait,mbarrier.try_wait

Checks whether thembarrier object has completed the phase.

Syntax

mbarrier.test_wait{.sem}{.scope}{.shared{::cta}}.b64        waitComplete, [addr], state;mbarrier.test_wait.parity{.sem}{.scope}{.shared{::cta}}.b64 waitComplete, [addr], phaseParity;mbarrier.try_wait{.sem}{.scope}{.shared{::cta}}.b64         waitComplete, [addr], state                                                               {, suspendTimeHint};mbarrier.try_wait.parity{.sem}{.scope}{.shared{::cta}}.b64  waitComplete, [addr], phaseParity                                                               {, suspendTimeHint};.sem   = { .acquire, .relaxed }.scope = { .cta, .cluster }

Description

Thetest_wait andtry_wait operations test for the completion of the current or the immediatelypreceding phase of anmbarrier object at the location specified by the operandaddr.

mbarrier.test_wait is a non-blocking instruction which tests for the completion of the phase.

mbarrier.try_wait is a potentially blocking instruction which tests for the completion of thephase. If the phase is not complete, the executing thread may be suspended. Suspended thread resumesexecution when the specified phase completes OR before the phase completes following asystem-dependent time limit. The optional 32-bit unsigned integer operandsuspendTimeHintspecifies the time limit, in nanoseconds, that may be used for the time limit instead of thesystem-dependent limit.

mbarrier.test_wait andmbarrier.try_wait test for completion of the phase :

  • Specified by the operandstate, which was returned by anmbarrier.arrive instruction onthe samembarrier object during the current or the immediately preceding phase. Or

  • Indicated by the operandphaseParity, which is the integer parity of either the current phaseor the immediately preceding phase of thembarrier object.

The.parity variant of the instructions test for the completion of the phase indicated by theoperandphaseParity, which is the integer parity of either the current phase or the immediatelypreceding phase of thembarrier object. An even phase has integer parity 0 and an odd phase hasinteger parity of 1. So the valid values ofphaseParity operand are 0 and 1.

Note: the use of the.parity variants of the instructions requires tracking the phase of anmbarrier object throughout its lifetime.

Thetest_wait andtry_wait operations are valid only for :

  • the current incomplete phase, for whichwaitComplete returnsFalse.

  • the immediately preceding phase, for whichwaitComplete returnsTrue.

If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta state space then the behavior is undefined.

Supported addressing modes for operandaddr is as described inAddresses as Operands.Alignment for operandaddr is as described in theSize and alignment of mbarrier object.

Whenmbarrier.test_wait andmbarrier.try_wait operations with.acquire qualifierreturnsTrue, they form theacquire pattern as described in theMemory Consistency Model.

The optional.sem qualifier specifies a memory synchronizing effect as described in theMemory Consistency Model. If the.sem qualifier is absent,.acquire is assumed by default. The.relaxed qualifier does not provide any memoryordering semantics and visibility guarantees.

The optional.scope qualifier indicates the set of threads that thembarrier.test_wait andmbarrier.try_wait instructions can directly synchronize. If the.scope qualifier is notspecified then it defaults to.cta. In contrast, the.shared::<scope> indicates the statespace where the mbarrier resides.

The following ordering of memory operations hold for the executing thread whenmbarrier.test_wait ormbarrier.try_wait having acquire semantics returnsTrue :

  1. All memory accesses (exceptasync operations) requested prior, in programorder, tombarrier.arrive having release semantics during the completed phase bythe participating threads of the CTA are performed and are visible to the executing thread.

  2. Allcp.async operationsrequested prior, in program order, tocp.async.mbarrier.arrive during the completed phase bythe participating threads of the CTA are performed and made visible to the executing thread.

  3. Allcp.async.bulk asynchronous operations using the samembarrier object requested prior,in program order, tombarrier.arrive having release semantics during the completedphase by the participating threads of the CTA are performed and made visible to the executing thread.

  4. All memory accesses requested after thembarrier.test_wait ormbarrier.try_wait, inprogram order, are not performed and not visible to memory accesses performed prior tombarrier.arrive having release semantics, in program order, by other threadsparticipating in thembarrier.

  5. There is no ordering and visibility guarantee for memory accesses requested by the thread aftermbarrier.arrive having release semantics and prior tombarrier.test_wait,in program order.

PTX ISA Notes

mbarrier.test_wait introduced in PTX ISA version 7.0.

Modifier.parity is introduced in PTX ISA version 7.1.

mbarrier.try_wait introduced in PTX ISA version 7.8.

Support for sub-qualifier::cta on.shared introduced in PTX ISA version 7.8.

Support for.scope and.sem qualifiers introduced in PTX ISA version 8.0

Support for.relaxed qualifier introduced in PTX ISA version 8.6.

Target ISA Notes

mbarrier.test_wait requiressm_80 or higher.

mbarrier.try_wait requiressm_90 or higher.

Support for.cluster scope requiressm_90 or higher.

Examples

// Example 1a, thread synchronization with test_wait:.reg .b64 %r1;.shared .b64 shMem;mbarrier.init.shared.b64 [shMem], N;  // N threads participating in the mbarrier....mbarrier.arrive.shared.b64  %r1, [shMem]; // N threads executing mbarrier.arrive// computation not requiring mbarrier synchronization...waitLoop:mbarrier.test_wait.shared.b64    complete, [shMem], %r1;@!complete nanosleep.u32 20;@!complete bra waitLoop;// Example 1b, thread synchronization with try_wait :.reg .b64 %r1;.shared .b64 shMem;mbarrier.init.shared.b64 [shMem], N;  // N threads participating in the mbarrier....mbarrier.arrive.shared.b64  %r1, [shMem]; // N threads executing mbarrier.arrive// computation not requiring mbarrier synchronization...waitLoop:mbarrier.try_wait.relaxed.cluster.shared.b64    complete, [shMem], %r1;@!complete bra waitLoop;// Example 2, thread synchronization using phase parity :.reg .b32 i, parArg;.reg .b64 %r1;.shared .b64 shMem;mov.b32 i, 0;mbarrier.init.shared.b64 [shMem], N;  // N threads participating in the mbarrier....loopStart :                           // One phase per loop iteration    ...    mbarrier.arrive.shared.b64  %r1, [shMem]; // N threads    ...    and.b32 parArg, i, 1;    waitLoop:    mbarrier.test_wait.parity.shared.b64  complete, [shMem], parArg;    @!complete nanosleep.u32 20;    @!complete bra waitLoop;    ...    add.u32 i, i, 1;    setp.lt.u32 p, i, IterMax;@p bra loopStart;// Example 3, Asynchronous copy completion waiting :.reg .b64 state;.shared .b64 shMem2;.shared .b64 shard1, shard2;.global .b64 gbl1, gbl2;mbarrier.init.shared.b64 [shMem2], threadCount;...cp.async.ca.shared.global [shard1], [gbl1], 4;cp.async.cg.shared.global [shard2], [gbl2], 16;// Absence of .noinc accounts for arrive-on from prior cp.async operationcp.async.mbarrier.arrive.shared.b64 [shMem2];...mbarrier.arrive.shared.b64 state, [shMem2];waitLoop:mbarrier.test_wait.shared::cta.b64 p, [shMem2], state;@!p bra waitLoop;// Example 4, Synchronizing the CTA0 threads with cluster threads.reg .b64 %r1, addr, remAddr;.shared .b64 shMem;cvta.shared.u64          addr, shMem;mapa.u64                 remAddr, addr, 0;     // CTA0's shMem instance// One thread from CTA0 executing the below initialization operation@p0 mbarrier.init.shared::cta.b64 [shMem], N;  // N = no of cluster threadsbarrier.cluster.arrive;barrier.cluster.wait;// Entire cluster executing the below arrive operationmbarrier.arrive.release.cluster.b64              _, [remAddr];// computation not requiring mbarrier synchronization ...// Only CTA0 threads executing the below wait operationwaitLoop:mbarrier.try_wait.parity.acquire.cluster.shared::cta.b64  complete, [shMem], 0;@!complete bra waitLoop;
9.7.13.15.17.Parallel Synchronization and Communication Instructions:mbarrier.pending_count

mbarrier.pending_count

Query the pending arrival count from the opaque mbarrier state.

Syntax

mbarrier.pending_count.b64 count, state;

Description

The pending count can be queried from the opaque mbarrier state usingmbarrier.pending_count.

Thestate operand is a 64-bit register that must be the result of a priormbarrier.arrive.noComplete ormbarrier.arrive_drop.noComplete instruction. Otherwise, thebehavior is undefined.

The destination registercount is a 32-bit unsigned integer representing the pending count ofthembarrier object prior to thearrive-on operation fromwhich thestate register was obtained.

PTX ISA Notes

Introduced in PTX ISA version 7.0.

Target ISA Notes

Requiressm_80 or higher.

Examples

.reg .b32 %r1;.reg .b64 state;.shared .b64 shMem;mbarrier.arrive.noComplete.b64 state, [shMem], 1;mbarrier.pending_count.b64 %r1, state;

9.7.13.16.Parallel Synchronization and Communication Instructions:tensormap.cp_fenceproxy

tensormap.cp_fenceproxy

A fused copy and fence operation.

Syntax

tensormap.cp_fenceproxy.cp_qualifiers.fence_qualifiers.sync.aligned  [dst], [src], size;.cp_qualifiers    = { .global.shared::cta }.fence_qualifiers = { .to_proxy::from_proxy.release.scope }.to_proxy::from_proxy  = { .tensormap::generic }.scope            = { .cta, .cluster, .gpu , .sys }

Description

Thetensormap.cp_fenceproxy instructions perform the following operations in order :

  • Copies data of size specified by thesize argument, in bytes, from the location specifiedby the address operandsrc in shared memory to the location specified by the address operanddst in the global memory, in the generic proxy.

  • Establishes auni-directional proxy release pattern on the ordering from the copy operationto the subsequent access performed in the tensormap proxy on the addressdst.

The valid value of immediate operandsize is 128.

The operandssrc anddst specify non-generic addresses inshared::cta andglobalstate space respectively.

The.scope qualifier specifies the set of threads that can directly observe the proxysynchronizing effect of this operation, as described inMemory Consistency Model.

The mandatory.sync qualifier indicates thattensormap.cp_fenceproxy causes the executingthread to wait until all threads in the warp execute the sametensormap.cp_fenceproxyinstruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must execute the sametensormap.cp_fenceproxy instruction. In conditionally executed code, an alignedtensormap.cp_fenceproxyinstruction should only be used if it is known that all threads in the warp evaluate the conditionidentically, otherwise behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.3.

Target ISA Notes

Requiressm_90 or higher.

Examples

// Example: manipulate a tensor-map object and then consume it in cp.async.bulk.tensor.reg .b64 new_addr;.global .align 128 .b8 gbl[128];.shared .align 128 .b8 sMem[128];cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [sMem], [gMem], 128, [mbar];...try_wait_loop:mbarrier.try_wait.shared.b64 p, [mbar], state;@!p bra try_wait loop;tensormap.replace.tile.global_address.shared.b1024.b64   [sMem], new_addr;tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned                                                         [gbl], [sMem], 128;fence.proxy.tensormap::generic.acquire.gpu [gbl], 128;cp.async.bulk.tensor.1d.shared::cluster.global.tile  [addr0], [gbl, {tc0}], [mbar0];

9.7.13.17.Parallel Synchronization and Communication Instructions:clusterlaunchcontrol.try_cancel

clusterlaunchcontrol.try_cancel

Requests cancellation of cluster which is not launched yet.

Syntax

clusterlaunchcontrol.try_cancel.async{.space}.completion_mechanism{.multicast::cluster::all}.b128 [addr], [mbar];.completion_mechanism = { .mbarrier::complete_tx::bytes };.space = { .shared::cta };

Description

Theclusterlaunchcontrol.try_cancel instruction requests atomically cancelling the launch ofa cluster that has not started running yet. It asynchronously writes an opaque response to sharedmemory indicating whether the operation succeeded or failed. The completion of the asynchronousoperation is tracked using the mbarrier completion mechanism at.cluster scope.

On success, the opaque response contains thectaid of the first CTA of the canceled cluster; noother successful response from otherclusterlaunchcontrol.try_cancel operations from the samegrid will contain that id.

The mandatory.async qualifier indicates that the instruction will initiate the cancellationoperation asynchronously and control will return to the executing thread before the requestedoperation is complete.

The.space qualifier is specified, both operandsaddr andmbar must be in the.shared::cta state space. Otherwise, generic addressing will be assumed for both. The resultis undefined if any of address operands do not fall within the address window of.shared::cta.

The qualifier.completion_mechanism specifies that upon completion of the asynchronous operation,complete-txoperation, withcompleteCount argument equal to amount of data stored in bytes, will be performedon the mbarrier object specified by the operandmbar.

The executing thread can then usembarrier instructions to wait for completionof the asynchronous operation. No other synchronization mechanisms described inMemory Consistency Model can be used to guarantee the completion of the asynchronous copy operations.

The.multicast::cluster::all qualifier indicates that the response is asynchronously written usingweak async-proxy writes to the corresponding local shared memoryaddr of each CTA in the requestingcluster. The completion of the writes toaddr of a particular CTA is signaled via a complete-tx operationto the mbarrier object on the shared memory of that CTA.

The behavior of instruction with.multicast::cluster::all qualifier is undefined if any CTA in thecluster is exited.

Operandaddr specifies the naturally aligned address of the 16-byte wide shared memory location wherethe request’s response is written.

The response ofclusterlaunchcontrol.try_cancel instruction will be 16-byte opaque value and will beit available at location specified by operandaddr. After loading this response into 16-byte register,instructionclusterlaunchcontrol.query_cancel can be used to check if request was successful and toretrievectaid of the first CTA of the canceled cluster.

If the executing CTA has already observed the completion of aclusterlaunchcontrol.try_cancel instructionas failed, then the behavior of issuing a subsequentclusterlaunchcontrol.try_cancel instruction is undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Target ISA Notes

Requiressm_100 or higher.

Qualifier.multicast::cluster::all is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

Examples

// Assumption: 1D cluster (cluster_ctaid.y/.z == 1)// with 1 thread per CTA.// Current Cluster to be processed, initially the// currently launched cluster:mov.b32 xctaid, %ctaid.x;barrier.cluster.arrive.relaxed;processCluster:// Wait on all cluster CTAs completing initialization or processing of previous cluster:barrier.cluster.wait.acquire;mov.u32  %r0, %tid.x;setp.u32.eq p0, %r0, 0x0;@!p0 bra asyncWork;// All CTAs in the cluster arrive at their local// SMEM   barrier and set 16B handle tx count:mbarrier.arrive.expect_tx.cluster.relaxed.shared::cta.b64 state, [mbar], 16;// first CTA in Cluster attempts to cancel a// not-yet-started cluster:mov.u32  %r0, %cluster_ctaid.x;setp.u32.eq p0, %r0, 0x0;@p0 clusterlaunchcontrol.try_cancel.async.mbarrier::complete_tx::bytes.multicast::cluster::all.b128 [addr], [mbar];asyncWork:// ...process xctaid while cancellation request completes// asynchronously...// All CTAs in Cluster wait on cancellation responses on their local SMEM:waitLoop:// .acquire prevents the load of the handle from overtaking this read:mbarrier.try_wait.cluster.acquire.shared::cta.b64   complete, [mbar], state;@!complete bra waitLoop;// Load response into 16-byte wide register after unblocking// from mbarrier:ld.shared.b128 handle, [addr];// Check whether cancellation succeeded:clusterlaunchcontrol.query_cancel.is_canceled.pred.b128 p, handle;@!p ret; // If failed, we are don end exit:// Otherwise, read ctaid of first CTA of cancelled Cluster for next iteration...@p clusterlaunchcontrol.query_cancel.get_first_ctaid.v4.b32.b128 {xctaid, _, _, _},  handle;// ...and signal CTA0 that we are done reading from handle:// Fence generic->asyncfence.proxy.async.shared::cta;barrier.cluster.arrive.relaxed;bra processCluster;

9.7.13.18.Parallel Synchronization and Communication Instructions:clusterlaunchcontrol.query_cancel

clusterlaunchcontrol.query_cancel

Queries response ofclusterlaunchcontrol.try_cancel operation.

Syntax

clusterlaunchcontrol.query_cancel.is_canceled.pred.b128 pred, try_cancel_response;clusterlaunchcontrol.query_cancel.get_first_ctaid.v4.b32.b128 {xdim, ydim, zdim, _},  try_cancel_response;clusterlaunchcontrol.query_cancel.get_first_ctaid{::dimension}.b32.b128 reg, try_cancel_response;::dimension = { ::x, ::y, ::z };

Description

Instructionclusterlaunchcontrol.query_cancel can be used to decode opaque responsewritten by instructionclusterlaunchcontrol.try_cancel.

After loading response fromclusterlaunchcontrol.try_cancel instruction into 16-byteregister it can be further queried usingclusterlaunchcontrol.query_cancel instructionas follows:

clusterlaunchcontrol.query_cancel.is_canceled.pred.b128: If the cluster is canceledsuccessfully, predicatep is set totrue; otherwise, it is set tofalse.

If the request succeeded, the instructionclusterlaunchcontrol.query_cancel.get_first_ctaidextracts the CTA id of the first CTA in the canceled cluster. By default, the instructionreturns a.v4 vector whose first three elements are thex,y andz coordinateof first CTA in canceled cluster. The contents of the 4th element are unspecified. Theexplicit.get_first_ctaid::x,.get_first_ctaid::y, or.get_first_ctaid::zqualifiers can be used to extract individualx,y orz coordinates into a 32-bitregister.

If the request fails the behavior ofclusterlaunchcontrol.query_cancel.get_first_ctaidis undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Target ISA Notes

Requiressm_100 or higher.

Examples

clusterlaunchcontrol.query_cancel.is_canceled pred.b128 p, handle;@p clusterlaunchcontrol.query_cancel.get_first_ctaid.v4.b32.b128 {xdim, ydim, zdim, ignr}  handle;clusterlaunchcontrol.query_cancel.get_first_ctaid::x.b32.b128 reg0, handle;clusterlaunchcontrol.query_cancel.get_first_ctaid::y.b32.b128 reg1, handle;clusterlaunchcontrol.query_cancel.get_first_ctaid::z.b32.b128 reg2, handle;

9.7.14.Warp Level Matrix Multiply-Accumulate Instructions

The matrix multiply and accumulate operation has the following form:

D = A * B + C

whereD andC are called accumulators and may refer to the same matrix.

PTX provides two ways to perform matrix multiply-and-accumulate computation:

  • Usingwmma instructions:

    • This warp-level computation is performed collectively by all threads in the warp as follows:

      • Load matrices A, B and C from memory into registers using thewmma.load operation. Whenthe operation completes, the destination registers in each thread hold a fragment of theloaded matrix.

      • Perform the matrix multiply and accumulate operation using thewmma.mma operation on theloaded matrices. When the operation completes, the destination registers in each thread holda fragment of the result matrix returned by thewmma.mma operation.

      • Store result Matrix D back to memory using thewmma.store operation. Alternately, resultmatrix D can also be used as argument C for a subsequentwmma.mma operation.

      Thewmma.load andwmma.store instructions implicitly handle the organization of matrixelements when loading the input matrices from memory for thewmma.mma operation and whenstoring the result back to memory.

  • Usingmma instruction:

    • Similar towmma,mma also requires computation to be performed collectively by allthreads in the warp however distribution of matrix elements across different threads in warpneeds to be done explicitly before invoking themma operation. Themma instructionsupports both dense as well as sparse matrix A. The sparse variant can be used when A is astructured sparse matrix as described inSparse matrix storage.

9.7.14.1.Matrix Shape

The matrix multiply and accumulate operations support a limited set of shapes for the operandmatrices A, B and C. The shapes of all three matrix operands are collectively described by the tupleMxNxK, where A is anMxK matrix, B is aKxN matrix, while C and D areMxN matrices.

The following matrix shapes are supported for the specified types:

Instruction

Scale

Sparsity

Multiplicand Data-type

Shape

PTX ISA version

wmma

NA

Dense

Floating-point -.f16

.m16n16k16,.m8n32k16,and.m32n8k16

PTX ISA version 6.0

wmma

Dense

Alternate floating-point format -.bf16

.m16n16k16,.m8n32k16,and.m32n8k16

PTX ISA version 7.0

wmma

Dense

Alternate floating-point format -.tf32

.m16n16k8

PTX ISA version 7.0

wmma

Dense

Integer -.u8/.s8

.m16n16k16,.m8n32k16,and.m32n8k16

PTX ISA version 6.3

wmma

Dense

Sub-byte integer -.u4/.s4

.m8n8k32

PTX ISA version 6.3(preview feature)

wmma

Dense

Single-bit -.b1

.m8n8k128

PTX ISA version 6.3(preview feature)

mma

NA

Dense

Floating-point -.f64

.m8n8k4

PTX ISA version 7.0

.m16n8k4,.m16n8k8,and.m16n8k16

PTX ISA version 7.8

mma

Dense

Floating-point -.f16

.m8n8k4

PTX ISA version 6.4

.m16n8k8

PTX ISA version 6.5

.m16n8k16

PTX ISA version 7.0

mma

Dense

Alternate floating-point format -.bf16

.m16n8k8 and.m16n8k16

PTX ISA version 7.0

mma

Dense

Alternate floating-point format -.tf32

.m16n8k4 and.m16n8k8

PTX ISA version 7.0

mma

Dense

Integer -.u8/.s8

.m8n8k16

PTX ISA version 6.5

.m16n8k16 and.m16n8k32

PTX ISA version 7.0

mma

Dense

Sub-byte integer -.u4/.s4

.m8n8k32

PTX ISA version 6.5

.m16n8k32 and.m16n8k64

PTX ISA version 7.0

mma

Dense

Single-bit -.b1

.m8n8k128,.m16n8k128,and.m16n8k256

PTX ISA version 7.0

mma

Dense

Alternate floating-point format -.e4m3/.e5m2

.m16n8k32

PTX ISA version 8.4

mma

Dense

Alternate floating-point format -.e4m3/.e5m2

.m16n8k16

PTX ISA version 8.7

mma

Dense

Alternate floating-point format -.e3m2/.e2m3/.e2m1

.m16n8k32

PTX ISA version 8.7

mma

Yes

Dense

Alternate floating-point format -.e4m3/.e5m2/.e3m2/.e2m3/.e2m1X(Scale).ue8m0

.m16n8k32

PTX ISA version 8.7

mma

Dense

Alternate floating-point format -.e2m1X(Scale).ue8m0/.ue4m3

.m16n8k64

PTX ISA version 8.7

mma

NA

Sparse

Floating-point -.f16

.m16n8k16 and.m16n8k32

PTX ISA version 7.1

mma

Sparse

Alternate floating-point format -.bf16

.m16n8k16 and.m16n8k32

PTX ISA version 7.1

mma

Sparse

Alternate floating-point format -.tf32

.m16n8k8 and.m16n8k16

PTX ISA version 7.1

mma

Sparse

Integer -.u8/.s8

.m16n8k32 and.m16n8k64

PTX ISA version 7.1

mma

Sparse

Sub-byte integer -.u4/.s4

.m16n8k64 and.m16n8k128

PTX ISA version 7.1

mma

Sparse

Alternate floating-point format -.e4m3/.e5m2

.m16n8k64

PTX ISA version 8.4

mma

Sparsewithorderedmetadata

Floating-point -.f16

.m16n8k16 and.m16n8k32

PTX ISA version 8.5

mma

Sparsewithorderedmetadata

Alternate floating-point format -.bf16

.m16n8k16 and.m16n8k32

PTX ISA version 8.5

mma

Sparsewithorderedmetadata

Alternate floating-point format -.tf32

.m16n8k8 and.m16n8k16

PTX ISA version 8.5

mma

Sparsewithorderedmetadata

Integer -.u8/.s8

.m16n8k32 and.m16n8k64

PTX ISA version 8.5

mma

Sparsewithorderedmetadata

Sub-byte integer -.u4/.s4

.m16n8k64 and.m16n8k128

PTX ISA version 8.5

mma

Sparsewithorderedmetadata

Alternate floating-point format -.e4m3/.e5m2

.m16n8k64

PTX ISA version 8.5

mma

Sparsewithorderedmetadata

Alternate floating-point format -.e3m2/.e2m3/.e2m1

.m16n8k64

PTX ISA version 8.7

mma

Yes

Sparsewithorderedmetadata

Alternate floating-point format -.e4m3/.e5m2/.e3m2/.e2m3/.e2m1X(Scale).ue8m0

.m16n8k64

PTX ISA version 8.7

mma

Sparsewithorderedmetadata

Alternate floating-point format -.e2m1X(Scale).ue8m0/.ue4m3

.m16n8k128

PTX ISA version 8.7

9.7.14.2.Matrix Data-types

The matrix multiply and accumulate operation is supported separately on integer, floating-point,sub-byte integer and single bit data-types. All operands must contain the same basic type kind,i.e., integer or floating-point.

For floating-point matrix multiply and accumulate operation, different matrix operands may havedifferent precision, as described later.

Data-type

Multiplicands (A or B)

Accumulators (C or D)

Integer

.u8,.s8

.s32

Floating Point

.f16

.f16,.f32

Alternate floating Point

.bf16

.f32

Alternate floating Point

.tf32

.f32

Alternate floating Point

.e4m3 or.e5m2 or.e3m2 or.e2m3 or.e2m1

.f16,.f32

Alternate floating Pointwith scale

.e4m3 or.e5m2 or.e3m2 or.e2m3 or.e2m1 X (Scale).ue8m0

.f32

Alternate floating Pointwith scale

.e2m1 X (Scale).ue8m0 or.ue4m3

.f32

Floating Point

.f64

.f64

Sub-byte integer

both.u4 or both.s4

.s32

Single-bit integer

.b1

.s32

9.7.14.3.Block Scaling

Themma instruction with the following.kind qualifier:

  • .kind::mxf8f6f4

  • .kind::mxf4

  • .kind::mxf4nvf4

perform matrix multiplication with block scaling. This operation has the following form:D=(A*scale_A)*(B*scale_B)+C.

For ascale_A matrix of shapeM x SFA_N, each row of matrixA is divided intoSFA_N number of chunks and each chunk of a row is multiplied with the correspondingelement (henceforth referred asSF_A) from the same row ofscale_A.

Similarly, for ascale_B matrix of shapeSFB_M x N, each column of matrixB isdivided into theSFB_M number of chunks and each chunk of a column is multiplied withthe corresponding element (henceforth referred asSF_B) from the same column ofscale_B.

Figure 42 shows an example ofmma with block scaling ofscale_vec::2X.

_images/mma-block-scaling.png

Figure 42mma with block scaling of.scale_vec::2X

The shapes forscale_A andscale_B matrices depend upon the qualifier.scale_vec_sizeas shown inTable 35.

Table 35Shapes for scale matrices depending upon.scale_vec_size qualifier

.scale_vec_size

Shape of scale_A

Shape of scale_B

.scale_vec::1X

M x 1

1 x N

.scale_vec::2X

M x 2

2 x N

.scale_vec::4X

M x 4

4 x N

The valid combination of the exact element types and the.scale_vec_size are listed inTable 36.

Table 36Valid combinations of.scale_vec_size and.kind qualifier

.kind::*

Element Data Type.atype and .btype

Scale Data Type.stype

.scale_vec_size

.kind::mxf8f6f4

.e4m3,.e5m2.e3m2,.e2m3.e2m1

.ue8m0

.scale_vec::1X

.kind::mxf4

.e2m1

.ue8m0

.scale_vec::2X

.kind::mxf4nvf4

.e2m1

.ue8m0

.scale_vec::2X

.e2m1

.ue4m3

.scale_vec::4X

Thescale-a-data andscale-b-data argument provides metadata forscale_A andscale_B matrices respectively. The tuple{byte-id-a,thread-id-a} and{byte-id-b,thread-id-b} provides the selector information to choose elementsSF_A andSF_B from corresponding metadata argumentsscale-a-data andscale-b-data.The tuple{byte-id-a,thread-id-a} allows to select the scale matrix elementSF_Afromscale-a-data. Similarly, the tuple{byte-id-b,thread-id-b} allows to selectthe scale matrix elementSF_B fromscale-b-data.

The componentsthread-id-a,thread-id-b decides which threads among the quadcontribute theSF_A andSF_B values. The following listing describes the impactof thread selector componentthread-id-a,thread-id-b:

  • One thread-pair within the quad determined bythread-id-a, contributes theSF_Avalues. The value of 0 selects lower two threads whereas value of 1 selects upper twothreads from the quad. In other words, whenthread-id-a set to 0, thread-pairsatisfying:%laneid % 4 == 0 or 1 provides theSF_A. In contrast whenthread-id-a set to 1, thread-pair satisfying:%laneid % 4 == 2 or 3 providestheSF_A. ReferFigure 43 for more details.

    _images/mma-scaling-thread-id-a-selection.png

    Figure 43Selection of set of values forSF_A based onthread-id-a

  • One thread within the quad, determined bythread-id-b, contributes theSF_Bvalue. In other words, each thread satisfying:%laneid % 4 ==thread-id-bprovides theSF_B. ReferFigure 44 for more details.

    _images/mma-scaling-thread-id-b-selection.png

    Figure 44Selection of set of values forSF_B based onthread-id-b

The argumentsbyte-id-a,byte-id-b selects which bytes from thescale-a-data,scale-b-data contribute theSF_A andSF_B values. The following listing describesimplications of.scale_vec_size qualifier on byte selector componentbyte-id-a,byte-id-b:

  • When.scale_vec_size is.scale_vec::1X

    • One byte each withinscale-a-data andscale-b-data determined bybyte-id-a,byte-id-b respectively contributes theSF_A andSF_B values.

  • When.scale_vec_size is.scale_vec::2X

    • One byte-pair (two bytes) withinscale-a-data andscale-b-data determined bybyte-id-a andbyte-id-b contributes theSF_A andSF_B values. The valueof 0 selects lower two bytes whereas value of 2 selects upper two bytes from thecorresponding metadata value.

  • When.scale_vec_size is.scale_vec::4X

    • All four bytes withinscale-a-data andscale-b-data contribute the values.Hence,byte-id-a,byte-id-b must be zero.

ReferFigure 45 for more details.

_images/mma-scaling-byte-id-selection.png

Figure 45Selection of set of values forSF_A orSF_B based onbyte-id-a orbyte-id-b

Table 37 enumerates the valid values forvarious selector components. Any other value results in an undefined behavior.

Table 37Valid values for various selector components

.scale_vec_size

Selector Components

byte-id-a

thread-id-a

byte-id-b

thread-id-b

scale_vec::1X

[0, 1, 2, 3]

[0, 1]

[0, 1, 2, 3]

[0, 1, 2, 3]

scale_vec::2X

[0, 2]

[0, 2]

scale_vec::4X

0

0

9.7.14.4.Matrix multiply-accumulate operation usingwmma instructions

This section describes warp levelwmma.load,wmma.mma andwmma.store instructions and theorganization of various matrices invovled in these instruction.

9.7.14.4.1.Matrix Fragments for WMMA

Each thread in the warp holds a fragment of the matrix. The distribution of fragments loaded by thethreads in a warp is unspecified and is target architecture dependent, and hence the identity of thefragment within the matrix is also unspecified and is target architecture dependent. The fragmentreturned by awmma operation can be used as an operand for anotherwmma operation if theshape, layout and element type of the underlying matrix matches. Since fragment layout isarchitecture dependent, using the fragment returned by awmma operation in one function as anoperand for awmma operation in a different function may not work as expected if the twofunctions are linked together but were compiled for different link-compatible SM architectures. Notepassingwmma fragment to a function having.weak linkage is unsafe since at link timereferences to such function may get resolved to a function in different compilation module.

Each fragment is a vector expression whose contents are determined as follows. The identity ofindividual matrix elements in the fragment is unspecified.

Integer fragments

Multiplicands (A or B):

Data-type

Shape

Matrix

Fragment

.u8 or.s8

.m16n16k16

A

A vector expression of two.b32 registers, with eachregister containing four elements from the matrix.

B

A vector expression of two.b32 registers, with eachregister containing four elements from the matrix.

.m8n32k16

A

A vector expression containing a single.b32 registercontaining four elements from the matrix.

B

A vector expression of four.b32 registers, with eachregister containing four elements from the matrix.

.m32n8k16

A

A vector expression of four.b32 registers, with eachregister containing four elements from the matrix.

B

A vector expression containing single.b32 register,with each containing four elements from the matrix.

Accumulators (C or D):

Data-type

Shape

Fragment

.s32

.m16n16k16

A vector expression of eight.s32 registers.

.m8n32k16

.m32n8k16

Floating point fragments

Data-type

Matrix

Fragment

.f16

A or B

A vector expression of eight.f16x2 registers.

.f16

C or D

A vector expression of four.f16x2 registers.

.f32

A vector expression of eight.f32 registers.

Floating point fragments for.bf16 data format

Multiplicands (A or B):

Data-type

Shape

Matrix

Fragment

.bf16

.m16n16k16

A

A vector expression of four.b32 registers, with eachregister containing two elements from the matrix.

B

.m8n32k16

A

A vector expression containing a two.b32 registers,with containing two elements from the matrix.

B

A vector expression of eight.b32 registers, witheach register containing two elements from the matrix.

.m32n8k16

A

A vector expression of eight.b32 registers, witheach register containing two elements from the matrix.

B

A vector expression containing two.b32 registers,with each containing two elements from the matrix.

Accumulators (C or D):

Data-type

Matrix

Fragment

.f32

C or D

A vector expression containing eight.f32 registers.

Floating point fragments for.tf32 data format

Multiplicands (A or B):

Data-type

Shape

Matrix

Fragment

.tf32

.m16n16k8

A

A vector expression of four.b32 registers.

B

A vector expression of four.b32 registers.

Accumulators (C or D):

Data-type

Shape

Matrix

Fragment

.f32

.m16n16k8

C or D

A vector expression containing eight.f32 registers.

Double precision floating point fragments

Multiplicands (A or B):

Data-type

Shape

Matrix

Fragment

.f64

.m8n8k4

A or B

A vector expression of single.f64 register.

Accumulators (C or D):

Data-type

Shape

Matrix

Fragment

.f64

.m8n8k4

C or D

A vector expression containing single.f64 register.

Sub-byte integer and single-bit fragments

Multiplicands (A or B):

Data-type

Shape

Fragment

.u4 or.s4

.m8n8k32

A vector expression containing a single.b32 register, containing eight elements from the matrix.

.b1

.m8n8k128

A vector expression containing a single.b32 register, containing 32 elements from the matrix.

Accumulators (C or D):

Data-type

Shape

Fragment

.s32

.m8n8k32

A vector expression of two.s32 registers.

.m8n8k128

A vector expression of two.s32 registers.

Manipulating fragment contents

The contents of a matrix fragment can be manipulated by reading and writing to individualregisters in the fragment, provided the following conditions are satisfied:

  • All matrix element in the fragment are operated on uniformly across threads, using the sameparameters.

  • The order of the matrix elements is not changed.

For example, if each register corresponding to a given matrix is multiplied by a uniform constantvalue, then the resulting matrix is simply the scaled version of the original matrix.

Note that type conversion between.f16 and.f32 accumulator fragments is not supported ineither direction. The result is undefined even if the order of elements in the fragment remainsunchanged.

9.7.14.4.2.Matrix Storage for WMMA

Each matrix can be stored in memory with arow-major orcolumn-major layout. In arow-majorformat, consecutive elements of each row are stored in contiguous memory locations, and the row iscalled theleading dimension of the matrix. In acolumn-major format, consecutive elements ofeach column are stored in contiguous memory locations and the column is called theleadingdimension of the matrix.

Consecutive instances of theleading dimension (rows or columns) need not be stored contiguouslyin memory. Thewmma.load andwmma.store operations accept an optional argumentstridethat specifies the offset from the beginning of each row (or column) to the next, in terms of matrixelements (and not bytes). For example, the matrix being accessed by awmma operation may be asubmatrix from a larger matrix stored in memory. This allows the programmer to compose amultiply-and-accumulate operation on matrices that are larger than the shapes supported by thewmma operation.

Address Alignment

The starting address of each instance of the leading dimension (row or column) must be alignedwith the size of the corresponding fragment in bytes. Note that the starting address isdetermined by the base pointer and the optionalstride.

Consider the following instruction as an example:

wmma.load.a.sync.aligned.row.m16n16k16.f16 {x0,...,x7}, [p], s;
  • Fragment size in bytes = 32 (eight elements of type.f16x2)

  • Actualstride in bytes = 2 *s (sincestride is specified in terms of.f16elements, not bytes)

  • For each row of this matrix to be aligned at fragment size the following must be true:

    1. p is a multiple of 32.

    2. 2*s is a multiple of 32.

Default value for stride

The default value of thestride is the size of theleading dimension of the matrix. Forexample, for anMxK matrix, thestride isK for arow-major layout andM for acolumn-major layout. In particular, the default strides for the supported matrix shapes are asfollows:

Shape

A (row)

A (column)

B (row)

B (column)

Accumulator (row)

Accumulator (column)

16x16x16

16

16

16

16

16

16

8x32x16

16

8

32

16

32

8

32x8x16

16

32

8

16

8

32

8x8x32

32

8

8

32

8

8

8x8x128

128

8

8

128

8

8

16x16x8

8

16

16

8

16

16

8x8x4

4

8

8

4

8

8

9.7.14.4.3.Warp-level Matrix Load Instruction:wmma.load

wmma.load

Collectively load a matrix from memory for WMMA

Syntax

Floating point format.f16 loads:

wmma.load.a.sync.aligned.layout.shape{.ss}.atype r, [p] {, stride};wmma.load.b.sync.aligned.layout.shape{.ss}.btype r, [p] {, stride};wmma.load.c.sync.aligned.layout.shape{.ss}.ctype r, [p] {, stride};.layout = {.row, .col};.shape  = {.m16n16k16, .m8n32k16, .m32n8k16};.ss     = {.global, .shared{::cta}};.atype  = {.f16, .s8, .u8};.btype  = {.f16, .s8, .u8};.ctype  = {.f16, .f32, .s32};

Alternate floating point format.bf16 loads:

wmma.load.a.sync.aligned.layout.shape{.ss}.atype r, [p] {, stride}wmma.load.b.sync.aligned.layout.shape{.ss}.btype r, [p] {, stride}wmma.load.c.sync.aligned.layout.shape{.ss}.ctype r, [p] {, stride}.layout = {.row, .col};.shape  = {.m16n16k16, .m8n32k16, .m32n8k16};.ss     = {.global, .shared{::cta}};.atype  = {.bf16 };.btype  = {.bf16 };.ctype  = {.f32 };

Alternate floating point format.tf32 loads:

wmma.load.a.sync.aligned.layout.shape{.ss}.atype r, [p] {, stride}wmma.load.b.sync.aligned.layout.shape{.ss}.btype r, [p] {, stride}wmma.load.c.sync.aligned.layout.shape{.ss}.ctype r, [p] {, stride}.layout = {.row, .col};.shape  = {.m16n16k8 };.ss     = {.global, .shared{::cta}};.atype  = {.tf32 };.btype  = {.tf32 };.ctype  = {.f32 };

Double precision Floating point.f64 loads:

wmma.load.a.sync.aligned.layout.shape{.ss}.atype r, [p] {, stride}wmma.load.b.sync.aligned.layout.shape{.ss}.btype r, [p] {, stride}wmma.load.c.sync.aligned.layout.shape{.ss}.ctype r, [p] {, stride}.layout = {.row, .col};.shape  = {.m8n8k4 };.ss     = {.global, .shared{::cta}};.atype  = {.f64 };.btype  = {.f64 };.ctype  = {.f64 };

Sub-byte loads:

wmma.load.a.sync.aligned.row.shape{.ss}.atype r, [p] {, stride}wmma.load.b.sync.aligned.col.shape{.ss}.btype r, [p] {, stride}wmma.load.c.sync.aligned.layout.shape{.ss}.ctype r, [p] {, stride}.layout = {.row, .col};.shape  = {.m8n8k32};.ss     = {.global, .shared{::cta}};.atype  = {.s4, .u4};.btype  = {.s4, .u4};.ctype  = {.s32};

Single-bit loads:

wmma.load.a.sync.aligned.row.shape{.ss}.atype r, [p] {, stride}wmma.load.b.sync.aligned.col.shape{.ss}.btype r, [p] {, stride}wmma.load.c.sync.aligned.layout.shape{.ss}.ctype r, [p] {, stride}.layout = {.row, .col};.shape  = {.m8n8k128};.ss     = {.global, .shared{::cta}};.atype  = {.b1};.btype  = {.b1};.ctype  = {.s32};

Description

Collectively load a matrix across all threads in a warp from the location indicated by addressoperandp in the specified state space into destination registerr.

If no state space is given, perform the memory accesses usingGeneric Addressing.wmma.load operation may be used only with.global and.shared spaces and with generic addressing, where the address points to.global or.shared space.

The mutually exclusive qualifiers.a,.b and.c indicate whether matrix A, B or C isbeing loaded respectively for thewmma computation.

The destination operandr is a brace-enclosed vector expression that can hold the fragmentreturned by the load operation, as described inMatrix Fragments for WMMA.

The.shape qualifier indicates the dimensions of all the matrix arguments involved in theintendedwmma computation.

The.layout qualifier indicates whether the matrix to be loaded is stored inrow-major orcolumn-major format.

stride is an optional 32-bit integer operand that provides an offset in terms of matrix elementsbetween the start of consecutive instances of theleading dimension (rows or columns). The defaultvalue ofstride is described inMatrix Storage for WMMA and must be specified if the actual value is larger thanthe default. For example, if the matrix is a sub-matrix of a larger matrix, then the value of strideis the leading dimension of the larger matrix. Specifying a value lower than the default valueresults in undefined behavior.

The required alignment for addressp andstride is described in theMatrix Storage for WMMA.

The mandatory.sync qualifier indicates thatwmma.load causes the executing thread to waituntil all threads in the warp execute the samewmma.load instruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must execute the samewmma.load instruction. In conditionally executed code, awmma.load instruction should onlybe used if it is known that all threads in the warp evaluate the condition identically, otherwisebehavior is undefined.

The behavior ofwmma.load is undefined if all threads do not use the same qualifiers and thesame values ofp andstride, or if any thread in the warp has exited.

wmma.load is treated as aweak memory operation in theMemory Consistency Model.

PTX ISA Notes

Introduced in PTX ISA version 6.0.

.m8n32k16 and.m32n8k16 introduced in PTX ISA version 6.1.

Integer, sub-byte integer and single-bitwmma introduced in PTX ISA version 6.3.

.m8n8k4 and.m16n16k8 onwmma introduced in PTX ISA version 7.0.

Double precision and alternate floating point precisionwmma introduced in PTX ISA version 7.0.

Modifier.aligned is required from PTX ISA version 6.3 onwards, and considered implicit in PTXISA versions less than 6.3.

Support for::cta sub-qualifier introduced in PTX ISA version 7.8.

Preview Feature:

Sub-bytewmma and single-bitwmma are preview features in PTX ISA version 6.3. Alldetails are subject to change with no guarantees of backward compatibility on future PTX ISAversions or SM architectures.

Target ISA Notes

Floating pointwmma requiressm_70 or higher.

Integerwmma requiressm_72 or higher.

Sub-byte and single-bitwmma requiressm_75 or higher.

Double precision and alternate floating point precisionwmma requiressm_80 or higher.

Examples

// Load elements from f16 row-major matrix B.reg .b32 x<8>;wmma.load.b.sync.aligned.m16n16k16.row.f16 {x0,x1,x2,x3,x4,x5,x,x7}, [ptr];// Now use {x0, ..., x7} for the actual wmma.mma// Load elements from f32 column-major matrix C and scale the values:.reg .b32 x<8>;wmma.load.c.sync.aligned.m16n16k16.col.f32                 {x0,x1,x2,x3,x4,x5,x6,x7}, [ptr];mul.f32 x0, x0, 0.1;// repeat for all registers x<8>;...mul.f32 x7, x7, 0.1;// Now use {x0, ..., x7} for the actual wmma.mma// Load elements from integer matrix A:.reg .b32 x<4>// destination registers x<4> contain four packed .u8 values eachwmma.load.a.sync.aligned.m32n8k16.row.u8 {x0,x1,x2,x3}, [ptr];// Load elements from sub-byte integer matrix A:.reg .b32 x0;// destination register x0 contains eight packed .s4 valueswmma.load.a.sync.aligned.m8n8k32.row.s4 {x0}, [ptr];// Load elements from .bf16 matrix A:.reg .b32 x<4>;wmma.load.a.sync.aligned.m16n16k16.row.bf16                {x0,x1,x2,x3}, [ptr];// Load elements from .tf32 matrix A:.reg .b32 x<4>;wmma.load.a.sync.aligned.m16n16k8.row.tf32                {x0,x1,x2,x3}, [ptr];// Load elements from .f64 matrix A:.reg .b32 x<4>;wmma.load.a.sync.aligned.m8n8k4.row.f64                {x0}, [ptr];
9.7.14.4.4.Warp-level Matrix Store Instruction:wmma.store

wmma.store

Collectively store a matrix into memory for WMMA

Syntax

wmma.store.d.sync.aligned.layout.shape{.ss}.type [p], r {, stride};.layout = {.row, .col};.shape  = {.m16n16k16, .m8n32k16, .m32n8k16};.ss     = {.global, .shared{::cta}};.type   = {.f16, .f32, .s32};wmma.store.d.sync.aligned.layout.shape{.ss}.type [p], r {, stride}.layout = {.row, .col};.shape  = {.m8n8k32, .m8n8k128};.ss     = {.global, .shared{::cta}};.type   = {.s32};wmma.store.d.sync.aligned.layout.shape{.ss}.type [p], r {, stride}.layout = {.row, .col};.shape  = {.m16n16k8};.ss     = {.global, .shared{::cta}};.type   = {.f32};wmma.store.d.sync.aligned.layout.shape{.ss}.type [p], r {, stride}.layout = {.row, .col};.shape  = {.m8n8k4 };.ss     = {.global, .shared{::cta}};.type   = {.f64};

Description

Collectively store a matrix across all threads in a warp at the location indicated by addressoperandp in the specified state space from source registerr.

If no state space is given, perform the memory accesses usingGeneric Addressing.wmma.load operation may be used only with.global and.shared spaces and with generic addressing, where the address points to.global or.shared space.

The source operandr is a brace-enclosed vector expression that matches the shape of thefragment expected by the store operation, as described inMatrix Fragments for WMMA.

The.shape qualifier indicates the dimensions of all the matrix arguments involved in theintendedwmma computation. It must match the.shape qualifier specified on thewmma.mmainstruction that produced the D matrix being stored.

The.layout qualifier indicates whether the matrix to be loaded is stored inrow-major orcolumn-major format.

stride is an optional 32-bit integer operand that provides an offset in terms of matrix elementsbetween the start of consecutive instances of theleading dimension (rows or columns). The defaultvalue ofstride is described inMatrix Storage for WMMA and must be specified if the actual value is larger thanthe default. For example, if the matrix is a sub-matrix of a larger matrix, then the value of strideis the leading dimension of the larger matrix. Specifying a value lower than the default valueresults in undefined behavior.

The required alignment for addressp andstride is described in theMatrix Storage for WMMA.

The mandatory.sync qualifier indicates thatwmma.store causes the executing thread to waituntil all threads in the warp execute the samewmma.store instruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must execute the samewmma.store instruction. In conditionally executed code, awmma.store instruction should onlybe used if it is known that all threads in the warp evaluate the condition identically, otherwisebehavior is undefined.

The behavior ofwmma.store is undefined if all threads do not use the same qualifiers and thesame values ofp andstride, or if any thread in the warp has exited.

wmma.store is treated as aweak memory operation in theMemory Consistency Model.

PTX ISA Notes

Introduced in PTX ISA version 6.0.

.m8n32k16 and.m32n8k16 introduced in PTX ISA version 6.1.

Integer, sub-byte integer and single-bitwmma introduced in PTX ISA version 6.3.

.m16n16k8 introduced in PTX ISA version 7.0.

Double precisionwmma introduced in PTX ISA version 7.0.

Modifier.aligned is required from PTX ISA version 6.3 onwards, and considered implicit in PTXISA versions less than 6.3.

Support for::cta sub-qualifier introduced in PTX ISA version 7.8.

Preview Feature:

Sub-bytewmma and single-bitwmma are preview features in PTX ISA version 6.3. Alldetails are subject to change with no guarantees of backward compatibility on future PTX ISAversions or SM architectures.

Target ISA Notes

Floating pointwmma requiressm_70 or higher.

Integerwmma requiressm_72 or higher.

Sub-byte and single-bitwmma requiressm_75 or higher.

Double precisionwmma and shape.m16n16k8 requiressm_80 or higher.

Examples

// Storing f32 elements computed by a wmma.mma.reg .b32 x<8>;wmma.mma.sync.m16n16k16.row.col.f32.f32              {d0, d1, d2, d3, d4, d5, d6, d7}, ...;wmma.store.d.sync.m16n16k16.row.f32              [ptr], {d0, d1, d2, d3, d4, d5, d6, d7};// Store s32 accumulator for m16n16k16 shape:.reg .b32 d<8>;wmma.store.d.sync.aligned.m16n16k16.row.s32              [ptr], {d0, d1, d2, d3, d4, d5, d6, d7};// Store s32 accumulator for m8n8k128 shape:.reg .b32 d<2>wmma.store.d.sync.aligned.m8n8k128.row.s32[ptr], {d0, d1};// Store f64 accumulator for m8n8k4 shape:.reg .f64 d<2>;wmma.store.d.sync.aligned.m8n8k4.row.f64              [ptr], {d0, d1};
9.7.14.4.5.Warp-level Matrix Multiply-and-Accumulate Instruction:wmma.mma

wmma.mma

Perform a single matrix multiply-and-accumulate operation across a warp

Syntax

// Floating point (.f16 multiplicands) wmma.mmawmma.mma.sync.aligned.alayout.blayout.shape.dtype.ctype d, a, b, c;// Integer (.u8/.s8 multiplicands) wmma.mmawmma.mma.sync.aligned.alayout.blayout.shape.s32.atype.btype.s32{.satfinite} d, a, b, c;.alayout = {.row, .col};.blayout = {.row, .col};.shape  =  {.m16n16k16, .m8n32k16, .m32n8k16};.dtype   = {.f16, .f32};.atype   = {.s8, .u8};.btype   = {.s8, .u8};.ctype   = {.f16, .f32};

Floating point format.bf16wmma.mma:

wmma.mma.sync.aligned.alayout.blayout.shape.f32.atype.btype.f32 d, a, b, c;.alayout = {.row, .col};.blayout = {.row, .col};.shape   = {.m16n16k16, .m8n32k16, .m32n8k16};.atype   = {.bf16 };.btype   = {.bf16};

Floating point format.tf32wmma.mma:

wmma.mma.sync.aligned.alayout.blayout.shape.f32.atype.btype.f32 d, a, b, c;.alayout = {.row, .col};.blayout = {.row, .col};.shape   = {.m16n16k8 };.atype   = {.tf32 };.btype   = {.tf32};

Floating point Double precisionwmma.mma:

wmma.mma.sync.aligned.alayout.blayout.shape{.rnd}.f64.f64.f64.f64 d, a, b, c;.alayout = {.row, .col};.blayout = {.row, .col};.shape   = {.m8n8k4 };.rnd = { .rn, .rz, .rm, .rp };

Sub-byte (.u4/.s4 multiplicands)wmma.mma:

wmma.mma.sync.aligned.row.col.shape.s32.atype.btype.s32{.satfinite} d, a, b, c;.shape  = {.m8n8k32};.atype  = {.s4, .u4};.btype  = {.s4, .u4};

Single-bit (.b1 multiplicands)wmma.mma:

wmma.mma.op.popc.sync.aligned.row.col.shape.s32.atype.btype.s32 d, a, b, c;.shape  = {.m8n8k128};.atype  = {.b1};.btype  = {.b1};.op     = {.xor, .and}

Description

Perform a warp-level matrix multiply-and-accumulate computationD=A*B+C using matrices A,B and C loaded in registersa,b andc respectively, and store the result matrix inregisterd. The register argumentsa,b,c andd hold unspecified fragments ofthe corresponding matrices as described inMatrix Fragments for WMMA

The qualifiers.dtype,.atype,.btype and.ctype indicate the data-type of theelements in the matrices D, A, B and C respectively.

Forwmma.mma without explicit.atype and.btype:.atype and.btype areimplicitly set to.f16.

For integerwmma,.ctype and.dtype must be specified as.s32. Also, the values for.atype and.btype must be the same, i.e., either both are.s8 or both are.u8.

For sub-byte single-bitwmma,.ctype and.dtype must be specified as.s32. Also, thevalues for.atype and.btype must be the same; i.e., either both are.s4, both are.u4, or both are.b1.

For single-bitwmma, multiplication is replaced by a sequence of logical operations;specifically,wmma.xor.popc andwmma.and.popc computes the XOR, AND respectively of a128-bit row of A with a 128-bit column of B, then counts the number of set bits in the result(popc). This result is added to the corresponding element of C and written into D.

The qualifiers.alayout and.blayout must match the layout specified on thewmma.loadinstructions that produce the contents of operandsa andb respectively. Similarly, thequalifiers.atype,.btype and.ctype must match the corresponding qualifiers on thewmma.load instructions that produce the contents of operandsa,b andcrespectively.

The.shape qualifier must match the.shape qualifier used on thewmma.load instructionsthat produce the contents of all three input operandsa,b andc respectively.

The destination operandd is a brace-enclosed vector expression that matches the.shape ofthe fragment computed by thewmma.mma instruction.

Saturation at the output:

The optional qualifier.satfinite indicates that the final values in the destination registerare saturated as follows:

  • The output is clamped to the minimum or maximum 32-bit signed integer value. Otherwise, if theaccumulation would overflow, the value wraps.

Precision and rounding for.f16 floating point operations:

Element-wise multiplication of matrix A and B is performed with at least single precision. When.ctype or.dtype is.f32, accumulation of the intermediate values is performed withat least single precision. When both.ctype and.dtype are specified as.f16, theaccumulation is performed with at least half precision.

The accumulation order, rounding and handling of subnormal inputs is unspecified.

Precision and rounding for.bf16,.tf32 floating point operations:

Element-wise multiplication of matrix A and B is performed with specified precision. Accumulationof the intermediate values is performed with at least single precision.

The accumulation order, rounding and handling of subnormal inputs is unspecified.

Rounding modifiers on double precisionwmma.mma (default is.rn):

.rn

mantissa LSB rounds to nearest even

.rz

mantissa LSB rounds towards zero

.rm

mantissa LSB rounds towards negative infinity

.rp

mantissa LSB rounds towards positive infinity

The mandatory.sync qualifier indicates thatwmma.mma causes the executing thread to waituntil all threads in the warp execute the samewmma.mma instruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must execute the samewmma.mma instruction. In conditionally executed code, awmma.mma instruction should only beused if it is known that all threads in the warp evaluate the condition identically, otherwisebehavior is undefined.

The behavior ofwmma.mma is undefined if all threads in the same warp do not use the samequalifiers, or if any thread in the warp has exited.

PTX ISA Notes

Introduced in PTX ISA version 6.0.

.m8n32k16 and.m32n8k16 introduced in PTX ISA version 6.1.

Integer, sub-byte integer and single-bitwmma introduced in PTX ISA version 6.3.

Double precision and alternate floating point precisionwmma introduced in PTX ISA version 7.0.

Support for.and operation in single-bitwmma introduced in PTX ISA version 7.1.

Modifier.aligned is required from PTX ISA version 6.3 onwards, and considered implicit in PTXISA versions less than 6.3.

Support for.satfinite on floating pointwmma.mma is deprecated in PTX ISA version 6.4 andis removed from PTX ISA version 6.5.

Preview Feature:

Sub-bytewmma and single-bitwmma are preview features in PTX ISA. All details aresubject to change with no guarantees of backward compatibility on future PTX ISA versions or SMarchitectures.

Target ISA Notes

Floating pointwmma requiressm_70 or higher.

Integerwmma requiressm_72 or higher.

Sub-byte and single-bitwmma requiressm_75 or higher.

Double precision, alternate floating point precisionwmma requiresm_80 or higher.

.and operation in single-bitwmma requiressm_80 or higher.

Examples

.global .align 32 .f16 A[256], B[256];.global .align 32 .f32 C[256], D[256];.reg .b32 a<8> b<8> c<8> d<8>;wmma.load.a.sync.aligned.m16n16k16.global.row.f16        {a0, a1, a2, a3, a4, a5, a6, a7}, [A];wmma.load.b.sync.aligned.m16n16k16.global.col.f16        {b0, b1, b2, b3, b4, b5, b6, b7}, [B];wmma.load.c.sync.aligned.m16n16k16.global.row.f32        {c0, c1, c2, c3, c4, c5, c6, c7}, [C];wmma.mma.sync.aligned.m16n16k16.row.col.f32.f32        {d0, d1, d2, d3, d4, d5, d6, d7},        {a0, a1, a2, a3, a4, a5, a6, a7},        {b0, b1, b2, b3, b4, b5, b6, b7},        {c0, c1, c2, c3, c4, c5, c6, c7};wmma.store.d.sync.aligned.m16n16k16.global.col.f32        [D], {d0, d1, d2, d3, d4, d5, d6, d7};// Compute an integer WMMA:.reg .b32  a, b<4>;.reg .b32 c<8>, d<8>;wmma.mma.sync.aligned.m8n32k16.row.col.s32.s8.s8.s32        {d0, d1, d2, d3, d4, d5, d6, d7},        {a}, {b0, b1, b2,  b3},        {c0, c1, c2, c3, c4, c5, c6, c7};// Compute sub-byte WMMA:.reg .b32 a, b, c<2> d<2>wmma.mma.sync.aligned.m8n8k32.row.col.s32.s4.s4.s32        {d0, d1}, {a}, {b}, {c0, c1};// Compute single-bit type WMMA:.reg .b32 a, b, c<2> d<2>wmma.mma.xor.popc.sync.aligned.m8n8k128.row.col.s32.b1.b1.s32        {d0, d1}, {a}, {b}, {c0, c1};// Compute double precision wmma.reg .f64 a, b, c<2>, d<2>;wmma.mma.sync.aligned.m8n8k4.row.col.f64.f64.f64.f64        {d0, d1}, {a}, {b}, {c0, c1};// Compute alternate floating point precision wmma.reg .b32 a<2>, b<2>, c<8>, d<8>;wmma.mma.sync.aligned.m16n16k8.row.col.f32.tf32.tf32.f32        {d0, d1, d2, d3, d4, d5, d6, d7},        {a0, a1, a2, a3}, {b0, b1, b2, b3},        {c0, c1, c2, c3, c4, c5, c6, c7};

9.7.14.5.Matrix multiply-accumulate operation usingmma instruction

This section describes warp-levelmma,ldmatrix,stmatrix, andmovmatrixinstructions and the organization of various matrices involved in these instructions.

9.7.14.5.1.Matrix Fragments formma.m8n8k4 with.f16 floating point type

A warp executingmma.m8n8k4 with.f16 floating point type will compute 4 MMA operations of shape.m8n8k4.

Elements of 4 matrices need to be distributed across the threads in a warp. The following tableshows distribution of matrices for MMA operations.

MMA Computation

Threads participating in MMA computation

MMA computation 1

Threads with%laneid 0-3 (low group) and 16-19 (high group)

MMA computation 2

Threads with%laneid 4-7 (low group) and 20-23 (high group)

MMA computation 3

Threads with%laneid 8-11 (low group) and 24-27 (high group)

MMA computation 4

Threads with%laneid 12-15 (low group) and 28-31 (high group)

For each of the individual MMA computation shown above, each of the required thread holds a fragmentof the matrix for performing mma operation as follows:

  • Multiplicand A:

    .atype

    Fragment

    Elements (low to high)

    .f16

    A vector expression containing two.f16x2 registers,with each register containing two.f16 elements fromthe matrix A.

    a0, a1, a2, a3

    The layout of the fragments held by different threads is shown below:

    • Fragment layout for Row Major matrix A is shown inFigure 46.

      _images/mma-884-A-row-f16.png

      Figure 46MMA .m8n8k4 fragment layout for row-major matrix A with.f16 type

      The row and column of a matrix fragment can be computed as:

      row=%laneid%4if%laneid<16(%laneid%4)+4otherwisecol=iforaiwherei={0,..,3}
    • Fragment layout for Column Major matrix A is shown inFigure 47.

      The layout of the fragments held by different threads is shown below:

      _images/mma-884-A-col-f16.png

      Figure 47MMA .m8n8k4 fragment layout for column-major matrix A with.f16 type

      The row and column of a matrix fragment can be computed as:

      row=i%4foraiwherei={0,..,3}if%laneid<16(i%4)+4foraiwherei={0,..,3}otherwisecol=%laneid%4
  • Multiplicand B:

    .btype

    Fragment

    Elements (low to high)

    .f16

    A vector expression containing two.f16x2 registers, with eachregister containing two.f16 elements from the matrix B.

    b0, b1, b2, b3

    The layout of the fragments held by different threads is shown below:

    • Fragment layout for Row Major matrix B is shown inFigure 48.

      _images/mma-884-B-row-f16.png

      Figure 48MMA .m8n8k4 fragment layout for row-major matrix B with.f16 type

      The row and column of a matrix fragment can be computed as:

      row=%laneid%4col=iforbiwherei={0,..,3}if%laneid<16i+4forbiwherei={0,..,3}otherwise
    • Fragment layout for Column Major matrix B is shown inFigure 49.

      _images/mma-884-B-col-f16.png

      Figure 49MMA .m8n8k4 fragment layout for column-major matrix B with.f16 type

      The row and column of a matrix fragment can be computed as:

      row=iforbiwherei={0,..,3}col=%laneid%4if%laneid<16(%laneid%4)+4otherwise
  • Accumulators C (or D):

    .ctype / .dtype

    Fragment

    Elements (low to high)

    .f16

    A vector expression containing four.f16x2 registers, witheach register containing two.f16 elements from the matrixC (or D).

    c0, c1, c2, c3, c4, c5, c6, c7

    .f32

    A vector expression of eight.f32 registers.

    The layout of the fragments held by different threads is shown below:

    • Fragment layout for accumulator matrix when.ctype is.f16 is shown inFigure 50.

      _images/mma-884-C-f16.png

      Figure 50MMA .m8n8k4 fragment layout for matrix C/D with.ctype =.f16

      The row and column of a matrix fragment can be computed as:

      row=%laneid%4if%laneid<16(%laneid%4)+4otherwisecol=iforciwherei={0,..,7}
    • Fragment layout for accumulator matrix when.ctype is.f32 is shown inFigure 51 andFigure 52.

      _images/mma-884-C-f32-1.png

      Figure 51MMA .m8n8k4 computation 1 and 2 fragment layout for matrix C/D with.ctype =.f32

      _images/mma-884-C-f32-2.png

      Figure 52MMA .m8n8k4 computation 3 and 4 fragment layout for matrix C/D with.ctype =.f32

      The row and column of a matrix fragment can be computed as:

      row=Xif%laneid<16X+4otherwisewhereX=(%laneid&0b1)+(i&0b10)forciwherei={0,..,7}col=(i&0b100)+(%laneid&0b10)+(i&0b1)forciwherei={0,..,7}
9.7.14.5.2.Matrix Fragments formma.m8n8k4 with.f64 floating point type

A warp executingmma.m8n8k4 with.f64 floating point type will compute an MMA operation ofshape.m8n8k4.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements (low to high)

    .f64

    A vector expression containing a single.f64 register, containingsingle.f64 element from the matrix A.

    a0

    The layout of the fragments held by different threads is shown inFigure 53.

    _images/mma-884-A-f64.png

    Figure 53MMA .m8n8k4 fragment layout for matrix A with.f64 type

    The row and column of a matrix fragment can be computed as:

    row=%laneid>>2col=%laneid%4
  • Multiplicand B:

    .btype

    Fragment

    Elements (low to high)

    .f64

    A vector expression containing a single.f64 register, containing a single.f64 element from the matrix B.

    b0

    The layout of the fragments held by different threads is shown inFigure 54.

    _images/mma-884-B-f64.png

    Figure 54MMA .m8n8k4 fragment layout for matrix B with.f64 type

    The row and column of a matrix fragment can be computed as:

    row=%laneid%4col=%laneid>>2
  • Accumulators (C or D):

    .ctype / .dtype

    Fragment

    Elements (low to high)

    .f64

    A vector expression containing of two.f64 registers containing two.f64 elements from the matrix C.

    c0, c1

    The layout of the fragments held by different threads is shown inFigure 55.

    _images/mma-884-C-f64.png

    Figure 55MMA .m8n8k4 fragment layout for accumulator matrix C/D with.f64 type

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDcol=(threadID_in_group*2)+(i&0x1)forciwherei={0,1}
9.7.14.5.3.Matrix Fragments formma.m8n8k16

A warp executingmma.m8n8k16 will compute an MMA operation of shape.m8n8k16.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements (low to high)

    .s8 /.u8

    A vector expression containing a single.b32 register, containingfour.s8 or.u8 elements from the matrix A.

    a0, a1, a2, a3

    The layout of the fragments held by different threads is shown inFigure 56.

    _images/mma-8816-A-i8.png

    Figure 56MMA .m8n8k16 fragment layout for matrix A with.u8/.s8 type

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDcol=(threadID_in_group*4)+iforaiwherei={0,..,3}
  • Multiplicand B:

    .btype

    Fragment

    Elements (low to high)

    .s8 /.u8

    A vector expression containing a single.b32 register, containingfour.s8 or.u8 elements from the matrix B.

    b0, b1, b2, b3

    The layout of the fragments held by different threads is shown inFigure 57.

    _images/mma-8816-B-i8.png

    Figure 57MMA .m8n8k16 fragment layout for matrix B with.u8/.s8 type

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=(threadID_in_group*4)+iforbiwherei={0,..,3}col=groupID
  • Accumulators (C or D):

    .ctype / .dtype

    Fragment

    Elements (low to high)

    .s32

    A vector expression containing of two.s32 registers.

    c0, c1

    The layout of the fragments held by different threads is shown inFigure 58.

    _images/mma-8816-C-i8.png

    Figure 58MMA .m8n8k16 fragment layout for accumulator matrix C/D with.s32 type

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDcol=(threadID_in_group*2)+iforciwherei={0,1}
9.7.14.5.4.Matrix Fragments formma.m8n8k32

A warp executingmma.m8n8k32 will compute an MMA operation of shape.m8n8k32.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements (low to high)

    .s4 /.u4

    A vector expression containing a single.b32 register, containingeight.s4 or.u4 elements from the matrix A.

    a0, a1, a2, a3, a4, a5, a6, a7

    The layout of the fragments held by different threads is shown inFigure 59.

    _images/mma-8832-A-i4.png

    Figure 59MMA .m8n8k32 fragment layout for matrix A with.u4/.s4 type

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDcol=(threadID_in_group*8)+iforaiwherei={0,..,7}
  • Multiplicand B:

    .btype

    Fragment

    Elements (low to high)

    .s4 /.u4

    A vector expression containing a single.b32 register, containingeight.s4 or.u4 elements from the matrix B.

    b0, b1, b2, b3, b4, b5, b6, b7

    The layout of the fragments held by different threads is shown inFigure 60.

    _images/mma-8832-B-i4.png

    Figure 60MMA .m8n8k32 fragment layout for matrix B with.u4/.s4 type

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=(threadID_in_group*8)+iforbiwherei={0,..,7}col=groupID
  • Accumulators (C or D):

    .ctype / .dtype

    Fragment

    Elements (low to high)

    .s32

    A vector expression of two.s32 registers.

    c0, c1

    The layout of the fragments held by different threads is shown inFigure 61:

    _images/mma-8832-C-i4.png

    Figure 61MMA .m8n8k32 fragment layout for accumulator matrix C/D with.s32 type

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDcol=(threadID_in_group*2)+iforciwherei={0,1}
9.7.14.5.5.Matrix Fragments formma.m8n8k128

A warp executingmma.m8n8k128 will compute an MMA operation of shape.m8n8k128.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements (low to high)

    .b1

    A vector expression containing a single.b32 register, containing thirtytwo.b1 elements from the matrix A.

    a0, a1, … a30, a31

    The layout of the fragments held by different threads is shown inFigure 62.

    _images/mma-88128-A.png

    Figure 62MMA .m8n8k128 fragment layout for matrix A with.b1 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDcol=(threadID_in_group*32)+iforaiwherei={0,..,31}
  • Multiplicand B:

    .btype

    Fragment

    Elements (low to high)

    .b1

    A vector expression containing a single.b32 register, containing thirtytwo.b1 elements from the matrix B.

    b0, b1, …, b30, b31

    The layout of the fragments held by different threads is shown inFigure 63.

    _images/mma-88128-B.png

    Figure 63MMA .m8n8k128 fragment layout for matrix B with.b1 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=(threadID_in_group*32)+iforbiwherei={0,..,31}col=groupID
  • Accumulators (C or D):

    .ctype / .dtype

    Fragment

    Elements (low to high)

    .s32

    A vector expression containing two.s32 registers, containing two.s32 elements from the matrix C (or D).

    c0, c1

    The layout of the fragments held by different threads is shown inFigure 64.

    _images/mma-88128-C.png

    Figure 64MMA .m8n8k128 fragment layout for accumulator matrix C/D with.s32 type

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDcol=(threadID_in_group*2)+iforciwherei={0,1}
9.7.14.5.6.Matrix Fragments formma.m16n8k4

A warp executingmma.m16n8k4 will compute an MMA operation of shape.m16n8k4.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    • .tf32:

      .atype

      Fragment

      Elements (low to high)

      .tf32

      A vector expression containing two.b32 registers, containing two.tf32 elements from the matrix A.

      a0, a1

      The layout of the fragments held by different threads is shown inFigure 65.

      _images/mma-1684-A.png

      Figure 65MMA .m16n8k4 fragment layout for matrix A with.tf32 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDfora0groupID+8fora1col=threadID_in_group
    • .f64:

      .atype

      Fragment

      Elements (low to high)

      .f64

      A vector expression containing two.f64 registers, containing two.f64 elements from the matrix A.

      a0, a1

      The layout of the fragments held by different threads is shown inFigure 66.

      _images/mma-1684-A.png

      Figure 66MMA .m16n8k4 fragment layout for matrix A with.f64 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDfora0groupID+8fora1col=threadID_in_group
  • Multiplicand B:

    • .tf32:

      .btype

      Fragment

      Elements (low to high)

      .tf32

      A vector expression of a single.b32 register, containing a single.tf32 element from the matrix B.

      b0

      The layout of the fragments held by different threads is shown inFigure 67.

      _images/mma-1684-B.png

      Figure 67MMA .m16n8k4 fragment layout for matrix B with.tf32 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=threadID_in_groupcol=groupID
    • .f64:

      .btype

      Fragment

      Elements (low to high)

      .f64

      A vector expression of a single.f64 register, containing a single.f64 element from the matrix B.

      b0

      The layout of the fragments held by different threads is shown inFigure 68.

      _images/mma-1684-B.png

      Figure 68MMA .m16n8k4 fragment layout for matrix B with.f64 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=threadID_in_groupcol=groupID
  • Accumulators (C or D):

    • .tf32:

      .ctype / .dtype

      Fragment

      Elements (low to high)

      .f32

      A vector expression containing four.f32 registers, containing four.f32 elements from the matrix C (or D).

      c0, c1, c2, c3

      The layout of the fragments held by different threads is shown inFigure 69.

      _images/mma-1684-C.png

      Figure 69MMA .m16n8k4 fragment layout for accumulator matrix C/D with.f32 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforc0andc1groupID+8forc2andc3col=(threadID_in_group*2)+(i&0x1)forciwherei={0,..,3}
    • .f64:

      .ctype / .dtype

      Fragment

      Elements (low to high)

      .f64

      A vector expression containing four.f64 registers, containing four.f64 elements from the matrix C (or D).

      c0, c1, c2, c3

      The layout of the fragments held by different threads is shown inFigure 70.

      _images/mma-1684-C.png

      Figure 70MMA .m16n8k4 fragment layout for accumulator matrix C/D with.f64 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforc0andc1groupID+8forc2andc3col=(threadID_in_group*2)+(i&0x1)forciwherei={0,..,3}
9.7.14.5.7.Matrix Fragments formma.m16n8k8

A warp executingmma.m16n8k8 will compute an MMA operation of shape.m16n8k8.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    • .f16 and.bf16 :

      .atype

      Fragment

      Elements (low to high)

      .f16 /.bf16

      A vector expression containing two.f16x2 registers, with eachregister containing two.f16 /.bf16 elements from thematrix A.

      a0, a1, a2, a3

      The layout of the fragments held by different threads is shown inFigure 71.

      _images/mma-1688-A-f16.png

      Figure 71MMA .m16n8k8 fragment layout for matrix A with.f16 /.bf16 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDfora0anda1groupID+8fora2anda3col=threadID_in_group*2+(i&0x1)foraiwherei={0,..,3}
    • .tf32 :

      .atype

      Fragment

      Elements (low to high)

      .tf32

      A vector expression containing four.b32 registers, containing four.tf32 elements from the matrix A.

      a0, a1, a2, a3

      The layout of the fragments held by different threads is shown inFigure 72.

      _images/mma-1688-A-tf32.png

      Figure 72MMA .m16n8k8 fragment layout for matrix A with.tf32 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDfora0anda2groupID+8fora1anda3col=threadID_in_groupfora0anda1threadID_in_group+4fora2anda3
    • .f64 :

      .atype

      Fragment

      Elements (low to high)

      .f64

      A vector expression containing four.f64 registers, containing four.f64 elements from the matrix A.

      a0, a1, a2, a3

      The layout of the fragments held by different threads is shown inFigure 73.

      _images/mma-1688-A-tf32.png

      Figure 73MMA .m16n8k8 fragment layout for matrix A with.f64 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDfora0anda2groupID+8fora1anda3col=threadID_in_groupfora0anda1threadID_in_group+4fora2anda3
  • Multiplicand B:

    • .f16 and.bf16 :

      .btype

      Fragment

      Elements (low to high)

      .f16 /.bf16

      A vector expression containing a single.f16x2 register, containingtwo.f16 /.bf16 elements from the matrix B.

      b0, b1

      The layout of the fragments held by different threads is shown inFigure 74.

      _images/mma-1688-B-f16.png

      Figure 74MMA .m16n8k8 fragment layout for matrix B with.f16 /.bf16 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=(threadID_in_group*2)+iforbiwherei={0,1}col=groupID
    • .tf32 :

      .btype

      Fragment

      Elements (low to high)

      .tf32

      A vector expression containing two.b32 registers, containing two.tf32 elements from the matrix B.

      b0, b1

      The layout of the fragments held by different threads is shown inFigure 75.

      _images/mma-1688-B-tf32.png

      Figure 75MMA .m16n8k8 fragment layout for matrix B with.tf32 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=threadID_in_groupforb0threadID_in_group+4forb1col=groupID
    • .f64 :

      .btype

      Fragment

      Elements (low to high)

      .f64

      A vector expression containing two.f64 registers, containing two.f64 elements from the matrix B.

      b0, b1

      The layout of the fragments held by different threads is shown inFigure 76.

      _images/mma-1688-B-tf32.png

      Figure 76MMA .m16n8k8 fragment layout for matrix B with.f64 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=threadID_in_groupforb0threadID_in_group+4forb1col=groupID
  • Accumulators (C or D):

    • .f16,.bf16 and.tf32:

      .ctype / .dtype

      Fragment

      Elements (low to high)

      .f16

      A vector expression containing two.f16x2 registers, with eachregister containing two.f16 elements from the matrix C (or D).

      c0, c1, c2, c3

      .f32

      A vector expression of four.f32 registers.

      The layout of the fragments held by different threads is shown inFigure 77.

      _images/mma-1688-C.png

      Figure 77MMA .m16n8k8 fragment layout for accumulator matrix C/D with.f16x2/.f32 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforc0andc1groupID+8forc2andc3col=(threadID_in_group*2)+(i&0x1)forciwherei={0,..,3}
    • .f64 :

      .ctype / .dtype

      Fragment

      Elements (low to high)

      .f64

      A vector expression of four.f64 registers containing four.f64 elements from the matrix C (or D).

      c0, c1, c2, c3

      The layout of the fragments held by different threads is shown inFigure 78.

      _images/mma-1688-C.png

      Figure 78MMA .m16n8k8 fragment layout for accumulator matrix C/D with.f64 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforc0andc1groupID+8forc2andc3col=(threadID_in_group*2)+(i&0x1)forciwherei={0,..,3}
9.7.14.5.8.Matrix Fragments formma.m16n8k16 with floating point type

A warp executingmma.m16n8k16 floating point types will compute an MMA operation of shape.m16n8k16.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    • .f16 and.bf16 :

      .atype

      Fragment

      Elements (low to high)

      .f16 /.bf16

      A vector expression containing four.f16x2 registers, witheach register containing two.f16 /.bf16 elementsfrom the matrix A.

      a0, a1, a2, a3, a4, a5, a6, a7

      The layout of the fragments held by different threads is shown inFigure 79.

      _images/mma-16816-A-f16.png

      Figure 79MMA .m16n8k16 fragment layout for matrix A with.f16 /.bf16 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<2||4<=i<6groupID+8Otherwisecol=(threadID_in_group*2)+(i&0x1)foraiwherei<4(threadID_in_group*2)+(i&0x1)+8foraiwherei>=4
    • .f64 :

      .atype

      Fragment

      Elements (low to high)

      .f64

      A vector expression containing eight.f64 registers, with eachregister containing one.f64 element from the matrix A.

      a0, a1, a2, a3, a4, a5, a6, a7

      The layout of the fragments held by different threads is shown inFigure 80.

      _images/mma-16816-A-f64.png

      Figure 80MMA .m16n8k16 fragment layout for matrix A with.f64 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwherei%2=0groupID+8Otherwisecol=(i*2)+threadID_in_groupforaiwherei%2=0(i*2)-2+(threadID_in_groupOtherwise
  • Multiplicand B:

    • .f16 and.bf16 :

      .btype

      Fragment

      Elements (low to high)

      .f16 /.bf16

      A vector expression containing two.f16x2 registers, witheach register containing two.f16 /.bf16 elementsfrom the matrix B.

      b0, b1, b2, b3

      The layout of the fragments held by different threads is shown inFigure 81.

      _images/mma-16816-B-f16.png

      Figure 81MMA .m16n8k16 fragment layout for matrix B with.f16 /.bf16 type.

      where the row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=(threadID_in_group*2)+(i&0x1)forbiwherei<2(threadID_in_group*2)+(i&0x1)+8forbiwherei>=2col=groupID
    • .f64 :

      .atype

      Fragment

      Elements (low to high)

      .f64

      A vector expression containing four.f64 registers, with eachregister containing one.f64 element from the matrix B.

      b0, b1, b2, b3

      The layout of the fragments held by different threads is shown inFigure 82.

      _images/sparse-mma-16816-tf32-B.png

      Figure 82MMA .m16n8k16 fragment layout for matrix B with.f64 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=threadID_in_group+(i*4)forbiwherei<4col=groupID
  • Accumulators (C or D):

    .ctype / .dtype

    Fragment

    Elements (low to high)

    .f64

    A vector expression containing four.f64 registers containing.f64 elements from the matrix C (or D).

    c0, c1, c2, c3

    .f32

    A vector expression containing four.f32 registers containingfour.f32 elements from the matrix C (or D).

    .f16

    A vector expression containing two.f16x2 registers, with eachregister containing two.f16 elements from the matrix C (or D).

    The layout of the fragments held by different threads is shown inFigure 83.

    _images/mma-16816-C-f16.png

    Figure 83MMA .m16n8k16 fragment layout for accumulator matrix matrix C/D.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforciwherei<2groupID+8forciwherei>=2col=(threadID_in_group*2)+(i&0x1)forciwherei={0,..,3}
9.7.14.5.9.Matrix Fragments formma.m16n8k16 with integer type

A warp executingmma.m16n8k16 will compute an MMA operation of shape.m16n8k16.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements (low to high)

    .u8 /.s8

    A vector expression containing two.b32 registers, with eachregister containing four.u8 /.s8 elements from thematrix A.

    a0, a1, a2, a3, a4, a5, a6, a7

    .e4m3 /.e5m2

    A vector expression containing two.b32 registers, with eachregister containing four.e4m3 /.e5m2 elements from thematrix A.

    a0, a1, a2, a3, a4, a5, a6, a7

    The layout of the fragments held by different threads is shown inFigure 84.

    _images/mma-16816-A-i8.png

    Figure 84MMA .m16n8k16 fragment layout for matrix A with.u8 /.s8 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwherei<4groupID+8foraiwherei>=4col=(threadID_in_group*4)+(i&0x3)foraiwherei={0,..,7}
  • Multiplicand B:

    .btype

    Fragment

    Elements (low to high)

    .u8 /.s8

    A vector expression containing a single.b32 register, containingfour.u8 /.s8 elements from the matrix B.

    b0, b1, b2, b3

    .e4m3 /.e5m2

    A vector expression containing a single.b32 register, containingfour.e4m3 /.e5m2 elements from the matrix B.

    b0, b1. b2. b3

    The layout of the fragments held by different threads is shown inFigure 85.

    _images/mma-16816-B-i8.png

    Figure 85MMA .m16n8k16 fragment layout for matrix B with.u8 /.s8 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=(threadID_in_group*4)+iforbiwherei={0,..,3}col=groupID
  • Accumulators (C or D):

    .ctype / .dtype

    Fragment

    Elements (low to high)

    .s32

    A vector expression containing four.s32 registers, containingfour.s32 elements from the matrix C (or D).

    c0, c1, c2, c3

    .f32

    A vector expression containing four.f32 registers, containingfour.f32 elements from the matrix C (or D).

    c0, c1, c2, c3

    .f16

    A vector expression containing two.f16x2 registers, with eachregister containing two.f16 elements from the matrix C (or D).

    c0, c1, c1, c2

    The layout of the fragments held by different threads is shown inFigure 86.

    _images/mma-16816-C-i8.png

    Figure 86MMA .m16n8k16 fragment layout for accumulator matrix C/D with.s32 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforciwherei<2groupID+8forciwherei>=2col=(threadID_in_group*2)+(i&0x1)forciwherei={0,..,3}
9.7.14.5.10.Matrix Fragments formma.m16n8k32

A warp executingmma.m16n8k32 will compute an MMA operation of shape.m16n8k32.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    • .s4 or.u4 :

      .atype

      Fragment

      Elements (low to high)

      .s4 /.u4

      A vector expression containing two.b32 registers, with eachregister containing eight.u4 /.s4 elements from thematrix A.

      a0, a1, …, a14, a15

      The layout of the fragments held by different threads is shown inFigure 87.

      _images/mma-16832-A-i4.png

      Figure 87MMA .m16n8k32 fragment layout for matrix A with.u4 /.s4 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwherei<8groupID+8foraiwherei>=8col=(threadID_in_group*8)+(i&0x7)foraiwherei={0,..,15}
    • .s8 or.u8 or.e4m3 or.e5m2 or.e3m2 or.e2m3 or.e2m1:

      .atype

      Fragment

      Elements (low to high)

      .s8 /.u8

      A vector expression containing four.b32 registers, with eachregister containing four.s8 /.u8 elements from thematrix A.

      a0, a1, .., a14, a15

      .e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1

      A vector expression containing four.b32 registers, with eachregister containing four.e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1 elements from the matrix A.

      a0, a1, …, a14, a15

      The layout of the fragments held by different threads is shown inFigure 88.

      _images/mma-16832-A-i8.png

      Figure 88MMA .m16n8k32 fragment layout for matrix A with.u8 /.s8 /.e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<4||8<=i<12groupID+8otherwisecol=(threadID_in_group*4)+(i&0x3)foraiwherei<8(threadID_in_group*4)+(i&0x3)+16foraiwherei>=8
  • Multiplicand B:

    • .s4 or.u4 :

      .btype

      Fragment

      Elements (low to high)

      .s4 /.u4

      A vector expression containing a single.b32 register,containing eight.s4 /.u4 elements from the matrix B.

      b0, b1, b2, b3, b4, b5, b6, b7

      The layout of the fragments held by different threads is shown inFigure 89.

      _images/mma-16832-B-i4.png

      Figure 89MMA .m16n8k32 fragment layout for matrix B with.u4 /.s4 type.

      The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=(threadID_in_group*8)+(i&0x7)forbiwherei={0,..,7}col=groupID
    • .s8 or.u8 or.e4m3 or.e5m2 or.e3m2 or.e2m3 or.e2m1:

      .btype

      Fragment

      Elements (low to high)

      .s8 /.u8

      A vector expression containing two.b32 registers, with eachregister containing four.s8 /.u8 elements from thematrix B.

      b0, b1, b2, b3, b4, b5, b6, b7

      .e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1

      A vector expression containing two.b32 registers, with eachregister containing four.e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1 elements from the matrix B.

      b0, b1, b2, b3, b4, b5, b6, b7

      The layout of the fragments held by different threads is shown inFigure 90 andFigure 91.

      _images/mma-16832-B-i8_1.png

      Figure 90MMA .m16n8k32 fragment layout for rows 0–15 of matrix B with.u8 /.s8 /.e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1 type.

      _images/mma-16832-B-i8_2.png

      Figure 91MMA .m16n8k32 fragment layout for rows 16–31 of matrix B with.u8 /.s8 /.e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1 type.

      The row and column of a matrix fragment can be computed as:

      groupID=%laneid>>2threadID_in_group=%laneid%4row=(threadID_in_group*4)+(i&0x3)forbiwherei<4(threadID_in_group*4)+(i&0x3)+16forbiwherei>=4col=groupID
  • Accumulators (C or D):

    .ctype / .dtype

    Fragment

    Elements (low to high)

    .s32

    A vector expression containing four.s32 registers, containingfour.s32 elements from the matrix C (or D).

    c0, c1, c2, c3

    .f32

    A vector expression containing four.f32 registers, containingfour.f32 elements from the matrix C (or D).

    c0, c1, c2, c3

    .f16

    A vector expression containing two.f16x2 registers, with eachregister containing two.f16 elements from the matrix C (or D).

    c0, c1, c2, c3

    The layout of the fragments held by different threads is shown inFigure 92.

    _images/mma-16832-C.png

    Figure 92MMA .m16n8k32 fragment layout for accumulator matrix C/D with.s32 /.f32 /.f16 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforciwherei<2groupID+8forciwherei>=2col=(threadID_in_group*2)+(i&0x1)forciwherei={0,..,3}
9.7.14.5.11.Matrix Fragments formma.m16n8k64

A warp executingmma.m16n8k64 will compute an MMA operation of shape.m16n8k64.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements (low to high)

    .s4 /.u4

    A vector expression containing four.b32 registers, with eachregister containing eight.s4 /.u4 elements from thematrix A.

    a0, a1, …, a30, a31

    .e2m1

    A vector expression containing four.b32 registers, with eachregister containing eight.e2m1 elements from the matrix A.

    a0, a1, …, a30, a31

    The layout of the fragments held by different threads is shown inFigure 93.

    _images/mma-16864-A.png

    Figure 93MMA .m16n8k64 fragment layout for matrix A with.u4 /.s4 /.e2m1 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<8||16<=i<24groupID+8otherwisecol=(threadID_in_group*8)+(i&0x7)foraiwherei<16(threadID_in_group*8)+(i&0x7)+32foraiwherei>=16
  • Multiplicand B:

    .btype

    Fragment

    Elements (low to high)

    .s4 /.u4

    A vector expression containing two.b32 registers, with eachregister containing eight.s4 /.u4 elements from thematrix B.

    b0, b1, …, b14, b15

    .e2m1

    A vector expression containing two.b32 registers, with eachregister containing eight.e2m1 elements from the matrix B.

    b0, b1, …, b14, b15

    The layout of the fragments held by different threads is shown inFigure 94andFigure 95.

    _images/mma-16864-B_1.png

    Figure 94MMA .m16n8k64 fragment layout for rows 0–31 of matrix B with.u4 /.s4 /.e2m1 type.

    _images/mma-16864-B_2.png

    Figure 95MMA .m16n8k64 fragment layout for rows 32–63 of matrix B with.u4 /.s4 /.e2m1 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=(threadID_in_group*8)+(i&0x7)forbiwherei<8(threadID_in_group*8)+(i&0x7)+32forbiwherei>=8col=groupID
  • Accumulators (C or D):

    .ctype / .dtype

    Fragment

    Elements (low to high)

    .s32

    A vector expression containing four.s32 registers, containing four.s32 elements from the matrix C (or D).

    c0, c1, c2, c3

    .f32

    A vector expression containing four.f32 registers, containing four.f32 elements from the matrix C (or D).

    c0, c1, c2, c3

    The layout of the fragments held by different threads is shown inFigure 96.

    _images/mma-16864-C.png

    Figure 96MMA .m16n8k64 fragment layout for accumulator matrix C/D with.s32 /.f32 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforciwherei<2groupID+8forciwherei>=2col=(threadID_in_group*2)+(i&0x1)forciwherei={0,..,3}
9.7.14.5.12.Matrix Fragments formma.m16n8k128

A warp executingmma.m16n8k128 will compute an MMA operation of shape.m16n8k128.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements (low to high)

    .b1

    A vector expression containing two.b32 registers, with each register containingthirty two.b1 elements from the matrix A.

    a0, a1, …, a62, a63

    The layout of the fragments held by different threads is shown inFigure 97.

    _images/mma-168128-A.png

    Figure 97MMA .m16n8k128 fragment layout for matrix A with.b1 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwherei<32groupID+8foraiwherei>=32col=(threadID_in_group*32)+(i&0x1F)foraiwherei={0,...,63}
  • Multiplicand B:

    .btype

    Fragment

    Elements (low to high)

    .b1

    A vector expression containing a single.b32 register containing thirtytwo.b1 elements from the matrix B.

    b0, b1, … , b30, b31

    The layout of the fragments held by different threads is shown inFigure 98.

    _images/mma-168128-B.png

    Figure 98MMA .m16n8k128 fragment layout for matrix B with.b1 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=(threadID_in_group*32)+iforbiwherei={0,...,31}col=groupID
  • Accumulators (C or D):

    .ctype / .dtype

    Fragment

    Elements (low to high)

    .s32

    A vector expression containing four.s32 registers, containingfour.s32 elements from the matrix C (or D).

    c0, c1, c2, c3

    The layout of the fragments held by different threads is shown inFigure 99.

    _images/mma-168128-C.png

    Figure 99MMA .m16n8k128 fragment layout for accumulator matrix C/D with.s32 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforciwherei<2groupID+8forciwherei>=2col=(threadID_in_group*2)+(i&0x1)forciwherei={0,1,2,3}
9.7.14.5.13.Matrix Fragments formma.m16n8k256

A warp executingmma.m16n8k256 will compute an MMA operation of shape.m16n8k256.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements (low to high)

    .b1

    A vector expression containing four.b32 registers, with each registercontaining thirty two.b1 elements from the matrix A.

    a0, a1, …, a126, a127

    The layout of the fragments held by different threads is shown inFigure 100.

    _images/mma-168256-A.png

    Figure 100MMA .m16n8k256 fragment layout for matrix A with.b1 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<32||64<=i<96groupID+8otherwisecol=(threadID_in_group*32)+iforaiwherei<64(threadID_in_group*32)+(i&0x1F)+128foraiwherei>=64
  • Multiplicand B:

    .btype

    Fragment

    Elements (low to high)

    .b1

    A vector expression containing two.b32 registers, with each registercontaining thirty two.b1 elements from the matrix B.

    b0, b1, …, b62, b63

    The layout of the fragments held by different threads is shown inFigure 101 andFigure 102.

    _images/mma-168256-B_1.png

    Figure 101MMA .m16n8k256 fragment layout for rows 0–127 of matrix B with.b1 type.

    _images/mma-168256-B_2.png

    Figure 102MMA .m16n8k256 fragment layout for rows 128–255 of matrix B with.b1 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=(threadID_in_group*32)+(i&0x1F)forbiwherei<32(threadID_in_group*32)+(i&0x1F)+128forbiwherei>=32col=groupID
  • Accumulators (C or D):

    .ctype / .dtype

    Fragment

    Elements (low to high)

    .s32

    A vector expression containing four.s32 registers, containingfour.s32 elements from the matrix C (or D).

    c0, c1, c2, c3

    The layout of the fragments held by different threads is shown inFigure 103.

    _images/mma-168256-C.png

    Figure 103MMA .m16n8k256 fragment layout for accumulator matrix C/D with.s32 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforciwherei<2groupID+8forciwherei>=2col=(threadID_in_group*2)+(i&0x1)forciwherei={0,1,2,3}
9.7.14.5.14.Multiply-and-Accumulate Instruction:mma

mma

Perform matrix multiply-and-accumulate operation

Syntax

Half precision floating point type:

mma.sync.aligned.m8n8k4.alayout.blayout.dtype.f16.f16.ctype  d, a, b, c;mma.sync.aligned.m16n8k8.row.col.dtype.f16.f16.ctype  d, a, b, c;mma.sync.aligned.m16n8k16.row.col.dtype.f16.f16.ctype d, a, b, c;.alayout = {.row, .col};.blayout = {.row, .col};.ctype   = {.f16, .f32};.dtype   = {.f16, .f32};

Alternate floating point type:

mma.sync.aligned.m16n8k4.row.col.f32.tf32.tf32.f32        d, a, b, c;mma.sync.aligned.m16n8k8.row.col.f32.atype.btype.f32      d, a, b, c;mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32       d, a, b, c;mma.sync.aligned.shape.row.col.dtype.f8type.f8type.ctype  d, a, b, c;mma.sync.aligned.m16n8k32.row.col.kind.dtype.f8f6f4type.f8f6f4type.ctype d, a, b, c;.atype      = {.bf16, .tf32};.btype      = {.bf16, .tf32};.f8type     = {.e4m3, .e5m2};.f8f6f4type = {.e4m3, .e5m2, .e3m2, .e2m3, .e2m1};.ctype      = {.f16, .f32};.dtype      = {.f16, .f32};.shape      = {.m16n8k16, .m16n8k32};.kind       = {.kind::f8f6f4};

Alternate floating point type with block scaling:

mma.sync.aligned.m16n8k64.row.col.kind.block_scale{.scale_vec_size}.f32.e2m1.e2m1.f32.stype d, a, b, c, scale-a-data, {byte-id-a, thread-id-a}, scale-b-data, {byte-id-b, thread-id-b};.kind           = {.kind::mxf4};.scale_vec_size = {.scale_vec::2X};.stype          = {.ue8m0};mma.sync.aligned.m16n8k64.row.col.kind.block_scale.scale_vec_size.f32.e2m1.e2m1.f32.stype d, a, b, c, scale-a-data, {byte-id-a, thread-id-a}, scale-b-data, {byte-id-b, thread-id-b};.kind           = {.kind::mxf4nvf4};.scale_vec_size = {.scale_vec::2X, .scale_vec::4X};.stype          = {.ue8m0, .ue4m3};mma.sync.aligned.m16n8k32.row.col.kind.block_scale{.scale_vec_size}.f32.f8f6f4type.f8f6f4type.f32.stype d, a, b, c, scale-a-data, {byte-id-a, thread-id-a}, scale-b-data, {byte-id-b, thread-id-b};.kind           = {.kind::mxf8f6f4};.scale_vec_size = {.scale_vec::1X};.f8f6f4type     = {.e4m3, .e5m2, .e3m2, .e2m3, .e2m1};.stype          = {.ue8m0};

Double precision floating point type:

mma.sync.aligned.shape.row.col.f64.f64.f64.f64 d, a, b, c;.shape   = {.m8n84, .m16n8k4, .m16n8k8, .m16n8k16};

Integer type:

mma.sync.aligned.shape.row.col{.satfinite}.s32.atype.btype.s32 d, a, b, c;.shape   = {.m8n8k16, .m16n8k16, .m16n8k32}.atype   = {.u8, .s8};.btype   = {.u8, .s8};mma.sync.aligned.shape.row.col{.satfinite}.s32.atype.btype.s32 d, a, b, c;.shape   = {.m8n8k32, .m16n8k32, .m16n8k64}.atype   = {.u4, .s4};.btype   = {.u4, .s4};

Single bit:

mma.sync.aligned.shape.row.col.s32.b1.b1.s32.bitOp.popc d, a, b, c;.bitOp = {.xor, .and}.shape = {.m8n8k128, .m16n8k128, .m16n8k256}

Description

Perform aMxNxK matrix multiply and accumulate operation,D=A*B+C, where the A matrix isMxK, the B matrix isKxN, and the C and D matrices areMxN.

Qualifier.block_scale specifies that the matrices A and B are scaled withscale_A andscale_B matrices respectively before performing the matrix multiply and accumulate operationas specified in the sectionBlock Scaling. The data typecorresponding to each of the element withinscale_A andScale_B matrices is specifiedby.stype. Qualifier.scale_vec_size specifies the number of columns ofscale_A matrixand number of rows in the matrixscale_B.

The valid combinations of.kind,.stype and.scale_vec_size are described inTable 36. Formma with.kind::mxf4 when thequalifier.scale_vec_size is not specified, then it defaults to2X. In contrast, when.kind is specified as.kind::mxf8f6f4 then the qualifier.scale_vec_size defaultsto1X. However, for.kind::mxf4nvf4, it is mandatory to provide valid.scale_vec_size.

A warp executingmma.sync.m8n8k4 instruction computes 4 matrix multiply and accumulateoperations. Rest of themma.sync operations compute a single matrix mutliply and accumulateoperation per warp.

For single-bitmma.sync, multiplication is replaced by a sequence of logical operations;specifically,mma.xor.popc andmma.and.popc computes the XOR, AND respectively of a k-bitrow of A with a k-bit column of B, then counts the number of set bits in the result (popc). Thisresult is added to the corresponding element of C and written into D.

Operandsa andb represent two multiplicand matrices A and B, whilec anddrepresent the accumulator and destination matrices, distributed across the threads in warp.When.block_scale qualifier is specified, operandscale-a-data,scale-b-data representsthe scale matrix metadata corresponding toscale_A andscale_B matrices respectively. Thetuple{byte-id-a,thread-id-a} and{byte-id-b,thread-id-b} represent selectors for matricesscale_A andscale_B respectively from their corresponding metadata argumentsscale-a-data,scale-b-data. The operandsscale-a-data,scale-b-data are of type.b32. The operandsbyte-id-a,thread-id-a,byte-id-b,thread-id-b are unsigned 16-bit integer values.For more details on selector arguments referBlock Scaling section.

The registers in each thread hold a fragment of matrix as described inMatrix multiply-accumulate operation using mma instruction.

The qualifiers.dtype,.atype,.btype and.ctype indicate the data-type of theelements in the matrices D, A, B and C respectively. The qualifier.stype indicate the data-typeof the elements in the matricesscale_A andscale_B. Specific shapes have type restrictions :

  • .m8n8k4 : When.ctype is.f32,.dtype must also be.f32.

  • .m16n8k8 :

    • .dtype must be the same as.ctype.

    • .atype must be the same as.btype.

The qualifiers.alayout and.blayout indicate the row-major or column-major layouts ofmatrices A and B respectively.

When.kind is either of.kind::mxf8f6f4 or.kind::f8f6f4, the individual 4-bit and the6-bit floating point type elements must be packed in an 8-bit container. The matrix element of type.e2m1 resides in central 4 bits of the 8-bit container with padding in the upper 2 bits andlower 2 bits of the container. When the matrix element is of type.e3m2 or.e2m3, thematrix element resides in the lower 6 bits of the 8-bit container with padding in the upper 2 bitsof the container. In contrast, note that when usingmma with.kind::mxf4 or.kind::mxf4nvf4, no explicit padding is necessary even though matrix elements are of type.e2m1.

Precision and rounding :
  • .f16 floating point operations:

    Element-wise multiplication of matrix A and B is performed with at least singleprecision. When.ctype or.dtype is.f32, accumulation of the intermediate valuesis performed with at least single precision. When both.ctype and.dtype are specifiedas.f16, the accumulation is performed with at least half precision.

    The accumulation order, rounding and handling of subnormal inputs are unspecified.

  • .e4m3,.e5m2,.e3m2,.e2m3,.e2m1 floating point operations :

    Element-wise multiplication of matrix A and B is performed with specified precision. Accumulationof the intermediate values is performed with at least single precision.

    The accumulation order, rounding, and handling of subnormal inputs are unspecified.

  • .bf16 and.tf32 floating point operations :

    Element-wise multiplication of matrix A and B is performed with specifiedprecision. Accumulation of the intermediate values is performed with at least singleprecision.

    The accumulation order, rounding, and handling of subnormal inputs are unspecified.

  • .f64 floating point operations :

    Precision of the element-wise multiplication and addition operation is identical to that of.f64precision fused multiply-add. Supported rounding modifiers are :

    • .rn : mantissa LSB rounds to nearest even. This is the default.

    • .rz : mantissa LSB rounds towards zero.

    • .rm : mantissa LSB rounds towards negative infinity.

    • .rp : mantissa LSB rounds towards positive infinity.

  • Integer operations :

    The integermma operation is performed with.s32 accumulators. The.satfinitequalifier indicates that on overflow, the accumulated value is limited to the rangeMIN_INT32..MAX_INT32 (where the bounds are defined as the minimum negative signed 32-bitinteger and the maximum positive signed 32-bit integer respectively).

    If.satfinite is not specified, the accumulated value is wrapped instead.

The mandatory.sync qualifier indicates thatmma instruction causes the executing thread towait until all threads in the warp execute the samemma instruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must execute the samemma instruction. In conditionally executed code, amma instruction should only be used if itis known that all threads in the warp evaluate the condition identically, otherwise behavior isundefined.

The behavior ofmma instruction is undefined if all threads in the same warp do not use the samequalifiers, or if any thread in the warp has exited.

Notes

Programs using double precision floating pointmma instruction with shapes.m16n8k4,.m16n8k8, and.m16n8k16 require at least 64 registers for compilation.

PTX ISA Notes

Introduced in PTX ISA version 6.4.

.f16 floating point typemma operation with.m8n8k4 shape introduced in PTX ISA version6.4.

.f16 floating point typemma operation with.m16n8k8 shape introduced in PTX ISA version6.5.

.u8/.s8 integer typemma operation with.m8n8k16 shape introduced in PTX ISA version6.5.

.u4/.s4 integer typemma operation with.m8n8k32 shape introduced in PTX ISA version6.5.

.f64 floating point typemma operation with.m8n8k4 shape introduced in PTX ISA version7.0.

.f16 floating point typemma operation with.m16n8k16 shape introduced in PTX ISAversion 7.0.

.bf16 alternate floating point typemma operation with.m16n8k8 and.m16n8k16 shapesintroduced in PTX ISA version 7.0.

.tf32 alternate floating point typemma operation with.m16n8k4 and.m16n8k8 shapesintroduced in PTX ISA version 7.0.

.u8/.s8 integer typemma operation with.m16n8k16 and.m16n8k32 shapes introduced inPTX ISA version 7.0.

.u4/.s4 integer typemma operation with.m16n8k32 and.m16n8k64 shapes introduced inPTX ISA version 7.0.

.b1 single-bit integer typemma operation with.m8n8k128,.m16n8k128 and.m16n8k256 shapes introduced in PTX ISA version 7.0.

Support for.and operation in single-bitmma introduced in PTX ISA version 7.1.

.f64 floating point typemma operation with.m16n8k4,.m16n8k8, and.m16n8k16shapes introduced in PTX ISA version 7.8.

Support for.e4m3 and.e5m2 alternate floating point typemma operation introduced inPTX ISA version 8.4.

Support for shape.m16n8k16 and.f16dtype/ctype with.e4m3/.e5m2 alternatefloating point type mma operation introduced in PTX ISA version 8.7.

Support for.e3m2,.e2m3,.e2m1 alternate floating point typemma operation introducedin PTX ISA version 8.7.

Support for.kind,.block_scale,.scale_vec_size qualifier introduced in PTX ISA version 8.7.

Target ISA Notes

Requiressm_70 or higher.

.f16 floating point typemma operation with.m8n8k4 shape requiressm_70 or higher.

Note

mma.sync.m8n8k4 is optimized for target architecturesm_70 and may have substantiallyreduced performance on other target architectures.

.f16 floating point typemma operation with.m16n8k8 shape requiressm_75 or higher.

.u8/.s8 integer typemma operation with.m8n8k16 shape requiressm_75 or higher.

.u4/.s4 integer typemma operation with.m8n8k32 shapesm_75 or higher.

.b1 single-bit integer typemma operation with.m8n8k128 shapesm_75 or higher.

.f64 floating point typemma operation with.m8n8k4 shape requiressm_80 or higher.

.f16 floating point typemma operation with.m16n8k16 shape requiressm_80 orhigher.

.bf16 alternate floating point typemma operation with.m16n8k8 and.m16n8k16 shapesrequiressm_80 or higher.

.tf32 alternate floating point typemma operation with.m16n8k4 and.m16n8k8 shapesrequiressm_80 or higher.

.u8/.s8 integer typemma operation with.m16n8k16 and.m16n8k32 shapes requiressm_80 or higher.

.u4/.s4 integer typemma operation with.m16n8k32 and.m16n8k64 shapes requiressm_80 or higher.

.b1 single-bit integer typemma operation with.m16n8k128 and.m16n8k256 shapesrequiressm_80 or higher.

.and operation in single-bitmma requiressm_80 or higher.

.f64 floating point typemma operation with.m16n8k4,.m16n8k8, and.m16n8k16shapes requiresm_90 or higher.

.e4m3 and.e5m2 alternate floating point typemma operation requiressm_89 or higher.

.e3m2,.e2m3 and.e2m1 alternate floating point typemma operation requiressm_120aand is supported onsm_120f from PTX ISA version 8.8.

Support for.kind,.block_scale,.scale_vec_size qualifier requiressm_120a and aresupported onsm_120f or higher in the same family from PTX ISA version 8.8.

Examples of half precision floating point type

// f16 elements in C and D matrix.reg .f16x2 %Ra<2> %Rb<2> %Rc<4> %Rd<4>mma.sync.aligned.m8n8k4.row.col.f16.f16.f16.f16{%Rd0, %Rd1, %Rd2, %Rd3},{%Ra0, %Ra1},{%Rb0, %Rb1},{%Rc0, %Rc1, %Rc2, %Rc3};// f16 elements in C and f32 elements in D.reg .f16x2 %Ra<2> %Rb<2> %Rc<4>.reg .f32 %Rd<8>mma.sync.aligned.m8n8k4.row.col.f32.f16.f16.f16{%Rd0, %Rd1, %Rd2, %Rd3, %Rd4, %Rd5, %Rd6, %Rd7},{%Ra0, %Ra1},{%Rb0, %Rb1},{%Rc0, %Rc1, %Rc2, %Rc3}; // f32 elements in C and D.reg .f16x2 %Ra<2>, %Rb<1>;.reg .f32 %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb0},  {%Rc0, %Rc1, %Rc2, %Rc3};.reg .f16x2 %Ra<4>, %Rb<2>, %Rc<2>, %Rd<2>;mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16  {%Rd0, %Rd1},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1};.reg .f16 %Ra<4>, %Rb<2>;.reg .f32 %Rc<2>, %Rd<2>;mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3};

Examples of alternate floating point type

.reg .b32 %Ra<2>, %Rb<1>;.reg .f32 %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k4.row.col.f32.tf32.tf32.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb0},  {%Rc0, %Rc1, %Rc2, %Rc3};.reg .f16x2 %Ra<2>, %Rb<1>;.reg .f32 %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k8.row.col.f32.bf16.bf16.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb0},  {%Rc0, %Rc1, %Rc2, %Rc3};.reg .b32 %Ra<2>, %Rb<1>;.reg .f32 %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Rb2, %Rb3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3};.reg .f16x2 %Ra<2>, %Rb<1>;.reg .f32 %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3};.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e5m2.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3};.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k16.row.col.f32.e5m2.e4m3.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb0},  {%Rc0, %Rc1, %Rc2, %Rc3};.reg .b32 %Ra<4>, %Rb<4>;.reg .b32 %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k32.row.col.f16.e4m3.e5m2.f16  {%Rd0, %Rd1},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1};.reg .b32 %Ra<4>, %Rb<4>;.reg .b32 %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k16.row.col.f16.e5m2.e5m2.f16  {%Rd0, %Rd1},  {%Ra0, %Ra1},  {%Rb0},  {%Rc0, %Rc1};.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k32.row.col.kind::f8f6f4.f32.e3m2.e2m3.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3};.reg .b32 %Ra<4>, %Rb<4>;.reg .b32 %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k32.row.col.kind::f8f6f4.f16.e2m3.e2m1.f16  {%Rd0, %Rd1},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1};

Examples of integer type

.reg .b32 %Ra, %Rb, %Rc<2>, %Rd<2>;// s8 elements in A and u8 elements in Bmma.sync.aligned.m8n8k16.row.col.satfinite.s32.s8.u8.s32  {%Rd0, %Rd1},  {%Ra},  {%Rb},  {%Rc0, %Rc1};// u4 elements in A and B matrixmma.sync.aligned.m8n8k32.row.col.satfinite.s32.u4.u4.s32  {%Rd0, %Rd1},  {%Ra},  {%Rb},  {%Rc0, %Rc1};// s8 elements in A and u8 elements in B.reg .b32 %Ra<2>, %Rb, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k16.row.col.satfinite.s32.s8.u8.s32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb},  {%Rc0, %Rc1, %Rc2, %Rc3};// u4 elements in A and s4 elements in B.reg .b32 %Ra<2>, %Rb, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k32.row.col.satfinite.s32.u4.s4.s32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb},  {%Rc0, %Rc1, %Rc2, %Rc3};// s8 elements in A and s8 elements in B.reg .b32 %Ra<4>, %Rb<2>, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k32.row.col.satfinite.s32.s8.s8.s32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3};// u8 elements in A and u8 elements in B.reg .b32 %Ra<4>, %Rb<2>, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k64.row.col.satfinite.s32.u4.u4.s32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1 },  {%Rc0, %Rc1, %Rc2, %Rc3};

Examples of single bit type

// b1 elements in A and B.reg .b32 %Ra, %Rb, %Rc<2>, %Rd<2>;mma.sync.aligned.m8n8k128.row.col.s32.b1.b1.s32.and.popc  {%Rd0, %Rd1},  {%Ra},  {%Rb},  {%Rc0, %Rc1};// b1 elements in A and B.reg .b32 %Ra, %Rb, %Rc<2>, %Rd<2>;mma.sync.aligned.m8n8k128.row.col.s32.b1.b1.s32.xor.popc  {%Rd0, %Rd1},  {%Ra},  {%Rb},  {%Rc0, %Rc1};.reg .b32 %Ra<2>, %Rb, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k128.row.col.s32.b1.b1.s32.xor.popc  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb},  {%Rc0, %Rc1, %Rc2, %Rc3};.reg .b32 %Ra<2>, %Rb, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k128.row.col.s32.b1.b1.s32.and.popc  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb},  {%Rc0, %Rc1, %Rc2, %Rc3};.reg .b32 %Ra<4>, %Rb<2>, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k256.row.col.s32.b1.b1.s32.xor.popc  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3};.reg .b32 %Ra<4>, %Rb<2>, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k256.row.col.s32.b1.b1.s32.and.popc  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3};

Examples of.f64 floating point type

.reg .f64 %Ra, %Rb, %Rc<2>, %Rd<2>;mma.sync.aligned.m8n8k4.row.col.f64.f64.f64.f64  {%Rd0, %Rd1},  {%Ra},  {%Rb},  {%Rc0, %Rc1};.reg .f64 %Ra<8>, %Rb<4>, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k4.row.col.f64.f64.f64.f64.rn  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb0},  {%Rc0, %Rc1, %Rc2, %Rc3};mma.sync.aligned.m16n8k8.row.col.f64.f64.f64.f64.rn  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3};mma.sync.aligned.m16n8k16.row.col.f64.f64.f64.f64.rn  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3, %Ra4, %Ra5, %Ra6, %Ra7},  {%Rb0, %Rb1, %Rb2, %Rb3},  {%Rc0, %Rc1, %Rc2, %Rc3};

Examples ofmma with block scale

.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 scaleAData, scaleBData;mma.sync.aligned.m16n8k64.row.col.kind::mxf4.block_scale.f32.e2m1.e2m1.f32.ue8m0  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3},  scaleAData, {2, 1}, scaleBData, {2, 3};.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 scaleAData, scaleBData;.reg .u16 bidA, bidB, tidA, tidB;mma.sync.aligned.m16n8k64.row.col.kind::mxf4nvf4.block_scale.scale_vec::4X.f32.e2m1.e2m1.f32.ue4m3  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3},  scaleAData, {bidA, tidA}, scaleBData, {bidB, tidB};.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 scaleAData, scaleBData;mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.block_scale.scale_vec::1X.f32.e3m2.e2m1.f32.ue8m0  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3},  scaleAData, {0, 1}, scaleBData, {0, 1};.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 scaleAData, scaleBData;mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.block_scale.scale_vec::1X.f32.e4m3.e5m2.f32.ue8m0  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2,  %Ra3},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3},  scaleAData, {0, 1}, scaleBData, {0, 0};
9.7.14.5.15.Warp-level matrix load instruction:ldmatrix

ldmatrix

Collectively load one or more matrices from shared memory formma instruction

Syntax

ldmatrix.sync.aligned.shape.num{.trans}{.ss}.type r, [p];ldmatrix.sync.aligned.m8n16.num{.ss}.dst_fmt.src_fmt        r, [p];ldmatrix.sync.aligned.m16n16.num.trans{.ss}.dst_fmt.src_fmt r, [p];.shape   = {.m8n8, .m16n16};.num     = {.x1, .x2, .x4};.ss      = {.shared{::cta}};.type    = {.b16, .b8};.dst_fmt = { .b8x16 };.src_fmt = { .b6x16_p32, .b4x16_p64 };

Description

Collectively load one or more matrices across all threads in a warp from the location indicated bythe address operandp, from.shared state space into destination registerr. If no statespace is provided, generic addressing is used, such that the address inp points into.shared space. If the generic address doesn’t fall in.shared state space, then the behavioris undefined.

The.shape qualifier indicates the dimensions of the matrices being loaded. Each matrix elementholds 16-bit or 8-bit or 6-bit or 4-bit data.

Following table shows the matrix load case for each.shape.

.shape

Matrix shape

Element size

.m8n8

8x8

16-bit

.m16n16

16x16

8-bit or 6-bit or 4-bit

.m8n16

8x16

6-bit or 4-bit

Following table shows the valid use of 6-bit or 4-bit data load.

.src_fmt

.shape

Source data

Padding

.dst_fmt

.b6x16_p32

.m8n16

16 6-bit elements

32 bits

.b8x16(16 8-bitelements)

.m16n16

.b4x16_p64

.m8n16

16 4-bit elements

64 bits

.m16n16

For.b6x16_p32 format source data is 16 unsigned 6-bit elements with 32 bits padding.For.b4x16_p64 format source data is 16 unsigned 4-bit elements with 64 bits padding.

The values.x1,.x2 and.x4 for.num indicate one, two or four matricesrespectively. When.shape is.m16n16, only.x1 and.x2 are valid values for.num.

The mandatory.sync qualifier indicates thatldmatrix causes the executing thread to waituntil all threads in the warp execute the sameldmatrix instruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must execute the sameldmatrix instruction. In conditionally executed code, anldmatrix instruction should only beused if it is known that all threads in the warp evaluate the condition identically, otherwise thebehavior is undefined.

The behavior ofldmatrix is undefined if all threads do not use the same qualifiers, or if anythread in the warp has exited.

The destination operandr is a brace-enclosed vector expression consisting of 1, 2, or 4 32-bitregisters as per the value of.num. Each component of the vector expression holds a fragmentfrom the corresponding matrix.

Supported addressing modes forp are described inAddresses as Operands.

Consecutive instances of row need not be stored contiguously in memory. The eight addresses requiredfor each matrix are provided by eight threads, depending upon the value of.num as shown in thefollowing table. Each address corresponds to the start of a matrix row. Addresses addr0–addr7correspond to the rows of the first matrix, addresses addr8–addr15 correspond to the rows of thesecond matrix, and so on.

.num

Threads 0–7

Threads 8–15

Threads 16–23

Threads 24–31

.x1

addr0–addr7

.x2

addr0–addr7

addr8–addr15

.x4

addr0–addr7

addr8–addr15

addr16–addr23

addr24–addr31

Note

For .targetsm_75 or below, all threads must contain valid addresses. Otherwise, the behavioris undefined. For.num=.x1 and.num=.x2, addresses contained in lower threads can becopied to higher threads to achieve the expected behavior.

When reading 8x8 matrices, a group of four consecutive threads loads 16 bytes. The matrix addressesmust be naturally aligned accordingly.

Each thread in a warp loads fragments of a row, with thread 0 receiving the first fragment in itsregisterr, and so on. A group of four threads loads an entire row of the matrix as shown inFigure 104.

_images/mma-ldmatrix-fragments.png

Figure 104ldmatrix fragment layout for one 8x8 Matrix with 16-bit elements

When.num =.x2, the elements of the second matrix are loaded in the next destinationregister in each thread as per the layout in above table. Similarly, when.num =.x4,elements of the third and fourth matrices are loaded in the subsequent destination registers in eachthread.

For matrix shape 16x16, two destination registersr0 andr1 of type.b32 must bespecified and in each register four 8-bit elements are loaded. For 4-bit or 6-bit data, 8-bitelement will have 4 bits or 2 bits of padding respectively.ReferOptional Decompression for more detailson these formats.

An entire row of the matrix can be loaded by a group of four consecutive and aligned threads.Each thread in a warp loads 4 consecutive columns across 2 rows as shown in theFigure 105.

_images/mma-ldmatrix-fragments-1616.png

Figure 105ldmatrix fragment layout for one 16x16 matrix with 8-bit elements

For matrix shape 8x16, one destination registerr0 of type.b32 must be specified where four8-bit elements are loaded in the register. For 4-bit or 6-bit data, 8-bit element will have 4 bitsor 2 bits of padding respectively.

An entire row of the matrix can be loaded by a group of four consecutive and aligned threads.Each thread in a warp loads 4 consecutive columns as shown inFigure 106.

_images/mma-ldmatrix-fragments-816.png

Figure 106ldmatrix fragment layout for one 8x16 matrix with 8-bit elements containing 4-bit/6-bit data

Optional qualifier.trans indicates that the matrix is loaded in column-major format. However,for 16x16 matrices,.trans is mandatory.

Theldmatrix instruction is treated as a weak memory operation in theMemory Consistency Model.

PTX ISA Notes

Introduced in PTX ISA version 6.5.

Support for::cta sub-qualifier introduced in PTX ISA version 7.8.

Support for.m16n16,.m8n16 shapes introduced in PTX ISA version 8.6.

Support for.b8 type withldmatrix is introduced in PTX ISA version 8.6.

Support for.src_fmt,.dst_fmt qualifiers introduced in PTX ISA version 8.6.

Target ISA Notes

Requiressm_75 or higher.

Shapes.m16n16,.m8n16 are supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And are supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

Type.b8 withldmatrix is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And are supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

Qualifiers.src_fmt,.dst_fmt are supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And are supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

Examples

// Load a single 8x8 matrix using 64-bit addressing.reg .b64 addr;.reg .b32 d;ldmatrix.sync.aligned.m8n8.x1.shared::cta.b16 {d}, [addr];// Load two 8x8 matrices in column-major format.reg .b64 addr;.reg .b32 d<2>;ldmatrix.sync.aligned.m8n8.x2.trans.shared.b16 {d0, d1}, [addr];// Load four 8x8 matrices.reg .b64 addr;.reg .b32 d<4>;ldmatrix.sync.aligned.m8n8.x4.b16 {d0, d1, d2, d3}, [addr];// Load one 16x16 matrices of 64-bit elements and transpose them.reg .b64 addr;.reg .b32 d<2>;ldmatrix.sync.aligned.m16n16.x1.trans.shared.b8 {d0, d1}, [addr];// Load two 16x16 matrices of 64-bit elements and transpose them.reg .b64 addr;.reg .b32 d<4>;ldmatrix.sync.aligned.m16n16.x2.trans.shared::cta.b8 {d0, d1, d2, d3}, [addr];// Load two 16x16 matrices of 6-bit elements and transpose them.reg .b64 addr;.reg .b32 d<4>;ldmatrix.sync.aligned.m16n16.x2.trans.shared::cta.b8x16.b6x16_p32 {d0, d1, d2, d3}, [addr];
9.7.14.5.16.Warp-level matrix store instruction:stmatrix

stmatrix

Collectively store one or more matrices to shared memory.

Syntax

stmatrix.sync.aligned.shape.num{.trans}{.ss}.type [p], r;.shape  = {.m8n8, .m16n8};.num    = {.x1, .x2, .x4};.ss     = {.shared{::cta}};.type   = {.b16, .b8};

Description

Collectively store one or more matrices across all threads in a warp to the location indicated bythe address operandp, in.shared state space. If no state space is provided, genericaddressing is used, such that the address inp points into.shared space. If the genericaddress doesn’t fall in.shared state space, then the behavior is undefined.

The.shape qualifier indicates the dimensions of the matrices being loaded. Each matrix elementholds 16-bit or 8-bit data as indicated by the.type qualifier.

.m16n8 shape is valid only for.b8 type.

The values.x1,.x2 and.x4 for.num indicate one, two or four matricesrespectively.

The mandatory.sync qualifier indicates thatstmatrix causes the executing thread to waituntil all threads in the warp execute the samestmatrix instruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must execute the samestmatrix instruction. In conditionally executed code, anstmatrix instruction should only beused if it is known that all threads in the warp evaluate the condition identically, otherwise thebehavior is undefined.

The behavior ofstmatrix is undefined if all threads do not use the same qualifiers, or if anythread in the warp has exited.

The source operandr is a brace-enclosed vector expression consisting of 1, 2, or 4 32-bitregisters as per the value of.num. Each component of the vector expression holds a fragmentfrom the corresponding matrix.

Supported addressing modes forp are described inAddresses as Operands.

Consecutive instances of row need not be stored contiguously in memory. The eight addresses requiredfor each matrix are provided by eight threads, depending upon the value of.num as shown in thefollowing table. Each address corresponds to the start of a matrix row. Addresses addr0–addr7correspond to the rows of the first matrix, addresses addr8–addr15 correspond to the rows of thesecond matrix, and so on.

.num

Threads 0–7

Threads 8–15

Threads 16–23

Threads 24–31

.x1

addr0–addr7

.x2

addr0–addr7

addr8–addr15

.x4

addr0–addr7

addr8–addr15

addr16–addr23

addr24–addr31

When storing 8x8 matrices, a group of four consecutive threads stores 16 bytes. The matrix addressesmust be naturally aligned accordingly.

Each thread in a warp stores fragments of a row, with thread 0 storing the first fragment from itsregisterr, and so on. A group of four threads stores an entire row of the matrix as shown inFigure 107.

_images/mma-stmatrix-fragments.png

Figure 107stmatrix fragment layout for one 8x8 matrix with 16-bit elements

When.num =.x2, the elements of the second matrix are storedd from the next source registerin each thread as per the layout in above table. Similarly, when.num =.x4, elements of thethird and fourth matrices are stored from the subsequent source registers in each thread.

For 16x8 matrix shape, each of the 32 threads in the warp provides four elements of data per matrix.

Each element in the source operandr is of type.b32 and contains four 8 bit elementse0,e1,e2,e3 withe0 ande3 containing the LSB and MSB respectively of registerr.

_images/mma-stmatrix-fragments-168.png

Figure 108stmatrix fragment layout for one 16x8 matrix with 8 bit elements

Optional qualifier.trans indicates that the matrix is stored in column-major format. However,for 16x8 matrices,.trans is mandatory.

Thestmatrix instruction is treated as a weak memory operation in theMemory Consistency Model.

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Support for.m16n8 shape is introduced in PTX ISA version 8.6.

Support for.b8 type withstmatrix is introduced in PTX ISA version 8.6.

Target ISA Notes

Requiressm_90 or higher.

Shape.m16n8 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

Type.b8 withstmatrix is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

Examples

// Store a single 8x8 matrix using 64-bit addressing.reg .b64 addr;.reg .b32 r;stmatrix.sync.aligned.m8n8.x1.shared.b16 [addr], {r};// Store two 8x8 matrices in column-major format.reg .b64 addr;.reg .b32 r<2>;stmatrix.sync.aligned.m8n8.x2.trans.shared::cta.b16 [addr], {r0, r1};// Store four 8x8 matrices.reg .b64 addr;.reg .b32 r<4>;stmatrix.sync.aligned.m8n8.x4.b16 [addr], {r0, r1, r2, r3};// Store a single 16x8 matrix using generic addressing.reg .b64 addr;.reg .b32 r;stmatrix.sync.aligned.m16n8.x1.trans.shared.b8 [addr], {r};// Store two 16x8 matrices.reg .b64 addr;.reg .b32 r<2>;stmatrix.sync.aligned.m16n8.x2.trans.shared::cta.b8 [addr],{r0, r1};// Store four 16x8 matrices.reg .b64 addr;.reg .b32 r<4>;stmatrix.sync.aligned.m16n8.x4.b8 [addr], {r0, r1, r2, r3};
9.7.14.5.17.Warp-level matrix transpose instruction:movmatrix

movmatrix

Transpose a matrix in registers across the warp.

Syntax

movmatrix.sync.aligned.shape.trans.type d, a;.shape  = {.m8n8};.type   = {.b16};

Description

Move a row-major matrix across all threads in a warp, reading elements from sourcea, andwriting the transposed elements to destinationd.

The.shape qualifier indicates the dimensions of the matrix being transposed. Each matrixelement holds 16-bit data as indicated by the.type qualifier.

The mandatory.sync qualifier indicates thatmovmatrix causes the executing thread to waituntil all threads in the warp execute the samemovmatrix instruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must execute the samemovmatrix instruction. In conditionally executed code, amovmatrix instruction should onlybe used if it is known that all threads in the warp evaluate the condition identically, otherwisethe behavior is undefined.

Operandsa andd are 32-bit registers containing fragments of the input matrix and theresulting matrix respectively. The mandatory qualifier.trans indicates that the resultingmatrix ind is a transpose of the input matrix specified bya.

Each thread in a warp holds a fragment of a row of the input matrix, with thread 0 holding the firstfragment in registera, and so on. A group of four threads holds an entire row of the inputmatrix as shown inFigure 109.

_images/mma-movmatrix-fragments-src.png

Figure 109movmatrix source matrix fragment layout

Each thread in a warp holds a fragment of a column of the result matrix, with thread 0 holding thefirst fragment in registerd, and so on. A group of four threads holds an entire column of theresult matrix as shown inFigure 110.

_images/mma-movmatrix-fragments-dst.png

Figure 110movmatrix result matrix fragment layout

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_75 or higher.

Examples

.reg .b32 d, a;movmatrix.sync.aligned.m8n8.trans.b16 d, a;

9.7.14.6.Matrix multiply-accumulate operation usingmma.sp instruction with sparse matrix A

This section describes warp-levelmma.sp{::ordered_metadata} instruction with sparse matrix A.This variant of themma operation can be used when A is a structured sparse matrix with 50%zeros in each row distributed in a shape-specific granularity. For anMxNxK sparsemma.sp{::ordered_metadata} operation, theMxK matrix A is packed intoMxK/2 elements.For each K-wide row of matrix A, 50% elements are zeros and the remaining K/2 non-zero elementsare packed in the operand representing matrix A. The mapping of these K/2 elements to thecorresponding K-wide row is provided explicitly as metadata.

9.7.14.6.1.Sparse matrix storage

Granularity of sparse matrix A is defined as the ratio of the number of non-zero elements in asub-chunk of the matrix row to the total number of elements in that sub-chunk where the size of thesub-chunk is shape-specific. For example, in a16x16 matrix A, sparsity is expected to be at 2:4granularity, i.e. each 4-element vector (i.e. a sub-chunk of 4 consecutive elements) of a matrix rowcontains 2 zeros. Index of each non-zero element in a sub-chunk is stored in the metadataoperand. Values0b0000,0b0101,0b1010,0b1111 are invalid values for metadata andwill result in undefined behavior. In a group of four consecutive threads, one or more threads storethe metadata for the whole group depending upon the matrix shape. These threads are specified usingan additionalsparsity selector operand.

Figure 111 shows an example of a 16x16 matrix A represented in sparse format and sparsityselector indicating which thread in a group of four consecutive threads stores the metadata.

_images/sparse-mma-storage-example.png

Figure 111Sparse MMA storage example

Granularities for different matrix shapes and data types are described below.

Sparsemma.sp{::ordered_metadata} with half-precision and.bf16 type

For the.m16n8k16 and.m16n8k32mma.sp{::ordered_metadata} operations, matrix A isstructured sparse at a granularity of 2:4. In other words, each chunk of four adjacent elementsin a row of matrix A has two zeros and two non-zero elements. Only the two non-zero elements arestored in the operand representing matrix A and their positions in the four-wide chunk in matrixA are indicated by two 2-bit indices in the metadata operand. Formma.sp::ordered_metadata,0b0100,0b1000,0b1001,0b1100,0b1101,0b1110 are the meaningful valuesof indices; any other values result in an undefined behavior.

_images/f16-metadata-example.png

Figure 112Sparse MMA metadata example for.f16/.bf16 type.

The sparsity selector indicates the threads which contribute metadata as listed below:

  • m16n8k16: One thread within a group of four consecutive threads contributes the metadata forthe entire group. This thread is indicated by a value in {0, 1, 2, 3}.

  • m16n8k32: A thread-pair within a group of four consecutive threads contributes the sparsitymetadata. Hence, the sparsity selector must be either 0 (threads T0, T1) or 1 (threads T2, T3);any other value results in an undefined behavior.

Sparsemma.sp{::ordered_metadata} with.tf32 type

When matrix A has.tf32 elements, matrix A is structured sparse at a granularity of 1:2. Inother words, each chunk of two adjacent elements in a row of matrix A has one zero and one non-zeroelement. Only the non-zero elements are stored in the operand for matrix A and their positions in atwo-wide chunk in matrix A are indicated by the 4-bit index in the metadata.0b1110 and0b0100 are the only meaningful index values; any other values result in an undefined behavior.

_images/tf32-metadata-example.png

Figure 113Sparse MMA metadata example for.tf32 type.

The sparsity selector indicates the threads which contribute metadata as listed below:

  • m16n8k8: One thread within a group of four consecutive threads contributes the metadata forthe entire group. This thread is indicated by a value in {0, 1, 2, 3}.

  • m16n8k16: A thread-pair within a group of four consecutive threads contributes the sparsitymetadata. Hence, the sparsity selector must be either 0 (threads T0, T1) or 1 (threads T2, T3);any other value results in an undefined behavior.

Sparsemma.sp{::ordered_metadata} with integer type

When matrices A and B have.u8/.s8 elements, matrix A is structured sparse at a granularityof 2:4. In other words, each chunk of four adjacent elements in a row of matrix A have two zeroesand two non-zero elements. Only the two non-zero elements are stored in sparse matrix and theirpositions in the four-wide chunk are indicated by two 2-bit indices in the metadata. Formma.sp::ordered_metadata,0b0100,0b1000,0b1001,0b1100,0b1101,0b1110are the meaningful values of indices; any other values result in an undefined behavior.

_images/u8s8-metadata-example.png

Figure 114Sparse MMA metadata example for.u8/.s8 type.

when matrices A and B have.u4/.s4 elements, matrix A is pair-wise structured sparse at agranularity of 4:8. In other words, each chunk of eight adjacent elements in a row of matrix A hasfour zeroes and four non-zero values. Further, the zero and non-zero values are clustered insub-chunks of two elements each within the eight-wide chunk. i.e., each two-wide sub-chunk withinthe eight-wide chunk must be all zeroes or all non-zeros. Only the four non-zero values are storedin sparse matrix and the positions of the two two-wide sub-chunks with non-zero values in theeight-wide chunk of a row of matrix A are indicated by two 2-bit indices in the metadata. Formma.sp::ordered_metadata,0b0100,0b1000,0b1001,0b1100,0b1101,0b1110are the meaningful values of indices; any other values result in an undefined behavior.

_images/u4s4-metadata-example.png

Figure 115Sparse MMA metadata example for.u4/.s4 type.

The sparsity selector indicates the threads which contribute metadata as listed below:

  • m16n8k32 with.u8/.s8 type andm16n8k64 with.u4/.s4 type: A thread-pairwithin a group of four consecutive threads contributes the sparsity metadata. Hence, the sparsityselector must be either 0 (threads T0, T1) or 1 (threads T2, T3); any other value results in anundefined behavior.

  • m16n8k64 with.u8/.s8 type andm16n8k128 with.u4/.s4 type: All threadswithin a group of four consecutive threads contribute the sparsity metadata. Hence, the sparsityselector in this case must be 0. Any other value of sparsity selector results in an undefinedbehavior.

Sparsemma.sp{::ordered_metadata} operating on.e4m3/.e5m2/.e3m2/.e2m3/.e2m1type with.kind::f8f6f4 or.kind::mxf8f6f4

When matrices A and B have.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 elements, matrix A isstructured sparse at a granularity of 2:4. In other words, each chunk of four adjacent elements in arow of matrix A have two zeroes and two non-zero elements. Only the two non-zero elements are storedin sparse matrix and their positions in the four-wide chunk are indicated by two 2-bit indices in themetadata.0b0100,0b1000,0b1001,0b1100,0b1101,0b1110 are the meaningfulvalues of indices; any other values result in an undefined behavior.

_images/fp8-metadata-example.png

Figure 116Sparse MMA metadata example for.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 type.

The sparsity selector indicates the threads which contribute metadata as listed below:

  • m16n8k64: All threads within a group of four consecutive threads contribute the sparsity metadata.Hence, the sparsity selector in this case must be 0. Any other value of sparsity selector results inan undefined behavior.

Sparsemma.sp::ordered_metadata operating on.e2m1 type with.kind::mxf4 or.kind::mxf4nvf4

When matrices A and B have.e2m1 elements, matrix A is pair-wise structured sparse at a granularityof 4:8. In other words, each chunk of eight adjacent elements in a row of matrix A has four zeroes andfour non-zero values. Further, the zero and non-zero values are clustered in sub-chunks of two elementseach within the eight-wide chunk. i.e., each two-wide sub-chunk within the eight-wide chunk must be allzeroes or all non-zeros. Only the four non-zero values are stored in sparse matrix and the positions ofthe two two-wide sub-chunks with non-zero values in the eight-wide chunk of a row of matrix A areindicated by two 2-bit indices in the metadata.0b0100,0b1000,0b1001,0b1100,0b1101,0b1110 are the meaningful values of indices; any other values result in an undefined behavior.

_images/fp4-metadata-example.png

Figure 117Sparse MMA metadata example for.e2m1 type with.kind::mxf4 or.kind::mxf4nvf4

The sparsity selector indicates the threads which contribute metadata as listed below:

  • m16n8k128: All threads within a group of four consecutive threads contribute the sparsity metadata.Hence, the sparsity selector in this case must be 0. Any other value of sparsity selector results inan undefined behavior.

9.7.14.6.2.Matrix fragments for multiply-accumulate operation with sparse matrix A

In this section we describe how the contents of thread registers are associated with fragments ofvarious matrices and the sparsity metadata. The following conventions are used throughout thissection:

  • For matrix A, only the layout of a fragment is described in terms of register vector sizes andtheir association with the matrix data.

  • For matrix B, when the combination of matrix dimension and the supported data type is not alreadycovered inMatrix multiply-accumulate operation using mma instruction, a pictorial representation of matrixfragments is provided.

  • For matrices C and D, since the matrix dimension - data type combination is the same for allsupported shapes, and is already covered inMatrix multiply-accumulate operation using mma instruction, the pictorial representationsof matrix fragments are not included in this section.

  • For the metadata operand, pictorial representations of the association between indices of theelements of matrix A and the contents of the metadata operand are included.Tk:[m..n] presentin cell[x][y..z] indicates that bitsm throughn (withm being higher) in themetadata operand of thread with%laneid=k contains the indices of the non-zero elements fromthe chunk[x][y]..[x][z] of matrix A.

9.7.14.6.2.1.Matrix Fragments for sparsemma.m16n8k16 with.f16 and.bf16 types

A warp executing sparsemma.m16n8k16 with.f16 /.bf16 floating point type will computean MMA operation of shape.m16n8k16.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements

    .f16 /.bf16

    A vector expression containing two.b32 registers,with each register containing two non-zero.f16 /.bf16 elements out of 4 consecutive elements frommatrix A.

    Mapping of the non-zero elements is asdescribed inSparse matrix storage.

    The layout of the fragments held by different threads is shown inFigure 118.

    _images/sparse-mma-16816-f16-bf16-A.png

    Figure 118Sparse MMA .m16n8k16 fragment layout for matrix A with.f16/.bf16 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDfora0anda1groupID+8fora2anda3col=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*4lastcol=firstcol+3
  • Matrix fragments for multiplicand B and accumulators C and D are the same as in case ofMatrix Fragments for mma.m16n8k16 with floating point type for.f16/.b16 formats.

  • Metadata: A.b32 register containing 16 2-bit vectors each storing the index of a non-zeroelement of a 4-wide chunk of matrix A as shown inFigure 119.

    _images/sparse-mma-metadata-16816-f16bf16.png

    Figure 119Sparse MMA .m16n8k16 metadata layout for.f16/.bf16 type.

9.7.14.6.2.2.Matrix Fragments for sparsemma.m16n8k32 with.f16 and.bf16 types

A warp executing sparsemma.m16n8k32 with.f16 /.bf16 floating point type will computean MMA operation of shape.m16n8k32.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements

    .f16 /.bf16

    A vector expression containing four.b32 registers,with each register containing two non-zero.f16 /.bf16 elements out of 4 consecutive elements frommatrix A.

    Mapping of the non-zero elements is asdescribed inSparse matrix storage.

    The layout of the fragments held by different threads is shown inFigure 120.

    _images/sparse-mma-16832-f16-bf16-A.png

    Figure 120Sparse MMA .m16n8k32 fragment layout for matrix A with.f16/.bf16 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<2||4<=i<6groupID+8Otherwisecol=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*4Foraiwherei<4(threadID_in_group*4)+16foraiwherei>=4lastcol=firstcol+3
  • Multiplicand B:

    .atype

    Fragment

    Elements (low to high)

    .f16 /.bf16

    A vector expression containing four.b32 registers, eachcontaining two.f16 /.bf16 elements from matrix B.

    b0, b1, b2, b3

    The layout of the fragments held by different threads is shown inFigure 121.

    _images/sparse-mma-16832-f16bf16-B.png

    Figure 121Sparse MMA .m16n8k32 fragment layout for matrix B with.f16/.bf16 type.

  • Matrix fragments for accumulators C and D are the same as in case ofMatrix Fragments for mma.m16n8k16 with floating point typefor.f16/.b16 formats.

  • Metadata: A.b32 register containing 16 2-bit vectors with each pair of 2-bit vectors storingthe indices of two non-zero element from a 4-wide chunk of matrix A as shown inFigure 122.

    _images/sparse-mma-metadata-16832-f16bf16.png

    Figure 122Sparse MMA .m16n8k32 metadata layout for.f16/.bf16 type.

9.7.14.6.2.3.Matrix Fragments for sparsemma.m16n8k16 with.tf32 floating point type

A warp executing sparsemma.m16n8k16 with.tf32 floating point type will compute an MMAoperation of shape.m16n8k16.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements

    .tf32

    A vector expression containing four.b32 registers, with eachregister containing one non-zero.tf32 element out of 2consecutive elements from matrix A.

    Mapping of the non-zero elements isas described inSparse matrix storage.

    The layout of the fragments held by different threads is shown inFigure 123.

    _images/sparse-mma-16816-tf32-A.png

    Figure 123Sparse MMA .m16n8k16 fragment layout for matrix A with.tf32 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDfora0anda2groupID+8fora1anda3col=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*2fora0anda1(threadID_in_group*2)+8fora2anda3lastcol=firstcol+1
  • Multiplicand B:

    .atype

    Fragment

    Elements (low to high)

    .tf32

    A vector expression containing four.b32 registers, eachcontaining four.tf32 elements from matrix B.

    b0, b1, b2, b3

    The layout of the fragments held by different threads is shown inFigure 124.

    _images/sparse-mma-16816-tf32-B.png

    Figure 124Sparse MMA .m16n8k16 fragment layout for matrix B with.tf32 type.

  • Matrix fragments for accumulators C and D are the same as in case ofMatrix Fragments for mma.m16n8k16 with floating point type.

  • Metadata: A.b32 register containing 8 4-bit vectors each storing the index of a non-zeroelement of a 2-wide chunk of matrix A as shown inFigure 125.

    _images/sparse-mma-metadata-16816-tf32.png

    Figure 125Sparse MMA .m16n8k16 metadata layout for.tf32 type.

9.7.14.6.2.4.Matrix Fragments for sparsemma.m16n8k8 with.tf32 floating point type

A warp executing sparsemma.m16n8k8 with.tf32 floating point type will compute an MMAoperation of shape.m16n8k8.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements

    .tf32

    A vector expression containing two.b32 registers, eachcontaining one non-zero.tf32 element out of 2consecutive elements from matrix A.

    Mapping of the non-zero elements isas described inSparse matrix storage.

    The layout of the fragments held by different threads is shown inFigure 126.

    _images/sparse-mma-1688-tf32-A.png

    Figure 126Sparse MMA .m16n8k8 fragment layout for matrix A with.tf32 type.

    The row and column of a matrix fragment can be computed as:

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDfora0groupID+8fora1col=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*2lastcol=firstcol+1
  • Matrix fragments for multiplicand B and accumulators C and D are the same as in case ofMatrix Fragments for mma.m16n8k8 for.tf32format.

  • Metadata: A.b32 register containing 8 4-bit vectors each storing the index of a non-zeroelement of a 2-wide chunk of matrix A as shown inFigure 127.

    _images/sparse-mma-metadata-1688-tf32.png

    Figure 127Sparse MMA .m16n8k8 metadata layout for.tf32 type.

9.7.14.6.2.5.Matrix Fragments for sparsemma.m16n8k32 with.u8 /.s8 integer type

A warp executing sparsemma.m16n8k32 with.u8 /.s8 integer type will compute an MMAoperation of shape.m16n8k32.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements

    .u8 /.s8

    A vector expression containing two.b32 registers, with eachregister containing four non-zero.u8 /.s8 elements outof 8 consecutive elements from matrix A.

    Mapping of the non-zero elements isas described inSparse matrix storage.

    The layout of the fragments held by different threads is shown inFigure 128.

    _images/sparse-mma-16832-u8s8-A.png

    Figure 128Sparse MMA .m16n8k32 fragment layout for matrix A with.u8/.s8 type.

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<4groupID+8Otherwisecol=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*8lastcol=firstcol+7
  • Matrix fragments for multiplicand B and accumulators C and D are the same as in case ofMatrix Fragments for mma.m16n8k32.

  • Metadata: A.b32 register containing 16 2-bit vectors with each pair of 2-bit vectors storingthe indices of two non-zero elements from a 4-wide chunk of matrix A as shown inFigure 129.

    _images/sparse-mma-metadata-16832-u8s8.png

    Figure 129Sparse MMA .m16n8k32 metadata layout for.u8/.s8 type.

9.7.14.6.2.6.Matrix Fragments for sparsemma.m16n8k64 with.u8 /.s8 /.e4m3 /.e5m2 type

A warp executing sparsemma.m16n8k64 with.u8 /.s8/.e4m3/.e5m2 /.e3m2 /.e2m3 /.e2m1 type will compute an MMA operation of shape.m16n8k64.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements

    .u8 /.s8

    A vector expression containing four.b32 registers, with eachregister containing four non-zero.u8 /.s8 elements outof 8 consecutive elements from matrix A.

    Mapping of the non-zero elements isas described inSparse matrix storage.

    .e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1

    A vector expression containing four.b32 registers, with eachregister containing four non-zero.e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1 elements out of 8 consecutiveelements from matrix A.

    The layout of the fragments held by different threads is shown inFigure 130andFigure 131.

    _images/sparse-mma-16864-u8s8-A-first32col.png

    Figure 130Sparse MMA .m16n8k64 fragment layout for columns 0–31 of matrix A with.u8/.s8/.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 type.

    _images/sparse-mma-16864-u8s8-A-last32col.png

    Figure 131Sparse MMA .m16n8k64 fragment layout for columns 32–63 of matrix A with.u8/.s8/.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 type.

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<4||8<=i<12groupID+8Otherwisecol=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*8Foraiwherei<8(threadID_in_group*8)+32Foraiwherei>=8lastcol=firstcol+7
  • Multiplicand B:

    .btype

    Fragment

    Elements (low to high)

    .u8 /.s8

    A vector expression containing four.b32 registers,each containing four.u8 /.s8 elements frommatrix B.

    b0, b1, b2, b3, …, b15

    .e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1

    A vector expression containing four.b32 registers,each containing four.e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1 elements from matrix B.

    The layout of the fragments held by different threads is shown inFigure 132,Figure 133,Figure 134 andFigure 135.

    _images/sparse-mma-16864-u8s8-B1.png

    Figure 132Sparse MMA .m16n8k64 fragment layout for rows 0–15 of matrix B with.u8/.s8/.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 type.

    _images/sparse-mma-16864-u8s8-B2.png

    Figure 133Sparse MMA .m16n8k64 fragment layout for rows 16–31 of matrix B with.u8/.s8/.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 type.

    _images/sparse-mma-16864-u8s8-B3.png

    Figure 134Sparse MMA .m16n8k64 fragment layout for rows 32–47 of matrix B with.u8/.s8/.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 type.

    _images/sparse-mma-16864-u8s8-B4.png

    Figure 135Sparse MMA .m16n8k64 fragment layout for rows 48–63 of matrix B with.u8/.s8/.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 type.

  • Matrix fragments for accumulators C and D are the same as in case ofMatrix Fragments for mma.m16n8k16 with integer type.

  • Metadata: A.b32 register containing 16 2-bit vectors with each pair of 2-bit vectors storingthe indices of two non-zero elements from a 4-wide chunk of matrix A as shown inFigure 136 andFigure 137.

    _images/sparse-mma-metadata-16864-u8s8-first32col.png

    Figure 136Sparse MMA .m16n8k64 metadata layout for columns 0–31 for.u8/.s8/.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 type.

    _images/sparse-mma-metadata-16864-u8s8-last32col.png

    Figure 137Sparse MMA .m16n8k64 metadata layout for columns 32–63 for.u8/.s8/.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 type.

9.7.14.6.2.7.Matrix Fragments for sparsemma.m16n8k64 with.u4 /.s4 integer type

A warp executing sparsemma.m16n8k64 with.u4 /.s4 integer type will compute an MMAoperation of shape.m16n8k64.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements

    .u4 /.s4

    A vector expression containing two.b32 registers, with eachregister containing eight non-zero.u4 /.s4 elementsout of 16 consecutive elements from matrix A.

    Mapping of the non-zero elements isas described inSparse matrix storage.

    The layout of the fragments held by different threads is shown inFigure 138.

    _images/sparse-mma-16864-u4s4-A.png

    Figure 138Sparse MMA .m16n8k64 fragment layout for matrix A with.u4/.s4 type.

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<8groupID+8Otherwisecol=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*16lastcol=firstcol+15
  • Matrix fragments for multiplicand B and accumulators C and D are the same as in case ofMatrix Fragments for mma.m16n8k64.

  • Metadata: A.b32 register containing 16 2-bit vectors with each pair of 2-bit vectors storingthe indices of four non-zero elements from a 8-wide chunk of matrix A as shown inFigure 139.

    _images/sparse-mma-metadata-16864-u4s4.png

    Figure 139Sparse MMA .m16n8k64 metadata layout for.u4/.s4 type.

9.7.14.6.2.8.Matrix Fragments for sparsemma.m16n8k128 with.u4 /.s4 integer type

A warp executing sparsemma.m16n8k128 with.u4 /.s4 /.e2m1 integer type will compute an MMAoperation of shape.m16n8k128.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.

  • Multiplicand A:

    .atype

    Fragment

    Elements

    .u4 /.s4

    A vector expression containing four.b32 registers, with eachregister containing eight non-zero.u4 /.s4 elements outof 16 consecutive elements from matrix A.

    Mapping of the non-zero elements isas described inSparse matrix storage.

    .e2m1

    A vector expression containing four.b32 registers, with eachregister containing eight non-zero.e2m1 elements outof 16 consecutive elements from matrix A.

    The layout of the fragments held by different threads is shown inFigure 140andFigure 141.

    _images/sparse-mma-168128-u4s4-A-first64col.png

    Figure 140Sparse MMA .m16n8k128 fragment layout for columns 0–63 of matrix A with.u4/.s4/.e2m1 type.

    _images/sparse-mma-168128-u4s4-A-last64col.png

    Figure 141Sparse MMA .m16n8k128 fragment layout for columns 64–127 of matrix A with.u4/.s4/.e2m1 type.

    groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<8||16<=i<24groupID+8Otherwisecol=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*16Foraiwherei<16(threadID_in_group*16)+64Foraiwherei>=16lastcol=firstcol+15
  • Multiplicand B:

    .atype

    Fragment

    Elements (low to high)

    .u4 /.s4

    A vector expression containing four.b32 registers, each containingeight.u4 /.s4 elements from matrix B.

    b0, b1, b2, b3, …, b31

    .e2m1

    A vector expression containing four.b32 registers, each containingeight.e2m1 elements from matrix B.

    The layout of the fragments held by different threads is shown inFigure 142,Figure 143,Figure 144,Figure 145.

    _images/sparse-mma-168128-u4s4-B1.png

    Figure 142Sparse MMA .m16n8k128 fragment layout for rows 0–31 of matrix B with.u4/.s4/.e2m1 type.

    _images/sparse-mma-168128-u4s4-B2.png

    Figure 143Sparse MMA .m16n8k128 fragment layout for rows 32–63 of matrix B with.u4/.s4/.e2m1 type.

    _images/sparse-mma-168128-u4s4-B3.png

    Figure 144Sparse MMA .m16n8k128 fragment layout for rows 64–95 of matrix B with.u4/.s4/.e2m1 type.

    _images/sparse-mma-168128-u4s4-B4.png

    Figure 145Sparse MMA .m16n8k128 fragment layout for rows 96–127 of matrix B with.u4/.s4/.e2m1 type.

  • Matrix fragments for accumulators C and D are the same as in case ofMatrix Fragments for mma.m16n8k64.

  • Metadata: A.b32 register containing 16 2-bit vectors with each pair of 2-bit vectors storingthe indices of four non-zero elements from a 8-wide chunk of matrix A as shown inFigure 146 andFigure 147.

    _images/sparse-mma-metadata-168128-u4s4-first64col.png

    Figure 146Sparse MMA .m16n8k128 metadata layout for columns 0–63 for.u4/.s4/.e2m1 type.

    _images/sparse-mma-metadata-168128-u4s4-last64col.png

    Figure 147Sparse MMA .m16n8k128 metadata layout for columns 64–127 for.u4/.s4/.e2m1 type.

9.7.14.6.3.Multiply-and-Accumulate Instruction:mma.sp /mma.sp::ordered_metadata

mma.sp,mma.sp::ordered_metadata

Perform matrix multiply-and-accumulate operation with sparse matrix A

Syntax

Half precision floating point type:

mma.spvariant.sync.aligned.m16n8k16.row.col.dtype.f16.f16.ctype  d, a, b, c, e, f;mma.spvariant.sync.aligned.m16n8k32.row.col.dtype.f16.f16.ctype  d, a, b, c, e, f;.ctype     = {.f16, .f32};.dtype     = {.f16, .f32};.spvariant = {.sp, .sp::ordered_metadata};

Alternate floating point type:

mma.spvariant.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32     d, a, b, c, e, f;mma.spvariant.sync.aligned.m16n8k32.row.col.f32.bf16.bf16.f32     d, a, b, c, e, f;mma.spvariant.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32      d, a, b, c, e, f;mma.spvariant.sync.aligned.m16n8k16.row.col.f32.tf32.tf32.f32     d, a, b, c, e, f;mma.spvariant.sync.aligned.m16n8k64.row.col.f32.f8type.f8type.f32 d, a, b, c, e, f;mma.sp::ordered_metadata.sync.aligned.m16n8k64.row.col.kind.dtype.f8f6f4type.f8f6f4type.ctype d, a, b, c, e, f;.f8type     = {.e4m3, .e5m2};.spvariant  = {.sp, .sp::ordered_metadata};.f8f6f4type = {.e4m3, .e5m2, .e3m2, .e2m3, .e2m1};.kind       = {kind::f8f6f4};.ctype      = {.f16, .f32};.dtype      = {.f16, .f32};

Alternate floating point type with block scaling:

mma.spvariant.sync.aligned.m16n8k128.row.col.kind.block_scale{.scale_vec_size}.f32.e2m1.e2m1.f32.stype d, a, b, c, e, f, scale-a-data, {byte-id-a, thread-id-a}, scale-b-data, {byte-id-b, thread-id-b};.spvariant      = {.sp::ordered_metadata};.kind           = {.kind::mxf4};.scale_vec_size = {.scale_vec::2X};.stype          = {.ue8m0};mma.spvariant.sync.aligned.m16n8k128.row.col.kind.block_scale.scale_vec_size.f32.e2m1.e2m1.f32.stype d, a, b, c, e, f, scale-a-data, {byte-id-a, thread-id-a}, scale-b-data, {byte-id-b, thread-id-b};.spvariant      = {.sp::ordered_metadata};.kind           = {.kind::mxf4nvf4};.scale_vec_size = {.scale_vec::2X, .scale_vec::4X};.stype          = {.ue8m0, .ue4m3};mma.spvariant.sync.aligned.m16n8k64.row.col.kind.block_scale{.scale_vec_size}.f32.f8f6f4type.f8f6f4type.f32.stype d, a, b, c, e, f, scale-a-data, {byte-id-a, thread-id-a}, scale-b-data, {byte-id-b, thread-id-b};.spvariant      = {.sp::ordered_metadata};.kind           = {.kind::mxf8f6f4};.scale_vec_size = {.scale_vec::1X};.f8f6f4type     = {.e4m3, .e5m2, .e3m2, .e2m3, .e2m1};.stype          = {.ue8m0};

Integer type:

mma.spvariant.sync.aligned.shape.row.col{.satfinite}.s32.atype.btype.s32 d, a, b, c, e, f;.shape     = {.m16n8k32, .m16n8k64}.atype     = {.u8, .s8};.btype     = {.u8, .s8};.spvariant = {.sp, .sp::ordered_metadata};mma.spvariant.sync.aligned.shape.row.col{.satfinite}.s32.atype.btype.s32 d, a, b, c, e, f;.shape     = {.m16n8k64, .m16n8k128}.atype     = {.u4, .s4};.btype     = {.u4, .s4};.spvariant = {.sp, .sp::ordered_metadata};

Description

Perform aMxNxK matrix multiply and accumulate operation,D=A*B+C, where the A matrix isMxK, the B matrix isKxN, and the C and D matrices areMxN.

A warp executingmma.sp.sync/mma.sp::ordered_metadata.sync instruction compute a single matrixmultiply and accumulate operation.

Qualifier.block_scale specifies that the matricesA andB are scaled withscale_Aandscale_B matrices respectively before performing the matrix multiply and accumulate operationas specified in the sectionBlock Scaling. The data type correspondingto each of the element withinscale_A andscale_B matrices is specified by.stype.Qualifier.scale_vec_size specifies the number of columns ofscale_A matrix and number ofrows in the matrixscale_B.

The valid combinations of.kind,.stype and.scale_vec_size are described inTable 36. Formma with.kind::mxf4 when thequalifier.scale_vec_size is not specified, then it defaults to2X. In contrast,when.kind is specified as.kind::mxf8f6f4 then the qualifier.scale_vec_sizedefaults to1X. However, for.kind::mxf4nvf4, it is mandatory to provide valid.scale_vec_size.

Operandsa andb represent two multiplicand matrices A and B, whilec anddrepresent the accumulator and destination matrices, distributed across the threads in warp. Matrix Ais structured sparse as described inSparse matrix storage Operandse andf represent sparsitymetadata and sparsity selector respectively. Operande is a 32-bit integer and operandf isa 32-bit integer constant with values in the range 0..3.When.block_scale qualifier is specified, operandscale-a-data,scale-b-data representsthe scale matrix metadata corresponding toscale_A andscale_B matrices respectively.The tuple{byte-id-a,thread-id-a} and{byte-id-b,thread-id-b} represent selectors formatricesscale_A andscale_B respectively from their corresponding metadata argumentsscale-a-data,scale-b-data. The operandsscale-a-data,scale-b-data are of type.b32. The operandsbyte-id-a,thread-id-a,byte-id-b,thread-id-b are unsigned16-bit integer values. For more details on selector arguments referBlock Scaling section.

Instructionmma.sp::ordered_metadata requires the indices in the sparsity metadata to be sortedin an increasing order starting from LSB, otherwise behavior is undefined.

The registers in each thread hold a fragment of matrix as described inMatrix fragments for multiply-accumulate operation with sparse matrix A.

The qualifiers.dtype,.atype,.btype and.ctype indicate the data-type of theelements in the matrices D, A, B and C respectively. The qualifier.stype indicate thedata-type of the elements in the matricesscale_A andscale_B. In case of shapes.m16n8k16 and.m16n8k32,.dtype must be the same as.ctype.

When.kind is either of.kind::mxf8f6f4 or.kind::f8f6f4, the individual 4-bit andthe 6-bit floating point type elements must be packed in an 8-bit container. The matrix elementof type.e2m1 resides in central 4 bits of the 8-bit container with padding in the upper 2bits and lower 2 bits of the container. When the matrix element is of type.e3m2 or.e2m3,the matrix element resides in the lower 6 bits of the 8-bit container with padding in the upper2 bits of the container. In contrast, note that when usingmma with.kind::mxf4 or.kind::mxf4nvf4, no explicit padding is necessary even though matrix elements are of type.e2m1.

Precision and rounding :
  • .f16 floating point operations :

    Element-wise multiplication of matrix A and B is performed with at least singleprecision. When.ctype or.dtype is.f32, accumulation of the intermediate valuesis performed with at least single precision. When both.ctype and.dtype are specifiedas.f16, the accumulation is performed with at least half precision.

    The accumulation order, rounding and handling of subnormal inputs are unspecified.

  • .e4m3,.e5m2,.e3m2,.e2m3,.e2m1 floating point operations :

    Element-wise multiplication of matrix A and B is performed with specified precision. Accumulationof the intermediate values is performed with at least single precision.

    The accumulation order, rounding, and handling of subnormal inputs are unspecified.

  • .bf16 and.tf32 floating point operations :

    Element-wise multiplication of matrix A and B is performed with specifiedprecision. Accumulation of the intermediate values is performed with at least singleprecision.

    The accumulation order, rounding, and handling of subnormal inputs are unspecified.

  • Integer operations :

    The integermma.sp/mma.sp::ordered_metadata operation is performed with.s32 accumulators.The.satfinite qualifier indicates that on overflow, the accumulated value is limited to the rangeMIN_INT32..MAX_INT32 (where the bounds are defined as the minimum negative signed 32-bitinteger and the maximum positive signed 32-bit integer respectively).

    If.satfinite is not specified, the accumulated value is wrapped instead.

The mandatory.sync qualifier indicates thatmma.sp/mma.sp::ordered_metadata instruction causesthe executing thread to wait until all threads in the warp execute the samemma.sp/mma.sp::ordered_metadatainstruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must execute the samemma.sp/mma.sp::ordered_metadata instruction. In conditionally executed code, amma.sp/mma.sp::ordered_metadatainstruction should only be used if it is known that all threads in the warp evaluate the condition identically,otherwise behavior is undefined.

The behavior ofmma.sp/mma.sp::ordered_metadata instruction is undefined if all threads in the same warpdo not use the same qualifiers, or if any thread in the warp has exited.

Notes

mma.sp instruction may have substantially reduced performance on some target architectures.Hence, it is advised to usemma.sp::ordered_metadata instruction.

PTX ISA Notes

Introduced in PTX ISA version 7.1.

Support for.e4m3 and.e5m2 alternate floating point typemma operation introduced inPTX ISA version 8.4.

mma.sp::ordered_metadata introduced in PTX ISA version 8.5.

Support for shape.m16n8k32 and.f16 dtype/ctype with.e4m3/.e5m2 alternate floatingpoint typemma operation introduced in PTX ISA version 8.7.

Support for.e3m2,.e2m3,.e2m1 alternate floating point typemma operation introducedin PTX ISA version 8.7.

Support for.kind,.block_scale,.scale_vec_size qualifier introduced in PTX ISA version 8.7.

Target ISA Notes

Requiressm_80 or higher.

.e4m3 and.e5m2 alternate floating point typemma operation requiressm_89 or higher.

mma.sp::ordered_metadata requiressm_80 or higher.

Support for shape.m16n8k32 and.f16 dtype/ctype with.e4m3/.e5m2 alternate floatingpoint typemma operation requiressm_120.

.e3m2,.e2m3 and.e2m1 alternate floating point typemma operation requiressm_120a and are supported onsm_120f or higher in the same family from PTX ISA version 8.8.

Support for.kind,.block_scale,.scale_vec_size qualifier requiressm_120a and aresupported onsm_120f and later generation targets in the same family from PTX ISA version 8.8 except for.kind::mxf4nvf4/.kind::mxf4.

Qualifiers.kind::mxf4nvf4 and.kind::mxf4 are supported on following architectures:

  • sm_120a

  • sm_121a

Examples of half precision floating point type

// f16 elements in C and D matrix.reg .f16x2 %Ra<2> %Rb<2> %Rc<2> %Rd<2>.reg .b32 %Re;mma.sp.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16  {%Rd0, %Rd1},  {%Ra0, %Ra1},  {%Rb0, %Rb1},  {%Rc0, %Rc1}, %Re, 0x1;.reg .f16x2 %Ra<2> %Rb<2> %Rc<2> %Rd<2>.reg .b32 %Re;mma.sp::ordered_metadata.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16  {%Rd0, %Rd1},  {%Ra0, %Ra1},  {%Rb0, %Rb1},  {%Rc0, %Rc1}, %Re, 0x1;

Examples of alternate floating point type

.reg .b32 %Ra<2>, %Rb<2>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 %Re;mma.sp.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x1;.reg .b32 %Ra<2>, %Rb<2>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 %Re;mma.sp.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x1;.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 %Re;mma.sp.sync.aligned.m16n8k32.row.col.f32.bf16.bf16.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1, %Rb2, %Rb3},  {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x1;.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 %Re;mma.sp.sync.aligned.m16n8k64.row.col.f32.e5m2.e4m3.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1, %Rb2, %Rb3},  {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0;.reg .b32 %Ra<2>, %Rb<2>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 %Re;mma.sp::ordered_metadata.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x1;.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 %Re;mma.sp::ordered_metadata.sync.aligned.m16n8k64.row.col.kind::f8f6f4.f32.e3m2.e2m3.f32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1, %Rb2, %Rb3},  {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0;.reg .b32 %Ra<4>, %Rb<4>;.reg .b32 %Rc<4>, %Rd<4>;.reg .b32 %Re;mma.sp::ordered_metadata.sync.aligned.m16n8k64.row.col.kind::f8f6f4.f16.e2m3.e2m1.f16  {%Rd0, %Rd1},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1, %Rb2, %Rb3},  {%Rc0, %Rc1}, %Re, 0;

Examples of integer type

.reg .b32 %Ra<4>, %Rb<4>, %Rc<4>, %Rd<4>;.reg .u32 %Re;// u8 elements in A and B matrixmma.sp.sync.aligned.m16n8k32.row.col.satfinite.s32.u8.u8.s32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x1;// s8 elements in A and B matrixmma.sp.sync.aligned.m16n8k64.row.col.satfinite.s32.s8.s8.s32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1, %Rb2, %Rb3},  {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x0;// s8 elements in A and B matrix with ordered metadatamma.sp::ordered_metadata.sync.aligned.m16n8k64.row.col.satfinite.s32.s8.s8.s32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1, %Rb2, %Rb3},  {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x0;// u4 elements in A and B matrixmma.sp.sync.aligned.m16n8k64.row.col.s32.s4.s4.s32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1},  {%Rb0, %Rb1},  {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x1;// u4 elements in A and B matrixmma.sp.sync.aligned.m16n8k128.row.col.satfinite.s32.u4.u4.s32  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1, %Rb2, %Rb3},  {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x0;

Examples of mma with block scale

.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 scaleAData, scaleBData;.reg .b32 %Re;mma.sp::ordered_metadata.sync.aligned.m16n8k128.row.col.kind::mxf4.block_scale.f32.e2m1.e2m1.f32.ue8m0  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1, %Rb2, %Rb3},  {%Rc0, %Rc1, %Rc2, %Rc3},  %Re, 0,  scaleAData, {2, 1}, scaleBData, {2, 3};.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 scaleAData, scaleBData;.reg .u16 bidA, bidB, tidA, tidB;.reg .b32 %Re;mma.sp::ordered_metadata.sync.aligned.m16n8k128.row.col.kind::mxf4nvf4.block_scale.scale_vec::4X.f32.e2m1.e2m1.f32.ue4m3  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1, %Rb2, %Rb3},  {%Rc0, %Rc1, %Rc2, %Rc3},  %Re, 0,  scaleAData, {bidA, tidA}, scaleBData, {bidB, tidB};.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 scaleAData, scaleBData;.reg .b32 %Re;mma.sp::ordered_metadata.sync.aligned.m16n8k64.row.col.kind::mxf8f6f4.block_scale.scale_vec::1X.f32.e3m2.e2m1.f32.ue8m0  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2, %Ra3},  {%Rb0, %Rb1, %Rb2, %Rb3},  {%Rc0, %Rc1, %Rc2, %Rc3},  %Re, 0,  scaleAData, {0, 1}, scaleBData, {0, 1};.reg .b32 %Ra<4>, %Rb<4>;.reg .f32 %Rc<4>, %Rd<4>;.reg .b32 scaleAData, scaleBData;.reg .b32 %Re;mma.sp::ordered_metadata.sync.aligned.m16n8k64.row.col.kind::mxf8f6f4.block_scale.scale_vec::1X.f32.e4m3.e5m2.f32.ue8m0  {%Rd0, %Rd1, %Rd2, %Rd3},  {%Ra0, %Ra1, %Ra2,  %Ra3},  {%Rb0, %Rb1, %Rb2, %Rb3},  {%Rc0, %Rc1, %Rc2, %Rc3},  %Re, 0,  scaleAData, {0, 1}, scaleBData, {0, 0};

9.7.15.Asynchronous Warpgroup Level Matrix Multiply-Accumulate Instructions

The warpgroup level matrix multiply and accumulate operation has either of the following forms,where matrixD is called accumulator:

  • D=A*B+D

  • D=A*B, where the input from accumulator D is disabled.

Thewgmma instructions perform warpgroup level matrix multiply-and-accumulate operation byhaving all threads in a warpgroup collectively perform the following actions:

  1. Load matrices A, B and D into registers or into shared memory.

  2. Perform the followingfence operations:

    • wgmma.fence operations to indicate that the register/shared-memory across the warpgrouphave been written into.

    • fence.proxy.async operation to make the generic proxy operations visible to the asyncproxy.

  3. Issue the asynchronous matrix multiply and accumulate operations using thewgmma.mma_asyncoperation on the input matrices. Thewgmma.mma_async operation is performed in the asyncproxy.

  4. Create a wgmma-group and commit all the prior outstandingwgmma.mma_async operations into thegroup, by usingwgmma.commit_group operation.

  5. Wait for the completion of the required wgmma-group.

  6. Once the wgmma-group completes, all thewgmma.mma_async operations have been performed andcompleted.

9.7.15.1.Warpgroup

A warpgroup is a set of four contiguous warps such that thewarp-rank of the first warp is amultiple of 4.

warp-rank of a warp is defined as:

(%tid.x + %tid.y * %ntid.x  + %tid.z * %ntid.x * %ntid.y) / 32

9.7.15.2.Matrix Shape

The matrix multiply and accumulate operations support a limited set of shapes for the operandmatrices A, B and D. The shapes of all three matrix operands are collectively described by the tupleMxNxK, where A is anMxK matrix, B is aKxN matrix, while D is aMxN matrix.

The following matrix shapes are supported for the specified types for thewgmma.mma_asyncoperation:

Multiplicand Data type

Sparsity

Shape

Floating-point -.f16

Dense

.m64n8k16,.m64n16k16,.m64n24k16,.m64n32k16,.m64n40k16,.m64n48k16,.m64n56k16,.m64n64k16,.m64n72k16,.m64n80k16,.m64n88k16,.m64n96k16,.m64n104k16,.m64n112k16,.m64n120k16,.m64n128k16,.m64n136k16,.m64n144k16,.m64n152k16,.m64n160k16,.m64n168k16,.m64n176k16,.m64n184k16,.m64n192k16,.m64n200k16,.m64n208k16,.m64n216k16,.m64n224k16,.m64n232k16,.m64n240k16,.m64n248k16,.m64n256k16

Alternate floating-pointformat -.bf16

Alternate floating-pointformat -.tf32

Sparse

Alternate floating-pointformat -.tf32

Dense

.m64n8k8,.m64n16k8,.m64n24k8,.m64n32k8,.m64n40k8,.m64n48k8,.m64n56k8,.m64n64k8,.m64n72k8,.m64n80k8,.m64n88k8,.m64n96k8,.m64n104k8,.m64n112k8,.m64n120k8,.m64n128k8,.m64n136k8,.m64n144k8,.m64n152k8,.m64n160k8,.m64n168k8,.m64n176k8,.m64n184k8,.m64n192k8,.m64n200k8,.m64n208k8,.m64n216k8,.m64n224k8,.m64n232k8,.m64n240k8,.m64n248k8,.m64n256k8

Alternate floating-pointformat -.e4m3/.e5m2

Dense

.m64n8k32,.m64n16k32,.m64n24k32,.m64n32k32,.m64n40k32,.m64n48k32,.m64n56k32,.m64n64k32,.m64n72k32,.m64n80k32,.m64n88k32,.m64n96k32,.m64n104k32,.m64n112k32,.m64n120k32,.m64n128k32,.m64n136k32,.m64n144k32,.m64n152k32,.m64n160k32,.m64n168k32,.m64n176k32,.m64n184k32,.m64n192k32,.m64n200k32,.m64n208k32,.m64n216k32,.m64n224k32,.m64n232k32,.m64n240k32,.m64n248k32,.m64n256k32

Floating point -.f16

Sparse

Altername floating-pointformat -.bf16

Integer -.u8/.s8

Dense

.m64n8k32,.m64n16k32,.m64n24k32,.m64n32k32,.m64n48k32,.m64n64k32,.m64n80k32,.m64n96k32,.m64n112k32,.m64n128k32,.m64n144k32,.m64n160k32,.m64n176k32,.m64n192k32,.m64n208k32,.m64n224k32,.m64n240k32,.m64n256k32

Alternate floating-pointformat -.e4m3/.e5m2

Sparse

.m64n8k64,.m64n16k64,.m64n24k64,.m64n32k64,.m64n40k64,.m64n48k64,.m64n56k64,.m64n64k64,.m64n72k64,.m64n80k64,.m64n88k64,.m64n96k64,.m64n104k64,.m64n112k64,.m64n120k64,.m64n128k64,.m64n136k64,.m64n144k64,.m64n152k64,.m64n160k64,.m64n168k64,.m64n176k64,.m64n184k64,.m64n192k64,.m64n200k64,.m64n208k64,.m64n216k64,.m64n224k64,.m64n232k64,.m64n240k64,.m64n248k64,.m64n256k64

Integer -.u8/.s8

Sparse

.m64n8k64,.m64n16k64,.m64n24k64,.m64n32k64,.m64n48k64,.m64n64k64,.m64n80k64,.m64n96k64,.m64n112k64,.m64n128k64,.m64n144k64,.m64n160k64,.m64n176k64,.m64n192k64,.m64n208k64,.m64n224k64,.m64n240k64,.m64n256k64

Single-bit -.b1

Dense

.m64n8k256,.m64n16k256,.m64n24k256,.m64n32k256,.m64n48k256,.m64n64k256,.m64n80k256,.m64n96k256,.m64n112k256,.m64n128k256,.m64n144k256,.m64n160k256,.m64n176k256,.m64n192k256,.m64n208k256,.m64n224k256,.m64n240k256,.m64n256k256

9.7.15.3.Matrix Data-types

The matrix multiply and accumulate operation is supported separately on integer, floating-point,sub-byte integer and single bit data-types. All operands must contain the same basic type kind,i.e., integer or floating-point.

For floating-point matrix multiply and accumulate operation, different matrix operands may havedifferent precision, as described later.

For integer matrix multiply and accumulate operation, both multiplicand matrices (A and B) must haveelements of the same data-type, e.g. both signed integer or both unsigned integer.

Data-type

Multiplicands (A or B)

Accumulator (D)

Integer

both.u8 or both.s8

.s32

Floating Point

.f16

.f16,.f32

Alternate floating Point

.bf16

.f32

Alternate floating Point

.tf32

.f32

Alternate floating Point

.e4m3,.e5m2

.f16,.f32

Single-bit integer

.b1

.s32

9.7.15.4.Async Proxy

Thewgmma.mma_async operations are performed in the asynchronous proxy (or async proxy).

Accessing the same memory location across multiple proxies needs a cross-proxy fence. For the asyncproxy,fence.proxy.async should be used to synchronize memory between generic proxy and theasync proxy.

The completion of awgmma.mma_async operation is followed by an implicit generic-async proxyfence. So the result of the asynchronous operation is made visible to the generic proxy as soon asits completion is observed.wgmma.commit_group andwgmma.wait_group operations must be usedto wait for the completion of thewgmma.mma_async instructions.

9.7.15.5.Asynchronous Warpgroup Level Matrix Multiply-Accumulate Operation usingwgmma.mma_async instruction

This section describes warpgroup levelwgmma.mma_async instruction and the organization ofvarious matrices involved in this instruction.

9.7.15.5.1.Register Fragments and Shared Memory Matrix Layouts

The input matrix A of the warpgroup wide MMA operations can be either in registers or in the sharedmemory. The input matrix B of the warpgroup wide MMA operations must be in the shared memory. Thissection describes the layouts of register fragments and shared memory expected by the warpgroup MMAinstructions.

When the matrices are in shared memory, their starting addresses must be aligned to 16 bytes.

9.7.15.5.1.1.Register Fragments

This section describes the organization of various matrices located in register operands of thewgmma.mma_async instruction.

9.7.15.5.1.1.1.Matrix Fragments forwgmma.mma_async.m64nNk16

A warpgroup executingwgmma.mma_async.m64nNk16 will compute an MMA operation of shape.m64nNk16 where N is a validn dimension as listed inMatrix Shape.

Elements of the matrix are distributed across the threads in a warpgroup so each thread of thewarpgroup holds a fragment of the matrix.

  • Multiplicand A in registers:

    .atype

    Fragment

    Elements (low to high)

    .f16/.bf16

    A vector expression containing four.f16x2 registers, with eachregister containing two.f16/.bf16 elements from matrix A.

    a0, a1, a2, a3, a4, a5, a6, a7

    The layout of the fragments held by different threads is shown inFigure 148.

    _images/wgmma-64N16-A.png

    Figure 148WGMMA .m64nNk16 register fragment layout for matrix A.

  • Accumulator D:

    .dtype

    Fragment

    Elements (low to high)

    .f16

    A vector expression containing N/4 number of.f16x2registers, with each register containing two.f16elements from matrix D.

    d0, d1, d2, d3, …, dX, dY, dZ, dW

    whereX=N/2 - 4

    Y=N/2 - 3

    Z=N/2 - 2

    W=N/2 - 1

    N=8*iwherei={1,2,...,32}

    .f32

    A vector expression containing N/2 number of.f32registers.

    The layout of the fragments held by different threads is shown inFigure 149.

    _images/wgmma-64N16-D.png

    Figure 149WGMMA .m64nNk16 register fragment layout for accumulator matrix D.

9.7.15.5.1.1.2.Matrix Fragments forwgmma.mma_async.m64nNk8

A warpgroup executingwgmma.mma_async.m64nNk8 will compute an MMA operation of shape.m64nNk8 where N is a validn dimension as listed inMatrix Shape.

Elements of the matrix are distributed across the threads in a warpgroup so each thread of thewarpgroup holds a fragment of the matrix.

  • Multiplicand A in registers:

    .atype

    Fragment

    Elements (low to high)

    .tf32

    A vector expression containing four.b32 registers containingfour.tf32 elements from matrix A.

    a0, a1, a2, a3

    The layout of the fragments held by different threads is shown inFigure 150.

    _images/wgmma-64N8-A.png

    Figure 150WGMMA .m64nNk8 register fragment layout for matrix A.

  • Accumulator D:

    .dtype

    Fragment

    Elements (low to high)

    .f32

    A vector expression containing N/2 number of.f32 registers.

    d0, d1, d2, d3, …, dX, dY, dZ, dW

    whereX=N/2 - 4

    Y=N/2 - 3

    Z=N/2 - 2

    W=N/2 - 1

    N=8*iwherei={1,2,...,32}

    The layout of the fragments held by different threads is shown inFigure 151.

    _images/wgmma-64N8-D.png

    Figure 151WGMMA .m64nNk8 register fragment layout for accumulator matrix D.

9.7.15.5.1.1.3.Matrix Fragments forwgmma.mma_async.m64nNk32

A warpgroup executingwgmma.mma_async.m64nNk32 will compute an MMA operation of shape.m64nNk32 where N is a validn dimension as listed inMatrix Shape.

Elements of the matrix are distributed across the threads in a warpgroup so each thread of thewarpgroup holds a fragment of the matrix.

  • Multiplicand A in registers:

    .atype

    Fragment

    Elements (low to high)

    .s8/.u8

    A vector expression containing four.b32 registers, with eachregister containing four.u8/.s8 elements from matrix A.

    a0, a1, a2, a3, … , a14, a15

    .e4m3/.e5m2

    A vector expression containing four.b32 registers, with eachregister containing four.e4m3/.e5m2 elements frommatrix A.

    The layout of the fragments held by different threads is shown inFigure 152.

    _images/wgmma-64N32-A.png

    Figure 152WGMMA .m64nNk32 register fragment layout for matrix A.

  • Accumulator D:

    .dtype

    Fragment

    Elements (low to high)

    Miscellaneous Information

    .s32

    A vector expression containingN/2 number of.s32registers.

    d0, d1, d2, d3, …, dX, dY, dZ, dW

    whereX=N/2 - 4

    Y=N/2 - 3

    Z=N/2 - 2

    W=N/2 - 1

    N depends on .dtype, asdescribed in the next column.

    N=8*iwherei={1,2,3,4}

    =16*iwherei={3,4,...,15,16}

    .f32

    A vector expression containingN/2 number of.f32registers.

    N=8*iwherei={1,2,...,32}

    .f16

    A vector expression containingN/4 number of.f16x2registers, with each registercontaining two.f16elements from matrix D.

    The layout of the fragments held by different threads is shown inFigure 153.

    _images/wgmma-64N32-D.png

    Figure 153WGMMA .m64nNk32 register fragment layout for accumulator matrix D.

9.7.15.5.1.1.4.Matrix Fragments forwgmma.mma_async.m64nNk256

A warpgroup executingwgmma.mma_async.m64nNk256 will compute an MMA operation of shape.m64nNk256 where N is a validn dimension as listed inMatrix Shape.

Elements of the matrix are distributed across the threads in a warpgroup so each thread of thewarpgroup holds a fragment of the matrix.

  • Multiplicand A in registers:

    .atype

    Fragment

    Elements (low to high)

    .b1

    A vector expression containing four.b32 registers, with eachregister containing thirty two.b1 element from matrix A.

    a0, a1, a2, …, a127

    The layout of the fragments held by different threads is shown inFigure 154.

    _images/wgmma-64N256-A.png

    Figure 154WGMMA .m64nNk256 register fragment layout for matrix A.

  • Accumulator D:

    .dtype

    Fragment

    Elements (low to high)

    .s32

    A vector expression containing N/2 number of.s32 registers.

    d0, d1, d2, d3, …, dX, dY, dZ, dW

    whereX=N/2 - 4

    Y=N/2 - 3

    Z=N/2 - 2

    W=N/2 - 1

    N=8*iwherei={1,2,3,4}

    =16*iwherei={3,4,...,15,16}

    The layout of the fragments held by different threads is shown inFigure 155.

    _images/wgmma-64N256-D.png

    Figure 155WGMMA .m64nNk256 register fragment layout for accumulator matrix D.

9.7.15.5.1.2.Shared Memory Matrix Layout

If the argumentimm-trans-a /imm-trans-b of the instructionwgmma.mma_async{.sp}is 0, thenK-major is used for matrixA /B respectively. If the value of argumentimm-trans-a is 1 thenM-major is used for matrixA. If the value of the argumentimm-trans-b is 1, thenN-major is used for matrixB.

In a column-major default BLAS library such as cuBLAS, the matricesA andB with andwithout transpose can be classified as eitherK-Major orM-or-N-Major as shown in thefollowing table:

Non-Transposed

Transposed

A

K-major

M-major

B

K-major

N-major

To avoid confusion withA,B,row-major,col-major,transpose, andnon-transpose, we will useMN-Major andK-Major throughout this section.

The matrices in the shared memory are made up of one or more “swizzle layout atom”.The exact layout of these swizzle atoms depends on the swizzling mode, swizzle-atomicity,and the leading dimension. The layout of the swizzle are shown inTable 38.

Table 38Various combinations of swizzling mode, leading dimension and swizzle-atom layout

Swizzling mode

Leading Dimension/ Major-ness

Swizzle atom layout(128b element)

128B Swizzling Mode

M/N

8x8

K

8x8

64B Swizzling Mode

M/N

4x8

K

8x4

32B Swizzling Mode

M/N

2x8

K

8x2

None

M/N

1x8

K

8x1

The above shapes are for elements of size 128 bits. For smaller elements sizes, the sameshapes would get multiplied along the leading dimension by a factor of128/sizeof_bits(Element).For example, 128B MN major swizzle atom would have a shape of(8*(128/32))x8=32x8 fortf32 tensor core inputs.

Examples

The following are some example layouts ofMxK orKxN matrices with various swizzling modes,and are in units of 128b elements as shownby each colored cell as shown inFigure 156,Figure 157,Figure 158,Figure 159,Figure 160,Figure 161,Figure 162,Figure 163.

_images/async-warpgroup-smem-layout-128B-mn.png

Figure 156MN major 128B swizzling

_images/async-warpgroup-smem-layout-128B-k.png

Figure 157K major 128B swizzling

_images/async-warpgroup-smem-layout-64B-mn.png

Figure 158MN major 64B swizzling

_images/async-warpgroup-smem-layout-64B-k.png

Figure 159K major 64B swizzling

_images/async-warpgroup-smem-layout-32B-mn.png

Figure 160MN major 32B swizzling

_images/async-warpgroup-smem-layout-32B-k.png

Figure 161K major 32B swizzling

_images/async-warpgroup-smem-layout-mn-interleaved.png

Figure 162MN major interleaved

_images/async-warpgroup-smem-layout-k-interleaved.png

Figure 163K major interleaved

Following are some of the examples of the 128B swizzling layout fortf32 element type.

9.7.15.5.1.2.1.Major-ness supported by Strides

There are two strides involved while accessing a matrix from shared memory:

  1. Leading dimension byte offset

  2. Stride dimension byte offset

9.7.15.5.1.2.1.1.Leading Dimension Byte Offset

The leading dimension byte offset is defined differently for transposed and non-transposedmatrices. The leading byte offset is defined as follows for matrices whose element types arenormalized to 128-bits:

Major-ness

Definition

K-Major

  • No-Swizzling: the offset from the first column to the second columnsof the 8x2 tile in the 128-bit element type normalized matrix.

  • Swizzled layouts: not used, assumed to be 1.

MN-Major

  • Interleave: offset from the first 8 columns to the next 8 columns.

  • Swizzled layouts: offset from the first (swizzle-byte-size/16) rowsto the next (swizzle-byte-size/16) rows.

9.7.15.5.1.2.1.2.Stride Dimension Byte Offset

The stride dimension byte offset is defined differently for transposed and non-transposedmatrices. The stride dimension byte offset is defined as follows for matrices whose elementtypes are normalized to 128-bits:

Major-ness

Definition

K-Major

The offset from the first 8 rows to the next 8 rows.

MN-Major

  • Interleave: offset from the first row to the next row.

  • Swizzled layout: offset from the first 8 columns to the next 8columns

9.7.15.5.1.2.1.3.Canonical Layouts

In terms ofCuTe layoutsthe canonical layout can be expressed as follows:

Major-ness

Swizzling mode

Canonical Layout without swizzling

Swizzlingon the previous column

MN-major

No-swizzling orInterleaved

((T,1,m),(8,k)):((1,T,SBO),(1T,LBO))

Swizzle<0, 4, 3>

32B Swizzling

((T,2,m),(8,k)):((1,T,LBO),(2T,SBO))

Swizzle<1, 4, 3>

64B Swizzling

((T,4,m),(8,k)):((1,T,LBO),(4T,SBO))

Swizzle<2, 4, 3>

128B Swizzling

((T,8,m),(8,k)):((1,T,LBO),(8T,SBO))

Swizzle<3, 4, 3>

K-major

No-swizzling orInterleaved

((8,m),(T,2k)):((1T,SBO),(1,LBO))

Swizzle<0, 4, 3>

32B Swizzling

((8,m),(T,2k)):((2T,SBO),(1,T))

Swizzle<1, 4, 3>

64B Swizzling

((8,m),(T,2k)):((4T,SBO),(1,T))

Swizzle<2, 4, 3>

128B Swizzling

((8,m),(T,2k)):((8T,SBO),(1,T))

Swizzle<3, 4, 3>

where

  • T = 128 / sizeof-elements-in-bitsT represents scale factor which normalizes matrix element types to 128-bits.

  • m represents the number of repeating patterns across rows.

  • k represents the number of repeating patterns across columns.

Examples

  • K-Major, no-swizzling and tf32 type:Figure 166

    _images/async-warpgroup-k-no-swizzle-tf32.png

    Figure 166K major, no-swizzling and tf32 type

    the strides and related details are as follows:

    Exact layout : Swizzle<0,4,3> o ((8,2),(4,4)):((4,32),(1,64))

    Canonical Layout :Swizzle<0,4,3> o ((8,m),(T,2k)):((1T,SBO),(1,LBO))

    Parameters

    Value

    T

    4

    m

    2

    k

    2

    LBO

    64*sizeof(tf32)

    SBO

    32*sizeof(tf32)

    Encoding of LBO in descriptor

    (LBO) >> 4 = 16

    Encoding of SBO in descriptor

    (SBO) >> 4 = 8

  • K-Major, 32B swizzling and tf32 type:Figure 167

    _images/async-warpgroup-k-32B-swizzle-tf32.png

    Figure 167K major, 32B swizzling and tf32 type

    the strides and related details are as follows:

    Exact layout : Swizzle<1,4,3> o ((8,2),(4,4)):((8,64),(1,4))

    Canonical Layout :Swizzle<1,4,3> o ((8,m),(T,2k)):((2T,SBO),(1,T))

    Parameters

    Value

    T

    4

    m

    2

    k

    2

    LBO

    NA

    SBO

    64*sizeof(tf32)

    Encoding of LBO in descriptor

    1 (assumed)

    Encoding of SBO in descriptor

    (SBO) >> 4 = 16

  • MN-Major, no-swizzling and bf16 type:Figure 168

    _images/async-warpgroup-mn-no-swizzle-bf16.png

    Figure 168MN major, no-swizzling and bf16 type

    the strides and related details are as follows:

    Exact layout : Swizzle<0,4,3> o ((8,1,2),(8,2)):((1,8,64),(8,128))

    Canonical Layout :Swizzle<0,4,3> o ((T,1,m),(8,k)):((1,T,SBO),(1T,LBO))

    Parameters

    Value

    T

    8

    m

    2

    k

    2

    LBO

    128*sizeof(bf16)

    SBO

    64*sizeof(bf16)

    Encoding of LBO in descriptor

    (LBO) >> 4 = 16

    Encoding of SBO in descriptor

    (SBO) >> 4 = 8

  • MN-Major, 32B swizzling and bf16 type:Figure 169

    _images/async-warpgroup-mn-32B-swizzle-bf16.png

    Figure 169MN major, 32B swizzling and bf16 type

    the strides and related details are as follows:

    Exact layout : Swizzle<1,4,3> o ((8,2,2),(8,2)):((1,8,128),(16,256))

    Canonical Layout :Swizzle<1,4,3> o ((T,2,m),(8,k)):((1,T,LBO),(2T,SBO))

    Parameters

    Value

    T

    8

    m

    2

    k

    2

    LBO

    128*sizeof(bf16)

    SBO

    256*sizeof(bf16)

    Encoding of LBO in descriptor

    (LBO) >> 4 = 16

    Encoding of SBO in descriptor

    (SBO) >> 4 = 32

  • MN-Major, 64B swizzling and bf16 type:Figure 170

    _images/async-warpgroup-mn-64B-swizzle-bf16.png

    Figure 170MN major, 64B swizzling and bf16 type

    the strides and related details are as follows:

    Exact layout : Swizzle<2,4,3> o ((8,4,2),(8,2)):((1,8,256),(32,512))

    Canonical Layout :Swizzle<2,4,3> o ((T,4,m),(8,k)):((1,T,LBO),(4T,SBO))

    Parameters

    Value

    T

    8

    m

    2

    k

    2

    LBO

    256*sizeof(bf16)

    SBO

    512*sizeof(bf16)

    Encoding of LBO in descriptor

    (LBO) >> 4 = 32

    Encoding of SBO in descriptor

    (SBO) >> 4 = 64

9.7.15.5.1.2.2.Matrix Descriptor Format

Matrix descriptor specifies the properties of the matrix in shared memory that is a multiplicand inthe matrix multiply and accumulate operation. It is a 64-bit value contained in a register with thefollowing layout:

Bit-field

Size in bits

Description

13–0

14

matrix-descriptor-encode(Matrix start address)

29–16

14

matrix-descriptor-encode(Leading dimension byte offset)

45–32

14

matrix-descriptor-encode(Stride dimension byte offset)

51–49

3

Matrix base offset. This is valid for all swizzling modes except the no-swizzle mode.

63–62

2

Specifies the swizzling mode to be used:

  • 0: No swizzle

  • 1: 128-Byte swizzle

  • 2: 64-Byte swizzle

  • 3: 32-Byte swizzle

where

matrix-descriptor-encode(x) = (x & 0x3FFFF) >> 4

The value of base offset is 0 when the repeating pattern of the specified swizzling mode starts asper the below table:

Swizzling mode

Starting address of the repeating pattern

128-Byte swizzle

1024-Byte boundary

64-Byte swizzle

512-Byte boundary

32-Byte swizzle

256-Byte boundary

Otherwise, the base offset must be a non-zero value, computed using the following formula:

base offset = (pattern start addr >> 0x7) & 0x7
9.7.15.5.2.Asynchronous Multiply-and-Accumulate Instruction:wgmma.mma_async

wgmma.mma_async

Perform matrix multiply-and-accumulate operation across warpgroup

Syntax

Half precision floating point type:

wgmma.mma_async.sync.aligned.shape.dtype.f16.f16  d, a-desc, b-desc, scale-d, imm-scale-a, imm-scale-b, imm-trans-a, imm-trans-b;wgmma.mma_async.sync.aligned.shape.dtype.f16.f16  d, a, b-desc, scale-d, imm-scale-a, imm-scale-b, imm-trans-b;.shape   = {.m64n8k16, .m64n16k16, .m64n24k16, .m64n32k16,            .m64n40k16, .m64n48k16, .m64n56k16, .m64n64k16,            .m64n72k16, .m64n80k16, .m64n88k16, .m64n96k16,            .m64n104k16, .m64n112k16, .m64n120k16, .m64n128k16,            .m64n136k16, .m64n144k16, .m64n152k16, .m64n160k16,            .m64n168k16, .m64n176k16, .m64n184k16, .m64n192k16,            .m64n200k16, .m64n208k16, .m64n216k16, .m64n224k16,            .m64n232k16, .m64n240k16, .m64n248k16, .m64n256k16};.dtype   = {.f16, .f32};

Alternate floating point type :

.bf16 floating point type:wgmma.mma_async.sync.aligned.shape.dtype.bf16.bf16  d, a-desc, b-desc, scale-d, imm-scale-a, imm-scale-b, imm-trans-a, imm-trans-b;wgmma.mma_async.sync.aligned.shape.dtype.bf16.bf16  d, a, b-desc, scale-d, imm-scale-a, imm-scale-b, imm-trans-b;.shape   = {.m64n8k16, .m64n16k16, .m64n24k16, .m64n32k16,            .m64n40k16, .m64n48k16, .m64n56k16, .m64n64k16,            .m64n72k16, .m64n80k16, .m64n88k16, .m64n96k16,            .m64n104k16, .m64n112k16, .m64n120k16, .m64n128k16,            .m64n136k16, .m64n144k16, .m64n152k16, .m64n160k16,            .m64n168k16, .m64n176k16, .m64n184k16, .m64n192k16,            .m64n200k16, .m64n208k16, .m64n216k16, .m64n224k16,            .m64n232k16, .m64n240k16, .m64n248k16, .m64n256k16};.dtype  = {.f32};.tf32 floating point type:wgmma.mma_async.sync.aligned.shape.dtype.tf32.tf32  d, a-desc, b-desc, scale-d, imm-scale-a, imm-scale-b;wgmma.mma_async.sync.aligned.shape.dtype.tf32.tf32  d, a, b-desc, scale-d, imm-scale-a, imm-scale-b;.shape   = {.m64n8k8, .m64n16k8, .m64n24k8, .m64n32k8,            .m64n40k8, .m64n48k8, .m64n56k8, .m64n64k8,            .m64n72k8, .m64n80k8, .m64n88k8, .m64n96k8,            .m64n104k8, .m64n112k8, .m64n120k8, .m64n128k8,            .m64n136k8, .m64n144k8, .m64n152k8, .m64n160k8,            .m64n168k8, .m64n176k8, .m64n184k8, .m64n192k8,            .m64n200k8, .m64n208k8, .m64n216k8, .m64n224k8,            .m64n232k8, .m64n240k8, .m64n248k8, .m64n256k8};.dtype  = {.f32};FP8 floating point typewgmma.mma_async.sync.aligned.shape.dtype.atype.btype  d, a-desc, b-desc, scale-d, imm-scale-a, imm-scale-b;wgmma.mma_async.sync.aligned.shape.dtype.atype.btype  d, a, b-desc, scale-d, imm-scale-a, imm-scale-b;.shape   = {.m64n8k32, .m64n16k32, .m64n24k32, .m64n32k32,            .m64n40k32, .m64n48k32, .m64n56k32, .m64n64k32,            .m64n72k32, .m64n80k32, .m64n88k32, .m64n96k32,            .m64n104k32, .m64n112k32, .m64n120k32, .m64n128k32,            .m64n136k32, .m64n144k32, .m64n152k32, .m64n160k32,            .m64n168k32, .m64n176k32, .m64n184k32, .m64n192k32,            .m64n200k32, .m64n208k32, .m64n216k32, .m64n224k32,            .m64n232k32, .m64n240k32, .m64n248k32, .m64n256k32};.atype  = {.e4m3, .e5m2};.btype  = {.e4m3, .e5m2};.dtype  = {.f16, .f32};

Integer type:

wgmma.mma_async.sync.aligned.shape{.satfinite}.s32.atype.btype  d, a-desc, b-desc, scale-d;wgmma.mma_async.sync.aligned.shape{.satfinite}.s32.atype.btype  d, a, b-desc, scale-d;.shape   = {.m64n8k32, .m64n16k32, .m64n24k32, .m64n32k32,            .m64n48k32, .m64n64k32, .m64n80k32, .m64n96k32,            .m64n112k32, .m64n128k32, .m64n144k32, .m64n160k32,            .m64n176k32, .m64n192k32, .m64n208k32, .m64n224k32};.atype  = {.s8, .u8};.btype  = {.s8, .u8};

Single bit:

wgmma.mma_async.sync.aligned.shape.s32.b1.b1.op.popc  d, a-desc, b-desc, scale-d;wgmma.mma_async.sync.aligned.shape.s32.b1.b1.op.popc  d, a, b-desc, scale-d;.shape   = {.m64n8k256, .m64n16k256, .m64n24k256, .m64n32k256,            .m64n48k256, .m64n64k256, .m64n80k256, .m64n96k256,            .m64n112k256, .m64n128k256, .m64n144k256, .m64n160k256,            .m64n176k256, .m64n192k256, .m64n208k256, .m64n224k256,            .m64n240k256, .m64n256k256};.op  = {.and};

Description

Instructionwgmma.mma_async issues aMxNxK matrix multiply and accumulate operation,D=A*B+D, where the A matrix isMxK, the B matrix isKxN, and the D matrix isMxN.

The operation of the formD=A*B is issued when the input predicate argumentscale-d isfalse.

wgmma.fence instruction must be used to fence the register accesses ofwgmma.mma_asyncinstruction from their prior accesses. Otherwise, the behavior is undefined.

wgmma.commit_group andwgmma.wait_group operations must be used to wait for the completionof the asynchronous matrix multiply and accumulate operations before the results are accessed.

Register operandd represents the accumulator matrix as well as the destination matrix,distributed across the participating threads. Register operanda represents the multiplicandmatrix A in register distributed across the participating threads. The 64-bit register operandsa-desc andb-desc are the matrix descriptors which represent the multiplicand matrices A andB in shared memory respectively. The contents of a matrix descriptor must be same across all the warpsin the warpgroup. The format of the matrix descriptor is described inMatrix Descriptor Format.

Matrices A and B are stored in row-major and column-major format respectively. For certain floatingpoint variants, the input matrices A and B can be transposed by specifying the value 1 for theimmediate integer argumentsimm-trans-a andimm-trans-b respectively. A value of 0 can beused to avoid the transpose operation. The valid values ofimm-trans-a andimm-trans-b are 0and 1. The transpose operation is only supported for thewgmma.mma_async variants with.f16/.bf16 types on matrices accessed from shared memory using matrix descriptors.

For the floating point variants of thewgmma.mma_async operation, each element of the inputmatrices A and B can be negated by specifying the value -1 for operandsimm-scale-a andimm-scale-b respectively. A value of 1 can be used to avoid the negate operation. The validvalues ofimm-scale-a andimm-scale-b are -1 and 1.

The qualifiers.dtype,.atype and.btype indicate the data type of the elements inmatrices D, A and B respectively..atype and.btype must be the same for all floating pointwgmma.mma_async variants except for the FP8 floating point variants. The sizes of individualdata elements of matrices A and B in alternate floating point variants of thewgmma.mma_asyncoperation are as follows:

  • Matrices A and B have 8-bit data elements when.atype/.btype is.e4m3/.e5m2.

  • Matrices A and B have 16-bit data elements when.atype/.btype is.bf16.

  • Matrices A and B have 32-bit data elements when.atype/.btype is.tf32.

Precision and rounding:

  • Floating point operations:

    Element-wise multiplication of matrix A and B is performed with at least single precision. When.dtype is.f32, accumulation of the intermediate values is performed with at least singleprecision. When.dtype is.f16, the accumulation is performed with at least halfprecision.

    The accumulation order, rounding and handling of subnormal inputs are unspecified.

  • .bf16 and.tf32 floating point operations:

    Element-wise multiplication of matrix A and B is performed with specifiedprecision.wgmma.mma_async operation involving type.tf32 will truncate lower 13 bits ofthe 32-bit input data before multiplication is issued. Accumulation of the intermediate values isperformed with at least single precision.

    The accumulation order, rounding, and handling of subnormal inputs are unspecified.

  • Integer operations:

    The integerwgmma.mma_async operation is performed with.s32 accumulators. The.satfinite qualifier indicates that on overflow, the accumulated value is limited to therangeMIN_INT32..MAX_INT32 (where the bounds are defined as the minimum negative signed32-bit integer and the maximum positive signed 32-bit integer respectively).

    If.satfinite is not specified, the accumulated value is wrapped instead.

The mandatory.sync qualifier indicates thatwgmma.mma_async instruction causes theexecuting thread to wait until all threads in the warp execute the samewgmma.mma_asyncinstruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warpgroup must execute thesamewgmma.mma_async instruction. In conditionally executed code, awgmma.mma_asyncinstruction should only be used if it is known that all threads in the warpgroup evaluate thecondition identically, otherwise behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Support for.u8.s8 and.s8.u8 as .atype.btype introduced in PTX ISA version 8.4.

Target ISA Notes

Requiressm_90a.

Examples of half precision floating point type

.reg .f16x2 f16a<40>, f16d<40>;.reg .f32   f32d<40>;.reg .b64   descA, descB;.reg .pred  scaleD;wgmma.mma_async.sync.aligned.m64n8k16.f32.f16.f16  {f32d0, f32d1, f32d2, f32d3},  {f16a0, f16a1, f16a2, f16a3},  descB,  1, -1, -1, 1;wgmma.mma_async.sync.aligned.m64n72k16.f16.f16.f16  {f16d0, f16d1,  f16d2,  f16d3,  f16d4,  f16d5,  f16d6,  f16d7,  f16d8,   f16d9, f16d10, f16d11, f16d12, f16d13, f16d14, f16d15, f16d16, f16d17},  descA,  descB,  scaleD, -1, 1, 1, 0;

Examples of alternate floating point type

.reg .f32   f32d<40>;.reg .b32   bf16a<40>.reg .b64   descA, descB;wgmma.mma_async.sync.aligned.m64n120k16.f32.bf16.bf16  {f32d0, f32d1, f32d2, f32d3, f32d4, f32d5, f32d6, f32d7, f32d8, f32d9,   f32d10, f32d11, f32d12, f32d13, f32d14, f32d15, f32d16, f32d17, f32d18, f32d19,   f32d20, f32d21, f32d22, f32d23, f32d24, f32d25, f32d26, f32d27, f32d28, f32d29,   f32d30, f32d31, f32d32, f32d33, f32d34, f32d35, f32d36, f32d37, f32d38, f32d39,   f32d40, f32d41, f32d42, f32d43, f32d44, f32d45, f32d46, f32d47, f32d48, f32d49,   f32d50, f32d51, f32d52, f32d53, f32d54, f32d55, f32d56, f32d57, f32d58, f32d59},  {bf16a0, bf16a1, bf16a2, bf16a3},  descB,  scaleD, -1, -1, 0;.reg .f32   f32d<40>;.reg .b64   descA, descB;wgmma.mma_async.sync.aligned.m64n16k8.f32.tf32.tf32  {f32d0, f32d1, f32d2, f32d3, f32d4, f32d5, f32d6, f32d7},  descA,  descB,  0, -1, -1;.reg .b32 f16d<8>, f16a<8>;.reg .f32 f32d<8>;.reg .b64   descA, descB;wgmma.mma_async.sync.aligned.m64n8k32.f16.e4m3.e5m2  {f16d0, f16d1},  descA,  descB,  scaleD, -1, 1;wgmma.mma_async.sync.aligned.m64n8k32.f32.e5m2.e4m3  {f32d0, f32d1, f32d2, f32d3},  {f16a0, f16a1, f16a2, f16a3},  descB,  1, -1, -1;

Examples of integer type

.reg .s32 s32d<8>, s32a<8>;.reg .u32 u32a<8>;.reg .pred scaleD;.reg .b64   descA, descB;wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.s8.satfinite  {s32d0, s32d1, s32d2, s32d3},  {s32a0, s32a1, s32a2, s32a3},  descB,  1;wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8  {s32d0, s32d1, s32d2, s32d3},  descA,  descB,  scaleD;wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.u8.satfinite  {s32d0, s32d1, s32d2, s32d3},  {s32a0, s32a1, s32a2, s32a3},  descB,  scaleD;wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.s8  {s32d0, s32d1, s32d2, s32d3},  descA,  descB,  scaleD;

Examples of single bit type

.reg .s32 s32d<4>;.reg .b32 b32a<4>;.reg .pred scaleD;.reg .b64   descA, descB;wgmma.mma_async.sync.aligned.m64n8k256.s32.b1.b1.and.popc  {s32d0, s32d1, s32d2, s32d3},  {b32a0, b32a1, b32a2, b32a3},  descB,  scaleD;

9.7.15.6.Asynchronous Warpgroup Level Multiply-and-Accumulate Operation usingwgmma.mma_async.sp instruction

This section describes warp-levelwgmma.mma_async.sp instruction with sparse matrix A. Thisvariant of thewgmma.mma_async operation can be used when A is a structured sparse matrix with50% zeros in each row distributed in a shape-specific granularity. For anMxNxK sparsewgmma.mma_async.sp operation, theMxK matrix A is packed intoMxK/2 elements. For eachK-wide row of matrix A, 50% elements are zeros and the remainingK/2 non-zero elements arepacked in the operand representing matrix A. The mapping of theseK/2 elements to thecorresponding K-wide row is provided explicitly as metadata.

9.7.15.6.1.Sparse matrix storage

Granularity of sparse matrix A is defined as the ratio of the number of non-zero elements in asub-chunk of the matrix row to the total number of elements in that sub-chunk where the size of thesub-chunk is shape-specific. For example, in a64x32 matrix A used in floating pointwgmma.mma_async operations, sparsity is expected to be at 2:4 granularity, i.e. each 4-elementvector (i.e. a sub-chunk of 4 consecutive elements) of a matrix row contains 2 zeros. Index of eachnon-zero element in a sub-chunk is stored in the metadata operand. Values0b0000,0b0101,0b1010,0b1111 are invalid values for metadata and will result in undefined behavior. In agroup of four consecutive threads, one or more threads store the metadata for the whole groupdepending upon the matrix shape. These threads are specified using an additional sparsity selector operand.

Matrix A and its corresponding input operand to the sparse wgmma is similar to the diagram shown inFigure 111, with an appropriate matrix size.

Granularities for different matrix shapes and data types are described below.

Sparsewgmma.mma_async.sp with half-precision and.bf16 type

For.f16 and.bf16 types, for all supported64xNx32 shapes, matrix A is structuredsparse at a granularity of 2:4. In other words, each chunk of four adjacent elements in a row ofmatrix A have two zeroes and two non-zero elements. Only the two non-zero elements are stored inmatrix A and their positions in the four-wide chunk in Matrix A are indicated by two 2-bits indicesin the metadata operand.

_images/f16-metadata-example.png

Figure 171Sparse WGMMA metadata example for.f16/.bf16 type.

The sparsity selector indicates a thread-pair within a group of four consecutive threads whichcontributes the sparsity metadata. Hence, the sparsity selector must be either 0 (threads T0, T1) or1 (threads T2, T3); any other value results in an undefined behavior.

Sparsewgmma.mma_async.sp with.tf32 type

For.tf32 type, for all supported64xNx16 shapes, matrix A is structured sparse at agranularity of 1:2. In other words, each chunk of two adjacent elements in a row of matrix A haveone zero and one non-zero element. Only the non-zero element is stored in operand for matrix A andthe 4-bit index in the metadata indicates the position of the non-zero element in the two-widechunk. 0b1110 and 0b0100 are the only meaningful values of the index, the remaining values result inan undefined behavior.

_images/tf32-metadata-example.png

Figure 172Sparse WGMMA metadata example for.tf32 type.

The sparsity selector indicates a thread-pair within a group of four consecutive threads whichcontributes the sparsity metadata. Hence, the sparsity selector must be either 0 (threads T0, T1) or1 (threads T2, T3); any other value results in an undefined behavior.

Sparsewgmma.mma_async.sp with.e4m3 and.e5m2 floating point type

For.e4m3 and.e5m2 types, for all supported64xNx64 shapes, matrix A is structuredsparse at a granularity of 2:4. In other words, each chunk of four adjacent elements in a row ofmatrix A have two zeroes and two non-zero elements. Only the two non-zero elements are stored inmatrix A and their positions in the four-wide chunk in Matrix A are indicated by two 2-bits indicesin the metadata operand.

_images/u8s8-metadata-example.png

Figure 173Sparse WGMMA metadata example for.e4m3/.e5m2 type.

All threads contribute the sparsity metadata and the sparsity selector must be 0; any other valueresults in an undefined behavior.

Sparsewgmma.mma_async.sp with integer type

For the integer type, for all supported64xNx64 shapes, matrix A is structured sparse at agranularity of 2:4. In other words, each chunk of four adjacent elements in a row of matrix A havetwo zeroes and two non-zero elements. Only the two non-zero elements are stored in matrix A and two2-bit indices in the metadata indicate the position of these two non-zero elements in the four-widechunk.

_images/u8s8-metadata-example.png

Figure 174Sparse WGMMA metadata example for.u8/.s8 type.

All threads contribute the sparsity metadata and the sparsity selector must be 0; any other valueresults in an undefined behavior.

9.7.15.6.2.Matrix fragments for warpgroup-level multiply-accumulate operation with sparse matrix A

In this section we describe how the contents of thread registers are associated with fragments of Amatrix and the sparsity metadata.

Each warp in the warpgroup provides sparsity information for 16 rows of matrix A. The followingtable shows the assignment of warps to rows of matrix A:

Warp

Sparsity information for rows of matrix A

%warpid % 4 = 3

48-63

%warpid % 4 = 2

32-47

%warpid % 4 = 1

16-31

%warpid % 4 = 0

0-15

The following conventions are used throughout this section:

  • For matrix A, only the layout of a fragment is described in terms of register vector sizes andtheir association with the matrix data.

  • For matrix D, since the matrix dimension - data type combination is the same for all supportedshapes, and is already covered inAsynchronous Warpgroup Level Matrix Multiply-Accumulate Operation using wgmma.mma_async instruction, the pictorialrepresentations of matrix fragments are not included in this section.

  • For the metadata operand, pictorial representations of the association between indices of theelements of matrix A and the contents of the metadata operand are included.Tk:[m..n] presentin cell[x][y..z] indicates that bitsm throughn (withm being higher) in themetadata operand of thread with%laneid=k contains the indices of the non-zero elements fromthe chunk[x][y]..[x][z] of matrix A.

9.7.15.6.2.1.Matrix Fragments for sparsewgmma.mma_async.m64nNk32

A warpgroup executing sparsewgmma.mma_async.m64nNk32 will compute an MMA operation of shape.m64nNk32 where N is a valid n dimension as listed inMatrix Shape.

Elements of the matrix are distributed across the threads in a warpgroup so each thread of thewarpgroup holds a fragment of the matrix.

  • Multiplicand A, from shared memory is documented inShared Memory Matrix Layout.

  • Multiplicand A, from registers:

    .atype

    Fragments

    Elements

    .f16 /
    .bf16
    A vector expression containing four.b32
    registers, with each register containing two
    non-zero.f16 /.bf16 elements out of 4
    consecutive elements from matrix A.
    Non-zero elements:
    a0, a1, a2, a3, a4, a5, a6, a7
    Mapping of the non-zero
    elements is as described in

    The layout of the fragments held by different threads is shown inFigure 175.

    _images/sparse-wgmma-64N32-f16-bf16-A.png

    Figure 175Sparse WGMMA .m64nNk32 fragment layout for matrix A with.f16/.bf16 type.

  • Accumulator D:

    Matrix fragments for accumulator D are the same as in case ofMatrix Fragments for wgmma.mma_async.m64nNk32for the same.dtype format.

  • Multiplicand B:

    Shared memory layout for Matrix B is documented inShared Memory Matrix Layout.

  • Metadata operand is a.b32 register containing 16 2-bit vectors each storing the index of anon-zero element of a 4-wide chunk of matrix A.

    Figure 176 shows the mapping of the metadata bits to the elementsof matrix A for a warp. In this figure, variablei represents the value of the sparsityselector operand.

    _images/sparse-mma-metadata-16832-f16bf16.png

    Figure 176Sparse WGMMA .m64nNk32 metadata layout for.f16/.bf16 type.

9.7.15.6.2.2.Matrix Fragments for sparsewgmma.mma_async.m64nNk16

A warpgroup executing sparsewgmma.mma_async.m64nNk16 will compute an MMA operation of shape.m64nNk16 where N is a valid n dimension as listed inMatrix Shape.

Elements of the matrix are distributed across the threads in a warpgroup so each thread of thewarpgroup holds a fragment of the matrix.

  • Multiplicand A, from shared memory is documented inShared Memory Matrix Layout.

  • Multiplicand A, from registers:

    .atype

    Fragments

    Elements

    .tf32
    A vector expression containing four.b32
    registers, containing four non-zero.tf32
    elements out of eight consecutive elements
    from matrix A.
    Non-zero elements:
    a0, a1, a2, a3

    Mapping of the non-zero
    elements is as described in

    The layout of the fragments held by different threads is shown inFigure 177.

    _images/sparse-wgmma-64N16-tf32-A.png

    Figure 177Sparse WGMMA .m64nNk16 fragment layout for matrix A with.tf32 type.

  • Accumulator D:

    Matrix fragments for accumulator D are the same as in case ofMatrix Fragments for wgmma.mma_async.m64nNk8for the same.dtype format.

  • Multiplicand B:

    Shared memory layout for Matrix B is documented inShared Memory Matrix Layout.

  • Metadata operand is a.b32 register containing eight 4-bit vectors each storing the index of anon-zero element of a 2-wide chunk of matrix A.

    Figure 178 shows the mapping of the metadata bits to the elementsof matrix A for a warp. In this figure, variablei represents the value of the sparsityselector operand.

    _images/sparse-mma-metadata-16816-tf32.png

    Figure 178Sparse WGMMA .m64nNk16 metadata layout for.tf32 type.

9.7.15.6.2.3.Matrix Fragments for sparsewgmma.mma_async.m64nNk64

A warpgroup executing sparsewgmma.mma_async.m64nNk64 will compute an MMA operation of shape.m64nNk64 where N is a valid n dimension as listed inMatrix Shape.

Elements of the matrix are distributed across the threads in a warpgroup so each thread of thewarpgroup holds a fragment of the matrix.

  • Multiplicand A, from shared memory is documented inMatrix Fragments for sparse wgmma.mma_async.m64nNk64.

  • Multiplicand A, from registers:

    .atype

    Fragments

    Elements

    .e4m3 /
    .e5m2
    A vector expression containing four.b32
    registers, with each register containing four
    non-zero.e4m3 /.e5m2 elements out of
    eight consecutive elements from matrix A.

    Non-zero elements:
    a0, a1, a2, … , a15

    Mapping of the non-zero
    elements is as described in
    .s8 /
    .u8
    A vector expression containing four.b32
    registers, with each register containing four
    non-zero.s8 /.u8 elements out of
    eight consecutive elements from matrix A.

    The layout of the fragments held by different threads is shown inFigure 179.

    _images/sparse-wgmma-64N64-e4m3-e5m2-s8-u8-A.png

    Figure 179Sparse WGMMA .m64nNk64 fragment layout for matrix A with.e4m3/.e5m2/.s8/.u8 type.

  • Accumulator D:

    Matrix fragments for accumulator D are the same as in case ofMatrix Fragments for wgmma.mma_async.m64nNk32for the same.dtype format.

  • Multiplicand B:

    Shared memory layout for Matrix B is documented inMatrix Fragments for sparse wgmma.mma_async.m64nNk64.

  • Metadata operand is a.b32 register containing 16 4-bit vectors each storing the indices oftwo non-zero elements of a 4-wide chunk of matrix A.

    Figure 180 shows the mapping of the metadatabits to the elements of columns 0–31 of matrix A.

    _images/sparse-mma-metadata-16864-u8s8-first32col.png

    Figure 180Sparse WGMMA .m64nNk64 metadata layout for.e4m3/.e5m2/.s8/.u8 type for columns 0–31

    Figure 181 shows the mapping of the metadatabits to the elements of columns 32–63 of matrix A.

    _images/sparse-mma-metadata-16864-u8s8-last32col.png

    Figure 181Sparse WGMMA .m64nNk64 metadata layout for.e4m3/.e5m2/.s8/.u8 type for columns 32–63

9.7.15.6.3.Asynchronous Multiply-and-Accumulate Instruction:wgmma.mma_async.sp

wgmma.mma_async.sp

Perform matrix multiply-and-accumulate operation with sparse matrix A across warpgroup

Syntax

Half precision floating point type:

wgmma.mma_async.sp.sync.aligned.shape.dtype.f16.f16  d, a-desc, b-desc, sp-meta, sp-sel, scale-d, imm-scale-a, imm-scale-b, imm-trans-a, imm-trans-b;wgmma.mma_async.sp.sync.aligned.shape.dtype.f16.f16  d, a, b-desc, sp-meta, sp-sel, scale-d, imm-scale-a, imm-scale-b, imm-trans-b;.shape   = {.m64n8k32, .m64n16k32, .m64n24k32, .m64n32k32,            .m64n40k32, .m64n48k32, .m64n56k32, .m64n64k32,            .m64n72k32, .m64n80k32, .m64n88k32, .m64n96k32,            .m64n104k32, .m64n112k32, .m64n120k32, .m64n128k32,            .m64n136k32, .m64n144k32, .m64n152k32, .m64n160k32,            .m64n168k32, .m64n176k32, .m64n184k32, .m64n192k32,            .m64n200k32, .m64n208k32, .m64n216k32, .m64n224k32,            .m64n232k32, .m64n240k32, .m64n248k32, .m64n256k32};.dtype   = {.f16, .f32};

Alternate floating point type :

.bf16 floating point type:wgmma.mma_async.sp.sync.aligned.shape.dtype.bf16.bf16  d, a-desc, b-desc, sp-meta, sp-sel, scale-d, imm-scale-a, imm-scale-b, imm-trans-a, imm-trans-b;wgmma.mma_async.sp.sync.aligned.shape.dtype.bf16.bf16  d, a, b-desc, sp-meta, sp-sel, scale-d, imm-scale-a, imm-scale-b, imm-trans-b;.shape   = {.m64n8k32, .m64n16k32, .m64n24k32, .m64n32k32,            .m64n40k32, .m64n48k32, .m64n56k32, .m64n64k32,            .m64n72k32, .m64n80k32, .m64n88k32, .m64n96k32,            .m64n104k32, .m64n112k32, .m64n120k32, .m64n128k32,            .m64n136k32, .m64n144k32, .m64n152k32, .m64n160k32,            .m64n168k32, .m64n176k32, .m64n184k32, .m64n192k32,            .m64n200k32, .m64n208k32, .m64n216k32, .m64n224k32,            .m64n232k32, .m64n240k32, .m64n248k32, .m64n256k32};.dtype  = {.f32};.tf32 floating point type:wgmma.mma_async.sp.sync.aligned.shape.dtype.tf32.tf32  d, a-desc, b-desc, sp-meta, sp-sel, scale-d, imm-scale-a, imm-scale-b;wgmma.mma_async.sp.sync.aligned.shape.dtype.tf32.tf32  d, a, b-desc, sp-meta, sp-sel, scale-d, imm-scale-a, imm-scale-b;.shape   = {.m64n8k16, .m64n16k16, .m64n24k16, .m64n32k16,            .m64n40k16, .m64n48k16, .m64n56k16, .m64n64k16,            .m64n72k16, .m64n80k16, .m64n88k16, .m64n96k16,            .m64n104k16, .m64n112k16, .m64n120k16, .m64n128k16,            .m64n136k16, .m64n144k16, .m64n152k16, .m64n160k16,            .m64n168k16, .m64n176k16, .m64n184k16, .m64n192k16,            .m64n200k16, .m64n208k16, .m64n216k16, .m64n224k16,            .m64n232k16, .m64n240k16, .m64n248k16, .m64n256k16};.dtype  = {.f32};FP8 floating point typewgmma.mma_async.sp.sync.aligned.shape.dtype.atype.btype  d, a-desc, b-desc, sp-meta, sp-sel, scale-d, imm-scale-a, imm-scale-b;wgmma.mma_async.sp.sync.aligned.shape.dtype.atype.btype  d, a, b-desc, sp-meta, sp-sel, scale-d, imm-scale-a, imm-scale-b;.shape   = {.m64n8k64, .m64n16k64, .m64n24k64, .m64n32k64,            .m64n40k64, .m64n48k64, .m64n56k64, .m64n64k64,            .m64n72k64, .m64n80k64, .m64n88k64, .m64n96k64,            .m64n104k64, .m64n112k64, .m64n120k64, .m64n128k64,            .m64n136k64, .m64n144k64, .m64n152k64, .m64n160k64,            .m64n168k64, .m64n176k64, .m64n184k64, .m64n192k64,            .m64n200k64, .m64n208k64, .m64n216k64, .m64n224k64,            .m64n232k64, .m64n240k64, .m64n248k64, .m64n256k64};.atype  = {.e4m3, .e5m2};.btype  = {.e4m3, .e5m2};.dtype  = {.f16, .f32};

Integer type:

wgmma.mma_async.sp.sync.aligned.shape{.satfinite}.s32.atype.btype  d, a-desc, b-desc, sp-meta, sp-sel, scale-d;wgmma.mma_async.sp.sync.aligned.shape{.satfinite}.s32.atype.btype  d, a, b-desc, sp-meta, sp-sel, scale-d;.shape   = {.m64n8k64, .m64n16k64, .m64n24k64, .m64n32k64,            .m64n48k64, .m64n64k64, .m64n80k64, .m64n96k64,            .m64n112k64, .m64n128k64, .m64n144k64, .m64n160k64,            .m64n176k64, .m64n192k64, .m64n208k64, .m64n224k64,            .m64n240k64, .m64n256k64};.atype  = {.s8, .u8};.btype  = {.s8, .u8};

Description

Instructionwgmma.mma_async issues aMxNxK matrix multiply and accumulate operation,D=A*B+D, where the A matrix isMxK, the B matrix isKxN, and the D matrix isMxN.

The matrix A is stored in the packed format Mx(K/2) as described inSparse matrix storage.

The operation of the formD=A*B is issued when the input predicate argumentscale-d isfalse.

wgmma.fence instruction must be used to fence the register accesses ofwgmma.mma_asyncinstruction from their prior accesses. Otherwise, the behavior is undefined.

wgmma.commit_group andwgmma.wait_group operations must be used to wait for the completionof the asynchronous matrix multiply and accumulate operations before the results are accessed.

Register operandd represents the accumulator matrix as well as the destination matrix,distributed across the participating threads. Register operanda represents the multiplicandmatrix A in register distributed across the participating threads. The 64-bit register operandsa-desc andb-desc are the matrix descriptors which represent the multiplicand matrices A andB in shared memory respectively. The contents of a matrix descriptor must be same across all thewarps in the warpgroup. The format of the matrix descriptor is described inMatrix Descriptor Format. Matrix A isstructured sparse as described inSparse matrix storage. Operandssp-meta andsp-selrepresent sparsity metadata and sparsity selector respectively. Operandsp-meta is a 32-bitinteger and operandsp-sel is a 32-bit integer constant with values in the range 0..3.

The valid values ofsp-meta andsp-sel for each shape is specified inSparse matrix storage and are summarized here :

Matrix shape

.atype

Valid values ofsp-meta

Valid values ofsp-sel

.m64nNk16

.tf32

0b1110 , 0b0100

0 (threads T0, T1) or 1 (threads T2, T3)

.m64nNk32

.f16/.bf16

0b00, 0b01, 0b10, 0b11

0 (threads T0, T1) or 1 (threads T2, T3)

.m64nNk64

.e4m3 /.e5m2 /.s8 /.u8

0b00, 0b01, 0b10, 0b11

0 (all threads contribute)

Matrices A and B are stored in row-major and column-major format respectively. For certain floatingpoint variants, the input matrices A and B can be transposed by specifying the value 1 for theimmediate integer argumentsimm-trans-a andimm-trans-b respectively. A value of 0 can beused to avoid the transpose operation. The valid values ofimm-trans-a andimm-trans-b are 0and 1. The transpose operation is only supported for thewgmma.mma_async variants with.f16/.bf16 types on matrices accessed from shared memory using matrix descriptors.

For the floating point variants of thewgmma.mma_async operation, each element of the inputmatrices A and B can be negated by specifying the value -1 for operandsimm-scale-a andimm-scale-b respectively. A value of 1 can be used to avoid the negate operation. The validvalues ofimm-scale-a andimm-scale-b are -1 and 1.

The qualifiers.dtype,.atype and.btype indicate the data type of the elements inmatrices D, A and B respectively..atype and.btype must be the same for all floating pointwgmma.mma_async variants except for the FP8 floating point variants. The sizes of individualdata elements of matrices A and B in alternate floating point variants of thewgmma.mma_asyncoperation are as follows:

  • Matrices A and B have 8-bit data elements when.atype/.btype is.e4m3/.e5m2.

  • Matrices A and B have 16-bit data elements when.atype/.btype is.bf16.

  • Matrices A and B have 32-bit data elements when.atype/.btype is.tf32.

Precision and rounding:

  • Floating point operations:

    Element-wise multiplication of matrix A and B is performed with at least single precision. When.dtype is.f32, accumulation of the intermediate values is performed with at least singleprecision. When.dtype is.f16, the accumulation is performed with at least halfprecision.

    The accumulation order, rounding and handling of subnormal inputs are unspecified.

  • .bf16 and.tf32 floating point operations:

    Element-wise multiplication of matrix A and B is performed with specifiedprecision.wgmma.mma_async operation involving type.tf32 will truncate lower 13 bits ofthe 32-bit input data before multiplication is issued. Accumulation of the intermediate values isperformed with at least single precision.

    The accumulation order, rounding, and handling of subnormal inputs are unspecified.

  • Integer operations:

    The integerwgmma.mma_async operation is performed with.s32 accumulators. The.satfinite qualifier indicates that on overflow, the accumulated value is limited to therangeMIN_INT32..MAX_INT32 (where the bounds are defined as the minimum negative signed32-bit integer and the maximum positive signed 32-bit integer respectively).

    If.satfinite is not specified, the accumulated value is wrapped instead.

The mandatory.sync qualifier indicates thatwgmma.mma_async instruction causes theexecuting thread to wait until all threads in the warp execute the samewgmma.mma_asyncinstruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warpgroup must execute thesamewgmma.mma_async instruction. In conditionally executed code, awgmma.mma_asyncinstruction should only be used if it is known that all threads in the warpgroup evaluate thecondition identically, otherwise behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.2.

Support for.u8.s8 and.s8.u8 as .atype.btype introduced in PTX ISA version 8.4.

Target ISA Notes

Requiressm_90a.

Examples of integer type

wgmma.fence.sync.aligned;wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.u8.u8  {s32d0, s32d1, s32d2, s32d3},                                                    descA, descB, spMeta, 0, scaleD;wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.s8.u8  {s32d0, s32d1, s32d2, s32d3},                                                    descA, descB, spMeta, 0, scaleD;wgmma.commit_group.sync.aligned;wgmma.wait_group.sync.aligned 0;

9.7.15.7.Asynchronouswgmma Proxy Operations

This section describes warpgroup levelwgmma.fence,wgmma.commit_group andwgmma.wait_group instructions.

9.7.15.7.1.Asynchronous Multiply-and-Accumulate Instruction:wgmma.fence

wgmma.fence

Enforce an ordering of register accesses betweenwgmma.mma_async and other operations.

Syntax

wgmma.fence.sync.aligned;

Description

wgmma.fence instruction establishes an ordering between prior accesses to any warpgroupregisters and subsequent accesses to the same registers by awgmma.mma_async instruction. Onlythe accumulator register and the input registers containing the fragments of matrix A require thisordering.

Thewgmma.fence instruction must be issued by all warps of the warpgroup at the followinglocations:

  • Before the firstwgmma.mma_async operation in a warpgroup.

  • Between a register access by a thread in the warpgroup and anywgmma.mma_async instructionthat accesses the same registers, either as accumulator or input register containing fragments ofmatrix A, except when these are accumulator register accesses across multiplewgmma.mma_asyncinstructions of the same shape. In the latter case, an ordering guarantee is provided by default.

Otherwise, the behavior is undefined.

An async proxy fence must be used to establish an ordering between prior writes to shared memorymatrices and subsequent reads of the same matrices in awgmma.mma_async instruction.

The mandatory.sync qualifier indicates thatwgmma.fence instruction causes the executingthread to wait until all threads in the warp execute the samewgmma.fence instruction beforeresuming execution.

The mandatory.aligned qualifier indicates that all threads in the warpgroup must execute thesamewgmma.fence instruction. In conditionally executed code, anwgmma.fence instructionshould only be used if it is known that all threads in the warpgroup evaluate the conditionidentically, otherwise the behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_90a.

Examples

// Example 1, first use example:wgmma.fence.sync.aligned;    // Establishes an ordering w.r.t. prior accesses to the registers s32d<0-3>wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8  {s32d0, s32d1, s32d2, s32d3},                                                  descA, descB, scaleD;wgmma.commit_group.sync.aligned;wgmma.wait_group.sync.aligned 0;// Example 2, use-case with the input value updated in between:wgmma.fence.sync.aligned;wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8  {s32d0, s32d1, s32d2, s32d3},                                                  descA, descB, scaleD;...mov.b32 s32d0, new_val;wgmma.fence.sync.aligned;wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8  {s32d4, s32d5, s32d6, s32d7},                                                 {s32d0, s32d1, s32d2, s32d3},                                                  descB, scaleD;wgmma.commit_group.sync.aligned;wgmma.wait_group.sync.aligned 0;
9.7.15.7.2.Asynchronous Multiply-and-Accumulate Instruction:wgmma.commit_group

wgmma.commit_group

Commits all prior uncommittedwgmma.mma_async operations into awgmma-group.

Syntax

wgmma.commit_group.sync.aligned;

Description

wgmma.commit_group instruction creates a new wgmma-group per warpgroup and batches all priorwgmma.mma_async instructions initiated by the executing warp but not committed to anywgmma-group into the new wgmma-group. If there are no uncommittedwgmma.mma_async instructionsthenwgmma.commit_group results in an empty wgmma-group.

An executing thread can wait for the completion of allwgmma.mma_async operations in awgmma-group by usingwgmma.wait_group.

The mandatory.sync qualifier indicates thatwgmma.commit_group instruction causes theexecuting thread to wait until all threads in the warp execute the samewgmma.commit_groupinstruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warpgroup must execute thesamewgmma.commit_group instruction. In conditionally executed code, anwgmma.commit_groupinstruction should only be used if it is known that all threads in the warpgroup evaluate thecondition identically, otherwise the behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_90a.

Examples

wgmma.commit_group.sync.aligned;
9.7.15.7.3.Asynchronous Multiply-and-Accumulate Instruction:wgmma.wait_group

wgmma.wait_group

Signal the completion of a preceding warpgroup operation.

Syntax

wgmma.wait_group.sync.aligned N;

Description

wgmma.wait_group instruction will cause the executing thread to wait until only N or fewer ofthe most recent wgmma-groups are pending and all the prior wgmma-groups committed by the executingthreads are complete. For example, when N is 0, the executing thread waits on all the priorwgmma-groups to complete. Operand N is an integer constant.

Accessing the accumulator register or the input register containing the fragments of matrix A of awgmma.mma_async instruction without first performing awgmma.wait_group instruction thatwaits on awgmma-group including thatwgmma.mma_async instruction is undefined behavior.

The mandatory.sync qualifier indicates thatwgmma.wait_group instruction causes theexecuting thread to wait until all threads in the warp execute the samewgmma.wait_groupinstruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warpgroup must execute thesamewgmma.wait_group instruction. In conditionally executed code, anwgmma.wait_groupinstruction should only be used if it is known that all threads in the warpgroup evaluate thecondition identically, otherwise the behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_90a.

Examples

wgmma.fence.sync.aligned;wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8  {s32d0, s32d1, s32d2, s32d3},                                                  descA, descB, scaleD;wgmma.commit_group.sync.aligned;wgmma.mma_async.sync.aligned.m64n8k16.f32.f16.f16 {f32d0, f32d1, f32d2, f32d3},                                                  {f16a0, f16a1, f16a2, f16a3},                                                   descB, 1, -1, -1, 1;wgmma.commit_group.sync.aligned;wgmma.wait_group.sync.aligned 0;

9.7.16.TensorCore 5th Generation Family Instructions

9.7.16.1.Tensor Memory

The 5th generation TensorCore has dedicated on-chip memory that is specialized for use byTensorCore operations. This Tensor Memory is organized as a two-dimensional matrix wherethe horizontal rows are called lanes and the vertical columns are called columns.

On architecturesm_100a/sm_100f, the 5th generation TensorCore’s Tensor Memory has atwo-dimensional structure of 512 columns and 128 rows per CTA, with each cell being 32-bits in size.

Restrictions on threads accessing the Tensor Memory via the load and store operationsare specified inAccess restrictions.

9.7.16.1.1.Tensor Memory Addressing

Tensor Memory addresses are 32-bit wide and specify two components.

  1. Lane index

  2. Column index

The layout is as follows:

31 16

15 0

Lane index

Column index

Figure 182 shows the view of the Tensor Memory Layout within CTA.

_images/tensor-memory-layout.png

Figure 182Tensor Memory Layout and Addressing

9.7.16.1.2.Tensor Memory Allocation

The Tensor Memory is dynamically allocated. The Tensor Memory must be allocated by a singlewarp in a CTA using theTensor Memory Allocation and Management Instructions.

The allocation and deallocation ofTensor Memory is performed in terms ofcolumns. The unit of allocation is 32 columns and the number of columns being allocated must bea power of 2. When a column is allocated, all 128 lanes of the column are allocated.

All of the Tensor Memory that was allocated in a kernel, must be explicitly deallocatedbefore the kernel exits.

9.7.16.2.Matrix and Data Movement Shape

There are two kinds of shapes involved.

  1. Shapes in the data movement operations

  2. Shapes in the MMA operations

9.7.16.2.1.Matrix Shape

The matrix multiply and accumulate operations support a limited set of shapes for the operand matricesA,B andD. The shapes of all three matrix operands are collectively described by the tupleMxNxK whereA isMxK matrix,B is aKxN matrix, andD is aMxN matrix.

Table 39 shows matrix shapes that are supported for the specified types for thetcgen05.mma operation.

Table 39Various combinations of .kind and shapes

Various Combinations

Shapes Supported

.kind::*

Has .ws

CTA Group

Sparsity

dtype

atype/btype

kind::f16

No.ws

1

Dense

.f16

.f16

64xNxK

128xNxK

N = {8, 16, 24, … 256} steps of 8

K = 16

.f32

.f16,.bf16

Sparse

.f16

.f16

K = 32

.f32

.f16,.bf16

2

Dense

.f16

.f16

128xNxK

256xNxK

N = {16, 32, … 256} steps of 16

K = 16

.f32

.f16,.bf16

Sparse

.f16

.f16

K = 32

.f32

.f16,.bf16

.ws

1

Dense

.f16

.f16

32xNxK

64xNxK

128xNxK

N = {64, 128, 256}

K = 16

.f32

.f16,.bf16

Sparse

.f16

.f16

N = {64, 128}

K = 32

.f32

.f16,.bf16

2

Either

.f16

.f16

Invalid

.f32

.f16,.bf16

.kind::tf32

No.ws

1

Dense

.f32

.tf32

64xNxK

128xNxK

N = {8, 16, 24, … 256} steps of 8

K = 8

Sparse

K = 16

2

Dense

128xNxK

256xNxK

N = {16, 32, … 256} steps of 16

K = 8

Sparse

K = 16

.ws

1

Dense

32xNxK64xNxK128xNxK

N = {64, 128, 256}

K = 8

Sparse

N = {64, 128}

K = 16

2

Dense

Invalid

Sparse

.kind::f8f6f4

No.ws

1

Dense

.f32

.f16

.e4m3,

.e5m2,

.e2m3,

.e3m2,

.e2m1

64xNxK

128xNxK

N = {8, 16, … 256} steps of 8

K = 32

Sparse

K = 64

2

Dense

128xNxK

256xNxK

N = {16, 32, … 256} steps of 16

K = 32

Sparse

K = 64

.ws

1

Dense

32xNxK64xNxK128xNxK

N = {64, 128, 256}

K = 32

Sparse

N = {64, 128}

K = 64

2

Dense

Invalid

Sparse

.kind::mxf8f6f4

No.ws

1

Dense

.f32

.e4m3,

.e5m2,

.e2m3,

.e3m2,

.e2m1

X

(Scale)

.ue8m0

128xNxK

N = {8, 16, … 256} steps of 8

K = 32

Sparse

K = 64

2

Dense

128xNxK

256xNxK

N = {16, 32, … 256} steps of 16

K = 32

Sparse

256xNxK

K = 64

.ws

1

Dense

Invalid

Sparse

2

Dense

Sparse

.kind::i8

No.ws

1

Dense

.s32

.s8,.u8

64xNxK

128xNxK

N = {8, 16, 24, 32, 48, … 256}

steps of 16 after N > 32

K = 32

Sparse

K = 64

2

Dense

128xNxK

256xNxK

N = {32, 64, … 256} steps of 32

K = 32

Sparse

K = 64

.ws

1

Dense

32xNxK64xNxK128xNxK

N = {64, 128, 256}

K = 32

Sparse

N = {64, 128}

K = 64

2

Dense

Invalid

Sparse

.kind::mxf4

No.ws

1

Dense

.f32

.e2m1

X

(Scale)

.ue8m0

128xNxK

N = {8, 16, … 256} steps of 8

K = 64

Sparse

K = 128

2

Dense

128xNxK256xNxK256xNxK1

N = {16, 32, … 256} steps of 16

K = 64

K1= 96

Sparse

256xNxK

K = 128

.ws

1 / 2

Either

Invalid

.kind::mxf4nvf4

No.ws

1

Dense

.f32

.e2m1

X

(Scale)

.ue8m0,

.ue4m3

128xNxK

N = {8, 16, … 256} steps of 8

K = 64

Sparse

K = 128

2

Dense

128xNxK256xNxK256xNxK1

N = {16, 32, … 256} steps of 16

K = 64

K1= 96

Sparse

256xNxK

K = 128

.ws

1 / 2

Either

Invalid

9.7.16.2.1.1.Target ISA Note
  • K = 96 is only supported for target architecturesm_103a.

9.7.16.2.2.Specifying Matrix Shape

M andN can be specified in theInstruction descriptor.

K cannot be explicitly specified but is implicitly determined by the MMA-kindand the sparsity, as shown in theTable 39.

9.7.16.2.3.Data Movement Shape

The data movement shape indicates the dimension of the data to be moved to or from theTensor Memory. These shapes are described as a tuplelanexsize where:

  • lane indicates the number of rows in theTensor Memory; and

  • size indicates the amount of data, in units of bits (b), across the columns in theTensor Memory.

The following shapes are supported by various tcgen05 operations:

Shape

tcgen05.<op>

.16x64b,.16x128b,.16x256b,.16x32bx2,.32x32b

.ld /.st

.4x256b,.32x128b,.64x128b,.128x256b,.128x128b

.cp

.31x256b (implicit)

.shift

9.7.16.2.3.1.Memory Layout

The following shows the layout of the matrix fragments across threads of the warp.

9.7.16.2.3.1.1.Matrix fragments for shape .32x32b

Atcgen05{.ld,.st}.32x32b instruction has the following data vector register.

Fragment

Elements (low to high)

A vector expression containing.numnumber of.b32 registers asmentioned in theTable 47.

r0, r1, …

A warp executingtcgen05{.ld,.st}.32x32b will access 32 lanes of the Tensor Memory.It loads from or stores to each of the lane (32 * .num)-bits of data as shown inFigure 183.

_images/tcgen05-mma-fragment-3232b.png

Figure 183Matrix Fragment for shape .32x32b

9.7.16.2.3.1.2.Matrix fragments for shape .16x64b

Atcgen05{.ld,.st}.16x64b instruction has the following data vector register.

Fragment

Elements (low to high)

A vector expression containing.numnumber of.b32 registers asmentioned in theTable 47.

r0, r1, …

A warp executingtcgen05{.ld,.st}.16x64b will access 16 lanes of the Tensor Memory.It loads from or stores to each of the lane (64 * .num)-bits of data as shown inFigure 184.

_images/tcgen05-mma-fragment-1664b.png

Figure 184Matrix Fragment for shape .16x64b

9.7.16.2.3.1.3.Matrix fragments for shape .16x128b

Atcgen05{.ld,.st}.16x128b instruction has the following data vector register.

Fragment

Elements (low to high)

A vector expression containing.numnumber of.b32 registers asmentioned in theTable 47.

r0, r1, …

A warp executingtcgen05{.ld,.st}.16x128b will access 16 lanes of the Tensor Memory.It loads from or stores to each of the lane (128 * .num)-bits of data as shown inFigure 185.

_images/tcgen05-mma-fragment-16128b.png

Figure 185Matrix Fragment for shape .16x128b

9.7.16.2.3.1.4.Matrix fragments for shape .16x256b

Atcgen05{.ld,.st}.16x256b instruction has the following data vector register.

Fragment

Elements (low to high)

A vector expression containing.numnumber of.b32 registers asmentioned in theTable 47.

r0, r1, r2, r3, …

A warp executingtcgen05{.ld,.st}.16x256b will access 16 lanes of the Tensor Memory.It loads from or stores to each of the lane (256 * .num)-bits of data as shown inFigure 186.

_images/tcgen05-mma-fragment-16256b.png

Figure 186Matrix Fragment for shape .16x256b

9.7.16.2.3.1.5.Matrix fragments for shape .16x32bx2

Atcgen05{.ld,.st}.16x32bx2 instruction has the following data vector register.

Fragment

Elements (low to high)

A vector expression containing.numnumber of.b32 registers asmentioned in theTable 47.

r0, r1, …

A warp executingtcgen05{.ld,.st}.16x32bx2 will access 16 lanes of the Tensor Memory.It loads from or stores to each of the lane (32 * .num)-bits of data as shown inFigure 187.

_images/tcgen05-mma-fragment-1632b2.png

Figure 187Matrix Fragment for shape .16x32bx2

9.7.16.3.Major-ness supported by Strides

There are two strides involved while accessing a matrix from shared memory:

  1. Leading dimension stride (byte offset or absolute address)

  2. Stride dimension byte offset

9.7.16.3.1.Leading Dimension Stride: relative offset or absolute address

There are two modes of Leading Dimension Strides as described below.Bit #52 in theShared memory descriptor is used to distinguish between two modes.

9.7.16.3.1.1.Relative offset mode

In this mode, the leading dimension stride is specified as a relative byte offset between thecolumns as explained in the below table.The leading dimension stride can either be specified as a relative offset between the columnsor as an absolute byte address of next buffer. The leading dimension stride is defineddifferently for transposed and non-transposed matrices. The leading dimension stride is definedas follows for matrices whose element types are normalized to 128-bits:

Major-ness

Definition

K-Major

  • No-Swizzling: the stride from the first column to the second columnof the 8x2 tile in the 128-bit element type normalized matrix.

  • Swizzled layouts: not used, assumed to be 1.

MN-Major

  • Interleave: stride from the first 8 columns to the next 8 columns.

  • Swizzled layouts: stride from the first (swizzle-byte-size/16) rowsto the next (swizzle-byte-size/16) rows.

9.7.16.3.1.2.Absolute address mode for K dimension being 48B

Thetcgen05.mma instruction withK-dimension of 48B would overflow the 128Bshared memory boundary if the data is packed contiguously.

In this case, the absolute address mode can be used to break up the data in theshared memory into two chunks such that both these chunks are laid out withinthe aligned 128-byte address boundary.The leading dimension absolute address can point to the second data chunk in the shared memory.

9.7.16.3.1.2.1.Restrictions on the Leading Dimension Absolute Address Stride

Following are the restrictions on the absolute address stride mode:

  1. Only 128B swizzle (with 16B atomicity) is supported.

  2. Only K-Major mode is supported. That is, the transpose bits(bits #15 and #16) inInstruction descriptor must be 0.

  3. The matrix base offset must be 0.

9.7.16.3.2.Stride Dimension Byte Offset

The stride dimension byte offset is defined differently for transposed and non-transposedmatrices. The stride dimension byte offset is defined as follows for matrices whose elementtypes are normalized to 128-bits:

Major-ness

Definition

K-Major

The offset from the first 8 rows to the next 8 rows.

MN-Major

  • Interleave: offset from the first row to the next row.

  • Swizzled layout: offset from the first 8 columns to the next 8columns

9.7.16.3.3.Canonical Layouts

In terms ofCuTe layoutsthe canonical layout can be expressed as follows:

Major-ness

Swizzling mode

Canonical Layout without swizzling

Swizzlingon the previous column

MN-major

No-swizzling orInterleaved

((T,1,m),(8,k)):((1,T,SBO),(1T,LBO))

Swizzle<0, 4, 3>

32B Swizzling

((T,2,m),(8,k)):((1,T,LBO),(2T,SBO))

Swizzle<1, 4, 3>

64B Swizzling

((T,4,m),(8,k)):((1,T,LBO),(4T,SBO))

Swizzle<2, 4, 3>

128B Swizzling

((T,8,m),(8,k)):((1,T,LBO),(8T,SBO))

Swizzle<3, 4, 3>

K-major

No-swizzling orInterleaved

((8,m),(T,2k)):((1T,SBO),(1,LBO))

Swizzle<0, 4, 3>

32B Swizzling

((8,m),(T,2k)):((2T,SBO),(1,T))

Swizzle<1, 4, 3>

64B Swizzling

((8,m),(T,2k)):((4T,SBO),(1,T))

Swizzle<2, 4, 3>

128B Swizzling

((8,m),(T,2k)):((8T,SBO),(1,T))

Swizzle<3, 4, 3>

where

  • T = 128 / sizeof-elements-in-bitsT represents scale factor which normalizes matrix element types to 128-bits.

  • m represents the number of repeating patterns across rows.

  • k represents the number of repeating patterns across columns.

Examples

  • K-Major, no-swizzling and tf32 type:Figure 188

    _images/async-warpgroup-k-no-swizzle-tf32.png

    Figure 188K major, no-swizzling and tf32 type

    the strides and related details are as follows:

    Exact layout : Swizzle<0,4,3> o ((8,2),(4,4)):((4,32),(1,64))

    Canonical Layout :Swizzle<0,4,3> o ((8,m),(T,2k)):((1T,SBO),(1,LBO))

    Parameters

    Value

    T

    4

    m

    2

    k

    2

    LBO (relative offset)

    64*sizeof(tf32)

    SBO

    32*sizeof(tf32)

    Encoding of LBO in descriptor

    (LBO) >> 4 = 16

    Encoding of SBO in descriptor

    (SBO) >> 4 = 8

  • K-Major, 32B swizzling and tf32 type:Figure 189

    _images/async-warpgroup-k-32B-swizzle-tf32.png

    Figure 189K major, 32B swizzling and tf32 type

    the strides and related details are as follows:

    Exact layout : Swizzle<1,4,3> o ((8,2),(4,4)):((8,64),(1,4))

    Canonical Layout :Swizzle<1,4,3> o ((8,m),(T,2k)):((2T,SBO),(1,T))

    Parameters

    Value

    T

    4

    m

    2

    k

    2

    LBO (relative offset)

    NA

    SBO

    64*sizeof(tf32)

    Encoding of LBO in descriptor

    1 (assumed)

    Encoding of SBO in descriptor

    (SBO) >> 4 = 16

  • MN-Major, no-swizzling and bf16 type:Figure 190

    _images/async-warpgroup-mn-no-swizzle-bf16.png

    Figure 190MN major, no-swizzling and bf16 type

    the strides and related details are as follows:

    Exact layout : Swizzle<0,4,3> o ((8,1,2),(8,2)):((1,8,64),(8,128))

    Canonical Layout :Swizzle<0,4,3> o ((T,1,m),(8,k)):((1,T,SBO),(1T,LBO))

    Parameters

    Value

    T

    8

    m

    2

    k

    2

    LBO (relative offset)

    128*sizeof(bf16)

    SBO

    64*sizeof(bf16)

    Encoding of LBO in descriptor

    (LBO) >> 4 = 16

    Encoding of SBO in descriptor

    (SBO) >> 4 = 8

  • MN-Major, 32B swizzling and bf16 type:Figure 191

    _images/async-warpgroup-mn-32B-swizzle-bf16.png

    Figure 191MN major, 32B swizzling and bf16 type

    the strides and related details are as follows:

    Exact layout : Swizzle<1,4,3> o ((8,2,2),(8,2)):((1,8,128),(16,256))

    Canonical Layout :Swizzle<1,4,3> o ((T,2,m),(8,k)):((1,T,LBO),(2T,SBO))

    Parameters

    Value

    T

    8

    m

    2

    k

    2

    LBO (relative offset)

    128*sizeof(bf16)

    SBO

    256*sizeof(bf16)

    Encoding of LBO in descriptor

    (LBO) >> 4 = 16

    Encoding of SBO in descriptor

    (SBO) >> 4 = 32

  • MN-Major, 64B swizzling and bf16 type:Figure 192

    _images/async-warpgroup-mn-64B-swizzle-bf16.png

    Figure 192MN major, 64B swizzling and bf16 type

    the strides and related details are as follows:

    Exact layout : Swizzle<2,4,3> o ((8,4,2),(8,2)):((1,8,256),(32,512))

    Canonical Layout :Swizzle<2,4,3> o ((T,4,m),(8,k)):((1,T,LBO),(4T,SBO))

    Parameters

    Value

    T

    8

    m

    2

    k

    2

    LBO (relative offset)

    256*sizeof(bf16)

    SBO

    512*sizeof(bf16)

    Encoding of LBO in descriptor

    (LBO) >> 4 = 32

    Encoding of SBO in descriptor

    (SBO) >> 4 = 64

9.7.16.4.Matrix Descriptors

There are three kinds of matrix descriptors used by thetcgen05 family of instructions.

9.7.16.4.1.Shared memory descriptor

The shared memory descriptor describes the properties of multiplicand matrix in sharedmemory including its location in the shared memory of the currentCTA. It is a 64-bitvalue contained in a register with the following layout:

Table 40Shared memory descriptor layout

Bit-field

Size in bits

Description

0-13

14

matrix-descriptor-encode (Matrix start address)

16-29

14

matrix-descriptor-encode(Leading dimension byte offset relative)

OR

matrix-descriptor-encode(Leading dimension byte address absolute)

32-45

14

matrix-descriptor-encode(Stride dimension byte offset)

46-48

3

Fixed constant value of 0b001

49-51

3

Matrix base offset

52

1

Leading dimension stride mode:- 0: byte offset relative- 1: byte address absolute

53-60

8

Fixed constant value of 0xb00000000

61-63

3

Specifies the swizzling mode to be used:0. No swizzling1. 128-Byte with 32B atomic swizzling2. 128-Byte swizzling4. 64-Byte swizzling6. 32-Byte swizzling

Note: Values 3, 5 and 7 are invalid

where matrix-descriptor-encode(x) = (x & 0x3FFFF) >> 4

The value of base offset is 0 when the repeating pattern of the specified swizzling modestarts as per shown inTable 41.

Table 41Starting address of repeating pattern for various swizzling modes

Swizzling mode

Starting address of the repeating pattern

128-Byte swizzle

1024-Byte boundary

64-Byte swizzle

512-Byte boundary

32-Byte swizzle

256-Byte boundary

Otherwise, the base offset must be a non-zero value, computed using the following formula:baseoffset=(patternstartaddr>>0x7)&0x7

The following must be 16-byte aligned:

  1. Matrix start address

  2. Leading dimension byte offset

  3. Stride dimension byte offset

9.7.16.4.1.1.Target ISA Note
  • The byte address mode for the leading dimension stride is supported onsm_103a.

9.7.16.4.2.Instruction descriptor

The instruction descriptor describes the shapes, types and other details of all the matricesand the matrix-multiplication-and-accumulation operation. It is a 32-bit value in registersand the exact layout is dependent on the MMA-Kind:

Table 42Instruction descriptor format for .kind::tf32, .kind::f16, .kind::f8f6f4 and .kind::i8

Bits

Size

(bits)

Description

Values

.kind::tf32

.kind::f16

.kind::f8f6f4

.kind::i8

0-1

2

Sparsity selector,if Sparsity is enabled

0-3

2

1

Sparsity

Dense = 0

Sparse = 1

3

1

Saturate for integer types

0 (NA)

No Saturate = 0Saturate = 1

4-5

2

dtype (Matrix D type)

F32 = 1

F16 = 0F32 = 1

S32 = 2

6

1

Reserved

0

7-9

3

atype (Matrix A type)

TF32 = 2

F16 = 0

BF16 = 1

E4M3 = 0E5M2 = 1E2M3 = 3E3M2 = 4E2M1 = 5

Unsigned 8b = 0

Signed 8b = 1

10-12

3

btype (Matrix B type)

13

1

Negate A Matrix

No Negate = 0

Negate = 1

No Negate = 0

14

1

Negate B Matrix

15

1

Transpose A Matrix

No Transpose = 0

Transpose = 1

16

1

Transpose B Matrix

17-22

6

N, Dimension of Matrix B(3 LSBs not included)

N >> 3

23

1

Reserved

0

24-28

5

M, Dimension of Matrix A(4 LSBs not included)

M >> 4

29

1

Reserved

0

30-31

2

Maximum shift while attemptingB matrix -reuse in.ws

no shift = 0maximum shift of 8 = 1maximum shift of 16 = 2maximum shift of 32 = 3

Table 43Instruction descriptor format for .kind::mxf8f6f4

Bits

Size

(bits)

Description

Values

.kind::mxf8f6f4

0-1

2

Reserved

0

2

1

Sparsity

Dense = 0

Sparse = 1

3

1

Reserved

0

4-5

2

Matrix B Scale Factor Data ID

0-3

6

1

Reserved

0

7-9

3

atype (Matrix A type)

E4M3 = 0E5M2 = 1E2M3 = 3E3M2 = 4E2M1 = 5

10-12

3

btype (Matrix B type)

13

1

Negate A Matrix

No Negate = 0

Negate = 1

14

1

Negate B Matrix

15

1

Transpose A Matrix

No Transpose = 0

Transpose = 1

16

1

Transpose B Matrix

17-22

6

N, Dimension of Matrix B(3 LSBs not included)

N >> 3

23

1

Scale Matrix Type, for both scale_A /scale_B

UE8M0 = 1

24-26

3

Reserved

0

27-28

2

M, Dimension of Matrix A(7 LSBs not included)

M >> 7

29-30

2

Matrix A Scale Factor Data ID

0-3

31

1

Reserved

0

Table 44Instruction descriptor format for .kind::mxf4 and .kind::mxf4nvf4

Bits

Size

(bits)

Description

Values

.kind::mxf4

.kind::mxf4nvf4

0-1

2

Reserved

0

2

1

Sparsity

Dense = 0

Sparse = 1

3

1

Reserved

0

4-5

2

Matrix B Scale Factor Data ID

0 or 2

6

1

Reserved

0

7-9

3

atype (Matrix A type)

E2M1 = 1

10-11

2

btype (Matrix B type)

12

1

Reserved

0

13

1

Negate A Matrix

No Negate = 0

Negate = 1

14

1

Negate B Matrix

15

1

Transpose A Matrix

No Transpose = 0

16

1

Transpose B Matrix

17-22

6

N, Dimension of Matrix B(3 LSBs not included)

N >> 3

23

1

Scale Matrix Type, for both scale_A /scale_B

UE8M0 = 1

UE4M3 = 0

24-26

3

Reserved

0

27-28

2

M, Dimension of Matrix A(7 LSBs not included)

M >> 7

29-30

2

Matrix A Scale Factor Data ID

0 or 2

31

1

K Dimension

(Dense K=64 / Sparse K=128) = 0

(Dense K=96) = 1

9.7.16.4.3.Zero-Column Mask Descriptor

The zero-column mask descriptor is used to generate a mask that specifies which columns ofB matrix will have zero value for the MMA operation regardless of the values present inthe shared memory. The total size of the generated mask is N-bits.

A 0-bit in the mask specifies that values of the corresponding column in matrixB shouldbe used for the MMA operation. A 1-bit in the mask specifies 0s must be used for the entirecolumn for the MMA operation.

The zero-column mask descriptor is a 64-bit value in registers with the following layout:

Table 45Zero-Column Mask descriptor layout

Bits

Size (bits)

Field Name

Description

0-7

8

Start Count 0 (sc0)

Specifies the LSBs that must be skipped

for sub-mask mask-i

8-15

8

Start Count 1 (sc1)

16-23

8

Start Count 2 (sc2)

24-31

8

Start Count 3 (sc3)

32

1

First Span 0 (fs0)

Specifies the starting value for

sub-mask mask-i

33

1

First Span 1 (fs1)

34

1

First Span 2 (fs2)

35

1

First Span 3 (fs3)

36-38

3

Reserved

39

1

Non-Zero Mask

Value 0 indicates generated mask will have all 0sValue 1 indicates the mask has to be generated

40-47

8

Skip Span

(Count of consecutive columns where B matrix is used) - 1

48-55

8

Use Span

(Count of consecutive columns where 0s ar used) - 1

56-61

6

Column Shift

Shifts column by specified amount.Thus allows MMA on non-0 starting column.Max shift amount = 16 for M=32Max shift amount = 32 otherwise

The zero-column mask is made up of one or more sub-mask depending on M, as shown in the table:

M

Zero-Column Mask breakup

Sub-masks

First Span used

Start Column used

128

Single sub-mask of size N-bits

mask0

fs0

sc0

64

Two sub-masks, each with size of N/2 bits

mask0, mask1

fs0, fs1

sc0, sc1

32

Four sub-masks, each with size of N/4 bits

mask0, mask1mask2, mask3

fs0, fs1, fs2,fs3

sc0, sc1, sc2,sc3

The following table shows the coverage of the sub-masks across N-dimension:

Sub-mask

M

128

64

32

mask0

Columns [0, N-1]

Columns [0, N/2-1]

Columns [0, N/4-1]

mask1

Columns [N/2, N-1]

Columns [N/4, N/2-1]

mask2

Columns [N/2, (N/4*3)-1]

mask3

Columns [(N/4*3), N-1]

The following examples shows zero-column mask descriptor and their corresponding mask generated:

  1. Example 1: M = 128

    Input zero-column mask descriptor:

    Start count

    First span

    Non-Zero Mask

    Skip Span

    Use Span

    Shift

    {0, 0, 0, 0}

    {0, 0, 0, 0}

    0

    4

    3

    0

    Output zero-column mask: 0x0.

    As Non-Zero Mask field is 0, the mask is 0x0. All the columns of the matrixB will be usedfor the MMA operation.

  2. Example 2: M = 128

    Input zero-column mask descriptor:

    Start count

    First span

    Non-Zero Mask

    Skip Span

    Use Span

    Shift

    {-, -, -, 0}

    {-, -, -, 0}

    1

    2

    3

    0

    Output mask0: 0b … 111 0000 111 0000 (size = N)

  3. Example 3: M = 64

    Input zero-column mask descriptor:

    Start count{.., sc1, sc0}

    First span{.., fs1, fs0}

    Non-Zero Mask

    Skip Span

    Use Span

    Shift

    {-, -, 0, 0}

    {-, -, 0, 1}

    1

    2

    3

    0

    Output mask0: 0b … 111 0000 111 0000 111

    Output masl1: 0b … 0000 111 0000 111 0000

  4. Example 4: M = 32

    Input zero-column mask descriptor:

    Start count{sc3, sc2, sc1, sc0}

    First span{fs3, fs2, fs1, fs0}

    Non-Zero Mask

    Skip Span

    Use Span

    Shift

    {1, 2, 1, 0}

    {0, 0, 1, 1}

    1

    2

    3

    2

    Output mask0: 0b … 0000 111 0000 111

    Output mask1: 0b … 0000 111 0000 11

    Output mask2: 0b … 111 0000 111 00

    Output mask3: 0b … 111 0000 111 000

    If N = 128 thenB Matrix with columns from 2 to 129 will be used for the MMA operation,due to the shift of 2.

9.7.16.5.Issue Granularity

Each of thetcgen05 operation has different requirements for the number ofthreads/warps that needs to issue them.

The following table lists the execution granularity requirements of each of thetcgen05 operation:

Table 46Execution granularity requirements for tcgen05 operations

tcgen05 operation

.cta_group

Issue Granularity

.mma,.cp,.shift,.commit

::1

An issue from a single thread in the currentCTA would initiate the base operation.

::2

Issue from a single thread from theCTA-Pair would initiatethe base operation.When the current CTA issues the operation, the peerCTA should be active and should not have exited.

.alloc,.dealloc,.relinquish_alloc_permit

::1

Issue from a single warp in the current CTAwould initiate the allocation management instruction.

::2

Issue from two warps, one in each of the current CTAand itsPeer CTA, collectivelyneeds to perform the operation.When the current CTA issues the operation, the peerCTA should be active and should not have exited.

.ld,.st,.wait::{ld,st}

N/A

Issue from a warp in the current CTA can access only1/4 of the Tensor Memory of the current CTA. So, awarpgroup is needed to access the entire Tensor Memoryof the current CTA.

.fence::*

N/A

A thread needs to fence all its accesses to the tensormemory that it wants to order with other accesses tothe tensor memory from other threads.

9.7.16.5.1.CTA Pair

Any 2 CTAs within the cluster whose%cluster_ctarank differs by the last bit onlyis said to form a CTA pair.

Within a CTA pair, the CTA whose last bit in the%cluster_ctarank is:

  • 0 is termed the even numbered CTA within the CTA pair.

  • 1 is termed as the odd numbered CTA within the CTA pair.

Most of thetcgen05 operations can either execute at a single CTA level granularity ORat a CTA pair level granularity. When atcgen05 operation is performed at CTA pairgranularity, the Tensor Memory of both the CTAs within the CTA pair are accessed. The setof threads that need to issue thetcgen05 operation is listed in theIssue Granularity.

9.7.16.5.2.Peer CTA

The peer CTA of the odd CTA within the CTA pair is the even CTA in the same pair.Similarly, the peer CTA of the even CTA within the CTA pair is the odd CTA in the same pair.

9.7.16.6.Memory Consistency Model for 5th generation of TensorCore operations

Ordering oftcgen05 instructions is described in terms of two key concepts:

  1. Pipelined tcgen05 instructions

  2. Specialized tcgen05-specific inter-thread synchronization mechanisms.

These concepts combine to form four canonical synchronization patterns, as described further below.

9.7.16.6.1.Asynchronous Operations

The tcgen05 family of instructions are divided into 2 categories:

  1. Asynchronous instructions:

    Thesetcgen05 operations are not inherently ordered with respect toothertcgen05 operations in the same thread (unless pipelined as mentioned below).

  2. Synchronous instructions:

    Thesetcgen05 operations are inherently ordered with respect to othertcgen05operations in the same order.

    The Tensor Memory allocation related instructions that access shared memory maintainsame-address ordering with respect to non-tcgen05 instructions.

The following table lists the category of each of thetcgen05 instruction:

tcgen05.* operation

Category

.alloc

Synchronous

instructions

.dealloc

.relinquish_alloc_permit

.fence::*

.wait::*

.commit

.mma

Asynchronous

instructions

.cp

.shift

.ld

.st

9.7.16.6.2.Pipelined tcgen05 Instructions

The asynchronoustcgen05 operations may execute and complete in a different order than theywere issued. However, some specific pairs of the asynchronoustcgen05 instructions formtcgen05 pipelines, where in the two asynchronous operations are guaranteed to execute inthe same order as the instructions that issued them. The specific pairings are as follows:

  1. tcgen05.mma.cta_group::N ->tcgen05.mma.cta_group::N (same N and accumulator and shape)

  2. tcgen05.copy.cta_group::N ->tcgen05.mma.cta_group::N (same N)

  3. tcgen05.shift.cta_group::N ->tcgen05.mma.cta_group::N (same N)

  4. tcgen05.shift.cta_group::N ->tcgen05.cp.4x256b.cta_group::N (same N)

  5. tcgen05.mma.cta_group::N ->tcgen05.shift.cta_group::N (same N)

9.7.16.6.2.1.Implicitly pipelined tcgen05 Instructions

Instructionstcgen05.commit andtcgen05.wait are implicitly pipelined with respectto previously issuedtcgen05.{mma,cp,shift} andtcgen05.{ld,st} instructionsrespectively that they track from the same thread.

9.7.16.6.2.1.1.mbarrier based completion mechanism

Completion of the following instruction’s asynchronous operations is observedthrough the mbarrier based waiting mechanism:

  1. tcgen05.mma

  2. tcgen05.cp

  3. tcgen05.shift

tcgen05.commit is used to track the completion of the above asynchronous instructions.

Following are the implicitly pipelinedtcgen05 instruction pairing that uses mbarrierbased completion mechanism:

  • tcgen05.mma.cta_group::N ->tcgen05.commit.cta_group::N (same N)

  • tcgen05.cp.cta_group::N ->tcgen05.commit.cta_group::N (same N)

  • tcgen05.shift.cta_group::N ->tcgen05.commit.cta_group::N (same N)

9.7.16.6.2.1.2.tcgen05.wait instruction based completion mechanism

Completion of the following instruction’s asynchronous operations is observed throughtcgen05.wait based waiting mechanism:

  1. tcgen05.ld

  2. tcgen05.st

tcgen05.wait::ld andtcgen05.wait::st is used to track the completion of thetcgen05.ld andtcgen05.st asynchronous instructions.

Following are the implicitly pipelinedtcgen05 instruction pairing that usestcgen05.wait based completion mechanism:

  • tcgen05.ld ->tcgen05.wait::ld

  • tcgen05.st ->tcgen05.wait::st

9.7.16.6.3.Specialized Inter-thread Synchronization for tcgen05 instructions

Thetcgen05 instructions support a specialized inter-thread synchronization which areoptimized fortcgen05 family of instructions. The standard memory consistency modelsynchronization mechanisms also apply to thetcgen05 family of instructions.

TheTensorCore 5th Generation Specialized Synchronization Operations section contains the specialized inter-threadsynchronization for tcgen05 instructions.

Thetcgen05.fence::before_thread_sync andtcgen05.fence::after_thread_sync composeswith execution ordering instructions, like morally strongld/st/atom instructions,mbarrier instruction,barrier instructions and so on, to establish an ordering betweenthetcgen05 operations across threads. The asynchronoustcgen05 instructions that areordered across threads also form atcgen05 pipeline.

An asynchronoustcgen05 operation prior to atcgen05.fence::before_thread_sync is orderedbefore all subsequenttcgen05 and the execution ordering operations.

An asynchronoustcgen05 operation subsequent to atcgen05.fence::after_thread_sync isordered after all the priortcgen05 and the execution ordering operations.

9.7.16.6.4.Canonical synchronization patterns

Using the above rules, the following are the five canonical synchronization patterns:

9.7.16.6.4.1.Pipelined instructions, same thread

In this pattern, no explicit ordering mechanism is needed and the ordering guarantee isprovided by the pipelined instruction pairing.

Example:

tcgen05.mmatcgen05.mma (same shape and accumulator)

The two instructions will be executed in program order.

9.7.16.6.4.2.Non-pipelined instructions, same thread

In this pattern, explicit waiting mechanisms are used to wait for the completion of theasynchronoustcgen05 operations.

Example 1:

tcgen05.sttcgen05.wait::sttcgen05.ld

tcgen05.wait::st is used to wait for the completion of the prior asynchronousinstructiontcgen05.st.

Example 2:

tcgen05.mma [d], ...tcgen05.commit.mbarrier::arrive::onembarrier.try_wait.relaxed.cluster (loop until successful)tcgen05.fence::after_thread_synctcgen05.ld [d], ...

For the completion of the asynchronoustcgen05.mma,tcgen05.commit is used.

Astcgen05.ld is an asynchronous operation, the instructiontcgen05.fence::after_thread_syncis needed.

No explicittcgen05.fence::before_thread_sync is needed as this is implicitly performed bytcgen05.commit. The combination oftcgen05.mma andtcgen05.commit forms aconceptual asynchronous pipeline and establishes execution ordering.

tcgen05.mma [d], ...tcgen05.fence::before_thread_syncmbarrier::arrive
9.7.16.6.4.3.Pipelined instructions, different thread

In this pattern, no explicit waiting mechanism is needed but proper synchronization between threads is needed.

Example:

Thread 0

Thread 1

tcgen05.cptcgen05.fence::before_thread_syncmbarrier.arrive.relaxed.cluster
mbarrier.try_wait.relaxed.cluster// loop till successtcgen05.fence::after_thread_synctcgen05.mma
9.7.16.6.4.4.Non-pipelined instructions, different thread

In this pattern, the producer threads that issue the asynchronoustcgen05 instructionsmust explicitly wait for the instructions’ completion before synchronizing with the consumer threads.

Example 1:

Thread 0

Thread 1

tcgen05.ldtcgen05.wait::ldtcgen05.fence::before_thread_syncmbarrier.arrive.relaxed.cluster
mbarrier.try_wait.relaxed.cluster// loop till successtcgen05.fence::after_thread_synctcgen05.mma

Example 1:

Thread 0

Thread 1

tcgen05.mmatcgen05.commit.mbarrier::arrive::one[mbar]
mbarrier.try_wait.relaxed.cluster[mbar]// loop till successtcgen05.fence::after_thread_synctcgen05.ld

The synchronization mechanisms can also be composed with each other. For example:

Thread 0

Thread 1

tcgen05.mmatcgen05.commit.mbarrier::arrive::one[bar1]mbarrier.try_wait.relaxed.cluster[bar1]// loop...tcgen05.fence::after_thread_sync...// completion is guaranteedtcgen05.fence::before_thread_syncmbarrier.arrive.relaxed.cluster[bar2]// loop...
mbarrier.try_wait.relaxed.cluster[bar2]// loop...tcgen05.fence::after_thread_synctcgen05.ld
9.7.16.6.4.5.Register dependencies, same thread

Fortcgen05.ld, an intra-thread ordering through true register dependency will be respectedregardless of the presence or absence of other forms of synchronization. This form of registerdependency does not imply any other form of ordering. For example, a register dependency doesnot imply that a dependee instruction’s memory accesses will be performed before a dependentinstruction’s memory accesses. To enforce such memory orderings and avoiding anti-dependencyhazards aroundtcgen05.ld,tcgen05.wait::ld must be used.

Example:

tcgen05.ld %r1, ...;tcgen05.mma ..., %r1, ...;
9.7.16.6.5.Shared Memory Accesses

The shared memory accesses bytcgen05.mma andtcgen05.cp operations are performedin the asynchronous proxy (async proxy).

Accessing the same memory location across miltiple proxies needs a cross-proxy fence.For the async proxy,fence.proxy.async should be used to synchronize memory betweengeneric proxy and the async proxy.

9.7.16.7.Tensor Memory Allocation and Management Instructions

9.7.16.7.1.Tensorcore 5th Generation Instructions:tcgen05.alloc,tcgen05.dealloc,tcgen05.relinquish_alloc_permit

tcgen05.alloc,tcgen05.dealloc,tcgen05.relinquish_alloc_permit

DynamicTensor Memory allocation management instructions

Syntax

tcgen05.alloc.cta_group.sync.aligned{.shared::cta}.b32  [dst], nCols;tcgen05.dealloc.cta_group.sync.aligned.b32              taddr, nCols;tcgen05.relinquish_alloc_permit.cta_group.sync.aligned;.cta_group = { .cta_group::1, .cta_group::2 }

Description

tcgen05.alloc is a potentially blocking instruction which dynamically allocatesthe specified number of columns in theTensor Memory and writesthe address of the allocatedTensor Memory into shared memoryat the location specified by address operand dst. Thetcgen05.alloc blocks if therequested amount ofTensor Memory is not available and unblocksas soon as the requested amount ofTensor Memory becomesavailable for allocation.

Instructiontcgen05.dealloc deallocates theTensor Memoryspecified by theTensor Memory addresstaddr. The operandtaddr must point to a previousTensor Memory allocation.

All of the Tensor Memory that was allocated usingtcgen05.alloc instruction in a kernel,must be explicitly deallocated usingtcgen05.dealloc before the kernel exits.

The unsigned 32-bit operandnCols specify the number of columns to be allocated orde-allocated. The unit of allocation and de-allocation is 32 columns and all of lanesper column. The number of columns must be a power of 2. The operandnCols must bewithin the range [32, 512]. The number of columns allocated should not increase betweenany two allocations in the execution order within the CTA. OperandnCols must bepower of 2.

Instructiontcgen05.relinquish_alloc_permit specifies that the CTA of the executingthread is relinquishing the right to allocateTensor Memory. So,it is illegal for a CTA to performtcgen05.alloc after any of its constituent threadsexecutetcgen05.relinquish_alloc_permit.

If no state space is specified thenGeneric Addressing is used.If the address specified bydst does not fall within the address window of.shared::cta state space then the behavior is undefined.

Qualifier.cta_group specifies the number of CTAs involved in the allocation andde-allocation operation. When.cta_group::1 is specified, one warp from the CTA mustperform the allocation and de-allocation. When.cta_group::2 is specified, one warpfrom each of thepeer CTAs must collectively perform the allocation andde-allocation. Refer to theIssue Granularity section.When.cta_group::2 is specified, the issuing warp must make sure that peer CTA is launchedand is still active.

Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.

The mandatory.sync qualifier indicates that the instruction causes the executing threadto wait until all threads in the warp execute the same instruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must execute thesame instruction. In conditionally executed code, the instruction should only be used if itis known that all threads in the warp evaluate the condition identically, otherwise behavioris undefined.

The behavior of the instruction is undefined if all the threads in the warp do not use thesame values ofnCols, or if any thread in the warp has exited.

The store operation intcgen05.alloc is treated as a weak memory operation in theMemory Consistency Model.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Target ISA Notes

Supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Examples

// Example 1:tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [sMemAddr1], 32;ld.shared.b32 taddr, [sMemAddr1];// use taddr ...// more allocations and its usages ...tcgen05.dealloc.cta_group::1.sync.aligned.b32  taddr, 32;// more deallocations ...tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;// Example 2:// Following instructions are performed by current warp and the warp in the peer-CTA:tcgen05.alloc.cta_group::2.sync.aligned.shared::cta.b32 [sMemAddr2], 32;ld.shared.b32 taddr, [sMemAddr2];// use taddr ...// more allocations and its usages ...tcgen05.dealloc.cta_group::2.sync.aligned.b32  taddr, 32;// more deallocations ...tcgen05.relinquish_alloc_permit.cta_group::2.sync.aligned;

9.7.16.8.Tensor Memory and Register Load/Store Instructions

The threads of the CTA can perform the loads and stores to theTensor Memoryof the CTA and move data between registers and Tensor Memory. The loads and stores of datacan be performed in certain shapes as specified in theMatrix and Data Movement Shape section.

9.7.16.8.1.Access restrictions

Not all threads of the CTA can access the entire Tensor Memory via thetcgen05.ld andtcgen05.st operations.

The Tensor Memory of a CTA is divided into 4 equal chunks such that each warp of a warpgroupin the CTA can access a chunk of the Tensor Memory. All the columns of the Tensor Memory canbe accessed by all the four warps of a warpgroup. A lane of the Tensor Memory can be accessedby a single warp in the warpgroup. The following table describes the access restriction.

ID of the warp within the warpgroup

Accessible Lanes

0

0-31

1

32-63

2

64-95

3

96-127

9.7.16.8.2.Packing and Unpacking

Optionally, the following pack and unpack operations can be performed during the load and store:

  1. Packing: two 16-bit chunks can be packed into a single 32-bit chunk in the register intcgen05.ld

  2. Unpacking: a single 32-bit chunk in the register can be unpacked into two 16-bit chunks intcgen05.st

as shown in theFigure 193.

_images/tcgen05-ld-st-pack-unpack.png

Figure 193Pack/Unpack operations for tcgen05 ld/st

9.7.16.8.3.Tensorcore 5th Generation Instructions:tcgen05.ld

tcgen05.ld

Asynchronous collective load from tensor memory into registers.

Syntax

// Base load instruction:tcgen05.ld.sync.aligned.shape1.num{.pack}.b32    r, [taddr];tcgen05.ld.sync.aligned.shape2.num{.pack}.b32    r, [taddr], immHalfSplitoff;.shape1 = { .16x64b, .16x128b, .16x256b, .32x32b }.shape2 = { .16x32bx2 }.num    = { .x1, .x2, .x4, .x8, .x16, .x32, .x64, .x128 }.pack   = { .pack::16b }// Floating point type load along with reduction :tcgen05.ld.red.sync.aligned.shape3.num.redOp{.abs}{.NaN}.f32 r, redval, [taddr];tcgen05.ld.red.sync.aligned.shape4.num.redOp{.abs}{.NaN}.f32 r, redval, [taddr], immHalfSplitoff;// Integer type load along with reduction :tcgen05.ld.red.sync.aligned.shape3.num.redOp.type r, redval, [taddr];tcgen05.ld.red.sync.aligned.shape4.num.redOp.type r, redval, [taddr], immHalfSplitoff;.shape3 = { .32x32b   }.shape4 = { .16x32bx2 }.redOp  = { .min, .max }.type   = { .u32, .s32 }

Description

Instructiontcgen05.ld asynchronously loads data from theTensor Memoryat the location specified by the 32-bit address operandtaddr into the destinationregisterr, collectively across all threads of the warps.

All the threads in the warp must specify the same value oftaddr, which must be thebase address of the collective load operation. Otherwise, the behavior is undefined.

The.shape qualifier and the.num qualifier together determines the totaldimension of the data which is loaded from theTensor Memory. The.shapequalifier indicates the base dimension of data to be accessed as described in theData Movement Shape. The.num qualifier indicatesthe repeat factor on the base dimension resulting in the total dimension of the data thatis accessed.

The shape.16x32bx2 performs two accesses into Tensor Memory of the shape.16x32b.The base address of the first access is specified by taddr and the base address of thesecond access is specified bytaddr+immHalfSplitoff, whereimmHalfSplitoff is animmediate argument.

The destination operandr is a brace-enclosed vector expression consisting of oneor more 32-bit registers as per the value of.shape and.num. The size of thevector for various combinations of.num and.shape is shown inTable 47.

Table 47Various-combinations of .num and .shape

.num

.shape

.16x32bx2 / .16x64b / .32x32b

.16x128b

.16x256b

.x1

1

2

4

.x2

2

4

8

.x4

4

8

16

.x8

8

16

32

.x16

16

32

64

.x32

32

64

128

.x64

64

128

NA

.x128

128

NA

NA

The qualifier.red specifies that the reduction operation specified by.redOp isperformed on the data that is loaded across columns in each lane. The result of thereduction operation is written into the corresponding thread’s 32-bit destination registeroperandredVal. When.red qualifier is specified,.num modifier must be at least.x2.

The optional qualifier.pack::16b can be used to pack two 16-bit elements from adjacentcolumns into a single 32-bit element during the load as shown in the sectionPacking and Unpacking.

The mandatory.sync qualifier indicates thattcgen05.ld causes the executing threadto wait until all threads in the warp execute the sametcgen05.ld instruction beforeresuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must execute thesametcgen05.ld instruction. In conditionally executed code, atcgen05.ld instructionshould only be used if it is known that all threads in the warp evaluate the conditionidentically, otherwise behavior is undefined.

The behavior oftcgen05.ld is undefined if all threads do not use the same values oftaddr,or if any thread in the warp has exited.

The instructiontcgen05.ld is performed asynchronously and more details are specified in thesectionMemory Consistency Model for 5th generation of TensorCore operations.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

tcgen05.ld.red is introduced in PTX ISA version 8.8.

Target ISA Notes

Supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

tcgen05.ld.red is supported on following architectures:

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_103f or higher in the same family

  • sm_110f or higher in the same family

Examples

tcgen05.ld.sync.aligned.32x32b.x2.b32     {r0, r1}, [taddr1];tcgen05.ld.sync.aligned.16x128b.x4.b32    {r0, r1, r2, r3, r4, r5, r6, r7}, [taddr2];tcgen05.ld.red.sync.aligned.16x32bx2.x8.u32.max {r0, r1, r2, r3, r4, r5, r6, r7},                                                 redVal, [taddr3], 16;
9.7.16.8.4.Tensorcore 5th Generation Instructions:tcgen05.st

tcgen05.st

Asynchronous collective store to tensor memory from registers.

Syntax

tcgen05.st.sync.aligned.shape1.num{.unpack}.b32    [taddr], r;tcgen05.st.sync.aligned.shape2.num{.unpack}.b32    [taddr], immHalfSplitoff, r;.shape1 = { .16x64b, .16x128b, .16x256b, .32x32b }.shape2 = { .16x32bx2 }.num    = { .x1, .x2, .x4, .x8, .x16, .x32, .x64, .x128 }.unpack = { .unpack::16b }

Description

Instructiontcgen05.st asynchronously stores data from the source registerr intotheTensor Memory at the location specified by the 32-bit address operandtaddr,collectively across all threads of the warps.

All the threads in the warp must specify the same value oftaddr, which must be the baseaddress of the collective store operation. Otherwise, the behavior is undefined.

The.shape qualifier and the.num qualifier together determines the total dimensionof the data which is stored to the Tensor Memory. The.shape qualifier indicates the basedimension of data to be accessed as described in theData Movement Shape. The.numqualifier indicates the repeat factor on the base dimension resulting in the total dimension ofthe data that is accessed.

The shape.16x32bx2 performs two accesses into Tensor Memory of the shape.16x32b.The base address of the first access is specified bytaddr and the base address of thesecond access is specified bytaddr+immHalfSplitoff, whereimmHalfSplitoff is animmediate argument.

The source operandr is a brace-enclosed vector expression consisting of one or more 32-bitregisters as per the value of.shape and.num. The size of the vector for variouscombinations of.num and.shape is shown inTable 48.

Table 48Various-combinations of .num and .shape

.num

.shape

.16x32bx2 / .16x64b / .32x32b

.16x128b

.16x256b

.x1

1

2

4

.x2

2

4

8

.x4

4

8

16

.x8

8

16

32

.x16

16

32

64

.x32

32

64

128

.x64

64

128

NA

.x128

128

NA

NA

The optional qualifier.unpack::16b can be used to unpack a 32-bit element in theregister into two 16-bit elements and store them in adjacent columns as shown in thesectionPacking and Unpacking.

The mandatory.sync qualifier indicates thattcgen05.st causes the executingthread to wait until all threads in the warp execute the sametcgen05.st instructionbefore resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must executethe sametcgen05.st instruction. In conditionally executed code, atcgen05.stinstruction should only be used if it is known that all threads in the warp evaluatethe condition identically, otherwise behavior is undefined.

The behavior oftcgen05.st is undefined if all threads do not use the same values oftaddr, or if any thread in the warp has exited.

The instructiontcgen05.st is performed asynchronously and more details are specifiedin the sectionMemory Consistency Model for 5th generation of TensorCore operations.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Target ISA Notes

Supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Examples

tcgen05.st.sync.aligned.16x64b.x4.b32               [taddr0], {r0,  r1,  r2,  r3};tcgen05.st.sync.aligned.16x128b.x1.unpack::16b.b32  [taddr1], {r0,  r1};
9.7.16.8.5.Tensorcore 5th Generation Instructions:tcgen05.wait

tcgen05.wait

Waits for the completion of all prior asynchronoustcgen05.ld /tcgen05.st instructions.

Syntax

tcgen05.wait_operation.sync.aligned;.wait_operation = { .wait::ld, .wait::st }

Description

Instructiontcgen05.wait::st causes the executing thread to block until all priortcgen05.st operations issued by the executing thread have completed.

Instructiontcgen05.wait::ld causes the executing thread to block until all priortcgen05.ld operations issued by the executing thread have completed.

The mandatory.sync qualifier indicates thattcgen05.wait_operation causes theexecuting thread to wait until all threads in the warp execute the sametcgen05.wait_operationinstruction before resuming execution.

The mandatory.aligned qualifier indicates that all threads in the warp must execute thesametcgen05.wait_operation instruction.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Target ISA Notes

Supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Examples

Example 1:tcgen05.ld.sync.aligned.32x32b.x2.b32     {r0, r1}, [taddr0];// Prevents subsequent tcgen05.mma from racing ahead of the tcgen05.ldtcgen05.wait::ld.sync.aligned;tcgen05.mma.cta_group::1.kind::f16   [taddr0],  a-desc,  b-desc, idesc, p;Example 2:tcgen05.st.sync.aligned.32x32b.x2.b32     [taddr0], {r0, r1};// Prevents the write to taddr0 in tcgen05.mma from racing ahead of the tcgen05.sttcgen05.wait::st.sync.aligned;tcgen05.mma.cta_group::1.kind::f16   [taddr0],  a-desc,  b-desc, idesc, p;

9.7.16.9.Tensor Memory Data Movement Instructions

Data from the shared memory can be copied asynchronously to theTensor Memoryusing theTensorcore 5th Generation Instructions: tcgen05.cp operation.

9.7.16.9.1.Optional Decompression

Optionally, during the copy, a vector of 4-bit and 6-bitcustom floating point types can be decompressed into 8-bit types.

9.7.16.9.1.1.Decompression of 4-bit floating point to 8-bit type

A contiguous set of 16 elements of 4-bits each followed by 8 bytes of padding can be convertedinto 16 elements of 8-bits each as shown inFigure 194.

_images/tcgen05-decompression-4b8b.png

Figure 194Decompression from 4-bit to 8-bit

The individual 4-bit to 8-bit decompression would look like as shown inFigure 195.

_images/tcgen05-decompression-4b8b-individual.png

Figure 195Individual decompression from 4-bit to 8-bit

9.7.16.9.1.2.Decompression of 6-bit floating point to 8-bit type

A contiguous set of 16 elements of 6-bits each followed by 4 bytes of padding isdecompressed into 16 elements of 8-bits each as shown inFigure 196.

_images/tcgen05-decompression-6b8b.png

Figure 196Decompression from 6-bit to 8-bit

The individual 6-bit to 8-bit decompression for typesE3M2 andE2M3 is shown inFigure 197 andFigure 198respectively.

_images/tcgen05-decompression-6b8b-individual1.png

Figure 197Individual decompression from 6-bit to 8-bit for E3M2 type

_images/tcgen05-decompression-6b8b-individual2.png

Figure 198Individual decompression from 6-bit to 8-bit for E2M3 type

9.7.16.9.2.Tensorcore 5th Generation Instructions:tcgen05.cp

tcgen05.cp

Initiates an asynchronous copy operation from shared memory to theTensor Memory.

Syntax

tcgen05.cp.cta_group.shape{.multicast}{.dst_fmt.src_fmt} [taddr], s-desc;.cta_group = { .cta_group::1, .cta_group::2 }.src_fmt   = { .b6x16_p32 , .b4x16_p64 }.dst_fmt   = { .b8x16 }.shape     = { .128x256b, .4x256b, .128x128b, .64x128b**, .32x128b*** }.multicast = { .warpx2::02_13** , .warpx2::01_23**, .warpx4*** }

Description

Instructiontcgen05.cp initiates an asynchronous copy operation from shared memory to thelocation specified by the address operandtaddr in theTensor Memory.

The 64-bit register operands-desc is the matrix descriptor which represents the sourcematrix in the shared memory that needs to be copied. The format of the matrix descriptor isdescribed inMatrix Descriptors.

The.shape qualifier indicates the dimension of data to be copied as described in theData Movement Shape.

Qualifier.cta_group specifies the number of CTAs whoseTensor Memory isaccessed when a single thread of a single CTA executes thetcgen05.cp instruction.When.cta_group::1 is specified, the data is copied into theTensor Memoryof the current CTA. When.cta_group::2 is specified, the data is copied into theTensor Memory of both the current and thepeer CTAs.

Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.

When the qualifiers.dst_fmt and.src_fmt are specified, the data is decompressedfrom the source format.src_fmt in the shared memory to the destination format.dst_fmt inTensor Memory by the copy operation. The details of sourceand the destination formats as specified in the sectionOptional Decompression.

Some of the.shape qualifiers require certain.multicast qualifiers.

  1. .64x128b requires.warpx2::02_13 or.warpx2::01_23

  2. .32x128b requires.warpx4

When the.multicast qualifier is specified as either.warpx2::02_13 or.warpx2::01_23 then the data being copied is multicasted into warp pairs and eachwarp in the warp pair receive half of the data. Warp pairs are formed as follows:

  1. .warpx2::02_13 : warps 0 and 2 form a pair; warps 1 and 3 form a pair.

  2. .warpx2::01_23 : warps 0 and 1 form a pair; warps 2 and 3 form a pair.

When the.multicast modifier is specified as.warpx4 then the data beingcopied is multicasted into all 4 warps.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Target ISA Notes

Supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Examples

tcgen05.cp.cta_group::1.128x256b                 [taddr0], sdesc0;tcgen05.cp.cta_group::2.128x128b.b8x16.b6x16_p32 [taddr1], sdesc1;tcgen05.cp.cta_group::1.64x128b.warpx2::02_13    [taddr2], sdesc2;
9.7.16.9.3.Tensorcore 5th Generation Instructions:tcgen05.shift

tcgen05.shift

Asynchronously shift down the rows of the matrix in theTensor Memory for a warp.

Syntax

tcgen05.shift.cta_group.down  [taddr];.cta_group = { .cta_group::1, .cta_group::2 }

Description

Instructiontcgen05.shift is an asynchronous instruction which initiates the shifting of 32-byteelements downwards across all the rows, except the last, by one row. The address operandtaddrspecifies the base address of the matrix in theTensor Memory whose rows mustbe down shifted.

The lane of the address operandtaddr must be aligned to 32.

Qualifier.cta_group specifies the number of CTAs whoseTensor Memoryis touched when a single thread of a single CTA executes thetcgen05.shift instruction.When.cta_group::1 is specified, the shift operation is performed in theTensor Memory of the current CTA. When.cta_group::2 is specified,the shift operation is performed in theTensor Memory of both the current and thepeer CTAs.

Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Target ISA Notes

Supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_103a

  • sm_110a

Examples

tcgen05.shift.down.cta_group::1 [taddr0];tcgen05.shift.down.cta_group::2 [taddr1];

9.7.16.10.TensorCore 5th Generation Matrix Multiply and accumulate Operations

The 5th generation of TensorCore operations of shapeMxNxK perform matrixmultiplication and accumulation of the form:

D=A*B+D

where:

  • theA matrix has shapeMxK, in either Tensor Memory or Shared Memory

  • theB matrix has shapeKxN, in Shared Memory of the current CTA and optionally in peer CTA

  • theD matrix is of the shapeMxN, in Tensor Memory

Optionally an input predicate can be used to disable the input from the accumulatormatrix and the following operation can be performed as

D=A*B

The matrix multiplication and accumulation operations are categorized into various kindsbased on input types and the throughput of the multiplication operation. The following shows thedifferent kinds of MMA operations that are supported:

  1. f16 : supportsf16 andbf16 input types.

  2. tf32 : supportstf32 input types.

  3. f8f6f4 : supports all input combinations off8,f6 andf4 types.

  4. i8 : supports signed and unsigned 8-bit integer input types.

  5. mxf8f6f4/mxf4 : supports mx-floating points input types.

  6. mxf4nvf4 : supportsmxf4 type and a custom NVIDIA floating-pointtype for inputs where the type of the vector elements is 4 bits and requires a commonscaling factor to form the complete floating-point type, similar to other mx-types.

Optionally, the 5th generation of TensorCore MMAs support dense and sparse matrixA.Sparse Matrices describes the details of the sparse matrices.

Some of the MMA-kinds requires scaling of input matrices from memory to form the matrixA and matrixB before performing the MMA operation.Block Scaling describes the details of the scaling of matrices.

The following table show the various matrices involved in the MMA operations and the memory inwhich they can reside:

Matrix Type

Memory

A

Tensor Memory OR Shared Memory

B

Shared Memory

D

Tensor Memory

SparseMetaData

A-Scale /B-Scale

A sequence of MMA instructions may reuse the sameA matrix with a sequence ofBmatrices or may reuse the sameB matrix with a sequence ofA matrices.In these patterns the TensorCore may be able to laod the unchanged matrix once and reuseit through the sequence without multiple reloads. TheA orB matrices are loadedinto a TensorCore collector buffer (i.e., special cache).

An MMA instruction has an optionalcollector qualifier to specify when anA orBmatrix is new to the sequence and should be loaded, unchanged within the sequenceand should be reused, or the last use in the sequence and should be discarded.Thecollector qualifier is used to give the TensorCore permission to reuse a previouslyloadedA orB matrix; however reuse is opportunistic in that the TensorCore mayreload a matrix even when it has permission to reuse that matrix. Thus, the sourcememory of anA orB matrix must not be modified while the MMA instruction using thosematrices has not completed - regardless ofcollector qualifier permissions.

The 5th generation of TensorCore MMAs can be used for general matrix multiplication OR forconvolution operations. In case of convolutions, the activations can be stored in eithermatrixA or matrixB while the weights will be stored in the other matrix.

ActivationMatrix

WeightsMatrix

Name of theop

InstructionName

Collector Buffer Applicability

A

B

ActivationStationary

(defaulttcgen05.mma)

Collector buffer is applicable on matrixA

B

A

WeightsStationary

.ws

Collector buffer is applicable on matrixB

9.7.16.10.1.Transpose and Negate operations

The matricesA andB can be transposed by specifying the TranposeA Matrixand TransposeB Matrix bits in the instruction descriptor respectively.

The elements of the matricesA andB can be negated by specifying the NegateA Matrix and NegateB Matrix bits in the instruction descriptor respectively.

The support for Transpose and Negate operation for various MMA-Kind are shown inTable 49.

Table 49Transpose and Negate operation for various MMA-Kind

MMA-Kind

Is Transpose A/B supported

Is Negate A/B supported

.kind::tf32

Yes

Yes

.kind::f16

Yes

Yes

.kind::f8f6f4

Yes

Yes

.kind::mxf8f6f4

Yes

Yes

.kind::i8

Yes

No

.kind::mxf4

No

Yes

.kind::mxf4nvf4

No

Yes

For.kind::tf32, the transpose operations on matricesA andB are supportedonly with 128B swizzling mode with 32B swizzle-atomicity.

For all other MMA-Kinds, the transpose operations on matricesA andB are not supportedon 128B swizzling mode with 32B swizzle-atomicity.

Table 50 shows the valid combinations of N shape with.cta_group qualifier for 8bit transpose B.

Table 50Various combinations of N shape with .cta_group qualifier for 8bit transpose B

.cta_group

N shape

1

16 <= N <= 256, step 16

2

32 <= N <= 256, step 32

9.7.16.10.2.Matrix Layout Organization

Table 51 describes the major-ness used for different matrices.

Table 51Major-ness for different matrices

Matrix

Residing in Memory

Default Major-ness

D

Tensor Memory

Row-Major

A

Tensor Memory

Shared Memory

Depends on swizzling mode.ReferShared Memory Layout and Swizzling

B

Shared Memory

9.7.16.10.3.Valid Combinations of Type-Size, Major-ness and Swizzling
Table 52Valid Combinations of Type-Size, Major-ness and Swizzling

Type-Size

Major-ness

Matrix

Supported Swizzle

4-bit,6-bit,8-bit,16-bit,32-bit

Row

A

All swizzling modes

Column

B

8-bit

16-bit

Column (transpose)

A

All except 128Bswizzling with 32Batomicity

Row (transpose)

B

32-bit

Column (transpose)

A

Only 128B swizzlingwith 32B atomicity

Row (transpose)

B

9.7.16.10.4.Packing formats of elements in Tensor and Shared memory
9.7.16.10.4.1.Packing format for matrix D in Tensor Memory

The sub-word elements of matrixD are expected not to be packed within a 32-bit Tensor Memory word.For example, if the type of elements of the matrixD is 16 bits then a Tensor Memory wordwould contain a single 16-bit element in its lower 16 bits.

9.7.16.10.4.2.Packing format for matrix A and B

The 6-bit and 4-bit floating point types have different packing format requirements fordifferent MMA kinds in both Tensor memory and Shared memory. The requirements are as follows.

9.7.16.10.4.3.Packing format used for matrix A by.kind::mxf8f6f4 in Tensor Memory

The individual 4-bit and the 6-bit floating point type elements must be packed in an 8-bit containerin Tensor memory as shown below. The 8-bit containers must be contiguously packed in a 32-bit TensorMemory word. For example, if the type of elements of the matrixA is 6 bits then 4 consecutiveA elements should be packed in one 32-bit Tensor Memory word.

  • 4-bit packing format as shown inFigure 199

    _images/tcgen05-packing-formats-mxf8f6f4-tmem-dig1.png

    Figure 1994-bit packing format with type E2M1

  • 6-bit packing format

    • Type E3M2 as shown inFigure 200

      _images/tcgen05-packing-formats-mxf8f6f4-tmem-dig2.png

      Figure 2006-bit packing format with type E3M2

    • Type E2M3 as shown inFigure 201

      _images/tcgen05-packing-formats-mxf8f6f4-tmem-dig3.png

      Figure 2016-bit packing format with type E2M3

9.7.16.10.4.4.Packing format used for matrix A and B by.kind::mxf8f6f4 in Shared Memory

The 4-bit and 6-bit floating point elements in shared memory must be contiguously packed alongwith padding as follows.

  • 4-bit packing format as shown inFigure 202

    _images/tcgen05-packing-formats-mxf8f6f4-smem-dig1.png

    Figure 2024-bit packing format

  • 6-bit packing format as shown inFigure 203

_images/tcgen05-packing-formats-mxf8f6f4-smem-dig2.png

Figure 2036-bit packing format

9.7.16.10.4.5.Packing format used for matrix A by.kind::mxf4 and.kind::mxf4nvf4 in Tensor Memory

Two 4-bit floating point type elements must be packed in an 8-bit container in Tensor memory asshown inFigure 204 formxf4.

_images/tcgen05-packing-formats-mxf4-tmem-dig1.png

Figure 2044-bit packing format with type E2M1

9.7.16.10.4.6.Packing format used for matrix A and B by.kind::mxf4 and.kind::mxf4nvf4 in Shared Memory

The packing format for 4-bit floating point elements in shared memory is to pack two 4-bitelements in a 8-bit container, with no padding.

9.7.16.10.5.Data Path Layout Organization

Different MMA variants access the tensor memory with different layout organization.The following table lists the various layouts:

M

cta_group

A-Sparsity

Is .ws mode

Datapath organization

Layout ID

Tensor Memory Datapath Lane Alignment

32

::1

Either

Yes

1x4

Layout G

0

64

::1

Either

Yes

2x3

Layout E

0

64

::1

Either

No

4x1 (1/2 datapath utilized)

Layout F

0 or 16

128

::1

Either

Either

4x1

Layout D

0

128

::2

Dense

N/A

2x2

Layout B

0

128

::2

Sparse

N/A

4x1 (1/2 datapath utilized)

Layout C

0 or 16

256

::2

Either

N/A

4x1

Layout A

0

The layouts which utilize only half the datapath lanes, i.e.,Layout F andLayout C, must use the same Tensor Memorylane alignment across matricesA,D and the sparsity metadata matrix.

The following shows the warps that can access the Tensor Memory regions viatcgen05.ld /tcgen05.st along with the addresses for various Tensor Memory Layouts.

9.7.16.10.5.1.Layout A (M = 256)

Layout organization for M = 256 is shown inFigure 205.

_images/tcgen05-data-path-layout-a1.png

Figure 205Layout organization for M = 256

Addresses for the above region to be used intcgen05.ld /tcgen05.stis shown inFigure 206

_images/tcgen05-data-path-layout-a2.png

Figure 206Addresses to use intcgen05.ld /tcgen05.st

9.7.16.10.5.2.Layout B (M = 128 + cta-group::2 + Dense A matrix)

Layout organization for M = 128 + .cta_group::2 + Dense A matrix is shown inFigure 207.

_images/tcgen05-data-path-layout-b1.png

Figure 207Layout organization for M = 128 + .cta_group::2 + Dense A matrix

Addresses for the above region to be used intcgen05.ld /tcgen05.stis shown inFigure 208

_images/tcgen05-data-path-layout-b2.png

Figure 208Addresses to use intcgen05.ld /tcgen05.st

9.7.16.10.5.3.Layout C (M = 128 + cta-group::2 + Sparse A matrix)

Layout organization for M = 128 + .cta_group::2 + Sparse A matrix is shown inFigure 209.

_images/tcgen05-data-path-layout-c1.png

Figure 209Layout organization for M = 128 + .cta_group::2 + Sparse A matrix

Addresses for the above region to be used intcgen05.ld /tcgen05.stis shown inFigure 210

_images/tcgen05-data-path-layout-c2.png

Figure 210Addresses to use intcgen05.ld /tcgen05.st

9.7.16.10.5.4.Layout D (M = 128 + cta-group::1)

Layout organization for M = 128 + .cta_group::1 is shown inFigure 211.

_images/tcgen05-data-path-layout-d1.png

Figure 211Layout organization for M = 128 + .cta_group::1

Addresses for the above region to be used intcgen05.ld /tcgen05.stis shown inFigure 212

_images/tcgen05-data-path-layout-d2.png

Figure 212Addresses to use intcgen05.ld /tcgen05.st

9.7.16.10.5.5.Layout E (M = 64 + .ws mode)

Layout organization for M = 64 + .ws mode is shown inFigure 213.

_images/tcgen05-data-path-layout-e1.png

Figure 213Layout organization for M = 64 + .ws mode

Addresses for the above region to be used intcgen05.ld /tcgen05.stis shown inFigure 214

_images/tcgen05-data-path-layout-e2.png

Figure 214Addresses to use intcgen05.ld /tcgen05.st

9.7.16.10.5.6.Layout F (M = 64 + non .ws mode)

Layout organization for M = 64 + non .ws mode is shown inFigure 215.

_images/tcgen05-data-path-layout-f1.png

Figure 215Layout organization for M = 64 + non .ws mode

Addresses for the above region to be used intcgen05.ld /tcgen05.stis shown inFigure 216

_images/tcgen05-data-path-layout-f2.png

Figure 216Addresses to use intcgen05.ld /tcgen05.st

9.7.16.10.5.7.Layout G (M = 32)

Layout organization for M = 32 is shown inFigure 217.

_images/tcgen05-data-path-layout-g1.png

Figure 217Layout organization for M = 32

Addresses for the above region to be used intcgen05.ld /tcgen05.stis shown inFigure 218

_images/tcgen05-data-path-layout-g2.png

Figure 218Addresses to use intcgen05.ld /tcgen05.st

9.7.16.10.6.Shared Memory Layout and Swizzling

If the bitTransposeAMatrix /TransposeBMatrix in theInstruction descriptor is 0, thenK-major isused for matrixA /B respectively. If the bitTransposeAMatrix in theInstruction descriptor is 1 thenM-major isused for matrixA. If the bitTransposeBMatrix in theInstruction descriptor is 1, thenN-major isused for matrixB.

In a column-major default BLAS library such as cuBLAS, the matricesA andB with andwithout transpose can be classified as eitherK-Major orM-or-N-Major as shown in thefollowing table:

Non-Transposed

Transposed

A

K-major

M-major

B

K-major

N-major

To avoid confusion withA,B,row-major,col-major,transpose, andnon-transpose, we will useMN-Major andK-Major throughout this section.

The matrices in the shared memory are made up of one or more “swizzle layout atom”.The exact layout of these swizzle atoms depends on the swizzling mode, swizzle-atomicity,and the leading dimension. The layout of the swizzle are shown inTable 53

Table 53Layout for swizzle atoms

Swizzling mode andSwizzle-Atomicity

LeadingDimension

Swizzle atom layout(128b element)

128B Swizzling with32B atomicity

M/N

8x4

128B Swizzling with16B atomicity

M/N

8x8

K

8x8

64B Swizzling Mode

M/N

4x8

K

8x4

32B Swizzling Mode

M/N

2x8

K

8x2

None

M/N

1x8

K

8x1

The above shapes are for elements of size 128 bits. For smaller element sizes, the same shapeswould get multiplied along the leading dimension by a factor of128/sizeof_bits(Element).For example, 128B MN major swizzle atom would have a shape of (8*(128/32))x8 = 32x8 fortf32 tensor core inputs.

Some example Layouts ofMxK orKxN matrices with various swizzling modes, and are in unitsof 128b elements as shown by each colored cell as shown inFigure 219,Figure 220,Figure 221,Figure 222,Figure 223,Figure 224,Figure 225,Figure 226,Figure 227.

_images/tcgen05-smem-layout-128B-32B-atom-mn.png

Figure 219MN major 128B swizzling with 32B atomicity

_images/tcgen05-smem-layout-128B-mn.png

Figure 220MN major 128B swizzling

_images/tcgen05-smem-layout-128B-k.png

Figure 221K major 128B swizzling

_images/tcgen05-smem-layout-64B-mn.png

Figure 222MN major 64B swizzling

_images/tcgen05-smem-layout-64B-k.png

Figure 223K major 64B swizzling

_images/tcgen05-smem-layout-32B-mn.png

Figure 224MN major 32B swizzling

_images/tcgen05-smem-layout-32B-k.png

Figure 225K major 32B swizzling

_images/tcgen05-smem-layout-no-swizzle-mn.png

Figure 226MN major no-swizzling mode

_images/tcgen05-smem-layout-no-swizzle-k.png

Figure 227K major no-swizzling mode

Following are some of the examples of the 128B swizzling layout fortf32 element type.

9.7.16.10.7.Block Scaling

Thetcgen05.mma instructions with the following.kind qualifier:

  • .kind::mxf8f6f4

  • .kind::mxf4

  • .kind::mxf4nvf4

perform matrix multiplication with block scaling. This operation has the following form:

(A*scale_A) *(B*scale_B)+D

wherescale_A andscale_B are matrices residing inTensor Memory.

For ascale_A matrix of shapeM x SFA_N, each row of matrixA is divided intoSFA_N number of chunks and each chunk of a row is multiplied with the correspondingelement in theSF_A of the same row.

Similarly, for ascale_B matrix of shapeSFB_M x N, each column of matrixB isdivided into theSFB_M number of chunks and each chunk of a column is multiplied withthe corresponding element in theSF_B of the same column.

Scale factors forA andB matrices need to be duplicated to all 32 lane partitionsof tensor memory.

Figure 230 shows an example oftcgen05.mma with block scaling ofscale_vec::2X.

_images/tcgen05-mma-block-scaling.png

Figure 230tcgen05.mma with block scaling ofscale_vec::2X

9.7.16.10.7.1.Valid combinations of scale_vectorsize with types and MMA-Kind

The shape ofscale_A andscale_B matrices depend on the.scale_vectorsize as shown inTable 54.

Table 54Valid combinations of scale_vectorsize and shapes

.scale_vectorsize

.kind::*

K

Shape of scale_A

Shape of scale_B

.scale_vec::1X

.kind::mxf8f6f4

All supported values of K

M x 1

1 x N

.scale_vec::2X

.kind::mxf4,.kind::mxf4nvf4

All supported values of K

M x 2

2 x N

.scale_vec::4X

.kind::mxf4nvf4

All supported values of K

M x 4

4 x N

.block16

.kind::mxf4nvf4

K = 96

M x 6

6 x N

All supported values of Kexcept 96

M x 4

4 x N

.block32

.kind::mxf4,.kind::mxf4nvf4

K = 96

M x 3

3 x N

All supported values of Kexcept 96

M x 2

2 x N

.kind::mxf8f6f4

All supported values of K

M x 1

1 x N

The valid combination of the exact element types and the.scale_vectorsize are listed inTable 55.

Table 55Valid combinations of scale_vectorsize with types and MMA-Kind

.kind::*

Element Data Type

Scale Data Type

.scale_vectorsize

.kind::mxf8f6f4

E4M3, E5M2, E2M3E3M2, E2M1

UE8M0

.scale_vec::1X/.block32

.kind::mxf4

E2M1

UE8M0

.scale_vec::2X/.block32

.kind::mxf4nvf4

E2M1

UE8M0

.scale_vec::2X/.block32,.scale_vec::4X/.block16

E2M1

UE4M3

.scale_vec::4X/.block16

New.blockN qualifiers are aliases for.scale_vec::NX qualifiers as:

  • .block32 is alias for.scale_vec::1X or.scale_vec::2Xbased on.kind and K dimension

  • .block16 is alias for.scale_vec::4X

9.7.16.10.7.2.Scale Factor A ID

The value of the scale factorAID selects the sub-columns in the Tensor Memory toform the scale factorA matrix, which is used to scale the matrixA.

The following shows the scale factor matrix layout for various scale vector sizes:

9.7.16.10.7.2.1.Layout of the Scale Factor A Matrix for scale_vec::1X/block32 with K=32/K=64

There is one scale factor per row of theA matrix with block size as 32 and the scale factor must be provided in1-byte aligned sub-column of the Tensor Memory.SFA_ID specifies the byte offset in theTensor Memory word that must be used for the scale factor matrix.Figure 231 shows which sub-columns get selected fordifferent values ofSFA_ID.

_images/tcgen05-mma-scale-factor-a-1x-dig.png

Figure 231Layout of scale factor A matrix with scale_vec::1X/block32 with K=32/K=64

For example, ifSFA_ID is 0, then all the green columns are selected to form the scale factormatrix. Similarly,SFA_ID values of 1, 2 and 3 would select the blue, yellow, and red columns,respectively.

9.7.16.10.7.2.2.Layout of the Scale Factor A Matrix for scale_vec::2X/block32 with K=64/K=128

There are two scale factors per row of theA matrix with block size as 32 and the scale factor must be provided in2-byte aligned sub-column of the Tensor Memory.SFA_ID specifies the half word offset in theTensor Memory word that must be used for the scale factor matrix.Figure 232 shows which sub-columns gets selected for differentvalues ofSFA_ID.

_images/tcgen05-mma-scale-factor-a-2x-dig.png

Figure 232Layout of scale factor A matrix with scale_vec::2X/block32 with K=64/K=128

For example, ifSFA_ID is 0, then all the green columns are selected to form the scale factormatrix. Similarly, ifSFA_ID is 2, then all of the blue columns are selected to form the scalefactor matrix.

9.7.16.10.7.2.3.Layout of the Scale Factor A Matrix for scale_vec::4X/block16 with K=64/K=128

There are four scale factors per row of theA matrix with block size as 16 and the scale factor must be provided in4-byte aligned sub-column of the Tensor Memory. TheSFA_ID value must be 0 and this specifiesthat all of the columns (in green) will be used for the scale factor matrix.Figure 233 shows which sub-columns gets selected for differentvalues ofSFA_ID.

_images/tcgen05-mma-scale-factor-a-4x-dig.png

Figure 233Layout of scale factor A matrix with scale_vec::4X/block16 with K=64/K=128

9.7.16.10.7.2.4.Layout of the Scale Factor A Matrix for block32 with K=96 (Semantically equivalent to scale_vec::3X)

There are three scale factors per row of theA matrix with block size as 32 and the scalefactor must be provided in 4-byte aligned sub-column of the Tensor Memory.SFA_ID specifiesthe byte offset in the Tensor Memory word that must be used for the scale factor matrix.Figure 234,Figure 235,Figure 236 andFigure 237show which sub-columns get selected for different values ofSFA_ID.

_images/tcgen05-mma-scale-factor-a-block32-k96-dig1.png

Figure 234Layout of scale factor A matrix with block32 with K=96 with SFA_ID=00

_images/tcgen05-mma-scale-factor-a-block32-k96-dig2.png

Figure 235Layout of scale factor A matrix with block32 with K=96 with SFA_ID=01

_images/tcgen05-mma-scale-factor-a-block32-k96-dig3.png

Figure 236Layout of scale factor A matrix with block32 with K=96 with SFA_ID=10

_images/tcgen05-mma-scale-factor-a-block32-k96-dig4.png

Figure 237Layout of scale factor A matrix with block32 with K=96 with SFA_ID=11

For example, ifSFA_ID is 0, then all the green columns are selected to form the scale factormatrix. Similarly,SFA_ID values of 1, 2 and 3 would select the blue, yellow, and red columns,respectively.

9.7.16.10.7.2.5.Layout of the Scale Factor A Matrix for block16 with K=96 (Semantically equivalent to scale_vec::6X)

There are six scale factors per row of theA matrix with block size as 16 and the scalefactor must be provided in 4-byte aligned sub-column of the Tensor Memory.SFA_ID specifiesthe byte offset in the Tensor Memory word that must be used for the scale factor matrix.Figure 238 andFigure 239show which sub-columns get selected for different values ofSFA_ID.

_images/tcgen05-mma-scale-factor-a-block16-k96-dig1.png

Figure 238Layout of scale factor A matrix with block16 with K=96 with SFA_ID=00

_images/tcgen05-mma-scale-factor-a-block16-k96-dig2.png

Figure 239Layout of scale factor A matrix with block16 with K=96 with SFA_ID=10

For example, ifSFA_ID is 0, then all the green columns are selected to form the scale factormatrix. Similarly, ifSFA_ID is 2, then all of the blue columns are selected to form the scalefactor matrix.

9.7.16.10.7.3.Scale Factor B ID

The value of the scale factorBID selects the sub-columns in the Tensor Memory toform the scale factorB matrix, which is used to scale the matrixB.

The following shows the scale factor matrix layout for various scale vector sizes:

9.7.16.10.7.3.1.Layout of the Scale Factor B Matrix for scale_vec::1X/block32 with K=32/K=64

There is one scale factor per row of theB matrix with block size as 32 and the scale factor must be provided in1-byte aligned sub-column of the Tensor Memory.SFB_ID specifies the byte offset in theTensor Memory word that must be used for the scale factor matrix.Figure 240 shows which sub-columns get selected fordifferent values ofSFB_ID.

_images/tcgen05-mma-scale-factor-b-1x-dig.png

Figure 240Layout of scale factor B matrix with scale_vec::1X/block32 with K=32/K=64

For example, ifSFB_ID is 0, then all the green columns are selected to form the scale factormatrix. Similarly,SFB_ID values of 1, 2 and 3 would select the blue, yellow, and red columns, respectively.

9.7.16.10.7.3.2.Layout of the Scale Factor B Matrix for scale_vec::2X/block32 with K=64/K=128

There are two scale factors per row of theB matrix with block size as 32 and the scale factor must be provided in2-byte aligned sub-column of the Tensor Memory.SFB_ID specifies the half word offset in theTensor Memory word that must be used for the scale factor matrix.Figure 241 shows which sub-columns get selected fordifferent values ofSFB_ID.

_images/tcgen05-mma-scale-factor-b-2x-dig.png

Figure 241Layout of scale factor B matrix with scale_vec::2X/block32 with K=64/K=128

For example, ifSFB_ID is 0, then all the green columns are selected to form the scale factormatrix. Similarly, ifSFB_ID is 2, then all of the blue columns are selected to form the scalefactor matrix.

9.7.16.10.7.3.3.Layout of the Scale Factor B Matrix for scale_vec::4X/block16 with K=64/K=128

There are four scale factors per row of theB matrix with block size as 16 and the scale factor must be provided in4-byte aligned sub-column of the Tensor Memory. TheSFB_ID value must be 0 and this specifiesthat all of the columns (in green) will be used for the scale factor matrix.Figure 242 shows which sub-columns get selected fordifferent values ofSFB_ID.

_images/tcgen05-mma-scale-factor-b-4x-dig.png

Figure 242Layout of scale factor B matrix with scale_vec::4X/block16 with K=64/K=128

9.7.16.10.7.3.4.Layout of the Scale Factor B Matrix for block32 with K=96 (Semantically equivalent to scale_vec::3X)

There are three scale factors per row of theB matrix with block size as 32 and the scale factormust be provided in 4-byte aligned sub-column of the Tensor Memory.SFB_ID specifies the byteoffset in the Tensor Memory word that must be used for the scale factor matrix.

For N<=128,Figure 243,Figure 244,Figure 245 andFigure 246 show whichsub-columns get selected for different values ofSFB_ID.

_images/tcgen05-mma-scale-factor-b-block32-k96-nlt128-dig1.png

Figure 243Layout of scale factor B matrix with block32 with K=96 and N<=128 with SFA_ID=00

_images/tcgen05-mma-scale-factor-b-block32-k96-nlt128-dig2.png

Figure 244Layout of scale factor B matrix with block32 with K=96 and N<=128 with SFA_ID=01

_images/tcgen05-mma-scale-factor-b-block32-k96-nlt128-dig3.png

Figure 245Layout of scale factor B matrix with block32 with K=96 and N<=128 with SFA_ID=10

_images/tcgen05-mma-scale-factor-b-block32-k96-nlt128-dig4.png

Figure 246Layout of scale factor B matrix with block32 with K=96 and N<=128 with SFA_ID=11

For N>128,Figure 247,Figure 248,Figure 249,Figure 250,Figure 251 andFigure 252 show whichsub-columns get selected for different values ofSFB_ID.

_images/tcgen05-mma-scale-factor-b-block32-k96-ngt128-dig1.png

Figure 247Layout of scale factor B matrix with block32 with K=96 and N>128 with SFA_ID=00

_images/tcgen05-mma-scale-factor-b-block32-k96-ngt128-dig2.png

Figure 248Layout of scale factor B matrix with block32 with K=96 and N>128 with SFA_ID=01

_images/tcgen05-mma-scale-factor-b-block32-k96-ngt128-dig3.png

Figure 249Layout of scale factor B matrix with block32 with K=96 and N>128 with SFA_ID=10

_images/tcgen05-mma-scale-factor-b-block32-k96-ngt128-dig4.png

Figure 250Layout of scale factor B matrix with block32 with K=96 and N>128 with SFA_ID=10

_images/tcgen05-mma-scale-factor-b-block32-k96-ngt128-dig5.png

Figure 251Layout of scale factor B matrix with block32 with K=96 and N>128 with SFA_ID=11

_images/tcgen05-mma-scale-factor-b-block32-k96-ngt128-dig6.png

Figure 252Layout of scale factor B matrix with block32 with K=96 and N>128 with SFA_ID=11

For example, ifSFB_ID is 0, then all the green columns are selected to form thescale factor matrix. Similarly,SFB_ID values of 1, 2 and 3 would select the blue,yellow, and red columns, respectively.

9.7.16.10.7.3.5.Layout of the Scale Factor B Matrix for block16 with K=96 (Semantically equivalent to scale_vec::6X)

There are six scale factors per row of theB matrix with block size as 16 and the scale factormust be provided in 4-byte aligned sub-column of the Tensor Memory.SFB_ID specifies the byteoffset in the Tensor Memory word that must be used for the scale factor matrix.

For N<=128,Figure 253 andFigure 254 show which sub-columnsget selected for different values ofSFB_ID.

_images/tcgen05-mma-scale-factor-b-block16-k96-nlt128-dig1.png

Figure 253Layout of scale factor B matrix with block16 with K=96 and N<=128 with SFA_ID=00

_images/tcgen05-mma-scale-factor-b-block16-k96-nlt128-dig2.png

Figure 254Layout of scale factor B matrix with block16 with K=96 and N<=128 with SFA_ID=10

For N>128,Figure 255,Figure 256,Figure 257 andFigure 258 show which sub-columnsget selected for different values ofSFB_ID.

_images/tcgen05-mma-scale-factor-b-block16-k96-ngt128-dig1.png

Figure 255Layout of scale factor B matrix with block16 with K=96 and N>128 with SFA_ID=00

_images/tcgen05-mma-scale-factor-b-block16-k96-ngt128-dig2.png

Figure 256Layout of scale factor B matrix with block16 with K=96 and N>128 with SFA_ID=00

_images/tcgen05-mma-scale-factor-b-block16-k96-ngt128-dig3.png

Figure 257Layout of scale factor B matrix with block16 with K=96 and N>128 with SFA_ID=10

_images/tcgen05-mma-scale-factor-b-block16-k96-ngt128-dig4.png

Figure 258Layout of scale factor B matrix with block16 with K=96 and N>128 with SFA_ID=10

For example, ifSFB_ID is 0, then all the green columns are selected to form thescale factor matrix. Similarly, ifSFB_ID is 2, then all of the blue columns areselected to form the scale factor matrix.

9.7.16.10.8.Sparse Matrices

This instructiontcgen05.mma.sp can be used when the matrixA is a structuredsparse matrix with 50% zeros in each row distributed as per its sparse granularity.

In aMxNxK sparsetcgen05.mma.sp operation, the matrixA of shapeMxK isstored in a packed form asMx(K/2) in memory. For eachK-wide row of matrixA,50% of elements are zeros and the remainingK/2 non-zero elements are stored inmemory. The metadata specifies the mapping of theK/2 non-zero elements to theKelements before performing the MMA operation.

Granularity of sparse matrixA is defined as the ratio of the number of non-zeroelements in a sub-chunk of the matrix row to the total number of elements in thatsub-chunk where the size of the sub-chunk is shape-specific. The following table liststhe granularity of differenttcgen05.mma.sp variants:

.kind of tcgen05.mma

Sparse Granularity

.kind::tf32

1:2

.kind::f16

2:4

.kind::f8f6f4

.kind::mxf8f6f4

.kind::i8

.kind::mxf4

4:8 (in pairs)

9.7.16.10.8.1.Sparsetcgen05.mma.sp with.kind::tf32

For.kind::tf32, matrixA is structured sparse at a granularity of1:2.In other words, each chunk of two adjacent elements in a row of matrixA has onezero and one non-zero element. Only the non-zero element is stored in memory and the4-bit index in the metadata indicates the position of the non-zero element in thetwo-wide chunk. The only meaningful values of the index are:

  • 0b1110

  • 0b0100

Rest of the values result in undefined behavior.

_images/tcgen05-sparse-mma-metadata-tf32.png

Figure 259Sparse tcgen05.mma metadata example for tf32 kind

9.7.16.10.8.2.Sparsetcgen05.mma.sp with.kind::f16,.kind::f8f6f4,.kind::mxf8f6f4,.kind::i8

For the following.kind variants oftcgen05.mma:

  • .kind::f16

  • .kind::f8f8f4

  • .kind::mxf8f6f4

  • .kind::i8

matrixA is structured sparse at a granularity of2:4. In other words, each chunkof four adjacent elements in a row of matrixA has two zero and two non-zero elements.Only the non-zero elements are stored in memory and the two 2-bit indices in the metadataindicates the position of the two non-zero elements in the four-wide chunk. The onlymeaningful values of the index are:

  • 0b0100

  • 0b1000

  • 0b1100

  • 0b1001

  • 0b1101

  • 0b0110

  • 0b1110

_images/tcgen05-sparse-mma-metadata-f16-f8f6f4-mxf8f6f4.png

Figure 260Sparse tcgen05.mma metadata example for f16/f8f6f4/mxf8f6f4 kind

9.7.16.10.8.3.Sparsetcgen05.mma.sp with.kind::mxf4 and.kind::mxf4nvf4

For.kind::mxf4 and.kind::mxf4nvf4, matrixA is pair-wise structuredsparse at a granularity of4:8. In other words, each chunk of eight adjacentelements in a row of matrixA has four zero and four non-zero elements. Thezero and non-zero elements are clustered in sub-chunks of two elements each withinthe eight-wide chunk, so each two-wide sub-chunk within the eight-wide chunk must beall zeros or all non-zeros. Only the four non-zero elements are stored in memory andthe two 2-bit indices in the metadata indicates the position of the two two-widesub-chunks with non-zero values in the eight-wide chunk of a row of matrixA.The only meaningful values of the index are:

  • 0b0100

  • 0b1000

  • 0b1100

  • 0b1001

  • 0b1101

  • 0b0110

  • 0b1110

Rest of the values result in undefined behavior.

_images/tcgen05-sparse-mma-metadata-mxf4.png

Figure 261Sparse tcgen05.mma metadata example for mxf4 kind

9.7.16.10.8.4.Sparsity selector

The value of the sparsity selector selects the sub-columns in the Tensor Memoryto form the sparsity metadata matrix, which is used with matrixA to form themultiplicand matrix.

The following shows the sparse metadata matrix layout in Tensor Memory for various MMA variants:

9.7.16.10.8.4.1.Layout of the Sparsity Metadata Matrix for M = 64 for.kind::f16

Figure 262 shows which sub-columns getsselected for different values of Sparsity Selector.

_images/tcgen05-sparse-matrices-sparsity-selector-kind-f16-m64.png

Figure 262Sparsity Metadata Layout for M = 64 for.kind::f16

9.7.16.10.8.4.2.Layout of the Sparsity Metadata Matrix for M = 128 / M = 256 for.kind::f16

Figure 263 shows which sub-columns getsselected for different values of Sparsity Selector.

_images/tcgen05-sparse-matrices-sparsity-selector-kind-f16-m128-256.png

Figure 263Sparsity Metadata Layout for M = 128 / M = 256 for.kind::f16

9.7.16.10.8.4.3.Layout of the Sparsity Metadata Matrix for M = 64 for.kind::tf32

Figure 264 shows which sub-columns getsselected for different values of Sparsity Selector.

_images/tcgen05-sparse-matrices-sparsity-selector-kind-tf32-m64.png

Figure 264Sparsity Metadata Layout for M = 64 for.kind::tf32

9.7.16.10.8.4.4.Layout of the Sparsity Metadata Matrix for M = 128 / M = 256 for.kind::tf32

Figure 265 shows which sub-columns getsselected for different values of Sparsity Selector.

_images/tcgen05-sparse-matrices-sparsity-selector-kind-tf32-m128-256.png

Figure 265Sparsity Metadata Layout for M = 128 / M = 256 for.kind::tf32

9.7.16.10.8.4.5.Layout of the Sparsity Metadata Matrix for M = 64 for.kind::f8f6f4,.kind::mxf8f6f4,.kind::i8,.kind::mxf4,.kind::mxf4nvf4

The value of the sparsity selector:

  • must be 0 for.kind::i8 and.kind::f8f6f4

  • is assumed to be 0 for.kind::mxf8f6f4,.kind::mxf4 and.kind::mxf4nvf4

and all of the columns are selected asshown inFigure 266

_images/tcgen05-sparse-matrices-sparsity-selector-kind-f8f6f4-mxf8f6f4-m64.png

Figure 266Sparsity Metadata Layout for M = 64 for.kind::f8f6f4,.kind::mxf8f6f4,.kind::i8,.kind::mxf4,.kind::mxf4nvf4

9.7.16.10.8.4.6.Layout of the Sparsity Metadata Matrix for M = 128 / M = 256 for.kind::f8f6f4,.kind::mxf8f6f4,.kind::i8,.kind::mxf4,.kind::mxf4nvf4

The value of the sparsity selector:

  • must be 0 for.kind::i8 and.kind::f8f6f4

  • is assumed to be 0 for.kind::mxf8f6f4,.kind::mxf4 and.kind::mxf4nvf4

and all of the columns are selected asshown inFigure 267

_images/tcgen05-sparse-matrices-sparsity-selector-kind-f8f6f4-mxf8f6f4-m128-256.png

Figure 267Sparsity Metadata Layout for M = 128 / M = 256 for.kind::f8f6f4,.kind::mxf8f6f4,.kind::i8,.kind::mxf4,.kind::mxf4nvf4

9.7.16.10.8.5.Alignment restriction

The layouts which utilize only half the datapath lanes as specified inData Path Layout Organization,i.e.Layout F andLayout C, must use the same alignmentacross matrices A, D and the sparsity metadata matrix.

9.7.16.10.9.TensorCore 5th Generation of MMA Instructions
9.7.16.10.9.1.TensorCore 5th Generation Instructions:tcgen05.mma

tcgen05.mma

Perform the 5th generation of matrix multiply and accumulate operation.

Syntax

// 1. Floating-point type without block scaling:tcgen05.mma.cta_group.kind   [d-tmem],  a-desc,  b-desc, idesc,                             { disable-output-lane }, enable-input-d {, scale-input-d};tcgen05.mma.cta_group.kind   [d-tmem], [a-tmem], b-desc, idesc,                             { disable-output-lane }, enable-input-d {, scale-input-d};.kind      = { .kind::f16, .kind::tf32, .kind::f8f6f4 }.cta_group = { .cta_group::1, .cta_group::2 }----------------------------------------------------------------------------------// 2. Floating-point type with block scaling:tcgen05.mma.cta_group.kind.block_scale{.scale_vectorsize}                                        [d-tmem],  a-desc,  b-desc, idesc,                                        [scale-A-tmem], [scale-B-tmem], enable-input-d;tcgen05.mma.cta_group.kind.block_scale{.scale_vectorsize}                                        [d-tmem], [a-tmem], b-desc, idesc,                                        [scale-A-tmem], [scale-B-tmem], enable-input-d;.kind = { .kind::mxf8f6f4, .kind::mxf4, .kind::mxf4nvf4 }.cta_group      = { .cta_group::1,   .cta_group::2 }.scale_vectorsize = { .scale_vec::1X, .scale_vec::2X, .scale_vec::4X, .block16, .block32 }----------------------------------------------------------------------------------// 3. Convolution MMA for floating-point type without block scaling:tcgen05.mma.cta_group.kind.collector_usage [d-tmem],  a-desc,  b-desc, idesc,                                           { disable-output-lane }, enable-input-d {, scale-input-d};tcgen05.mma.cta_group.kind{.ashift}.collector_usage [d-tmem], [a-tmem], b-desc, idesc,                                                    { disable-output-lane }, enable-input-d {, scale-input-d};tcgen05.mma.cta_group.kind.ashift{.collector_usage} [d-tmem], [a-tmem], b-desc, idesc,                                                    { disable-output-lane }, enable-input-d {, scale-input-d};.kind      = { .kind::f16, .kind::tf32, .kind::f8f6f4 }.cta_group = { .cta_group::1,   .cta_group::2 }.collector_usage = { .collector::buffer::op }::buffer         = { ::a }::op             = { ::fill, ::use, ::lastuse, ::discard* }----------------------------------------------------------------------------------// 4. Activation Stationary MMA for floating-point type with block scaling:tcgen05.mma.cta_group.kind.block_scale{.scale_vectorsize}.collector_usage                                            [d-tmem],  a-desc,  b-desc, idesc,                                            [scale-A-tmem], [scale-B-tmem], enable-input-d;tcgen05.mma.cta_group.kind.block_scale{.scale_vectorsize}.collector_usage                                            [d-tmem], [a-tmem], b-desc, idesc,                                            [scale-A-tmem], [scale-B-tmem], enable-input-d;.cta_group       = { .cta_group::1,   .cta_group::2 }.scale_vectorsize  = { .scale_vec::1X, .scale_vec::2X, .scale_vec::4X, .block16, .block32 }.kind            = { .kind::mxf8f6f4, .kind::mxf4, .kind::mxf4nvf4 }.collector_usage = { .collector::buffer::op }::buffer         = { ::a }::op             = { ::fill, ::use, ::lastuse, ::discard* }----------------------------------------------------------------------------------// 5. Integer type:tcgen05.mma.cta_group.kind::i8  [d-tmem],  a-desc,  b-desc, idesc,                                { disable-output-lane }, enable-input-d;tcgen05.mma.cta_group.kind::i8  [d-tmem], [a-tmem], b-desc, idesc,                                { disable-output-lane }, enable-input-d;.cta_group = { .cta_group::1,   .cta_group::2  }----------------------------------------------------------------------------------// 6. Convolution MMA for integer type:tcgen05.mma.cta_group.kind::i8.collector_usage          [d-tmem],  a-desc,  b-desc, idesc,                                                        { disable-output-lane }, enable-input-d;tcgen05.mma.cta_group.kind::i8.ashift{.collector_usage} [d-tmem], [a-tmem], b-desc, idesc,                                                        { disable-output-lane }, enable-input-d;tcgen05.mma.cta_group.kind::i8{.ashift}.collector_usage [d-tmem], [a-tmem], b-desc, idesc,                                                        { disable-output-lane }, enable-input-d;.cta_group       = { .cta_group::1,   .cta_group::2  }.collector_usage = { .collector::buffer::op }::buffer         = { ::a }::op             = { ::fill, ::use, ::lastuse, ::discard* }

Description

Instructiontcgen05.mma is an asynchronous instruction which initiates anMxNxK matrixmultiply and accumulate operation,D=A*B+Dwhere theA matrix isMxK, theB matrix isKxN, and theD matrix isMxN.

The operation of the formD=A*Bis issued when the input predicate argumentenable-input-d is false.

The optional immediate argumentscale-input-d can be specified to scale the inputmatrixD as follows:D=A*B+D*(2^-scale-input-d)

The valid range of values for argumentscale-input-d is [0, 15]. The argumentscale-input-d is only valid for.kind::tf32 and.kind::f16.

The 32-bit register operandidesc is the instruction descriptor as describedinInstruction descriptor, specifiesthe shapes, exact types, sparsity and other details of the input matrices,output matrix and the matrix multiply and accumulate operation.

The qualifier.cta_group::1 specifies that the matrix multiply andaccumulate operation is performed on theTensor Memory of theexecuting thread’s CTA only. The qualifier.cta_group::2 specifies that the matrixmultiply and accumulate operation is performed on theTensor Memoryof the executing thread’s CTA and itspeer CTA.

Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.

The instructiontcgen05.mma has single thread semantics, unlike the collectiveinstructionsmma.sync orwgmma.mma_async. So, a single thread issuing thetcgen05.mma will result in the initiation of the whole matrix multiply andaccumulate operation. Refer to the sectionIssue Granularity.

The qualifier.kind specifies the general kind of the element types of the multiplicandmatrices. The exact types of the elements of the input and output matrices for each MMA-kindare specified in theInstruction descriptor.

The address operandd-tmem specifies the address of the destination and the accumulationmatrixD in theTensor Memory. The address operanda-tmemspecifies the address of the matrixA in theTensor Memory.The 64-bit register operanda-desc andb-desc are the matrix descriptors whichrepresent the matricesA andB in shared memory respectively. The format of thematrix descriptor is described inMatrix Descriptors.

The vector operanddisable-output-lane specifies the lane(s) in theTensor Memory that should be not be updated with the resultantmatrixD. Elements of the vector operanddisable-output-lane forms a mask whereeach bit corresponds to a lane of theTensor Memory, with leastsignificant bit of the first element of the vector (leftmost in syntax) correspondingto the lane 0 of theTensor Memory. If a bit in the mask is 1,then the corresponding lane in the Tensor Memory for the resultant matrixD will notbe updated. The size of the vector is as follows:

.cta_group

Size of the vector disable-output-lane

::1

4

::2

8

Qualifier.block_scale specifies that the matricesA andB are scaled withscale_A andscale_B matrices respectively before performing the matrix multiplyand accumulate operation as specified in the sectionBlock Scaling.The address operandscale-A-tmem andscale-B-tmem specify the base address thematricesscale_A andscale_B respectively in theTensor Memory.

For qualifier.scale_vectorsize,

  • If.scale_vec::NX is specified: N specifies the number of columns inscale_Amatrix and number of rows inscale_B matrix.

  • If.blockN is specified: N specifies the block size for which single scale factorwill be applied. In this form, value of N is same as the K-dimension / (N of.scale_vec::NX).

Aliased.scale_vectorsize variants:

  1. .block16 is aliased with:

    1. .scale_vec::4X when.kind=.kind::mxf4nvf4 and K = 64 or 128

  2. .block32 is aliased with:

    1. .scale_vec::1X when.kind=.kind::mxf8f6f4 for all supported values of K

    2. .scale_vec::2X when.kind=.kind::mxf4 or.kind::mxf4nvf4 and K = 64 or 128

The valid combinations of MMA-kind and.scale_vectorsize aredescribed inTable 54. For.kind::mxf4 when the qualifier.scale_vectorsize is not specified, then it defaults to.block32. For.kind::mxf4nvf4,the qualifier.scale_vectorsize must be explicitly specified.

The qualifier.ashift shifts the rows of theA matrix down by one row, except forthe last row in theTensor Memory. Qualifier.ashift is only allowedwithM = 128 orM = 256.

The qualifier.collector_usage specifies the usage of collector buffer for matrixA.Following collector buffer operations can be specified:

.collector_usage

Semantics

.collector::a::fill

Specifies that theA matrix read from the memoryshould be filled in collector buffer.

.collector::a::use

Specifies that theA matrix can be read from thecollector buffer. This requires a previous fill tothe collector buffer to be still valid.

.collector::a::lastuse

Specifies that theA matrix can be read from thecollector buffer and the contents of the collectorbuffer can be discarded. This requires a previousfill to the collector buffer to be valid till thecollector buffer is read.

.collector::a::discard

Specifies that the contents of the collector bufferforA can be discarded.

If no.collector_usage qualifier is specified, then it defaults to.collector::a::discard.It is illegal to specify either of.collector::a::use or.collector::a::fill along with.ashift.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Qualifier.kind::mxf4nvf4 introduced in PTX ISA version 8.7.

Qualifiers.block16 and.block32 introduced in PTX ISA version 8.8.

Target ISA Notes

Supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8 except.kind::i8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Qualifier.kind::i8 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_110a

Argumentscale-input-d requiressm_100a and is supported onsm_100f or higher in the same family from PTX ISA version 8.8.

For.scale_vectorsize,

  • .scale_vec::1X,.scale_vec::2X,.scale_vec::4X requiressm_100a.

  • .block16,.block32 requiressm_100f orsm_110f.

For Target ISA details on matrix shape, checkTarget ISA Note.

For Target ISA details on shared memory descriptor, checkTarget ISA Note.

Examples

tcgen05.mma.cta_group::1.kind::tf32      [taddr0],  adesc,  bdesc, idesc, {m0, m1, m2, m3}, p;tcgen05.mma.cta_group::1.kind::mxf8f6f4  [taddr2],  [taddr1],  bdesc, idesc,                                         [tmem_scaleA], [tmem_scaleB], p;tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [mbarObj0];loop:mbarrier.try_wait.parity.b64 p, [mbarObj0], 0;@!p bra loop;
9.7.16.10.9.2.TensorCore 5th Generation Instructions:tcgen05.mma.sp

tcgen05.mma.sp

Perform the 5th generation of matrix multiply and accumulate operation with sparseA matrix.

Syntax

// 1. Floating-point type without block scaling:tcgen05.mma.sp.cta_group.kind  [d-tmem],  a-desc,  b-desc, [sp-meta-tmem] ,  idesc,                               { disable-output-lane }, enable-input-d{, scale-input-d};tcgen05.mma.sp.cta_group.kind  [d-tmem], [a-tmem], b-desc, [sp-meta-tmem] , idesc,                               { disable-output-lane }, enable-input-d{, scale-input-d};.kind       = { .kind::f16, , .kind::tf32, .kind::f8f6f4 }.cta_group  = { .cta_group::1,  .cta_group::2 }----------------------------------------------------------------------------------// 2. Floating-point type with block scaling:tcgen05.mma.sp.cta_group.kind.block_scale{.scale_vectorsize}                                         [d-tmem],  a-desc,  b-desc , [sp-meta-tmem] , idesc,                                         [scale-A-tmem], [scale-B-tmem], enable-input-d;tcgen05.mma.sp.cta_group.kind.block_scale{.scale_vectorsize}                                         [d-tmem], [a-tmem], b-desc , [sp-meta-tmem] , idesc,                                         [scale-A-tmem], [scale-B-tmem], enable-input-d;.scale_vectorsize = { .scale_vec::1X, .scale_vec::2X, .scale_vec::4X, .block16, .block32 }.cta_group      = { .cta_group::1,  .cta_group::2 }.kind = { .kind::mxf8f6f4, .kind::mxf4, .kind::mxf4nvf4 }----------------------------------------------------------------------------------// 3. Convolution MMA with floating-point type without block scaling:tcgen05.mma.sp.cta_group.kind.collector_usage           [d-tmem],  a-desc,  b-desc,                                                        [sp-meta-tmem] ,  idesc,                                                        { disable-output-lane }, enable-input-d                                                        {, scale-input-d};tcgen05.mma.sp.cta_group.kind.ashift{.collector_usage}  [d-tmem], [a-tmem], b-desc,                                                        [sp-meta-tmem] , idesc,                                                        { disable-output-lane }, enable-input-d                                                        {, scale-input-d};tcgen05.mma.sp.cta_group.kind{.ashift}.collector_usage  [d-tmem], [a-tmem], b-desc,                                                        [sp-meta-tmem] , idesc,                                                        { disable-output-lane }, enable-input-d                                                        {, scale-input-d};.kind            = { .kind::f16, .kind::tf32, .kind::f8f6f4 }.collector_usage = { .collector::buffer::op }::buffer         = { ::a }::op             = { ::fill, ::use, ::lastuse, ::discard* }----------------------------------------------------------------------------------// 4. Activation Stationary MMA with floating-point type with block scaling:tcgen05.mma.sp.cta_group.kind.block_scale{.scale_vectorsize}.collector_usage                                         [d-tmem],  a-desc,  b-desc , [sp-meta-tmem] , idesc,                                         [scale-A-tmem], [scale-B-tmem], enable-input-d;tcgen05.mma.sp.cta_group.kind.block_scale{.scale_vectorsize}.collector_usage                                         [d-tmem], [a-tmem], b-desc , [sp-meta-tmem] , idesc,                                         [scale-A-tmem], [scale-B-tmem], enable-input-d;.kind = { .kind::mxf8f6f4, .kind::mxf4, .kind::mxf4nvf4 }.scale_vectorsize = { .scale_vec::1X, .scale_vec::2X, .scale_vec::4X, .block16, .block32 }.collector_usage = { .collector::buffer::op }::buffer         = { ::a }::op             = { ::fill, ::use, ::lastuse, ::discard* }----------------------------------------------------------------------------------// 5. Integer type:tcgen05.mma.sp.cta_group.kind::i8 [d-tmem],  a-desc,  b-desc, [sp-meta-tmem] , idesc,                                  { disable-output-lane }, enable-input-d;tcgen05.mma.sp.cta_group.kind::i8 [d-tmem], [a-tmem], b-desc, [sp-meta-tmem] , idesc,                                  { disable-output-lane }, enable-input-d;.cta_group      = { .cta_group::1,  .cta_group::2 }----------------------------------------------------------------------------------// 6. Convolution MMA with Integer type:tcgen05.mma.sp.cta_group.kind::i8.collector_usage          [d-tmem],  a-desc,  b-desc,                                                           [sp-meta-tmem] , idesc,                                                           { disable-output-lane }, enable-input-d;tcgen05.mma.sp.cta_group.kind::i8.ashift{.collector_usage} [d-tmem], [a-tmem], b-desc,                                                           [sp-meta-tmem], idesc ,                                                           { disable-output-lane }, enable-input-d;tcgen05.mma.sp.cta_group.kind::i8{.ashift}.collector_usage [d-tmem], [a-tmem], b-desc,                                                           [sp-meta-tmem], idesc ,                                                           { disable-output-lane }, enable-input-d;.collector_usage = { .collector::buffer::op }::buffer         = { ::a }::op             = { ::fill, ::use, ::lastuse, ::discard* }

Description

Instructiontcgen05.mma.sp is an asynchronous instruction which initiates anMxNxK matrix multiply and accumulate operation of the formD=A*B+Dwhere theA matrix isMx(K/2), theB matrix isKxN, and theD matrix isMxN.Sparse Matrices describes the details of the sparsity.

The operation of the formD=A*Bis issued when the input predicate argumentenable-input-d is false.

The optional immediate argumentscale-input-d can be specified to scale theinput matrixD as follows:D=A*B+D*(2^-scale-input-d)

The valid range of values for argumentscale-input-d is [0, 15]. The argumentscale-input-d is only valid for.kind::tf32 and.kind::f16.

The 32-bit register operandidesc is the instruction descriptor as described inInstruction descriptor, specifies the shapes,exact types, sparsity and other details of the input matrices, output matrix and thematrix multiply and accumulate operation.

The qualifier.cta_group::1 specifies that the matrix multiply and accumulateoperation is performed on theTensor Memory of the executingthread’s CTA only. The qualifier.cta_group::2 specifies that the matrixmultiply and accumulate operation is performed on theTensor Memoryof the executing thread’s CTA and itspeer CTA.

Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.

The instructiontcgen05.mma.sp has single thread semantics, unlike the collectiveinstructionsmma.sync orwgmma.mma_async. So, a single thread issuing thetcgen05.mma.sp will result in the initiation of the whole matrix multiply andaccumulate operation. Refer to the sectionIssue Granularity.

The qualifier.kind specifies the general kind of the element types of the multiplicandmatrices. The exact types of the elements of the input and output matrices for each MMA-kindare specified in theInstruction descriptor.

The address operandd-tmem specifies the address of the destination and the accumulationmatrixD in theTensor Memory. The address operanda-tmemspecifies the address of the matrixA in theTensor Memory. The64-bit register operanda-desc andb-desc are the matrix descriptors which representthe matricesA andB in shared memory respectively. The format of the matrix descriptoris described inMatrix Descriptors.

The vector operanddisable-output-lane specifies the lane(s) in theTensor Memorythat should be not be updated with the resultant matrixD. Elements of the vector operanddisable-output-lane forms a mask where each bit corresponds to a lane of theTensor Memory. with least significant bit of the first element ofthe vector (leftmost in syntax) corresponding to the lane 0 of the Tensor Memory. If a bit inthe mask is 1, then the corresponding lane in the Tensor Memory for the resultant matrixDwill not be updated. The size of the vector is as follows:

.cta_group

Size of the vector disable-output-lane

::1

4

::2

8

Qualifier.block_scale specifies that the matricesA andB are scaled withscale_A andscale_B matrices respectively before performing the matrix multiplyand accumulate operation as specified in the sectionBlock Scaling.The address operandscale-A-tmem andscale-B-tmem specify the base address thematricesscale_A andscale_B respectively in theTensor Memory.

For qualifier.scale_vectorsize,

  • If.scale_vec::NX is specified: N specifies the number of columns inscale_Amatrix and number of rows inscale_B matrix.

  • If.blockN is specified: N specifies the block size for which single scale factorwill be applied. In this form, value of N is same as the K-dimension / (N of.scale_vec::NX).

Aliased.scale_vectorsize variants:

  1. .block16 is aliased with:

    1. .scale_vec::4X when.kind=.kind::mxf4nvf4 and K = 64 or 128

  2. .block32 is aliased with:

    1. .scale_vec::1X when.kind=.kind::mxf8f6f4 for all supported values of K

    2. .scale_vec::2X when.kind=.kind::mxf4 or.kind::mxf4nvf4 and K = 64 or 128

The valid combinations of MMA-kind and.scale_vectorsize aredescribed inTable 54. For.kind::mxf4 when the qualifier.scale_vectorsize is not specified, then it defaults to.block32. For.kind::mxf4nvf4,the qualifier.scale_vectorsize must be explicitly specified.

The qualifier.ashift shifts the rows of theA matrix down by one row, except forthe last row in theTensor Memory. Qualifier.ashift is only allowedwithM = 128 orM = 256.

The qualifier.collector_usage specifies the usage of collector buffer for matrixA.Following collector buffer operations can be specified:

.collector_usage

Semantics

.collector::a::fill

Specifies that theA matrix read from the memoryshould be filled in collector buffer.

.collector::a::use

Specifies that theA matrix can be read from thecollector buffer. This requires a previous fill tothe collector buffer to be still valid.

.collector::a::lastuse

Specifies that theA matrix can be read from thecollector buffer and the contents of the collectorbuffer can be discarded. This requires a previousfill to the collector buffer to be valid till thecollector buffer is read.

.collector::a::discard

Specifies that the contents of the collector bufferforA can be discarded.

If no.collector_usage qualifier is specified, then it defaults to.collector::a::discard.It is illegal to specify either of.collector::a::use or.collector::a::fill along with.ashift.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Qualifier.kind::mxf4nvf4 introduced in PTX ISA version 8.7.

Qualifiers.block16 and.block32 introduced in PTX ISA version 8.8.

Target ISA Notes

Supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8 except.kind::i8/.kind::mxf4nvf4/.kind::mxf4:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Qualifier.kind::i8 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_110a

Qualifiers.kind::mxf4nvf4 and.kind::mxf4 are supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_103a

  • sm_110a

Argumentscale-input-d requiressm_100a and is supported onsm_100f or higher in the same family from PTX ISA version 8.8.

For.scale_vectorsize,

  • .scale_vec::1X,.scale_vec::2X,.scale_vec::4X requiressm_100a.

  • .block16,.block32 requiressm_100f orsm_110f.

For Target ISA details on matrix shape, checkTarget ISA Note.

For Target ISA details on shared memory descriptor, checkTarget ISA Note.

Examples

tcgen05.mma.sp.cta_group::1.kind::f16      [taddr0],  adesc,  bdesc, [tmem_spmeta0], idesc, p;tcgen05.mma.sp.cta_group::1.kind::mxf8f6f4.collector::a:fill                                           [taddr2],  [taddr1],  bdesc, [tmem_spmeta1], idesc,                                           [tmem_scaleA], [tmem_scaleB], p;tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [mbarObj0];loop:mbarrier.try_wait.parity.b64 p, [mbarObj0], 0;@!p bra loop;
9.7.16.10.9.3.TensorCore 5th Generation Instructions:tcgen05.mma.ws

tcgen05.mma.ws

Perform the 5th generation of weight stationary convolution matrix multiply and accumulateoperation.

Syntax

// 1. Floating-point type without block scaling:tcgen05.mma.ws.cta_group::1.kind{.collector_usage}    [d-tmem],  a-desc,  b-desc,  idesc,                                                      enable-input-d {, zero-column-mask-desc };tcgen05.mma.ws.cta_group::1.kind{.collector_usage}    [d-tmem], [a-tmem], b-desc, idesc,                                                      enable-input-d {, zero-column-mask-desc };.kind = { .kind::f16, .kind::tf32, .kind::f8f6f4 }----------------------------------------------------------------------------------// 2. Integer type:tcgen05.mma.ws.cta_group::1.kind::i8{.collector_usage} [d-tmem],  a-desc,  b-desc, idesc,                                                       enable-input-d {, zero-column-mask-desc};tcgen05.mma.ws.cta_group::1.kind::i8{.collector_usage} [d-tmem], [a-tmem], b-desc, idesc,                                                       enable-input-d {, zero-column-mask-desc};.collector_usage = { .collector::buffer::op }::buffer = { ::b0, ::b1, ::b2, ::b3 }::op   = { ::fill, ::use, ::lastuse, ::discard}

Description

Instructiontcgen05.mma.ws is an asynchronous instruction which initiates anMxNxKmatrix multiply and accumulate operation,D=A*B+Dwhere theA matrix isMxK, theB matrix isKxN, and theD matrix isMxN.

The operation of the formD=A*Bis issued when the input predicate argumentenable-input-d is false.

The 32-bit register operandidesc is the instruction descriptor as described inInstruction descriptor, specifies the shapes, exacttypes, sparsity and other details of the input matrices, output matrix and the matrixmultiply and accumulate operation.

The qualifier.cta_group::1 specifies that the matrix multiply and accumulate operationis performed on theTensor Memory of the executing thread’s CTA only.

Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.

The instructiontcgen05.mma.ws has single thread semantics, unlike the collectiveinstructionsmma.sync orwgmma.mma_async. So, a single thread issuing thetcgen05.mma.ws will result in the initiation of the whole matrix multiply and accumulateoperation. Refer to the sectionIssue Granularity.

The qualifier.kind specifies the general kind of the element types of the multiplicandmatrices. The exact types of the elements of the input and output matrices for each MMA-kindare specified in theInstruction descriptor.

The address operandd-tmem specifies the address of the destination and the accumulationmatrixD in theTensor Memory. The address operanda-tmemspecifies the address of the matrixA in theTensor Memory. The64-bit register operanda-desc andb-desc are the matrix descriptors which representthe matricesA andB in shared memory respectively. The format of the matrix descriptoris described inMatrix Descriptors.

The optional operandzero-column-mask-desc is a 64-bit register which specifies theZero-Column Mask Descriptor. The zero-columnmask descriptor is used to generate a mask that specifies which columns ofB matrixwill have zero value for the matrix multiply and accumulate operation regardless of thevalues present in the shared memory.

The qualifier.collector_usage specifies the usage of collector buffer for MatrixB.Following collector buffer operations can be specified:

.collector_usage

Semantics

.collector::bN::fill

Specifies that theB matrix read from the memoryshould be filled in collector buffer #N.

.collector::bN::use

Specifies that theB matrix can be read from thecollector buffer #N. This requires a previous fillto the collector buffer #N to be still valid.

.collector::bN::lastuse

Specifies that theB matrix can be read from thecollector buffer #N after which the contents of thecollector buffer #N can be discarded. This requiresa previous fill to the collector buffer #N to bevalid till the collector buffer #N is read.

.collector::bN::discard

Specifies that the contents of the collector buffer#N can be discarded.

If no.collector_usage qualifier is specified, then it defaults to.collector::b0::discard.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Target ISA Notes

Supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8 except.kind::i8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Qualifier.kind::i8 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_110a

Examples

tcgen05.mma.ws.cta_group::1.kind::i8.collector::b2:use [taddr2], [taddr1], bdesc, idesc, p;tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [mbarObj0];loop:mbarrier.try_wait.parity.b64 p, [mbarObj0], 0;@!p bra loop;
9.7.16.10.9.4.TensorCore 5th Generation Instructions:tcgen05.mma.ws.sp

tcgen05.mma.ws.sp

Perform the 5th generation of weight stationary convolution matrix multiply and accumulateoperation with sparseA matrix.

Syntax

// 1. Floating-point type without block scaling:tcgen05.mma.ws.sp.cta_group::1.kind{.collector_usage} [d-tmem],  a-desc,  b-desc,                                                      [sp-meta-tmem] ,  idesc,                                                      enable-input-d {, zero-column-mask-desc};tcgen05.mma.ws.sp.cta_group::1.kind{.collector_usage} [d-tmem], [a-tmem], b-desc,                                                      [sp-meta-tmem] , idesc,                                                      enable-input-d {, zero-column-mask-desc};.kind = { .kind::f16, .kind::tf32, .kind::f8f6f4 }----------------------------------------------------------------------------------// 2. Integer type:tcgen05.mma.ws.sp.cta_group::1.kind::i8{.collector_usage} [d-tmem], a-desc, b-desc,                                                          [sp-meta-tmem] , idesc,                                                          enable-input-d {, zero-column-mask-desc};tcgen05.mma.ws.sp.cta_group::1.kind::i8{.collector_usage} [d-tmem], [a-tmem], b-desc,                                                          [sp-meta-tmem] , idesc,                                                          enable-input-d {, zero-column-mask-desc};.collector_usage = { .collector::buffer::op }::buffer = { ::b0, ::b1, ::b2, ::b3 }::op   = { ::fill, ::use, ::lastuse, ::discard}

Description

Instructiontcgen05.mma.ws.sp is an asynchronous instruction which initiatesanMxNxK matrix multiply and accumulate operation,D=A*B+Dwhere theA matrix isMx(K/2), theB matrix isKxN, and theD matrixisMxN.Sparse Matrices describes the details of thesparsity.

The operation of the formD=A*Bis issued when the input predicate argumentenable-input-d is false.

The 32-bit register operandidesc is the instruction descriptor as described inInstruction descriptor, specifies the shapes, exacttypes, sparsity and other details of the input matrices, output matrix and the matrixmultiply and accumulate operation.

The qualifier.cta_group::1 specifies that the matrix multiply and accumulateoperation is performed on the Tensor Memory of the executing thread’s CTA only.

Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.

The instructiontcgen05.mma.ws.sp has single thread semantics, unlike the collectiveinstructionsmma.sync orwgmma.mma_async. So, a single thread issuing thetcgen05.mma.ws.sp will result in the initiation of the whole matrix multiply andaccumulate operation. Refer to the sectionIssue Granularity.

The qualifier.kind specifies the general kind of the element types of the multiplicandmatrices. The exact types of the elements of the input and output matrices for each MMA-kind arespecified in theInstruction descriptor.

The address operandd-tmem specifies the address of the destination and the accumulationmatrixD in theTensor Memory. The address operanda-tmem specifiesthe address of the matrixA in theTensor Memory. The 64-bit registeroperanda-desc andb-desc are the matrix descriptors which represent the matricesAandB in shared memory respectively. The format of the matrix descriptor is described inMatrix Descriptors.

The optional operandzero-column-mask-desc is a 64-bit register which specifies theZero-Column Mask Descriptor. The zero-columnmask descriptor is used to generate a mask that specifies which columns ofB matrixwill have zero value for the matrix multiply and accumulate operation regardless of thevalues present in the shared memory.

The qualifier.collector_usage specifies the usage of collector buffer for MatrixB.Following collector buffer operations can be specified:

.collector_usage

Semantics

.collector::bN::fill

Specifies that theB matrix read from the memoryshould be filled in collector buffer #N.

.collector::bN::use

Specifies that theB matrix can be read from thecollector buffer #N. This requires a previous fillto the collector buffer #N to be still valid.

.collector::bN::lastuse

Specifies that theB matrix can be read from thecollector buffer #N after which the contents of thecollector buffer #N can be discarded. This requiresa previous fill to the collector buffer #N to bevalid till the collector buffer #N is read.

.collector::bN::discard

Specifies that the contents of the collector buffer#N can be discarded.

If no.collector_usage qualifier is specified, then it defaults to.collector::b0::discard.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Target ISA Notes

Supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8 except.kind::i8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Qualifier.kind::i8 is supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_110a

Examples

tcgen05.mma.ws.sp.cta_group::1.kind::tf32.collector::b1::fill  [taddr1], [taddr0], bdesc,                                                               [tmem_spmeta0], idesc, p;tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [mbarObj0];loop:mbarrier.try_wait.parity.b64 p, [mbarObj0], 0;@!p bra loop;

9.7.16.11.TensorCore 5th Generation Specialized Synchronization Operations

9.7.16.11.1.TensorCore 5th Generation Instructions:tcgen05.fence

tcgen05.fence

Specialized fence for the asynchronous tcgen05 operations.

Syntax

tcgen05.fence::before_thread_sync ;tcgen05.fence::after_thread_sync  ;

Description

The instructiontcgen05.fence::before_thread_sync orders all the prior asynchronoustcgen05 operations with respect to the subsequenttcgen05 and the executionordering operations.

The instructiontcgen05.fence::after_thread_sync orders all the subsequent asynchronoustcgen05 operations with respect to the priortcgen05 and the execution orderingoperations.

Thetcgen05.fence::* instructions compose with execution ordering instructions acrossa thread scope and provide ordering betweentcgen05 instructions across the same scope.

Thetcgen05.fence::before_thread_sync instructions behave as code motion fence for priortcgen05 instructions as they cannot be hoisted across. Thetcgen05.fence::after_thread_syncinstructions behave as code motion fence for subsequenttcgen05 instructions as they cannotbe hoisted across.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Target ISA Notes

Supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Examples

// Producer thread:tcgen05.cp.cta_group::1.128x256b  [taddr0], sdesc0;tcgen05.fence::before_thread_sync;st.relaxed.b32 [flag], 1;// Consumer thread:loop:ld.relaxed.b32 r, [flag];setp.eq.u32 p, r, 1;@!p bra loop;tcgen05.fence::after_thread_sync;tcgen05.mma.cta_group.kind   [taddr0], adesc, bdesc, idesc, p;

9.7.16.12.TensorCore 5th Generation Async Synchronization Operations

9.7.16.12.1.TensorCore 5th Generation Instructions:tcgen05.commit

tcgen05.commit

Makes the mbarrier object track the completion of all prior async-tcgen05 operations initiatedby the executing thread.

Syntax

tcgen05.commit.cta_group.completion_mechanism{.shared::cluster}{.multicast}.b64                                                            [mbar] {, ctaMask};.completion_mechanism = { .mbarrier::arrive::one }.cta_group            = { .cta_group::1, .cta_group::2 }.multicast            = { .multicast::cluster }

Description

The instructiontcgen05.commit is an asynchronous instruction which makes the mbarrier object,specified by the address operandmbar, track the completion of all the prior asynchronoustcgen05 operations, as listed inmbarrier based completion mechanism,initiated by the executing thread. Upon the completion of the tracked asynchronoustcgen05operations, the signal specified by the.completion_mechanism is triggered by the systemon the mbarrier object.

The instructiontcgen05.commit.cta_group::1 tracks for the completion of all priorasynchronoustcgen05 operations with.cta_group::1 issued by the current thread.Similarly, the instructiontcgen05.commit.cta_group::2 tracks for the completion of allprior asynchronoustcgen05 operations with.cta_group::2 issued by the current thread.

Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.

The qualifier.mbarrier::arrive::one indicates that upon the completion of the priorasynchronoustcgen05 operation issued by the current thread, an arrive-on operation, withthe count argument of 1, is signaled on the mbarrier object. The scope of the arrive-on operationis the cluster scope.

The optional qualifier.multicast::cluster allows signaling on the mbarrier objects of multipleCTAs in the cluster. OperandctaMask specifies the CTAs in the cluster such that each bitposition in the 16-bitctaMask operand corresponds to the%cluster_ctarank of the destinationCTA. The mbarrier signal is multicast to the same offset asmbar in the shared memory of eachdestination CTA.

If no state space is specified thenGeneric Addressing is used. If theaddress specified bymbar does not fall within the address window of.shared::cluster statespace then the behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.6.

Target ISA Notes

Supported on following architectures:

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

  • sm_110f or higher in the same family

Examples

Example 1:tcgen05.cp.cta_group::1.128x256b                      [taddr0], sdesc0;tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [mbarObj1];loop:mbarrier.try_wait.parity.b64 p, [mbarObj1], 0;@!p bra loop;Example 2:tcgen05.mma.cta_group::2.kind::tf32    [taddr0],  adesc,  bdesc, idesc, p;tcgen05.commit.cta_group::2.mbarrier::arrive::one.b64 [mbarObj2];loop:mbarrier.try_wait.parity.b64 p, [mbarObj2], 0;@!p bra loop;

9.7.17.Stack Manipulation Instructions

The stack manipulation instructions can be used to dynamically allocate and deallocate memory on thestack frame of the current function.

The stack manipulation instrucitons are:

  • stacksave

  • stackrestore

  • alloca

9.7.17.1.Stack Manipulation Instructions:stacksave

stacksave

Save the value of stack pointer into a register.

Syntax

stacksave.type  d;.type = { .u32, .u64 };

Description

Copies the current value of stack pointer into the destination registerd. Pointer returned bystacksave can be used in a subsequentstackrestore instruction to restore the stackpointer. Ifd is modified prior to use instackrestore instruction, it may corrupt data inthe stack.

Destination operandd has the same type as the instruction type.

Semantics

d = stackptr;

PTX ISA Notes

Introduced in PTX ISA version 7.3.

Preview Feature:

stacksave is a preview feature in PTX ISA version 7.3. All details are subject to change withno guarantees of backward compatibility on future PTX ISA versions or SM architectures.

Target ISA Notes

stacksave requiressm_52 or higher.

Examples

.reg .u32 rd;stacksave.u32 rd;.reg .u64 rd1;stacksave.u64 rd1;

9.7.17.2.Stack Manipulation Instructions:stackrestore

stackrestore

Update the stack pointer with a new value.

Syntax

stackrestore.type  a;.type = { .u32, .u64 };

Description

Sets the current stack pointer to source registera.

Whenstackrestore is used with operanda written by a priorstacksave instruction, itwill effectively restore the state of stack as it was beforestacksave was executed. Note thatifstackrestore is used with an arbitrary value ofa, it may cause corruption of stackpointer. This implies that the correct use of this feature requires thatstackrestore.typea isused afterstacksave.typea without redefining the value ofa between them.

Operanda has the same type as the instruction type.

Semantics

stackptr = a;

PTX ISA Notes

Introduced in PTX ISA version 7.3.

Preview Feature:

stackrestore is a preview feature in PTX ISA version 7.3. All details are subject to changewith no guarantees of backward compatibility on future PTX ISA versions or SM architectures.

Target ISA Notes

stackrestore requiressm_52 or higher.

Examples

.reg .u32 ra;stacksave.u32 ra;// Code that may modify stack pointer...stackrestore.u32 ra;

9.7.17.3.Stack Manipulation Instructions:alloca

alloca

Dynamically allocate memory on stack.

Syntax

alloca.type  ptr, size{, immAlign};.type = { .u32, .u64 };

Description

Thealloca instruction dynamically allocates memory on the stack frame of the current functionand updates the stack pointer accordingly. The returned pointerptr points to local memory andcan be used in the address operand ofld.local andst.local instructions.

If sufficient memory is unavailable for allocation on the stack, then execution ofalloca mayresult in stack overflow. In such cases, attempting to access the allocated memory withptr willresult in undefined program behavior.

The memory allocated byalloca is deallocated in the following ways:

  • It is automatically deallocated when the function exits.

  • It can be explicitly deallocated usingstacksave andstackrestore instructions:stacksave can be used to save the value of stack pointer before executingalloca, andstackrestore can be used afteralloca to restore stack pointer to the original value whichwas previously saved withstacksave. Note that accessing deallocated memory after executingstackrestore results in undefined behavior.

size is an unsigned value which specifies the amount of memory in number of bytes to beallocated on stack.size=0 may not lead to a valid memory allocation.

Bothptr andsize have the same type as the instruction type.

immAlign is a 32-bit value which specifies the alignment requirement in number of bytes for thememory allocated byalloca. It is an integer constant, must be a power of 2 and must not exceed2^23.immAlign is an optional argument with default value being 8 which is the minimumguaranteed alignment.

Semantics

alloca.type ptr, size, immAlign:a = max(immAlign, frame_align); // frame_align is the minimum guaranteed alignment// Allocate size bytes of stack memory with alignment a and update the stack pointer.// Since the stack grows down, the updated stack pointer contains a lower address.stackptr = alloc_stack_mem(size, a);// Return the new value of stack pointer as ptr. Since ptr is the lowest address of the memory// allocated by alloca, the memory can be accessed using ptr up to (ptr + size of allocated memory).stacksave ptr;

PTX ISA Notes

Introduced in PTX ISA version 7.3.

Preview Feature:

alloca is a preview feature in PTX ISA version 7.3. All details are subject to change with noguarantees of backward compatibility on future PTX ISA versions or SM architectures.

Target ISA Notes

alloca requiressm_52 or higher.

Examples

.reg .u32 ra, stackptr, ptr, size;stacksave.u32 stackptr;     // Save the current stack pointeralloca ptr, size, 8;        // Allocate stack memoryst.local.u32 [ptr], ra;     // Use the allocated stack memorystackrestore.u32 stackptr;  // Deallocate memory by restoring the stack pointer

9.7.18.Video Instructions

All video instructions operate on 32-bit register operands. However, the video instructions may beclassified as either scalar or SIMD based on whether their core operation applies to one or multiplevalues.

The video instructions are:

  • vadd,vadd2,vadd4

  • vsub,vsub2,vsub4

  • vmad

  • vavrg2,vavrg4

  • vabsdiff,vabsdiff2,vabsdiff4

  • vmin,vmin2,vmin4

  • vmax,vmax2,vmax4

  • vshl

  • vshr

  • vset,vset2,vset4

9.7.18.1.Scalar Video Instructions

All scalar video instructions operate on 32-bit register operands. The scalar video instructionsare:

  • vadd

  • vsub

  • vabsdiff

  • vmin

  • vmax

  • vshl

  • vshr

  • vmad

  • vset

The scalar video instructions execute the following stages:

  1. Extract and sign- or zero-extend byte, half-word, or word values from its source operands, toproduce signed 33-bit input values.

  2. Perform a scalar arithmetic operation to produce a signed 34-bit result.

  3. Optionally clamp the result to the range of the destination type.

  4. Optionally perform one of the following:

    • apply a second operation to the intermediate result and a third operand, or

    • truncate the intermediate result to a byte or half-word value and merge into a specifiedposition in the third operand to produce the final result.

The general format of scalar video instructions is as follows:

// 32-bit scalar operation, with optional secondary operationvop.dtype.atype.btype{.sat}        d, a{.asel}, b{.bsel};vop.dtype.atype.btype{.sat}.secop  d, a{.asel}, b{.bsel}, c;// 32-bit scalar operation, with optional data mergevop.dtype.atype.btype{.sat}   d.dsel, a{.asel}, b{.bsel}, c;.dtype = .atype = .btype = { .u32, .s32 };.dsel  = .asel  = .bsel  = { .b0, .b1, .b2, .b3, .h0, .h1 };.secop = { .add, .min, .max };

The source and destination operands are all 32-bit registers. The type of each operand (.u32 or.s32) is specified in the instruction type; all combinations ofdtype,atype, andbtype are valid. Using theatype/btype andasel/bsel specifiers, the input values areextracted and sign- or zero-extended internally to.s33 values. The primary operation is thenperformed to produce an.s34 intermediate result. The sign of the intermediate result depends ondtype.

The intermediate result is optionally clamped to the range of the destination type (signed orunsigned), taking into account the subword destination size in the case of optional data merging.

.s33 optSaturate( .s34 tmp, Bool sat, Bool sign, Modifier dsel ) {    if ( !sat )  return tmp;    switch ( dsel ) {        case .b0, .b1, .b2, .b3:            if ( sign )  return CLAMP( tmp, S8_MAX, S8_MIN );            else         return CLAMP( tmp, U8_MAX, U8_MIN );        case .h0, .h1:            if ( sign )  return CLAMP( tmp, S16_MAX, S16_MIN );            else         return CLAMP( tmp, U16_MAX, U16_MIN );        default:            if ( sign )  return CLAMP( tmp, S32_MAX, S32_MIN );            else         return CLAMP( tmp, U32_MAX, U32_MIN );    }}

This intermediate result is then optionally combined with the third source operand using a secondaryarithmetic operation or subword data merge, as shown in the following pseudocode. The sign of thethird operand is based ondtype.

.s33 optSecOp(Modifier secop, .s33 tmp, .s33 c) {    switch ( secop ) {        .add:     return tmp + c;        .min:     return MIN(tmp, c);        .max      return MAX(tmp, c);        default:  return tmp;    }}
.s33 optMerge( Modifier dsel, .s33 tmp, .s33 c ) {    switch ( dsel ) {        case .h0:  return ((tmp & 0xffff)        | (0xffff0000 & c);        case .h1:  return ((tmp & 0xffff) << 16) | (0x0000ffff & c);        case .b0:  return ((tmp & 0xff)          | (0xffffff00 & c);        case .b1:  return ((tmp & 0xff) <<  8)   | (0xffff00ff & c);        case .b2:  return ((tmp & 0xff) << 16)   | (0xff00ffff & c);        case .b3:  return ((tmp & 0xff) << 24)   | (0x00ffffff & c);        default:   return tmp;    }}

The lower 32-bits are then written to the destination operand.

9.7.18.1.1.Scalar Video Instructions:vadd,vsub,vabsdiff,vmin,vmax

vadd,vsub

Integer byte/half-word/word addition/subtraction.

vabsdiff

Integer byte/half-word/word absolute value of difference.

vmin,vmax

Integer byte/half-word/word minimum/maximum.

Syntax

// 32-bit scalar operation, with optional secondary operationvop.dtype.atype.btype{.sat}       d, a{.asel}, b{.bsel};vop.dtype.atype.btype{.sat}.op2   d, a{.asel}, b{.bsel}, c;// 32-bit scalar operation, with optional data mergevop.dtype.atype.btype{.sat}  d.dsel, a{.asel}, b{.bsel}, c; vop   = { vadd, vsub, vabsdiff, vmin, vmax };.dtype = .atype = .btype = { .u32, .s32 };.dsel  = .asel  = .bsel  = { .b0, .b1, .b2, .b3, .h0, .h1 };.op2   = { .add, .min, .max };

Description

Perform scalar arithmetic operation with optional saturate, and optional secondary arithmetic operation or subword data merge.

Semantics

// extract byte/half-word/word and sign- or zero-extend// based on source operand typeta = partSelectSignExtend( a, atype, asel );tb = partSelectSignExtend( b, btype, bsel );switch ( vop ) {    case vadd:     tmp = ta + tb;    case vsub:     tmp = ta - tb;    case vabsdiff: tmp = | ta - tb |;    case vmin:     tmp = MIN( ta, tb );    case vmax:     tmp = MAX( ta, tb );}// saturate, taking into account destination type and merge operationstmp = optSaturate( tmp, sat, isSigned(dtype), dsel );d = optSecondaryOp( op2, tmp, c );  // optional secondary operationd = optMerge( dsel, tmp, c );       // optional merge with c operand

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

vadd,vsub,vabsdiff,vmin,vmax requiresm_20 or higher.

Examples

vadd.s32.u32.s32.sat      r1, r2.b0, r3.h0;vsub.s32.s32.u32.sat      r1, r2.h1, r3.h1;vabsdiff.s32.s32.s32.sat  r1.h0, r2.b0, r3.b2, c;vmin.s32.s32.s32.sat.add  r1, r2, r3, c;
9.7.18.1.2.Scalar Video Instructions:vshl,vshr

vshl,vshr

Integer byte/half-word/word left/right shift.

Syntax

// 32-bit scalar operation, with optional secondary operationvop.dtype.atype.u32{.sat}.mode       d, a{.asel}, b{.bsel};vop.dtype.atype.u32{.sat}.mode.op2   d, a{.asel}, b{.bsel}, c;// 32-bit scalar operation, with optional data mergevop.dtype.atype.u32{.sat}.mode  d.dsel, a{.asel}, b{.bsel}, c; vop   = { vshl, vshr };.dtype = .atype = { .u32, .s32 };.mode  = { .clamp, .wrap };.dsel  = .asel  = .bsel  = { .b0, .b1, .b2, .b3, .h0, .h1 };.op2   = { .add, .min, .max };

Description

vshl

Shifta left by unsigned amount inb with optional saturate, and optional secondaryarithmetic operation or subword data merge. Left shift fills with zero.

vshr

Shifta right by unsigned amount inb with optional saturate, and optional secondaryarithmetic operation or subword data merge. Signed shift fills with the sign bit, unsigned shiftfills with zero.

Semantics

// extract byte/half-word/word and sign- or zero-extend// based on source operand typeta = partSelectSignExtend( a,atype, asel );tb = partSelectSignExtend( b, .u32, bsel );if ( mode == .clamp  && tb > 32 )  tb = 32;if ( mode == .wrap )                       tb = tb & 0x1f;switch ( vop ){   case vshl:  tmp = ta << tb;   case vshr:  tmp = ta >> tb;}// saturate, taking into account destination type and merge operationstmp = optSaturate( tmp, sat, isSigned(dtype), dsel );d = optSecondaryOp( op2, tmp, c );  // optional secondary operationd = optMerge( dsel, tmp, c );       // optional merge with c operand

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

vshl,vshr requiresm_20 or higher.

Examples

vshl.s32.u32.u32.clamp  r1, r2, r3;vshr.u32.u32.u32.wrap   r1, r2, r3.h1;
9.7.18.1.3.Scalar Video Instructions:vmad

vmad

Integer byte/half-word/word multiply-accumulate.

Syntax

// 32-bit scalar operationvmad.dtype.atype.btype{.sat}{.scale}     d, {-}a{.asel}, {-}b{.bsel},                                         {-}c;vmad.dtype.atype.btype.po{.sat}{.scale}  d, a{.asel}, b{.bsel}, c;.dtype = .atype = .btype = { .u32, .s32 };.asel  = .bsel  = { .b0, .b1, .b2, .b3, .h0, .h1 };.scale = { .shr7, .shr15 };

Description

Calculate(a*b)+c, with optional operand negates,plus one mode, and scaling.

The source operands support optional negation with some restrictions. Although PTX syntax allowsseparate negation of thea andb operands, internally this is represented as negation of theproduct(a*b). That is,(a*b) is negated if and only if exactly one ofa orb isnegated. PTX allows negation of either(a*b) orc.

The plus one mode (.po) computes(a*b)+c+1, which is used in computing averages. Sourceoperands may not be negated in.po mode.

The intermediate result of(a*b) is unsigned if atype and btype are unsigned and the product(a*b) is not negated; otherwise, the intermediate result is signed. Inputc has the samesign as the intermediate result.

The final result is unsigned if the intermediate result is unsigned andc is not negated.

Depending on the sign of thea andb operands, and the operand negates, the followingcombinations of operands are supported for VMAD:

 (u32 * u32) + u32  // intermediate unsigned; final unsigned-(u32 * u32) + s32  // intermediate   signed; final   signed (u32 * u32) - u32  // intermediate unsigned; final   signed (u32 * s32) + s32  // intermediate   signed; final   signed-(u32 * s32) + s32  // intermediate   signed; final   signed (u32 * s32) - s32  // intermediate   signed; final   signed (s32 * u32) + s32  // intermediate   signed; final   signed-(s32 * u32) + s32  // intermediate   signed; final   signed (s32 * u32) - s32  // intermediate   signed; final   signed (s32 * s32) + s32  // intermediate   signed; final   signed-(s32 * s32) + s32  // intermediate   signed; final   signed (s32 * s32) - s32  // intermediate   signed; final   signed

The intermediate result is optionally scaled via right-shift; this result is sign-extended if thefinal result is signed, and zero-extended otherwise.

The final result is optionally saturated to the appropriate 32-bit range based on the type (signedor unsigned) of the final result.

Semantics

// extract byte/half-word/word and sign- or zero-extend// based on source operand typeta = partSelectSignExtend( a, atype, asel );tb = partSelectSignExtend( b, btype, bsel );signedFinal = isSigned(atype) || isSigned(btype) ||                                 (a.negate ^ b.negate) || c.negate;tmp[127:0] = ta * tb;lsb = 0;if ( .po )                  {              lsb = 1; } elseif ( a.negate ^ b.negate )  { tmp = ~tmp;  lsb = 1; } elseif ( c.negate )             { c   = ~c;    lsb = 1; }c128[127:0] = (signedFinal) sext32( c ) : zext ( c );tmp = tmp + c128 + lsb;switch( scale ) {   case .shr7:   result = (tmp >>  7) & 0xffffffffffffffff;   case .shr15:  result = (tmp >> 15) & 0xffffffffffffffff;}if ( .sat ) {     if (signedFinal) result = CLAMP(result, S32_MAX, S32_MIN);     else             result = CLAMP(result, U32_MAX, U32_MIN);}

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

vmad requiressm_20 or higher.

Examples

vmad.s32.s32.u32.sat    r0, r1, r2, -r3;vmad.u32.u32.u32.shr15  r0, r1.h0, r2.h0, r3;
9.7.18.1.4.Scalar Video Instructions:vset

vset

Integer byte/half-word/word comparison.

Syntax

// 32-bit scalar operation, with optional secondary operationvset.atype.btype.cmp       d, a{.asel}, b{.bsel};vset.atype.btype.cmp.op2   d, a{.asel}, b{.bsel}, c;// 32-bit scalar operation, with optional data mergevset.atype.btype.cmp  d.dsel, a{.asel}, b{.bsel}, c;.atype = .btype = { .u32, .s32 };.cmp   = { .eq, .ne, .lt, .le, .gt, .ge };.dsel  = .asel  = .bsel  = { .b0, .b1, .b2, .b3, .h0, .h1 };.op2   = { .add, .min, .max };

Description

Compare input values using specified comparison, with optional secondary arithmetic operation orsubword data merge.

The intermediate result of the comparison is always unsigned, and therefore destinationd andoperandc are also unsigned.

Semantics

// extract byte/half-word/word and sign- or zero-extend// based on source operand typeta = partSelectSignExtend( a, atype, asel );tb = partSelectSignExtend( b, btype, bsel );tmp = compare( ta, tb, cmp ) ? 1 : 0;d = optSecondaryOp( op2, tmp, c );    // optional secondary operationd = optMerge( dsel, tmp, c );         // optional merge with c operand

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

vset requiressm_20 or higher.

Examples

vset.s32.u32.lt    r1, r2, r3;vset.u32.u32.ne    r1, r2, r3.h1;

9.7.18.2.SIMD Video Instructions

The SIMD video instructions operate on pairs of 16-bit values and quads of 8-bit values.

The SIMD video instructions are:

  • vadd2,vadd4

  • vsub2,vsub4

  • vavrg2,vavrg4

  • vabsdiff2,vabsdiff4

  • vmin2,vmin4

  • vmax2,vmax4

  • vset2,vset4

PTX includes SIMD video instructions for operation on pairs of 16-bit values and quads of 8-bitvalues. The SIMD video instructions execute the following stages:

  1. Form input vectors by extracting and sign- or zero-extending byte or half-word values from thesource operands, to form pairs of signed 17-bit values.

  2. Perform a SIMD arithmetic operation on the input pairs.

  3. Optionally clamp the result to the appropriate signed or unsigned range, as determinted by thedestination type.

  4. Optionally perform one of the following:

    1. perform a second SIMD merge operation, or

    2. apply a scalar accumulate operation to reduce the intermediate SIMD results to a singlescalar.

The general format of dual half-word SIMD video instructions is as follows:

// 2-way SIMD operation, with second SIMD merge or accumulatevop2.dtype.atype.btype{.sat}{.add}  d{.mask}, a{.asel}, b{.bsel}, c;.dtype = .atype = .btype = { .u32, .s32 };.mask  = { .h0, .h1, .h10 };.asel  = .bsel = { .hxy, where x,y are from { 0, 1, 2, 3 } };

The general format of quad byte SIMD video instructions is as follows:

// 4-way SIMD operation, with second SIMD merge or accumulatevop4.dtype.atype.btype{.sat}{.add}  d{.mask}, a{.asel}, b{.bsel}, c;.dtype = .atype = .btype = { .u32, .s32 };.mask  = { .b0,           .b1, .b10           .b2, .b20, .b21, .b210,           .b3, .b30, .b31, .b310, .b32, .b320, .b321, .b3210 };.asel = .bsel = .bxyzw, where x,y,z,w are from { 0, ..., 7 };

The source and destination operands are all 32-bit registers. The type of each operand (.u32 or.s32) is specified in the instruction type; all combinations ofdtype,atype, andbtype are valid. Using theatype/btype andasel/bsel specifiers, the input values areextracted and sign- or zero-extended internally to.s33 values. The primary operation is thenperformed to produce an.s34 intermediate result. The sign of the intermediate result depends ondtype.

The intermediate result is optionally clamped to the range of the destination type (signed orunsigned), taking into account the subword destination size in the case of optional data merging.

9.7.18.2.1.SIMD Video Instructions:vadd2,vsub2,vavrg2,vabsdiff2,vmin2,vmax2

vadd2,vsub2

Integer dual half-word SIMD addition/subtraction.

vavrg2

Integer dual half-word SIMD average.

vabsdiff2

Integer dual half-word SIMD absolute value of difference.

vmin2,vmax2

Integer dual half-word SIMD minimum/maximum.

Syntax

// SIMD instruction with secondary SIMD merge operationvop2.dtype.atype.btype{.sat}  d{.mask}, a{.asel}, b{.bsel}, c;// SIMD instruction with secondary accumulate operationvop2.dtype.atype.btype.add  d{.mask}, a{.asel}, b{.bsel}, c; vop2  = { vadd2, vsub2, vavrg2, vabsdiff2, vmin2, vmax2 };.dtype = .atype = .btype = { .u32, .s32 };.mask  = { .h0, .h1, .h10 };  // defaults to .h10.asel  = .bsel  = { .hxy, where x,y are from { 0, 1, 2, 3 } };   .asel defaults to .h10   .bsel defaults to .h32

Description

Two-way SIMD parallel arithmetic operation with secondary operation.

Elements of each dual half-word source to the operation are selected from any of the four half-wordsin the two source operandsa andb using theasel andbsel modifiers.

The selected half-words are then operated on in parallel.

The results are optionally clamped to the appropriate range determined by the destination type(signed or unsigned). Saturation cannot be used with the secondary accumulate operation.

For instructions with a secondary SIMD merge operation:

  • For half-word positions indicated in mask, the selected half-word results are copied intodestinationd. For all other positions, the corresponding half-word from source operandcis copied tod.

For instructions with a secondary accumulate operation:

  • For half-word positions indicated in mask, the selected half-word results are added to operandc, producing a result ind.

Semantics

// extract pairs of half-words and sign- or zero-extend// based on operand typeVa = extractAndSignExt_2( a, b, .asel, .atype );Vb = extractAndSignExt_2( a, b, .bsel, .btype );Vc = extractAndSignExt_2( c );for (i=0; i<2; i++) {    switch ( vop2 ) {       case vadd2:             t[i] = Va[i] + Vb[i];       case vsub2:             t[i] = Va[i] - Vb[i];       case vavrg2:            if ( ( Va[i] + Vb[i] ) >= 0 ) {                                   t[i] = ( Va[i] + Vb[i] + 1 ) >> 1;                               } else {                                   t[i] = ( Va[i] + Vb[i] ) >> 1;                               }       case vabsdiff2:         t[i] = | Va[i] - Vb[i] |;       case vmin2:             t[i] = MIN( Va[i], Vb[i] );       case vmax2:             t[i] = MAX( Va[i], Vb[i] );    }    if (.sat) {        if ( .dtype == .s32 )  t[i] = CLAMP( t[i], S16_MAX, S16_MIN );        else                   t[i] = CLAMP( t[i], U16_MAX, U16_MIN );    }}// secondary accumulate or SIMD mergemask = extractMaskBits( .mask );if (.add) {    d = c;    for (i=0; i<2; i++) {  d += mask[i] ? t[i] : 0;  }} else {    d = 0;    for (i=0; i<2; i++)  {  d |= mask[i] ? t[i] : Vc[i];  }}

PTX ISA Notes

Introduced in PTX ISA version 3.0.

Target ISA Notes

vadd2,vsub2,varvg2,vabsdiff2,vmin2,vmax2 requiresm_30 or higher.

Examples

vadd2.s32.s32.u32.sat  r1, r2, r3, r1;vsub2.s32.s32.s32.sat  r1.h0, r2.h10, r3.h32, r1;vmin2.s32.u32.u32.add  r1.h10, r2.h00, r3.h22, r1;
9.7.18.2.2.SIMD Video Instructions:vset2

vset2

Integer dual half-word SIMD comparison.

Syntax

// SIMD instruction with secondary SIMD merge operationvset2.atype.btype.cmp  d{.mask}, a{.asel}, b{.bsel}, c;// SIMD instruction with secondary accumulate operationvset2.atype.btype.cmp.add  d{.mask}, a{.asel}, b{.bsel}, c;.atype = .btype = { .u32, .s32 };.cmp   = { .eq, .ne, .lt, .le, .gt, .ge };.mask  = { .h0, .h1, .h10 };  // defaults to .h10.asel  = .bsel  = { .hxy, where x,y are from { 0, 1, 2, 3 } };   .asel defaults to .h10   .bsel defaults to .h32

Description

Two-way SIMD parallel comparison with secondary operation.

Elements of each dual half-word source to the operation are selected from any of the four half-wordsin the two source operandsa andb using theasel andbsel modifiers.

The selected half-words are then compared in parallel.

The intermediate result of the comparison is always unsigned, and therefore the half-words ofdestinationd and operandc are also unsigned.

For instructions with a secondary SIMD merge operation:

  • For half-word positions indicated in mask, the selected half-word results are copied intodestinationd. For all other positions, the corresponding half-word from source operandbis copied tod.

For instructions with a secondary accumulate operation:

  • For half-word positions indicated in mask, the selected half-word results are added to operandc, producinga result ind.

Semantics

// extract pairs of half-words and sign- or zero-extend// based on operand typeVa = extractAndSignExt_2( a, b, .asel, .atype );Vb = extractAndSignExt_2( a, b, .bsel, .btype );Vc = extractAndSignExt_2( c );for (i=0; i<2; i++) {    t[i] = compare( Va[i], Vb[i], .cmp ) ? 1 : 0;}// secondary accumulate or SIMD mergemask = extractMaskBits( .mask );if (.add) {    d = c;    for (i=0; i<2; i++) {  d += mask[i] ? t[i] : 0;  }} else {    d = 0;    for (i=0; i<2; i++)  {  d |= mask[i] ? t[i] : Vc[i];  }}

PTX ISA Notes

Introduced in PTX ISA version 3.0.

Target ISA Notes

vset2 requiressm_30 or higher.

Examples

vset2.s32.u32.lt      r1, r2, r3, r0;vset2.u32.u32.ne.add  r1, r2, r3, r0;
9.7.18.2.3.SIMD Video Instructions:vadd4,vsub4,vavrg4,vabsdiff4,vmin4,vmax4

vadd4,vsub4

Integer quad byte SIMD addition/subtraction.

vavrg4

Integer quad byte SIMD average.

vabsdiff4

Integer quad byte SIMD absolute value of difference.

vmin4,vmax4

Integer quad byte SIMD minimum/maximum.

Syntax

// SIMD instruction with secondary SIMD merge operationvop4.dtype.atype.btype{.sat}  d{.mask}, a{.asel}, b{.bsel}, c;// SIMD instruction with secondary accumulate operationvop4.dtype.atype.btype.add  d{.mask}, a{.asel}, b{.bsel}, c;vop4  = { vadd4, vsub4, vavrg4, vabsdiff4, vmin4, vmax4 };.dtype = .atype = .btype = { .u32, .s32 };.mask  = { .b0,           .b1, .b10           .b2, .b20, .b21, .b210,           .b3, .b30, .b31, .b310, .b32, .b320, .b321, .b3210 };    defaults to .b3210.asel = .bsel = .bxyzw, where x,y,z,w are from { 0, ..., 7 };   .asel defaults to .b3210   .bsel defaults to .b7654

Description

Four-way SIMD parallel arithmetic operation with secondary operation.

Elements of each quad byte source to the operation are selected from any of the eight bytes in thetwo source operandsa andb using theasel andbsel modifiers.

The selected bytes are then operated on in parallel.

The results are optionally clamped to the appropriate range determined by the destination type(signed or unsigned). Saturation cannot be used with the secondary accumulate operation.

For instructions with a secondary SIMD merge operation:

  • For byte positions indicated in mask, the selected byte results are copied into destinationd. For all other positions, the corresponding byte from source operandc is copied tod.

For instructions with a secondary accumulate operation:

  • For byte positions indicated in mask, the selected byte results are added to operandc,producing a result ind.

Semantics

// extract quads of bytes and sign- or zero-extend// based on operand typeVa = extractAndSignExt_4( a, b, .asel, .atype );Vb = extractAndSignExt_4( a, b, .bsel, .btype );Vc = extractAndSignExt_4( c );for (i=0; i<4; i++) {    switch ( vop4 ) {        case vadd4:            t[i] = Va[i] + Vb[i];        case vsub4:            t[i] = Va[i] - Vb[i];        case vavrg4:           if ( ( Va[i] + Vb[i] ) >= 0 ) {                                   t[i] = ( Va[i] + Vb[i] + 1 ) >> 1;                               } else {                                   t[i] = ( Va[i] + Vb[i] ) >> 1;                               }        case vabsdiff4:        t[i] = | Va[i] - Vb[i] |;        case vmin4:            t[i] = MIN( Va[i], Vb[i] );        case vmax4:            t[i] = MAX( Va[i], Vb[i] );    }    if (.sat) {        if ( .dtype == .s32 )  t[i] = CLAMP( t[i], S8_MAX, S8_MIN );        else                   t[i] = CLAMP( t[i], U8_MAX, U8_MIN );    }}// secondary accumulate or SIMD mergemask = extractMaskBits( .mask );if (.add) {    d = c;    for (i=0; i<4; i++) {  d += mask[i] ? t[i] : 0;  }} else {    d = 0;    for (i=0; i<4; i++)  {  d |= mask[i] ? t[i] : Vc[i];  }}

PTX ISA Notes

Introduced in PTX ISA version 3.0.

Target ISA Notes

vadd4,vsub4,varvg4,vabsdiff4,vmin4,vmax4 requiresm_30 or higher.

Examples

vadd4.s32.s32.u32.sat  r1, r2, r3, r1;vsub4.s32.s32.s32.sat  r1.b0, r2.b3210, r3.b7654, r1;vmin4.s32.u32.u32.add  r1.b00, r2.b0000, r3.b2222, r1;
9.7.18.2.4.SIMD Video Instructions:vset4

vset4

Integer quad byte SIMD comparison.

Syntax

// SIMD instruction with secondary SIMD merge operationvset4.atype.btype.cmp  d{.mask}, a{.asel}, b{.bsel}, c;// SIMD instruction with secondary accumulate operationvset4.atype.btype.cmp.add  d{.mask}, a{.asel}, b{.bsel}, c;.atype = .btype = { .u32, .s32 };.cmp   = { .eq, .ne, .lt, .le, .gt, .ge };.mask  = { .b0,           .b1, .b10           .b2, .b20, .b21, .b210,           .b3, .b30, .b31, .b310, .b32, .b320, .b321, .b3210 };    defaults to .b3210.asel = .bsel = .bxyzw, where x,y,z,w are from { 0, ..., 7 };   .asel defaults to .b3210   .bsel defaults to .b7654

Description

Four-way SIMD parallel comparison with secondary operation.

Elements of each quad byte source to the operation are selected from any of the eight bytes in thetwo source operandsa andb using theasel andbsel modifiers.

The selected bytes are then compared in parallel.

The intermediate result of the comparison is always unsigned, and therefore the bytes of destinationd and operandc are also unsigned.

For instructions with a secondary SIMD merge operation:

  • For byte positions indicated in mask, the selected byte results are copied into destinationd. For all other positions, the corresponding byte from source operandb is copied tod.

For instructions with a secondary accumulate operation:

  • For byte positions indicated in mask, the selected byte results are added to operandc,producing a result ind.

Semantics

// extract quads of bytes and sign- or zero-extend// based on operand typeVa = extractAndSignExt_4( a, b, .asel, .atype );Vb = extractAndSignExt_4( a, b, .bsel, .btype );Vc = extractAndSignExt_4( c );for (i=0; i<4; i++) {    t[i] = compare( Va[i], Vb[i], cmp ) ? 1 : 0;}// secondary accumulate or SIMD mergemask = extractMaskBits( .mask );if (.add) {    d = c;    for (i=0; i<4; i++) {  d += mask[i] ? t[i] : 0;  }} else {    d = 0;    for (i=0; i<4; i++)  {  d |= mask[i] ? t[i] : Vc[i];  }}

PTX ISA Notes

Introduced in PTX ISA version 3.0.

Target ISA Notes

vset4 requiressm_30 or higher.

Examples

vset4.s32.u32.lt      r1, r2, r3, r0;vset4.u32.u32.ne.max  r1, r2, r3, r0;

9.7.19.Miscellaneous Instructions

The Miscellaneous instructions are:

  • brkpt

  • nanosleep

  • pmevent

  • trap

  • setmaxnreg

9.7.19.1.Miscellaneous Instructions:brkpt

brkpt

Breakpoint.

Syntax

brkpt;

Description

Suspends execution.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

brkpt requiressm_11 or higher.

Examples

    brkpt;@p  brkpt;

9.7.19.2.Miscellaneous Instructions:nanosleep

nanosleep

Suspend the thread for an approximate delay given in nanoseconds.

Syntax

nanosleep.u32 t;

Description

Suspends the thread for a sleep duration approximately close to the delayt, specified innanoseconds.t may be a register or an immediate value.

The sleep duration is approximated, but guaranteed to be in the interval[0,2*t]. The maximumsleep duration is 1 millisecond. The implementation may reduce the sleep duration for individualthreads within a warp such that all sleeping threads in the warp wake up together.

PTX ISA Notes

nanosleep introduced in PTX ISA 6.3.

Target ISA Notes

nanosleep requiressm_70 or higher.

Examples

.reg .b32 r;.reg .pred p;nanosleep.u32 r;nanosleep.u32 42;@p nanosleep.u32 r;

9.7.19.3.Miscellaneous Instructions:pmevent

pmevent

Trigger one or more Performance Monitor events.

Syntax

pmevent       a;    // trigger a single performance monitor eventpmevent.mask  a;    // trigger one or more performance monitor events

Description

Triggers one or more of a fixed number of performance monitor events, with event index or maskspecified by immediate operanda.

pmevent (without modifier.mask) triggers a single performance monitor event indexed byimmediate operanda, in the range0..15.

pmevent.mask triggers one or more of the performance monitor events. Each bit in the 16-bitimmediate operanda controls an event.

Programmatic performance moniter events may be combined with other hardware events using Booleanfunctions to increment one of the four performance counters. The relationship between events andcounters is programmed via API calls from the host.

Notes

Currently, there are sixteen performance monitor events, numbered 0 through 15.

PTX ISA Notes

pmevent introduced in PTX ISA version 1.4.

pmevent.mask introduced in PTX ISA version 3.0.

Target ISA Notes

pmevent supported on all target architectures.

pmevent.mask requiressm_20 or higher.

Examples

    pmevent      1;@p  pmevent      7;@q  pmevent.mask 0xff;

9.7.19.4.Miscellaneous Instructions:trap

trap

Perform trap operation.

Syntax

trap;

Description

Abort execution and generate an interrupt to the host CPU.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

    trap;@p  trap;

9.7.19.5.Miscellaneous Instructions:setmaxnreg

setmaxnreg

Hint to change the number of registers owned by the warp.

Syntax

setmaxnreg.action.sync.aligned.u32 imm-reg-count;.action = { .inc, .dec };

Description

setmaxnreg provides a hint to the system to update the maximum number of per-thread registersowned by the executing warp to the value specified by theimm-reg-count operand.

Qualifier.dec is used to release extra registers such that the absolute per-thread maximumregister count is reduced from its current value toimm-reg-count. Qualifier.inc is used torequest additional registers such that the absolute per-thread maximum register count is increasedfrom its current value toimm-reg-count.

A pool of available registers is maintained per-CTA. Register adjustments requested by thesetmaxnreg instructions are handled by supplying extra registers from this pool to therequesting warp or by releasing extra registers from the requesting warp to this pool, dependingupon the value of the.action qualifier.

Thesetmaxnreg.inc instruction blocks the execution until enough registers are available in theCTA’s register pool. After the instructionsetmaxnreg.inc obtains new registers from the CTApool, the initial contents of the new registers are undefined. The new registers must be initializedbefore they are used.

The samesetmaxnreg instruction must be executed by all warps in awarpgroup. After executing asetmaxnreg instruction, all warps in thewarpgroup must synchronize explicitly beforeexecuting subsequent setmaxnreg instructions. If asetmaxnreg instruction is not executed by allwarps in thewarpgroup, then the behavior is undefined.

Operandimm-reg-count is an integer constant. The value ofimm-reg-count must be in therange 24 to 256 (both inclusive) and must be a multiple of 8.

Changes to the register file of the warp always happen at the tail-end of the register file.

Thesetmaxnreg instruction requires that the kernel has been launched with a valid value ofmaximum number of per-thread registers specified via the appropriate compilation via the appropriatecompile-time option or the appropriate performance tuning directive. Otherwise, thesetmaxnreginstruction may have no effect.

When qualifier.dec is specified, the maximum number of per-thread registers owned by the warpprior to the execution ofsetmaxnreg instruction should be greater than or equal to theimm-reg-count. Otherwise, the behaviour is undefined.

When qualifier.inc is specified, the maximum number of per-thread registers owned by the warpprior to the execution ofsetmaxnreg instruction should be less than or equal to theimm-reg-count. Otherwise, the behaviour is undefined.

The mandatory.sync qualifier indicates thatsetmaxnreg instruction causes the executingthread to wait until all threads in the warp execute the samesetmaxnreg instruction beforeresuming execution.

The mandatory.aligned qualifier indicates that all threads in the warpgroup must execute thesamesetmaxnreg instruction. In conditionally executed code,setmaxnreg instruction shouldonly be used if it is known that all threads in warpgroup evaluate the condition identically,otherwise the behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Supported on following architectures:

  • sm_90a

  • sm_100a

  • sm_101a (Renamed tosm_110a from PTX ISA version 9.0)

  • sm_120a

  • And is supported on following family-specific architectures from PTX ISA version 8.8:

    • sm_100f or higher in the same family

    • sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)

    • sm_120f or higher in the same family

  • sm_110f or higher in the same family

Examples

setmaxnreg.dec.sync.aligned.u32 64;setmaxnreg.inc.sync.aligned.u32 192;

10.Special Registers

PTX includes a number of predefined, read-only variables, which arevisible as special registers and accessed throughmov orcvtinstructions.

The special registers are:

  • %tid

  • %ntid

  • %laneid

  • %warpid

  • %nwarpid

  • %ctaid

  • %nctaid

  • %smid

  • %nsmid

  • %gridid

  • %is_explicit_cluster

  • %clusterid

  • %nclusterid

  • %cluster_ctaid

  • %cluster_nctaid

  • %cluster_ctarank

  • %cluster_nctarank

  • %lanemask_eq,%lanemask_le,%lanemask_lt,%lanemask_ge,%lanemask_gt

  • %clock,%clock_hi,%clock64

  • %pm0,...,%pm7

  • %pm0_64,...,%pm7_64

  • %envreg0,...,%envreg31

  • %globaltimer,%globaltimer_lo,%globaltimer_hi

  • %reserved_smem_offset_begin,%reserved_smem_offset_end,%reserved_smem_offset_cap,%reserved_smem_offset<2>

  • %total_smem_size

  • %aggr_smem_size

  • %dynamic_smem_size

  • %current_graph_exec

10.1.Special Registers:%tid

%tid

Thread identifier within a CTA.

Syntax (predefined)

.sreg .v4 .u32 %tid;                  // thread id vector.sreg .u32 %tid.x, %tid.y, %tid.z;    // thread id components

Description

A predefined, read-only, per-thread special register initialized with the thread identifier withinthe CTA. The%tid special register contains a 1D, 2D, or 3D vector to match the CTA shape; the%tid value in unused dimensions is0. The fourth element is unused and always returnszero. The number of threads in each dimension are specified by the predefined special register%ntid.

Every thread in the CTA has a unique%tid.

%tid component values range from0 through%ntid-1 in each CTA dimension.

%tid.y==%tid.z==0 in 1D CTAs.%tid.z==0 in 2D CTAs.

It is guaranteed that:

0  <=  %tid.x <  %ntid.x0  <=  %tid.y <  %ntid.y0  <=  %tid.z <  %ntid.z

PTX ISA Notes

Introduced in PTX ISA version 1.0 with type.v4.u16.

Redefined as type.v4.u32 in PTX ISA version 2.0. For compatibility with legacy PTX code, 16-bitmov andcvt instructions may be used to read the lower 16-bits of each component of%tid.

Target ISA Notes

Supported on all target architectures.

Examples

mov.u32      %r1,%tid.x;  // move tid.x to %rh// legacy code accessing 16-bit components of %tidmov.u16      %rh,%tid.x;cvt.u32.u16  %r2,%tid.z;  // zero-extend tid.z to %r2

10.2.Special Registers:%ntid

%ntid

Number of thread IDs per CTA.

Syntax (predefined)

.sreg .v4 .u32 %ntid;                   // CTA shape vector.sreg .u32 %ntid.x, %ntid.y, %ntid.z;   // CTA dimensions

Description

A predefined, read-only special register initialized with the number of thread ids in each CTAdimension. The%ntid special register contains a 3D CTA shape vector that holds the CTAdimensions. CTA dimensions are non-zero; the fourth element is unused and always returns zero. Thetotal number of threads in a CTA is(%ntid.x*%ntid.y*%ntid.z).

%ntid.y == %ntid.z == 1 in 1D CTAs.%ntid.z ==1 in 2D CTAs.

Maximum values of %ntid.{x,y,z} are as follows:

.target architecture

%ntid.x

%ntid.y

%ntid.z

sm_1x

512

512

64

sm_20,sm_3x,sm_5x,sm_6x,sm_7x,sm_8x,sm_9x,sm_10x,sm_12x

1024

1024

64

PTX ISA Notes

Introduced in PTX ISA version 1.0 with type.v4.u16.

Redefined as type.v4.u32 in PTX ISA version 2.0. For compatibility with legacy PTX code, 16-bitmov andcvt instructions may be used to read the lower 16-bits of each component of%ntid.

Target ISA Notes

Supported on all target architectures.

Examples

// compute unified thread id for 2D CTAmov.u32  %r0,%tid.x;mov.u32  %h1,%tid.y;mov.u32  %h2,%ntid.x;mad.u32  %r0,%h1,%h2,%r0;mov.u16  %rh,%ntid.x;      // legacy code

10.3.Special Registers:%laneid

%laneid

Lane Identifier.

Syntax (predefined)

.sreg .u32 %laneid;

Description

A predefined, read-only special register that returns the thread’s lane within the warp. The laneidentifier ranges from zero toWARP_SZ-1.

PTX ISA Notes

Introduced in PTX ISA version 1.3.

Target ISA Notes

Supported on all target architectures.

Examples

mov.u32  %r, %laneid;

10.4.Special Registers:%warpid

%warpid

Warp identifier.

Syntax (predefined)

.sreg .u32 %warpid;

Description

A predefined, read-only special register that returns the thread’s warp identifier. The warpidentifier provides a unique warp number within a CTA but not across CTAs within a grid. The warpidentifier will be the same for all threads within a single warp.

Note that%warpid returns the location of a thread at the moment when read, butits value may change during execution, e.g., due to rescheduling of threads followingpreemption. For this reason,%ctaid and%tid should be used to compute a virtual warp indexif such a value is needed in kernel code;%warpid is intended mainly to enable profiling anddiagnostic code to sample and log information such as work place mapping and load distribution.

PTX ISA Notes

Introduced in PTX ISA version 1.3.

Target ISA Notes

Supported on all target architectures.

Examples

mov.u32  %r, %warpid;

10.5.Special Registers:%nwarpid

%nwarpid

Number of warp identifiers.

Syntax (predefined)

.sreg .u32 %nwarpid;

Description

A predefined, read-only special register that returns the maximum number of warp identifiers.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

%nwarpid requiressm_20 or higher.

Examples

mov.u32  %r, %nwarpid;

10.6.Special Registers:%ctaid

%ctaid

CTA identifier within a grid.

Syntax (predefined)

.sreg .v4 .u32 %ctaid;                      // CTA id vector.sreg .u32 %ctaid.x, %ctaid.y, %ctaid.z;    // CTA id components

Description

A predefined, read-only special register initialized with the CTA identifier within the CTAgrid. The%ctaid special register contains a 1D, 2D, or 3D vector, depending on the shape andrank of the CTA grid. The fourth element is unused and always returns zero.

It is guaranteed that:

0  <=  %ctaid.x <  %nctaid.x0  <=  %ctaid.y <  %nctaid.y0  <=  %ctaid.z <  %nctaid.z

PTX ISA Notes

Introduced in PTX ISA version 1.0 with type.v4.u16.

Redefined as type.v4.u32 in PTX ISA version 2.0. For compatibility with legacy PTX code, 16-bitmov andcvt instructions may be used to read the lower 16-bits of each component of%ctaid.

Target ISA Notes

Supported on all target architectures.

Examples

mov.u32  %r0,%ctaid.x;mov.u16  %rh,%ctaid.y;   // legacy code

10.7.Special Registers:%nctaid

%nctaid

Number of CTA ids per grid.

Syntax (predefined)

.sreg .v4 .u32 %nctaid                      // Grid shape vector.sreg .u32 %nctaid.x,%nctaid.y,%nctaid.z;   // Grid dimensions

Description

A predefined, read-only special register initialized with the number of CTAs in each griddimension. The%nctaid special register contains a 3D grid shape vector, with each elementhaving a value of at least1. The fourth element is unused and always returns zero.

Maximum values of %nctaid.{x,y,z} are as follows:

.target architecture

%nctaid.x

%nctaid.y

%nctaid.z

sm_1x,sm_20

65535

65535

65535

sm_3x,sm_5x,sm_6x,sm_7x,sm_8x,sm_9x,sm_10x,sm_12x

231 -1

65535

65535

PTX ISA Notes

Introduced in PTX ISA version 1.0 with type.v4.u16.

Redefined as type.v4.u32 in PTX ISA version 2.0. For compatibility with legacy PTX code, 16-bitmov andcvt instructions may be used to read the lower 16-bits of each component of%nctaid.

Target ISA Notes

Supported on all target architectures.

Examples

mov.u32  %r0,%nctaid.x;mov.u16  %rh,%nctaid.x;     // legacy code

10.8.Special Registers:%smid

%smid

SM identifier.

Syntax (predefined)

.sreg .u32 %smid;

Description

A predefined, read-only special register that returns the processor (SM) identifier on which aparticular thread is executing. The SM identifier ranges from0 to%nsmid-1. The SMidentifier numbering is not guaranteed to be contiguous.

Notes

Note that%smid returns the location of a thread at the moment when read, butits value may change during execution, e.g. due to rescheduling of threads followingpreemption.%smid is intended mainly to enable profiling and diagnostic code to sample and loginformation such as work place mapping and load distribution.

PTX ISA Notes

Introduced in PTX ISA version 1.3.

Target ISA Notes

Supported on all target architectures.

Examples

mov.u32  %r, %smid;

10.9.Special Registers:%nsmid

%nsmid

Number of SM identifiers.

Syntax (predefined)

.sreg .u32 %nsmid;

Description

A predefined, read-only special register that returns the maximum number of SM identifiers. The SMidentifier numbering is not guaranteed to be contiguous, so%nsmid may be larger than thephysical number of SMs in the device.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

%nsmid requiressm_20 or higher.

Examples

mov.u32  %r, %nsmid;

10.10.Special Registers:%gridid

%gridid

Grid identifier.

Syntax (predefined)

.sreg .u64 %gridid;

Description

A predefined, read-only special register initialized with the per-grid temporal grid identifier. The%gridid is used by debuggers to distinguish CTAs and clusters within concurrent (small) grids.

During execution, repeated launches of programs may occur, where each launch starts agrid-of-CTAs. This variable provides the temporal grid launch number for this context.

Forsm_1x targets,%gridid is limited to the range [0..216-1]. Forsm_20,%gridid is limited to the range [0..232-1].sm_30 supports the entire 64-bit range.

PTX ISA Notes

Introduced in PTX ISA version 1.0 as type.u16.

Redefined as type.u32 in PTX ISA version 1.3.

Redefined as type.u64 in PTX ISA version 3.0.

For compatibility with legacy PTX code, 16-bit and 32-bitmov andcvt instructions may beused to read the lower 16-bits or 32-bits of each component of%gridid.

Target ISA Notes

Supported on all target architectures.

Examples

mov.u64  %s, %gridid;  // 64-bit read of %grididmov.u32  %r, %gridid;  // legacy code with 32-bit %gridid

10.11.Special Registers:%is_explicit_cluster

%is_explicit_cluster

Checks if user has explicitly specified cluster launch.

Syntax (predefined)

.sreg .pred %is_explicit_cluster;

Description

A predefined, read-only special register initialized with the predicate value of whether the clusterlaunch is explicitly specified by user.

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

.reg .pred p;mov.pred  p, %is_explicit_cluster;

10.12.Special Registers:%clusterid

%clusterid

Cluster identifier within a grid.

Syntax (predefined)

.sreg .v4 .u32 %clusterid;.sreg .u32 %clusterid.x, %clusterid.y, %clusterid.z;

Description

A predefined, read-only special register initialized with the cluster identifier in a grid in eachdimension. Each cluster in a grid has a unique identifier.

The%clusterid special register contains a 1D, 2D, or 3D vector, depending upon the shape andrank of the cluster. The fourth element is unused and always returns zero.

It is guaranteed that:

0  <=  %clusterid.x <  %nclusterid.x0  <=  %clusterid.y <  %nclusterid.y0  <=  %clusterid.z <  %nclusterid.z

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

.reg .b32 %r<2>;.reg .v4 .b32 %rx;mov.u32     %r0, %clusterid.x;mov.u32     %r1, %clusterid.z;mov.v4.u32  %rx, %clusterid;

10.13.Special Registers:%nclusterid

%nclusterid

Number of cluster identifiers per grid.

Syntax (predefined)

.sreg .v4 .u32 %nclusterid;.sreg .u32 %nclusterid.x, %nclusterid.y, %nclusterid.z;

Description

A predefined, read-only special register initialized with the number of clusters in each griddimension.

The%nclusterid special register contains a 3D grid shape vector that holds the grid dimensionsin terms of clusters. The fourth element is unused and always returns zero.

Refer to theCuda Programming Guide for details on the maximum values of%nclusterid.{x,y,z}.

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

.reg .b32 %r<2>;.reg .v4 .b32 %rx;mov.u32     %r0, %nclusterid.x;mov.u32     %r1, %nclusterid.z;mov.v4.u32  %rx, %nclusterid;

10.14.Special Registers:%cluster_ctaid

%cluster_ctaid

CTA identifier within a cluster.

Syntax (predefined)

.sreg .v4 .u32 %cluster_ctaid;.sreg .u32 %cluster_ctaid.x, %cluster_ctaid.y, %cluster_ctaid.z;

Description

A predefined, read-only special register initialized with the CTA identifier in a cluster in eachdimension. Each CTA in a cluster has a unique CTA identifier.

The%cluster_ctaid special register contains a 1D, 2D, or 3D vector, depending upon the shape ofthe cluster. The fourth element is unused and always returns zero.

It is guaranteed that:

0  <=  %cluster_ctaid.x <  %cluster_nctaid.x0  <=  %cluster_ctaid.y <  %cluster_nctaid.y0  <=  %cluster_ctaid.z <  %cluster_nctaid.z

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

.reg .b32 %r<2>;.reg .v4 .b32 %rx;mov.u32     %r0, %cluster_ctaid.x;mov.u32     %r1, %cluster_ctaid.z;mov.v4.u32  %rx, %cluster_ctaid;

10.15.Special Registers:%cluster_nctaid

%cluster_nctaid

Number of CTA identifiers per cluster.

Syntax (predefined)

.sreg .v4 .u32 %cluster_nctaid;.sreg .u32 %cluster_nctaid.x, %cluster_nctaid.y, %cluster_nctaid.z;

Description

A predefined, read-only special register initialized with the number of CTAs in a cluster in eachdimension.

The%cluster_nctaid special register contains a 3D grid shape vector that holds the clusterdimensions in terms of CTAs. The fourth element is unused and always returns zero.

Refer to theCuda Programming Guide for details on the maximum values of%cluster_nctaid.{x,y,z}.

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

.reg .b32 %r<2>;.reg .v4 .b32 %rx;mov.u32     %r0, %cluster_nctaid.x;mov.u32     %r1, %cluster_nctaid.z;mov.v4.u32  %rx, %cluster_nctaid;

10.16.Special Registers:%cluster_ctarank

%cluster_ctarank

CTA identifier in a cluster across all dimensions.

Syntax (predefined)

.sreg .u32 %cluster_ctarank;

Description

A predefined, read-only special register initialized with the CTA rank within a cluster across alldimensions.

It is guaranteed that:

0  <=  %cluster_ctarank <  %cluster_nctarank

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

.reg .b32 %r;mov.u32  %r, %cluster_ctarank;

10.17.Special Registers:%cluster_nctarank

%cluster_nctarank

Number of CTA identifiers in a cluster across all dimensions.

Syntax (predefined)

.sreg .u32 %cluster_nctarank;

Description

A predefined, read-only special register initialized with the nunber of CTAs within a cluster acrossall dimensions.

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

.reg .b32 %r;mov.u32  %r, %cluster_nctarank;

10.18.Special Registers:%lanemask_eq

%lanemask_eq

32-bit mask with bit set in position equal to the thread’s lane number in the warp.

Syntax (predefined)

.sreg .u32 %lanemask_eq;

Description

A predefined, read-only special register initialized with a 32-bit mask with a bit set in theposition equal to the thread’s lane number in the warp.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

%lanemask_eq requiressm_20 or higher.

Examples

mov.u32     %r, %lanemask_eq;

10.19.Special Registers:%lanemask_le

%lanemask_le

32-bit mask with bits set in positions less than or equal to the thread’s lane number in the warp.

Syntax (predefined)

.sreg .u32 %lanemask_le;

Description

A predefined, read-only special register initialized with a 32-bit mask with bits set in positionsless than or equal to the thread’s lane number in the warp.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

%lanemask_le requiressm_20 or higher.

Examples

mov.u32     %r, %lanemask_le

10.20.Special Registers:%lanemask_lt

%lanemask_lt

32-bit mask with bits set in positions less than the thread’s lane number in the warp.

Syntax (predefined)

.sreg .u32 %lanemask_lt;

Description

A predefined, read-only special register initialized with a 32-bit mask with bits set in positionsless than the thread’s lane number in the warp.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

%lanemask_lt requiressm_20 or higher.

Examples

mov.u32     %r, %lanemask_lt;

10.21.Special Registers:%lanemask_ge

%lanemask_ge

32-bit mask with bits set in positions greater than or equal to the thread’s lane number in the warp.

Syntax (predefined)

.sreg .u32 %lanemask_ge;

Description

A predefined, read-only special register initialized with a 32-bit mask with bits set in positionsgreater than or equal to the thread’s lane number in the warp.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

%lanemask_ge requiressm_20 or higher.

Examples

mov.u32     %r, %lanemask_ge;

10.22.Special Registers:%lanemask_gt

%lanemask_gt

32-bit mask with bits set in positions greater than the thread’s lane number in the warp.

Syntax (predefined)

.sreg .u32 %lanemask_gt;

Description

A predefined, read-only special register initialized with a 32-bit mask with bits set in positionsgreater than the thread’s lane number in the warp.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

%lanemask_gt requiressm_20 or higher.

Examples

mov.u32     %r, %lanemask_gt;

10.23.Special Registers:%clock,%clock_hi

%clock,%clock_hi

%clock

A predefined, read-only 32-bit unsigned cycle counter.

%clock_hi

The upper 32-bits of%clock64 special register.

Syntax (predefined)

.sreg .u32 %clock;.sreg .u32 %clock_hi;

Description

Special register%clock and%clock_hi are unsigned 32-bit read-only cycle counters that wrapsilently.

PTX ISA Notes

%clock introduced in PTX ISA version 1.0.

%clock_hi introduced in PTX ISA version 5.0.

Target ISA Notes

%clock supported on all target architectures.

%clock_hi requiressm_20 or higher.

Examples

mov.u32 r1,%clock;mov.u32 r2, %clock_hi;

10.24.Special Registers:%clock64

%clock64

A predefined, read-only 64-bit unsigned cycle counter.

Syntax (predefined)

.sreg .u64 %clock64;

Description

Special register%clock64 is an unsigned 64-bit read-only cycle counter that wraps silently.

Notes

The lower 32-bits of%clock64 are identical to%clock.

The upper 32-bits of%clock64 are identical to%clock_hi.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

%clock64 requiressm_20 or higher.

Examples

mov.u64  r1,%clock64;

10.25.Special Registers:%pm0%pm7

%pm0%pm7

Performance monitoring counters.

Syntax (predefined)

.sreg .u32 %pm<8>;

Description

Special registers%pm0%pm7 are unsigned 32-bit read-only performance monitor counters. Theirbehavior is currently undefined.

PTX ISA Notes

%pm0%pm3 introduced in PTX ISA version 1.3.

%pm4%pm7 introduced in PTX ISA version 3.0.

Target ISA Notes

%pm0%pm3 supported on all target architectures.

%pm4%pm7 requiresm_20 or higher.

Examples

mov.u32  r1,%pm0;mov.u32  r1,%pm7;

10.26.Special Registers:%pm0_64%pm7_64

%pm0_64%pm7_64

64 bit Performance monitoring counters.

Syntax (predefined)

.sreg .u64 %pm0_64;.sreg .u64 %pm1_64;.sreg .u64 %pm2_64;.sreg .u64 %pm3_64;.sreg .u64 %pm4_64;.sreg .u64 %pm5_64;.sreg .u64 %pm6_64;.sreg .u64 %pm7_64;

Description

Special registers%pm0_64%pm7_64 are unsigned 64-bit read-only performance monitorcounters. Their behavior is currently undefined.

Notes

The lower 32bits of%pm0_64%pm7_64 are identical to%pm0%pm7.

PTX ISA Notes

%pm0_64%pm7_64 introduced in PTX ISA version 4.0.

Target ISA Notes

%pm0_64%pm7_64 requiresm_50 or higher.

Examples

mov.u32  r1,%pm0_64;mov.u32  r1,%pm7_64;

10.27.Special Registers:%envreg<32>

%envreg<32>

Driver-defined read-only registers.

Syntax (predefined)

.sreg .b32 %envreg<32>;

Description

A set of 32 pre-defined read-only registers used to capture execution environment of PTX programoutside of PTX virtual machine. These registers are initialized by the driver prior to kernel launchand can contain cta-wide or grid-wide values.

Precise semantics of these registers is defined in the driver documentation.

PTX ISA Notes

Introduced in PTX ISA version 2.1.

Target ISA Notes

Supported on all target architectures.

Examples

mov.b32      %r1,%envreg0;  // move envreg0 to %r1

10.28.Special Registers:%globaltimer,%globaltimer_lo,%globaltimer_hi

%globaltimer,%globaltimer_lo,%globaltimer_hi

%globaltimer

A predefined, 64-bit global nanosecond timer.

%globaltimer_lo

The lower 32-bits of %globaltimer.

%globaltimer_hi

The upper 32-bits of %globaltimer.

Syntax (predefined)

.sreg .u64 %globaltimer;.sreg .u32 %globaltimer_lo, %globaltimer_hi;

Description

Special registers intended for use by NVIDIA tools. The behavior is target-specific and may changeor be removed in future GPUs. When JIT-compiled to other targets, the value of these registers isunspecified.

PTX ISA Notes

Introduced in PTX ISA version 3.1.

Target ISA Notes

Requires targetsm_30 or higher.

Examples

mov.u64  r1,%globaltimer;

10.29.Special Registers:%reserved_smem_offset_begin,%reserved_smem_offset_end,%reserved_smem_offset_cap,%reserved_smem_offset_<2>

%reserved_smem_offset_begin,%reserved_smem_offset_end,%reserved_smem_offset_cap,%reserved_smem_offset_<2>

%reserved_smem_offset_begin

Start of the reserved shared memory region.

%reserved_smem_offset_end

End of the reserved shared memory region.

%reserved_smem_offset_cap

Total size of the reserved shared memory region.

%reserved_smem_offset_<2>

Offsets in the reserved shared memory region.

Syntax (predefined)

.sreg .b32 %reserved_smem_offset_begin;.sreg .b32 %reserved_smem_offset_end;.sreg .b32 %reserved_smem_offset_cap;.sreg .b32 %reserved_smem_offset_<2>;

Description

These are predefined, read-only special registers containing information about the shared memoryregion which is reserved for the NVIDIA system software use. This region of shared memory is notavailable to users, and accessing this region from user code results in undefined behavior. Refer toCUDA Programming Guide for details.

PTX ISA Notes

Introduced in PTX ISA version 7.6.

Target ISA Notes

Requiresm_80 or higher.

Examples

.reg .b32 %reg_begin, %reg_end, %reg_cap, %reg_offset0, %reg_offset1;mov.b32 %reg_begin,   %reserved_smem_offset_begin;mov.b32 %reg_end,     %reserved_smem_offset_end;mov.b32 %reg_cap,     %reserved_smem_offset_cap;mov.b32 %reg_offset0, %reserved_smem_offset_0;mov.b32 %reg_offset1, %reserved_smem_offset_1;

10.30.Special Registers:%total_smem_size

%total_smem_size

Total size of shared memory used by a CTA of a kernel.

Syntax (predefined)

.sreg .u32 %total_smem_size;

Description

A predefined, read-only special register initialized with total size of shared memory allocated(statically and dynamically, excluding the shared memory reserved for the NVIDIA system softwareuse) for the CTA of a kernel at launch time.

Size is returned in multiples of shared memory allocation unit size supported by targetarchitecture.

Allocation unit values are as follows:

Target architecture

Shared memory allocation unit size

sm_2x

128 bytes

sm_3x,sm_5x,sm_6x,sm_7x

256 bytes

sm_8x,sm_9x,sm_10x,sm_12x

128 bytes

PTX ISA Notes

Introduced in PTX ISA version 4.1.

Target ISA Notes

Requiressm_20 or higher.

Examples

mov.u32  %r, %total_smem_size;

10.31.Special Registers:%aggr_smem_size

%aggr_smem_size

Total size of shared memory used by a CTA of a kernel.

Syntax (predefined)

.sreg .u32 %aggr_smem_size;

Description

A predefined, read-only special register initialized with total aggregated size of shared memoryconsisting of the size of user shared memory allocated (statically and dynamically) at launch timeand the size of shared memory region which is reserved for the NVIDIA system software use.

PTX ISA Notes

Introduced in PTX ISA version 8.1.

Target ISA Notes

Requiressm_90 or higher.

Examples

mov.u32  %r, %aggr_smem_size;

10.32.Special Registers:%dynamic_smem_size

%dynamic_smem_size

Size of shared memory allocated dynamically at kernel launch.

Syntax (predefined)

.sreg .u32 %dynamic_smem_size;

Description

Size of shared memory allocated dynamically at kernel launch.

A predefined, read-only special register initialized with size of shared memory allocated dynamically for the CTA of a kernel at launch time.

PTX ISA Notes

Introduced in PTX ISA version 4.1.

Target ISA Notes

Requiressm_20 or higher.

Examples

mov.u32  %r, %dynamic_smem_size;

10.33.Special Registers:%current_graph_exec

%current_graph_exec

An Identifier for currently executing CUDA device graph.

Syntax (predefined)

.sreg .u64 %current_graph_exec;

Description

A predefined, read-only special register initialized with the identifier referring to the CUDAdevice graph being currently executed. This register is 0 if the executing kernel is not part of aCUDA device graph.

Refer to theCUDA Programming Guide for more details on CUDA device graphs.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requiressm_50 or higher.

Examples

mov.u64  r1, %current_graph_exec;

11.Directives

11.1.PTX Module Directives

The following directives declare the PTX ISA version of the code in the module, the targetarchitecture for which the code was generated, and the size of addresses within the PTX module.

  • .version

  • .target

  • .address_size

11.1.1.PTX Module Directives:.version

.version

PTX ISA version number.

Syntax

.version  major.minor    // major, minor are integers

Description

Specifies the PTX language version number.

Themajor number is incremented when there are incompatible changes to the PTX language, such aschanges to the syntax or semantics. The version major number is used by the PTX compiler to ensurecorrect execution of legacy PTX code.

Theminor number is incremented when new features are added to PTX.

Semantics

Indicates that this module must be compiled with tools that support an equal or greater versionnumber.

Each PTX module must begin with a.version directive, and no other.version directive isallowed anywhere else within the module.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

.version 3.1.version 3.0.version 2.3

11.1.2.PTX Module Directives:.target

.target

Architecture and Platform target.

Syntax

.target stringlist         // comma separated list of target specifiersstring = { sm_120a, sm_120f, sm_120,          // sm_12x target architectures           sm_121a, sm_121f, sm_121,          // sm_12x target architectures           sm_110a, sm_110f, sm_110,          // sm_11x target architectures           sm_100a, sm_100f, sm_100,          // sm_10x target architectures           sm_101a, sm_101f, sm_101,          // sm_10x target architectures           sm_103a, sm_103f, sm_103           // sm_10x target architectures           sm_90a, sm_90,                     // sm_9x target architectures           sm_80, sm_86, sm_87, sm_88, sm_89, // sm_8x target architectures           sm_70, sm_72, sm_75,               // sm_7x target architectures           sm_60, sm_61, sm_62,               // sm_6x target architectures           sm_50, sm_52, sm_53,               // sm_5x target architectures           sm_30, sm_32, sm_35, sm_37,        // sm_3x target architectures           sm_20,                             // sm_2x target architectures           sm_10, sm_11, sm_12, sm_13,        // sm_1x target architectures           texmode_unified, texmode_independent,   // texturing mode           debug,                                  // platform option           map_f64_to_f32 };                       // platform option

Description

Specifies the set of features in the target architecture for which the current PTX code wasgenerated. In general, generations of SM architectures follow anonion layer model, where eachgeneration adds new features and retains all features of previous generations. The onion layer modelallows the PTX code generated for a given target to be run on later generation devices.

Target architectures with suffix “a”, such assm_90a, include architecture-specificfeatures that are supported on the specified architecture only, hence such targets do not follow theonion layer model. Therefore, PTX code generated for such targets cannot be run on later generationdevices. Architecture-specific features can only be used with targets that support thesefeatures.

Target architectures with suffix “f”, such assm_100f, include family-specific features thatare supported only within the same architecture family. Therefore, PTX code generated for suchtargets can run only on later generation devices in the same family. Family-specific features can beused with f-targets as well as a-targets of later generation devices in the same family.

Table 56 defines the architecture families.

Table 56Architecture Families

Family

Target SM architectures included

sm_10x family

sm_100f, sm_103f, future targetsin sm_10x family

sm_11x family

sm_110f, sm_101f, future targetsin sm_11x family

sm_12x family

sm_120f, sm_121f, future targetsin sm_12x family

Semantics

Each PTX module must begin with a.version directive, immediately followed by a.targetdirective containing a target architecture and optional platform options. A.target directivespecifies a single target architecture, but subsequent.target directives can be used to changethe set of target features allowed during parsing. A program with multiple.target directiveswill compile and run only on devices that support all features of the highest-numbered architecturelisted in the program.

PTX features are checked against the specified target architecture, and an error is generated if anunsupported feature is used. The following table summarizes the features in PTX that vary accordingto target architecture.

Target

Description

sm_120

Baseline feature set forsm_120 architecture.

sm_120f

Adds support forsm_120f family specific features.

sm_120a

Adds support forsm_120a architecture-specific features.

sm_121

Baseline feature set forsm_121 architecture.

sm_121f

Adds support forsm_121f family specific features.

sm_121a

Adds support forsm_121a architecture-specific features.

Target

Description

sm_110

Baseline feature set forsm_110 architecture.

sm_110f

Adds support forsm_110f family specific features.

sm_110a

Adds support forsm_110a architecture-specific features.

Target

Description

sm_100

Baseline feature set forsm_100 architecture.

sm_100f

Adds support forsm_100f family specific features.

sm_100a

Adds support forsm_100a architecture-specific features.

sm_101

Baseline feature set forsm_101 architecture. (Renamed tosm_110)

sm_101f

Adds support forsm_101f family specific features. (Renamed tosm_110f)

sm_101a

Adds support forsm_101a architecture-specific features. (Renamed tosm_110a)

sm_103

Baseline feature set forsm_103 architecture.

sm_103f

Adds support forsm_103f family specific features.

sm_103a

Adds support forsm_103a architecture-specific features.

Target

Description

sm_90

Baseline feature set forsm_90 architecture.

sm_90a

Adds support forsm_90a architecture-specific features.

Target

Description

sm_80

Baseline feature set forsm_80 architecture.

sm_86

Adds support for.xorsign modifier onmin andmax instructions.

sm_87

Baseline feature set forsm_87 architecture.

sm_88

Baseline feature set forsm_88 architecture.

sm_89

Baseline feature set forsm_89 architecture.

Target

Description

sm_70

Baseline feature set forsm_70 architecture.

sm_72

Adds support for integer multiplicand and accumulator matrices inwmma instructions.

Adds support forcvt.pack instruction.

sm_75

Adds support for sub-byte integer and single-bit multiplicant matrices inwmma instructions.

Adds support forldmatrix instruction.

Adds support formovmatrix instruction.

Adds support fortanh instruction.

Target

Description

sm_60

Baseline feature set forsm_60 architecture.

sm_61

Adds support fordp2a anddp4a instructions.

sm_62

Baseline feature set forsm_61 architecture.

Target

Description

sm_50

Baseline feature set forsm_50 architecture.

sm_52

Baseline feature set forsm_50 architecture.

sm_53

Adds support for arithmetic, comparsion and texture instructions for.f16 and.f16x2 types.

Target

Description

sm_30

Baseline feature set forsm_30 architecture.

sm_32

Adds 64-bit{atom,red}.{and,or,xor,min,max}instructions.

Addsshf instruction.

Addsld.global.nc instruction.

sm_35

Adds support for CUDA Dynamic Parallelism.

sm_37

Baseline feature set forsm_35 architecture.

Target

Description

sm_20

Baseline feature set forsm_20 architecture.

Target

Description

sm_10

Baseline feature set forsm_10 architecture.

Requiresmap_f64_to_f32 if any.f64 instructions used.

sm_11

Adds 64-bit{atom,red}.{and,or,xor,min,max} instructions.

Requiresmap_f64_to_f32 if any.f64 instructions used.

sm_12

Adds{atom,red}.shared, 64-bit{atom,red}.global,voteinstructions.

Requiresmap_f64_to_f32 if any.f64 instructions used.

sm_13

Adds double-precision support, including expanded rounding modifiers.

Disallows use ofmap_f64_to_f32.

The texturing mode is specified for an entire module and cannot be changed within the module.

The.target debug option declares that the PTX file contains DWARF debug information, andsubsequent compilation of PTX will retain information needed for source-level debugging. If thedebug option is declared, an error message is generated if no DWARF information is found in thefile. The debug option requires PTX ISA version 3.0 or later.

map_f64_to_f32 indicates that all double-precision instructions map to single-precisionregardless of the target architecture. This enables high-level language compilers to compileprograms containing type double to target device that do not support double-precisionoperations. Note that.f64 storage remains as 64-bits, with only half being used by instructionsconverted from.f64 to.f32.

Notes

Targets of the formcompute_xx are also accepted as synonyms forsm_xx targets.

Targetssm_{101,101f,101a} are renamed to targetssm_{110,110f,110a} from PTX ISA version 9.0.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target stringssm_10 andsm_11 introduced in PTX ISA version 1.0.

Target stringssm_12 andsm_13 introduced in PTX ISA version 1.2.

Texturing mode introduced in PTX ISA version 1.5.

Target stringsm_20 introduced in PTX ISA version 2.0.

Target stringsm_30 introduced in PTX ISA version 3.0.

Platform optiondebug introduced in PTX ISA version 3.0.

Target stringsm_35 introduced in PTX ISA version 3.1.

Target stringssm_32 andsm_50 introduced in PTX ISA version 4.0.

Target stringssm_37 andsm_52 introduced in PTX ISA version 4.1.

Target stringsm_53 introduced in PTX ISA version 4.2.

Target stringsm_60,sm_61,sm_62 introduced in PTX ISA version 5.0.

Target stringsm_70 introduced in PTX ISA version 6.0.

Target stringsm_72 introduced in PTX ISA version 6.1.

Target stringsm_75 introduced in PTX ISA version 6.3.

Target stringsm_80 introduced in PTX ISA version 7.0.

Target stringsm_86 introduced in PTX ISA version 7.1.

Target stringsm_87 introduced in PTX ISA version 7.4.

Target stringsm_88 introduced in PTX ISA version 9.0.

Target stringsm_89 introduced in PTX ISA version 7.8.

Target stringsm_90 introduced in PTX ISA version 7.8.

Target stringsm_90a introduced in PTX ISA version 8.0.

Target stringsm_100 introduced in PTX ISA version 8.6.

Target stringsm_100f introduced in PTX ISA version 8.8.

Target stringsm_100a introduced in PTX ISA version 8.6.

Target stringsm_101 introduced in PTX ISA version 8.6. (Renamed tosm_110)

Target stringsm_101f introduced in PTX ISA version 8.8. (Renamed tosm_110f)

Target stringsm_101a introduced in PTX ISA version 8.6. (Renamed tosm_110a)

Target stringsm_103 introduced in PTX ISA version 8.8.

Target stringsm_103f introduced in PTX ISA version 8.8.

Target stringsm_103a introduced in PTX ISA version 8.8.

Target stringsm_110 introduced in PTX ISA version 9.0.

Target stringsm_110f introduced in PTX ISA version 9.0.

Target stringsm_110a introduced in PTX ISA version 9.0.

Target stringsm_120 introduced in PTX ISA version 8.7.

Target stringsm_120f introduced in PTX ISA version 8.8.

Target stringsm_120a introduced in PTX ISA version 8.7.

Target stringsm_121 introduced in PTX ISA version 8.8.

Target stringsm_121f introduced in PTX ISA version 8.8.

Target stringsm_121a introduced in PTX ISA version 8.8.

Target ISA Notes

The.target directive is supported on all target architectures.

Examples

.target sm_10       // baseline target architecture.target sm_13       // supports double-precision.target sm_20, texmode_independent.target sm_90       // baseline target architecture.target sm_90a      // PTX using architecture-specific features.target sm_100f     // PTX using family-specific features

11.1.3.PTX Module Directives:.address_size

.address_size

Address size used throughout PTX module.

Syntax

.address_size  address-sizeaddress-size = { 32, 64 };

Description

Specifies the address size assumed throughout the module by the PTX code and the binary DWARFinformation in PTX.

Redefinition of this directive within a module is not allowed. In the presence of separatecompilation all modules must specify (or default to) the same address size.

The.address_size directive is optional, but it must immediately follow the.targetdirective if present within a module.

Semantics

If the.address_size directive is omitted, the address size defaults to 32.

PTX ISA Notes

Introduced in PTX ISA version 2.3.

Target ISA Notes

Supported on all target architectures.

Examples

// example directives   .address_size 32       // addresses are 32 bit   .address_size 64       // addresses are 64 bit// example of directive placement within a module   .version 2.3   .target sm_20   .address_size 64....entry foo () {...}

11.2.Specifying Kernel Entry Points and Functions

The following directives specify kernel entry points and functions.

  • .entry

  • .func

11.2.1.Kernel and Function Directives:.entry

.entry

Kernel entry point and body, with optional parameters.

Syntax

.entry kernel-name ( param-list )  kernel-body.entry kernel-name  kernel-body

Description

Defines a kernel entry point name, parameters, and body for the kernel function.

Parameters are passed via.param space memory and are listed within an optional parenthesizedparameter list. Parameters may be referenced by name within the kernel body and loaded intoregisters usingld.param{::entry} instructions.

In addition to normal parameters, opaque.texref,.samplerref, and.surfref variablesmay be passed as parameters. These parameters can only be referenced by name within texture andsurface load, store, and query instructions and cannot be accessed viald.param instructions.

The shape and size of the CTA executing the kernel are available in special registers.

Semantics

Specify the entry point for a kernel program.

At kernel launch, the kernel dimensions and properties are established and made available viaspecial registers, e.g.,%ntid,%nctaid, etc.

PTX ISA Notes

For PTX ISA version 1.4 and later, parameter variables are declared in the kernel parameterlist. For PTX ISA versions 1.0 through 1.3, parameter variables are declared in the kernel body.

The maximum memory size supported by PTX for normal (non-opaque type) parameters is 32764bytes. Depending upon the PTX ISA version, the parameter size limit varies. The following tableshows the allowed parameter size for a PTX ISA version:

PTX ISA Version

Maximum parameter size (In bytes)

PTX ISA version 8.1 and above

32764

PTX ISA version 1.5 and above

4352

PTX ISA version 1.4 and above

256

The CUDA and OpenCL drivers support the following limits for parameter memory:

Driver

Parameter memory size

CUDA

256 bytes forsm_1x, 4096 bytes forsm_2xandhigher,32764 bytes fosm_70 and higher

OpenCL

32764 bytes forsm_70 and higher, 4352 bytes onsm_6xand lower

Target ISA Notes

Supported on all target architectures.

Examples

.entry cta_fft.entry filter ( .param .b32 x, .param .b32 y, .param .b32 z ){    .reg .b32 %r<99>;    ld.param.b32  %r1, [x];    ld.param.b32  %r2, [y];    ld.param.b32  %r3, [z];    ...}.entry prefix_sum ( .param .align 4 .s32 pitch[8000] ){    .reg .s32 %t;    ld.param::entry.s32  %t, [pitch];    ...}

11.2.2.Kernel and Function Directives:.func

.func

Function definition.

Syntax

.func {.attribute(attr-list)} fname {.noreturn} {.abi_preserve N} {.abi_preserve_control N} function-body.func {.attribute(attr-list)} fname (param-list) {.noreturn} {.abi_preserve N} {.abi_preserve_control N} function-body.func {.attribute(attr-list)} (ret-param) fname (param-list) {.abi_preserve N} {.abi_preserve_control N} function-body

Description

Defines a function, including input and return parameters and optional function body.

An optional.noreturn directive indicates that the function does not return to the callerfunction..noreturn directive cannot be specified on functions which have return parameters. Seethe description of.noreturn directive inPerformance-Tuning Directives: .noreturn.

An optional.attribute directive specifies additional information associated with thefunction. See the description ofVariable and Function Attribute Directive: .attributefor allowed attributes.

Optional.abi_preserve and.abi_preserve_control directives are used to specify the numberof general purpose registers and control registers. See description ofPerformance-Tuning Directives: .abi_preserveandPerformance-Tuning Directives: .abi_preserve_control for more details.

A.func definition with no body provides a function prototype.

The parameter lists define locally-scoped variables in the function body. Parameters must be basetypes in either the register or parameter state space. Parameters in register state space may bereferenced directly within instructions in the function body. Parameters in.param space areaccessed usingld.param{::func} andst.param{::func} instructions in the body. Parameterpassing is call-by-value.

The last parameter in the parameter list may be a.param array of type.b8 with no sizespecified. It is used to pass an arbitrary number of parameters to the function packed into a singlearray object.

When calling a function with such an unsized last argument, the last argument may be omitted fromthecall instruction if no parameter is passed through it. Accesses to this array parameter mustbe within the bounds of the array. The result of an access is undefined if no array was passed, orif the access was outside the bounds of the actual array being passed.

Semantics

The PTX syntax hides all details of the underlying calling convention and ABI.

The implementation of parameter passing is left to the optimizing translator, which may use acombination of registers and stack locations to pass parameters.

Release Notes

For PTX ISA version 1.x code, parameters must be in the register state space, there is no stack, andrecursion is illegal.

PTX ISA versions 2.0 and later with targetsm_20 or higher allow parameters in the.paramstate space, implements an ABI with stack, and supports recursion.

PTX ISA versions 2.0 and later with targetsm_20 or higher support at most one return value.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Support for unsized array parameter introduced in PTX ISA version 6.0.

Support for.noreturn directive introduced in PTX ISA version 6.4.

Support for.attribute directive introduced in PTX ISA version 8.0.

Support for.abi_preserve and.abi_preserve_control directives introduced in PTX ISA version 9.0.

Target ISA Notes

Functions without unsized array parameter supported on all target architectures.

Unsized array parameter requiressm_30 or higher.

.noreturn directive requiressm_30 or higher.

.attribute directive requiressm_90 or higher.

.abi_preserve and.abi_preserve_control directives requiresm_80 or higher.

Examples

.func (.reg .b32 rval) foo (.reg .b32 N, .reg .f64 dbl){.reg .b32 localVar;... use N, dbl;other code;mov.b32 rval,result;ret;}...call (fooval), foo, (val0, val1);  // return value in fooval....func foo (.reg .b32 N, .reg .f64 dbl) .noreturn{.reg .b32 localVar;... use N, dbl;other code;mov.b32 rval, result;ret;}...call foo, (val0, val1);....func (.param .u32 rval) bar(.param .u32 N, .param .align 4 .b8 numbers[]){    .reg .b32 input0, input1;    ld.param.b32   input0, [numbers + 0];    ld.param.b32   input1, [numbers + 4];    ...    other code;    ret;}....param .u32 N;.param .align 4 .b8 numbers[8];st.param.u32    [N], 2;st.param.b32    [numbers + 0], 5;st.param.b32    [numbers + 4], 10;call (rval), bar, (N, numbers);...

11.2.3.Kernel and Function Directives:.alias

.alias

Define an alias to existing function symbol.

Syntax

.alias fAlias, fAliasee;

Description

.alias is a module scope directive that defines identifierfAlias to be an alias to functionspecified byfAliasee.

BothfAlias andfAliasee are non-entry function symbols.

IdentifierfAlias is a function declaration without body.

IdentifierfAliasee is a function symbol which must be defined in the same module as.aliasdeclaration. FunctionfAliasee cannot have.weak linkage.

Prototype offAlias andfAliasee must match.

Program can use eitherfAlias orfAlisee identifiers to reference function defined withfAliasee.

PTX ISA Notes

.alias directive introduced in PTX ISA 6.3.

Target ISA Notes

.alias directive requiressm_30 or higher.

Examples

.visible .func foo(.param .u32 p) {   ...}.visible .func bar(.param .u32 p);.alias bar, foo;.entry test(){      .param .u32 p;      ...      call foo, (p);       // call foo directly       ...       .param .u32 p;       call bar, (p);        // call foo through alias}.entry filter ( .param .b32 x, .param .b32 y, .param .b32 z ){    .reg .b32 %r1, %r2, %r3;    ld.param.b32  %r1, [x];    ld.param.b32  %r2, [y];    ld.param.b32  %r3, [z];    ...}

11.3.Control Flow Directives

PTX provides directives for specifying potential targets forbrx.idx andcallinstructions. See the descriptions ofbrx.idx andcall for more information.

  • .branchtargets

  • .calltargets

  • .callprototype

11.3.1.Control Flow Directives:.branchtargets

.branchtargets

Declare a list of potential branch targets.

Syntax

Label:   .branchtargets  list-of-labels ;

Description

Declares a list of potential branch targets for a subsequentbrx.idx, and associates the listwith the label at the start of the line.

All control flow labels in the list must occur within the same function as the declaration.

The list of labels may use the compact, shorthand syntax for enumerating a range of labels having acommon prefix, similar to the syntax described inParameterized Variable Names.

PTX ISA Notes

Introduced in PTX ISA version 2.1.

Target ISA Notes

Requiressm_20 or higher.

Examples

  .function foo () {      .reg .u32 %r0;      ...      L1:      ...      L2:      ...      L3:      ...      ts: .branchtargets L1, L2, L3;      @p brx.idx %r0, ts;      ....function bar() {      .reg .u32 %r0;      ...      N0:      ...      N1:      ...      N2:      ...      N3:      ...      N4:      ...      ts: .branchtargets N<5>;      @p brx.idx %r0, ts;      ...

11.3.2.Control Flow Directives:.calltargets

.calltargets

Declare a list of potential call targets.

Syntax

Label:   .calltargets  list-of-functions ;

Description

Declares a list of potential call targets for a subsequent indirect call, and associates the listwith the label at the start of the line.

All functions named in the list must be declared prior to the.calltargets directive, and allfunctions must have the same type signature.

PTX ISA Notes

Introduced in PTX ISA version 2.1.

Target ISA Notes

Requiressm_20 or higher.

Examples

calltgt:  .calltargets  fastsin, fastcos;...@p   call  (%f1), %r0, (%x), calltgt;...

11.3.3.Control Flow Directives:.callprototype

.callprototype

Declare a prototype for use in an indirect call.

Syntax

 // no input or return parameterslabel: .callprototype _ .noreturn {.abi_preserve N} {.abi_preserve_control N};// input params, no return paramslabel: .callprototype _ (param-list) .noreturn {.abi_preserve N} {.abi_preserve_control N};// no input params, // return paramslabel: .callprototype (ret-param) _ {.abi_preserve N} {.abi_preserve_control N};// input, return parameterslabel: .callprototype (ret-param) _ (param-list) {.abi_preserve N} {.abi_preserve_control N};

Description

Defines a prototype with no specific function name, and associates the prototype with a label. Theprototype may then be used in indirect call instructions where there is incomplete knowledge of thepossible call targets.

Parameters may have either base types in the register or parameter state spaces, or array types inparameter state space. The sink symbol'_' may be used to avoid dummy parameter names.

An optional.noreturn directive indicates that the function does not return to the callerfunction..noreturn directive cannot be specified on functions which have return parameters. Seethe description of .noreturn directive inPerformance-Tuning Directives: .noreturn.

Optional.abi_preserve and.abi_preserve_control directives are used to specify the numberof general purpose registers and control registers. See description ofPerformance-Tuning Directives: .abi_preserveandPerformance-Tuning Directives: .abi_preserve_control for more details.

PTX ISA Notes

Introduced in PTX ISA version 2.1.

Support for.noreturn directive introduced in PTX ISA version 6.4.

Support for.abi_preserve and.abi_preserve_control directives introduced in PTX ISA version 9.0.

Target ISA Notes

Requiressm_20 or higher.

.noreturn directive requiressm_30 or higher.

.abi_preserve and.abi_preserve_control directives requiresm_80 or higher.

Examples

Fproto1: .callprototype  _ ;Fproto2: .callprototype  _ (.param .f32 _);Fproto3: .callprototype  (.param .u32 _) _ ;Fproto4: .callprototype  (.param .u32 _) _ (.param .f32 _);...@p   call  (%val), %r0, (%f1), Fproto4;...// example of array parameterFproto5: .callprototype _ (.param .b8 _[12]);Fproto6: .callprototype  _ (.param .f32 _) .noreturn;...@p   call  %r0, (%f1), Fproto6;...// example of .abi_preserveFproto7: .callprototype _ (.param .b32 _) .abi_preserve 10;...@p   call %r0, (%r1), Fproto7;...

11.4.Performance-Tuning Directives

To provide a mechanism for low-level performance tuning, PTX supports the following directives,which pass information to the optimizing backend compiler.

  • .maxnreg

  • .maxntid

  • .reqntid

  • .minnctapersm

  • .maxnctapersm (deprecated)

  • .pragma

  • .abi_preserve

  • .abi_preserve_control

The.maxnreg directive specifies the maximum number of registers to be allocated to a singlethread; the.maxntid directive specifies the maximum number of threadsin a thread block (CTA); the.reqntid directive specifies the required number of threads in athread block (CTA); and the.minnctapersm directive specifies a minimum number of thread blocksto be scheduled on a single multiprocessor (SM). These can be used, for example, to throttle theresource requirements (e.g., registers) to increase total thread count and provide a greateropportunity to hide memory latency. The.minnctapersm directive can be used together with eitherthe.maxntid or.reqntid directive to trade-off registers-per-thread against multiprocessorutilization without needed to directly specify a maximum number of registers. This may achieve betterperformance when compiling PTX for multiple devices having different numbers of registers per SM.

Device function directives.abi_preserve and.abi_preserve_control specify number of dataand control registers from callee save registers that a function must preserve for its caller. Thiscan be considered to be the number of general purpose and control registers live in the caller when functionis called. Control registers refer to the number of divergent program points that happen in the calltreeleading to current function call.

Currently, the.maxnreg,.maxntid,.reqntid, and.minnctapersmdirectives may be applied per-entry and must appear between an.entry directive and its body.The directives take precedence over any module-level constraints passed to the optimizing backend.A warning message is generated if the directives’ constraints are inconsistent or cannot be metfor the specified target device.

A general.pragma directive is supported for passing information to the PTX backend. Thedirective passes a list of strings to the backend, and the strings have no semantics within the PTXvirtual machine model. The interpretation of.pragma values is determined by the backendimplementation and is beyond the scope of the PTX ISA. Note that.pragma directives may appearat module (file) scope, at entry-scope, or as statements within a kernel or device function body.

11.4.1.Performance-Tuning Directives:.maxnreg

.maxnreg

Maximum number of registers that can be allocated per thread.

Syntax

.maxnreg n

Description

Declare the maximum number of registers per thread in a CTA.

Semantics

The compiler guarantees that this limit will not be exceeded. The actual number of registers usedmay be less; for example, the backend may be able to compile to fewer registers, or the maximumnumber of registers may be further constrained by.maxntid and.maxctapersm.

PTX ISA Notes

Introduced in PTX ISA version 1.3.

Target ISA Notes

Supported on all target architectures.

Examples

.entry foo .maxnreg 16 { ... }  // max regs per thread = 16

11.4.2.Performance-Tuning Directives:.maxntid

.maxntid

Maximum number of threads in the thread block (CTA).

Syntax

.maxntid nx.maxntid nx, ny.maxntid nx, ny, nz

Description

Declare the maximum number of threads in the thread block (CTA). This maximum is specified by givingthe maximum extent of each dimension of the 1D, 2D, or 3D CTA. The maximum number of threads is theproduct of the maximum extent in each dimension.

Semantics

The maximum number of threads in the thread block, computed as the product of the maximum extentspecified for each dimension, is guaranteed not to be exceeded in any invocation of the kernel inwhich this directive appears. Exceeding the maximum number of threads results in a runtime error orkernel launch failure.

Note that this directive guarantees that thetotal number of threads does not exceed the maximum,but does not guarantee that the limit in any particular dimension is not exceeded.

PTX ISA Notes

Introduced in PTX ISA version 1.3.

Target ISA Notes

Supported on all target architectures.

Examples

.entry foo .maxntid 256       { ... }  // max threads = 256.entry bar .maxntid 16,16,4   { ... }  // max threads = 1024

11.4.3.Performance-Tuning Directives:.reqntid

.reqntid

Number of threads in the thread block (CTA).

Syntax

.reqntid nx.reqntid nx, ny.reqntid nx, ny, nz

Description

Declare the number of threads in the thread block (CTA) by specifying the extent of each dimensionof the 1D, 2D, or 3D CTA. The total number of threads is the product of the number of threads ineach dimension.

Semantics

The size of each CTA dimension specified in any invocation of the kernel is required to be equal tothat specified in this directive. Specifying a different CTA dimension at launch will result in aruntime error or kernel launch failure.

Notes

The.reqntid directive cannot be used in conjunction with the.maxntid directive.

PTX ISA Notes

Introduced in PTX ISA version 2.1.

Target ISA Notes

Supported on all target architectures.

Examples

.entry foo .reqntid 256       { ... }  // num threads = 256.entry bar .reqntid 16,16,4   { ... }  // num threads = 1024

11.4.4.Performance-Tuning Directives:.minnctapersm

.minnctapersm

Minimum number of CTAs per SM.

Syntax

.minnctapersm ncta

Description

Declare the minimum number of CTAs from the kernel’s grid to be mapped to a single multiprocessor(SM).

Notes

Optimizations based on.minnctapersm need either.maxntid or.reqntid to be specified aswell.

If the total number of threads on a single SM resulting from.minnctapersm and.maxntid /.reqntid exceed maximum number of threads supported by an SM then directive.minnctapersmwill be ignored.

In PTX ISA version 2.1 or higher, a warning is generated if.minnctapersm is specified withoutspecifying either.maxntid or.reqntid.

PTX ISA Notes

Introduced in PTX ISA version 2.0 as a replacement for.maxnctapersm.

Target ISA Notes

Supported on all target architectures.

Examples

.entry foo .maxntid 256 .minnctapersm 4 { ... }

11.4.5.Performance-Tuning Directives:.maxnctapersm (deprecated)

.maxnctapersm

Maximum number of CTAs per SM.

Syntax

.maxnctapersm ncta

Description

Declare the maximum number of CTAs from the kernel’s grid that may be mapped to a singlemultiprocessor (SM).

Notes

Optimizations based on .maxnctapersm generally need.maxntid to be specified as well. Theoptimizing backend compiler uses.maxntid and.maxnctapersm to compute an upper-bound onper-thread register usage so that the specified number of CTAs can be mapped to a singlemultiprocessor. However, if the number of registers used by the backend is sufficiently lower thanthis bound, additional CTAs may be mapped to a single multiprocessor. For this reason,.maxnctapersm has been renamed to .minnctapersm in PTX ISA version 2.0.

PTX ISA Notes

Introduced in PTX ISA version 1.3. Deprecated in PTX ISA version 2.0.

Target ISA Notes

Supported on all target architectures.

Examples

.entry foo .maxntid 256 .maxnctapersm 4 { ... }

11.4.6.Performance-Tuning Directives:.noreturn

.noreturn

Indicate that the function does not return to its caller function.

Syntax

.noreturn

Description

Indicate that the function does not return to its caller function.

Semantics

An optional.noreturn directive indicates that the function does not return to callerfunction..noreturn directive can only be specified on device functions and must appear betweena.func directive and its body.

The directive cannot be specified on functions which have return parameters.

If a function with.noreturn directive returns to the caller function at runtime, then thebehavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 6.4.

Target ISA Notes

Requiressm_30 or higher.

Examples

.func foo .noreturn { ... }

11.4.7.Performance-Tuning Directives:.pragma

.pragma

Pass directives to PTX backend compiler.

Syntax

.pragma list-of-strings ;

Description

Pass module-scoped, entry-scoped, or statement-level directives to the PTX backend compiler.

The.pragma directive may occur at module-scope, at entry-scope, or at statement-level.

Semantics

The interpretation of.pragma directive strings is implementation-specific and has no impact onPTX semantics. SeeDescriptions of .pragma Strings fordescriptions of the pragma strings defined inptxas.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

Supported on all target architectures.

Examples

.pragma "nounroll";    // disable unrolling in backend// disable unrolling for current kernel.entry foo .pragma "nounroll"; { ... }

11.4.8.Performance-Tuning Directives:.abi_preserve

.abi_preserve

Specify number of general purpose registers that should be preserved by the callers of this function.

Syntax

.abi_preserve N

Description

It is an architecture agnostic value specifying actual number of general purpose registers.Internally ABI defines some general purpose registers as preserved (callee save) registers.Integer N specifies the actual number of general purpose registers that should be preserved bythe function.

.abi_preserve directive can only be specified on device functions and must appear betweena.func directive and its body.

Semantics

When this directive is specified compiler backend modifies low level ABI components to ensure thatnumber of live data variables in the callers of this function that are stored in the callee saveregisters are less than specified value.

PTX ISA Notes

Introduced in PTX ISA version 9.0.

Target ISA Notes

Requiressm_80 or higher.

Examples

.func bar() .abi_preserve 8// Indirect call via call prototype.func (.param .b32 out[30]) foo (.param .b32 in[30]) .abi_preserve 10 { ... }...mov.b64 lpfoo, foo;prot: .callprototype (.param .b32 out[30]) _ (.param .b32 in[30]) .abi_preserve 10;call (out), lpfoo, (in), prot;

11.4.9.Performance-Tuning Directives:.abi_preserve_control

.abi_preserve_control

Specify number of control registers that should be preserved by the callers of this function.

Syntax

.abi_preserve_control N

Description

It is an architecture agnostic value specifying the number of divergent program points that happenin the calltree leading to current function call.Internally ABI defines some control registers as preserved (callee save) registers.Integer N specifies the actual number of control registers that should be preserved by the function.

.abi_preserve_control directive can only be specified on device functions and must appear betweena.func directive and its body.

Semantics

When this directive is specified compiler backend modifies low level ABI components to ensure thatnumber of live control variables in the callers of this function that are stored in the callee savecontrol registers are less than specified value.

PTX ISA Notes

Introduced in PTX ISA version 9.0.

Target ISA Notes

Requiressm_80 or higher.

Examples

.func foo() .abi_preserve_control 14// Indirect call via call prototype.func (.param .b32 out[30]) bar (.param .b32 in[30]) .abi_preserve_control 10 { ... }...mov.b64 lpbar, bar;prot: .callprototype (.param .b32 out[30]) _ (.param .b32 in[30]) .abi_preserve_control 10;call (out), lpbar, (in), prot;

11.5.Debugging Directives

DWARF-format debug information is passed through PTX modules using the following directives:

  • @@DWARF

  • .section

  • .file

  • .loc

The.section directive was introduced in PTX ISA version 2.0 and replaces the@@DWARFsyntax. The@@DWARF syntax was deprecated in PTX ISA version 2.0 but is supported for legacy PTXISA version 1.x code.

Beginning with PTX ISA version 3.0, PTX files containing DWARF debug information should include the.targetdebug platform option. This forward declaration directs PTX compilation to retainmappings for source-level debugging.

11.5.1.Debugging Directives:@@dwarf

@@dwarf

DWARF-format information.

Syntax

@@DWARF dwarf-stringdwarf-string may have one of the.byte   byte-list   // comma-separated hexadecimal byte values.4byte  int32-list  // comma-separated hexadecimal integers in range [0..2^32-1].quad   int64-list  // comma-separated hexadecimal integers in range [0..2^64-1].4byte  label.quad   label

PTX ISA Notes

Introduced in PTX ISA version 1.2. Deprecated as of PTX ISA version 2.0, replaced by.sectiondirective.

Target ISA Notes

Supported on all target architectures.

Examples

@@DWARF .section .debug_pubnames, "", @progbits@@DWARF .byte   0x2b, 0x00, 0x00, 0x00, 0x02, 0x00@@DWARF .4byte  .debug_info@@DWARF .4byte  0x000006b5, 0x00000364, 0x61395a5f, 0x5f736f63@@DWARF .4byte  0x6e69616d, 0x63613031, 0x6150736f, 0x736d6172@@DWARF .byte   0x00, 0x00, 0x00, 0x00, 0x00

11.5.2.Debugging Directives:.section

.section

PTX section definition.

Syntax

.section section_name { dwarf-lines }dwarf-lines have the following formats:  .b8    byte-list       // comma-separated list of integers                         // in range [-128..255]  .b16   int16-list      // comma-separated list of integers                         // in range [-2^15..2^16-1]  .b32   int32-list      // comma-separated list of integers                         // in range [-2^31..2^32-1]  label:                 // Define label inside the debug section  .b64   int64-list      // comma-separated list of integers                         // in range [-2^63..2^64-1]  .b32   label  .b64   label  .b32   label+imm       // a sum of label address plus a constant integer byte                         // offset(signed, 32bit)  .b64   label+imm       // a sum of label address plus a constant integer byte                         // offset(signed, 64bit)  .b32   label1-label2   // a difference in label addresses between labels in                         // the same dwarf section (32bit)  .b64   label3-label4   // a difference in label addresses between labels in                         // the same dwarf section (64bit)

PTX ISA Notes

Introduced in PTX ISA version 2.0, replaces@@DWARF syntax.

label+imm expression introduced in PTX ISA version 3.2.

Support for.b16 integers in dwarf-lines introduced in PTX ISA version 6.0.

Support for defininglabel inside the DWARF section is introduced in PTX ISA version 7.2.

label1-label2 expression introduced in PTX ISA version 7.5.

Negative numbers in dwarf lines introduced in PTX ISA version 7.5.

Target ISA Notes

Supported on all target architectures.

Examples

.section .debug_pubnames{    .b32    LpubNames_end0-LpubNames_begin0  LpubNames_begin0:    .b8     0x2b, 0x00, 0x00, 0x00, 0x02, 0x00    .b32    .debug_info  info_label1:    .b32    0x000006b5, 0x00000364, 0x61395a5f, 0x5f736f63    .b32    0x6e69616d, 0x63613031, 0x6150736f, 0x736d6172    .b8     0x00, 0x00, 0x00, 0x00, 0x00  LpubNames_end0:}.section .debug_info{    .b32 11430    .b8 2, 0    .b32 .debug_abbrev    .b8 8, 1, 108, 103, 101, 110, 102, 101, 58, 32, 69, 68, 71, 32, 52, 46, 49    .b8 0    .b32 3, 37, 176, -99    .b32 info_label1    .b32 .debug_loc+0x4    .b8 -11, 11, 112, 97    .b32 info_label1+12    .b64 -1    .b16 -5, -65535}

11.5.3.Debugging Directives:.file

.file

Source file name.

Syntax

.file file_index "filename" {, timestamp, file_size}

Description

Associates a source filename with an integer index..loc directives reference source files byindex.

.file directive allows optionally specifying an unsigned number representing time of lastmodification and an unsigned integer representing size in bytes of source file.timestamp andfile_size value can be 0 to indicate this information is not available.

timestamp value is in format of C and C++ data typetime_t.

file_size is an unsigned 64-bit integer.

The.file directive is allowed only in the outermost scope, i.e., at the same level as kerneland device function declarations.

Semantics

If timestamp and file size are not specified, they default to 0.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Timestamp and file size introduced in PTX ISA version 3.2.

Target ISA Notes

Supported on all target architectures.

Examples

.file 1 "example.cu".file 2 "kernel.cu".file 1 "kernel.cu", 1339013327, 64118

11.5.4.Debugging Directives:.loc

.loc

Source file location.

Syntax

.loc file_index line_number column_position.loc file_index line_number column_position,function_name label {+ immediate }, inlined_at file_index2 line_number2 column_position2

Description

Declares the source file location (source file, line number, and column position) to be associatedwith lexically subsequent PTX instructions..loc refers tofile_index which is defined by a.file directive.

To indicate PTX instructions that are generated from a function that got inlined, additionalattribute.inlined_at can be specified as part of the.loc directive..inlined_atattribute specifies source location at which the specified function is inlined.file_index2,line_number2, andcolumn_position2 specify the location at which function is inlined. Sourcelocation specified as part of.inlined_at directive must lexically precede as source location in.loc directive.

Thefunction_name attribute specifies an offset in the DWARF section named.debug_str. Offset is specified aslabel expression orlabel+immediate expressionwherelabel is defined in.debug_str section. DWARF section.debug_str contains ASCIInull-terminated strings that specify the name of the function that is inlined.

Note that a PTX instruction may have a single associated source location, determined by the nearestlexically preceding .loc directive, or no associated source location if there is no preceding .locdirective. Labels in PTX inherit the location of the closest lexically following instruction. Alabel with no following PTX instruction has no associated source location.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

function_name andinlined_at attributes are introduced in PTX ISA version 7.2.

Target ISA Notes

Supported on all target architectures.

Examples

    .loc 2 4237 0L1:                        // line 4237, col 0 of file #2,                           // inherited from mov    mov.u32  %r1,%r2;      // line 4237, col 0 of file #2    add.u32  %r2,%r1,%r3;  // line 4237, col 0 of file #2...L2:                        // line 4239, col 5 of file #2,                           // inherited from sub    .loc 2 4239 5    sub.u32  %r2,%r1,%r3;  // line 4239, col 5 of file #2    .loc 1 21 3    .loc 1 9 3, function_name info_string0, inlined_at 1 21 3    ld.global.u32   %r1, [gg]; // Function at line 9    setp.lt.s32 %p1, %r1, 8;   // inlined at line 21    .loc 1 27 3    .loc 1 10 5, function_name info_string1, inlined_at 1 27 3    .loc 1 15 3, function_name .debug_str+16, inlined_at 1 10 5    setp.ne.s32 %p2, %r1, 18;    @%p2 bra    BB2_3;    .section .debug_str {    info_string0:     .b8 95  // _     .b8 90  // z     .b8 51  // 3     .b8 102 // f     .b8 111 // o     .b8 111 // o     .b8 118 // v     .b8 0    info_string1:     .b8 95  // _     .b8 90  // z     .b8 51  // 3     .b8 98  // b     .b8 97  // a     .b8 114 // r     .b8 118 // v     .b8 0     .b8 95  // _     .b8 90  // z     .b8 51  // 3     .b8 99  // c     .b8 97  // a     .b8 114 // r     .b8 118 // v     .b8 0    }

11.6.Linking Directives

  • .extern

  • .visible

  • .weak

11.6.1.Linking Directives:.extern

.extern

External symbol declaration.

Syntax

.extern identifier

Description

Declares identifier to be defined external to the current module. The module defining suchidentifier must define it as.weak or.visible only once in a single object file. Externdeclaration of symbol may appear multiple times and references to that get resolved against thesingle definition of that symbol.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

.extern .global .b32 foo;  // foo is defined in another module

11.6.2.Linking Directives:.visible

.visible

Visible (externally) symbol declaration.

Syntax

.visible identifier

Description

Declares identifier to be globally visible. Unlike C, where identifiers are globally visible unlessdeclared static, PTX identifiers are visible only within the current module unless declared.visible outside the current.

PTX ISA Notes

Introduced in PTX ISA version 1.0.

Target ISA Notes

Supported on all target architectures.

Examples

.visible .global .b32 foo;  // foo will be externally visible

11.6.3.Linking Directives:.weak

.weak

Visible (externally) symbol declaration.

Syntax

.weak identifier

Description

Declares identifier to be globally visible butweak. Weak symbols are similar to globally visiblesymbols, except during linking, weak symbols are only chosen after globally visible symbols duringsymbol resolution. Unlike globally visible symbols, multiple object files may declare the same weaksymbol, and references to a symbol get resolved against a weak symbol only if no global symbols havethe same name.

PTX ISA Notes

Introduced in PTX ISA version 3.1.

Target ISA Notes

Supported on all target architectures.

Examples

.weak .func (.reg .b32 val) foo;  // foo will be externally visible

11.6.4.Linking Directives:.common

.common

Visible (externally) symbol declaration.

Syntax

.common identifier

Description

Declares identifier to be globally visible but “common”.

Common symbols are similar to globally visible symbols. However multiple object files may declarethe same common symbol and they may have different types and sizes and references to a symbol getresolved against a common symbol with the largest size.

Only one object file can initialize a common symbol and that must have the largest size among allother definitions of that common symbol from different object files.

.common linking directive can be used only on variables with.global storage. It cannot beused on function symbols or on symbols with opaque type.

PTX ISA Notes

Introduced in PTX ISA version 5.0.

Target ISA Notes

.common directive requiressm_20 or higher.

Examples

.common .global .u32 gbl;

11.7.Cluster Dimension Directives

The following directives specify information about clusters:

  • .reqnctapercluster

  • .explicitcluster

  • .maxclusterrank

The.reqnctapercluster directive specifies the number of CTAs in the cluster. The.explicitcluster directive specifies that the kernel should be launched with explicit clusterdetails. The.maxclusterrank directive specifies the maximum number of CTAs in the cluster.

The cluster dimension directives can be applied only on kernel functions.

11.7.1.Cluster Dimension Directives:.reqnctapercluster

.reqnctapercluster

Declare the number of CTAs in the cluster.

Syntax

.reqnctapercluster nx.reqnctapercluster nx, ny.reqnctapercluster nx, ny, nz

Description

Set the number of thread blocks (CTAs) in the cluster by specifying the extent of each dimension ofthe 1D, 2D, or 3D cluster. The total number of CTAs is the product of the number of CTAs in eachdimension. For kernels with.reqnctapercluster directive specified, runtime will use thespecified values for configuring the launch if the same are not specified at launch time.

Semantics

If cluster dimension is explicitly specified at launch time, it should be equal to the valuesspecified in this directive. Specifying a different cluster dimension at launch will result in aruntime error or kernel launch failure.

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

.entry foo .reqnctapercluster 2         { . . . }.entry bar .reqnctapercluster 2, 2, 1   { . . . }.entry ker .reqnctapercluster 3, 2      { . . . }

11.7.2.Cluster Dimension Directives:.explicitcluster

.explicitcluster

Declare that Kernel must be launched with cluster dimensions explicitly specified.

Syntax

.explicitcluster

Description

Declares that this Kernel should be launched with cluster dimension explicitly specified.

Semantics

Kernels with.explicitcluster directive must be launched with cluster dimension explicitlyspecified (either at launch time or via.reqnctapercluster), otherwise program will fail withruntime error or kernel launch failure.

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

.entry foo .explicitcluster         { . . . }

11.7.3.Cluster Dimension Directives:.maxclusterrank

.maxclusterrank

Declare the maximum number of CTAs that can be part of the cluster.

Syntax

.maxclusterrank n

Description

Declare the maximum number of thread blocks (CTAs) allowed to be part of the cluster.

Semantics

Product of the number of CTAs in each cluster dimension specified in any invocation of the kernel isrequired to be less or equal to that specified in this directive. Otherwise invocation will resultin a runtime error or kernel launch failure.

The.maxclusterrank directive cannot be used in conjunction with the.reqnctapercluster directive.

PTX ISA Notes

Introduced in PTX ISA version 7.8.

Target ISA Notes

Requiressm_90 or higher.

Examples

.entry foo ..maxclusterrank 8         { . . . }

11.8.Miscellaneous Directives

PTX provides the following miscellaneous directives:

  • .blocksareclusters

11.8.1.Miscellaneous Directives:.blocksareclusters

.blocksareclusters

Specify that CUDA thread blocks are mapped to clusters.

Syntax

.blocksareclusters

Description

Default behavior of CUDA API is to specify the grid launch configuration by specifying the number ofthread blocks and the number of threads per block.

When.blocksareclusters directive is specified, it implies that the grid launch configurationfor the corresponding.entry function is specifying the number of clusters, i.e. the launchconfiguration is specifying number of clusters instead of the number of thread blocks. In this case,the number of thread blocks per cluster is specified by.reqnctapercluster directive and thethread block size is specified with the.reqntid directive.

.blocksareclusters directive is only allowed for.entry functions and also needs.reqntid and.reqnctapercluster directives to be specified.

Refer toCUDA Programming Guide for more details.

PTX ISA Notes

Introduced in PTX ISA version 9.0.

Target ISA Notes

Requiressm_90 or higher.

Examples

.entry foo .reqntid 32, 32, 1 .reqnctapercluster 32, 32, 1 .blocksareclusters { ... }

12.Descriptions of.pragma Strings

This section describes the.pragma strings defined by ptxas.

12.1.Pragma Strings:"nounroll"

"nounroll"

Disable loop unrolling in optimizing the backend compiler.

Syntax

.pragma "nounroll";

Description

The"nounroll"pragma is a directive to disable loop unrolling in the optimizing backendcompiler.

The"nounroll"pragma is allowed at module, entry-function, and statement levels, with thefollowing meanings:

module scope

disables unrolling for all loops in module, including loops preceding the.pragma.

entry-function scope

disables unrolling for all loops in the entry function body.

statement-level pragma

disables unrolling of the loop for which the current block is the loop header.

Note that in order to have the desired effect at statement level, the"nounroll" directive mustappear before any instruction statements in the loop header basic block for the desired loop. Theloop header block is defined as the block that dominates all blocks in the loop body and is thetarget of the loop backedge. Statement-level"nounroll" directives appearing outside of loopheader blocks are silently ignored.

PTX ISA Notes

Introduced in PTX ISA version 2.0.

Target ISA Notes

Requiressm_20 or higher. Ignored forsm_1x targets.

Examples

.entry foo (...).pragma "nounroll";  // do not unroll any loop in this function{...}.func bar (...){...L1_head:     .pragma "nounroll";  // do not unroll this loop     ...@p   bra L1_end;L1_body:     ...L1_continue:     bra L1_head;L1_end:     ...}

12.2.Pragma Strings:"used_bytes_mask"

"used_bytes_mask"

Mask for indicating used bytes in data of ld operation.

Syntax

.pragma "used_bytes_mask mask";

Description

The"used_bytes_mask"pragma is a directive that specifies used bytes in a loadoperation based on the mask provided.

"used_bytes_mask"pragma needs to be specified prior to a load instruction for whichinformation about bytes used from the load operation is needed.Pragma is ignored if instruction following it is not a load instruction.

For a load instruction without this pragma, all bytes from the load operation are assumedto be used.

Operandmask is a 32-bit integer with set bits indicating the used bytes in data ofload operation.

Semantics

Each bit in mask operand corresponds to a byte data where each set bit represents the used byte.Most-significant bit corresponds to most-significant byte of data.// For 4 bytes load with only lower 3 bytes used.pragma "used_bytes_mask 0x7";ld.global.u32 %r0, [gbl];     // Higher 1 byte from %r0 is unused// For vector load of 16 bytes with lower 12 bytes used.pragma "used_bytes_mask 0xfff";ld.global.v4.u32 {%r0, %r1, %r2, %r3}, [gbl];  // %r3 unused

PTX ISA Notes

Introduced in PTX ISA version 8.3.

Target ISA Notes

Requiressm_50 or higher.

Examples

.pragma "used_bytes_mask 0xfff";ld.global.v4.u32 {%r0, %r1, %r2, %r3}, [gbl]; // Only lower 12 bytes used

12.3.Pragma Strings:"enable_smem_spilling"

"enable_smem_spilling"

Enable shared memory spilling for CUDA kernels.

Syntax

.pragma "enable_smem_spilling";

Description

The"enable_smem_spilling"pragma is a directive that enables register spilling into shared memory.During the spilling process, registers are first spilled into shared memory, and once the allocatedshared memory is full, any additional spills are redirected to local memory. This can enhanceperformance by reducing memory access latency since shared memory accesses are faster than local memory.

The"enable_smem_spilling"pragma is only allowed within the function scope. When applied, it enablesshared memory spilling for the specified function.

The usage of pragma is valid only in certain scenarios and specific compilation modes. The usage ofpragma is disallowed under following cases and may result in an error:

  • Per-function compilation mode: e.g., Separate Compilation, Device-debug, Whole program with recursivefunction calls, Extensible-whole-program

  • Kernels utilizing dynamically allocated shared memory

  • Kernels usingsetmaxnreg instruction

Note

If launch bounds are not explicitly specified, the compiler assumes the maximum possible number ofthreads per CTA to estimate shared memory allocated per CTA and corresponding spill size. However,if the kernel is launched with fewer threads per CTA than estimated, the shared memory allocatedper CTA may exceed the compiler estimated size, thereby potentially limiting the number of CTAsthat can be launched on an SM. Due to this, using the pragma without launch bounds may lead toperformance regressions. Hence it is recommended to use this pragma only when launch bounds areexplicitly specified.

PTX ISA Notes

Introduced in PTX ISA version 9.0.

Target ISA Notes

Requiressm_75 or higher.

Examples

.entry foo (...){    ...    .pragma "enable_smem_spilling";   // Enable shared memory spilling for this function    ...}

12.4.Pragma Strings:"frequency"

"frequency"

Specify frequency for basic block execution.

Syntax

.pragma "frequency n";

Description

The"frequency"pragma is a directive that specifies the number of times a basic block isexecuted by an executing thread. The optimizing compiler backend treats this pragma as a hintwhich will be used for optimizations.

Operandn is a 64-bit non-negative integer constant that specifies the execution frequency.

Note that in order to have the desired effect of this pragma, it should be specified at the start ofthe basic block. Basic block is defined as a straight-line sequence of instructions with only oneentry point and one exit point.

PTX ISA Notes

Introduced in PTX ISA version 9.0.

Target ISA Notes

Supported on all target architectures.

Examples

.entry foo (...){    .pragma "frequency 32";    ...}

13.Release Notes

This section describes the history of change in the PTX ISA and implementation. The first sectiondescribes ISA and implementation changes in the current release of PTX ISA version 9.0, and theremaining sections provide a record of changes in previous releases of PTX ISA versions back to PTXISA version 2.0.

Table 57 shows the PTX release history.

Table 57PTX Release History

PTX ISA Version

CUDA Release

Supported Targets

PTX ISA 1.0

CUDA 1.0

sm_{10,11}

PTX ISA 1.1

CUDA 1.1

sm_{10,11}

PTX ISA 1.2

CUDA 2.0

sm_{10,11,12,13}

PTX ISA 1.3

CUDA 2.1

sm_{10,11,12,13}

PTX ISA 1.4

CUDA 2.2

sm_{10,11,12,13}

PTX ISA 1.5

driver r190

sm_{10,11,12,13}

PTX ISA 2.0

CUDA 3.0, driver r195

sm_{10,11,12,13},sm_20

PTX ISA 2.1

CUDA 3.1, driver r256

sm_{10,11,12,13},sm_20

PTX ISA 2.2

CUDA 3.2, driver r260

sm_{10,11,12,13},sm_20

PTX ISA 2.3

CUDA 4.0, driver r270

sm_{10,11,12,13},sm_20

PTX ISA 3.0

CUDA 4.1, driver r285

sm_{10,11,12,13},sm_20

CUDA 4.2, driver r295

sm_{10,11,12,13},sm_20,sm_30

PTX ISA 3.1

CUDA 5.0, driver r302

sm_{10,11,12,13},sm_20,sm_{30,35}

PTX ISA 3.2

CUDA 5.5, driver r319

sm_{10,11,12,13},sm_20,sm_{30,35}

PTX ISA 4.0

CUDA 6.0, driver r331

sm_{10,11,12,13},sm_20,sm_{30,32,35},sm_50

PTX ISA 4.1

CUDA 6.5, driver r340

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52}

PTX ISA 4.2

CUDA 7.0, driver r346

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53}

PTX ISA 4.3

CUDA 7.5, driver r352

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53}

PTX ISA 5.0

CUDA 8.0, driver r361

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62}

PTX ISA 6.0

CUDA 9.0, driver r384

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_70

PTX ISA 6.1

CUDA 9.1, driver r387

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_70,sm_72

PTX ISA 6.2

CUDA 9.2, driver r396

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_70,sm_72

PTX ISA 6.3

CUDA 10.0, driver r400

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_70,sm_72,sm_75

PTX ISA 6.4

CUDA 10.1, driver r418

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_70,sm_72,sm_75

PTX ISA 6.5

CUDA 10.2, driver r440

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_70,sm_72,sm_75

PTX ISA 7.0

CUDA 11.0, driver r445

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_80

PTX ISA 7.1

CUDA 11.1, driver r455

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86}

PTX ISA 7.2

CUDA 11.2, driver r460

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86}

PTX ISA 7.3

CUDA 11.3, driver r465

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86}

PTX ISA 7.4

CUDA 11.4, driver r470

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87}

PTX ISA 7.5

CUDA 11.5, driver r495

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87}

PTX ISA 7.6

CUDA 11.6, driver r510

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87}

PTX ISA 7.7

CUDA 11.7, driver r515

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87}

PTX ISA 7.8

CUDA 11.8, driver r520

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87,89},sm_90

PTX ISA 8.0

CUDA 12.0, driver r525

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87,89},sm_{90,90a}

PTX ISA 8.1

CUDA 12.1, driver r530

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87,89},sm_{90,90a}

PTX ISA 8.2

CUDA 12.2, driver r535

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87,89},sm_{90,90a}

PTX ISA 8.3

CUDA 12.3, driver r545

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87,89},sm_{90,90a}

PTX ISA 8.4

CUDA 12.4, driver r550

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87,89},sm_{90,90a}

PTX ISA 8.5

CUDA 12.5, driver r555

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87,89},sm_{90,90a}

CUDA 12.6, driver r560

PTX ISA 8.6

CUDA 12.7, driver r565

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87,89},sm_{90,90a},sm_{100,100a,101,101a}

PTX ISA 8.7

CUDA 12.8, driver r570

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87,89},sm_{90,90a},sm_{100,100,101,101a},sm_{120,120a}

PTX ISA 8.8

CUDA 12.9, driver r575

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87,89},sm_{90,90a},sm_{100,100f,100a,101,101f,101a,103,103f,103a},sm_{120,120f,120a,121,121f,121a}

PTX ISA 9.0

CUDA 13.0, driver r580

sm_{10,11,12,13},sm_20,sm_{30,32,35,37},sm_{50,52,53},sm_{60,61,62},sm_{70,72,75},sm_{80,86,87,88,89},sm_{90,90a},sm_{100,100f,100a,103,103f,103a},sm_{110,110f,110a},sm_{120,120f,120a,121,121f,121a}

13.1.Changes in PTX ISA Version 9.0

New Features

PTX ISA version 9.0 introduces the following new features:

  • Adds support forsm_88 target architecture.

  • Adds support forsm_110 target architecture.

  • Adds support for targetsm_110f that supports family-specific features.

  • Adds support for targetsm_110a that supports architecture-specific features.

  • Adds support for pragmaenable_smem_spilling that is used to enable sharedmemory spilling for a function.

  • Adds support for pragmafrequency that is used to specify the execution frequency of a basicblock.

  • Adds support for directive.blocksareclusters that is used to specify that CUDA thread blocksare mapped to clusters.

  • Extendssize operand ofst.bulk instruction to support 32-bit length.

  • Adds support for performance-tuning directives.abi_preserve and.abi_preserve_controlthat are used to specify the number of data and control registers that should be preserved by thecallers of a function.

Notes

  • Targetssm_{101,101f,101a} are renamed to targetssm_{110,110f,110a} from PTX ISA version 9.0.

Semantic Changes and Clarifications

  • Alltcgen05 instructions(tcgen05.alloc,tcgen05.dealloc,tcgen05.relinquish_alloc_permit,tcgen05.cp,tcgen05.shift,tcgen05.mma,tcgen05.mma.sp,tcgen05.mma.ws,tcgen05.mma.ws.sp,tcgen05.commit) within a kernel must specify the same value for the.cta_group qualifier.

None.

13.2.Changes in PTX ISA Version 8.8

New Features

PTX ISA version 8.8 introduces the following new features:

  • Adds support forsm_103 target architecture.

  • Adds support for targetsm_103a that supports architecture-specific features.

  • Adds support forsm_121 target architecture.

  • Adds support for targetsm_121a that supports architecture-specific features.

  • Introduces family-specific target architectures that are represented with “f” suffix.PTX for family-specific targets is compatible with all subsequent targets in same family.Adds support forsm_100f,sm_101f,sm_103f,sm_120f,sm_121f.

  • Extendsmin andmax instructions to support three input arguments.

  • Extendstcgen05.mma instruction to add support for newscale_vectorsizequalifiers.block16 and.block32 and K dimension 96.

  • Extends.field3 oftensormap.replace instruction to support 96B swizzle mode.

  • Adds support fortcgen05.ld.red instruction.

  • Extendsld,ld.global.nc andst instructions to support 256b load/store operations.

  • Table 58 shows the list of features that aresupported on family-specific targets:

    Table 58List of features promoted to family-specific architecture

    Feature

    Supported targets

    .m16n8,.m16n16,.m8n16 shapes and.b8type forldmatrix/stmatrix

    sm_100f,sm_101f,sm_120f

    Shapes fortcgen05.16x64b.16x128b,.16x256b,.16x32bx2,.32x32b,.4x256b,.32x128b,.64x128b,.128x256b,.128x128b,.31x256b

    sm_100f,sm_101f

    setmaxnreg

    sm_100f,sm_101f,sm_120f

    .cta_group modifier

    sm_100f,sm_101f

    cvt with.e2m1x2,.e3m2x2,.e2m3x2,.ue8m0x2

    sm_100f,sm_101f,sm_120f

    multimem with.acc::f16and.e5m2,.e5m2x2,.e5m2x4,.e4m3,.e4m3x2,.e4m3x4 types

    sm_100f,sm_101f

    tensormap.replace

    sm_100f,sm_101f,sm_120f

    tcgen05.ld.red

    sm_101f,sm_103f

    tcgen05.ld/st/fence/wait/commit/cp/alloc/dealloc/relinquish_alloc_permit

    sm_100f,sm_101f

    tcgen05.mma{.ws}{.sp}(exceptkind::mxf4/kind::mxf4nvf4 for.sp)

    sm_100f,sm_101f

    tcgen05.kind::mxf4nvf4,.kind::mxf4,.kind::mxf8f6f4,.kind::f16,.kind::tf32,.kind::f8f6f4

    sm_100f,sm_101f

    .ashift,.collector_usagemodifiers fortcgen05

    sm_100f,sm_101f

    Modifiers.b8x16,.b6x16_p32,.b4x16_p64

    sm_100f,sm_101f,sm_120f

    .block_scale modifier

    sm_100f,sm_101f,sm_120f

    mma{.sp} with.e3m2,.e2m3,.e2m1 types and.kind,.block_scale,.scale_vec_size modifiers(except.sp withmxf4/mxf4nvf4)

    sm_120f

    .scale_vec::1X/2X/4Xmodifiers

    sm_120f

    .block16/.block32modifiers (alias toscale_vec)

    sm_100f,sm_101f

    .warpx2::02_13,.warpx2::01_23,.warpx4,.pack::16b,.unpack::16b modifiers fortcgen05

    sm_100f,sm_101f

    clusterlaunchcontrol.try_cancelmulticast::cluster::all

    sm_100f,sm_101f,sm_120f

    .tile::scatter4,.tile::gather4,.im2col::w,.im2col::w::128

    sm_100f,sm_101f

    redux.f32

    sm_100f

    scale-input-d fortcgen05

    sm_100f

Semantic Changes and Clarifications

  • Clarified the behavior of float-to-integer conversions forNaN input.

13.3.Changes in PTX ISA Version 8.7

New Features

PTX ISA version 8.7 introduces the following new features:

  • Adds support forsm_120 target architecture.

  • Adds support for targetsm_120a that supports architecture-specific features.

  • Extendstcgen05.mma instruction to add support for.kind::mxf4nvf4 and.scale_vec::4Xqualifiers.

  • Extendsmma instructions to support.f16 type accumulator and shape.m16n8k16 withFP8 types.e4m3 and.e5m2.

  • Extendscvt instruction to add support for.rs rounding mode and destination types.e2m1x4,.e4m3x4,.e5m2x4,.e3m2x4,.e2m3x4.

  • Extends support forst.async andred.async instructions to add support for.mmio,.release,.global and.scope qualifiers.

  • Extendstensormap.replace instruction to add support for values13 to15 for.elemtype qualifier.

  • Extendsmma andmma.sp::ordered_metadata instructions to add support for types.e3m2/.e2m3/.e2m1 and qualifiers.kind,.block_scale,.scale_vec_size.

Semantic Changes and Clarifications

  • Clarified that in.tile::gather4,.tile::scatter4 modes, tensor coordinates need to bespecified as {col_idx, row_idx0, row_idx1, row_idx2, row_idx3} i.e. {x, y0, y1, y2, y3} insteadof {x0, x1, x2, x3, y}.

  • UpdatedInstruction descriptor oftcgen05.mma instructionto clarify the bits that are reserved for future use.

13.4.Changes in PTX ISA Version 8.6

New Features

PTX ISA version 8.6 introduces the following new features:

  • Adds support forsm_100 target architecture.

  • Adds support for targetsm_100a that supports architecture-specific features.

  • Adds support forsm_101 target architecture.

  • Adds support for targetsm_101a that supports architecture-specific features.

  • Extendscp.async.bulk andcp.async.bulk.tensor instructions to add.shared::cta as destination state space.

  • Extendsfence instruction to add support for.acquire and.release qualifiers.

  • Extendsfence andfence.proxy instructions to add support for.sync_restrictqualifier.

  • Extendsldmatrix instruction to support.m16n16,.m8n16 shapes and.b8 type.

  • Extendsldmatrix instruction to support.src_fmt,.dst_fmt qualifiers.

  • Extendsstmatrix instruction to support.m16n8 shape and.b8 type.

  • Adds support forclusterlaunchcontrol instruction.

  • Extendsadd,sub andfma instructions to support mixed precision floating pointoperations with.f32 as destaination operand type and.f16/.bf16 as source operandtypes.

  • Extendsadd,sub,mul andfma instructions to support.f32x2 type.

  • Extendscvt instruction with.tf32 type to support.satfinite qualifierfor.rn/.rz rounding modes.

  • Extendscp.async.bulk instruction to support.cp_mask qualifier andbyteMaskoperand.

  • Extendsmultimem.ld_reduce andmultimem.st instructions to support.e5m2,.e5m2x2,.e5m2x4,.e4m3,.e4m3x2 and.e4m3x4 types.

  • Extendscvt instruction to support conversions to/from.e2m1x2,.e3m2x2,.e2m3x2 and.ue8m0x2 types.

  • Extendscp.async.bulk.tensor andcp.async.bulk.prefetch.tensor instructions tosupport new load_mode qualifiers.tile::scatter4 and.tile::gather4.

  • Extendstensormap.replace instruction to add support for new qualifier.swizzle_atomicity for supporting new swizzle modes.

  • Extendsmbarrier.arrive,mbarrier.arrive_drop,.mbarrier.test_wait and.mbarrier.try_wait instructions to support.relaxed qualifier.

  • Extendscp.async.bulk.tensor andcp.async.bulk.prefetch.tensor instructions tosupport new load_mode qualifiers.im2col::w and.im2col::w::128.

  • Extendscp.async.bulk.tensor instruction to support new qualifier.cta_group.

  • Add support forst.bulk instruction.

  • Adds support for tcgen05 features and related instructions:tcgen05.alloc,tcgen05.dealloc,tcgen05.relinquish_alloc_permit,tcgen05.ld,tcgen05.st,tcgen05.wait,tcgen05.cp,tcgen05.shift,tcgen05.mma,tcgen05.mma.sp,tcgen05.mma.ws,tcgen05.mma.ws.sp,tcgen05.fence andtcgen05.commit.

  • Extendsredux.sync instruction to add support for.f32 type with qualifiers.absand.NaN.

Semantic Changes and Clarifications

None.

13.5.Changes in PTX ISA Version 8.5

New Features

PTX ISA version 8.5 introduces the following new features:

  • Adds support formma.sp::ordered_metadata instruction.

Semantic Changes and Clarifications

  • Values0b0000,0b0101,0b1010,0b1111 for sparsity metadata (operande)of instructionmma.sp are invalid and their usage results in undefined behavior.

13.6.Changes in PTX ISA Version 8.4

New Features

PTX ISA version 8.4 introduces the following new features:

  • Extendsld,st andatom instructions with.b128 type to support.sys scope.

  • Extends integerwgmma.mma_async instruction to support.u8.s8 and.s8.u8 as.atypeand.btype respectively.

  • Extendsmma,mma.sp instructions to support FP8 types.e4m3 and.e5m2.

Semantic Changes and Clarifications

None.

13.7.Changes in PTX ISA Version 8.3

New Features

PTX ISA version 8.3 introduces the following new features:

  • Adds support for pragmaused_bytes_mask that is used to specify mask for used bytes for a load operation.

  • Extendsisspacep,cvta.to,ld andst instructions to accept::entry and::funcsub-qualifiers with.param state space qualifier.

  • Adds support for.b128 type on instructionsld,ld.global.nc,ldu,st,mov andatom.

  • Add support for instructionstensormap.replace,tensormap.cp_fenceproxy and support for qualifier.to_proxykind::from_proxykind on instructionfence.proxy to support modifyingtensor-map.

Semantic Changes and Clarifications

None.

13.8.Changes in PTX ISA Version 8.2

New Features

PTX ISA version 8.2 introduces the following new features:

  • Adds support for.mmio qualifier onld andst instructions.

  • Extendslop3 instruction to allow predicate destination.

  • Extendsmultimem.ld_reduce instruction to support.acc::f32 qualifer to allow.f32precision of the intermediate accumulation.

  • Extends the asynchronous warpgroup-level matrix multiply-and-accumulate operationwgmma.mma_async to support.sp modifier that allows matrix multiply-accumulate operationwhen input matrix A is sparse.

Semantic Changes and Clarifications

The.multicast::cluster qualifier oncp.async.bulk andcp.async.bulk.tensor instructionsis optimized for target architecturesm_90a and may have substantially reduced performance onother targets and hence.multicast::cluster is advised to be used withsm_90a.

13.9.Changes in PTX ISA Version 8.1

New Features

PTX ISA version 8.1 introduces the following new features:

  • Adds support forst.async andred.async instructions for asynchronous store andasynchronous reduction operations respectively on shared memory.

  • Adds support for.oob modifier on half-precisionfma instruction.

  • Adds support for.satfinite saturation modifer oncvt instruction for.f16,.bf16and.tf32 formats.

  • Extends support forcvt with.e4m3/.e5m2 tosm_89.

  • Extendsatom andred instructions to support vector types.

  • Adds support for special register%aggr_smem_size.

  • Extendssured instruction with 64-bitmin/max operations.

  • Adds support for increased kernel parameter size of 32764 bytes.

  • Adds support for multimem addresses in memory consistency model.

  • Adds support formultimem.ld_reduce,multimem.st andmultimem.red instructions toperform memory operations on multimem addresses.

Semantic Changes and Clarifications

None.

13.10.Changes in PTX ISA Version 8.0

New Features

PTX ISA version 8.0 introduces the following new features:

  • Adds support for targetsm_90a that supports architecture-specific features.

  • Adds support for asynchronous warpgroup-level matrix multiply-and-accumulate operationwgmma.

  • Extends the asynchronous copy operations with bulk operations that operate on large data,including tensor data.

  • Introduces packed integer types.u16x2 and.s16x2.

  • Extends integer arithmetic instructionadd to allow packed integer types.u16x2 and.s16x2.

  • Extends integer arithmetic instructionsmin andmax to allow packed integer types.u16x2 and.s16x2, as well as saturation modifier.relu on.s16x2 and.s32types.

  • Adds support for special register%current_graph_exec that identifies the currently executingCUDA device graph.

  • Adds support forelect.sync instruction.

  • Adds support for.unified attribute on functions and variables.

  • Adds support forsetmaxnreg instruction.

  • Adds support for.sem qualifier onbarrier.cluster instruction.

  • Extends thefence instruction to allow opcode-specific synchronizaion usingop_restrictqualifier.

  • Adds support for.cluster scope onmbarrier.arrive,mbarrier.arrive_drop,mbarrier.test_wait andmbarrier.try_wait operations.

  • Adds support for transaction count operations onmbarrier objects, specified with.expect_tx and.complete_tx qualifiers.

Semantic Changes and Clarifications

None.

13.11.Changes in PTX ISA Version 7.8

New Features

PTX ISA version 7.8 introduces the following new features:

  • Adds support forsm_89 target architecture.

  • Adds support forsm_90 target architecture.

  • Extendsbar andbarrier instructions to accept optional scope qualifier.cta.

  • Extends.shared state space qualifier with optional sub-qualifier::cta.

  • Adds support formovmatrix instruction which transposes a matrix in registers across a warp.

  • Adds support forstmatrix instruction which stores one or more matrices to shared memory.

  • Extends the.f64 floating point typemma operation with shapes.m16n8k4,.m16n8k8,and.m16n8k16.

  • Extendsadd,sub,mul,set,setp,cvt,tanh,ex2,atom andred instructions withbf16 alternate floating point data format.

  • Adds support for new alternate floating-point data formats.e4m3 and.e5m2.

  • Extendscvt instruction to convert.e4m3 and.e5m2 alternate floating point data formats.

  • Adds support forgriddepcontrol instruction as a communication mechanism to control theexecution of dependent grids.

  • Extendsmbarrier instruction to allow a new phase completion check operationtry_wait.

  • Adds support for new thread scope.cluster which is a set of Cooperative Thread Arrays (CTAs).

  • Extendsfence/membar,ld,st,atom, andred instructions to accept.cluster scope.

  • Adds support for extended visibility of shared state space to all threads within a cluster.

  • Extends.shared state space qualifier with::cluster sub-qualifier for cluster-levelvisibility of shared memory.

  • Extendsisspacep,cvta,ld,st,atom, andred instructions to accept::cluster sub-qualifier with.shared state space qualifier.

  • Adds support formapa instruction to map a shared memory address to the corresponding addressin a different CTA within the cluster.

  • Adds support forgetctarank instruction to query the rank of the CTA that contains a givenaddress.

  • Adds support for new barrier synchronization instructionbarrier.cluster.

  • Extends the memory consistency model to include the new cluster scope.

  • Adds support for special registers related to cluster information:%is_explicit_cluster,%clusterid,%nclusterid,%cluster_ctaid,%cluster_nctaid,%cluster_ctarank,%cluster_nctarank.

  • Adds support for cluster dimension directives.reqnctapercluster,.explicitcluster, and.maxclusterrank.

Semantic Changes and Clarifications

None.

13.12.Changes in PTX ISA Version 7.7

New Features

PTX ISA version 7.7 introduces the following new features:

  • Extendsisspacep andcvta instructions to include the.param state space for kernelfunction parameters.

Semantic Changes and Clarifications

None.

13.13.Changes in PTX ISA Version 7.6

New Features

PTX ISA version 7.6 introduces the following new features:

  • Support forszext instruction which performs sign-extension or zero-extension on a specifiedvalue.

  • Support forbmsk instruction which creates a bitmask of the specified width starting at thespecified bit position.

  • Support for special registers%reserved_smem_offset_begin,%reserved_smem_offset_end,%reserved_smem_offset_cap,%reserved_smem_offset<2>.

Semantic Changes and Clarifications

None.

13.14.Changes in PTX ISA Version 7.5

New Features

PTX ISA version 7.5 introduces the following new features:

  • Debug information enhancements to support label difference and negative values in the.sectiondebugging directive.

  • Support forignore-src operand oncp.async instruction.

  • Extensions to the memory consistency model to introduce the following new concepts:

    • Amemory proxy as an abstract label for different methods of memory access.

    • Virtual aliases as distinct memory addresses accessing the same physical memory location.

  • Support for newfence.proxy andmembar.proxy instructions to allow synchronization ofmemory accesses performed via virtual aliases.

Semantic Changes and Clarifications

None.

13.15.Changes in PTX ISA Version 7.4

New Features

PTX ISA version 7.4 introduces the following new features:

  • Support forsm_87 target architecture.

  • Support for.level::eviction_priority qualifier which allows specifying cache evictionpriority hints onld,ld.global.nc,st, andprefetch instructions.

  • Support for.level::prefetch_size qualifier which allows specifying data prefetch hints onld andcp.async instructions.

  • Support forcreatepolicy instruction which allows construction of different types of cacheeviction policies.

  • Support for.level::cache_hint qualifier which allows the use of cache eviction policies withld,ld.global.nc,st,atom,red andcp.async instructions.

  • Support forapplypriority anddiscard operations on cached data.

Semantic Changes and Clarifications

None.

13.16.Changes in PTX ISA Version 7.3

New Features

PTX ISA version 7.3 introduces the following new features:

  • Extendsmask() operator used in initializers to also support integer constant expression.

  • Adds support for stack manpulation instructions that allow manipulating stack usingstacksaveandstackrestore instructions and allocation of per-thread stack usingallocainstruction.

Semantic Changes and Clarifications

The unimplemented version ofalloca from the older PTX ISA specification has been replaced withnew stack manipulation instructions in PTX ISA version 7.3.

13.17.Changes in PTX ISA Version 7.2

New Features

PTX ISA version 7.2 introduces the following new features:

  • Enhances.loc directive to represent inline function information.

  • Adds support to define labels inside the debug sections.

  • Extendsmin andmax instructions to support.xorsign and.abs modifiers.

Semantic Changes and Clarifications

None.

13.18.Changes in PTX ISA Version 7.1

New Features

PTX ISA version 7.1 introduces the following new features:

  • Support forsm_86 target architecture.

  • Adds a new operator,mask(), to extract a specific byte from variable’s address used ininitializers.

  • Extendstex andtld4 instructions to return an optional predicate that indicates if dataat specified coordinates is resident in memory.

  • Extends single-bitwmma andmma instructions to support.and operation.

  • Extendsmma instruction to support.sp modifier that allows matrix multiply-accumulateoperation when input matrix A is sparse.

  • Extendsmbarrier.test_wait instruction to test the completion of specific phase parity.

Semantic Changes and Clarifications

None.

13.19.Changes in PTX ISA Version 7.0

New Features

PTX ISA version 7.0 introduces the following new features:

  • Support forsm_80 target architecture.

  • Adds support for asynchronous copy instructions that allow copying of data asynchronously from onestate space to another.

  • Adds support formbarrier instructions that allow creation ofmbarrier objects in memory anduse of these objects to synchronize threads and asynchronous copy operations initiated by threads.

  • Adds support forredux.sync instruction which allows reduction operation across threads in awarp.

  • Adds support for new alternate floating-point data formats.bf16 and.tf32.

  • Extendswmma instruction to support.f64 type with shape.m8n8k4.

  • Extendswmma instruction to support.bf16 data format.

  • Extendswmma instruction to support.tf32 data format with shape.m16n16k8.

  • Extendsmma instruction to support.f64 type with shape.m8n8k4.

  • Extendsmma instruction to support.bf16 and.tf32 data formats with shape.m16n8k8.

  • Extendsmma instruction to support new shapes.m8n8k128,.m16n8k4,.m16n8k16,.m16n8k32,.m16n8k64,.m16n8k128 and.m16n8k256.

  • Extendsabs andneg instructions to support.bf16 and.bf16x2 data formats.

  • Extendsmin andmax instructions to support.NaN modifier and.f16,.f16x2,.bf16 and.bf16x2 data formats.

  • Extendsfma instruction to support.relu saturation mode and.bf16 and.bf16x2data formats.

  • Extendscvt instruction to support.relu saturation mode and.f16,.f16x2,.bf16,.bf16x2 and.tf32 destination formats.

  • Adds support fortanh instruction that computes hyperbolic-tangent.

  • Extendsex2 instruction to support.f16 and.f16x2 types.

Semantic Changes and Clarifications

None.

13.20.Changes in PTX ISA Version 6.5

New Features

PTX ISA version 6.5 introduces the following new features:

  • Adds support for integer destination types for half precision comparison instructionset.

  • Extendsabs instruction to support.f16 and.f16x2 types.

  • Adds support forcvt.pack instruction which allows converting two integer values and packingthe results together.

  • Adds new shapes.m16n8k8,.m8n8k16 and.m8n8k32 on themma instruction.

  • Adds support forldmatrix instruction which loads one or more matrices from shared memory formma instruction.

Removed Features

PTX ISA version 6.5 removes the following features:

  • Support for.satfinite qualifier on floating pointwmma.mma instruction has beenremoved. This support was deprecated since PTX ISA version 6.4.

Semantic Changes and Clarifications

None.

13.21.Changes in PTX ISA Version 6.4

New Features

PTX ISA version 6.4 introduces the following new features:

  • Adds support for.noreturn directive which can be used to indicate a function does not returnto it’s caller function.

  • Adds support formma instruction which allows performing matrix multiply-and-accumulateoperation.

Deprecated Features

PTX ISA version 6.4 deprecates the following features:

  • Support for.satfinite qualifier on floating pointwmma.mma instruction.

Removed Features

PTX ISA version 6.4 removes the following features:

  • Support forshfl andvote instructions without the.sync qualifier has been removedfor.targetsm_70 and higher. This support was deprecated since PTX ISA version 6.0 asdocumented in PTX ISA version 6.2.

Semantic Changes and Clarifications

  • Clarified that resolving references of a.weak symbol considers only.weak or.visiblesymbols with the same name and does not consider local symbols with the same name.

  • Clarified that incvt instruction, modifier.ftz can only be specified when either.atype or.dtype is.f32.

13.22.Changes in PTX ISA Version 6.3

New Features

PTX ISA version 6.3 introduces the following new features:

  • Support forsm_75 target architecture.

  • Adds support for a new instructionnanosleep that suspends a thread for a specified duration.

  • Adds support for.alias directive which allows definining alias to function symbol.

  • Extendsatom instruction to perform.f16 addition operation and.cas.b16 operation.

  • Extendsred instruction to perform.f16 addition operation.

  • Thewmma instructions are extended to support multiplicand matrices of type.s8,.u8,.s4,.u4,.b1 and accumulator matrices of type.s32.

Semantic Changes and Clarifications

  • Introduced the mandatory.aligned qualifier for allwmma instructions.

  • Specified the alignment required for the base address and stride parameters passed towmma.load andwmma.store.

  • Clarified that layout of fragment returned bywmma operation is architecture dependent andpassingwmma fragments around functions compiled for different link compatible SMarchitectures may not work as expected.

  • Clarified that atomicity for{atom/red}.f16x2} operations is guranteed separately for each ofthe two.f16 elements but not guranteed to be atomic as single 32-bit access.

13.23.Changes in PTX ISA Version 6.2

New Features

PTX ISA version 6.2 introduces the following new features:

  • A new instructionactivemask for querying active threads in a warp.

  • Extends atomic and reduction instructions to perform.f16x2 addition operation with mandatory.noftz qualifier.

Deprecated Features

PTX ISA version 6.2 deprecates the following features:

  • The use ofshfl andvote instructions without the.sync is deprecated retrospectivelyfrom PTX ISA version 6.0, which introduced thesm_70 architecture that implementsIndependent Thread Scheduling.

Semantic Changes and Clarifications

  • Clarified thatwmma instructions can be used in conditionally executed code only if it isknown that all threads in the warp evaluate the condition identically, otherwise behavior isundefined.

  • In the memory consistency model, the definition ofmorally strong operations was updated toexclude fences from the requirement ofcomplete overlap since fences do not access memory.

13.24.Changes in PTX ISA Version 6.1

New Features

PTX ISA version 6.1 introduces the following new features:

  • Support forsm_72 target architecture.

  • Support for new matrix shapes32x8x16 and8x32x16 inwmma instruction.

Semantic Changes and Clarifications

None.

13.25.Changes in PTX ISA Version 6.0

New Features

PTX ISA version 6.0 introduces the following new features:

  • Support forsm_70 target architecture.

  • Specifies the memory consistency model for programs running onsm_70 and later architectures.

  • Various extensions to memory instructions to specify memory synchronization semantics and scopesat which such synchronization can be observed.

  • New instructionwmma for matrix operations which allows loading matrices from memory,performing multiply-and-accumulate on them and storing result in memory.

  • Support for newbarrier instruction.

  • Extendsneg instruction to support.f16 and.f16x2 types.

  • A new instructionfns which allows finding n-th set bit in integer.

  • A new instructionbar.warp.sync which allows synchronizing threads in warp.

  • Extendsvote andshfl instructions with.sync modifier which waits for specifiedthreads before executing thevote andshfl operation respectively.

  • A new instructionmatch.sync which allows broadcasting and comparing a value across threads inwarp.

  • A new instructionbrx.idx which allows branching to a label indexed from list of potentialtargets.

  • Support for unsized array parameter for.func which can be used to implement variadicfunctions.

  • Support for.b16 integer type in dwarf-lines.

  • Support for taking address of device function return parameters usingmov instruction.

Semantic Changes and Clarifications

  • Semantics ofbar instruction were updated to indicate that executing thread waits for othernon-exited threads from it’s warp.

  • Support for indirect branch introduced in PTX 2.1 which was unimplemented has been removed fromthe spec.

  • Support for taking address of labels, using labels in initializers which was unimplemented hasbeen removed from the spec.

  • Support for variadic functions which was unimplemented has been removed from the spec.

13.26.Changes in PTX ISA Version 5.0

New Features

PTX ISA version 5.0 introduces the following new features:

  • Support forsm_60,sm_61,sm_62 target architecture.

  • Extends atomic and reduction instructions to perform double-precision add operation.

  • Extends atomic and reduction instructions to specifyscope modifier.

  • A new.common directive to permit linking multiple object files containing declarations of thesame symbol with different size.

  • A newdp4a instruction which allows 4-way dot product with accumulate operation.

  • A newdp2a instruction which allows 2-way dot product with accumulate operation.

  • Support for special register%clock_hi.

Semantic Changes and Clarifications

Semantics of cache modifiers onld andst instructions were clarified to reflect cacheoperations are treated as performance hint only and do not change memory consistency behavior of theprogram.

Semantics ofvolatile operations onld andst instructions were clarified to reflect howvolatile operations are handled by optimizing compiler.

13.27.Changes in PTX ISA Version 4.3

New Features

PTX ISA version 4.3 introduces the following new features:

  • A newlop3 instruction which allows arbitrary logical operation on 3 inputs.

  • Adds support for 64-bit computations in extended precision arithmetic instructions.

  • Extendstex.grad instruction to supportcube andacube geometries.

  • Extendstld4 instruction to supporta2d,cube andacube geometries.

  • Extendstex andtld4 instructions to support optional operands for offset vector and depthcompare.

  • Extendstxq instruction to support querying texture fields from specific LOD.

Semantic Changes and Clarifications

None.

13.28.Changes in PTX ISA Version 4.2

New Features

PTX ISA version 4.2 introduces the following new features:

  • Support forsm_53 target architecture.

  • Support for arithmetic, comparsion and texture instructions for.f16 and.f16x2 types.

  • Support formemory_layout field for surfaces andsuq instruction support for querying thisfield.

Semantic Changes and Clarifications

Semantics for parameter passing under ABI were updated to indicateld.param andst.paraminstructions used for argument passing cannot be predicated.

Semantics of{atom/red}.add.f32 were updated to indicate subnormal inputs and results areflushed to sign-preserving zero for atomic operations on global memory; whereas atomic operations onshared memory preserve subnormal inputs and results and don’t flush them to zero.

13.29.Changes in PTX ISA Version 4.1

New Features

PTX ISA version 4.1 introduces the following new features:

  • Support forsm_37 andsm_52 target architectures.

  • Support for new fieldsarray_size,num_mipmap_levels andnum_samples for Textures, andthetxq instruction support for querying these fields.

  • Support for new fieldarray_size for Surfaces, and thesuq instruction support forquerying this field.

  • Support for special registers%total_smem_size and%dynamic_smem_size.

Semantic Changes and Clarifications

None.

13.30.Changes in PTX ISA Version 4.0

New Features

PTX ISA version 4.0 introduces the following new features:

  • Support forsm_32 andsm_50 target architectures.

  • Support for 64bit performance counter special registers%pm0_64,..,%pm7_64.

  • A newistypep instruction.

  • A new instruction,rsqrt.approx.ftz.f64 has been added to compute a fast approximation of thesquare root reciprocal of a value.

  • Support for a new directive.attribute for specifying special attributes of a variable.

  • Support for.managed variable attribute.

Semantic Changes and Clarifications

Thevote instruction semantics were updated to clearly indicate that an inactive thread in awarp contributes a 0 for its entry when participating invote.ballot.b32.

13.31.Changes in PTX ISA Version 3.2

New Features

PTX ISA version 3.2 introduces the following new features:

  • The texture instruction supports reads from multi-sample and multisample array textures.

  • Extends.section debugging directive to include label + immediate expressions.

  • Extends.file directive to include timestamp and file size information.

Semantic Changes and Clarifications

Thevavrg2 andvavrg4 instruction semantics were updated to indicate that instruction adds 1only if Va[i] + Vb[i] is non-negative, and that the addition result is shifted by 1 (rather thanbeing divided by 2).

13.32.Changes in PTX ISA Version 3.1

New Features

PTX ISA version 3.1 introduces the following new features:

  • Support forsm_35 target architecture.

  • Support for CUDA Dynamic Parallelism, which enables a kernel to create and synchronize new work.

  • ld.global.nc for loading read-only global data though the non-coherent texture cache.

  • A new funnel shift instruction,shf.

  • Extends atomic and reduction instructions to perform 64-bit{and,or,xor} operations, and64-bit integer{min,max} operations.

  • Adds support formipmaps.

  • Adds support for indirect access to textures and surfaces.

  • Extends support for generic addressing to include the.const state space, and adds a newoperator,generic(), to form a generic address for.global or.const variables used ininitializers.

  • A new.weak directive to permit linking multiple object files containing declarations of thesame symbol.

Semantic Changes and Clarifications

PTX 3.1 redefines the default addressing for global variables in initializers, from genericaddresses to offsets in the global state space. Legacy PTX code is treated as having an implicitgeneric() operator for each global variable used in an initializer. PTX 3.1 code should eitherinclude explicitgeneric() operators in initializers, usecvta.global to form genericaddresses at runtime, or load from the non-generic address usingld.global.

Instructionmad.f32 requires a rounding modifier forsm_20 and higher targets. However forPTX ISA version 3.0 and earlier, ptxas does not enforce this requirement andmad.f32 silentlydefaults tomad.rn.f32. For PTX ISA version 3.1, ptxas generates a warning and defaults tomad.rn.f32, and in subsequent releases ptxas will enforce the requirement for PTX ISA version3.2 and later.

13.33.Changes in PTX ISA Version 3.0

New Features

PTX ISA version 3.0 introduces the following new features:

  • Support forsm_30 target architectures.

  • SIMD video instructions.

  • A new warp shuffle instruction.

  • Instructionsmad.cc andmadc for efficient, extended-precision integer multiplication.

  • Surface instructions with 3D and array geometries.

  • The texture instruction supports reads from cubemap and cubemap array textures.

  • Platform option.target debug to declare that a PTX module containsDWARF debug information.

  • pmevent.mask, for triggering multiple performance monitor events.

  • Performance monitor counter special registers%pm4..%pm7.

Semantic Changes and Clarifications

Special register%gridid has been extended from 32-bits to 64-bits.

PTX ISA version 3.0 deprecates module-scoped.reg and.local variables when compiling to theApplication Binary Interface (ABI). When compiling without use of the ABI, module-scoped.regand.local variables are supported as before. When compiling legacy PTX code (ISA versions priorto 3.0) containing module-scoped.reg or.local variables, the compiler silently disablesuse of the ABI.

Theshfl instruction semantics were updated to clearly indicate that value of source operanda is unpredictable for inactive and predicated-off threads within the warp.

PTX modules no longer allow duplicate.version directives. This feature was unimplemented, sothere is no semantic change.

Unimplemented instructionssuld.p andsust.p.{u32,s32,f32} have been removed.

13.34.Changes in PTX ISA Version 2.3

New Features

PTX 2.3 adds support for texture arrays. The texture array feature supports access to an array of 1Dor 2D textures, where an integer indexes into the array of textures, and then one or twosingle-precision floating point coordinates are used to address within the selected 1D or 2Dtexture.

PTX 2.3 adds a new directive,.address_size, for specifying the size of addresses.

Variables in.const and.global state spaces are initialized to zero by default.

Semantic Changes and Clarifications

The semantics of the.maxntid directive have been updated to match the currentimplementation. Specifically,.maxntid only guarantees that the total number of threads in athread block does not exceed the maximum. Previously, the semantics indicated that the maximum wasenforced separately in each dimension, which is not the case.

Bit field extract and insert instructions BFE and BFI now indicate that thelen andposoperands are restricted to the value range0..255.

Unimplemented instructions{atom,red}.{min,max}.f32 have been removed.

13.35.Changes in PTX ISA Version 2.2

New Features

PTX 2.2 adds a new directive for specifying kernel parameter attributes; specifically, there is anew directives for specifying that a kernel parameter is a pointer, for specifying to which statespace the parameter points, and for optionally specifying the alignment of the memory to which theparameter points.

PTX 2.2 adds a new field namedforce_unnormalized_coords to the.samplerref opaquetype. This field is used in the independent texturing mode to override thenormalized_coordsfield in the texture header. This field is needed to support languages such as OpenCL, whichrepresent the property of normalized/unnormalized coordinates in the sampler header rather than inthe texture header.

PTX 2.2 deprecates explicit constant banks and supports a large, flat address space for the.const state space. Legacy PTX that uses explicit constant banks is still supported.

PTX 2.2 adds a newtld4 instruction for loading a component (r,g,b, ora) fromthe four texels compising the bilinear interpolation footprint of a given texture location. Thisinstruction may be used to compute higher-precision bilerp results in software, or for performinghigher-bandwidth texture loads.

Semantic Changes and Clarifications

None.

13.36.Changes in PTX ISA Version 2.1

New Features

The underlying, stack-based ABI is supported in PTX ISA version 2.1 forsm_2x targets.

Support for indirect calls has been implemented forsm_2x targets.

New directives,.branchtargets and.calltargets, have been added for specifying potentialtargets for indirect branches and indirect function calls. A.callprototype directive has beenadded for declaring the type signatures for indirect function calls.

The names of.global and.const variables can now be specified in variable initializers torepresent their addresses.

A set of thirty-two driver-specific execution environment special registers has been added. Theseare named%envreg0..%envreg31.

Textures and surfaces have new fields for channel data type and channel order, and thetxq andsuq instructions support queries for these fields.

Directive.minnctapersm has replaced the.maxnctapersm directive.

Directive.reqntid has been added to allow specification of exact CTA dimensions.

A new instruction,rcp.approx.ftz.f64, has been added to compute a fast, gross approximatereciprocal.

Semantic Changes and Clarifications

A warning is emitted if.minnctapersm is specified without also specifying.maxntid.

13.37.Changes in PTX ISA Version 2.0

New Features

Floating Point Extensions

This section describes the floating-point changes in PTX ISA version 2.0 forsm_20 targets. Thegoal is to achieve IEEE 754 compliance wherever possible, while maximizing backward compatibilitywith legacy PTX ISA version 1.x code andsm_1x targets.

The changes from PTX ISA version 1.x are as follows:

  • Single-precision instructions support subnormal numbers by default forsm_20 targets. The.ftz modifier may be used to enforce backward compatibility withsm_1x.

  • Single-precisionadd,sub, andmul now support.rm and.rp rounding modifiersforsm_20 targets.

  • A single-precision fused multiply-add (fma) instruction has been added, with support for IEEE 754compliant rounding modifiers and support for subnormal numbers. Thefma.f32 instruction alsosupports.ftz and.sat modifiers.fma.f32 requiressm_20. Themad.f32instruction has been extended with rounding modifiers so that it’s synonymous withfma.f32forsm_20 targets. Bothfma.f32 andmad.f32 require a rounding modifier forsm_20targets.

  • Themad.f32 instructionwithout rounding is retained so that compilers can generate code forsm_1x targets. When code compiled forsm_1x is executed onsm_20 devices,mad.f32maps tofma.rn.f32.

  • Single- and double-precisiondiv,rcp, andsqrt with IEEE 754 compliant rounding havebeen added. These are indicated by the use of a rounding modifier and requiresm_20.

  • Instructionstestp andcopysign have been added.

New Instructions

Aload uniform instruction,ldu, has been added.

Surface instructions support additional.clamp modifiers,.clamp and.zero.

Instructionsust now supports formatted surface stores.

Acount leading zeros instruction,clz, has been added.

Afind leading non-sign bit instruction,bfind, has been added.

Abit reversal instruction,brev, has been added.

Bit field extract and insert instructions,bfe andbfi, have been added.

Apopulation count instruction,popc, has been added.

Avote ballot instruction,vote.ballot.b32, has been added.

Instructions{atom,red}.add.f32 have been implemented.

Instructions{atom,red}.shared have been extended to handle 64-bit data types forsm_20targets.

A system-level membar instruction,membar.sys, has been added.

Thebar instruction has been extended as follows:

  • Abar.arrive instruction has been added.

  • Instructionsbar.red.popc.u32 andbar.red.{and,or}.pred have been added.

  • bar now supports optional thread count and register operands.

Scalar video instructions (includesprmt) have been added.

Instructionisspacep for querying whether a generic address falls within a specified state spacewindow has been added.

Instructioncvta for converting global, local, and shared addresses to generic address andvice-versa has been added.

Other New Features

Instructionsld,ldu,st,prefetch,prefetchu,isspacep,cvta,atom,andred now support generic addressing.

New special registers%nwarpid,%nsmid,%clock64,%lanemask_{eq,le,lt,ge,gt} havebeen added.

Cache operations have been added to instructionsld,st,suld, andsust, e.g., forprefetching to specified level of memory hierarchy. Instructionsprefetch andprefetchuhave also been added.

The.maxnctapersm directive was deprecated and replaced with.minnctapersm to better matchits behavior and usage.

A new directive,.section, has been added to replace the@@DWARF syntax for passingDWARF-format debugging information through PTX.

A new directive,.pragmanounroll, has been added to allow users to disable loop unrolling.

Semantic Changes and Clarifications

The errata incvt.ftz for PTX ISA versions 1.4 and earlier, where single-precision subnormalinputs and results were not flushed to zero if either source or destination type size was 64-bits,has been fixed. In PTX ISA version 1.5 and later,cvt.ftz (andcvt for.targetsm_1x,where.ftz is implied) instructions flush single-precision subnormal inputs and results tosign-preserving zero for all combinations of floating-point instruction types. To maintaincompatibility with legacy PTX code, if .version is 1.4 or earlier, single-precision subnormal inputsand results are flushed to sign-preserving zero only when neither source nor destination type sizeis 64-bits.

Components of special registers%tid,%ntid,%ctaid, and%nctaid have been extendedfrom 16-bits to 32-bits. These registers now have type.v4.u32.

The number of samplers available in independent texturing mode was incorrectly listed as thirty-twoin PTX ISA version 1.5; the correct number is sixteen.

14.Notices

14.1.Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.

NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

14.2.OpenCL

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.

14.3.Trademarks

NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.