This document describes PTX, a low-levelparallel thread execution virtual machine and instructionset architecture (ISA). PTX exposes the GPU as a data-parallel computingdevice.
Driven by the insatiable market demand for real-time, high-definition 3D graphics, the programmableGPU has evolved into a highly parallel, multithreaded, many-core processor with tremendouscomputational horsepower and very high memory bandwidth. The GPU is especially well-suited toaddress problems that can be expressed as data-parallel computations - the same program is executedon many data elements in parallel - with high arithmetic intensity - the ratio of arithmeticoperations to memory operations. Because the same program is executed for each data element, thereis a lower requirement for sophisticated flow control; and because it is executed on many dataelements and has high arithmetic intensity, the memory access latency can be hidden withcalculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many applications thatprocess large data sets can use a data-parallel programming model to speed up the computations. In3D rendering large sets of pixels and vertices are mapped to parallel threads. Similarly, image andmedia processing applications such as post-processing of rendered images, video encoding anddecoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels toparallel processing threads. In fact, many algorithms outside the field of image rendering andprocessing are accelerated by data-parallel processing, from general signal processing or physicssimulation to computational finance or computational biology.
PTX defines a virtual machine and ISA for general purpose parallel thread execution. PTX programsare translated at install time to the target hardware instruction set. The PTX-to-GPU translatorand driver enable NVIDIA GPUs to be used as programmable parallel computers.
PTX provides a stable programming model and instruction set for general purpose parallelprogramming. It is designed to be efficient on NVIDIA GPUs supporting the computation featuresdefined by the NVIDIA Tesla architecture. High level language compilers for languages such as CUDAand C/C++ generate PTX instructions, which are optimized for and translated to nativetarget-architecture instructions.
The goals for PTX include the following:
Provide a stable ISA that spans multiple GPU generations.
Achieve performance in compiled applications comparable to native GPU performance.
Provide a machine-independent ISA for C/C++ and other compilers to target.
Provide a code distribution ISA for application and middleware developers.
Provide a common source-level ISA for optimizing code generators and translators, which map PTX tospecific target machines.
Facilitate hand-coding of libraries, performance kernels, and architecture tests.
Provide a scalable programming model that spans GPU sizes from a single unit to many parallel units.
PTX ISA version 9.0 introduces the following new features:
Adds support forsm_88 target architecture.
Adds support forsm_110 target architecture.
Adds support for targetsm_110f that supports family-specific features.
Adds support for targetsm_110a that supports architecture-specific features.
Adds support for pragmaenable_smem_spilling that is used to enable sharedmemory spilling for a function.
Adds support for pragmafrequency that is used to specify the execution frequency of a basicblock.
Adds support for directive.blocksareclusters that is used to specify that CUDA thread blocksare mapped to clusters.
Extendssize operand ofst.bulk instruction to support 32-bit length.
Adds support for performance-tuning directives.abi_preserve and.abi_preserve_controlthat are used to specify the number of data and control registers that should be preserved by thecallers of a function.
The GPU is a compute device capable of executing a very large number of threads in parallel. Itoperates as a coprocessor to the main CPU, or host: In other words, data-parallel, compute-intensiveportions of applications running on the host are off-loaded onto the device.
More precisely, a portion of an application that is executed many times, but independently ondifferent data, can be isolated into a kernel function that is executed on the GPU as many differentthreads. To that effect, such a function is compiled to the PTX instruction set and the resultingkernel is translated at install time to the target GPU instruction set.
The batch of threads that executes a kernel is organized as a grid. A grid consists of eithercooperative thread arrays or clusters of cooperative thread arrays as described in this section andillustrated inFigure 1 andFigure 2.Cooperative thread arrays (CTAs) implement CUDAthread blocks and clusters implement CUDA thread block clusters.
TheParallel Thread Execution (PTX) programming model is explicitly parallel: a PTX programspecifies the execution of a given thread of a parallel thread array. Acooperative thread array,or CTA, is an array of threads that execute a kernel concurrently or in parallel.
Threads within a CTA can communicate with each other. To coordinate the communication of the threadswithin the CTA, one can specify synchronization points where threads wait until all threads in theCTA have arrived.
Each thread has a unique thread identifier within the CTA. Programs use a data paralleldecomposition to partition inputs, work, and results across the threads of the CTA. Each CTA threaduses its thread identifier to determine its assigned role, assign specific input and outputpositions, compute addresses, and select work to perform. The thread identifier is a three-elementvectortid, (with elementstid.x,tid.y, andtid.z) that specifies the thread’sposition within a 1D, 2D, or 3D CTA. Each thread identifier component ranges from zero up to thenumber of thread ids in that CTA dimension.
Each CTA has a 1D, 2D, or 3D shape specified by a three-element vectorntid (with elementsntid.x,ntid.y, andntid.z). The vectorntid specifies the number of threads in eachCTA dimension.
Threads within a CTA execute in SIMT (single-instruction, multiple-thread) fashion in groups calledwarps. Awarp is a maximal subset of threads from a single CTA, such that the threads executethe same instructions at the same time. Threads within a warp are sequentially numbered. The warpsize is a machine-dependent constant. Typically, a warp has 32 threads. Some applications may beable to maximize performance with knowledge of the warp size, so PTX includes a run-time immediateconstant,WARP_SZ, which may be used in any instruction where an immediate operand is allowed.
Cluster is a group of CTAs that run concurrently or in parallel and can synchronize and communicatewith each other via shared memory. The executing CTA has to make sure that the shared memory of thepeer CTA exists before communicating with it via shared memory and the peer CTA hasn’t exited beforecompleting the shared memory operation.
Threads within the different CTAs in a cluster can synchronize and communicate with each other viashared memory. Cluster-wide barriers can be used to synchronize all the threads within thecluster. Each CTA in a cluster has a unique CTA identifier within its cluster(cluster_ctaid). Each cluster of CTAs has 1D, 2D or 3D shape specified by the parametercluster_nctaid. Each CTA in the cluster also has a unique CTA identifier (cluster_ctarank)across all dimensions. The total number of CTAs across all the dimensions in the cluster isspecified bycluster_nctarank. Threads may read and use these values through predefined, read-onlyspecial registers%cluster_ctaid,%cluster_nctaid,%cluster_ctarank,%cluster_nctarank.
Cluster level is applicable only on target architecturesm_90 or higher. Specifying clusterlevel during launch time is optional. If the user specifies the cluster dimensions at launch timethen it will be treated as explicit cluster launch, otherwise it will be treated as implicit clusterlaunch with default dimension 1x1x1. PTX provides read-only special register%is_explicit_cluster to differentiate between explicit and implicit cluster launch.
There is a maximum number of threads that a CTA can contain and a maximum number of CTAs that acluster can contain. However, clusters with CTAs that execute the same kernel can be batchedtogether into a grid of clusters, so that the total number of threads that can be launched in asingle kernel invocation is very large. This comes at the expense of reduced thread communicationand synchronization, because threads in different clusters cannot communicate and synchronize witheach other.
Each cluster has a unique cluster identifier (clusterid) within a grid of clusters. Each grid ofclusters has a 1D, 2D , or 3D shape specified by the parameternclusterid. Each grid also has aunique temporal grid identifier (gridid). Threads may read and use these values throughpredefined, read-only special registers%tid,%ntid,%clusterid,%nclusterid, and%gridid.
Each CTA has a unique identifier (ctaid) within a grid. Each grid of CTAs has 1D, 2D, or 3D shapespecified by the parameternctaid. Thread may use and read these values through predefined,read-only special registers%ctaid and%nctaid.
Each kernel is executed as a batch of threads organized as a grid of clusters consisting of CTAswhere cluster is optional level and is applicable only for target architecturessm_90 andhigher.Figure 1 shows a grid consisting of CTAs andFigure 2 shows a grid consisting of clusters.
Grids may be launched with dependencies between one another - a grid may be a dependent grid and/ora prerequisite grid. To understand how grid dependencies may be defined, refer to the section onCUDA Graphs in theCuda Programming Guide.
A cluster is a set of cooperative thread arrays (CTAs) where a CTA is a set of concurrent threadsthat execute the same kernel program. A grid is a set of clusters consisting of CTAs thatexecute independently.
PTX threads may access data from multiple state spaces during their execution as illustrated byFigure 3 where cluster level is introduced fromtarget architecturesm_90 onwards. Each thread has a private local memory. Each thread block(CTA) has a shared memory visible to all threads of the block and to all active blocks in thecluster and with the same lifetime as the block. Finally, all threads have access to the same globalmemory.
There are additional state spaces accessible by all threads: the constant, param, texture, andsurface state spaces. Constant and texture memory are read-only; surface memory is readable andwritable. The global, constant, param, texture, and surface state spaces are optimized for differentmemory usages. For example, texture memory offers different addressing modes as well as datafiltering for specific data formats. Note that texture and surface memory is cached, and within thesame kernel call, the cache is not kept coherent with respect to global memory writes and surfacememory writes, so any texture fetch or surface read to an address that has been written to via aglobal or a surface write in the same kernel call returns undefined data. In other words, a threadcan safely read some texture or surface memory location only if this memory location has beenupdated by a previous kernel call or memory copy, but not if it has been previously updated by thesame thread or another thread from the same kernel call.
The global, constant, and texture state spaces are persistent across kernel launches by the sameapplication.
Both the host and the device maintain their own local memory, referred to ashost memory anddevice memory, respectively. The device memory may be mapped and read or written by the host, or,for more efficient transfer, copied from the host memory through optimized API calls that utilizethe device’s high-performanceDirect Memory Access (DMA) engine.
The NVIDIA GPU architecture is built around a scalable array of multithreadedStreamingMultiprocessors (SMs). When a host program invokes a kernel grid, the blocks of the grid areenumerated and distributed to multiprocessors with available execution capacity. The threads of athread block execute concurrently on one multiprocessor. As thread blocks terminate, new blocks arelaunched on the vacated multiprocessors.
A multiprocessor consists of multipleScalar Processor (SP) cores, a multithreaded instructionunit, and on-chip shared memory. The multiprocessor creates, manages, and executes concurrentthreads in hardware with zero scheduling overhead. It implements a single-instruction barriersynchronization. Fast barrier synchronization together with lightweight thread creation andzero-overhead thread scheduling efficiently support very fine-grained parallelism, allowing, forexample, a low granularity decomposition of problems by assigning one thread to each data element(such as a pixel in an image, a voxel in a volume, a cell in a grid-based computation).
To manage hundreds of threads running several different programs, the multiprocessor employs anarchitecture we callSIMT (single-instruction, multiple-thread). The multiprocessor maps eachthread to one scalar processor core, and each scalar thread executes independently with its owninstruction address and register state. The multiprocessor SIMT unit creates, manages, schedules,and executes threads in groups of parallel threads calledwarps. (This term originates fromweaving, the first parallel thread technology.) Individual threads composing a SIMT warp starttogether at the same program address but are otherwise free to branch and execute independently.
When a multiprocessor is given one or more thread blocks to execute, it splits them into warps thatget scheduled by the SIMT unit. The way a block is split into warps is always the same; each warpcontains threads of consecutive, increasing thread IDs with the first warp containing thread 0.
At every instruction issue time, the SIMT unit selects a warp that is ready to execute and issuesthe next instruction to the active threads of the warp. A warp executes one common instruction at atime, so full efficiency is realized when all threads of a warp agree on their execution path. Ifthreads of a warp diverge via a data-dependent conditional branch, the warp serially executes eachbranch path taken, disabling threads that are not on that path, and when all paths complete, thethreads converge back to the same execution path. Branch divergence occurs only within a warp;different warps execute independently regardless of whether they are executing common or disjointedcode paths.
SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that asingle instruction controls multiple processing elements. A key difference is that SIMD vectororganizations expose the SIMD width to the software, whereas SIMT instructions specify the executionand branching behavior of a single thread. In contrast with SIMD vector machines, SIMT enablesprogrammers to write thread-level parallel code for independent, scalar threads, as well asdata-parallel code for coordinated threads. For the purposes of correctness, the programmer canessentially ignore the SIMT behavior; however, substantial performance improvements can be realizedby taking care that the code seldom requires threads in a warp to diverge. In practice, this isanalogous to the role of cache lines in traditional code: Cache line size can be safely ignored whendesigning for correctness but must be considered in the code structure when designing for peakperformance. Vector architectures, on the other hand, require the software to coalesce loads intovectors and manage divergence manually.
How many blocks a multiprocessor can process at once depends on how many registers per thread andhow much shared memory per block are required for a given kernel since the multiprocessor’sregisters and shared memory are split among all the threads of the batch of blocks. If there are notenough registers or shared memory available per multiprocessor to process at least one block, thekernel will fail to launch.
On architectures prior to Volta, warps used a single program counter shared amongst all 32 threadsin the warp together with an active mask specifying the active threads of the warp. As a result,threads from the same warp in divergent regions or different states of execution cannot signal eachother or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks ormutexes can easily lead to deadlock, depending on which warp the contending threads come from.
Starting with the Volta architecture,Independent Thread Scheduling allows full concurrencybetween threads, regardless of warp. WithIndependent Thread Scheduling, the GPU maintainsexecution state per thread, including a program counter and call stack, and can yield execution at aper-thread granularity, either to make better use of execution resources or to allow one thread towait for data to be produced by another. A schedule optimizer determines how to group active threadsfrom the same warp together into SIMT units. This retains the high throughput of SIMT execution asin prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge atsub-warp granularity.
Independent Thread Scheduling can lead to a rather different set of threads participating in theexecuted code than intended if the developer made assumptions about warp-synchronicity of previoushardware architectures. In particular, any warp-synchronous code (such as synchronization-free,intra-warp reductions) should be revisited to ensure compatibility with Volta and beyond. See thesection on Compute Capability 7.x in theCuda Programming Guide for further details.
As illustrated byFigure 4, each multiprocessor hason-chip memory of the four following types:
One set of local 32-bitregisters per processor,
A parallel data cache orshared memory that is shared by all scalar processor cores and is wherethe shared memory space resides,
A read-onlyconstant cache that is shared by all scalar processor cores and speeds up reads fromthe constant memory space, which is a read-only region of device memory,
A read-onlytexture cache that is shared by all scalar processor cores and speeds up reads fromthe texture memory space, which is a read-only region of device memory; each multiprocessoraccesses the texture cache via atexture unit that implements the various addressing modes anddata filtering.
The local and global memory spaces are read-write regions of device memory.
PTX programs are a collection of text source modules (files). PTX source modules have anassembly-language style syntax with instruction operation codes and operands. Pseudo-operationsspecify symbol and addressing management. The ptxas optimizing backend compiler optimizes andassembles PTX source modules to produce corresponding binary object files.
Source modules are ASCII text. Lines are separated by the newline character (\n).
All whitespace characters are equivalent; whitespace is ignored except for its use in separatingtokens in the language.
The C preprocessor cpp may be used to process PTX source modules. Lines beginning with# arepreprocessor directives. The following are common preprocessor directives:
C: A Reference Manual by Harbison and Steele provides a good description of the C preprocessor.
PTX is case sensitive and uses lowercase for keywords.
Each PTX module must begin with a.version directive specifying the PTX language version,followed by a.target directive specifying the target architecture assumed. SeePTX Module Directives for a more information on these directives.
Comments in PTX follow C/C++ syntax, using non-nested/* and*/ for comments that may spanmultiple lines, and using// to begin a comment that extends up to the next newline character,which terminates the current line. Comments cannot occur within character constants, stringliterals, or within other comments.
Instructions are formed from an instruction opcode followed by a comma-separated list of zero ormore operands, and terminated with a semicolon. Operands may be register variables, constantexpressions, address expressions, or label names. Instructions have an optional guard predicatewhich controls conditional execution. The guard predicate follows the optional label and precedesthe opcode, and is written as@p, wherep is a predicate register. The guard predicate maybe optionally negated, written as@!p.
The destination operand is first, followed by source operands.
Instruction keywords are listed inTable 2. All instruction keywords arereserved tokens in PTX.
User-defined identifiers follow extended C++ rules: they either start with a letter followed by zeroor more letters, digits, underscore, or dollar characters; or they start with an underscore, dollar,or percentage character followed by one or more letters, digits, underscore, or dollar characters:
PTX does not specify a maximum length for identifiers and suggests that all implementations supporta minimum length of at least 1024 characters.
Many high-level languages such as C and C++ follow similar rules for identifier names, except thatthe percentage sign is not allowed. PTX allows the percentage sign as the first character of anidentifier. The percentage sign can be used to avoid name conflicts, e.g., between user-definedvariable names and compiler-generated names.
PTX predefines one constant and a small number of special registers that begin with the percentagesign, listed inTable 3.
PTX supports integer and floating-point constants and constant expressions. These constants may beused in data initialization and as operands to instructions. Type checking rules remain the same forinteger, floating-point, and bit-size types. For predicate-type data and instructions, integerconstants are allowed and are interpreted as in C, i.e., zero values areFalse and non-zerovalues areTrue.
Integer constants are 64-bits in size and are either signed or unsigned, i.e., every integerconstant has type.s64 or.u64. The signed/unsigned nature of an integer constant is neededto correctly evaluate constant expressions containing operations such as division and orderedcomparisons, where the behavior of the operation depends on the operand types. When used in aninstruction or data initialization, each integer constant is converted to the appropriate size basedon the data or instruction type at its use.
Integer literals may be written in decimal, hexadecimal, octal, or binary notation. The syntaxfollows that of C. Integer literals may be followed immediately by the letterU to indicate thatthe literal is unsigned.
Integer literals are non-negative and have a type determined by their magnitude and optional typesuffix as follows: literals are signed (.s64) unless the value cannot be fully represented in.s64 or the unsigned suffix is specified, in which case the literal is unsigned (.u64).
The predefined integer constantWARP_SZ specifies the number of threads per warp for the targetplatform; to date, all target architectures have aWARP_SZ value of 32.
Floating-point constants are represented as 64-bit double-precision values, and all floating-pointconstant expressions are evaluated using 64-bit double precision arithmetic. The only exception isthe 32-bit hex notation for expressing an exact single-precision floating-point value; such valuesretain their exact 32-bit single-precision value and may not be used in constant expressions. Each64-bit floating-point constant is converted to the appropriate floating-point size based on the dataor instruction type at its use.
Floating-point literals may be written with an optional decimal point and an optional signedexponent. Unlike C and C++, there is no suffix letter to specify size; literals are alwaysrepresented in 64-bit double-precision format.
PTX includes a second representation of floating-point constants for specifying the exact machinerepresentation using a hexadecimal constant. To specify IEEE 754 double-precision floating pointvalues, the constant begins with0d or0D followed by 16 hex digits. To specify IEEE 754single-precision floating point values, the constant begins with0f or0F followed by 8 hexdigits.
0[fF]{hexdigit}{8} // single-precision floating point0[dD]{hexdigit}{16} // double-precision floating point
In PTX, integer constants may be used as predicates. For predicate-type data initializers andinstruction operands, integer constants are interpreted as in C, i.e., zero values areFalse andnon-zero values areTrue.
In PTX, constant expressions are formed using operators as in C and are evaluated using rulessimilar to those in C, but simplified by restricting types and sizes, removing most casts, anddefining full semantics to eliminate cases where expression evaluation in C is implementationdependent.
Constant expressions are formed from constant literals, unary plus and minus, basic arithmeticoperators (addition, subtraction, multiplication, division), comparison operators, the conditionalternary operator (?: ), and parentheses. Integer constant expressions also allow unary logicalnegation (!), bitwise complement (~), remainder (%), shift operators (<< and>>), bit-type operators (&,|, and^), and logical operators (&&,||).
Constant expressions in PTX do not support casts between integer and floating-point.
Constant expressions are evaluated using the same operator precedence asin C.Table 4 gives operator precedence andassociativity. Operator precedence is highest for unary operators and decreases with each line inthe chart. Operators on the same line have the same precedence and are evaluated right-to-left forunary operators and left-to-right for binary operators.
Integer constant expressions are evaluated at compile time according to a set of rules thatdetermine the type (signed.s64 versus unsigned.u64) of each sub-expression. These rulesare based on the rules in C, but they’ve been simplified to apply only to 64-bit integers, andbehavior is fully defined in all cases (specifically, for remainder and shift operators).
Literals are signed unless unsigned is needed to prevent overflow, or unless the literal uses aU suffix. For example:
42,0x1234,0123 are signed.
0xfabc123400000000,42U,0x1234U are unsigned.
Unary plus and minus preserve the type of the input operand. For example:
+123,-1,-(-42) are signed.
-1U,-0xfabc123400000000 are unsigned.
Unary logical negation (!) produces a signed result with value0 or1.
Unary bitwise complement (~) interprets the source operand as unsigned and produces anunsigned result.
Some binary operators require normalization of source operands. This normalization is known asthe usual arithmetic conversions and simply converts both operands to unsigned type if eitheroperand is unsigned.
Addition, subtraction, multiplication, and division perform the usual arithmetic conversions andproduce a result with the same type as the converted operands. That is, the operands and resultare unsigned if either source operand is unsigned, and is otherwise signed.
Remainder (%) interprets the operands as unsigned. Note that this differs from C, which allowsa negative divisor but defines the behavior to be implementation dependent.
Left and right shift interpret the second operand as unsigned and produce a result with the sametype as the first operand. Note that the behavior of right-shift is determined by the type of thefirst operand: right shift of a signed value is arithmetic and preserves the sign, and right shiftof an unsigned value is logical and shifts in a zero bit.
AND (&), OR (|), and XOR (^) perform the usual arithmetic conversions and produce aresult with the same type as the converted operands.
AND_OP (&&), OR_OP (||), Equal (==), and Not_Equal (!=) produce a signedresult. The result value is 0 or 1.
Ordered comparisons (<,<=,>,>=) perform the usual arithmetic conversions onsource operands and produce a signed result. The result value is0 or1.
Casting of expressions to signed or unsigned is supported using (.s64) and (.u64) casts.
For the conditional operator (?: ) , the first operand must be an integer, and the secondand third operands are either both integers or both floating-point. The usual arithmeticconversions are performed on the second and third operands, and the result type is the same as theconverted type.
While the specific resources available in a given target GPU will vary, the kinds of resources willbe common across platforms, and these resources are abstracted in PTX through state spaces and datatypes.
A state space is a storage area with particular characteristics. All variables reside in some statespace. The characteristics of a state space include its size, addressability, access speed, accessrights, and level of sharing between threads.
The state spaces defined in PTX are a byproduct of parallel programming and graphicsprogramming. The list of state spaces is shown inTable 6,andproperties of state spaces are shown inTable 7.
1 Variables in.const and.global state spaces are initialized to zero by default.
2 Accessible only via theld.param{::entry} instruction. Address may be taken viamov instruction.
3 Accessible viald.param{::func} andst.param{::func} instructions. Device functioninput and return parameters may have their address taken viamov; the parameter is then locatedon the stack frame and its address is in the.local state space.
4 Accessible only via thetex instruction.
5 Visible to the owning CTA and other active CTAs in the cluster.
Registers (.reg state space) are fast storage locations. The number of registers is limited, andwill vary from platform to platform. When the limit is exceeded, register variables will be spilledto memory, causing changes in performance. For each architecture, there is a recommended maximumnumber of registers to use (see theCUDA Programming Guide for details).
Registers may be typed (signed integer, unsigned integer, floating point, predicate) oruntyped. Register size is restricted; aside from predicate registers which are 1-bit, scalarregisters have a width of 8-, 16-, 32-, 64-, or 128-bits, and vector registers have a width of16-, 32-, 64-, or 128-bits. The most common use of 8-bit registers is withld,st, andcvtinstructions, or as elements of vector tuples.
Registers differ from the other state spaces in that they are not fully addressable, i.e., it is notpossible to refer to the address of a register. When compiling to use the Application BinaryInterface (ABI), register variables are restricted to function scope and may not be declared atmodule scope. When compiling legacy PTX code (ISA versions prior to 3.0) containing module-scoped.reg variables, the compiler silently disables use of the ABI. Registers may have alignmentboundaries required by multi-word loads and stores.
The special register (.sreg) state space holds predefined, platform-specific registers, such asgrid, cluster, CTA, and thread parameters, clock counters, and performance monitoring registers. Allspecial registers are predefined.
The constant (.const) state space is a read-only memory initialized by the host. Constant memoryis accessed with ald.const instruction. Constant memory is restricted in size, currentlylimited to 64 KB which can be used to hold statically-sized constant variables. There is anadditional 640 KB of constant memory, organized as ten independent 64 KB regions. The driver mayallocate and initialize constant buffers in these regions and pass pointers to the buffers as kernelfunction parameters. Since the ten regions are not contiguous, the driver must ensure that constantbuffers are allocated so that each buffer fits entirely within a 64 KB region and does not span aregion boundary.
Statically-sized constant variables have an optional variable initializer; constant variables withno explicit initializer are initialized to zero by default. Constant buffers allocated by the driverare initialized by the host, and pointers to such buffers are passed to the kernel asparameters. See the description of kernel parameter attributes inKernel Function Parameter Attributes for more details on passing pointersto constant buffers as kernel parameters.
Previous versions of PTX exposed constant memory as a set of eleven 64 KB banks, with explicit banknumbers required for variable declaration and during access.
Prior to PTX ISA version 2.2, the constant memory was organized into fixed size banks. There wereeleven 64 KB banks, and banks were specified using the.const[bank] modifier, wherebankranged from 0 to 10. If no bank number was given, bank zero was assumed.
By convention, bank zero was used for all statically-sized constant variables. The remaining bankswere used to declareincomplete constant arrays (as in C, for example), where the size is notknown at compile time. For example, the declaration
.extern .const[2] .b32 const_buffer[];
resulted inconst_buffer pointing to the start of constant bank two. This pointer could then beused to access the entire 64 KB constant bank. Multiple incomplete array variables declared in thesame bank were aliased, with each pointing to the start address of the specified constant bank.
To access data in contant banks 1 through 10, the bank number was required in the state space of theload instruction. For example, an incomplete array in bank 2 was accessed as follows:
.extern .const[2] .b32 const_buffer[];ld.const[2].b32 %r1, [const_buffer+4]; // load second word
In PTX ISA version 2.2, we eliminated explicit banks and replaced the incomplete arrayrepresentation of driver-allocated constant buffers with kernel parameter attributes that allowpointers to constant buffers to be passed as kernel parameters.
The global (.global) state space is memory that is accessible by all threads in a context. It isthe mechanism by which threads in different CTAs, clusters, and grids can communicate. Useld.global,st.global, andatom.global to access global variables.
Global variables have an optional variable initializer; global variables with no explicitinitializer are initialized to zero by default.
The local state space (.local) is private memory for each thread to keep its own data. It istypically standard memory with cache. The size is limited, as it must be allocated on a per-threadbasis. Useld.local andst.local to access local variables.
When compiling to use theApplication Binary Interface (ABI),.local state-space variablesmust be declared within function scope and are allocated on the stack. In implementations that donot support a stack, all local memory variables are stored at fixed addresses, recursive functioncalls are not supported, and.local variables may be declared at module scope. When compilinglegacy PTX code (ISA versions prior to 3.0) containing module-scoped.local variables, thecompiler silently disables use of the ABI.
The parameter (.param) state space is used (1) to pass input arguments from the host to thekernel, (2a) to declare formal input and return parameters for device functions called from withinkernel execution, and (2b) to declare locally-scoped byte array variables that serve as functioncall arguments, typically for passing large structures by value to a function. Kernel functionparameters differ from device function parameters in terms of access and sharing (read-only versusread-write, per-kernel versus per-thread). Note that PTX ISA versions 1.x supports only kernelfunction parameters in .param space; device function parameters were previously restricted to theregister state space. The use of parameter state space for device function parameters was introducedin PTX ISA version 2.0 and requires target architecturesm_20 or higher. Additional sub-qualifiers::entry or::func can be specified on instructions with.param state space to indicatewhether the address refers to kernel function parameter or device function parameter. If nosub-qualifier is specified with the.param state space, then the default sub-qualifier is specificto and dependent on the exact instruction. For example,st.param is equivalent tost.param::funcwhereasisspacep.param is equivalent toisspacep.param::entry. Refer to the instructiondescription for more details on default sub-qualifier assumption.
Note
The location of parameter space is implementation specific. For example, in some implementationskernel parameters reside in global memory. No access protection is provided between parameter andglobal space in this case. Though the exact location of the kernel parameter space isimplementation specific, the kernel parameter space window is always contained within the globalspace window. Similarly, function parameters are mapped to parameter passing registers and/orstack locations based on the function calling conventions of theApplication Binary Interface(ABI). Therefore, PTX code should make no assumptions about the relative locations or orderingof.param space variables.
Each kernel function definition includes an optional list of parameters. These parameters areaddressable, read-only variables declared in the.param state space. Values passed from the hostto the kernel are accessed through these parameter variables usingld.param instructions. Thekernel parameter variables are shared across all CTAs from all clusters within a grid.
The address of a kernel parameter may be moved into a register using themov instruction. Theresulting address is in the.param state space and is accessed usingld.param instructions.
.entry bar ( .param .b32 len ){ .reg .u32 %ptr, %n; mov.u32 %ptr, len; ld.param.u32 %n, [%ptr]; ...
Kernel function parameters may represent normal data values, or they may hold addresses to objectsin constant, global, local, or shared state spaces. In the case of pointers, the compiler andruntime system need information about which parameters are pointers, and to which state space theypoint. Kernel parameter attribute directives are used to provide this information at the PTXlevel. SeeKernel Function Parameter Attributesfor a description of kernel parameter attributedirectives.
Note
The current implementation does not allow creation of generic pointers to constant variables(cvta.const) in programs that have pointers to constant buffers passed as kernel parameters.
Kernel function parameters may be declared with an optional .ptr attribute to indicate that aparameter is a pointer to memory, and also indicate the state space and alignment of the memorybeing pointed to.Kernel Parameter Attribute: .ptrdescribes the.ptr kernel parameter attribute.
.param .type .ptr .space .align N varname.param .type .ptr .align N varname.space = { .const, .global, .local, .shared };
Description
Used to specify the state space and, optionally, the alignment of memory pointed to by a pointertype kernel parameter. The alignment valueN, if present, must be a power of two. If no statespace is specified, the pointer is assumed to be a generic address pointing to one of const, global,local, or shared memory. If no alignment is specified, the memory pointed to is assumed to bealigned to a 4 byte boundary.
Spaces between.ptr,.space, and.align may be eliminated to improve readability.
PTX ISA Notes
Introduced in PTX ISA version 2.2.
Support for generic addressing of .const space added in PTX ISA version 3.1.
PTX ISA version 2.0 extended the use of parameter space to device function parameters. The mostcommon use is for passing objects by value that do not fit within a PTX register, such as Cstructures larger than 8 bytes. In this case, a byte array in parameter space is used. Typically,the caller will declare a locally-scoped.param byte array variable that represents a flattenedC structure or union. This will be passed by value to a callee, which declares a.param formalparameter having the same size and alignment as the passed argument.
Example
// pass object of type struct { double d; int y; };.func foo ( .reg .b32 N, .param .align 8 .b8 buffer[12] ){ .reg .f64 %d; .reg .s32 %y; ld.param.f64 %d, [buffer]; ld.param.s32 %y, [buffer+8]; ...}// code snippet from the caller// struct { double d; int y; } mystruct; is flattened, passed to foo ... .reg .f64 dbl; .reg .s32 x; .param .align 8 .b8 mystruct; ... st.param.f64 [mystruct+0], dbl; st.param.s32 [mystruct+8], x; call foo, (4, mystruct); ...
See the section on function call syntax for more details.
Function input parameters may be read viald.param and function return parameters may be writtenusingst.param; it is illegal to write to an input parameter or read from a return parameter.
Aside from passing structures by value,.param space is also required whenever a formalparameter has its address taken within the called function. In PTX, the address of a function inputparameter may be moved into a register using themov instruction. Note that the parameter willbe copied to the stack if necessary, and so the address will be in the.local state space and isaccessed viald.local andst.local instructions. It is not possible to usemov to getthe address of or a locally-scoped.param space variable. Starting PTX ISA version 6.0, it ispossible to usemov instruction to get address of return parameter of device function.
Example
// pass array of up to eight floating-point values in buffer.func foo ( .param .b32 N, .param .b32 buffer[32] ){ .reg .u32 %n, %r; .reg .f32 %f; .reg .pred %p; ld.param.u32 %n, [N]; mov.u32 %r, buffer; // forces buffer to .local state spaceLoop: setp.eq.u32 %p, %n, 0;@%p bra Done; ld.local.f32 %f, [%r]; ... add.u32 %r, %r, 4; sub.u32 %n, %n, 1; bra Loop;Done: ...}
The shared (.shared) state space is a memory that is owned by an executing CTA and is accessibleto the threads of all the CTAs within a cluster. An address in shared memory can be read and writtenby any thread in a CTA cluster.
Additional sub-qualifiers::cta or::cluster can be specified on instructions with.shared state space to indicate whether the address belongs to the shared memory window of theexecuting CTA or of any CTA in the cluster respectively. The addresses in the.shared::ctawindow also fall within the.shared::cluster window. If no sub-qualifier is specified with the.shared state space, then it defaults to::cta. For example,ld.shared is equivalent told.shared::cta.
Variables declared in.shared state space refer to the memory addresses in the currentCTA. Instructionmapa gives the.shared::cluster address of the corresponding variable inanother CTA in the cluster.
Shared memory typically has some optimizations to support the sharing. One example is broadcast;where all threads read from the same address. Another is sequential access from sequential threads.
The texture (.tex) state space is global memory accessed via the texture instruction. It isshared by all threads in a context. Texture memory is read-only and cached, so accesses to texturememory are not coherent with global memory stores to the texture image.
The GPU hardware has a fixed number of texture bindings that can be accessed within a single kernel(typically 128). The .tex directive will bind the named texture memory variable to a hardwaretexture identifier, where texture identifiers are allocated sequentially beginning withzero. Multiple names may be bound to the same physical texture identifier. An error is generated ifthe maximum number of physical resources is exceeded. The texture name must be of type.u32 or.u64.
Physical texture resources are allocated on a per-kernel granularity, and.tex variables arerequired to be defined in the global scope.
Texture memory is read-only. A texture’s base address is assumed to be aligned to a 16 byteboundary.
Example
.tex .u32 tex_a; // bound to physical texture 0.tex .u32 tex_c, tex_d; // both bound to physical texture 1.tex .u32 tex_d; // bound to physical texture 2.tex .u32 tex_f; // bound to physical texture 3
Note
Explicit declarations of variables in the texture state space is deprecated, and programs shouldinstead reference texture memory through variables of type.texref. The.tex directive isretained for backward compatibility, and variables declared in the.tex state space areequivalent to module-scoped.texref variables in the.global state space.
In PTX, the fundamental types reflect the native data types supported by the target architectures. Afundamental type specifies both a basic type and a size. Register variables are always of afundamental type, and instructions operate on these types. The same type-size specifiers are usedfor both variable definitions and for typing instructions, so their names are intentionally short.
Table 8 lists the fundamental type specifiers foreach basic type:
Most instructions have one or more type specifiers, needed to fully specify instructionbehavior. Operand types and sizes are checked against instruction types for compatibility.
Two fundamental types are compatible if they have the same basic type and are the same size. Signedand unsigned integer types are compatible if they have the same size. The bit-size type iscompatible with any fundamental type having the same size.
In principle, all variables (aside from predicates) could be declared using only bit-size types, buttyped variables enhance program readability and allow for better operand type checking.
The.u8,.s8, and.b8 instruction types are restricted told,st, andcvtinstructions. The.f16 floating-point type is allowed only in conversions to and from.f32,.f64 types, in half precision floating point instructions and texture fetch instructions. The.f16x2 floating point type is allowed only in half precision floating point arithmeticinstructions and texture fetch instructions.
For convenience,ld,st, andcvt instructions permit source and destination dataoperands to be wider than the instruction-type size, so that narrow values may be loaded, stored,and converted using regular-width registers. For example, 8-bit or 16-bit values may be helddirectly in 32-bit or 64-bit registers when being loaded, stored, or converted to other types andsizes.
The fundamental floating-point types supported in PTX have implicit bit representations thatindicate the number of bits used to store exponent and mantissa. For example, the.f16 typeindicates 5 bits reserved for exponent and 10 bits reserved for mantissa. In addition to thefloating-point representations assumed by the fundamental types, PTX allows the following alternatefloating-point data formats:
bf16 data format:
This data format is a 16-bit floating point format with 8 bits for exponent and 7 bits formantissa. A register variable containingbf16 data must be declared with.b16 type.
e4m3 data format:
This data format is an 8-bit floating point format with 4 bits for exponent and 3 bits formantissa. Thee4m3 encoding does not support infinity andNaN values are limited to0x7f and0xff. A register variable containinge4m3 value must be declared usingbit-size type.
e5m2 data format:
This data format is an 8-bit floating point format with 5 bits for exponent and 2 bits formantissa. A register variable containinge5m2 value must be declared using bit-size type.
tf32 data format:
This data format is a special 32-bit floating point format supported by the matrixmultiply-and-accumulate instructions, with the same range as.f32 and reduced precision (>=10bits). The internal layout oftf32 format is implementation defined. PTX facilitatesconversion from single precision.f32 type totf32 format. A register variable containingtf32 data must be declared with.b32 type.
e2m1 data format:
This data format is a 4-bit floating point format with 2 bits for exponent and 1 bit for mantissa.Thee2m1 encoding does not support infinity andNaN.e2m1 values must be used in apacked format specified ase2m1x2. A register variable containing twoe2m1 values must bedeclared with.b8 type.
e2m3 data format:
This data format is a 6-bit floating point format with 2 bits for exponent and 3 bits for mantissa.Thee2m3 encoding does not support infinity andNaN.e2m3 values must be used in apacked format specified ase2m3x2. A register variable containing twoe2m3 values must bedeclared with.b16 type where each.b8 element has 6-bit floating point value and 2 MSBbits padded with zeros.
e3m2 data format:
This data format is a 6-bit floating point format with 3 bits for exponent and 2 bits for mantissa.Thee3m2 encoding does not support infinity andNaN.e3m2 values must be used in apacked format specified ase3m2x2. A register variable containing twoe3m2 values must bedeclared with.b16 type where each.b8 element has 6-bit floating point value and 2 MSBbits padded with zeros.
ue8m0 data format:
This data format is an 8-bit unsigned floating-point format with 8 bits for exponent and 0 bits formantissa. Theue8m0 encoding does not support infinity.NaN value is limited to0xff.ue8m0 values must be used in a packed format specified asue8m0x2. A register variablecontaining twoue8m0 values must be declared with.b16 type.
ue4m3 data format:
This data format is a 7-bit unsigned floating-point format with 4 bits for exponent and 3 bits formantissa. Theue4m3 encoding does not support infinity.NaN value is limited to0x7f.A register variable containing singleue4m3 value must be declared with.b8 type havingMSB bit padded with zero.
Alternate data formats cannot be used as fundamental types. They are supported as source ordestination formats by certain instructions.
Certain PTX instructions operate on two or more sets of inputs in parallel, and produce two or moreoutputs. Such instructions can use the data stored in a packed format. PTX supports packing two orfour values of the same scalar data type into a single, larger value. The packed value is consideredas a value of apacked data type. In this section we describe the packed data types supported in PTX.
PTX supports various variants of packed floating point data types. Out of them, only.f16x2 issupported as a fundamental type, while others cannot be used as fundamental types - they aresupported as instruction types on certain instructions. When using an instruction with suchnon-fundamental types, the operand data variables must be of bit type of appropriate size.For example, all of the operand variables must be of type.b32 for an instruction withinstruction type as.bf16x2.Table 9 described various variantsof packed floating point data types in PTX.
Table 9Operand types for packed floating point instruction type.
Packed floatingpoint type
Number of elementscontained in apacked format
Type of eachelement
Register variable typeto be used in thedeclaration
PTX supports two variants of packed integer data types:.u16x2 and.s16x2. The packed datatype consists of two.u16 or.s16 values. A register variable containing.u16x2 or.s16x2 data must be declared with.b32 type. Packed integer data types cannot be used asfundamental types. They are supported as instruction types on certain instructions.
PTX includes built-inopaque types for defining texture, sampler, and surface descriptorvariables. These types have named fields similar to structures, but all information about layout,field ordering, base address, and overall size is hidden to a PTX program, hence the termopaque. The use of these opaque types is limited to:
Variable definition within global (module) scope and in kernel entry parameter lists.
Static initialization of module-scope variables using comma-delimited static assignmentexpressions for the named members of the type.
Referencing textures, samplers, or surfaces via texture and surface load/store instructions(tex,suld,sust,sured).
Retrieving the value of a named member via query instructions (txq,suq).
Creating pointers to opaque variables usingmov, e.g.,mov.u64reg,opaque_var;. Theresulting pointer may be stored to and loaded from memory, passed as a parameter to functions, andde-referenced by texture and surface load, store, and query instructions, but the pointer cannototherwise be treated as an address, i.e., accessing the pointer withld andstinstructions, or performing pointer arithmetic will result in undefined results.
Opaque variables may not appear in initializers, e.g., to initialize a pointer to an opaquevariable.
Note
Indirect access to textures and surfaces using pointers to opaque variables is supportedbeginning with PTX ISA version 3.1 and requires targetsm_20 or later.
Indirect access to textures is supported only in unified texture mode (see below).
The three built-in types are.texref,.samplerref, and.surfref. For working withtextures and samplers, PTX has two modes of operation. In theunified mode, texture and samplerinformation is accessed through a single.texref handle. In theindependent mode, texture andsampler information each have their own handle, allowing them to be defined separately and combinedat the site of usage in the program. In independent mode, the fields of the.texref type thatdescribe sampler properties are ignored, since these properties are defined by.samplerrefvariables.
Table 10 andTable 11 list the named membersof each type for unified and independent texture modes. These members and their values haveprecise mappings to methods and values defined in the textureHW class as well asexposed values via the API.
Table 10Opaque Type Fields in Unified Texture Mode
Fieldswidth,height, anddepth specify the size of the texture or surface in number ofelements in each dimension.
Thechannel_data_type andchannel_order fields specify these properties of the texture orsurface using enumeration types corresponding to the source language API. For example, seeChannel Data Type and Channel Order Fields forthe OpenCL enumeration types currently supported in PTX.
Thenormalized_coords field indicates whether the texture or surface uses normalized coordinatesin the range [0.0, 1.0) instead of unnormalized coordinates in the range [0, N). If no value isspecified, the default is set by the runtime system based on the source language.
Thefilter_mode field specifies how the values returned by texture reads are computed based onthe input texture coordinates.
Theaddr_mode_{0,1,2} fields define the addressing mode in each dimension, which determine howout-of-range coordinates are handled.
See theCUDA C++ Programming Guide for more details of these properties.
Table 11Opaque Type Fields in Independent Texture Mode
In independent texture mode, the sampler properties are carried in an independent.samplerrefvariable, and these fields are disabled in the.texref variables. One additional samplerproperty,force_unnormalized_coords, is available in independent texture mode.
Theforce_unnormalized_coords field is a property of.samplerref variables that allows thesampler to override the texture headernormalized_coords property. This field is defined only inindependent texture mode. WhenTrue, the texture header setting is overridden and unnormalizedcoordinates are used; whenFalse, the texture header setting is used.
Theforce_unnormalized_coords property is used in compiling OpenCL; in OpenCL, the property ofnormalized coordinates is carried in sampler headers. To compile OpenCL to PTX, texture headers arealways initialized withnormalized_coords set to True, and the OpenCL sampler-basednormalized_coords flag maps (negated) to the PTX-levelforce_unnormalized_coords flag.
Variables using these types may be declared at module scope or within kernel entry parameterlists. At module scope, these variables must be in the.global state space. As kernelparameters, these variables are declared in the.param state space.
Thechannel_data_type andchannel_order fields have enumeration types corresponding to thesource language API. Currently, OpenCL is the only source language that defines thesefields.Table 13 andTable 12 show theenumeration values defined in OpenCL version 1.0 for channel data type and channel order.
In PTX, a variable declaration describes both the variable’s type and its state space. In additionto fundamental types, PTX supports types for simple aggregate objects such as vectors and arrays.
All storage for data is specified with variable declarations. Every variable must reside in one ofthe state spaces enumerated in the previous section.
A variable declaration names the space in which the variable resides, its type and size, its name,an optional array size, an optional initializer, and an optional fixed address for the variable.
Predicate variables may only be declared in the register state space.
Limited-length vector types are supported. Vectors of length 2 and 4 of any non-predicatefundamental type can be declared by prefixing the type with.v2 or.v4. Vectors must bebased on a fundamental type, and they may reside in the register space. Vectors cannot exceed128-bits in length; for example,.v4.f64 is not allowed. Three-element vectors may behandled by using a.v4 vector, where the fourth element provides padding. This is a common casefor three-dimensional grids, textures, etc.
Examples
.global .v4 .f32 V; // a length-4 vector of floats.shared .v2 .u16 uv; // a length-2 vector of unsigned ints.global .v4 .b8 v; // a length-4 vector of bytes
By default, vector variables are aligned to a multiple of their overall size (vector length timesbase-type size), to enable vector load and store instructions which require addresses aligned to amultiple of the access size.
Array declarations are provided to allow the programmer to reserve space. To declare an array, thevariable name is followed with dimensional declarations similar to fixed-size array declarationsin C. The size of each dimension is a constant expression.
The size of the array specifies how many elements should be reserved. For the declaration of arraykernel above, 19*19 = 361 halfwords are reserved, for a total of 722 bytes.
When declared with an initializer, the first dimension of the array may be omitted. The size of thefirst array dimension is determined by the number of elements in the array initializer.
Declared variables may specify an initial value using a syntax similar to C/C++, where the variablename is followed by an equals sign and the initial value or values for the variable. A scalar takesa single value, while vectors and arrays take nested lists of values inside of curly braces (thenesting matches the dimensionality of the declaration).
As in C, array initializers may be incomplete, i.e., the number of initializer elements may be lessthan the extent of the corresponding array dimension, with remaining array locations initialized tothe default value for the specified array type.
Currently, variable initialization is supported only for constant and global state spaces. Variablesin constant and global state spaces with no explicit initializer are initialized to zero bydefault. Initializers are not allowed in external variable declarations.
Variable names appearing in initializers represent the address of the variable; this can be used tostatically initialize a pointer to a variable. Initializers may also containvar+offsetexpressions, whereoffset is a byte offset added to the address ofvar. Only variables in.global or.const state spaces may be used in initializers. By default, the resultingaddress is the offset in the variable’s state space (as is the case when taking the address of avariable with amov instruction). An operator,generic(), is provided to create a genericaddress for variables used in initializers.
Starting PTX ISA version 7.1, an operatormask() is provided, wheremask is an integerimmediate. The only allowed expressions in themask() operator are integer constant expressionand symbol expression representing address of variable. Themask() operator extractsnconsecutive bits from the expression used in initializers and inserts these bits at the lowestposition of the initialized variable. The numbern and the starting position of the bits to beextracted is specified by the integer immediatemask. PTX ISA version 7.1 only supportsextracting a single byte starting at byte boundary from the address of the variable. PTX ISA version7.3 supports Integer constant expression as an operand in themask() operator.
.const .u32 foo = 42;.global .u32 bar[] = { 2, 3, 5 };.global .u32 p1 = foo; // offset of foo in .const space.global .u32 p2 = generic(foo); // generic address of foo// array of generic-address pointers to elements of bar.global .u32 parr[] = { generic(bar), generic(bar)+4,generic(bar)+8 };// examples using mask() operator are pruned for brevity.global .u8 addr[] = {0xff(foo), 0xff00(foo), 0xff0000(foo), ...};.global .u8 addr2[] = {0xff(foo+4), 0xff00(foo+4), 0xff0000(foo+4),...}.global .u8 addr3[] = {0xff(generic(foo)), 0xff00(generic(foo)),...}.global .u8 addr4[] = {0xff(generic(foo)+4), 0xff00(generic(foo)+4),...}// mask() operator with integer const expression.global .u8 addr5[] = { 0xFF(1000 + 546), 0xFF00(131187), ...};
Note
PTX 3.1 redefines the default addressing for global variables in initializers, from genericaddresses to offsets in the global state space. Legacy PTX code is treated as having an implicitgeneric() operator for each global variable used in an initializer. PTX 3.1 code shouldeither include explicitgeneric() operators in initializers, usecvta.global to formgeneric addresses at runtime, or load from the non-generic address usingld.global.
Device function names appearing in initializers represent the address of the first instruction inthe function; this can be used to initialize a table of function pointers to be used with indirectcalls. Beginning in PTX ISA version 3.1, kernel function names can be used as initializers e.g. toinitialize a table of kernel function pointers, to be used with CUDA Dynamic Parallelism to launchkernels from GPU. See theCUDA Dynamic Parallelism Programming Guide for details.
Labels cannot be used in initializers.
Variables that hold addresses of variables or functions should be of type.u8 or.u32 or.u64.
Type.u8 is allowed only if themask() operator is used.
Initializers are allowed for all types except.f16,.f16x2 and.pred.
Byte alignment of storage for all addressable variables can be specified in the variabledeclaration. Alignment is specified using an optional.alignbyte-count specifier immediatelyfollowing the state-space specifier. The variable will be aligned to an address which is an integermultiple of byte-count. The alignment value byte-count must be a power of two. For arrays, alignmentspecifies the address alignment for the starting address of the entire array, not for individualelements.
The default alignment for scalar and array variables is to a multiple of the base-type size. Thedefault alignment for vector variables is to a multiple of the overall vector size.
Examples
// allocate array at 4-byte aligned address. Elements are bytes..const .align 4 .b8 bar[8] = {0,0,0,0,2,0,0,0};
Note that all PTX instructions that access memory require that the address be aligned to a multipleof the access size. The access size of a memory instruction is the total number of bytes accessed inmemory. For example, the access size ofld.v4.b32 is 16 bytes, while the access size ofatom.f16x2 is 4 bytes.
Since PTX supports virtual registers, it is quite common for a compiler frontend to generate a largenumber of register names. Rather than require explicit declaration of every name, PTX supports asyntax for creating a set of variables having a common prefix string appended with integer suffixes.
For example, suppose a program uses a large number, say one hundred, of.b32 variables, named%r0,%r1, …,%r99. These 100 register variables can be declared as follows:
.reg .b32 %r<100>; // declare %r0, %r1, ..., %r99
This shorthand syntax may be used with any of the fundamental types and with any state space, andmay be preceded by an alignment specifier. Array variables cannot be declared this way, nor areinitializers permitted.
Variables may be declared with an optional.attribute directive which allows specifying specialattributes of variables. Keyword.attribute is followed by attribute specification insideparenthesis. Multiple attributes are separated by comma.
Used to specify special attributes of a variable or a function.
The following attributes are supported.
.managed
.managed attribute specifies that variable will be allocated at a location in unified virtualmemory environment where host and other devices in the system can reference the variabledirectly. This attribute can only be used with variables in .global state space. See theCUDAUVM-Lite Programming Guide for details.
.unified
.unified attribute specifies that function has the same memory address on the host and onother devices in the system. Integer constantsuuid1 anduuid2 respectively specify upperand lower 64 bits of the unique identifier associated with the function or the variable. Thisattribute can only be used on device functions or on variables in the.global statespace. Variables with.unified attribute are read-only and must be loaded by specifying.unified qualifier on the address operand ofld instruction, otherwise the behavior isundefined.
PTX ISA Notes
Introduced in PTX ISA version 4.0.
Support for function attributes introduced in PTX ISA version 8.0.
A tensor is a multi-dimensional matrix structure in the memory. Tensor is defined by the followingproperties:
Dimensionality
Dimension sizes across each dimension
Individual element types
Tensor stride across each dimension
PTX supports instructions which can operate on the tensor data. PTX Tensor instructions include:
Copying data between global and shared memories
Reducing the destination tensor data with the source.
The Tensor data can be operated on by variouswmma.mma,mma andwgmma.mma_asyncinstructions.
PTX Tensor instructions treat the tensor data in the global memory as a multi-dimensional structureand treat the data in the shared memory as a linear data.
Floating point and alternate floating point:.f16,.bf16,.tf32,.f32,.f64(rounded to nearest even).
Tensor can have padding at the end in each of the dimensions to provide alignment for the data inthe subsequent dimensions. Tensor stride can be used to specify the amount of padding in eachdimension.
The sub-byte types are expected to packed contiguously in the global memory andthe Tensor copy instruction will expand them by appending empty spaces as shown below:
Type.b4x16:With this type, there is no padding involved and the packed sixteen.b4 elementsin a 64-bits container is copied as is between the shared memory and the global memory.
Type.b4x16_p64:With this type, sixteen contiguous 4-bits of data is copied from global memory to theshared memory with the append of 64-bits of padding as shown inFigure 5
The padded region that gets added is un-initialized.
Type.b6x16_p32:With this type, sixteen 6-bits of data is copied from global memory to the shared memorywith an append of 32-bits of padding as shown inFigure 6
The padded region that gets added is un-initialized.
Type.b6p2x16:With this type, sixteen elements, each containing 6-bits of data at the LSB and 2-bitsof padding at the MSB, are copied from shared memory into the global memory by discardingthe 2-bits of padding data and packing the 6-bits data contiguously as shown inFigure 7
In case of.b6x16_p32 and.b4x16_p64, the padded region that gets added isun-initialized.
The types.b6x16_p32 and.b6p2x16 share the same encoding value in thedescriptor (value 15) as the two types are applicable for different types oftensor copy operations:
A tensor can be accessed in chunks known asBounding Box. The Bounding Box has the samedimensionality as the tensor they are accessing into. Size of each bounding Box must be a multipleof 16 bytes. The address of the bounding Box must also be aligned to 16 bytes.
Bounding Box has the following access properties:
Bounding Box dimension sizes
Out of boundary access mode
Traversal strides
The tensor-coordinates, specified in the PTX tensor instructions, specify the starting offset of thebounding box. Starting offset of the bounding box along with the rest of the bounding boxinformation together are used to determine the elements which are to be accessed.
While the Bounding Box is iterating the tensor across a dimension, the traversal stride specifiesthe exact number of elements to be skipped. If no jump over is required, default value of 1 must bespecified.
The traversal stride in dimension 0 can be used for theInterleave layout.For non-interleaved layout, the traversal stride indimension 0 must always be 1.
These modes are similar to the tiled mode with restriction that these modes work only on 2D tensor data.Tile::scatter4 andTile::gather4 modes are used to access multiple non-contiguous rows of tensor data.
InTile::scatter4 mode single 2D source tensor is divided into four rows in the 2D destination tensor.InTile::gather4 mode four rows in the source 2D tensor are combined to form single 2D destination tensor.
These modes work on four rows and hence the instruction will take:
four tensor coordinates across the dimension 0
one tensor coordinate across the dimension 1
The interleave layout is not supported for.tile::scatter4 and.tile::gather4 modes.
All other constraints and rules of the tile mode apply to these modes as well.
Im2col mode supports the following tensor dimensions : 3D, 4D and 5D. In this mode, the tensor datais treated as a batch of images with the following properties:
N : number of images in the batch
D, H, W : size of a 3D image (depth, height and width)
C: channels per image element
The above properties are associated with 3D, 4D and 5D tensors as follows:
In im2col mode, the Bounding Box is defined in DHW space. Boundaries along other dimensions arespecified by Pixels-per-Column and Channels-per-Pixel parameters as described below.
The dimensionality of the Bounding Box is two less than the tensor dimensionality.
The following properties describe how to access of the elements in im2col mode:
Bounding-Box Lower-Corner
Bounding-Box Upper-Corner
Pixels-per-Column
Channels-per-Pixel
Bounding-box Lower-Corner andBounding-box Upper-Corner specify the two opposite corners of theBounding Box in the DHW space.Bounding-box Lower-Corner specifies the corner with the smallestcoordinate andBounding-box Upper-Corner specifies the corner with the largest coordinate.
Bounding-box Upper- andLower-Corners are 16-bit signed values whose limits varies across thedimensions and are as shown below:
TheBounding-box Upper- andLower- Corners specify only the boundaries and not the number ofelements to be accessed.Pixels-per-Column specifies the number of elements to be accessed in theNDHW space.
Channels-per-Pixel specifies the number of elements to access across the C dimension.
The tensor coordinates, specified in the PTX tensor instructions, behaves differently in differentdimensions:
Across N and C dimensions: specify the starting offsets along the dimension, similar to the tiledmode.
Across DHW dimensions: specify the location of the convolution filter base in the tensorspace. The filter corner location must be within the bounding box.
The im2col offsets, specified in the PTX tensor instructions in im2col mode, are added to the filterbase coordinates to determine the starting location in the tensor space from where the elements areaccessed.
The size of the im2col offsets varies across the dimensions and their valid ranges are as shownbelow:
3D
4D
5D
im2col offsets range
[0, 216-1]
[0, 28-1]
[0, 25-1]
Following are some examples of the im2col mode accesses:
The traversal stride, in im2col mode, does not impact the total number of elements (or pixels) beingaccessed unlike the tiled mode. Pixels-per-Column determines the total number of elements beingaccessed, in im2col mode.
The number of elements traversed along the D, H and W dimensions is strided by the traversal stridefor that dimension.
The following example withFigure 15 illustrates accesse with traversal-strides:
Tensor Size[0] = 64Tensor Size[1] = 8Tensor Size[2] = 14Tensor Size[3] = 64Traversal Stride = 2Pixels-per-Column = 32channels-per-pixel = 16Bounding-Box Lower-Corner W = -1Bounding-Box Lower-Corner H = -1Bounding-Box Upper-Corner W = -1Bounding-Box Upper-Corner H = -1.Tensor coordinates in the instruction = (7, 7, 5, 0)Im2col offsets in the instruction : (1, 1)
In im2col mode, when the number of requested pixels in NDHW space specified byPixels-per-Columnexceeds the number of available pixels in the image batch then out-of-bounds access is performed.
Similar to tiled mode, zero fill orOOB-NaN fill can be performed based on the Fill-Modespecified.
These modes are similar to the im2col mode with the restriction that elements are accessed acrosstheW dimension only while keeping theH andD dimension constant.
All the constraints and rules of the im2col mode apply to these modes as well.
The number of elements accessed in theim2col::w::128 mode is fixed and is equal to 128.The number of elements accessed in theim2col::w mode depends on the field Pixels-per-Columnfield in the TensorMap.
In these modes, the size of the bounding box inD andH dimensions are 1.
TheD andH dimensions in the tensor coordinates argument in the PTX instruction specifythe position of the bounding box in the tensor space.
The Bounding-BoxLower-Corner-W and Bounding-BoxUpper-Corner-W specify the two oppositecorners of the Bounding Box in theW dimension.
TheW dimension in the tensor coordinates argument in the PTX instruction specify the locationof the first element that is to be accessed in the bounding box.
Number of pixels loaded inim2col::w mode is as specified by Pixels-per-Column in the TensorMap.Number of pixels loaded inim2col::w::128 mode is always 128. So, Pixels-per-Column is ignoredinim2col::w::128 mode.
Figure 16 shows an example of theim2col::w andim2col::w:128 modes.
Figure 16im2col::w and im2col::w::128 modes example
The first element can lie outside of the Bounding Box in the W-dimension only and only on the leftside of the Bounding Box.Figure 17 shows of an example of this.
Figure 17im2col::w and im2col::w::128 modes first element outside Bounding Box example
This is similar to im2col mode with the exception of that the number of elements traversedalong only theW dimension is strided by the traversal stride as specified in the TensorMap.
Inim2col::w mode, thewHalo argument in the PTX instruction specifies how many filterhalo elements must be loaded at the end of the image.
Inim2col::w::128 mode, the halo elements are loaded after every 32 elements in the boundingbox along theW dimension. ThewHalo argument in the PTX instruction specifies how manyhalo elements must be loaded after every 32 elements.
Following is an example of.im2col::w mode access:
Tensor Size [0] = 128Tensor Size [1] = 9Tensor Size [2] = 7Tensor Size [3] = 64Pixels-per-column = 128Channels-per-pixel = 64Bounding Box Lower Corner W = 0Bounding Box Upper Corner W = 0Tensor Coordinates in the instruction = (7, 2, 3, 0)wHalo in the instruction = 2 (as 3x3 convolution filter is used)
A tensor copy operation with the above parameters loads 128 pixels and the two halo pixels as shown inFigure 18.
Figure 18tensor copy operation with im2col::w mode example
The halo pixels are always loaded in the shared memory next to the main row pixels as shown inFigure 18.
Following is an example of.im2col::w::128 mode access:
Tensor Size [0] = 128Tensor Size [1] = 9Tensor Size [2] = 7Tensor Size [3] = 64Channels-per-pixel = 64Bounding Box Lower Corner W = 0Bounding Box Upper Corner W = 0Tensor Coordinates in the instruction = (7, 2, 3, 0)wHalo in the instruction = 2 (as 3x3 convolution filter is used)
A tensor copy operation with the above parameters loads 128 elements such that after every 32 elements,wHalo number of elements are loaded as shown inFigure 19.
Figure 19tensor copy operation with im2col::w::128 mode example
In the convolution calculations, the same elements along theW dimension are reused for differentlocations within the convolution filter footprint. Based on the number of times a pixel is used, thepixels may be loaded into different shared memory buffers. Each buffer can be loaded by a separatetensor copy operation.
ThewOffset argument in the tensor copy and prefetch instruction adjusts the source pixel locationfor each buffer. The exact position of the buffer is adjusted along theW dimension using thefollowing formula:
Bounding Box Lower Corner W += wOffsetBounding Box Upper Corner W += wOffsetW += wOffset
Following are examples of tensor copy to multiple buffers with variouswHalo andwOffset values:
Tensor can be interleaved and the following interleave layouts are supported:
No interleave (NDHWC)
8 byte interleave (NC/8DHWC8) : C8 utilizes 16 bytes in memory assuming 2B per channel.
16 byte interleave (NC/16HWC16) : C16 utilizes 32 bytes in memory assuming 4B per channel.
TheC information is organized in slices where sequential C elements are grouped in 16 byte or 32byte quantities.
If the total number of channels is not a multiple of the number of channels per slice, then the lastslice must be padded with zeros to make it complete 16B or 32B slice.
Interleaved layouts are supported only for the dimensionalities : 3D, 4D and 5D.
The interleave layout is not supported for.im2col::w and.im2col::w::128 modes.
The layout of the data in the shared memory can be different to that of global memory, for accessperformance reasons. The following describes various swizzling modes:
No swizzle mode:
There is no swizzling in this mode and the destination data layout is exactly similar to thesource data layout.
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
… Pattern repeats …
32 byte swizzle mode:
The following table, where each elements (numbered cell) is 16 byte and the starting address is256 bytes aligned, shows the pattern of the destination data layout:
0
1
2
3
4
5
6
7
1
0
3
2
5
4
7
6
… Pattern repeats …
An example of the 32 byte swizzle mode for NC/(32B)HWC(32B) tensor of 1x2x10x10xC16 dimension,with the innermost dimension holding slice of 16 channels with 2 byte/channel, is shown inFigure 25.
Figure 27 shows the destination data layout with 32 byte swizzling.
Figure 2732-byte swizzle mode destination data layout
64 byte swizzle mode:
The following table, where each elements (numbered cell) is 16 byte and the starting address is512 bytes aligned, shows the pattern of the destination data layout:
0
1
2
3
4
5
6
7
1
0
3
2
5
4
7
6
2
3
0
1
6
7
4
5
3
2
1
0
7
6
5
4
… Pattern repeats …
An example of the 64 byte swizzle mode for NHWC tensor of 1x10x10x64 dimension, with 2 bytes /channel and 32 channels, is shown inFigure 28.
Figure 30 shows the destination data layout with 64 byte swizzling.
Figure 3064-byte swizzle mode destination data layout
96 byte swizzle mode:
The following table where each element (numbered cell) is 16 byte shows the swizzling pattern at the destinationdata layout:
0
1
2
3
4
5
6
7
1
0
3
2
5
4
7
6
… Pattern repeats …
An example of the data layout in global memory and its swizzled data layout in shared memory where each element(colored cell) is 16 bytes and the starting address is 256 bytes aligned is shown inFigure 31.
The 128-byte swizzling mode supports the following sub-modes:
16-byte atomicity sub-mode:
In this sub-mode, the 16-byte of data is kept intact while swizzling.
The following table, where each elements (numbered cell) is 16 byte and the starting address is1024 bytes aligned, shows the pattern of the destination data layout:
0
1
2
3
4
5
6
7
1
0
3
2
5
4
7
6
2
3
0
1
6
7
4
5
3
2
1
0
7
6
5
4
4
5
6
7
0
1
2
3
5
4
7
6
1
0
3
2
6
7
4
5
2
3
0
1
7
6
5
4
3
2
1
0
… Pattern repeats …
An example of the 128 byte swizzle mode for NHWC tensor of 1x10x10x64 dimension, with 2 bytes /channel and 64 channels, is shown inFigure 32.
Each colored cell represents 8 channels.Figure 33 shows the source data layout.
Figure 33128-byte swizzle mode source data layout
Figure 34 shows the destination data layout with 128 byte swizzling.
Figure 34128-byte swizzle mode destination data layout
32-byte atomicity sub-mode:
In this sub-mode, the 32-byte of data is kept intact while swizzling.
The following table where each element (numbered cell) is 16 byte shows theswizzling pattern at the destination data layout:
0 1
2 3
4 5
6 7
2 3
0 1
6 7
4 5
4 5
6 7
0 1
2 3
6 7
4 5
2 3
0 1
… Pattern repeats …
This sub-mode requires 32 byte alignment at shared memory.
An example of the data layout in global memory and its swizzled data layout in shared memorywhere each element (colored cell) is 16 bytes is shown inFigure 35
Figure 35128-byte swizzle mode example with 32-byte atomicity
32-byte atomicity with 8-byte flip sub-mode:
The swizzling pattern for this sub-mode is similar to the 32-byte atomicity sub-mode except thatthere is a flip of adjacent 8-bytes within the 16-byte data at every alternate shared memory line.
An example of the data layout in global memory and its swizzled data layout in shared memory whereeach element (colored cell) is 16 bytes (two 8-byte sub-elements for each 16-byte colored cell areshown to show the flip) is shown inFigure 36
Figure 36128-byte swizzle mode example with 32-byte atomicity with 8-byte flip
64-byte atomicity sub-mode:
In this sub-mode, the 64-byte of data is kept intact while swizzling.
The following table where each element (numbered cell) is 16 byte shows the swizzlingpattern at the destination data layout:
0 1 2 3
4 5 6 7
4 5 6 7
0 1 2 3
… Pattern repeats …
This sub-mode requires 64-byte alignment at shared memory.
An example of the data layout in global memory and its swizzled data layoutin shared memory where each element (colored cell) is 16 bytes is showninFigure 37
Figure 37128-byte swizzle mode example with 64-byte atomicity
Table 14lists the valid combination of swizzle-atomicity with the swizzling-mode.
Table 14Valid combination of swizzle-atomicity with swizzling-mode
Swizzling Mode
Swizzle-Atomicity
No Swizzling
–
32B Swizzling Mode
16B
64B Swizzling Mode
16B
96B Swizzling Mode
16B
128B Swizzling Mode
16B
32B
32B + 8B-flip
64B
The value of swizzle base offset is 0 when thedstMem shared memory address is locatedat the following boundary:
Swizzling Mode
Starting address of the repeating pattern
128-Byte swizzle
1024-Byte boundary
96-Byte swizzle
256-Byte boundary
64-Byte swizzle
512-Byte boundary
32-Byte swizzle
256-Byte boundary
Otherwise, the swizzle base offset is a non-zero value, computed using following formula:
The tensor-map is a 128-byte opaque object either in.const space or.param (kernel functionparameter) space or.global space which describes the tensor properties and the access propertiesof the tensor data described in previous sections.
Tensor-Map can be created using CUDA APIs. Refer toCUDA programming guide for more details.
All operands in instructions have a known type from their declarations. Each operand type must becompatible with the type determined by the instruction template and instruction type. There is noautomatic conversion between types.
The bit-size type is compatible with every type having the same size. Integer types of a common sizeare compatible with each other. Operands having type different from but compatible with theinstruction type are silently cast to the instruction type.
The source operands are denoted in the instruction descriptions by the namesa,b, andc. PTX describes a load-store machine, so operands for ALU instructions must all be in variablesdeclared in the.reg register state space. For most operations, the sizes of the operands mustbe consistent.
Thecvt (convert) instruction takes a variety of operand types and sizes, as its job is toconvert from nearly any data type to any other data type (and size).
Theld,st,mov, andcvt instructions copy data from one location toanother. Instructionsld andst move data from/to addressable state spaces to/fromregisters. Themov instruction copies data between registers.
Most instructions have an optional predicate guard that controls conditional execution, and a fewinstructions have additional predicate source operands. Predicate operands are denoted by the namesp,q,r,s.
PTX instructions that produce a single result store the result in the field denoted byd (fordestination) in the instruction descriptions. The result operand is a scalar or vector variable inthe register state space.
The register containing an address may be declared as a bit-size type or integer type.
The access size of a memory instruction is the total number of bytes accessed in memory. Forexample, the access size ofld.v4.b32 is 16 bytes, while the access size ofatom.f16x2 is 4bytes.
The address must be naturally aligned to a multiple of the access size. If an address is notproperly aligned, the resulting behavior is undefined. For example, among other things, the accessmay proceed by silently masking off low-order address bits to achieve proper rounding, or theinstruction may fault.
The address size may be either 32-bit or 64-bit. 128-bit adresses are not supported. Addresses arezero-extended to the specified width as needed, and truncated if the register width exceeds thestate space address width for the target architecture.
Address arithmetic is performed using integer arithmetic and logical instructions. Examples includepointer arithmetic and pointer comparisons. All addresses and address computations are byte-based;there is no support for C-style pointer arithmetic.
Themov instruction can be used to move the address of a variable into a pointer. The address isan offset in the state space in which the variable is declared. Load and store operations move databetween registers and locations in addressable state spaces. The syntax is similar to that used inmany assembly languages, where scalar variables are simply named and addresses are de-referenced byenclosing the address expression in square brackets. Address expressions include variable names,address registers, address register plus byte offset, and immediate address expressions whichevaluate at compile-time to a constant address.
If a memory instruction does not specify a state space, the operation is performed using genericaddressing. The state spaces.const,Kernel Function Parameters(.param),.local and.shared are modeled aswindows within the generic address space. Each window is defined by a window base and a window sizethat is equal to the size of the corresponding state space. A generic address maps toglobalmemory unless it falls within the window forconst,local, orshared memory. TheKernel Function Parameters (.param) window is containedwithin the.global window. Within each window, a generic address maps to an address in theunderlying state space by subtracting the window base from the generic address.
Arrays of all types can be declared, and the identifier becomes an address constant in the spacewhere the array is declared. The size of the array is a constant in the program.
Array elements can be accessed using an explicitly calculated byte address, or by indexing into thearray using square-bracket notation. The expression within square brackets is either a constantinteger, a register variable, or a simpleregister with constant offset expression, where theoffset is a constant expression that is either added or subtracted from a register variable. If morecomplicated indexing is desired, it must be written as an address calculation prior to use. Examplesare:
ld.global.u32 s, a[0];ld.global.u32 s, a[N-1];mov.u32 s, a[1]; // move address of a[1] into s
Vector operands can be specified as source and destination operands for instructions. However, whenspecified as destination operand, all elements in vector expression must be unique, otherwise behavioris undefined.Vectors may also be passed as arguments to called functions.
Vector elements can be extracted from the vector with the suffixes.x,.y,.z and.w, as well as the typical color fields.r,.g,.b and.a.
A brace-enclosed list is used for pattern matching to pull apart vectors.
Vector loads and stores can be used to implement wide loads and stores, which may improve memoryperformance. The registers in the load/store operations can be a vector, or a brace-enclosed list ofsimilarly typed scalars. Here are examples:
Labels and function names can be used only inbra/brx.idx andcall instructionsrespectively. Function names can be used inmov instruction to get the address of the functioninto a register, for use in an indirect call.
Beginning in PTX ISA version 3.1, themov instruction may be used to take the address of kernelfunctions, to be passed to a system call that initiates a kernel launch from the GPU. This featureis part of the support for CUDA Dynamic Parallelism. See theCUDA Dynamic Parallelism ProgrammingGuide for details.
All operands to all arithmetic, logic, and data movement instruction must be of the same type andsize, except for operations where changing the size and/or type is part of the definition of theinstruction. Operands of different sizes or types must be converted prior to the operation.
Table 15 andTable 16 show whatprecision and format the cvt instruction uses given operands of differing types. For example, if acvt.s32.u16 instruction is given au16 source operand ands32 as a destination operand,theu16 is zero-extended tos32.
Conversions to floating-point that are beyond the range of floating-point numbers are representedwith the maximum floating-point value (IEEE 754 Inf forf32 andf64, and ~131,000 forf16).
Table 15Convert Instruction Precision and Format Table 1
Destination Format
s8
s16
s32
s64
u8
u16
u32
u64
f16
f32
f64
bf16
tf32
SourceFormat
s8
–
sext
sext
sext
–
sext
sext
sext
s2f
s2f
s2f
s2f
–
s16
chop1
–
sext
sext
chop1
–
sext
sext
s2f
s2f
s2f
s2f
–
s32
chop1
chop1
–
sext
chop1
chop1
–
sext
s2f
s2f
s2f
s2f
–
s64
chop1
chop1
chop1
–
chop1
chop1
chop1
–
s2f
s2f
s2f
s2f
–
u8
–
zext
zext
zext
–
zext
zext
zext
u2f
u2f
u2f
u2f
–
u16
chop1
–
zext
zext
chop1
–
zext
zext
u2f
u2f
u2f
u2f
–
u32
chop1
chop1
–
zext
chop1
chop1
–
zext
u2f
u2f
u2f
u2f
–
u64
chop1
chop1
chop1
–
chop1
chop1
chop1
–
u2f
u2f
u2f
u2f
–
f16
f2s
f2s
f2s
f2s
f2u
f2u
f2u
f2u
–
f2f
f2f
f2f
–
f32
f2s
f2s
f2s
f2s
f2u
f2u
f2u
f2u
f2f
–
f2f
f2f
f2f
f64
f2s
f2s
f2s
f2s
f2u
f2u
f2u
f2u
f2f
f2f
–
f2f
–
bf16
f2s
f2s
f2s
f2s
f2u
f2u
f2u
f2u
f2f
f2f
f2f
f2f
–
tf32
–
–
–
–
–
–
–
–
–
–
–
–
–
Table 16Convert Instruction Precision and Format Table 2
Destination Format
f16
f32
bf16
e4m3
e5m2
e2m3
e3m2
e2m1
ue8m0
SourceFormat
f16
–
f2f
f2f
f2f
f2f
–
–
–
–
f32
f2f
–
f2f
f2f
f2f
f2f
f2f
f2f
f2f
bf16
f2f
f2f
f2f
–
–
–
–
–
f2f
e4m3
f2f
–
–
–
–
–
–
–
–
e5m2
f2f
–
–
–
–
–
–
–
–
e2m3
f2f
–
–
–
–
–
–
–
–
e3m2
f2f
–
–
–
–
–
–
–
–
e2m1
f2f
–
–
–
–
–
–
–
–
ue8m0
–
–
f2f
–
–
–
–
–
–
Notes
sext = sign-extend; zext = zero-extend; chop = keep only low bits that fit;
1 If the destination register is wider than the destination format, the result is extended to thedestination register width after chopping. The type of extension (sign or zero) is based on thedestination format. For example, cvt.s16.u32 targeting a 32-bit register first chops to 16-bit, thensign-extends to 32-bit.
Conversion instructions may specify a rounding modifier. In PTX, there are four integer roundingmodifiers and six floating-point roundingmodifiers.Table 17 andTable 18 summarize the rounding modifiers.
Operands from different state spaces affect the speed of an operation. Registers are fastest, whileglobal memory is slowest. Much of the delay to memory can be hidden in a number of ways. The firstis to have multiple threads of execution so that the hardware can issue a memory operation and thenswitch to other execution. Another way to hide latency is to issue the load instructions as early aspossible, as execution is not blocked until the desired result is used in a subsequent (in time)instruction. The register in a store operation is available much morequickly.Table 19 gives estimates of thecosts of using different kinds of memory.
Table 19Cost Estimates for Accessing State-Spaces
Rather than expose details of a particular calling convention, stack layout, and Application BinaryInterface (ABI), PTX provides a slightly higher-level abstraction and supports multiple ABIimplementations. In this section, we describe the features of PTX needed to achieve this hiding ofthe ABI. These include syntax for function definitions, function calls, parameter passing, andmemory allocated on the stack (alloca).
Refer toPTX Writers Guide to Interoperability for details on generating PTX compliant withApplication Binary Interface (ABI) for the CUDA® architecture.
In PTX, functions are declared and defined using the.func directive. A functiondeclarationspecifies an optional list of return parameters, the function name, and an optional list of inputparameters; together these specify the function’s interface, or prototype. A functiondefinitionspecifies both the interface and the body of the function. A function must be declared or definedprior to being called.
The simplest function has no parameters or return values, and is represented in PTX as follows:
.func foo{ ... ret;} ... call foo; ...
Here, execution of thecall instruction transfers control tofoo, implicitly saving thereturn address. Execution of theret instruction withinfoo transfers control to theinstruction following the call.
Scalar and vector base-type input and return parameters may be represented simply as registervariables. At the call, arguments may be register variables or constants, and return values may beplaced directly into register variables. The arguments and return variables at the call must havetype and size that match the callee’s corresponding formal parameters.
When using the ABI,.reg state space parameters must be at least 32-bits in size. Subword scalarobjects in the source language should be promoted to 32-bit registers in PTX, or use.paramstate space byte arrays described next.
Objects such as C structures and unions are flattened into registers or byte arrays in PTX and arerepresented using.param space memory. For example, consider the following C structure, passedby value to a function:
struct { double dbl; char c[4];};
In PTX, this structure will be flattened into a byte array. Since memory accesses are required to bealigned to a multiple of the access size, the structure in this example will be a 12 byte array with8 byte alignment so that accesses to the.f64 field are aligned. The.param state space isused to pass the structure by value:
In this example, note that.param space variables are used in two ways. First, a.paramvariabley is used in function definition bar to represent a formal parameter. Second, a.param variablepy is declared in the body of the calling function and used to set up thestructure being passed to bar.
The following is a conceptual way to think about the.param state space use in device functions.
For a caller,
The.param state space is used to set values that will be passed to a called function and/orto receive return values from a called function. Typically, a.param byte array is used tocollect together fields of a structure being passed by value.
For a callee,
The.param state space is used to receive parameter values and/or pass return values back tothe caller.
The following restrictions apply to parameter passing.
For a caller,
Arguments may be.param variables,.reg variables, or constants.
In the case of.param space formal parameters that are byte arrays, the argument must also bea.param space byte array with matching type, size, and alignment. A.param argument mustbe declared within the local scope of the caller.
In the case of.param space formal parameters that are base-type scalar or vector variables,the corresponding argument may be either a.param or.reg space variable with matchingtype and size, or a constant that can be represented in the type of the formal parameter.
In the case of.reg space formal parameters, the corresponding argument may be either a.param or.reg space variable of matching type and size, or a constant that can berepresented in the type of the formal parameter.
In the case of.reg space formal parameters, the register must be at least 32-bits in size.
Allst.param instructions used for passing arguments to function call must immediately precedethe correspondingcall instruction andld.param instruction used for collecting returnvalue must immediately follow thecall instruction without any control flowalteration.st.param andld.param instructions used for argument passing cannot bepredicated. This enables compiler optimization and ensures that the.param variable does notconsume extra space in the caller’s frame beyond that needed by the ABI. The.param variablesimply allows a mapping to be made at the call site between data that may be in multiplelocations (e.g., structure being manipulated by caller is located in registers and memory) tosomething that can be passed as a parameter or return value to the callee.
For a callee,
Input and return parameters may be.param variables or.reg variables.
Parameters in.param memory must be aligned to a multiple of 1, 2, 4, 8, or 16 bytes.
Parameters in the.reg state space must be at least 32-bits in size.
The.reg state space can be used to receive and return base-type scalar and vector values,including sub-word size objects when compiling in non-ABI mode. Supporting the.reg statespace provides legacy support.
Note that the choice of.reg or.param state space for parameter passing has no impact onwhether the parameter is ultimately passed in physical registers or on the stack. The mapping ofparameters to physical registers and stack locations depends on the ABI definition and the order,size, and alignment of parameters.
In PTX ISA version 1.x, formal parameters were restricted to .reg state space, and there was nosupport for array parameters. Objects such as C structures were flattened and passed or returnedusing multiple registers. PTX ISA version 1.x supports multiple return values for this purpose.
Beginning with PTX ISA version 2.0, formal parameters may be in either.reg or.param statespace, and.param space parameters support arrays. For targetssm_20 or higher, PTXrestricts functions to a single return value, and a.param byte array should be used to returnobjects that do not fit into a register. PTX continues to support multiple return registers forsm_1x targets.
Note
PTX implements a stack-based ABI only for targetssm_20 or higher.
PTX ISA versions prior to 3.0 permitted variables in.reg and.local state spaces to bedefined at module scope. When compiling to use the ABI, PTX ISA version 3.0 and later disallowsmodule-scoped.reg and.local variables and restricts their use to within functionscope. When compiling without use of the ABI, module-scoped.reg and.local variables aresupported as before. When compiling legacy PTX code (ISA versions prior to 3.0) containingmodule-scoped.reg or.local variables, the compiler silently disables use of the ABI.
PTX providesalloca instruction for allocating storage at runtime on the per-thread local memorystack. The allocated stack memory can be accessed withld.local andst.local instructionsusing the pointer returned byalloca.
In order to facilitate deallocation of memory allocated withalloca, PTX provides two additionalinstructions:stacksave which allows reading the value of stack pointer in a local variable, andstackrestore which can restore the stack pointer with the saved value.
Stack manipulation instructionsalloca,stacksave andstackrestore are preview featuresin PTX ISA version 7.3. All details are subject to change with no guarantees of backwardcompatibility on future PTX ISA versions or SM architectures.
In multi-threaded executions, the side-effects of memory operations performed by each thread becomevisible to other threads in a partial and non-identical order. This means that any two operationsmay appear to happen in no order, or in different orders, to different threads. The axiomsintroduced by the memory consistency model specify exactly which contradictions are forbiddenbetween the orders observed by different threads.
In the absence of any constraint, each read operation returns the value committed by some writeoperation to the same memory location, including the initial write to that memory location. Thememory consistency model effectively constrains the set of such candidate writes from which a readoperation can return a value.
When communicating with the host CPU, certain strong operations with system scope may not beperformed atomically on some systems. For more details on atomicity guarantees to host memory, seetheCUDA Atomicity Requirements.
The fundamental storage unit in the PTX memory model is a byte, consisting of 8 bits. Each statespace available to a PTX program is a sequence of contiguous bytes in memory. Every byte in a PTXstate space has a unique address relative to all threads that have access to the same state space.
Each PTX memory instruction specifies an address operand and a data type. The address operandcontains a virtual address that gets converted to a physical address during memory access. Thephysical address and the size of the data type together define a physical memory location, which isthe range of bytes starting from the physical address and extending up to the size of the data typein bytes.
The memory consistency model specification uses the terms “address” or “memory address” to indicatea virtual address, and the term “memory location” to indicate a physical memory location.
Each PTX memory instruction also specifies the operation — either a read, a write or an atomicread-modify-write — to be performed on all the bytes in the corresponding memory location.
Two memory locations are said to overlap when the starting address of one location is within therange of bytes constituting the other location. Two memory operations are said to overlap when theyspecify the same virtual address and the corresponding memory locations overlap. The overlap is saidto be complete when both memory locations are identical, and it is said to be partial otherwise.
A multimem address is a virtual address which points to multiple distinct memory locations acrossdevices.
Onlymultimem.* operations are valid on multimem addresses. That is, the behavior of accessinga multimem address in any other memory operation is undefined.
The memory consistency model relates operations executed on memory locations with scalar data types,which have a maximum size and alignment of 64 bits. Memory operations with a vector data type aremodelled as a set of equivalent memory operations with a scalar data type, executed in anunspecified order on the elements in the vector.
A packed data type consists of two values of the same scalar data type, as described inPacked Data Types. These values are accessed in adjacent memory locations. Amemory operation on a packed data type is modelled as a pair of equivalent memory operations on thescalar data type, executed in an unspecified order on each element of the packed data.
Each byte in memory is initialized by a hypothetical writeW0 executed before starting any threadin the program. If the byte is included in a program variable, and that variable has an initialvalue, thenW0 writes the corresponding initial value for that byte; elseW0 is assumed to havewritten an unknown but constant value to the byte.
The relations defined in the memory consistency model are independent of state spaces. Inparticular, causality order closes over all memory operations across all the state spaces. But theside-effect of a memory operation in one state space can be observed directly only by operationsthat also have access to the same state space. This further constrains the synchronizing effect of amemory operation in addition to scope. For example, the synchronizing effect of the PTX instructionld.relaxed.shared.sys is identical to that ofld.relaxed.shared.cluster, since no threadoutside the same cluster can execute an operation that accesses the same memory location.
Anmmio operation is a memory operation with.mmio qualifier specified. It is usually performedon a memory location which is mapped to the control registers of peer I/O devices. It can also beused for communication between threads but has poor performance relative to non-mmio operations.
The semantic meaning ofmmio operations cannot be defined precisely as it is defined by theunderlying I/O device. For formal specification of semantics ofmmio operation from MemoryConsistency Model perspective, it is equivalent to the semantics of astrong operation. But itfollows a few implementation-specific properties, if it meets theCUDA atomicity requirements atthe specified scope:
Writes are always performed and are never combined within the scope specified.
Reads are always performed, and are not forwarded, prefetched, combined, or allowed to hit anycache within the scope specified.
As an exception, in some implementations, the surrounding locations may also be loaded. In suchcases the amount of data loaded is implementation specific and varies between 32 and 128 bytesin size.
Avolatile operation is a memory operation with.volatile qualifier specified.The semantics of volatile operations are equivalent to a relaxed memory operation with system-scopebut with the following extra implementation-specific constraints:
The number of volatileinstructions (not operations) executed by a program is preserved.Hardware may combine and merge volatileoperations issued by multiple different volatileinstructions, that is, the number of volatileoperations in the program is not preserved.
Volatileinstructions are not re-ordered around other volatileinstructions, but the memoryoperations performed by thoseinstructions may be re-ordered around each other.
Note
PTX volatile operations are intended for compilers to lower volatile read and write operations fromCUDA C++, and other programming languages sharing CUDA C++ volatile semantics, to PTX.
Since volatile operations are relaxed at system-scope with extra constraints, prefer using otherstrong read or write operations (e.g.ld.relaxed.sys orst.relaxed.sys) forInter-Thread Synchronization instead, which may deliver better performance.
PTX volatile operations are not suited forMemory Mapped IO (MMIO) because volatile operationsdo not preserve the number of memory operations performed, and may perform more or less operationsthan requested in a non-deterministic way.Use.mmio operations instead, which strictly preserve the number of operationsperformed.
Eachstrong operation must specify ascope, which is the set of threads that may interactdirectly with that operation and establish any of the relations described in the memory consistencymodel. There are four scopes:
The set of all threads executing in the same CTA as the current thread.
.cluster
The set of all threads executing in the same cluster as the current thread.
.gpu
The set of all threads in the current program executing on the same computedevice as the current thread. This also includes other kernel grids invoked bythe host program on the same compute device.
.sys
The set of all threads in the current program, including all kernel gridsinvoked by the host program on all compute devices, and all threadsconstituting the host program itself.
Note that the warp is not ascope; the CTA is the smallest collection of threads that qualifies asascope in the memory consistency model.
Amemory proxy, or aproxy is an abstract label applied to a method of memory access. When twomemory operations use distinct methods of memory access, they are said to be differentproxies.
Memory operations as defined inOperation types usegenericmethod of memory access, i.e. ageneric proxy. Other operations such as textures and surfaces alluse distinct methods of memory access, also distinct from thegeneric method.
Aproxy fence is required to synchronize memory operations across differentproxies. Althoughvirtual aliases use thegeneric method of memory access, since using distinct virtual addressesbehaves as if using differentproxies, they require aproxy fence to establish memory ordering.
Two operations are said to bemorally strong relative to each other if they satisfy all of thefollowing conditions:
The operations are related inprogram order (i.e, they are both executed by the same thread),or each operation isstrong and specifies ascope that includes the thread executing theother operation.
Both operations are performed via the sameproxy.
If both are memory operations, then they overlap completely.
Most (but not all) of the axioms in the memory consistency model depend on relations betweenmorally strong operations.
Adata-race between operations thatoverlap completely is called auniform-size data-race,while adata-race between operations thatoverlap partially is called amixed-size data-race.
The axioms in the memory consistency model do not apply if a PTX program contains one or moremixed-size data-races. But these axioms are sufficient to describe the behavior of a PTX programwith onlyuniform-size data-races.
Atomicity of mixed-size RMW operations
In any program with or withoutmixed-size data-races, the following property holds for every pairofoverlapping atomic operations A1 and A2 such that each specifies ascope that includes theother: Either theread-modify-write operation specified by A1 is performed completely before A2 isinitiated, or vice versa. This property holds irrespective of whether the two operations A1 and A2overlap partially or completely.
Some sequences of instructions give rise to patterns that participate in memory synchronization asdescribed later. Therelease pattern makes prior operations from the current thread1visible to some operations from other threads. Theacquire pattern makes some operations fromother threads visible to later operations from the current thread.
Arelease pattern on a location M consists of one of the following:
Anymemory synchronization established by arelease pattern only affects operations occurring inprogram order before the first instruction in that pattern.
Anacquire pattern on a location M consists of one of the following:
Anymemory synchronization established by anacquire pattern only affects operations occurringinprogram order after the last instruction in that pattern.
Note that while atomic reductions conceptually perform a strong read as part of itsread-modify-write sequence, this strong read does not form an acquire pattern.
E.g.:red.add[M],1;fence.acquire; is not an acquire pattern.
1 For bothrelease andacquire patterns, this effect is further extended to operations inother threads through the transitive nature ofcausality order.
The sequence of operations performed by each thread is captured asprogram order whilememorysynchronization across threads is captured ascausality order. The visibility of the side-effectsof memory operations to other memory operations is captured ascommunication order. The memoryconsistency model defines contradictions that are disallowed between communication order on the onehand, andcausality order andprogram order on the other.
Theprogram order relates all operations performed by a thread to the order in which a sequentialprocessor will execute instructions in the corresponding PTX source. It is a transitive relationthat forms a total order over the operations performed by the thread, but does not relate operationsfrom different threads.
Some PTX instructions (all variants ofcp.async,cp.async.bulk,cp.reduce.async.bulk,wgmma.mma_async) perform operations that are asynchronous to the thread that executed theinstruction. These asynchronous operations are ordered after prior instructions in the same thread(except in the case ofwgmma.mma_async), but they are not part of the program order for thatthread. Instead, they provide weaker ordering guarantees as documented in the instructiondescription.
For example, the loads and stores performed as part of acp.async are ordered with respect toeach other, but not to those of any othercp.async instructions initiated by the same thread,nor any other instruction subsequently issued by the thread with the exception ofcp.async.commit_group orcp.async.mbarrier.arrive. The asynchronous mbarrierarrive-on operationperformed by acp.async.mbarrier.arrive instruction is ordered with respect to the memoryoperations performed by all priorcp.async operations initiated by the same thread, but not tothose of any other instruction issued by the thread. The implicit mbarriercomplete-txoperation that is part of all variants ofcp.async.bulk andcp.reduce.async.bulkinstructions is ordered only with respect to the memory operations performed by the sameasynchronous instruction, and in particular it does not transitively establish ordering with respectto prior instructions from the issuing thread.
Synchronizing operations performed by different threads synchronize with each other at runtime asdescribed here. The effect of such synchronization is to establishcausality order across threads.
Afence.sc operation Xsynchronizes with afence.sc operation Y if X precedes Y in theFence-SC order.
Abar{.cta}.sync orbar{.cta}.red orbar{.cta}.arrive operationsynchronizes with abar{.cta}.sync orbar{.cta}.red operation executed on the same barrier.
Abarrier.cluster.arrive operation synchronizes with abarrier.cluster.wait operation.
Arelease pattern Xsynchronizes with anacquire pattern Y, if awrite operation in Xprecedes aread operation in Y inobservation order, and the first operation in X and thelast operation in Y aremorally strong.
API synchronization
Asynchronizes relation can also be established by certain CUDA APIs.
Completion of a task enqueued in a CUDA streamsynchronizes with the start of the followingtask in the same stream, if any.
For purposes of the above, recording or waiting on a CUDA event in a stream, or causing across-stream barrier to be inserted due tocudaStreamLegacy, enqueues tasks in the associatedstreams even if there are no direct side effects. An event record tasksynchronizes withmatching event wait tasks, and a barrier arrival tasksynchronizes with matching barrier waittasks.
Start of a CUDA kernelsynchronizes with start of all threads in the kernel. End of all threadsin a kernelsynchronize with end of the kernel.
Start of a CUDA graphsynchronizes with start of all source nodes in the graph. Completion ofall sink nodes in a CUDA graphsynchronizes with completion of the graph. Completion of a graphnodesynchronizes with start of all nodes with a direct dependency.
Start of a CUDA API call to enqueue a tasksynchronizes with start of the task.
Completion of the last task queued to a stream, if any,synchronizes with return fromcudaStreamSynchronize. Completion of the most recently queued matching event record task, ifany,synchronizes with return fromcudaEventSynchronize. Synchronizing a CUDA device orcontext behaves as if synchronizing all streams in the context, including ones that have beendestroyed.
ReturningcudaSuccess from an API to query a CUDA handle, such as a stream or event, behavesthe same as return from the matching synchronization API.
In addition to establishing asynchronizes relation, the CUDA API synchronization mechanisms abovealso participate inproxy-preserved base causality order.
Causality order captures how memory operations become visible across threads through synchronizingoperations. The axiom “Causality” uses this order to constrain the set of write operations fromwhich a read operation may read a value.
Relations in thecausality order primarily consist of relations inBase causality order1 , which is a transitive order, determined at runtime.
Base causality order
An operation X precedes an operation Y inbase causality order if:
X precedes Y inprogram order, or
Xsynchronizes with Y, or
For some operation Z,
X precedes Z inprogram order and Z precedes Y inbase causality order, or
X precedes Z inbase causality order and Z precedes Y inprogram order, or
X precedes Z inbase causality order and Z precedes Y inbase causality order.
Proxy-preserved base causality order
A memory operation X precedes a memory operation Y inproxy-preserved base causality order if Xprecedes Y inbase causality order, and:
X and Y are performed to the same address, using thegeneric proxy, or
X and Y are performed to the same address, using the sameproxy, and by the same thread block,or
X and Y are aliases and there is an aliasproxy fence along the base causality path from Xto Y.
Causality order
Causality order combinesbase causality order with some non-transitive relations as follows:
An operation X precedes an operation Y incausality order if:
X precedes Y inproxy-preserved base causality order, or
For some operation Z, X precedes Z in observation order, and Z precedes Y inproxy-preservedbase causality order.
1 The transitivity ofbase causality order accounts for the “cumulativity” of synchronizingoperations.
There exists a partial transitive order that relatesoverlapping write operations, determined atruntime, called thecoherence order1. Twooverlapping write operations are related incoherence order if they aremorally strong or if they are related incausality order. Twooverlapping writes are unrelated incoherence order if they are in adata-race, which givesrise to the partial nature ofcoherence order.
1Coherence order cannot be observed directly since it consists entirely of writeoperations. It may be observed indirectly by its use in constraining the set of candidatewrites that a read operation may read from.
Thecommunication order is a non-transitive order, determined at runtime, that relates writeoperations to otheroverlapping memory operations.
A write W precedes anoverlapping read R incommunication order if R returns the value of anybyte that was written by W.
A write W precedes a write W’ incommunication order if W precedes W’ incoherence order.
A read R precedes anoverlapping write W incommunication order if, for any byte accessed byboth R and W, R returns the value written by a write W’ that precedes W incoherence order.
Communication order captures the visibility of memory operations — when a memory operation X1precedes a memory operation X2 incommunication order, X1 is said to be visible to X2.
Fence-SC order cannot contradictcausality order. For a pair ofmorally strongfence.scoperations F1 and F2, if F1 precedes F2 incausality order, then F1 must precede F2 inFence-SCorder.
Conflictingmorally strong operations are performed withsingle-copy atomicity. When a read Rand a write W aremorally strong, then the following two communications cannot both exist in thesame execution, for the set of bytes accessed by both R and W:
R reads any byte from W.
R reads any byte from any write W’ which precedes W incoherence order.
Atomicity of read-modify-write (RMW) operations
When anatomic operation A and a write Woverlap and aremorally strong, then the followingtwo communications cannot both exist in the same execution, for the set of bytes accessed by both Aand W:
A reads any byte from a write W’ that precedes W incoherence order.
A follows W incoherence order.
Litmus Test 1
.global.u32x=0;
T1
T2
A1:atom.sys.inc.u32%r0,[x];
A2:atom.sys.inc.u32%r0,[x];
FINALSTATE:x==2
Atomicity is guaranteed when the operations aremorally strong.
Litmus Test 2
.global.u32x=0;
T1
T2 (In a different CTA)
A1:atom.cta.inc.u32%r0,[x];
A2:atom.gpu.inc.u32%r0,[x];
FINALSTATE:x==1ORx==2
Atomicity is not guaranteed if the operations are notmorally strong.
Values may not appear “out of thin air”: an execution cannot speculatively produce a value in such away that the speculation becomes self-satisfying through chains of instruction dependencies andinter-thread communication. This matches both programmer intuition and hardware reality, but isnecessary to state explicitly when performing formal analysis.
Litmus Test: Load Buffering
.global.u32x=0;.global.u32y=0;
T1
T2
A1:ld.global.u32%r0,[x];B1:st.global.u32[y],%r0;
A2:ld.global.u32%r1,[y];B2:st.global.u32[x],%r1;
FINALSTATE:x==0ANDy==0
The litmus test known as “LB” (Load Buffering) checks such forbidden values that may arise out ofthin air. Two threads T1 and T2 each read from a first variable and copy the observed result into asecond variable, with the first and second variable exchanged between the threads. If each variableis initially zero, the final result shall also be zero. If A1 reads from B2 and A2 reads from B1,then values passing through the memory operations in this example form a cycle:A1->B1->A2->B2->A1. Only the values x == 0 and y == 0 are allowed to satisfy this cycle. If any ofthe memory operations in this example were to speculatively associate a different value with thecorresponding memory location, then such a speculation would become self-fulfilling, and henceforbidden.
Within any set ofoverlapping memory operations that are pairwisemorally strong,communicationorder cannot contradictprogram order, i.e., a concatenation ofprogram order betweenoverlapping operations andmorally strong relations incommunication order cannot result in acycle. This ensures that each program slice ofoverlapping pairwise morallystrong operations isstrictlysequentially-consistent.
The litmus test “CoRR” (Coherent Read-Read), demonstrates one consequence of this guarantee. Athread T1 executes a write W1 on a location x, and a thread T2 executes two (or an infinite sequenceof) reads R1 and R2 on the same location x. No other writes are executed on x, except the onemodelling the initial value. The operations W1, R1 and R2 are pairwisemorally strong. If R1 readsfrom W1, then the subsequent read R2 must also observe the same value. If R2 observed the initialvalue of x instead, then this would form a sequence ofmorally-strong relations R2->W1->R1 incommunication order that contradicts theprogram order R1->R2 in thread T2. Hence R2 cannot readthe initial value of x in such an execution.
Relations incommunication order cannot contradictcausality order. This constrains the set ofcandidate write operations that a read operation may read from:
If a read R precedes anoverlapping write W incausality order, then R cannot read from W.
If a write W precedes anoverlapping read R incausality order, then for any byte accessed byboth R and W, R cannot read from any write W’ that precedes W incoherence order.
The litmus test known as “MP” (Message Passing) represents the essence of typical synchronizationalgorithms. A vast majority of useful programs can be reduced to sequenced applications of thispattern.
Thread T1 first writes to a data variable and then to a flag variable while a second thread T2 firstreads from the flag variable and then from the data variable. The operations on the flag aremorally strong and the memory operations in each thread are separated by afence, and thesefences aremorally strong.
If R1 observes W2, then the release pattern “F1; W2”synchronizes with theacquire pattern “R1;F2”. This establishes thecausality order W1 -> F1 -> W2 -> R1 -> F2 -> R2. Then axiomcausalityguarantees that R2 cannot read from any write that precedes W1 incoherence order. In the absenceof any other writes in this example, R2 must read from W1.
Litmus Test: CoWR
// These addresses are aliases.global.u32data_alias_1;.global.u32data_alias_2;
Virtual aliases require an aliasproxy fence along the synchronization path.
Litmus Test: Store Buffering
The litmus test known as “SB” (Store Buffering) demonstrates thesequential consistency enforcedby thefence.sc. A thread T1 writes to a first variable, and then reads the value of a secondvariable, while a second thread T2 writes to the second variable and then reads the value of thefirst variable. The memory operations in each thread are separated byfence.sc instructions,and thesefences aremorally strong.
In any execution, either F1 precedes F2 inFence-SC order, or vice versa. If F1 precedes F2 inFence-SC order, then F1synchronizes with F2. This establishes thecausality order in W1 -> F1-> F2 -> R2. Axiomcausality ensures that R2 cannot read from any write that precedes W1 incoherence order. In the absence of any other write to that variable, R2 must read fromW1. Similarly, in the case where F2 precedes F1 inFence-SC order, R1 must read from W2. If eachfence.sc in this example were replaced by afence.acq_rel instruction, then this outcome isnot guaranteed. There may be an execution where the write from each thread remains unobserved fromthe other thread, i.e., an execution is possible, where both R1 and R2 return the initial value “0”for variables y and x respectively.
The litmus test known as “MP” (Message Passing) demonstrates the consequenceof reductions being excluded from acquire patterns.It is possible to observe the outcome whereR2 reads the value0fromx andflag has the final value of2.This outcome is possible since the release pattern inT1 does not synchronizewith any acquire pattern inT2.Using theatom instruction instead ofred forbids this outcome.
This section describes each PTX instruction. In addition to the name and the format of theinstruction, the semantics are described, followed by some examples that attempt to show severalpossible instantiations of the instruction.
PTX instructions generally have from zero to four operands, plus an optional guard predicateappearing after an@ symbol to the left of theopcode:
@popcode;
@popcodea;
@popcoded,a;
@popcoded,a,b;
@popcoded,a,b,c;
For instructions that create a result value, thed operand is the destination operand, whilea,b, andc are source operands.
Thesetp instruction writes two destination registers. We use a| symbol to separatemultiple destination registers.
setp.lt.s32 p|q, a, b; // p = (a < b); q = !(a < b);
For some instructions the destination operand is optional. Abit bucket operand denoted with anunderscore (_) may be used in place of a destination register.
In PTX, predicate registers are virtual and have.pred as the type specifier. So, predicateregisters can be declared as
.reg .pred p, q, r;
All instructions have an optionalguard predicate which controls conditional execution of theinstruction. The syntax to specify conditional execution is to prefix an instruction with@{!}p,wherep is a predicate variable, optionally negated. Instructions without a guard predicate areexecuted unconditionally.
Predicates are most commonly set as the result of a comparison performed by thesetpinstruction.
As an example, consider the high-level code
if (i < n) j = j + 1;
This can be written in PTX as
setp.lt.s32 p, i, n; // p = (i < n)@p add.s32 j, j, 1; // if i < n, add 1 to j
To get a conditional branch or conditional function call, use a predicate to control the executionof the branch or call instructions. To implement the above example as a true conditional branch, thefollowing PTX instruction sequence might be used:
setp.lt.s32 p, i, n; // compare i to n@!p bra L1; // if False, branch over add.s32 j, j, 1;L1: ...
The signed integer comparisons are the traditionaleq (equal),ne (not-equal),lt(less-than),le (less-than-or-equal),gt (greater-than), andge(greater-than-or-equal). The unsigned comparisons areeq,ne,lo (lower),ls(lower-or-same),hi (higher), andhs (higher-or-same). The bit-size comparisons areeqandne; ordering comparisons are not defined for bit-size types.
Table 22shows the operators for signed integer, unsigned integer, and bit-size types.
Table 22Operators for Signed Integer, Unsigned Integer, and Bit-Size Types
The ordered floating-point comparisons areeq,ne,lt,le,gt, andge. Ifeither operand isNaN, the result isFalse.Table 23 lists the floating-pointcomparison operators.
To aid comparison operations in the presence ofNaN values, unordered floating-point comparisonsare provided:equ,neu,ltu,leu,gtu, andgeu. If both operands are numericvalues (notNaN), then the comparison has the same result as its ordered counterpart. If eitheroperand isNaN, then the result of the comparison isTrue.
Table 24 lists the floating-pointcomparison operators acceptingNaN values.
To test forNaN values, two operatorsnum (numeric) andnan (isNaN) areprovided.num returnsTrue if both operands are numeric values (notNaN), andnanreturnsTrue if either operand isNaN.Table 25 lists thefloating-point comparison operators testing forNaN values.
Table 25Floating-Point Comparison Operators Testing for NaN
Predicate values may be computed and manipulated using the following instructions:and,or,xor,not, andmov.
There is no direct conversion between predicates and integer values, and no direct way to load orstore predicate register values. However,setp can be used to generate a predicate from aninteger, and the predicate-based select (selp) instruction can be used to generate an integervalue based on the value of a predicate; for example:
selp.u32 %r1,1,0,%p; // convert predicate to 32-bit value
Typed instructions must have a type-size modifier. For example, theadd instruction requirestype and size information to properly perform the addition operation (signed, unsigned, float,different sizes), and this information must be specified as a suffix to the opcode.
Example
.reg .u16 d, a, b;add.u16 d, a, b; // perform a 16-bit unsigned add
Some instructions require multiple type-size modifiers, most notably the data conversion instructioncvt. It requires separate type-size modifiers for the result and source, and these are placed inthe same order as the operands. For example:
In general, an operand’s type must agree with the corresponding instruction-type modifier. The rulesfor operand and instruction type conformance are as follows:
Bit-size types agree with any type of the same size.
Signed and unsigned integer types agree provided they have the same size, and integer operands aresilently cast to the instruction type if needed. For example, an unsigned integer operand used ina signed integer instruction will be treated as a signed integer by the instruction.
Floating-point types agree only if they have the same size; i.e., they must match exactly.
Some operands have their type and size defined independently from the instruction type-size. Forexample, the shift amount operand for left and right shift instructions always has type.u32,while the remaining operands have their type and size determined by the instruction type.
Example
// 64-bit arithmetic right shift; shift amount 'b' is .u32 shr.s64 d,a,b;
For convenience,ld,st, andcvt instructions permit source and destination dataoperands to be wider than the instruction-type size, so that narrow values may be loaded, stored,and converted using regular-width registers. For example, 8-bit or 16-bit values may be helddirectly in 32-bit or 64-bit registers when being loaded, stored, or converted to other types andsizes. The operand type checking rules are relaxed for bit-size and integer (signed and unsigned)instruction types; floating-point instruction types still require that the operand type-size matchesexactly, unless the operand is of bit-size type.
When a source operand has a size that exceeds the instruction-type size, the source data istruncated (chopped) to the appropriate number of bits specified by the instruction type-size.
Table 27summarizes the relaxed type-checking rules for source operands. Note that some combinations maystill be invalid for a particular instruction; for example, thecvt instruction does not support.bX instruction types, so those rows are invalid forcvt.
Table 27Relaxed Type-checking Rules for Source Operands
Source Operand Type
b8
b16
b32
b64
b128
s8
s16
s32
s64
u8
u16
u32
u64
f16
f32
f64
Instruction Type
b8
–
chop
chop
chop
chop
–
chop
chop
chop
–
chop
chop
chop
chop
chop
chop
b16
inv
–
chop
chop
chop
inv
–
chop
chop
inv
–
chop
chop
–
chop
chop
b32
inv
inv
–
chop
chop
inv
inv
–
chop
inv
inv
–
chop
inv
–
chop
b64
inv
inv
inv
–
chop
inv
inv
inv
–
inv
inv
inv
–
inv
inv
–
b128
inv
inv
inv
inv
–
inv
inv
inv
inv
inv
inv
inv
inv
inv
inv
inv
s8
–
chop
chop
chop
chop
–
chop
chop
chop
–
chop
chop
chop
inv
inv
inv
s16
inv
–
chop
chop
chop
inv
–
chop
chop
inv
–
chop
chop
inv
inv
inv
s32
inv
inv
–
chop
chop
inv
inv
–
chop
inv
inv
–
chop
inv
inv
inv
s64
inv
inv
inv
–
chop
inv
inv
inv
–
inv
inv
inv
–
inv
inv
inv
u8
–
chop
chop
chop
chop
–
chop
chop
chop
–
chop
chop
chop
inv
inv
inv
u16
inv
–
chop
chop
chop
inv
–
chop
chop
inv
–
chop
chop
inv
inv
inv
u32
inv
inv
–
chop
chop
inv
inv
–
chop
inv
inv
–
chop
inv
inv
inv
u64
inv
inv
inv
–
chop
inv
inv
inv
–
inv
inv
inv
–
inv
inv
inv
f16
inv
–
chop
chop
chop
inv
inv
inv
inv
inv
inv
inv
inv
–
inv
inv
f32
inv
inv
–
chop
chop
inv
inv
inv
inv
inv
inv
inv
inv
inv
–
inv
f64
inv
inv
inv
–
chop
inv
inv
inv
inv
inv
inv
inv
inv
inv
inv
–
Notes
chop = keep only low bits that fit; “–” = allowed, but no conversion needed;
inv = invalid, parse error.
Source register size must be of equal or greater size than the instruction-type size.
Bit-size source registers may be used with any appropriately-sized instruction type. The data aretruncated (“chopped”) to the instruction-type size and interpreted according to the instructiontype.
Integer source registers may be used with any appropriately-sized bit-size or integer instructiontype. The data are truncated to the instruction-type size and interpreted according to theinstruction type.
Floating-point source registers can only be used with bit-size or floating-point instruction types.When used with a narrower bit-size instruction type, the data are truncated. When used with afloating-point instruction type, the size must match exactly.
When a destination operand has a size that exceeds the instruction-type size, the destination datais zero- or sign-extended to the size of the destination register. If the corresponding instructiontype is signed integer, the data is sign-extended; otherwise, the data is zero-extended.
Table 28summarizes the relaxed type-checking rules for destination operands.
Table 28Relaxed Type-checking Rules for Destination Operands
Destination Operand Type
b8
b16
b32
b64
b128
s8
s16
s32
s64
u8
u16
u32
u64
f16
f32
f64
Instruction Type
b8
–
zext
zext
zext
zext
–
zext
zext
zext
–
zext
zext
zext
zext
zext
zext
b16
inv
–
zext
zext
zext
inv
–
zext
zext
inv
–
zext
zext
–
zext
zext
b32
inv
inv
–
zext
zext
inv
inv
–
zext
inv
inv
–
zext
inv
–
zext
b64
inv
inv
inv
–
zext
inv
inv
inv
–
inv
inv
inv
–
inv
inv
–
b128
inv
inv
inv
inv
–
inv
inv
inv
inv
inv
inv
inv
inv
inv
inv
inv
s8
–
sext
sext
sext
sext
–
sext
sext
sext
–
sext
sext
sext
inv
inv
inv
s16
inv
–
sext
sext
sext
inv
–
sext
sext
inv
–
sext
sext
inv
inv
inv
s32
inv
inv
–
sext
sext
inv
inv
–
sext
inv
inv
–
sext
inv
inv
inv
s64
inv
inv
inv
–
sext
inv
inv
inv
–
inv
inv
inv
–
inv
inv
inv
u8
–
zext
zext
zext
zext
–
zext
zext
zext
–
zext
zext
zext
inv
inv
inv
u16
inv
–
zext
zext
zext
inv
–
zext
zext
inv
–
zext
zext
inv
inv
inv
u32
inv
inv
–
zext
zext
inv
inv
–
zext
inv
inv
–
zext
inv
inv
inv
u64
inv
inv
inv
–
zext
inv
inv
inv
–
inv
inv
inv
–
inv
inv
inv
f16
inv
–
zext
zext
zext
inv
inv
inv
inv
inv
inv
inv
inv
–
inv
inv
f32
inv
inv
–
zext
zext
inv
inv
inv
inv
inv
inv
inv
inv
inv
–
inv
f64
inv
inv
inv
–
zext
inv
inv
inv
inv
inv
inv
inv
inv
inv
inv
–
Notes
sext = sign-extend; zext = zero-extend; “–” = allowed, but no conversion needed;
inv = invalid, parse error.
Destination register size must be of equal or greater size than the instruction-type size.
Bit-size destination registers may be used with any appropriately-sized instruction type. The dataare sign-extended to the destination register width for signed integer instruction types, and arezero-extended to the destination register width otherwise.
Integer destination registers may be used with any appropriately-sized bit-size or integerinstruction type. The data are sign-extended to the destination register width for signed integerinstruction types, and are zero-extended to the destination register width for bit-size an dunsigned integer instruction types.
Floating-point destination registers can only be used with bit-size or floating-point instructiontypes. When used with a narrower bit-size instruction type, the data are zero-extended. When usedwith a floating-point instruction type, the size must match exactly.
Threads in a CTA execute together, at least in appearance, until they come to a conditional controlconstruct such as a conditional branch, conditional function call, or conditional return. If threadsexecute down different control flow paths, the threads are calleddivergent. If all of the threadsact in unison and follow a single control flow path, the threads are calleduniform. Bothsituations occur often in programs.
A CTA with divergent threads may have lower performance than a CTA with uniformly executing threads,so it is important to have divergent threads re-converge as soon as possible. All control constructsare assumed to be divergent points unless the control-flow instruction is marked as uniform, usingthe.uni suffix. For divergent control flow, the optimizing code generator automaticallydetermines points of re-convergence. Therefore, a compiler or code author targeting PTX can ignorethe issue of divergent threads, but has the opportunity to improve performance by marking branchpoints as uniform when the compiler or author can guarantee that the branch point is non-divergent.
The goal of the semantic description of an instruction is to describe the results in all cases in assimple language as possible. The semantics are described using C, until C is not expressive enough.
A PTX program may execute on a GPU with either a 16-bit or a 32-bit data path. When executing on a32-bit data path, 16-bit registers in PTX are mapped to 32-bit physical registers, and 16-bitcomputations arepromoted to 32-bit computations. This can lead to computational differencesbetween code run on a 16-bit machine versus the same code run on a 32-bit machine, since thepromoted computation may have bits in the high-order half-word of registers that are not present in16-bit physical registers. These extra precision bits can become visible at the application level,for example, by a right-shift instruction.
At the PTX language level, one solution would be to define semantics for 16-bit code that isconsistent with execution on a 16-bit data path. This approach introduces a performance penalty for16-bit code executing on a 32-bit data path, since the translated code would require many additionalmasking instructions to suppress extra precision bits in the high-order half-word of 32-bitregisters.
Rather than introduce a performance penalty for 16-bit code running on 32-bit GPUs, the semantics of16-bit instructions in PTX is machine-specific. A compiler or programmer may chose to enforceportable, machine-independent 16-bit semantics by adding explicit conversions to 16-bit values atappropriate points in the program to guarantee portability of the code. However, for manyperformance-critical applications, this is not desirable, and for many applications the differencein execution is preferable to limiting performance.
add.type d, a, b;add{.sat}.s32 d, a, b; // .sat applies only to .s32.type = { .u16, .u32, .u64, .s16, .s32, .s64, .u16x2, .s16x2 };
Description
Performs addition and writes the resulting value into a destination register.
For.u16x2,.s16x2 instruction types, forms input vectors by half word values from sourceoperands. Half-word operands are then added in parallel to produce.u16x2,.s16x2 result indestination.
Operandsd,a andb have type.type. For instruction types.u16x2,.s16x2,operandsd,a andb have type.b32.
Semantics
if (type == u16x2 || type == s16x2) { iA[0] = a[0:15]; iA[1] = a[16:31]; iB[0] = b[0:15]; iB[1] = b[16:31]; for (i = 0; i < 2; i++) { d[i] = iA[i] + iB[i]; }} else { d = a + b;}
Notes
Saturation modifier:
.sat
limits result toMININT..MAXINT (no overflow) for the size of the operation. Applies only to.s32 type.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
add.u16x2 andadd.s16x2 introduced in PTX ISA version 8.0.
t = a * b;n = bitwidth of type;d = t; // for .wided = t<2n-1..n>; // for .hi variantd = t<n-1..0>; // for .lo variant
Notes
The type of the operation represents the types of thea andb operands. If.hi or.lo is specified, thend is the same size asa andb, and either the upper or lowerhalf of the result is written to the destination register. If.wide is specified, thend istwice as wide asa andb to receive the full result of the multiplication.
The.wide suffix is supported only for 16- and 32-bit integer types.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
Target ISA Notes
Supported on all target architectures.
Examples
mul.wide.s16 fa,fxs,fys; // 16*16 bits yields 32 bitsmul.lo.s16 fa,fxs,fys; // 16*16 bits, save only the low 16 bitsmul.wide.s32 z,x,y; // 32*32 bits, creates 64 bit result
Multiplies two values, optionally extracts the high or low half of the intermediate result, and addsa third value. Writes the result into a destination register.
Semantics
t = a * b;n = bitwidth of type;d = t + c; // for .wided = t<2n-1..n> + c; // for .hi variantd = t<n-1..0> + c; // for .lo variant
Notes
The type of the operation represents the types of thea andb operands. If .hi or .lo isspecified, thend andc are the same size asa andb, and either the upper or lowerhalf of the result is written to the destination register. If.wide is specified, thend andc are twice as wide asa andb to receive the result of the multiplication.
The.wide suffix is supported only for 16-bit and 32-bit integer types.
Saturation modifier:
.sat
limits result toMININT..MAXINT (no overflow) for the size of the operation.
Multiply two 24-bit integer values and add a third value.
Syntax
mad24.mode.type d, a, b, c;mad24.hi.sat.s32 d, a, b, c;.mode = { .hi, .lo };.type = { .u32, .s32 };
Description
Compute the product of two 24-bit integer values held in 32-bit source registers, and add a third,32-bit value to either the high or low 32-bits of the 48-bit result. Return either the high or low32-bits of the 48-bit result.
Semantics
t = a * b;d = t<47..16> + c; // for .hi variantd = t<31..0> + c; // for .lo variant
Notes
Integer multiplication yields a result that is twice the size of the input operands, i.e., 48-bits.
mad24.hi performs a 24x24-bit multiply and adds the high 32 bits of the 48-bit result to a thirdvalue.
mad24.lo performs a 24x24-bit multiply and adds the low 32 bits of the 48-bit result to a thirdvalue.
All operands are of the same type and size.
Saturation modifier:
.sat
limits result of 32-bit signed addition toMININT..MAXINT (no overflow). Applies only to.s32 type in .hi mode.
mad24.hi may be less efficient on machines without hardware support for 24-bit multiply.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
Target ISA Notes
Supported on all target architectures.
Examples
mad24.lo.s32 d,a,b,c; // low 32-bits of 24x24-bit signed multiply.
For.u16x2,.s16x2 instruction types, forms input vectors by half word values from sourceoperands. Half-word operands are then processed in parallel to produce.u16x2,.s16x2 resultin destination.
Operandsd,a andb have the same type as the instruction type. For instruction types.u16x2,.s16x2, operandsd,a andb have type.b32.
Semantics
if (type == u16x2 || type == s16x2) { iA[0] = a[0:15]; iA[1] = a[16:31]; iB[0] = b[0:15]; iB[1] = b[16:31]; for (i = 0; i < 2; i++) { d[i] = (iA[i] < iB[i]) ? iA[i] : iB[i]; }} else { d = (a < b) ? a : b; // Integer (signed and unsigned)}
Notes
Signed and unsigned differ.
Saturation modifier:
min.relu.{s16x2,s32} clamps the result to 0 if negative.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
min.u16x2,min{.relu}.s16x2 andmin.relu.s32 introduced in PTX ISA version 8.0.
Target ISA Notes
Supported on all target architectures.
min.u16x2,min{.relu}.s16x2 andmin.relu.s32 requiresm_90 or higher.
For.u16x2,.s16x2 instruction types, forms input vectors by half word values from sourceoperands. Half-word operands are then processed in parallel to produce.u16x2,.s16x2 resultin destination.
Operandsd,a andb have the same type as the instruction type. For instruction types.u16x2,.s16x2, operandsd,a andb have type.b32.
Semantics
if (type == u16x2 || type == s16x2) { iA[0] = a[0:15]; iA[1] = a[16:31]; iB[0] = b[0:15]; iB[1] = b[16:31]; for (i = 0; i < 2; i++) { d[i] = (iA[i] > iB[i]) ? iA[i] : iB[i]; }} else { d = (a > b) ? a : b; // Integer (signed and unsigned)}
Notes
Signed and unsigned differ.
Saturation modifier:
max.relu.{s16x2,s32} clamps the result to 0 if negative.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
max.u16x2,max{.relu}.s16x2 andmax.relu.s32 introduced in PTX ISA version 8.0.
Target ISA Notes
Supported on all target architectures.
max.u16x2,max{.relu}.s16x2 andmax.relu.s32 requiresm_90 or higher.
Count the number of one bits ina and place the resultingpopulation count in 32-bitdestination registerd. Operanda has the instruction type and destinationd has type.u32.
Semantics
.u32 d = 0;while (a != 0) { if (a & 0x1) d++; a = a >> 1;}
Count the number of leading zeros ina starting with the most-significant bit and place theresult in 32-bit destination registerd. Operanda has the instruction type, and destinationd has type.u32. For.b32 type, the number of leading zeros is between 0 and 32,inclusively. For.b64 type, the number of leading zeros is between 0 and 64, inclusively.
Semantics
.u32 d = 0;if (.type == .b32) { max = 32; mask = 0x80000000; }else { max = 64; mask = 0x8000000000000000; }while (d < max && (a&mask == 0) ) { d++; a = a << 1;}
Find the bit position of the most significant non-sign bit ina and place the result ind. Operanda has the instruction type, and destinationd has type.u32. For unsignedintegers,bfind returns the bit position of the most significant1. For signed integers,bfind returns the bit position of the most significant0 for negative inputs and the mostsignificant1 for non-negative inputs.
If.shiftamt is specified,bfind returns the shift amount needed to left-shift the found bitinto the most-significant bit position.
bfind returns0xffffffff if no non-sign bit is found.
Semantics
msb = (.type==.u32 || .type==.s32) ? 31 : 63;// negate negative signed inputsif ( (.type==.s32 || .type==.s64) && (a & (1<<msb)) ) { a = ~a;}.u32 d = 0xffffffff;for (.s32 i=msb; i>=0; i--) { if (a & (1<<i)) { d = i; break; }}if (.shiftamt && d != 0xffffffff) { d = msb - d; }
PTX ISA Notes
Introduced in PTX ISA version 2.0.
Target ISA Notes
bfind requiressm_20 or higher.
Examples
bfind.u32 d, a;bfind.shiftamt.s64 cnt, X; // cnt is .u32
Given a 32-bit valuemask and an integer valuebase (between 0 and 31), find the n-th (givenby offset) set bit inmask from thebase bit, and store the bit position ind. If notfound, store 0xffffffff ind.
Operandmask has a 32-bit type. Operandbase has.b32,.u32 or.s32type. Operand offset has.s32 type. Destinationd has type.b32.
Operandbase must be <= 31, otherwise behavior is undefined.
Extract bit field froma and place the zero or sign-extended result ind. Sourceb givesthe bit field starting bit position, and sourcec gives the bit field length in bits.
Operandsa andd have the same type as the instruction type. Operandsb andc aretype.u32, but are restricted to the 8-bit value range0..255.
The sign bit of the extracted field is defined as:
.u32,.u64:
zero
.s32,.s64:
msb of input a if the extracted field extends beyond themsb of amsb of extractedfield, otherwise
If the bit field length is zero, the result is zero.
The destinationd is padded with the sign bit of the extracted field. If the start position isbeyond themsb of the input, the destinationd is filled with the replicated sign bit of theextracted field.
Align and insert a bit field froma intob, and place the result inf. Sourcecgives the starting bit position for the insertion, and sourced gives the bit field length inbits.
Operandsa,b, andf have the same type as the instruction type. Operandsc andd are type.u32, but are restricted to the 8-bit value range0..255.
If the bit field length is zero, the result isb.
If the start position is beyond the msb of the input, the result isb.
Semantics
msb = (.type==.b32) ? 31 : 63;pos = c & 0xff; // pos restricted to 0..255 rangelen = d & 0xff; // len restricted to 0..255 rangef = b;for (i=0; i<len && pos+i<=msb; i++) { f[pos+i] = a[i];}
Sign-extends or zero-extends an N-bit value from operanda where N is specified in operandb. The resulting value is stored in the destination operandd.
For the.s32 instruction type, the value ina is treated as an N-bit signed value and themost significant bit of this N-bit value is replicated up to bit 31. For the.u32 instructiontype, the value ina is treated as an N-bit unsigned number and is zero-extended to 32bits. Operandb is an unsigned 32-bit value.
If the value of N is 0, then the result ofszext is 0. If the value of N is 32 or higher, thenthe result ofszext depends upon the value of the.mode qualifier as follows:
If.mode is.clamp, then the result is the same as the source operanda.
If.mode is.wrap, then the result is computed using the wrapped value of N.
Generates a 32-bit mask starting from the bit position specified in operanda, and of the widthspecified in operandb. The generated bitmask is stored in the destination operandd.
The resulting bitmask is 0 in the following cases:
When the value ofa is 32 or higher and.mode is.clamp.
When either the specified value ofb or the wrapped value ofb (when.mode isspecified as.wrap) is 0.
Four-way byte dot product which is accumulated in 32-bit result.
Operanda andb are 32-bit inputs which hold 4 byte inputs in packed form for dot product.
Operandc has type.u32 if both.atype and.btype are.u32 else operandchas type.s32.
Semantics
d = c;// Extract 4 bytes from a 32bit input and sign or zero extend// based on input type.Va = extractAndSignOrZeroExt_4(a, .atype);Vb = extractAndSignOrZeroExt_4(b, .btype);for (i = 0; i < 4; ++i) { d += Va[i] * Vb[i];}
Two-way 16-bit to 8-bit dot product which is accumulated in 32-bit result.
Operanda andb are 32-bit inputs. Operanda holds two 16-bits inputs in packed form andoperandb holds 4 byte inputs in packed form for dot product.
Depending on the.mode specified, either lower half or upper half of operandb will be usedfor dot product.
Operandc has type.u32 if both.atype and.btype are.u32 else operandchas type.s32.
Semantics
d = c;// Extract two 16-bit values from a 32-bit input and sign or zero extend// based on input type.Va = extractAndSignOrZeroExt_2(a, .atype);// Extract four 8-bit values from a 32-bit input and sign or zer extend// based on input type.Vb = extractAndSignOrZeroExt_4(b, .btype);b_select = (.mode == .lo) ? 0 : 2;for (i = 0; i < 2; ++i) { d += Va[i] * Vb[b_select + i];}
Instructionsadd.cc,addc,sub.cc,subc,mad.cc andmadc reference animplicitly specified condition code register (CC) having a single carry flag bit (CC.CF)holding carry-in/carry-out or borrow-in/borrow-out. These instructions support extended-precisioninteger addition, subtraction, and multiplication. No other instructions access the condition code,and there is no support for setting, clearing, or testing the condition code. The condition coderegister is not preserved across calls and is mainly intended for use in straight-line codesequences for computing extended-precision integer addition, subtraction, and multiplication.
The extended-precision arithmetic instructions are:
Multiplies two values, extracts either the high or low part of the result, and adds a thirdvalue. Writes the result to the destination register and the carry-out from the addition into thecondition code register.
Semantics
t = a * b;d = t<63..32> + c; // for .hi variantd = t<31..0> + c; // for .lo variant
carry-out from addition is written toCC.CF
Notes
Generally used in combination withmadc andaddc to implement extended-precision multi-wordmultiplication. Seemadc for an example.
Multiplies two values, extracts either the high or low part of the result, and adds a third valuealong with carry-in. Writes the result to the destination register and optionally writes thecarry-out from the addition into the condition code register.
Semantics
t = a * b;d = t<63..32> + c + CC.CF; // for .hi variantd = t<31..0> + c + CC.CF; // for .lo variant
if.cc specified, carry-out from addition is written toCC.CF
Notes
Generally used in combination withmad.cc andaddc to implement extended-precisionmulti-word multiplication. See example below.
PTX ISA Notes
32-bitmadc introduced in PTX ISA version 3.0.
64-bitmadc introduced in PTX ISA version 4.3.
Target ISA Notes
Requires targetsm_20 or higher.
Examples
// extended-precision multiply: [r3,r2,r1,r0] = [r5,r4] * [r7,r6]mul.lo.u32 r0,r4,r6; // r0=(r4*r6).[31:0], no carry-outmul.hi.u32 r1,r4,r6; // r1=(r4*r6).[63:32], no carry-outmad.lo.cc.u32 r1,r5,r6,r1; // r1+=(r5*r6).[31:0], may carry-outmadc.hi.u32 r2,r5,r6,0; // r2 =(r5*r6).[63:32]+carry-in, // no carry-outmad.lo.cc.u32 r1,r4,r7,r1; // r1+=(r4*r7).[31:0], may carry-outmadc.hi.cc.u32 r2,r4,r7,r2; // r2+=(r4*r7).[63:32]+carry-in, // may carry-outaddc.u32 r3,0,0; // r3 = carry-in, no carry-outmad.lo.cc.u32 r2,r5,r7,r2; // r2+=(r5*r7).[31:0], may carry-outmadc.hi.u32 r3,r5,r7,r3; // r3+=(r5*r7).[63:32]+carry-in
Floating-point instructions operate on.f32 and.f64 register operands and constantimmediate values. The floating-point instructions are:
testp
copysign
add
sub
mul
fma
mad
div
abs
neg
min
max
rcp
sqrt
rsqrt
sin
cos
lg2
ex2
tanh
Instructions that support rounding modifiers are IEEE-754 compliant. Double-precision instructionssupport subnormal inputs and results. Single-precision instructions support subnormal inputs andresults by default forsm_20 and subsequent targets, and flush subnormal inputs and results tosign-preserving zero forsm_1x targets. The optional.ftz modifier on single-precisioninstructions provides backward compatibility withsm_1x targets by flushing subnormal inputs andresults to sign-preserving zero regardless of the target architecture.
Single-precisionadd,sub,mul, andmad support saturation of results to the range[0.0, 1.0], withNaNs being flushed to positive zero.NaN payloads are supported fordouble-precision instructions (except forrcp.approx.ftz.f64 andrsqrt.approx.ftz.f64, whichmaps inputNaNs to a canonicalNaN). Single-precision instructions return an unspecifiedNaN. Note that future implementations may supportNaN payloads for single-precisioninstructions, so PTX programs should not rely on the specific single-precisionNaNs beinggenerated.
Table 29 summarizesfloating-point instructions in PTX.
add{.rnd}{.ftz}{.sat}.f32 d, a, b;add{.rnd}{.ftz}.f32x2 d, a, b;add{.rnd}.f64 d, a, b;.rnd = { .rn, .rz, .rm, .rp };
Description
Performs addition and writes the resulting value into a destination register.
For.f32x2 instruction type, forms input vectors of single precision (.f32) values fromsource operands. Single precision (.f32) operands are then added in parallel to produce.f32x2 result in destination.
if (type == f32 || type == f64) { d = a + b;} else if (type == f32x2) { fA[0] = a[0:31]; fA[1] = a[32:63]; fB[0] = b[0:31]; fB[1] = b[32:63]; for (i = 0; i < 2; i++) { d[i] = fA[i] + fB[i]; }}
Notes
Rounding modifiers:
.rn
mantissa LSB rounds to nearest even
.rz
mantissa LSB rounds towards zero
.rm
mantissa LSB rounds towards negative infinity
.rp
mantissa LSB rounds towards positive infinity
The default value of rounding modifier is.rn. Note that anadd instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Anadd instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/add sequences with no rounding modifiers may be optimized touse fused-multiply-add instructions on the target device.
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
add.ftz.f32,add.ftz.f32x2 flushes subnormal inputs and results to sign-preserving zero.
sm_1x
add.f64 supports subnormal numbers.
add.f32 flushes subnormal inputs and results to sign-preserving zero.
Saturation modifier:
add.sat.f32 clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
add.f32x2 introduced in PTX ISA version 8.6.
Target ISA Notes
add.f32 supported on all target architectures.
add.f64 requiressm_13 or higher.
Rounding modifiers have the following target requirements:
.rn,.rz
available for all targets
.rm,.rp
foradd.f64, requiressm_13 or higher.
foradd.f32, requiressm_20 or higher.
add.f32x2 requiressm_100 or higher.
Examples
@p add.rz.ftz.f32 f1,f2,f3;add.rp.ftz.f32x2 d, a, b;
sub{.rnd}{.ftz}{.sat}.f32 d, a, b;sub{.rnd}{.ftz}.f32x2 d, a, b;sub{.rnd}.f64 d, a, b;.rnd = { .rn, .rz, .rm, .rp };
Description
Performs subtraction and writes the resulting value into a destination register.
For.f32x2 instruction type, forms input vectors of single precision (.f32) valuesfrom source operands. Single precision (.f32) operands are then subtracted in parallelto produce.f32x2 result in destination.
if (type == f32 || type == f64) { d = a - b;} else if (type == f32x2) { fA[0] = a[0:31]; fA[1] = a[32:63]; fB[0] = b[0:31]; fB[1] = b[32:63]; for (i = 0; i < 2; i++) { d[i] = fA[i] - fB[i]; }}
Notes
Rounding modifiers:
.rn
mantissa LSB rounds to nearest even
.rz
mantissa LSB rounds towards zero
.rm
mantissa LSB rounds towards negative infinity
.rp
mantissa LSB rounds towards positive infinity
The default value of rounding modifier is.rn. Note that asub instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Asub instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/sub sequences with no rounding modifiers may be optimized touse fused-multiply-add instructions on the target device.
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
sub.ftz.f32,sub.ftz.f32x2 flushes subnormal inputs and results to sign-preserving zero.
sm_1x
sub.f64 supports subnormal numbers.
sub.f32 flushes subnormal inputs and results to sign-preserving zero.
Saturation modifier:
sub.sat.f32 clamps the result to [0.0, 1.0]. NaN results are flushed to+0.0f.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
sub.f32x2 introduced in PTX ISA version 8.6.
Target ISA Notes
sub.f32 supported on all target architectures.
sub.f64 requiressm_13 or higher.
Rounding modifiers have the following target requirements:
mul{.rnd}{.ftz}{.sat}.f32 d, a, b;mul{.rnd}{.ftz}.f32x2 d, a, b;mul{.rnd}.f64 d, a, b;.rnd = { .rn, .rz, .rm, .rp };
Description
Compute the product of two values.
For.f32x2 instruction type, forms input vectors of single precision (.f32) valuesfrom source operands. Single precision (.f32) operands are then multiplied in parallelto produce.f32x2 result in destination.
if (type == f32 || type == f64) { d = a * b;} else if (type == f32x2) { fA[0] = a[0:31]; fA[1] = a[32:63]; fB[0] = b[0:31]; fB[1] = b[32:63]; for (i = 0; i < 2; i++) { d[i] = fA[i] * fB[i]; }}
Notes
For floating-point multiplication, all operands must be the same size.
Rounding modifiers:
.rn
mantissa LSB rounds to nearest even
.rz
mantissa LSB rounds towards zero
.rm
mantissa LSB rounds towards negative infinity
.rp
mantissa LSB rounds towards positive infinity
The default value of rounding modifier is.rn. Note that amul instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Amul instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/add andmul/sub sequences with no rounding modifiers may beoptimized to use fused-multiply-add instructions on the target device.
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
mul.ftz.f32,mul.ftz.f32x2 flushes subnormal inputs and results to sign-preserving zero.
sm_1x
mul.f64 supports subnormal numbers.
mul.f32 flushes subnormal inputs and results to sign-preserving zero.
Saturation modifier:
mul.sat.f32 clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
mul.f32x2 introduced in PTX ISA version 8.6.
Target ISA Notes
mul.f32 supported on all target architectures.
mul.f64 requiressm_13 or higher.
Rounding modifiers have the following target requirements:
.rn,.rz
available for all targets
.rm,.rp
formul.f64, requiressm_13 or higher.
formul.f32, requiressm_20 or higher.
mul.f32x2 requiressm_100 or higher.
Examples
mul.ftz.f32 circumf,radius,pi // a single-precision multiply
fma.rnd{.ftz}{.sat}.f32 d, a, b, c;fma.rnd{.ftz}.f32x2 d, a, b, c;fma.rnd.f64 d, a, b, c;.rnd = { .rn, .rz, .rm, .rp };
Description
Performs a fused multiply-add with no loss of precision in the intermediate product and addition.
For.f32x2 instruction type, forms input vectors of single precision (.f32) values fromsource operands. Single precision (.f32) operands are then operated in parallel to produce.f32x2 result in destination.
if (type == f32 || type == f64) { d = a * b + c;} else if (type == f32x2) { fA[0] = a[0:31]; fA[1] = a[32:63]; fB[0] = b[0:31]; fB[1] = b[32:63]; fC[0] = c[0:31]; fC[1] = c[32:63]; for (i = 0; i < 2; i++) { d[i] = fA[i] * fB[i] + fC[i]; }}
Notes
fma.f32 computes the product ofa andb to infinite precision and then addsc tothis product, again in infinite precision. The resulting value is then rounded to single precisionusing the rounding mode specified by.rnd.
fma.f64 computes the product ofa andb to infinite precision and then addsc tothis product, again in infinite precision. The resulting value is then rounded to double precisionusing the rounding mode specified by.rnd.
fma.f64 is the same asmad.f64.
Rounding modifiers (no default):
.rn
mantissa LSB rounds to nearest even
.rz
mantissa LSB rounds towards zero
.rm
mantissa LSB rounds towards negative infinity
.rp
mantissa LSB rounds towards positive infinity
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
fma.ftz.f32,fma.ftz.f32x2 flushes subnormal inputs and results to sign-preserving zero.
sm_1x
fma.f64 supports subnormal numbers.
fma.f32 is unimplemented forsm_1x targets.
Saturation:
fma.sat.f32 clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.
mad{.ftz}{.sat}.f32 d, a, b, c; // .target sm_1xmad.rnd{.ftz}{.sat}.f32 d, a, b, c; // .target sm_20mad.rnd.f64 d, a, b, c; // .target sm_13 and higher.rnd = { .rn, .rz, .rm, .rp };
Description
Multiplies two values and adds a third, and then writes the resulting value into a destinationregister.
Semantics
d = a*b + c;
Notes
For.targetsm_20 and higher:
mad.f32 computes the product ofa andb to infinite precision and then addsc tothis product, again in infinite precision. The resulting value is then rounded to single precisionusing the rounding mode specified by.rnd.
mad.f64 computes the product ofa andb to infinite precision and then addsc tothis product, again in infinite precision. The resulting value is then rounded to double precisionusing the rounding mode specified by.rnd.
mad.{f32,f64} is the same asfma.{f32,f64}.
For.targetsm_1x:
mad.f32 computes the product ofa andb at double precision, and then the mantissa istruncated to 23 bits, but the exponent is preserved. Note that this is different from computingthe product withmul, where the mantissa can be rounded and the exponent will be clamped. Theexception formad.f32 is whenc=+/-0.0,mad.f32 is identical to the result computedusing separate mul and add instructions. When JIT-compiled for SM 2.0 devices,mad.f32 isimplemented as a fused multiply-add (i.e.,fma.rn.ftz.f32). In this case,mad.f32 canproduce slightly different numeric results and backward compatibility is not guaranteed in thiscase.
mad.f64 computes the product ofa andb to infinite precision and then addsc tothis product, again in infinite precision. The resulting value is then rounded to double precisionusing the rounding mode specified by.rnd. Unlikemad.f32, the treatment of subnormalinputs and output follows IEEE 754 standard.
mad.f64 is the same asfma.f64.
Rounding modifiers (no default):
.rn
mantissa LSB rounds to nearest even
.rz
mantissa LSB rounds towards zero
.rm
mantissa LSB rounds towards negative infinity
.rp
mantissa LSB rounds towards positive infinity
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
mad.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.
sm_1x
mad.f64 supports subnormal numbers.
mad.f32 flushes subnormal inputs and results to sign-preserving zero.
Saturation modifier:
mad.sat.f32 clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
In PTX ISA versions 1.4 and later, a rounding modifier is required formad.f64.
Legacymad.f64 instructions having no rounding modifier will map tomad.rn.f64.
In PTX ISA versions 2.0 and later, a rounding modifier is required formad.f32 forsm_20 and higher targets.
Errata
mad.f32 requires a rounding modifier forsm_20 and higher targets. However for PTX ISAversion 3.0 and earlier, ptxas does not enforce this requirement andmad.f32 silently defaultstomad.rn.f32. For PTX ISA version 3.1, ptxas generates a warning and defaults tomad.rn.f32, and in subsequent releases ptxas will enforce the requirement for PTX ISA version3.2 and later.
Target ISA Notes
mad.f32 supported on all target architectures.
mad.f64 requiressm_13 or higher.
Rounding modifiers have the following target requirements:
.rn,.rz,.rm,.rp formad.f64, requiressm_13 or higher.
.rn,.rz,.rm,.rp formad.f32, requiressm_20 or higher.
div.approx{.ftz}.f32 d, a, b; // fast, approximate dividediv.full{.ftz}.f32 d, a, b; // full-range approximate dividediv.rnd{.ftz}.f32 d, a, b; // IEEE 754 compliant roundingdiv.rnd.f64 d, a, b; // IEEE 754 compliant rounding.rnd = { .rn, .rz, .rm, .rp };
Description
Dividesa byb, stores result ind.
Semantics
d = a / b;
Notes
Fast, approximate single-precision divides:
div.approx.f32 implements a fast approximation to divide, computed asd=a*(1/b). For|b| in [2-126, 2126], the maximumulp error is 2. For 2126 <|b| < 2128, ifa is infinity,div.approx.f32 returnsNaN, otherwise itreturns a sign-preserving zero.
div.full.f32 implements a relatively fast, full-range approximation that scales operands toachieve better accuracy, but is not fully IEEE 754 compliant and does not support roundingmodifiers. The maximumulp error is 2 across the full range of inputs.
Divide with IEEE 754 compliant rounding:
Rounding modifiers (no default):
.rn
mantissa LSB rounds to nearest even
.rz
mantissa LSB rounds towards zero
.rm
mantissa LSB rounds towards negative infinity
.rp
mantissa LSB rounds towards positive infinity
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
div.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.
sm_1x
div.f64 supports subnormal numbers.
div.f32 flushes subnormal inputs and results to sign-preserving zero.
PTX ISA Notes
div.f32 anddiv.f64 introduced in PTX ISA version 1.0.
Explicit modifiers.approx,.full,.ftz, and rounding introduced in PTX ISA version 1.4.
For PTX ISA version 1.4 and later, one of.approx,.full, or.rnd is required.
For PTX ISA versions 1.0 through 1.3,div.f32 defaults todiv.approx.ftz.f32, anddiv.f64 defaults todiv.rn.f64.
Target ISA Notes
div.approx.f32 anddiv.full.f32 supported on all target architectures.
div.rnd.f32 requiressm_20 or higher.
div.rn.f64 requiressm_13 or higher, or.targetmap_f64_to_f32.
div.{rz,rm,rp}.f64 requiressm_20 or higher.
Examples
div.approx.ftz.f32 diam,circum,3.14159;div.full.ftz.f32 x, y, z;div.rn.f64 xd, yd, zd;
Take the absolute value ofa and store the result ind.
Semantics
d = |a|;
Notes
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
abs.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.
sm_1x
abs.f64 supports subnormal numbers.
abs.f32 flushes subnormal inputs and results to sign-preserving zero.
Forabs.f32,NaN input yields unspecifiedNaN. Forabs.f64,NaN input is passedthrough unchanged. Future implementations may comply with the IEEE 754 standard by preservingpayload and modifying only the sign bit.
min{.ftz}{.NaN}{.xorsign.abs}.f32 d, a, b;min{.ftz}{.NaN}{.abs}.f32 d, a, b, c;min.f64 d, a, b;
Description
Store the minimum ofa,b and optionallyc ind.
If.NaN modifier is specified, then the result is canonicalNaN if any of the inputs isNaN.
If.abs modifier is specified, the magnitude of destination operandd is the minimum ofabsolute values of both input arguments.
If.xorsign modifier is specified, the sign bit of destinationd is equal to the XOR of thesign bits of both inputsa andb. The.xorsign qualifier cannot be specified for threeinputs operation.
Qualifier.xorsign requires qualifier.abs to be specified. In such cases,.xorsignconsiders the sign bit of both inputs before applying.abs operation.
If the result ofmin isNaN then the.xorsign and.abs modifiers will be ignored.
Semantics
def min_num (z, x, y) { if (isNaN(x) && isNaN(y)) z = NaN; else if (isNaN(x)) z = y; else if (isNaN(y)) z = x; else // note: -0.0 < +0.0 here z = (x < y) ? x : y; return z;}def min_nan (z, x, y) { if (isNaN(x) || isNaN(y)) z = NaN; else // note: -0.0 < +0.0 here z = (x < y) ? x : y; return z;}def two_inputs_min (z, x, y) { if (.NaN) z = min_nan(z, x, y); else z = min_num(z, x, y); return z;}if (.xorsign && !isPresent(c)) { xorsign = getSignBit(a) ^ getSignBit(b);}if (.abs) { a = |a|; b = |b|; if (isPresent(c)) { c = |c|; }}d = two_inputs_min(d, a, b)if (isPresent(c)) { d = two_inputs_min(d, d, c)}if (.xorsign && !isPresent(c) && !isNaN(d)) { setSignBit(d, xorsign);}
Notes
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
min.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.
sm_1x
min.f64 supports subnormal numbers.
min.f32 flushes subnormal inputs and results to sign-preserving zero.
If values of both inputs are 0.0, then +0.0 > -0.0.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
min.NaN introduced in PTX ISA version 7.0.
min.xorsign.abs introduced in PTX ISA version 7.2.
min with three input arguments introduced in PTX ISA version 8.8.
Target ISA Notes
min.f32 supported on all target architectures.
min.f64 requiressm_13 or higher.
min.NaN requiressm_80 or higher.
min.xorsign.abs requiressm_86 or higher.
min with three input arguments requiressm_100 or higher.
Examples
@p min.ftz.f32 z,z,x; min.f64 a,b,c; // fp32 min with .NaN min.NaN.f32 f0,f1,f2; // fp32 min with .xorsign.abs min.xorsign.abs.f32 Rd, Ra, Rb;
max{.ftz}{.NaN}{.xorsign.abs}.f32 d, a, b;max{.ftz}{.NaN}{.abs}.f32 d, a, b, c;max.f64 d, a, b;
Description
Store the maximum ofa,b and optionallyc ind.
If.NaN modifier is specified, the result is canonicalNaN if any of the inputs isNaN.
If.abs modifier is specified, the magnitude of destination operandd is the maximum ofabsolute values of the input arguments.
If.xorsign modifier is specified, the sign bit of destinationd is equal to the XOR of thesign bits of the inputs:a andb. The.xorsign qualifier cannot be specified for threeinputs operation.
Qualifier.xorsign requires qualifier.abs to be specified. In such cases,.xorsignconsiders the sign bit of both inputs before applying.abs operation.
If the result ofmax isNaN then the.xorsign and.abs modifiers will be ignored.
Semantics
def max_num (z, x, y) { if (isNaN(x) && isNaN(y)) z = NaN; else if (isNaN(x)) z = y; else if (isNaN(y)) z = x; else // note: +0.0 > -0.0 here z = (x > y) ? x : y; return z;}def max_nan (z, x, y) { if (isNaN(x) || isNaN(y)) z = NaN; else // note: +0.0 > -0.0 here z = (x > y) ? x : y; return z;}def two_inputs_max (z, x, y) { if (.NaN) z = max_nan(z, x, y); else z = max_num(z, x, y); return z;}if (.xorsign && !isPresent(c)) { xorsign = getSignBit(a) ^ getSignBit(b);}if (.abs) { a = |a|; b = |b|; if (isPresent(c)) { c = |c|; }}d = two_inputs_max (d, a, b)if (isPresent(c)) { d = two_inputs_max (d, d, c)}if (.xorsign && !isPresent(c) !isNaN(d)) { setSignBit(d, xorsign);}
Notes
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
max.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.
sm_1x
max.f64 supports subnormal numbers.
max.f32 flushes subnormal inputs and results to sign-preserving zero.
If values of both inputs are 0.0, then +0.0 > -0.0.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
max.NaN introduced in PTX ISA version 7.0.
max.xorsign.abs introduced in PTX ISA version 7.2.
max with three input arguments introduced in PTX ISA version 8.8.
Target ISA Notes
max.f32 supported on all target architectures.
max.f64 requiressm_13 or higher.
max.NaN requiressm_80 or higher.
max.xorsign.abs requiressm_86 or higher.
max with three input arguments requiressm_100 or higher.
Examples
max.ftz.f32 f0,f1,f2;max.f64 a,b,c;// fp32 max with .NaNmax.NaN.f32 f0,f1,f2;// fp32 max with .xorsign.absmax.xorsign.abs.f32 Rd, Ra, Rb;
rcp.approx.f32 implements a fast approximation to reciprocal.The maximum ulp error is 1 across the full range of inputs.
Input
Result
-Inf
-0.0
-0.0
-Inf
+0.0
+Inf
+Inf
+0.0
NaN
NaN
Reciprocal with IEEE 754 compliant rounding:
Rounding modifiers (no default):
.rn
mantissa LSB rounds to nearest even
.rz
mantissa LSB rounds towards zero
.rm
mantissa LSB rounds towards negative infinity
.rp
mantissa LSB rounds towards positive infinity
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
rcp.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.
sm_1x
rcp.f64 supports subnormal numbers.
rcp.f32 flushes subnormal inputs and results to sign-preserving zero.
PTX ISA Notes
rcp.f32 andrcp.f64 introduced in PTX ISA version 1.0.rcp.rn.f64 and explicit modifiers.approx and.ftz were introduced in PTX ISA version 1.4. General rounding modifiers wereadded in PTX ISA version 2.0.
For PTX ISA version 1.4 and later, one of.approx or.rnd is required.
For PTX ISA versions 1.0 through 1.3,rcp.f32 defaults torcp.approx.ftz.f32, andrcp.f64 defaults torcp.rn.f64.
Target ISA Notes
rcp.approx.f32 supported on all target architectures.
rcp.rnd.f32 requiressm_20 or higher.
rcp.rn.f64 requiressm_13 or higher, or.targetmap_f64_to_f32.
sqrt.approx.f32 implements a fast approximation to square root.The maximum relative error over the entire positive finite floating-pointrange is 2-23.
For various corner-case inputs, results ofsqrt instruction are shownin below table:
Input
Result
-Inf
NaN
-normal
NaN
-0.0
-0.0
+0.0
+0.0
+Inf
+Inf
NaN
NaN
Square root with IEEE 754 compliant rounding:
Rounding modifiers (no default):
.rn
mantissa LSB rounds to nearest even
.rz
mantissa LSB rounds towards zero
.rm
mantissa LSB rounds towards negative infinity
.rp
mantissa LSB rounds towards positive infinity
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
sqrt.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.
sm_1x
sqrt.f64 supports subnormal numbers.
sqrt.f32 flushes subnormal inputs and results to sign-preserving zero.
PTX ISA Notes
sqrt.f32 andsqrt.f64 introduced in PTX ISA version 1.0.sqrt.rn.f64 and explicitmodifiers.approx and.ftz were introduced in PTX ISA version 1.4. General roundingmodifiers were added in PTX ISA version 2.0.
For PTX ISA version 1.4 and later, one of.approx or.rnd is required.
For PTX ISA versions 1.0 through 1.3,sqrt.f32 defaults tosqrt.approx.ftz.f32, andsqrt.f64 defaults tosqrt.rn.f64.
Target ISA Notes
sqrt.approx.f32 supported on all target architectures.
sqrt.rnd.f32 requiressm_20 or higher.
sqrt.rn.f64 requiressm_13 or higher, or.targetmap_f64_to_f32.
Compute an approximation of the square root reciprocal of a value.
Syntax
rsqrt.approx.ftz.f64 d, a;
Description
Compute a double-precision (.f64) approximation of the square root reciprocal of a value. Theleast significant 32 bits of the double-precision (.f64) destinationd are all zeros.
Semantics
tmp = a[63:32]; // upper word of a, 1.11.20 formatd[63:32] = 1.0 / sqrt(tmp);d[31:0] = 0x00000000;
Notes
rsqrt.approx.ftz.f64 implements a fast approximation of the square root reciprocal of a value.
Input
Result
-Inf
NaN
-subnormal
-Inf
-0.0
-Inf
+0.0
+Inf
+subnormal
+Inf
+Inf
+0.0
NaN
NaN
InputNaNs map to a canonicalNaN with encoding0x7fffffff00000000.
Subnormal inputs and results are flushed to sign-preserving zero.
PTX ISA Notes
rsqrt.approx.ftz.f64 introduced in PTX ISA version 4.0.
lg2.approx.f32 implements a fast approximation to log2(a).
Input
Result
-Inf
NaN
-normal
NaN
-0.0
-Inf
+0.0
-Inf
+Inf
+Inf
NaN
NaN
The maximum absolute error is 2-22 when the input operand is in therange (0.5, 2). For positive finite inputs outside of this interval, maximumrelative error is 2-22.
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
lg2.ftz.f32 flushes subnormal inputs and results to sign-preserving zero.
sm_1x
Subnormal inputs and results to sign-preserving zero.
PTX ISA Notes
lg2.f32 introduced in PTX ISA version 1.0. Explicit modifiers.approx and.ftzintroduced in PTX ISA version 1.4.
For PTX ISA version 1.4 and later, the.approx modifier is required.
For PTX ISA versions 1.0 through 1.3,lg2.f32 defaults tolg2.approx.ftz.f32.
Half-precisionadd,sub,mul, andfma support saturation of results to the range[0.0, 1.0], withNaNs being flushed to positive zero. Half-precision instructions return anunspecifiedNaN.
add{.rnd}{.ftz}{.sat}.f16 d, a, b;add{.rnd}{.ftz}{.sat}.f16x2 d, a, b;add{.rnd}.bf16 d, a, b;add{.rnd}.bf16x2 d, a, b;.rnd = { .rn };
Description
Performs addition and writes the resulting value into a destination register.
For.f16x2 and.bf16x2 instruction type, forms input vectors by half word values from sourceoperands. Half-word operands are then added in parallel to produce.f16x2 or.bf16x2 resultin destination.
if (type == f16 || type == bf16) { d = a + b;} else if (type == f16x2 || type == bf16x2) { fA[0] = a[0:15]; fA[1] = a[16:31]; fB[0] = b[0:15]; fB[1] = b[16:31]; for (i = 0; i < 2; i++) { d[i] = fA[i] + fB[i]; }}
Notes
Rounding modifiers:
.rn
mantissa LSB rounds to nearest even
The default value of rounding modifier is.rn. Note that anadd instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Anadd instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/add sequences with no rounding modifiers may be optimized touse fused-multiply-add instructions on the target device.
Subnormal numbers:
By default, subnormal numbers are supported.add.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.
Saturation modifier:
add.sat.{f16,f16x2} clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.
PTX ISA Notes
Introduced in PTX ISA version 4.2.
add{.rnd}.bf16 andadd{.rnd}.bf16x2 introduced in PTX ISA version 7.8.
Target ISA Notes
Requiressm_53 or higher.
add{.rnd}.bf16 andadd{.rnd}.bf16x2 requiressm_90 or higher.
Examples
// scalar f16 additionsadd.f16 d0, a0, b0;add.rn.f16 d1, a1, b1;add.bf16 bd0, ba0, bb0;add.rn.bf16 bd1, ba1, bb1;// SIMD f16 additioncvt.rn.f16.f32 h0, f0;cvt.rn.f16.f32 h1, f1;cvt.rn.f16.f32 h2, f2;cvt.rn.f16.f32 h3, f3;mov.b32 p1, {h0, h1}; // pack two f16 to 32bit f16x2mov.b32 p2, {h2, h3}; // pack two f16 to 32bit f16x2add.f16x2 p3, p1, p2; // SIMD f16x2 addition// SIMD bf16 additioncvt.rn.bf16x2.f32 p4, f4, f5; // Convert two f32 into packed bf16x2cvt.rn.bf16x2.f32 p5, f6, f7; // Convert two f32 into packed bf16x2add.bf16x2 p6, p4, p5; // SIMD bf16x2 addition// SIMD fp16 additionld.global.b32 f0, [addr]; // load 32 bit which hold packed f16x2ld.global.b32 f1, [addr + 4]; // load 32 bit which hold packed f16x2add.f16x2 f2, f0, f1; // SIMD f16x2 additionld.global.b32 f3, [addr + 8]; // load 32 bit which hold packed bf16x2ld.global.b32 f4, [addr + 12]; // load 32 bit which hold packed bf16x2add.bf16x2 f5, f3, f4; // SIMD bf16x2 addition
sub{.rnd}{.ftz}{.sat}.f16 d, a, b;sub{.rnd}{.ftz}{.sat}.f16x2 d, a, b;sub{.rnd}.bf16 d, a, b;sub{.rnd}.bf16x2 d, a, b;.rnd = { .rn };
Description
Performs subtraction and writes the resulting value into a destination register.
For.f16x2 and.bf16x2 instruction type, forms input vectors by half word values from sourceoperands. Half-word operands are then subtracted in parallel to produce.f16x2 or.bf16x2result in destination.
if (type == f16 || type == bf16) { d = a - b;} else if (type == f16x2 || type == bf16x2) { fA[0] = a[0:15]; fA[1] = a[16:31]; fB[0] = b[0:15]; fB[1] = b[16:31]; for (i = 0; i < 2; i++) { d[i] = fA[i] - fB[i]; }}
Notes
Rounding modifiers:
.rn
mantissa LSB rounds to nearest even
The default value of rounding modifier is.rn. Note that asub instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Asub instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/sub sequences with no rounding modifiers may be optimized touse fused-multiply-add instructions on the target device.
Subnormal numbers:
By default, subnormal numbers are supported.sub.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.
Saturation modifier:
sub.sat.{f16,f16x2} clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.
PTX ISA Notes
Introduced in PTX ISA version 4.2.
sub{.rnd}.bf16 andsub{.rnd}.bf16x2 introduced in PTX ISA version 7.8.
Target ISA Notes
Requiressm_53 or higher.
sub{.rnd}.bf16 andsub{.rnd}.bf16x2 requiressm_90 or higher.
Examples
// scalar f16 subtractionssub.f16 d0, a0, b0;sub.rn.f16 d1, a1, b1;sub.bf16 bd0, ba0, bb0;sub.rn.bf16 bd1, ba1, bb1;// SIMD f16 subtractioncvt.rn.f16.f32 h0, f0;cvt.rn.f16.f32 h1, f1;cvt.rn.f16.f32 h2, f2;cvt.rn.f16.f32 h3, f3;mov.b32 p1, {h0, h1}; // pack two f16 to 32bit f16x2mov.b32 p2, {h2, h3}; // pack two f16 to 32bit f16x2sub.f16x2 p3, p1, p2; // SIMD f16x2 subtraction// SIMD bf16 subtractioncvt.rn.bf16x2.f32 p4, f4, f5; // Convert two f32 into packed bf16x2cvt.rn.bf16x2.f32 p5, f6, f7; // Convert two f32 into packed bf16x2sub.bf16x2 p6, p4, p5; // SIMD bf16x2 subtraction// SIMD fp16 subtractionld.global.b32 f0, [addr]; // load 32 bit which hold packed f16x2ld.global.b32 f1, [addr + 4]; // load 32 bit which hold packed f16x2sub.f16x2 f2, f0, f1; // SIMD f16x2 subtraction// SIMD bf16 subtractionld.global.b32 f3, [addr + 8]; // load 32 bit which hold packed bf16x2ld.global.b32 f4, [addr + 12]; // load 32 bit which hold packed bf16x2sub.bf16x2 f5, f3, f4; // SIMD bf16x2 subtraction
mul{.rnd}{.ftz}{.sat}.f16 d, a, b;mul{.rnd}{.ftz}{.sat}.f16x2 d, a, b;mul{.rnd}.bf16 d, a, b;mul{.rnd}.bf16x2 d, a, b;.rnd = { .rn };
Description
Performs multiplication and writes the resulting value into a destination register.
For.f16x2 and.bf16x2 instruction type, forms input vectors by half word values from sourceoperands. Half-word operands are then multiplied in parallel to produce.f16x2 or.bf16x2result in destination.
if (type == f16 || type == bf16) { d = a * b;} else if (type == f16x2 || type == bf16x2) { fA[0] = a[0:15]; fA[1] = a[16:31]; fB[0] = b[0:15]; fB[1] = b[16:31]; for (i = 0; i < 2; i++) { d[i] = fA[i] * fB[i]; }}
Notes
Rounding modifiers:
.rn
mantissa LSB rounds to nearest even
The default value of rounding modifier is.rn. Note that amul instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Amul instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/add andmul/sub sequences with no rounding modifiers maybe optimized to use fused-multiply-add instructions on the target device.
Subnormal numbers:
By default, subnormal numbers are supported.mul.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.
Saturation modifier:
mul.sat.{f16,f16x2} clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.
PTX ISA Notes
Introduced in PTX ISA version 4.2.
mul{.rnd}.bf16 andmul{.rnd}.bf16x2 introduced in PTX ISA version 7.8.
Target ISA Notes
Requiressm_53 or higher.
mul{.rnd}.bf16 andmul{.rnd}.bf16x2 requiressm_90 or higher.
Examples
// scalar f16 multiplicationsmul.f16 d0, a0, b0;mul.rn.f16 d1, a1, b1;mul.bf16 bd0, ba0, bb0;mul.rn.bf16 bd1, ba1, bb1;// SIMD f16 multiplicationcvt.rn.f16.f32 h0, f0;cvt.rn.f16.f32 h1, f1;cvt.rn.f16.f32 h2, f2;cvt.rn.f16.f32 h3, f3;mov.b32 p1, {h0, h1}; // pack two f16 to 32bit f16x2mov.b32 p2, {h2, h3}; // pack two f16 to 32bit f16x2mul.f16x2 p3, p1, p2; // SIMD f16x2 multiplication// SIMD bf16 multiplicationcvt.rn.bf16x2.f32 p4, f4, f5; // Convert two f32 into packed bf16x2cvt.rn.bf16x2.f32 p5, f6, f7; // Convert two f32 into packed bf16x2mul.bf16x2 p6, p4, p5; // SIMD bf16x2 multiplication// SIMD fp16 multiplicationld.global.b32 f0, [addr]; // load 32 bit which hold packed f16x2ld.global.b32 f1, [addr + 4]; // load 32 bit which hold packed f16x2mul.f16x2 f2, f0, f1; // SIMD f16x2 multiplication// SIMD bf16 multiplicationld.global.b32 f3, [addr + 8]; // load 32 bit which hold packed bf16x2ld.global.b32 f4, [addr + 12]; // load 32 bit which hold packed bf16x2mul.bf16x2 f5, f3, f4; // SIMD bf16x2 multiplication
fma.rnd{.ftz}{.sat}.f16 d, a, b, c;fma.rnd{.ftz}{.sat}.f16x2 d, a, b, c;fma.rnd{.ftz}.relu.f16 d, a, b, c;fma.rnd{.ftz}.relu.f16x2 d, a, b, c;fma.rnd{.relu}.bf16 d, a, b, c;fma.rnd{.relu}.bf16x2 d, a, b, c;fma.rnd.oob.{relu}.type d, a, b, c;.rnd = { .rn };
Description
Performs a fused multiply-add with no loss of precision in the intermediate product and addition.
For.f16x2 and.bf16x2 instruction type, forms input vectors by half word values from sourceoperands. Half-word operands are then operated in parallel to produce.f16x2 or.bf16x2result in destination.
if (type == f16 || type == bf16) { d = a * b + c;} else if (type == f16x2 || type == bf16x2) { fA[0] = a[0:15]; fA[1] = a[16:31]; fB[0] = b[0:15]; fB[1] = b[16:31]; fC[0] = c[0:15]; fC[1] = c[16:31]; for (i = 0; i < 2; i++) { d[i] = fA[i] * fB[i] + fC[i]; }}
Notes
Rounding modifiers (default is.rn):
.rn
mantissa LSB rounds to nearest even
Subnormal numbers:
By default, subnormal numbers are supported.fma.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.
Saturation modifier:
fma.sat.{f16,f16x2} clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.fma.relu.{f16,f16x2,bf16,bf16x2} clamps the result to 0 if negative.NaN result isconverted to canonicalNaN.
Out Of Bounds modifier:
fma.oob.{f16,f16x2,bf16,bf16x2} clamps the result to 0 if either of the operandsisOOBNaN (defined underTensors) value. The test for the specialNaN valueand resultant forcing of the result to +0.0 is performed independently for each of thetwo SIMD operations.
PTX ISA Notes
Introduced in PTX ISA version 4.2.
fma.relu.{f16,f16x2} andfma{.relu}.{bf16,bf16x2} introduced in PTX ISA version 7.0.
Support for modifier.oob introduced in PTX ISA version 8.1.
Target ISA Notes
Requiressm_53 or higher.
fma.relu.{f16,f16x2} andfma{.relu}.{bf16,bf16x2} requiresm_80 or higher.
fma{.oob}.{f16,f16x2,bf16,bf16x2} requiressm_90 or higher.
For.f16x2 and.bf16x2 instruction type, forms input vector by extracting half word valuesfrom the source operand. Half-word operands are then negated in parallel to produce.f16x2 or.bf16x2 result in destination.
For.f16 instruction type, operandsd anda have.f16 or.b16 type. For.f16x2 instruction type, operandsd anda have.b32 type. For.bf16 instructiontype, operandsd anda have.b16 type. For.bf16x2 instruction type, operandsdanda have.b32 type.
Semantics
if (type == f16 || type == bf16) { d = -a;} else if (type == f16x2 || type == bf16x2) { fA[0] = a[0:15]; fA[1] = a[16:31]; for (i = 0; i < 2; i++) { d[i] = -fA[i]; }}
Notes
Subnormal numbers:
By default, subnormal numbers are supported.neg.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.
NaN inputs yield an unspecifiedNaN. Future implementations may comply with the IEEE 754standard by preserving payload and modifying only the sign bit.
PTX ISA Notes
Introduced in PTX ISA version 6.0.
neg.bf16 andneg.bf16x2 introduced in PTX ISA 7.0.
Target ISA Notes
Requiressm_53 or higher.
neg.bf16 andneg.bf16x2 requires architecturesm_80 or higher.
For.f16x2 and.bf16x2 instruction type, forms input vector by extracting half word valuesfrom the source operand. Absolute values of half-word operands are then computed in parallel toproduce.f16x2 or.bf16x2 result in destination.
For.f16 instruction type, operandsd anda have.f16 or.b16 type. For.f16x2 instruction type, operandsd anda have.f16x2 or.b32 type. For.bf16 instruction type, operandsd anda have.b16 type. For.bf16x2 instructiontype, operandsd anda have.b32 type.
Semantics
if (type == f16 || type == bf16) { d = |a|;} else if (type == f16x2 || type == bf16x2) { fA[0] = a[0:15]; fA[1] = a[16:31]; for (i = 0; i < 2; i++) { d[i] = |fA[i]|; }}
Notes
Subnormal numbers:
By default, subnormal numbers are supported.abs.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.
NaN inputs yield an unspecifiedNaN. Future implementations may comply with the IEEE 754standard by preserving payload and modifying only the sign bit.
PTX ISA Notes
Introduced in PTX ISA version 6.5.
abs.bf16 andabs.bf16x2 introduced in PTX ISA 7.0.
Target ISA Notes
Requiressm_53 or higher.
abs.bf16 andabs.bf16x2 requires architecturesm_80 or higher.
min{.ftz}{.NaN}{.xorsign.abs}.f16 d, a, b;min{.ftz}{.NaN}{.xorsign.abs}.f16x2 d, a, b;min{.NaN}{.xorsign.abs}.bf16 d, a, b;min{.NaN}{.xorsign.abs}.bf16x2 d, a, b;
Description
Store the minimum ofa andb ind.
For.f16x2 and.bf16x2 instruction types, input vectors are formed with half-word valuesfrom source operands. Half-word operands are then processed in parallel to store.f16x2 or.bf16x2 result in destination.
For.f16 instruction type, operandsd anda have.f16 or.b16 type. For.f16x2 instruction type, operandsd anda have.f16x2 or.b32 type. For.bf16 instruction type, operandsd anda have.b16 type. For.bf16x2 instructiontype, operandsd anda have.b32 type.
If.NaN modifier is specified, then the result is canonicalNaN if either of the inputs isNaN.
If.abs modifier is specified, the magnitude of destination operandd is the minimum ofabsolute values of both the input arguments.
If.xorsign modifier is specified, the sign bit of destinationd is equal to the XOR of thesign bits of both the inputs.
Modifiers.abs and.xorsign must be specified together and.xorsign considers the signbit of both inputs before applying.abs operation.
If the result ofmin isNaN then the.xorsign and.abs modifiers will be ignored.
Semantics
if (type == f16 || type == bf16) { if (.xorsign) { xorsign = getSignBit(a) ^ getSignBit(b); if (.abs) { a = |a|; b = |b|; } } if (isNaN(a) && isNaN(b)) d = NaN; if (.NaN && (isNaN(a) || isNaN(b))) d = NaN; else if (isNaN(a)) d = b; else if (isNaN(b)) d = a; else d = (a < b) ? a : b; if (.xorsign && !isNaN(d)) { setSignBit(d, xorsign); }} else if (type == f16x2 || type == bf16x2) { fA[0] = a[0:15]; fA[1] = a[16:31]; fB[0] = b[0:15]; fB[1] = b[16:31]; for (i = 0; i < 2; i++) { if (.xorsign) { xorsign = getSignBit(fA[i]) ^ getSignBit(fB[i]); if (.abs) { fA[i] = |fA[i]|; fB[i] = |fB[i]|; } } if (isNaN(fA[i]) && isNaN(fB[i])) d[i] = NaN; if (.NaN && (isNaN(fA[i]) || isNaN(fB[i]))) d[i] = NaN; else if (isNaN(fA[i])) d[i] = fB[i]; else if (isNaN(fB[i])) d[i] = fA[i]; else d[i] = (fA[i] < fB[i]) ? fA[i] : fB[i]; if (.xorsign && !isNaN(d[i])) { setSignBit(d[i], xorsign); } }}
Notes
Subnormal numbers:
By default, subnormal numbers are supported.min.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.
If values of both inputs are 0.0, then +0.0 > -0.0.
PTX ISA Notes
Introduced in PTX ISA version 7.0.
min.xorsign introduced in PTX ISA version 7.2.
Target ISA Notes
Requiressm_80 or higher.
min.xorsign.abs support requiressm_86 or higher.
Examples
min.ftz.f16 h0,h1,h2;min.f16x2 b0,b1,b2;// SIMD fp16 min with .NaNmin.NaN.f16x2 b0,b1,b2;min.bf16 h0, h1, h2;// SIMD bf16 min with NaNmin.NaN.bf16x2 b0, b1, b2;// scalar bf16 min with xorsign.absmin.xorsign.abs.bf16 Rd, Ra, Rb
max{.ftz}{.NaN}{.xorsign.abs}.f16 d, a, b;max{.ftz}{.NaN}{.xorsign.abs}.f16x2 d, a, b;max{.NaN}{.xorsign.abs}.bf16 d, a, b;max{.NaN}{.xorsign.abs}.bf16x2 d, a, b;
Description
Store the maximum ofa andb ind.
For.f16x2 and.bf16x2 instruction types, input vectors are formed with half-word valuesfrom source operands. Half-word operands are then processed in parallel to store.f16x2 or.bf16x2 result in destination.
For.f16 instruction type, operandsd anda have.f16 or.b16 type. For.f16x2 instruction type, operandsd anda have.f16x2 or.b32 type. For.bf16 instruction type, operandsd anda have.b16 type. For.bf16x2 instructiontype, operandsd anda have.b32 type.
If.NaN modifier is specified, the result is canonicalNaN if either of the inputs isNaN.
If.abs modifier is specified, the magnitude of destination operandd is the maximum ofabsolute values of both the input arguments.
If.xorsign modifier is specified, the sign bit of destinationd is equal to the XOR of thesign bits of both the inputs.
Modifiers.abs and.xorsign must be specified together and.xorsign considers the signbit of both inputs before applying.abs operation.
If the result ofmax isNaN then the.xorsign and.abs modifiers will be ignored.
Semantics
if (type == f16 || type == bf16) { if (.xorsign) { xorsign = getSignBit(a) ^ getSignBit(b); if (.abs) { a = |a|; b = |b|; } } if (isNaN(a) && isNaN(b)) d = NaN; if (.NaN && (isNaN(a) || isNaN(b))) d = NaN; else if (isNaN(a)) d = b; else if (isNaN(b)) d = a; else d = (a > b) ? a : b; if (.xorsign && !isNaN(d)) { setSignBit(d, xorsign); }} else if (type == f16x2 || type == bf16x2) { fA[0] = a[0:15]; fA[1] = a[16:31]; fB[0] = b[0:15]; fB[1] = b[16:31]; for (i = 0; i < 2; i++) { if (.xorsign) { xorsign = getSignBit(fA[i]) ^ getSignBit(fB[i]); if (.abs) { fA[i] = |fA[i]|; fB[i] = |fB[i]|; } } if (isNaN(fA[i]) && isNaN(fB[i])) d[i] = NaN; if (.NaN && (isNaN(fA[i]) || isNaN(fB[i]))) d[i] = NaN; else if (isNaN(fA[i])) d[i] = fB[i]; else if (isNaN(fB[i])) d[i] = fA[i]; else d[i] = (fA[i] > fB[i]) ? fA[i] : fB[i]; if (.xorsign && !isNaN(fA[i])) { setSignBit(d[i], xorsign); } }}
Notes
Subnormal numbers:
By default, subnormal numbers are supported.max.ftz.{f16,f16x2} flushes subnormal inputs and results to sign-preserving zero.
If values of both inputs are 0.0, then +0.0 > -0.0.
PTX ISA Notes
Introduced in PTX ISA version 7.0.
max.xorsign.abs introduced in PTX ISA version 7.2.
Target ISA Notes
Requiressm_80 or higher.
max.xorsign.abs support requiressm_86 or higher.
Examples
max.ftz.f16 h0,h1,h2;max.f16x2 b0,b1,b2;// SIMD fp16 max with NaNmax.NaN.f16x2 b0,b1,b2;// scalar f16 max with xorsign.absmax.xorsign.abs.f16 Rd, Ra, Rb;max.bf16 h0, h1, h2;// scalar bf16 max and NaNmax.NaN.bf16x2 b0, b1, b2;// SIMD bf16 max with xorsign.absmax.xorsign.abs.bf16x2 Rd, Ra, Rb;
The type of operandsd anda are as specified by.type.
For.f16x2 or.bf16x2 instruction type, each of the half-word operands are operated inparallel and the results are packed appropriately into a.f16x2 or.bf16x2.
The type of operandsd anda are as specified by.type.
For.f16x2 or.bf16x2 instruction type, each of the half-word operands are operated inparallel and the results are packed appropriately into a.f16x2 or.bf16x2.
Mixed precision floating-point instructions operate on data with varied floating point precision.Before executing the specified operation, operands with different precision needs to be convertedsuch that all the instruction operands can be represented with a consistent floating-point precision.The register variable to be used for holding a particular operand depends upon the combination ofthe instruction types. ReferFundamental Types andAlternate Floating-Point Data Formats for more detailsaround exact register operand to be used for a given data type.
The mixed precision floating point instructions are:
add
sub
fma
Mixed precisionadd,sub,fma support saturation of results to the range [0.0, 1.0],withNaN being flushed to positive zero.
Converts input operanda from.atype into.f32 type. The converted value is thenused for the addition. The resulting value is stored in the destination operandd.
Semantics
d = convert(a) + c;
Notes
Rounding modifiers:
.rn
mantissa LSB rounds to nearest even
.rz
mantissa LSB rounds towards zero
.rm
mantissa LSB rounds towards negative infinity
.rp
mantissa LSB rounds towards positive infinity
The default value of rounding modifier is.rn. Note that anadd instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Anadd instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/add sequences with no rounding modifiers may be optimized touse fused-multiply-add instructions on the target device.
Subnormal numbers:
By default, subnormal numbers are supported.
Saturation modifier:
add.sat clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.
PTX ISA Notes
add.f32.{f16/bf16} introduced in PTX ISA version 8.6.
Converts input operanda from.atype into.f32 type. The converted value is thenused for the subtraction. The resulting value is stored in the destination operandd.
Semantics
d = convert(a) - c;
Notes
Rounding modifiers:
.rn
mantissa LSB rounds to nearest even
.rz
mantissa LSB rounds towards zero
.rm
mantissa LSB rounds towards negative infinity
.rp
mantissa LSB rounds towards positive infinity
The default value of rounding modifier is.rn. Note that ansub instruction with an explicitrounding modifier is treated conservatively by the code optimizer. Ansub instruction with norounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the codeoptimizer. In particular,mul/sub sequences with no rounding modifiers may be optimized touse fused-multiply-add instructions on the target device.
Subnormal numbers:
By default, subnormal numbers are supported.
Saturation modifier:
sub.sat clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.
PTX ISA Notes
sub.f32.{f16/bf16} introduced in PTX ISA version 8.6.
Target ISA Notes
sub.f32.{f16/bf16} requiressm_100 or higher.
Examples
.reg .f32 fc, fd;.reg .f16 ha;sub.rz.f32.f16.sat fd, ha, fc;
Converts input operandsa andb from.atype into.f32 type. The converted valuesare then used to perform fused multiply-add operation with no loss of precision in the intermediateproduct and addition. The resulting value is stored in the destination operandd.
Semantics
d = convert(a) * convert(b) + c;
Notes
fma.f32.{f16/bf16} computes the product ofa andb to infinite precision and then addsc to this product, again in infinite precision. The resulting value is then rounded to singleprecision using the rounding mode specified by.rnd.
Rounding modifiers(no default):
.rn
mantissa LSB rounds to nearest even
.rz
mantissa LSB rounds towards zero
.rm
mantissa LSB rounds towards negative infinity
.rp
mantissa LSB rounds towards positive infinity
Subnormal numbers:
By default, subnormal numbers are supported.
Saturation modifier:
fma.sat clamps the result to [0.0, 1.0].NaN results are flushed to+0.0f.
PTX ISA Notes
fma.f32.{f16/bf16} introduced in PTX ISA version 8.6.
Target ISA Notes
fma.f32.{f16/bf16} requiressm_100 or higher.
Examples
.reg .f32 fc, fd;.reg .f16 ha, hb;fma.rz.sat.f32.f16.sat fd, ha, hb, fc;
As with single-precision floating-point instructions, theset,setp, andslctinstructions support subnormal numbers forsm_20 and higher targets and flush single-precisionsubnormal inputs to sign-preserving zero forsm_1x targets. The optional.ftz modifierprovides backward compatibility withsm_1x targets by flushing subnormal inputs and results tosign-preserving zero regardless of the target architecture.
Compares two numeric values and optionally combines the result with another predicate value byapplying a Boolean operator. If this result isTrue,1.0f is written for floating-pointdestination types, and0xffffffff is written for integer destination types. Otherwise,0x00000000 is written.
Operandd has type.dtype; operandsa andb have type.stype; operandc hastype.pred.
Semantics
t = (a CmpOp b) ? 1 : 0;if (isFloat(dtype)) d = BoolOp(t, c) ? 1.0f : 0x00000000;else d = BoolOp(t, c) ? 0xffffffff : 0x00000000;
Integer Notes
The signed and unsigned comparison operators areeq,ne,lt,le,gt,ge.
For unsigned values, the comparison operatorslo,ls,hi, andhs for lower,lower-or-same, higher, and higher-or-same may be used instead oflt,le,gt,ge,respectively.
The untyped, bit-size comparisons areeq andne.
Floating Point Notes
The ordered comparisons areeq,ne,lt,le,gt,ge. If either operand isNaN, the result isFalse.
To aid comparison operations in the presence ofNaN values, unordered versions are included:equ,neu,ltu,leu,gtu,geu. If both operands are numeric values (notNaN), then these comparisons have the same result as their ordered counterparts. If eitheroperand isNaN, then the result of these comparisons isTrue.
num returnsTrue if both operands are numeric values (notNaN), andnan returnsTrue if either operand isNaN.
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
set.ftz.dtype.f32 flushes subnormal inputs to sign-preserving zero.
sm_1x
set.dtype.f64 supports subnormal numbers.
set.dtype.f32 flushes subnormal inputs to sign-preserving zero.
Compares two values and combines the result with another predicate value by applying a Booleanoperator. This result is written to the first destination operand. A related value computed usingthe complement of the compare result is written to the second destination operand.
Applies to all numeric types. Operandsa andb have type.type; operandsp,q,andc have type.pred. The sink symbol ‘_’ may be used in place of any one of thedestination operands.
Semantics
t = (a CmpOp b) ? 1 : 0;p = BoolOp(t, c);q = BoolOp(!t, c);
Integer Notes
The signed and unsigned comparison operators areeq,ne,lt,le,gt,ge.
For unsigned values, the comparison operatorslo,ls,hi, andhs for lower,lower-or-same, higher, and higher-or-same may be used instead oflt,le,gt,ge,respectively.
The untyped, bit-size comparisons areeq andne.
Floating Point Notes
The ordered comparisons areeq,ne,lt,le,gt,ge. If either operand isNaN, the result isFalse.
To aid comparison operations in the presence ofNaN values, unordered versions are included:equ,neu,ltu,leu,gtu,geu. If both operands are numeric values (notNaN), then these comparisons have the same result as their ordered counterparts. If eitheroperand isNaN, then the result of these comparisons isTrue.
num returnsTrue if both operands are numeric values (notNaN), andnan returnsTrue if either operand isNaN.
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
setp.ftz.dtype.f32 flushes subnormal inputs to sign-preserving zero.
sm_1x
setp.dtype.f64 supports subnormal numbers.
setp.dtype.f32 flushes subnormal inputs to sign-preserving zero.
Modifier.ftz applies only to.f32 comparisons.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
Target ISA Notes
setp with.f64 source type requiressm_13 or higher.
Conditional selection. Ifc >= 0,a is stored ind, otherwiseb is stored ind. Operandsd,a, andb are treated as a bitsize type of the same width as the firstinstruction type; operandc must match the second instruction type (.s32 or.f32). Theselected input is copied to the output without modification.
Semantics
d = (c >= 0) ? a : b;
Floating Point Notes
For.f32 comparisons, negative zero equals zero.
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
slct.ftz.dtype.f32 flushes subnormal values of operandc to sign-preserving zero, andoperanda is selected.
sm_1x
slct.dtype.f32 flushes subnormal values of operandc to sign-preserving zero, and operanda is selected.
Modifier.ftz applies only to.f32 comparisons.
If operandc isNaN, the comparison is unordered and operandb is selected.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
Target ISA Notes
slct.f64 requiressm_13 or higher.
Examples
slct.u32.s32 x, y, z, val;slct.ftz.u64.f32 A, B, C, fval;
Compare two numeric values with a relational operator, and optionally combine this result with apredicate value by applying a Boolean operator.
Syntax
set.CmpOp{.ftz}.f16.stype d, a, b;set.CmpOp.BoolOp{.ftz}.f16.stype d, a, b, {!}c;set.CmpOp.bf16.stype d, a, b;set.CmpOp.BoolOp.bf16.stype d, a, b, {!}c;set.CmpOp{.ftz}.dtype.f16 d, a, b;set.CmpOp.BoolOp{.ftz}.dtype.f16 d, a, b, {!}c;.dtype = { .u16, .s16, .u32, .s32}set.CmpOp.dtype.bf16 d, a, b;set.CmpOp.BoolOp.dtype.bf16 d, a, b, {!}c;.dtype = { .u16, .s16, .u32, .s32}set.CmpOp{.ftz}.dtype.f16x2 d, a, b;set.CmpOp.BoolOp{.ftz}.dtype.f16x2 d, a, b, {!}c;.dtype = { .f16x2, .u32, .s32}set.CmpOp.dtype.bf16x2 d, a, b;set.CmpOp.BoolOp.dtype.bf16x2 d, a, b, {!}c;.dtype = { .bf16x2, .u32, .s32}.CmpOp = { eq, ne, lt, le, gt, ge, equ, neu, ltu, leu, gtu, geu, num, nan };.BoolOp = { and, or, xor };.stype = { .b16, .b32, .b64, .u16, .u32, .u64, .s16, .s32, .s64, .f16, .f32, .f64};
Description
Compares two numeric values and optionally combines the result with another predicate value byapplying a Boolean operator.
Result of this computation is written in destination register in the following way:
If result isTrue,
0xffffffff is written for destination types.u32/.s32.
0xffff is written for destination types.u16/.s16.
1.0 in target precision floating point format is written for destination type.f16,.bf16.
If result isFalse,
0x0 is written for all integer destination types.
0.0 in target precision floating point format is written for destination type.f16,.bf16.
If the source type is.f16x2 or.bf16x2 then result of individual operations are packed inthe 32-bit destination operand.
Operandc has type.pred.
Semantics
if (stype == .f16x2 || stype == .bf16x2) { fA[0] = a[0:15]; fA[1] = a[16:31]; fB[0] = b[0:15]; fB[1] = b[16:31]; t[0] = (fA[0] CmpOp fB[0]) ? 1 : 0; t[1] = (fA[1] CmpOp fB[1]) ? 1 : 0; if (dtype == .f16x2 || stype == .bf16x2) { for (i = 0; i < 2; i++) { d[i] = BoolOp(t[i], c) ? 1.0 : 0.0; } } else { for (i = 0; i < 2; i++) { d[i] = BoolOp(t[i], c) ? 0xffff : 0; } }} else if (dtype == .f16 || stype == .bf16) { t = (a CmpOp b) ? 1 : 0; d = BoolOp(t, c) ? 1.0 : 0.0;} else { // Integer destination type trueVal = (isU16(dtype) || isS16(dtype)) ? 0xffff : 0xffffffff; t = (a CmpOp b) ? 1 : 0; d = BoolOp(t, c) ? trueVal : 0;}
Floating Point Notes
The ordered comparisons areeq,ne,lt,le,gt,ge. If either operand isNaN, the result isFalse.
To aid comparison operations in the presence ofNaN values, unordered versions are included:equ,neu,ltu,leu,gtu,geu. If both operands are numeric values (notNaN), then these comparisons have the same result as their ordered counterparts. If eitheroperand isNaN, then the result of these comparisons isTrue.
num returnsTrue if both operands are numeric values (notNaN), andnan returnsTrue if either operand isNaN.
Subnormal numbers:
By default, subnormal numbers are supported.
When.ftz modifier is specified then subnormal inputs and results are flushed to signpreserving zero.
PTX ISA Notes
Introduced in PTX ISA version 4.2.
set.{u16,u32,s16,s32}.f16 andset.{u32,s32}.f16x2 are introduced in PTX ISA version 6.5.
set.{u16,u32,s16,s32}.bf16,set.{u32,s32,bf16x2}.bf16x2,set.bf16.{s16,u16,f16,b16,s32,u32,f32,b32,s64,u64,f64,b64} are introduced in PTX ISA version7.8.
Target ISA Notes
Requiressm_53 or higher.
set.{u16,u32,s16,s32}.bf16,set.{u32,s32,bf16x2}.bf16x2,set.bf16.{s16,u16,f16,b16,s32,u32,f32,b32,s64,u64,f64,b64} requiresm_90 or higher.
Compare two numeric values with a relational operator, and optionally combine this result with apredicate value by applying a Boolean operator.
Syntax
setp.CmpOp{.ftz}.f16 p, a, b;setp.CmpOp.BoolOp{.ftz}.f16 p, a, b, {!}c;setp.CmpOp{.ftz}.f16x2 p|q, a, b;setp.CmpOp.BoolOp{.ftz}.f16x2 p|q, a, b, {!}c;setp.CmpOp.bf16 p, a, b;setp.CmpOp.BoolOp.bf16 p, a, b, {!}c;setp.CmpOp.bf16x2 p|q, a, b;setp.CmpOp.BoolOp.bf16x2 p|q, a, b, {!}c;.CmpOp = { eq, ne, lt, le, gt, ge, equ, neu, ltu, leu, gtu, geu, num, nan };.BoolOp = { and, or, xor };
Description
Compares two values and combines the result with another predicate value by applying a Booleanoperator. This result is written to the destination operand.
Operandc,p andq has type.pred.
For instruction type.f16, operandsa andb have type.b16 or.f16.
For instruction type.f16x2, operandsa andb have type.b32.
For instruction type.bf16, operandsa andb have type.b16.
For instruction type.bf16x2, operandsa andb have type.b32.
Semantics
if (type == .f16 || type == .bf16) { t = (a CmpOp b) ? 1 : 0; p = BoolOp(t, c);} else if (type == .f16x2 || type == .bf16x2) { fA[0] = a[0:15]; fA[1] = a[16:31]; fB[0] = b[0:15]; fB[1] = b[16:31]; t[0] = (fA[0] CmpOp fB[0]) ? 1 : 0; t[1] = (fA[1] CmpOp fB[1]) ? 1 : 0; p = BoolOp(t[0], c); q = BoolOp(t[1], c);}
Floating Point Notes
The ordered comparisons areeq,ne,lt,le,gt,ge. If either operand isNaN, the result isFalse.
To aid comparison operations in the presence ofNaN values, unordered versions are included:equ,neu,ltu,leu,gtu,geu. If both operands are numeric values (notNaN), then these comparisons have the same result as their ordered counterparts. If eitheroperand isNaN, then the result of these comparisons isTrue.
num returnsTrue if both operands are numeric values (notNaN), andnan returnsTrue if either operand isNaN.
Subnormal numbers:
By default, subnormal numbers are supported.
setp.ftz.{f16,f16x2} flushes subnormal inputs to sign-preserving zero.
PTX ISA Notes
Introduced in PTX ISA version 4.2.
setp.{bf16/bf16x2} introduced in PTX ISA version 7.8.
The logic and shift instructions are fundamentally untyped, performing bit-wise operations onoperands of any type, provided the operands are of the same size. This permits bit-wise operationson floating point values without having to define a union to access the bits. Instructionsand,or,xor, andnot also operate on predicates.
lop3.b32 d, a, b, c, immLut;lop3.BoolOp.b32 d|p, a, b, c, immLut, q;.BoolOp = { .or , .and };
Description
Compute bitwise logical operation on inputsa,b,c and store the result in destinationd.
Optionally,.BoolOp can be specified to compute the predicate resultp by performing aBoolean operation on the destination operandd with the predicateq in the following manner:
p = (d != 0) BoolOp q;
The sink symbol ‘_’ may be used in place of the destination operandd when.BoolOp qualifieris specified.
The logical operation is defined by a look-up table which, for 3 inputs, can be represented as an8-bit value specified by operandimmLut as described below.immLut is an integer constantthat can take values from 0 to 255, thereby allowing up to 256 distinct logical operations on inputsa,b,c.
For a logical operationF(a,b,c) the value ofimmLut can be computed by applying the sameoperation to three predefined constant values as follows:
If F = (a & b & c);immLut = 0xF0 & 0xCC & 0xAA = 0x80If F = (a | b | c);immLut = 0xF0 | 0xCC | 0xAA = 0xFEIf F = (a & b & ~c);immLut = 0xF0 & 0xCC & (~0xAA) = 0x40If F = ((a & b | c) ^ a);immLut = (0xF0 & 0xCC | 0xAA) ^ 0xF0 = 0x1A
The following table illustrates computation ofimmLut for various logical operations:
ta
tb
tc
Oper 0 (False)
Oper 1 (ta & tb & tc)
Oper 2 (ta & tb & ~tc)
…
Oper 254 (ta | tb | tc)
Oper 255 (True)
0
0
0
0
0
0
…
0
1
0
0
1
0
0
0
1
1
0
1
0
0
0
0
1
1
0
1
1
0
0
0
1
1
1
0
0
0
0
0
1
1
1
0
1
0
0
0
1
1
1
1
0
0
0
1
1
1
1
1
1
0
1
0
1
1
immLut
0x0
0x80
0x40
…
0xFE
0xFF
Semantics
F = GetFunctionFromTable(immLut); // returns the function corresponding to immLut valued = F(a, b, c);if (BoolOp specified) { p = (d != 0) BoolOp q;}
PTX ISA Notes
Introduced in PTX ISA version 4.3.
Support for.BoolOp qualifier introduced in PTX ISA version 8.2.
Target ISA Notes
Requiressm_50 or higher.
Qualifier.BoolOp requiressm_70 or higher.
Examples
lop3.b32 d, a, b, c, 0x40;lop3.or.b32 d|p, a, b, c, 0x3f, q;lop3.and.b32 _|p, a, b, c, 0x3f, q;
shf.l.mode.b32 d, a, b, c; // left shiftshf.r.mode.b32 d, a, b, c; // right shift.mode = { .clamp, .wrap };
Description
Shift the 64-bit value formed by concatenating operandsa andb left or right by the amountspecified by the unsigned 32-bit value inc. Operandb holds bits63:32 and operand aholds bits31:0 of the 64-bit source value. The source is shifted left or right by the clampedor wrapped value inc. Forshf.l, the most-significant 32-bits of the result are writtenintod; forshf.r, the least-significant 32-bits of the result are written intod.
Semantics
u32 n = (.mode == .clamp) ? min(c, 32) : c & 0x1f;switch (shf.dir) { // shift concatenation of [b, a] case shf.l: // extract 32 msbs u32 d = (b << n) | (a >> (32-n)); case shf.r: // extract 32 lsbs u32 d = (b << (32-n)) | (a >> n);}
Notes
Use funnel shift for multi-word shift operations and for rotate operations. The shift amount islimited to the range0..32 in clamp mode and0..31 in wrap mode, so shifting multi-wordvalues by distances greater than 32 requires first moving 32-bit words, then usingshf to shiftthe remaining0..31 distance.
To shift data sizes greater than 64 bits to the right, use repeatedshf.r instructions appliedto adjacent words, operating from least-significant word towards most-significant word. At eachstep, a single word of the shifted result is computed. The most-significant word of the result iscomputed using ashr.{u32,s32} instruction, which zero or sign fills based on the instructiontype.
To shift data sizes greater than 64 bits to the left, use repeatedshf.l instructions applied toadjacent words, operating from most-significant word towards least-significant word. At each step, asingle word of the shifted result is computed. The least-significant word of the result is computedusing ashl instruction.
Use funnel shift to perform 32-bit left or right rotate by supplying the same value for sourceargumentsa andb.
PTX ISA Notes
Introduced in PTX ISA version 3.1.
Target ISA Notes
Requiressm_32 or higher.
Example
shf.l.clamp.b32 r3,r1,r0,16;// 128-bit left shift; n < 32// [r7,r6,r5,r4] = [r3,r2,r1,r0] << nshf.l.clamp.b32 r7,r2,r3,n;shf.l.clamp.b32 r6,r1,r2,n;shf.l.clamp.b32 r5,r0,r1,n;shl.b32 r4,r0,n;// 128-bit right shift, arithmetic; n < 32// [r7,r6,r5,r4] = [r3,r2,r1,r0] >> nshf.r.clamp.b32 r4,r0,r1,n;shf.r.clamp.b32 r5,r1,r2,n;shf.r.clamp.b32 r6,r2,r3,n;shr.s32 r7,r3,n; // result is sign-extendedshf.r.clamp.b32 r1,r0,r0,n; // rotate right by n; n < 32shf.l.clamp.b32 r1,r0,r0,n; // rotate left by n; n < 32// extract 32-bits from [r1,r0] starting at position n < 32shf.r.clamp.b32 r0,r0,r1,n;
Shifta left by the amount specified by unsigned 32-bit value inb.
Semantics
d = a << b;
Notes
Shift amounts greater than the register widthN are clamped toN.
The sizes of the destination and first source operand must match, but not necessarily the type. Theb operand must be a 32-bit value, regardless of the instruction type.
Shifta right by the amount specified by unsigned 32-bit value inb. Signed shifts fill withthe sign bit, unsigned and untyped shifts fill with0.
Semantics
d = a >> b;
Notes
Shift amounts greater than the register widthN are clamped toN.
The sizes of the destination and first source operand must match, but not necessarily the type. Theb operand must be a 32-bit value, regardless of the instruction type.
These instructions copy data from place to place, and from state space to state space, possiblyconverting it from one format to another.mov,ld,ldu, andst operate on bothscalar and vector types. Theisspacep instruction is provided to query whether a generic addressfalls within a particular state space window. Thecvta instruction converts addresses betweengeneric andconst,global,local, orshared state spaces.
Instructionsld,st,suld, andsust support optional cache operations.
The Data Movement and Conversion Instructions are:
PTX ISA version 2.0 introduced optional cache operators on load and store instructions. The cacheoperators require a target architecture ofsm_20 or higher.
Cache operators on load or store instructions are treated as performance hints only. The use of acache operator on anld orst instruction does not change the memory consistency behavior ofthe program.
Forsm_20 and higher, the cache operators have the following definitions and behavior.
Table 30Cache Operators for Memory Load Instructions
Operator
Meaning
.ca
Cache at all levels, likely to be accessed again.
The default load instruction cache operation is ld.ca, which allocates cache lines in alllevels (L1 and L2) with normal eviction policy. Global data is coherent at the L2 level, butmultiple L1 caches are not coherent for global data. If one thread stores to global memoryvia one L1 cache, and a second thread loads that address via a second L1 cache withld.ca,the second thread may get stale L1 cache data, rather than the data stored by the first thread.The driver must invalidate global L1 cache lines between dependent grids of parallel threads.Stores by the first grid program are then correctly fetched by the second grid program issuingdefaultld.ca loads cached in L1.
.cg
Cache at global level (cache in L2 and below, not L1).
Useld.cg to cache loads only globally, bypassing the L1 cache, and cache only in the L2cache.
.cs
Cache streaming, likely to be accessed once.
Theld.cs load cached streaming operation allocates global lines with evict-first policy inL1 and L2 to limit cache pollution by temporary streaming data that may be accessed once ortwice. Whenld.cs is applied to a Local window address, it performs theld.luoperation.
.lu
Last use.
The compiler/programmer may useld.lu when restoring spilled registers and popping functionstack frames to avoid needless write-backs of lines that will not be used again. Theld.luinstruction performs a load cached streaming operation (ld.cs) on global addresses.
.cv
Don’t cache and fetch again (consider cached system memory lines stale, fetch again).
The ld.cv load operation applied to a global System Memory address invalidates (discards) amatching L2 line and re-fetches the line on each new load.
Table 31Cache Operators for Memory Store Instructions
Operator
Meaning
.wb
Cache write-back all coherent levels.
The default store instruction cache operation isst.wb, which writes back cache lines ofcoherent cache levels with normal eviction policy.
If one thread stores to global memory, bypassing its L1 cache, and a second thread in adifferent SM later loads from that address via a different L1 cache withld.ca, the secondthread may get a hit on stale L1 cache data, rather than get the data from L2 or memory storedby the first thread.
The driver must invalidate global L1 cache lines between dependent grids of thread arrays.Stores by the first grid program are then correctly missed in L1 and fetched by the second gridprogram issuing defaultld.ca loads.
.cg
Cache at global level (cache in L2 and below, not L1).
Usest.cg to cache global store data only globally, bypassing the L1 cache, and cache onlyin the L2 cache.
.cs
Cache streaming, likely to be accessed once.
Thest.cs store cached-streaming operation allocates cache lines with evict-first policy tolimit cache pollution by streaming output data.
.wt
Cache write-through (to system memory).
Thest.wt store write-through operation applied to a global System Memory address writesthrough the L2 cache.
PTX ISA version 7.4 adds optional cache eviction priority hints on load and storeinstructions. Cache eviction priority requires target architecturesm_70 or higher.
Cache eviction priority on load or store instructions is treated as a performance hint. It issupported for.global state space and generic addresses where the address points to.globalstate space.
Table 32Cache Eviction Priority Hints for Memory Load and Store Instructions
Cache Eviction Priority
Meaning
evict_normal
Cache data with normal eviction priority. This is the default eviction priority.
evict_first
Data cached with this priority will be first in the eviction priority order andwill likely be evicted when cache eviction is required. This priority is suitablefor streaming data.
evict_last
Data cached with this priority will be last in the eviction priority order and willlikely be evicted only after other data withevict_normal orevict_firsteviction priotity is already evicted. This priority is suitable for data thatshould remain persistent in cache.
evict_unchanged
Do not change eviction priority order as part of this operation.
no_allocate
Do not allocate data to cache. This priority is suitable for streaming data.
Set a register variable with the value of a register variable or an immediate value. Take thenon-generic address of a variable in global, local, or shared state space.
Syntax
mov.type d, a;mov.type d, sreg;mov.type d, avar; // get address of variablemov.type d, avar+imm; // get address of variable with offsetmov.u32 d, fname; // get address of device functionmov.u64 d, fname; // get address of device functionmov.u32 d, kernel; // get address of entry functionmov.u64 d, kernel; // get address of entry function.type = { .pred, .b16, .b32, .b64, .u16, .u32, .u64, .s16, .s32, .s64, .f32, .f64 };
Description
Write registerd with the value ofa.
Operanda may be a register, special register, variable with optional offset in an addressablememory space, or function name.
For variables declared in.const,.global,.local, and.shared state spaces,movplaces the non-generic address of the variable (i.e., the address of the variable in its statespace) into the destination register. The generic address of a variable inconst,global,local, orshared state space may be generated by first taking the address within the statespace withmov and then converting it to a generic address using thecvta instruction;alternately, the generic address of a variable declared inconst,global,local, orshared state space may be taken directly using thecvta instruction.
Note that if the address of a device function parameter is moved to a register, the parameter willbe copied onto the stack and the address will be in the local state space.
Semantics
d = a;d = sreg;d = &avar; // address is non-generic; i.e., within the variable's declared state spaced = &avar+imm;
Notes
Although only predicate and bit-size types are required, we include the arithmetic types for theprogrammer’s convenience: their use enhances program readability and allows additional typechecking.
When moving address of a kernel or a device function, only.u32 or.u64 instruction typesare allowed. However, if a signed type is used, it is not treated as a compilation error. Thecompiler issues a warning in this case.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
Taking the address of kernel entry functions requires PTX ISA version 3.1 or later. Kernel functionaddresses should only be used in the context of CUDA Dynamic Parallelism system calls. See theCUDADynamic Parallelism Programming Guide for details.
Target ISA Notes
mov.f64 requiressm_13 or higher.
Taking the address of kernel entry functions requiressm_35 or higher.
Examples
mov.f32 d,a;mov.u16 u,v;mov.f32 k,0.1;mov.u32 ptr, A; // move address of A into ptrmov.u32 ptr, A[5]; // move address of A[5] into ptrmov.u32 ptr, A+20; // move address with offset into ptrmov.u32 addr, myFunc; // get address of device function 'myFunc'mov.u64 kptr, main; // get address of entry function 'main'
Write scalar registerd with the packed value of vector registera, or write vector registerd with the unpacked values from scalar registera.
When destination operandd is a vector register, the sink symbol'_' may be used for one ormore elements provided that at least one element is a scalar register.
For bit-size types,mov may be used to pack vector elements into a scalar register or unpacksub-fields of a scalar register into a vector. Both the overall size of the vector and the size ofthe scalar must match the size of the instruction type.
Semantics
// pack two 8-bit elements into .b16d = a.x | (a.y << 8)// pack four 8-bit elements into .b32d = a.x | (a.y << 8) | (a.z << 16) | (a.w << 24)// pack two 16-bit elements into .b32d = a.x | (a.y << 16)// pack four 16-bit elements into .b64d = a.x | (a.y << 16) | (a.z << 32) | (a.w << 48)// pack two 32-bit elements into .b64d = a.x | (a.y << 32)// pack four 32-bit elements into .b128d = a.x | (a.y << 32) | (a.z << 64) | (a.w << 96)// pack two 64-bit elements into .b128d = a.x | (a.y << 64)// unpack 8-bit elements from .b16{ d.x, d.y } = { a[0..7], a[8..15] }// unpack 8-bit elements from .b32{ d.x, d.y, d.z, d.w } { a[0..7], a[8..15], a[16..23], a[24..31] }// unpack 16-bit elements from .b32{ d.x, d.y } = { a[0..15], a[16..31] }// unpack 16-bit elements from .b64{ d.x, d.y, d.z, d.w } = { a[0..15], a[16..31], a[32..47], a[48..63] }// unpack 32-bit elements from .b64{ d.x, d.y } = { a[0..31], a[32..63] }// unpack 32-bit elements from .b128{ d.x, d.y, d.z, d.w } = { a[0..31], a[32..63], a[64..95], a[96..127] }// unpack 64-bit elements from .b128{ d.x, d.y } = { a[0..63], a[64..127] }
PTX ISA Notes
Introduced in PTX ISA version 1.0.
Support for.b128 type introduced in PTX ISA version 8.3.
Target ISA Notes
Supported on all target architectures.
Support for.b128 type requiressm_70 or higher.
Examples
mov.b32 %r1,{a,b}; // a,b have type .u16mov.b64 {lo,hi}, %x; // %x is a double; lo,hi are .u32mov.b32 %r1,{x,y,z,w}; // x,y,z,w have type .b8mov.b32 {r,g,b,a},%r1; // r,g,b,a have type .u8mov.b64 {%r1, _}, %x; // %x is.b64, %r1 is .b32mov.b128 {%b1, %b2}, %y; // %y is.b128, %b1 and % b2 are .b64mov.b128 %y, {%b1, %b2}; // %y is.b128, %b1 and % b2 are .b64
Theshfl instruction without a.sync qualifier is deprecated in PTX ISA version 6.0.
Support for this instruction with.target lower thansm_70 may be removed in a future PTX ISA version.
Removal Note
Support forshfl instruction without a.sync qualifier is removed in PTX ISA version 6.4 for.targetsm_70 or higher.
Description
Exchange register data between threads of a warp.
Each thread in the currently executing warp will compute a source lane indexj based on inputoperandsb andc and themode. If the computed source lane indexj is in range, thethread will copy the input operanda from lanej into its own destination registerd;otherwise, the thread will simply copy its own inputa to destinationd. The optionaldestination predicatep is set toTrue if the computed source lane is in range, andotherwise set toFalse.
Note that an out of range value ofb may still result in a valid computed source lane indexj. In this case, a data transfer occurs and the destination predicatep is True.
Note that results are undefined in divergent control flow within a warp, if an active thread sourcesa register from an inactive thread.
Operandb specifies a source lane or source lane offset, depending on the mode.
Operandc contains two packed values specifying a mask for logically splitting warps intosub-segments and an upper bound for clamping the source lane index.
Semantics
lane[4:0] = [Thread].laneid; // position of thread in warpbval[4:0] = b[4:0]; // source lane or lane offset (0..31)cval[4:0] = c[4:0]; // clamp valuemask[4:0] = c[12:8];// get value of source register a if thread is active and// guard predicate true, else unpredictableif (isActive(Thread) && isGuardPredicateTrue(Thread)) { SourceA[lane] = a;} else { // Value of SourceA[lane] is unpredictable for // inactive/predicated-off threads in warp}maxLane = (lane[4:0] & mask[4:0]) | (cval[4:0] & ~mask[4:0]);minLane = (lane[4:0] & mask[4:0]);switch (.mode) { case .up: j = lane - bval; pval = (j >= maxLane); break; case .down: j = lane + bval; pval = (j <= maxLane); break; case .bfly: j = lane ^ bval; pval = (j <= maxLane); break; case .idx: j = minLane | (bval[4:0] & ~mask[4:0]); pval = (j <= maxLane); break;}if (!pval) j = lane; // copy from own laned = SourceA[j]; // copy input a from lane jif (dest predicate selected) p = pval;
PTX ISA Notes
Introduced in PTX ISA version 3.0.
Deprecated in PTX ISA version 6.0 in favor ofshfl.sync.
Not supported in PTX ISA version 6.4 for .targetsm_70 or higher.
Target ISA Notes
shfl requiressm_30 or higher.
shfl is not supported onsm_70 or higher starting PTX ISA version 6.4.
Examples
// Warp-level INCLUSIVE PLUS SCAN: // // Assumes input in following registers: // - Rx = sequence value for this thread // shfl.up.b32 Ry|p, Rx, 0x1, 0x0;@p add.f32 Rx, Ry, Rx; shfl.up.b32 Ry|p, Rx, 0x2, 0x0;@p add.f32 Rx, Ry, Rx; shfl.up.b32 Ry|p, Rx, 0x4, 0x0;@p add.f32 Rx, Ry, Rx; shfl.up.b32 Ry|p, Rx, 0x8, 0x0;@p add.f32 Rx, Ry, Rx; shfl.up.b32 Ry|p, Rx, 0x10, 0x0;@p add.f32 Rx, Ry, Rx; // Warp-level INCLUSIVE PLUS REVERSE-SCAN: // // Assumes input in following registers: // - Rx = sequence value for this thread // shfl.down.b32 Ry|p, Rx, 0x1, 0x1f;@p add.f32 Rx, Ry, Rx; shfl.down.b32 Ry|p, Rx, 0x2, 0x1f;@p add.f32 Rx, Ry, Rx; shfl.down.b32 Ry|p, Rx, 0x4, 0x1f;@p add.f32 Rx, Ry, Rx; shfl.down.b32 Ry|p, Rx, 0x8, 0x1f;@p add.f32 Rx, Ry, Rx; shfl.down.b32 Ry|p, Rx, 0x10, 0x1f;@p add.f32 Rx, Ry, Rx; // BUTTERFLY REDUCTION: // // Assumes input in following registers: // - Rx = sequence value for this thread // shfl.bfly.b32 Ry, Rx, 0x10, 0x1f; // no predicate dest add.f32 Rx, Ry, Rx; shfl.bfly.b32 Ry, Rx, 0x8, 0x1f; add.f32 Rx, Ry, Rx; shfl.bfly.b32 Ry, Rx, 0x4, 0x1f; add.f32 Rx, Ry, Rx; shfl.bfly.b32 Ry, Rx, 0x2, 0x1f; add.f32 Rx, Ry, Rx; shfl.bfly.b32 Ry, Rx, 0x1, 0x1f; add.f32 Rx, Ry, Rx; // // All threads now hold sum in Rx
shfl.sync will cause executing thread to wait until all non-exited threads corresponding tomembermask have executedshfl.sync with the same qualifiers and samemembermask valuebefore resuming execution.
Operandmembermask specifies a 32-bit integer which is a mask indicating threads participatingin barrier where the bit position corresponds to thread’slaneid.
shfl.sync exchanges register data between threads inmembermask.
Each thread in the currently executing warp will compute a source lane indexj based on inputoperandsb andc and themode. If the computed source lane indexj is in range, thethread will copy the input operanda from lanej into its own destination registerd;otherwise, the thread will simply copy its own inputa to destinationd. The optionaldestination predicatep is set toTrue if the computed source lane is in range, andotherwise set toFalse.
Note that an out of range value ofb may still result in a valid computed source lane indexj. In this case, a data transfer occurs and the destination predicatep is True.
Note that results are undefined if a thread sources a register from an inactive thread or a threadthat is not inmembermask.
Operandb specifies a source lane or source lane offset, depending on the mode.
Operandc contains two packed values specifying a mask for logically splitting warps intosub-segments and an upper bound for clamping the source lane index.
The behavior ofshfl.sync is undefined if the executing thread is not in themembermask.
Note
For .targetsm_6x or below, all threads inmembermask must execute the sameshfl.syncinstruction in convergence, and only threads belonging to somemembermask can be active whentheshfl.sync instruction is executed. Otherwise, the behavior is undefined.
Semantics
// wait for all threads in membermask to arrivewait_for_specified_threads(membermask);lane[4:0] = [Thread].laneid; // position of thread in warpbval[4:0] = b[4:0]; // source lane or lane offset (0..31)cval[4:0] = c[4:0]; // clamp valuesegmask[4:0] = c[12:8];// get value of source register a if thread is active and// guard predicate true, else unpredictableif (isActive(Thread) && isGuardPredicateTrue(Thread)) { SourceA[lane] = a;} else { // Value of SourceA[lane] is unpredictable for // inactive/predicated-off threads in warp}maxLane = (lane[4:0] & segmask[4:0]) | (cval[4:0] & ~segmask[4:0]);minLane = (lane[4:0] & segmask[4:0]);switch (.mode) { case .up: j = lane - bval; pval = (j >= maxLane); break; case .down: j = lane + bval; pval = (j <= maxLane); break; case .bfly: j = lane ^ bval; pval = (j <= maxLane); break; case .idx: j = minLane | (bval[4:0] & ~segmask[4:0]); pval = (j <= maxLane); break;}if (!pval) j = lane; // copy from own laned = SourceA[j]; // copy input a from lane jif (dest predicate selected) p = pval;
Pick four arbitrary bytes from two 32-bit registers, and reassemble them into a 32-bit destinationregister.
In the generic form (no mode specified), the permute control consists of four 4-bit selectionvalues. The bytes in the two source registers are numbered from 0 to 7:{b,a}={{b7,b6,b5,b4},{b3,b2,b1,b0}}. For each byte in the target register, a 4-bit selection value is defined.
The 3 lsbs of the selection value specify which of the 8 source bytes should be moved into thetarget position. The msb defines if the byte value should be copied, or if the sign (msb of thebyte) should be replicated over all 8 bits of the target position (sign extend of the byte value);msb=0 means copy the literal value;msb=1 means replicate the sign. Note that the signextension is only performed as part of generic form.
Thus, the four 4-bit values fully specify an arbitrary byte permute, as a16b permute code.
default mode
d.b3
source select
d.b2
source select
d.b1
source select
d.b0
source select
index
c[15:12]
c[11:8]
c[7:4]
c[3:0]
The more specialized form of the permute control uses the two lsb’s of operandc (which istypically an address pointer) to control the byte extraction.
Load register variabled from the location specified by the source address operanda inspecified state space. If no state space is given, perform the load usingGeneric Addressing.
If no sub-qualifier is specified with.shared state space, then::cta is assumed by default.
Supported addressing modes for operanda and alignment requirements are described inAddresses as Operands
If no sub-qualifier is specified with.param state space, then:
::func is assumed when access is inside a device function.
::entry is assumed when accessing kernel function parameters from entry function. Otherwise, whenaccessing device function parameters or any other.param variables from entry function::funcis assumed by default.
Forld.param::entry instruction, operand a must be a kernel parameter address, otherwise behavioris undefined. Forld.param::func instruction, operand a must be a device function parameter address,otherwise behavior is undefined.
The.relaxed and.acquire qualifiers indicate memory synchronization as described in theMemory Consistency Model. The.scope qualifierindicates the set of threads with which anld.relaxed orld.acquire instruction can directlysynchronize1. The.weak qualifier indicates a memory instruction with no synchronization.The effects of this instruction become visible to other threads only when synchronization is establishedby other means.
The semantic details of.mmio qualifier are described in theMemory Consistency Model.Only.sys thread scope is valid forld.mmio operation. Thequalifiers.mmio and.relaxed must be specified together.
The.weak,.volatile,.relaxed and.acquire qualifiers are mutually exclusive. Whennone of these is specified, the.weak qualifier is assumed by default.
The qualifiers.volatile,.relaxed and.acquire may be used only with.global and.shared spaces and with generic addressing, where the address points to.global or.shared space. Cache operations are not permitted with these qualifiers. The qualifier.mmiomay be used only with.global space and with generic addressing, where the address points to.global space.
State space is.global or with generic addressing where address points to.global state space
The.v4 (.vec) qualifier with type.b64 or.s64 or.u64 or.f64 is supported if:
State space is.global or with generic addressing where address points to.global state space
Qualifiers.level1::eviction_priority and.level2::eviction_priority specify the eviction policyfor L1 and L2 cache respectively which may be applied during memory access.
Qualifier.level2::eviction_priority is supported if:
.vec is.v8 and.type is.b32 or.s32 or.u32 or.f32
AND Operandd is vector of 8 registers with type specified with.type
OR.vec is.v4 and.type is.b64 or.s64 or.u64 or.f64
AND Operandd is vector of 4 registers with type specified with.type
Optionally, sink symbol ‘_’ can be used in vector expressiond when:
.vec is.v8 and.type is.b32 or.s32 or.u32 or.f32 OR
.vec is.v4 and.type is.b64 or.s64 or.u64 or.f64
which indicates that data from corresponding memory location is not read.
The.level::prefetch_size qualifier is a hint to fetch additional data of the specified sizeinto the respective cache level.The sub-qualifierprefetch_size can be set to either of64B,128B,256B thereby allowing the prefetch size to be 64 Bytes, 128 Bytes or 256 Bytesrespectively.
The qualifier.level::prefetch_size may only be used with.global state space and withgeneric addressing where the address points to.global state space. If the generic address doesnot fall within the address window of the global memory, then the prefetching behavior is undefined.
The.level::prefetch_size qualifier is treated as a performance hint only.
When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.
The qualifiers.unified and.level::cache_hint are only supported for.global statespace and for generic addressing where the address points to the.global state space.
cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.
1 This synchronization is further extended to other threads through the transitive nature ofcausality order, as described in the memory consistency model.
Semantics
d = a; // named variable ad = *(&a+immOff) // variable-plus-offsetd = *a; // registerd = *(a+immOff); // register-plus-offsetd = *(immAddr); // immediate address
Notes
Destinationd must be in the.reg state space.
A destination register wider than the specified type may be used. The value loaded is sign-extendedto the destination register width for signed integers, and is zero-extended to the destinationregister width for unsigned and bit-size types. SeeTable 28for a description of these relaxed type-checking rules.
.f16 data may be loaded usingld.b16, and then converted to.f32 or.f64 usingcvt or can be used in half precision floating point instructions.
.f16x2 data may be loaded usingld.b32 and then used in half precision floating pointinstructions.
PTX ISA Notes
ld introduced in PTX ISA version 1.0.ld.volatile introduced in PTX ISA version 1.1.
Generic addressing and cache operations introduced in PTX ISA version 2.0.
Support for scope qualifier,.relaxed,.acquire,.weak qualifiers introduced in PTX ISAversion 6.0.
Support for generic addressing of .const space added in PTX ISA version 3.1.
Support for.level1::eviction_priority,.level::prefetch_size and.level::cache_hintqualifiers introduced in PTX ISA version 7.4.
Support for.cluster scope qualifier introduced in PTX ISA version 7.8.
Support for::cta and::cluster sub-qualifiers introduced in PTX ISA version 7.8.
Support for.unified qualifier introduced in PTX ISA version 8.0.
Support for.mmio qualifier introduced in PTX ISA version 8.2.
Support for::entry and::func sub-qualifiers on.param space introduced in PTX ISAversion 8.3.
Support for.b128 type introduced in PTX ISA version 8.3.
Support for.sys scope with.b128 type introduced in PTX ISA version 8.4.
Support for.level2::eviction_priority qualifier and.v8.b32/.v4.b64 introduced in PTX ISA version 8.8.
Target ISA Notes
ld.f64 requiressm_13 or higher.
Support for scope qualifier,.relaxed,.acquire,.weak qualifiers requiresm_70 orhigher.
Generic addressing requiressm_20 or higher.
Cache operations requiresm_20 or higher.
Support for.level::eviction_priority qualifier requiressm_70 or higher.
Support for.level::prefetch_size qualifier requiressm_75 or higher.
Support for.L2::256B and.L2::cache_hint qualifiers requiressm_80 or higher.
Support for.cluster scope qualifier requiressm_90 or higher.
Sub-qualifier::cta requiressm_30 or higher.
Sub-qualifier::cluster requiressm_90 or higher.
Support for.unified qualifier requiressm_90 or higher.
Support for.mmio qualifier requiressm_70 or higher.
Support for.b128 type requiressm_70 or higher.
Support for.level2::eviction_priority qualifier and.v8.b32/.v4.b64 requiresm_100 or higher.
Load register variabled from the location specified by the source address operanda in theglobal state space, and optionally cache in non-coherent read-only cache.
Note
On some architectures, the texture cache is larger, has higher bandwidth, and longer latency thanthe global memory cache. For applications with sufficient parallelism to cover the longerlatency,ld.global.nc should offer better performance thanld.global on sucharchitectures.
The address operanda shall contain a global address.Supported addressing modes for operanda and alignment requirements aredescribed inAddresses as Operands.
The.v8 (.vec) qualifier is supported if:
.type is.b32,.s32,.u32, or.f32 AND
State space is.global or with generic addressing where address points to.global state space
The.v4 (.vec) qualifier with type.b64 or.s64 or.u64 or.f64 is supported if:
State space is.global or with generic addressing where address points to.global state space
Qualifiers.level1::eviction_priority and.level2::eviction_priority specify the eviction policyfor L1 and L2 cache respectively which may be applied during memory access.
Qualifier.level2::eviction_priority is supported if:
.vec is.v8 and.type is.b32 or.s32 or.u32 or.f32
AND Operandd is vector of 8 registers with type specified with.type
OR.vec is.v4 and.type is.b64 or.s64 or.u64 or.f64
AND Operandd is vector of 4 registers with type specified with.type
Optionally, sink symbol ‘_’ can be used in vector expressiond when:
.vec is.v8 and.type is.b32 or.s32 or.u32 or.f32 OR
.vec is.v4 and.type is.b64 or.s64 or.u64 or.f64
which indicates that data from corresponding memory location is not read.
The.level::prefetch_size qualifier is a hint to fetch additional data of the specified sizeinto the respective cache level.The sub-qualifierprefetch_size can be set to either of64B,128B,256B thereby allowing the prefetch size to be 64 Bytes, 128 Bytes or 256 Bytesrespectively.
The.level::prefetch_size qualifier is treated as a performance hint only.
When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.
cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.
Semantics
d = a; // named variable ad = *(&a+immOff) // variable-plus-offsetd = *a; // registerd = *(a+immOff); // register-plus-offsetd = *(immAddr); // immediate address
Notes
Destinationd must be in the.reg state space.
A destination register wider than the specified type may be used. The value loaded is sign-extendedto the destination register width for signed integers, and is zero-extended to the destinationregister width for unsigned and bit-size types.
.f16 data may be loaded usingld.b16, and then converted to.f32 or.f64 usingcvt.
PTX ISA Notes
Introduced in PTX ISA version 3.1.
Support for.level::eviction_priority,.level::prefetch_size and.level::cache_hintqualifiers introduced in PTX ISA version 7.4.
Support for.b128 type introduced in PTX ISA version 8.3.
Support for.level2::eviction_priority qualifier and.v8.b32/.v4.b64 introduced in PTX ISA version 8.8.
Target ISA Notes
Requiressm_32 or higher.
Support for.level1::eviction_priority qualifier requiressm_70 or higher.
Support for.level::prefetch_size qualifier requiressm_75 or higher.
Support for.level::cache_hint qualifier requiressm_80 or higher.
Support for.b128 type requiressm_70 or higher.
Support for.level2::eviction_priority qualifier and.v8.b32/.v4.b64 requiresm_100 or higher.
Loadread-only data into register variabled from the location specified by the source addressoperanda in the global state space, where the address is guaranteed to be the same across allthreads in the warp. If no state space is given, perform the load usingGeneric Addressing.
Supported addressing modes for operanda and alignment requirements are described inAddresses as Operands.
Semantics
d = a; // named variable ad = *(&a+immOff) // variable-plus-offsetd = *a; // registerd = *(a+immOff); // register-plus-offsetd = *(immAddr); // immediate address
Notes
Destinationd must be in the.reg state space.
A destination register wider than the specified type may be used. The value loaded is sign-extendedto the destination register width for signed integers, and is zero-extended to the destinationregister width for unsigned and bit-size types. SeeTable 28for a description of these relaxed type-checking rules.
.f16 data may be loaded usingldu.b16, and then converted to.f32 or.f64 usingcvt or can be used in half precision floating point instructions.
.f16x2 data may be loaded usingldu.b32 and then used in half precision floating pointinstructions.
PTX ISA Notes
Introduced in PTX ISA version 2.0.
Support for.b128 type introduced in PTX ISA version 8.3.
Store the value of operandb in the location specified by the destination addressoperanda in specified state space. If no state space is given, perform the store usingGeneric Addressing. Stores to const memory are illegal.
If no sub-qualifier is specified with.shared state space, then::cta is assumed by default.
Supported addressing modes for operanda and alignment requirements are described inAddresses as Operands.
If.param is specified without any sub-qualifiers then it defaults to.param::func.
The qualifiers.relaxed and.release indicate memory synchronization as described in theMemory Consistency Model. The.scope qualifierindicates the set of threads with which anst.relaxed orst.release instruction can directlysynchronize1. The.weak qualifier indicates a memory instruction with no synchronization.The effects of this instruction become visible to other threads only when synchronization is establishedby other means.
The semantic details of.mmio qualifier are described in theMemory Consistency Model.Only.sys thread scope is valid forst.mmio operation. Thequalifiers.mmio and.relaxed must be specified together.
The.weak,.volatile,.relaxed and.release qualifiers are mutually exclusive. Whennone of these is specified, the.weak qualifier is assumed by default.
The qualifiers.volatile,.relaxed and.release may be used only with.global and.shared spaces and with generic addressing, where the address points to.global or.shared space. Cache operations are not permitted with these qualifiers. The qualifier.mmiomay be used only with.global space and with generic addressing, where the address points to.global space.
The.v8 (.vec) qualifier is supported if:
.type is.b32,.s32,.u32, or.f32 AND
State space is.global or with generic addressing where address points to.global state space
The.v4 (.vec) qualifier with type.b64 or.s64 or.u64 or.f64 is supported if:
State space is.global or with generic addressing where address points to.global state space
Qualifiers.level1::eviction_priority and.level2::eviction_priority specify the eviction policyfor L1 and L2 cache respectively which may be applied during memory access.
Qualifier.level2::eviction_priority is supported if:
.vec is.v8 and.type is.b32 or.s32 or.u32 or.f32
AND Operandd is vector of 8 registers with type specified with.type
OR.vec is.v4 and.type is.b64 or.s64 or.u64 or.f64
AND Operandd is vector of 4 registers with type specified with.type
Optionally, sink symbol ‘_’ can be used in vector expressionb when:
.vec is.v8 and.type is.b32 or.s32 or.u32 or.f32 OR
.vec is.v4 and.type is.b64 or.s64 or.u64 or.f64
which indicates that no data is being written at the corresponding destination address.
When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.
The qualifier.level::cache_hint is only supported for.global state space and for genericaddressing where the address points to the.global state space.
cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.
1 This synchronization is further extended to other threads through the transitive nature ofcausality order, as described in the memory consistency model.
A source register wider than the specified type may be used. The lowern bits corresponding tothe instruction-type width are stored to memory. SeeTable 27for a description of these relaxed type-checking rules.
.f16 data resulting from acvt instruction may be stored usingst.b16.
.f16x2 data may be stored usingst.b32.
PTX ISA Notes
st introduced in PTX ISA version 1.0.st.volatile introduced in PTX ISA version 1.1.
Generic addressing and cache operations introduced in PTX ISA version 2.0.
Support for scope qualifier,.relaxed,.release,.weak qualifiers introduced in PTX ISAversion 6.0.
Support for.level1::eviction_priority and.level::cache_hint qualifiers introduced in PTXISA version 7.4.
Support for.cluster scope qualifier introduced in PTX ISA version 7.8.
Support for::cta and::cluster sub-qualifiers introduced in PTX ISA version 7.8.
Support for.mmio qualifier introduced in PTX ISA version 8.2.
Support for::func sub-qualifier on.param space introduced in PTX ISA version 8.3.
Support for.b128 type introduced in PTX ISA version 8.3.
Support for.sys scope with.b128 type introduced in PTX ISA version 8.4.
Support for.level2::eviction_priority qualifier and.v8.b32/.v4.b64 introduced in PTX ISA version 8.8.
Target ISA Notes
st.f64 requiressm_13 or higher.
Support for scope qualifier,.relaxed,.release,.weak qualifiers requiresm_70 orhigher.
Generic addressing requiressm_20 or higher.
Cache operations requiresm_20 or higher.
Support for.level1::eviction_priority qualifier requiressm_70 or higher.
Support for.level::cache_hint qualifier requiressm_80 or higher.
Support for.cluster scope qualifier requiressm_90 or higher.
Sub-qualifier::cta requiressm_30 or higher.
Sub-qualifier::cluster requiressm_90 or higher.
Support for.mmio qualifier requiressm_70 or higher.
Support for.b128 type requiressm_70 or higher.
Support for.level2::eviction_priority qualifier and.v8.b32/.v4.b64 requiresm_100 or higher.
st.async is a non-blocking instruction which initiates an asynchronous store operation thatstores the value specified by source operandb to the destination memory locationspecified by operanda.
Operands
a is a destination address, and must be either a register, or of the formregister+immOff,as described inAddresses as Operands.
b is a source value, of the type indicated by qualifier.type.
.completion_mechanism specifies the mechanism for observing the completion of theasynchronous operation.
When.completion_mechanism is.mbarrier::complete_tx::bytes: upon completion of theasynchronous operation, acomplete-txoperation will be performed on the mbarrier object specified by the operandmbar, withcompleteCount argument equal to the amount of data stored in bytes.
When.completion_mechanism is not specified: the completion of the store synchronizeswith the end of the CTA.
.type specifies the type of the source operandb.
Conditions
When.sem is.weak:
This is a weak store to shared memory, which signals its completion through an mbarrier object.
The store operation is treated as a weak memory operation.
The complete-tx operation on the mbarrier has.release semantics at.clusterscope.
Requires:
The shared memory addresses of destination operanda and thembarrier objectmbar belongto the same CTA within the same cluster as the executing thread.
The number of CTAs within the cluster is strictly greater than one;%cluster_nctarank>1 is true.
Otherwise, the behavior is undefined.
.mmio must not be specified.
If.ss is specified, it must be.shared::cluster.
If.ss is not specified, generic addressing is used for operandsa andmbar.If the generic addresses specified do not fall within the address window of.shared::cluster state space, the behavior is undefined.
If.completion_mechanism is specified, it must be.mbarrier::complete_tx::bytes.
If.completion_mechanism is not specified, it defaults to.mbarrier::complete_tx::bytes.
When.sem is.release:
This is a release store to global memory.
The store operation is a strong memory operation with.release semantics at thescope specified by.scope.
If.mmio is specified,.scope must be.sys.
If.ss is specified, it must be.global.
If.ss is not specified, generic addressing is used for operanda.If the generic address specified does not fall within the address window of.globalstate space, the behavior is undefined.
.completion_mechanism must not be specified.
PTX ISA Notes
Introduced in PTX ISA version 8.1.
Support for.mmio qualifier,.release semantics,.global state space, and.scope qualifier introduced in PTX ISA version 8.7.
Target ISA Notes
Requiressm_90 or higher.
.mmio qualifier,.release semantics,.global state space, and.scope qualifier requiresm_100 or higher.
Initializes a region of memory as specified by state space.
Syntax
st.bulk{.weak}{.shared::cta} [a], size, initval; // initval must be zero
Description
st.bulk instruction initializes a region of shared memory starting from the location specifiedby destination address operanda.
The 32-bit or 64-bit integer operandsize specifies the amount of memory to be initialized in terms ofnumber of bytes.size must be a multiple of 8. If the value is not a multiple of 8, then thebehavior is undefined. The maximum value ofsize operand can be 16777216.
The integer immediate operandinitval specifies the initialization value for the memorylocations. The only numeric value allowed for operandinitval is 0.
If no state space is specified thenGeneric Addressing is used. If theaddress specified bya does not fall within the address window of.shared state space thenthe behavior is undefined.
The optional qualifier.weak specify the memory synchronizing effect of thest.bulkinstruction as described in theMemory Consistency Model.
PTX ISA Notes
Introduced in PTX ISA version 8.6.
Support forsize operand with 32-bit length is introduced in PTX ISA version 9.0.
Target ISA Notes
Requiressm_100 or higher.
Examples
st.bulk.weak.shared::cta [dst], n, 0;st.bulk [gdst], 4096, 0;
The multimem.* operations operate on multimem addresses and accesses all of the multiple memorylocations which the multimem address points to.
Multimem addresses can only be accessed only by multimem.* operations. Accessing a multimem addresswithld,st or any other memory operations results in undefined behavior.
Refer toCUDA programming guide for creation and management of the multimem addresses.
multimem.ld_reduce,multimem.st,multimem.red
Perform memory operations on the multimem address.
Instructionmultimem.ld_reduce performs the following operations:
load operation on the multimem addressa, which involves loading of data from all of themultiple memory locations pointed to by the multimem addressa,
reduction operation specified by.op on the multiple data loaded from the multimem addressa.
The result of the reduction operation in returned in registerd.
Instructionmultimem.st performs a store operation of the input operandb to all the memorylocations pointed to by the multimem addressa.
Instructionmultimem.red performs a reduction operation on all the memory locations pointed toby the multimem addressa, with operandb.
Instructionmultimem.ld_reduce performs reduction on the values loaded from all the memorylocations that the multimem address points to. In contrast, themultimem.red perform reductionon all the memory locations that the multimem address points to.
Address operanda must be a multimem address. Otherwise, the behavior is undefined. Supportedaddressing modes for operand a and alignment requirements are described inAddresses as Operands.
If no state space is specified thenGeneric Addressing isused. If the address specified bya does not fall within the address window of.global statespace then the behavior is undefined.
For floating-point type multi- operations, the size of the specified type along with.vec mustequal either 32-bits or 64-bits or 128-bits. No other combinations of.vec and type areallowed. Type.f64 cannot be used with.vec qualifier.The following table describes the valid usage of.vec and base floating-point type:
Formultimem.ld_reduce, the default precision of the intermediate accumulation is same as thespecified type.
Optionally,.acc_prec qualifier can be specified to change the precision of intermediateaccumulation as follows:
.type
.acc::prec
Changes precision to
.f16,.f16x2,.bf16,.bf16x2
.acc::f32
.f32
.e5m2,.e4m3,.e5m2x2,.e4m3x2,.e4m3x4,.e5m2x4
.acc::f16
.f16
Optional qualifiers.ldsem,.stsem and.redsem specify the memory synchronizing effectof themultimem.ld_reduce,multimem.st andmultimem.red respectively, as described inMemory Consistency Model. If explicit semantics qualifiersare not specified, thenmultimem.ld_reduce andmultimem.st default to.weak andmultimem.red defaults to.relaxed.
The optional.scope qualifier specifies the set of threads that can directly observe the memorysynchronizing effect of this operation, as described inMemory Consistency Model. If the.scope qualifier is not specified formultimem.red then.sys scope is assumed by default.
PTX ISA Notes
Introduced in PTX ISA version 8.1.
Support for.acc::f32 qualifier introduced in PTX ISA version 8.2.
Support for types.e5m2,.e5m2x2,.e5m2x4,.e4m3,.e4m3x2,.e4m3x4introduced in PTX ISA version 8.6.
Support for.acc::f16 qualifier introduced in PTX ISA version 8.6.
Target ISA Notes
Requiressm_90 or higher.
Types.e5m2,.e5m2x2,.e5m2x4,.e4m3,.e4m3x2,.e4m3x4are supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a
sm_121a
And are supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_110f or higher in the same family
Qualifier.acc::f16 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a
sm_121a
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
Prefetch line containing a generic address at a specified level of memory hierarchy, in specifiedstate space.
Syntax
prefetch{.space}.level [a]; // prefetch to data cacheprefetch.global.level::eviction_priority [a]; // prefetch to data cacheprefetchu.L1 [a]; // prefetch to uniform cacheprefetch{.tensormap_space}.tensormap [a]; // prefetch the tensormap.space = { .global, .local };.level = { .L1, .L2 };.level::eviction_priority = { .L2::evict_last, .L2::evict_normal };.tensormap_space = { .const, .param };
Description
Theprefetch instruction brings the cache line containing the specified address in global orlocal memory state space into the specified cache level.
If the.tensormap qualifier is specified then theprefetch instruction brings the cache linecontaining the specified address in the.const or.param memory state space for subsequentuse by thecp.async.bulk.tensor instruction.
Theapplypriority instruction applies the cache eviction priority specified by the.level::eviction_priority qualifier to the address range[a..a+size) in the specified cachelevel.
If no state space is specified thenGeneric Addressing isused. If the specified address does not fall within the address window of.global state spacethen the behavior is undefined.
The operandsize is an integer constant that specifies the amount of data, in bytes, in thespecified cache level on which the priority is to be applied. The only supported value for thesize operand is 128.
Supported addressing modes for operanda are described inAddresses as Operands.a must be aligned to 128 bytes.
Semantically, this behaves like a weak write of anunstable indeterminate value:reads of memory locations withunstable indeterminate values may return differentbit patterns each time until the memory is overwritten.This operationhints to the implementation that data in the specified cache.levelcan be destructively discarded without writing it back to memory.
The operandsize is an integer constant that specifies the length in bytes of theaddress range[a,a+size) to writeunstable indeterminate values into.The only supported value for thesize operand is128.
If no state space is specified thenGeneric Addressing is used.If the specified address does not fall within the address window of.global state spacethen the behavior is undefined.
Supported addressing modes for address operanda are described inAddresses as Operands.a must be aligned to 128 bytes.
PTX ISA Notes
Introduced in PTX ISA version 7.4.
Target ISA Notes
Requiressm_80 or higher.
Examples
discard.global.L2 [ptr], 128;ld.weak.u32 r0, [ptr];ld.weak.u32 r1, [ptr];// The values in r0 and r1 may differ!
Thecreatepolicy instruction creates a cache eviction policy for the specified cache level in anopaque 64-bit register specified by the destination operandcache-policy. The cache evictionpolicy specifies how cache eviction priorities are applied to global memory addresses used in memoryoperations with.level::cache_hint qualifier.
There are two types of cache eviction policies:
Range-based policy
The cache eviction policy created usingcreatepolicy.range specifies the cache evictionbehaviors for the following three address ranges:
[a..a+(primary-size-1)] referred to as primary range.
[a+primary-size..a+(total-size-1)] referred to as trailing secondary range.
[a-(total-size-primary-size)..(a-1)] referred to as preceding secondary range.
When a range-based cache eviction policy is used in a memory operation with.level::cache_hint qualifier, the eviction priorities are applied as follows:
If the memory address falls in the primary range, the eviction priority specified by.L2::primary_priority is applied.
If the memory address falls in any of the secondary ranges, the eviction priority specified by.L2::secondary_priority is applied.
If the memory address does not fall in either of the above ranges, then the applied evictionpriority is unspecified.
The 32-bit operandprimary-size specifies the size, in bytes, of the primary range. The32-bit operandtotal-size specifies the combined size, in bytes, of the address rangeincluding primary and secondary ranges. The value ofprimary-size must be less than or equalto the value oftotal-size. Maximum allowed value oftotal-size is 4GB.
If.L2::secondary_priority is not specified, then it defaults to.L2::evict_unchanged.
If no state space is specified thenGeneric Addressing isused. If the specified address does not fall within the address window of.global state spacethen the behavior is undefined.
Fraction-based policy
A memory operation with.level::cache_hint qualifier can use the fraction-based cacheeviction policy to request the cache eviction priority specified by.L2:primary_priority tobe applied to a fraction of cache accesses specified by the 32-bit floating point operandfraction. The remainder of the cache accesses get the eviction priority specified by.L2::secondary_priority. This implies that in a memory operation that uses a fraction-basedcache policy, the memory access has a probability specified by the operandfraction ofgetting the cache eviction priority specified by.L2::primary_priority.
The valid range of values for the operandfraction is(0.0,..,1.0]. If the operandfraction is not specified, it defaults to 1.0.
If.L2::secondary_priority is not specified, then it defaults to.L2::evict_unchanged.
The access property created using the CUDA APIs can be converted into cache eviction policy by theinstructioncreatepolicy.cvt. The source operandaccess-property is a 64-bit opaqueregister. Refer toCUDA programming guide for more details.
PTX ISA Notes
Introduced in PTX ISA version 7.4.
Target ISA Notes
Requiressm_80 or higher.
Examples
createpolicy.fractional.L2::evict_last.b64 policy, 1.0;createpolicy.fractional.L2::evict_last.L2::evict_unchanged.b64 policy, 0.5;createpolicy.range.L2::evict_last.L2::evict_first.b64 policy, [ptr], 0x100000, 0x200000;// access-prop is created by CUDA APIs.createpolicy.cvt.L2.b64 policy, access-prop;
Query whether a generic address falls within a specified state space window.
Syntax
isspacep.space p, a; // result is .pred.space = { const, .global, .local, .shared{::cta, ::cluster}, .param{::entry} };
Description
Write predicate registerp with1 if generic address a falls within the specified statespace window and with0 otherwise. Destinationp has type.pred; the source addressoperand must be of type.u32 or.u64.
isspacep.param{::entry} returns1 if the generic address falls within the window ofKernel Function Parameters, otherwise returns0. If.paramis specified without any sub-qualifiers then it defaults to.param::entry.
isspacep.global returns1 forKernel Function Parametersas.param window is contained within the.globalwindow.
If no sub-qualifier is specified with.shared state space, then::cta is assumed by default.
Note
ispacep.shared::cluster will return 1 for every shared memory address that is accessible tothe threads in the cluster, whereasispacep.shared::cta will return 1 only if the address isof a variable declared in the executing CTA.
PTX ISA Notes
Introduced in PTX ISA version 2.0.
isspacep.const introduced in PTX ISA version 3.1.
isspacep.param introduced in PTX ISA version 7.7.
Support for::cta and::cluster sub-qualifiers introduced in PTX ISA version 7.8.
Support for sub-qualifier::entry on.param space introduced in PTX ISA version 8.3.
Convert address from.const,Kernel Function Parameters (.param),.global,.local, or.sharedstate space to generic, or vice-versa. Take the generic address of a variable declared in.const,Kernel Function Parameters (.param),.global,.local, or.shared state space.
Syntax
// convert const, global, local, or shared address to generic addresscvta.space.size p, a; // source address in register acvta.space.size p, var; // get generic address of varcvta.space.size p, var+imm; // generic address of var+offset// convert generic address to const, global, local, or shared addresscvta.to.space.size p, a;.space = { .const, .global, .local, .shared{::cta, ::cluster}, .param{::entry} };.size = { .u32, .u64 };
Description
Convert aconst,Kernel Function Parameters(.param),global,local, orshared address to a generic address, or vice-versa. Thesource and destination addresses must be the same size. Usecvt.u32.u64 orcvt.u64.u32 totruncate or zero-extend addresses.
For variables declared in.const,Kernel Function Parameters (.param),.global,.local, or.sharedstate space, the generic address of the variable may be taken usingcvta. The source is either aregister or a variable defined inconst,Kernel Function Parameters (.param),global,local, orshared memorywith an optional offset.
When converting a generic address into aconst,Kernel Function Parameters (.param),global,local, orsharedaddress, the resulting address is undefined in cases where the generic address does not fall withinthe address window of the specified state space. A program may useisspacep to guard againstsuch incorrect behavior.
Forcvta with.shared state space, the address must belong to the space specified by::cta or::cluster sub-qualifier, otherwise the behavior is undefined. If no sub-qualifieris specified with.shared state space, then::cta is assumed by default.
If.param is specified without any sub-qualifiers then it defaults to.param::entry. For.param{::entry} state space, operanda must be a kernel parameter address, otherwisebehavior is undefined.
PTX ISA Notes
Introduced in PTX ISA version 2.0.
cvta.const andcvta.to.const introduced in PTX ISA version 3.1.
cvta.param andcvta.to.param introduced in PTX ISA version 7.7.
Note: The current implementation does not allow generic pointers toconst space variables inprograms that contain pointers to constant buffers passed as kernel parameters.
Support for::cta and::cluster sub-qualifiers introduced in PTX ISA version 7.8.
Support for sub-qualifier::entry on.param space introduced in PTX ISA version 8.3.
Target ISA Notes
cvta requiressm_20 or higher.
cvta.param{::entry} andcvta.to.param{::entry} requiressm_70 or higher.
For.f16x2 and.bf16x2 instruction type, two inputsa andb of.f32 type areconverted into.f16 or.bf16 type and the converted values are packed in the destinationregisterd, such that the value converted from inputa is stored in the upper half ofdand the value converted from inputb is stored in the lower half ofd
When converting to.e4m3x2/.e5m2x2 data formats, the destination operandd has.b16type. When converting two.f32 inputs to.e4m3x2/.e5m2x2, each input is converted to thespecified format, and the converted values are packed in the destination operandd such that thevalue converted from inputa is stored in the upper 8 bits ofd and the value converted frominputb is stored in the lower 8 bits ofd. When converting an.f16x2 input to.e4m3x2/.e5m2x2, each.f16 input from operanda is converted to the specifiedformat. The converted values are packed in the destination operandd such that the valueconverted from the upper 16 bits of inputa is stored in the upper 8 bits ofd and the valueconverted from the lower 16 bits of inputa is stored in the lower 8 bits ofd.
When converting from.e4m3x2/.e5m2x2 to.f16x2, source operanda has.b16type. Each 8-bit input value in operanda is converted to.f16 type. The converted valuesare packed in the destination operandd such that the value converted from the upper 8 bits ofa is stored in the upper 16 bits ofd and the value converted from the lower 8 bits ofais stored in the lower 16 bits ofd.
When converting to.e2m1x2 data formats, the destination operandd has.b8 type.When converting two.f32 inputs to.e2m1x2, each input is converted to the specified format,and the converted values are packed in the destination operandd such that the value convertedfrom inputa is stored in the upper 4 bits ofd and the value converted from inputb isstored in the lower 4 bits ofd.
When converting from.e2m1x2 to.f16x2, source operanda has.b8 type. Each 4-bitinput value in operanda is converted to.f16 type. The converted values are packed in thedestination operandd such that the value converted from the upper 4 bits ofa is stored inthe upper 16 bits ofd and the value converted from the lower 4 bits ofa is stored in thelower 16 bits ofd.
When converting to.e2m1x4 data format, the destination operandd has.b16 type. Whenconverting four.f32 inputs to.e2m1x4, each input is converted to the specified format,and the converted values are packed in the destination operandd such that the value convertedfrom inputsa,b,e,f are stored in each 4 bits starting from upper bits ofd.
When converting to.e2m3x2/.e3m2x2 data formats, the destination operandd has.b16type. When converting two.f32 inputs to.e2m3x2/.e3m2x2, each input is converted to thespecified format, and the converted values are packed in the destination operandd such that thevalue converted from inputa is stored in the upper 8 bits ofd with 2 MSB bits padded withzeros and the value converted from inputb is stored in the lower 8 bits ofd with 2 MSB bitspadded with zeros.
When converting from.e2m3x2/.e3m2x2 to.f16x2, source operanda has.b16 type.Each 8-bit input value with 2 MSB bits 0 in operanda is converted to.f16 type. The convertedvalues are packed in the destination operandd such that the value converted from the upper 8 bitsofa is stored in the upper 16 bits ofd and the value converted from the lower 8 bits ofais stored in the lower 16 bits ofd.
When converting to.e5m2x4/.e4m3x4/.e3m2x4/.e2m3x4 data format, the destinationoperandd has.b32 type. When converting four.f32 inputs to.e5m2x4/.e4m3x4/.e3m2x4/.e2m3x4, each input is converted to the specified format,and the converted values are packed in the destination operandd such that the value convertedfrom inputsa,b,e,f are stored in each 8 bits starting from upper bits ofd.For.e3m2x4/.e2m3x4, each 8-bit output will have 2 MSB bits padded with zeros.
When converting to.ue8m0x2 data formats, the destination operandd has.b16 type. Whenconverting two.f32 or two packed.bf16 inputs to.ue8m0x2, each input is converted to thespecified format, and the converted values are packed in the destination operandd such that thevalue converted from inputa is stored in the upper 8 bits ofd and the value converted frominputb is stored in the lower 8 bits ofd.
When converting from.ue8m0x2 to.bf16x2, source operanda has.b16 type. Each 8-bitinput value in operanda is converted to.bf16 type. The converted values are packed in thedestination operandd such that the value converted from the upper 8 bits ofa is stored inthe upper 16 bits ofd and the value converted from the lower 8 bits ofa is stored in thelower 16 bits ofd.
rbits is a.b32 type register operand used for providing random bits for.rs rounding mode.
When converting to.f16x2, two 16-bit values are provided fromrbits where 13 LSBs fromupper 16-bits are used as random bits for operanda with 3 MSBs are 0 and 13 LSBs from lower16-bits are used as random bits for operandb with 3 MSBs are 0.
When converting to.bf16x2, two 16-bit values are provided fromrbits where upper 16-bitsare used as random bits for operanda and lower 16-bits are used as random bits for operandb.
When converting to.e4m3x4/.e5m2x4/.e2m3x4/.e3m2x4, two 16-bit values are providedfromrbits where lower 16-bits are used for operandse,f and upper 16 bits are usedfor operandsa,b.
When converting to.e2m1x4, two 16-bit values are provided fromrbits where lower 8-bitsfrom both 16-bits half ofrbits are used for operandse,f and upper 8-bits from both16-bits half ofrbits are used for operandsa,b.
Rounding modifier is mandatory in all of the following cases:
float-to-float conversions, when destination type is smaller than source type
All float-to-int conversions
All int-to-float conversions
All conversions involving.f16x2,.e4m3x2,.e5m2x2,,.bf16x2,.tf32,.e2m1x2,.e2m3x2,.e3m2x2,.e4m3x4,.e5m2x4,.e2m1x4,.e2m3x4,.e3m2x4 and.ue8m0x2 instruction types.
.satfinite modifier is only supported for conversions involving the following types:
.e4m3x2,.e5m2x2,.e2m1x2,.e2m3x2,.e3m2x2,.e4m3x4,.e5m2x4,.e2m1x4,.e2m3x4,.e3m2x4 destination types..satfinite modifier is mandatory for such conversions.
.f16,.bf16,.f16x2,.bf16x2,.tf32,.ue8m0x2 as destination types.
Semantics
if (/* inst type is .f16x2 or .bf16x2 */) { d[31:16] = convert(a); d[15:0] = convert(b);} else if (/* inst destination type is .e5m2x2 or .e4m3x2 or .ue8m0x2 */) { d[15:8] = convert(a); d[7:0] = convert(b);} else if (/* inst destination type is .e2m1x2 */) { d[7:4] = convert(a); d[3:0] = convert(b);} else if (/* inst destination type is .e2m3x2 or .e3m2x2 */) { d[15:14] = 0; d[13:8] = convert(a); d[7:6] = 0; d[5:0] = convert(b);} else if (/* inst destination type is .e2m1x4 */) { d[15:12] = convert(a); d[11:8] = convert(b); d[7:4] = convert(e); d[3:0] = convert(f);} else if (/* inst destination type is .e4m3x4 or .e5m2x4 */) { d[31:24] = convert(a); d[23:16] = convert(b); d[15:8] = convert(e); d[7:0] = convert(f);} else if (/* inst destination type is .e2m3x4 or .e3m2x4 */) { d[31:30] = 0; d[29:24] = convert(a); d[23:22] = 0; d[21:16] = convert(b); d[15:14] = 0; d[13:8] = convert(e); d[7:6] = 0; d[5:0] = convert(f);} else { d = convert(a);}
// Random bitsrbits semantics for.rs rounding:
Destination type.f16:ReferFigure 38 for random bits layout details.
Integer rounding is required for float-to-integer conversions, and for same-size float-to-floatconversions where the value is rounded to an integer. Integer rounding is illegal in all otherinstances.
Integer rounding modifiers:
.rni
round to nearest integer, choosing even integer if source is equidistant between two integers
.rzi
round to nearest integer in the direction of zero
.rmi
round to nearest integer in direction of negative infinity
.rpi
round to nearest integer in direction of positive infinity
In float-to-integer conversions, depending upon conversion types,NaN input results in followingvalue:
Zero if source is not.f64 and destination is not.s64,.u64.
Otherwise 1 << (BitWidth(dst) - 1) corresponding to the value of (MAXINT >> 1) + 1 for unsigned typeorMININT for signed type.
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported.
Forcvt.ftz.dtype.f32 float-to-integer conversions andcvt.ftz.f32.f32 float-to-floatconversions with integer rounding, subnormal inputs are flushed to sign-preserving zero. Modifier.ftz can only be specified when either.dtype or.atype is.f32 and applies onlyto single precision (.f32) inputs and results.
sm_1x
Forcvt.ftz.dtype.f32 float-to-integer conversions andcvt.ftz.f32.f32float-to-float conversions with integer rounding, subnormal inputs are flushed to sign-preservingzero. The optional.ftz modifier may be specified in these cases for clarity.
Note: In PTX ISA versions 1.4 and earlier, thecvt instruction did not flush single-precisionsubnormal inputs or results to zero if the destination type size was 64-bits. The compiler willpreserve this behavior for legacy PTX code.
Saturation modifier:
.sat
For integer destination types,.sat limits the result toMININT..MAXINT for the size ofthe operation. Note that saturation applies to both signed and unsigned integer types.
The saturation modifier is allowed only in cases where the destination type’s value range is nota superset of the source type’s value range; i.e., the.sat modifier is illegal in caseswhere saturation is not possible based on the source and destination types.
For float-to-integer conversions, the result is clamped to the destination range by default; i.e,.sat is redundant.
Floating Point Notes
Floating-point rounding is required for float-to-float conversions that result in loss of precision,and for integer-to-float conversions. Floating-point rounding is illegal in all other instances.
Floating-point rounding modifiers:
.rn
rounding to nearest, with ties to even
.rna
rounding to nearest, with ties away from zero
.rz
rounding toward zero
.rm
rounding toward negative infinity
.rp
rounding toward positive infinity
.rs
Stochastic rounding is achieved through the use of the supplied random bits. Operation’s resultis rounded in the direction toward zero or away from zero based on the carry out of the integeraddition of the supplied random bits (rbits) to the truncated off (discarded) bits ofmantissa from the input.
A floating-point value may be rounded to an integral value using the integer rounding modifiers (seeInteger Notes). The operands must be of the same size. The result is an integral value, stored infloating-point format.
Subnormal numbers:
sm_20+
By default, subnormal numbers are supported. Modifier.ftz may be specified to flushsingle-precision subnormal inputs and results to sign-preserving zero. Modifier.ftz can onlybe specified when either.dtype or.atype is.f32 and applies only to singleprecision (.f32) inputs and results.
sm_1x
Single-precision subnormal inputs and results are flushed to sign-preserving zero. The optional.ftz modifier may be specified in these cases for clarity.
Note: In PTX ISA versions 1.4 and earlier, thecvt instruction did not flushsingle-precision subnormal inputs or results to zero if either source or destination type was.f64. The compiler will preserve this behavior for legacy PTX code. Specifically, if the PTXISA version is 1.4 or earlier, single-precision subnormal inputs and results are flushed tosign-preserving zero only forcvt.f32.f16,cvt.f16.f32, andcvt.f32.f32 instructions.
Saturation modifier:
.sat:
For floating-point destination types,.sat limits the result to the range [0.0, 1.0].NaNresults are flushed to positive zero. Applies to.f16,.f32, and.f64 types.
.relu:
For.f16,.f16x2,.bf16,.bf16x2,.e4m3x2,.e5m2x2,.e2m1x2,.e2m3x2,.e3m2x2,.e4m3x4,.e5m2x4,.e2m1x4,.e2m3x4,.e3m2x4 and.tf32destination types,.relu clamps the result to 0 if negative.NaN results are convertedto canonicalNaN.
.satfinite:
For.f16,.f16x2,.bf16,.bf16x2,.e4m3x2,.e5m2x2,.ue8m0x2,.e4m3x4,.e5m2x4 and.tf32 destination formats, if the input value isNaN, then the result isNaN in the specified destination format. For.e2m1x2,.e2m3x2,.e3m2x2,.e2m1x4,.e2m3x4,.e3m2x4 destination formatsNaN results are converted to positiveMAX_NORM.If the absolute value of input (ignoring sign) is greater thanMAX_NORM of the specified destinationformat, then the result is sign-preservedMAX_NORM of the destination format and a positiveMAX_NORM in.ue8m0x2 for which the destination sign is not supported.
Notes
A source register wider than the specified type may be used, except when the source operand has.bf16 or.bf16x2 format. The lowern bits corresponding to the instruction-type widthare used in the conversion. SeeOperand Size Exceeding Instruction-Type Size for a description of these relaxedtype-checking rules.
A destination register wider than the specified type may be used, except when the destinationoperand has.bf16,.bf16x2 or.tf32 format. The result of conversion is sign-extended tothe destination register width for signed integers, and is zero-extended to the destination registerwidth for unsigned, bit-size, and floating-point types. SeeOperand Size Exceeding Instruction-Type Size for a description of these relaxedtype-checking rules.
Forcvt.f32.bf16,NaN input yields unspecifiedNaN.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
.relu modifier and {.f16x2,.bf16,.bf16x2,.tf32} destination formatsintroduced in PTX ISA version 7.0.
cvt.f32.bf16 introduced in PTX ISA version 7.1.
cvt.bf16.{u8/s8/u16/s16/u32/s32/u64/s64/f16/f64/bf16},cvt.{u8/s8/u16/s16/u32/s32/u64/s64/f16/f64}.bf16, andcvt.tf32.f32.{relu}.{rn/rz} introducedin PTX ISA version 7.8.
.ftz qualifier forcvt.f32.bf16 introduced in PTX ISA version 7.8.
cvt with.e4m3x2/.e5m2x2 forsm_90 or higher introduced in PTX ISA version 7.8.
cvt.satfinite.{e4m3x2,e5m2x2}.{f32,f16x2} forsm_90 or higher introduced in PTX ISA version 7.8.
cvt with.e4m3x2/.e5m2x2 forsm_89 introduced in PTX ISA version 8.1.
cvt.satfinite.{e4m3x2,e5m2x2}.{f32,f16x2} forsm_89 introduced in PTX ISA version 8.1.
cvt.satfinite.{f16,bf16,f16x2,bf16x2,tf32}.f32 introduced in PTX ISA version 8.1.
cvt.{rn/rz}.satfinite.tf32.f32 introduced in PTX ISA version 8.6.
cvt.rn.satfinite{.relu}.{e2m1x2/e2m3x2/e3m2x2/ue8m0x2}.f32 introduced in PTX ISA version 8.6.
cvt.rn{.relu}.f16x2.{e2m1x2/e2m3x2/e3m2x2} introduced in PTX ISA version 8.6.
cvt.{rp/rz}{.satfinite}{.relu}.ue8m0x2.bf16x2 introduced in PTX ISA version 8.6.
cvt.{rz/rp}.satfinite.ue8m0x2.f32 introduced in PTX ISA version 8.6.
cvt.rn.bf16x2.ue8m0x2 introduced in PTX ISA version 8.6.
.rs rounding mode introduced in PTX ISA version 8.7.
cvt.rs{.e2m1x4/.e4m3x4/.e5m2x4/.e3m2x4/.e2m3x4}.f32 introduced in PTX ISA version 8.7.
Target ISA Notes
cvt to or from.f64 requiressm_13 or higher.
.relu modifier and {.f16x2,.bf16,.bf16x2,.tf32} destination formats requiresm_80 or higher.
cvt.f32.bf16 requiressm_80 or higher.
cvt.bf16.{u8/s8/u16/s16/u32/s32/u64/s64/f16/f64/bf16},cvt.{u8/s8/u16/s16/u32/s32/u64/s64/f16/f64}.bf16, andcvt.tf32.f32.{relu}.{rn/rz} requiresm_90 or higher.
.ftz qualifier forcvt.f32.bf16 requiressm_90 or higher.
cvt with.e4m3x2/.e5m2x2 requiressm89 or higher.
cvt.satfinite.{e4m3x2,e5m2x2}.{f32,f16x2} requiressm_89 or higher.
cvt.{rn/rz}.satfinite.tf32.f32 requiressm_100 or higher.
cvt.rn.satfinite{.relu}.{e2m1x2/e2m3x2/e3m2x2/ue8m0x2}.f32 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_120f or higher in the same family
sm_110f or higher in the same family
cvt.rn{.relu}.f16x2.{e2m1x2/e2m3x2/e3m2x2} is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_120f or higher in the same family
sm_110f or higher in the same family
cvt.{rz/rp}{.satfinite}{.relu}.ue8m0x2.bf16x2 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_120f or higher in the same family
sm_110f or higher in the same family
cvt.{rz/rp}.satfinite.ue8m0x2.f32 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_120f or higher in the same family
sm_110f or higher in the same family
cvt.rn.bf16x2.ue8m0x2 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_120f or higher in the same family
sm_110f or higher in the same family
.rs rounding mode is supported on following architectures:
sm_100a
sm_103a
cvt.rs{.e2m1x4/.e4m3x4/.e5m2x4/.e3m2x4/.e2m3x4}.f32 is supported on following architectures:
sm_100a
sm_103a
Examples
cvt.f32.s32 f,i;cvt.s32.f64 j,r; // float-to-int saturates by defaultcvt.rni.f32.f32 x,y; // round to nearest int, result is fpcvt.f32.f32 x,y; // note .ftz behavior for sm_1x targetscvt.rn.relu.f16.f32 b, f; // result is saturated with .relu saturation modecvt.rz.f16x2.f32 b1, f, f1; // convert two fp32 values to packed fp16 outputscvt.rn.relu.satfinite.f16x2.f32 b1, f, f1; // convert two fp32 values to packed fp16 outputs with .relu saturation on each outputcvt.rn.bf16.f32 b, f; // convert fp32 to bf16cvt.rz.relu.satfinite.bf16.f3 2 b, f; // convert fp32 to bf16 with .relu and .satfinite saturationcvt.rz.satfinite.bf16x2.f32 b1, f, f1; // convert two fp32 values to packed bf16 outputscvt.rn.relu.bf16x2.f32 b1, f, f1; // convert two fp32 values to packed bf16 outputs with .relu saturation on each outputcvt.rna.satfinite.tf32.f32 b1, f; // convert fp32 to tf32 formatcvt.rn.relu.tf32.f32 d, a; // convert fp32 to tf32 formatcvt.f64.bf16.rp f, b; // convert bf16 to f64 formatcvt.bf16.f16.rz b, f // convert f16 to bf16 formatcvt.bf16.u64.rz b, u // convert u64 to bf16 formatcvt.s8.bf16.rpi s, b // convert bf16 to s8 formatcvt.bf16.bf16.rpi b1, b2 // convert bf16 to corresponding int represented in bf16 formatcvt.rn.satfinite.e4m3x2.f32 d, a, b; // convert a, b to .e4m3 and pack as .e4m3x2 outputcvt.rn.relu.satfinite.e5m2x2.f16x2 d, a; // unpack a and convert the values to .e5m2 outputs with .relu // saturation on each output and pack as .e5m2x2cvt.rn.f16x2.e4m3x2 d, a; // unpack a, convert two .e4m3 values to packed f16x2 outputcvt.rn.satfinite.tf32.f32 d, a; // convert fp32 to tf32 formatcvt.rn.relu.f16x2.e2m1x2 d, a; // unpack a, convert two .e2m1 values to packed f16x2 outputcvt.rn.satfinite.e2m3x2.f32 d, a, b; // convert a, b to .e2m3 and pack as .e2m3x2 outputcvt.rn.relu.f16x2.e3m2x2 d, a; // unpack a, convert two .e3m2 values to packed f16x2 outputcvt.rs.f16x2.f32 d, a, b, rbits; // convert 2 fp32 values to packed fp16 with applying .rs roundingcvt.rs.satfinite.e2m1x4.f32 d, {a, b, e, f}, rbits; // convert 4 fp32 values to packed 4 e2m1 values with applying .rs rounding
Convert two 32-bit integersa andb into specified type and pack the results intod.
Destinationd is an unsigned 32-bit integer. Source operandsa andb are integers oftype.abType and the source operandc is an integer of type.cType.
The inputsa andb are converted to values of type specified by.convertType withsaturation and the results after conversion are packed into lower bits ofd.
If operandc is specified then remaining bits ofd are copied from lower bits ofc.
Semantics
ta = a < MIN(convertType) ? MIN(convertType) : a;ta = a > MAX(convertType) ? MAX(convertType) : a;tb = b < MIN(convertType) ? MIN(convertType) : b;tb = b > MAX(convertType) ? MAX(convertType) : b;size = sizeInBits(convertType);td = tb ;for (i = size; i <= 2 * size - 1; i++) { td[i] = ta[i - size];}if (isU16(convertType) || isS16(convertType)) { d = td;} else { for (i = 0; i < 2 * size; i++) { d[i] = td[i]; } for (i = 2 * size; i <= 31; i++) { d[i] = c[i - 2 * size]; }}
.sat modifier limits the converted values toMIN(convertType)..MAX(convertedType) (nooverflow) if the corresponding inputs are not in the range of datatype specified as.convertType.
PTX ISA Notes
Introduced in PTX ISA version 6.5.
Target ISA Notes
Requiressm_72 or higher.
Sub byte types (.u4/.s4 and.u2/.s2) requiressm_75 or higher.
Map the address of the shared variable in the target CTA.
Syntax
mapa{.space}.type d, a, b;// Maps shared memory address in register a into CTA b.mapa.shared::cluster.type d, a, b;// Maps shared memory variable into CTA b.mapa.shared::cluster.type d, sh, b;// Maps shared memory variable into CTA b.mapa.shared::cluster.type d, sh + imm, b;// Maps generic address in register a into CTA b.mapa.type d, a, b;.space = { .shared::cluster }.type = { .u32, .u64 }
Description
Get address in the CTA specified by operandb which corresponds to the address specified byoperanda.
Instruction type.type indicates the type of the destination operandd and the sourceoperanda.
When space is.shared::cluster, sourcea is either a shared memory variable or a registercontaining a valid shared memory address and registerd contains a shared memory address. Whenthe optional qualifier.space is not specified, botha andd are registers containinggeneric addresses pointing to shared memory.
b is a 32-bit integer operand representing the rank of the target CTA.
Destination registerd will hold an address in CTAb corresponding to operanda.
getctarank{.space}.type d, a;// Get cta rank from source shared memory address in register a.getctarank.shared::cluster.type d, a;// Get cta rank from shared memory variable.getctarank.shared::cluster.type d, var;// Get cta rank from shared memory variable+offset.getctarank.shared::cluster.type d, var + imm;// Get cta rank from generic address of shared memory variable in register a.getctarank.type d, a;.space = { .shared::cluster }.type = { .u32, .u64 }
Description
Write the destination registerd with the rank of the CTA which contains the address specifiedin operanda.
Instruction type.type indicates the type of source operanda.
When space is.shared::cluster, sourcea is either a shared memory variable or a registercontaining a valid shared memory address. When the optional qualifier.space is not specified,a is a register containing a generic addresses pointing to shared memory. Destinationd isalways a 32-bit register which holds the rank of the CTA.
PTX ISA Notes
Introduced in PTX ISA version 7.8.
Target ISA Notes
Requiressm_90 or higher.
Examples
getctarank.shared::cluster.u32 d1, addr;getctarank.shared::cluster.u64 d2, sh + 4;getctarank.u64 d3, src;
An asynchronous copy operation performs the underlying operation asynchronously in the background,thus allowing the issuing threads to perform subsequent tasks.
An asynchronous copy operation can be abulk operation that operates on a large amount of data, oranon-bulk operation that operates on smaller sized data. The amount of data handled by a bulkasynchronous operation must be a multiple of 16 bytes.
An asynchronous copy operation typically includes the following sequence:
Optionally, reading from the tensormap.
Reading data from the source location(s).
Writing data to the destination location(s).
Writes being made visible to the executing thread or other threads.
A thread must explicitly wait for the completion of an asynchronous copy operation in order toaccess the result of the operation. Once an asynchronous copy operation is initiated, modifying thesource memory location or tensor descriptor or reading from the destination memory location beforethe asynchronous operation completes, exhibits undefined behavior.
This section describes two asynchronous copy operation completion mechanisms supported in PTX:Async-group mechanism and mbarrier-based mechanism.
Asynchronous operations may be tracked by either of the completion mechanisms or both mechanisms.The tracking mechanism is instruction/instruction-variant specific.
When using the async-group completion mechanism, the issuing thread specifies a group ofasynchronous operations, calledasync-group, using acommit operation and tracks the completionof this group using await operation. The thread issuing the asynchronous operation must createseparateasync-groups for bulk and non-bulk asynchronous operations.
Acommit operation creates a per-threadasync-group containing all prior asynchronous operationstracked byasync-group completion and initiated by the executing thread but none of the asynchronousoperations following the commit operation. A committed asynchronous operation belongs to a singleasync-group.
When anasync-group completes, all the asynchronous operations belonging to that group arecomplete and the executing thread that initiated the asynchronous operations can read the result ofthe asynchronous operations. Allasync-groups committed by an executing thread always complete inthe order in which they were committed. There is no ordering between asynchronous operations withinanasync-group.
A typical pattern of usingasync-group as the completion mechanism is as follows:
Initiate the asynchronous operations.
Group the asynchronous operations into anasync-group using acommit operation.
Wait for the completion of the async-group using the wait operation.
Once theasync-group completes, access the results of all asynchronous operations in thatasync-group.
A thread can track the completion of one or more asynchronous operations using the current phase ofanmbarrier object. When the current phase of thembarrier object is complete, it implies thatall asynchronous operations tracked by this phase are complete, and all threads participating inthatmbarrier object can access the result of the asynchronous operations.
Thembarrier object to be used for tracking the completion of an asynchronous operation can beeither specified along with the asynchronous operation as part of its syntax, or as a separateoperation. For a bulk asynchronous operation, thembarrier object must be specified in theasynchronous operation, whereas for non-bulk operations, it can be specified after the asynchronousoperation.
A typical pattern of using mbarrier-based completion mechanism is as follows:
Initiate the asynchronous operations.
Set up anmbarrier object to track the asynchronous operations in its current phase, either aspart of the asynchronous operation or as a separate operation.
Wait for thembarrier object to complete its current phase usingmbarrier.test_wait ormbarrier.try_wait.
Once thembarrier.test_wait ormbarrier.try_wait operation returnsTrue, access theresults of the asynchronous operations tracked by thembarrier object.
Thecp{.reduce}.async.bulk operations are performed in theasynchronous proxy (orasyncproxy).
Accessing the same memory location across multiple proxies needs a cross-proxy fence. For theasync proxy,fence.proxy.async should be used to synchronize memory betweengenericproxy and theasync proxy.
The completion of acp{.reduce}.async.bulk operation is followed by an implicitgeneric-asyncproxy fence. So the result of the asynchronous operation is made visible to the generic proxy assoon as its completion is observed.Async-group ORmbarrier-based completion mechanism mustbe used to wait for the completion of thecp{.reduce}.async.bulk instructions.
cp.async is a non-blocking instruction which initiates an asynchronous copy operation of datafrom the location specified by source address operandsrc to the location specified bydestination address operanddst. Operandsrc specifies a location in the global state spaceanddst specifies a location in the shared state space.
Operandcp-size is an integer constant which specifies the size of data in bytes to be copied tothe destinationdst.cp-size can only be 4, 8 and 16.
Instructioncp.async allows optionally specifying a 32-bit integer operandsrc-size. Operandsrc-size represents the size of the data in bytes to be copied fromsrc todst and mustbe less thancp-size. In such case, remaining bytes in destinationdst are filled withzeros. Specifyingsrc-size larger thancp-size results in undefined behavior.
The optional and non-immediate predicate argumentignore-src specifies whether the data from thesource locationsrc should be ignored completely. If the source data is ignored then zeros willbe copied to destinationdst. If the argumentignore-src is not specified then it defaultstoFalse.
Supported alignment requirements and addressing modes for operandsrc anddst are describedinAddresses as Operands.
The mandatory.async qualifier indicates that thecp instruction will initiate the memorycopy operation asynchronously and control will return to the executing thread before the copyoperation is complete. The executing thread can then useasync-group based completion mechanismor thembarrier based completion mechanismto wait for completion of the asynchronous copy operation.No other synchronization mechanism guarantees the completion of the asynchronouscopy operations.
There is no ordering guarantee between twocp.async operations if they are not explicitlysynchronized usingcp.async.wait_all orcp.async.wait_group ormbarrier instructions.
As described inCache Operators, the.cg qualifier indicatescaching of data only at global level cache L2 and not at L1 whereas.ca qualifier indicatescaching of data at all levels including L1 cache. Cache operator are treated as performance hintsonly.
The.level::prefetch_size qualifier is a hint to fetch additional data of the specified sizeinto the respective cache level.The sub-qualifierprefetch_size can be set to either of64B,128B,256B thereby allowing the prefetch size to be 64 Bytes, 128 Bytes or 256 Bytesrespectively.
The qualifier.level::prefetch_size may only be used with.global state space and withgeneric addressing where the address points to.global state space. If the generic address doesnot fall within the address window of the global memory, then the prefetching behavior is undefined.
The.level::prefetch_size qualifier is treated as a performance hint only.
When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.
The qualifier.level::cache_hint is only supported for.global state space and for genericaddressing where the address points to the.global state space.
cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.
PTX ISA Notes
Introduced in PTX ISA version 7.0.
Support for.level::cache_hint and.level::prefetch_size qualifiers introduced in PTX ISAversion 7.4.
Support forignore-src operand introduced in PTX ISA version 7.5.
Support for sub-qualifier::cta introduced in PTX ISA version 7.8.
Commits all prior initiated but uncommittedcp.async instructions into acp.async-group.
Syntax
cp.async.commit_group ;
Description
cp.async.commit_group instruction creates a newcp.async-group per thread and batches allpriorcp.async instructions initiated by the executing thread but not committed to anycp.async-group into the newcp.async-group. If there are no uncommittedcp.asyncinstructions thencp.async.commit_group results in an emptycp.async-group.
An executing thread can wait for the completion of allcp.async operations in acp.async-groupusingcp.async.wait_group.
There is no memory ordering guarantee provided between any twocp.async operations within thesamecp.async-group. So two or morecp.async operations within acp.async-group copying datato the same location results in undefined behavior.
PTX ISA Notes
Introduced in PTX ISA version 7.0.
Target ISA Notes
Requiressm_80 or higher.
Examples
// Example 1:cp.async.ca.shared.global [shrd], [gbl], 4;cp.async.commit_group ; // Marks the end of a cp.async group// Example 2:cp.async.ca.shared.global [shrd1], [gbl1], 8;cp.async.ca.shared.global [shrd1+8], [gbl1+8], 8;cp.async.commit_group ; // Marks the end of cp.async group 1cp.async.ca.shared.global [shrd2], [gbl2], 16;cp.async.cg.shared.global [shrd2+16], [gbl2+16], 16;cp.async.commit_group ; // Marks the end of cp.async group 2
Wait for completion of prior asynchronous copy operations.
Syntax
cp.async.wait_group N;cp.async.wait_all ;
Description
cp.async.wait_group instruction will cause executing thread to wait till onlyN or fewer ofthe most recentcp.async-groups are pending and all the priorcp.async-groups committed bythe executing threads are complete. For example, whenN is 0, the executing thread waits on allthe priorcp.async-groups to complete. OperandN is an integer constant.
cp.async.wait_all is equivalent to :
cp.async.commit_group;cp.async.wait_group 0;
An emptycp.async-group is considered to be trivially complete.
Writes performed bycp.async operations are made visible to the executing thread only after:
The completion ofcp.async.wait_all or
The completion ofcp.async.wait_group on thecp.async-group in which thecp.asyncbelongs to or
mbarrier.test_waitreturnsTrue on anmbarrier object which is tracking the completion of thecp.asyncoperation.
There is no ordering between twocp.async operations that are not synchronized withcp.async.wait_all orcp.async.wait_group ormbarrier objects.
cp.async.wait_group andcp.async.wait_all does not provide any ordering and visibilityguarantees for any other memory operation apart fromcp.async.
PTX ISA Notes
Introduced in PTX ISA version 7.0.
Target ISA Notes
Requiressm_80 or higher.
Examples
// Example of .wait_all:cp.async.ca.shared.global [shrd1], [gbl1], 4;cp.async.cg.shared.global [shrd2], [gbl2], 16;cp.async.wait_all; // waits for all prior cp.async to complete// Example of .wait_group :cp.async.ca.shared.global [shrd3], [gbl3], 8;cp.async.commit_group; // End of group 1cp.async.cg.shared.global [shrd4], [gbl4], 16;cp.async.commit_group; // End of group 2cp.async.cg.shared.global [shrd5], [gbl5], 16;cp.async.commit_group; // End of group 3cp.async.wait_group 1; // waits for group 1 and group 2 to complete
cp.async.bulk is a non-blocking instruction which initiates an asynchronous bulk-copy operationfrom the location specified by source address operandsrcMem to the location specified bydestination address operanddstMem.
The direction of bulk-copy is from the state space specified by the.src modifier to the statespace specified by the.dst modifiers.
The 32-bit operandsize specifies the amount of memory to be copied, in terms of number ofbytes.size must be a multiple of 16. If the value is not a multiple of 16, then the behavior isundefined. The memory range[dstMem,dstMem+size-1] must not overflow the destination memoryspace and the memory range[srcMem,srcMem+size-1] must not overflow the source memoryspace. Otherwise, the behavior is undefined. The addressesdstMem andsrcMem must be alignedto 16 bytes.
When the destination of the copy is.shared::cta the destination address has to be in the sharedmemory of the executing CTA within the cluster, otherwise the behavior is undefined.
When the source of the copy is.shared::cta and the destination is.shared::cluster, thedestination has to be in the shared memory of a different CTA within the cluster.
The modifier.completion_mechanism specifies the completion mechanism that is supported on theinstruction variant. The completion mechanisms that are supported for different variants aresummarized in the following table:
.completion-mechanism
.dst
.src
Completion mechanism
Needed for completion ofentire Async operation
optionally can be used for the completion of- Reading data from the source- Reading from the tensormap, if applicable
.mbarrier::...
.shared::cta
.global
mbarrier based
Bulk async-group based
.shared::cluster
.global
.shared::cluster
.shared::cta
.bulk_group
.global
.shared::cta
Bulk async-groupbased
The modifier.mbarrier::complete_tx::bytes specifies that thecp.async.bulk variant usesmbarrier based completion mechanism. Thecomplete-txoperation, withcompleteCount argument equal to amount of data copied in bytes, will beperformed on the mbarrier object specified by the operandmbar.
The modifier.bulk_group specifies that thecp.async.bulk variant usesbulk async-groupbased completion mechanism.
The optional modifier.multicast::cluster allows copying of data from global memory to sharedmemory of multiple CTAs in the cluster. OperandctaMask specifies the destination CTAs in thecluster such that each bit position in the 16-bitctaMask operand corresponds to the%ctaidof the destination CTA. The source data is multicast to the same CTA-relative offset asdstMemin the shared memory of each destination CTA. The mbarrier signal is also multicast to the sameCTA-relative offset asmbar in the shared memory of the destination CTA.
When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.
cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program. Thequalifier.level::cache_hint is only supported when at least one of the.src or.dststatespaces is.global state space.
When the optional qualifier.cp_mask is specified, the argumentbyteMask is required.The i-th bit in the 16-bit widebyteMask operand specifies whether the i-th byte of each 16-bytewide chunk of source data is copied to the destination. If the bit is set, the byte is copied.
The copy operation incp.async.bulk is treated as a weak memory operation and thecomplete-txoperation on the mbarrier has.release semantics at the.cluster scope as described in theMemory Consistency Model.
Notes
.multicast::cluster qualifier is optimized for target architecturesm_90a/sm_100f/sm_100a/sm_103f/sm_103a/sm_110f/sm_110a and may have substantially reduced performance on othertargets and hence.multicast::cluster is advised to be used with.targetsm_90a/sm_100f/sm_100a/sm_103f/sm_103a/sm_110f/sm_110a.
PTX ISA Notes
Introduced in PTX ISA version 8.0.
Support for.shared::cta as destination state space is introduced in PTX ISA version 8.6.
Support for.cp_mask qualifier introduced in PTX ISA version 8.6.
Target ISA Notes
Requiressm_90 or higher.
.multicast::cluster qualifier advised to be used with.targetsm_90a orsm_100f orsm_100a orsm_103f orsm_103a orsm_110f orsm_110a.
Support for.cp_mask qualifier requiressm_100 or higher.
cp.reduce.async.bulk is a non-blocking instruction which initiates an asynchronous reductionoperation on an array of memory locations specified by the destination address operanddstMemwith the source array whose location is specified by the source address operandsrcMem. The sizeof the source and the destination array must be the same and is specified by the operandsize.
Each data element in the destination array is reduced inline with the corresponding data element inthe source array with the reduction operation specified by the modifier.redOp. The type of eachdata element in the source and the destination array is specified by the modifier.type.
The source address operandsrcMem is located in the state space specified by.src and thedestination address operanddstMem is located in the state specified by the.dst.
The 32-bit operandsize specifies the amount of memory to be copied from the source location andused in the reduction operation, in terms of number of bytes.size must be a multiple of 16. Ifthe value is not a multiple of 16, then the behavior is undefined. The memory range[dstMem,dstMem+size-1] must not overflow the destination memory space and the memory range[srcMem,srcMem+size-1] must not overflow the source memory space. Otherwise, the behavior isundefined. The addressesdstMem andsrcMem must be aligned to 16 bytes.
The operations supported by.redOp are classified as follows:
The bit-size operations are.and,.or, and.xor.
The integer operations are.add,.inc,.dec,.min, and.max. The.inc and.dec operations return a result in the range[0..x] wherex is the value at the sourcestate space.
The floating point operation.add rounds to the nearest even. The current implementation ofcp.reduce.async.bulk.add.f32 flushes subnormal inputs and results to sign-preserving zero. Thecp.reduce.async.bulk.add.f16 andcp.reduce.async.bulk.add.bf16 operations require.noftz qualifier. It preserves input and result subnormals, and does not flush them to zero.
The following table describes the valid combinations of.redOp and element type:
.dst
.redOp
Element type
.shared::cluster
.add
.u32,.s32,.u64
.min,.max
.u32,.s32
.inc,.dec
.u32
.and,.or,.xor
.b32
.global
.add
.u32,.s32,.u64,.f32,.f64,.f16,.bf16
.min,.max
.u32,.s32,.u64,.s64,.f16,.bf16
.inc,.dec
.u32
.and,.or,.xor
.b32,.b64
The modifier.completion_mechanism specifies the completion mechanism that is supported on theinstruction variant. The completion mechanisms that are supported for different variants aresummarized in the following table:
.completion-mechanism
.dst
.src
Completion mechanism
Needed for completion ofentire Async operation
optionally can be used for the completion of- Reading data from the source- Reading from the tensormap, if applicable
.mbarrier::...
.shared::cluster
.global
mbarrier based
Bulk async-group based
.shared::cluster
.shared::cta
.bulk_group
.global
.shared::cta
Bulk async-groupbased
The modifier.mbarrier::complete_tx::bytes specifies that thecp.reduce.async.bulk variantuses mbarrier based completion mechanism. Thecomplete-txoperation, withcompleteCount argument equal to amount of data copied in bytes, will beperformed on the mbarrier object specified by the operandmbar.
The modifier.bulk_group specifies that thecp.reduce.async.bulk variant usesbulkasync-group based completion mechanism.
When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.
cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program. Thequalifier.level::cache_hint is only supported when at least one of the.src or.dststatespaces is.global state space.
Each reduction operation performed by thecp.reduce.async.bulk has individually.relaxed.gpumemory ordering semantics. The load operations incp.reduce.async.bulk are treated as weakmemory operation and thecomplete-txoperation on the mbarrier has.release semantics at the.cluster scope as described in theMemory Consistency Model.
cp.async.bulk.prefetch is a non-blocking instruction which may initiate an asynchronous prefetchof data from the location specified by source address operandsrcMem, in.src statespace, tothe L2 cache.
The 32-bit operandsize specifies the amount of memory to be prefetched in terms of number ofbytes.size must be a multiple of 16. If the value is not a multiple of 16, then the behavior isundefined. The addresssrcMem must be aligned to 16 bytes.
When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.
cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.
Following are the restrictions on the types.b4x16,.b4x16_p64,.b6x16_p32 and.b6p2x16:
cp.reduce.async.bulk doesn’t support the types.b4x16,.b4x16_p64,.b6x16_p32and.b6p2x16.
cp.async.bulk.tensor with the direction.global.shared::cta doesn’t support thetype.b4x16_p64.
cp.async.bulk.tensor with the direction.shared::cluster.global doesn’t supportthe sub-byte types onsm_120a.
OOB-NaN fill mode doesn’t support the types.b4x16,.b4x16_p64,.b6x16_p32and.b6p2x16.
Box-Size[0] must be exactly:
96B forb6x16_p32 and.b6p2x16.
64B forb4x16_p64.
Tensor-Size[0] must be a multiple of:
96B forb6x16_p32 and.b6p2x16.
64B forb4x16_p64.
For.b4x16_p64,.b6x16_p32 and.b6p2x16, the first coordinate in the tensorCoordsargument vector must be a multiple of 128.
For.b4x16_p64,.b6x16_p32 and.b6p2x16, the global memory address must be 32B aligned.Additionally, tensor stride in every dimension must be 32B aligned.
.b4x16_p64,.b6x16_p32 and.b6p2x16 supports the following swizzling modes:
None.
128B (With all potential swizzle atomicity values except: 32B with 8B flip)
Following are the restrictions on the 96B swizzle mode:
The.swizzle_atomicity must be 16B.
The.interleave_layout must not be set.
Box-Size[0] must be less than or equal to 96B.
The type must not be among following:.b4x16_p64,.b6x16_p32 and.b6p2x16.
The.load_mode must not be set to.im2col::w::128.
Following are the restrictions on the.global.shared::cta direction:
Starting co-ordinates for Bounding Box (tensorCoords) must be non-negative.
The bounding box along the D, W and H dimensions must stay within the tensor boundaries.This implies:
Bounding-Box Lower-Corner must be non-negative.
Bounding-Box Upper-Corner must be non-positive.
Following are the restrictions forsm_120a:
cp.async.bulk.tensor with the direction.shared::cluster.global doesn’t support:
the sub-byte types
the qualifier.swizzle_atomicity
Following are the restrictions forsm_103a while using type.b6p2x16 oncp.async.bulk.tensor with the direction.global.shared::cta:
Box-Size[0] must be exactly either of 48B or 96B.
The global memory address must be 16B aligned.
Tensor Stride in every dimension must be 16B aligned.
The first coordinate in the tensorCoords argument vector must be a multiple of 64.
Tensor-Size[0] must be a multiple of 48B.
The following swizzle modes are supported:
None.
128B (With all potential swizzle atomicity values except: 32B with 8B flip)
cp.async.bulk.tensor is a non-blocking instruction which initiates an asynchronous copyoperation of tensor data from the location in.src state space to the location in the.dststate space.
The operanddstMem specifies the location in the.dst state space into which the tensor datahas to be copied andsrcMem specifies the location in the.src state space from which thetensor data has to be copied.
When.dst is specified as.shared::cta, the addressdstMem must be in the shared memoryof the executing CTA within the cluster, otherwise the behavior is undefined.
When.dst is specified as.shared::cluster, the addressdstMem can be in the shared memoryof any of the CTAs within the current cluster.
The operandtensorMap is the generic address of the opaque tensor-map object which residesin.param space or.const space or.global space. The operandtensorMap specifiesthe properties of the tensor copy operation, as described inTensor-map.ThetensorMap is accessed in tensormap proxy. Refer to theCUDA programming guide for creatingthe tensor-map objects on the host side.
The dimension of the tensor data is specified by the.dim modifier.
The vector operandtensorCoords specifies the starting coordinates in the tensor data in theglobal memory from or to which the copy operation has to be performed. The individual tensorcoordinates intensorCoords are of type.s32. The format of vector argumenttensorCoordsis dependent on.load_mode specified and is as follows:
.load_mode
tensorCoords
Semantics
.tile::scatter4
{col_idx, row_idx0, row_idx1, row_idx2, row_idx3}
Fixed length vector of size 5.The five elements together specify the startco-ordinates of the four rows.
.tile::gather4
Rest all
{d0, .., dn}for n = .dim
Vector of n elements where n = .dim.The elements indicate the offset in each of thedimension.
The modifier.completion_mechanism specifies the completion mechanism that is supported on theinstruction variant. The completion mechanisms that are supported for different variants aresummarized in the following table:
.completion-mechanism
.dst
.src
Completion mechanism
Needed for completion ofentire Async operation
optionally can be used for the completion of- Reading data from the source- Reading from the tensormap, if applicable
.mbarrier::...
.shared::cta
.global
mbarrier based
Bulk async-group based
.shared::cluster
.global
.bulk_group
.global
.shared::cta
Bulk async-groupbased
The modifier.mbarrier::complete_tx::bytes specifies that thecp.async.bulk.tensor variantuses mbarrier based completion mechanism. Upon the completion of the asynchronous copy operation, thecomplete-txoperation, withcompleteCount argument equal to amount of data copied in bytes, will beperformed on the mbarrier object specified by the operandmbar.
The modifier.cta_group can only be specified with the mbarrier based completion mechanism. Themodifier.cta_group is used to signal either the odd numbered CTA or the even numbered CTA amongtheCTA-Pair. When.cta_group::1 is specified, the mbarrier objectmbarthat is specified must be in the shared memory of the same CTA as the shared memory destinationdstMem.When.cta_group::2 is specified, the mbarrier objectmbar can be in shared memory of either thesame CTA as the shared memory destinationdstMem or in itspeer-CTA. If.cta_group is not specified, then it defaults to.cta_group::1.
The modifier.bulk_group specifies that thecp.async.bulk.tensor variant usesbulkasync-group based completion mechanism.
The qualifier.load_mode specifies how the data in the source location is copied into thedestination location. If.load_mode is not specified, it defaults to.tile.
In.tile mode, the multi-dimensional layout of the source tensor is preserved at the destination.In.tile::gather4 mode, four rows in 2-dimnesional source tensor are combined to form a single 2-dimensionaldestination tensor. In.tile::scatter4 mode, single 2-dimensional source tensor is divided into four rowsin 2-dimensional destination tensor. Details of.tile::scatter4/.tile::gather4 modes are describedin.tile::scatter4 and .tile::gather4 modes.
In.im2col and.im2col::* modes, some dimensions of the source tensors are unrolled in a singledimensional column at the destination. Details of theim2col and.im2col::* modes are describedinim2col mode andim2col::w and im2col::w::128 modesrespectively. In.im2col and.im2col::* modes, the tensor has to be at least 3-dimensional. The vectoroperandim2colInfo can be specified only when.load_mode is.im2col or.im2col::w or.im2col::w::128. The format of the vector argumentim2colInfo is dependent on the exact im2col modeand is as follows:
Exact im2col mode
im2colInfo argument
Semantics
.im2col
{ i2cOffW , i2cOffH , i2cOffD }for.dim =.5d
A vector of im2col offsets whose vector size is twoless than number of dimensions .dim.
.im2col::w
{ wHalo, wOffset }
A vector of 2 arguments containingwHalo andwOffsetarguments.
.im2col::w::128
.im2col_no_offs
im2colInfo is not applicable.
im2colInfo is not applicable.
ArgumentwHalo is a 16bit unsigned integer whose valid set of values differs on the load-mode and is as follows:- Im2col::w mode : valid range is [0, 512).- Im2col::w::128 mode : valid range is [0, 32).
ArgumentwOffset is a 16bit unsigned integer whose valid range of values is [0, 32).
The optional modifier.multicast::cluster allows copying of data from global memory to sharedmemory of multiple CTAs in the cluster. OperandctaMask specifies the destination CTAs in thecluster such that each bit position in the 16-bitctaMask operand corresponds to the%ctaidof the destination CTA. The source data is multicast to the same offset asdstMem in the sharedmemory of each destination CTA. When.cta_group is specified as:
.cta_group::1 : The mbarrier signal is also multicasted to the same offset asmbar inthe shared memory of the destination CTA.
.cta_group::2 : The mbarrier signal is multicasted either to all the odd numbered CTAs or theeven numbered CTAs within the correspondingCTA-Pair. For each destinationCTA specified in thectaMask, the mbarrier signal is sent either to the destination CTA or itspeer-CTA based on CTAs%cluster_ctarank parity of shared memory wherethe mbarrier objectmbar resides.
When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.
cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.
The copy operation incp.async.bulk.tensor is treated as a weak memory operation and thecomplete-txoperation on the mbarrier has.release semantics at the.cluster scope as described in theMemory Consistency Model.
Notes
.multicast::cluster qualifier is optimized for target architecturesm_90a/sm_100f/sm_100a/sm_103f/sm_103a/sm_110f/sm_110a and may have substantially reduced performance on othertargets and hence.multicast::cluster is advised to be used with.targetsm_90a/sm_100f/sm_100a/sm_103f/sm_103a/sm_110f/sm_110a.
PTX ISA Notes
Introduced in PTX ISA version 8.0.
Support for.shared::cta as destination state space is introduced in PTX ISA version 8.6.
Support for qualifiers.tile::gather4 and.tile::scatter4 introduced in PTX ISA version 8.6.
Support for qualifiers.im2col::w and.im2col::w::128 introduced in PTX ISA version 8.6.
Support for qualifier.cta_group introduced in PTX ISA version 8.6.
Target ISA Notes
Requiressm_90 or higher.
.multicast::cluster qualifier advised to be used with.targetsm_90a orsm_100f orsm_100a orsm_103f orsm_103a orsm_110f orsm_110a.
Qualifiers.tile::gather4 and.im2col::w require:
sm_100a when destination state space is.shared::cluster and is supported onsm_100f from PTX ISA version 8.8.
sm_100 or higher when destination state space is.shared::cta.
Qualifier.tile::scatter4 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_110f or higher in the same family
Qualifier.im2col::w::128 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_110f or higher in the same family
Qualifier.cta_group is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
cp.reduce.async.bulk.tensor is a non-blocking instruction which initiates an asynchronousreduction operation of tensor data in the.dst state space with tensor data in the.srcstate space.
The operandsrcMem specifies the location of the tensor data in the.src state space usingwhich the reduction operation has to be performed.
The operandtensorMap is the generic address of the opaque tensor-map object which residesin.param space or.const space or.global space. The operandtensorMap specifiesthe properties of the tensor copy operation, as described inTensor-map.ThetensorMap is accessed in tensormap proxy. Refer to theCUDA programming guide for creatingthe tensor-map objects on the host side.
Each element of the tensor data in the.dst state space is reduced inline with the correspondingelement from the tensor data in the.src state space. The modifier.redOp specifies thereduction operation used for the inline reduction. The type of each tensor data element in thesource and the destination tensor is specified inTensor-map.
The dimension of the tensor is specified by the.dim modifier.
The vector operandtensorCoords specifies the starting coordinates of the tensor data in theglobal memory on which the reduce operation is to be performed. The number of tensor coordinates inthe vector argumenttensorCoords should be equal to the dimension specified by the modifier.dim. The individual tensor coordinates are of the type.s32.
The following table describes the valid combinations of.redOp and element type:
.redOp
Element type
.add
.u32,.s32,.u64,.f32,.f16,.bf16
.min,.max
.u32,.s32,.u64,.s64,.f16,.bf16
.inc,.dec
.u32
.and,.or,.xor
.b32,.b64
The modifier.completion_mechanism specifies the completion mechanism that is supported on theinstruction variant. Value.bulk_group of the modifier.completion_mechanism specifies thatcp.reduce.async.bulk.tensor instruction usesbulk async-group based completion mechanism.
The qualifier.load_mode specifies how the data in the source location is copied into thedestination location. If.load_mode is not specified, it defaults to.tile. In.tilemode, the multi-dimensional layout of the source tensor is preserved at the destination. In.im2col_no_offs mode, some dimensions of the source tensors are unrolled in a single dimensionalcolumn at the destination. Details of theim2col mode are described inim2col mode. In.im2col mode, the tensor has to be at least3-dimensional.
When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.
cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program. Thequalifier.level::cache_hint is only supported when at least one of the.src or.dststatespaces is.global state space.
Each reduction operation performed bycp.reduce.async.bulk.tensor has individually.relaxed.gpu memory ordering semantics. The load operations incp.reduce.async.bulk.tensorare treated as weak memory operations and thecomplete-txoperation on the mbarrier has.release semantics at the.cluster scope as described in theMemory Consistency Model.
cp.async.bulk.prefetch.tensor is a non-blocking instruction which may initiate an asynchronousprefetch of tensor data from the location in.src statespace to the L2 cache.
The operandtensorMap is the generic address of the opaque tensor-map object which residesin.param space or.const space or.global space. The operandtensorMap specifiesthe properties of the tensor copy operation, as described inTensor-map.ThetensorMap is accessed in tensormap proxy. Refer to theCUDA programming guide for creatingthe tensor-map objects on the host side.
The dimension of the tensor data is specified by the.dim modifier.
The vector operandtensorCoords specifies the starting coordinates in the tensor data in theglobal memory from which the copy operation has to be performed. The individual tensorcoordinates intensorCoords are of type.s32. The format of vector argumenttensorCoordsis dependent on.load_mode specified and is as follows:
.load_mode
tensorCoords
Semantics
.tile::gather4
{col_idx, row_idx0, row_idx1, row_idx2, row_idx3}
Fixed length vector of size 5.The five elements together specify the startco-ordinates of the four rows.
Rest all
{d0, .., dn}for n = .dim
Vector of n elements where n = .dim.The elements indicate the offset in each of thedimension.
The qualifier.load_mode specifies how the data in the source location is copied into thedestination location. If.load_mode is not specified, it defaults to.tile.
In.tile mode, the multi-dimensional layout of the source tensor is preserved at the destination.In.tile::gather4 mode, four rows in the 2-dimnesional source tensor are fetched to L2 cache.Details of.tile::gather4 modes are describedin.tile::scatter4 and .tile::gather4 modes.
In.im2col and.im2col::* modes, some dimensions of the source tensors are unrolled in a singledimensional column at the destination. Details of theim2col and.im2col::* modes are described inim2col mode andim2col::w and im2col::w::128 modesrespectively. In.im2col and.im2col::* modes, the tensor has to be at least 3-dimensional. The vectoroperandim2colInfo can be specified only when.load_mode is.im2col or.im2col::w or.im2col::w::128. The format of the vector argumentim2colInfo is dependent on the exact im2col modeand is as follows:
Exact im2col mode
im2colInfo argument
Semantics
.im2col
{ i2cOffW , i2cOffH , i2cOffD }for.dim =.5d
A vector of im2col offsets whose vector size is twoless than number of dimensions .dim.
.im2col::w
{ wHalo, wOffset }
A vector of 2 arguments containingwHalo andwOffsetarguments.
.im2col::w::128
When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.
cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.
cp.async.bulk.prefetch.tensor is treated as a weak memory operation in theMemory Consistency Model.
PTX ISA Notes
Introduced in PTX ISA version 8.0.
Support for qualifier.tile::gather4 introduced in PTX ISA version 8.6.
Support for qualifiers.im2col::w and.im2col::w::128 introduced in PTX ISA version 8.6.
Target ISA Notes
Requiressm_90 or higher.
Qualifier.tile::gather4 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_110f or higher in the same family
Qualifiers.im2col::w and.im2col::w::128 are supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And are supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
Commits all prior initiated but uncommittedcp.async.bulk instructions into acp.async.bulk-group.
Syntax
cp.async.bulk.commit_group;
Description
cp.async.bulk.commit_group instruction creates a new per-threadbulk async-group and batchesall priorcp{.reduce}.async.bulk.{.prefetch}{.tensor} instructions satisfying the followingconditions into the newbulk async-group:
The priorcp{.reduce}.async.bulk.{.prefetch}{.tensor} instructions usebulk_group basedcompletion mechanism, and
They are initiated by the executing thread but not committed to anybulk async-group.
If there are no uncommittedcp{.reduce}.async.bulk.{.prefetch}{.tensor} instructions thencp.async.bulk.commit_group results in an emptybulk async-group.
An executing thread can wait for the completion of allcp{.reduce}.async.bulk.{.prefetch}{.tensor} operations in abulk async-group usingcp.async.bulk.wait_group.
There is no memory ordering guarantee provided between any twocp{.reduce}.async.bulk.{.prefetch}{.tensor} operations within the samebulk async-group.
cp.async.bulk.wait_group instruction will cause the executing thread to wait until only N orfewer of the most recentbulk async-groups are pending and all the priorbulk async-groupscommitted by the executing threads are complete. For example, when N is 0, the executing threadwaits on all the priorbulk async-groups to complete. Operand N is an integer constant.
By default,cp.async.bulk.wait_group instruction will cause the executing thread to wait untilcompletion of all the bulk async operations in the specifiedbulk async-group. A bulk asyncoperation includes the following:
Optionally, reading from the tensormap.
Reading from the source locations.
Writing to their respective destination locations.
Writes being made visible to the executing thread.
The optional.read modifier indicates that the waiting has to be done until all the bulkasync operations in the specifiedbulk async-group have completed:
Thetensormap.replace instruction replaces the field, specified by.field qualifier,of the tensor-map object at the location specified by the address operandaddr with anew value. The new value is specified by the argumentnew_val.
Qualifier.mode specifies the mode of thetensor-map objectlocated at the address operandaddr.
Instruction type.b1024 indicates the size of thetensor-mapobject, which is 1024 bits.
Operandnew_val has the type.type. When.field is specified as.global_addressor.global_stride,.type must be.b64. Otherwise,.type must be.b32.
The immediate integer operandord specifies the ordinal of the field across the rank of thetensor which needs to be replaced in thetensor-map object.
For field.rank, the operandnew_val must be ones less than the desired tensor rank asthis field uses zero-based numbering.
When.field3 is specified, the operandnew_val must be an immediate and theTable 33 shows the mapping of the operandnew_val across various fields.
The values of.elemtype do not correspond to the values of theCUtensorMapDataType enum used in the driver API.
If no state space is specified thenGeneric Addressing is used.If the address specified byaddr does not fall within the address window of.globalor.shared::cta state space then the behavior is undefined.
tensormap.replace is treated as a weak memory operation, on the entire 1024-bit opaquetensor-map object, in theMemory Consistency Model.
PTX ISA Notes
Introduced in PTX ISA version 8.3.
Qualifier.swizzle_atomicity introduced in PTX ISA version 8.6.
Qualifier.elemtype with values from13 to15, both inclusive, issupported in PTX ISA version 8.7 onwards.
Qualifier.swizzle_mode with value4 is supported from PTX ISA version 8.8 onwards.
Target ISA Notes
Supported on following architectures:
sm_90a
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_120f or higher in the same family
sm_110f or higher in the same family
Qualifier.swizzle_atomicity is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a (refer tosectionfor restrictions on sm_120a)
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_120f or higher in the same family
sm_110f or higher in the same family
.field3 variant.elemtype corresponding tonew_val values13,14and15 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a (refer tosectionfor restrictions on sm_120a)
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_120f or higher in the same family
sm_110f or higher in the same family
.field3 variant.swizzle_mode corresponding tonew_val value4 is supported onfollowing architectures:
sm_103a (refer tosectionfor restrictions on sm_103a)
For working with textures and samplers, PTX has two modes of operation. In theunified mode,texture and sampler information is accessed through a single.texref handle. In theindependentmode, texture and sampler information each have their own handle, allowing them to be definedseparately and combined at the site of usage in the program.
The advantage of unified mode is that it allows 256 samplers per kernel (128 for architectures priortosm_3x), with the restriction that they correspond 1-to-1 with the 256 possible textures perkernel (128 for architectures prior tosm_3x). The advantage of independent mode is thattextures and samplers can be mixed and matched, but the number of samplers is greatly restricted to32 per kernel (16 for architectures prior tosm_3x).
Table 34 summarizes the number of textures, samplers andsurfaces available in different texturing modes.
The texturing mode is selected using.target optionstexmode_unified andtexmode_independent. A PTX module may declare only one texturing mode. If no texturing mode isdeclared, the module is assumed to use unified mode.
Example: calculate an element’s power contribution as element’s power/total number of elements.
Amipmap is a sequence of textures, each of which is a progressively lower resolutionrepresentation of the same image. The height and width of each image, or level of detail (LOD), inthe mipmap is a power of two smaller than the previous level. Mipmaps are used in graphicsapplications to improve rendering speed and reduce aliasing artifacts. For example, ahigh-resolution mipmap image is used for objects that are close to the user; lower-resolution imagesare used as the object appears farther away. Mipmap filtering modes are provided when switchingbetween two levels of detail (LODs) in order to avoid abrupt changes in visual fidelity.
Example: If the texture has a basic size of 256 by 256 pixels, then the associated mipmap setmay contain a series of eight images, each one-fourth the total area of the previous one: 128x128pixels, 64x64, 32x32, 16x16, 8x8, 4x4, 2x2, 1x1 (a single pixel). If, for example, a scene isrendering this texture in a space of 40x40 pixels, then either a scaled up version of the 32x32(without trilinear interpolation) or an interpolation of the 64x64 and the 32x32 mipmaps (withtrilinear interpolation) would be used.
The total number of LODs in a complete mipmap pyramid is calculated through the following equation:
numLODs = 1 + floor(log2(max(w, h, d)))
The finest LOD is called the base level and is the 0th level. The next (coarser) level is the 1stlevel, and so on. The coarsest level is the level of size (1 x 1 x 1). Each successively smallermipmap level has half the {width, height, depth} of the previous level, but if this half value is afractional value, it’s rounded down to the next largest integer. Essentially, the size of a mipmaplevel can be specified as:
wherei is the ith level beyond the 0th level (the base level). Andw_b,h_b andd_b are thewidth, height and depth of the base level respectively.
PTX support for mipmaps
The PTXtex instruction supports three modes for specifying the LOD:base,level, andgradient. In base mode, the instruction always picks level 0. In level mode, an additionalargument is provided to specify the LOD to fetch from. In gradmode, two floating-point vectorarguments providepartials (e.g.,{ds/dx,dt/dx} and{ds/dy,dt/dy} for a 2d texture),which thetex instruction uses to compute the LOD.
These instructions provide access to texture memory.
Texture lookup using a texture coordinate vector. The instruction loads data from the texture namedby operanda at coordinates given by operandc into destinationd. Operandc is ascalar or singleton tuple for 1d textures; is a two-element vector for 2d textures; and is afour-element vector for 3d textures, where the fourth element is ignored. An optional texturesamplerb may be specified. If no sampler is specified, the sampler behavior is a property ofthe named texture. The optional destination predicatep is set toTrue if data from textureat specified coordinates is resident in memory,False otherwise. When optional destinationpredicatep is set toFalse, data loaded will be all zeros. Memory residency of Texture Dataat specified coordinates is dependent on execution environment setup using Driver API calls, priorto kernel launch. Refer to Driver API documentation for more details including anysystem/implementation specific behavior.
An optional operande may be specified. Operande is a vector of.s32 values thatspecifies coordinate offset. Offset is applied to coordinates before doing texture lookup. Offsetvalue is in the range of -8 to +7. Operande is a singleton tuple for 1d textures; is a twoelement vector 2d textures; and is four-element vector for 3d textures, where the fourth element isignored.
An optional operandf may be specified fordepthtextures. Depth textures are special typeof textures which hold data from the depth buffer. Depth buffer contains depth information of eachpixel. Operandf is.f32 scalar value that specifies depth compare value for depthtextures. Each element fetched from texture is compared against value given inf operand. Ifcomparison passes, result is 1.0; otherwise result is 0.0. These per-element comparison results areused for the filtering. When using depth compare operand, the elements in texture coordinate vectorc have.f32 type.
Depth compare operand is not supported for3d textures.
The instruction returns a two-element vector for destination type.f16x2. For all otherdestination types, the instruction returns a four-element vector. Coordinates may be given in eithersigned 32-bit integer or 32-bit floating point form.
A texture base address is assumed to be aligned to a 16 byte boundary, and the address given by thecoordinate vector must be naturally aligned to a multiple of the access size. If an address is notproperly aligned, the resulting behavior is undefined; i.e., the access may proceed by silentlymasking off low-order address bits to achieve proper rounding, or the instruction may fault.
tex.{a1d,a2d}
Texture array selection, followed by texture lookup. The instruction first selects a texture fromthe texture array named by operanda using the index given by the first element of the arraycoordinate vectorc. The instruction then loads data from the selected texture at coordinatesgiven by the remaining elements of operandc into destinationd. Operandc is a bit-sizetype vector or tuple containing an index into the array of textures followed by coordinates withinthe selected texture, as follows:
For 1d texture arrays, operandc has type.v2.b32. The first element is interpreted as anunsigned integer index (.u32) into the texture array, and the second element is interpreted asa 1d texture coordinate of type.ctype.
For 2d texture arrays, operandc has type.v4.b32. The first element is interpreted as anunsigned integer index (.u32) into the texture array, and the next two elements areinterpreted as 2d texture coordinates of type.ctype. The fourth element is ignored.
An optional texture samplerb may be specified. If no sampler is specified, the sampler behavioris a property of the named texture.
An optional operande may be specified. Operande is a vector of.s32 values thatspecifies coordinate offset. Offset is applied to coordinates before doing texture lookup. Offsetvalue is in the range of -8 to +7. Operande is a singleton tuple for 1d texture arrays; and isa two element vector 2d texture arrays.
An optional operandf may be specified for depth textures arrays. Operandf is.f32scalar value that specifies depth compare value for depth textures. When using depth compareoperand, the coordinates in texture coordinate vectorc have.f32 type.
The instruction returns a two-element vector for destination type.f16x2. For all otherdestination types, the instruction returns a four-element vector. The texture array index is a32-bit unsigned integer, and texture coordinate elements are 32-bit signed integer or floating pointvalues.
The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.
tex.cube
Cubemap texture lookup. The instruction loads data from the cubemap texture named by operandaat coordinates given by operandc into destinationd. Cubemap textures are specialtwo-dimensional layered textures consisting of six layers that represent the faces of a cube. Alllayers in a cubemap are of the same size and are square (i.e., width equals height).
When accessing a cubemap, the texture coordinate vectorc has type.v4.f32, and comprisesthree floating-point coordinates (s,t,r) and a fourth padding argument which isignored. Coordinates (s,t,r) are projected onto one of the six cube faces. The (s,t,r) coordinates can be thought of as a direction vector emanating from the center of thecube. Of the three coordinates (s,t,r), the coordinate of the largest magnitude (themajor axis) selects the cube face. Then, the other two coordinates (the minor axes) are divided bythe absolute value of the major axis to produce a new (s,t) coordinate pair to lookup intothe selected cube face.
An optional texture samplerb may be specified. If no sampler is specified, the sampler behavioris a property of the named texture.
Offset vector operande is not supported for cubemap textures.
an optional operandf may be specified for cubemap depth textures. operandf is.f32scalar value that specifies depth compare value for cubemap depth textures.
The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.
tex.acube
Cubemap array selection, followed by cubemap lookup. The instruction first selects a cubemap texturefrom the cubemap array named by operanda using the index given by the first element of thearray coordinate vectorc. The instruction then loads data from the selected cubemap texture atcoordinates given by the remaining elements of operandc into destinationd.
Cubemap array textures consist of an array of cubemaps, i.e., the total number of layers is amultiple of six. When accessing a cubemap array texture, the coordinate vectorc has type.v4.b32. The first element is interpreted as an unsigned integer index (.u32) into thecubemap array, and the remaining three elements are interpreted as floating-point cubemapcoordinates (s,t,r), used to lookup in the selected cubemap as described above.
An optional texture samplerb may be specified. If no sampler is specified, the sampler behavioris a property of the named texture.
Offset vector operande is not supported for cubemap texture arrays.
An optional operandf may be specified for cubemap depth texture arrays. Operandf is.f32 scalar value that specifies depth compare value for cubemap depth textures.
The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.
tex.2dms
Multi-sample texture lookup using a texture coordinate vector. Multi-sample textures consist ofmultiple samples per data element. The instruction loads data from the texture named by operanda from sample number given by first element of the operandc, at coordinates given byremaining elements of operandc into destinationd. When accessing a multi-sample texture,texture coordinate vectorc has type.v4.b32. The first element in operandc isinterpreted as unsigned integer sample number (.u32), and the next two elements are interpretedas signed integer (.s32) 2d texture coordinates. The fourth element is ignored. An optionaltexture samplerb may be specified. If no sampler is specified, the sampler behavior is aproperty of the named texture.
An optional operande may be specified. Operande is a vector of type.v2.s32 thatspecifies coordinate offset. Offset is applied to coordinates before doing texture lookup. Offsetvalue is in the range of -8 to +7.
Depth compare operandf is not supported for multi-sample textures.
The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.
tex.a2dms
Multi-sample texture array selection, followed by multi-sample texture lookup. The instruction firstselects a multi-sample texture from the multi-sample texture array named by operand a using theindex given by the first element of the array coordinate vectorc. The instruction then loadsdata from the selected multi-sample texture from sample number given by second element of theoperandc, at coordinates given by remaining elements of operandc into destinationd. When accessing a multi-sample texture array, texture coordinate vectorc has type.v4.b32. The first element in operand c is interpreted as unsigned integer sampler number, thesecond element is interpreted as unsigned integer index (.u32) into the multi-sample texturearray and the next two elements are interpreted as signed integer (.s32) 2d texturecoordinates. An optional texture samplerb may be specified. If no sampler is specified, thesampler behavior is a property of the named texture.
An optional operande may be specified. Operande is a vector of type.v2.s32 valuesthat specifies coordinate offset. Offset is applied to coordinates before doing texturelookup. Offset value is in the range of -8 to +7.
Depth compare operandf is not supported for multi-sample texture arrays.
The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.
Mipmaps
.base (lod zero)
Pick level 0 (base level). This is the default if no mipmap mode is specified. No additional arguments.
.level (lod explicit)
Requires an additional 32-bit scalar argument,lod, which contains the LOD to fetch from. Thetype oflod follows.ctype (either.s32 or.f32). Geometries.2dms and.a2dms are not supported in this mode.
.grad (lod gradient)
Requires two.f32 vectors,dPdx anddPdy, that specify the partials. The vectors aresingletons for 1d and a1d textures; are two-element vectors for 2d and a2d textures; and arefour-element vectors for 3d, cube and acube textures, where the fourth element is ignored for 3dand cube geometries. Geometries.2dms and.a2dms are not supported in this mode.
For mipmap texture lookup, an optional operande may be specified. Operande is a vector of.s32 that specifies coordinate offset. Offset is applied to coordinates before doing texturelookup. Offset value is in the range of -8 to +7. Offset vector operand is not supported for cubeand cubemap geometries.
An optional operandf may be specified for mipmap textures. Operandf is.f32 scalarvalue that specifies depth compare value for depth textures. When using depth compare operand, thecoordinates in texture coordinate vectorc have.f32 type.
The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.
Depth compare operand is not supported for3d textures.
Indirect texture access
Beginning with PTX ISA version 3.1, indirect texture access is supported in unified mode for targetarchitecturesm_20 or higher. In indirect access, operanda is a.u64 register holdingthe address of a.texref variable.
Notes
For compatibility with prior versions of PTX, the square brackets are not required and.v4coordinate vectors are allowed for any geometry, with the extra elements being ignored.
PTX ISA Notes
Unified mode texturing introduced in PTX ISA version 1.0. Extension using opaque.texref and.samplerref types and independent mode texturing introduced in PTX ISA version 1.5.
Texture arraystex.{a1d,a2d} introduced in PTX ISA version 2.3.
Cubemaps and cubemap arrays introduced in PTX ISA version 3.0.
Support for mipmaps introduced in PTX ISA version 3.1.
Indirect texture access introduced in PTX ISA version 3.1.
Multi-sample textures and multi-sample texture arrays introduced in PTX ISA version 3.2.
Support for textures returning.f16 and.f16x2 data introduced in PTX ISA version 4.2.
Support fortex.grad.{cube,acube} introduced in PTX ISA version 4.3.
Offset vector operand introduced in PTX ISA version 4.3.
Depth compare operand introduced in PTX ISA version 4.3.
Support for optional destination predicate introduced in PTX ISA version 7.1.
Target ISA Notes
Supported on all target architectures.
The cubemap array geometry (.acube) requiressm_20 or higher.
Mipmaps requiresm_20 or higher.
Indirect texture access requiressm_20 or higher.
Multi-sample textures and multi-sample texture arrays requiresm_30 or higher.
Texture fetch returning.f16 and.f16x2 data requiresm_53 or higher.
tex.grad.{cube,acube} requiressm_20 or higher.
Offset vector operand requiressm_30 or higher.
Depth compare operand requiressm_30 or higher.
Support for optional destination predicate requiressm_60 or higher.
Examples
// Example of unified mode texturing // - f4 is required to pad four-element tuple and is ignored tex.3d.v4.s32.s32 {r1,r2,r3,r4}, [tex_a,{f1,f2,f3,f4}]; // Example of independent mode texturing tex.1d.v4.s32.f32 {r1,r2,r3,r4}, [tex_a,smpl_x,{f1}]; // Example of 1D texture array, independent texturing mode tex.a1d.v4.s32.s32 {r1,r2,r3,r4}, [tex_a,smpl_x,{idx,s1}]; // Example of 2D texture array, unified texturing mode // - f3 is required to pad four-element tuple and is ignored tex.a2d.v4.s32.f32 {r1,r2,r3,r4}, [tex_a,{idx,f1,f2,f3}]; // Example of cubemap array, unified textureing mode tex.acube.v4.f32.f32 {r0,r1,r2,r3}, [tex_cuarray,{idx,f1,f2,f3}]; // Example of multi-sample texture, unified texturing mode tex.2dms.v4.s32.s32 {r0,r1,r2,r3}, [tex_ms,{sample,r6,r7,r8}]; // Example of multi-sample texture, independent texturing mode tex.2dms.v4.s32.s32 {r0,r1,r2,r3}, [tex_ms, smpl_x,{sample,r6,r7,r8}]; // Example of multi-sample texture array, unified texturing mode tex.a2dms.v4.s32.s32 {r0,r1,r2,r3}, [tex_ams,{idx,sample,r6,r7}]; // Example of texture returning .f16 data tex.1d.v4.f16.f32 {h1,h2,h3,h4}, [tex_a,smpl_x,{f1}]; // Example of texture returning .f16x2 data tex.1d.v2.f16x2.f32 {h1,h2}, [tex_a,smpl_x,{f1}]; // Example of 3d texture array access with tex.grad,unified texturing mode tex.grad.3d.v4.f32.f32 {%f4,%f5,%f6,%f7},[tex_3d,{%f0,%f0,%f0,%f0}], {fl0,fl1,fl2,fl3},{fl0,fl1,fl2,fl3};// Example of cube texture array access with tex.grad,unified texturing mode tex.grad.cube.v4.f32.f32{%f4,%f5,%f6,%f7},[tex_cube,{%f0,%f0,%f0,%f0}], {fl0,fl1,fl2,fl3},{fl0,fl1,fl2,fl3}; // Example of 1d texture lookup with offset, unified texturing mode tex.1d.v4.s32.f32 {r1,r2,r3,r4}, [tex_a, {f1}], {r5}; // Example of 2d texture array lookup with offset, unified texturing mode tex.a2d.v4.s32.f32 {r1,r2,r3,r4}, [tex_a,{idx,f1,f2}], {f5,f6}; // Example of 2d mipmap texture lookup with offset, unified texturing mode tex.level.2d.v4.s32.f32 {r1,r2,r3,r4}, [tex_a,{f1,f2}], flvl, {r7, r8}; // Example of 2d depth texture lookup with compare, unified texturing mode tex.1d.v4.f32.f32 {f1,f2,f3,f4}, [tex_a, {f1}], f0; // Example of depth 2d texture array lookup with offset, compare tex.a2d.v4.s32.f32 {f0,f1,f2,f3}, [tex_a,{idx,f4,f5}], {r5,r6}, f6; // Example of destination predicate use tex.3d.v4.s32.s32 {r1,r2,r3,r4}|p, [tex_a,{f1,f2,f3,f4}];
Texture fetch of the 4-texel bilerp footprint using a texture coordinate vector. The instructionloads the bilerp footprint from the texture named by operanda at coordinates given by operandc into vector destinationd. The texture component fetched for each texel sample isspecified by.comp. The four texel samples are placed into destination vectord incounter-clockwise order starting at lower left.
An optional texture samplerb may be specified. If no sampler is specified, the sampler behavioris a property of the named texture.
The optional destination predicatep is set toTrue if data from texture at specifiedcoordinates is resident in memory,False otherwise. When optional destination predicatep isset toFalse, data loaded will be all zeros. Memory residency of Texture Data at specifiedcoordinates is dependent on execution environment setup using Driver API calls, prior to kernellaunch. Refer to Driver API documentation for more details including any system/implementationspecific behavior.
An optional operandf may be specified fordepth textures. Depth textures are special type oftextures which hold data from the depth buffer. Depth buffer contains depth information of eachpixel. Operandf is.f32 scalar value that specifies depth compare value for depthtextures. Each element fetched from texture is compared against value given inf operand. Ifcomparison passes, result is 1.0; otherwise result is 0.0. These per-element comparison results areused for the filtering.
A texture base address is assumed to be aligned to a 16 byte boundary, and the address given by thecoordinate vector must be naturally aligned to a multiple of the access size. If an address is notproperly aligned, the resulting behavior is undefined; i.e., the access may proceed by silentlymasking off low-order address bits to achieve proper rounding, or the instruction may fault.
tld4.2d
For 2D textures, operandc specifies coordinates as a two-element, 32-bit floating-point vector.
An optional operande may be specified. Operande is a vector of type.v2.s32 thatspecifies coordinate offset. Offset is applied to coordinates before doing texture fetch. Offsetvalue is in the range of -8 to +7.
tld4.a2d
Texture array selection, followed bytld4 texture fetch of 2d texture. For 2d texture arraysoperandc is a four element, 32-bit vector. The first element in operand c is interpreted as anunsigned integer index (.u32) into the texture array, and the next two elements are interpretedas 32-bit floating point coordinates of 2d texture. The fourth element is ignored.
An optional operande may be specified. Operande is a vector of type.v2.s32 thatspecifies coordinate offset. Offset is applied to coordinates before doing texture fetch. Offsetvalue is in the range of -8 to +7.
tld4.cube
For cubemap textures, operandc specifies four-element vector which comprises threefloating-point coordinates (s, t, r) and a fourth padding argument which is ignored.
Cubemap textures are special two-dimensional layered textures consisting of six layers thatrepresent the faces of a cube. All layers in a cubemap are of the same size and are square (i.e.,width equals height).
Coordinates (s, t, r) are projected onto one of the six cube faces. The (s, t, r) coordinates can bethought of as a direction vector emanating from the center of the cube. Of the three coordinates (s,t, r), the coordinate of the largest magnitude (the major axis) selects the cube face. Then, theother two coordinates (the minor axes) are divided by the absolute value of the major axis toproduce a new (s, t) coordinate pair to lookup into the selected cube face.
Offset vector operande is not supported for cubemap textures.
tld4.acube
Cubemap array selection, followed bytld4 texture fetch of cubemap texture. The first element inoperandc is interpreted as an unsigned integer index (.u32) into the cubemap texture array,and the remaining three elements are interpreted as floating-point cubemap coordinates (s, t, r),used to lookup in the selected cubemap.
Offset vector operande is not supported for cubemap texture arrays.
Indirect texture access
Beginning with PTX ISA version 3.1, indirect texture access is supported in unified mode for targetarchitecturesm_20 or higher. In indirect access, operanda is a.u64 register holdingthe address of a.texref variable.
PTX ISA Notes
Introduced in PTX ISA version 2.2.
Indirect texture access introduced in PTX ISA version 3.1.
tld4.{a2d,cube,acube} introduced in PTX ISA version 4.3.
Offset vector operand introduced in PTX ISA version 4.3.
Depth compare operand introduced in PTX ISA version 4.3.
Support for optional destination predicate introduced in PTX ISA version 7.1.
Target ISA Notes
tld4 requiressm_20 or higher.
Indirect texture access requiressm_20 or higher.
tld4.{a2d,cube,acube} requiressm_30 or higher.
Offset vector operand requiressm_30 or higher.
Depth compare operand requiressm_30 or higher.
Support for optional destination predicate requiressm_60 or higher.
Examples
//Example of unified mode texturingtld4.r.2d.v4.s32.f32 {r1,r2,r3,r4}, [tex_a,{f1,f2}];// Example of independent mode texturingtld4.r.2d.v4.u32.f32 {u1,u2,u3,u4}, [tex_a,smpl_x,{f1,f2}];// Example of unified mode texturing using offsettld4.r.2d.v4.s32.f32 {r1,r2,r3,r4}, [tex_a,{f1,f2}], {r5, r6};// Example of unified mode texturing using comparetld4.r.2d.v4.f32.f32 {f1,f2,f3,f4}, [tex_a,{f5,f6}], f7;// Example of optional destination predicatetld4.r.2d.v4.f32.f32 {f1,f2,f3,f4}|p, [tex_a,{f5,f6}], f7;
Query an attribute of a texture or sampler. Operanda is either a.texref or.samplerref variable, or a.u64 register.
Query
Returns
.width
.height
.depth
value in elements
.channel_data_type
Unsigned integer corresponding to source language’s channel data typeenumeration. If the source language combines channel data type and channelorder into a single enumeration type, that value is returned for bothchannel_data_type and channel_order queries.
.channel_order
Unsigned integer corresponding to source language’s channel orderenumeration. If the source language combines channel data type and channelorder into a single enumeration type, that value is returned for bothchannel_data_type andchannel_order queries.
.normalized_coords
1 (True) or0 (False).
.force_unnormalized_coords
1 (True) or0 (False). Defined only for.samplerrefvariables in independent texture mode. Overrides thenormalized_coordsfield of a.texref variable used with a.samplerref in atexinstruction.
For a texture array, number of textures in array, 0 otherwise.
.num_mipmap_levels
For a mipmapped texture, number of levels of details (LOD), 0 otherwise.
.num_samples
For a multi-sample texture, number of samples, 0 otherwise.
Texture attributes are queried by supplying a.texref argument totxq. In unified mode,sampler attributes are also accessed via a.texref argument, and in independent mode samplerattributes are accessed via a separate.samplerref argument.
txq.level
txq.level requires an additional 32bit integer argument,lod, which specifies LOD andqueries requested attribute for the specified LOD.
Indirect texture access
Beginning with PTX ISA version 3.1, indirect texture access is supported in unified mode for targetarchitecturesm_20 or higher. In indirect access, operanda is a.u64 register holdingthe address of a.texref variable.
PTX ISA Notes
Introduced in PTX ISA version 1.5.
Channel data type and channel order queries were added in PTX ISA version 2.1.
The.force_unnormalized_coords query was added in PTX ISA version 2.2.
Indirect texture access introduced in PTX ISA version 3.1.
.array_size,.num_mipmap_levels,.num_samples samples queries were added in PTX ISAversion 4.1.
txq.level introduced in PTX ISA version 4.3.
Target ISA Notes
Supported on all target architectures.
Indirect texture access requiressm_20 or higher.
Querying the number of mipmap levels requiressm_20 or higher.
Querying the number of samples requiressm_30 or higher.
Query whether a register points to an opaque variable of a specified type.
Syntax
istypep.type p, a; // result is .pred.type = { .texref, .samplerref, .surfref };
Description
Write predicate registerp with 1 if registera points to an opaque variable of thespecified type, and with 0 otherwise. Destinationp has type.pred; the source addressoperand must be of type.u64.
Load from surface memory using a surface coordinate vector. The instruction loads data from thesurface named by operanda at coordinates given by operandb into destinationd. Operanda is a.surfref variable or.u64 register. Operandb is a scalar or singleton tuplefor 1d surfaces; is a two-element vector for 2d surfaces; and is a four-element vector for 3dsurfaces, where the fourth element is ignored. Coordinate elements are of type.s32.
suld.b performs an unformatted load of binary data. The lowest dimension coordinate represents abyte offset into the surface and is not scaled, and the size of the data transfer matches the sizeof destination operandd.
suld.b.{a1d,a2d}
Surface layer selection, followed by a load from the selected surface. The instruction first selectsa surface layer from the surface array named by operanda using the index given by the firstelement of the array coordinate vectorb. The instruction then loads data from the selectedsurface at coordinates given by the remaining elements of operandb into destinationd. Operanda is a.surfref variable or.u64 register. Operandb is a bit-sizetype vector or tuple containing an index into the array of surfaces followed by coordinates withinthe selected surface, as follows:
For 1d surface arrays, operandb has type.v2.b32. The first element is interpreted as anunsigned integer index (.u32) into the surface array, and the second element is interpreted as a1d surface coordinate of type.s32.
For 2d surface arrays, operandb has type.v4.b32. The first element is interpreted as anunsigned integer index (.u32) into the surface array, and the next two elements are interpretedas 2d surface coordinates of type.s32. The fourth element is ignored.
A surface base address is assumed to be aligned to a 16 byte boundary, and the address given by thecoordinate vector must be naturally aligned to a multiple of the access size. If an address is notproperly aligned, the resulting behavior is undefined; i.e., the access may proceed by silentlymasking off low-order address bits to achieve proper rounding, or the instruction may fault.
The.clamp field specifies how to handle out-of-bounds addresses:
.trap
causes an execution trap on out-of-bounds addresses
.clamp
loads data at the nearest surface location (sized appropriately)
.zero
loads zero for out-of-bounds addresses
Indirect surface access
Beginning with PTX ISA version 3.1, indirect surface access is supported for target architecturesm_20 or higher. In indirect access, operanda is a.u64 register holding the address ofa.surfref variable.
PTX ISA Notes
suld.b.trap introduced in PTX ISA version 1.5.
Additional clamp modifiers and cache operations introduced in PTX ISA version 2.0.
suld.b.3d andsuld.b.{a1d,a2d} introduced in PTX ISA version 3.0.
Indirect surface access introduced in PTX ISA version 3.1.
Target ISA Notes
suld.b supported on all target architectures.
sm_1x targets support only the.trap clamping modifier.
Store to surface memory using a surface coordinate vector. The instruction stores data from operandc to the surface named by operanda at coordinates given by operandb. Operanda isa.surfref variable or.u64 register. Operandb is a scalar or singleton tuple for 1dsurfaces; is a two-element vector for 2d surfaces; and is a four-element vector for 3d surfaces,where the fourth element is ignored. Coordinate elements are of type.s32.
sust.b performs an unformatted store of binary data. The lowest dimension coordinate representsa byte offset into the surface and is not scaled. The size of the data transfer matches the size ofsource operandc.
sust.p performs a formatted store of a vector of 32-bit data values to a surface sample. Thesource vector elements are interpreted left-to-right asR,G,B, andA surfacecomponents. These elements are written to the corresponding surface sample components. Sourceelements that do not occur in the surface sample are ignored. Surface sample components that do notoccur in the source vector will be written with an unpredictable value. The lowest dimensioncoordinate represents a sample offset rather than a byte offset.
The source data interpretation is based on the surface sample format as follows: If the surfaceformat containsUNORM,SNORM, orFLOAT data, then.f32 is assumed; if the surfaceformat containsUINT data, then.u32 is assumed; if the surface format containsSINTdata, then.s32 is assumed. The source data is then converted from this type to the surfacesample format.
sust.b.{a1d,a2d}
Surface layer selection, followed by an unformatted store to the selected surface. The instructionfirst selects a surface layer from the surface array named by operanda using the index given bythe first element of the array coordinate vectorb. The instruction then stores the data inoperandc to the selected surface at coordinates given by the remaining elements of operandb. Operanda is a .surfref variable or.u64 register. Operandb is a bit-size typevector or tuple containing an index into the array of surfaces followed by coordinates within theselected surface, as follows:
For 1d surface arrays, operandb has type.v2.b32. The first element is interpreted as anunsigned integer index (.u32) into the surface array, and the second element is interpreted asa 1d surface coordinate of type.s32.
For 2d surface arrays, operandb has type.v4.b32. The first element is interpreted as anunsigned integer index (.u32) into the surface array, and the next two elements areinterpreted as 2d surface coordinates of type.s32. The fourth element is ignored.
A surface base address is assumed to be aligned to a 16 byte boundary, and the address given by thecoordinate vector must be naturally aligned to a multiple of the access size. If an address is notproperly aligned, the resulting behavior is undefined; i.e., the access may proceed by silentlymasking off low-order address bits to achieve proper rounding, or the instruction may fault.
The.clamp field specifies how to handle out-of-bounds addresses:
.trap
causes an execution trap on out-of-bounds addresses
.clamp
stores data at the nearest surface location (sized appropriately)
.zero
drops stores to out-of-bounds addresses
Indirect surface access
Beginning with PTX ISA version 3.1, indirect surface access is supported for target architecturesm_20 or higher. In indirect access, operanda is a.u64 register holding the address ofa.surfref variable.
PTX ISA Notes
sust.b.trap introduced in PTX ISA version 1.5.sust.p, additional clamp modifiers, andcache operations introduced in PTX ISA version 2.0.
sust.b.3d andsust.b.{a1d,a2d} introduced in PTX ISA version 3.0.
Indirect surface access introduced in PTX ISA version 3.1.
Target ISA Notes
sust.b supported on all target architectures.
sm_1x targets support only the.trap clamping modifier.
Reduction to surface memory using a surface coordinate vector. The instruction performs a reductionoperation with data from operandc to the surface named by operanda at coordinates given byoperandb. Operanda is a.surfref variable or.u64 register. Operandb is ascalar or singleton tuple for 1d surfaces; is a two-element vector for 2d surfaces; and is afour-element vector for 3d surfaces, where the fourth element is ignored. Coordinate elements are oftype.s32.
sured.b performs an unformatted reduction on.u32,.s32,.b32,.u64, or.s64data. The lowest dimension coordinate represents a byte offset into the surface and is notscaled. Operationadd applies to.u32,.u64, and.s32 types;min andmaxapply to.u32,.s32,.u64 and.s64 types; operationsand andor apply to.b32 type.
sured.p performs a reduction on sample-addressed data. The lowest dimension coordinaterepresents a sample offset rather than a byte offset. The instruction type.b64 is restricted tomin andmax operations. For type.b32, the data is interpreted as.u32 or.s32based on the surface sample format as follows: if the surface format containsUINT data, then.u32 is assumed; if the surface format containsSINT data, then.s32 is assumed. Fortype.b64, if the surface format containsUINT data, then.u64 is assumed; if thesurface format containsSINT data, then.s64 is assumed.
A surface base address is assumed to be aligned to a 16 byte boundary, and the address given by thecoordinate vector must be naturally aligned to a multiple of the access size. If an address is notproperly aligned, the resulting behavior is undefined; i.e., the access may proceed by silentlymasking off low-order address bits to achieve proper rounding, or the instruction may fault.
The.clamp field specifies how to handle out-of-bounds addresses:
.trap
causes an execution trap on out-of-bounds addresses
.clamp
stores data at the nearest surface location (sized appropriately)
.zero
drops stores to out-of-bounds addresses
Indirect surface access
Beginning with PTX ISA version 3.1, indirect surface access is supported for target architecturesm_20 or higher. In indirect access, operanda is a.u64 register holding the address ofa.surfref variable.
PTX ISA Notes
Introduced in PTX ISA version 2.0.
Indirect surface access introduced in PTX ISA version 3.1.
.u64/.s64/.b64 types with.min/.max operations introduced in PTX ISA version8.1.
Target ISA Notes
sured requiressm_20 or higher.
Indirect surface access requiressm_20 or higher.
.u64/.s64/.b64 types with.min/.max operations requiressm_50 or higher.
Query an attribute of a surface. Operanda is a.surfref variable or a.u64 register.
Query
Returns
.width
.height
.depth
value in elements
.channel_data_type
Unsigned integer corresponding to source language’s channel datatype enumeration. If the source language combines channel datatype and channel order into a single enumeration type, that valueis returned for bothchannel_data_type andchannel_orderqueries.
.channel_order
Unsigned integer corresponding to source language’s channel orderenumeration. If the source language combines channel data type andchannel order into a single enumeration type, that value isreturned for bothchannel_data_type andchannel_orderqueries.
.array_size
For a surface array, number of surfaces in array, 0 otherwise.
.memory_layout
1 for surface with linear memory layout;0 otherwise
Indirect surface access
Beginning with PTX ISA version 3.1, indirect surface access is supported for target architecturesm_20 or higher. In indirect access, operanda is a.u64 register holding the address ofa.surfref variable.
PTX ISA Notes
Introduced in PTX ISA version 1.5.
Channel data type and channel order queries added in PTX ISA version 2.1.
Indirect surface access introduced in PTX ISA version 3.1.
The.array_size query was added in PTX ISA version 4.1.
The.memory_layout query was added in PTX ISA version 4.2.
The curly braces create a group of instructions, used primarily for defining a function body. Thecurly braces also provide a mechanism for determining the scope of a variable: any variable declaredwithin a scope is not available outside the scope.
@p bra{.uni} tgt; // tgt is a label bra{.uni} tgt; // unconditional branch
Description
Continue execution at the target. Conditional branches are specified by using a guard predicate. Thebranch target must be a label.
bra.uni is guaranteed to be non-divergent, i.e. all active threads in a warp that are currentlyexecuting this instruction have identical values for the guard predicate and branch target.
Semantics
if (p) { pc = tgt;}
PTX ISA Notes
Introduced in PTX ISA version 1.0.
Unimplemented indirect branch introduced in PTX ISA version 2.1 has been removed from the spec.
Index into a list of possible destination labels, and continue execution from the chosenlabel. Conditional branches are specified by using a guard predicate.
brx.idx.uni guarantees that the branch is non-divergent, i.e. all active threads in a warp thatare currently executing this instruction have identical values for the guard predicate and theindex argument.
Theindex operand is a.u32 register. Thetlist operand must be the label of a.branchtargets directive. It is accessed as a zero-based sequence usingindex. Behaviour isundefined if the value ofindex is greater than or equal to the length oftlist.
The.branchtargets directive must be defined in the local function scope before it is used. Itmust refer to labels within the current function.
Semantics
if (p) { if (index < length(tlist)) { pc = tlist[index]; } else { pc = undefined; }}
// direct call to named function, func is a symbolcall{.uni} (ret-param), func, (param-list);call{.uni} func, (param-list);call{.uni} func;// indirect call via pointer, with full list of call targetscall{.uni} (ret-param), fptr, (param-list), flist;call{.uni} fptr, (param-list), flist;call{.uni} fptr, flist;// indirect call via pointer, with no knowledge of call targetscall{.uni} (ret-param), fptr, (param-list), fproto;call{.uni} fptr, (param-list), fproto;call{.uni} fptr, fproto;
Description
Thecall instruction stores the address of the next instruction, so execution can resume at thatpoint after executing aret instruction. Acall is assumed to be divergent unless the.uni suffix is present. The.uni suffix indicates that thecall is guaranteed to benon-divergent, i.e. all active threads in a warp that are currently executing this instruction haveidentical values for the guard predicate andcall target.
For direct calls, the called locationfunc must be a symbolic function name; for indirect calls,the called locationfptr must be an address of a function held in a register. Input argumentsand return values are optional. Arguments may be registers, immediate constants, or variables in.param space. Arguments are pass-by-value.
Indirect calls require an additional operand,flist orfproto, to communicate the list ofpotentialcall targets or the common function prototype of allcall targets,respectively. In the first case,flist gives a complete list of potentialcall targets andthe optimizing backend is free to optimize the calling convention. In the second case, where thecomplete list of potentialcall targets may not be known, the common function prototype is givenand thecall must obey the ABI’s calling convention.
Theflist operand is either the name of an array (call table) initialized to a list of functionnames; or a label associated with a.calltargets directive, which declares a list of potentialcall targets. In both cases the fptr register holds the address of a function listed in the calltable or.calltargets list, and thecall operands are type-checked against the typesignature of the functions indicated byflist.
The fproto operand is the name of a label associated with a.callprototype directive. Thisoperand is used when a complete list of potential targets is not known. Thecall operands aretype-checked against the prototype, and code generation will follow the ABI calling convention. If afunction that doesn’t match the prototype is called, the behavior is undefined.
Call tables may be declared at module scope or local scope, in either the constant or global statespace. The.calltargets and.callprototype directives must be declared within a functionbody. All functions must be declared prior to being referenced in acall table initializer or.calltargets directive.
PTX ISA Notes
Directcall introduced in PTX ISA version 1.0. Indirectcall introduced in PTX ISA version 2.1.
Target ISA Notes
Directcall supported on all target architectures. Indirectcall requiressm_20 or higher.
Examples
// examples of direct call call init; // call function 'init' call.uni g, (a); // call function 'g' with parameter 'a'@p call (d), h, (a, b); // return value into register d// call-via-pointer using jump table.func (.reg .u32 rv) foo (.reg .u32 a, .reg .u32 b) ....func (.reg .u32 rv) bar (.reg .u32 a, .reg .u32 b) ....func (.reg .u32 rv) baz (.reg .u32 a, .reg .u32 b) ....global .u32 jmptbl[5] = { foo, bar, baz }; ...@p ld.global.u32 %r0, [jmptbl+4];@p ld.global.u32 %r0, [jmptbl+8]; call (retval), %r0, (x, y), jmptbl;// call-via-pointer using .calltargets directive.func (.reg .u32 rv) foo (.reg .u32 a, .reg .u32 b) ....func (.reg .u32 rv) bar (.reg .u32 a, .reg .u32 b) ....func (.reg .u32 rv) baz (.reg .u32 a, .reg .u32 b) ... ...@p mov.u32 %r0, foo;@q mov.u32 %r0, baz;Ftgt: .calltargets foo, bar, baz; call (retval), %r0, (x, y), Ftgt;// call-via-pointer using .callprototype directive.func dispatch (.reg .u32 fptr, .reg .u32 idx){...Fproto: .callprototype _ (.param .u32 _, .param .u32 _); call %fptr, (x, y), Fproto;...
Return execution to caller’s environment. A divergent return suspends threads until all threads areready to return to the caller. This allows multiple divergentret instructions.
Aret is assumed to be divergent unless the.uni suffix is present, indicating that thereturn is guaranteed to be non-divergent.
Any values returned from a function should be moved into the return parameter variables prior toexecuting theret instruction.
A return instruction executed in a top-level entry routine will terminate thread execution.
As threads exit, barriers waiting on all threads are checked to see if the exiting threads are theonly threads that have not yet made it to a barrier{.cta} for all threads in the CTA or to abarrier.cluster for all threads in the cluster. If the exiting threads are holding up thebarrier, the barrier is released.
Performs barrier synchronization and communication within a CTA. Each CTA instance has sixteenbarriers numbered0..15.
barrier{.cta} instructions can be used by the threads within the CTA for synchronization andcommunication.
Operandsa,b, andd have type.u32; operandsp andc are predicates. Sourceoperanda specifies a logical barrier resource as an immediate constant or register with value0 through15. Operandb specifies the number of threads participating in the barrier. Ifno thread count is specified, all threads in the CTA participate in the barrier. When specifying athread count, the value must be a multiple of the warp size. Note that a non-zero thread count isrequired forbarrier{.cta}.arrive.
Depending on operandb, either specified number of threads (in multiple of warp size) or allthreads in the CTA participate inbarrier{.cta} instruction. Thebarrier{.cta} instructionssignal the arrival of the executing threads at the named barrier.
barrier{.cta} instruction causes executing thread to wait for all non-exited threads from itswarp and marks warps’ arrival at barrier. In addition to signaling its arrival at the barrier, thebarrier{.cta}.red andbarrier{.cta}.sync instructions causes executing thread to wait fornon-exited threads of all other warps participating in the barrier toarrive.barrier{.cta}.arrive does not cause executing thread to wait for threads of otherparticipating warps.
When a barrier completes, the waiting threads are restarted without delay, and the barrier isreinitialized so that it can be immediately reused.
Thebarrier{.cta}.sync orbarrier{.cta}.red orbarrier{.cta}.arrive instructionguarantees that when the barrier completes, prior memory accesses requested by this thread areperformed relative to all threads participating in the barrier. Thebarrier{.cta}.sync andbarrier{.cta}.red instruction further guarantees that no new memory access is requested by thisthread before the barrier completes.
A memory read (e.g., byld oratom) has been performed when the value read has beentransmitted from memory and cannot be modified by another thread participating in the barrier. Amemory write (e.g., byst,red oratom) has been performed when the value written hasbecome visible to other threads participating in the barrier, that is, when the previous value canno longer be read.
barrier{.cta}.red performs a reduction operation across threads. Thec predicate (or itscomplement) from all threads in the CTA are combined using the specified reduction operator. Oncethe barrier count is reached, the final value is written to the destination register in all threadswaiting at the barrier.
The reduction operations forbarrier{.cta}.red are population-count (.popc),all-threads-True (.and), and any-thread-True (.or). The result of.popc is the number ofthreads with aTrue predicate, while.and and.or indicate if all the threads had aTrue predicate or if any of the threads had aTrue predicate.
Instructionbarrier{.cta} has optional.aligned modifier. When specified, it indicates thatall threads in CTA will execute the samebarrier{.cta} instruction. In conditionally executedcode, an alignedbarrier{.cta} instruction should only be used if it is known that all threadsin CTA evaluate the condition identically, otherwise behavior is undefined.
Different warps may execute different forms of thebarrier{.cta} instruction using the samebarrier name and thread count. One example mixesbarrier{.cta}.sync andbarrier{.cta}.arriveto implement producer/consumer models. The producer threads executebarrier{.cta}.arrive toannounce their arrival at the barrier and continue execution without delay to produce the nextvalue, while the consumer threads execute thebarrier{.cta}.sync to wait for a resource to beproduced. The roles are then reversed, using a different barrier, where the producer threads executeabarrier{.cta}.sync to wait for a resource to consumed, while the consumer threads announcethat the resource has been consumed withbarrier{.cta}.arrive. Care must be taken to keep a warpfrom executing morebarrier{.cta} instructions than intended (barrier{.cta}.arrive followedby any otherbarrier{.cta} instruction to the same barrier) prior to the reset of thebarrier.barrier{.cta}.red should not be intermixed withbarrier{.cta}.sync orbarrier{.cta}.arrive using the same active barrier. Execution in this case is unpredictable.
The optional.cta qualifier simply indicates CTA-level applicability of the barrier and itdoesn’t change the semantics of the instruction.
bar{.cta}.sync is equivalent tobarrier{.cta}.sync.aligned.bar{.cta}.arrive isequivalent tobarrier{.cta}.arrive.aligned.bar{.cta}.red is equivalent tobarrier{.cta}.red.aligned.
Note
For .targetsm_6x or below,
barrier{.cta} instruction without.aligned modifier is equivalent to.alignedvariant and has the same restrictions as of.aligned variant.
All threads in warp (except for those have exited) must executebarrier{.cta} instructionin convergence.
PTX ISA Notes
bar.sync without a thread count introduced in PTX ISA version 1.0.
Register operands, thread count, andbar.{arrive,red} introduced in PTX ISA version 2.0.
barrier instruction introduced in PTX ISA version 6.0.
.cta qualifier introduced in PTX ISA version 7.8.
Target ISA Notes
Register operands, thread count, andbar{.cta}.{arrive,red} requiresm_20 or higher.
Onlybar{.cta}.sync with an immediate barrier number is supported forsm_1x targets.
barrier{.cta} instruction requiressm_30 or higher.
Examples
// Use bar.sync to arrive at a pre-computed barrier number and// wait for all threads in CTA to also arrive: st.shared [r0],r1; // write my result to shared memory bar.cta.sync 1; // arrive, wait for others to arrive ld.shared r2,[r3]; // use shared results from other threads// Use bar.sync to arrive at a pre-computed barrier number and// wait for fixed number of cooperating threads to arrive: #define CNT1 (8*12) // Number of cooperating threads st.shared [r0],r1; // write my result to shared memory bar.cta.sync 1, CNT1; // arrive, wait for others to arrive ld.shared r2,[r3]; // use shared results from other threads// Use bar.red.and to compare results across the entire CTA: setp.eq.u32 p,r1,r2; // p is True if r1==r2 bar.cta.red.and.pred r3,1,p; // r3=AND(p) forall threads in CTA// Use bar.red.popc to compute the size of a group of threads// that have a specific condition True: setp.eq.u32 p,r1,r2; // p is True if r1==r2 bar.cta.red.popc.u32 r3,1,p; // r3=SUM(p) forall threads in CTA// Examples of barrier.cta.sync st.shared [r0],r1; barrier.cta.sync 0; ld.shared r1, [r0];/* Producer/consumer model. The producer deposits a value in * shared memory, signals that it is complete but does not wait * using bar.arrive, and begins fetching more data from memory. * Once the data returns from memory, the producer must wait * until the consumer signals that it has read the value from * the shared memory location. In the meantime, a consumer * thread waits until the data is stored by the producer, reads * it, and then signals that it is done (without waiting). */ // Producer code places produced value in shared memory. st.shared [r0],r1; bar.arrive 0,64; ld.global r1,[r2]; bar.sync 1,64; ... // Consumer code, reads value from shared memory bar.sync 0,64; ld.shared r1,[r0]; bar.arrive 1,64; ...
bar.warp.sync will cause executing thread to wait until all threads corresponding tomembermask have executed abar.warp.sync with the samemembermask value before resumingexecution.
Operandmembermask specifies a 32-bit integer which is a mask indicating threads participatingin barrier where the bit position corresponds to thread’slaneid.
The behavior ofbar.warp.sync is undefined if the executing thread is not in themembermask.
bar.warp.sync also guarantee memory ordering among threads participating in barrier. Thus,threads within warp that wish to communicate via memory can store to memory, executebar.warp.sync, and then safely read values stored by other threads in warp.
Note
For .targetsm_6x or below, all threads inmembermask must execute the samebar.warp.sync instruction in convergence, and only threads belonging to somemembermaskcan be active when thebar.warp.sync instruction is executed. Otherwise, the behavior isundefined.
PTX ISA Notes
Introduced in PTX ISA version 6.0.
Target ISA Notes
Requiressm_30 or higher.
Examples
st.shared.u32 [r0],r1; // write my result to shared memorybar.warp.sync 0xffffffff; // arrive, wait for others to arriveld.shared.u32 r2,[r3]; // read results written by other threads
Performs barrier synchronization and communication within a cluster.
barrier.cluster instructions can be used by the threads within the cluster for synchronizationand communication.
barrier.cluster.arrive instruction marks warps’ arrival at barrier without causing executingthread to wait for threads of other participating warps.
barrier.cluster.wait instruction causes the executing thread to wait for all non-exited threadsof the cluster to performbarrier.cluster.arrive.
In addition,barrier.cluster instructions cause the executing thread to wait for all non-exitedthreads from its warp.
When all non-exited threads in the cluster have executedbarrier.cluster.arrive, the barriercompletes and is automatically reinitialized. After usingbarrier.cluster.wait to detect completionof the barrier, a thread may immediately arrive at the barrier once again.Each thread must arrive at the barrier only once before the barrier completes.
Thebarrier.cluster.wait instruction guarantees that when it completes the execution, memoryaccesses (except asynchronous operations) requested, in program order, prior to the precedingbarrier.cluster.arrive by all threads in the cluster are complete and visible to the executingthread.
There is no memory ordering and visibility guarantee for memory accesses requested by the executingthread, in program order, afterbarrier.cluster.arrive and prior tobarrier.cluster.wait.
The optional.relaxed qualifier onbarrier.cluster.arrive specifies that there are no memoryordering and visibility guarantees provided for the memory accesses performed prior tobarrier.cluster.arrive.
The optional.sem and.acquire qualifiers on instructionsbarrier.cluster.arrive andbarrier.cluster.wait specify the memory synchronization as described in theMemory Consistency Model. If the optional.sem qualifier is absent forbarrier.cluster.arrive,.release is assumed by default. If the optional.acquirequalifier is absent forbarrier.cluster.wait,.acquire is assumed by default.
The optional.aligned qualifier indicates that all threads in the warp must execute the samebarrier.cluster instruction. In conditionally executed code, an alignedbarrier.clusterinstruction should only be used if it is known that all threads in the warp evaluate the conditionidentically, otherwise behavior is undefined.
PTX ISA Notes
Introduced in PTX ISA version 7.8.
Support for.acquire,.relaxed,.release qualifiers introduced in PTX ISA version 8.0.
Target ISA Notes
Requiressm_90 or higher.
Examples
// use of arrive followed by waitld.shared::cluster.u32 r0, [addr];barrier.cluster.arrive.aligned;...barrier.cluster.wait.aligned;st.shared::cluster.u32 [addr], r1;// use memory fence prior to arrive for relaxed barrier@cta0 ld.shared::cluster.u32 r0, [addr];fence.cluster.acq_rel;barrier.cluster.arrive.relaxed.aligned;...barrier.cluster.wait.aligned;@cta1 st.shared::cluster.u32 [addr], r1;
Themembar instruction guarantees that prior memory accesses requested by this thread (ld,st,atom andred instructions) are performed at the specifiedlevel, before latermemory operations requested by this thread following themembar instruction. Thelevelqualifier specifies the set of threads that may observe the ordering effect of this operation.
A memory read (e.g., byld oratom) has been performed when the value read has beentransmitted from memory and cannot be modified by another thread at the indicated level. A memorywrite (e.g., byst,red oratom) has been performed when the value written has becomevisible to other threads at the specified level, that is, when the previous value can no longer beread.
Thefence instruction establishes an ordering between memory accesses requested by this thread(ld,st,atom andred instructions) as described in theMemory Consistency Model. The scope qualifier specifies the set of threads that mayobserve the ordering effect of this operation.
fence.acq_rel is a light-weight fence that is sufficient for memory synchronization in mostprograms. Instances offence.acq_rel synchronize when combined with additional memory operationsas described inacquire andrelease patterns in theMemory Consistency Model.If the optional.sem qualifier is absent,.acq_relis assumed by default.
fence.sc is a slower fence that can restoresequential consistency when used in sufficientplaces, at the cost of performance. Instances offence.sc with sufficient scope alwayssynchronize by forming a total order per scope, determined at runtime. This total order can beconstrained further by other synchronization in the program.
Qualifiers.op_restrict and.sync_restrict restrict the class of memory operationsfor which thefence instruction provides the memory ordering guarantees. When.op_restrictis.mbarrier_init, the synchronizing effect of the fence only applies to the priormbarrier.init operations executed by the same thread onmbarrier objects in.shared::ctastate space. When.sync_restrict is.sync_restrict::shared::cta,.sem must be.release, and the effect of the fence only applies to operations performed on objects in.shared::cta state space. Likewise, when.sync_restrict is.sync_restrict::shared::cluster,.sem must be.acquire, and the effect of the fence only applies to operations performed onobjects in.shared::cluster state space. When either.sync_restrict::shared::cta or.sync_restrict::shared::cluster is present, the.scope must be specified as.cluster.
The address operandaddr and the operandsize together specify the memory range[addr,addr+size-1] on which the ordering guarantees on the memory accesses across the proxies is to beprovided. The only supported value for thesize operand is 128, which must be a constant integer literal.Generic Addressing is used unconditionally, and the address specified bythe operandaddr must fall within the.global state space. Otherwise, the behavior is undefined.
Onsm_70 and highermembar is a synonym forfence.sc1, and themembarlevelscta,gl andsys are synonymous with thefence scopescta,gpu andsys respectively.
membar.proxy andfence.proxy instructions establish an ordering between memory accesses thatmay happen through differentproxies.
Auni-directional proxy ordering from thefrom-proxykind to theto-proxykind establishesordering between a prior memory access performed via thefrom-proxykind and a subsequent memory accessperformed via theto-proxykind.
Abi-directional proxy ordering between two proxykinds establishes twouni-directional proxy orderings: one from the first proxykind to the second proxykind and the other from the second proxykind to the firstproxykind.
The.proxykind qualifier indicates thebi-directional proxy ordering that is established between the memoryaccesses done between the generic proxy and the proxy specified by.proxykind.
Value.alias of the.proxykind qualifier refers to memory accesses performed using virtuallyaliased addresses to the same memory location. Value.async of the.proxykind qualifier specifiesthat the memory ordering is established between the async proxy and the generic proxy. The memoryordering is limited only to operations performed on objects in the state space specified. If no state spaceis specified, then the memory ordering applies on all state spaces.
A.release proxy fence can form a release sequence that synchronizes with an acquiresequence that contains a.acquire proxy fence. The.to_proxykind and.from_proxykind qualifiers indicate theuni-directional proxy ordering that is established.
Onsm_70 and higher,membar.proxy is a synonym forfence.proxy.
1 The semantics offence.sc introduced withsm_70 is a superset of the semantics ofmembar and the two are compatible; when executing onsm_70 or later architectures,membar acquires the full semantics offence.sc.
PTX ISA Notes
membar.{cta,gl} introduced in PTX ISA version 1.4.
membar.sys introduced in PTX ISA version 2.0.
fence introduced in PTX ISA version 6.0.
membar.proxy andfence.proxy introduced in PTX ISA version 7.5.
.cluster scope qualifier introduced in PTX ISA version 7.8.
.op_restrict qualifier introduced in PTX ISA version 8.0.
fence.proxy.async is introduced in PTX ISA version 8.0.
.to_proxykind::from_proxykind qualifier introduced in PTX ISA version 8.3.
.acquire and.release qualifiers forfence instruction introduced in PTX ISA version 8.6.
.sync_restrict qualifier introduced in PTX ISA version 8.6.
Target ISA Notes
membar.{cta,gl} supported on all target architectures.
membar.sys requiressm_20 or higher.
fence requiressm_70 or higher.
membar.proxy requiressm_60 or higher.
fence.proxy requiressm_70 or higher.
.cluster scope qualifier requiressm_90 or higher.
.op_restrict qualifier requiressm_90 or higher.
fence.proxy.async requiressm_90 or higher.
.to_proxykind::from_proxykind qualifier requiressm_90 or higher.
.acquire and.release qualifiers forfence instruction requiresm_90 or higher..
.sync_restrict qualifier requiressm_90 or higher..
Examples
membar.gl;membar.cta;membar.sys;fence.sc.cta;fence.sc.cluster;fence.proxy.alias;membar.proxy.alias;fence.mbarrier_init.release.cluster;fence.proxy.async;fence.proxy.async.shared::cta;fence.proxy.async.shared::cluster;fence.proxy.async.global;tensormap.replace.tile.global_address.global.b1024.b64 [gbl], new_addr;fence.proxy.tensormap::generic.release.gpu;cvta.global.u64 tmap, gbl;fence.proxy.tensormap::generic.acquire.gpu [tmap], 128;cp.async.bulk.tensor.1d.shared::cluster.global.tile [addr0], [tmap, {tc0}], [mbar0];// Acquire remote barrier state via async proxy.barrier.cluster.wait.acquire;fence.proxy.async::generic.acquire.sync_restrict::shared::cluster.cluster;// Release local barrier state via async proxy.mbarrier.init [bar];fence.mbarrier_init.release.cluster;fence.proxy.async::generic.release.sync_restrict::shared::cta.cluster;barrier.cluster.arrive.relaxed;// Acquire local shared memory via generic proxy.mbarrier.try_wait.relaxed.cluster.shared::cta.b64 complete, [addr], parity;fence.acquire.sync_restrict::shared::cluster.cluster;// Release local shared memory via generic proxy.fence.release.sync_restrict::shared::cta.cluster;mbarrier.arrive.relaxed.cluster.shared::cluster.b64 state, [bar];
Atomically loads the original value at locationa into destination registerd, performs areduction operation with operandb and the value in locationa, and stores the result of thespecified operation at locationa, overwriting the original value. Operanda specifies alocation in the specified state space. If no state space is given, perform the memory accesses usingGeneric Addressing.atom with scalar type may be used onlywith.global and.shared spaces and with generic addressing, where the address points to.global or.shared space.atom with vector type may be used only with.global spaceand with generic addressing where the address points to.global space.
Foratom with vector type, operandsd andb are brace-enclosed vector expressions, sizeof which is equal to the size of vector qualifier.
If no sub-qualifier is specified with.shared state space, then::cta is assumed by default.
The optional.sem qualifier specifies a memory synchronizing effect as described in theMemory Consistency Model. If the.sem qualifier is absent,.relaxed is assumed by default.
The optional.scope qualifier specifies the set of threads that can directly observe the memorysynchronizing effect of this operation, as described in theMemory Consistency Model.If the.scope qualifier is absent,.gpu scope isassumed by default.
Foratom with vector type, the supported combinations of vector qualifier and types, and atomicoperations supported on these combinations are depicted in the following table:
Vector qualifier
Types
.f16/bf16
.f16x2/bf16x2
.f32
.v2
.add,.min,.max
.add,.min,.max
.add
.v4
.add,.min,.max
.add,.min,.max
.add
.v8
.add,.min,.max
Not supported
Not Supported
Two atomic operations (atom orred) are performed atomically with respect to each other onlyif each operation specifies a scope that includes the other. When this condition is not met, eachoperation observes the other operation being performed as if it were split into a read followed by adependent write.
atom instruction on packed type or vector type, accesses adjacent scalar elements in memory. Insuch cases, the atomicity is guaranteed separately for each of the individual scalar elements; theentireatom is not guaranteed to be atomic as a single access.
Forsm_6x and earlier architectures,atom operations on.shared state space do notguarantee atomicity with respect to normal store instructions to the same address. It is theprogrammer’s responsibility to guarantee correctness of programs that use shared memory atomicinstructions, e.g., by inserting barriers between normal stores and atomic operations to a commonaddress, or by using atom.exch to store to locations accessed by other atomic operations.
Supported addressing modes for operanda and alignment requirements are described inAddresses as Operands
The bit-size operations are.and,.or,.xor,.cas (compare-and-swap), and.exch(exchange).
The integer operations are.add,.inc,.dec,.min,.max. The.inc and.dec operations return a result in the range[0..b].
The floating-point operation.add operation rounds to nearest even. Current implementation ofatom.add.f32 on global memory flushes subnormal inputs and results to sign-preserving zero;whereasatom.add.f32 on shared memory supports subnormal inputs and results and doesn’t flushthem to zero.
atom.add.f16,atom.add.f16x2,atom.add.bf16 andatom.add.bf16x2 operation requiresthe.noftz qualifier; it preserves subnormal inputs and results, and does not flush them tozero.
When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.
The qualifier.level::cache_hint is only supported for.global state space and for genericaddressing where the address points to the.global state space.
cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.
Performs a reduction operation with operandb and the value in locationa, and stores theresult of the specified operation at locationa, overwriting the original value. Operandaspecifies a location in the specified state space. If no state space is given, perform the memoryaccesses usingGeneric Addressing.red with scalar type maybe used only with.global and.shared spaces and with generic addressing, where the addresspoints to.global or.shared space.red with vector type may be used only with.global space and with generic addressing where the address points to.global space.
Forred with vector type, operandb is brace-enclosed vector expressions, size of which isequal to the size of vector qualifier.
If no sub-qualifier is specified with.shared state space, then::cta is assumed by default.
The optional.sem qualifier specifies a memory synchronizing effect as described in theMemory Consistency Model. If the.sem qualifier is absent,.relaxed is assumed by default.
The optional.scope qualifier specifies the set of threads that can directly observe the memorysynchronizing effect of this operation, as described in theMemory Consistency Model.If the.scope qualifier is absent,.gpu scope isassumed by default.
Forred with vector type, the supported combinations of vector qualifier, types and reductionoperations supported on these combinations are depicted in following table:
Vector qualifier
Types
.f16/bf16
.f16x2/bf16x2
.f32
.v2
.add,.min,.max
.add,.min,.max
.add
.v4
.add,.min,.max
.add,.min,.max
.add
.v8
.add,.min,.max
Not supported
Not Supported
Two atomic operations (atom orred) are performed atomically with respect to each other onlyif each operation specifies a scope that includes the other. When this condition is not met, eachoperation observes the other operation being performed as if it were split into a read followed by adependent write.
red instruction on packed type or vector type, accesses adjacent scalar elements in memory. Insuch case, the atomicity is guaranteed separately for each of the individual scalar elements; theentirered is not guaranteed to be atomic as a single access.
Forsm_6x and earlier architectures,red operations on.shared state space do notguarantee atomicity with respect to normal store instructions to the same address. It is theprogrammer’s responsibility to guarantee correctness of programs that use shared memory reductioninstructions, e.g., by inserting barriers between normal stores and reduction operations to a commonaddress, or by usingatom.exch to store to locations accessed by other reduction operations.
Supported addressing modes for operanda and alignment requirements are described inAddresses as Operands
The bit-size operations are.and,.or, and.xor.
The integer operations are.add,.inc,.dec,.min,.max. The.inc and.dec operations return a result in the range[0..b].
The floating-point operation.add operation rounds to nearest even. Current implementation ofred.add.f32 on global memory flushes subnormal inputs and results to sign-preserving zero;whereasred.add.f32 on shared memory supports subnormal inputs and results and doesn’t flushthem to zero.
red.add.f16,red.add.f16x2,red.add.bf16 andred.add.bf16x2 operation requires the.noftz qualifier; it preserves subnormal inputs and results, and does not flush them to zero.
When the optional argumentcache-policy is specified, the qualifier.level::cache_hint isrequired. The 64-bit operandcache-policy specifies the cache eviction policy that may be usedduring the memory access.
The qualifier.level::cache_hint is only supported for.global state space and for genericaddressing where the address points to the.global state space.
cache-policy is a hint to the cache subsystem and may not always be respected. It is treated asa performance hint only, and does not change the memory consistency behavior of the program.
red.async is a non-blocking instruction which initiates an asynchronous reduction operationspecified by.op, with the operandb and the value at destination shared memory locationspecified by operanda.
Operands
a is a destination address, and must be either a register, or of the formregister+immOff, as described inAddresses as Operands.
b is a source value, of the type indicated by qualifier.type.
.completion_mechanism specifies the mechanism for observing thecompletion of the asynchronous operation.
When.completion_mechanism is.mbarrier::complete_tx::bytes: uponcompletion of the asynchronous operation, acomplete-txoperation will be performed on the mbarrier object specified by the operandmbar,withcompleteCount argument equal to the amount of data stored in bytes.
When.completion_mechanism is not specified: the completion of the storesynchronizes with the end of the CTA.
.op specifies the reduction operation.
The.inc and.dec operations return a result in the range[0..b].
.type specifies the type of the source operandb.
Conditions
When.sem is.relaxed:
The reduce operation is a relaxed memory operation.
The complete-tx operation on the mbarrier has.releasesemantics at.cluster scope.
The shared-memory addresses of the destination operanda and thembarrier operandmbar must meet all of the following conditions:
They belong to the same CTA.
The CTA to which they belong is different from the CTA of the executing thread,but must be within the same cluster.
Otherwise, the behavior is undefined.
.mmio must not be specified.
If.ss is specified, it must be.shared::cluster.
If.ss is not specified, generic addressing is used for operandsa andmbar.If the generic addresses specified do not fall within the address window of.shared::cluster state space, the behavior is undefined.
If.completion_mechanism is specified, it must be.mbarrier::complete_tx::bytes.
If.completion_mechanism is not specified, it defaults to.mbarrier::complete_tx::bytes.
When.sem is.release:
The reduce operation is a strong memory operation with.release semanticsat the scope specified by.scope.
If.mmio is specified,.scope must be.sys.
If.ss is specified, it must be.global.
If.ss is not specified, generic addressing is used for operanda.If the generic address specified does not fall within the address window of.global state space, the behavior is undefined.
.completion_mechanism must not be specified.
PTX ISA Notes
Introduced in PTX ISA version 8.1.
Support for.mmio qualifier,.release semantics,.global state space,and.gpu and.sys scopes introduced in PTX ISA version 8.7.
Target ISA Notes
Requiressm_90 or higher.
.mmio qualifier,.release semantics,.global state space,and.gpu and.sys scopes requiresm_100 or higher.
Thevote instruction without a.sync qualifier is deprecated in PTX ISA version 6.0.
Support for this instruction with.target lower thansm_70 may be removed in a future PTXISA version.
Removal Note
Support forvote instruction without a.sync qualifier is removed in PTX ISA version 6.4 for.targetsm_70 or higher.
Description
Performs a reduction of the source predicate across all active threads in a warp. The destinationpredicate value is the same across all threads in the warp.
The reduction modes are:
.all
True if source predicate isTrue for all active threads in warp. Negate the sourcepredicate to compute.none.
.any
True if source predicate isTrue for some active thread in warp. Negate the sourcepredicate to compute.not_all.
.uni
True if source predicate has the same value in all active threads in warp. Negating thesource predicate also computes.uni.
In theballot form,vote.ballot.b32 simply copies the predicate from each thread in a warpinto the corresponding bit position of destination registerd, where the bit positioncorresponds to the thread’s lane id.
An inactive thread in warp will contribute a 0 for its entry when participating invote.ballot.b32.
PTX ISA Notes
Introduced in PTX ISA version 1.2.
Deprecated in PTX ISA version 6.0 in favor ofvote.sync.
Not supported in PTX ISA version 6.4 for .targetsm_70 or higher.
Target ISA Notes
vote requiressm_12 or higher.
vote.ballot.b32 requiressm_20 or higher.
vote is not supported onsm_70 or higher starting PTX ISA version 6.4.
Release Notes
Note thatvote applies to threads in a single warp, not across an entire CTA.
Examples
vote.all.pred p,q;vote.uni.pred p,q;vote.ballot.b32 r1,p; // get 'ballot' across warp
vote.sync will cause executing thread to wait until all non-exited threads corresponding tomembermask have executedvote.sync with the same qualifiers and samemembermask valuebefore resuming execution.
Operandmembermask specifies a 32-bit integer which is a mask indicating threads participatingin this instruction where the bit position corresponds to thread’slaneid. Operanda is apredicate register.
In themode form,vote.sync performs a reduction of the source predicate across all non-exitedthreads inmembermask. The destination operandd is a predicate register and its value isthe same across all threads inmembermask.
The reduction modes are:
.all
True if source predicate isTrue for all non-exited threads inmembermask. Negate thesource predicate to compute.none.
.any
True if source predicate isTrue for some thread inmembermask. Negate the sourcepredicate to compute.not_all.
.uni
True if source predicate has the same value in all non-exited threads inmembermask. Negating the source predicate also computes.uni.
In theballot form, the destination operandd is a.b32 register. In this form,vote.sync.ballot.b32 simply copies the predicate from each thread inmembermask into thecorresponding bit position of destination registerd, where the bit position corresponds to thethread’s lane id.
A thread not specified inmembermask will contribute a 0 for its entry invote.sync.ballot.b32.
The behavior ofvote.sync is undefined if the executing thread is not in themembermask.
Note
For .targetsm_6x or below, all threads inmembermask must execute the samevote.syncinstruction in convergence, and only threads belonging to somemembermask can be active whenthevote.sync instruction is executed. Otherwise, the behavior is undefined.
PTX ISA Notes
Introduced in PTX ISA version 6.0.
Target ISA Notes
Requiressm_30 or higher.
Examples
vote.sync.all.pred p,q,0xffffffff;vote.sync.ballot.b32 r1,p,0xffffffff; // get 'ballot' across warp
Broadcast and compare a value across threads in warp.
Syntax
match.any.sync.type d, a, membermask;match.all.sync.type d[|p], a, membermask;.type = { .b32, .b64 };
Description
match.sync will cause executing thread to wait until all non-exited threads frommembermaskhave executedmatch.sync with the same qualifiers and samemembermask value before resumingexecution.
Operandmembermask specifies a 32-bit integer which is a mask indicating threads participatingin this instruction where the bit position corresponds to thread’s laneid.
match.sync performs broadcast and compare of operanda across all non-exited threads inmembermask and sets destinationd and optional predicatep based on mode.
Operanda has instruction type andd has.b32 type.
Destinationd is a 32-bit mask where bit position in mask corresponds to thread’s laneid.
The matching operation modes are:
.all
d is set to mask corresponding to non-exited threads inmembermask if all non-exitedthreads inmembermask have same value of operanda; otherwised is setto 0. Optionally predicatep is set to true if all non-exited threads inmembermask havesame value of operanda; otherwisep is set to false. The sink symbol ‘_’ may be used inplace of any one of the destination operands.
.any
d is set to mask of non-exited threads inmembermask that have same value of operanda.
The behavior ofmatch.sync is undefined if the executing thread is not in themembermask.
PTX ISA Notes
Introduced in PTX ISA version 6.0.
Target ISA Notes
Requiressm_70 or higher.
Release Notes
Note thatmatch.sync applies to threads in a single warp, not across an entire CTA.
Examples
match.any.sync.b32 d, a, 0xffffffff;match.all.sync.b64 d|p, a, mask;
activemask queries predicated-on active threads from the executing warp and sets the destinationd with 32-bit integer mask where bit position in the mask corresponds to the thread’slaneid.
Destinationd is a 32-bit destination register.
An active thread will contribute 1 for its entry in the result and exited or inactive orpredicated-off thread will contribute 0 for its entry in the result.
redux.sync will cause the executing thread to wait until all non-exited threads corresponding tomembermask have executedredux.sync with the same qualifiers and samemembermask valuebefore resuming execution.
Operandmembermask specifies a 32-bit integer which is a mask indicating threads participatingin this instruction where the bit position corresponds to thread’slaneid.
redux.sync performs a reduction operation.op of the 32 bit source registersrc acrossall non-exited threads in themembermask. The result of the reduction operation is written tothe 32 bit destination registerdst.
Reduction operation can be one of the bitwise operation in.and,.or,.xor or arithmeticoperation in.add,.min ,.max.
For the.add operation result is truncated to 32 bits.
For.f32 instruction type, if the input value is 0.0 then +0.0 > -0.0.
If.abs qualifier is specified, then the absolute value of the input is considered for thereduction operation.
If the.NaN qualifier is specified, then the result of the reduction operation is canonical NaNif the input to the reduction operation from any participating thread is NaN.
In the absence of.NaN qualifier, only non-NaN values are considered for the reduction operationand the result will be canonical NaN when all inputs are NaNs.
The behavior ofredux.sync is undefined if the executing thread is not in themembermask.
PTX ISA Notes
Introduced in PTX ISA version 7.0.
Support for.f32 type is introduced in PTX ISA version 8.6.
Support for.abs and.NaN qualifiers is introduced in PTX ISA version 8.6.
Target ISA Notes
Requiressm_80 or higher.
.f32 type requiressm_100a and is supported onsm_100f from PTX ISA version 8.8.
Qualifiers.abs and.NaN requiresm_100a and are supported onsm_100f orhigher in the same family from PTX ISA version 8.8.
Release Notes
Note thatredux.sync applies to threads in a single warp, not across an entire CTA.
Thegriddepcontrol instruction allows the dependent grids and prerequisite grids as defined bythe runtime, to control execution in the following way:
.launch_dependents modifier signals that specific dependents the runtime system designated toreact to this instruction can be scheduled as soon as all other CTAs in the grid issue the sameinstruction or have completed. The dependent may launch before the completion of the currentgrid. There is no guarantee that the dependent will launch before the completion of the currentgrid. Repeated invocations of this instruction by threads in the current CTA will have no additionalside effects past that of the first invocation.
.wait modifier causes the executing thread to wait until all prerequisite grids in flight havecompleted and all the memory operations from the prerequisite grids are performed and made visibleto the current grid.
Note
If the prerequisite grid is usinggriddepcontrol.launch_dependents, then the dependent gridmust usegriddepcontrol.wait to ensure correct functional execution.
elect.sync elects one predicated active leader thread from among a set of threads specified bymembermask.laneid of the elected thread is returned in the 32-bit destination operandd. The sink symbol ‘_’ can be used for destination operandd. The predicate destinationp is set toTrue for the leader thread, andFalse for all other threads.
Operandmembermask specifies a 32-bit integer indicating the set of threads from which a leaderis to be elected. The behavior is undefined if the executing thread is not inmembermask.
Election of a leader thread happens deterministically, i.e. the same leader thread is elected forthe samemembermask every time.
The mandatory.sync qualifier indicates thatelect causes the executing thread to wait untilall threads in themembermask execute theelect instruction before resuming execution.
mbarrier is a barrier created in shared memory that supports :
Synchronizing any subset of threads within a CTA
One-way synchronization of threads across CTAs of a cluster. As noted inmbarrier support with shared memory, threads canperform onlyarrive operations but not*_wait on an mbarrier located inshared::clusterspace.
Waiting for completion of asynchronous memory operations initiated by a thread and making themvisible to other threads.
Anmbarrier object is an opaque object in memory which can be initialized and invalidated using :
mbarrier.init
mbarrier.inval
Operations supported onmbarrier objects are :
mbarrier.expect_tx
mbarrier.complete_tx
mbarrier.arrive
mbarrier.arrive_drop
mbarrier.test_wait
mbarrier.try_wait
mbarrier.pending_count
cp.async.mbarrier.arrive
Performing anymbarrier operation exceptmbarrier.init on an uninitializedmbarrier objectresults in undefined behavior.Performing anynon-mbarrier ormbarrier.init operations on an initializedmbarrier objectresults in undefined behavior.
Unlikebar{.cta}/barrier{.cta} instructions which can access a limited number of barriersper CTA,mbarrier objects are user defined and are only limited by the total shared memory sizeavailable.
mbarrier operations enable threads to perform useful work after the arrival at thembarrier andbefore waiting for thembarrier to complete.
An opaquembarrier object keeps track of the following information :
Current phase of thembarrier object
Count of pending arrivals for the current phase of thembarrier object
Count of expected arrivals for the next phase of thembarrier object
Count of pending asynchronous memory operations (or transactions) tracked by the current phase ofthembarrier object. This is also referred to astx-count.
Anmbarrier object progresses through a sequence of phases where each phase is defined by threadsperforming an expected number ofarrive-onoperations.
The valid range of each of the counts is as shown below:
The phase of anmbarrier object is the number of times thembarrier object has been used tosynchronize threads andasynchronousoperations. In each phase {0, 1, 2, …}, threads perform in program order :
arrive-onoperations to complete the current phase and
test_wait /try_wait operations to check for the completion of the current phase.
Anmbarrier object is automatically reinitialized upon completion of the current phase forimmediate use in the next phase. The current phase is incomplete and all prior phases are complete.
For each phase of the mbarrier object, at least onetest_wait ortry_wait operation must beperformed which returnsTrue forwaitComplete before anarrive-on operationin the subsequent phase.
Starting with the Hopper architecture (sm_9x),mbarrier object supports a new count, calledtx-count, which is used for tracking the completion of asynchronous memory operations ortransactions.tx-count tracks the number of asynchronous transactions, in units specified by theasynchronous memory operation, that are outstanding and yet to be complete.
Thetx-count of anmbarrier object must be set to the total amount of asynchronous memoryoperations, in units as specified by the asynchronous operations, to be tracked by the currentphase. Upon completion of each of the asynchronous operations, thecomplete-txoperation will be performed on thembarrier object and thus progress the mbarrier towards thecompletion of the current phase.
Theexpect-tx operation, with anexpectCount argument, increases thetx-count of anmbarrier object by the value specified byexpectCount. This sets the current phase of thembarrier object to expect and track the completion of additional asynchronous transactions.
Thecomplete-tx operation, with ancompleteCount argument, on anmbarrier object consists of the following:
mbarrier signaling
Signals the completion of asynchronous transactions that were tracked by the current phase. As aresult of this,tx-count is decremented bycompleteCount.
mbarrier potentially completing the current phase
If the current phase has been completed then the mbarrier transitions to the next phase. Refer toPhase Completion of the mbarrier objectfor details on phase completion requirements and phase transition process.
The requirements for completion of the current phase are described below. Upon completion of thecurrent phase, the phase transitions to the subsequent phase as described below.
Current phase completion requirements
Anmbarrier object completes the current phase when all of the following conditions are met:
The count of the pending arrivals has reached zero.
Thetx-count has reached zero.
Phase transition
When anmbarrier object completes the current phase, the following actions are performedatomically:
Thembarrier object transitions to the next phase.
The pending arrival count is reinitialized to the expected arrival count.
Anarrive-on operation, with an optionalcount argument, on anmbarrier object consists of thefollowing 2 steps :
mbarrier signalling:
Signals the arrival of the executing thread OR completion of the asynchronous instruction whichsignals the arrive-on operation initiated by the executing thread on thembarrier object. As aresult of this, the pending arrival count is decremented bycount. If thecount argument isnot specified, then it defaults to 1.
mbarrier potentially completing the current phase:
If the current phase has been completed then the mbarrier transitions to the next phase. Refer toPhase Completion of the mbarrier objectfor details on phase completion requirements and phase transition process.
mbarrier.init initializes thembarrier object at the location specified by the address operandaddr with the unsigned 32-bit integercount. The value of operand count must be in the rangeas specified inContents of the mbarrier object.
Initialization of thembarrier object involves :
Initializing the current phase to 0.
Initializing the expected arrival count tocount.
Initializing the pending arrival count tocount.
Initializing thetx-count to 0.
The valid range of values for the operandcount is [1, …, 220 - 1].ReferContents of the mbarrier object for thevalid range of values for the various constituents of the mbarrier.
If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta state space then the behavior is undefined.
The behavior of performing anmbarrier.init operation on a memory location containing avalidmbarrier object is undefined; invalidate thembarrier object usingmbarrier.invalfirst, before repurposing the memory location for any other purpose, including anothermbarrier object.
PTX ISA Notes
Introduced in PTX ISA version 7.0.
Support for sub-qualifier::cta on.shared introduced in PTX ISA version 7.8.
Target ISA Notes
Requiressm_80 or higher.
Examples
.shared .b64 shMem, shMem2;.reg .b64 addr;.reg .b32 %r1;cvta.shared.u64 addr, shMem2;mbarrier.init.b64 [addr], %r1;bar.cta.sync 0;// ... other mbarrier operations on addrmbarrier.init.shared::cta.b64 [shMem], 12;bar.sync 0;// ... other mbarrier operations on shMem
mbarrier.inval invalidates thembarrier object at the location specified by the addressoperandaddr.
Anmbarrier object must be invalidated before using its memory location for any other purpose.
Performing anymbarrier operation exceptmbarrier.init on a memory location that does notcontain a validmbarrier object, results in undefined behaviour.
If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta state space then the behavior is undefined.
Support for sub-qualifier::cta on.shared introduced in PTX ISA version 7.8.
Target ISA Notes
Requiressm_80 or higher.
Examples
.shared .b64 shmem;.reg .b64 addr;.reg .b32 %r1;.reg .pred t0;// Example 1 :bar.sync 0;@t0 mbarrier.init.b64 [addr], %r1;// ... other mbarrier operations on addrbar.sync 0;@t0 mbarrier.inval.b64 [addr];// Example 2 :bar.cta.sync 0;mbarrier.init.shared.b64 [shmem], 12;// ... other mbarrier operations on shmembar.cta.sync 0;@t0 mbarrier.inval.shared.b64 [shmem];// shmem can be reused here for unrelated use :bar.cta.sync 0;st.shared.b64 [shmem], ...;// shmem can be re-initialized as mbarrier object :bar.cta.sync 0;@t0 mbarrier.init.shared.b64 [shmem], 24;// ... other mbarrier operations on shmembar.cta.sync 0;@t0 mbarrier.inval.shared::cta.b64 [shmem];
A thread executingmbarrier.expect_tx performs anexpect-txoperation on thembarrier object at the location specified by the address operandaddr. The32-bit unsigned integer operandtxCount specifies theexpectCount argument to theexpect-tx operation.
If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta or.shared::cluster state space then the behavior is undefined.
A thread executingmbarrier.complete_tx performs acomplete-txoperation on thembarrier object at the location specified by the address operandaddr. The32-bit unsigned integer operandtxCount specifies thecompleteCount argument to thecomplete-tx operation.
mbarrier.complete_tx does not involve any asynchronous memory operations and only simulates thecompletion of an asynchronous memory operation and its side effect of signaling to thembarrierobject.
If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta or.shared::cluster state space then the behavior is undefined.
A thread executingmbarrier.arrive performs anarrive-on operationon thembarrier object at the location specified by the address operandaddr. The 32-bitunsigned integer operandcount specifies thecount argument to thearrive-onoperation.
If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta state space then the behavior is undefined.
The optional qualifier.expect_tx specifies that anexpect-txoperation is performed prior to thearrive-onoperation. The 32-bit unsigned integer operandtxCount specifies theexpectCount argument totheexpect-tx operation. When both qualifiers.arrive and.expect_tx are specified, thenthe count argument of thearrive-on operation is assumed to be 1.
Ambarrier.arrive operation with.noComplete qualifier must not cause thembarrier tocomplete its current phase, otherwise the behavior is undefined.
Note: forsm_8x, when the argumentcount is specified, the modifier.noComplete isrequired.
mbarrier.arrive operation on anmbarrier object located in.shared::cta returns an opaque64-bit register capturing the phase of thembarrier object prior to thearrive-on operation in thedestination operandstate. Contents of thestate operand are implementationspecific. Optionally, sink symbol'_' can be used for thestate argument.
mbarrier.arrive operation on anmbarrier object located in.shared::cluster but not in.shared::cta cannot return a value. Sink symbol ‘_’ is mandatory for the destination operand forsuch cases.
The optional.sem qualifier specifies a memory synchronizing effect as described in theMemory Consistency Model. If the.sem qualifier is absent,.release is assumed by default.
The.relaxed qualifier does not provide any memory ordering semantics and visibilityguarantees.
The optional.scope qualifier indicates the set of threads that directly observe the memorysynchronizing effect of this operation, as described in theMemory Consistency Model.If the.scope qualifier is not specified then itdefaults to.cta. In contrast, the.shared::<scope> indicates the state space where thembarrier resides.
PTX ISA Notes
Introduced in PTX ISA version 7.0.
Support for sink symbol ‘_’ as the destination operand is introduced in PTX ISA version 7.1.
Support for sub-qualifier::cta on.shared introduced in PTX ISA version 7.8.
Support forcount argument without the modifier.noComplete introduced in PTX ISA version7.8.
Support for sub-qualifier::cluster introduced in PTX ISA version 8.0.
Support for qualifier.expect_tx is introduced in PTX ISA version 8.0.
Support for.scope and.sem qualifiers introduced in PTX ISA version 8.0
Support for.relaxed qualifier introduced in PTX ISA version 8.6.
Target ISA Notes
Requiressm_80 or higher.
Support forcount argument without the modifier.noComplete requiressm_90 or higher.
Qualifier.expect_tx requiressm_90 or higher.
Sub-qualifier::cluster requiressm_90 or higher.
Support for.cluster scope requiressm_90 or higher.
A thread executingmbarrier.arrive_drop on thembarrier object at the location specified bythe address operandaddr performs the following steps:
Decrements the expected arrival count of thembarrier object by the value specified by the32-bit integer operandcount. Ifcount operand is not specified, it defaults to 1.
The decrement done in the expected arrivals count of thembarrier object will be for all thesubsequent phases of thembarrier object.
If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta or.shared::cluster state space then the behavior is undefined.
The optional qualifier.expect_tx specifies that anexpect-txoperation is performed prior to thearrive-onoperation. The 32-bit unsigned integer operandtxCount specifies theexpectCount argument totheexpect-tx operation. When both qualifiers.arrive and.expect_tx are specified, thenthe count argument of thearrive-on operation is assumed to be 1.
mbarrier.arrive_drop operation with.release qualifier forms therelease pattern asdescribed in the Memory Consistency Model and synchronizes with theacquire patterns.
The optional.sem qualifier specifies a memory synchronizing effect as described in theMemory Consistency Model. If the.sem qualifier is absent,.release is assumed by default. The.relaxed qualifier does not provide any memoryordering semantics and visibility guarantees.
The optional.scope qualifier indicates the set of threads that anmbarrier.arrive_dropinstruction can directly synchronize. If the.scope qualifier is not specified then it defaultsto.cta. In contrast, the.shared::<scope> indicates the state space where the mbarrierresides.
Ambarrier.arrive_drop with.noComplete qualifier must not complete thembarrier,otherwise the behavior is undefined.
Note: forsm_8x, when the argumentcount is specified, the modifier.noComplete isrequired.
A thread that wants to either exit or opt out of participating in thearrive-on operation can usembarrier.arrive_drop to drop itself from thembarrier.
mbarrier.arrive_drop operation on anmbarrier object located in.shared::cta returns anopaque 64-bit register capturing the phase of thembarrier object prior to thearrive-onoperationin the destination operandstate. Contents of the returned state are implementationspecific. Optionally, sink symbol'_' can be used for thestate argument.
mbarrier.arrive_drop operation on anmbarrier object located in.shared::cluster but notin.shared::cta cannot return a value. Sink symbol ‘_’ is mandatory for the destination operandfor such cases.
PTX ISA Notes
Introduced in PTX ISA version 7.0.
Support for sub-qualifier::cta on.shared introduced in PTX ISA version 7.8.
Support forcount argument without the modifier.noComplete introduced in PTX ISA version7.8.
Support for qualifier.expect_tx is introduced in PTX ISA version 8.0.
Support for sub-qualifier::cluster introduced in PTX ISA version 8.0.
Support for.scope and.sem qualifiers introduced in PTX ISA version 8.0
Support for.relaxed qualifier introduced in PTX ISA version 8.6.
Target ISA Notes
Requiressm_80 or higher.
Support forcount argument without the modifier.noComplete requiressm_90 or higher.
Qualifier.expect_tx requiressm_90 or higher.
Sub-qualifier::cluster requiressm_90 or higher.
Support for.cluster scope requiressm_90 or higher.
Examples
.reg .b32 cnt;.reg .b64 %r1;.shared .b64 shMem;// Example 1@p mbarrier.arrive_drop.shared.b64 _, [shMem];@p exit;@p2 mbarrier.arrive_drop.noComplete.shared.b64 _, [shMem], %a;@p2 exit;..@!p mbarrier.arrive.shared.b64 %r1, [shMem];@!p mbarrier.test_wait.shared.b64 q, [shMem], %r1;// Example 2mbarrier.arrive_drop.shared::cluster.b64 _, [addr];mbarrier.arrive_drop.shared::cta.release.cluster.b64 _, [addr], cnt;// Example 3mbarrier.arrive_drop.expect_tx.shared::cta.relaxed.cluster.b64 state, [addr], tx_count;
Causes anarrive-on operation to betriggered by the system on thembarrier object upon the completion of all priorcp.asyncoperations initiated by theexecuting thread. Thembarrier object is at the location specified by the operandaddr. Thearrive-on operation isasynchronous to execution ofcp.async.mbarrier.arrive.
When.noinc modifier is not specified, the pending count of the mbarrier object is incrementedby 1 prior to the asynchronousarrive-on operation. Thisresults in a zero-net change for the pending count from the asynchronousarrive-on operationduring the current phase. The pending count of thembarrier object after the increment should notexceed the limit as mentioned inContents of the mbarrier object. Otherwise,the behavior is undefined.
When the.noinc modifier is specified, the increment to the pending count of thembarrierobject is not performed. Hence the decrement of the pending count done by the asynchronousarrive-on operation must beaccounted for in the initialization of thembarrier object.
If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta state space then the behavior is undefined.
Support for sub-qualifier::cta on.shared introduced in PTX ISA version 7.8.
Target ISA Notes
Requiressm_80 or higher.
Examples
// Example 1: no .noincmbarrier.init.shared.b64 [shMem], threadCount;....cp.async.ca.shared.global [shard1], [gbl1], 4;cp.async.cg.shared.global [shard2], [gbl2], 16;....// Absence of .noinc accounts for arrive-on from completion of prior cp.async operations.// So mbarrier.init must only account for arrive-on from mbarrier.arrive.cp.async.mbarrier.arrive.shared.b64 [shMem];....mbarrier.arrive.shared.b64 state, [shMem];waitLoop:mbarrier.test_wait.shared.b64 p, [shMem], state;@!p bra waitLoop;// Example 2: with .noinc// Tracks arrive-on from mbarrier.arrive and cp.async.mbarrier.arrive.// All threads participating in the mbarrier perform cp.asyncmov.b32 copyOperationCnt, threadCount;// 3 arrive-on operations will be triggered per-threadmul.lo.u32 copyArrivalCnt, copyOperationCnt, 3;add.u32 totalCount, threadCount, copyArrivalCnt;mbarrier.init.shared.b64 [shMem], totalCount;....cp.async.ca.shared.global [shard1], [gbl1], 4;cp.async.cg.shared.global [shard2], [gbl2], 16;...// Presence of .noinc requires mbarrier initalization to have accounted for arrive-on from cp.asynccp.async.mbarrier.arrive.noinc.shared.b64 [shMem]; // 1st instance....cp.async.ca.shared.global [shard3], [gbl3], 4;cp.async.ca.shared.global [shard4], [gbl4], 16;cp.async.mbarrier.arrive.noinc.shared::cta.b64 [shMem]; // 2nd instance....cp.async.ca.shared.global [shard5], [gbl5], 4;cp.async.cg.shared.global [shard6], [gbl6], 16;cp.async.mbarrier.arrive.noinc.shared.b64 [shMem]; // 3rd and last instance....mbarrier.arrive.shared.b64 state, [shMem];waitLoop:mbarrier.test_wait.shared.b64 p, [shMem], state;@!p bra waitLoop;
Thetest_wait andtry_wait operations test for the completion of the current or the immediatelypreceding phase of anmbarrier object at the location specified by the operandaddr.
mbarrier.test_wait is a non-blocking instruction which tests for the completion of the phase.
mbarrier.try_wait is a potentially blocking instruction which tests for the completion of thephase. If the phase is not complete, the executing thread may be suspended. Suspended thread resumesexecution when the specified phase completes OR before the phase completes following asystem-dependent time limit. The optional 32-bit unsigned integer operandsuspendTimeHintspecifies the time limit, in nanoseconds, that may be used for the time limit instead of thesystem-dependent limit.
mbarrier.test_wait andmbarrier.try_wait test for completion of the phase :
Specified by the operandstate, which was returned by anmbarrier.arrive instruction onthe samembarrier object during the current or the immediately preceding phase. Or
Indicated by the operandphaseParity, which is the integer parity of either the current phaseor the immediately preceding phase of thembarrier object.
The.parity variant of the instructions test for the completion of the phase indicated by theoperandphaseParity, which is the integer parity of either the current phase or the immediatelypreceding phase of thembarrier object. An even phase has integer parity 0 and an odd phase hasinteger parity of 1. So the valid values ofphaseParity operand are 0 and 1.
Note: the use of the.parity variants of the instructions requires tracking the phase of anmbarrier object throughout its lifetime.
Thetest_wait andtry_wait operations are valid only for :
the current incomplete phase, for whichwaitComplete returnsFalse.
the immediately preceding phase, for whichwaitComplete returnsTrue.
If no state space is specified thenGeneric Addressing isused. If the address specified byaddr does not fall within the address window of.shared::cta state space then the behavior is undefined.
Whenmbarrier.test_wait andmbarrier.try_wait operations with.acquire qualifierreturnsTrue, they form theacquire pattern as described in theMemory Consistency Model.
The optional.sem qualifier specifies a memory synchronizing effect as described in theMemory Consistency Model. If the.sem qualifier is absent,.acquire is assumed by default. The.relaxed qualifier does not provide any memoryordering semantics and visibility guarantees.
The optional.scope qualifier indicates the set of threads that thembarrier.test_wait andmbarrier.try_wait instructions can directly synchronize. If the.scope qualifier is notspecified then it defaults to.cta. In contrast, the.shared::<scope> indicates the statespace where the mbarrier resides.
The following ordering of memory operations hold for the executing thread whenmbarrier.test_wait ormbarrier.try_wait having acquire semantics returnsTrue :
All memory accesses (exceptasync operations) requested prior, in programorder, tombarrier.arrive having release semantics during the completed phase bythe participating threads of the CTA are performed and are visible to the executing thread.
Allcp.async operationsrequested prior, in program order, tocp.async.mbarrier.arrive during the completed phase bythe participating threads of the CTA are performed and made visible to the executing thread.
Allcp.async.bulk asynchronous operations using the samembarrier object requested prior,in program order, tombarrier.arrive having release semantics during the completedphase by the participating threads of the CTA are performed and made visible to the executing thread.
All memory accesses requested after thembarrier.test_wait ormbarrier.try_wait, inprogram order, are not performed and not visible to memory accesses performed prior tombarrier.arrive having release semantics, in program order, by other threadsparticipating in thembarrier.
There is no ordering and visibility guarantee for memory accesses requested by the thread aftermbarrier.arrive having release semantics and prior tombarrier.test_wait,in program order.
PTX ISA Notes
mbarrier.test_wait introduced in PTX ISA version 7.0.
Modifier.parity is introduced in PTX ISA version 7.1.
mbarrier.try_wait introduced in PTX ISA version 7.8.
Support for sub-qualifier::cta on.shared introduced in PTX ISA version 7.8.
Support for.scope and.sem qualifiers introduced in PTX ISA version 8.0
Support for.relaxed qualifier introduced in PTX ISA version 8.6.
Target ISA Notes
mbarrier.test_wait requiressm_80 or higher.
mbarrier.try_wait requiressm_90 or higher.
Support for.cluster scope requiressm_90 or higher.
Examples
// Example 1a, thread synchronization with test_wait:.reg .b64 %r1;.shared .b64 shMem;mbarrier.init.shared.b64 [shMem], N; // N threads participating in the mbarrier....mbarrier.arrive.shared.b64 %r1, [shMem]; // N threads executing mbarrier.arrive// computation not requiring mbarrier synchronization...waitLoop:mbarrier.test_wait.shared.b64 complete, [shMem], %r1;@!complete nanosleep.u32 20;@!complete bra waitLoop;// Example 1b, thread synchronization with try_wait :.reg .b64 %r1;.shared .b64 shMem;mbarrier.init.shared.b64 [shMem], N; // N threads participating in the mbarrier....mbarrier.arrive.shared.b64 %r1, [shMem]; // N threads executing mbarrier.arrive// computation not requiring mbarrier synchronization...waitLoop:mbarrier.try_wait.relaxed.cluster.shared.b64 complete, [shMem], %r1;@!complete bra waitLoop;// Example 2, thread synchronization using phase parity :.reg .b32 i, parArg;.reg .b64 %r1;.shared .b64 shMem;mov.b32 i, 0;mbarrier.init.shared.b64 [shMem], N; // N threads participating in the mbarrier....loopStart : // One phase per loop iteration ... mbarrier.arrive.shared.b64 %r1, [shMem]; // N threads ... and.b32 parArg, i, 1; waitLoop: mbarrier.test_wait.parity.shared.b64 complete, [shMem], parArg; @!complete nanosleep.u32 20; @!complete bra waitLoop; ... add.u32 i, i, 1; setp.lt.u32 p, i, IterMax;@p bra loopStart;// Example 3, Asynchronous copy completion waiting :.reg .b64 state;.shared .b64 shMem2;.shared .b64 shard1, shard2;.global .b64 gbl1, gbl2;mbarrier.init.shared.b64 [shMem2], threadCount;...cp.async.ca.shared.global [shard1], [gbl1], 4;cp.async.cg.shared.global [shard2], [gbl2], 16;// Absence of .noinc accounts for arrive-on from prior cp.async operationcp.async.mbarrier.arrive.shared.b64 [shMem2];...mbarrier.arrive.shared.b64 state, [shMem2];waitLoop:mbarrier.test_wait.shared::cta.b64 p, [shMem2], state;@!p bra waitLoop;// Example 4, Synchronizing the CTA0 threads with cluster threads.reg .b64 %r1, addr, remAddr;.shared .b64 shMem;cvta.shared.u64 addr, shMem;mapa.u64 remAddr, addr, 0; // CTA0's shMem instance// One thread from CTA0 executing the below initialization operation@p0 mbarrier.init.shared::cta.b64 [shMem], N; // N = no of cluster threadsbarrier.cluster.arrive;barrier.cluster.wait;// Entire cluster executing the below arrive operationmbarrier.arrive.release.cluster.b64 _, [remAddr];// computation not requiring mbarrier synchronization ...// Only CTA0 threads executing the below wait operationwaitLoop:mbarrier.try_wait.parity.acquire.cluster.shared::cta.b64 complete, [shMem], 0;@!complete bra waitLoop;
Query the pending arrival count from the opaque mbarrier state.
Syntax
mbarrier.pending_count.b64 count, state;
Description
The pending count can be queried from the opaque mbarrier state usingmbarrier.pending_count.
Thestate operand is a 64-bit register that must be the result of a priormbarrier.arrive.noComplete ormbarrier.arrive_drop.noComplete instruction. Otherwise, thebehavior is undefined.
The destination registercount is a 32-bit unsigned integer representing the pending count ofthembarrier object prior to thearrive-on operation fromwhich thestate register was obtained.
Thetensormap.cp_fenceproxy instructions perform the following operations in order :
Copies data of size specified by thesize argument, in bytes, from the location specifiedby the address operandsrc in shared memory to the location specified by the address operanddst in the global memory, in the generic proxy.
Establishes auni-directional proxy release pattern on the ordering from the copy operationto the subsequent access performed in the tensormap proxy on the addressdst.
The valid value of immediate operandsize is 128.
The operandssrc anddst specify non-generic addresses inshared::cta andglobalstate space respectively.
The.scope qualifier specifies the set of threads that can directly observe the proxysynchronizing effect of this operation, as described inMemory Consistency Model.
The mandatory.sync qualifier indicates thattensormap.cp_fenceproxy causes the executingthread to wait until all threads in the warp execute the sametensormap.cp_fenceproxyinstruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must execute the sametensormap.cp_fenceproxy instruction. In conditionally executed code, an alignedtensormap.cp_fenceproxyinstruction should only be used if it is known that all threads in the warp evaluate the conditionidentically, otherwise behavior is undefined.
PTX ISA Notes
Introduced in PTX ISA version 8.3.
Target ISA Notes
Requiressm_90 or higher.
Examples
// Example: manipulate a tensor-map object and then consume it in cp.async.bulk.tensor.reg .b64 new_addr;.global .align 128 .b8 gbl[128];.shared .align 128 .b8 sMem[128];cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [sMem], [gMem], 128, [mbar];...try_wait_loop:mbarrier.try_wait.shared.b64 p, [mbar], state;@!p bra try_wait loop;tensormap.replace.tile.global_address.shared.b1024.b64 [sMem], new_addr;tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [gbl], [sMem], 128;fence.proxy.tensormap::generic.acquire.gpu [gbl], 128;cp.async.bulk.tensor.1d.shared::cluster.global.tile [addr0], [gbl, {tc0}], [mbar0];
Theclusterlaunchcontrol.try_cancel instruction requests atomically cancelling the launch ofa cluster that has not started running yet. It asynchronously writes an opaque response to sharedmemory indicating whether the operation succeeded or failed. The completion of the asynchronousoperation is tracked using the mbarrier completion mechanism at.cluster scope.
On success, the opaque response contains thectaid of the first CTA of the canceled cluster; noother successful response from otherclusterlaunchcontrol.try_cancel operations from the samegrid will contain that id.
The mandatory.async qualifier indicates that the instruction will initiate the cancellationoperation asynchronously and control will return to the executing thread before the requestedoperation is complete.
The.space qualifier is specified, both operandsaddr andmbar must be in the.shared::cta state space. Otherwise, generic addressing will be assumed for both. The resultis undefined if any of address operands do not fall within the address window of.shared::cta.
The qualifier.completion_mechanism specifies that upon completion of the asynchronous operation,complete-txoperation, withcompleteCount argument equal to amount of data stored in bytes, will be performedon the mbarrier object specified by the operandmbar.
The executing thread can then usembarrier instructions to wait for completionof the asynchronous operation. No other synchronization mechanisms described inMemory Consistency Model can be used to guarantee the completion of the asynchronous copy operations.
The.multicast::cluster::all qualifier indicates that the response is asynchronously written usingweak async-proxy writes to the corresponding local shared memoryaddr of each CTA in the requestingcluster. The completion of the writes toaddr of a particular CTA is signaled via a complete-tx operationto the mbarrier object on the shared memory of that CTA.
The behavior of instruction with.multicast::cluster::all qualifier is undefined if any CTA in thecluster is exited.
Operandaddr specifies the naturally aligned address of the 16-byte wide shared memory location wherethe request’s response is written.
The response ofclusterlaunchcontrol.try_cancel instruction will be 16-byte opaque value and will beit available at location specified by operandaddr. After loading this response into 16-byte register,instructionclusterlaunchcontrol.query_cancel can be used to check if request was successful and toretrievectaid of the first CTA of the canceled cluster.
If the executing CTA has already observed the completion of aclusterlaunchcontrol.try_cancel instructionas failed, then the behavior of issuing a subsequentclusterlaunchcontrol.try_cancel instruction is undefined.
PTX ISA Notes
Introduced in PTX ISA version 8.6.
Target ISA Notes
Requiressm_100 or higher.
Qualifier.multicast::cluster::all is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_120f or higher in the same family
sm_110f or higher in the same family
Examples
// Assumption: 1D cluster (cluster_ctaid.y/.z == 1)// with 1 thread per CTA.// Current Cluster to be processed, initially the// currently launched cluster:mov.b32 xctaid, %ctaid.x;barrier.cluster.arrive.relaxed;processCluster:// Wait on all cluster CTAs completing initialization or processing of previous cluster:barrier.cluster.wait.acquire;mov.u32 %r0, %tid.x;setp.u32.eq p0, %r0, 0x0;@!p0 bra asyncWork;// All CTAs in the cluster arrive at their local// SMEM barrier and set 16B handle tx count:mbarrier.arrive.expect_tx.cluster.relaxed.shared::cta.b64 state, [mbar], 16;// first CTA in Cluster attempts to cancel a// not-yet-started cluster:mov.u32 %r0, %cluster_ctaid.x;setp.u32.eq p0, %r0, 0x0;@p0 clusterlaunchcontrol.try_cancel.async.mbarrier::complete_tx::bytes.multicast::cluster::all.b128 [addr], [mbar];asyncWork:// ...process xctaid while cancellation request completes// asynchronously...// All CTAs in Cluster wait on cancellation responses on their local SMEM:waitLoop:// .acquire prevents the load of the handle from overtaking this read:mbarrier.try_wait.cluster.acquire.shared::cta.b64 complete, [mbar], state;@!complete bra waitLoop;// Load response into 16-byte wide register after unblocking// from mbarrier:ld.shared.b128 handle, [addr];// Check whether cancellation succeeded:clusterlaunchcontrol.query_cancel.is_canceled.pred.b128 p, handle;@!p ret; // If failed, we are don end exit:// Otherwise, read ctaid of first CTA of cancelled Cluster for next iteration...@p clusterlaunchcontrol.query_cancel.get_first_ctaid.v4.b32.b128 {xctaid, _, _, _}, handle;// ...and signal CTA0 that we are done reading from handle:// Fence generic->asyncfence.proxy.async.shared::cta;barrier.cluster.arrive.relaxed;bra processCluster;
Instructionclusterlaunchcontrol.query_cancel can be used to decode opaque responsewritten by instructionclusterlaunchcontrol.try_cancel.
After loading response fromclusterlaunchcontrol.try_cancel instruction into 16-byteregister it can be further queried usingclusterlaunchcontrol.query_cancel instructionas follows:
clusterlaunchcontrol.query_cancel.is_canceled.pred.b128: If the cluster is canceledsuccessfully, predicatep is set totrue; otherwise, it is set tofalse.
If the request succeeded, the instructionclusterlaunchcontrol.query_cancel.get_first_ctaidextracts the CTA id of the first CTA in the canceled cluster. By default, the instructionreturns a.v4 vector whose first three elements are thex,y andz coordinateof first CTA in canceled cluster. The contents of the 4th element are unspecified. Theexplicit.get_first_ctaid::x,.get_first_ctaid::y, or.get_first_ctaid::zqualifiers can be used to extract individualx,y orz coordinates into a 32-bitregister.
If the request fails the behavior ofclusterlaunchcontrol.query_cancel.get_first_ctaidis undefined.
The matrix multiply and accumulate operation has the following form:
D = A * B + C
whereD andC are called accumulators and may refer to the same matrix.
PTX provides two ways to perform matrix multiply-and-accumulate computation:
Usingwmma instructions:
This warp-level computation is performed collectively by all threads in the warp as follows:
Load matrices A, B and C from memory into registers using thewmma.load operation. Whenthe operation completes, the destination registers in each thread hold a fragment of theloaded matrix.
Perform the matrix multiply and accumulate operation using thewmma.mma operation on theloaded matrices. When the operation completes, the destination registers in each thread holda fragment of the result matrix returned by thewmma.mma operation.
Store result Matrix D back to memory using thewmma.store operation. Alternately, resultmatrix D can also be used as argument C for a subsequentwmma.mma operation.
Thewmma.load andwmma.store instructions implicitly handle the organization of matrixelements when loading the input matrices from memory for thewmma.mma operation and whenstoring the result back to memory.
Usingmma instruction:
Similar towmma,mma also requires computation to be performed collectively by allthreads in the warp however distribution of matrix elements across different threads in warpneeds to be done explicitly before invoking themma operation. Themma instructionsupports both dense as well as sparse matrix A. The sparse variant can be used when A is astructured sparse matrix as described inSparse matrix storage.
The matrix multiply and accumulate operations support a limited set of shapes for the operandmatrices A, B and C. The shapes of all three matrix operands are collectively described by the tupleMxNxK, where A is anMxK matrix, B is aKxN matrix, while C and D areMxN matrices.
The following matrix shapes are supported for the specified types:
Instruction
Scale
Sparsity
Multiplicand Data-type
Shape
PTX ISA version
wmma
NA
Dense
Floating-point -.f16
.m16n16k16,.m8n32k16,and.m32n8k16
PTX ISA version 6.0
wmma
Dense
Alternate floating-point format -.bf16
.m16n16k16,.m8n32k16,and.m32n8k16
PTX ISA version 7.0
wmma
Dense
Alternate floating-point format -.tf32
.m16n16k8
PTX ISA version 7.0
wmma
Dense
Integer -.u8/.s8
.m16n16k16,.m8n32k16,and.m32n8k16
PTX ISA version 6.3
wmma
Dense
Sub-byte integer -.u4/.s4
.m8n8k32
PTX ISA version 6.3(preview feature)
wmma
Dense
Single-bit -.b1
.m8n8k128
PTX ISA version 6.3(preview feature)
mma
NA
Dense
Floating-point -.f64
.m8n8k4
PTX ISA version 7.0
.m16n8k4,.m16n8k8,and.m16n8k16
PTX ISA version 7.8
mma
Dense
Floating-point -.f16
.m8n8k4
PTX ISA version 6.4
.m16n8k8
PTX ISA version 6.5
.m16n8k16
PTX ISA version 7.0
mma
Dense
Alternate floating-point format -.bf16
.m16n8k8 and.m16n8k16
PTX ISA version 7.0
mma
Dense
Alternate floating-point format -.tf32
.m16n8k4 and.m16n8k8
PTX ISA version 7.0
mma
Dense
Integer -.u8/.s8
.m8n8k16
PTX ISA version 6.5
.m16n8k16 and.m16n8k32
PTX ISA version 7.0
mma
Dense
Sub-byte integer -.u4/.s4
.m8n8k32
PTX ISA version 6.5
.m16n8k32 and.m16n8k64
PTX ISA version 7.0
mma
Dense
Single-bit -.b1
.m8n8k128,.m16n8k128,and.m16n8k256
PTX ISA version 7.0
mma
Dense
Alternate floating-point format -.e4m3/.e5m2
.m16n8k32
PTX ISA version 8.4
mma
Dense
Alternate floating-point format -.e4m3/.e5m2
.m16n8k16
PTX ISA version 8.7
mma
Dense
Alternate floating-point format -.e3m2/.e2m3/.e2m1
.m16n8k32
PTX ISA version 8.7
mma
Yes
Dense
Alternate floating-point format -.e4m3/.e5m2/.e3m2/.e2m3/.e2m1X(Scale).ue8m0
.m16n8k32
PTX ISA version 8.7
mma
Dense
Alternate floating-point format -.e2m1X(Scale).ue8m0/.ue4m3
.m16n8k64
PTX ISA version 8.7
mma
NA
Sparse
Floating-point -.f16
.m16n8k16 and.m16n8k32
PTX ISA version 7.1
mma
Sparse
Alternate floating-point format -.bf16
.m16n8k16 and.m16n8k32
PTX ISA version 7.1
mma
Sparse
Alternate floating-point format -.tf32
.m16n8k8 and.m16n8k16
PTX ISA version 7.1
mma
Sparse
Integer -.u8/.s8
.m16n8k32 and.m16n8k64
PTX ISA version 7.1
mma
Sparse
Sub-byte integer -.u4/.s4
.m16n8k64 and.m16n8k128
PTX ISA version 7.1
mma
Sparse
Alternate floating-point format -.e4m3/.e5m2
.m16n8k64
PTX ISA version 8.4
mma
Sparsewithorderedmetadata
Floating-point -.f16
.m16n8k16 and.m16n8k32
PTX ISA version 8.5
mma
Sparsewithorderedmetadata
Alternate floating-point format -.bf16
.m16n8k16 and.m16n8k32
PTX ISA version 8.5
mma
Sparsewithorderedmetadata
Alternate floating-point format -.tf32
.m16n8k8 and.m16n8k16
PTX ISA version 8.5
mma
Sparsewithorderedmetadata
Integer -.u8/.s8
.m16n8k32 and.m16n8k64
PTX ISA version 8.5
mma
Sparsewithorderedmetadata
Sub-byte integer -.u4/.s4
.m16n8k64 and.m16n8k128
PTX ISA version 8.5
mma
Sparsewithorderedmetadata
Alternate floating-point format -.e4m3/.e5m2
.m16n8k64
PTX ISA version 8.5
mma
Sparsewithorderedmetadata
Alternate floating-point format -.e3m2/.e2m3/.e2m1
.m16n8k64
PTX ISA version 8.7
mma
Yes
Sparsewithorderedmetadata
Alternate floating-point format -.e4m3/.e5m2/.e3m2/.e2m3/.e2m1X(Scale).ue8m0
.m16n8k64
PTX ISA version 8.7
mma
Sparsewithorderedmetadata
Alternate floating-point format -.e2m1X(Scale).ue8m0/.ue4m3
The matrix multiply and accumulate operation is supported separately on integer, floating-point,sub-byte integer and single bit data-types. All operands must contain the same basic type kind,i.e., integer or floating-point.
For floating-point matrix multiply and accumulate operation, different matrix operands may havedifferent precision, as described later.
Data-type
Multiplicands (A or B)
Accumulators (C or D)
Integer
.u8,.s8
.s32
Floating Point
.f16
.f16,.f32
Alternate floating Point
.bf16
.f32
Alternate floating Point
.tf32
.f32
Alternate floating Point
.e4m3 or.e5m2 or.e3m2 or.e2m3 or.e2m1
.f16,.f32
Alternate floating Pointwith scale
.e4m3 or.e5m2 or.e3m2 or.e2m3 or.e2m1 X (Scale).ue8m0
Themma instruction with the following.kind qualifier:
.kind::mxf8f6f4
.kind::mxf4
.kind::mxf4nvf4
perform matrix multiplication with block scaling. This operation has the following form:D=(A*scale_A)*(B*scale_B)+C.
For ascale_A matrix of shapeM x SFA_N, each row of matrixA is divided intoSFA_N number of chunks and each chunk of a row is multiplied with the correspondingelement (henceforth referred asSF_A) from the same row ofscale_A.
Similarly, for ascale_B matrix of shapeSFB_M x N, each column of matrixB isdivided into theSFB_M number of chunks and each chunk of a column is multiplied withthe corresponding element (henceforth referred asSF_B) from the same column ofscale_B.
Figure 42 shows an example ofmma with block scaling ofscale_vec::2X.
Thescale-a-data andscale-b-data argument provides metadata forscale_A andscale_B matrices respectively. The tuple{byte-id-a,thread-id-a} and{byte-id-b,thread-id-b} provides the selector information to choose elementsSF_A andSF_B from corresponding metadata argumentsscale-a-data andscale-b-data.The tuple{byte-id-a,thread-id-a} allows to select the scale matrix elementSF_Afromscale-a-data. Similarly, the tuple{byte-id-b,thread-id-b} allows to selectthe scale matrix elementSF_B fromscale-b-data.
The componentsthread-id-a,thread-id-b decides which threads among the quadcontribute theSF_A andSF_B values. The following listing describes the impactof thread selector componentthread-id-a,thread-id-b:
One thread-pair within the quad determined bythread-id-a, contributes theSF_Avalues. The value of 0 selects lower two threads whereas value of 1 selects upper twothreads from the quad. In other words, whenthread-id-a set to 0, thread-pairsatisfying:%laneid % 4 == 0 or 1 provides theSF_A. In contrast whenthread-id-a set to 1, thread-pair satisfying:%laneid % 4 == 2 or 3 providestheSF_A. ReferFigure 43 for more details.
Figure 43Selection of set of values forSF_A based onthread-id-a
One thread within the quad, determined bythread-id-b, contributes theSF_Bvalue. In other words, each thread satisfying:%laneid % 4 ==thread-id-bprovides theSF_B. ReferFigure 44 for more details.
Figure 44Selection of set of values forSF_B based onthread-id-b
The argumentsbyte-id-a,byte-id-b selects which bytes from thescale-a-data,scale-b-data contribute theSF_A andSF_B values. The following listing describesimplications of.scale_vec_size qualifier on byte selector componentbyte-id-a,byte-id-b:
When.scale_vec_size is.scale_vec::1X
One byte each withinscale-a-data andscale-b-data determined bybyte-id-a,byte-id-b respectively contributes theSF_A andSF_B values.
When.scale_vec_size is.scale_vec::2X
One byte-pair (two bytes) withinscale-a-data andscale-b-data determined bybyte-id-a andbyte-id-b contributes theSF_A andSF_B values. The valueof 0 selects lower two bytes whereas value of 2 selects upper two bytes from thecorresponding metadata value.
When.scale_vec_size is.scale_vec::4X
All four bytes withinscale-a-data andscale-b-data contribute the values.Hence,byte-id-a,byte-id-b must be zero.
Each thread in the warp holds a fragment of the matrix. The distribution of fragments loaded by thethreads in a warp is unspecified and is target architecture dependent, and hence the identity of thefragment within the matrix is also unspecified and is target architecture dependent. The fragmentreturned by awmma operation can be used as an operand for anotherwmma operation if theshape, layout and element type of the underlying matrix matches. Since fragment layout isarchitecture dependent, using the fragment returned by awmma operation in one function as anoperand for awmma operation in a different function may not work as expected if the twofunctions are linked together but were compiled for different link-compatible SM architectures. Notepassingwmma fragment to a function having.weak linkage is unsafe since at link timereferences to such function may get resolved to a function in different compilation module.
Each fragment is a vector expression whose contents are determined as follows. The identity ofindividual matrix elements in the fragment is unspecified.
Integer fragments
Multiplicands (A or B):
Data-type
Shape
Matrix
Fragment
.u8 or.s8
.m16n16k16
A
A vector expression of two.b32 registers, with eachregister containing four elements from the matrix.
B
A vector expression of two.b32 registers, with eachregister containing four elements from the matrix.
.m8n32k16
A
A vector expression containing a single.b32 registercontaining four elements from the matrix.
B
A vector expression of four.b32 registers, with eachregister containing four elements from the matrix.
.m32n8k16
A
A vector expression of four.b32 registers, with eachregister containing four elements from the matrix.
B
A vector expression containing single.b32 register,with each containing four elements from the matrix.
Accumulators (C or D):
Data-type
Shape
Fragment
.s32
.m16n16k16
A vector expression of eight.s32 registers.
.m8n32k16
.m32n8k16
Floating point fragments
Data-type
Matrix
Fragment
.f16
A or B
A vector expression of eight.f16x2 registers.
.f16
C or D
A vector expression of four.f16x2 registers.
.f32
A vector expression of eight.f32 registers.
Floating point fragments for.bf16 data format
Multiplicands (A or B):
Data-type
Shape
Matrix
Fragment
.bf16
.m16n16k16
A
A vector expression of four.b32 registers, with eachregister containing two elements from the matrix.
B
.m8n32k16
A
A vector expression containing a two.b32 registers,with containing two elements from the matrix.
B
A vector expression of eight.b32 registers, witheach register containing two elements from the matrix.
.m32n8k16
A
A vector expression of eight.b32 registers, witheach register containing two elements from the matrix.
B
A vector expression containing two.b32 registers,with each containing two elements from the matrix.
Accumulators (C or D):
Data-type
Matrix
Fragment
.f32
C or D
A vector expression containing eight.f32 registers.
Floating point fragments for.tf32 data format
Multiplicands (A or B):
Data-type
Shape
Matrix
Fragment
.tf32
.m16n16k8
A
A vector expression of four.b32 registers.
B
A vector expression of four.b32 registers.
Accumulators (C or D):
Data-type
Shape
Matrix
Fragment
.f32
.m16n16k8
C or D
A vector expression containing eight.f32 registers.
Double precision floating point fragments
Multiplicands (A or B):
Data-type
Shape
Matrix
Fragment
.f64
.m8n8k4
A or B
A vector expression of single.f64 register.
Accumulators (C or D):
Data-type
Shape
Matrix
Fragment
.f64
.m8n8k4
C or D
A vector expression containing single.f64 register.
Sub-byte integer and single-bit fragments
Multiplicands (A or B):
Data-type
Shape
Fragment
.u4 or.s4
.m8n8k32
A vector expression containing a single.b32 register, containing eight elements from the matrix.
.b1
.m8n8k128
A vector expression containing a single.b32 register, containing 32 elements from the matrix.
Accumulators (C or D):
Data-type
Shape
Fragment
.s32
.m8n8k32
A vector expression of two.s32 registers.
.m8n8k128
A vector expression of two.s32 registers.
Manipulating fragment contents
The contents of a matrix fragment can be manipulated by reading and writing to individualregisters in the fragment, provided the following conditions are satisfied:
All matrix element in the fragment are operated on uniformly across threads, using the sameparameters.
The order of the matrix elements is not changed.
For example, if each register corresponding to a given matrix is multiplied by a uniform constantvalue, then the resulting matrix is simply the scaled version of the original matrix.
Note that type conversion between.f16 and.f32 accumulator fragments is not supported ineither direction. The result is undefined even if the order of elements in the fragment remainsunchanged.
Each matrix can be stored in memory with arow-major orcolumn-major layout. In arow-majorformat, consecutive elements of each row are stored in contiguous memory locations, and the row iscalled theleading dimension of the matrix. In acolumn-major format, consecutive elements ofeach column are stored in contiguous memory locations and the column is called theleadingdimension of the matrix.
Consecutive instances of theleading dimension (rows or columns) need not be stored contiguouslyin memory. Thewmma.load andwmma.store operations accept an optional argumentstridethat specifies the offset from the beginning of each row (or column) to the next, in terms of matrixelements (and not bytes). For example, the matrix being accessed by awmma operation may be asubmatrix from a larger matrix stored in memory. This allows the programmer to compose amultiply-and-accumulate operation on matrices that are larger than the shapes supported by thewmma operation.
Address Alignment
The starting address of each instance of the leading dimension (row or column) must be alignedwith the size of the corresponding fragment in bytes. Note that the starting address isdetermined by the base pointer and the optionalstride.
Fragment size in bytes = 32 (eight elements of type.f16x2)
Actualstride in bytes = 2 *s (sincestride is specified in terms of.f16elements, not bytes)
For each row of this matrix to be aligned at fragment size the following must be true:
p is a multiple of 32.
2*s is a multiple of 32.
Default value for stride
The default value of thestride is the size of theleading dimension of the matrix. Forexample, for anMxK matrix, thestride isK for arow-major layout andM for acolumn-major layout. In particular, the default strides for the supported matrix shapes are asfollows:
Collectively load a matrix across all threads in a warp from the location indicated by addressoperandp in the specified state space into destination registerr.
If no state space is given, perform the memory accesses usingGeneric Addressing.wmma.load operation may be used only with.global and.shared spaces and with generic addressing, where the address points to.global or.shared space.
The mutually exclusive qualifiers.a,.b and.c indicate whether matrix A, B or C isbeing loaded respectively for thewmma computation.
The destination operandr is a brace-enclosed vector expression that can hold the fragmentreturned by the load operation, as described inMatrix Fragments for WMMA.
The.shape qualifier indicates the dimensions of all the matrix arguments involved in theintendedwmma computation.
The.layout qualifier indicates whether the matrix to be loaded is stored inrow-major orcolumn-major format.
stride is an optional 32-bit integer operand that provides an offset in terms of matrix elementsbetween the start of consecutive instances of theleading dimension (rows or columns). The defaultvalue ofstride is described inMatrix Storage for WMMA and must be specified if the actual value is larger thanthe default. For example, if the matrix is a sub-matrix of a larger matrix, then the value of strideis the leading dimension of the larger matrix. Specifying a value lower than the default valueresults in undefined behavior.
The mandatory.sync qualifier indicates thatwmma.load causes the executing thread to waituntil all threads in the warp execute the samewmma.load instruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must execute the samewmma.load instruction. In conditionally executed code, awmma.load instruction should onlybe used if it is known that all threads in the warp evaluate the condition identically, otherwisebehavior is undefined.
The behavior ofwmma.load is undefined if all threads do not use the same qualifiers and thesame values ofp andstride, or if any thread in the warp has exited.
.m8n32k16 and.m32n8k16 introduced in PTX ISA version 6.1.
Integer, sub-byte integer and single-bitwmma introduced in PTX ISA version 6.3.
.m8n8k4 and.m16n16k8 onwmma introduced in PTX ISA version 7.0.
Double precision and alternate floating point precisionwmma introduced in PTX ISA version 7.0.
Modifier.aligned is required from PTX ISA version 6.3 onwards, and considered implicit in PTXISA versions less than 6.3.
Support for::cta sub-qualifier introduced in PTX ISA version 7.8.
Preview Feature:
Sub-bytewmma and single-bitwmma are preview features in PTX ISA version 6.3. Alldetails are subject to change with no guarantees of backward compatibility on future PTX ISAversions or SM architectures.
Target ISA Notes
Floating pointwmma requiressm_70 or higher.
Integerwmma requiressm_72 or higher.
Sub-byte and single-bitwmma requiressm_75 or higher.
Double precision and alternate floating point precisionwmma requiressm_80 or higher.
Examples
// Load elements from f16 row-major matrix B.reg .b32 x<8>;wmma.load.b.sync.aligned.m16n16k16.row.f16 {x0,x1,x2,x3,x4,x5,x,x7}, [ptr];// Now use {x0, ..., x7} for the actual wmma.mma// Load elements from f32 column-major matrix C and scale the values:.reg .b32 x<8>;wmma.load.c.sync.aligned.m16n16k16.col.f32 {x0,x1,x2,x3,x4,x5,x6,x7}, [ptr];mul.f32 x0, x0, 0.1;// repeat for all registers x<8>;...mul.f32 x7, x7, 0.1;// Now use {x0, ..., x7} for the actual wmma.mma// Load elements from integer matrix A:.reg .b32 x<4>// destination registers x<4> contain four packed .u8 values eachwmma.load.a.sync.aligned.m32n8k16.row.u8 {x0,x1,x2,x3}, [ptr];// Load elements from sub-byte integer matrix A:.reg .b32 x0;// destination register x0 contains eight packed .s4 valueswmma.load.a.sync.aligned.m8n8k32.row.s4 {x0}, [ptr];// Load elements from .bf16 matrix A:.reg .b32 x<4>;wmma.load.a.sync.aligned.m16n16k16.row.bf16 {x0,x1,x2,x3}, [ptr];// Load elements from .tf32 matrix A:.reg .b32 x<4>;wmma.load.a.sync.aligned.m16n16k8.row.tf32 {x0,x1,x2,x3}, [ptr];// Load elements from .f64 matrix A:.reg .b32 x<4>;wmma.load.a.sync.aligned.m8n8k4.row.f64 {x0}, [ptr];
Collectively store a matrix across all threads in a warp at the location indicated by addressoperandp in the specified state space from source registerr.
If no state space is given, perform the memory accesses usingGeneric Addressing.wmma.load operation may be used only with.global and.shared spaces and with generic addressing, where the address points to.global or.shared space.
The source operandr is a brace-enclosed vector expression that matches the shape of thefragment expected by the store operation, as described inMatrix Fragments for WMMA.
The.shape qualifier indicates the dimensions of all the matrix arguments involved in theintendedwmma computation. It must match the.shape qualifier specified on thewmma.mmainstruction that produced the D matrix being stored.
The.layout qualifier indicates whether the matrix to be loaded is stored inrow-major orcolumn-major format.
stride is an optional 32-bit integer operand that provides an offset in terms of matrix elementsbetween the start of consecutive instances of theleading dimension (rows or columns). The defaultvalue ofstride is described inMatrix Storage for WMMA and must be specified if the actual value is larger thanthe default. For example, if the matrix is a sub-matrix of a larger matrix, then the value of strideis the leading dimension of the larger matrix. Specifying a value lower than the default valueresults in undefined behavior.
The mandatory.sync qualifier indicates thatwmma.store causes the executing thread to waituntil all threads in the warp execute the samewmma.store instruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must execute the samewmma.store instruction. In conditionally executed code, awmma.store instruction should onlybe used if it is known that all threads in the warp evaluate the condition identically, otherwisebehavior is undefined.
The behavior ofwmma.store is undefined if all threads do not use the same qualifiers and thesame values ofp andstride, or if any thread in the warp has exited.
.m8n32k16 and.m32n8k16 introduced in PTX ISA version 6.1.
Integer, sub-byte integer and single-bitwmma introduced in PTX ISA version 6.3.
.m16n16k8 introduced in PTX ISA version 7.0.
Double precisionwmma introduced in PTX ISA version 7.0.
Modifier.aligned is required from PTX ISA version 6.3 onwards, and considered implicit in PTXISA versions less than 6.3.
Support for::cta sub-qualifier introduced in PTX ISA version 7.8.
Preview Feature:
Sub-bytewmma and single-bitwmma are preview features in PTX ISA version 6.3. Alldetails are subject to change with no guarantees of backward compatibility on future PTX ISAversions or SM architectures.
Target ISA Notes
Floating pointwmma requiressm_70 or higher.
Integerwmma requiressm_72 or higher.
Sub-byte and single-bitwmma requiressm_75 or higher.
Double precisionwmma and shape.m16n16k8 requiressm_80 or higher.
Examples
// Storing f32 elements computed by a wmma.mma.reg .b32 x<8>;wmma.mma.sync.m16n16k16.row.col.f32.f32 {d0, d1, d2, d3, d4, d5, d6, d7}, ...;wmma.store.d.sync.m16n16k16.row.f32 [ptr], {d0, d1, d2, d3, d4, d5, d6, d7};// Store s32 accumulator for m16n16k16 shape:.reg .b32 d<8>;wmma.store.d.sync.aligned.m16n16k16.row.s32 [ptr], {d0, d1, d2, d3, d4, d5, d6, d7};// Store s32 accumulator for m8n8k128 shape:.reg .b32 d<2>wmma.store.d.sync.aligned.m8n8k128.row.s32[ptr], {d0, d1};// Store f64 accumulator for m8n8k4 shape:.reg .f64 d<2>;wmma.store.d.sync.aligned.m8n8k4.row.f64 [ptr], {d0, d1};
Perform a warp-level matrix multiply-and-accumulate computationD=A*B+C using matrices A,B and C loaded in registersa,b andc respectively, and store the result matrix inregisterd. The register argumentsa,b,c andd hold unspecified fragments ofthe corresponding matrices as described inMatrix Fragments for WMMA
The qualifiers.dtype,.atype,.btype and.ctype indicate the data-type of theelements in the matrices D, A, B and C respectively.
Forwmma.mma without explicit.atype and.btype:.atype and.btype areimplicitly set to.f16.
For integerwmma,.ctype and.dtype must be specified as.s32. Also, the values for.atype and.btype must be the same, i.e., either both are.s8 or both are.u8.
For sub-byte single-bitwmma,.ctype and.dtype must be specified as.s32. Also, thevalues for.atype and.btype must be the same; i.e., either both are.s4, both are.u4, or both are.b1.
For single-bitwmma, multiplication is replaced by a sequence of logical operations;specifically,wmma.xor.popc andwmma.and.popc computes the XOR, AND respectively of a128-bit row of A with a 128-bit column of B, then counts the number of set bits in the result(popc). This result is added to the corresponding element of C and written into D.
The qualifiers.alayout and.blayout must match the layout specified on thewmma.loadinstructions that produce the contents of operandsa andb respectively. Similarly, thequalifiers.atype,.btype and.ctype must match the corresponding qualifiers on thewmma.load instructions that produce the contents of operandsa,b andcrespectively.
The.shape qualifier must match the.shape qualifier used on thewmma.load instructionsthat produce the contents of all three input operandsa,b andc respectively.
The destination operandd is a brace-enclosed vector expression that matches the.shape ofthe fragment computed by thewmma.mma instruction.
Saturation at the output:
The optional qualifier.satfinite indicates that the final values in the destination registerare saturated as follows:
The output is clamped to the minimum or maximum 32-bit signed integer value. Otherwise, if theaccumulation would overflow, the value wraps.
Precision and rounding for.f16 floating point operations:
Element-wise multiplication of matrix A and B is performed with at least single precision. When.ctype or.dtype is.f32, accumulation of the intermediate values is performed withat least single precision. When both.ctype and.dtype are specified as.f16, theaccumulation is performed with at least half precision.
The accumulation order, rounding and handling of subnormal inputs is unspecified.
Precision and rounding for.bf16,.tf32 floating point operations:
Element-wise multiplication of matrix A and B is performed with specified precision. Accumulationof the intermediate values is performed with at least single precision.
The accumulation order, rounding and handling of subnormal inputs is unspecified.
Rounding modifiers on double precisionwmma.mma (default is.rn):
.rn
mantissa LSB rounds to nearest even
.rz
mantissa LSB rounds towards zero
.rm
mantissa LSB rounds towards negative infinity
.rp
mantissa LSB rounds towards positive infinity
The mandatory.sync qualifier indicates thatwmma.mma causes the executing thread to waituntil all threads in the warp execute the samewmma.mma instruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must execute the samewmma.mma instruction. In conditionally executed code, awmma.mma instruction should only beused if it is known that all threads in the warp evaluate the condition identically, otherwisebehavior is undefined.
The behavior ofwmma.mma is undefined if all threads in the same warp do not use the samequalifiers, or if any thread in the warp has exited.
PTX ISA Notes
Introduced in PTX ISA version 6.0.
.m8n32k16 and.m32n8k16 introduced in PTX ISA version 6.1.
Integer, sub-byte integer and single-bitwmma introduced in PTX ISA version 6.3.
Double precision and alternate floating point precisionwmma introduced in PTX ISA version 7.0.
Support for.and operation in single-bitwmma introduced in PTX ISA version 7.1.
Modifier.aligned is required from PTX ISA version 6.3 onwards, and considered implicit in PTXISA versions less than 6.3.
Support for.satfinite on floating pointwmma.mma is deprecated in PTX ISA version 6.4 andis removed from PTX ISA version 6.5.
Preview Feature:
Sub-bytewmma and single-bitwmma are preview features in PTX ISA. All details aresubject to change with no guarantees of backward compatibility on future PTX ISA versions or SMarchitectures.
Target ISA Notes
Floating pointwmma requiressm_70 or higher.
Integerwmma requiressm_72 or higher.
Sub-byte and single-bitwmma requiressm_75 or higher.
Double precision, alternate floating point precisionwmma requiresm_80 or higher.
.and operation in single-bitwmma requiressm_80 or higher.
This section describes warp-levelmma,ldmatrix,stmatrix, andmovmatrixinstructions and the organization of various matrices involved in these instructions.
A warp executingmma.m8n8k4 with.f16 floating point type will compute 4 MMA operations of shape.m8n8k4.
Elements of 4 matrices need to be distributed across the threads in a warp. The following tableshows distribution of matrices for MMA operations.
MMA Computation
Threads participating in MMA computation
MMA computation 1
Threads with%laneid 0-3 (low group) and 16-19 (high group)
MMA computation 2
Threads with%laneid 4-7 (low group) and 20-23 (high group)
MMA computation 3
Threads with%laneid 8-11 (low group) and 24-27 (high group)
MMA computation 4
Threads with%laneid 12-15 (low group) and 28-31 (high group)
For each of the individual MMA computation shown above, each of the required thread holds a fragmentof the matrix for performing mma operation as follows:
Multiplicand A:
.atype
Fragment
Elements (low to high)
.f16
A vector expression containing two.f16x2 registers,with each register containing two.f16 elements fromthe matrix A.
a0, a1, a2, a3
The layout of the fragments held by different threads is shown below:
Fragment layout for Row Major matrix A is shown inFigure 46.
Figure 46MMA .m8n8k4 fragment layout for row-major matrix A with.f16 type
The row and column of a matrix fragment can be computed as:
Perform aMxNxK matrix multiply and accumulate operation,D=A*B+C, where the A matrix isMxK, the B matrix isKxN, and the C and D matrices areMxN.
Qualifier.block_scale specifies that the matrices A and B are scaled withscale_A andscale_B matrices respectively before performing the matrix multiply and accumulate operationas specified in the sectionBlock Scaling. The data typecorresponding to each of the element withinscale_A andScale_B matrices is specifiedby.stype. Qualifier.scale_vec_size specifies the number of columns ofscale_A matrixand number of rows in the matrixscale_B.
The valid combinations of.kind,.stype and.scale_vec_size are described inTable 36. Formma with.kind::mxf4 when thequalifier.scale_vec_size is not specified, then it defaults to2X. In contrast, when.kind is specified as.kind::mxf8f6f4 then the qualifier.scale_vec_size defaultsto1X. However, for.kind::mxf4nvf4, it is mandatory to provide valid.scale_vec_size.
A warp executingmma.sync.m8n8k4 instruction computes 4 matrix multiply and accumulateoperations. Rest of themma.sync operations compute a single matrix mutliply and accumulateoperation per warp.
For single-bitmma.sync, multiplication is replaced by a sequence of logical operations;specifically,mma.xor.popc andmma.and.popc computes the XOR, AND respectively of a k-bitrow of A with a k-bit column of B, then counts the number of set bits in the result (popc). Thisresult is added to the corresponding element of C and written into D.
Operandsa andb represent two multiplicand matrices A and B, whilec anddrepresent the accumulator and destination matrices, distributed across the threads in warp.When.block_scale qualifier is specified, operandscale-a-data,scale-b-data representsthe scale matrix metadata corresponding toscale_A andscale_B matrices respectively. Thetuple{byte-id-a,thread-id-a} and{byte-id-b,thread-id-b} represent selectors for matricesscale_A andscale_B respectively from their corresponding metadata argumentsscale-a-data,scale-b-data. The operandsscale-a-data,scale-b-data are of type.b32. The operandsbyte-id-a,thread-id-a,byte-id-b,thread-id-b are unsigned 16-bit integer values.For more details on selector arguments referBlock Scaling section.
The qualifiers.dtype,.atype,.btype and.ctype indicate the data-type of theelements in the matrices D, A, B and C respectively. The qualifier.stype indicate the data-typeof the elements in the matricesscale_A andscale_B. Specific shapes have type restrictions :
.m8n8k4 : When.ctype is.f32,.dtype must also be.f32.
.m16n8k8 :
.dtype must be the same as.ctype.
.atype must be the same as.btype.
The qualifiers.alayout and.blayout indicate the row-major or column-major layouts ofmatrices A and B respectively.
When.kind is either of.kind::mxf8f6f4 or.kind::f8f6f4, the individual 4-bit and the6-bit floating point type elements must be packed in an 8-bit container. The matrix element of type.e2m1 resides in central 4 bits of the 8-bit container with padding in the upper 2 bits andlower 2 bits of the container. When the matrix element is of type.e3m2 or.e2m3, thematrix element resides in the lower 6 bits of the 8-bit container with padding in the upper 2 bitsof the container. In contrast, note that when usingmma with.kind::mxf4 or.kind::mxf4nvf4, no explicit padding is necessary even though matrix elements are of type.e2m1.
Precision and rounding :
.f16 floating point operations:
Element-wise multiplication of matrix A and B is performed with at least singleprecision. When.ctype or.dtype is.f32, accumulation of the intermediate valuesis performed with at least single precision. When both.ctype and.dtype are specifiedas.f16, the accumulation is performed with at least half precision.
The accumulation order, rounding and handling of subnormal inputs are unspecified.
.e4m3,.e5m2,.e3m2,.e2m3,.e2m1 floating point operations :
Element-wise multiplication of matrix A and B is performed with specified precision. Accumulationof the intermediate values is performed with at least single precision.
The accumulation order, rounding, and handling of subnormal inputs are unspecified.
.bf16 and.tf32 floating point operations :
Element-wise multiplication of matrix A and B is performed with specifiedprecision. Accumulation of the intermediate values is performed with at least singleprecision.
The accumulation order, rounding, and handling of subnormal inputs are unspecified.
.f64 floating point operations :
Precision of the element-wise multiplication and addition operation is identical to that of.f64precision fused multiply-add. Supported rounding modifiers are :
.rn : mantissa LSB rounds to nearest even. This is the default.
.rz : mantissa LSB rounds towards zero.
.rm : mantissa LSB rounds towards negative infinity.
.rp : mantissa LSB rounds towards positive infinity.
Integer operations :
The integermma operation is performed with.s32 accumulators. The.satfinitequalifier indicates that on overflow, the accumulated value is limited to the rangeMIN_INT32..MAX_INT32 (where the bounds are defined as the minimum negative signed 32-bitinteger and the maximum positive signed 32-bit integer respectively).
If.satfinite is not specified, the accumulated value is wrapped instead.
The mandatory.sync qualifier indicates thatmma instruction causes the executing thread towait until all threads in the warp execute the samemma instruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must execute the samemma instruction. In conditionally executed code, amma instruction should only be used if itis known that all threads in the warp evaluate the condition identically, otherwise behavior isundefined.
The behavior ofmma instruction is undefined if all threads in the same warp do not use the samequalifiers, or if any thread in the warp has exited.
Notes
Programs using double precision floating pointmma instruction with shapes.m16n8k4,.m16n8k8, and.m16n8k16 require at least 64 registers for compilation.
PTX ISA Notes
Introduced in PTX ISA version 6.4.
.f16 floating point typemma operation with.m8n8k4 shape introduced in PTX ISA version6.4.
.f16 floating point typemma operation with.m16n8k8 shape introduced in PTX ISA version6.5.
.u8/.s8 integer typemma operation with.m8n8k16 shape introduced in PTX ISA version6.5.
.u4/.s4 integer typemma operation with.m8n8k32 shape introduced in PTX ISA version6.5.
.f64 floating point typemma operation with.m8n8k4 shape introduced in PTX ISA version7.0.
.f16 floating point typemma operation with.m16n8k16 shape introduced in PTX ISAversion 7.0.
.bf16 alternate floating point typemma operation with.m16n8k8 and.m16n8k16 shapesintroduced in PTX ISA version 7.0.
.tf32 alternate floating point typemma operation with.m16n8k4 and.m16n8k8 shapesintroduced in PTX ISA version 7.0.
.u8/.s8 integer typemma operation with.m16n8k16 and.m16n8k32 shapes introduced inPTX ISA version 7.0.
.u4/.s4 integer typemma operation with.m16n8k32 and.m16n8k64 shapes introduced inPTX ISA version 7.0.
.b1 single-bit integer typemma operation with.m8n8k128,.m16n8k128 and.m16n8k256 shapes introduced in PTX ISA version 7.0.
Support for.and operation in single-bitmma introduced in PTX ISA version 7.1.
.f64 floating point typemma operation with.m16n8k4,.m16n8k8, and.m16n8k16shapes introduced in PTX ISA version 7.8.
Support for.e4m3 and.e5m2 alternate floating point typemma operation introduced inPTX ISA version 8.4.
Support for shape.m16n8k16 and.f16dtype/ctype with.e4m3/.e5m2 alternatefloating point type mma operation introduced in PTX ISA version 8.7.
Support for.e3m2,.e2m3,.e2m1 alternate floating point typemma operation introducedin PTX ISA version 8.7.
Support for.kind,.block_scale,.scale_vec_size qualifier introduced in PTX ISA version 8.7.
Target ISA Notes
Requiressm_70 or higher.
.f16 floating point typemma operation with.m8n8k4 shape requiressm_70 or higher.
Note
mma.sync.m8n8k4 is optimized for target architecturesm_70 and may have substantiallyreduced performance on other target architectures.
.f16 floating point typemma operation with.m16n8k8 shape requiressm_75 or higher.
.u8/.s8 integer typemma operation with.m8n8k16 shape requiressm_75 or higher.
.u4/.s4 integer typemma operation with.m8n8k32 shapesm_75 or higher.
.b1 single-bit integer typemma operation with.m8n8k128 shapesm_75 or higher.
.f64 floating point typemma operation with.m8n8k4 shape requiressm_80 or higher.
.f16 floating point typemma operation with.m16n8k16 shape requiressm_80 orhigher.
.bf16 alternate floating point typemma operation with.m16n8k8 and.m16n8k16 shapesrequiressm_80 or higher.
.tf32 alternate floating point typemma operation with.m16n8k4 and.m16n8k8 shapesrequiressm_80 or higher.
.u8/.s8 integer typemma operation with.m16n8k16 and.m16n8k32 shapes requiressm_80 or higher.
.u4/.s4 integer typemma operation with.m16n8k32 and.m16n8k64 shapes requiressm_80 or higher.
.b1 single-bit integer typemma operation with.m16n8k128 and.m16n8k256 shapesrequiressm_80 or higher.
.and operation in single-bitmma requiressm_80 or higher.
.f64 floating point typemma operation with.m16n8k4,.m16n8k8, and.m16n8k16shapes requiresm_90 or higher.
.e4m3 and.e5m2 alternate floating point typemma operation requiressm_89 or higher.
.e3m2,.e2m3 and.e2m1 alternate floating point typemma operation requiressm_120aand is supported onsm_120f from PTX ISA version 8.8.
Support for.kind,.block_scale,.scale_vec_size qualifier requiressm_120a and aresupported onsm_120f or higher in the same family from PTX ISA version 8.8.
Examples of half precision floating point type
// f16 elements in C and D matrix.reg .f16x2 %Ra<2> %Rb<2> %Rc<4> %Rd<4>mma.sync.aligned.m8n8k4.row.col.f16.f16.f16.f16{%Rd0, %Rd1, %Rd2, %Rd3},{%Ra0, %Ra1},{%Rb0, %Rb1},{%Rc0, %Rc1, %Rc2, %Rc3};// f16 elements in C and f32 elements in D.reg .f16x2 %Ra<2> %Rb<2> %Rc<4>.reg .f32 %Rd<8>mma.sync.aligned.m8n8k4.row.col.f32.f16.f16.f16{%Rd0, %Rd1, %Rd2, %Rd3, %Rd4, %Rd5, %Rd6, %Rd7},{%Ra0, %Ra1},{%Rb0, %Rb1},{%Rc0, %Rc1, %Rc2, %Rc3}; // f32 elements in C and D.reg .f16x2 %Ra<2>, %Rb<1>;.reg .f32 %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 {%Rd0, %Rd1, %Rd2, %Rd3}, {%Ra0, %Ra1}, {%Rb0}, {%Rc0, %Rc1, %Rc2, %Rc3};.reg .f16x2 %Ra<4>, %Rb<2>, %Rc<2>, %Rd<2>;mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 {%Rd0, %Rd1}, {%Ra0, %Ra1, %Ra2, %Ra3}, {%Rb0, %Rb1}, {%Rc0, %Rc1};.reg .f16 %Ra<4>, %Rb<2>;.reg .f32 %Rc<2>, %Rd<2>;mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 {%Rd0, %Rd1, %Rd2, %Rd3}, {%Ra0, %Ra1, %Ra2, %Ra3}, {%Rb0, %Rb1}, {%Rc0, %Rc1, %Rc2, %Rc3};
.reg .b32 %Ra, %Rb, %Rc<2>, %Rd<2>;// s8 elements in A and u8 elements in Bmma.sync.aligned.m8n8k16.row.col.satfinite.s32.s8.u8.s32 {%Rd0, %Rd1}, {%Ra}, {%Rb}, {%Rc0, %Rc1};// u4 elements in A and B matrixmma.sync.aligned.m8n8k32.row.col.satfinite.s32.u4.u4.s32 {%Rd0, %Rd1}, {%Ra}, {%Rb}, {%Rc0, %Rc1};// s8 elements in A and u8 elements in B.reg .b32 %Ra<2>, %Rb, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k16.row.col.satfinite.s32.s8.u8.s32 {%Rd0, %Rd1, %Rd2, %Rd3}, {%Ra0, %Ra1}, {%Rb}, {%Rc0, %Rc1, %Rc2, %Rc3};// u4 elements in A and s4 elements in B.reg .b32 %Ra<2>, %Rb, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k32.row.col.satfinite.s32.u4.s4.s32 {%Rd0, %Rd1, %Rd2, %Rd3}, {%Ra0, %Ra1}, {%Rb}, {%Rc0, %Rc1, %Rc2, %Rc3};// s8 elements in A and s8 elements in B.reg .b32 %Ra<4>, %Rb<2>, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k32.row.col.satfinite.s32.s8.s8.s32 {%Rd0, %Rd1, %Rd2, %Rd3}, {%Ra0, %Ra1, %Ra2, %Ra3}, {%Rb0, %Rb1}, {%Rc0, %Rc1, %Rc2, %Rc3};// u8 elements in A and u8 elements in B.reg .b32 %Ra<4>, %Rb<2>, %Rc<4>, %Rd<4>;mma.sync.aligned.m16n8k64.row.col.satfinite.s32.u4.u4.s32 {%Rd0, %Rd1, %Rd2, %Rd3}, {%Ra0, %Ra1, %Ra2, %Ra3}, {%Rb0, %Rb1 }, {%Rc0, %Rc1, %Rc2, %Rc3};
Collectively load one or more matrices across all threads in a warp from the location indicated bythe address operandp, from.shared state space into destination registerr. If no statespace is provided, generic addressing is used, such that the address inp points into.shared space. If the generic address doesn’t fall in.shared state space, then the behavioris undefined.
The.shape qualifier indicates the dimensions of the matrices being loaded. Each matrix elementholds 16-bit or 8-bit or 6-bit or 4-bit data.
Following table shows the matrix load case for each.shape.
.shape
Matrix shape
Element size
.m8n8
8x8
16-bit
.m16n16
16x16
8-bit or 6-bit or 4-bit
.m8n16
8x16
6-bit or 4-bit
Following table shows the valid use of 6-bit or 4-bit data load.
.src_fmt
.shape
Source data
Padding
.dst_fmt
.b6x16_p32
.m8n16
16 6-bit elements
32 bits
.b8x16(16 8-bitelements)
.m16n16
.b4x16_p64
.m8n16
16 4-bit elements
64 bits
.m16n16
For.b6x16_p32 format source data is 16 unsigned 6-bit elements with 32 bits padding.For.b4x16_p64 format source data is 16 unsigned 4-bit elements with 64 bits padding.
The values.x1,.x2 and.x4 for.num indicate one, two or four matricesrespectively. When.shape is.m16n16, only.x1 and.x2 are valid values for.num.
The mandatory.sync qualifier indicates thatldmatrix causes the executing thread to waituntil all threads in the warp execute the sameldmatrix instruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must execute the sameldmatrix instruction. In conditionally executed code, anldmatrix instruction should only beused if it is known that all threads in the warp evaluate the condition identically, otherwise thebehavior is undefined.
The behavior ofldmatrix is undefined if all threads do not use the same qualifiers, or if anythread in the warp has exited.
The destination operandr is a brace-enclosed vector expression consisting of 1, 2, or 4 32-bitregisters as per the value of.num. Each component of the vector expression holds a fragmentfrom the corresponding matrix.
Consecutive instances of row need not be stored contiguously in memory. The eight addresses requiredfor each matrix are provided by eight threads, depending upon the value of.num as shown in thefollowing table. Each address corresponds to the start of a matrix row. Addresses addr0–addr7correspond to the rows of the first matrix, addresses addr8–addr15 correspond to the rows of thesecond matrix, and so on.
.num
Threads 0–7
Threads 8–15
Threads 16–23
Threads 24–31
.x1
addr0–addr7
–
–
–
.x2
addr0–addr7
addr8–addr15
–
–
.x4
addr0–addr7
addr8–addr15
addr16–addr23
addr24–addr31
Note
For .targetsm_75 or below, all threads must contain valid addresses. Otherwise, the behavioris undefined. For.num=.x1 and.num=.x2, addresses contained in lower threads can becopied to higher threads to achieve the expected behavior.
When reading 8x8 matrices, a group of four consecutive threads loads 16 bytes. The matrix addressesmust be naturally aligned accordingly.
Each thread in a warp loads fragments of a row, with thread 0 receiving the first fragment in itsregisterr, and so on. A group of four threads loads an entire row of the matrix as shown inFigure 104.
Figure 104ldmatrix fragment layout for one 8x8 Matrix with 16-bit elements
When.num =.x2, the elements of the second matrix are loaded in the next destinationregister in each thread as per the layout in above table. Similarly, when.num =.x4,elements of the third and fourth matrices are loaded in the subsequent destination registers in eachthread.
For matrix shape 16x16, two destination registersr0 andr1 of type.b32 must bespecified and in each register four 8-bit elements are loaded. For 4-bit or 6-bit data, 8-bitelement will have 4 bits or 2 bits of padding respectively.ReferOptional Decompression for more detailson these formats.
An entire row of the matrix can be loaded by a group of four consecutive and aligned threads.Each thread in a warp loads 4 consecutive columns across 2 rows as shown in theFigure 105.
Figure 105ldmatrix fragment layout for one 16x16 matrix with 8-bit elements
For matrix shape 8x16, one destination registerr0 of type.b32 must be specified where four8-bit elements are loaded in the register. For 4-bit or 6-bit data, 8-bit element will have 4 bitsor 2 bits of padding respectively.
An entire row of the matrix can be loaded by a group of four consecutive and aligned threads.Each thread in a warp loads 4 consecutive columns as shown inFigure 106.
Figure 106ldmatrix fragment layout for one 8x16 matrix with 8-bit elements containing 4-bit/6-bit data
Optional qualifier.trans indicates that the matrix is loaded in column-major format. However,for 16x16 matrices,.trans is mandatory.
Collectively store one or more matrices across all threads in a warp to the location indicated bythe address operandp, in.shared state space. If no state space is provided, genericaddressing is used, such that the address inp points into.shared space. If the genericaddress doesn’t fall in.shared state space, then the behavior is undefined.
The.shape qualifier indicates the dimensions of the matrices being loaded. Each matrix elementholds 16-bit or 8-bit data as indicated by the.type qualifier.
.m16n8 shape is valid only for.b8 type.
The values.x1,.x2 and.x4 for.num indicate one, two or four matricesrespectively.
The mandatory.sync qualifier indicates thatstmatrix causes the executing thread to waituntil all threads in the warp execute the samestmatrix instruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must execute the samestmatrix instruction. In conditionally executed code, anstmatrix instruction should only beused if it is known that all threads in the warp evaluate the condition identically, otherwise thebehavior is undefined.
The behavior ofstmatrix is undefined if all threads do not use the same qualifiers, or if anythread in the warp has exited.
The source operandr is a brace-enclosed vector expression consisting of 1, 2, or 4 32-bitregisters as per the value of.num. Each component of the vector expression holds a fragmentfrom the corresponding matrix.
Consecutive instances of row need not be stored contiguously in memory. The eight addresses requiredfor each matrix are provided by eight threads, depending upon the value of.num as shown in thefollowing table. Each address corresponds to the start of a matrix row. Addresses addr0–addr7correspond to the rows of the first matrix, addresses addr8–addr15 correspond to the rows of thesecond matrix, and so on.
.num
Threads 0–7
Threads 8–15
Threads 16–23
Threads 24–31
.x1
addr0–addr7
–
–
–
.x2
addr0–addr7
addr8–addr15
–
–
.x4
addr0–addr7
addr8–addr15
addr16–addr23
addr24–addr31
When storing 8x8 matrices, a group of four consecutive threads stores 16 bytes. The matrix addressesmust be naturally aligned accordingly.
Each thread in a warp stores fragments of a row, with thread 0 storing the first fragment from itsregisterr, and so on. A group of four threads stores an entire row of the matrix as shown inFigure 107.
Figure 107stmatrix fragment layout for one 8x8 matrix with 16-bit elements
When.num =.x2, the elements of the second matrix are storedd from the next source registerin each thread as per the layout in above table. Similarly, when.num =.x4, elements of thethird and fourth matrices are stored from the subsequent source registers in each thread.
For 16x8 matrix shape, each of the 32 threads in the warp provides four elements of data per matrix.
Each element in the source operandr is of type.b32 and contains four 8 bit elementse0,e1,e2,e3 withe0 ande3 containing the LSB and MSB respectively of registerr.
Figure 108stmatrix fragment layout for one 16x8 matrix with 8 bit elements
Optional qualifier.trans indicates that the matrix is stored in column-major format. However,for 16x8 matrices,.trans is mandatory.
Support for.m16n8 shape is introduced in PTX ISA version 8.6.
Support for.b8 type withstmatrix is introduced in PTX ISA version 8.6.
Target ISA Notes
Requiressm_90 or higher.
Shape.m16n8 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_120f or higher in the same family
sm_110f or higher in the same family
Type.b8 withstmatrix is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_120f or higher in the same family
sm_110f or higher in the same family
Examples
// Store a single 8x8 matrix using 64-bit addressing.reg .b64 addr;.reg .b32 r;stmatrix.sync.aligned.m8n8.x1.shared.b16 [addr], {r};// Store two 8x8 matrices in column-major format.reg .b64 addr;.reg .b32 r<2>;stmatrix.sync.aligned.m8n8.x2.trans.shared::cta.b16 [addr], {r0, r1};// Store four 8x8 matrices.reg .b64 addr;.reg .b32 r<4>;stmatrix.sync.aligned.m8n8.x4.b16 [addr], {r0, r1, r2, r3};// Store a single 16x8 matrix using generic addressing.reg .b64 addr;.reg .b32 r;stmatrix.sync.aligned.m16n8.x1.trans.shared.b8 [addr], {r};// Store two 16x8 matrices.reg .b64 addr;.reg .b32 r<2>;stmatrix.sync.aligned.m16n8.x2.trans.shared::cta.b8 [addr],{r0, r1};// Store four 16x8 matrices.reg .b64 addr;.reg .b32 r<4>;stmatrix.sync.aligned.m16n8.x4.b8 [addr], {r0, r1, r2, r3};
Move a row-major matrix across all threads in a warp, reading elements from sourcea, andwriting the transposed elements to destinationd.
The.shape qualifier indicates the dimensions of the matrix being transposed. Each matrixelement holds 16-bit data as indicated by the.type qualifier.
The mandatory.sync qualifier indicates thatmovmatrix causes the executing thread to waituntil all threads in the warp execute the samemovmatrix instruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must execute the samemovmatrix instruction. In conditionally executed code, amovmatrix instruction should onlybe used if it is known that all threads in the warp evaluate the condition identically, otherwisethe behavior is undefined.
Operandsa andd are 32-bit registers containing fragments of the input matrix and theresulting matrix respectively. The mandatory qualifier.trans indicates that the resultingmatrix ind is a transpose of the input matrix specified bya.
Each thread in a warp holds a fragment of a row of the input matrix, with thread 0 holding the firstfragment in registera, and so on. A group of four threads holds an entire row of the inputmatrix as shown inFigure 109.
Figure 109movmatrix source matrix fragment layout
Each thread in a warp holds a fragment of a column of the result matrix, with thread 0 holding thefirst fragment in registerd, and so on. A group of four threads holds an entire column of theresult matrix as shown inFigure 110.
Figure 110movmatrix result matrix fragment layout
This section describes warp-levelmma.sp{::ordered_metadata} instruction with sparse matrix A.This variant of themma operation can be used when A is a structured sparse matrix with 50%zeros in each row distributed in a shape-specific granularity. For anMxNxK sparsemma.sp{::ordered_metadata} operation, theMxK matrix A is packed intoMxK/2 elements.For each K-wide row of matrix A, 50% elements are zeros and the remaining K/2 non-zero elementsare packed in the operand representing matrix A. The mapping of these K/2 elements to thecorresponding K-wide row is provided explicitly as metadata.
Granularity of sparse matrix A is defined as the ratio of the number of non-zero elements in asub-chunk of the matrix row to the total number of elements in that sub-chunk where the size of thesub-chunk is shape-specific. For example, in a16x16 matrix A, sparsity is expected to be at 2:4granularity, i.e. each 4-element vector (i.e. a sub-chunk of 4 consecutive elements) of a matrix rowcontains 2 zeros. Index of each non-zero element in a sub-chunk is stored in the metadataoperand. Values0b0000,0b0101,0b1010,0b1111 are invalid values for metadata andwill result in undefined behavior. In a group of four consecutive threads, one or more threads storethe metadata for the whole group depending upon the matrix shape. These threads are specified usingan additionalsparsity selector operand.
Figure 111 shows an example of a 16x16 matrix A represented in sparse format and sparsityselector indicating which thread in a group of four consecutive threads stores the metadata.
Granularities for different matrix shapes and data types are described below.
Sparsemma.sp{::ordered_metadata} with half-precision and.bf16 type
For the.m16n8k16 and.m16n8k32mma.sp{::ordered_metadata} operations, matrix A isstructured sparse at a granularity of 2:4. In other words, each chunk of four adjacent elementsin a row of matrix A has two zeros and two non-zero elements. Only the two non-zero elements arestored in the operand representing matrix A and their positions in the four-wide chunk in matrixA are indicated by two 2-bit indices in the metadata operand. Formma.sp::ordered_metadata,0b0100,0b1000,0b1001,0b1100,0b1101,0b1110 are the meaningful valuesof indices; any other values result in an undefined behavior.
Figure 112Sparse MMA metadata example for.f16/.bf16 type.
The sparsity selector indicates the threads which contribute metadata as listed below:
m16n8k16: One thread within a group of four consecutive threads contributes the metadata forthe entire group. This thread is indicated by a value in {0, 1, 2, 3}.
m16n8k32: A thread-pair within a group of four consecutive threads contributes the sparsitymetadata. Hence, the sparsity selector must be either 0 (threads T0, T1) or 1 (threads T2, T3);any other value results in an undefined behavior.
Sparsemma.sp{::ordered_metadata} with.tf32 type
When matrix A has.tf32 elements, matrix A is structured sparse at a granularity of 1:2. Inother words, each chunk of two adjacent elements in a row of matrix A has one zero and one non-zeroelement. Only the non-zero elements are stored in the operand for matrix A and their positions in atwo-wide chunk in matrix A are indicated by the 4-bit index in the metadata.0b1110 and0b0100 are the only meaningful index values; any other values result in an undefined behavior.
Figure 113Sparse MMA metadata example for.tf32 type.
The sparsity selector indicates the threads which contribute metadata as listed below:
m16n8k8: One thread within a group of four consecutive threads contributes the metadata forthe entire group. This thread is indicated by a value in {0, 1, 2, 3}.
m16n8k16: A thread-pair within a group of four consecutive threads contributes the sparsitymetadata. Hence, the sparsity selector must be either 0 (threads T0, T1) or 1 (threads T2, T3);any other value results in an undefined behavior.
Sparsemma.sp{::ordered_metadata} with integer type
When matrices A and B have.u8/.s8 elements, matrix A is structured sparse at a granularityof 2:4. In other words, each chunk of four adjacent elements in a row of matrix A have two zeroesand two non-zero elements. Only the two non-zero elements are stored in sparse matrix and theirpositions in the four-wide chunk are indicated by two 2-bit indices in the metadata. Formma.sp::ordered_metadata,0b0100,0b1000,0b1001,0b1100,0b1101,0b1110are the meaningful values of indices; any other values result in an undefined behavior.
Figure 114Sparse MMA metadata example for.u8/.s8 type.
when matrices A and B have.u4/.s4 elements, matrix A is pair-wise structured sparse at agranularity of 4:8. In other words, each chunk of eight adjacent elements in a row of matrix A hasfour zeroes and four non-zero values. Further, the zero and non-zero values are clustered insub-chunks of two elements each within the eight-wide chunk. i.e., each two-wide sub-chunk withinthe eight-wide chunk must be all zeroes or all non-zeros. Only the four non-zero values are storedin sparse matrix and the positions of the two two-wide sub-chunks with non-zero values in theeight-wide chunk of a row of matrix A are indicated by two 2-bit indices in the metadata. Formma.sp::ordered_metadata,0b0100,0b1000,0b1001,0b1100,0b1101,0b1110are the meaningful values of indices; any other values result in an undefined behavior.
Figure 115Sparse MMA metadata example for.u4/.s4 type.
The sparsity selector indicates the threads which contribute metadata as listed below:
m16n8k32 with.u8/.s8 type andm16n8k64 with.u4/.s4 type: A thread-pairwithin a group of four consecutive threads contributes the sparsity metadata. Hence, the sparsityselector must be either 0 (threads T0, T1) or 1 (threads T2, T3); any other value results in anundefined behavior.
m16n8k64 with.u8/.s8 type andm16n8k128 with.u4/.s4 type: All threadswithin a group of four consecutive threads contribute the sparsity metadata. Hence, the sparsityselector in this case must be 0. Any other value of sparsity selector results in an undefinedbehavior.
When matrices A and B have.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 elements, matrix A isstructured sparse at a granularity of 2:4. In other words, each chunk of four adjacent elements in arow of matrix A have two zeroes and two non-zero elements. Only the two non-zero elements are storedin sparse matrix and their positions in the four-wide chunk are indicated by two 2-bit indices in themetadata.0b0100,0b1000,0b1001,0b1100,0b1101,0b1110 are the meaningfulvalues of indices; any other values result in an undefined behavior.
Figure 116Sparse MMA metadata example for.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 type.
The sparsity selector indicates the threads which contribute metadata as listed below:
m16n8k64: All threads within a group of four consecutive threads contribute the sparsity metadata.Hence, the sparsity selector in this case must be 0. Any other value of sparsity selector results inan undefined behavior.
Sparsemma.sp::ordered_metadata operating on.e2m1 type with.kind::mxf4 or.kind::mxf4nvf4
When matrices A and B have.e2m1 elements, matrix A is pair-wise structured sparse at a granularityof 4:8. In other words, each chunk of eight adjacent elements in a row of matrix A has four zeroes andfour non-zero values. Further, the zero and non-zero values are clustered in sub-chunks of two elementseach within the eight-wide chunk. i.e., each two-wide sub-chunk within the eight-wide chunk must be allzeroes or all non-zeros. Only the four non-zero values are stored in sparse matrix and the positions ofthe two two-wide sub-chunks with non-zero values in the eight-wide chunk of a row of matrix A areindicated by two 2-bit indices in the metadata.0b0100,0b1000,0b1001,0b1100,0b1101,0b1110 are the meaningful values of indices; any other values result in an undefined behavior.
Figure 117Sparse MMA metadata example for.e2m1 type with.kind::mxf4 or.kind::mxf4nvf4
The sparsity selector indicates the threads which contribute metadata as listed below:
m16n8k128: All threads within a group of four consecutive threads contribute the sparsity metadata.Hence, the sparsity selector in this case must be 0. Any other value of sparsity selector results inan undefined behavior.
In this section we describe how the contents of thread registers are associated with fragments ofvarious matrices and the sparsity metadata. The following conventions are used throughout thissection:
For matrix A, only the layout of a fragment is described in terms of register vector sizes andtheir association with the matrix data.
For matrices C and D, since the matrix dimension - data type combination is the same for allsupported shapes, and is already covered inMatrix multiply-accumulate operation using mma instruction, the pictorial representationsof matrix fragments are not included in this section.
For the metadata operand, pictorial representations of the association between indices of theelements of matrix A and the contents of the metadata operand are included.Tk:[m..n] presentin cell[x][y..z] indicates that bitsm throughn (withm being higher) in themetadata operand of thread with%laneid=k contains the indices of the non-zero elements fromthe chunk[x][y]..[x][z] of matrix A.
A warp executing sparsemma.m16n8k16 with.f16 /.bf16 floating point type will computean MMA operation of shape.m16n8k16.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.
Multiplicand A:
.atype
Fragment
Elements
.f16 /.bf16
A vector expression containing two.b32 registers,with each register containing two non-zero.f16 /.bf16 elements out of 4 consecutive elements frommatrix A.
The layout of the fragments held by different threads is shown inFigure 118.
Figure 118Sparse MMA .m16n8k16 fragment layout for matrix A with.f16/.bf16 type.
The row and column of a matrix fragment can be computed as:
groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDfora0anda1groupID+8fora2anda3col=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*4lastcol=firstcol+3
A warp executing sparsemma.m16n8k32 with.f16 /.bf16 floating point type will computean MMA operation of shape.m16n8k32.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.
Multiplicand A:
.atype
Fragment
Elements
.f16 /.bf16
A vector expression containing four.b32 registers,with each register containing two non-zero.f16 /.bf16 elements out of 4 consecutive elements frommatrix A.
The layout of the fragments held by different threads is shown inFigure 120.
Figure 120Sparse MMA .m16n8k32 fragment layout for matrix A with.f16/.bf16 type.
The row and column of a matrix fragment can be computed as:
groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<2||4<=i<6groupID+8Otherwisecol=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*4Foraiwherei<4(threadID_in_group*4)+16foraiwherei>=4lastcol=firstcol+3
Multiplicand B:
.atype
Fragment
Elements (low to high)
.f16 /.bf16
A vector expression containing four.b32 registers, eachcontaining two.f16 /.bf16 elements from matrix B.
b0, b1, b2, b3
The layout of the fragments held by different threads is shown inFigure 121.
Figure 121Sparse MMA .m16n8k32 fragment layout for matrix B with.f16/.bf16 type.
Metadata: A.b32 register containing 16 2-bit vectors with each pair of 2-bit vectors storingthe indices of two non-zero element from a 4-wide chunk of matrix A as shown inFigure 122.
The layout of the fragments held by different threads is shown inFigure 123.
Figure 123Sparse MMA .m16n8k16 fragment layout for matrix A with.tf32 type.
The row and column of a matrix fragment can be computed as:
groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDfora0anda2groupID+8fora1anda3col=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*2fora0anda1(threadID_in_group*2)+8fora2anda3lastcol=firstcol+1
Multiplicand B:
.atype
Fragment
Elements (low to high)
.tf32
A vector expression containing four.b32 registers, eachcontaining four.tf32 elements from matrix B.
b0, b1, b2, b3
The layout of the fragments held by different threads is shown inFigure 124.
Figure 124Sparse MMA .m16n8k16 fragment layout for matrix B with.tf32 type.
The layout of the fragments held by different threads is shown inFigure 126.
Figure 126Sparse MMA .m16n8k8 fragment layout for matrix A with.tf32 type.
The row and column of a matrix fragment can be computed as:
groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDfora0groupID+8fora1col=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*2lastcol=firstcol+1
Matrix fragments for multiplicand B and accumulators C and D are the same as in case ofMatrix Fragments for mma.m16n8k8 for.tf32format.
Metadata: A.b32 register containing 8 4-bit vectors each storing the index of a non-zeroelement of a 2-wide chunk of matrix A as shown inFigure 127.
A warp executing sparsemma.m16n8k32 with.u8 /.s8 integer type will compute an MMAoperation of shape.m16n8k32.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.
Multiplicand A:
.atype
Fragment
Elements
.u8 /.s8
A vector expression containing two.b32 registers, with eachregister containing four non-zero.u8 /.s8 elements outof 8 consecutive elements from matrix A.
The layout of the fragments held by different threads is shown inFigure 128.
Figure 128Sparse MMA .m16n8k32 fragment layout for matrix A with.u8/.s8 type.
groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<4groupID+8Otherwisecol=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*8lastcol=firstcol+7
Metadata: A.b32 register containing 16 2-bit vectors with each pair of 2-bit vectors storingthe indices of two non-zero elements from a 4-wide chunk of matrix A as shown inFigure 129.
A warp executing sparsemma.m16n8k64 with.u8 /.s8/.e4m3/.e5m2 /.e3m2 /.e2m3 /.e2m1 type will compute an MMA operation of shape.m16n8k64.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.
Multiplicand A:
.atype
Fragment
Elements
.u8 /.s8
A vector expression containing four.b32 registers, with eachregister containing four non-zero.u8 /.s8 elements outof 8 consecutive elements from matrix A.
A vector expression containing four.b32 registers, with eachregister containing four non-zero.e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1 elements out of 8 consecutiveelements from matrix A.
The layout of the fragments held by different threads is shown inFigure 130andFigure 131.
Figure 130Sparse MMA .m16n8k64 fragment layout for columns 0–31 of matrix A with.u8/.s8/.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 type.
Figure 131Sparse MMA .m16n8k64 fragment layout for columns 32–63 of matrix A with.u8/.s8/.e4m3/.e5m2/.e3m2/.e2m3/.e2m1 type.
groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<4||8<=i<12groupID+8Otherwisecol=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*8Foraiwherei<8(threadID_in_group*8)+32Foraiwherei>=8lastcol=firstcol+7
Multiplicand B:
.btype
Fragment
Elements (low to high)
.u8 /.s8
A vector expression containing four.b32 registers,each containing four.u8 /.s8 elements frommatrix B.
b0, b1, b2, b3, …, b15
.e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1
A vector expression containing four.b32 registers,each containing four.e4m3 /.e5m2 /.e3m2 /.e2m3 /.e2m1 elements from matrix B.
Metadata: A.b32 register containing 16 2-bit vectors with each pair of 2-bit vectors storingthe indices of two non-zero elements from a 4-wide chunk of matrix A as shown inFigure 136 andFigure 137.
A warp executing sparsemma.m16n8k64 with.u4 /.s4 integer type will compute an MMAoperation of shape.m16n8k64.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.
Multiplicand A:
.atype
Fragment
Elements
.u4 /.s4
A vector expression containing two.b32 registers, with eachregister containing eight non-zero.u4 /.s4 elementsout of 16 consecutive elements from matrix A.
The layout of the fragments held by different threads is shown inFigure 138.
Figure 138Sparse MMA .m16n8k64 fragment layout for matrix A with.u4/.s4 type.
groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<8groupID+8Otherwisecol=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*16lastcol=firstcol+15
Metadata: A.b32 register containing 16 2-bit vectors with each pair of 2-bit vectors storingthe indices of four non-zero elements from a 8-wide chunk of matrix A as shown inFigure 139.
A warp executing sparsemma.m16n8k128 with.u4 /.s4 /.e2m1 integer type will compute an MMAoperation of shape.m16n8k128.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holdsa fragment of the matrix.
Multiplicand A:
.atype
Fragment
Elements
.u4 /.s4
A vector expression containing four.b32 registers, with eachregister containing eight non-zero.u4 /.s4 elements outof 16 consecutive elements from matrix A.
A vector expression containing four.b32 registers, with eachregister containing eight non-zero.e2m1 elements outof 16 consecutive elements from matrix A.
The layout of the fragments held by different threads is shown inFigure 140andFigure 141.
Figure 140Sparse MMA .m16n8k128 fragment layout for columns 0–63 of matrix A with.u4/.s4/.e2m1 type.
Figure 141Sparse MMA .m16n8k128 fragment layout for columns 64–127 of matrix A with.u4/.s4/.e2m1 type.
groupID=%laneid>>2threadID_in_group=%laneid%4row=groupIDforaiwhere0<=i<8||16<=i<24groupID+8Otherwisecol=[firstcol...lastcol]// As per the mapping of non-zero elements// as described in Sparse matrix storageWherefirstcol=threadID_in_group*16Foraiwherei<16(threadID_in_group*16)+64Foraiwherei>=16lastcol=firstcol+15
Multiplicand B:
.atype
Fragment
Elements (low to high)
.u4 /.s4
A vector expression containing four.b32 registers, each containingeight.u4 /.s4 elements from matrix B.
b0, b1, b2, b3, …, b31
.e2m1
A vector expression containing four.b32 registers, each containingeight.e2m1 elements from matrix B.
Metadata: A.b32 register containing 16 2-bit vectors with each pair of 2-bit vectors storingthe indices of four non-zero elements from a 8-wide chunk of matrix A as shown inFigure 146 andFigure 147.
Perform matrix multiply-and-accumulate operation with sparse matrix A
Syntax
Half precision floating point type:
mma.spvariant.sync.aligned.m16n8k16.row.col.dtype.f16.f16.ctype d, a, b, c, e, f;mma.spvariant.sync.aligned.m16n8k32.row.col.dtype.f16.f16.ctype d, a, b, c, e, f;.ctype = {.f16, .f32};.dtype = {.f16, .f32};.spvariant = {.sp, .sp::ordered_metadata};
Alternate floating point type:
mma.spvariant.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 d, a, b, c, e, f;mma.spvariant.sync.aligned.m16n8k32.row.col.f32.bf16.bf16.f32 d, a, b, c, e, f;mma.spvariant.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32 d, a, b, c, e, f;mma.spvariant.sync.aligned.m16n8k16.row.col.f32.tf32.tf32.f32 d, a, b, c, e, f;mma.spvariant.sync.aligned.m16n8k64.row.col.f32.f8type.f8type.f32 d, a, b, c, e, f;mma.sp::ordered_metadata.sync.aligned.m16n8k64.row.col.kind.dtype.f8f6f4type.f8f6f4type.ctype d, a, b, c, e, f;.f8type = {.e4m3, .e5m2};.spvariant = {.sp, .sp::ordered_metadata};.f8f6f4type = {.e4m3, .e5m2, .e3m2, .e2m3, .e2m1};.kind = {kind::f8f6f4};.ctype = {.f16, .f32};.dtype = {.f16, .f32};
mma.spvariant.sync.aligned.shape.row.col{.satfinite}.s32.atype.btype.s32 d, a, b, c, e, f;.shape = {.m16n8k32, .m16n8k64}.atype = {.u8, .s8};.btype = {.u8, .s8};.spvariant = {.sp, .sp::ordered_metadata};mma.spvariant.sync.aligned.shape.row.col{.satfinite}.s32.atype.btype.s32 d, a, b, c, e, f;.shape = {.m16n8k64, .m16n8k128}.atype = {.u4, .s4};.btype = {.u4, .s4};.spvariant = {.sp, .sp::ordered_metadata};
Description
Perform aMxNxK matrix multiply and accumulate operation,D=A*B+C, where the A matrix isMxK, the B matrix isKxN, and the C and D matrices areMxN.
A warp executingmma.sp.sync/mma.sp::ordered_metadata.sync instruction compute a single matrixmultiply and accumulate operation.
Qualifier.block_scale specifies that the matricesA andB are scaled withscale_Aandscale_B matrices respectively before performing the matrix multiply and accumulate operationas specified in the sectionBlock Scaling. The data type correspondingto each of the element withinscale_A andscale_B matrices is specified by.stype.Qualifier.scale_vec_size specifies the number of columns ofscale_A matrix and number ofrows in the matrixscale_B.
The valid combinations of.kind,.stype and.scale_vec_size are described inTable 36. Formma with.kind::mxf4 when thequalifier.scale_vec_size is not specified, then it defaults to2X. In contrast,when.kind is specified as.kind::mxf8f6f4 then the qualifier.scale_vec_sizedefaults to1X. However, for.kind::mxf4nvf4, it is mandatory to provide valid.scale_vec_size.
Operandsa andb represent two multiplicand matrices A and B, whilec anddrepresent the accumulator and destination matrices, distributed across the threads in warp. Matrix Ais structured sparse as described inSparse matrix storage Operandse andf represent sparsitymetadata and sparsity selector respectively. Operande is a 32-bit integer and operandf isa 32-bit integer constant with values in the range 0..3.When.block_scale qualifier is specified, operandscale-a-data,scale-b-data representsthe scale matrix metadata corresponding toscale_A andscale_B matrices respectively.The tuple{byte-id-a,thread-id-a} and{byte-id-b,thread-id-b} represent selectors formatricesscale_A andscale_B respectively from their corresponding metadata argumentsscale-a-data,scale-b-data. The operandsscale-a-data,scale-b-data are of type.b32. The operandsbyte-id-a,thread-id-a,byte-id-b,thread-id-b are unsigned16-bit integer values. For more details on selector arguments referBlock Scaling section.
Instructionmma.sp::ordered_metadata requires the indices in the sparsity metadata to be sortedin an increasing order starting from LSB, otherwise behavior is undefined.
The qualifiers.dtype,.atype,.btype and.ctype indicate the data-type of theelements in the matrices D, A, B and C respectively. The qualifier.stype indicate thedata-type of the elements in the matricesscale_A andscale_B. In case of shapes.m16n8k16 and.m16n8k32,.dtype must be the same as.ctype.
When.kind is either of.kind::mxf8f6f4 or.kind::f8f6f4, the individual 4-bit andthe 6-bit floating point type elements must be packed in an 8-bit container. The matrix elementof type.e2m1 resides in central 4 bits of the 8-bit container with padding in the upper 2bits and lower 2 bits of the container. When the matrix element is of type.e3m2 or.e2m3,the matrix element resides in the lower 6 bits of the 8-bit container with padding in the upper2 bits of the container. In contrast, note that when usingmma with.kind::mxf4 or.kind::mxf4nvf4, no explicit padding is necessary even though matrix elements are of type.e2m1.
Precision and rounding :
.f16 floating point operations :
Element-wise multiplication of matrix A and B is performed with at least singleprecision. When.ctype or.dtype is.f32, accumulation of the intermediate valuesis performed with at least single precision. When both.ctype and.dtype are specifiedas.f16, the accumulation is performed with at least half precision.
The accumulation order, rounding and handling of subnormal inputs are unspecified.
.e4m3,.e5m2,.e3m2,.e2m3,.e2m1 floating point operations :
Element-wise multiplication of matrix A and B is performed with specified precision. Accumulationof the intermediate values is performed with at least single precision.
The accumulation order, rounding, and handling of subnormal inputs are unspecified.
.bf16 and.tf32 floating point operations :
Element-wise multiplication of matrix A and B is performed with specifiedprecision. Accumulation of the intermediate values is performed with at least singleprecision.
The accumulation order, rounding, and handling of subnormal inputs are unspecified.
Integer operations :
The integermma.sp/mma.sp::ordered_metadata operation is performed with.s32 accumulators.The.satfinite qualifier indicates that on overflow, the accumulated value is limited to the rangeMIN_INT32..MAX_INT32 (where the bounds are defined as the minimum negative signed 32-bitinteger and the maximum positive signed 32-bit integer respectively).
If.satfinite is not specified, the accumulated value is wrapped instead.
The mandatory.sync qualifier indicates thatmma.sp/mma.sp::ordered_metadata instruction causesthe executing thread to wait until all threads in the warp execute the samemma.sp/mma.sp::ordered_metadatainstruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must execute the samemma.sp/mma.sp::ordered_metadata instruction. In conditionally executed code, amma.sp/mma.sp::ordered_metadatainstruction should only be used if it is known that all threads in the warp evaluate the condition identically,otherwise behavior is undefined.
The behavior ofmma.sp/mma.sp::ordered_metadata instruction is undefined if all threads in the same warpdo not use the same qualifiers, or if any thread in the warp has exited.
Notes
mma.sp instruction may have substantially reduced performance on some target architectures.Hence, it is advised to usemma.sp::ordered_metadata instruction.
PTX ISA Notes
Introduced in PTX ISA version 7.1.
Support for.e4m3 and.e5m2 alternate floating point typemma operation introduced inPTX ISA version 8.4.
mma.sp::ordered_metadata introduced in PTX ISA version 8.5.
Support for shape.m16n8k32 and.f16 dtype/ctype with.e4m3/.e5m2 alternate floatingpoint typemma operation introduced in PTX ISA version 8.7.
Support for.e3m2,.e2m3,.e2m1 alternate floating point typemma operation introducedin PTX ISA version 8.7.
Support for.kind,.block_scale,.scale_vec_size qualifier introduced in PTX ISA version 8.7.
Target ISA Notes
Requiressm_80 or higher.
.e4m3 and.e5m2 alternate floating point typemma operation requiressm_89 or higher.
mma.sp::ordered_metadata requiressm_80 or higher.
Support for shape.m16n8k32 and.f16 dtype/ctype with.e4m3/.e5m2 alternate floatingpoint typemma operation requiressm_120.
.e3m2,.e2m3 and.e2m1 alternate floating point typemma operation requiressm_120a and are supported onsm_120f or higher in the same family from PTX ISA version 8.8.
Support for.kind,.block_scale,.scale_vec_size qualifier requiressm_120a and aresupported onsm_120f and later generation targets in the same family from PTX ISA version 8.8 except for.kind::mxf4nvf4/.kind::mxf4.
Qualifiers.kind::mxf4nvf4 and.kind::mxf4 are supported on following architectures:
sm_120a
sm_121a
Examples of half precision floating point type
// f16 elements in C and D matrix.reg .f16x2 %Ra<2> %Rb<2> %Rc<2> %Rd<2>.reg .b32 %Re;mma.sp.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 {%Rd0, %Rd1}, {%Ra0, %Ra1}, {%Rb0, %Rb1}, {%Rc0, %Rc1}, %Re, 0x1;.reg .f16x2 %Ra<2> %Rb<2> %Rc<2> %Rd<2>.reg .b32 %Re;mma.sp::ordered_metadata.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 {%Rd0, %Rd1}, {%Ra0, %Ra1}, {%Rb0, %Rb1}, {%Rc0, %Rc1}, %Re, 0x1;
.reg .b32 %Ra<4>, %Rb<4>, %Rc<4>, %Rd<4>;.reg .u32 %Re;// u8 elements in A and B matrixmma.sp.sync.aligned.m16n8k32.row.col.satfinite.s32.u8.u8.s32 {%Rd0, %Rd1, %Rd2, %Rd3}, {%Ra0, %Ra1}, {%Rb0, %Rb1}, {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x1;// s8 elements in A and B matrixmma.sp.sync.aligned.m16n8k64.row.col.satfinite.s32.s8.s8.s32 {%Rd0, %Rd1, %Rd2, %Rd3}, {%Ra0, %Ra1, %Ra2, %Ra3}, {%Rb0, %Rb1, %Rb2, %Rb3}, {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x0;// s8 elements in A and B matrix with ordered metadatamma.sp::ordered_metadata.sync.aligned.m16n8k64.row.col.satfinite.s32.s8.s8.s32 {%Rd0, %Rd1, %Rd2, %Rd3}, {%Ra0, %Ra1, %Ra2, %Ra3}, {%Rb0, %Rb1, %Rb2, %Rb3}, {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x0;// u4 elements in A and B matrixmma.sp.sync.aligned.m16n8k64.row.col.s32.s4.s4.s32 {%Rd0, %Rd1, %Rd2, %Rd3}, {%Ra0, %Ra1}, {%Rb0, %Rb1}, {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x1;// u4 elements in A and B matrixmma.sp.sync.aligned.m16n8k128.row.col.satfinite.s32.u4.u4.s32 {%Rd0, %Rd1, %Rd2, %Rd3}, {%Ra0, %Ra1, %Ra2, %Ra3}, {%Rb0, %Rb1, %Rb2, %Rb3}, {%Rc0, %Rc1, %Rc2, %Rc3}, %Re, 0x0;
The warpgroup level matrix multiply and accumulate operation has either of the following forms,where matrixD is called accumulator:
D=A*B+D
D=A*B, where the input from accumulator D is disabled.
Thewgmma instructions perform warpgroup level matrix multiply-and-accumulate operation byhaving all threads in a warpgroup collectively perform the following actions:
Load matrices A, B and D into registers or into shared memory.
Perform the followingfence operations:
wgmma.fence operations to indicate that the register/shared-memory across the warpgrouphave been written into.
fence.proxy.async operation to make the generic proxy operations visible to the asyncproxy.
Issue the asynchronous matrix multiply and accumulate operations using thewgmma.mma_asyncoperation on the input matrices. Thewgmma.mma_async operation is performed in the asyncproxy.
Create a wgmma-group and commit all the prior outstandingwgmma.mma_async operations into thegroup, by usingwgmma.commit_group operation.
Wait for the completion of the required wgmma-group.
Once the wgmma-group completes, all thewgmma.mma_async operations have been performed andcompleted.
The matrix multiply and accumulate operations support a limited set of shapes for the operandmatrices A, B and D. The shapes of all three matrix operands are collectively described by the tupleMxNxK, where A is anMxK matrix, B is aKxN matrix, while D is aMxN matrix.
The following matrix shapes are supported for the specified types for thewgmma.mma_asyncoperation:
The matrix multiply and accumulate operation is supported separately on integer, floating-point,sub-byte integer and single bit data-types. All operands must contain the same basic type kind,i.e., integer or floating-point.
For floating-point matrix multiply and accumulate operation, different matrix operands may havedifferent precision, as described later.
For integer matrix multiply and accumulate operation, both multiplicand matrices (A and B) must haveelements of the same data-type, e.g. both signed integer or both unsigned integer.
Thewgmma.mma_async operations are performed in the asynchronous proxy (or async proxy).
Accessing the same memory location across multiple proxies needs a cross-proxy fence. For the asyncproxy,fence.proxy.async should be used to synchronize memory between generic proxy and theasync proxy.
The completion of awgmma.mma_async operation is followed by an implicit generic-async proxyfence. So the result of the asynchronous operation is made visible to the generic proxy as soon asits completion is observed.wgmma.commit_group andwgmma.wait_group operations must be usedto wait for the completion of thewgmma.mma_async instructions.
The input matrix A of the warpgroup wide MMA operations can be either in registers or in the sharedmemory. The input matrix B of the warpgroup wide MMA operations must be in the shared memory. Thissection describes the layouts of register fragments and shared memory expected by the warpgroup MMAinstructions.
When the matrices are in shared memory, their starting addresses must be aligned to 16 bytes.
A warpgroup executingwgmma.mma_async.m64nNk256 will compute an MMA operation of shape.m64nNk256 where N is a validn dimension as listed inMatrix Shape.
Elements of the matrix are distributed across the threads in a warpgroup so each thread of thewarpgroup holds a fragment of the matrix.
Multiplicand A in registers:
.atype
Fragment
Elements (low to high)
.b1
A vector expression containing four.b32 registers, with eachregister containing thirty two.b1 element from matrix A.
a0, a1, a2, …, a127
The layout of the fragments held by different threads is shown inFigure 154.
Figure 154WGMMA .m64nNk256 register fragment layout for matrix A.
Accumulator D:
.dtype
Fragment
Elements (low to high)
.s32
A vector expression containing N/2 number of.s32 registers.
d0, d1, d2, d3, …, dX, dY, dZ, dW
whereX=N/2-4
Y=N/2-3
Z=N/2-2
W=N/2-1
N=8*iwherei={1,2,3,4}
=16*iwherei={3,4,...,15,16}
The layout of the fragments held by different threads is shown inFigure 155.
Figure 155WGMMA .m64nNk256 register fragment layout for accumulator matrix D.
If the argumentimm-trans-a /imm-trans-b of the instructionwgmma.mma_async{.sp}is 0, thenK-major is used for matrixA /B respectively. If the value of argumentimm-trans-a is 1 thenM-major is used for matrixA. If the value of the argumentimm-trans-b is 1, thenN-major is used for matrixB.
In a column-major default BLAS library such as cuBLAS, the matricesA andB with andwithout transpose can be classified as eitherK-Major orM-or-N-Major as shown in thefollowing table:
Non-Transposed
Transposed
A
K-major
M-major
B
K-major
N-major
To avoid confusion withA,B,row-major,col-major,transpose, andnon-transpose, we will useMN-Major andK-Major throughout this section.
The matrices in the shared memory are made up of one or more “swizzle layout atom”.The exact layout of these swizzle atoms depends on the swizzling mode, swizzle-atomicity,and the leading dimension. The layout of the swizzle are shown inTable 38.
Table 38Various combinations of swizzling mode, leading dimension and swizzle-atom layout
Swizzling mode
Leading Dimension/ Major-ness
Swizzle atom layout(128b element)
128B Swizzling Mode
M/N
8x8
K
8x8
64B Swizzling Mode
M/N
4x8
K
8x4
32B Swizzling Mode
M/N
2x8
K
8x2
None
M/N
1x8
K
8x1
The above shapes are for elements of size 128 bits. For smaller elements sizes, the sameshapes would get multiplied along the leading dimension by a factor of128/sizeof_bits(Element).For example, 128B MN major swizzle atom would have a shape of(8*(128/32))x8=32x8 fortf32 tensor core inputs.
The leading dimension byte offset is defined differently for transposed and non-transposedmatrices. The leading byte offset is defined as follows for matrices whose element types arenormalized to 128-bits:
Major-ness
Definition
K-Major
No-Swizzling: the offset from the first column to the second columnsof the 8x2 tile in the 128-bit element type normalized matrix.
Swizzled layouts: not used, assumed to be 1.
MN-Major
Interleave: offset from the first 8 columns to the next 8 columns.
Swizzled layouts: offset from the first (swizzle-byte-size/16) rowsto the next (swizzle-byte-size/16) rows.
The stride dimension byte offset is defined differently for transposed and non-transposedmatrices. The stride dimension byte offset is defined as follows for matrices whose elementtypes are normalized to 128-bits:
Major-ness
Definition
K-Major
The offset from the first 8 rows to the next 8 rows.
MN-Major
Interleave: offset from the first row to the next row.
Swizzled layout: offset from the first 8 columns to the next 8columns
Matrix descriptor specifies the properties of the matrix in shared memory that is a multiplicand inthe matrix multiply and accumulate operation. It is a 64-bit value contained in a register with thefollowing layout:
Instructionwgmma.mma_async issues aMxNxK matrix multiply and accumulate operation,D=A*B+D, where the A matrix isMxK, the B matrix isKxN, and the D matrix isMxN.
The operation of the formD=A*B is issued when the input predicate argumentscale-d isfalse.
wgmma.fence instruction must be used to fence the register accesses ofwgmma.mma_asyncinstruction from their prior accesses. Otherwise, the behavior is undefined.
wgmma.commit_group andwgmma.wait_group operations must be used to wait for the completionof the asynchronous matrix multiply and accumulate operations before the results are accessed.
Register operandd represents the accumulator matrix as well as the destination matrix,distributed across the participating threads. Register operanda represents the multiplicandmatrix A in register distributed across the participating threads. The 64-bit register operandsa-desc andb-desc are the matrix descriptors which represent the multiplicand matrices A andB in shared memory respectively. The contents of a matrix descriptor must be same across all the warpsin the warpgroup. The format of the matrix descriptor is described inMatrix Descriptor Format.
Matrices A and B are stored in row-major and column-major format respectively. For certain floatingpoint variants, the input matrices A and B can be transposed by specifying the value 1 for theimmediate integer argumentsimm-trans-a andimm-trans-b respectively. A value of 0 can beused to avoid the transpose operation. The valid values ofimm-trans-a andimm-trans-b are 0and 1. The transpose operation is only supported for thewgmma.mma_async variants with.f16/.bf16 types on matrices accessed from shared memory using matrix descriptors.
For the floating point variants of thewgmma.mma_async operation, each element of the inputmatrices A and B can be negated by specifying the value -1 for operandsimm-scale-a andimm-scale-b respectively. A value of 1 can be used to avoid the negate operation. The validvalues ofimm-scale-a andimm-scale-b are -1 and 1.
The qualifiers.dtype,.atype and.btype indicate the data type of the elements inmatrices D, A and B respectively..atype and.btype must be the same for all floating pointwgmma.mma_async variants except for the FP8 floating point variants. The sizes of individualdata elements of matrices A and B in alternate floating point variants of thewgmma.mma_asyncoperation are as follows:
Matrices A and B have 8-bit data elements when.atype/.btype is.e4m3/.e5m2.
Matrices A and B have 16-bit data elements when.atype/.btype is.bf16.
Matrices A and B have 32-bit data elements when.atype/.btype is.tf32.
Precision and rounding:
Floating point operations:
Element-wise multiplication of matrix A and B is performed with at least single precision. When.dtype is.f32, accumulation of the intermediate values is performed with at least singleprecision. When.dtype is.f16, the accumulation is performed with at least halfprecision.
The accumulation order, rounding and handling of subnormal inputs are unspecified.
.bf16 and.tf32 floating point operations:
Element-wise multiplication of matrix A and B is performed with specifiedprecision.wgmma.mma_async operation involving type.tf32 will truncate lower 13 bits ofthe 32-bit input data before multiplication is issued. Accumulation of the intermediate values isperformed with at least single precision.
The accumulation order, rounding, and handling of subnormal inputs are unspecified.
Integer operations:
The integerwgmma.mma_async operation is performed with.s32 accumulators. The.satfinite qualifier indicates that on overflow, the accumulated value is limited to therangeMIN_INT32..MAX_INT32 (where the bounds are defined as the minimum negative signed32-bit integer and the maximum positive signed 32-bit integer respectively).
If.satfinite is not specified, the accumulated value is wrapped instead.
The mandatory.sync qualifier indicates thatwgmma.mma_async instruction causes theexecuting thread to wait until all threads in the warp execute the samewgmma.mma_asyncinstruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warpgroup must execute thesamewgmma.mma_async instruction. In conditionally executed code, awgmma.mma_asyncinstruction should only be used if it is known that all threads in the warpgroup evaluate thecondition identically, otherwise behavior is undefined.
PTX ISA Notes
Introduced in PTX ISA version 8.0.
Support for.u8.s8 and.s8.u8 as .atype.btype introduced in PTX ISA version 8.4.
This section describes warp-levelwgmma.mma_async.sp instruction with sparse matrix A. Thisvariant of thewgmma.mma_async operation can be used when A is a structured sparse matrix with50% zeros in each row distributed in a shape-specific granularity. For anMxNxK sparsewgmma.mma_async.sp operation, theMxK matrix A is packed intoMxK/2 elements. For eachK-wide row of matrix A, 50% elements are zeros and the remainingK/2 non-zero elements arepacked in the operand representing matrix A. The mapping of theseK/2 elements to thecorresponding K-wide row is provided explicitly as metadata.
Granularity of sparse matrix A is defined as the ratio of the number of non-zero elements in asub-chunk of the matrix row to the total number of elements in that sub-chunk where the size of thesub-chunk is shape-specific. For example, in a64x32 matrix A used in floating pointwgmma.mma_async operations, sparsity is expected to be at 2:4 granularity, i.e. each 4-elementvector (i.e. a sub-chunk of 4 consecutive elements) of a matrix row contains 2 zeros. Index of eachnon-zero element in a sub-chunk is stored in the metadata operand. Values0b0000,0b0101,0b1010,0b1111 are invalid values for metadata and will result in undefined behavior. In agroup of four consecutive threads, one or more threads store the metadata for the whole groupdepending upon the matrix shape. These threads are specified using an additional sparsity selector operand.
Matrix A and its corresponding input operand to the sparse wgmma is similar to the diagram shown inFigure 111, with an appropriate matrix size.
Granularities for different matrix shapes and data types are described below.
Sparsewgmma.mma_async.sp with half-precision and.bf16 type
For.f16 and.bf16 types, for all supported64xNx32 shapes, matrix A is structuredsparse at a granularity of 2:4. In other words, each chunk of four adjacent elements in a row ofmatrix A have two zeroes and two non-zero elements. Only the two non-zero elements are stored inmatrix A and their positions in the four-wide chunk in Matrix A are indicated by two 2-bits indicesin the metadata operand.
Figure 171Sparse WGMMA metadata example for.f16/.bf16 type.
The sparsity selector indicates a thread-pair within a group of four consecutive threads whichcontributes the sparsity metadata. Hence, the sparsity selector must be either 0 (threads T0, T1) or1 (threads T2, T3); any other value results in an undefined behavior.
Sparsewgmma.mma_async.sp with.tf32 type
For.tf32 type, for all supported64xNx16 shapes, matrix A is structured sparse at agranularity of 1:2. In other words, each chunk of two adjacent elements in a row of matrix A haveone zero and one non-zero element. Only the non-zero element is stored in operand for matrix A andthe 4-bit index in the metadata indicates the position of the non-zero element in the two-widechunk. 0b1110 and 0b0100 are the only meaningful values of the index, the remaining values result inan undefined behavior.
Figure 172Sparse WGMMA metadata example for.tf32 type.
The sparsity selector indicates a thread-pair within a group of four consecutive threads whichcontributes the sparsity metadata. Hence, the sparsity selector must be either 0 (threads T0, T1) or1 (threads T2, T3); any other value results in an undefined behavior.
Sparsewgmma.mma_async.sp with.e4m3 and.e5m2 floating point type
For.e4m3 and.e5m2 types, for all supported64xNx64 shapes, matrix A is structuredsparse at a granularity of 2:4. In other words, each chunk of four adjacent elements in a row ofmatrix A have two zeroes and two non-zero elements. Only the two non-zero elements are stored inmatrix A and their positions in the four-wide chunk in Matrix A are indicated by two 2-bits indicesin the metadata operand.
Figure 173Sparse WGMMA metadata example for.e4m3/.e5m2 type.
All threads contribute the sparsity metadata and the sparsity selector must be 0; any other valueresults in an undefined behavior.
Sparsewgmma.mma_async.sp with integer type
For the integer type, for all supported64xNx64 shapes, matrix A is structured sparse at agranularity of 2:4. In other words, each chunk of four adjacent elements in a row of matrix A havetwo zeroes and two non-zero elements. Only the two non-zero elements are stored in matrix A and two2-bit indices in the metadata indicate the position of these two non-zero elements in the four-widechunk.
Figure 174Sparse WGMMA metadata example for.u8/.s8 type.
All threads contribute the sparsity metadata and the sparsity selector must be 0; any other valueresults in an undefined behavior.
In this section we describe how the contents of thread registers are associated with fragments of Amatrix and the sparsity metadata.
Each warp in the warpgroup provides sparsity information for 16 rows of matrix A. The followingtable shows the assignment of warps to rows of matrix A:
Warp
Sparsity information for rows of matrix A
%warpid % 4 = 3
48-63
%warpid % 4 = 2
32-47
%warpid % 4 = 1
16-31
%warpid % 4 = 0
0-15
The following conventions are used throughout this section:
For matrix A, only the layout of a fragment is described in terms of register vector sizes andtheir association with the matrix data.
For the metadata operand, pictorial representations of the association between indices of theelements of matrix A and the contents of the metadata operand are included.Tk:[m..n] presentin cell[x][y..z] indicates that bitsm throughn (withm being higher) in themetadata operand of thread with%laneid=k contains the indices of the non-zero elements fromthe chunk[x][y]..[x][z] of matrix A.
A warpgroup executing sparsewgmma.mma_async.m64nNk32 will compute an MMA operation of shape.m64nNk32 where N is a valid n dimension as listed inMatrix Shape.
Elements of the matrix are distributed across the threads in a warpgroup so each thread of thewarpgroup holds a fragment of the matrix.
Metadata operand is a.b32 register containing 16 2-bit vectors each storing the index of anon-zero element of a 4-wide chunk of matrix A.
Figure 176 shows the mapping of the metadata bits to the elementsof matrix A for a warp. In this figure, variablei represents the value of the sparsityselector operand.
A warpgroup executing sparsewgmma.mma_async.m64nNk16 will compute an MMA operation of shape.m64nNk16 where N is a valid n dimension as listed inMatrix Shape.
Elements of the matrix are distributed across the threads in a warpgroup so each thread of thewarpgroup holds a fragment of the matrix.
Metadata operand is a.b32 register containing eight 4-bit vectors each storing the index of anon-zero element of a 2-wide chunk of matrix A.
Figure 178 shows the mapping of the metadata bits to the elementsof matrix A for a warp. In this figure, variablei represents the value of the sparsityselector operand.
A warpgroup executing sparsewgmma.mma_async.m64nNk64 will compute an MMA operation of shape.m64nNk64 where N is a valid n dimension as listed inMatrix Shape.
Elements of the matrix are distributed across the threads in a warpgroup so each thread of thewarpgroup holds a fragment of the matrix.
Instructionwgmma.mma_async issues aMxNxK matrix multiply and accumulate operation,D=A*B+D, where the A matrix isMxK, the B matrix isKxN, and the D matrix isMxN.
The matrix A is stored in the packed format Mx(K/2) as described inSparse matrix storage.
The operation of the formD=A*B is issued when the input predicate argumentscale-d isfalse.
wgmma.fence instruction must be used to fence the register accesses ofwgmma.mma_asyncinstruction from their prior accesses. Otherwise, the behavior is undefined.
wgmma.commit_group andwgmma.wait_group operations must be used to wait for the completionof the asynchronous matrix multiply and accumulate operations before the results are accessed.
Register operandd represents the accumulator matrix as well as the destination matrix,distributed across the participating threads. Register operanda represents the multiplicandmatrix A in register distributed across the participating threads. The 64-bit register operandsa-desc andb-desc are the matrix descriptors which represent the multiplicand matrices A andB in shared memory respectively. The contents of a matrix descriptor must be same across all thewarps in the warpgroup. The format of the matrix descriptor is described inMatrix Descriptor Format. Matrix A isstructured sparse as described inSparse matrix storage. Operandssp-meta andsp-selrepresent sparsity metadata and sparsity selector respectively. Operandsp-meta is a 32-bitinteger and operandsp-sel is a 32-bit integer constant with values in the range 0..3.
The valid values ofsp-meta andsp-sel for each shape is specified inSparse matrix storage and are summarized here :
Matrix shape
.atype
Valid values ofsp-meta
Valid values ofsp-sel
.m64nNk16
.tf32
0b1110 , 0b0100
0 (threads T0, T1) or 1 (threads T2, T3)
.m64nNk32
.f16/.bf16
0b00, 0b01, 0b10, 0b11
0 (threads T0, T1) or 1 (threads T2, T3)
.m64nNk64
.e4m3 /.e5m2 /.s8 /.u8
0b00, 0b01, 0b10, 0b11
0 (all threads contribute)
Matrices A and B are stored in row-major and column-major format respectively. For certain floatingpoint variants, the input matrices A and B can be transposed by specifying the value 1 for theimmediate integer argumentsimm-trans-a andimm-trans-b respectively. A value of 0 can beused to avoid the transpose operation. The valid values ofimm-trans-a andimm-trans-b are 0and 1. The transpose operation is only supported for thewgmma.mma_async variants with.f16/.bf16 types on matrices accessed from shared memory using matrix descriptors.
For the floating point variants of thewgmma.mma_async operation, each element of the inputmatrices A and B can be negated by specifying the value -1 for operandsimm-scale-a andimm-scale-b respectively. A value of 1 can be used to avoid the negate operation. The validvalues ofimm-scale-a andimm-scale-b are -1 and 1.
The qualifiers.dtype,.atype and.btype indicate the data type of the elements inmatrices D, A and B respectively..atype and.btype must be the same for all floating pointwgmma.mma_async variants except for the FP8 floating point variants. The sizes of individualdata elements of matrices A and B in alternate floating point variants of thewgmma.mma_asyncoperation are as follows:
Matrices A and B have 8-bit data elements when.atype/.btype is.e4m3/.e5m2.
Matrices A and B have 16-bit data elements when.atype/.btype is.bf16.
Matrices A and B have 32-bit data elements when.atype/.btype is.tf32.
Precision and rounding:
Floating point operations:
Element-wise multiplication of matrix A and B is performed with at least single precision. When.dtype is.f32, accumulation of the intermediate values is performed with at least singleprecision. When.dtype is.f16, the accumulation is performed with at least halfprecision.
The accumulation order, rounding and handling of subnormal inputs are unspecified.
.bf16 and.tf32 floating point operations:
Element-wise multiplication of matrix A and B is performed with specifiedprecision.wgmma.mma_async operation involving type.tf32 will truncate lower 13 bits ofthe 32-bit input data before multiplication is issued. Accumulation of the intermediate values isperformed with at least single precision.
The accumulation order, rounding, and handling of subnormal inputs are unspecified.
Integer operations:
The integerwgmma.mma_async operation is performed with.s32 accumulators. The.satfinite qualifier indicates that on overflow, the accumulated value is limited to therangeMIN_INT32..MAX_INT32 (where the bounds are defined as the minimum negative signed32-bit integer and the maximum positive signed 32-bit integer respectively).
If.satfinite is not specified, the accumulated value is wrapped instead.
The mandatory.sync qualifier indicates thatwgmma.mma_async instruction causes theexecuting thread to wait until all threads in the warp execute the samewgmma.mma_asyncinstruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warpgroup must execute thesamewgmma.mma_async instruction. In conditionally executed code, awgmma.mma_asyncinstruction should only be used if it is known that all threads in the warpgroup evaluate thecondition identically, otherwise behavior is undefined.
PTX ISA Notes
Introduced in PTX ISA version 8.2.
Support for.u8.s8 and.s8.u8 as .atype.btype introduced in PTX ISA version 8.4.
Enforce an ordering of register accesses betweenwgmma.mma_async and other operations.
Syntax
wgmma.fence.sync.aligned;
Description
wgmma.fence instruction establishes an ordering between prior accesses to any warpgroupregisters and subsequent accesses to the same registers by awgmma.mma_async instruction. Onlythe accumulator register and the input registers containing the fragments of matrix A require thisordering.
Thewgmma.fence instruction must be issued by all warps of the warpgroup at the followinglocations:
Before the firstwgmma.mma_async operation in a warpgroup.
Between a register access by a thread in the warpgroup and anywgmma.mma_async instructionthat accesses the same registers, either as accumulator or input register containing fragments ofmatrix A, except when these are accumulator register accesses across multiplewgmma.mma_asyncinstructions of the same shape. In the latter case, an ordering guarantee is provided by default.
Otherwise, the behavior is undefined.
An async proxy fence must be used to establish an ordering between prior writes to shared memorymatrices and subsequent reads of the same matrices in awgmma.mma_async instruction.
The mandatory.sync qualifier indicates thatwgmma.fence instruction causes the executingthread to wait until all threads in the warp execute the samewgmma.fence instruction beforeresuming execution.
The mandatory.aligned qualifier indicates that all threads in the warpgroup must execute thesamewgmma.fence instruction. In conditionally executed code, anwgmma.fence instructionshould only be used if it is known that all threads in the warpgroup evaluate the conditionidentically, otherwise the behavior is undefined.
PTX ISA Notes
Introduced in PTX ISA version 8.0.
Target ISA Notes
Requiressm_90a.
Examples
// Example 1, first use example:wgmma.fence.sync.aligned; // Establishes an ordering w.r.t. prior accesses to the registers s32d<0-3>wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8 {s32d0, s32d1, s32d2, s32d3}, descA, descB, scaleD;wgmma.commit_group.sync.aligned;wgmma.wait_group.sync.aligned 0;// Example 2, use-case with the input value updated in between:wgmma.fence.sync.aligned;wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8 {s32d0, s32d1, s32d2, s32d3}, descA, descB, scaleD;...mov.b32 s32d0, new_val;wgmma.fence.sync.aligned;wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8 {s32d4, s32d5, s32d6, s32d7}, {s32d0, s32d1, s32d2, s32d3}, descB, scaleD;wgmma.commit_group.sync.aligned;wgmma.wait_group.sync.aligned 0;
Commits all prior uncommittedwgmma.mma_async operations into awgmma-group.
Syntax
wgmma.commit_group.sync.aligned;
Description
wgmma.commit_group instruction creates a new wgmma-group per warpgroup and batches all priorwgmma.mma_async instructions initiated by the executing warp but not committed to anywgmma-group into the new wgmma-group. If there are no uncommittedwgmma.mma_async instructionsthenwgmma.commit_group results in an empty wgmma-group.
An executing thread can wait for the completion of allwgmma.mma_async operations in awgmma-group by usingwgmma.wait_group.
The mandatory.sync qualifier indicates thatwgmma.commit_group instruction causes theexecuting thread to wait until all threads in the warp execute the samewgmma.commit_groupinstruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warpgroup must execute thesamewgmma.commit_group instruction. In conditionally executed code, anwgmma.commit_groupinstruction should only be used if it is known that all threads in the warpgroup evaluate thecondition identically, otherwise the behavior is undefined.
Signal the completion of a preceding warpgroup operation.
Syntax
wgmma.wait_group.sync.aligned N;
Description
wgmma.wait_group instruction will cause the executing thread to wait until only N or fewer ofthe most recent wgmma-groups are pending and all the prior wgmma-groups committed by the executingthreads are complete. For example, when N is 0, the executing thread waits on all the priorwgmma-groups to complete. Operand N is an integer constant.
Accessing the accumulator register or the input register containing the fragments of matrix A of awgmma.mma_async instruction without first performing awgmma.wait_group instruction thatwaits on awgmma-group including thatwgmma.mma_async instruction is undefined behavior.
The mandatory.sync qualifier indicates thatwgmma.wait_group instruction causes theexecuting thread to wait until all threads in the warp execute the samewgmma.wait_groupinstruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warpgroup must execute thesamewgmma.wait_group instruction. In conditionally executed code, anwgmma.wait_groupinstruction should only be used if it is known that all threads in the warpgroup evaluate thecondition identically, otherwise the behavior is undefined.
The 5th generation TensorCore has dedicated on-chip memory that is specialized for use byTensorCore operations. This Tensor Memory is organized as a two-dimensional matrix wherethe horizontal rows are called lanes and the vertical columns are called columns.
On architecturesm_100a/sm_100f, the 5th generation TensorCore’s Tensor Memory has atwo-dimensional structure of 512 columns and 128 rows per CTA, with each cell being 32-bits in size.
Restrictions on threads accessing the Tensor Memory via the load and store operationsare specified inAccess restrictions.
The allocation and deallocation ofTensor Memory is performed in terms ofcolumns. The unit of allocation is 32 columns and the number of columns being allocated must bea power of 2. When a column is allocated, all 128 lanes of the column are allocated.
All of the Tensor Memory that was allocated in a kernel, must be explicitly deallocatedbefore the kernel exits.
The matrix multiply and accumulate operations support a limited set of shapes for the operand matricesA,B andD. The shapes of all three matrix operands are collectively described by the tupleMxNxK whereA isMxK matrix,B is aKxN matrix, andD is aMxN matrix.
Table 39 shows matrix shapes that are supported for the specified types for thetcgen05.mma operation.
The data movement shape indicates the dimension of the data to be moved to or from theTensor Memory. These shapes are described as a tuplelanexsize where:
lane indicates the number of rows in theTensor Memory; and
size indicates the amount of data, in units of bits (b), across the columns in theTensor Memory.
The following shapes are supported by various tcgen05 operations:
Atcgen05{.ld,.st}.32x32b instruction has the following data vector register.
Fragment
Elements (low to high)
A vector expression containing.numnumber of.b32 registers asmentioned in theTable 47.
r0, r1, …
A warp executingtcgen05{.ld,.st}.32x32b will access 32 lanes of the Tensor Memory.It loads from or stores to each of the lane (32 * .num)-bits of data as shown inFigure 183.
Atcgen05{.ld,.st}.16x64b instruction has the following data vector register.
Fragment
Elements (low to high)
A vector expression containing.numnumber of.b32 registers asmentioned in theTable 47.
r0, r1, …
A warp executingtcgen05{.ld,.st}.16x64b will access 16 lanes of the Tensor Memory.It loads from or stores to each of the lane (64 * .num)-bits of data as shown inFigure 184.
Atcgen05{.ld,.st}.16x128b instruction has the following data vector register.
Fragment
Elements (low to high)
A vector expression containing.numnumber of.b32 registers asmentioned in theTable 47.
r0, r1, …
A warp executingtcgen05{.ld,.st}.16x128b will access 16 lanes of the Tensor Memory.It loads from or stores to each of the lane (128 * .num)-bits of data as shown inFigure 185.
Atcgen05{.ld,.st}.16x256b instruction has the following data vector register.
Fragment
Elements (low to high)
A vector expression containing.numnumber of.b32 registers asmentioned in theTable 47.
r0, r1, r2, r3, …
A warp executingtcgen05{.ld,.st}.16x256b will access 16 lanes of the Tensor Memory.It loads from or stores to each of the lane (256 * .num)-bits of data as shown inFigure 186.
Atcgen05{.ld,.st}.16x32bx2 instruction has the following data vector register.
Fragment
Elements (low to high)
A vector expression containing.numnumber of.b32 registers asmentioned in theTable 47.
r0, r1, …
A warp executingtcgen05{.ld,.st}.16x32bx2 will access 16 lanes of the Tensor Memory.It loads from or stores to each of the lane (32 * .num)-bits of data as shown inFigure 187.
In this mode, the leading dimension stride is specified as a relative byte offset between thecolumns as explained in the below table.The leading dimension stride can either be specified as a relative offset between the columnsor as an absolute byte address of next buffer. The leading dimension stride is defineddifferently for transposed and non-transposed matrices. The leading dimension stride is definedas follows for matrices whose element types are normalized to 128-bits:
Major-ness
Definition
K-Major
No-Swizzling: the stride from the first column to the second columnof the 8x2 tile in the 128-bit element type normalized matrix.
Swizzled layouts: not used, assumed to be 1.
MN-Major
Interleave: stride from the first 8 columns to the next 8 columns.
Swizzled layouts: stride from the first (swizzle-byte-size/16) rowsto the next (swizzle-byte-size/16) rows.
Thetcgen05.mma instruction withK-dimension of 48B would overflow the 128Bshared memory boundary if the data is packed contiguously.
In this case, the absolute address mode can be used to break up the data in theshared memory into two chunks such that both these chunks are laid out withinthe aligned 128-byte address boundary.The leading dimension absolute address can point to the second data chunk in the shared memory.
The stride dimension byte offset is defined differently for transposed and non-transposedmatrices. The stride dimension byte offset is defined as follows for matrices whose elementtypes are normalized to 128-bits:
Major-ness
Definition
K-Major
The offset from the first 8 rows to the next 8 rows.
MN-Major
Interleave: offset from the first row to the next row.
Swizzled layout: offset from the first 8 columns to the next 8columns
The shared memory descriptor describes the properties of multiplicand matrix in sharedmemory including its location in the shared memory of the currentCTA. It is a 64-bitvalue contained in a register with the following layout:
Specifies the swizzling mode to be used:0. No swizzling1. 128-Byte with 32B atomic swizzling2. 128-Byte swizzling4. 64-Byte swizzling6. 32-Byte swizzling
Note: Values 3, 5 and 7 are invalid
where matrix-descriptor-encode(x) = (x & 0x3FFFF) >> 4
The value of base offset is 0 when the repeating pattern of the specified swizzling modestarts as per shown inTable 41.
Table 41Starting address of repeating pattern for various swizzling modes
Swizzling mode
Starting address of the repeating pattern
128-Byte swizzle
1024-Byte boundary
64-Byte swizzle
512-Byte boundary
32-Byte swizzle
256-Byte boundary
Otherwise, the base offset must be a non-zero value, computed using the following formula:baseoffset=(patternstartaddr>>0x7)&0x7
The instruction descriptor describes the shapes, types and other details of all the matricesand the matrix-multiplication-and-accumulation operation. It is a 32-bit value in registersand the exact layout is dependent on the MMA-Kind:
Table 42Instruction descriptor format for .kind::tf32, .kind::f16, .kind::f8f6f4 and .kind::i8
The zero-column mask descriptor is used to generate a mask that specifies which columns ofB matrix will have zero value for the MMA operation regardless of the values present inthe shared memory. The total size of the generated mask is N-bits.
A 0-bit in the mask specifies that values of the corresponding column in matrixB shouldbe used for the MMA operation. A 1-bit in the mask specifies 0s must be used for the entirecolumn for the MMA operation.
The zero-column mask descriptor is a 64-bit value in registers with the following layout:
Each of thetcgen05 operation has different requirements for the number ofthreads/warps that needs to issue them.
The following table lists the execution granularity requirements of each of thetcgen05 operation:
Table 46Execution granularity requirements for tcgen05 operations
tcgen05 operation
.cta_group
Issue Granularity
.mma,.cp,.shift,.commit
::1
An issue from a single thread in the currentCTA would initiate the base operation.
::2
Issue from a single thread from theCTA-Pair would initiatethe base operation.When the current CTA issues the operation, the peerCTA should be active and should not have exited.
.alloc,.dealloc,.relinquish_alloc_permit
::1
Issue from a single warp in the current CTAwould initiate the allocation management instruction.
::2
Issue from two warps, one in each of the current CTAand itsPeer CTA, collectivelyneeds to perform the operation.When the current CTA issues the operation, the peerCTA should be active and should not have exited.
.ld,.st,.wait::{ld,st}
N/A
Issue from a warp in the current CTA can access only1/4 of the Tensor Memory of the current CTA. So, awarpgroup is needed to access the entire Tensor Memoryof the current CTA.
.fence::*
N/A
A thread needs to fence all its accesses to the tensormemory that it wants to order with other accesses tothe tensor memory from other threads.
Any 2 CTAs within the cluster whose%cluster_ctarank differs by the last bit onlyis said to form a CTA pair.
Within a CTA pair, the CTA whose last bit in the%cluster_ctarank is:
0 is termed the even numbered CTA within the CTA pair.
1 is termed as the odd numbered CTA within the CTA pair.
Most of thetcgen05 operations can either execute at a single CTA level granularity ORat a CTA pair level granularity. When atcgen05 operation is performed at CTA pairgranularity, the Tensor Memory of both the CTAs within the CTA pair are accessed. The setof threads that need to issue thetcgen05 operation is listed in theIssue Granularity.
The peer CTA of the odd CTA within the CTA pair is the even CTA in the same pair.Similarly, the peer CTA of the even CTA within the CTA pair is the odd CTA in the same pair.
The asynchronoustcgen05 operations may execute and complete in a different order than theywere issued. However, some specific pairs of the asynchronoustcgen05 instructions formtcgen05 pipelines, where in the two asynchronous operations are guaranteed to execute inthe same order as the instructions that issued them. The specific pairings are as follows:
tcgen05.mma.cta_group::N ->tcgen05.mma.cta_group::N (same N and accumulator and shape)
Instructionstcgen05.commit andtcgen05.wait are implicitly pipelined with respectto previously issuedtcgen05.{mma,cp,shift} andtcgen05.{ld,st} instructionsrespectively that they track from the same thread.
Thetcgen05 instructions support a specialized inter-thread synchronization which areoptimized fortcgen05 family of instructions. The standard memory consistency modelsynchronization mechanisms also apply to thetcgen05 family of instructions.
Thetcgen05.fence::before_thread_sync andtcgen05.fence::after_thread_sync composeswith execution ordering instructions, like morally strongld/st/atom instructions,mbarrier instruction,barrier instructions and so on, to establish an ordering betweenthetcgen05 operations across threads. The asynchronoustcgen05 instructions that areordered across threads also form atcgen05 pipeline.
An asynchronoustcgen05 operation prior to atcgen05.fence::before_thread_sync is orderedbefore all subsequenttcgen05 and the execution ordering operations.
An asynchronoustcgen05 operation subsequent to atcgen05.fence::after_thread_sync isordered after all the priortcgen05 and the execution ordering operations.
In this pattern, explicit waiting mechanisms are used to wait for the completion of theasynchronoustcgen05 operations.
Example 1:
tcgen05.sttcgen05.wait::sttcgen05.ld
tcgen05.wait::st is used to wait for the completion of the prior asynchronousinstructiontcgen05.st.
Example 2:
tcgen05.mma [d], ...tcgen05.commit.mbarrier::arrive::onembarrier.try_wait.relaxed.cluster (loop until successful)tcgen05.fence::after_thread_synctcgen05.ld [d], ...
For the completion of the asynchronoustcgen05.mma,tcgen05.commit is used.
Astcgen05.ld is an asynchronous operation, the instructiontcgen05.fence::after_thread_syncis needed.
No explicittcgen05.fence::before_thread_sync is needed as this is implicitly performed bytcgen05.commit. The combination oftcgen05.mma andtcgen05.commit forms aconceptual asynchronous pipeline and establishes execution ordering.
In this pattern, the producer threads that issue the asynchronoustcgen05 instructionsmust explicitly wait for the instructions’ completion before synchronizing with the consumer threads.
Fortcgen05.ld, an intra-thread ordering through true register dependency will be respectedregardless of the presence or absence of other forms of synchronization. This form of registerdependency does not imply any other form of ordering. For example, a register dependency doesnot imply that a dependee instruction’s memory accesses will be performed before a dependentinstruction’s memory accesses. To enforce such memory orderings and avoiding anti-dependencyhazards aroundtcgen05.ld,tcgen05.wait::ld must be used.
The shared memory accesses bytcgen05.mma andtcgen05.cp operations are performedin the asynchronous proxy (async proxy).
Accessing the same memory location across miltiple proxies needs a cross-proxy fence.For the async proxy,fence.proxy.async should be used to synchronize memory betweengeneric proxy and the async proxy.
tcgen05.alloc is a potentially blocking instruction which dynamically allocatesthe specified number of columns in theTensor Memory and writesthe address of the allocatedTensor Memory into shared memoryat the location specified by address operand dst. Thetcgen05.alloc blocks if therequested amount ofTensor Memory is not available and unblocksas soon as the requested amount ofTensor Memory becomesavailable for allocation.
Instructiontcgen05.dealloc deallocates theTensor Memoryspecified by theTensor Memory addresstaddr. The operandtaddr must point to a previousTensor Memory allocation.
All of the Tensor Memory that was allocated usingtcgen05.alloc instruction in a kernel,must be explicitly deallocated usingtcgen05.dealloc before the kernel exits.
The unsigned 32-bit operandnCols specify the number of columns to be allocated orde-allocated. The unit of allocation and de-allocation is 32 columns and all of lanesper column. The number of columns must be a power of 2. The operandnCols must bewithin the range [32, 512]. The number of columns allocated should not increase betweenany two allocations in the execution order within the CTA. OperandnCols must bepower of 2.
Instructiontcgen05.relinquish_alloc_permit specifies that the CTA of the executingthread is relinquishing the right to allocateTensor Memory. So,it is illegal for a CTA to performtcgen05.alloc after any of its constituent threadsexecutetcgen05.relinquish_alloc_permit.
If no state space is specified thenGeneric Addressing is used.If the address specified bydst does not fall within the address window of.shared::cta state space then the behavior is undefined.
Qualifier.cta_group specifies the number of CTAs involved in the allocation andde-allocation operation. When.cta_group::1 is specified, one warp from the CTA mustperform the allocation and de-allocation. When.cta_group::2 is specified, one warpfrom each of thepeer CTAs must collectively perform the allocation andde-allocation. Refer to theIssue Granularity section.When.cta_group::2 is specified, the issuing warp must make sure that peer CTA is launchedand is still active.
Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.
The mandatory.sync qualifier indicates that the instruction causes the executing threadto wait until all threads in the warp execute the same instruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must execute thesame instruction. In conditionally executed code, the instruction should only be used if itis known that all threads in the warp evaluate the condition identically, otherwise behavioris undefined.
The behavior of the instruction is undefined if all the threads in the warp do not use thesame values ofnCols, or if any thread in the warp has exited.
The store operation intcgen05.alloc is treated as a weak memory operation in theMemory Consistency Model.
PTX ISA Notes
Introduced in PTX ISA version 8.6.
Target ISA Notes
Supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_110f or higher in the same family
Examples
// Example 1:tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [sMemAddr1], 32;ld.shared.b32 taddr, [sMemAddr1];// use taddr ...// more allocations and its usages ...tcgen05.dealloc.cta_group::1.sync.aligned.b32 taddr, 32;// more deallocations ...tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;// Example 2:// Following instructions are performed by current warp and the warp in the peer-CTA:tcgen05.alloc.cta_group::2.sync.aligned.shared::cta.b32 [sMemAddr2], 32;ld.shared.b32 taddr, [sMemAddr2];// use taddr ...// more allocations and its usages ...tcgen05.dealloc.cta_group::2.sync.aligned.b32 taddr, 32;// more deallocations ...tcgen05.relinquish_alloc_permit.cta_group::2.sync.aligned;
The threads of the CTA can perform the loads and stores to theTensor Memoryof the CTA and move data between registers and Tensor Memory. The loads and stores of datacan be performed in certain shapes as specified in theMatrix and Data Movement Shape section.
Not all threads of the CTA can access the entire Tensor Memory via thetcgen05.ld andtcgen05.st operations.
The Tensor Memory of a CTA is divided into 4 equal chunks such that each warp of a warpgroupin the CTA can access a chunk of the Tensor Memory. All the columns of the Tensor Memory canbe accessed by all the four warps of a warpgroup. A lane of the Tensor Memory can be accessedby a single warp in the warpgroup. The following table describes the access restriction.
Instructiontcgen05.ld asynchronously loads data from theTensor Memoryat the location specified by the 32-bit address operandtaddr into the destinationregisterr, collectively across all threads of the warps.
All the threads in the warp must specify the same value oftaddr, which must be thebase address of the collective load operation. Otherwise, the behavior is undefined.
The.shape qualifier and the.num qualifier together determines the totaldimension of the data which is loaded from theTensor Memory. The.shapequalifier indicates the base dimension of data to be accessed as described in theData Movement Shape. The.num qualifier indicatesthe repeat factor on the base dimension resulting in the total dimension of the data thatis accessed.
The shape.16x32bx2 performs two accesses into Tensor Memory of the shape.16x32b.The base address of the first access is specified by taddr and the base address of thesecond access is specified bytaddr+immHalfSplitoff, whereimmHalfSplitoff is animmediate argument.
The destination operandr is a brace-enclosed vector expression consisting of oneor more 32-bit registers as per the value of.shape and.num. The size of thevector for various combinations of.num and.shape is shown inTable 47.
The qualifier.red specifies that the reduction operation specified by.redOp isperformed on the data that is loaded across columns in each lane. The result of thereduction operation is written into the corresponding thread’s 32-bit destination registeroperandredVal. When.red qualifier is specified,.num modifier must be at least.x2.
The optional qualifier.pack::16b can be used to pack two 16-bit elements from adjacentcolumns into a single 32-bit element during the load as shown in the sectionPacking and Unpacking.
The mandatory.sync qualifier indicates thattcgen05.ld causes the executing threadto wait until all threads in the warp execute the sametcgen05.ld instruction beforeresuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must execute thesametcgen05.ld instruction. In conditionally executed code, atcgen05.ld instructionshould only be used if it is known that all threads in the warp evaluate the conditionidentically, otherwise behavior is undefined.
The behavior oftcgen05.ld is undefined if all threads do not use the same values oftaddr,or if any thread in the warp has exited.
Instructiontcgen05.st asynchronously stores data from the source registerr intotheTensor Memory at the location specified by the 32-bit address operandtaddr,collectively across all threads of the warps.
All the threads in the warp must specify the same value oftaddr, which must be the baseaddress of the collective store operation. Otherwise, the behavior is undefined.
The.shape qualifier and the.num qualifier together determines the total dimensionof the data which is stored to the Tensor Memory. The.shape qualifier indicates the basedimension of data to be accessed as described in theData Movement Shape. The.numqualifier indicates the repeat factor on the base dimension resulting in the total dimension ofthe data that is accessed.
The shape.16x32bx2 performs two accesses into Tensor Memory of the shape.16x32b.The base address of the first access is specified bytaddr and the base address of thesecond access is specified bytaddr+immHalfSplitoff, whereimmHalfSplitoff is animmediate argument.
The source operandr is a brace-enclosed vector expression consisting of one or more 32-bitregisters as per the value of.shape and.num. The size of the vector for variouscombinations of.num and.shape is shown inTable 48.
The optional qualifier.unpack::16b can be used to unpack a 32-bit element in theregister into two 16-bit elements and store them in adjacent columns as shown in thesectionPacking and Unpacking.
The mandatory.sync qualifier indicates thattcgen05.st causes the executingthread to wait until all threads in the warp execute the sametcgen05.st instructionbefore resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must executethe sametcgen05.st instruction. In conditionally executed code, atcgen05.stinstruction should only be used if it is known that all threads in the warp evaluatethe condition identically, otherwise behavior is undefined.
The behavior oftcgen05.st is undefined if all threads do not use the same values oftaddr, or if any thread in the warp has exited.
Instructiontcgen05.wait::st causes the executing thread to block until all priortcgen05.st operations issued by the executing thread have completed.
Instructiontcgen05.wait::ld causes the executing thread to block until all priortcgen05.ld operations issued by the executing thread have completed.
The mandatory.sync qualifier indicates thattcgen05.wait_operation causes theexecuting thread to wait until all threads in the warp execute the sametcgen05.wait_operationinstruction before resuming execution.
The mandatory.aligned qualifier indicates that all threads in the warp must execute thesametcgen05.wait_operation instruction.
PTX ISA Notes
Introduced in PTX ISA version 8.6.
Target ISA Notes
Supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_110f or higher in the same family
Examples
Example 1:tcgen05.ld.sync.aligned.32x32b.x2.b32 {r0, r1}, [taddr0];// Prevents subsequent tcgen05.mma from racing ahead of the tcgen05.ldtcgen05.wait::ld.sync.aligned;tcgen05.mma.cta_group::1.kind::f16 [taddr0], a-desc, b-desc, idesc, p;Example 2:tcgen05.st.sync.aligned.32x32b.x2.b32 [taddr0], {r0, r1};// Prevents the write to taddr0 in tcgen05.mma from racing ahead of the tcgen05.sttcgen05.wait::st.sync.aligned;tcgen05.mma.cta_group::1.kind::f16 [taddr0], a-desc, b-desc, idesc, p;
Instructiontcgen05.cp initiates an asynchronous copy operation from shared memory to thelocation specified by the address operandtaddr in theTensor Memory.
The 64-bit register operands-desc is the matrix descriptor which represents the sourcematrix in the shared memory that needs to be copied. The format of the matrix descriptor isdescribed inMatrix Descriptors.
The.shape qualifier indicates the dimension of data to be copied as described in theData Movement Shape.
Qualifier.cta_group specifies the number of CTAs whoseTensor Memory isaccessed when a single thread of a single CTA executes thetcgen05.cp instruction.When.cta_group::1 is specified, the data is copied into theTensor Memoryof the current CTA. When.cta_group::2 is specified, the data is copied into theTensor Memory of both the current and thepeer CTAs.
Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.
When the qualifiers.dst_fmt and.src_fmt are specified, the data is decompressedfrom the source format.src_fmt in the shared memory to the destination format.dst_fmt inTensor Memory by the copy operation. The details of sourceand the destination formats as specified in the sectionOptional Decompression.
Some of the.shape qualifiers require certain.multicast qualifiers.
.64x128b requires.warpx2::02_13 or.warpx2::01_23
.32x128b requires.warpx4
When the.multicast qualifier is specified as either.warpx2::02_13 or.warpx2::01_23 then the data being copied is multicasted into warp pairs and eachwarp in the warp pair receive half of the data. Warp pairs are formed as follows:
.warpx2::02_13 : warps 0 and 2 form a pair; warps 1 and 3 form a pair.
.warpx2::01_23 : warps 0 and 1 form a pair; warps 2 and 3 form a pair.
When the.multicast modifier is specified as.warpx4 then the data beingcopied is multicasted into all 4 warps.
PTX ISA Notes
Introduced in PTX ISA version 8.6.
Target ISA Notes
Supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
Instructiontcgen05.shift is an asynchronous instruction which initiates the shifting of 32-byteelements downwards across all the rows, except the last, by one row. The address operandtaddrspecifies the base address of the matrix in theTensor Memory whose rows mustbe down shifted.
The lane of the address operandtaddr must be aligned to 32.
Qualifier.cta_group specifies the number of CTAs whoseTensor Memoryis touched when a single thread of a single CTA executes thetcgen05.shift instruction.When.cta_group::1 is specified, the shift operation is performed in theTensor Memory of the current CTA. When.cta_group::2 is specified,the shift operation is performed in theTensor Memory of both the current and thepeer CTAs.
Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.
PTX ISA Notes
Introduced in PTX ISA version 8.6.
Target ISA Notes
Supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
The 5th generation of TensorCore operations of shapeMxNxK perform matrixmultiplication and accumulation of the form:
D=A*B+D
where:
theA matrix has shapeMxK, in either Tensor Memory or Shared Memory
theB matrix has shapeKxN, in Shared Memory of the current CTA and optionally in peer CTA
theD matrix is of the shapeMxN, in Tensor Memory
Optionally an input predicate can be used to disable the input from the accumulatormatrix and the following operation can be performed as
D=A*B
The matrix multiplication and accumulation operations are categorized into various kindsbased on input types and the throughput of the multiplication operation. The following shows thedifferent kinds of MMA operations that are supported:
f16 : supportsf16 andbf16 input types.
tf32 : supportstf32 input types.
f8f6f4 : supports all input combinations off8,f6 andf4 types.
i8 : supports signed and unsigned 8-bit integer input types.
mxf4nvf4 : supportsmxf4 type and a custom NVIDIA floating-pointtype for inputs where the type of the vector elements is 4 bits and requires a commonscaling factor to form the complete floating-point type, similar to other mx-types.
Optionally, the 5th generation of TensorCore MMAs support dense and sparse matrixA.Sparse Matrices describes the details of the sparse matrices.
Some of the MMA-kinds requires scaling of input matrices from memory to form the matrixA and matrixB before performing the MMA operation.Block Scaling describes the details of the scaling of matrices.
The following table show the various matrices involved in the MMA operations and the memory inwhich they can reside:
Matrix Type
Memory
A
Tensor Memory OR Shared Memory
B
Shared Memory
D
Tensor Memory
SparseMetaData
A-Scale /B-Scale
A sequence of MMA instructions may reuse the sameA matrix with a sequence ofBmatrices or may reuse the sameB matrix with a sequence ofA matrices.In these patterns the TensorCore may be able to laod the unchanged matrix once and reuseit through the sequence without multiple reloads. TheA orB matrices are loadedinto a TensorCore collector buffer (i.e., special cache).
An MMA instruction has an optionalcollector qualifier to specify when anA orBmatrix is new to the sequence and should be loaded, unchanged within the sequenceand should be reused, or the last use in the sequence and should be discarded.Thecollector qualifier is used to give the TensorCore permission to reuse a previouslyloadedA orB matrix; however reuse is opportunistic in that the TensorCore mayreload a matrix even when it has permission to reuse that matrix. Thus, the sourcememory of anA orB matrix must not be modified while the MMA instruction using thosematrices has not completed - regardless ofcollector qualifier permissions.
The 5th generation of TensorCore MMAs can be used for general matrix multiplication OR forconvolution operations. In case of convolutions, the activations can be stored in eithermatrixA or matrixB while the weights will be stored in the other matrix.
The sub-word elements of matrixD are expected not to be packed within a 32-bit Tensor Memory word.For example, if the type of elements of the matrixD is 16 bits then a Tensor Memory wordwould contain a single 16-bit element in its lower 16 bits.
The 6-bit and 4-bit floating point types have different packing format requirements fordifferent MMA kinds in both Tensor memory and Shared memory. The requirements are as follows.
The individual 4-bit and the 6-bit floating point type elements must be packed in an 8-bit containerin Tensor memory as shown below. The 8-bit containers must be contiguously packed in a 32-bit TensorMemory word. For example, if the type of elements of the matrixA is 6 bits then 4 consecutiveA elements should be packed in one 32-bit Tensor Memory word.
The layouts which utilize only half the datapath lanes, i.e.,Layout F andLayout C, must use the same Tensor Memorylane alignment across matricesA,D and the sparsity metadata matrix.
The following shows the warps that can access the Tensor Memory regions viatcgen05.ld /tcgen05.st along with the addresses for various Tensor Memory Layouts.
If the bitTransposeAMatrix /TransposeBMatrix in theInstruction descriptor is 0, thenK-major isused for matrixA /B respectively. If the bitTransposeAMatrix in theInstruction descriptor is 1 thenM-major isused for matrixA. If the bitTransposeBMatrix in theInstruction descriptor is 1, thenN-major isused for matrixB.
In a column-major default BLAS library such as cuBLAS, the matricesA andB with andwithout transpose can be classified as eitherK-Major orM-or-N-Major as shown in thefollowing table:
Non-Transposed
Transposed
A
K-major
M-major
B
K-major
N-major
To avoid confusion withA,B,row-major,col-major,transpose, andnon-transpose, we will useMN-Major andK-Major throughout this section.
The matrices in the shared memory are made up of one or more “swizzle layout atom”.The exact layout of these swizzle atoms depends on the swizzling mode, swizzle-atomicity,and the leading dimension. The layout of the swizzle are shown inTable 53
The above shapes are for elements of size 128 bits. For smaller element sizes, the same shapeswould get multiplied along the leading dimension by a factor of128/sizeof_bits(Element).For example, 128B MN major swizzle atom would have a shape of (8*(128/32))x8 = 32x8 fortf32 tensor core inputs.
Thetcgen05.mma instructions with the following.kind qualifier:
.kind::mxf8f6f4
.kind::mxf4
.kind::mxf4nvf4
perform matrix multiplication with block scaling. This operation has the following form:
(A*scale_A)*(B*scale_B)+D
wherescale_A andscale_B are matrices residing inTensor Memory.
For ascale_A matrix of shapeM x SFA_N, each row of matrixA is divided intoSFA_N number of chunks and each chunk of a row is multiplied with the correspondingelement in theSF_A of the same row.
Similarly, for ascale_B matrix of shapeSFB_M x N, each column of matrixB isdivided into theSFB_M number of chunks and each chunk of a column is multiplied withthe corresponding element in theSF_B of the same column.
Scale factors forA andB matrices need to be duplicated to all 32 lane partitionsof tensor memory.
Figure 230 shows an example oftcgen05.mma with block scaling ofscale_vec::2X.
Figure 230tcgen05.mma with block scaling ofscale_vec::2X
There is one scale factor per row of theA matrix with block size as 32 and the scale factor must be provided in1-byte aligned sub-column of the Tensor Memory.SFA_ID specifies the byte offset in theTensor Memory word that must be used for the scale factor matrix.Figure 231 shows which sub-columns get selected fordifferent values ofSFA_ID.
Figure 231Layout of scale factor A matrix with scale_vec::1X/block32 with K=32/K=64
For example, ifSFA_ID is 0, then all the green columns are selected to form the scale factormatrix. Similarly,SFA_ID values of 1, 2 and 3 would select the blue, yellow, and red columns,respectively.
There are two scale factors per row of theA matrix with block size as 32 and the scale factor must be provided in2-byte aligned sub-column of the Tensor Memory.SFA_ID specifies the half word offset in theTensor Memory word that must be used for the scale factor matrix.Figure 232 shows which sub-columns gets selected for differentvalues ofSFA_ID.
Figure 232Layout of scale factor A matrix with scale_vec::2X/block32 with K=64/K=128
For example, ifSFA_ID is 0, then all the green columns are selected to form the scale factormatrix. Similarly, ifSFA_ID is 2, then all of the blue columns are selected to form the scalefactor matrix.
There are four scale factors per row of theA matrix with block size as 16 and the scale factor must be provided in4-byte aligned sub-column of the Tensor Memory. TheSFA_ID value must be 0 and this specifiesthat all of the columns (in green) will be used for the scale factor matrix.Figure 233 shows which sub-columns gets selected for differentvalues ofSFA_ID.
Figure 233Layout of scale factor A matrix with scale_vec::4X/block16 with K=64/K=128
There are three scale factors per row of theA matrix with block size as 32 and the scalefactor must be provided in 4-byte aligned sub-column of the Tensor Memory.SFA_ID specifiesthe byte offset in the Tensor Memory word that must be used for the scale factor matrix.Figure 234,Figure 235,Figure 236 andFigure 237show which sub-columns get selected for different values ofSFA_ID.
Figure 234Layout of scale factor A matrix with block32 with K=96 with SFA_ID=00
Figure 235Layout of scale factor A matrix with block32 with K=96 with SFA_ID=01
Figure 236Layout of scale factor A matrix with block32 with K=96 with SFA_ID=10
Figure 237Layout of scale factor A matrix with block32 with K=96 with SFA_ID=11
For example, ifSFA_ID is 0, then all the green columns are selected to form the scale factormatrix. Similarly,SFA_ID values of 1, 2 and 3 would select the blue, yellow, and red columns,respectively.
There are six scale factors per row of theA matrix with block size as 16 and the scalefactor must be provided in 4-byte aligned sub-column of the Tensor Memory.SFA_ID specifiesthe byte offset in the Tensor Memory word that must be used for the scale factor matrix.Figure 238 andFigure 239show which sub-columns get selected for different values ofSFA_ID.
Figure 238Layout of scale factor A matrix with block16 with K=96 with SFA_ID=00
Figure 239Layout of scale factor A matrix with block16 with K=96 with SFA_ID=10
For example, ifSFA_ID is 0, then all the green columns are selected to form the scale factormatrix. Similarly, ifSFA_ID is 2, then all of the blue columns are selected to form the scalefactor matrix.
There is one scale factor per row of theB matrix with block size as 32 and the scale factor must be provided in1-byte aligned sub-column of the Tensor Memory.SFB_ID specifies the byte offset in theTensor Memory word that must be used for the scale factor matrix.Figure 240 shows which sub-columns get selected fordifferent values ofSFB_ID.
Figure 240Layout of scale factor B matrix with scale_vec::1X/block32 with K=32/K=64
For example, ifSFB_ID is 0, then all the green columns are selected to form the scale factormatrix. Similarly,SFB_ID values of 1, 2 and 3 would select the blue, yellow, and red columns, respectively.
There are two scale factors per row of theB matrix with block size as 32 and the scale factor must be provided in2-byte aligned sub-column of the Tensor Memory.SFB_ID specifies the half word offset in theTensor Memory word that must be used for the scale factor matrix.Figure 241 shows which sub-columns get selected fordifferent values ofSFB_ID.
Figure 241Layout of scale factor B matrix with scale_vec::2X/block32 with K=64/K=128
For example, ifSFB_ID is 0, then all the green columns are selected to form the scale factormatrix. Similarly, ifSFB_ID is 2, then all of the blue columns are selected to form the scalefactor matrix.
There are four scale factors per row of theB matrix with block size as 16 and the scale factor must be provided in4-byte aligned sub-column of the Tensor Memory. TheSFB_ID value must be 0 and this specifiesthat all of the columns (in green) will be used for the scale factor matrix.Figure 242 shows which sub-columns get selected fordifferent values ofSFB_ID.
Figure 242Layout of scale factor B matrix with scale_vec::4X/block16 with K=64/K=128
There are three scale factors per row of theB matrix with block size as 32 and the scale factormust be provided in 4-byte aligned sub-column of the Tensor Memory.SFB_ID specifies the byteoffset in the Tensor Memory word that must be used for the scale factor matrix.
Figure 247Layout of scale factor B matrix with block32 with K=96 and N>128 with SFA_ID=00
Figure 248Layout of scale factor B matrix with block32 with K=96 and N>128 with SFA_ID=01
Figure 249Layout of scale factor B matrix with block32 with K=96 and N>128 with SFA_ID=10
Figure 250Layout of scale factor B matrix with block32 with K=96 and N>128 with SFA_ID=10
Figure 251Layout of scale factor B matrix with block32 with K=96 and N>128 with SFA_ID=11
Figure 252Layout of scale factor B matrix with block32 with K=96 and N>128 with SFA_ID=11
For example, ifSFB_ID is 0, then all the green columns are selected to form thescale factor matrix. Similarly,SFB_ID values of 1, 2 and 3 would select the blue,yellow, and red columns, respectively.
There are six scale factors per row of theB matrix with block size as 16 and the scale factormust be provided in 4-byte aligned sub-column of the Tensor Memory.SFB_ID specifies the byteoffset in the Tensor Memory word that must be used for the scale factor matrix.
For N<=128,Figure 253 andFigure 254 show which sub-columnsget selected for different values ofSFB_ID.
Figure 253Layout of scale factor B matrix with block16 with K=96 and N<=128 with SFA_ID=00
Figure 254Layout of scale factor B matrix with block16 with K=96 and N<=128 with SFA_ID=10
Figure 255Layout of scale factor B matrix with block16 with K=96 and N>128 with SFA_ID=00
Figure 256Layout of scale factor B matrix with block16 with K=96 and N>128 with SFA_ID=00
Figure 257Layout of scale factor B matrix with block16 with K=96 and N>128 with SFA_ID=10
Figure 258Layout of scale factor B matrix with block16 with K=96 and N>128 with SFA_ID=10
For example, ifSFB_ID is 0, then all the green columns are selected to form thescale factor matrix. Similarly, ifSFB_ID is 2, then all of the blue columns areselected to form the scale factor matrix.
This instructiontcgen05.mma.sp can be used when the matrixA is a structuredsparse matrix with 50% zeros in each row distributed as per its sparse granularity.
In aMxNxK sparsetcgen05.mma.sp operation, the matrixA of shapeMxK isstored in a packed form asMx(K/2) in memory. For eachK-wide row of matrixA,50% of elements are zeros and the remainingK/2 non-zero elements are stored inmemory. The metadata specifies the mapping of theK/2 non-zero elements to theKelements before performing the MMA operation.
Granularity of sparse matrixA is defined as the ratio of the number of non-zeroelements in a sub-chunk of the matrix row to the total number of elements in thatsub-chunk where the size of the sub-chunk is shape-specific. The following table liststhe granularity of differenttcgen05.mma.sp variants:
For.kind::tf32, matrixA is structured sparse at a granularity of1:2.In other words, each chunk of two adjacent elements in a row of matrixA has onezero and one non-zero element. Only the non-zero element is stored in memory and the4-bit index in the metadata indicates the position of the non-zero element in thetwo-wide chunk. The only meaningful values of the index are:
0b1110
0b0100
Rest of the values result in undefined behavior.
Figure 259Sparse tcgen05.mma metadata example for tf32 kind
matrixA is structured sparse at a granularity of2:4. In other words, each chunkof four adjacent elements in a row of matrixA has two zero and two non-zero elements.Only the non-zero elements are stored in memory and the two 2-bit indices in the metadataindicates the position of the two non-zero elements in the four-wide chunk. The onlymeaningful values of the index are:
0b0100
0b1000
0b1100
0b1001
0b1101
0b0110
0b1110
Figure 260Sparse tcgen05.mma metadata example for f16/f8f6f4/mxf8f6f4 kind
For.kind::mxf4 and.kind::mxf4nvf4, matrixA is pair-wise structuredsparse at a granularity of4:8. In other words, each chunk of eight adjacentelements in a row of matrixA has four zero and four non-zero elements. Thezero and non-zero elements are clustered in sub-chunks of two elements each withinthe eight-wide chunk, so each two-wide sub-chunk within the eight-wide chunk must beall zeros or all non-zeros. Only the four non-zero elements are stored in memory andthe two 2-bit indices in the metadata indicates the position of the two two-widesub-chunks with non-zero values in the eight-wide chunk of a row of matrixA.The only meaningful values of the index are:
0b0100
0b1000
0b1100
0b1001
0b1101
0b0110
0b1110
Rest of the values result in undefined behavior.
Figure 261Sparse tcgen05.mma metadata example for mxf4 kind
The value of the sparsity selector selects the sub-columns in the Tensor Memoryto form the sparsity metadata matrix, which is used with matrixA to form themultiplicand matrix.
The following shows the sparse metadata matrix layout in Tensor Memory for various MMA variants:
The layouts which utilize only half the datapath lanes as specified inData Path Layout Organization,i.e.Layout F andLayout C, must use the same alignmentacross matrices A, D and the sparsity metadata matrix.
Instructiontcgen05.mma is an asynchronous instruction which initiates anMxNxK matrixmultiply and accumulate operation,D=A*B+Dwhere theA matrix isMxK, theB matrix isKxN, and theD matrix isMxN.
The operation of the formD=A*Bis issued when the input predicate argumentenable-input-d is false.
The optional immediate argumentscale-input-d can be specified to scale the inputmatrixD as follows:D=A*B+D*(2^-scale-input-d)
The valid range of values for argumentscale-input-d is [0, 15]. The argumentscale-input-d is only valid for.kind::tf32 and.kind::f16.
The 32-bit register operandidesc is the instruction descriptor as describedinInstruction descriptor, specifiesthe shapes, exact types, sparsity and other details of the input matrices,output matrix and the matrix multiply and accumulate operation.
The qualifier.cta_group::1 specifies that the matrix multiply andaccumulate operation is performed on theTensor Memory of theexecuting thread’s CTA only. The qualifier.cta_group::2 specifies that the matrixmultiply and accumulate operation is performed on theTensor Memoryof the executing thread’s CTA and itspeer CTA.
Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.
The instructiontcgen05.mma has single thread semantics, unlike the collectiveinstructionsmma.sync orwgmma.mma_async. So, a single thread issuing thetcgen05.mma will result in the initiation of the whole matrix multiply andaccumulate operation. Refer to the sectionIssue Granularity.
The qualifier.kind specifies the general kind of the element types of the multiplicandmatrices. The exact types of the elements of the input and output matrices for each MMA-kindare specified in theInstruction descriptor.
The address operandd-tmem specifies the address of the destination and the accumulationmatrixD in theTensor Memory. The address operanda-tmemspecifies the address of the matrixA in theTensor Memory.The 64-bit register operanda-desc andb-desc are the matrix descriptors whichrepresent the matricesA andB in shared memory respectively. The format of thematrix descriptor is described inMatrix Descriptors.
The vector operanddisable-output-lane specifies the lane(s) in theTensor Memory that should be not be updated with the resultantmatrixD. Elements of the vector operanddisable-output-lane forms a mask whereeach bit corresponds to a lane of theTensor Memory, with leastsignificant bit of the first element of the vector (leftmost in syntax) correspondingto the lane 0 of theTensor Memory. If a bit in the mask is 1,then the corresponding lane in the Tensor Memory for the resultant matrixD will notbe updated. The size of the vector is as follows:
.cta_group
Size of the vector disable-output-lane
::1
4
::2
8
Qualifier.block_scale specifies that the matricesA andB are scaled withscale_A andscale_B matrices respectively before performing the matrix multiplyand accumulate operation as specified in the sectionBlock Scaling.The address operandscale-A-tmem andscale-B-tmem specify the base address thematricesscale_A andscale_B respectively in theTensor Memory.
For qualifier.scale_vectorsize,
If.scale_vec::NX is specified: N specifies the number of columns inscale_Amatrix and number of rows inscale_B matrix.
If.blockN is specified: N specifies the block size for which single scale factorwill be applied. In this form, value of N is same as the K-dimension / (N of.scale_vec::NX).
Aliased.scale_vectorsize variants:
.block16 is aliased with:
.scale_vec::4X when.kind=.kind::mxf4nvf4 and K = 64 or 128
.block32 is aliased with:
.scale_vec::1X when.kind=.kind::mxf8f6f4 for all supported values of K
.scale_vec::2X when.kind=.kind::mxf4 or.kind::mxf4nvf4 and K = 64 or 128
The valid combinations of MMA-kind and.scale_vectorsize aredescribed inTable 54. For.kind::mxf4 when the qualifier.scale_vectorsize is not specified, then it defaults to.block32. For.kind::mxf4nvf4,the qualifier.scale_vectorsize must be explicitly specified.
The qualifier.ashift shifts the rows of theA matrix down by one row, except forthe last row in theTensor Memory. Qualifier.ashift is only allowedwithM = 128 orM = 256.
The qualifier.collector_usage specifies the usage of collector buffer for matrixA.Following collector buffer operations can be specified:
.collector_usage
Semantics
.collector::a::fill
Specifies that theA matrix read from the memoryshould be filled in collector buffer.
.collector::a::use
Specifies that theA matrix can be read from thecollector buffer. This requires a previous fill tothe collector buffer to be still valid.
.collector::a::lastuse
Specifies that theA matrix can be read from thecollector buffer and the contents of the collectorbuffer can be discarded. This requires a previousfill to the collector buffer to be valid till thecollector buffer is read.
.collector::a::discard
Specifies that the contents of the collector bufferforA can be discarded.
If no.collector_usage qualifier is specified, then it defaults to.collector::a::discard.It is illegal to specify either of.collector::a::use or.collector::a::fill along with.ashift.
PTX ISA Notes
Introduced in PTX ISA version 8.6.
Qualifier.kind::mxf4nvf4 introduced in PTX ISA version 8.7.
Qualifiers.block16 and.block32 introduced in PTX ISA version 8.8.
Target ISA Notes
Supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8 except.kind::i8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_110f or higher in the same family
Qualifier.kind::i8 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_110a
Argumentscale-input-d requiressm_100a and is supported onsm_100f or higher in the same family from PTX ISA version 8.8.
Instructiontcgen05.mma.sp is an asynchronous instruction which initiates anMxNxK matrix multiply and accumulate operation of the formD=A*B+Dwhere theA matrix isMx(K/2), theB matrix isKxN, and theD matrix isMxN.Sparse Matrices describes the details of the sparsity.
The operation of the formD=A*Bis issued when the input predicate argumentenable-input-d is false.
The optional immediate argumentscale-input-d can be specified to scale theinput matrixD as follows:D=A*B+D*(2^-scale-input-d)
The valid range of values for argumentscale-input-d is [0, 15]. The argumentscale-input-d is only valid for.kind::tf32 and.kind::f16.
The 32-bit register operandidesc is the instruction descriptor as described inInstruction descriptor, specifies the shapes,exact types, sparsity and other details of the input matrices, output matrix and thematrix multiply and accumulate operation.
The qualifier.cta_group::1 specifies that the matrix multiply and accumulateoperation is performed on theTensor Memory of the executingthread’s CTA only. The qualifier.cta_group::2 specifies that the matrixmultiply and accumulate operation is performed on theTensor Memoryof the executing thread’s CTA and itspeer CTA.
Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.
The instructiontcgen05.mma.sp has single thread semantics, unlike the collectiveinstructionsmma.sync orwgmma.mma_async. So, a single thread issuing thetcgen05.mma.sp will result in the initiation of the whole matrix multiply andaccumulate operation. Refer to the sectionIssue Granularity.
The qualifier.kind specifies the general kind of the element types of the multiplicandmatrices. The exact types of the elements of the input and output matrices for each MMA-kindare specified in theInstruction descriptor.
The address operandd-tmem specifies the address of the destination and the accumulationmatrixD in theTensor Memory. The address operanda-tmemspecifies the address of the matrixA in theTensor Memory. The64-bit register operanda-desc andb-desc are the matrix descriptors which representthe matricesA andB in shared memory respectively. The format of the matrix descriptoris described inMatrix Descriptors.
The vector operanddisable-output-lane specifies the lane(s) in theTensor Memorythat should be not be updated with the resultant matrixD. Elements of the vector operanddisable-output-lane forms a mask where each bit corresponds to a lane of theTensor Memory. with least significant bit of the first element ofthe vector (leftmost in syntax) corresponding to the lane 0 of the Tensor Memory. If a bit inthe mask is 1, then the corresponding lane in the Tensor Memory for the resultant matrixDwill not be updated. The size of the vector is as follows:
.cta_group
Size of the vector disable-output-lane
::1
4
::2
8
Qualifier.block_scale specifies that the matricesA andB are scaled withscale_A andscale_B matrices respectively before performing the matrix multiplyand accumulate operation as specified in the sectionBlock Scaling.The address operandscale-A-tmem andscale-B-tmem specify the base address thematricesscale_A andscale_B respectively in theTensor Memory.
For qualifier.scale_vectorsize,
If.scale_vec::NX is specified: N specifies the number of columns inscale_Amatrix and number of rows inscale_B matrix.
If.blockN is specified: N specifies the block size for which single scale factorwill be applied. In this form, value of N is same as the K-dimension / (N of.scale_vec::NX).
Aliased.scale_vectorsize variants:
.block16 is aliased with:
.scale_vec::4X when.kind=.kind::mxf4nvf4 and K = 64 or 128
.block32 is aliased with:
.scale_vec::1X when.kind=.kind::mxf8f6f4 for all supported values of K
.scale_vec::2X when.kind=.kind::mxf4 or.kind::mxf4nvf4 and K = 64 or 128
The valid combinations of MMA-kind and.scale_vectorsize aredescribed inTable 54. For.kind::mxf4 when the qualifier.scale_vectorsize is not specified, then it defaults to.block32. For.kind::mxf4nvf4,the qualifier.scale_vectorsize must be explicitly specified.
The qualifier.ashift shifts the rows of theA matrix down by one row, except forthe last row in theTensor Memory. Qualifier.ashift is only allowedwithM = 128 orM = 256.
The qualifier.collector_usage specifies the usage of collector buffer for matrixA.Following collector buffer operations can be specified:
.collector_usage
Semantics
.collector::a::fill
Specifies that theA matrix read from the memoryshould be filled in collector buffer.
.collector::a::use
Specifies that theA matrix can be read from thecollector buffer. This requires a previous fill tothe collector buffer to be still valid.
.collector::a::lastuse
Specifies that theA matrix can be read from thecollector buffer and the contents of the collectorbuffer can be discarded. This requires a previousfill to the collector buffer to be valid till thecollector buffer is read.
.collector::a::discard
Specifies that the contents of the collector bufferforA can be discarded.
If no.collector_usage qualifier is specified, then it defaults to.collector::a::discard.It is illegal to specify either of.collector::a::use or.collector::a::fill along with.ashift.
PTX ISA Notes
Introduced in PTX ISA version 8.6.
Qualifier.kind::mxf4nvf4 introduced in PTX ISA version 8.7.
Qualifiers.block16 and.block32 introduced in PTX ISA version 8.8.
Target ISA Notes
Supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8 except.kind::i8/.kind::mxf4nvf4/.kind::mxf4:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_110f or higher in the same family
Qualifier.kind::i8 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_110a
Qualifiers.kind::mxf4nvf4 and.kind::mxf4 are supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_103a
sm_110a
Argumentscale-input-d requiressm_100a and is supported onsm_100f or higher in the same family from PTX ISA version 8.8.
Instructiontcgen05.mma.ws is an asynchronous instruction which initiates anMxNxKmatrix multiply and accumulate operation,D=A*B+Dwhere theA matrix isMxK, theB matrix isKxN, and theD matrix isMxN.
The operation of the formD=A*Bis issued when the input predicate argumentenable-input-d is false.
The 32-bit register operandidesc is the instruction descriptor as described inInstruction descriptor, specifies the shapes, exacttypes, sparsity and other details of the input matrices, output matrix and the matrixmultiply and accumulate operation.
The qualifier.cta_group::1 specifies that the matrix multiply and accumulate operationis performed on theTensor Memory of the executing thread’s CTA only.
Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.
The instructiontcgen05.mma.ws has single thread semantics, unlike the collectiveinstructionsmma.sync orwgmma.mma_async. So, a single thread issuing thetcgen05.mma.ws will result in the initiation of the whole matrix multiply and accumulateoperation. Refer to the sectionIssue Granularity.
The qualifier.kind specifies the general kind of the element types of the multiplicandmatrices. The exact types of the elements of the input and output matrices for each MMA-kindare specified in theInstruction descriptor.
The address operandd-tmem specifies the address of the destination and the accumulationmatrixD in theTensor Memory. The address operanda-tmemspecifies the address of the matrixA in theTensor Memory. The64-bit register operanda-desc andb-desc are the matrix descriptors which representthe matricesA andB in shared memory respectively. The format of the matrix descriptoris described inMatrix Descriptors.
The optional operandzero-column-mask-desc is a 64-bit register which specifies theZero-Column Mask Descriptor. The zero-columnmask descriptor is used to generate a mask that specifies which columns ofB matrixwill have zero value for the matrix multiply and accumulate operation regardless of thevalues present in the shared memory.
The qualifier.collector_usage specifies the usage of collector buffer for MatrixB.Following collector buffer operations can be specified:
.collector_usage
Semantics
.collector::bN::fill
Specifies that theB matrix read from the memoryshould be filled in collector buffer #N.
.collector::bN::use
Specifies that theB matrix can be read from thecollector buffer #N. This requires a previous fillto the collector buffer #N to be still valid.
.collector::bN::lastuse
Specifies that theB matrix can be read from thecollector buffer #N after which the contents of thecollector buffer #N can be discarded. This requiresa previous fill to the collector buffer #N to bevalid till the collector buffer #N is read.
.collector::bN::discard
Specifies that the contents of the collector buffer#N can be discarded.
If no.collector_usage qualifier is specified, then it defaults to.collector::b0::discard.
PTX ISA Notes
Introduced in PTX ISA version 8.6.
Target ISA Notes
Supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8 except.kind::i8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_110f or higher in the same family
Qualifier.kind::i8 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
Instructiontcgen05.mma.ws.sp is an asynchronous instruction which initiatesanMxNxK matrix multiply and accumulate operation,D=A*B+Dwhere theA matrix isMx(K/2), theB matrix isKxN, and theD matrixisMxN.Sparse Matrices describes the details of thesparsity.
The operation of the formD=A*Bis issued when the input predicate argumentenable-input-d is false.
The 32-bit register operandidesc is the instruction descriptor as described inInstruction descriptor, specifies the shapes, exacttypes, sparsity and other details of the input matrices, output matrix and the matrixmultiply and accumulate operation.
The qualifier.cta_group::1 specifies that the matrix multiply and accumulateoperation is performed on the Tensor Memory of the executing thread’s CTA only.
Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.
The instructiontcgen05.mma.ws.sp has single thread semantics, unlike the collectiveinstructionsmma.sync orwgmma.mma_async. So, a single thread issuing thetcgen05.mma.ws.sp will result in the initiation of the whole matrix multiply andaccumulate operation. Refer to the sectionIssue Granularity.
The qualifier.kind specifies the general kind of the element types of the multiplicandmatrices. The exact types of the elements of the input and output matrices for each MMA-kind arespecified in theInstruction descriptor.
The address operandd-tmem specifies the address of the destination and the accumulationmatrixD in theTensor Memory. The address operanda-tmem specifiesthe address of the matrixA in theTensor Memory. The 64-bit registeroperanda-desc andb-desc are the matrix descriptors which represent the matricesAandB in shared memory respectively. The format of the matrix descriptor is described inMatrix Descriptors.
The optional operandzero-column-mask-desc is a 64-bit register which specifies theZero-Column Mask Descriptor. The zero-columnmask descriptor is used to generate a mask that specifies which columns ofB matrixwill have zero value for the matrix multiply and accumulate operation regardless of thevalues present in the shared memory.
The qualifier.collector_usage specifies the usage of collector buffer for MatrixB.Following collector buffer operations can be specified:
.collector_usage
Semantics
.collector::bN::fill
Specifies that theB matrix read from the memoryshould be filled in collector buffer #N.
.collector::bN::use
Specifies that theB matrix can be read from thecollector buffer #N. This requires a previous fillto the collector buffer #N to be still valid.
.collector::bN::lastuse
Specifies that theB matrix can be read from thecollector buffer #N after which the contents of thecollector buffer #N can be discarded. This requiresa previous fill to the collector buffer #N to bevalid till the collector buffer #N is read.
.collector::bN::discard
Specifies that the contents of the collector buffer#N can be discarded.
If no.collector_usage qualifier is specified, then it defaults to.collector::b0::discard.
PTX ISA Notes
Introduced in PTX ISA version 8.6.
Target ISA Notes
Supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8 except.kind::i8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_110f or higher in the same family
Qualifier.kind::i8 is supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
The instructiontcgen05.fence::before_thread_sync orders all the prior asynchronoustcgen05 operations with respect to the subsequenttcgen05 and the executionordering operations.
The instructiontcgen05.fence::after_thread_sync orders all the subsequent asynchronoustcgen05 operations with respect to the priortcgen05 and the execution orderingoperations.
Thetcgen05.fence::* instructions compose with execution ordering instructions acrossa thread scope and provide ordering betweentcgen05 instructions across the same scope.
Thetcgen05.fence::before_thread_sync instructions behave as code motion fence for priortcgen05 instructions as they cannot be hoisted across. Thetcgen05.fence::after_thread_syncinstructions behave as code motion fence for subsequenttcgen05 instructions as they cannotbe hoisted across.
PTX ISA Notes
Introduced in PTX ISA version 8.6.
Target ISA Notes
Supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
The instructiontcgen05.commit is an asynchronous instruction which makes the mbarrier object,specified by the address operandmbar, track the completion of all the prior asynchronoustcgen05 operations, as listed inmbarrier based completion mechanism,initiated by the executing thread. Upon the completion of the tracked asynchronoustcgen05operations, the signal specified by the.completion_mechanism is triggered by the systemon the mbarrier object.
The instructiontcgen05.commit.cta_group::1 tracks for the completion of all priorasynchronoustcgen05 operations with.cta_group::1 issued by the current thread.Similarly, the instructiontcgen05.commit.cta_group::2 tracks for the completion of allprior asynchronoustcgen05 operations with.cta_group::2 issued by the current thread.
Alltcgen05 instructions within a kernel must specify the same value for the.cta_groupqualifier.
The qualifier.mbarrier::arrive::one indicates that upon the completion of the priorasynchronoustcgen05 operation issued by the current thread, an arrive-on operation, withthe count argument of 1, is signaled on the mbarrier object. The scope of the arrive-on operationis the cluster scope.
The optional qualifier.multicast::cluster allows signaling on the mbarrier objects of multipleCTAs in the cluster. OperandctaMask specifies the CTAs in the cluster such that each bitposition in the 16-bitctaMask operand corresponds to the%cluster_ctarank of the destinationCTA. The mbarrier signal is multicast to the same offset asmbar in the shared memory of eachdestination CTA.
If no state space is specified thenGeneric Addressing is used. If theaddress specified bymbar does not fall within the address window of.shared::cluster statespace then the behavior is undefined.
PTX ISA Notes
Introduced in PTX ISA version 8.6.
Target ISA Notes
Supported on following architectures:
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
sm_110f or higher in the same family
Examples
Example 1:tcgen05.cp.cta_group::1.128x256b [taddr0], sdesc0;tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [mbarObj1];loop:mbarrier.try_wait.parity.b64 p, [mbarObj1], 0;@!p bra loop;Example 2:tcgen05.mma.cta_group::2.kind::tf32 [taddr0], adesc, bdesc, idesc, p;tcgen05.commit.cta_group::2.mbarrier::arrive::one.b64 [mbarObj2];loop:mbarrier.try_wait.parity.b64 p, [mbarObj2], 0;@!p bra loop;
Copies the current value of stack pointer into the destination registerd. Pointer returned bystacksave can be used in a subsequentstackrestore instruction to restore the stackpointer. Ifd is modified prior to use instackrestore instruction, it may corrupt data inthe stack.
Destination operandd has the same type as the instruction type.
Semantics
d = stackptr;
PTX ISA Notes
Introduced in PTX ISA version 7.3.
Preview Feature:
stacksave is a preview feature in PTX ISA version 7.3. All details are subject to change withno guarantees of backward compatibility on future PTX ISA versions or SM architectures.
Sets the current stack pointer to source registera.
Whenstackrestore is used with operanda written by a priorstacksave instruction, itwill effectively restore the state of stack as it was beforestacksave was executed. Note thatifstackrestore is used with an arbitrary value ofa, it may cause corruption of stackpointer. This implies that the correct use of this feature requires thatstackrestore.typea isused afterstacksave.typea without redefining the value ofa between them.
Operanda has the same type as the instruction type.
Semantics
stackptr = a;
PTX ISA Notes
Introduced in PTX ISA version 7.3.
Preview Feature:
stackrestore is a preview feature in PTX ISA version 7.3. All details are subject to changewith no guarantees of backward compatibility on future PTX ISA versions or SM architectures.
Target ISA Notes
stackrestore requiressm_52 or higher.
Examples
.reg .u32 ra;stacksave.u32 ra;// Code that may modify stack pointer...stackrestore.u32 ra;
Thealloca instruction dynamically allocates memory on the stack frame of the current functionand updates the stack pointer accordingly. The returned pointerptr points to local memory andcan be used in the address operand ofld.local andst.local instructions.
If sufficient memory is unavailable for allocation on the stack, then execution ofalloca mayresult in stack overflow. In such cases, attempting to access the allocated memory withptr willresult in undefined program behavior.
The memory allocated byalloca is deallocated in the following ways:
It is automatically deallocated when the function exits.
It can be explicitly deallocated usingstacksave andstackrestore instructions:stacksave can be used to save the value of stack pointer before executingalloca, andstackrestore can be used afteralloca to restore stack pointer to the original value whichwas previously saved withstacksave. Note that accessing deallocated memory after executingstackrestore results in undefined behavior.
size is an unsigned value which specifies the amount of memory in number of bytes to beallocated on stack.size=0 may not lead to a valid memory allocation.
Bothptr andsize have the same type as the instruction type.
immAlign is a 32-bit value which specifies the alignment requirement in number of bytes for thememory allocated byalloca. It is an integer constant, must be a power of 2 and must not exceed2^23.immAlign is an optional argument with default value being 8 which is the minimumguaranteed alignment.
Semantics
alloca.type ptr, size, immAlign:a = max(immAlign, frame_align); // frame_align is the minimum guaranteed alignment// Allocate size bytes of stack memory with alignment a and update the stack pointer.// Since the stack grows down, the updated stack pointer contains a lower address.stackptr = alloc_stack_mem(size, a);// Return the new value of stack pointer as ptr. Since ptr is the lowest address of the memory// allocated by alloca, the memory can be accessed using ptr up to (ptr + size of allocated memory).stacksave ptr;
PTX ISA Notes
Introduced in PTX ISA version 7.3.
Preview Feature:
alloca is a preview feature in PTX ISA version 7.3. All details are subject to change with noguarantees of backward compatibility on future PTX ISA versions or SM architectures.
Target ISA Notes
alloca requiressm_52 or higher.
Examples
.reg .u32 ra, stackptr, ptr, size;stacksave.u32 stackptr; // Save the current stack pointeralloca ptr, size, 8; // Allocate stack memoryst.local.u32 [ptr], ra; // Use the allocated stack memorystackrestore.u32 stackptr; // Deallocate memory by restoring the stack pointer
All video instructions operate on 32-bit register operands. However, the video instructions may beclassified as either scalar or SIMD based on whether their core operation applies to one or multiplevalues.
The source and destination operands are all 32-bit registers. The type of each operand (.u32 or.s32) is specified in the instruction type; all combinations ofdtype,atype, andbtype are valid. Using theatype/btype andasel/bsel specifiers, the input values areextracted and sign- or zero-extended internally to.s33 values. The primary operation is thenperformed to produce an.s34 intermediate result. The sign of the intermediate result depends ondtype.
The intermediate result is optionally clamped to the range of the destination type (signed orunsigned), taking into account the subword destination size in the case of optional data merging.
This intermediate result is then optionally combined with the third source operand using a secondaryarithmetic operation or subword data merge, as shown in the following pseudocode. The sign of thethird operand is based ondtype.
Perform scalar arithmetic operation with optional saturate, and optional secondary arithmetic operation or subword data merge.
Semantics
// extract byte/half-word/word and sign- or zero-extend// based on source operand typeta = partSelectSignExtend( a, atype, asel );tb = partSelectSignExtend( b, btype, bsel );switch ( vop ) { case vadd: tmp = ta + tb; case vsub: tmp = ta - tb; case vabsdiff: tmp = | ta - tb |; case vmin: tmp = MIN( ta, tb ); case vmax: tmp = MAX( ta, tb );}// saturate, taking into account destination type and merge operationstmp = optSaturate( tmp, sat, isSigned(dtype), dsel );d = optSecondaryOp( op2, tmp, c ); // optional secondary operationd = optMerge( dsel, tmp, c ); // optional merge with c operand
PTX ISA Notes
Introduced in PTX ISA version 2.0.
Target ISA Notes
vadd,vsub,vabsdiff,vmin,vmax requiresm_20 or higher.
Shifta left by unsigned amount inb with optional saturate, and optional secondaryarithmetic operation or subword data merge. Left shift fills with zero.
vshr
Shifta right by unsigned amount inb with optional saturate, and optional secondaryarithmetic operation or subword data merge. Signed shift fills with the sign bit, unsigned shiftfills with zero.
Semantics
// extract byte/half-word/word and sign- or zero-extend// based on source operand typeta = partSelectSignExtend( a,atype, asel );tb = partSelectSignExtend( b, .u32, bsel );if ( mode == .clamp && tb > 32 ) tb = 32;if ( mode == .wrap ) tb = tb & 0x1f;switch ( vop ){ case vshl: tmp = ta << tb; case vshr: tmp = ta >> tb;}// saturate, taking into account destination type and merge operationstmp = optSaturate( tmp, sat, isSigned(dtype), dsel );d = optSecondaryOp( op2, tmp, c ); // optional secondary operationd = optMerge( dsel, tmp, c ); // optional merge with c operand
Calculate(a*b)+c, with optional operand negates,plus one mode, and scaling.
The source operands support optional negation with some restrictions. Although PTX syntax allowsseparate negation of thea andb operands, internally this is represented as negation of theproduct(a*b). That is,(a*b) is negated if and only if exactly one ofa orb isnegated. PTX allows negation of either(a*b) orc.
The plus one mode (.po) computes(a*b)+c+1, which is used in computing averages. Sourceoperands may not be negated in.po mode.
The intermediate result of(a*b) is unsigned if atype and btype are unsigned and the product(a*b) is not negated; otherwise, the intermediate result is signed. Inputc has the samesign as the intermediate result.
The final result is unsigned if the intermediate result is unsigned andc is not negated.
Depending on the sign of thea andb operands, and the operand negates, the followingcombinations of operands are supported for VMAD:
(u32 * u32) + u32 // intermediate unsigned; final unsigned-(u32 * u32) + s32 // intermediate signed; final signed (u32 * u32) - u32 // intermediate unsigned; final signed (u32 * s32) + s32 // intermediate signed; final signed-(u32 * s32) + s32 // intermediate signed; final signed (u32 * s32) - s32 // intermediate signed; final signed (s32 * u32) + s32 // intermediate signed; final signed-(s32 * u32) + s32 // intermediate signed; final signed (s32 * u32) - s32 // intermediate signed; final signed (s32 * s32) + s32 // intermediate signed; final signed-(s32 * s32) + s32 // intermediate signed; final signed (s32 * s32) - s32 // intermediate signed; final signed
The intermediate result is optionally scaled via right-shift; this result is sign-extended if thefinal result is signed, and zero-extended otherwise.
The final result is optionally saturated to the appropriate 32-bit range based on the type (signedor unsigned) of the final result.
Semantics
// extract byte/half-word/word and sign- or zero-extend// based on source operand typeta = partSelectSignExtend( a, atype, asel );tb = partSelectSignExtend( b, btype, bsel );signedFinal = isSigned(atype) || isSigned(btype) || (a.negate ^ b.negate) || c.negate;tmp[127:0] = ta * tb;lsb = 0;if ( .po ) { lsb = 1; } elseif ( a.negate ^ b.negate ) { tmp = ~tmp; lsb = 1; } elseif ( c.negate ) { c = ~c; lsb = 1; }c128[127:0] = (signedFinal) sext32( c ) : zext ( c );tmp = tmp + c128 + lsb;switch( scale ) { case .shr7: result = (tmp >> 7) & 0xffffffffffffffff; case .shr15: result = (tmp >> 15) & 0xffffffffffffffff;}if ( .sat ) { if (signedFinal) result = CLAMP(result, S32_MAX, S32_MIN); else result = CLAMP(result, U32_MAX, U32_MIN);}
The SIMD video instructions operate on pairs of 16-bit values and quads of 8-bit values.
The SIMD video instructions are:
vadd2,vadd4
vsub2,vsub4
vavrg2,vavrg4
vabsdiff2,vabsdiff4
vmin2,vmin4
vmax2,vmax4
vset2,vset4
PTX includes SIMD video instructions for operation on pairs of 16-bit values and quads of 8-bitvalues. The SIMD video instructions execute the following stages:
Form input vectors by extracting and sign- or zero-extending byte or half-word values from thesource operands, to form pairs of signed 17-bit values.
Perform a SIMD arithmetic operation on the input pairs.
Optionally clamp the result to the appropriate signed or unsigned range, as determinted by thedestination type.
Optionally perform one of the following:
perform a second SIMD merge operation, or
apply a scalar accumulate operation to reduce the intermediate SIMD results to a singlescalar.
The general format of dual half-word SIMD video instructions is as follows:
// 2-way SIMD operation, with second SIMD merge or accumulatevop2.dtype.atype.btype{.sat}{.add} d{.mask}, a{.asel}, b{.bsel}, c;.dtype = .atype = .btype = { .u32, .s32 };.mask = { .h0, .h1, .h10 };.asel = .bsel = { .hxy, where x,y are from { 0, 1, 2, 3 } };
The general format of quad byte SIMD video instructions is as follows:
// 4-way SIMD operation, with second SIMD merge or accumulatevop4.dtype.atype.btype{.sat}{.add} d{.mask}, a{.asel}, b{.bsel}, c;.dtype = .atype = .btype = { .u32, .s32 };.mask = { .b0, .b1, .b10 .b2, .b20, .b21, .b210, .b3, .b30, .b31, .b310, .b32, .b320, .b321, .b3210 };.asel = .bsel = .bxyzw, where x,y,z,w are from { 0, ..., 7 };
The source and destination operands are all 32-bit registers. The type of each operand (.u32 or.s32) is specified in the instruction type; all combinations ofdtype,atype, andbtype are valid. Using theatype/btype andasel/bsel specifiers, the input values areextracted and sign- or zero-extended internally to.s33 values. The primary operation is thenperformed to produce an.s34 intermediate result. The sign of the intermediate result depends ondtype.
The intermediate result is optionally clamped to the range of the destination type (signed orunsigned), taking into account the subword destination size in the case of optional data merging.
Integer dual half-word SIMD absolute value of difference.
vmin2,vmax2
Integer dual half-word SIMD minimum/maximum.
Syntax
// SIMD instruction with secondary SIMD merge operationvop2.dtype.atype.btype{.sat} d{.mask}, a{.asel}, b{.bsel}, c;// SIMD instruction with secondary accumulate operationvop2.dtype.atype.btype.add d{.mask}, a{.asel}, b{.bsel}, c; vop2 = { vadd2, vsub2, vavrg2, vabsdiff2, vmin2, vmax2 };.dtype = .atype = .btype = { .u32, .s32 };.mask = { .h0, .h1, .h10 }; // defaults to .h10.asel = .bsel = { .hxy, where x,y are from { 0, 1, 2, 3 } }; .asel defaults to .h10 .bsel defaults to .h32
Description
Two-way SIMD parallel arithmetic operation with secondary operation.
Elements of each dual half-word source to the operation are selected from any of the four half-wordsin the two source operandsa andb using theasel andbsel modifiers.
The selected half-words are then operated on in parallel.
The results are optionally clamped to the appropriate range determined by the destination type(signed or unsigned). Saturation cannot be used with the secondary accumulate operation.
For instructions with a secondary SIMD merge operation:
For half-word positions indicated in mask, the selected half-word results are copied intodestinationd. For all other positions, the corresponding half-word from source operandcis copied tod.
For instructions with a secondary accumulate operation:
For half-word positions indicated in mask, the selected half-word results are added to operandc, producing a result ind.
Semantics
// extract pairs of half-words and sign- or zero-extend// based on operand typeVa = extractAndSignExt_2( a, b, .asel, .atype );Vb = extractAndSignExt_2( a, b, .bsel, .btype );Vc = extractAndSignExt_2( c );for (i=0; i<2; i++) { switch ( vop2 ) { case vadd2: t[i] = Va[i] + Vb[i]; case vsub2: t[i] = Va[i] - Vb[i]; case vavrg2: if ( ( Va[i] + Vb[i] ) >= 0 ) { t[i] = ( Va[i] + Vb[i] + 1 ) >> 1; } else { t[i] = ( Va[i] + Vb[i] ) >> 1; } case vabsdiff2: t[i] = | Va[i] - Vb[i] |; case vmin2: t[i] = MIN( Va[i], Vb[i] ); case vmax2: t[i] = MAX( Va[i], Vb[i] ); } if (.sat) { if ( .dtype == .s32 ) t[i] = CLAMP( t[i], S16_MAX, S16_MIN ); else t[i] = CLAMP( t[i], U16_MAX, U16_MIN ); }}// secondary accumulate or SIMD mergemask = extractMaskBits( .mask );if (.add) { d = c; for (i=0; i<2; i++) { d += mask[i] ? t[i] : 0; }} else { d = 0; for (i=0; i<2; i++) { d |= mask[i] ? t[i] : Vc[i]; }}
PTX ISA Notes
Introduced in PTX ISA version 3.0.
Target ISA Notes
vadd2,vsub2,varvg2,vabsdiff2,vmin2,vmax2 requiresm_30 or higher.
// SIMD instruction with secondary SIMD merge operationvset2.atype.btype.cmp d{.mask}, a{.asel}, b{.bsel}, c;// SIMD instruction with secondary accumulate operationvset2.atype.btype.cmp.add d{.mask}, a{.asel}, b{.bsel}, c;.atype = .btype = { .u32, .s32 };.cmp = { .eq, .ne, .lt, .le, .gt, .ge };.mask = { .h0, .h1, .h10 }; // defaults to .h10.asel = .bsel = { .hxy, where x,y are from { 0, 1, 2, 3 } }; .asel defaults to .h10 .bsel defaults to .h32
Description
Two-way SIMD parallel comparison with secondary operation.
Elements of each dual half-word source to the operation are selected from any of the four half-wordsin the two source operandsa andb using theasel andbsel modifiers.
The selected half-words are then compared in parallel.
The intermediate result of the comparison is always unsigned, and therefore the half-words ofdestinationd and operandc are also unsigned.
For instructions with a secondary SIMD merge operation:
For half-word positions indicated in mask, the selected half-word results are copied intodestinationd. For all other positions, the corresponding half-word from source operandbis copied tod.
For instructions with a secondary accumulate operation:
For half-word positions indicated in mask, the selected half-word results are added to operandc, producinga result ind.
Semantics
// extract pairs of half-words and sign- or zero-extend// based on operand typeVa = extractAndSignExt_2( a, b, .asel, .atype );Vb = extractAndSignExt_2( a, b, .bsel, .btype );Vc = extractAndSignExt_2( c );for (i=0; i<2; i++) { t[i] = compare( Va[i], Vb[i], .cmp ) ? 1 : 0;}// secondary accumulate or SIMD mergemask = extractMaskBits( .mask );if (.add) { d = c; for (i=0; i<2; i++) { d += mask[i] ? t[i] : 0; }} else { d = 0; for (i=0; i<2; i++) { d |= mask[i] ? t[i] : Vc[i]; }}
Four-way SIMD parallel arithmetic operation with secondary operation.
Elements of each quad byte source to the operation are selected from any of the eight bytes in thetwo source operandsa andb using theasel andbsel modifiers.
The selected bytes are then operated on in parallel.
The results are optionally clamped to the appropriate range determined by the destination type(signed or unsigned). Saturation cannot be used with the secondary accumulate operation.
For instructions with a secondary SIMD merge operation:
For byte positions indicated in mask, the selected byte results are copied into destinationd. For all other positions, the corresponding byte from source operandc is copied tod.
For instructions with a secondary accumulate operation:
For byte positions indicated in mask, the selected byte results are added to operandc,producing a result ind.
Semantics
// extract quads of bytes and sign- or zero-extend// based on operand typeVa = extractAndSignExt_4( a, b, .asel, .atype );Vb = extractAndSignExt_4( a, b, .bsel, .btype );Vc = extractAndSignExt_4( c );for (i=0; i<4; i++) { switch ( vop4 ) { case vadd4: t[i] = Va[i] + Vb[i]; case vsub4: t[i] = Va[i] - Vb[i]; case vavrg4: if ( ( Va[i] + Vb[i] ) >= 0 ) { t[i] = ( Va[i] + Vb[i] + 1 ) >> 1; } else { t[i] = ( Va[i] + Vb[i] ) >> 1; } case vabsdiff4: t[i] = | Va[i] - Vb[i] |; case vmin4: t[i] = MIN( Va[i], Vb[i] ); case vmax4: t[i] = MAX( Va[i], Vb[i] ); } if (.sat) { if ( .dtype == .s32 ) t[i] = CLAMP( t[i], S8_MAX, S8_MIN ); else t[i] = CLAMP( t[i], U8_MAX, U8_MIN ); }}// secondary accumulate or SIMD mergemask = extractMaskBits( .mask );if (.add) { d = c; for (i=0; i<4; i++) { d += mask[i] ? t[i] : 0; }} else { d = 0; for (i=0; i<4; i++) { d |= mask[i] ? t[i] : Vc[i]; }}
PTX ISA Notes
Introduced in PTX ISA version 3.0.
Target ISA Notes
vadd4,vsub4,varvg4,vabsdiff4,vmin4,vmax4 requiresm_30 or higher.
// SIMD instruction with secondary SIMD merge operationvset4.atype.btype.cmp d{.mask}, a{.asel}, b{.bsel}, c;// SIMD instruction with secondary accumulate operationvset4.atype.btype.cmp.add d{.mask}, a{.asel}, b{.bsel}, c;.atype = .btype = { .u32, .s32 };.cmp = { .eq, .ne, .lt, .le, .gt, .ge };.mask = { .b0, .b1, .b10 .b2, .b20, .b21, .b210, .b3, .b30, .b31, .b310, .b32, .b320, .b321, .b3210 }; defaults to .b3210.asel = .bsel = .bxyzw, where x,y,z,w are from { 0, ..., 7 }; .asel defaults to .b3210 .bsel defaults to .b7654
Description
Four-way SIMD parallel comparison with secondary operation.
Elements of each quad byte source to the operation are selected from any of the eight bytes in thetwo source operandsa andb using theasel andbsel modifiers.
The selected bytes are then compared in parallel.
The intermediate result of the comparison is always unsigned, and therefore the bytes of destinationd and operandc are also unsigned.
For instructions with a secondary SIMD merge operation:
For byte positions indicated in mask, the selected byte results are copied into destinationd. For all other positions, the corresponding byte from source operandb is copied tod.
For instructions with a secondary accumulate operation:
For byte positions indicated in mask, the selected byte results are added to operandc,producing a result ind.
Semantics
// extract quads of bytes and sign- or zero-extend// based on operand typeVa = extractAndSignExt_4( a, b, .asel, .atype );Vb = extractAndSignExt_4( a, b, .bsel, .btype );Vc = extractAndSignExt_4( c );for (i=0; i<4; i++) { t[i] = compare( Va[i], Vb[i], cmp ) ? 1 : 0;}// secondary accumulate or SIMD mergemask = extractMaskBits( .mask );if (.add) { d = c; for (i=0; i<4; i++) { d += mask[i] ? t[i] : 0; }} else { d = 0; for (i=0; i<4; i++) { d |= mask[i] ? t[i] : Vc[i]; }}
Suspend the thread for an approximate delay given in nanoseconds.
Syntax
nanosleep.u32 t;
Description
Suspends the thread for a sleep duration approximately close to the delayt, specified innanoseconds.t may be a register or an immediate value.
The sleep duration is approximated, but guaranteed to be in the interval[0,2*t]. The maximumsleep duration is 1 millisecond. The implementation may reduce the sleep duration for individualthreads within a warp such that all sleeping threads in the warp wake up together.
pmevent a; // trigger a single performance monitor eventpmevent.mask a; // trigger one or more performance monitor events
Description
Triggers one or more of a fixed number of performance monitor events, with event index or maskspecified by immediate operanda.
pmevent (without modifier.mask) triggers a single performance monitor event indexed byimmediate operanda, in the range0..15.
pmevent.mask triggers one or more of the performance monitor events. Each bit in the 16-bitimmediate operanda controls an event.
Programmatic performance moniter events may be combined with other hardware events using Booleanfunctions to increment one of the four performance counters. The relationship between events andcounters is programmed via API calls from the host.
Notes
Currently, there are sixteen performance monitor events, numbered 0 through 15.
setmaxnreg provides a hint to the system to update the maximum number of per-thread registersowned by the executing warp to the value specified by theimm-reg-count operand.
Qualifier.dec is used to release extra registers such that the absolute per-thread maximumregister count is reduced from its current value toimm-reg-count. Qualifier.inc is used torequest additional registers such that the absolute per-thread maximum register count is increasedfrom its current value toimm-reg-count.
A pool of available registers is maintained per-CTA. Register adjustments requested by thesetmaxnreg instructions are handled by supplying extra registers from this pool to therequesting warp or by releasing extra registers from the requesting warp to this pool, dependingupon the value of the.action qualifier.
Thesetmaxnreg.inc instruction blocks the execution until enough registers are available in theCTA’s register pool. After the instructionsetmaxnreg.inc obtains new registers from the CTApool, the initial contents of the new registers are undefined. The new registers must be initializedbefore they are used.
The samesetmaxnreg instruction must be executed by all warps in awarpgroup. After executing asetmaxnreg instruction, all warps in thewarpgroup must synchronize explicitly beforeexecuting subsequent setmaxnreg instructions. If asetmaxnreg instruction is not executed by allwarps in thewarpgroup, then the behavior is undefined.
Operandimm-reg-count is an integer constant. The value ofimm-reg-count must be in therange 24 to 256 (both inclusive) and must be a multiple of 8.
Changes to the register file of the warp always happen at the tail-end of the register file.
Thesetmaxnreg instruction requires that the kernel has been launched with a valid value ofmaximum number of per-thread registers specified via the appropriate compilation via the appropriatecompile-time option or the appropriate performance tuning directive. Otherwise, thesetmaxnreginstruction may have no effect.
When qualifier.dec is specified, the maximum number of per-thread registers owned by the warpprior to the execution ofsetmaxnreg instruction should be greater than or equal to theimm-reg-count. Otherwise, the behaviour is undefined.
When qualifier.inc is specified, the maximum number of per-thread registers owned by the warpprior to the execution ofsetmaxnreg instruction should be less than or equal to theimm-reg-count. Otherwise, the behaviour is undefined.
The mandatory.sync qualifier indicates thatsetmaxnreg instruction causes the executingthread to wait until all threads in the warp execute the samesetmaxnreg instruction beforeresuming execution.
The mandatory.aligned qualifier indicates that all threads in the warpgroup must execute thesamesetmaxnreg instruction. In conditionally executed code,setmaxnreg instruction shouldonly be used if it is known that all threads in warpgroup evaluate the condition identically,otherwise the behavior is undefined.
PTX ISA Notes
Introduced in PTX ISA version 8.0.
Target ISA Notes
Supported on following architectures:
sm_90a
sm_100a
sm_101a (Renamed tosm_110a from PTX ISA version 9.0)
sm_120a
And is supported on following family-specific architectures from PTX ISA version 8.8:
sm_100f or higher in the same family
sm_101f or higher in the same family (Renamed tosm_110f from PTX ISA version 9.0)
.sreg .v4 .u32 %tid; // thread id vector.sreg .u32 %tid.x, %tid.y, %tid.z; // thread id components
Description
A predefined, read-only, per-thread special register initialized with the thread identifier withinthe CTA. The%tid special register contains a 1D, 2D, or 3D vector to match the CTA shape; the%tid value in unused dimensions is0. The fourth element is unused and always returnszero. The number of threads in each dimension are specified by the predefined special register%ntid.
Every thread in the CTA has a unique%tid.
%tid component values range from0 through%ntid-1 in each CTA dimension.
%tid.y==%tid.z==0 in 1D CTAs.%tid.z==0 in 2D CTAs.
Introduced in PTX ISA version 1.0 with type.v4.u16.
Redefined as type.v4.u32 in PTX ISA version 2.0. For compatibility with legacy PTX code, 16-bitmov andcvt instructions may be used to read the lower 16-bits of each component of%tid.
Target ISA Notes
Supported on all target architectures.
Examples
mov.u32 %r1,%tid.x; // move tid.x to %rh// legacy code accessing 16-bit components of %tidmov.u16 %rh,%tid.x;cvt.u32.u16 %r2,%tid.z; // zero-extend tid.z to %r2
A predefined, read-only special register initialized with the number of thread ids in each CTAdimension. The%ntid special register contains a 3D CTA shape vector that holds the CTAdimensions. CTA dimensions are non-zero; the fourth element is unused and always returns zero. Thetotal number of threads in a CTA is(%ntid.x*%ntid.y*%ntid.z).
%ntid.y == %ntid.z == 1 in 1D CTAs.%ntid.z ==1 in 2D CTAs.
Introduced in PTX ISA version 1.0 with type.v4.u16.
Redefined as type.v4.u32 in PTX ISA version 2.0. For compatibility with legacy PTX code, 16-bitmov andcvt instructions may be used to read the lower 16-bits of each component of%ntid.
Target ISA Notes
Supported on all target architectures.
Examples
// compute unified thread id for 2D CTAmov.u32 %r0,%tid.x;mov.u32 %h1,%tid.y;mov.u32 %h2,%ntid.x;mad.u32 %r0,%h1,%h2,%r0;mov.u16 %rh,%ntid.x; // legacy code
A predefined, read-only special register that returns the thread’s warp identifier. The warpidentifier provides a unique warp number within a CTA but not across CTAs within a grid. The warpidentifier will be the same for all threads within a single warp.
Note that%warpid returns the location of a thread at the moment when read, butits value may change during execution, e.g., due to rescheduling of threads followingpreemption. For this reason,%ctaid and%tid should be used to compute a virtual warp indexif such a value is needed in kernel code;%warpid is intended mainly to enable profiling anddiagnostic code to sample and log information such as work place mapping and load distribution.
.sreg .v4 .u32 %ctaid; // CTA id vector.sreg .u32 %ctaid.x, %ctaid.y, %ctaid.z; // CTA id components
Description
A predefined, read-only special register initialized with the CTA identifier within the CTAgrid. The%ctaid special register contains a 1D, 2D, or 3D vector, depending on the shape andrank of the CTA grid. The fourth element is unused and always returns zero.
Introduced in PTX ISA version 1.0 with type.v4.u16.
Redefined as type.v4.u32 in PTX ISA version 2.0. For compatibility with legacy PTX code, 16-bitmov andcvt instructions may be used to read the lower 16-bits of each component of%ctaid.
A predefined, read-only special register initialized with the number of CTAs in each griddimension. The%nctaid special register contains a 3D grid shape vector, with each elementhaving a value of at least1. The fourth element is unused and always returns zero.
Maximum values of %nctaid.{x,y,z} are as follows:
.target architecture
%nctaid.x
%nctaid.y
%nctaid.z
sm_1x,sm_20
65535
65535
65535
sm_3x,sm_5x,sm_6x,sm_7x,sm_8x,sm_9x,sm_10x,sm_12x
231 -1
65535
65535
PTX ISA Notes
Introduced in PTX ISA version 1.0 with type.v4.u16.
Redefined as type.v4.u32 in PTX ISA version 2.0. For compatibility with legacy PTX code, 16-bitmov andcvt instructions may be used to read the lower 16-bits of each component of%nctaid.
A predefined, read-only special register that returns the processor (SM) identifier on which aparticular thread is executing. The SM identifier ranges from0 to%nsmid-1. The SMidentifier numbering is not guaranteed to be contiguous.
Notes
Note that%smid returns the location of a thread at the moment when read, butits value may change during execution, e.g. due to rescheduling of threads followingpreemption.%smid is intended mainly to enable profiling and diagnostic code to sample and loginformation such as work place mapping and load distribution.
A predefined, read-only special register that returns the maximum number of SM identifiers. The SMidentifier numbering is not guaranteed to be contiguous, so%nsmid may be larger than thephysical number of SMs in the device.
A predefined, read-only special register initialized with the per-grid temporal grid identifier. The%gridid is used by debuggers to distinguish CTAs and clusters within concurrent (small) grids.
During execution, repeated launches of programs may occur, where each launch starts agrid-of-CTAs. This variable provides the temporal grid launch number for this context.
Forsm_1x targets,%gridid is limited to the range [0..216-1]. Forsm_20,%gridid is limited to the range [0..232-1].sm_30 supports the entire 64-bit range.
PTX ISA Notes
Introduced in PTX ISA version 1.0 as type.u16.
Redefined as type.u32 in PTX ISA version 1.3.
Redefined as type.u64 in PTX ISA version 3.0.
For compatibility with legacy PTX code, 16-bit and 32-bitmov andcvt instructions may beused to read the lower 16-bits or 32-bits of each component of%gridid.
Target ISA Notes
Supported on all target architectures.
Examples
mov.u64 %s, %gridid; // 64-bit read of %grididmov.u32 %r, %gridid; // legacy code with 32-bit %gridid
A predefined, read-only special register initialized with the cluster identifier in a grid in eachdimension. Each cluster in a grid has a unique identifier.
The%clusterid special register contains a 1D, 2D, or 3D vector, depending upon the shape andrank of the cluster. The fourth element is unused and always returns zero.
A predefined, read-only special register initialized with the number of clusters in each griddimension.
The%nclusterid special register contains a 3D grid shape vector that holds the grid dimensionsin terms of clusters. The fourth element is unused and always returns zero.
Refer to theCuda Programming Guide for details on the maximum values of%nclusterid.{x,y,z}.
A predefined, read-only special register initialized with the CTA identifier in a cluster in eachdimension. Each CTA in a cluster has a unique CTA identifier.
The%cluster_ctaid special register contains a 1D, 2D, or 3D vector, depending upon the shape ofthe cluster. The fourth element is unused and always returns zero.
A predefined, read-only special register initialized with the number of CTAs in a cluster in eachdimension.
The%cluster_nctaid special register contains a 3D grid shape vector that holds the clusterdimensions in terms of CTAs. The fourth element is unused and always returns zero.
Refer to theCuda Programming Guide for details on the maximum values of%cluster_nctaid.{x,y,z}.
32-bit mask with bits set in positions less than or equal to the thread’s lane number in the warp.
Syntax (predefined)
.sreg .u32 %lanemask_le;
Description
A predefined, read-only special register initialized with a 32-bit mask with bits set in positionsless than or equal to the thread’s lane number in the warp.
32-bit mask with bits set in positions greater than or equal to the thread’s lane number in the warp.
Syntax (predefined)
.sreg .u32 %lanemask_ge;
Description
A predefined, read-only special register initialized with a 32-bit mask with bits set in positionsgreater than or equal to the thread’s lane number in the warp.
A set of 32 pre-defined read-only registers used to capture execution environment of PTX programoutside of PTX virtual machine. These registers are initialized by the driver prior to kernel launchand can contain cta-wide or grid-wide values.
Precise semantics of these registers is defined in the driver documentation.
Special registers intended for use by NVIDIA tools. The behavior is target-specific and may changeor be removed in future GPUs. When JIT-compiled to other targets, the value of these registers isunspecified.
These are predefined, read-only special registers containing information about the shared memoryregion which is reserved for the NVIDIA system software use. This region of shared memory is notavailable to users, and accessing this region from user code results in undefined behavior. Refer toCUDA Programming Guide for details.
Total size of shared memory used by a CTA of a kernel.
Syntax (predefined)
.sreg .u32 %total_smem_size;
Description
A predefined, read-only special register initialized with total size of shared memory allocated(statically and dynamically, excluding the shared memory reserved for the NVIDIA system softwareuse) for the CTA of a kernel at launch time.
Size is returned in multiples of shared memory allocation unit size supported by targetarchitecture.
Total size of shared memory used by a CTA of a kernel.
Syntax (predefined)
.sreg .u32 %aggr_smem_size;
Description
A predefined, read-only special register initialized with total aggregated size of shared memoryconsisting of the size of user shared memory allocated (statically and dynamically) at launch timeand the size of shared memory region which is reserved for the NVIDIA system software use.
An Identifier for currently executing CUDA device graph.
Syntax (predefined)
.sreg .u64 %current_graph_exec;
Description
A predefined, read-only special register initialized with the identifier referring to the CUDAdevice graph being currently executed. This register is 0 if the executing kernel is not part of aCUDA device graph.
Refer to theCUDA Programming Guide for more details on CUDA device graphs.
The following directives declare the PTX ISA version of the code in the module, the targetarchitecture for which the code was generated, and the size of addresses within the PTX module.
Themajor number is incremented when there are incompatible changes to the PTX language, such aschanges to the syntax or semantics. The version major number is used by the PTX compiler to ensurecorrect execution of legacy PTX code.
Theminor number is incremented when new features are added to PTX.
Semantics
Indicates that this module must be compiled with tools that support an equal or greater versionnumber.
Each PTX module must begin with a.version directive, and no other.version directive isallowed anywhere else within the module.
Specifies the set of features in the target architecture for which the current PTX code wasgenerated. In general, generations of SM architectures follow anonion layer model, where eachgeneration adds new features and retains all features of previous generations. The onion layer modelallows the PTX code generated for a given target to be run on later generation devices.
Target architectures with suffix “a”, such assm_90a, include architecture-specificfeatures that are supported on the specified architecture only, hence such targets do not follow theonion layer model. Therefore, PTX code generated for such targets cannot be run on later generationdevices. Architecture-specific features can only be used with targets that support thesefeatures.
Target architectures with suffix “f”, such assm_100f, include family-specific features thatare supported only within the same architecture family. Therefore, PTX code generated for suchtargets can run only on later generation devices in the same family. Family-specific features can beused with f-targets as well as a-targets of later generation devices in the same family.
Each PTX module must begin with a.version directive, immediately followed by a.targetdirective containing a target architecture and optional platform options. A.target directivespecifies a single target architecture, but subsequent.target directives can be used to changethe set of target features allowed during parsing. A program with multiple.target directiveswill compile and run only on devices that support all features of the highest-numbered architecturelisted in the program.
PTX features are checked against the specified target architecture, and an error is generated if anunsupported feature is used. The following table summarizes the features in PTX that vary accordingto target architecture.
Target
Description
sm_120
Baseline feature set forsm_120 architecture.
sm_120f
Adds support forsm_120f family specific features.
sm_120a
Adds support forsm_120a architecture-specific features.
sm_121
Baseline feature set forsm_121 architecture.
sm_121f
Adds support forsm_121f family specific features.
sm_121a
Adds support forsm_121a architecture-specific features.
Target
Description
sm_110
Baseline feature set forsm_110 architecture.
sm_110f
Adds support forsm_110f family specific features.
sm_110a
Adds support forsm_110a architecture-specific features.
Target
Description
sm_100
Baseline feature set forsm_100 architecture.
sm_100f
Adds support forsm_100f family specific features.
sm_100a
Adds support forsm_100a architecture-specific features.
sm_101
Baseline feature set forsm_101 architecture. (Renamed tosm_110)
sm_101f
Adds support forsm_101f family specific features. (Renamed tosm_110f)
sm_101a
Adds support forsm_101a architecture-specific features. (Renamed tosm_110a)
sm_103
Baseline feature set forsm_103 architecture.
sm_103f
Adds support forsm_103f family specific features.
sm_103a
Adds support forsm_103a architecture-specific features.
Target
Description
sm_90
Baseline feature set forsm_90 architecture.
sm_90a
Adds support forsm_90a architecture-specific features.
Target
Description
sm_80
Baseline feature set forsm_80 architecture.
sm_86
Adds support for.xorsign modifier onmin andmax instructions.
sm_87
Baseline feature set forsm_87 architecture.
sm_88
Baseline feature set forsm_88 architecture.
sm_89
Baseline feature set forsm_89 architecture.
Target
Description
sm_70
Baseline feature set forsm_70 architecture.
sm_72
Adds support for integer multiplicand and accumulator matrices inwmma instructions.
Adds support forcvt.pack instruction.
sm_75
Adds support for sub-byte integer and single-bit multiplicant matrices inwmma instructions.
Adds support forldmatrix instruction.
Adds support formovmatrix instruction.
Adds support fortanh instruction.
Target
Description
sm_60
Baseline feature set forsm_60 architecture.
sm_61
Adds support fordp2a anddp4a instructions.
sm_62
Baseline feature set forsm_61 architecture.
Target
Description
sm_50
Baseline feature set forsm_50 architecture.
sm_52
Baseline feature set forsm_50 architecture.
sm_53
Adds support for arithmetic, comparsion and texture instructions for.f16 and.f16x2 types.
Requiresmap_f64_to_f32 if any.f64 instructions used.
sm_13
Adds double-precision support, including expanded rounding modifiers.
Disallows use ofmap_f64_to_f32.
The texturing mode is specified for an entire module and cannot be changed within the module.
The.target debug option declares that the PTX file contains DWARF debug information, andsubsequent compilation of PTX will retain information needed for source-level debugging. If thedebug option is declared, an error message is generated if no DWARF information is found in thefile. The debug option requires PTX ISA version 3.0 or later.
map_f64_to_f32 indicates that all double-precision instructions map to single-precisionregardless of the target architecture. This enables high-level language compilers to compileprograms containing type double to target device that do not support double-precisionoperations. Note that.f64 storage remains as 64-bits, with only half being used by instructionsconverted from.f64 to.f32.
Notes
Targets of the formcompute_xx are also accepted as synonyms forsm_xx targets.
Targetssm_{101,101f,101a} are renamed to targetssm_{110,110f,110a} from PTX ISA version 9.0.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
Target stringssm_10 andsm_11 introduced in PTX ISA version 1.0.
Target stringssm_12 andsm_13 introduced in PTX ISA version 1.2.
Texturing mode introduced in PTX ISA version 1.5.
Target stringsm_20 introduced in PTX ISA version 2.0.
Target stringsm_30 introduced in PTX ISA version 3.0.
Platform optiondebug introduced in PTX ISA version 3.0.
Target stringsm_35 introduced in PTX ISA version 3.1.
Target stringssm_32 andsm_50 introduced in PTX ISA version 4.0.
Target stringssm_37 andsm_52 introduced in PTX ISA version 4.1.
Target stringsm_53 introduced in PTX ISA version 4.2.
Target stringsm_60,sm_61,sm_62 introduced in PTX ISA version 5.0.
Target stringsm_70 introduced in PTX ISA version 6.0.
Target stringsm_72 introduced in PTX ISA version 6.1.
Target stringsm_75 introduced in PTX ISA version 6.3.
Target stringsm_80 introduced in PTX ISA version 7.0.
Target stringsm_86 introduced in PTX ISA version 7.1.
Target stringsm_87 introduced in PTX ISA version 7.4.
Target stringsm_88 introduced in PTX ISA version 9.0.
Target stringsm_89 introduced in PTX ISA version 7.8.
Target stringsm_90 introduced in PTX ISA version 7.8.
Target stringsm_90a introduced in PTX ISA version 8.0.
Target stringsm_100 introduced in PTX ISA version 8.6.
Target stringsm_100f introduced in PTX ISA version 8.8.
Target stringsm_100a introduced in PTX ISA version 8.6.
Target stringsm_101 introduced in PTX ISA version 8.6. (Renamed tosm_110)
Target stringsm_101f introduced in PTX ISA version 8.8. (Renamed tosm_110f)
Target stringsm_101a introduced in PTX ISA version 8.6. (Renamed tosm_110a)
Target stringsm_103 introduced in PTX ISA version 8.8.
Target stringsm_103f introduced in PTX ISA version 8.8.
Target stringsm_103a introduced in PTX ISA version 8.8.
Target stringsm_110 introduced in PTX ISA version 9.0.
Target stringsm_110f introduced in PTX ISA version 9.0.
Target stringsm_110a introduced in PTX ISA version 9.0.
Target stringsm_120 introduced in PTX ISA version 8.7.
Target stringsm_120f introduced in PTX ISA version 8.8.
Target stringsm_120a introduced in PTX ISA version 8.7.
Target stringsm_121 introduced in PTX ISA version 8.8.
Target stringsm_121f introduced in PTX ISA version 8.8.
Target stringsm_121a introduced in PTX ISA version 8.8.
Target ISA Notes
The.target directive is supported on all target architectures.
Examples
.target sm_10 // baseline target architecture.target sm_13 // supports double-precision.target sm_20, texmode_independent.target sm_90 // baseline target architecture.target sm_90a // PTX using architecture-specific features.target sm_100f // PTX using family-specific features
Specifies the address size assumed throughout the module by the PTX code and the binary DWARFinformation in PTX.
Redefinition of this directive within a module is not allowed. In the presence of separatecompilation all modules must specify (or default to) the same address size.
The.address_size directive is optional, but it must immediately follow the.targetdirective if present within a module.
Semantics
If the.address_size directive is omitted, the address size defaults to 32.
PTX ISA Notes
Introduced in PTX ISA version 2.3.
Target ISA Notes
Supported on all target architectures.
Examples
// example directives .address_size 32 // addresses are 32 bit .address_size 64 // addresses are 64 bit// example of directive placement within a module .version 2.3 .target sm_20 .address_size 64....entry foo () {...}
Defines a kernel entry point name, parameters, and body for the kernel function.
Parameters are passed via.param space memory and are listed within an optional parenthesizedparameter list. Parameters may be referenced by name within the kernel body and loaded intoregisters usingld.param{::entry} instructions.
In addition to normal parameters, opaque.texref,.samplerref, and.surfref variablesmay be passed as parameters. These parameters can only be referenced by name within texture andsurface load, store, and query instructions and cannot be accessed viald.param instructions.
The shape and size of the CTA executing the kernel are available in special registers.
Semantics
Specify the entry point for a kernel program.
At kernel launch, the kernel dimensions and properties are established and made available viaspecial registers, e.g.,%ntid,%nctaid, etc.
PTX ISA Notes
For PTX ISA version 1.4 and later, parameter variables are declared in the kernel parameterlist. For PTX ISA versions 1.0 through 1.3, parameter variables are declared in the kernel body.
The maximum memory size supported by PTX for normal (non-opaque type) parameters is 32764bytes. Depending upon the PTX ISA version, the parameter size limit varies. The following tableshows the allowed parameter size for a PTX ISA version:
PTX ISA Version
Maximum parameter size (In bytes)
PTX ISA version 8.1 and above
32764
PTX ISA version 1.5 and above
4352
PTX ISA version 1.4 and above
256
The CUDA and OpenCL drivers support the following limits for parameter memory:
Driver
Parameter memory size
CUDA
256 bytes forsm_1x, 4096 bytes forsm_2xandhigher,32764 bytes fosm_70 and higher
OpenCL
32764 bytes forsm_70 and higher, 4352 bytes onsm_6xand lower
Defines a function, including input and return parameters and optional function body.
An optional.noreturn directive indicates that the function does not return to the callerfunction..noreturn directive cannot be specified on functions which have return parameters. Seethe description of.noreturn directive inPerformance-Tuning Directives: .noreturn.
A.func definition with no body provides a function prototype.
The parameter lists define locally-scoped variables in the function body. Parameters must be basetypes in either the register or parameter state space. Parameters in register state space may bereferenced directly within instructions in the function body. Parameters in.param space areaccessed usingld.param{::func} andst.param{::func} instructions in the body. Parameterpassing is call-by-value.
The last parameter in the parameter list may be a.param array of type.b8 with no sizespecified. It is used to pass an arbitrary number of parameters to the function packed into a singlearray object.
When calling a function with such an unsized last argument, the last argument may be omitted fromthecall instruction if no parameter is passed through it. Accesses to this array parameter mustbe within the bounds of the array. The result of an access is undefined if no array was passed, orif the access was outside the bounds of the actual array being passed.
Semantics
The PTX syntax hides all details of the underlying calling convention and ABI.
The implementation of parameter passing is left to the optimizing translator, which may use acombination of registers and stack locations to pass parameters.
Release Notes
For PTX ISA version 1.x code, parameters must be in the register state space, there is no stack, andrecursion is illegal.
PTX ISA versions 2.0 and later with targetsm_20 or higher allow parameters in the.paramstate space, implements an ABI with stack, and supports recursion.
PTX ISA versions 2.0 and later with targetsm_20 or higher support at most one return value.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
Support for unsized array parameter introduced in PTX ISA version 6.0.
Support for.noreturn directive introduced in PTX ISA version 6.4.
Support for.attribute directive introduced in PTX ISA version 8.0.
Support for.abi_preserve and.abi_preserve_control directives introduced in PTX ISA version 9.0.
Target ISA Notes
Functions without unsized array parameter supported on all target architectures.
Unsized array parameter requiressm_30 or higher.
.noreturn directive requiressm_30 or higher.
.attribute directive requiressm_90 or higher.
.abi_preserve and.abi_preserve_control directives requiresm_80 or higher.
Examples
.func (.reg .b32 rval) foo (.reg .b32 N, .reg .f64 dbl){.reg .b32 localVar;... use N, dbl;other code;mov.b32 rval,result;ret;}...call (fooval), foo, (val0, val1); // return value in fooval....func foo (.reg .b32 N, .reg .f64 dbl) .noreturn{.reg .b32 localVar;... use N, dbl;other code;mov.b32 rval, result;ret;}...call foo, (val0, val1);....func (.param .u32 rval) bar(.param .u32 N, .param .align 4 .b8 numbers[]){ .reg .b32 input0, input1; ld.param.b32 input0, [numbers + 0]; ld.param.b32 input1, [numbers + 4]; ... other code; ret;}....param .u32 N;.param .align 4 .b8 numbers[8];st.param.u32 [N], 2;st.param.b32 [numbers + 0], 5;st.param.b32 [numbers + 4], 10;call (rval), bar, (N, numbers);...
Declares a list of potential branch targets for a subsequentbrx.idx, and associates the listwith the label at the start of the line.
All control flow labels in the list must occur within the same function as the declaration.
The list of labels may use the compact, shorthand syntax for enumerating a range of labels having acommon prefix, similar to the syntax described inParameterized Variable Names.
Defines a prototype with no specific function name, and associates the prototype with a label. Theprototype may then be used in indirect call instructions where there is incomplete knowledge of thepossible call targets.
Parameters may have either base types in the register or parameter state spaces, or array types inparameter state space. The sink symbol'_' may be used to avoid dummy parameter names.
An optional.noreturn directive indicates that the function does not return to the callerfunction..noreturn directive cannot be specified on functions which have return parameters. Seethe description of .noreturn directive inPerformance-Tuning Directives: .noreturn.
To provide a mechanism for low-level performance tuning, PTX supports the following directives,which pass information to the optimizing backend compiler.
.maxnreg
.maxntid
.reqntid
.minnctapersm
.maxnctapersm (deprecated)
.pragma
.abi_preserve
.abi_preserve_control
The.maxnreg directive specifies the maximum number of registers to be allocated to a singlethread; the.maxntid directive specifies the maximum number of threadsin a thread block (CTA); the.reqntid directive specifies the required number of threads in athread block (CTA); and the.minnctapersm directive specifies a minimum number of thread blocksto be scheduled on a single multiprocessor (SM). These can be used, for example, to throttle theresource requirements (e.g., registers) to increase total thread count and provide a greateropportunity to hide memory latency. The.minnctapersm directive can be used together with eitherthe.maxntid or.reqntid directive to trade-off registers-per-thread against multiprocessorutilization without needed to directly specify a maximum number of registers. This may achieve betterperformance when compiling PTX for multiple devices having different numbers of registers per SM.
Device function directives.abi_preserve and.abi_preserve_control specify number of dataand control registers from callee save registers that a function must preserve for its caller. Thiscan be considered to be the number of general purpose and control registers live in the caller when functionis called. Control registers refer to the number of divergent program points that happen in the calltreeleading to current function call.
Currently, the.maxnreg,.maxntid,.reqntid, and.minnctapersmdirectives may be applied per-entry and must appear between an.entry directive and its body.The directives take precedence over any module-level constraints passed to the optimizing backend.A warning message is generated if the directives’ constraints are inconsistent or cannot be metfor the specified target device.
A general.pragma directive is supported for passing information to the PTX backend. Thedirective passes a list of strings to the backend, and the strings have no semantics within the PTXvirtual machine model. The interpretation of.pragma values is determined by the backendimplementation and is beyond the scope of the PTX ISA. Note that.pragma directives may appearat module (file) scope, at entry-scope, or as statements within a kernel or device function body.
Maximum number of registers that can be allocated per thread.
Syntax
.maxnreg n
Description
Declare the maximum number of registers per thread in a CTA.
Semantics
The compiler guarantees that this limit will not be exceeded. The actual number of registers usedmay be less; for example, the backend may be able to compile to fewer registers, or the maximumnumber of registers may be further constrained by.maxntid and.maxctapersm.
PTX ISA Notes
Introduced in PTX ISA version 1.3.
Target ISA Notes
Supported on all target architectures.
Examples
.entry foo .maxnreg 16 { ... } // max regs per thread = 16
Maximum number of threads in the thread block (CTA).
Syntax
.maxntid nx.maxntid nx, ny.maxntid nx, ny, nz
Description
Declare the maximum number of threads in the thread block (CTA). This maximum is specified by givingthe maximum extent of each dimension of the 1D, 2D, or 3D CTA. The maximum number of threads is theproduct of the maximum extent in each dimension.
Semantics
The maximum number of threads in the thread block, computed as the product of the maximum extentspecified for each dimension, is guaranteed not to be exceeded in any invocation of the kernel inwhich this directive appears. Exceeding the maximum number of threads results in a runtime error orkernel launch failure.
Note that this directive guarantees that thetotal number of threads does not exceed the maximum,but does not guarantee that the limit in any particular dimension is not exceeded.
PTX ISA Notes
Introduced in PTX ISA version 1.3.
Target ISA Notes
Supported on all target architectures.
Examples
.entry foo .maxntid 256 { ... } // max threads = 256.entry bar .maxntid 16,16,4 { ... } // max threads = 1024
Declare the number of threads in the thread block (CTA) by specifying the extent of each dimensionof the 1D, 2D, or 3D CTA. The total number of threads is the product of the number of threads ineach dimension.
Semantics
The size of each CTA dimension specified in any invocation of the kernel is required to be equal tothat specified in this directive. Specifying a different CTA dimension at launch will result in aruntime error or kernel launch failure.
Notes
The.reqntid directive cannot be used in conjunction with the.maxntid directive.
PTX ISA Notes
Introduced in PTX ISA version 2.1.
Target ISA Notes
Supported on all target architectures.
Examples
.entry foo .reqntid 256 { ... } // num threads = 256.entry bar .reqntid 16,16,4 { ... } // num threads = 1024
Declare the minimum number of CTAs from the kernel’s grid to be mapped to a single multiprocessor(SM).
Notes
Optimizations based on.minnctapersm need either.maxntid or.reqntid to be specified aswell.
If the total number of threads on a single SM resulting from.minnctapersm and.maxntid /.reqntid exceed maximum number of threads supported by an SM then directive.minnctapersmwill be ignored.
In PTX ISA version 2.1 or higher, a warning is generated if.minnctapersm is specified withoutspecifying either.maxntid or.reqntid.
PTX ISA Notes
Introduced in PTX ISA version 2.0 as a replacement for.maxnctapersm.
Declare the maximum number of CTAs from the kernel’s grid that may be mapped to a singlemultiprocessor (SM).
Notes
Optimizations based on .maxnctapersm generally need.maxntid to be specified as well. Theoptimizing backend compiler uses.maxntid and.maxnctapersm to compute an upper-bound onper-thread register usage so that the specified number of CTAs can be mapped to a singlemultiprocessor. However, if the number of registers used by the backend is sufficiently lower thanthis bound, additional CTAs may be mapped to a single multiprocessor. For this reason,.maxnctapersm has been renamed to .minnctapersm in PTX ISA version 2.0.
PTX ISA Notes
Introduced in PTX ISA version 1.3. Deprecated in PTX ISA version 2.0.
Indicate that the function does not return to its caller function.
Syntax
.noreturn
Description
Indicate that the function does not return to its caller function.
Semantics
An optional.noreturn directive indicates that the function does not return to callerfunction..noreturn directive can only be specified on device functions and must appear betweena.func directive and its body.
The directive cannot be specified on functions which have return parameters.
If a function with.noreturn directive returns to the caller function at runtime, then thebehavior is undefined.
Pass module-scoped, entry-scoped, or statement-level directives to the PTX backend compiler.
The.pragma directive may occur at module-scope, at entry-scope, or at statement-level.
Semantics
The interpretation of.pragma directive strings is implementation-specific and has no impact onPTX semantics. SeeDescriptions of .pragma Strings fordescriptions of the pragma strings defined inptxas.
PTX ISA Notes
Introduced in PTX ISA version 2.0.
Target ISA Notes
Supported on all target architectures.
Examples
.pragma "nounroll"; // disable unrolling in backend// disable unrolling for current kernel.entry foo .pragma "nounroll"; { ... }
Specify number of general purpose registers that should be preserved by the callers of this function.
Syntax
.abi_preserve N
Description
It is an architecture agnostic value specifying actual number of general purpose registers.Internally ABI defines some general purpose registers as preserved (callee save) registers.Integer N specifies the actual number of general purpose registers that should be preserved bythe function.
.abi_preserve directive can only be specified on device functions and must appear betweena.func directive and its body.
Semantics
When this directive is specified compiler backend modifies low level ABI components to ensure thatnumber of live data variables in the callers of this function that are stored in the callee saveregisters are less than specified value.
Specify number of control registers that should be preserved by the callers of this function.
Syntax
.abi_preserve_control N
Description
It is an architecture agnostic value specifying the number of divergent program points that happenin the calltree leading to current function call.Internally ABI defines some control registers as preserved (callee save) registers.Integer N specifies the actual number of control registers that should be preserved by the function.
.abi_preserve_control directive can only be specified on device functions and must appear betweena.func directive and its body.
Semantics
When this directive is specified compiler backend modifies low level ABI components to ensure thatnumber of live control variables in the callers of this function that are stored in the callee savecontrol registers are less than specified value.
DWARF-format debug information is passed through PTX modules using the following directives:
@@DWARF
.section
.file
.loc
The.section directive was introduced in PTX ISA version 2.0 and replaces the@@DWARFsyntax. The@@DWARF syntax was deprecated in PTX ISA version 2.0 but is supported for legacy PTXISA version 1.x code.
Beginning with PTX ISA version 3.0, PTX files containing DWARF debug information should include the.targetdebug platform option. This forward declaration directs PTX compilation to retainmappings for source-level debugging.
@@DWARF dwarf-stringdwarf-string may have one of the.byte byte-list // comma-separated hexadecimal byte values.4byte int32-list // comma-separated hexadecimal integers in range [0..2^32-1].quad int64-list // comma-separated hexadecimal integers in range [0..2^64-1].4byte label.quad label
PTX ISA Notes
Introduced in PTX ISA version 1.2. Deprecated as of PTX ISA version 2.0, replaced by.sectiondirective.
.section section_name { dwarf-lines }dwarf-lines have the following formats: .b8 byte-list // comma-separated list of integers // in range [-128..255] .b16 int16-list // comma-separated list of integers // in range [-2^15..2^16-1] .b32 int32-list // comma-separated list of integers // in range [-2^31..2^32-1] label: // Define label inside the debug section .b64 int64-list // comma-separated list of integers // in range [-2^63..2^64-1] .b32 label .b64 label .b32 label+imm // a sum of label address plus a constant integer byte // offset(signed, 32bit) .b64 label+imm // a sum of label address plus a constant integer byte // offset(signed, 64bit) .b32 label1-label2 // a difference in label addresses between labels in // the same dwarf section (32bit) .b64 label3-label4 // a difference in label addresses between labels in // the same dwarf section (64bit)
PTX ISA Notes
Introduced in PTX ISA version 2.0, replaces@@DWARF syntax.
label+imm expression introduced in PTX ISA version 3.2.
Support for.b16 integers in dwarf-lines introduced in PTX ISA version 6.0.
Support for defininglabel inside the DWARF section is introduced in PTX ISA version 7.2.
label1-label2 expression introduced in PTX ISA version 7.5.
Negative numbers in dwarf lines introduced in PTX ISA version 7.5.
Associates a source filename with an integer index..loc directives reference source files byindex.
.file directive allows optionally specifying an unsigned number representing time of lastmodification and an unsigned integer representing size in bytes of source file.timestamp andfile_size value can be 0 to indicate this information is not available.
timestamp value is in format of C and C++ data typetime_t.
file_size is an unsigned 64-bit integer.
The.file directive is allowed only in the outermost scope, i.e., at the same level as kerneland device function declarations.
Semantics
If timestamp and file size are not specified, they default to 0.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
Timestamp and file size introduced in PTX ISA version 3.2.
Declares the source file location (source file, line number, and column position) to be associatedwith lexically subsequent PTX instructions..loc refers tofile_index which is defined by a.file directive.
To indicate PTX instructions that are generated from a function that got inlined, additionalattribute.inlined_at can be specified as part of the.loc directive..inlined_atattribute specifies source location at which the specified function is inlined.file_index2,line_number2, andcolumn_position2 specify the location at which function is inlined. Sourcelocation specified as part of.inlined_at directive must lexically precede as source location in.loc directive.
Thefunction_name attribute specifies an offset in the DWARF section named.debug_str. Offset is specified aslabel expression orlabel+immediate expressionwherelabel is defined in.debug_str section. DWARF section.debug_str contains ASCIInull-terminated strings that specify the name of the function that is inlined.
Note that a PTX instruction may have a single associated source location, determined by the nearestlexically preceding .loc directive, or no associated source location if there is no preceding .locdirective. Labels in PTX inherit the location of the closest lexically following instruction. Alabel with no following PTX instruction has no associated source location.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
function_name andinlined_at attributes are introduced in PTX ISA version 7.2.
Target ISA Notes
Supported on all target architectures.
Examples
.loc 2 4237 0L1: // line 4237, col 0 of file #2, // inherited from mov mov.u32 %r1,%r2; // line 4237, col 0 of file #2 add.u32 %r2,%r1,%r3; // line 4237, col 0 of file #2...L2: // line 4239, col 5 of file #2, // inherited from sub .loc 2 4239 5 sub.u32 %r2,%r1,%r3; // line 4239, col 5 of file #2 .loc 1 21 3 .loc 1 9 3, function_name info_string0, inlined_at 1 21 3 ld.global.u32 %r1, [gg]; // Function at line 9 setp.lt.s32 %p1, %r1, 8; // inlined at line 21 .loc 1 27 3 .loc 1 10 5, function_name info_string1, inlined_at 1 27 3 .loc 1 15 3, function_name .debug_str+16, inlined_at 1 10 5 setp.ne.s32 %p2, %r1, 18; @%p2 bra BB2_3; .section .debug_str { info_string0: .b8 95 // _ .b8 90 // z .b8 51 // 3 .b8 102 // f .b8 111 // o .b8 111 // o .b8 118 // v .b8 0 info_string1: .b8 95 // _ .b8 90 // z .b8 51 // 3 .b8 98 // b .b8 97 // a .b8 114 // r .b8 118 // v .b8 0 .b8 95 // _ .b8 90 // z .b8 51 // 3 .b8 99 // c .b8 97 // a .b8 114 // r .b8 118 // v .b8 0 }
Declares identifier to be defined external to the current module. The module defining suchidentifier must define it as.weak or.visible only once in a single object file. Externdeclaration of symbol may appear multiple times and references to that get resolved against thesingle definition of that symbol.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
Target ISA Notes
Supported on all target architectures.
Examples
.extern .global .b32 foo; // foo is defined in another module
Declares identifier to be globally visible. Unlike C, where identifiers are globally visible unlessdeclared static, PTX identifiers are visible only within the current module unless declared.visible outside the current.
PTX ISA Notes
Introduced in PTX ISA version 1.0.
Target ISA Notes
Supported on all target architectures.
Examples
.visible .global .b32 foo; // foo will be externally visible
Declares identifier to be globally visible butweak. Weak symbols are similar to globally visiblesymbols, except during linking, weak symbols are only chosen after globally visible symbols duringsymbol resolution. Unlike globally visible symbols, multiple object files may declare the same weaksymbol, and references to a symbol get resolved against a weak symbol only if no global symbols havethe same name.
PTX ISA Notes
Introduced in PTX ISA version 3.1.
Target ISA Notes
Supported on all target architectures.
Examples
.weak .func (.reg .b32 val) foo; // foo will be externally visible
Declares identifier to be globally visible but “common”.
Common symbols are similar to globally visible symbols. However multiple object files may declarethe same common symbol and they may have different types and sizes and references to a symbol getresolved against a common symbol with the largest size.
Only one object file can initialize a common symbol and that must have the largest size among allother definitions of that common symbol from different object files.
.common linking directive can be used only on variables with.global storage. It cannot beused on function symbols or on symbols with opaque type.
The following directives specify information about clusters:
.reqnctapercluster
.explicitcluster
.maxclusterrank
The.reqnctapercluster directive specifies the number of CTAs in the cluster. The.explicitcluster directive specifies that the kernel should be launched with explicit clusterdetails. The.maxclusterrank directive specifies the maximum number of CTAs in the cluster.
The cluster dimension directives can be applied only on kernel functions.
Set the number of thread blocks (CTAs) in the cluster by specifying the extent of each dimension ofthe 1D, 2D, or 3D cluster. The total number of CTAs is the product of the number of CTAs in eachdimension. For kernels with.reqnctapercluster directive specified, runtime will use thespecified values for configuring the launch if the same are not specified at launch time.
Semantics
If cluster dimension is explicitly specified at launch time, it should be equal to the valuesspecified in this directive. Specifying a different cluster dimension at launch will result in aruntime error or kernel launch failure.
Declare that Kernel must be launched with cluster dimensions explicitly specified.
Syntax
.explicitcluster
Description
Declares that this Kernel should be launched with cluster dimension explicitly specified.
Semantics
Kernels with.explicitcluster directive must be launched with cluster dimension explicitlyspecified (either at launch time or via.reqnctapercluster), otherwise program will fail withruntime error or kernel launch failure.
Declare the maximum number of CTAs that can be part of the cluster.
Syntax
.maxclusterrank n
Description
Declare the maximum number of thread blocks (CTAs) allowed to be part of the cluster.
Semantics
Product of the number of CTAs in each cluster dimension specified in any invocation of the kernel isrequired to be less or equal to that specified in this directive. Otherwise invocation will resultin a runtime error or kernel launch failure.
The.maxclusterrank directive cannot be used in conjunction with the.reqnctapercluster directive.
Specify that CUDA thread blocks are mapped to clusters.
Syntax
.blocksareclusters
Description
Default behavior of CUDA API is to specify the grid launch configuration by specifying the number ofthread blocks and the number of threads per block.
When.blocksareclusters directive is specified, it implies that the grid launch configurationfor the corresponding.entry function is specifying the number of clusters, i.e. the launchconfiguration is specifying number of clusters instead of the number of thread blocks. In this case,the number of thread blocks per cluster is specified by.reqnctapercluster directive and thethread block size is specified with the.reqntid directive.
.blocksareclusters directive is only allowed for.entry functions and also needs.reqntid and.reqnctapercluster directives to be specified.
Disable loop unrolling in optimizing the backend compiler.
Syntax
.pragma "nounroll";
Description
The"nounroll"pragma is a directive to disable loop unrolling in the optimizing backendcompiler.
The"nounroll"pragma is allowed at module, entry-function, and statement levels, with thefollowing meanings:
module scope
disables unrolling for all loops in module, including loops preceding the.pragma.
entry-function scope
disables unrolling for all loops in the entry function body.
statement-level pragma
disables unrolling of the loop for which the current block is the loop header.
Note that in order to have the desired effect at statement level, the"nounroll" directive mustappear before any instruction statements in the loop header basic block for the desired loop. Theloop header block is defined as the block that dominates all blocks in the loop body and is thetarget of the loop backedge. Statement-level"nounroll" directives appearing outside of loopheader blocks are silently ignored.
PTX ISA Notes
Introduced in PTX ISA version 2.0.
Target ISA Notes
Requiressm_20 or higher. Ignored forsm_1x targets.
Examples
.entry foo (...).pragma "nounroll"; // do not unroll any loop in this function{...}.func bar (...){...L1_head: .pragma "nounroll"; // do not unroll this loop ...@p bra L1_end;L1_body: ...L1_continue: bra L1_head;L1_end: ...}
Mask for indicating used bytes in data of ld operation.
Syntax
.pragma "used_bytes_mask mask";
Description
The"used_bytes_mask"pragma is a directive that specifies used bytes in a loadoperation based on the mask provided.
"used_bytes_mask"pragma needs to be specified prior to a load instruction for whichinformation about bytes used from the load operation is needed.Pragma is ignored if instruction following it is not a load instruction.
For a load instruction without this pragma, all bytes from the load operation are assumedto be used.
Operandmask is a 32-bit integer with set bits indicating the used bytes in data ofload operation.
Semantics
Each bit in mask operand corresponds to a byte data where each set bit represents the used byte.Most-significant bit corresponds to most-significant byte of data.// For 4 bytes load with only lower 3 bytes used.pragma "used_bytes_mask 0x7";ld.global.u32 %r0, [gbl]; // Higher 1 byte from %r0 is unused// For vector load of 16 bytes with lower 12 bytes used.pragma "used_bytes_mask 0xfff";ld.global.v4.u32 {%r0, %r1, %r2, %r3}, [gbl]; // %r3 unused
PTX ISA Notes
Introduced in PTX ISA version 8.3.
Target ISA Notes
Requiressm_50 or higher.
Examples
.pragma "used_bytes_mask 0xfff";ld.global.v4.u32 {%r0, %r1, %r2, %r3}, [gbl]; // Only lower 12 bytes used
The"enable_smem_spilling"pragma is a directive that enables register spilling into shared memory.During the spilling process, registers are first spilled into shared memory, and once the allocatedshared memory is full, any additional spills are redirected to local memory. This can enhanceperformance by reducing memory access latency since shared memory accesses are faster than local memory.
The"enable_smem_spilling"pragma is only allowed within the function scope. When applied, it enablesshared memory spilling for the specified function.
The usage of pragma is valid only in certain scenarios and specific compilation modes. The usage ofpragma is disallowed under following cases and may result in an error:
Per-function compilation mode: e.g., Separate Compilation, Device-debug, Whole program with recursivefunction calls, Extensible-whole-program
If launch bounds are not explicitly specified, the compiler assumes the maximum possible number ofthreads per CTA to estimate shared memory allocated per CTA and corresponding spill size. However,if the kernel is launched with fewer threads per CTA than estimated, the shared memory allocatedper CTA may exceed the compiler estimated size, thereby potentially limiting the number of CTAsthat can be launched on an SM. Due to this, using the pragma without launch bounds may lead toperformance regressions. Hence it is recommended to use this pragma only when launch bounds areexplicitly specified.
PTX ISA Notes
Introduced in PTX ISA version 9.0.
Target ISA Notes
Requiressm_75 or higher.
Examples
.entry foo (...){ ... .pragma "enable_smem_spilling"; // Enable shared memory spilling for this function ...}
The"frequency"pragma is a directive that specifies the number of times a basic block isexecuted by an executing thread. The optimizing compiler backend treats this pragma as a hintwhich will be used for optimizations.
Operandn is a 64-bit non-negative integer constant that specifies the execution frequency.
Note that in order to have the desired effect of this pragma, it should be specified at the start ofthe basic block. Basic block is defined as a straight-line sequence of instructions with only oneentry point and one exit point.
This section describes the history of change in the PTX ISA and implementation. The first sectiondescribes ISA and implementation changes in the current release of PTX ISA version 9.0, and theremaining sections provide a record of changes in previous releases of PTX ISA versions back to PTXISA version 2.0.
PTX ISA version 9.0 introduces the following new features:
Adds support forsm_88 target architecture.
Adds support forsm_110 target architecture.
Adds support for targetsm_110f that supports family-specific features.
Adds support for targetsm_110a that supports architecture-specific features.
Adds support for pragmaenable_smem_spilling that is used to enable sharedmemory spilling for a function.
Adds support for pragmafrequency that is used to specify the execution frequency of a basicblock.
Adds support for directive.blocksareclusters that is used to specify that CUDA thread blocksare mapped to clusters.
Extendssize operand ofst.bulk instruction to support 32-bit length.
Adds support for performance-tuning directives.abi_preserve and.abi_preserve_controlthat are used to specify the number of data and control registers that should be preserved by thecallers of a function.
Notes
Targetssm_{101,101f,101a} are renamed to targetssm_{110,110f,110a} from PTX ISA version 9.0.
Semantic Changes and Clarifications
Alltcgen05 instructions(tcgen05.alloc,tcgen05.dealloc,tcgen05.relinquish_alloc_permit,tcgen05.cp,tcgen05.shift,tcgen05.mma,tcgen05.mma.sp,tcgen05.mma.ws,tcgen05.mma.ws.sp,tcgen05.commit) within a kernel must specify the same value for the.cta_group qualifier.
PTX ISA version 8.8 introduces the following new features:
Adds support forsm_103 target architecture.
Adds support for targetsm_103a that supports architecture-specific features.
Adds support forsm_121 target architecture.
Adds support for targetsm_121a that supports architecture-specific features.
Introduces family-specific target architectures that are represented with “f” suffix.PTX for family-specific targets is compatible with all subsequent targets in same family.Adds support forsm_100f,sm_101f,sm_103f,sm_120f,sm_121f.
Extendsmin andmax instructions to support three input arguments.
Extendstcgen05.mma instruction to add support for newscale_vectorsizequalifiers.block16 and.block32 and K dimension 96.
Extends.field3 oftensormap.replace instruction to support 96B swizzle mode.
Adds support fortcgen05.ld.red instruction.
Extendsld,ld.global.nc andst instructions to support 256b load/store operations.
Table 58 shows the list of features that aresupported on family-specific targets:
Table 58List of features promoted to family-specific architecture
PTX ISA version 8.6 introduces the following new features:
Adds support forsm_100 target architecture.
Adds support for targetsm_100a that supports architecture-specific features.
Adds support forsm_101 target architecture.
Adds support for targetsm_101a that supports architecture-specific features.
Extendscp.async.bulk andcp.async.bulk.tensor instructions to add.shared::cta as destination state space.
Extendsfence instruction to add support for.acquire and.release qualifiers.
Extendsfence andfence.proxy instructions to add support for.sync_restrictqualifier.
Extendsldmatrix instruction to support.m16n16,.m8n16 shapes and.b8 type.
Extendsldmatrix instruction to support.src_fmt,.dst_fmt qualifiers.
Extendsstmatrix instruction to support.m16n8 shape and.b8 type.
Adds support forclusterlaunchcontrol instruction.
Extendsadd,sub andfma instructions to support mixed precision floating pointoperations with.f32 as destaination operand type and.f16/.bf16 as source operandtypes.
Extendsadd,sub,mul andfma instructions to support.f32x2 type.
Extendscvt instruction with.tf32 type to support.satfinite qualifierfor.rn/.rz rounding modes.
Extendscp.async.bulk instruction to support.cp_mask qualifier andbyteMaskoperand.
Extendsmultimem.ld_reduce andmultimem.st instructions to support.e5m2,.e5m2x2,.e5m2x4,.e4m3,.e4m3x2 and.e4m3x4 types.
Extendscvt instruction to support conversions to/from.e2m1x2,.e3m2x2,.e2m3x2 and.ue8m0x2 types.
Extendscp.async.bulk.tensor andcp.async.bulk.prefetch.tensor instructions tosupport new load_mode qualifiers.tile::scatter4 and.tile::gather4.
Extendstensormap.replace instruction to add support for new qualifier.swizzle_atomicity for supporting new swizzle modes.
Extendsmbarrier.arrive,mbarrier.arrive_drop,.mbarrier.test_wait and.mbarrier.try_wait instructions to support.relaxed qualifier.
Extendscp.async.bulk.tensor andcp.async.bulk.prefetch.tensor instructions tosupport new load_mode qualifiers.im2col::w and.im2col::w::128.
Extendscp.async.bulk.tensor instruction to support new qualifier.cta_group.
Add support forst.bulk instruction.
Adds support for tcgen05 features and related instructions:tcgen05.alloc,tcgen05.dealloc,tcgen05.relinquish_alloc_permit,tcgen05.ld,tcgen05.st,tcgen05.wait,tcgen05.cp,tcgen05.shift,tcgen05.mma,tcgen05.mma.sp,tcgen05.mma.ws,tcgen05.mma.ws.sp,tcgen05.fence andtcgen05.commit.
Extendsredux.sync instruction to add support for.f32 type with qualifiers.absand.NaN.
PTX ISA version 8.3 introduces the following new features:
Adds support for pragmaused_bytes_mask that is used to specify mask for used bytes for a load operation.
Extendsisspacep,cvta.to,ld andst instructions to accept::entry and::funcsub-qualifiers with.param state space qualifier.
Adds support for.b128 type on instructionsld,ld.global.nc,ldu,st,mov andatom.
Add support for instructionstensormap.replace,tensormap.cp_fenceproxy and support for qualifier.to_proxykind::from_proxykind on instructionfence.proxy to support modifyingtensor-map.
PTX ISA version 8.2 introduces the following new features:
Adds support for.mmio qualifier onld andst instructions.
Extendslop3 instruction to allow predicate destination.
Extendsmultimem.ld_reduce instruction to support.acc::f32 qualifer to allow.f32precision of the intermediate accumulation.
Extends the asynchronous warpgroup-level matrix multiply-and-accumulate operationwgmma.mma_async to support.sp modifier that allows matrix multiply-accumulate operationwhen input matrix A is sparse.
Semantic Changes and Clarifications
The.multicast::cluster qualifier oncp.async.bulk andcp.async.bulk.tensor instructionsis optimized for target architecturesm_90a and may have substantially reduced performance onother targets and hence.multicast::cluster is advised to be used withsm_90a.
PTX ISA version 8.0 introduces the following new features:
Adds support for targetsm_90a that supports architecture-specific features.
Adds support for asynchronous warpgroup-level matrix multiply-and-accumulate operationwgmma.
Extends the asynchronous copy operations with bulk operations that operate on large data,including tensor data.
Introduces packed integer types.u16x2 and.s16x2.
Extends integer arithmetic instructionadd to allow packed integer types.u16x2 and.s16x2.
Extends integer arithmetic instructionsmin andmax to allow packed integer types.u16x2 and.s16x2, as well as saturation modifier.relu on.s16x2 and.s32types.
Adds support for special register%current_graph_exec that identifies the currently executingCUDA device graph.
Adds support forelect.sync instruction.
Adds support for.unified attribute on functions and variables.
Adds support forsetmaxnreg instruction.
Adds support for.sem qualifier onbarrier.cluster instruction.
Extends thefence instruction to allow opcode-specific synchronizaion usingop_restrictqualifier.
Adds support for.cluster scope onmbarrier.arrive,mbarrier.arrive_drop,mbarrier.test_wait andmbarrier.try_wait operations.
Adds support for transaction count operations onmbarrier objects, specified with.expect_tx and.complete_tx qualifiers.
PTX ISA version 7.8 introduces the following new features:
Adds support forsm_89 target architecture.
Adds support forsm_90 target architecture.
Extendsbar andbarrier instructions to accept optional scope qualifier.cta.
Extends.shared state space qualifier with optional sub-qualifier::cta.
Adds support formovmatrix instruction which transposes a matrix in registers across a warp.
Adds support forstmatrix instruction which stores one or more matrices to shared memory.
Extends the.f64 floating point typemma operation with shapes.m16n8k4,.m16n8k8,and.m16n8k16.
Extendsadd,sub,mul,set,setp,cvt,tanh,ex2,atom andred instructions withbf16 alternate floating point data format.
Adds support for new alternate floating-point data formats.e4m3 and.e5m2.
Extendscvt instruction to convert.e4m3 and.e5m2 alternate floating point data formats.
Adds support forgriddepcontrol instruction as a communication mechanism to control theexecution of dependent grids.
Extendsmbarrier instruction to allow a new phase completion check operationtry_wait.
Adds support for new thread scope.cluster which is a set of Cooperative Thread Arrays (CTAs).
Extendsfence/membar,ld,st,atom, andred instructions to accept.cluster scope.
Adds support for extended visibility of shared state space to all threads within a cluster.
Extends.shared state space qualifier with::cluster sub-qualifier for cluster-levelvisibility of shared memory.
Extendsisspacep,cvta,ld,st,atom, andred instructions to accept::cluster sub-qualifier with.shared state space qualifier.
Adds support formapa instruction to map a shared memory address to the corresponding addressin a different CTA within the cluster.
Adds support forgetctarank instruction to query the rank of the CTA that contains a givenaddress.
Adds support for new barrier synchronization instructionbarrier.cluster.
Extends the memory consistency model to include the new cluster scope.
Adds support for special registers related to cluster information:%is_explicit_cluster,%clusterid,%nclusterid,%cluster_ctaid,%cluster_nctaid,%cluster_ctarank,%cluster_nctarank.
Adds support for cluster dimension directives.reqnctapercluster,.explicitcluster, and.maxclusterrank.
PTX ISA version 7.3 introduces the following new features:
Extendsmask() operator used in initializers to also support integer constant expression.
Adds support for stack manpulation instructions that allow manipulating stack usingstacksaveandstackrestore instructions and allocation of per-thread stack usingallocainstruction.
Semantic Changes and Clarifications
The unimplemented version ofalloca from the older PTX ISA specification has been replaced withnew stack manipulation instructions in PTX ISA version 7.3.
PTX ISA version 7.0 introduces the following new features:
Support forsm_80 target architecture.
Adds support for asynchronous copy instructions that allow copying of data asynchronously from onestate space to another.
Adds support formbarrier instructions that allow creation ofmbarrier objects in memory anduse of these objects to synchronize threads and asynchronous copy operations initiated by threads.
Adds support forredux.sync instruction which allows reduction operation across threads in awarp.
Adds support for new alternate floating-point data formats.bf16 and.tf32.
Extendswmma instruction to support.f64 type with shape.m8n8k4.
Extendswmma instruction to support.bf16 data format.
Extendswmma instruction to support.tf32 data format with shape.m16n16k8.
Extendsmma instruction to support.f64 type with shape.m8n8k4.
Extendsmma instruction to support.bf16 and.tf32 data formats with shape.m16n8k8.
Extendsmma instruction to support new shapes.m8n8k128,.m16n8k4,.m16n8k16,.m16n8k32,.m16n8k64,.m16n8k128 and.m16n8k256.
Extendsabs andneg instructions to support.bf16 and.bf16x2 data formats.
Extendsmin andmax instructions to support.NaN modifier and.f16,.f16x2,.bf16 and.bf16x2 data formats.
Extendsfma instruction to support.relu saturation mode and.bf16 and.bf16x2data formats.
Extendscvt instruction to support.relu saturation mode and.f16,.f16x2,.bf16,.bf16x2 and.tf32 destination formats.
Adds support fortanh instruction that computes hyperbolic-tangent.
Extendsex2 instruction to support.f16 and.f16x2 types.
PTX ISA version 6.4 introduces the following new features:
Adds support for.noreturn directive which can be used to indicate a function does not returnto it’s caller function.
Adds support formma instruction which allows performing matrix multiply-and-accumulateoperation.
Deprecated Features
PTX ISA version 6.4 deprecates the following features:
Support for.satfinite qualifier on floating pointwmma.mma instruction.
Removed Features
PTX ISA version 6.4 removes the following features:
Support forshfl andvote instructions without the.sync qualifier has been removedfor.targetsm_70 and higher. This support was deprecated since PTX ISA version 6.0 asdocumented in PTX ISA version 6.2.
Semantic Changes and Clarifications
Clarified that resolving references of a.weak symbol considers only.weak or.visiblesymbols with the same name and does not consider local symbols with the same name.
Clarified that incvt instruction, modifier.ftz can only be specified when either.atype or.dtype is.f32.
PTX ISA version 6.3 introduces the following new features:
Support forsm_75 target architecture.
Adds support for a new instructionnanosleep that suspends a thread for a specified duration.
Adds support for.alias directive which allows definining alias to function symbol.
Extendsatom instruction to perform.f16 addition operation and.cas.b16 operation.
Extendsred instruction to perform.f16 addition operation.
Thewmma instructions are extended to support multiplicand matrices of type.s8,.u8,.s4,.u4,.b1 and accumulator matrices of type.s32.
Semantic Changes and Clarifications
Introduced the mandatory.aligned qualifier for allwmma instructions.
Specified the alignment required for the base address and stride parameters passed towmma.load andwmma.store.
Clarified that layout of fragment returned bywmma operation is architecture dependent andpassingwmma fragments around functions compiled for different link compatible SMarchitectures may not work as expected.
Clarified that atomicity for{atom/red}.f16x2} operations is guranteed separately for each ofthe two.f16 elements but not guranteed to be atomic as single 32-bit access.
PTX ISA version 6.2 introduces the following new features:
A new instructionactivemask for querying active threads in a warp.
Extends atomic and reduction instructions to perform.f16x2 addition operation with mandatory.noftz qualifier.
Deprecated Features
PTX ISA version 6.2 deprecates the following features:
The use ofshfl andvote instructions without the.sync is deprecated retrospectivelyfrom PTX ISA version 6.0, which introduced thesm_70 architecture that implementsIndependent Thread Scheduling.
Semantic Changes and Clarifications
Clarified thatwmma instructions can be used in conditionally executed code only if it isknown that all threads in the warp evaluate the condition identically, otherwise behavior isundefined.
In the memory consistency model, the definition ofmorally strong operations was updated toexclude fences from the requirement ofcomplete overlap since fences do not access memory.
PTX ISA version 6.0 introduces the following new features:
Support forsm_70 target architecture.
Specifies the memory consistency model for programs running onsm_70 and later architectures.
Various extensions to memory instructions to specify memory synchronization semantics and scopesat which such synchronization can be observed.
New instructionwmma for matrix operations which allows loading matrices from memory,performing multiply-and-accumulate on them and storing result in memory.
Support for newbarrier instruction.
Extendsneg instruction to support.f16 and.f16x2 types.
A new instructionfns which allows finding n-th set bit in integer.
A new instructionbar.warp.sync which allows synchronizing threads in warp.
Extendsvote andshfl instructions with.sync modifier which waits for specifiedthreads before executing thevote andshfl operation respectively.
A new instructionmatch.sync which allows broadcasting and comparing a value across threads inwarp.
A new instructionbrx.idx which allows branching to a label indexed from list of potentialtargets.
Support for unsized array parameter for.func which can be used to implement variadicfunctions.
Support for.b16 integer type in dwarf-lines.
Support for taking address of device function return parameters usingmov instruction.
Semantic Changes and Clarifications
Semantics ofbar instruction were updated to indicate that executing thread waits for othernon-exited threads from it’s warp.
Support for indirect branch introduced in PTX 2.1 which was unimplemented has been removed fromthe spec.
Support for taking address of labels, using labels in initializers which was unimplemented hasbeen removed from the spec.
Support for variadic functions which was unimplemented has been removed from the spec.
PTX ISA version 5.0 introduces the following new features:
Support forsm_60,sm_61,sm_62 target architecture.
Extends atomic and reduction instructions to perform double-precision add operation.
Extends atomic and reduction instructions to specifyscope modifier.
A new.common directive to permit linking multiple object files containing declarations of thesame symbol with different size.
A newdp4a instruction which allows 4-way dot product with accumulate operation.
A newdp2a instruction which allows 2-way dot product with accumulate operation.
Support for special register%clock_hi.
Semantic Changes and Clarifications
Semantics of cache modifiers onld andst instructions were clarified to reflect cacheoperations are treated as performance hint only and do not change memory consistency behavior of theprogram.
Semantics ofvolatile operations onld andst instructions were clarified to reflect howvolatile operations are handled by optimizing compiler.
PTX ISA version 4.2 introduces the following new features:
Support forsm_53 target architecture.
Support for arithmetic, comparsion and texture instructions for.f16 and.f16x2 types.
Support formemory_layout field for surfaces andsuq instruction support for querying thisfield.
Semantic Changes and Clarifications
Semantics for parameter passing under ABI were updated to indicateld.param andst.paraminstructions used for argument passing cannot be predicated.
Semantics of{atom/red}.add.f32 were updated to indicate subnormal inputs and results areflushed to sign-preserving zero for atomic operations on global memory; whereas atomic operations onshared memory preserve subnormal inputs and results and don’t flush them to zero.
PTX ISA version 4.0 introduces the following new features:
Support forsm_32 andsm_50 target architectures.
Support for 64bit performance counter special registers%pm0_64,..,%pm7_64.
A newistypep instruction.
A new instruction,rsqrt.approx.ftz.f64 has been added to compute a fast approximation of thesquare root reciprocal of a value.
Support for a new directive.attribute for specifying special attributes of a variable.
Support for.managed variable attribute.
Semantic Changes and Clarifications
Thevote instruction semantics were updated to clearly indicate that an inactive thread in awarp contributes a 0 for its entry when participating invote.ballot.b32.
PTX ISA version 3.2 introduces the following new features:
The texture instruction supports reads from multi-sample and multisample array textures.
Extends.section debugging directive to include label + immediate expressions.
Extends.file directive to include timestamp and file size information.
Semantic Changes and Clarifications
Thevavrg2 andvavrg4 instruction semantics were updated to indicate that instruction adds 1only if Va[i] + Vb[i] is non-negative, and that the addition result is shifted by 1 (rather thanbeing divided by 2).
PTX ISA version 3.1 introduces the following new features:
Support forsm_35 target architecture.
Support for CUDA Dynamic Parallelism, which enables a kernel to create and synchronize new work.
ld.global.nc for loading read-only global data though the non-coherent texture cache.
A new funnel shift instruction,shf.
Extends atomic and reduction instructions to perform 64-bit{and,or,xor} operations, and64-bit integer{min,max} operations.
Adds support formipmaps.
Adds support for indirect access to textures and surfaces.
Extends support for generic addressing to include the.const state space, and adds a newoperator,generic(), to form a generic address for.global or.const variables used ininitializers.
A new.weak directive to permit linking multiple object files containing declarations of thesame symbol.
Semantic Changes and Clarifications
PTX 3.1 redefines the default addressing for global variables in initializers, from genericaddresses to offsets in the global state space. Legacy PTX code is treated as having an implicitgeneric() operator for each global variable used in an initializer. PTX 3.1 code should eitherinclude explicitgeneric() operators in initializers, usecvta.global to form genericaddresses at runtime, or load from the non-generic address usingld.global.
Instructionmad.f32 requires a rounding modifier forsm_20 and higher targets. However forPTX ISA version 3.0 and earlier, ptxas does not enforce this requirement andmad.f32 silentlydefaults tomad.rn.f32. For PTX ISA version 3.1, ptxas generates a warning and defaults tomad.rn.f32, and in subsequent releases ptxas will enforce the requirement for PTX ISA version3.2 and later.
PTX ISA version 3.0 introduces the following new features:
Support forsm_30 target architectures.
SIMD video instructions.
A new warp shuffle instruction.
Instructionsmad.cc andmadc for efficient, extended-precision integer multiplication.
Surface instructions with 3D and array geometries.
The texture instruction supports reads from cubemap and cubemap array textures.
Platform option.target debug to declare that a PTX module containsDWARF debug information.
pmevent.mask, for triggering multiple performance monitor events.
Performance monitor counter special registers%pm4..%pm7.
Semantic Changes and Clarifications
Special register%gridid has been extended from 32-bits to 64-bits.
PTX ISA version 3.0 deprecates module-scoped.reg and.local variables when compiling to theApplication Binary Interface (ABI). When compiling without use of the ABI, module-scoped.regand.local variables are supported as before. When compiling legacy PTX code (ISA versions priorto 3.0) containing module-scoped.reg or.local variables, the compiler silently disablesuse of the ABI.
Theshfl instruction semantics were updated to clearly indicate that value of source operanda is unpredictable for inactive and predicated-off threads within the warp.
PTX modules no longer allow duplicate.version directives. This feature was unimplemented, sothere is no semantic change.
Unimplemented instructionssuld.p andsust.p.{u32,s32,f32} have been removed.
PTX 2.3 adds support for texture arrays. The texture array feature supports access to an array of 1Dor 2D textures, where an integer indexes into the array of textures, and then one or twosingle-precision floating point coordinates are used to address within the selected 1D or 2Dtexture.
PTX 2.3 adds a new directive,.address_size, for specifying the size of addresses.
Variables in.const and.global state spaces are initialized to zero by default.
Semantic Changes and Clarifications
The semantics of the.maxntid directive have been updated to match the currentimplementation. Specifically,.maxntid only guarantees that the total number of threads in athread block does not exceed the maximum. Previously, the semantics indicated that the maximum wasenforced separately in each dimension, which is not the case.
Bit field extract and insert instructions BFE and BFI now indicate that thelen andposoperands are restricted to the value range0..255.
Unimplemented instructions{atom,red}.{min,max}.f32 have been removed.
PTX 2.2 adds a new directive for specifying kernel parameter attributes; specifically, there is anew directives for specifying that a kernel parameter is a pointer, for specifying to which statespace the parameter points, and for optionally specifying the alignment of the memory to which theparameter points.
PTX 2.2 adds a new field namedforce_unnormalized_coords to the.samplerref opaquetype. This field is used in the independent texturing mode to override thenormalized_coordsfield in the texture header. This field is needed to support languages such as OpenCL, whichrepresent the property of normalized/unnormalized coordinates in the sampler header rather than inthe texture header.
PTX 2.2 deprecates explicit constant banks and supports a large, flat address space for the.const state space. Legacy PTX that uses explicit constant banks is still supported.
PTX 2.2 adds a newtld4 instruction for loading a component (r,g,b, ora) fromthe four texels compising the bilinear interpolation footprint of a given texture location. Thisinstruction may be used to compute higher-precision bilerp results in software, or for performinghigher-bandwidth texture loads.
The underlying, stack-based ABI is supported in PTX ISA version 2.1 forsm_2x targets.
Support for indirect calls has been implemented forsm_2x targets.
New directives,.branchtargets and.calltargets, have been added for specifying potentialtargets for indirect branches and indirect function calls. A.callprototype directive has beenadded for declaring the type signatures for indirect function calls.
The names of.global and.const variables can now be specified in variable initializers torepresent their addresses.
A set of thirty-two driver-specific execution environment special registers has been added. Theseare named%envreg0..%envreg31.
Textures and surfaces have new fields for channel data type and channel order, and thetxq andsuq instructions support queries for these fields.
Directive.minnctapersm has replaced the.maxnctapersm directive.
Directive.reqntid has been added to allow specification of exact CTA dimensions.
A new instruction,rcp.approx.ftz.f64, has been added to compute a fast, gross approximatereciprocal.
Semantic Changes and Clarifications
A warning is emitted if.minnctapersm is specified without also specifying.maxntid.
This section describes the floating-point changes in PTX ISA version 2.0 forsm_20 targets. Thegoal is to achieve IEEE 754 compliance wherever possible, while maximizing backward compatibilitywith legacy PTX ISA version 1.x code andsm_1x targets.
The changes from PTX ISA version 1.x are as follows:
Single-precision instructions support subnormal numbers by default forsm_20 targets. The.ftz modifier may be used to enforce backward compatibility withsm_1x.
Single-precisionadd,sub, andmul now support.rm and.rp rounding modifiersforsm_20 targets.
A single-precision fused multiply-add (fma) instruction has been added, with support for IEEE 754compliant rounding modifiers and support for subnormal numbers. Thefma.f32 instruction alsosupports.ftz and.sat modifiers.fma.f32 requiressm_20. Themad.f32instruction has been extended with rounding modifiers so that it’s synonymous withfma.f32forsm_20 targets. Bothfma.f32 andmad.f32 require a rounding modifier forsm_20targets.
Themad.f32 instructionwithout rounding is retained so that compilers can generate code forsm_1x targets. When code compiled forsm_1x is executed onsm_20 devices,mad.f32maps tofma.rn.f32.
Single- and double-precisiondiv,rcp, andsqrt with IEEE 754 compliant rounding havebeen added. These are indicated by the use of a rounding modifier and requiresm_20.
Instructionstestp andcopysign have been added.
New Instructions
Aload uniform instruction,ldu, has been added.
Surface instructions support additional.clamp modifiers,.clamp and.zero.
Instructionsust now supports formatted surface stores.
Acount leading zeros instruction,clz, has been added.
Afind leading non-sign bit instruction,bfind, has been added.
Abit reversal instruction,brev, has been added.
Bit field extract and insert instructions,bfe andbfi, have been added.
Apopulation count instruction,popc, has been added.
Avote ballot instruction,vote.ballot.b32, has been added.
Instructions{atom,red}.add.f32 have been implemented.
Instructions{atom,red}.shared have been extended to handle 64-bit data types forsm_20targets.
A system-level membar instruction,membar.sys, has been added.
Thebar instruction has been extended as follows:
Abar.arrive instruction has been added.
Instructionsbar.red.popc.u32 andbar.red.{and,or}.pred have been added.
bar now supports optional thread count and register operands.
Scalar video instructions (includesprmt) have been added.
Instructionisspacep for querying whether a generic address falls within a specified state spacewindow has been added.
Instructioncvta for converting global, local, and shared addresses to generic address andvice-versa has been added.
Other New Features
Instructionsld,ldu,st,prefetch,prefetchu,isspacep,cvta,atom,andred now support generic addressing.
New special registers%nwarpid,%nsmid,%clock64,%lanemask_{eq,le,lt,ge,gt} havebeen added.
Cache operations have been added to instructionsld,st,suld, andsust, e.g., forprefetching to specified level of memory hierarchy. Instructionsprefetch andprefetchuhave also been added.
The.maxnctapersm directive was deprecated and replaced with.minnctapersm to better matchits behavior and usage.
A new directive,.section, has been added to replace the@@DWARF syntax for passingDWARF-format debugging information through PTX.
A new directive,.pragmanounroll, has been added to allow users to disable loop unrolling.
Semantic Changes and Clarifications
The errata incvt.ftz for PTX ISA versions 1.4 and earlier, where single-precision subnormalinputs and results were not flushed to zero if either source or destination type size was 64-bits,has been fixed. In PTX ISA version 1.5 and later,cvt.ftz (andcvt for.targetsm_1x,where.ftz is implied) instructions flush single-precision subnormal inputs and results tosign-preserving zero for all combinations of floating-point instruction types. To maintaincompatibility with legacy PTX code, if .version is 1.4 or earlier, single-precision subnormal inputsand results are flushed to sign-preserving zero only when neither source nor destination type sizeis 64-bits.
Components of special registers%tid,%ntid,%ctaid, and%nctaid have been extendedfrom 16-bits to 32-bits. These registers now have type.v4.u32.
The number of samplers available in independent texturing mode was incorrectly listed as thirty-twoin PTX ISA version 1.5; the correct number is sixteen.
This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.