Convergent Operation Semantics¶
Overview¶
Some parallel execution environments execute threads in groups that allowefficient communication within the group using special primitives calledconvergent operations. The outcome of a convergent operation is sensitive tothe set of threads that executes it “together”, i.e., convergently. When controlflowdiverges, i.e. threads of the samegroup follow differentpaths through the CFG, not all threads of the group may be available toparticipate in this communication. This is the defining characteristic thatdistinguishes convergent operations from other inter-thread communication:
A convergent operation involves inter-thread communication or synchronizationthat occurs outside of the memory model, where the set of threads whichparticipate in communication is implicitly affected by control flow.
For example, in the following GPU compute kernel, communication during theconvergent operation is expected to occur precisely among those threads of animplementation-defined execution scope (such as workgroup or subgroup) forwhichcondition
is true:
voidexample_kernel(){...if(condition)convergent_operation();...}
In structured programming languages, there is often an intuitive andunambiguous way of determining the threads that are expected to communicate.However, this is not always the case even in structured programming languages,and the intuition breaks down entirely in unstructured control flow. Thisdocument describes the formal semantics in LLVM, i.e. how to determine the setof communicating threads for convergent operations.
The definitions in this document leave many details open, such as how groups ofthreads are formed in the first place. It focuses on the questions that arerelevant for deciding the correctness of generic program transforms andconvergence-related analyses such asuniformity analysis.
Convergent Operations¶
In LLVM IR, the only way to communicate between threads as describedabove is by calling target-defined convergent intrinsics. Hence, onlya call-site in LLVM IR (acall,invoke, orcallbr instruction) can result in aconvergent operation.
A function in LLVM IR is said to beconvergent if it has theconvergent attribute.
A call-site in LLVM IR is said to beconvergent if it is a directcall to a convergent function or it has theconvergent attribute or aconvergencectrl operand bundle.
Informational notes:
A function may have to be treated as convergent if that function, ortransitively, any function called from it, contains a convergent call-site. Afrontend generating the
convergent
attribute should take this into accountwhen emitting functions and function calls. But this is not always the case:A non-convergent function may contain convergent operations; such operationsdo not directly depend on the set of threads that enter the function as asingle communicating group. Instead, these operations depend on animplementation-defined subset of threads within the body of the function, asshown inOpportunistic convergent operations.
Examples of Convergent Operations¶
(This section is informative.)
Texture sampling in a pixel shader¶
The following stylized pixel shader samples a texture at a given set ofcoordinates, using the builtin functiontextureSample. Texture samplingrequires screen-space derivatives of the coordinates to determine the level ofdetail (mipmap level) of the sample. They are commonly approximated by takingthe difference between neighboring pixels, which are computed by differentthreads in the same group:
voidexample_shader(){...color=textureSample(texture,coordinates);if(condition){use(color);}...}
From a purely single-threaded perspective, sinking thetextureSample intothe if-statement appears legal. However, if the condition is false for someneighboring pixels, then their corresponding threads will not execute togetherin the group, making it impossible to take the difference of coordinates as anapproximation of the screen-space derivative. In practice, the outcome will bean undefined value.
That is, thetextureSample operation fits our definition of a convergentoperation:
It communicates with a set of threads that implicitly depends on controlflow.
Correctness depends on this set of threads.
The compiler frontend can emit IR that expresses the convergence constraints asfollows:
definevoid@example_shader()convergent{%entry=calltoken@llvm.experimental.convergence.entry()...%color=callT@textureSample(U%texture,V%coordinates)["convergencectrl"(token%entry)]bri1%condition,label%then,label%endthen:callvoid@use(T%color)brlabel%endend:retvoid}
Thellvm.experimental.convergence.entryintrinsic is itselfconvergent
, and we expect it to communicate at leastamong all threads of the same “quad” – a group of 2x2 pixels that areevaluated together for the purpose of approximating screen-space derivatives.This fact is not part of the generic LLVM IR semantics; it would have to bedefined somewhere else, for example as part of target-specific ABI definitionsand/or in reference to some relevant API specs.
Since the@textureSample
call then uses the token produced by the entryintrinsic in itsconvergencectrl
bundle, and has no additional controldependencies, it must communicate among the same set of threads. This indicatesto generic program transforms that sinking the@textureSample
call isforbidden. (A program transform can still sink the call if it can prove somehow,e.g. by leaning on target-specific callbacks that can analyze the program withadditional knowledge, that%condition
is always uniform across the threadsreferenced by theconvergence token%entry
.)
Reductions inside divergent control flow¶
The following example shows that merging common code of branches can beincorrect in the face of convergent operations:
voidexample_kernel(){delta=...if(delta>0){total_gains=subgroupAdd(delta);...}else{total_losses=subgroupAdd(delta);...}}
ThesubgroupAdd
computing thetotal_gains
will be executed by thesubset of threads with positivedelta
in a subgroup (wave), and so will sumup all thedelta
values of those threads; and similarly for thesubgroupAdd
that computes thetotal_losses
.
If we were to hoist and merge thesubgroupAdd
above the if-statement, itwould sum up thedelta
acrossall threads instead.
The compiler frontend can emit IR that expresses the convergence constraintsas follows:
definevoid@example_kernel()convergent{%entry=calltoken@llvm.experimental.convergence.entry()%delta=...%cc=icmpsgti32%delta,0bri1%cc,label%then,label%elsethen:%total_gains=calli32@subgroupAdd(i32%delta)["convergencectrl"(token%entry)]...brlabel%endelse:%total_losses=calli32@subgroupAdd(i32%delta)["convergencectrl"(token%entry)]...brlabel%endend:...}
The entry intrinsic behaves like in the previous example: assuming that@example_kernel
is an OpenCL kernel (as hinted at by the “subgroup”terminology), we expect it to communicate among all threads within the“subgroup”. This typically maps to a SIMD vector on GPU hardware.
The calls to@subgroupAdd
use the token produced by the entry intrinsic,but they also have an additional control dependency. According to the rulesdefined in this document, they only communicate among the subset of threadsthat actually end up executing the respective (static) call site.
Hoisting them would remove the control dependency and cause them to communicateamong the full set of threads that the entry intrinsic communicated with.Again, hoisting is allowed if it can be proven that%cc
is always uniformamong the relevant set of threads: in that case, the@subgroupAdd
alreadycommunicates among the full set of threads in the original program.
Motivating Examples of Convergence Control¶
(This section is informative.)
Unstructured control flow¶
Consider an example of how jump threading removes structure in a way that canmake semantics non-obvious without the convergence intrinsics described in thisdocument:
voidexample_original(){entry:...bri1%cond1,label%then1,label%midthen1:...%cond2=...brlabel%midmid:%flag=phii1[true,%entry],[%cond2,%then1]bri1%flag,label%then2,label%endthen2:...callvoid@subgroupControlBarrier()...brlabel%endend:}voidexample_jumpthreaded(){entry:...bri1%cond1,label%then1,label%then2then1:...%cond2=...bri1%cond2,label%then2,label%endthen2:...callvoid@subgroupControlBarrier()...brlabel%endend:}
Is the control barrier guaranteed to synchronize among the same set of threadsin both cases? Different implementations in the literature may give differentanswers to this question:
In an implementation that reconverges at post-dominators, threads reconvergeat
mid
in the first version, so that all threads (within a subgroup/wave)that execute the control barrier do so together. In the second version,threads that reach the control barrier via different paths synchronizeseparately: the first (and only) post-dominator isend
, so threads do notreconverge before then.An implementation that sorts basic blocks topologically and ensures maximalreconvergence for each basic block would behave the same way in bothversions.
We generally take the stance that reconvergence in acyclic control flow mustbe maximal. The compiler frontend could augment the original code as follows:
definevoid@example_original()convergent{entry:%entry=calltoken@llvm.experimental.convergence.entry()...bri1%cond1,label%then1,label%midthen1:...%cond2=...brlabel%midmid:%flag=phii1[true,%entry],[%cond2,%then1]bri1%flag,label%then2,label%endthen2:...callvoid@subgroupControlBarrier()["convergencectrl"(token%entry)]...brlabel%endend:}
If S is the set of threads that the entry intrinsic communicated with, thenthe@subgroupControlBarrier
call communicates with the subset of S thatactually reaches the call site. This set of threads doesn’t change afterjump-threading, so the answer to the question posed above remains the same.
Opportunistic convergent operations¶
Some programs have local regions of code that contain a sequence of convergentoperations where the code does not care about the exact set of threads withwhich it is executed, but only that the set of threads is the same for all theoperations within the sequence. (If a subset of the convergent operations in thesequence have additional, non-uniform control dependencies, then this is notpossible. However, the code may still require that the sets of threads arelogically consistent with the conditions of those control dependencies.) In thiscase,llvm.experimental.convergence.anchor can be used to express the desiredsemantics.
The following example function could be part of a hypothetical “append buffer”implementation, where threads conditionally write fixed-sized recordscontiguously into a global buffer. The function@reserveSpaceInBuffer
returns the index into the buffer at which the calling thread should store itsdata.
This could be achieved by using a simple atomic operation in every thread tobump an allocation counter.
However, the following implementation can be more performant on some hardware,because it uses only a single atomic operation for an entire group of threads.To do this, it first determines the total size of the group, which will be theoperand to the atomic operation, and then later broadcasts the result of theatomic operation to all threads of the group, so that each thread can computeits individual position in the buffer:
definei32@reserveSpaceInBuffer(){; NOTE: _not_ a convergent function!entry:%anchor=calltoken@llvm.experimental.convergence.anchor()%ballot=calli64@subgroupBallot(i1true)["convergencectrl"(token%anchor)]%numThreads.p=calli64@llvm.ctpop.i64(i64%ballot)%numThreads=trunci64%numThreads.ptoi32%absoluteThreadIdx=calli32@getSubgroupLocalInvocationId()%absoluteThreadIdx.ext=zexti32%absoluteThreadIdxtoi64%mask.p=shli641,%absoluteThreadIdx.ext%mask=subi64%mask.p,1%maskedBallot=andi64%ballot,%mask%relativeThreadIdx.p=calli64@llvm.ctpop.i64(i64%maskedBallot)%relativeThreadIdx=trunci64%relativeThreadIdx.ptoi32%isFirstThread=icmpeqi32%relativeThreadIdx,0bri1%isFirstThread,label%then,label%endthen:%baseOffset.1=atomicrmwaddptr@bufferAllocationCount,i32%numThreadsmonotonicbrlabel%endend:%baseOffset.2=phii32[undef,%entry],[%baseOffset.1,%then]%baseOffset=calli32@subgroupBroadcastFirst(i32%baseOffset.2)["convergencectrl"(token%anchor)]%offset=addi32%baseOffset,%relativeThreadIdxreti32%offset}
The key here is that the function really doesn’t care which set of threads itis being called with. It takes whatever set of threads it can get. What theimplementation of the function cares about is that the initial@subgroupBallot
– which is used to retrieve the bitmask of threads thatexecuted the anchor together – executes with the same set of threads as thefinal@subgroupBroadcastFirst
. Nothing else is required for correctness asfar as convergence is concerned.
The function@reserveSpaceInBuffer
itself is _not_convergent
: callersare free to move call sites of the function as they see fit. This can changethe behavior in practice, by changing the sets of threads that are groupedtogether for the atomic operation. This can be visible in the output of theprogram, since the order in which outputs appear in the buffer is changed.However, this does not break the overall contract that@reserveSpaceInBuffer
has with its caller – which makes sense: the order of outputs isnon-deterministic anyway because of the atomic operation that is involved.
If the function is inlined, the use of the anchor intrinsic similarly indicatesthat certain transforms which are usually forbidden by the presence ofconvergent operations are in fact allowed, as long as they don’t break up theregion of code that is controlled by the anchor.
Extended Cycles: Divergent Exit from a Loop¶
High-level languages typically provide abreak
statement that transferscontrol out of a loop statement. In most cases, the loop is structured and hencethere is no ambiguity about convergence inside the loop. But an ambiguity ariseswhen abreak
is control dependent on a divergent condition inside the loop.Consider the following example:
voidexample(){// A...for(...){// Bif(condition){// divergent condition// Cconvergent_op();break;}// D...}// E}
In this program, the call to convergent_op() is lexically “inside” thefor
loop. But when translated to LLVM IR, the basic block B is an exiting blockending in a divergent branch, and the basic block C is an exit of the loop.Thus, the call to convergent_op() is outside the loop. This causes a mismatchbetween the programmer’s expectation and the compiled program. The call shouldbe executed convergently on every iteration of the loop, by threads thattogether take the branch to exit the loop. But when compiled, all threads thattake the divergent exit on different iterations first converge at the beginningof basic block C and then together execute the call to convergent_op().
In this case,llvm.experimental.convergence.loop can be used to express the desiredsemantics. A call to this intrinsic is placed in the loop header, which trackseach iteration of the loop. The token produced by this is used as aconvergencectrl
operand to the convergent call. The semantics of theloop
intrinsic ensures that the convergent call is performed convergentlyonly by those threads that convergently exited the loop in a given iteration.
definevoid@example()convergent{%entry=calltoken@llvm.experimental.convergence.entry()brlabel%forfor:%inner=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%entry)]%for.cond=i1...bri1%for.cond,label%B,label%EB:...%condition=i1...bri1%condition,label%C,label%DC:callvoid@convergent_op()["convergencectrl"(token%inner)]brlabel%ED:...brlabel%forE:...retvoid}
The LLVM IR version of the same program shows a cycle consisting of the basicblocks%for
,%B
and%D
, while%C
is an exit reached by thedivergent branch at the end of the exiting block%B
. But the use ofconvergence control tokens makes it clear that block%C
must be executedconvergently only by those threads that convergently take the exit edge from %Bto%C
. In other words, the convergent execution of%C
is governed by thecall to thellvm.experimental.convergence.loop intrinsic inside the cycle. The cycle iseffectively extended to include all uses of this token that lie outside thecycle.
Dynamic Instances and Convergence Tokens¶
Every execution of an LLVM IR instruction occurs in adynamic instance of the instruction. Dynamic instances are theformal objects by which we talk about communicating threads in convergentoperations. Dynamic instances are defined forall operations in an LLVMprogram, whether convergent or not. Convergence control is primarily about thedynamic instances of convergent operations since they affect execution of theprogram through inter-thread communication. The dynamic instances fornon-convergent operations are relevant for determininguniformity of values.
Dynamic instances produced by the execution of the sameconvergent operationby different threads may beconverged. Whenexecuting a convergent operation, the set of threads that execute convergeddynamic instances is the set of threads that communicate with each other.Convergence tokens capture this convergence as described below.
Convergence tokens are values oftoken
type, i.e. they cannot be used inphi
orselect
instructions. A convergence token value represents thedynamic instance of the instruction that produced it.
Convergent operations may have an optionalconvergencectrl
operand bundle witha convergence token operand to define the set of communicating threads relativeto the operation that defined the token.
Let
U
be a convergent operation other than a call to a convergencecontrol intrinsic, andD
be the convergent operation that definesthe token value used as theconvergencectrl
operand toU
. Twothreads execute converged dynamic instances ofU
if and only if thetoken value in both threads was returned by converged dynamicinstances ofD
.
Note
The text defines convergence token values as representing dynamic instances.But if we were to assume that converged dynamic instances produce the sametoken value, then we could almost think of the token value as representing aset of threads instead – specifically, the setS
of threads thatexecuted converged dynamic instances of the defining instructionD
.
In this intuitive picture, when a convergence token valueT
is used by aconvergencectrl
bundle on an instructionI
, then the set of threads thatcommunicates inI
is a subset of the setS
represented by the token value.Specifically, it is the subset of threads that ends up executingI
whileusing the token value.
This by itself wouldn’t quite work as a definition: what ifI
is executedmultiple times by the same threads? Which execution ofI
in thread 1communicates with which execution ofI
in thread 2? Leaning on the notionof dynamic instances gives a robust answer to this question as long asD
andI
are at the same loop (or cycle) nesting level.
The case whereD
andI
are at different loop nesting levels isforbidden by thestatic rules – handlingthat case is the purpose ofllvm.experimental.convergence.loop.
Convergence Control Intrinsics¶
This section describes target-independent intrinsics that can be used toproduce convergence tokens.
Behaviour is undefined if a convergence control intrinsic is calledindirectly.
llvm.experimental.convergence.entry
¶
token@llvm.experimental.convergence.entry()convergentreadnone
This intrinsic is used to tie the dynamic instances inside of a function tothose in the caller.
If the function is called from outside the scope of LLVM, the convergence ofdynamic instances of this intrinsic are environment-defined. For example:
In an OpenCLkernel launch, the maximal set of threads thatcan communicate outside the memory model is aworkgroup.Hence, a suitable choice is to specify that all the threads froma single workgroup in OpenCL execute converged dynamic instancesof this intrinsic.
In a C/C++ program, threads are launched independently and they cancommunicate only through the memory model. Hence the dynamic instances ofthis intrinsic in a C/C++ program are never converged.
If the function is called from a call-site in LLVM IR, then twothreads execute converged dynamic instances of this intrinsic if andonly if both threads entered the function by executing convergeddynamic instances of the call-site.
This intrinsic can occur at most once in a function, and only in the entryblock of the function. If this intrinsic occurs in a basic block, then it mustprecede any other convergent operation in the same basic block.
It is an error if this intrinsic appears in a non-convergent function.
It is an error to specify aconvergencectrl
operand bundle at acall to this intrinsic.
Function inlining substitutes this intrinsic with the token from the operandbundle. For example:
// Before inlining:voidcallee()convergent{%tok=calltoken@llvm.experimental.convergence.entry()convergent_operation(...)["convergencectrl"(token%tok)]}voidmain(){%outer=calltoken@llvm.experimental.convergence.anchor()for(...){%inner=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%outer)]callee()["convergencectrl"(token%inner)]}}// After inlining:voidmain(){%outer=calltoken@llvm.experimental.convergence.anchor()for(...){%inner=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%outer)]convergent_operation(...)["convergencectrl"(token%inner)]}}
llvm.experimental.convergence.loop
¶
token@llvm.experimental.convergence.loop()["convergencectrl"(token)]convergentreadnone
This intrinsic represents the place where an imaginary counter is incrementedfor determining convergence inside a control flow cycle.
LetU
be a call to this intrinsic andD
be the convergent operation thatdefines the token value used as theconvergencectrl
operand toU
. Twothreads execute converged dynamic instances ofU
if and only if:
The token value in both threads was returned by converged dynamicinstances of
D
, and,There is an integern such that both threads execute
U
for then’th timewith that token value.
It is an error to omit theconvergencectrl
operand bundle on acall to this intrinsic.
If this intrinsic occurs in a basic block, then it must precede any otherconvergent operation in the same basic block.
Heart of a Cycle:
If acycle
C
contains an occurrenceH
ofthis intrinsic whose token operand is defined outsideC
, thenH
iscalled the heart ofC
.Note
The static rules for cycles imply that a heart can occur only in the headerof a natural loop. This ensures that the heart closely represents theintuitive notion of a loop iteration. If this restriction is relaxed, theresulting semantics provides a new notion of “cycle iteration” even forirreducible cycles. But this allows a natural loop to have a heart in anode other than its header, which has interesting consequences on themeaning of a loop iteration in terms of convergence. For now, we disallowthis situation since its practical application is very rare.
llvm.experimental.convergence.anchor
¶
token@llvm.experimental.convergence.anchor()convergentreadnone
This intrinsic produces an initial convergence token that is independent fromany “outer scope”. The set of threads executing converged dynamic instances ofthis intrinsic is implementation-defined.
It is an error to pass aconvergencectrl
operand bundle at acall to this intrinsic.
Note
The expectation is that all threads within a group that “happen to be activeat the same time” will execute converged dynamic instances, so that programscan detect the maximal set of threads that can communicate efficiently withinsome local region of the program.
Uncontrolled Convergent Operations¶
Convergent operations with an explicitconvergencectrl
operand bundle arecalledcontrolled convergent operations. All other convergent operations aresaid to beuncontrolled.
An uncontrolled convergent operation is said to haveimplicit convergencecontrol determined by theconvergent
attribute alone. The semantics of theconvergent
attribute as implemented in LLVM differs from the documentedsemantics. The implementation tries to follow common intuition about convergentoperations, which remains under-specified. As such, it is not possible to fullytranslate implicit convergence control into explicit convergence control tokens,and these two modes cannot be mixed in the same function.
If a function contains a controlled convergent operation, then all convergentoperations in that function must either be controlled operations or calls tothe convergence control intrinsics.
Inferring Tokens¶
(This section is informational)
Sometimes, it may be necessary to reinterpret the implicit convergence controlin terms of explicit convergence control tokens. For example, this may happenwhen a function call is inlined, and either the caller or the callee containsuncontrolled convergent operations.
Some uses of uncontrolled convergent operations may need to satisfy thefollowing property:
For an environment-defined group of threads (such as an OpenCL workgroup orsubgroup), if one thread in the group executes a convergent operation, thenall threads in the group do so convergently with that thread.
In terms of explicit convergence control, this means that theconvergencectrl
operand on each convergent operationX
must ultimatelyoriginate from a call to thellvm.experimental.convergence.entry intrinsic. This preserves the possibilitythat the group of threads that converge on reachingX
is the same group thatoriginally started executing the program in convergence. In comparison, thellvm.experimental.convergence.anchor intrinsic captures animplementation-defined group of threads, which is insufficient to support theabove property.
One way to approximate implicit convergence control in terms of explicitconvergence control tokens is the following procedure, which preserves the abovementioned property:
Convert every irreducible cycle into a reducible cycle.
Insert a call tollvm.experimental.convergence.entry at the start of the entry block of thefunction.
Insert a call tollvm.experimental.convergence.loop at the start of every loop header. Ifthis loop is an outermost loop, the
convergencectrl
operand is the calltollvm.experimental.convergence.entry in the entry block of the function.Otherwise, theconvergencectrl
operand is the call tollvm.experimental.convergence.loop in the parent loop’s header.For each uncontrolled convergent operation
X
, add aconvergencectrl
operand bundle using the token defined by a definitionD
that is asibling to this operation.D
always dominatesX
— ifX
is not in any cycle, thenD
is a call tollvm.experimental.convergence.entry; otherwiseD
is the heart of theparent cycle ofX
.
Static Rules¶
Awell-formed program in LLVM IR must satisfy the following staticrules about cycles and convergence regions.
Closed Paths¶
Aclosed path in a CFG is a connected sequence ofnodes and edges in the CFG whose start and end points are the same.
Every closed path in the CFG that contains a use of a convergence token T otherthan a use byllvm.experimental.convergence.loopmust also contain the definition of T.
Every closed path in the CFG that contains two different uses of a convergencetoken T must also contain the definition of T.
Every closed path in the CFG that contains uses of two different convergence tokensT1 and T2 must also contain the definition of at least one of them.
Taken together, these rules imply that for every closed path C, there can be at mostone convergence token T which is used in C but defined outside of it, and thatT can be used only once in C, and only byllvm.experimental.convergence.loop
.
In every closed path that contains a use U of a token T but not thedefinition of T, U must dominate all nodes in the closed path.
This implies thatllvm.experimental.convergence.loop
can appear as a heartonly in the header of a natural loop.
Sufficient Conditions: From theproperties of cycles, it is sufficient to prove the above propertiesfor cycles instead of closed paths. Briefly, any closed path that violatesone or more of the above static rules is contained in a cycle that alsoviolates the same rule(s).
Convergence Regions¶
Theconvergence region of a convergence token T is the minimal region inwhich T is live and used, i.e., the set of program points dominated by thedefinition D of T from which a use of T can be reached.
The following static rule about convergence regions must be satisfied byvalid programs:
If a convergence region R for a token T1 contains a use of a convergencetoken T2, then R must also contain the definition of T2. (In other words,convergence regions must be reasonably nested.)
Note
For brevity, this document uses the term “convergence region of a tokendefinitionD
” to actually refer to the convergence region of the tokenT
defined byD
.
Inferring non-convergence¶
When the target or the environment guarantees that threads do notcommunicate using convergent operations or that threads never diverge,the dynamic instances in the program are irrelevant and an optimizermay remove any occurrence of theconvergent
attribute on acall-site or a function and any explicitconvergencectrl
operandbundle at a call-site.
An optimizer may remove theconvergent
attribute and any explicitconvergencectrl
operand bundle from a call-site if it can provethat the execution of this call-site always results in a call to anon-convergent function.
An optimizer may remove theconvergent
attribute on a function if it canprove that the function does not contain a call tollvm.experimental.convergence.entry, or any uncontrolled convergentoperations.
Memory Model Non-Interaction¶
The fact that an operation is convergent has no effect on how it is treated formemory model purposes. In particular, an operation that isconvergent
andreadnone
does not introduce additional ordering constraints as far as thememory model is concerned. There is no implied barrier, neither in the memorybarrier sense nor in the control barrier sense of synchronizing the executionof threads.
Informational note: Threads that execute converged dynamic instances do notnecessarily do so at the same time.
Other Interactions¶
A function can be bothconvergent
andspeculatable
, indicating that the function does not have undefinedbehavior and has no effects besides calculating its result, but is stillaffected by the set of threads executing this function. This typicallyprevents speculation of calls to the function unless the constraint imposedbyconvergent
is further relaxed by some other means.
Controlled Maximal Convergence¶
Theconverged-with relation over dynamicinstances of each controlled convergent operation is completely defined by thesemantics of convergence tokens. But the implementation-defined convergence at acall tollvm.experimental.convergence.anchor also depends on the cycle hierarchychosen if it occurs inside an irreducible cycle.
When the token defined by a convergent operationD
is used at anotherconvergent operationU
, the implementation must ensure that the threads thatconverge atU
are all the threads that reachedU
after converging atD
. On most implementations, it is reasonable to assume that only thesethreads are converged at every node they reach on any path fromD
toU
.In other words, the converged-with relation atD
produces groups of threadsthat can converge only within each group, while inside the convergence region ofD
.
All this affects themaximal converged-with relation over dynamic instances and in turn them-convergedproperty of static instances in the convergence region ofD
.
Controlled Maximal converged-with Relation
Dynamic instances of aconvergent operation are related in the controlledmaximal converged-with relation according to the semantics of the convergencecontrol tokens.
Dynamic instances
X1
andX2
produced by different threads for thesamenon-convergent operationX
are related in the controlled maximalconverged-with relation if and only if:
Both threads executed converged dynamic instances of every tokendefinition
D
such thatX
is in the convergence region ofD
,and,Either
X
is not contained in any cycle, or, for every cycleC
with headerH
that containsX
:
every dynamic instance
H1
ofH
that precedesX1
in therespective thread is convergence-beforeX2
, and,every dynamic instance
H2
ofH
that precedesX2
in therespective thread is convergence-beforeX1
,without assuming that
X1
is converged withX2
.
Controlled m-converged Static Instances
A node
X
in a given CFG is reported to be m-converged if and only if:
For any token definition
D
such thatX
is inside the convergence regionofD
,D
itself is m-converged, and,Every cycle that contains
X
satisfies the following necessaryconditions:
Every divergent branch inside the cycle satisfies thedivergedentry criterion, and,
There are nodiverged paths reaching thecycle from a divergent branch outside it.
Temporal Divergence at Cycle Exit¶
When a cycle has a divergent exit, maximal convergence assumes that all threadsconverge at the exit block. But if a controlled convergent operation outside thecycle uses a token defined by an operationD
inside the cycle, theconvergence region ofD
now extends outside the cycle. If two threadsexecuted converged dynamic instances ofD
before exiting the cycle, thenthey continue to execute converged dynamic instances of nodes in the convergenceregion ofD
outside the cycle. Thus, for a valueV
defined inside thecycle, any useU
ofV
within the convergence region ofT
uses theoutput of converged dynamic instances ofV
. IfV
is uniform, then itsuse at such aU
is also uniform. In other words, temporal divergence appliesonly to a use ofV
that is outside the convergence region ofD
.
Rationales for Static rules about cycles¶
(This section is informative.)
Note
For convenience, we use the operator==
to represent the relationconverged-with
and the operator!=
to represent its negation.
Consider a loop with (incorrect!) convergence control as in the followingpseudocode:
; WARNING: Example of incorrect convergence control!%anchor=calltoken@llvm.experimental.convergence.anchor()for(;;) {...callvoid@convergent.op()["convergencectrl"(token%anchor)]...}
This code is forbidden by the first static rule about cycles.
A first formal argument why we have to do this is that the dynamic rule fordeciding whether two threads execute converged dynamic instances of@convergent.op
leads to a logical contradiction in this code.Assume two threads execute converged dynamic instances of the anchorfollowed by two iterations of the loop. Thread 1 executes dynamic instancesI1 and I2 of@convergent.op
, thread 2 executes dynamic instances J1 and J2.Using all the rules, we can deduce:
I1!=I2
andJ1!=J2
by the basic rules of dynamic instances.I1==J1
by the first dynamic rule about controlled convergentoperations: both threads execute the same static instruction while usinga convergence token value produced by converged dynamic instances of aninstruction (the anchor).I1==J2
by the same argument. Also,I2==J1
andI2==J2
.The fact that one may beintuitively tempted to think of
I1
andJ2
as being executed in different loop iterations is completely irrelevant fortheformal argument. There is no mechanism in LLVM IR semantics forforming associations between loop iterations in different threads,exceptfor the rules defined in this document – and the rules in this documentrequire a loop heart intrinsic for talking about loop iterations.By transitivity, we have
I1==I2
andJ1==J2
. That is acontradiction.
This problem goes away by inserting a loop heart intrinsic as follows, whichestablishes a relationship between loop iterations across threads.
%anchor=calltoken@llvm.experimental.convergence.anchor()for(;;) {%loop=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%anchor)]...callvoid@convergent.op()["convergencectrl"(token%loop)]...}
In the same scenario of two threads executing converged dynamic instances of theanchor and then two iterations of the loop, the dynamic rule about loop heartintrinsics implies that both threads execute the converged dynamic instances ofthe loop heart intrinsic in their respective first iterations and then again intheir respective second iterations of the loop.
This then implies that they execute converged dynamic instancesI1==J1
ofthe@convergent.op
in their first iterations and thenI2==J2
in their second iterations. The rule is an “if and only if” rule,so it also implies thatI1!=J2
andI2!=J1
, because those executionssee token values of%loop
originating from non-converged dynamicinstances of the loop intrinsic.
One may ask whether we could change the dynamic rule instead of adding thestatic rule about cycles. That is impractical due to deeper difficulties.Consider the following loop, again with incorrect convergence control:
; WARNING: Example of incorrect convergence control!; (A)%anchor=calltoken@llvm.experimental.convergence.anchor()for(;;) {; (B)if(condition1){; (C)callvoid@convergent.op.1()["convergencectrl"(token%anchor)]}; (D)if(condition2){; (E)callvoid@convergent.op.2()["convergencectrl"(token%anchor)]}; (F)}; (G)
Assume two threads execute converged dynamic instances of the anchor followedby this sequence of basic blocks:
Thread 1: A B C D F B D E F GThread 2: A B D E F B C D F G
That is, both threads execute two iterations of the loop, but they executethe different convergent operations in different iterations. Without forming arelation between loop iterations across the threads, there is no reasonable wayof defining which dynamic instances of the convergent operations should be thesame across the threads, if any.
Again, this can be addressed by adding a loop heart intrinsic, most naturallyas:
; (A)%anchor=calltoken@llvm.experimental.convergence.anchor()for(;;) {; (B)%loop=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%anchor)]if(condition1){; (C)callvoid@convergent.op.1()["convergencectrl"(token%loop)]}; (D)if(condition2){; (E)callvoid@convergent.op.2()["convergencectrl"(token%loop)]}; (F)}; (G)
Let%loop(i;j)
be the dynamic instance ofj
-th execution of the loopheart intrinsic by threadi
, and analogously@op.k(i)
and@op.k(i)
the dynamic instances of the execution of@convergent.op.k
by threadi
.Then we have:
%loop(1;j)==%loop(2;j)
forj=1,2
because of the dynamic ruleabout loop heart intrinsics.%loop(i;1)!=%loop(i;2)
fori=1,2
because of the basic rule thatdifferent executions by the same thread happen in different dynamicinstances.@op.1(1)!=@op.1(2)
, since@op.1(1)
uses the token value of%loop
referring to%loop(1;1)
and@op.1(2)
uses thatreferring to%loop(2;2)==%loop(1;2)
, which is different from%loop(1;1)
.Similarly,
@op.2(1)!=@op.2(2)
.
However, loop heart intrinsics could be inserted differently, at the costof also inserting a free-standing anchor:
; (A)%anchor=calltoken@llvm.experimental.convergence.anchor()for(;;) {; (B)if(condition1){; (C)%loop=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%anchor)]callvoid@convergent.op.1()["convergencectrl"(token%loop)]}; (D)if(condition2){; (E)%free=calltoken@llvm.experimental.convergence.anchor()callvoid@convergent.op.2()["convergencectrl"(token%free)]}; (F)}; (G)
This leads to the “unnatural counting of loop iterations” that is also mentionedelsewhere. Let%loop(i)
be the dynamic instance of the execution of theloop heart intrinsic by threadi
(each thread executes it only once), andlet@op.k(i)
be as before. Then:
%loop(1)==%loop(2)
because of the dynamic rule about loop heartintrinsics.@op.1(1)==@op.1(2)
because@op.1(i)
uses the value of%loop
referring to%loop(i)
, and%loop(1)==%loop(2)
.Whether
@op.2(1)==@op.2(2)
is implementation-defined because of theuse of the%free
anchor intrinsic.In practice, they almost certainly have to be non-converged dynamicinstances. Consider that if an implementation strictly follows the order ofinstructions given in the program, the executions of the threads can be“aligned” as follows:
Thread 1: A B C D F B D E F GThread 2: A B D E F B C D F G
So then
@op.2(1)
physically executes later than@op.2(2)
and therecan be no communication between the threads, which means they executenon-converged dynamic instances.That said, it is conceivable that there aren’t actually any data or otherdependencies that would enforce this execution order. In that case, a highlyout-of-order implementation could potentially allow communication. That’swhy the rules defined in this document are silent about whether
@op.2(1)==@op.2(2)
or not.
This type of convergence control seems relatively unlikely to appear in realprograms. Its possibility is simply a logical consequence of the model.
An equivalent issue arises if the convergent operations are replaced by nestedloops with loop heart intrinsics that directly refer to%anchor
, hencethe variants of the static rules about cycles that apply to them:
; WARNING: Example of incorrect convergence control!%anchor=calltoken@llvm.experimental.convergence.anchor()for(;;) {if(condition1){for(;;) {%loop1=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%anchor)]}}if(condition2){for(;;) {%loop2=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%anchor)]}}}
There is a cycle (closed walk in the CFG) that goes through both loop heartintrinsics using%anchor
but not through the definition of%anchor
,so this code is invalid.
Examples for the Correctness of Program Transforms¶
(This section is informative.)
As implied by the rules in the previous sections, program transforms are correctwith respect to convergent operations if they preserve or refine theirsemantics. This means that the set of communicating threads in the transformedprogram must have been possible in the original program.
Program transforms with a single-threaded focus are generally conservativelycorrect if they do not sink or hoist convergent operations across a branch.This applies even to program transforms that change the control flow graph.
For example, unrolling a loop that does not contain convergent operationscannot break any of the guarantees required for convergent operations outsideof the loop.
Loop unrolling examples¶
We consider three kinds of loop unrolling here:
Partial unrolling with no known trip multiple, so a “tail” is required tocollect the remaining elements.
Partial unrolling by a trip multiple, so no “tail” is required.
Full unrolling, which eliminates the loop.
The first kind is forbidden when@llvm.experimental.convergence.loop
isused. We illustrate the reasoning with some examples.
First, an arbitrary loop that contains convergent operationscan be unrolledin all of these ways, even with “tail”, if all convergent operations refer backto an anchor inside the loop. For example (in pseudo-code):
while(counter>0){%tok=calltoken@llvm.experimental.convergence.anchor()callvoid@convergent.operation()["convergencectrl"(token%tok)]counter--;}
This can be unrolled to:
while(counter>=2){%tok=calltoken@llvm.experimental.convergence.anchor()callvoid@convergent.operation()["convergencectrl"(token%tok)]%tok=calltoken@llvm.experimental.convergence.anchor()callvoid@convergent.operation()["convergencectrl"(token%tok)]counter-=2;}while(counter>0){%tok=calltoken@llvm.experimental.convergence.anchor()callvoid@convergent.operation()["convergencectrl"(token%tok)]counter--;}
This is likely to change the behavior of the convergent operation if thereare threads whose initial counter value is not a multiple of 2. In particular,all threads with an odd trip count are now likely to execute the convergentoperation in their respective final iterations together because theunderlying implementation is likely to try to group as many threads togetheras possible for the execution of the “tail”.
This change is allowed because the anchor intrinsic has implementation-definedconvergence behavior and the loop unrolling transform is considered to be partof the implementation. Another way of reasoning is that while thelikelybehavior of the code has changed, theguarantees about its behavior haveremained the same.
If the loop contains uncontrolled convergent operations, this kind of unrollingis forbidden.
Unrolling a loop with convergent operations that refer to tokens producedoutside the loop is forbidden when a “tail” or “remainder” would have tobe introduced. Consider:
; (A)%outer=calltoken@llvm.experimental.convergence.anchor()while(counter>0){%inner=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%outer)]; (B)callvoid@convergent.operation()["convergencectrl"(token%inner)]counter--;}; (C)
To understand why unrolling is forbidden, consider two threads that executeconverged dynamic instances of the anchor and then proceed with 3 and 4 loopiterations, respectively:
Thread 1: A B B B CThread 2: A B B B B C
By the dynamic rule on loop heart intrinsics, these threads execute convergeddynamic instances of the loop intrinsic for the first 3 iterations, and thenthread 2 executes another dynamic instance by itself.
By the dynamic rule on general convergent operations, the threads executeconverged dynamic instances of the@convergent.operation
in the first 3iterations (that is, the dynamic instance executed by thread 1 in iterationn is the same as that executed by thread 2 in iterationn, forn = 1,2,3;the dynamic instance executed in iteration 1 is different from that initeration 2, etc.).
Now assume that the loop is unrolled by a factor of 2, which requires aremainder as follows:
; (A)%outer=calltoken@llvm.experimental.convergence.anchor()while(counter>=2){%inner=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%outer)]; (B)callvoid@convergent.operation()["convergencectrl"(token%inner)]callvoid@convergent.operation()["convergencectrl"(token%inner)]counter-=2;}; (C)if(counter>0){%remainder=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%outer)]; (D)callvoid@convergent.operation()["convergencectrl"(token%remainder)]}; (E)
First of all, note some interesting problems surrounding the loop intrinsic:
It isnot duplicated inside the unrolled loop. This is to comply withtheStatic Rules.
It is unclear whether the loop intrinsic ought to be duplicated in theremainder, or whether the final
@convergent.operation
in D should justrefer to either%inner
(which is possible in SSA form) or directly to%outer
. The decision made here is arbitrary and doesn’t change theargument that follows. Ultimately, it simply doesn’t matter because thetransform is incorrect either way.
The threads now execute the following sequences of blocks:
Thread 1: A B C D EThread 2: A B B C D E
Analogous to the argument above, they execute converged dynamic instances of the%inner
intrinsic and the@convergent.operation
in the first iterationof the unrolled loop, which corresponds to the first 2 iterations of theoriginal loop.
However, they execute different static calls to@convergent.operation
forthe 3rd iteration of the original loop. In thread 1, that iteration correspondsto the call in the remainder, while in thread 2 it corresponds to the firstcall to@convergent.operation
in the unrolled loop. Therefore, they executenon-converged dynamic instances, which means that the set of communicating threadsfor the 3rd iteration of the original loop is different. This is why theunrolling is incorrect.
On the other hand, unrolling without “tail” is allowed. For example, assumingthat the trip counter is known to be a multiple of 2, we can unroll the loopas follows:
%outer=calltoken@llvm.experimental.convergence.anchor()while(counter>0){%inner=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%outer)]callvoid@convergent.operation()["convergencectrl"(token%inner)]callvoid@convergent.operation()["convergencectrl"(token%inner)]counter-=2;}
Note again that the loop intrinsic is not duplicated.
Thellvm.experimental.convergence.loopintrinsic is typically expected to appear in the header of a natural loop.However, it can also appear in non-header blocks of a loop. In that case, theloop can generally not be unrolled.
Hoisting and sinking¶
In general, hoisting and sinking of convergent operations is forbidden. This isbecause moving the operation to a different point in control flow generallychanges the set of threads that reach the operation and therefore, the set ofthreads that execute converged dynamic instances of the operation. Bydefinition, this changes the set of threads that participate in thecommunication of the convergent operation, which will typically change itsresult.
There are a number of exceptions, though most of them require additionalknowledge.
For example, hoisting and sinking acrossuniform conditional branches – i.e.,conditional branches where within every possible relevant set of threads, allthreads will always take the same direction – is generally allowed. See the endof theexample of reductions inside control flow for a brief discussion.
Some convergent operations can be hoisted but not sunk, or vice versa. A simpleexample is thesubgroupShuffle(data,id)
operation. It returns thedata
operand of the thread identified byid
, where thread IDs are fixed andassigned to each thread at launch. The result is undefined (or perhaps there isUB, depending on the language and environment) if threadid
is not in thecommunicating set of threads. So hoisting is allowed in the followingpseudo-code example:
definevoid@example(...)convergent{%entry=calltoken@llvm.experimental.convergence.entry()%data=...%id=...if(condition){%shuffled=calli32@subgroupShuffle(i32%data,i32%id)["convergencectrl"(token%entry)]...}else{%shuffled=calli32@subgroupShuffle(i32%data,i32%id)["convergencectrl"(token%entry)]...}}
After hoisting the calls to@subgroupShuffle
, the communicating set ofthreads is the union of the two sets of threads in the original program, so%id
can only go “out of range” after hoisting if it did so in the originalprogram.
However, speculative execution of@subgroupShuffle
in the following programmay be forbidden:
definevoid@example(...)convergent{%entry=calltoken@llvm.experimental.convergence.entry()%data=...%id=...if(condition){%shuffled=calli32@subgroupShuffle(i32%data,i32%id)["convergencectrl"(token%entry)]...}}
There is no guarantee about the value of%id
in the threads wherecondition
is false. If@subgroupShuffle
is defined to have UB when%id
is outside of the set of communicating threads, then speculating andhoisting@subgroupShuffle
might introduce UB.
On the other hand, if@subgroupShuffle
is defined such that it merelyproduces an undefined value or poison as result when%id
is “out of range”,then speculating is okay.
Even thoughllvm.experimental.convergence.anchoris marked asconvergent
, it can be sunk in some cases. For example, inpseudo-code:
%tok=calltoken@llvm.experimental.convergence.anchor()if(condition){callvoid@convergent.operation()["convergencectrl"(token%tok)]}
Assuming that%tok
is only used inside the conditional block, the anchor canbe sunk. The rationale is two-fold. First, the anchor has implementation-definedbehavior, and the sinking is part of the implementation. Second, already in theoriginal program, the set of threads that communicates in the@convergent.operation
is automatically subset to the threads for whichcondition
is true.
Anchors can be hoisted in acyclic control flow. For example:
if(condition){%tok1=calltoken@llvm.experimental.convergence.anchor()callvoid@convergent.operation()["convergencectrl"(token%tok1)]}else{%tok2=calltoken@llvm.experimental.convergence.anchor()callvoid@convergent.operation()["convergencectrl"(token%tok2)]}
The anchors can be hoisted, resulting in:
%tok=calltoken@llvm.experimental.convergence.anchor()if(condition){callvoid@convergent.operation()["convergencectrl"(token%tok)]}else{callvoid@convergent.operation()["convergencectrl"(token%tok)]}
The behavior is unchanged, since each of the static convergent operations onlyever communicates with threads that have the samecondition
value.By contrast, hoisting the convergent operations themselves is forbidden.
Hoisting and sinking anchors out of and into loops is forbidden. For example:
for(;;) {%tok=calltoken@llvm.experimental.convergence.anchor()callvoid@convergent.operation()["convergencectrl"(token%tok)]}
Hoisting the anchor would make the program invalid according to the staticvalidity rules. Conversely:
%outer=calltoken@llvm.experimental.convergence.anchor()while(counter>0){%inner=calltoken@llvm.experimental.convergence.loop()["convergencectrl"(token%outer)]callvoid@convergent.operation()["convergencectrl"(token%inner)]counter--;}
The program would stay valid if the anchor was sunk into the loop, but itsbehavior could end up being different. If the anchor is inside the loop, theneach loop iteration has a new dynamic instance of the anchor, and the set ofthreads participating in those dynamic instances of the anchor could bedifferent in arbitrary implementation-defined ways. Via the dynamic rules aboutdynamic instances of convergent operations, this then implies that the set ofthreads executing@convergent.operation
could be different in each loopiteration in arbitrary implementation-defined ways.
Convergent operations can be sunk together with their anchor. Again inpseudo-code:
%tok=calltoken@llvm.experimental.convergence.anchor()%a=callT@pure.convergent.operation(...)["convergencectrl"(token%tok)]%b=callT@pure.convergent.operation(...)["convergencectrl"(token%tok)]if(condition){use(%a,%b)}
Assuming that%tok
,%a
, and%b
are only used inside the conditionalblock, all can be sunk together:
if(condition){%tok=calltoken@llvm.experimental.convergence.anchor()%a=callT@pure.convergent.operation(...)["convergencectrl"(token%tok)]%b=callT@pure.convergent.operation(...)["convergencectrl"(token%tok)]use(%a,%b)}
The rationale is that the anchor intrinsic has implementation-defined behavior,and the sinking transform is considered to be part of the implementation:the sinking will restrict the set of communicating threads to those for whichcondition
is true, but that could have happened in the original programanyway for some arbitrary other reason.
However, sinkingonly the convergent operation producing%b
would beincorrect. That would allow threads for whichcondition
is false tocommunicate at%a
, but not at%b
, which the original program doesn’tallow.
Note that the entry intrinsic behaves differently. Sinking the convergentoperations is forbidden in the following snippet:
%tok=calltoken@llvm.experimental.convergence.entry()%a=callT@pure.convergent.operation(...)["convergencectrl"(token%tok)]%b=callT@pure.convergent.operation(...)["convergencectrl"(token%tok)]if(condition){use(%a,%b)}