Support for AArch64 Scalable Matrix Extension in LLVM

1. Introduction

TheAArch64 SME ACLE provides a number ofattributes for users to control PSTATE.SM and PSTATE.ZA.TheAArch64 SME ABI describes the requirements forcalls between functions when at least one of those functions uses PSTATE.SM orPSTATE.ZA.

This document describes how the SME ACLE attributes map to LLVM IRattributes and how LLVM lowers these attributes to implement the rules andrequirements of the ABI.

Below we describe the LLVM IR attributes and their relation to the C/C++level ACLE attributes:

aarch64_pstate_sm_enabled

is used for functions with__arm_streaming

aarch64_pstate_sm_compatible

is used for functions with__arm_streaming_compatible

aarch64_pstate_sm_body

is used for functions with__arm_locally_streaming and isonly valid on function definitions (not declarations)

aarch64_new_za

is used for functions with__arm_new("za")

aarch64_in_za

is used for functions with__arm_in("za")

aarch64_out_za

is used for functions with__arm_out("za")

aarch64_inout_za

is used for functions with__arm_inout("za")

aarch64_preserves_za

is used for functions with__arm_preserves("za")

aarch64_expanded_pstate_za

is used for functions with__arm_new_za

Clang must ensure that the above attributes are added both to thefunction’s declaration/definition as well as to their call-sites. This isimportant for calls to attributed function pointers, where there is nodefinition or declaration available.

2. Handling PSTATE.SM

When changing PSTATE.SM the execution of FP/vector operations may be transferredto another processing element. This has three important implications:

  • The runtime SVE vector length may change.

  • The contents of FP/AdvSIMD/SVE registers are zeroed.

  • The set of allowable instructions changes.

This leads to certain restrictions on IR and optimizations. For example, itis undefined behaviour to share vector-length dependent state between functionsthat may operate with different values for PSTATE.SM. Front-ends must honourthese restrictions when generating LLVM IR.

Even though the runtime SVE vector length may change, for the purpose of LLVM IRand almost all parts of CodeGen we can assume that the runtime value forvscale does not. If we let the compiler insert the appropriatesmstartandsmstop instructions around call boundaries, then the effects on SVEstate can be mitigated. By limiting the state changes to a very brief windowaround the call we can control how the operations are scheduled and how livevalues remain preserved between state transitions.

In order to control PSTATE.SM at this level of granularity, we use function andcallsite attributes rather than intrinsics.

Restrictions on attributes

  • It is undefined behaviour to pass or return (pointers to) scalable vectorobjects to/from functions which may use a different SVE vector length.This includes functions with a non-streaming interface, but marked withaarch64_pstate_sm_body.

  • It is not allowed for a function to be decorated with bothaarch64_pstate_sm_compatible andaarch64_pstate_sm_enabled.

  • It is not allowed for a function to be decorated with more than one of thefollowing attributes:aarch64_new_za,aarch64_in_za,aarch64_out_za,aarch64_inout_za,aarch64_preserves_za.

These restrictions also apply in the higher level SME ACLE, which means we canemit diagnostics in Clang to signal users about incorrect behaviour.

Compiler inserted streaming-mode changes

The table below describes the transitions in PSTATE.SM the compiler has toaccount for when doing calls between functions with different attributes.In this table, we use the following abbreviations:

N

functions with a normal interface (PSTATE.SM=0 on entry, PSTATE.SM=0 onreturn)

S

functions with a Streaming interface (PSTATE.SM=1 on entry, PSTATE.SM=1on return)

SC

functions with a Streaming-Compatible interface (PSTATE.SM can beeither 0 or 1 on entry, and is unchanged on return).

Functions with__attribute__((arm_locally_streaming)) are excluded from thistable because for the caller the attribute is synonymous to ‘streaming’, andfor the callee it is merely an implementation detail that is explicitly notexposed to the caller.

Table 4Combinations of calls for functions with different attributes

From

To

Before call

After call

After exception

N

N

N

S

SMSTART

SMSTOP

N

SC

S

N

SMSTOP

SMSTART

SMSTART

S

S

SMSTART

S

SC

SMSTART

SC

N

If PSTATE.SM before call is 1,then SMSTOP

If PSTATE.SM before call is 1,then SMSTART

If PSTATE.SM before call is 1,then SMSTART

SC

S

If PSTATE.SM before call is 0,then SMSTART

If PSTATE.SM before call is 0,then SMSTOP

If PSTATE.SM before call is 1,then SMSTART

SC

SC

If PSTATE.SM before call is 1,then SMSTART

Because changing PSTATE.SM zeroes the FP/vector registers, it is best to emitthesmstart andsmstop instructions before register allocation, so thatthe register allocator can spill/reload registers around the mode change.

The compiler should also have sufficient information on which operations arepart of the call/function’s arguments/result and which operations are part ofthe function’s body, so that it can place the mode changes in exactly the rightposition. The suitable place to do this seems to be SelectionDAG, where it lowersthe call’s arguments/return values to implement the specified calling convention.SelectionDAG provides Chains and Glue to specify the order of operations and givepreliminary control over the instruction’s scheduling.

Example of preserving state

When passing and returning afloat value to/from a functionthat has a streaming interface from a function that has a normal interface, thecall-site will need to ensure that the argument/result registers are preservedand that no other code is scheduled in between thesmstart/smstop and the call.

definefloat@foo(float%f)nounwind{%res=callfloat@bar(float%f)"aarch64_pstate_sm_enabled"retfloat%res}declarefloat@bar(float)"aarch64_pstate_sm_enabled"

The program needs to preserve the value of the floating point argument andreturn value in registers0:

foo:                                    // @foo// %bb.0:        stp     d15, d14, [sp, #-80]!           // 16-byte Folded Spill        stp     d13, d12, [sp, #16]             // 16-byte Folded Spill        stp     d11, d10, [sp, #32]             // 16-byte Folded Spill        stp     d9, d8, [sp, #48]               // 16-byte Folded Spill        str     x30, [sp, #64]                  // 8-byte Folded Spill        str     s0, [sp, #76]                   // 4-byte Folded Spill        smstart sm        ldr     s0, [sp, #76]                   // 4-byte Folded Reload        bl      bar        str     s0, [sp, #76]                   // 4-byte Folded Spill        smstop  sm        ldp     d9, d8, [sp, #48]               // 16-byte Folded Reload        ldp     d11, d10, [sp, #32]             // 16-byte Folded Reload        ldp     d13, d12, [sp, #16]             // 16-byte Folded Reload        ldr     s0, [sp, #76]                   // 4-byte Folded Reload        ldr     x30, [sp, #64]                  // 8-byte Folded Reload        ldp     d15, d14, [sp], #80             // 16-byte Folded Reload        ret

Setting the correct register masks on the ISD nodes and inserting thesmstart/smstop in the right places should ensure this is done correctly.

Instruction Selection Nodes

AArch64ISD::SMSTART Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask]AArch64ISD::SMSTOP  Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask]

TheSMSTART/SMSTOP nodes takeCurrentState andExpectedState operand forthe case of a conditional SMSTART/SMSTOP. The instruction will only be executedif CurrentState != ExpectedState.

WhenCurrentState andExpectedState can be evaluated at compile-time(i.e. they are both constants) then an unconditionalsmstart/smstopinstruction is emitted. Otherwise the node is matched to a Pseudo instructionwhich expands to a compare/branch and asmstart/smstop. This is necessary toimplement transitions fromSC->N andSC->S.

Unchained Function calls

When a function with “aarch64_pstate_sm_enabled” calls a function that is notstreaming compatible, the compiler has to insert a SMSTOP before the call andinsert a SMSTOP after the call.

If the function that is called is an intrinsic with no side-effects which inturn is lowered to a function call (e.g.@llvm.cos()), then the call to@llvm.cos() is not part of any Chain; it can be scheduled freely.

Lowering of a Callsite creates a small chain of nodes which:

  • starts a call sequence

  • copies input values from virtual registers to physical registers specified bythe ABI

  • executes a branch-and-link

  • stops the call sequence

  • copies the output values from their physical registers to virtual registers

When the callsite’s Chain is not used, only the result value from the chainedsequence is used, but the Chain itself is discarded.

TheSMSTART andSMSTOP ISD nodes return a Chain, but no realvalues, so when theSMSTART/SMSTOP nodes are part of a Chain that isn’tused, these nodes are not considered for scheduling and areremoved from the DAG. In order to prevent these nodesfrom being removed, we need a way to ensure the results from theCopyFromReg can only beused after theSMSTART/SMSTOP has beenexecuted.

We can use a CopyToReg -> CopyFromReg sequence for this, which moves thevalue to/from a virtual register and chains these nodes with theSMSTART/SMSTOP to make them part of the expression that calculatesthe result value. The resulting COPY nodes are removed by the registerallocator.

The example below shows how this is used in a DAG that does not linktogether the result by a Chain, but rather by a value:

            t0: ch,glue = AArch64ISD::SMSTOP ...          t1: ch,glue = ISD::CALL ....        t2: res,ch,glue = CopyFromReg t1, ...      t3: ch,glue = AArch64ISD::SMSTART t2:1, ....   <- this is now part of the expression that returns the result value.    t4: ch = CopyToReg t3, Register:f64 %vreg, t2  t5: res,ch = CopyFromReg t4, Register:f64 %vregt6: res = FADD t5, t9

We also need this for locally streaming functions, where anSMSTART needs tobe inserted into the DAG at the start of the function.

Functions with __attribute__((arm_locally_streaming))

If a function is marked asarm_locally_streaming, then the runtime SVEvector length in the prologue/epilogue may be different from the vector lengthin the function’s body. This happens because we invoke smstart after setting upthe stack-frame and similarly invoke smstop before deallocating the stack-frame.

To ensure we use the correct SVE vector length to allocate the locals with, wecan use the streaming vector-length to allocate the stack-slots through theADDSVL instruction, even when the CPU is not yet in streaming mode.

This only works for locals and not callee-save slots, since LLVM doesn’t supportmixing two different scalable vector lengths in one stack frame. That means that thecase where a function is markedarm_locally_streaming and needs to spill SVEcallee-saves in the prologue is currently unsupported. However, it is unlikelyfor this to happen without user intervention, becausearm_locally_streamingfunctions cannot take or return vector-length-dependent values. This would otherwiserequire forcing both the SVE PCS using ‘aarch64_sve_pcs’ combined with usingarm_locally_streaming in order to encounter this problem. This combinationcan be prevented in Clang through emitting a diagnostic.

An example of how the prologue/epilogue would look for a function that isattributed witharm_locally_streaming:

#define N 64void__attribute__((arm_streaming_compatible))some_use(svfloat32_t*);// Use a float argument type, to check the value isn't clobbered by smstart.// Use a float return type to check the value isn't clobbered by smstop.float__attribute__((noinline,arm_locally_streaming))foo(floatarg){// Create local for SVE vector to check local is created with correct// size when not yet in streaming mode (ADDSVL).floatarray[N];svfloat32_tvector;some_use(&vector);svst1_f32(svptrue_b32(),&array[0],vector);returnarray[N-1]+arg;}

should use ADDSVL for allocating the stack space and should avoid clobberingthe return/argument values.

_Z3foof:                                // @_Z3foof// %bb.0:                               // %entry        stp     d15, d14, [sp, #-96]!           // 16-byte Folded Spill        stp     d13, d12, [sp, #16]             // 16-byte Folded Spill        stp     d11, d10, [sp, #32]             // 16-byte Folded Spill        stp     d9, d8, [sp, #48]               // 16-byte Folded Spill        stp     x29, x30, [sp, #64]             // 16-byte Folded Spill        add     x29, sp, #64        str     x28, [sp, #80]                  // 8-byte Folded Spill        addsvl  sp, sp, #-1        sub     sp, sp, #256        str     s0, [x29, #28]                  // 4-byte Folded Spill        smstart sm        sub     x0, x29, #64        addsvl  x0, x0, #-1        bl      _Z10some_usePu13__SVFloat32_t        sub     x8, x29, #64        ptrue   p0.s        ld1w    { z0.s }, p0/z, [x8, #-1, mul vl]        ldr     s1, [x29, #28]                  // 4-byte Folded Reload        st1w    { z0.s }, p0, [sp]        ldr     s0, [sp, #252]        fadd    s0, s0, s1        str     s0, [x29, #28]                  // 4-byte Folded Spill        smstop  sm        ldr     s0, [x29, #28]                  // 4-byte Folded Reload        addsvl  sp, sp, #1        add     sp, sp, #256        ldp     x29, x30, [sp, #64]             // 16-byte Folded Reload        ldp     d9, d8, [sp, #48]               // 16-byte Folded Reload        ldp     d11, d10, [sp, #32]             // 16-byte Folded Reload        ldp     d13, d12, [sp, #16]             // 16-byte Folded Reload        ldr     x28, [sp, #80]                  // 8-byte Folded Reload        ldp     d15, d14, [sp], #96             // 16-byte Folded Reload        ret

Preventing the use of illegal instructions in Streaming Mode

  • When executing a program in streaming-mode (PSTATE.SM=1) a subset of SVE/SVE2instructions and most AdvSIMD/NEON instructions are invalid.

  • When executing a program in normal mode (PSTATE.SM=0), a subset of SMEinstructions are invalid.

  • Streaming-compatible functions must only use instructions that are valid wheneither PSTATE.SM=0 or PSTATE.SM=1.

The value of PSTATE.SM is not controlled by the feature flags, but rather by thefunction attributes. This means that we can compile for ‘+sme’ and the compilerwill code-generate any instructions, even if they are not legal under the requestedstreaming mode. The compiler needs to use the function attributes to ensure thecompiler doesn’t do transformations under the assumption that certain operationsare available at runtime.

We made a conscious choice not to model this with feature flags, because westill want to support inline-asm in either mode (with the user placingsmstart/smstop manually), and this became rather complicated to implement at theindividual instruction level (seeD120261andD121208) because of limitations inTableGen.

As a first step, this means we’ll disable vectorization (LoopVectorize/SLP)entirely when the a function has either of theaarch64_pstate_sm_enabled,aarch64_pstate_sm_body oraarch64_pstate_sm_compatible attributes,in order to avoid the use of vector instructions.

Later on we’ll aim to relax these restrictions to enable scalableauto-vectorization with a subset of streaming-compatible instructions, but thatrequires changes to the CostModel, Legalization and SelectionDAG lowering.

We will also emit diagnostics in Clang to prevent the use ofnon-streaming(-compatible) operations, e.g. through ACLE intrinsics, when afunction is decorated with the streaming mode attributes.

Other things to consider

  • Inlining must be disabled when the call-site needs to toggle PSTATE.SM orwhen the callee’s function body is executed in a different streaming mode thanits caller. This is needed because function calls are the boundaries forstreaming mode changes.

  • Tail call optimization must be disabled when the call-site needs to togglePSTATE.SM, such that the caller can restore the original value of PSTATE.SM.

3. Handling PSTATE.ZA

In contrast to PSTATE.SM, enabling PSTATE.ZA does not affect the SVE vectorlength and also doesn’t clobber FP/AdvSIMD/SVE registers. This means it is safeto toggle PSTATE.ZA using intrinsics. This also makes it simpler to setup alazy-save mechanism for calls to private-ZA functions (i.e. functions that mayeither directly or indirectly clobber ZA state).

For the purpose of handling functions marked withaarch64_new_za,we have introduced a new LLVM IR pass (SMEABIPass) that is run just beforeSelectionDAG. Any such functions dealt with by this pass are marked withaarch64_expanded_pstate_za.

Setting up a lazy-save

Committing a lazy-save

Exception handling and ZA

4. Types

AArch64 Predicate-as-Counter Type

Overview:

The predicate-as-counter type represents the type of a predicate-as-countervalue held in a AArch64 SVE predicate register. Such a value containsinformation about the number of active lanes, the element width and a bit thattells whether the generated mask should be inverted. ACLE intrinsics should beused to move the predicate-as-counter value to/from a predicate vector.

There are certain limitations on the type:

  • The type can be used for function parameters and return values.

  • The supported LLVM operations on this type are limited toload,store,phi,select andalloca instructions.

The predicate-as-counter type is a scalable type.

Syntax:

target("aarch64.svcount")

5. References

  1. SME ACLE Pull-request

  2. SME ABI Pull-request