AMDGPU backend has a pass which does transformations to allow firmware to preload kernel arguments into sgpr's to avoid loading them from kernel arg segment. This pass can improve kernel latency but it cannot preload struct-type kernel arguments.

This patch adds a pass to AMDGPU backend to split and flatten struct-type kernel arguments so that later passes can preload them into sgpr's.

Basically, the pass collects load or GEP/load instructions with struct-type kenel args as operand's and makes them new arguments as the kernel. If all uses of a struct-type kernel args can be replaced, it will do the replacements and create a new kernel with the new signature, and translate all instructions of the old kernel to use the new arguments in the new kernel. It adds a function attribute to encode the mapping from the new kernel argument index to the old kernel argument index and offset. The streamer will generate kernel argument metadata based on that and runtime will process
the kernel arguments based on the metadata.

The pass is disabled by default and can be enabled by LLVM option-amdgpu-enable-split-kernel-args.

yxsamliu requested review fromarsenm,kerbowa,shiltian,rampitec,b-sumner andkzhuravl

March 31, 2025 19:57

llvmbot added the backend:AMDGPU label

Mar 31, 2025

Copy link

Member

llvmbot commentedMar 31, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Yaxun (Sam) Liu (yxsamliu)

Changes

This patch adds a pass to AMDGPU backend to split and flat struct-type kernel arguments so that later passes can preload them into sgpr's.

Basically, the pass collects load or GEP/load instructions with struct-type kenel args as oprands and makes them new arguments as the kernel. If all uses of a struct-type kernel arg can be replaced, it will do the replacements and create a new kernel with the new signature, and translate all instructions of the old kernel to use the new arguments in the new kernel. It adds a function attribute to encode the mapping from the new kernel argument index to the old kernel argument index and offset. The streamer will generate kernel argument metadata based on that and runtime will process
the kernel arguments based on the metadata.

The pass is disabled by default and can be enabled by LLVM option-amdgpu-enable-split-kernel-args.

Patch is 27.59 KiB, truncated to 20.00 KiB below, full version:https://github.com/llvm/llvm-project/pull/133786.diff

9 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+9)
(modified) llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp (+41-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h (+2-1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
(added) llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp (+372)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+8-1)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.ll (+120)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+4)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.hindex a8e4ea9429f50..777390b99c0cc 100644--- a/llvm/lib/Target/AMDGPU/AMDGPU.h+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h@@ -125,6 +125,15 @@ struct AMDGPUPromoteKernelArgumentsPass   PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM); };+ModulePass *createAMDGPUSplitKernelArgumentsPass();+void initializeAMDGPUSplitKernelArgumentsPass(PassRegistry &);+extern char &AMDGPUSplitKernelArgumentsID;++struct AMDGPUSplitKernelArgumentsPass+    : PassInfoMixin<AMDGPUSplitKernelArgumentsPass> {+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);+};+ ModulePass *createAMDGPULowerKernelAttributesPass(); void initializeAMDGPULowerKernelAttributesPass(PassRegistry &); extern char &AMDGPULowerKernelAttributesID;diff --git a/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp b/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cppindex 2991778a1bbc7..d54828e225e12 100644--- a/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp+++ b/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp@@ -357,17 +357,50 @@ void MetadataStreamerMsgPackV4::emitKernelArg(const Argument &Arg,   Align ArgAlign;   std::tie(ArgTy, ArgAlign) = getArgumentTypeAlign(Arg, DL);+  // Assuming the argument is not split from struct-type argument by default,+  // unless we find it in function attribute amdgpu-argument-mapping.+  unsigned OriginalArgIndex = ~0U;+  uint64_t OriginalArgOffset = 0;+  if (Func->hasFnAttribute("amdgpu-argument-mapping")) {+    StringRef MappingStr = Func->getFnAttribute("amdgpu-argument-mapping").getValueAsString();+    SmallVector<StringRef, 8> Mappings;+    MappingStr.split(Mappings, ',');+    for (const StringRef &Mapping : Mappings) {+      SmallVector<StringRef, 3> Elements;+      Mapping.split(Elements, ':');+      if (Elements.size() != 3)+        continue;++      unsigned NewArgIndex = 0;+      unsigned OrigArgIndex = 0;+      uint64_t OffsetValue = 0;+      if (Elements[0].getAsInteger(10, NewArgIndex))+        continue;+      if (Elements[1].getAsInteger(10, OrigArgIndex))+        continue;+      if (Elements[2].getAsInteger(10, OffsetValue))+        continue;++      if (NewArgIndex == ArgNo) {+        OriginalArgIndex = OrigArgIndex;+        OriginalArgOffset = OffsetValue;+        break;+      }+    }+  }+   emitKernelArg(DL, ArgTy, ArgAlign,                 getValueKind(ArgTy, TypeQual, BaseTypeName), Offset, Args,                 PointeeAlign, Name, TypeName, BaseTypeName, ActAccQual,-                AccQual, TypeQual);+                AccQual, TypeQual, OriginalArgIndex, OriginalArgOffset); }  void MetadataStreamerMsgPackV4::emitKernelArg(     const DataLayout &DL, Type *Ty, Align Alignment, StringRef ValueKind,     unsigned &Offset, msgpack::ArrayDocNode Args, MaybeAlign PointeeAlign,     StringRef Name, StringRef TypeName, StringRef BaseTypeName,-    StringRef ActAccQual, StringRef AccQual, StringRef TypeQual) {+    StringRef ActAccQual, StringRef AccQual, StringRef TypeQual,+    unsigned OriginalArgIndex, uint64_t OriginalArgOffset) {   auto Arg = Args.getDocument()->getMapNode();    if (!Name.empty())@@ -409,6 +442,12 @@ void MetadataStreamerMsgPackV4::emitKernelArg(       Arg[".is_pipe"] = Arg.getDocument()->getNode(true);   }+  // Add original argument index and offset to the metadata+  if (OriginalArgIndex != ~0U) {+    Arg[".original_arg_index"] = Arg.getDocument()->getNode(OriginalArgIndex);+    Arg[".original_arg_offset"] = Arg.getDocument()->getNode(OriginalArgOffset);+  }+   Args.push_back(Arg); }diff --git a/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h b/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.hindex 22dfcb4a4ec1d..312a1747f5c1d 100644--- a/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h+++ b/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h@@ -116,7 +116,8 @@ class LLVM_EXTERNAL_VISIBILITY MetadataStreamerMsgPackV4                      MaybeAlign PointeeAlign = std::nullopt,                      StringRef Name = "", StringRef TypeName = "",                      StringRef BaseTypeName = "", StringRef ActAccQual = "",-                     StringRef AccQual = "", StringRef TypeQual = "");+                     StringRef AccQual = "", StringRef TypeQual = "",+                     unsigned OriginalArgIndex = ~0U, uint64_t OriginalArgOffset = 0);    void emitHiddenKernelArgs(const MachineFunction &MF, unsigned &Offset,                             msgpack::ArrayDocNode Args) override;diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.defindex 6a45392b5f099..094346670811c 100644--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def@@ -29,6 +29,7 @@ MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass()) MODULE_PASS("amdgpu-remove-incompatible-functions", AMDGPURemoveIncompatibleFunctionsPass(*this)) MODULE_PASS("amdgpu-sw-lower-lds", AMDGPUSwLowerLDSPass(*this)) MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())+MODULE_PASS("amdgpu-split-kernel-arguments", AMDGPUSplitKernelArgumentsPass()) #undef MODULE_PASS  #ifndef MODULE_PASS_WITH_PARAMSdiff --git a/llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cppnew file mode 100644index 0000000000000..4a025e1806070--- /dev/null+++ b/llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp@@ -0,0 +1,372 @@+//===--- AMDGPUSplitKernelArguments.cpp - Split kernel arguments ----------===//+//+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.+// See https://llvm.org/LICENSE.txt for license information.+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception+//+//===----------------------------------------------------------------------===//+//+// \file This pass flats struct-type kernel arguments. It eliminates unused+// fields and only keeps used fields. The objective is to facilitate preloading+// of kernel arguments by later passes.+//+//===----------------------------------------------------------------------===//+#include "AMDGPU.h"+#include "llvm/ADT/DenseMap.h"+#include "llvm/ADT/SetVector.h"+#include "llvm/ADT/SmallVector.h"+#include "llvm/IR/IRBuilder.h"+#include "llvm/IR/Module.h"+#include "llvm/InitializePasses.h"+#include "llvm/Pass.h"+#include "llvm/Support/CommandLine.h"+#include "llvm/Support/FileSystem.h"+#include "llvm/Transforms/Utils/Cloning.h"++#define DEBUG_TYPE "amdgpu-split-kernel-arguments"++using namespace llvm;++namespace {+static llvm::cl::opt<bool> EnableSplitKernelArgs(+    "amdgpu-enable-split-kernel-args",+    llvm::cl::desc("Enable splitting of AMDGPU kernel arguments"),+    llvm::cl::init(false));++class AMDGPUSplitKernelArguments : public ModulePass {+public:+  static char ID;++  AMDGPUSplitKernelArguments() : ModulePass(ID) {}++  bool runOnModule(Module &M) override;++  void getAnalysisUsage(AnalysisUsage &AU) const override {+    AU.setPreservesCFG();+  }++private:+  bool processFunction(Function &F);+};++} // end anonymous namespace++bool AMDGPUSplitKernelArguments::processFunction(Function &F) {+  const DataLayout &DL = F.getParent()->getDataLayout();+  LLVM_DEBUG(dbgs() << "Entering AMDGPUSplitKernelArguments::processFunction "+                    << F.getName() << '\n');+  if (F.isDeclaration()) {+    LLVM_DEBUG(dbgs() << "Function is a declaration, skipping\n");+    return false;+  }++  CallingConv::ID CC = F.getCallingConv();+  if (CC != CallingConv::AMDGPU_KERNEL || F.arg_empty()) {+    LLVM_DEBUG(dbgs() << "non-kernel or arg_empty\n");+    return false;+  }++  SmallVector<std::tuple<unsigned, unsigned, uint64_t>, 8> NewArgMappings;+  DenseMap<Argument *, SmallVector<LoadInst *, 8>> ArgToLoadsMap;+  DenseMap<Argument *, SmallVector<GetElementPtrInst *, 8>> ArgToGEPsMap;+  SmallVector<Argument *, 8> StructArgs;+  SmallVector<Type *, 8> NewArgTypes;++  auto convertAddressSpace = [](Type *Ty) -> Type * {+    if (auto *PtrTy = dyn_cast<PointerType>(Ty)) {+      if (PtrTy->getAddressSpace() == AMDGPUAS::FLAT_ADDRESS) {+        return PointerType::get(PtrTy->getContext(), AMDGPUAS::GLOBAL_ADDRESS);+      }+    }+    return Ty;+  };++  // Collect struct arguments and new argument types+  unsigned OriginalArgIndex = 0;+  unsigned NewArgIndex = 0;+  for (Argument &Arg : F.args()) {+    LLVM_DEBUG(dbgs() << "Processing argument: " << Arg << "\n");+    if (Arg.use_empty()) {+      NewArgTypes.push_back(convertAddressSpace(Arg.getType()));+      NewArgMappings.push_back(+          std::make_tuple(NewArgIndex, OriginalArgIndex, 0));+      ++NewArgIndex;+      ++OriginalArgIndex;+      LLVM_DEBUG(dbgs() << "use empty\n");+      continue;+    }++    PointerType *PT = dyn_cast<PointerType>(Arg.getType());+    if (!PT) {+      NewArgTypes.push_back(Arg.getType());+      LLVM_DEBUG(dbgs() << "not a pointer\n");+      // Include mapping if indices have changed+      if (NewArgIndex != OriginalArgIndex)+        NewArgMappings.push_back(+            std::make_tuple(NewArgIndex, OriginalArgIndex, 0));+      ++NewArgIndex;+      ++OriginalArgIndex;+      continue;+    }++    const bool IsByRef = Arg.hasByRefAttr();+    if (!IsByRef) {+      NewArgTypes.push_back(Arg.getType());+      LLVM_DEBUG(dbgs() << "not byref\n");+      // Include mapping if indices have changed+      if (NewArgIndex != OriginalArgIndex)+        NewArgMappings.push_back(+            std::make_tuple(NewArgIndex, OriginalArgIndex, 0));+      ++NewArgIndex;+      ++OriginalArgIndex;+      continue;+    }++    Type *ArgTy = Arg.getParamByRefType();+    StructType *ST = dyn_cast<StructType>(ArgTy);+    if (!ST) {+      NewArgTypes.push_back(Arg.getType());+      LLVM_DEBUG(dbgs() << "not a struct\n");+      // Include mapping if indices have changed+      if (NewArgIndex != OriginalArgIndex)+        NewArgMappings.push_back(+            std::make_tuple(NewArgIndex, OriginalArgIndex, 0));+      ++NewArgIndex;+      ++OriginalArgIndex;+      continue;+    }++    bool AllLoadsOrGEPs = true;+    SmallVector<LoadInst *, 8> Loads;+    SmallVector<GetElementPtrInst *, 8> GEPs;+    for (User *U : Arg.users()) {+      LLVM_DEBUG(dbgs() << "  User: " << *U << "\n");+      if (auto *LI = dyn_cast<LoadInst>(U)) {+        Loads.push_back(LI);+      } else if (auto *GEP = dyn_cast<GetElementPtrInst>(U)) {+        GEPs.push_back(GEP);+        for (User *GEPUser : GEP->users()) {+          LLVM_DEBUG(dbgs() << "    GEP User: " << *GEPUser << "\n");+          if (auto *GEPLoad = dyn_cast<LoadInst>(GEPUser)) {+            Loads.push_back(GEPLoad);+          } else {+            AllLoadsOrGEPs = false;+            break;+          }+        }+      } else {+        AllLoadsOrGEPs = false;+        break;+      }+      if (!AllLoadsOrGEPs)+        break;+    }+    LLVM_DEBUG(dbgs() << "  AllLoadsOrGEPs: "+                      << (AllLoadsOrGEPs ? "true" : "false") << "\n");++    if (AllLoadsOrGEPs) {+      StructArgs.push_back(&Arg);+      ArgToLoadsMap[&Arg] = Loads;+      ArgToGEPsMap[&Arg] = GEPs;+      for (LoadInst *LI : Loads) {+        Type *NewType = convertAddressSpace(LI->getType());+        NewArgTypes.push_back(NewType);++        // Compute offset+        uint64_t Offset = 0;+        if (auto *GEP = dyn_cast<GetElementPtrInst>(LI->getPointerOperand())) {+          APInt OffsetAPInt(DL.getPointerSizeInBits(), 0);+          if (GEP->accumulateConstantOffset(DL, OffsetAPInt))+            Offset = OffsetAPInt.getZExtValue();+        }++        // Map each new argument to the original argument index and offset+        NewArgMappings.push_back(+            std::make_tuple(NewArgIndex, OriginalArgIndex, Offset));+        ++NewArgIndex;+      }+    } else {+      NewArgTypes.push_back(convertAddressSpace(Arg.getType()));+      // Include mapping if indices have changed+      if (NewArgIndex != OriginalArgIndex)+        NewArgMappings.push_back(+            std::make_tuple(NewArgIndex, OriginalArgIndex, 0));+      ++NewArgIndex;+    }+    ++OriginalArgIndex;+  }++  if (StructArgs.empty())+    return false;++  // Collect function and return attributes+  AttributeList OldAttrs = F.getAttributes();+  AttributeSet FnAttrs = OldAttrs.getFnAttrs();+  AttributeSet RetAttrs = OldAttrs.getRetAttrs();++  // Create new function type+  FunctionType *NewFT =+      FunctionType::get(F.getReturnType(), NewArgTypes, F.isVarArg());+  Function *NewF =+      Function::Create(NewFT, F.getLinkage(), F.getAddressSpace(), F.getName());+  F.getParent()->getFunctionList().insert(F.getIterator(), NewF);+  NewF->takeName(&F);+  NewF->setVisibility(F.getVisibility());+  if (F.hasComdat())+    NewF->setComdat(F.getComdat());+  NewF->setDSOLocal(F.isDSOLocal());+  NewF->setUnnamedAddr(F.getUnnamedAddr());+  NewF->setCallingConv(F.getCallingConv());++  // Build new parameter attributes+  SmallVector<AttributeSet, 8> NewArgAttrSets;+  NewArgIndex = 0;+  for (Argument &Arg : F.args()) {+    if (ArgToLoadsMap.count(&Arg)) {+      for (LoadInst *LI : ArgToLoadsMap[&Arg]) {+        (void)LI;+        NewArgAttrSets.push_back(AttributeSet());+        ++NewArgIndex;+      }+    } else {+      AttributeSet ArgAttrs = OldAttrs.getParamAttrs(Arg.getArgNo());+      NewArgAttrSets.push_back(ArgAttrs);+      ++NewArgIndex;+    }+  }++  // Build the new AttributeList+  AttributeList NewAttrList =+      AttributeList::get(F.getContext(), FnAttrs, RetAttrs, NewArgAttrSets);+  NewF->setAttributes(NewAttrList);++  // Add the mapping of the new arguments to the old arguments as a function+  // attribute in the format "NewArgIndex:OriginalArgIndex:Offset,..."+  std::string MappingStr;+  for (const auto &Info : NewArgMappings) {+    unsigned NewArgIdx, OrigArgIdx;+    uint64_t Offset;+    std::tie(NewArgIdx, OrigArgIdx, Offset) = Info;++    if (!MappingStr.empty())+      MappingStr += ",";+    MappingStr += std::to_string(NewArgIdx) + ":" + std::to_string(OrigArgIdx) ++                  ":" + std::to_string(Offset);+  }++  NewF->addFnAttr("amdgpu-argument-mapping", MappingStr);++  LLVM_DEBUG(dbgs() << "New empty function:\n" << *NewF << '\n');++  NewF->splice(NewF->begin(), &F);++  // Map old arguments and loads to new arguments+  DenseMap<Value *, Value *> VMap;+  auto NewArgIt = NewF->arg_begin();+  for (Argument &Arg : F.args()) {+    if (ArgToLoadsMap.count(&Arg)) {+      for (LoadInst *LI : ArgToLoadsMap[&Arg]) {+        std::string OldName = LI->getName().str();+        LI->setName(OldName + ".old");+        NewArgIt->setName(OldName);+        Value *NewArg = &*NewArgIt++;+        if (isa<PointerType>(NewArg->getType()) &&+            isa<PointerType>(LI->getType())) {+          IRBuilder<> Builder(LI);+          Value *CastedArg = Builder.CreatePointerBitCastOrAddrSpaceCast(+              NewArg, LI->getType());+          VMap[LI] = CastedArg;+        } else {+          VMap[LI] = NewArg;+        }+      }+      UndefValue *UndefArg = UndefValue::get(Arg.getType());+      Arg.replaceAllUsesWith(UndefArg);+    } else {+      std::string OldName = Arg.getName().str();+      Arg.setName(OldName + ".old");+      NewArgIt->setName(OldName);+      Value *NewArg = &*NewArgIt;+      if (isa<PointerType>(NewArg->getType()) &&+          isa<PointerType>(Arg.getType())) {+        IRBuilder<> Builder(&*NewF->begin()->begin());+        Value *CastedArg =+            Builder.CreatePointerBitCastOrAddrSpaceCast(NewArg, Arg.getType());+        Arg.replaceAllUsesWith(CastedArg);+      } else {+        Arg.replaceAllUsesWith(NewArg);+      }+      ++NewArgIt;+    }+  }++  // Replace LoadInsts with new arguments+  for (auto &Entry : ArgToLoadsMap) {+    for (LoadInst *LI : Entry.second) {+      Value *NewArg = VMap[LI];+      LI->replaceAllUsesWith(NewArg);+      LI->eraseFromParent();+    }+  }++  // Erase GEPs+  for (auto &Entry : ArgToGEPsMap) {+    for (GetElementPtrInst *GEP : Entry.second) {+      if (GEP->use_empty()) {+        GEP->eraseFromParent();+      } else {+        GEP->replaceAllUsesWith(UndefValue::get(GEP->getType()));+        GEP->eraseFromParent();+      }+    }+  }++  LLVM_DEBUG(dbgs() << "New function after transformation:\n" << *NewF << '\n');++  F.replaceAllUsesWith(NewF);+  F.eraseFromParent();++  return true;+}++bool AMDGPUSplitKernelArguments::runOnModule(Module &M) {+  if (!EnableSplitKernelArgs)+    return false;+  bool Changed = false;+  SmallVector<Function *, 16> FunctionsToProcess;++  for (Function &F : M) {+    if (F.isDeclaration())+      continue;+    FunctionsToProcess.push_back(&F);+  }++  for (Function *F : FunctionsToProcess) {+    if (F->isDeclaration())+      continue;+    Changed |= processFunction(*F);+  }++  return Changed;+}++INITIALIZE_PASS_BEGIN(AMDGPUSplitKernelArguments, DEBUG_TYPE,+                      "AMDGPU Split Kernel Arguments", false, false)+INITIALIZE_PASS_END(AMDGPUSplitKernelArguments, DEBUG_TYPE,+                    "AMDGPU Split Kernel Arguments", false, false)++char AMDGPUSplitKernelArguments::ID = 0;++ModulePass *llvm::createAMDGPUSplitKernelArgumentsPass() {+  return new AMDGPUSplitKernelArguments();+}++PreservedAnalyses AMDGPUSplitKernelArgumentsPass::run(Module &M, ModuleAnalysisManager &AM) {+  AMDGPUSplitKernelArguments Splitter;+  bool Changed = Splitter.runOnModule(M);++  if (!Changed)+    return PreservedAnalyses::all();++  return PreservedAnalyses::none();+}diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cppindex 4937b434bc955..f5bb925a95b54 100644--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp@@ -517,6 +517,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {   initializeAMDGPUAtomicOptimizerPass(*PR);   initializeAMDGPULowerKernelArgumentsPass(*PR);   initializeAMDGPUPromoteKernelArgumentsPass(*PR);+  initializeAMDGPUSplitKernelArgumentsPass(*PR);   initializeAMDGPULowerKernelAttributesPass(*PR);   initializeAMDGPUExportKernelRuntimeHandlesLegacyPass(*PR);   initializeAMDGPUPostLegalizerCombinerPass(*PR);@@ -876,8 +877,10 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {                                             OptimizationLevel Level,                                             ThinOrFullLTOPhase Phase) {     if (Level != OptimizationLevel::O0) {-      if (!isLTOPreLink(Phase))+      if (!isLTOPreLink(Phase)) {+        MPM.addPass(AMDGPUSplitKernelArgumentsPass());         MPM.addPass(AMDGPUAttributorPass(*this));+      }     }   });@@ -896,6 +899,7 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {             PM.addPass(InternalizePass(mustPreserveGV));             PM.addPass(GlobalDCEPass());           }+          PM.addPass(AMDGPUSplitKernelArgumentsPass());           if (EnableAMDGPUAttributor) {             AMDGPUAttributorOptions Opt;             if (HasClosedWorldAssumption)@@ -1237,6 +1241,9 @@ void AMDGPUPassConfig::addIRPasses() {     addPass(createAMDGPULowerModuleLDSLegacyPass(&TM));   }+  if (TM.getOptLevel() > CodeGenOptLevel::None) {+    addPass(createAMDGPUSplitKernelArgumentsPass());+  }   if (TM.getOptLevel() > CodeGenOptLevel::None)     addPass(createInferAddressSpacesPass());diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txtindex 09a3096602fc3..bc30e24d92d2b 100644--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt@@ -92,6 +92,7 @@ add_llvm_target(AMDGPUCodeGen   AMDGPUPrintfRuntimeBinding.cpp   AMDGPUPromoteAlloca.cpp   AMDGPUPromoteKernelArguments.cpp+  AMDGPUSplitKernelArguments.cpp   AMDGPURegBankCombiner.cpp   AMDGPURegBankLegalize.cpp   AMDGPURegBankLegalizeHelper.cppdiff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.llnew file mode 100644index 0000000000000..99de32f92aa7f--- /dev/null+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-split-k...[truncated]

Copy link

github-actionsbot commentedMar 31, 2025•
edited
Loading

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link

github-actionsbot commentedMar 31, 2025•
edited
Loading

✅ With the latest revision this PR passed the undef deprecator.

yxsamliu force-pushed thesplit-kernel-arg-trunk branch 2 times, most recently from11d24a3 tocb4da45Compare

April 1, 2025 00:25

arsenm requested changes

Apr 1, 2025

View reviewed changes

Copy link

Contributor

arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'd rather avoid having an ABI breaking pass, and I definitely want to avoid spreading the handling of non-byref kernel arguments. This has overlap with the existing IR expansion of non-byref arguments

llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp OutdatedShow resolvedHide resolved

llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp OutdatedShow resolvedHide resolved

llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cppShow resolvedHide resolved

Copy link

CollaboratorAuthor

yxsamliu commentedApr 4, 2025

I'd rather avoid having an ABI breaking pass, and I definitely want to avoid spreading the handling of non-byref kernel arguments. This has overlap with the existing IR expansion of non-byref arguments

For HIP programs that launches the kernel, there is no ABI change since the arguments passed to the kernel launching triple chevron do not change. This is an internal optimization of kernel launching procedure by skipping unused fields in struct type kernel arguments so that only used fields are preloaded. In a sense, it is similar to the arguments promotion IPO, but happening between the compiler and the runtime through the newly added kernel argument metadata. Basically, it tells runtime which chunk of kernel arguments are to be kept and then preloaded. Since the optimization only happens late in the LLVM pipeline, it won't proliferate non-byref kernel arguments in FE or middle end.

shiltian reviewed

Apr 7, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.ll OutdatedShow resolvedHide resolved

llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.llShow resolvedHide resolved

yxsamliu force-pushed thesplit-kernel-arg-trunk branch fromcb4da45 to68318b2Compare

April 8, 2025 17:20

Copy link

CollaboratorAuthor

yxsamliu commentedApr 30, 2025

gentle ping

yxsamliu force-pushed thesplit-kernel-arg-trunk branch from68318b2 to0ea419dCompare

May 5, 2025 15:32

Copy link

CollaboratorAuthor

yxsamliu commentedMay 12, 2025

ping

Copy link

Contributor

shiltian commentedMay 13, 2025

it is similar to the arguments promotion IPO

which we (at least@arsenm and I) want to get rid of and want to do it right in the front end instead. :-D

However, IMO this pass is fine. We are not doing something "less optimal" in the first place and then try to correct it later, unlike the kernel argument promotion. It's hard to do this kind of reasoning in the front end.

shiltian reviewed

May 13, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp OutdatedShow resolvedHide resolved

shiltian reviewed

May 13, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp OutdatedShow resolvedHide resolved

shiltian reviewed

May 13, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp OutdatedShow resolvedHide resolved

shiltian reviewed

May 13, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp OutdatedShow resolvedHide resolved

shiltian reviewed

May 13, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp OutdatedShow resolvedHide resolved

shiltian reviewed

May 13, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp OutdatedShow resolvedHide resolved

shiltian reviewed

May 13, 2025

View reviewed changes

Copy link

Contributor

shiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Another question is, does it affect explicit kernel launch viahipLaunchKernelGGL or even the lower level use via HSA?

Copy link

CollaboratorAuthor

yxsamliu commentedMay 20, 2025

Another question is, does it affect explicit kernel launch viahipLaunchKernelGGL or even the lower level use via HSA?

No. The kernel signature does not change from the user pointer of view. It only needs HIP runtime change about laying out the kernel arg segment. It does not need HSA or firmware change.

yxsamliu force-pushed thesplit-kernel-arg-trunk branch from0ea419d to3e205bdCompare

May 26, 2025 17:01

yxsamliu force-pushed thesplit-kernel-arg-trunk branch from3e205bd tof07c64fCompare

June 9, 2025 13:20

shiltian reviewed

Jun 9, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cppShow resolvedHide resolved

llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp OutdatedShow resolvedHide resolved

llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp

		@@ -409,6 +430,12 @@ void MetadataStreamerMsgPackV4::emitKernelArg(
		Arg[".is_pipe"] = Arg.getDocument()->getNode(true);
		}

		// Add original argument index and offset to the metadata
		if (OriginalArgIndex != ~0U) {
		Arg[".original_arg_index"] = Arg.getDocument()->getNode(OriginalArgIndex);

Copy link

Contributor

shiltianJun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Honestly, I agree with@arsenm that this is indeed an ABI breaking change, even though the ABI change seems to be transparent to end users.

llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.llShow resolvedHide resolved

yxsamliu force-pushed thesplit-kernel-arg-trunk branch fromf07c64f tob90e613Compare

June 14, 2025 20:41

yxsamliu requested a review fromarsenm

June 16, 2025 11:42

yxsamliu requested a review fromshiltian

June 23, 2025 15:06

shiltian reviewed

Jun 23, 2025

View reviewed changes

Copy link

Contributor

shiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The pass generally looks good to me. Let's wait for@arsenm regarding the ABI break concern.

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cppShow resolvedHide resolved

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp OutdatedShow resolvedHide resolved

llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cppShow resolvedHide resolved

Copy link

Contributor

shiltian commentedJun 23, 2025

Another question is, does it affect explicit kernel launch viahipLaunchKernelGGL or even the lower level use via HSA?
No. The kernel signature does not change from the user pointer of view. It only needs HIP runtime change about laying out the kernel arg segment. It does not need HSA or firmware change.

I mean, if users launch a kernel using HSA runtime instead of HIP (which is what OpenMP is doing), is it gonna break?

Copy link

CollaboratorAuthor

yxsamliu commentedJun 23, 2025

Another question is, does it affect explicit kernel launch viahipLaunchKernelGGL or even the lower level use via HSA?
No. The kernel signature does not change from the user pointer of view. It only needs HIP runtime change about laying out the kernel arg segment. It does not need HSA or firmware change.
I mean, if users launch a kernel using HSA runtime instead of HIP (which is what OpenMP is doing), is it gonna break?

OpenMP does not seem to use explicit kernel args although it uses implicit kernel argshttps://github.com/llvm/llvm-project/blob/main/offload/plugins-nextgen/amdgpu/src/rtl.cpp#L3393 . Since it does not use explicit kernel args, it is not affected by this patch. Even if it decides to use explicit kernel args, based on the libomptarget code, it needs to layout kernel arguments from host into kernel argument segment as dictated by code object metadata, before passing the pointer to the kernel arg segment to HSA runtime, in a similar way as HIP runtime does. Therefore, as long as it correctly follow the code object metadata, it will be able to launch the kernel transformed by this pass correctly with HSA runtime.

shiltian reviewed

Jun 23, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cppShow resolvedHide resolved

[AMDGPU] Split struct kernel arguments

4fe4ed1

AMDGPU backend has a pass which does transformations to allowfirmware to preload kernel arguments into sgpr's to avoidloading them from kernel arg segment. This pass can improvekernel latency but it cannot preload struct-type kernelarguments.This patch adds a pass to AMDGPU backend to split and flattenstruct-type kernel arguments so that later passes canpreload them into sgpr's.Basically, the pass collects load or GEP/load instructionswith struct-type kenel args as operands and makes themnew arguments as the kernel. If all uses of a struct-typekernel arg can be replaced, it will do the replacementsand create a new kernel with the new signature, andtranslate all instructions of the old kernel to usethe new arguments in the new kernel. It adds a functionattribute to encode the mapping from the new kernelargument index to the old kernel argument index andoffset. The streamer will generate kernel argumentmetadata based on that and runtime will processthe kernel arguments based on the metadata.The pass is disabled by default and can be enabledby LLVM option `-amdgpu-enable-split-kernel-args`.

yxsamliu force-pushed thesplit-kernel-arg-trunk branch fromb90e613 to4fe4ed1Compare

June 24, 2025 03:55

Copy link

CollaboratorAuthor

yxsamliu commentedJul 14, 2025

ping

Copy link

CollaboratorAuthor

yxsamliu commentedJul 17, 2025

@arsenm Any further concerns or comments about this PR? Thanks.

Labels

backend:AMDGPU

4 participants

Movatterモバイル変換

[AMDGPU] Split struct kernel arguments#133786

Are you sure you want to change the base?

[AMDGPU] Split struct kernel arguments#133786

Conversation

yxsamliu commentedMar 31, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

llvmbot commentedMar 31, 2025

Uh oh!

github-actionsbot commentedMar 31, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

github-actionsbot commentedMar 31, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yxsamliu commentedApr 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yxsamliu commentedApr 30, 2025

Uh oh!

yxsamliu commentedMay 12, 2025

Uh oh!

shiltian commentedMay 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shiltian left a comment

Choose a reason for hiding this comment

Uh oh!

yxsamliu commentedMay 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shiltianJun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shiltian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shiltian commentedJun 23, 2025

Uh oh!

yxsamliu commentedJun 23, 2025

Uh oh!

Uh oh!

yxsamliu commentedJul 14, 2025

Uh oh!

yxsamliu commentedJul 17, 2025

Uh oh!

Uh oh!

yxsamliu commentedMar 31, 2025•
edited
Loading

github-actionsbot commentedMar 31, 2025•
edited
Loading

github-actionsbot commentedMar 31, 2025•
edited
Loading