Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[AMDGPU] Split struct kernel arguments#133786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
yxsamliu wants to merge1 commit intollvm:main
base:main
Choose a base branch
Loading
fromyxsamliu:split-kernel-arg-trunk

Conversation

yxsamliu
Copy link
Collaborator

@yxsamliuyxsamliu commentedMar 31, 2025
edited
Loading

AMDGPU backend has a pass which does transformations to allow firmware to preload kernel arguments into sgpr's to avoid loading them from kernel arg segment. This pass can improve kernel latency but it cannot preload struct-type kernel arguments.

This patch adds a pass to AMDGPU backend to split and flatten struct-type kernel arguments so that later passes can preload them into sgpr's.

Basically, the pass collects load or GEP/load instructions with struct-type kenel args as operand's and makes them new arguments as the kernel. If all uses of a struct-type kernel args can be replaced, it will do the replacements and create a new kernel with the new signature, and translate all instructions of the old kernel to use the new arguments in the new kernel. It adds a function attribute to encode the mapping from the new kernel argument index to the old kernel argument index and offset. The streamer will generate kernel argument metadata based on that and runtime will process
the kernel arguments based on the metadata.

The pass is disabled by default and can be enabled by LLVM option-amdgpu-enable-split-kernel-args.

tgymnich reacted with eyes emoji
@llvmbot
Copy link
Member

@llvm/pr-subscribers-backend-amdgpu

Author: Yaxun (Sam) Liu (yxsamliu)

Changes

AMDGPU backend has a pass which does transformations to allow firmware to preload kernel arguments into sgpr's to avoid loading them from kernel arg segment. This pass can improve kernel latency but it cannot preload struct-type kernel arguments.

This patch adds a pass to AMDGPU backend to split and flat struct-type kernel arguments so that later passes can preload them into sgpr's.

Basically, the pass collects load or GEP/load instructions with struct-type kenel args as oprands and makes them new arguments as the kernel. If all uses of a struct-type kernel arg can be replaced, it will do the replacements and create a new kernel with the new signature, and translate all instructions of the old kernel to use the new arguments in the new kernel. It adds a function attribute to encode the mapping from the new kernel argument index to the old kernel argument index and offset. The streamer will generate kernel argument metadata based on that and runtime will process
the kernel arguments based on the metadata.

The pass is disabled by default and can be enabled by LLVM option-amdgpu-enable-split-kernel-args.


Patch is 27.59 KiB, truncated to 20.00 KiB below, full version:https://github.com/llvm/llvm-project/pull/133786.diff

9 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+9)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp (+41-2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h (+2-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp (+372)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+8-1)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (added) llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.ll (+120)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+4)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.hindex a8e4ea9429f50..777390b99c0cc 100644--- a/llvm/lib/Target/AMDGPU/AMDGPU.h+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h@@ -125,6 +125,15 @@ struct AMDGPUPromoteKernelArgumentsPass   PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM); };+ModulePass *createAMDGPUSplitKernelArgumentsPass();+void initializeAMDGPUSplitKernelArgumentsPass(PassRegistry &);+extern char &AMDGPUSplitKernelArgumentsID;++struct AMDGPUSplitKernelArgumentsPass+    : PassInfoMixin<AMDGPUSplitKernelArgumentsPass> {+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);+};+ ModulePass *createAMDGPULowerKernelAttributesPass(); void initializeAMDGPULowerKernelAttributesPass(PassRegistry &); extern char &AMDGPULowerKernelAttributesID;diff --git a/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp b/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cppindex 2991778a1bbc7..d54828e225e12 100644--- a/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp+++ b/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp@@ -357,17 +357,50 @@ void MetadataStreamerMsgPackV4::emitKernelArg(const Argument &Arg,   Align ArgAlign;   std::tie(ArgTy, ArgAlign) = getArgumentTypeAlign(Arg, DL);+  // Assuming the argument is not split from struct-type argument by default,+  // unless we find it in function attribute amdgpu-argument-mapping.+  unsigned OriginalArgIndex = ~0U;+  uint64_t OriginalArgOffset = 0;+  if (Func->hasFnAttribute("amdgpu-argument-mapping")) {+    StringRef MappingStr = Func->getFnAttribute("amdgpu-argument-mapping").getValueAsString();+    SmallVector<StringRef, 8> Mappings;+    MappingStr.split(Mappings, ',');+    for (const StringRef &Mapping : Mappings) {+      SmallVector<StringRef, 3> Elements;+      Mapping.split(Elements, ':');+      if (Elements.size() != 3)+        continue;++      unsigned NewArgIndex = 0;+      unsigned OrigArgIndex = 0;+      uint64_t OffsetValue = 0;+      if (Elements[0].getAsInteger(10, NewArgIndex))+        continue;+      if (Elements[1].getAsInteger(10, OrigArgIndex))+        continue;+      if (Elements[2].getAsInteger(10, OffsetValue))+        continue;++      if (NewArgIndex == ArgNo) {+        OriginalArgIndex = OrigArgIndex;+        OriginalArgOffset = OffsetValue;+        break;+      }+    }+  }+   emitKernelArg(DL, ArgTy, ArgAlign,                 getValueKind(ArgTy, TypeQual, BaseTypeName), Offset, Args,                 PointeeAlign, Name, TypeName, BaseTypeName, ActAccQual,-                AccQual, TypeQual);+                AccQual, TypeQual, OriginalArgIndex, OriginalArgOffset); }  void MetadataStreamerMsgPackV4::emitKernelArg(     const DataLayout &DL, Type *Ty, Align Alignment, StringRef ValueKind,     unsigned &Offset, msgpack::ArrayDocNode Args, MaybeAlign PointeeAlign,     StringRef Name, StringRef TypeName, StringRef BaseTypeName,-    StringRef ActAccQual, StringRef AccQual, StringRef TypeQual) {+    StringRef ActAccQual, StringRef AccQual, StringRef TypeQual,+    unsigned OriginalArgIndex, uint64_t OriginalArgOffset) {   auto Arg = Args.getDocument()->getMapNode();    if (!Name.empty())@@ -409,6 +442,12 @@ void MetadataStreamerMsgPackV4::emitKernelArg(       Arg[".is_pipe"] = Arg.getDocument()->getNode(true);   }+  // Add original argument index and offset to the metadata+  if (OriginalArgIndex != ~0U) {+    Arg[".original_arg_index"] = Arg.getDocument()->getNode(OriginalArgIndex);+    Arg[".original_arg_offset"] = Arg.getDocument()->getNode(OriginalArgOffset);+  }+   Args.push_back(Arg); }diff --git a/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h b/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.hindex 22dfcb4a4ec1d..312a1747f5c1d 100644--- a/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h+++ b/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h@@ -116,7 +116,8 @@ class LLVM_EXTERNAL_VISIBILITY MetadataStreamerMsgPackV4                      MaybeAlign PointeeAlign = std::nullopt,                      StringRef Name = "", StringRef TypeName = "",                      StringRef BaseTypeName = "", StringRef ActAccQual = "",-                     StringRef AccQual = "", StringRef TypeQual = "");+                     StringRef AccQual = "", StringRef TypeQual = "",+                     unsigned OriginalArgIndex = ~0U, uint64_t OriginalArgOffset = 0);    void emitHiddenKernelArgs(const MachineFunction &MF, unsigned &Offset,                             msgpack::ArrayDocNode Args) override;diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.defindex 6a45392b5f099..094346670811c 100644--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def@@ -29,6 +29,7 @@ MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass()) MODULE_PASS("amdgpu-remove-incompatible-functions", AMDGPURemoveIncompatibleFunctionsPass(*this)) MODULE_PASS("amdgpu-sw-lower-lds", AMDGPUSwLowerLDSPass(*this)) MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())+MODULE_PASS("amdgpu-split-kernel-arguments", AMDGPUSplitKernelArgumentsPass()) #undef MODULE_PASS  #ifndef MODULE_PASS_WITH_PARAMSdiff --git a/llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cppnew file mode 100644index 0000000000000..4a025e1806070--- /dev/null+++ b/llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp@@ -0,0 +1,372 @@+//===--- AMDGPUSplitKernelArguments.cpp - Split kernel arguments ----------===//+//+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.+// See https://llvm.org/LICENSE.txt for license information.+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception+//+//===----------------------------------------------------------------------===//+//+// \file This pass flats struct-type kernel arguments. It eliminates unused+// fields and only keeps used fields. The objective is to facilitate preloading+// of kernel arguments by later passes.+//+//===----------------------------------------------------------------------===//+#include "AMDGPU.h"+#include "llvm/ADT/DenseMap.h"+#include "llvm/ADT/SetVector.h"+#include "llvm/ADT/SmallVector.h"+#include "llvm/IR/IRBuilder.h"+#include "llvm/IR/Module.h"+#include "llvm/InitializePasses.h"+#include "llvm/Pass.h"+#include "llvm/Support/CommandLine.h"+#include "llvm/Support/FileSystem.h"+#include "llvm/Transforms/Utils/Cloning.h"++#define DEBUG_TYPE "amdgpu-split-kernel-arguments"++using namespace llvm;++namespace {+static llvm::cl::opt<bool> EnableSplitKernelArgs(+    "amdgpu-enable-split-kernel-args",+    llvm::cl::desc("Enable splitting of AMDGPU kernel arguments"),+    llvm::cl::init(false));++class AMDGPUSplitKernelArguments : public ModulePass {+public:+  static char ID;++  AMDGPUSplitKernelArguments() : ModulePass(ID) {}++  bool runOnModule(Module &M) override;++  void getAnalysisUsage(AnalysisUsage &AU) const override {+    AU.setPreservesCFG();+  }++private:+  bool processFunction(Function &F);+};++} // end anonymous namespace++bool AMDGPUSplitKernelArguments::processFunction(Function &F) {+  const DataLayout &DL = F.getParent()->getDataLayout();+  LLVM_DEBUG(dbgs() << "Entering AMDGPUSplitKernelArguments::processFunction "+                    << F.getName() << '\n');+  if (F.isDeclaration()) {+    LLVM_DEBUG(dbgs() << "Function is a declaration, skipping\n");+    return false;+  }++  CallingConv::ID CC = F.getCallingConv();+  if (CC != CallingConv::AMDGPU_KERNEL || F.arg_empty()) {+    LLVM_DEBUG(dbgs() << "non-kernel or arg_empty\n");+    return false;+  }++  SmallVector<std::tuple<unsigned, unsigned, uint64_t>, 8> NewArgMappings;+  DenseMap<Argument *, SmallVector<LoadInst *, 8>> ArgToLoadsMap;+  DenseMap<Argument *, SmallVector<GetElementPtrInst *, 8>> ArgToGEPsMap;+  SmallVector<Argument *, 8> StructArgs;+  SmallVector<Type *, 8> NewArgTypes;++  auto convertAddressSpace = [](Type *Ty) -> Type * {+    if (auto *PtrTy = dyn_cast<PointerType>(Ty)) {+      if (PtrTy->getAddressSpace() == AMDGPUAS::FLAT_ADDRESS) {+        return PointerType::get(PtrTy->getContext(), AMDGPUAS::GLOBAL_ADDRESS);+      }+    }+    return Ty;+  };++  // Collect struct arguments and new argument types+  unsigned OriginalArgIndex = 0;+  unsigned NewArgIndex = 0;+  for (Argument &Arg : F.args()) {+    LLVM_DEBUG(dbgs() << "Processing argument: " << Arg << "\n");+    if (Arg.use_empty()) {+      NewArgTypes.push_back(convertAddressSpace(Arg.getType()));+      NewArgMappings.push_back(+          std::make_tuple(NewArgIndex, OriginalArgIndex, 0));+      ++NewArgIndex;+      ++OriginalArgIndex;+      LLVM_DEBUG(dbgs() << "use empty\n");+      continue;+    }++    PointerType *PT = dyn_cast<PointerType>(Arg.getType());+    if (!PT) {+      NewArgTypes.push_back(Arg.getType());+      LLVM_DEBUG(dbgs() << "not a pointer\n");+      // Include mapping if indices have changed+      if (NewArgIndex != OriginalArgIndex)+        NewArgMappings.push_back(+            std::make_tuple(NewArgIndex, OriginalArgIndex, 0));+      ++NewArgIndex;+      ++OriginalArgIndex;+      continue;+    }++    const bool IsByRef = Arg.hasByRefAttr();+    if (!IsByRef) {+      NewArgTypes.push_back(Arg.getType());+      LLVM_DEBUG(dbgs() << "not byref\n");+      // Include mapping if indices have changed+      if (NewArgIndex != OriginalArgIndex)+        NewArgMappings.push_back(+            std::make_tuple(NewArgIndex, OriginalArgIndex, 0));+      ++NewArgIndex;+      ++OriginalArgIndex;+      continue;+    }++    Type *ArgTy = Arg.getParamByRefType();+    StructType *ST = dyn_cast<StructType>(ArgTy);+    if (!ST) {+      NewArgTypes.push_back(Arg.getType());+      LLVM_DEBUG(dbgs() << "not a struct\n");+      // Include mapping if indices have changed+      if (NewArgIndex != OriginalArgIndex)+        NewArgMappings.push_back(+            std::make_tuple(NewArgIndex, OriginalArgIndex, 0));+      ++NewArgIndex;+      ++OriginalArgIndex;+      continue;+    }++    bool AllLoadsOrGEPs = true;+    SmallVector<LoadInst *, 8> Loads;+    SmallVector<GetElementPtrInst *, 8> GEPs;+    for (User *U : Arg.users()) {+      LLVM_DEBUG(dbgs() << "  User: " << *U << "\n");+      if (auto *LI = dyn_cast<LoadInst>(U)) {+        Loads.push_back(LI);+      } else if (auto *GEP = dyn_cast<GetElementPtrInst>(U)) {+        GEPs.push_back(GEP);+        for (User *GEPUser : GEP->users()) {+          LLVM_DEBUG(dbgs() << "    GEP User: " << *GEPUser << "\n");+          if (auto *GEPLoad = dyn_cast<LoadInst>(GEPUser)) {+            Loads.push_back(GEPLoad);+          } else {+            AllLoadsOrGEPs = false;+            break;+          }+        }+      } else {+        AllLoadsOrGEPs = false;+        break;+      }+      if (!AllLoadsOrGEPs)+        break;+    }+    LLVM_DEBUG(dbgs() << "  AllLoadsOrGEPs: "+                      << (AllLoadsOrGEPs ? "true" : "false") << "\n");++    if (AllLoadsOrGEPs) {+      StructArgs.push_back(&Arg);+      ArgToLoadsMap[&Arg] = Loads;+      ArgToGEPsMap[&Arg] = GEPs;+      for (LoadInst *LI : Loads) {+        Type *NewType = convertAddressSpace(LI->getType());+        NewArgTypes.push_back(NewType);++        // Compute offset+        uint64_t Offset = 0;+        if (auto *GEP = dyn_cast<GetElementPtrInst>(LI->getPointerOperand())) {+          APInt OffsetAPInt(DL.getPointerSizeInBits(), 0);+          if (GEP->accumulateConstantOffset(DL, OffsetAPInt))+            Offset = OffsetAPInt.getZExtValue();+        }++        // Map each new argument to the original argument index and offset+        NewArgMappings.push_back(+            std::make_tuple(NewArgIndex, OriginalArgIndex, Offset));+        ++NewArgIndex;+      }+    } else {+      NewArgTypes.push_back(convertAddressSpace(Arg.getType()));+      // Include mapping if indices have changed+      if (NewArgIndex != OriginalArgIndex)+        NewArgMappings.push_back(+            std::make_tuple(NewArgIndex, OriginalArgIndex, 0));+      ++NewArgIndex;+    }+    ++OriginalArgIndex;+  }++  if (StructArgs.empty())+    return false;++  // Collect function and return attributes+  AttributeList OldAttrs = F.getAttributes();+  AttributeSet FnAttrs = OldAttrs.getFnAttrs();+  AttributeSet RetAttrs = OldAttrs.getRetAttrs();++  // Create new function type+  FunctionType *NewFT =+      FunctionType::get(F.getReturnType(), NewArgTypes, F.isVarArg());+  Function *NewF =+      Function::Create(NewFT, F.getLinkage(), F.getAddressSpace(), F.getName());+  F.getParent()->getFunctionList().insert(F.getIterator(), NewF);+  NewF->takeName(&F);+  NewF->setVisibility(F.getVisibility());+  if (F.hasComdat())+    NewF->setComdat(F.getComdat());+  NewF->setDSOLocal(F.isDSOLocal());+  NewF->setUnnamedAddr(F.getUnnamedAddr());+  NewF->setCallingConv(F.getCallingConv());++  // Build new parameter attributes+  SmallVector<AttributeSet, 8> NewArgAttrSets;+  NewArgIndex = 0;+  for (Argument &Arg : F.args()) {+    if (ArgToLoadsMap.count(&Arg)) {+      for (LoadInst *LI : ArgToLoadsMap[&Arg]) {+        (void)LI;+        NewArgAttrSets.push_back(AttributeSet());+        ++NewArgIndex;+      }+    } else {+      AttributeSet ArgAttrs = OldAttrs.getParamAttrs(Arg.getArgNo());+      NewArgAttrSets.push_back(ArgAttrs);+      ++NewArgIndex;+    }+  }++  // Build the new AttributeList+  AttributeList NewAttrList =+      AttributeList::get(F.getContext(), FnAttrs, RetAttrs, NewArgAttrSets);+  NewF->setAttributes(NewAttrList);++  // Add the mapping of the new arguments to the old arguments as a function+  // attribute in the format "NewArgIndex:OriginalArgIndex:Offset,..."+  std::string MappingStr;+  for (const auto &Info : NewArgMappings) {+    unsigned NewArgIdx, OrigArgIdx;+    uint64_t Offset;+    std::tie(NewArgIdx, OrigArgIdx, Offset) = Info;++    if (!MappingStr.empty())+      MappingStr += ",";+    MappingStr += std::to_string(NewArgIdx) + ":" + std::to_string(OrigArgIdx) ++                  ":" + std::to_string(Offset);+  }++  NewF->addFnAttr("amdgpu-argument-mapping", MappingStr);++  LLVM_DEBUG(dbgs() << "New empty function:\n" << *NewF << '\n');++  NewF->splice(NewF->begin(), &F);++  // Map old arguments and loads to new arguments+  DenseMap<Value *, Value *> VMap;+  auto NewArgIt = NewF->arg_begin();+  for (Argument &Arg : F.args()) {+    if (ArgToLoadsMap.count(&Arg)) {+      for (LoadInst *LI : ArgToLoadsMap[&Arg]) {+        std::string OldName = LI->getName().str();+        LI->setName(OldName + ".old");+        NewArgIt->setName(OldName);+        Value *NewArg = &*NewArgIt++;+        if (isa<PointerType>(NewArg->getType()) &&+            isa<PointerType>(LI->getType())) {+          IRBuilder<> Builder(LI);+          Value *CastedArg = Builder.CreatePointerBitCastOrAddrSpaceCast(+              NewArg, LI->getType());+          VMap[LI] = CastedArg;+        } else {+          VMap[LI] = NewArg;+        }+      }+      UndefValue *UndefArg = UndefValue::get(Arg.getType());+      Arg.replaceAllUsesWith(UndefArg);+    } else {+      std::string OldName = Arg.getName().str();+      Arg.setName(OldName + ".old");+      NewArgIt->setName(OldName);+      Value *NewArg = &*NewArgIt;+      if (isa<PointerType>(NewArg->getType()) &&+          isa<PointerType>(Arg.getType())) {+        IRBuilder<> Builder(&*NewF->begin()->begin());+        Value *CastedArg =+            Builder.CreatePointerBitCastOrAddrSpaceCast(NewArg, Arg.getType());+        Arg.replaceAllUsesWith(CastedArg);+      } else {+        Arg.replaceAllUsesWith(NewArg);+      }+      ++NewArgIt;+    }+  }++  // Replace LoadInsts with new arguments+  for (auto &Entry : ArgToLoadsMap) {+    for (LoadInst *LI : Entry.second) {+      Value *NewArg = VMap[LI];+      LI->replaceAllUsesWith(NewArg);+      LI->eraseFromParent();+    }+  }++  // Erase GEPs+  for (auto &Entry : ArgToGEPsMap) {+    for (GetElementPtrInst *GEP : Entry.second) {+      if (GEP->use_empty()) {+        GEP->eraseFromParent();+      } else {+        GEP->replaceAllUsesWith(UndefValue::get(GEP->getType()));+        GEP->eraseFromParent();+      }+    }+  }++  LLVM_DEBUG(dbgs() << "New function after transformation:\n" << *NewF << '\n');++  F.replaceAllUsesWith(NewF);+  F.eraseFromParent();++  return true;+}++bool AMDGPUSplitKernelArguments::runOnModule(Module &M) {+  if (!EnableSplitKernelArgs)+    return false;+  bool Changed = false;+  SmallVector<Function *, 16> FunctionsToProcess;++  for (Function &F : M) {+    if (F.isDeclaration())+      continue;+    FunctionsToProcess.push_back(&F);+  }++  for (Function *F : FunctionsToProcess) {+    if (F->isDeclaration())+      continue;+    Changed |= processFunction(*F);+  }++  return Changed;+}++INITIALIZE_PASS_BEGIN(AMDGPUSplitKernelArguments, DEBUG_TYPE,+                      "AMDGPU Split Kernel Arguments", false, false)+INITIALIZE_PASS_END(AMDGPUSplitKernelArguments, DEBUG_TYPE,+                    "AMDGPU Split Kernel Arguments", false, false)++char AMDGPUSplitKernelArguments::ID = 0;++ModulePass *llvm::createAMDGPUSplitKernelArgumentsPass() {+  return new AMDGPUSplitKernelArguments();+}++PreservedAnalyses AMDGPUSplitKernelArgumentsPass::run(Module &M, ModuleAnalysisManager &AM) {+  AMDGPUSplitKernelArguments Splitter;+  bool Changed = Splitter.runOnModule(M);++  if (!Changed)+    return PreservedAnalyses::all();++  return PreservedAnalyses::none();+}diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cppindex 4937b434bc955..f5bb925a95b54 100644--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp@@ -517,6 +517,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {   initializeAMDGPUAtomicOptimizerPass(*PR);   initializeAMDGPULowerKernelArgumentsPass(*PR);   initializeAMDGPUPromoteKernelArgumentsPass(*PR);+  initializeAMDGPUSplitKernelArgumentsPass(*PR);   initializeAMDGPULowerKernelAttributesPass(*PR);   initializeAMDGPUExportKernelRuntimeHandlesLegacyPass(*PR);   initializeAMDGPUPostLegalizerCombinerPass(*PR);@@ -876,8 +877,10 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {                                             OptimizationLevel Level,                                             ThinOrFullLTOPhase Phase) {     if (Level != OptimizationLevel::O0) {-      if (!isLTOPreLink(Phase))+      if (!isLTOPreLink(Phase)) {+        MPM.addPass(AMDGPUSplitKernelArgumentsPass());         MPM.addPass(AMDGPUAttributorPass(*this));+      }     }   });@@ -896,6 +899,7 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {             PM.addPass(InternalizePass(mustPreserveGV));             PM.addPass(GlobalDCEPass());           }+          PM.addPass(AMDGPUSplitKernelArgumentsPass());           if (EnableAMDGPUAttributor) {             AMDGPUAttributorOptions Opt;             if (HasClosedWorldAssumption)@@ -1237,6 +1241,9 @@ void AMDGPUPassConfig::addIRPasses() {     addPass(createAMDGPULowerModuleLDSLegacyPass(&TM));   }+  if (TM.getOptLevel() > CodeGenOptLevel::None) {+    addPass(createAMDGPUSplitKernelArgumentsPass());+  }   if (TM.getOptLevel() > CodeGenOptLevel::None)     addPass(createInferAddressSpacesPass());diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txtindex 09a3096602fc3..bc30e24d92d2b 100644--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt@@ -92,6 +92,7 @@ add_llvm_target(AMDGPUCodeGen   AMDGPUPrintfRuntimeBinding.cpp   AMDGPUPromoteAlloca.cpp   AMDGPUPromoteKernelArguments.cpp+  AMDGPUSplitKernelArguments.cpp   AMDGPURegBankCombiner.cpp   AMDGPURegBankLegalize.cpp   AMDGPURegBankLegalizeHelper.cppdiff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.llnew file mode 100644index 0000000000000..99de32f92aa7f--- /dev/null+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-split-k...[truncated]

@github-actionsGitHub Actions
Copy link

github-actionsbot commentedMar 31, 2025
edited
Loading

✅ With the latest revision this PR passed the C/C++ code formatter.

@github-actionsGitHub Actions
Copy link

github-actionsbot commentedMar 31, 2025
edited
Loading

✅ With the latest revision this PR passed the undef deprecator.

@yxsamliuyxsamliuforce-pushed thesplit-kernel-arg-trunk branch 2 times, most recently from11d24a3 tocb4da45CompareApril 1, 2025 00:25
Copy link
Contributor

@arsenmarsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'd rather avoid having an ABI breaking pass, and I definitely want to avoid spreading the handling of non-byref kernel arguments. This has overlap with the existing IR expansion of non-byref arguments

@yxsamliu
Copy link
CollaboratorAuthor

I'd rather avoid having an ABI breaking pass, and I definitely want to avoid spreading the handling of non-byref kernel arguments. This has overlap with the existing IR expansion of non-byref arguments

For HIP programs that launches the kernel, there is no ABI change since the arguments passed to the kernel launching triple chevron do not change. This is an internal optimization of kernel launching procedure by skipping unused fields in struct type kernel arguments so that only used fields are preloaded. In a sense, it is similar to the arguments promotion IPO, but happening between the compiler and the runtime through the newly added kernel argument metadata. Basically, it tells runtime which chunk of kernel arguments are to be kept and then preloaded. Since the optimization only happens late in the LLVM pipeline, it won't proliferate non-byref kernel arguments in FE or middle end.

@yxsamliuyxsamliuforce-pushed thesplit-kernel-arg-trunk branch fromcb4da45 to68318b2CompareApril 8, 2025 17:20
@yxsamliu
Copy link
CollaboratorAuthor

gentle ping

@yxsamliuyxsamliuforce-pushed thesplit-kernel-arg-trunk branch from68318b2 to0ea419dCompareMay 5, 2025 15:32
@yxsamliu
Copy link
CollaboratorAuthor

ping

@shiltian
Copy link
Contributor

it is similar to the arguments promotion IPO

which we (at least@arsenm and I) want to get rid of and want to do it right in the front end instead. :-D

However, IMO this pass is fine. We are not doing something "less optimal" in the first place and then try to correct it later, unlike the kernel argument promotion. It's hard to do this kind of reasoning in the front end.

Copy link
Contributor

@shiltianshiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Another question is, does it affect explicit kernel launch viahipLaunchKernelGGL or even the lower level use via HSA?

@yxsamliu
Copy link
CollaboratorAuthor

Another question is, does it affect explicit kernel launch viahipLaunchKernelGGL or even the lower level use via HSA?

No. The kernel signature does not change from the user pointer of view. It only needs HIP runtime change about laying out the kernel arg segment. It does not need HSA or firmware change.

@yxsamliuyxsamliuforce-pushed thesplit-kernel-arg-trunk branch from0ea419d to3e205bdCompareMay 26, 2025 17:01
@yxsamliuyxsamliuforce-pushed thesplit-kernel-arg-trunk branch from3e205bd tof07c64fCompareJune 9, 2025 13:20
@@ -409,6 +430,12 @@ void MetadataStreamerMsgPackV4::emitKernelArg(
Arg[".is_pipe"] = Arg.getDocument()->getNode(true);
}

// Add original argument index and offset to the metadata
if (OriginalArgIndex != ~0U) {
Arg[".original_arg_index"] = Arg.getDocument()->getNode(OriginalArgIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Honestly, I agree with@arsenm that this is indeed an ABI breaking change, even though the ABI change seems to be transparent to end users.

@yxsamliuyxsamliuforce-pushed thesplit-kernel-arg-trunk branch fromf07c64f tob90e613CompareJune 14, 2025 20:41
@yxsamliuyxsamliu requested a review fromarsenmJune 16, 2025 11:42
@yxsamliuyxsamliu requested a review fromshiltianJune 23, 2025 15:06
Copy link
Contributor

@shiltianshiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The pass generally looks good to me. Let's wait for@arsenm regarding the ABI break concern.

@shiltian
Copy link
Contributor

Another question is, does it affect explicit kernel launch viahipLaunchKernelGGL or even the lower level use via HSA?

No. The kernel signature does not change from the user pointer of view. It only needs HIP runtime change about laying out the kernel arg segment. It does not need HSA or firmware change.

I mean, if users launch a kernel using HSA runtime instead of HIP (which is what OpenMP is doing), is it gonna break?

@yxsamliu
Copy link
CollaboratorAuthor

Another question is, does it affect explicit kernel launch viahipLaunchKernelGGL or even the lower level use via HSA?

No. The kernel signature does not change from the user pointer of view. It only needs HIP runtime change about laying out the kernel arg segment. It does not need HSA or firmware change.

I mean, if users launch a kernel using HSA runtime instead of HIP (which is what OpenMP is doing), is it gonna break?

OpenMP does not seem to use explicit kernel args although it uses implicit kernel argshttps://github.com/llvm/llvm-project/blob/main/offload/plugins-nextgen/amdgpu/src/rtl.cpp#L3393 . Since it does not use explicit kernel args, it is not affected by this patch. Even if it decides to use explicit kernel args, based on the libomptarget code, it needs to layout kernel arguments from host into kernel argument segment as dictated by code object metadata, before passing the pointer to the kernel arg segment to HSA runtime, in a similar way as HIP runtime does. Therefore, as long as it correctly follow the code object metadata, it will be able to launch the kernel transformed by this pass correctly with HSA runtime.

AMDGPU backend has a pass which does transformations to allowfirmware to preload kernel arguments into sgpr's to avoidloading them from kernel arg segment. This pass can improvekernel latency but it cannot preload struct-type kernelarguments.This patch adds a pass to AMDGPU backend to split and flattenstruct-type kernel arguments so that later passes canpreload them into sgpr's.Basically, the pass collects load or GEP/load instructionswith struct-type kenel args as operands and makes themnew arguments as the kernel. If all uses of a struct-typekernel arg can be replaced, it will do the replacementsand create a new kernel with the new signature, andtranslate all instructions of the old kernel to usethe new arguments in the new kernel. It adds a functionattribute to encode the mapping from the new kernelargument index to the old kernel argument index andoffset. The streamer will generate kernel argumentmetadata based on that and runtime will processthe kernel arguments based on the metadata.The pass is disabled by default and can be enabledby LLVM option `-amdgpu-enable-split-kernel-args`.
@yxsamliuyxsamliuforce-pushed thesplit-kernel-arg-trunk branch fromb90e613 to4fe4ed1CompareJune 24, 2025 03:55
@yxsamliu
Copy link
CollaboratorAuthor

ping

@yxsamliu
Copy link
CollaboratorAuthor

@arsenm Any further concerns or comments about this PR? Thanks.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@shiltianshiltianshiltian left review comments

@kerbowakerbowaAwaiting requested review from kerbowa

@rampitecrampitecAwaiting requested review from rampitec

@b-sumnerb-sumnerAwaiting requested review from b-sumner

@kzhuravlkzhuravlAwaiting requested review from kzhuravl

@arsenmarsenmAwaiting requested review from arsenm

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@yxsamliu@llvmbot@shiltian@arsenm

[8]ページ先頭

©2009-2025 Movatter.jp