Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[AMDGPU] Add DAG mutation to improve scheduling before barriers#142716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
perlfu wants to merge2 commits intollvm:main
base:main
Choose a base branch
Loading
fromperlfu:amdgpu-barrier-latency

Conversation

perlfu
Copy link
Contributor

Add scheduler DAG mutation to add data dependencies between atomic fences and preceding memory reads. This allows some modelling of the impact an atomic fence can have on outstanding memory accesses.

This is beneficial when a fence would cause wait count insertion, as more instructions will be scheduled before the fence hiding memory latency. It also reduces the risk of a fence causing a premature wait on all active memory operations.

@llvmbot
Copy link
Member

@llvm/pr-subscribers-backend-amdgpu

Author: Carl Ritson (perlfu)

Changes

Add scheduler DAG mutation to add data dependencies between atomic fences and preceding memory reads. This allows some modelling of the impact an atomic fence can have on outstanding memory accesses.

This is beneficial when a fence would cause wait count insertion, as more instructions will be scheduled before the fence hiding memory latency. It also reduces the risk of a fence causing a premature wait on all active memory operations.


Patch is 368.34 KiB, truncated to 20.00 KiB below, full version:https://github.com/llvm/llvm-project/pull/142716.diff

14 Files Affected:

  • (added) llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cpp (+100)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.h (+21)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+4)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+24-26)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+24-26)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+375-381)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+210-219)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+210-219)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-waitcnts-crash.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.gws.barrier.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll (+19-19)
  • (added) llvm/test/CodeGen/AMDGPU/schedule-barrier-latency.mir (+82)
  • (modified) llvm/test/CodeGen/AMDGPU/waitcnt-vscnt.ll (+51-53)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cpp b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cppnew file mode 100644index 0000000000000..b633cbf91b7ff--- /dev/null+++ b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cpp@@ -0,0 +1,100 @@+//===--- AMDGPUBarrierLatency.cpp - AMDGPU Barrier Latency ----------------===//+//+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.+// See https://llvm.org/LICENSE.txt for license information.+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception+//+//===----------------------------------------------------------------------===//+//+/// \file This file contains a DAG scheduling mutation to add data dependency+///       edges between ATOMIC_FENCE instructions and preceeding memory+///       accesses they might be effected by the fence.+///       This is beneficial when a fence would cause wait count insertion,+///       as more instructions will be scheduled before the fence hiding+///       memory latency.+///       It also reduces the risk of a fence causing a premature wait+///       on all active memory operations.+//+//===----------------------------------------------------------------------===//++#include "AMDGPUBarrierLatency.h"+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"+#include "SIInstrInfo.h"+#include "llvm/CodeGen/ScheduleDAGInstrs.h"++using namespace llvm;++namespace {++class BarrierLatency : public ScheduleDAGMutation {+public:+  BarrierLatency() = default;+  void apply(ScheduleDAGInstrs *DAG) override;+};++static bool isMemRead(const MachineInstr *MI) {+  return (SIInstrInfo::isDS(*MI) || SIInstrInfo::isVMEM(*MI) ||+          SIInstrInfo::isSMRD(*MI)) &&+         MI->mayLoad();+}++static const MachineInstr *getReadInstr(const MachineInstr *MI) {+  if (MI->isBundle()) {+    auto I = std::next(MI->getIterator());+    if (I != MI->getParent()->instr_end() && I->isInsideBundle() &&+        isMemRead(&*I))+      return &*I;+  } else if (isMemRead(MI)) {+    return MI;+  }++  return nullptr;+}++void BarrierLatency::apply(ScheduleDAGInstrs *DAG) {+  const unsigned SyntheticLatency = 2000;+  const unsigned MaxTracked = 32;+  SmallVector<std::pair<SUnit *, const MachineInstr *>, MaxTracked> ReadOps;+  unsigned NextIdx = 0;++  for (SUnit &SU : DAG->SUnits) {+    auto *MI = SU.getInstr();+    auto *ReadMI = getReadInstr(MI);++    // Record read operations.+    // If SU represents a bundle, then ReadMI is the first instruction in the+    // bundle.+    if (ReadMI) {+      if (ReadOps.size() < MaxTracked) {+        ReadOps.emplace_back(&SU, ReadMI);+      } else {+        ReadOps[NextIdx] = std::pair(&SU, ReadMI);+        NextIdx = (NextIdx + 1) % MaxTracked;+      }+      continue;+    }++    // Create new edges on ATOMIC_FENCE for recorded reads.+    // We don't consider the scope of the fence so it is possible there will+    // be no impact of this fence on the recorded operations.+    if (MI->getOpcode() == AMDGPU::ATOMIC_FENCE) {+      for (auto &DSOp : ReadOps) {+        Register DstReg = DSOp.second->getOperand(0).getReg();+        SDep Edge = SDep(DSOp.first, SDep::Data, DstReg);+        Edge.setLatency(SyntheticLatency);+        DAG->addEdge(&SU, Edge);+      }+      // Clear tracked operations+      ReadOps.clear();+      NextIdx = 0;+      continue;+    }+  }+}++} // end namespace++std::unique_ptr<ScheduleDAGMutation>+llvm::createAMDGPUBarrierLatencyDAGMutation() {+  return std::make_unique<BarrierLatency>();+}diff --git a/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.h b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.hnew file mode 100644index 0000000000000..c23f0b99fe822--- /dev/null+++ b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.h@@ -0,0 +1,21 @@+//===- AMDGPUBarrierLatency.h - AMDGPU Export Clustering --------*- C++ -*-===//+//+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.+// See https://llvm.org/LICENSE.txt for license information.+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception+//+//===----------------------------------------------------------------------===//++#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPUBARRIERLATENCY_H+#define LLVM_LIB_TARGET_AMDGPU_AMDGPUBARRIERLATENCY_H++#include "llvm/CodeGen/ScheduleDAGMutation.h"+#include <memory>++namespace llvm {++std::unique_ptr<ScheduleDAGMutation> createAMDGPUBarrierLatencyDAGMutation();++} // namespace llvm++#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPUBARRIERLATENCY_Hdiff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cppindex f0a0c2113bf81..fcf4111ea16de 100644--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp@@ -17,6 +17,7 @@ #include "AMDGPUTargetMachine.h" #include "AMDGPU.h" #include "AMDGPUAliasAnalysis.h"+#include "AMDGPUBarrierLatency.h" #include "AMDGPUCtorDtorLowering.h" #include "AMDGPUExportClustering.h" #include "AMDGPUExportKernelRuntimeHandles.h"@@ -588,6 +589,7 @@ createGCNMaxOccupancyMachineScheduler(MachineSchedContext *C) {   DAG->addMutation(createIGroupLPDAGMutation(AMDGPU::SchedulingPhase::Initial));   DAG->addMutation(createAMDGPUMacroFusionDAGMutation());   DAG->addMutation(createAMDGPUExportClusteringDAGMutation());+  DAG->addMutation(createAMDGPUBarrierLatencyDAGMutation());   return DAG; }@@ -608,6 +610,7 @@ createGCNMaxMemoryClauseMachineScheduler(MachineSchedContext *C) {   if (ST.shouldClusterStores())     DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));   DAG->addMutation(createAMDGPUExportClusteringDAGMutation());+  DAG->addMutation(createAMDGPUBarrierLatencyDAGMutation());   return DAG; }@@ -1156,6 +1159,7 @@ GCNTargetMachine::createPostMachineScheduler(MachineSchedContext *C) const {       EnableVOPD)     DAG->addMutation(createVOPDPairingMutation());   DAG->addMutation(createAMDGPUExportClusteringDAGMutation());+  DAG->addMutation(createAMDGPUBarrierLatencyDAGMutation());   return DAG; } //===----------------------------------------------------------------------===//diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txtindex c6d70ee39202e..d09de7f91d8c5 100644--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt@@ -50,6 +50,7 @@ add_llvm_target(AMDGPUCodeGen   AMDGPUAsmPrinter.cpp   AMDGPUAtomicOptimizer.cpp   AMDGPUAttributor.cpp+  AMDGPUBarrierLatency.cpp   AMDGPUCallLowering.cpp   AMDGPUCodeGenPrepare.cpp   AMDGPUCombinerHelper.cppdiff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.llindex 666523c88860c..75039722b141b 100644--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll@@ -1528,9 +1528,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m ; GFX942-NEXT:    buffer_wbl2 sc1 ; GFX942-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[0:3], 0 offen sc0 ; GFX942-NEXT:    s_waitcnt vmcnt(0)-; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX942-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX942-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX942-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1576,9 +1576,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m ; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[4:5], v[4:5] op_sel:[0,1] ; GFX90A-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc ; GFX90A-NEXT:    s_waitcnt vmcnt(0)-; GFX90A-NEXT:    buffer_wbinvl1 ; GFX90A-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX90A-NEXT:    buffer_wbinvl1 ; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX90A-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX90A-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1603,9 +1603,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m ; GFX908-NEXT:    v_mov_b32_e32 v1, v5 ; GFX908-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc ; GFX908-NEXT:    s_waitcnt vmcnt(0)-; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX908-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX908-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1630,9 +1630,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m ; GFX8-NEXT:    v_mov_b32_e32 v1, v5 ; GFX8-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc ; GFX8-NEXT:    s_waitcnt vmcnt(0)-; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX8-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX8-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1683,10 +1683,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_ ; GFX942-NEXT:    buffer_wbl2 sc1 ; GFX942-NEXT:    buffer_atomic_cmpswap v[4:5], v2, s[0:3], 0 offen sc0 ; GFX942-NEXT:    s_waitcnt vmcnt(0)-; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v1-; GFX942-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX942-NEXT:    v_mov_b32_e32 v1, v4+; GFX942-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX942-NEXT:    s_cbranch_execnz .LBB13_1 ; GFX942-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1730,10 +1730,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_ ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[0:1], v[0:1] op_sel:[0,1] ; GFX90A-NEXT:    buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc ; GFX90A-NEXT:    s_waitcnt vmcnt(0)-; GFX90A-NEXT:    buffer_wbinvl1 ; GFX90A-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v1-; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX90A-NEXT:    v_mov_b32_e32 v1, v4+; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX90A-NEXT:    buffer_wbinvl1 ; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX90A-NEXT:    s_cbranch_execnz .LBB13_1 ; GFX90A-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1756,10 +1756,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_ ; GFX908-NEXT:    v_mov_b32_e32 v4, v0 ; GFX908-NEXT:    buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc ; GFX908-NEXT:    s_waitcnt vmcnt(0)-; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v1-; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX908-NEXT:    v_mov_b32_e32 v1, v4+; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX908-NEXT:    s_cbranch_execnz .LBB13_1 ; GFX908-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1782,10 +1782,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_ ; GFX8-NEXT:    v_mov_b32_e32 v4, v0 ; GFX8-NEXT:    buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc ; GFX8-NEXT:    s_waitcnt vmcnt(0)-; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v1-; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX8-NEXT:    v_mov_b32_e32 v1, v4+; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX8-NEXT:    s_cbranch_execnz .LBB13_1 ; GFX8-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1830,9 +1830,9 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_ ; GFX12-NEXT:    v_dual_mov_b32 v2, v9 :: v_dual_mov_b32 v3, v10 ; GFX12-NEXT:    buffer_atomic_cmpswap_b64 v[0:3], v6, s[0:3], null offen th:TH_ATOMIC_RETURN ; GFX12-NEXT:    s_wait_loadcnt 0x0-; GFX12-NEXT:    global_inv scope:SCOPE_DEV ; GFX12-NEXT:    v_cmp_eq_u64_e32 vcc_lo, v[0:1], v[9:10] ; GFX12-NEXT:    s_or_b32 s4, vcc_lo, s4+; GFX12-NEXT:    global_inv scope:SCOPE_DEV ; GFX12-NEXT:    s_wait_alu 0xfffe ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4 ; GFX12-NEXT:    s_cbranch_execnz .LBB14_1@@ -1872,11 +1872,10 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_ ; GFX11-NEXT:    v_dual_mov_b32 v2, v9 :: v_dual_mov_b32 v3, v10 ; GFX11-NEXT:    buffer_atomic_cmpswap_b64 v[0:3], v6, s[0:3], 0 offen glc ; GFX11-NEXT:    s_waitcnt vmcnt(0)-; GFX11-NEXT:    buffer_gl1_inv-; GFX11-NEXT:    buffer_gl0_inv ; GFX11-NEXT:    v_cmp_eq_u64_e32 vcc_lo, v[0:1], v[9:10] ; GFX11-NEXT:    s_or_b32 s4, vcc_lo, s4-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)+; GFX11-NEXT:    buffer_gl1_inv+; GFX11-NEXT:    buffer_gl0_inv ; GFX11-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4 ; GFX11-NEXT:    s_cbranch_execnz .LBB14_1 ; GFX11-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1925,9 +1924,9 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_ ; GFX908-NEXT:    v_mov_b32_e32 v3, v10 ; GFX908-NEXT:    buffer_atomic_cmpswap_x2 v[0:3], v6, s[16:19], 0 offen glc ; GFX908-NEXT:    s_waitcnt vmcnt(0)-; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[9:10] ; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX908-NEXT:    s_cbranch_execnz .LBB14_1 ; GFX908-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1956,9 +1955,9 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_ ; GFX8-NEXT:    v_mov_b32_e32 v3, v10 ; GFX8-NEXT:    buffer_atomic_cmpswap_x2 v[0:3], v6, s[16:19], 0 offen glc ; GFX8-NEXT:    s_waitcnt vmcnt(0)-; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[9:10] ; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX8-NEXT:    s_cbranch_execnz .LBB14_1 ; GFX8-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -2000,10 +1999,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_ ; GFX12-NEXT:    v_dual_mov_b32 v8, v1 :: v_dual_mov_b32 v7, v0 ; GFX12-NEXT:    buffer_atomic_cmpswap_b64 v[7:10], v6, s[0:3], null offen th:TH_ATOMIC_RETURN ; GFX12-NEXT:    s_wait_loadcnt 0x0-; GFX12-NEXT:    global_inv scope:SCOPE_DEV ; GFX12-NEXT:    v_cmp_eq_u64_e32 vcc_lo, v[7:8], v[2:3] ; GFX12-NEXT:    v_dual_mov_b32 v2, v7 :: v_dual_mov_b32 v3, v8 ; GFX12-NEXT:    s_or_b32 s4, vcc_lo, s4+; GFX12-NEXT:    global_inv scope:SCOPE_DEV ; GFX12-NEXT:    s_wait_alu 0xfffe ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4 ; GFX12-NEXT:    s_cbranch_execnz .LBB15_1@@ -2040,12 +2039,11 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_ ; GFX11-NEXT:    v_dual_mov_b32 v8, v1 :: v_dual_mov_b32 v7, v0 ; GFX11-NEXT:    buffer_atomic_cmpswap_b64 v[7:10], v6, s[0:3], 0 offen glc ; GFX11-NEXT:    s_waitcnt vmcnt(0)-; GFX11-NEXT:    buffer_gl1_inv-; GFX11-NEXT:    buffer_gl0_inv ; GFX11-NEXT:    v_cmp_eq_u64_e32 vcc_lo, v[7:8], v[2:3] ; GFX11-NEXT:    v_dual_mov_b32 v2, v7 :: v_dual_mov_b32 v3, v8 ; GFX11-NEXT:    s_or_b32 s4, vcc_lo, s4-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)+; GFX11-NEXT:    buffer_gl1_inv+; GFX11-NEXT:    buffer_gl0_inv ; GFX11-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4 ; GFX11-NEXT:    s_cbranch_execnz .LBB15_1 ; GFX11-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -2090,11 +2088,11 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_ ; GFX908-NEXT:    v_mov_b32_e32 v7, v0 ; GFX908-NEXT:    buffer_atomic_cmpswap_x2 v[7:10], v6, s[16:19], 0 offen glc ; GFX908-NEXT:    s_waitcnt vmcnt(0)-; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    v_cmp_eq_u64_e32 vcc, v[7:8], v[2:3] ; GFX908-NEXT:    v_mov_b32_e32 v2, v7-; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX908-NEXT:    v_mov_b32_e32 v3, v8+; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX908-NEXT:    s_cbranch_execnz .LBB15_1 ; GFX908-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -2119,11 +2117,11 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_ ; GFX8-NEXT:    v_mov_b32_e32 v7, v0 ; GFX8-NEXT:    buffer_atomic_cmpswap_x2 v[7:10], v6, s[16:19], 0 offen glc ; GFX8-NEXT:    s_waitcnt vmcnt(0)-; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    v_cmp_eq_u64_e32 vcc, v[7:8], v[2:3] ; GFX8-NEXT:    v_mov_b32_e32 v2, v7-; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX8-NEXT:    v_mov_b32_e32 v3, v8+; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX8-NEXT:    s_cbranch_execnz .LBB15_1 ; GFX8-NEXT:  ; %bb.2: ; %atomicrmw.enddiff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.llindex 351502816ae6e..8988b2fd5d01b 100644--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll@@ -1528,9 +1528,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m ; GFX942-NEXT:    buffer_wbl2 sc1 ; GFX942-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[0:3], 0 offen sc0 ; GFX942-NEXT:    s_waitcnt vmcnt(0)-; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX942-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX942-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX942-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1576,9 +1576,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m ; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[4:5], v[4:5] op_sel:[0,1] ; GFX90A-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc ; GFX90A-NEXT:    s_waitcnt vmcnt(0)-; GFX90A-NEXT:    buffer_wbinvl1 ; GFX90A-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX90A-NEXT:    buffer_wbinvl1 ; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX90A-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX90A-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1603,9 +1603,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m ; GFX908-NEXT:    v_mov_b32_e32 v1, v5 ; GFX908-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc ; GFX908-NEXT:    s_waitcnt vmcnt(0)-; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX908-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX908-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1630,9 +1630,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m ; GFX8-NEXT:    v_mov_b32_e32 v1, v5 ; GFX8-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc ; GFX8-NEXT:    s_waitcnt vmcnt(0)-; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX8-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX8-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1683,10 +1683,10 @@ define void @buffer_fat_ptr_agent_atomic_fmin_noret_f32__amdgpu_no_fine_grained_ ; GFX942-NEXT:    buffer_wbl2 sc1 ; GFX942-NEXT:    buffer_atomic_cmpswap v[4:5], v2, s[0:3], 0 offen sc0 ; GFX942-NEXT:    s_waitcnt vmcnt(0)-; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v1-; GFX942-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX942-NEXT:    v_mov_b32_e32 v1, v4+; GFX942-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX942-NEXT:    s_cbranch_execnz .LBB13_1 ; GFX942-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1730,10 +1730,10 @@ define void @buffer_fat_ptr_agent_atomic_fmin_noret_f32__amdgpu_no_fine_grained_ ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[0:1], v[0:1] op_sel:[0,1] ; GFX90A-NEXT:    buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc ; GFX90A-NEXT:    s_waitcnt v...[truncated]

@llvmbot
Copy link
Member

@llvm/pr-subscribers-llvm-globalisel

Author: Carl Ritson (perlfu)

Changes

Add scheduler DAG mutation to add data dependencies between atomic fences and preceding memory reads. This allows some modelling of the impact an atomic fence can have on outstanding memory accesses.

This is beneficial when a fence would cause wait count insertion, as more instructions will be scheduled before the fence hiding memory latency. It also reduces the risk of a fence causing a premature wait on all active memory operations.


Patch is 368.34 KiB, truncated to 20.00 KiB below, full version:https://github.com/llvm/llvm-project/pull/142716.diff

14 Files Affected:

  • (added) llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cpp (+100)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.h (+21)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+4)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+24-26)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+24-26)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+375-381)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+210-219)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+210-219)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-waitcnts-crash.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.gws.barrier.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll (+19-19)
  • (added) llvm/test/CodeGen/AMDGPU/schedule-barrier-latency.mir (+82)
  • (modified) llvm/test/CodeGen/AMDGPU/waitcnt-vscnt.ll (+51-53)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cpp b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cppnew file mode 100644index 0000000000000..b633cbf91b7ff--- /dev/null+++ b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cpp@@ -0,0 +1,100 @@+//===--- AMDGPUBarrierLatency.cpp - AMDGPU Barrier Latency ----------------===//+//+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.+// See https://llvm.org/LICENSE.txt for license information.+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception+//+//===----------------------------------------------------------------------===//+//+/// \file This file contains a DAG scheduling mutation to add data dependency+///       edges between ATOMIC_FENCE instructions and preceeding memory+///       accesses they might be effected by the fence.+///       This is beneficial when a fence would cause wait count insertion,+///       as more instructions will be scheduled before the fence hiding+///       memory latency.+///       It also reduces the risk of a fence causing a premature wait+///       on all active memory operations.+//+//===----------------------------------------------------------------------===//++#include "AMDGPUBarrierLatency.h"+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"+#include "SIInstrInfo.h"+#include "llvm/CodeGen/ScheduleDAGInstrs.h"++using namespace llvm;++namespace {++class BarrierLatency : public ScheduleDAGMutation {+public:+  BarrierLatency() = default;+  void apply(ScheduleDAGInstrs *DAG) override;+};++static bool isMemRead(const MachineInstr *MI) {+  return (SIInstrInfo::isDS(*MI) || SIInstrInfo::isVMEM(*MI) ||+          SIInstrInfo::isSMRD(*MI)) &&+         MI->mayLoad();+}++static const MachineInstr *getReadInstr(const MachineInstr *MI) {+  if (MI->isBundle()) {+    auto I = std::next(MI->getIterator());+    if (I != MI->getParent()->instr_end() && I->isInsideBundle() &&+        isMemRead(&*I))+      return &*I;+  } else if (isMemRead(MI)) {+    return MI;+  }++  return nullptr;+}++void BarrierLatency::apply(ScheduleDAGInstrs *DAG) {+  const unsigned SyntheticLatency = 2000;+  const unsigned MaxTracked = 32;+  SmallVector<std::pair<SUnit *, const MachineInstr *>, MaxTracked> ReadOps;+  unsigned NextIdx = 0;++  for (SUnit &SU : DAG->SUnits) {+    auto *MI = SU.getInstr();+    auto *ReadMI = getReadInstr(MI);++    // Record read operations.+    // If SU represents a bundle, then ReadMI is the first instruction in the+    // bundle.+    if (ReadMI) {+      if (ReadOps.size() < MaxTracked) {+        ReadOps.emplace_back(&SU, ReadMI);+      } else {+        ReadOps[NextIdx] = std::pair(&SU, ReadMI);+        NextIdx = (NextIdx + 1) % MaxTracked;+      }+      continue;+    }++    // Create new edges on ATOMIC_FENCE for recorded reads.+    // We don't consider the scope of the fence so it is possible there will+    // be no impact of this fence on the recorded operations.+    if (MI->getOpcode() == AMDGPU::ATOMIC_FENCE) {+      for (auto &DSOp : ReadOps) {+        Register DstReg = DSOp.second->getOperand(0).getReg();+        SDep Edge = SDep(DSOp.first, SDep::Data, DstReg);+        Edge.setLatency(SyntheticLatency);+        DAG->addEdge(&SU, Edge);+      }+      // Clear tracked operations+      ReadOps.clear();+      NextIdx = 0;+      continue;+    }+  }+}++} // end namespace++std::unique_ptr<ScheduleDAGMutation>+llvm::createAMDGPUBarrierLatencyDAGMutation() {+  return std::make_unique<BarrierLatency>();+}diff --git a/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.h b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.hnew file mode 100644index 0000000000000..c23f0b99fe822--- /dev/null+++ b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.h@@ -0,0 +1,21 @@+//===- AMDGPUBarrierLatency.h - AMDGPU Export Clustering --------*- C++ -*-===//+//+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.+// See https://llvm.org/LICENSE.txt for license information.+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception+//+//===----------------------------------------------------------------------===//++#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPUBARRIERLATENCY_H+#define LLVM_LIB_TARGET_AMDGPU_AMDGPUBARRIERLATENCY_H++#include "llvm/CodeGen/ScheduleDAGMutation.h"+#include <memory>++namespace llvm {++std::unique_ptr<ScheduleDAGMutation> createAMDGPUBarrierLatencyDAGMutation();++} // namespace llvm++#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPUBARRIERLATENCY_Hdiff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cppindex f0a0c2113bf81..fcf4111ea16de 100644--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp@@ -17,6 +17,7 @@ #include "AMDGPUTargetMachine.h" #include "AMDGPU.h" #include "AMDGPUAliasAnalysis.h"+#include "AMDGPUBarrierLatency.h" #include "AMDGPUCtorDtorLowering.h" #include "AMDGPUExportClustering.h" #include "AMDGPUExportKernelRuntimeHandles.h"@@ -588,6 +589,7 @@ createGCNMaxOccupancyMachineScheduler(MachineSchedContext *C) {   DAG->addMutation(createIGroupLPDAGMutation(AMDGPU::SchedulingPhase::Initial));   DAG->addMutation(createAMDGPUMacroFusionDAGMutation());   DAG->addMutation(createAMDGPUExportClusteringDAGMutation());+  DAG->addMutation(createAMDGPUBarrierLatencyDAGMutation());   return DAG; }@@ -608,6 +610,7 @@ createGCNMaxMemoryClauseMachineScheduler(MachineSchedContext *C) {   if (ST.shouldClusterStores())     DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));   DAG->addMutation(createAMDGPUExportClusteringDAGMutation());+  DAG->addMutation(createAMDGPUBarrierLatencyDAGMutation());   return DAG; }@@ -1156,6 +1159,7 @@ GCNTargetMachine::createPostMachineScheduler(MachineSchedContext *C) const {       EnableVOPD)     DAG->addMutation(createVOPDPairingMutation());   DAG->addMutation(createAMDGPUExportClusteringDAGMutation());+  DAG->addMutation(createAMDGPUBarrierLatencyDAGMutation());   return DAG; } //===----------------------------------------------------------------------===//diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txtindex c6d70ee39202e..d09de7f91d8c5 100644--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt@@ -50,6 +50,7 @@ add_llvm_target(AMDGPUCodeGen   AMDGPUAsmPrinter.cpp   AMDGPUAtomicOptimizer.cpp   AMDGPUAttributor.cpp+  AMDGPUBarrierLatency.cpp   AMDGPUCallLowering.cpp   AMDGPUCodeGenPrepare.cpp   AMDGPUCombinerHelper.cppdiff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.llindex 666523c88860c..75039722b141b 100644--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll@@ -1528,9 +1528,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m ; GFX942-NEXT:    buffer_wbl2 sc1 ; GFX942-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[0:3], 0 offen sc0 ; GFX942-NEXT:    s_waitcnt vmcnt(0)-; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX942-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX942-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX942-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1576,9 +1576,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m ; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[4:5], v[4:5] op_sel:[0,1] ; GFX90A-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc ; GFX90A-NEXT:    s_waitcnt vmcnt(0)-; GFX90A-NEXT:    buffer_wbinvl1 ; GFX90A-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX90A-NEXT:    buffer_wbinvl1 ; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX90A-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX90A-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1603,9 +1603,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m ; GFX908-NEXT:    v_mov_b32_e32 v1, v5 ; GFX908-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc ; GFX908-NEXT:    s_waitcnt vmcnt(0)-; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX908-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX908-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1630,9 +1630,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m ; GFX8-NEXT:    v_mov_b32_e32 v1, v5 ; GFX8-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc ; GFX8-NEXT:    s_waitcnt vmcnt(0)-; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX8-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX8-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1683,10 +1683,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_ ; GFX942-NEXT:    buffer_wbl2 sc1 ; GFX942-NEXT:    buffer_atomic_cmpswap v[4:5], v2, s[0:3], 0 offen sc0 ; GFX942-NEXT:    s_waitcnt vmcnt(0)-; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v1-; GFX942-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX942-NEXT:    v_mov_b32_e32 v1, v4+; GFX942-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX942-NEXT:    s_cbranch_execnz .LBB13_1 ; GFX942-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1730,10 +1730,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_ ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[0:1], v[0:1] op_sel:[0,1] ; GFX90A-NEXT:    buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc ; GFX90A-NEXT:    s_waitcnt vmcnt(0)-; GFX90A-NEXT:    buffer_wbinvl1 ; GFX90A-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v1-; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX90A-NEXT:    v_mov_b32_e32 v1, v4+; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX90A-NEXT:    buffer_wbinvl1 ; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX90A-NEXT:    s_cbranch_execnz .LBB13_1 ; GFX90A-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1756,10 +1756,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_ ; GFX908-NEXT:    v_mov_b32_e32 v4, v0 ; GFX908-NEXT:    buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc ; GFX908-NEXT:    s_waitcnt vmcnt(0)-; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v1-; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX908-NEXT:    v_mov_b32_e32 v1, v4+; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX908-NEXT:    s_cbranch_execnz .LBB13_1 ; GFX908-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1782,10 +1782,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_ ; GFX8-NEXT:    v_mov_b32_e32 v4, v0 ; GFX8-NEXT:    buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc ; GFX8-NEXT:    s_waitcnt vmcnt(0)-; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v1-; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX8-NEXT:    v_mov_b32_e32 v1, v4+; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX8-NEXT:    s_cbranch_execnz .LBB13_1 ; GFX8-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1830,9 +1830,9 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_ ; GFX12-NEXT:    v_dual_mov_b32 v2, v9 :: v_dual_mov_b32 v3, v10 ; GFX12-NEXT:    buffer_atomic_cmpswap_b64 v[0:3], v6, s[0:3], null offen th:TH_ATOMIC_RETURN ; GFX12-NEXT:    s_wait_loadcnt 0x0-; GFX12-NEXT:    global_inv scope:SCOPE_DEV ; GFX12-NEXT:    v_cmp_eq_u64_e32 vcc_lo, v[0:1], v[9:10] ; GFX12-NEXT:    s_or_b32 s4, vcc_lo, s4+; GFX12-NEXT:    global_inv scope:SCOPE_DEV ; GFX12-NEXT:    s_wait_alu 0xfffe ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4 ; GFX12-NEXT:    s_cbranch_execnz .LBB14_1@@ -1872,11 +1872,10 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_ ; GFX11-NEXT:    v_dual_mov_b32 v2, v9 :: v_dual_mov_b32 v3, v10 ; GFX11-NEXT:    buffer_atomic_cmpswap_b64 v[0:3], v6, s[0:3], 0 offen glc ; GFX11-NEXT:    s_waitcnt vmcnt(0)-; GFX11-NEXT:    buffer_gl1_inv-; GFX11-NEXT:    buffer_gl0_inv ; GFX11-NEXT:    v_cmp_eq_u64_e32 vcc_lo, v[0:1], v[9:10] ; GFX11-NEXT:    s_or_b32 s4, vcc_lo, s4-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)+; GFX11-NEXT:    buffer_gl1_inv+; GFX11-NEXT:    buffer_gl0_inv ; GFX11-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4 ; GFX11-NEXT:    s_cbranch_execnz .LBB14_1 ; GFX11-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1925,9 +1924,9 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_ ; GFX908-NEXT:    v_mov_b32_e32 v3, v10 ; GFX908-NEXT:    buffer_atomic_cmpswap_x2 v[0:3], v6, s[16:19], 0 offen glc ; GFX908-NEXT:    s_waitcnt vmcnt(0)-; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[9:10] ; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX908-NEXT:    s_cbranch_execnz .LBB14_1 ; GFX908-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1956,9 +1955,9 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_ ; GFX8-NEXT:    v_mov_b32_e32 v3, v10 ; GFX8-NEXT:    buffer_atomic_cmpswap_x2 v[0:3], v6, s[16:19], 0 offen glc ; GFX8-NEXT:    s_waitcnt vmcnt(0)-; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[9:10] ; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX8-NEXT:    s_cbranch_execnz .LBB14_1 ; GFX8-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -2000,10 +1999,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_ ; GFX12-NEXT:    v_dual_mov_b32 v8, v1 :: v_dual_mov_b32 v7, v0 ; GFX12-NEXT:    buffer_atomic_cmpswap_b64 v[7:10], v6, s[0:3], null offen th:TH_ATOMIC_RETURN ; GFX12-NEXT:    s_wait_loadcnt 0x0-; GFX12-NEXT:    global_inv scope:SCOPE_DEV ; GFX12-NEXT:    v_cmp_eq_u64_e32 vcc_lo, v[7:8], v[2:3] ; GFX12-NEXT:    v_dual_mov_b32 v2, v7 :: v_dual_mov_b32 v3, v8 ; GFX12-NEXT:    s_or_b32 s4, vcc_lo, s4+; GFX12-NEXT:    global_inv scope:SCOPE_DEV ; GFX12-NEXT:    s_wait_alu 0xfffe ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4 ; GFX12-NEXT:    s_cbranch_execnz .LBB15_1@@ -2040,12 +2039,11 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_ ; GFX11-NEXT:    v_dual_mov_b32 v8, v1 :: v_dual_mov_b32 v7, v0 ; GFX11-NEXT:    buffer_atomic_cmpswap_b64 v[7:10], v6, s[0:3], 0 offen glc ; GFX11-NEXT:    s_waitcnt vmcnt(0)-; GFX11-NEXT:    buffer_gl1_inv-; GFX11-NEXT:    buffer_gl0_inv ; GFX11-NEXT:    v_cmp_eq_u64_e32 vcc_lo, v[7:8], v[2:3] ; GFX11-NEXT:    v_dual_mov_b32 v2, v7 :: v_dual_mov_b32 v3, v8 ; GFX11-NEXT:    s_or_b32 s4, vcc_lo, s4-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)+; GFX11-NEXT:    buffer_gl1_inv+; GFX11-NEXT:    buffer_gl0_inv ; GFX11-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4 ; GFX11-NEXT:    s_cbranch_execnz .LBB15_1 ; GFX11-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -2090,11 +2088,11 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_ ; GFX908-NEXT:    v_mov_b32_e32 v7, v0 ; GFX908-NEXT:    buffer_atomic_cmpswap_x2 v[7:10], v6, s[16:19], 0 offen glc ; GFX908-NEXT:    s_waitcnt vmcnt(0)-; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    v_cmp_eq_u64_e32 vcc, v[7:8], v[2:3] ; GFX908-NEXT:    v_mov_b32_e32 v2, v7-; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX908-NEXT:    v_mov_b32_e32 v3, v8+; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX908-NEXT:    s_cbranch_execnz .LBB15_1 ; GFX908-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -2119,11 +2117,11 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_ ; GFX8-NEXT:    v_mov_b32_e32 v7, v0 ; GFX8-NEXT:    buffer_atomic_cmpswap_x2 v[7:10], v6, s[16:19], 0 offen glc ; GFX8-NEXT:    s_waitcnt vmcnt(0)-; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    v_cmp_eq_u64_e32 vcc, v[7:8], v[2:3] ; GFX8-NEXT:    v_mov_b32_e32 v2, v7-; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX8-NEXT:    v_mov_b32_e32 v3, v8+; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX8-NEXT:    s_cbranch_execnz .LBB15_1 ; GFX8-NEXT:  ; %bb.2: ; %atomicrmw.enddiff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.llindex 351502816ae6e..8988b2fd5d01b 100644--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll@@ -1528,9 +1528,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m ; GFX942-NEXT:    buffer_wbl2 sc1 ; GFX942-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[0:3], 0 offen sc0 ; GFX942-NEXT:    s_waitcnt vmcnt(0)-; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX942-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX942-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX942-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1576,9 +1576,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m ; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[4:5], v[4:5] op_sel:[0,1] ; GFX90A-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc ; GFX90A-NEXT:    s_waitcnt vmcnt(0)-; GFX90A-NEXT:    buffer_wbinvl1 ; GFX90A-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX90A-NEXT:    buffer_wbinvl1 ; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX90A-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX90A-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1603,9 +1603,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m ; GFX908-NEXT:    v_mov_b32_e32 v1, v5 ; GFX908-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc ; GFX908-NEXT:    s_waitcnt vmcnt(0)-; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX908-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX908-NEXT:    buffer_wbinvl1 ; GFX908-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX908-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX908-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1630,9 +1630,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m ; GFX8-NEXT:    v_mov_b32_e32 v1, v5 ; GFX8-NEXT:    buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc ; GFX8-NEXT:    s_waitcnt vmcnt(0)-; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v5 ; GFX8-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX8-NEXT:    buffer_wbinvl1 ; GFX8-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX8-NEXT:    s_cbranch_execnz .LBB12_1 ; GFX8-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1683,10 +1683,10 @@ define void @buffer_fat_ptr_agent_atomic_fmin_noret_f32__amdgpu_no_fine_grained_ ; GFX942-NEXT:    buffer_wbl2 sc1 ; GFX942-NEXT:    buffer_atomic_cmpswap v[4:5], v2, s[0:3], 0 offen sc0 ; GFX942-NEXT:    s_waitcnt vmcnt(0)-; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v1-; GFX942-NEXT:    s_or_b64 s[4:5], vcc, s[4:5] ; GFX942-NEXT:    v_mov_b32_e32 v1, v4+; GFX942-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]+; GFX942-NEXT:    buffer_inv sc1 ; GFX942-NEXT:    s_andn2_b64 exec, exec, s[4:5] ; GFX942-NEXT:    s_cbranch_execnz .LBB13_1 ; GFX942-NEXT:  ; %bb.2: ; %atomicrmw.end@@ -1730,10 +1730,10 @@ define void @buffer_fat_ptr_agent_atomic_fmin_noret_f32__amdgpu_no_fine_grained_ ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[0:1], v[0:1] op_sel:[0,1] ; GFX90A-NEXT:    buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc ; GFX90A-NEXT:    s_waitcnt v...[truncated]

@jayfoad
Copy link
Contributor

I'm trying to understand this. In some cases the load is required to complete before the fence for correctness, so there must already be an edge in DAG representing that, right? So why do you need to add extra edges? Are you handling cases where the ordering wasnot required for correctness? Or do you just want to set a different latency on the edge? If so, can't you useadjustSchedDependency for that?

// Create new edges on ATOMIC_FENCE for recorded reads.
// We don't consider the scope of the fence so it is possible there will
// be no impact of this fence on the recorded operations.
if (MI->getOpcode() == AMDGPU::ATOMIC_FENCE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's confusing that the desciption talks about "barriers", but the code handles "fence".

@github-actionsGitHub Actions
Copy link

github-actionsbot commentedJun 6, 2025
edited
Loading

✅ With the latest revision this PR passed the C/C++ code formatter.

@perlfu
Copy link
ContributorAuthor

I'm trying to understand this. In some cases the load is required to complete before the fence for correctness, so there must already be an edge in DAG representing that, right? So why do you need to add extra edges? Are you handling cases where the ordering wasnot required for correctness? Or do you just want to set a different latency on the edge? If so, can't you useadjustSchedDependency for that?

In an earlier version of this I tried to add latency by introducing artificial edges with the new latency; however, this had no impact on scheduling.
In hind sight this was perhaps because they overlapped (in a src->dst sense) with the existing barrier edges -- although I had confirmed the new artificial edges were present in the DAG.
You are correct that I can achieve approximately the same result by modifying the latency on existing barrier edges.
(Although as you pointed out this doesn't handle if no edge was required for correctness.)
I have rewritten the patch to use this approach.

I was unaware ofadjustSchedDependency, but from an initial inspection and experimentation the barrier edges are never passed to this function so it cannot be used to modify them.

@perlfuperlfuforce-pushed theamdgpu-barrier-latency branch from630bab4 to0cecdc7CompareJune 30, 2025 08:38
@perlfu
Copy link
ContributorAuthor

  • Rebase and squash

@perlfu
Copy link
ContributorAuthor

Any follow up comments?

Add scheduler DAG mutation to add data dependencies betweenatomic fences and preceding memory reads. This allows somemodelling of the impact an atomic fence can have on outstandingmemory accesses.This is beneficial when a fence would cause wait count insertion,as more instructions will be scheduled before the fence hidingmemory latency. It also reduces the risk of a fence causing apremature wait on all active memory operations.
@perlfuperlfuforce-pushed theamdgpu-barrier-latency branch from0cecdc7 to53c3b21CompareJuly 14, 2025 04:24
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@jayfoadjayfoadjayfoad left review comments

@arsenmarsenmarsenm left review comments

@jmmartinezjmmartinezjmmartinez left review comments

@ruilingruilingAwaiting requested review from ruiling

@Pierre-vhPierre-vhAwaiting requested review from Pierre-vh

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

5 participants
@perlfu@llvmbot@jayfoad@arsenm@jmmartinez

[8]ページ先頭

©2009-2025 Movatter.jp