llvm/llvm-projectPublic

NotificationsYou must be signed in to change notification settings
Fork14.5k
Star33.6k

[MLIR][XeGPU] Add transformation pattern for vector.broadcast in Wg to Sg pass#144417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

nbpatel wants to merge12 commits intollvm:main

base:main

Choose a base branch

fromnbpatel:xegpu_wg_sg_broadcast

Open

[MLIR][XeGPU] Add transformation pattern for vector.broadcast in Wg to Sg pass#144417

nbpatel wants to merge12 commits intollvm:mainfromnbpatel:xegpu_wg_sg_broadcast

+117 −5

Conversation

Copy link

Contributor

nbpatel commentedJun 16, 2025•
edited
Loading

This PR adds transformation pattern for vector.broadcast op in xegpu-wg-to-sg-distribute pass

nbpatel added5 commits

June 11, 2025 21:15

Add pattern for broadcast

f1509d2

Add pattern for broadcast

c5cd274

Merge branch 'main' into xegpu_wg_sg_broadcast

2b23906

Clean up

803a565

Add CHECKS

2c97ee7

nbpatel requested a review fromchencha3

June 16, 2025 19:41

chencha3 reviewed

Jun 16, 2025

View reviewed changes

mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp Outdated

		VectorType::get(sgShape, resultType.getElementType());

		SmallVector<Value> newBroadcastOps;
		for (size_t i = 0; i < adaptor.getOperands().front().size(); ++i) {

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

How about use range-based for loop?

chencha3 reviewed

Jun 16, 2025

View reviewed changes

mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp

		xegpu::LayoutAttr layout = xegpu::getLayoutAttr(op.getResult());
		if (!layout \|\| !layout.getSgLayout())
		return failure();

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It looks to me that the current implementation is assuming the rank of source is the same as the rank of the result, which is a subset of the supported semantics ofvector.broadcast. I believe it is partially because of the limitation ofLayoutAttr. It would be better to add a check.

nbpatel marked this pull request as ready for review

June 16, 2025 20:33

llvmbot added mlir:gpu mlir labels

Jun 16, 2025

Copy link

Member

llvmbot commentedJun 16, 2025•
edited
Loading

@llvm/pr-subscribers-mlir-gpu

@llvm/pr-subscribers-mlir

Author: Nishant Patel (nbpatel)

Changes

This PR adds transformation pattern for vector.broadcast op in xegpu-wg-to-sg-distribute pass

Full diff:https://github.com/llvm/llvm-project/pull/144417.diff

3 Files Affected:

(modified) mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp (+40-1)
(modified) mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir (+18-1)
(modified) mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir (+17-2)

diff --git a/mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp b/mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cppindex a26c6b52f0ddc..96c7032d6b812 100644--- a/mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp+++ b/mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp@@ -328,6 +328,39 @@ struct WgToSgPrefetchNdOp : public OpConversionPattern<xegpu::PrefetchNdOp> {   } };+/// This pattern transforms vector.broadcast ops to work at subgroup level.+struct WgToSgVectorBroadcastOp+    : public OpConversionPattern<vector::BroadcastOp> {+  using OpConversionPattern<vector::BroadcastOp>::OpConversionPattern;++  LogicalResult+  matchAndRewrite(vector::BroadcastOp op, OneToNOpAdaptor adaptor,+                  ConversionPatternRewriter &rewriter) const override {+    VectorType resultType = op.getResult().getType();+    ArrayRef<int64_t> wgShape = resultType.getShape();++    xegpu::LayoutAttr layout = xegpu::getLayoutAttr(op.getResult());+    if (!layout || !layout.getSgLayout())+      return failure();++    SmallVector<int64_t> sgShape = getSgShapeAndCount(wgShape, layout).first;+    VectorType newResultType =+        VectorType::get(sgShape, resultType.getElementType());++    SmallVector<Value> newBroadcastOps;+    for (size_t i = 0; i < adaptor.getOperands().front().size(); ++i) {+      auto newBroadcast = rewriter.create<vector::BroadcastOp>(+          op.getLoc(), newResultType, adaptor.getOperands().front()[i]);+      xegpu::setLayoutAttr(newBroadcast->getResult(0),+                           layout.dropSgLayoutAndData());+      newBroadcastOps.push_back(newBroadcast.getResult());+    }++    rewriter.replaceOpWithMultiple(op, {newBroadcastOps});+    return success();+  }+};+ // Handles UnrealizedConversionCastOp generated during // SCFStructuralTypeConversions (step 1). This op may appear as either a // target or source materialization for Vector values, e.g.:@@ -411,7 +444,8 @@ namespace xegpu { void populateXeGPUWgToSgDistributePatterns(RewritePatternSet &patterns) {   patterns.add<WgToSgCreateNdOp, WgToSgLoadNdOp, WgToSgStoreNdOp,                WgToSgUpdateNdOffsetOp, WgToSgDpasOp, WgToSgPrefetchNdOp,-               UnrealizedConversionCastOpPattern>(patterns.getContext());+               WgToSgVectorBroadcastOp, UnrealizedConversionCastOpPattern>(+      patterns.getContext()); } } // namespace xegpu } // namespace mlir@@ -518,6 +552,11 @@ void XeGPUWgToSgDistributePass::runOnOperation() {     return isLegal(layout);   });+  target.addDynamicallyLegalOp<vector::BroadcastOp>(+      [=](vector::BroadcastOp op) -> bool {+        return isLegal(xegpu::getLayoutAttr(op.getResult()));+      });+   target.addDynamicallyLegalOp<UnrealizedConversionCastOp>(       [=](UnrealizedConversionCastOp op) {         return llvm::is_contained(existingCastOps, op.getOperation());diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlirindex 35ad16d8cd9a9..60ac266b0f112 100644--- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir+++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir@@ -103,6 +103,24 @@ gpu.module @test_round_robin_assignment {     gpu.return   }+  // CHECK-LABEL: test_broadcast+  // CHECK-SAME: %[[ARG_0:.*]]: memref<24x1xf32>+  gpu.func @test_broadcast(%src: memref<24x1xf32>) {+    %tdesc = xegpu.create_nd_tdesc %src[0, 0] : memref<24x1xf32>+      -> !xegpu.tensor_desc<24x1xf32, #xegpu.layout<sg_layout = [4, 1], sg_data = [2, 1], lane_layout = [2, 1], lane_data = [1, 1]>>+    %load =  xegpu.load_nd %tdesc+      : !xegpu.tensor_desc<24x1xf32, #xegpu.layout<sg_layout = [4, 1], sg_data = [2, 1], lane_layout = [2, 1], lane_data = [1, 1]>>+      -> vector<24x1xf32>+    // CHECK-COUNT-3: vector.broadcast {{.*}}+    // CHECK-SAME-COUNT-3: {layout_result_0 = #xegpu.layout<lane_layout = [2, 1], lane_data = [1, 1]>}+    // CHECK-SAME-COUNT-3: : vector<2x1xf32> to vector<2x4xf32>+    // CHECK-NOT: vector.broadcast+    %broadcast = vector.broadcast %load+      {layout_result_0 = #xegpu.layout<sg_layout = [4, 1], sg_data = [2, 4], lane_layout = [2, 1], lane_data = [1, 1]>}+      : vector<24x1xf32> to vector<24x8xf32>+    gpu.return+  }+   gpu.func @test_scf_for(%arg0: memref<1024xf32>, %arg1: memref<1024xf32>) {     %c1 = arith.constant 1 : index     %c10 = arith.constant 10 : index@@ -197,5 +215,4 @@ gpu.module @test_round_robin_assignment {     xegpu.store_nd %d, %1 : vector<256xf32>, !xegpu.tensor_desc<256xf32, #xegpu.layout<sg_layout = [8], sg_data = [16]>>     gpu.return   }- }diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlirindex 466842c968448..125bab349b4cb 100644--- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir+++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir@@ -170,6 +170,22 @@ gpu.func @test_dpas_no_sg_data(%a: memref<24x32xf32>, %b: memref<32x24xf32>) {     gpu.return   }+  // CHECK-LABEL: test_broadcast+  // CHECK-SAME: %[[ARG_0:.*]]: memref<24x1xf32>+  gpu.func @test_broadcast(%src: memref<24x1xf32>) {+    %tdesc = xegpu.create_nd_tdesc %src[0, 0] : memref<24x1xf32>+      -> !xegpu.tensor_desc<24x1xf32, #xegpu.layout<sg_layout = [2, 1], sg_data = [12, 1], lane_layout = [2, 1], lane_data = [1, 1]>>+    %load =  xegpu.load_nd %tdesc+      : !xegpu.tensor_desc<24x1xf32, #xegpu.layout<sg_layout = [2, 1], sg_data = [12, 1], lane_layout = [2, 1], lane_data = [1, 1]>>+      -> vector<24x1xf32>+    // CHECK: vector.broadcast {{.*}} {layout_result_0 = #xegpu.layout<lane_layout = [2, 1], lane_data = [1, 1]>}+    // CHECK-SAME: : vector<12x1xf32> to vector<12x8xf32>+    %broadcast = vector.broadcast %load+      {layout_result_0 = #xegpu.layout<sg_layout = [2, 1], sg_data = [12, 8], lane_layout = [2, 1], lane_data = [1, 1]>}+      : vector<24x1xf32> to vector<24x8xf32>+    gpu.return+  }+   gpu.func @test_scf_for(%arg0: memref<1024x1024xf16>, %arg1: memref<1024x1024xf16>, %arg2: memref<1024x1024xf32>) {     //CHECK: [[c0:%.+]] = arith.constant 0 : index     //CHECK: [[c128:%.+]] = arith.constant 128 : index@@ -295,6 +311,5 @@ gpu.func @test_dpas_no_sg_data(%a: memref<24x32xf32>, %b: memref<32x24xf32>) {     xegpu.store_nd %d, %1 : vector<256xf32>, !xegpu.tensor_desc<256xf32, #xegpu.layout<sg_layout = [16], sg_data = [16]>>     gpu.return   }-- }+

nbpatel added3 commits

June 18, 2025 00:02

add check

692ae9e

Merge branch 'main' into xegpu_wg_sg_broadcast

717664f

Merge branch 'main' into xegpu_wg_sg_broadcast

9d71167

nbpatel requested a review fromadam-smnk

June 20, 2025 17:05

adam-smnk reviewed

Jun 20, 2025

View reviewed changes

mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp

		if (!layout \|\| !layout.getSgLayout())
		return failure();

		// TODO: Currently only supports cases where the source and result ranks

Copy link

Contributor

adam-smnkJun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What happens if broadcast is highly N dimensional?
It's probably unlikely to end up with such IR but wonder if logic here is still safe to execute in such a case.

adam-smnk approved these changes

Jun 20, 2025

View reviewed changes

Copy link

Contributor

adam-smnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks in line with other distributions

chencha3 reviewed

Jun 23, 2025

View reviewed changes

mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir

		// CHECK-SAME: %[[ARG_0:.*]]: memref<24x1xf32>
		gpu.func @broadcast(%src: memref<24x1xf32>) {
		%tdesc = xegpu.create_nd_tdesc %src[0, 0] : memref<24x1xf32>
		-> !xegpu.tensor_desc<24x1xf32, #xegpu.layout<sg_layout = [2, 1], sg_data = [12, 1], lane_layout = [2, 1], lane_data = [1, 1]>>

Copy link

Contributor

chencha3Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

please add a test case with broadcast in dim 0 too.

Copy link

ContributorAuthor

nbpatelJul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

added the test case

chencha3 reviewed

Jun 23, 2025

View reviewed changes

mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp


		xegpu::LayoutAttr layout = xegpu::getLayoutAttr(op.getResult());
		if (!layout \|\| !layout.getSgLayout())
		return failure();

Copy link

Contributor

chencha3Jun 23, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We probably also need to check whether the LayoutAttr of input is broadcastable to the LayoutAttr of output. In your test example: input LayoutAttr is#xegpu.layout<sg_layout = [2, 1], sg_data = [12, 1], lane_layout = [2, 1], lane_data = [1, 1]> and output LayoutAttr is#xegpu.layout<sg_layout = [2, 1], sg_data = [12, 8], lane_layout = [2, 1], lane_data = [1, 1]>, but what if the input LayoutAttr is#xegpu.layout<sg_layout = [2, 1], sg_data = [6, 1], lane_layout = [2, 1], lane_data = [1, 1]>? is the lowering still valid?

Copy link

ContributorAuthor

nbpatelJul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

added the check

nbpatel added4 commits

July 7, 2025 22:12

Add test case for dim0

1d17537

add check

425d677

Temp commit to check isDiscardable

00ffa57

Add check for output layout

8467c29

Labels

mlir:gpu mlir

4 participants

Movatterモバイル変換

[MLIR][XeGPU] Add transformation pattern for vector.broadcast in Wg to Sg pass#144417

Are you sure you want to change the base?

[MLIR][XeGPU] Add transformation pattern for vector.broadcast in Wg to Sg pass#144417

Conversation

nbpatel commentedJun 16, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

chencha3Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

chencha3Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

llvmbot commentedJun 16, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

adam-smnkJun 20, 2025

Choose a reason for hiding this comment

Uh oh!

adam-smnk left a comment

Choose a reason for hiding this comment

Uh oh!

chencha3Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

nbpatelJul 8, 2025

Choose a reason for hiding this comment

Uh oh!

chencha3Jun 23, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nbpatelJul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nbpatel commentedJun 16, 2025•
edited
Loading

llvmbot commentedJun 16, 2025•
edited
Loading

chencha3Jun 23, 2025•
edited
Loading