Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[WebAssembly] Mask undef shuffle lanes#149084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
sparker-arm wants to merge1 commit intollvm:main
base:main
Choose a base branch
Loading
fromsparker-arm:wasm-mask-undef-shuffle

Conversation

sparker-arm
Copy link
Contributor

In LowerVECTOR_SHUFFLE, we already attempt to make shuffles, with undef lanes, friendly to VMs by trying to lower to a i32x4 shuffle. This patch inserts an AND, to mask the undef lanes, when only the bottom four or eight bytes are defined. This allows the VM to easily understand which lanes are required.

In LowerVECTOR_SHUFFLE, we already attempt to make shuffles, withundef lanes, friendly to VMs by trying to lower to a i32x4 shuffle.This patch inserts an AND, to mask the undef lanes, when only thebottom four or eight bytes are defined. This allows the VM to easilyunderstand which lanes are required.
@llvmbot
Copy link
Member

@llvm/pr-subscribers-backend-webassembly

Author: Sam Parker (sparker-arm)

Changes

In LowerVECTOR_SHUFFLE, we already attempt to make shuffles, with undef lanes, friendly to VMs by trying to lower to a i32x4 shuffle. This patch inserts an AND, to mask the undef lanes, when only the bottom four or eight bytes are defined. This allows the VM to easily understand which lanes are required.


Patch is 81.81 KiB, truncated to 20.00 KiB below, full version:https://github.com/llvm/llvm-project/pull/149084.diff

10 Files Affected:

  • (modified) llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp (+36-2)
  • (modified) llvm/test/CodeGen/WebAssembly/extend-shuffles.ll (+18-10)
  • (modified) llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll (+36)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-concat.ll (+6)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-conversions.ll (+38-24)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-extending-convert.ll (+4)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-extending.ll (+6)
  • (modified) llvm/test/CodeGen/WebAssembly/simd.ll (+12-4)
  • (modified) llvm/test/CodeGen/WebAssembly/vector-reduce.ll (+401-285)
  • (modified) llvm/test/CodeGen/WebAssembly/wide-simd-mul.ll (+40-32)
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cppindex bf2e04caa0a61..a360c592d3ecc 100644--- a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp+++ b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp@@ -2719,18 +2719,52 @@ WebAssemblyTargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,   Ops[OpIdx++] = Op.getOperand(0);   Ops[OpIdx++] = Op.getOperand(1);+  std::bitset<16> DefinedLaneBytes = 0xFFFF;   // Expand mask indices to byte indices and materialize them as operands   for (int M : Mask) {     for (size_t J = 0; J < LaneBytes; ++J) {       // Lower undefs (represented by -1 in mask) to {0..J}, which use a       // whole lane of vector input, to allow further reduction at VM. E.g.       // match an 8x16 byte shuffle to an equivalent cheaper 32x4 shuffle.+      if (M == -1) {+        DefinedLaneBytes[OpIdx - 2] = 0;+      }       uint64_t ByteIndex = M == -1 ? J : (uint64_t)M * LaneBytes + J;       Ops[OpIdx++] = DAG.getConstant(ByteIndex, DL, MVT::i32);     }   }--  return DAG.getNode(WebAssemblyISD::SHUFFLE, DL, Op.getValueType(), Ops);+  EVT VT = Op.getValueType();+  SDValue Shuffle = DAG.getNode(WebAssemblyISD::SHUFFLE, DL, VT, Ops);++  // If only the lower four or eight bytes are actually defined by the+  // shuffle, insert an AND so a VM can know that it can ignore the higher,+  // undef, lanes.+  if (DefinedLaneBytes == 0xF) {+    SDValue LowLaneMask[] = {+        DAG.getConstant(uint32_t(-1), DL, MVT::i32),+        DAG.getConstant(uint32_t(0), DL, MVT::i32),+        DAG.getConstant(uint32_t(0), DL, MVT::i32),+        DAG.getConstant(uint32_t(0), DL, MVT::i32),+    };+    SDValue UndefMask =+        DAG.getNode(ISD::BUILD_VECTOR, DL, MVT::v4i32, LowLaneMask);+    SDValue MaskedShuffle =+        DAG.getNode(ISD::AND, DL, MVT::v4i32,+                    DAG.getBitcast(MVT::v4i32, Shuffle), UndefMask);+    return DAG.getBitcast(VT, MaskedShuffle);+  } else if (DefinedLaneBytes == 0xFF) {+    SDValue LowLaneMask[] = {+        DAG.getConstant(uint64_t(-1), DL, MVT::i64),+        DAG.getConstant(uint32_t(0), DL, MVT::i64),+    };+    SDValue UndefMask =+        DAG.getNode(ISD::BUILD_VECTOR, DL, MVT::v2i64, LowLaneMask);+    SDValue MaskedShuffle =+        DAG.getNode(ISD::AND, DL, MVT::v2i64,+                    DAG.getBitcast(MVT::v2i64, Shuffle), UndefMask);+    return DAG.getBitcast(VT, MaskedShuffle);+  }+  return Shuffle; }  SDValue WebAssemblyTargetLowering::LowerSETCC(SDValue Op,diff --git a/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll b/llvm/test/CodeGen/WebAssembly/extend-shuffles.llindex 7736e78271e55..0085c6cd82797 100644--- a/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll+++ b/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll@@ -10,9 +10,11 @@ define <4 x i32> @sext_high_v4i8(<8 x i8> %in) { ; SIMD128:         .functype sext_high_v4i8 (v128) -> (v128) ; SIMD128-NEXT:  # %bb.0: ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0-; SIMD128-NEXT:    i16x8.extend_low_i8x16_s $push1=, $pop0-; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push2=, $pop1-; SIMD128-NEXT:    return $pop2+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1+; SIMD128-NEXT:    i16x8.extend_low_i8x16_s $push3=, $pop2+; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push4=, $pop3+; SIMD128-NEXT:    return $pop4  %shuffle = shufflevector <8 x i8> %in, <8 x i8> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>  %res = sext <4 x i8> %shuffle to <4 x i32>  ret <4 x i32> %res@@ -23,9 +25,11 @@ define <4 x i32> @zext_high_v4i8(<8 x i8> %in) { ; SIMD128:         .functype zext_high_v4i8 (v128) -> (v128) ; SIMD128-NEXT:  # %bb.0: ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0-; SIMD128-NEXT:    i16x8.extend_low_i8x16_u $push1=, $pop0-; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push2=, $pop1-; SIMD128-NEXT:    return $pop2+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1+; SIMD128-NEXT:    i16x8.extend_low_i8x16_u $push3=, $pop2+; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push4=, $pop3+; SIMD128-NEXT:    return $pop4  %shuffle = shufflevector <8 x i8> %in, <8 x i8> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>  %res = zext <4 x i8> %shuffle to <4 x i32>  ret <4 x i32> %res@@ -58,8 +62,10 @@ define <2 x i32> @sext_high_v2i16(<4 x i16> %in) { ; SIMD128:         .functype sext_high_v2i16 (v128) -> (v128) ; SIMD128-NEXT:  # %bb.0: ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1-; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push1=, $pop0-; SIMD128-NEXT:    return $pop1+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1+; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push3=, $pop2+; SIMD128-NEXT:    return $pop3  %shuffle = shufflevector <4 x i16> %in, <4 x i16> poison, <2 x i32> <i32 2, i32 3>  %res = sext <2 x i16> %shuffle to <2 x i32>  ret <2 x i32> %res@@ -70,8 +76,10 @@ define <2 x i32> @zext_high_v2i16(<4 x i16> %in) { ; SIMD128:         .functype zext_high_v2i16 (v128) -> (v128) ; SIMD128-NEXT:  # %bb.0: ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1-; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push1=, $pop0-; SIMD128-NEXT:    return $pop1+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1+; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push3=, $pop2+; SIMD128-NEXT:    return $pop3  %shuffle = shufflevector <4 x i16> %in, <4 x i16> poison, <2 x i32> <i32 2, i32 3>  %res = zext <2 x i16> %shuffle to <2 x i32>  ret <2 x i32> %resdiff --git a/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll b/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.llindex 7190e162eb010..27b7e8c6b01cd 100644--- a/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll+++ b/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll@@ -32,6 +32,8 @@ define <2 x i32> @stest_f64i32(<2 x double> %x) { ; CHECK-NEXT:    v128.bitselect ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i64>@@ -76,6 +78,8 @@ define <2 x i32> @utest_f64i32(<2 x double> %x) { ; CHECK-NEXT:    v128.bitselect ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptoui <2 x double> %x to <2 x i64>@@ -112,6 +116,8 @@ define <2 x i32> @ustest_f64i32(<2 x double> %x) { ; CHECK-NEXT:    v128.and ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i64>@@ -301,6 +307,8 @@ define <2 x i16> @stest_f64i16(<2 x double> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i32>@@ -328,6 +336,8 @@ define <2 x i16> @utest_f64i16(<2 x double> %x) { ; CHECK-NEXT:    i32x4.min_u ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptoui <2 x double> %x to <2 x i32>@@ -355,6 +365,8 @@ define <2 x i16> @ustest_f64i16(<2 x double> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i32>@@ -378,6 +390,8 @@ define <4 x i16> @stest_f32i16(<4 x float> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <4 x float> %x to <4 x i32>@@ -399,6 +413,8 @@ define <4 x i16> @utest_f32i16(<4 x float> %x) { ; CHECK-NEXT:    i32x4.min_u ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptoui <4 x float> %x to <4 x i32>@@ -420,6 +436,8 @@ define <4 x i16> @ustest_f32i16(<4 x float> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <4 x float> %x to <4 x i32>@@ -1484,6 +1502,8 @@ define <2 x i32> @stest_f64i32_mm(<2 x double> %x) { ; CHECK-NEXT:    v128.bitselect ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i64>@@ -1526,6 +1546,8 @@ define <2 x i32> @utest_f64i32_mm(<2 x double> %x) { ; CHECK-NEXT:    v128.bitselect ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptoui <2 x double> %x to <2 x i64>@@ -1561,6 +1583,8 @@ define <2 x i32> @ustest_f64i32_mm(<2 x double> %x) { ; CHECK-NEXT:    v128.and ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i64>@@ -1738,6 +1762,8 @@ define <2 x i16> @stest_f64i16_mm(<2 x double> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i32>@@ -1763,6 +1789,8 @@ define <2 x i16> @utest_f64i16_mm(<2 x double> %x) { ; CHECK-NEXT:    i32x4.min_u ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptoui <2 x double> %x to <2 x i32>@@ -1789,6 +1817,8 @@ define <2 x i16> @ustest_f64i16_mm(<2 x double> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i32>@@ -1810,6 +1840,8 @@ define <4 x i16> @stest_f32i16_mm(<4 x float> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <4 x float> %x to <4 x i32>@@ -1829,6 +1861,8 @@ define <4 x i16> @utest_f32i16_mm(<4 x float> %x) { ; CHECK-NEXT:    i32x4.min_u ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptoui <4 x float> %x to <4 x i32>@@ -1849,6 +1883,8 @@ define <4 x i16> @ustest_f32i16_mm(<4 x float> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <4 x float> %x to <4 x i32>diff --git a/llvm/test/CodeGen/WebAssembly/simd-concat.ll b/llvm/test/CodeGen/WebAssembly/simd-concat.llindex 42ded8a47c199..4473f7ffc6a93 100644--- a/llvm/test/CodeGen/WebAssembly/simd-concat.ll+++ b/llvm/test/CodeGen/WebAssembly/simd-concat.ll@@ -24,6 +24,8 @@ define <8 x i8> @concat_v4i8(<4 x i8> %a, <4 x i8> %b) { ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 16, 17, 18, 19, 0, 0, 0, 0, 0, 0, 0, 0+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return   %v = shufflevector <4 x i8> %a, <4 x i8> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>   ret <8 x i8> %v@@ -48,6 +50,8 @@ define <4 x i8> @concat_v2i8(<2 x i8> %a, <2 x i8> %b) { ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 16, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return   %v = shufflevector <2 x i8> %a, <2 x i8> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>   ret <4 x i8> %v@@ -60,6 +64,8 @@ define <4 x i16> @concat_v2i16(<2 x i16> %a, <2 x i16> %b) { ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 16, 17, 18, 19, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return   %v = shufflevector <2 x i16> %a, <2 x i16> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>   ret <4 x i16> %vdiff --git a/llvm/test/CodeGen/WebAssembly/simd-conversions.ll b/llvm/test/CodeGen/WebAssembly/simd-conversions.llindex 8459ec8101ff2..c98567eaaf7d6 100644--- a/llvm/test/CodeGen/WebAssembly/simd-conversions.ll+++ b/llvm/test/CodeGen/WebAssembly/simd-conversions.ll@@ -313,14 +313,16 @@ define <4 x double> @convert_low_s_v4f64(<8 x i32> %x) { ; CHECK-NEXT:  # %bb.0: ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3 ; CHECK-NEXT:    f64x2.convert_low_i32x4_s-; CHECK-NEXT:    v128.store 16+; CHECK-NEXT:    v128.store 0 ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.convert_low_i32x4_s-; CHECK-NEXT:    v128.store 0+; CHECK-NEXT:    v128.store 16 ; CHECK-NEXT:    # fallthrough-return   %v = shufflevector <8 x i32> %x, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>   %a = sitofp <4 x i32> %v to <4 x double>@@ -333,14 +335,16 @@ define <4 x double> @convert_low_u_v4f64(<8 x i32> %x) { ; CHECK-NEXT:  # %bb.0: ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3 ; CHECK-NEXT:    f64x2.convert_low_i32x4_u-; CHECK-NEXT:    v128.store 16+; CHECK-NEXT:    v128.store 0 ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.convert_low_i32x4_u-; CHECK-NEXT:    v128.store 0+; CHECK-NEXT:    v128.store 16 ; CHECK-NEXT:    # fallthrough-return   %v = shufflevector <8 x i32> %x, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>   %a = uitofp <4 x i32> %v to <4 x double>@@ -354,14 +358,16 @@ define <4 x double> @convert_low_s_v4f64_2(<8 x i32> %x) { ; CHECK-NEXT:  # %bb.0: ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3 ; CHECK-NEXT:    f64x2.convert_low_i32x4_s-; CHECK-NEXT:    v128.store 16+; CHECK-NEXT:    v128.store 0 ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.convert_low_i32x4_s-; CHECK-NEXT:    v128.store 0+; CHECK-NEXT:    v128.store 16 ; CHECK-NEXT:    # fallthrough-return   %v = sitofp <8 x i32> %x to <8 x double>   %a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>@@ -374,14 +380,16 @@ define <4 x double> @convert_low_u_v4f64_2(<8 x i32> %x) { ; CHECK-NEXT:  # %bb.0: ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3 ; CHECK-NEXT:    f64x2.convert_low_i32x4_u-; CHECK-NEXT:    v128.store 16+; CHECK-NEXT:    v128.store 0 ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.convert_low_i32x4_u-; CHECK-NEXT:    v128.store 0+; CHECK-NEXT:    v128.store 16 ; CHECK-NEXT:    # fallthrough-return   %v = uitofp <8 x i32> %x to <8 x double>   %a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>@@ -394,14 +402,16 @@ define <4 x double> @promote_low_v4f64(<8 x float> %x) { ; CHECK-NEXT:  # %bb.0: ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3 ; CHECK-NEXT:    f64x2.promote_low_f32x4-; CHECK-NEXT:    v128.store 16+; CHECK-NEXT:    v128.store 0 ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.promote_low_f32x4-; CHECK-NEXT:    v128.store 0+; CHECK-NEXT:    v128.store 16 ; CHECK-NEXT:    # fallthrough-return   %v = shufflevector <8 x float> %x, <8 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>   %a = fpext <4 x float> %v to <4 x double>@@ -414,14 +424,16 @@ define <4 x double> @promote_low_v4f64_2(<8 x float> %x) { ; CHECK-NEXT:  # %bb.0: ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3 ; CHECK-NEXT:    f64x2.promote_low_f32x4-; CHECK-NEXT:    v128.store 16+; CHECK-NEXT:    v128.store 0 ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.promote_low_f32x4-; CHECK-NEXT:    v128.store 0+; CHECK-NEXT:    v128.store 16 ; CHECK-NEXT:    # fallthrough-return   %v = fpext <8 x float> %x to <8 x double>   %a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>@@ -435,6 +447,8 @@ define <2 x double> @promote_mixed_v2f64(<4 x float> %x, <4 x float> %y) { ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1 ; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 28, 29, 30, 31, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.promote_low_f32x4 ; CHECK-NEXT:   ...[truncated]

@dschuff
Copy link
Member

Can you say a little more about what the advantages of this are, i.e. what the VM does differently as a result? (And, which VMs have you tested this with?)

@sparker-arm
Copy link
ContributorAuthor

I haven't tested with any VMs yet, as I doubt any of them will be taking advantage of this now.

The main advantage of this change is identify 'narrow' shuffles that can be mapped to target instructions. Even though Wasm is 128-bit, it doesn't always mean we're operating on that full width. Imagine that we're operating on 4 x 16-bit vector and we want the result to be the even lanes: 0, 2, 4, 6. But the wasm shuffle will be 0, 2, 4, 6, 0, 0, 0, 0.

I've optimised the AArch64 backend in V8 so that these cases are often handled by splatting lane zero first, but this is still far from optimal.

With the undef mask, during isel and with very little overhead, the backend can recognize this as an 'unzip' operation instead of an arbitrary lane shuffle.

The extend_low operations also provide the same information as this mask but, if the shuffle has multiple users, it's unlikely to be such a simple optimisation during isel. I've created anoptimisation in V8 specifically for figuring out undef lanes and it's non trivial. This undef mask change would make it much more simple for other runtimes to generate good shuffle code far more easily.

As you may have noticed, I've found WebAssembly shuffles to be a real pain! I would really like to see a revision to the spec so that these undef lanes/bytes can be explicitly encoded :)

@tlively
Copy link
Collaborator

As you may have noticed, I've found WebAssembly shuffles to be a real pain! I would really like to see a revision to the spec so that these undef lanes/bytes can be explicitly encoded :)

If we did add such an extension to the shuffle instruction, we would still have to specify what value ends up in the lanes of the result. Would it be portable and fast if we specified that the "undef" lanes all end up containing zeros, for instance?

@sparker-arm
Copy link
ContributorAuthor

which VMs have you tested this with?

With ~20 lines of code into V8 to notice the AND mask, this change gave me ~10% speedup on my microbenchmark suite for memory interleaving.

@sparker-arm
Copy link
ContributorAuthor

Would it be portable and fast if we specified that the "undef" lanes all end up containing zeros, for instance?

This is the only option that I have considered, really. It would then have the same semantics as what I'm proposing here and I would expect it is cheap enough on any architecture.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@dschuffdschuffAwaiting requested review from dschuff

@tlivelytlivelyAwaiting requested review from tlively

Assignees

@sparker-armsparker-arm

Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@sparker-arm@llvmbot@dschuff@tlively

[8]ページ先頭

©2009-2025 Movatter.jp