In LowerVECTOR_SHUFFLE, we already attempt to make shuffles, with undef lanes, friendly to VMs by trying to lower to a i32x4 shuffle. This patch inserts an AND, to mask the undef lanes, when only the bottom four or eight bytes are defined. This allows the VM to easily understand which lanes are required.

[WebAssembly] Mask undef shuffle lanes

0c53160

In LowerVECTOR_SHUFFLE, we already attempt to make shuffles, withundef lanes, friendly to VMs by trying to lower to a i32x4 shuffle.This patch inserts an AND, to mask the undef lanes, when only thebottom four or eight bytes are defined. This allows the VM to easilyunderstand which lanes are required.

sparker-arm requested review fromdschuff andtlively

July 16, 2025 12:09

sparker-arm self-assigned this

Jul 16, 2025

llvmbot added the backend:WebAssembly label

Jul 16, 2025

Copy link

Member

llvmbot commentedJul 16, 2025

@llvm/pr-subscribers-backend-webassembly

Author: Sam Parker (sparker-arm)

Changes

Patch is 81.81 KiB, truncated to 20.00 KiB below, full version:https://github.com/llvm/llvm-project/pull/149084.diff

10 Files Affected:

(modified) llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp (+36-2)
(modified) llvm/test/CodeGen/WebAssembly/extend-shuffles.ll (+18-10)
(modified) llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll (+36)
(modified) llvm/test/CodeGen/WebAssembly/simd-concat.ll (+6)
(modified) llvm/test/CodeGen/WebAssembly/simd-conversions.ll (+38-24)
(modified) llvm/test/CodeGen/WebAssembly/simd-extending-convert.ll (+4)
(modified) llvm/test/CodeGen/WebAssembly/simd-extending.ll (+6)
(modified) llvm/test/CodeGen/WebAssembly/simd.ll (+12-4)
(modified) llvm/test/CodeGen/WebAssembly/vector-reduce.ll (+401-285)
(modified) llvm/test/CodeGen/WebAssembly/wide-simd-mul.ll (+40-32)

diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cppindex bf2e04caa0a61..a360c592d3ecc 100644--- a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp+++ b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp@@ -2719,18 +2719,52 @@ WebAssemblyTargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,   Ops[OpIdx++] = Op.getOperand(0);   Ops[OpIdx++] = Op.getOperand(1);+  std::bitset<16> DefinedLaneBytes = 0xFFFF;   // Expand mask indices to byte indices and materialize them as operands   for (int M : Mask) {     for (size_t J = 0; J < LaneBytes; ++J) {       // Lower undefs (represented by -1 in mask) to {0..J}, which use a       // whole lane of vector input, to allow further reduction at VM. E.g.       // match an 8x16 byte shuffle to an equivalent cheaper 32x4 shuffle.+      if (M == -1) {+        DefinedLaneBytes[OpIdx - 2] = 0;+      }       uint64_t ByteIndex = M == -1 ? J : (uint64_t)M * LaneBytes + J;       Ops[OpIdx++] = DAG.getConstant(ByteIndex, DL, MVT::i32);     }   }--  return DAG.getNode(WebAssemblyISD::SHUFFLE, DL, Op.getValueType(), Ops);+  EVT VT = Op.getValueType();+  SDValue Shuffle = DAG.getNode(WebAssemblyISD::SHUFFLE, DL, VT, Ops);++  // If only the lower four or eight bytes are actually defined by the+  // shuffle, insert an AND so a VM can know that it can ignore the higher,+  // undef, lanes.+  if (DefinedLaneBytes == 0xF) {+    SDValue LowLaneMask[] = {+        DAG.getConstant(uint32_t(-1), DL, MVT::i32),+        DAG.getConstant(uint32_t(0), DL, MVT::i32),+        DAG.getConstant(uint32_t(0), DL, MVT::i32),+        DAG.getConstant(uint32_t(0), DL, MVT::i32),+    };+    SDValue UndefMask =+        DAG.getNode(ISD::BUILD_VECTOR, DL, MVT::v4i32, LowLaneMask);+    SDValue MaskedShuffle =+        DAG.getNode(ISD::AND, DL, MVT::v4i32,+                    DAG.getBitcast(MVT::v4i32, Shuffle), UndefMask);+    return DAG.getBitcast(VT, MaskedShuffle);+  } else if (DefinedLaneBytes == 0xFF) {+    SDValue LowLaneMask[] = {+        DAG.getConstant(uint64_t(-1), DL, MVT::i64),+        DAG.getConstant(uint32_t(0), DL, MVT::i64),+    };+    SDValue UndefMask =+        DAG.getNode(ISD::BUILD_VECTOR, DL, MVT::v2i64, LowLaneMask);+    SDValue MaskedShuffle =+        DAG.getNode(ISD::AND, DL, MVT::v2i64,+                    DAG.getBitcast(MVT::v2i64, Shuffle), UndefMask);+    return DAG.getBitcast(VT, MaskedShuffle);+  }+  return Shuffle; }  SDValue WebAssemblyTargetLowering::LowerSETCC(SDValue Op,diff --git a/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll b/llvm/test/CodeGen/WebAssembly/extend-shuffles.llindex 7736e78271e55..0085c6cd82797 100644--- a/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll+++ b/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll@@ -10,9 +10,11 @@ define <4 x i32> @sext_high_v4i8(<8 x i8> %in) { ; SIMD128:         .functype sext_high_v4i8 (v128) -> (v128) ; SIMD128-NEXT:  # %bb.0: ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0-; SIMD128-NEXT:    i16x8.extend_low_i8x16_s $push1=, $pop0-; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push2=, $pop1-; SIMD128-NEXT:    return $pop2+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1+; SIMD128-NEXT:    i16x8.extend_low_i8x16_s $push3=, $pop2+; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push4=, $pop3+; SIMD128-NEXT:    return $pop4  %shuffle = shufflevector <8 x i8> %in, <8 x i8> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>  %res = sext <4 x i8> %shuffle to <4 x i32>  ret <4 x i32> %res@@ -23,9 +25,11 @@ define <4 x i32> @zext_high_v4i8(<8 x i8> %in) { ; SIMD128:         .functype zext_high_v4i8 (v128) -> (v128) ; SIMD128-NEXT:  # %bb.0: ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0-; SIMD128-NEXT:    i16x8.extend_low_i8x16_u $push1=, $pop0-; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push2=, $pop1-; SIMD128-NEXT:    return $pop2+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1+; SIMD128-NEXT:    i16x8.extend_low_i8x16_u $push3=, $pop2+; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push4=, $pop3+; SIMD128-NEXT:    return $pop4  %shuffle = shufflevector <8 x i8> %in, <8 x i8> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>  %res = zext <4 x i8> %shuffle to <4 x i32>  ret <4 x i32> %res@@ -58,8 +62,10 @@ define <2 x i32> @sext_high_v2i16(<4 x i16> %in) { ; SIMD128:         .functype sext_high_v2i16 (v128) -> (v128) ; SIMD128-NEXT:  # %bb.0: ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1-; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push1=, $pop0-; SIMD128-NEXT:    return $pop1+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1+; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push3=, $pop2+; SIMD128-NEXT:    return $pop3  %shuffle = shufflevector <4 x i16> %in, <4 x i16> poison, <2 x i32> <i32 2, i32 3>  %res = sext <2 x i16> %shuffle to <2 x i32>  ret <2 x i32> %res@@ -70,8 +76,10 @@ define <2 x i32> @zext_high_v2i16(<4 x i16> %in) { ; SIMD128:         .functype zext_high_v2i16 (v128) -> (v128) ; SIMD128-NEXT:  # %bb.0: ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1-; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push1=, $pop0-; SIMD128-NEXT:    return $pop1+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1+; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push3=, $pop2+; SIMD128-NEXT:    return $pop3  %shuffle = shufflevector <4 x i16> %in, <4 x i16> poison, <2 x i32> <i32 2, i32 3>  %res = zext <2 x i16> %shuffle to <2 x i32>  ret <2 x i32> %resdiff --git a/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll b/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.llindex 7190e162eb010..27b7e8c6b01cd 100644--- a/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll+++ b/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll@@ -32,6 +32,8 @@ define <2 x i32> @stest_f64i32(<2 x double> %x) { ; CHECK-NEXT:    v128.bitselect ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i64>@@ -76,6 +78,8 @@ define <2 x i32> @utest_f64i32(<2 x double> %x) { ; CHECK-NEXT:    v128.bitselect ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptoui <2 x double> %x to <2 x i64>@@ -112,6 +116,8 @@ define <2 x i32> @ustest_f64i32(<2 x double> %x) { ; CHECK-NEXT:    v128.and ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i64>@@ -301,6 +307,8 @@ define <2 x i16> @stest_f64i16(<2 x double> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i32>@@ -328,6 +336,8 @@ define <2 x i16> @utest_f64i16(<2 x double> %x) { ; CHECK-NEXT:    i32x4.min_u ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptoui <2 x double> %x to <2 x i32>@@ -355,6 +365,8 @@ define <2 x i16> @ustest_f64i16(<2 x double> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i32>@@ -378,6 +390,8 @@ define <4 x i16> @stest_f32i16(<4 x float> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <4 x float> %x to <4 x i32>@@ -399,6 +413,8 @@ define <4 x i16> @utest_f32i16(<4 x float> %x) { ; CHECK-NEXT:    i32x4.min_u ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptoui <4 x float> %x to <4 x i32>@@ -420,6 +436,8 @@ define <4 x i16> @ustest_f32i16(<4 x float> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <4 x float> %x to <4 x i32>@@ -1484,6 +1502,8 @@ define <2 x i32> @stest_f64i32_mm(<2 x double> %x) { ; CHECK-NEXT:    v128.bitselect ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i64>@@ -1526,6 +1546,8 @@ define <2 x i32> @utest_f64i32_mm(<2 x double> %x) { ; CHECK-NEXT:    v128.bitselect ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptoui <2 x double> %x to <2 x i64>@@ -1561,6 +1583,8 @@ define <2 x i32> @ustest_f64i32_mm(<2 x double> %x) { ; CHECK-NEXT:    v128.and ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i64>@@ -1738,6 +1762,8 @@ define <2 x i16> @stest_f64i16_mm(<2 x double> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i32>@@ -1763,6 +1789,8 @@ define <2 x i16> @utest_f64i16_mm(<2 x double> %x) { ; CHECK-NEXT:    i32x4.min_u ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptoui <2 x double> %x to <2 x i32>@@ -1789,6 +1817,8 @@ define <2 x i16> @ustest_f64i16_mm(<2 x double> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <2 x double> %x to <2 x i32>@@ -1810,6 +1840,8 @@ define <4 x i16> @stest_f32i16_mm(<4 x float> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <4 x float> %x to <4 x i32>@@ -1829,6 +1861,8 @@ define <4 x i16> @utest_f32i16_mm(<4 x float> %x) { ; CHECK-NEXT:    i32x4.min_u ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptoui <4 x float> %x to <4 x i32>@@ -1849,6 +1883,8 @@ define <4 x i16> @ustest_f32i16_mm(<4 x float> %x) { ; CHECK-NEXT:    i32x4.max_s ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return entry:   %conv = fptosi <4 x float> %x to <4 x i32>diff --git a/llvm/test/CodeGen/WebAssembly/simd-concat.ll b/llvm/test/CodeGen/WebAssembly/simd-concat.llindex 42ded8a47c199..4473f7ffc6a93 100644--- a/llvm/test/CodeGen/WebAssembly/simd-concat.ll+++ b/llvm/test/CodeGen/WebAssembly/simd-concat.ll@@ -24,6 +24,8 @@ define <8 x i8> @concat_v4i8(<4 x i8> %a, <4 x i8> %b) { ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 16, 17, 18, 19, 0, 0, 0, 0, 0, 0, 0, 0+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return   %v = shufflevector <4 x i8> %a, <4 x i8> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>   ret <8 x i8> %v@@ -48,6 +50,8 @@ define <4 x i8> @concat_v2i8(<2 x i8> %a, <2 x i8> %b) { ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 16, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0+; CHECK-NEXT:    v128.const -1, 0, 0, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return   %v = shufflevector <2 x i8> %a, <2 x i8> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>   ret <4 x i8> %v@@ -60,6 +64,8 @@ define <4 x i16> @concat_v2i16(<2 x i16> %a, <2 x i16> %b) { ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 16, 17, 18, 19, 0, 1, 0, 1, 0, 1, 0, 1+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    # fallthrough-return   %v = shufflevector <2 x i16> %a, <2 x i16> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>   ret <4 x i16> %vdiff --git a/llvm/test/CodeGen/WebAssembly/simd-conversions.ll b/llvm/test/CodeGen/WebAssembly/simd-conversions.llindex 8459ec8101ff2..c98567eaaf7d6 100644--- a/llvm/test/CodeGen/WebAssembly/simd-conversions.ll+++ b/llvm/test/CodeGen/WebAssembly/simd-conversions.ll@@ -313,14 +313,16 @@ define <4 x double> @convert_low_s_v4f64(<8 x i32> %x) { ; CHECK-NEXT:  # %bb.0: ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3 ; CHECK-NEXT:    f64x2.convert_low_i32x4_s-; CHECK-NEXT:    v128.store 16+; CHECK-NEXT:    v128.store 0 ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.convert_low_i32x4_s-; CHECK-NEXT:    v128.store 0+; CHECK-NEXT:    v128.store 16 ; CHECK-NEXT:    # fallthrough-return   %v = shufflevector <8 x i32> %x, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>   %a = sitofp <4 x i32> %v to <4 x double>@@ -333,14 +335,16 @@ define <4 x double> @convert_low_u_v4f64(<8 x i32> %x) { ; CHECK-NEXT:  # %bb.0: ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3 ; CHECK-NEXT:    f64x2.convert_low_i32x4_u-; CHECK-NEXT:    v128.store 16+; CHECK-NEXT:    v128.store 0 ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.convert_low_i32x4_u-; CHECK-NEXT:    v128.store 0+; CHECK-NEXT:    v128.store 16 ; CHECK-NEXT:    # fallthrough-return   %v = shufflevector <8 x i32> %x, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>   %a = uitofp <4 x i32> %v to <4 x double>@@ -354,14 +358,16 @@ define <4 x double> @convert_low_s_v4f64_2(<8 x i32> %x) { ; CHECK-NEXT:  # %bb.0: ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3 ; CHECK-NEXT:    f64x2.convert_low_i32x4_s-; CHECK-NEXT:    v128.store 16+; CHECK-NEXT:    v128.store 0 ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.convert_low_i32x4_s-; CHECK-NEXT:    v128.store 0+; CHECK-NEXT:    v128.store 16 ; CHECK-NEXT:    # fallthrough-return   %v = sitofp <8 x i32> %x to <8 x double>   %a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>@@ -374,14 +380,16 @@ define <4 x double> @convert_low_u_v4f64_2(<8 x i32> %x) { ; CHECK-NEXT:  # %bb.0: ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3 ; CHECK-NEXT:    f64x2.convert_low_i32x4_u-; CHECK-NEXT:    v128.store 16+; CHECK-NEXT:    v128.store 0 ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.convert_low_i32x4_u-; CHECK-NEXT:    v128.store 0+; CHECK-NEXT:    v128.store 16 ; CHECK-NEXT:    # fallthrough-return   %v = uitofp <8 x i32> %x to <8 x double>   %a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>@@ -394,14 +402,16 @@ define <4 x double> @promote_low_v4f64(<8 x float> %x) { ; CHECK-NEXT:  # %bb.0: ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3 ; CHECK-NEXT:    f64x2.promote_low_f32x4-; CHECK-NEXT:    v128.store 16+; CHECK-NEXT:    v128.store 0 ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.promote_low_f32x4-; CHECK-NEXT:    v128.store 0+; CHECK-NEXT:    v128.store 16 ; CHECK-NEXT:    # fallthrough-return   %v = shufflevector <8 x float> %x, <8 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>   %a = fpext <4 x float> %v to <4 x double>@@ -414,14 +424,16 @@ define <4 x double> @promote_low_v4f64_2(<8 x float> %x) { ; CHECK-NEXT:  # %bb.0: ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    local.get 1-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3 ; CHECK-NEXT:    f64x2.promote_low_f32x4-; CHECK-NEXT:    v128.store 16+; CHECK-NEXT:    v128.store 0 ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    local.get 1+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.promote_low_f32x4-; CHECK-NEXT:    v128.store 0+; CHECK-NEXT:    v128.store 16 ; CHECK-NEXT:    # fallthrough-return   %v = fpext <8 x float> %x to <8 x double>   %a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>@@ -435,6 +447,8 @@ define <2 x double> @promote_mixed_v2f64(<4 x float> %x, <4 x float> %y) { ; CHECK-NEXT:    local.get 0 ; CHECK-NEXT:    local.get 1 ; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 28, 29, 30, 31, 0, 1, 2, 3, 0, 1, 2, 3+; CHECK-NEXT:    v128.const -1, 0+; CHECK-NEXT:    v128.and ; CHECK-NEXT:    f64x2.promote_low_f32x4 ; CHECK-NEXT:   ...[truncated]

Copy link

Member

dschuff commentedJul 16, 2025

Can you say a little more about what the advantages of this are, i.e. what the VM does differently as a result? (And, which VMs have you tested this with?)

Copy link

ContributorAuthor

sparker-arm commentedJul 17, 2025

I haven't tested with any VMs yet, as I doubt any of them will be taking advantage of this now.

The main advantage of this change is identify 'narrow' shuffles that can be mapped to target instructions. Even though Wasm is 128-bit, it doesn't always mean we're operating on that full width. Imagine that we're operating on 4 x 16-bit vector and we want the result to be the even lanes: 0, 2, 4, 6. But the wasm shuffle will be 0, 2, 4, 6, 0, 0, 0, 0.

I've optimised the AArch64 backend in V8 so that these cases are often handled by splatting lane zero first, but this is still far from optimal.

With the undef mask, during isel and with very little overhead, the backend can recognize this as an 'unzip' operation instead of an arbitrary lane shuffle.

The extend_low operations also provide the same information as this mask but, if the shuffle has multiple users, it's unlikely to be such a simple optimisation during isel. I've created anoptimisation in V8 specifically for figuring out undef lanes and it's non trivial. This undef mask change would make it much more simple for other runtimes to generate good shuffle code far more easily.

As you may have noticed, I've found WebAssembly shuffles to be a real pain! I would really like to see a revision to the spec so that these undef lanes/bytes can be explicitly encoded :)

Copy link

Collaborator

tlively commentedJul 17, 2025

As you may have noticed, I've found WebAssembly shuffles to be a real pain! I would really like to see a revision to the spec so that these undef lanes/bytes can be explicitly encoded :)

If we did add such an extension to the shuffle instruction, we would still have to specify what value ends up in the lanes of the result. Would it be portable and fast if we specified that the "undef" lanes all end up containing zeros, for instance?

Copy link

ContributorAuthor

sparker-arm commentedJul 18, 2025

which VMs have you tested this with?

With ~20 lines of code into V8 to notice the AND mask, this change gave me ~10% speedup on my microbenchmark suite for memory interleaving.

Copy link

ContributorAuthor

sparker-arm commentedJul 18, 2025

Would it be portable and fast if we specified that the "undef" lanes all end up containing zeros, for instance?

This is the only option that I have considered, really. It would then have the same semantics as what I'm proposing here and I would expect it is cheap enough on any architecture.