Extend SLP permutation optimisations

Currently SLP tries to force permute operations "down" the graph from loads in the hope of reducing the total number of permutations needed or (in the best case) removing the need for the permutations entirely. This patch tries to extend it as follows: - Allow loads to take a different permutation from the one they started with, rather than choosing between "original permutation" and "no permutation". - Allow changes in both directions, if the target supports the reverse permutation. - Treat the placement of permutations as a two-way dataflow problem: after propagating information from leaves to roots (as now), propagate information back up the graph. - Take execution frequency into account when optimising for speed, so that (for example) permutations inside loops have a higher cost than permutations outside loops. - Try to reduce the total number of permutations when optimising for size, even if that increases the number of permutations on a given execution path. See the big block comment above vect_optimize_slp_pass for a detailed description. The original motivation for doing this was to add a framework that would allow other layout differences in future. The two main ones are: - Make it easier to represent predicated operations, including predicated operations with gaps. E.g.: a[0] += 1; a[1] += 1; a[3] += 1; could be a single load/add/store for SVE. We could handle this by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 } (depending on what's being counted). We might need to move elements between lanes at various points, like with permutes. (This would first mean adding support for stores with gaps.) - Make it easier to switch between an even/odd and unpermuted layout when switching between wide and narrow elements. E.g. if a widening operation produces an even vector and an odd vector, we should try to keep operations on the wide elements in that order rather than force them to be permuted back "in order". To give some examples of what the patch does: int f1(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { a[0] = (b[1] << c[3]) - d[1]; a[1] = (b[0] << c[2]) - d[0]; a[2] = (b[3] << c[1]) - d[3]; a[3] = (b[2] << c[0]) - d[2]; } continues to produce the same code as before when optimising for speed: b, c and d are permuted at load time. But when optimising for size we instead permute c into the same order as b+d and then permute the result of the arithmetic into the same order as a: ldr q1, [x2] ldr q0, [x1] ext v1.16b, v1.16b, v1.16b, #8 // <------ sshl v0.4s, v0.4s, v1.4s ldr q1, [x3] sub v0.4s, v0.4s, v1.4s rev64 v0.4s, v0.4s // <------ str q0, [x0] ret The following function: int f2(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { a[0] = (b[3] << c[3]) - d[3]; a[1] = (b[2] << c[2]) - d[2]; a[2] = (b[1] << c[1]) - d[1]; a[3] = (b[0] << c[0]) - d[0]; } continues to push the reverse down to just before the store, like the previous code did. In: int f3(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { for (int i = 0; i < 100; ++i) { a[0] = (a[0] + c[3]); a[1] = (a[1] + c[2]); a[2] = (a[2] + c[1]); a[3] = (a[3] + c[0]); c += 4; } } the loads of a are hoisted and the stores of a are sunk, so that only the load from c happens in the loop. When optimising for speed, we prefer to have the loop operate on the reversed layout, changing on entry and exit from the loop: mov x3, x0 adrp x0, .LC0 add x1, x2, 1600 ldr q2, [x0, #:lo12:.LC0] ldr q0, [x3] mov v1.16b, v0.16b tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <-------- .p2align 3,,7 .L6: ldr q1, [x2], 16 add v0.4s, v0.4s, v1.4s cmp x2, x1 bne .L6 mov v1.16b, v0.16b adrp x0, .LC0 ldr q2, [x0, #:lo12:.LC0] tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <-------- str q0, [x3] ret Similarly, for the very artificial testcase: int f4(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { int a0 = a[0]; int a1 = a[1]; int a2 = a[2]; int a3 = a[3]; for (int i = 0; i < 100; ++i) { a0 ^= c[0]; a1 ^= c[1]; a2 ^= c[2]; a3 ^= c[3]; c += 4; for (int j = 0; j < 100; ++j) { a0 += d[1]; a1 += d[0]; a2 += d[3]; a3 += d[2]; d += 4; } b[0] = a0; b[1] = a1; b[2] = a2; b[3] = a3; b += 4; } a[0] = a0; a[1] = a1; a[2] = a2; a[3] = a3; } the a vector in the inner loop maintains the order { 1, 0, 3, 2 }, even though it's part of an SCC that includes the outer loop. In other words, this is a motivating case for not assigning permutes at SCC granularity. The code we get is: ldr q0, [x0] mov x4, x1 mov x5, x0 add x1, x3, 1600 add x3, x4, 1600 .p2align 3,,7 .L11: ldr q1, [x2], 16 sub x0, x1, #1600 eor v0.16b, v1.16b, v0.16b rev64 v0.4s, v0.4s // <--- .p2align 3,,7 .L10: ldr q1, [x0], 16 add v0.4s, v0.4s, v1.4s cmp x0, x1 bne .L10 rev64 v0.4s, v0.4s // <--- add x1, x0, 1600 str q0, [x4], 16 cmp x3, x4 bne .L11 str q0, [x5] ret bb-slp-layout-17.c is a collection of compile tests for problems I hit with earlier versions of the patch. The same prolems might show up elsewhere, but it seemed worth having the test anyway. In slp-11b.c we previously pushed the permutation of the in[i*4] group down from the load to just before the store. That didn't reduce the number or frequency of the permutations (or increase them either). But separating the permute from the load meant that we could no longer use load/store lanes. Whether load/store lanes are a good idea here is another question. If there were two sets of loads, and if we could use a single permutation instead of one per load, then avoiding load/store lanes should be a good thing even under the current abstract cost model. But I think under the current model we should try to avoid splitting up potential load/store lanes groups if there is no specific benefit to the split. Preferring load/store lanes is still a source of missed optimisations that we should fix one day... gcc/ * params.opt (-param=vect-max-layout-candidates=): New parameter. * doc/invoke.texi (vect-max-layout-candidates): Document it. * tree-vectorizer.h (auto_lane_permutation_t): New typedef. (auto_load_permutation_t): Likewise. * tree-vect-slp.cc (vect_slp_node_weight): New function. (slpg_layout_cost): New class. (slpg_vertex): Replace perm_in and perm_out with partition, out_degree, weight and out_weight. (slpg_partition_info, slpg_partition_layout_costs): New classes. (vect_optimize_slp_pass): Likewise, cannibalizing some part of the previous vect_optimize_slp. (vect_optimize_slp): Use it. gcc/testsuite/ * lib/target-supports.exp (check_effective_target_vect_var_shift): Return true for aarch64. * gcc.dg/vect/bb-slp-layout-1.c: New test. * gcc.dg/vect/bb-slp-layout-2.c: New test. * gcc.dg/vect/bb-slp-layout-3.c: New test. * gcc.dg/vect/bb-slp-layout-4.c: New test. * gcc.dg/vect/bb-slp-layout-5.c: New test. * gcc.dg/vect/bb-slp-layout-6.c: New test. * gcc.dg/vect/bb-slp-layout-7.c: New test. * gcc.dg/vect/bb-slp-layout-8.c: New test. * gcc.dg/vect/bb-slp-layout-9.c: New test. * gcc.dg/vect/bb-slp-layout-10.c: New test. * gcc.dg/vect/bb-slp-layout-11.c: New test. * gcc.dg/vect/bb-slp-layout-13.c: New test. * gcc.dg/vect/bb-slp-layout-14.c: New test. * gcc.dg/vect/bb-slp-layout-15.c: New test. * gcc.dg/vect/bb-slp-layout-16.c: New test. * gcc.dg/vect/bb-slp-layout-17.c: New test. * gcc.dg/vect/slp-11b.c: XFAIL SLP test for load-lanes targets.
2022-08-30 15:43:47 +01:00 · 2022-08-30 15:43:47 +01:00 · 61c4c98903
commit 61c4c98903
parent 050309d15e
23 changed files with 2000 additions and 429 deletions
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@ -14619,6 +14619,10 @@ Complex expressions slow the analyzer.
 Maximum number of arguments in a PHI supported by TREE if conversion
 unless the loop is marked with simd pragma.

+@item vect-max-layout-candidates
+The maximum number of possible vector layouts (such as permutations)
+to consider when optimizing to-be-vectorized code.
+
@item vect-max-version-for-alignment-checks
 The maximum number of run-time checks that can be performed when
 doing loop versioning for alignment in the vectorizer.
--- a/gcc/params.opt
+++ b/gcc/params.opt
@ -1137,6 +1137,10 @@ Whether to use canonical types.
 Common Joined UInteger Var(param_vect_epilogues_nomask) Init(1) IntegerRange(0, 1) Param Optimization
 Enable loop epilogue vectorization using smaller vector size.

+-param=vect-max-layout-candidates=
+Common Joined UInteger Var(param_vect_max_layout_candidates) Init(32) Param Optimization
+Maximum number of possible vector layouts (such as permutations) to consider when optimizing to-be-vectorized code.
+
 -param=vect-max-peeling-for-alignment=
 Common Joined UInteger Var(param_vect_max_peeling_for_alignment) Init(-1) IntegerRange(0, 64) Param Optimization
 Maximum number of loop peels to enhance alignment of data references in a loop.
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-1.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-1.c
@ -0,0 +1,13 @@
+/* { dg-do compile } */
+
+int a[4], b[4], c[4], d[4];
+
+void f1()
+{
+  a[0] = (b[1] << c[3]) - d[1];
+  a[1] = (b[0] << c[2]) - d[0];
+  a[2] = (b[3] << c[1]) - d[3];
+  a[3] = (b[2] << c[0]) - d[2];
+}
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 3 "slp2" { target { vect_var_shift && vect_perm } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-10.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-10.c
@ -0,0 +1,6 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-Os -fno-tree-loop-vectorize" } */
+
+#include "bb-slp-layout-9.c"
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 1 "slp1" { target { vect_int && { vect_perm && vect_hw_misalign } } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-11.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-11.c
@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-fno-tree-loop-vectorize" } */
+
+int a[4], b[4], c[400], d[400];
+
+void f1()
+{
+  int a0 = a[0] - b[0];
+  int a1 = a[1] + b[1];
+  int a2 = a[2] - b[2];
+  int a3 = a[3] + b[3];
+  int b0 = a0;
+  int b1 = a1;
+  int b2 = a2;
+  int b3 = a3;
+  for (int i = 0; i < 100; ++i)
+    {
+      a0 += c[i * 4 + 1];
+      a1 += c[i * 4 + 0];
+      a2 += c[i * 4 + 3];
+      a3 += c[i * 4 + 2];
+      b0 ^= d[i * 4 + 3];
+      b1 ^= d[i * 4 + 2];
+      b2 ^= d[i * 4 + 1];
+      b3 ^= d[i * 4 + 0];
+    }
+  a[0] = a0 ^ b0;
+  a[1] = a1 ^ b1;
+  a[2] = a2 ^ b2;
+  a[3] = a3 ^ b3;
+}
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 4 "slp1" { target { vect_int && vect_perm } } } } */
+/* { dg-final { scan-tree-dump "duplicating permutation node" "slp1" { target { vect_int && vect_perm } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-12.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-12.c
@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-Os -fno-tree-loop-vectorize" } */
+
+#include "bb-slp-layout-11.c"
+
+/* It would be better to keep the original three permutations.  */
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 3 "slp1" { target { vect_int && { vect_perm && vect_hw_misalign } } xfail { *-*-* } } } } */
+/* { dg-final { scan-tree-dump-not "duplicating permutation node" "slp1" { target { vect_int && { vect_perm && vect_hw_misalign } } xfail { *-*-* } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-13.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-13.c
@ -0,0 +1,13 @@
+/* { dg-do compile } */
+
+int a[4], b[4], c[4], d[4];
+
+void f1()
+{
+  a[0] = (b[1] << c[3]) - (d[1] >> c[3]);
+  a[1] = (b[0] << c[2]) - (d[0] >> c[2]);
+  a[2] = (b[3] << c[1]) - (d[3] >> c[1]);
+  a[3] = (b[2] << c[0]) - (d[2] >> c[0]);
+}
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 3 "slp2" { target { vect_var_shift && vect_perm } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-14.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-14.c
@ -0,0 +1,6 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-Os" } */
+
+#include "bb-slp-layout-13.c"
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 2 "slp2" { target { vect_var_shift && { vect_perm && vect_hw_misalign } } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-15.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-15.c
@ -0,0 +1,13 @@
+/* { dg-do compile } */
+
+int a[4], b[4], c[4], d[4];
+
+void f1()
+{
+  a[0] = (b[3] << c[3]) - d[0];
+  a[1] = (b[2] << c[2]) - d[2];
+  a[2] = (b[1] << c[1]) - d[4];
+  a[3] = (b[0] << c[0]) - d[6];
+}
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 1 "slp2" { target { vect_var_shift && vect_perm } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-16.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-16.c
@ -0,0 +1,6 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-Os" } */
+
+#include "bb-slp-layout-15.c"
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 1 "slp2" { target { vect_var_shift && { vect_perm && vect_hw_misalign } } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-17.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-17.c
@ -0,0 +1,27 @@
+/* { dg-do compile } */
+
+int a[8], b[8];
+
+int f1()
+{
+  a[0] = b[4] + 1;
+  a[1] = b[5] + 1;
+  a[2] = b[6] + 1;
+  a[3] = b[7] + 1;
+  a[4] = b[0] + 1;
+  a[5] = b[1] + 1;
+  a[6] = b[2] + 1;
+  a[7] = b[3] + 1;
+}
+
+unsigned short c[2], d[2];
+void f2() {
+  c[0] += d[1];
+  c[1] += d[0];
+}
+
+typedef int v4si __attribute__((vector_size(16)));
+void f3(v4si x) {
+  a[0] = b[1] + x[1];
+  a[1] = b[0] + x[3];
+}
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-2.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-2.c
@ -0,0 +1,6 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-Os" } */
+
+#include "bb-slp-layout-1.c"
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 2 "slp2" { target { vect_var_shift && { vect_perm && vect_hw_misalign } } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-3.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-3.c
@ -0,0 +1,13 @@
+/* { dg-do compile } */
+
+int a[4], b[4], c[4], d[4];
+
+void f1()
+{
+  a[0] = (b[3] << c[3]) - d[3];
+  a[1] = (b[2] << c[2]) - d[2];
+  a[2] = (b[1] << c[1]) - d[1];
+  a[3] = (b[0] << c[0]) - d[0];
+}
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 1 "slp2" { target { vect_var_shift && vect_perm } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-4.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-4.c
@ -0,0 +1,6 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-Os" } */
+
+#include "bb-slp-layout-3.c"
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 1 "slp2" { target { vect_var_shift && { vect_perm && vect_hw_misalign } } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-5.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-5.c
@ -0,0 +1,13 @@
+/* { dg-do compile } */
+
+int a[4], b[4], c[4];
+
+void f1()
+{
+  a[0] = b[3] - c[3];
+  a[1] = b[2] + c[2];
+  a[2] = b[1] - c[1];
+  a[3] = b[0] + c[0];
+}
+
+/* { dg-final { scan-tree-dump "absorbing input layouts" "slp2" { target { vect_int && vect_perm } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-6.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-6.c
@ -0,0 +1,6 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-Os" } */
+
+#include "bb-slp-layout-5.c"
+
+/* { dg-final { scan-tree-dump "absorbing input layouts" "slp2" { target { vect_int && { vect_perm && vect_hw_misalign } } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-7.c
@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-fno-tree-loop-vectorize" } */
+
+int a[4], b[400];
+
+void f1()
+{
+  for (int i = 0; i < 100; ++i)
+    {
+      a[0] += b[i * 4 + 3];
+      a[1] += b[i * 4 + 2];
+      a[2] += b[i * 4 + 1];
+      a[3] += b[i * 4 + 0];
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 2 "slp1" { target { vect_int && vect_perm } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-8.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-8.c
@ -0,0 +1,6 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-Os -fno-tree-loop-vectorize" } */
+
+#include "bb-slp-layout-7.c"
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 1 "slp1" { target { vect_int && { vect_perm && vect_hw_misalign } } } } } */
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-9.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-9.c
@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-fno-tree-loop-vectorize" } */
+
+int a[4], b[400], c[400], d[40000];
+
+void f1()
+{
+  int a0 = a[0];
+  int a1 = a[1];
+  int a2 = a[2];
+  int a3 = a[3];
+  for (int i = 0; i < 100; ++i)
+    {
+      a0 ^= c[i * 4 + 0];
+      a1 ^= c[i * 4 + 1];
+      a2 ^= c[i * 4 + 2];
+      a3 ^= c[i * 4 + 3];
+      for (int j = 0; j < 100; ++j)
+	{
+	  a0 += d[i * 400 + j * 4 + 1];
+	  a1 += d[i * 400 + j * 4 + 0];
+	  a2 += d[i * 400 + j * 4 + 3];
+	  a3 += d[i * 400 + j * 4 + 2];
+	}
+      b[i * 4 + 0] = a0;
+      b[i * 4 + 1] = a1;
+      b[i * 4 + 2] = a2;
+      b[i * 4 + 3] = a3;
+    }
+  a[0] = a0;
+  a[1] = a1;
+  a[2] = a2;
+  a[3] = a3;
+}
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 2 "slp1" { target { vect_int && vect_perm } } } } */
--- a/gcc/testsuite/gcc.dg/vect/slp-11b.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-11b.c
@ -44,4 +44,4 @@ int main (void)
 }

 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { { vect_strided4 || vect_perm } && vect_int_mult } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_perm && vect_int_mult } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_perm && vect_int_mult } xfail vect_load_lanes } } } */
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@ -6814,6 +6814,7 @@ proc check_effective_target_vect_var_shift { } {
    return [check_cached_effective_target_indexed vect_var_shift {
      expr {(([istarget i?86-*-*] || [istarget x86_64-*-*])
 	     && [check_avx2_available])
+	    || [istarget aarch64*-*-*]
      }}]
 }

--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@ -154,7 +154,9 @@ struct vect_scalar_ops_slice_hash : typed_noop_remove<vect_scalar_ops_slice>
  SLP
 ************************************************************************/
 typedef vec<std::pair<unsigned, unsigned> > lane_permutation_t;
+typedef auto_vec<std::pair<unsigned, unsigned>, 16> auto_lane_permutation_t;
 typedef vec<unsigned> load_permutation_t;
+typedef auto_vec<unsigned, 16> auto_load_permutation_t;

 /* A computation tree of an SLP instance.  Each node corresponds to a group of
   stmts to be packed in a SIMD stmt.  */