Find a file
Richard Sandiford 61c4c98903 Extend SLP permutation optimisations
Currently SLP tries to force permute operations "down" the graph
from loads in the hope of reducing the total number of permutations
needed or (in the best case) removing the need for the permutations
entirely.  This patch tries to extend it as follows:

- Allow loads to take a different permutation from the one they
  started with, rather than choosing between "original permutation"
  and "no permutation".

- Allow changes in both directions, if the target supports the
  reverse permutation.

- Treat the placement of permutations as a two-way dataflow problem:
  after propagating information from leaves to roots (as now), propagate
  information back up the graph.

- Take execution frequency into account when optimising for speed,
  so that (for example) permutations inside loops have a higher
  cost than permutations outside loops.

- Try to reduce the total number of permutations when optimising for
  size, even if that increases the number of permutations on a given
  execution path.

See the big block comment above vect_optimize_slp_pass for
a detailed description.

The original motivation for doing this was to add a framework that would
allow other layout differences in future.  The two main ones are:

- Make it easier to represent predicated operations, including
  predicated operations with gaps.  E.g.:

     a[0] += 1;
     a[1] += 1;
     a[3] += 1;

  could be a single load/add/store for SVE.  We could handle this
  by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
  (depending on what's being counted).  We might need to move
  elements between lanes at various points, like with permutes.

  (This would first mean adding support for stores with gaps.)

- Make it easier to switch between an even/odd and unpermuted layout
  when switching between wide and narrow elements.  E.g. if a widening
  operation produces an even vector and an odd vector, we should try
  to keep operations on the wide elements in that order rather than
  force them to be permuted back "in order".

To give some examples of what the patch does:

int f1(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  a[0] = (b[1] << c[3]) - d[1];
  a[1] = (b[0] << c[2]) - d[0];
  a[2] = (b[3] << c[1]) - d[3];
  a[3] = (b[2] << c[0]) - d[2];
}

continues to produce the same code as before when optimising for
speed: b, c and d are permuted at load time.  But when optimising
for size we instead permute c into the same order as b+d and then
permute the result of the arithmetic into the same order as a:

        ldr     q1, [x2]
        ldr     q0, [x1]
        ext     v1.16b, v1.16b, v1.16b, #8     // <------
        sshl    v0.4s, v0.4s, v1.4s
        ldr     q1, [x3]
        sub     v0.4s, v0.4s, v1.4s
        rev64   v0.4s, v0.4s                   // <------
        str     q0, [x0]
        ret

The following function:

int f2(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  a[0] = (b[3] << c[3]) - d[3];
  a[1] = (b[2] << c[2]) - d[2];
  a[2] = (b[1] << c[1]) - d[1];
  a[3] = (b[0] << c[0]) - d[0];
}

continues to push the reverse down to just before the store,
like the previous code did.

In:

int f3(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  for (int i = 0; i < 100; ++i)
    {
      a[0] = (a[0] + c[3]);
      a[1] = (a[1] + c[2]);
      a[2] = (a[2] + c[1]);
      a[3] = (a[3] + c[0]);
      c += 4;
    }
}

the loads of a are hoisted and the stores of a are sunk, so that
only the load from c happens in the loop.  When optimising for
speed, we prefer to have the loop operate on the reversed layout,
changing on entry and exit from the loop:

        mov     x3, x0
        adrp    x0, .LC0
        add     x1, x2, 1600
        ldr     q2, [x0, #:lo12:.LC0]
        ldr     q0, [x3]
        mov     v1.16b, v0.16b
        tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
        .p2align 3,,7
.L6:
        ldr     q1, [x2], 16
        add     v0.4s, v0.4s, v1.4s
        cmp     x2, x1
        bne     .L6
        mov     v1.16b, v0.16b
        adrp    x0, .LC0
        ldr     q2, [x0, #:lo12:.LC0]
        tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
        str     q0, [x3]
        ret

Similarly, for the very artificial testcase:

int f4(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  int a0 = a[0];
  int a1 = a[1];
  int a2 = a[2];
  int a3 = a[3];
  for (int i = 0; i < 100; ++i)
    {
      a0 ^= c[0];
      a1 ^= c[1];
      a2 ^= c[2];
      a3 ^= c[3];
      c += 4;
      for (int j = 0; j < 100; ++j)
	{
	  a0 += d[1];
	  a1 += d[0];
	  a2 += d[3];
	  a3 += d[2];
	  d += 4;
	}
      b[0] = a0;
      b[1] = a1;
      b[2] = a2;
      b[3] = a3;
      b += 4;
    }
  a[0] = a0;
  a[1] = a1;
  a[2] = a2;
  a[3] = a3;
}

the a vector in the inner loop maintains the order { 1, 0, 3, 2 },
even though it's part of an SCC that includes the outer loop.
In other words, this is a motivating case for not assigning
permutes at SCC granularity.  The code we get is:

        ldr     q0, [x0]
        mov     x4, x1
        mov     x5, x0
        add     x1, x3, 1600
        add     x3, x4, 1600
        .p2align 3,,7
.L11:
        ldr     q1, [x2], 16
        sub     x0, x1, #1600
        eor     v0.16b, v1.16b, v0.16b
        rev64   v0.4s, v0.4s              // <---
        .p2align 3,,7
.L10:
        ldr     q1, [x0], 16
        add     v0.4s, v0.4s, v1.4s
        cmp     x0, x1
        bne     .L10
        rev64   v0.4s, v0.4s              // <---
        add     x1, x0, 1600
        str     q0, [x4], 16
        cmp     x3, x4
        bne     .L11
        str     q0, [x5]
        ret

bb-slp-layout-17.c is a collection of compile tests for problems
I hit with earlier versions of the patch.  The same prolems might
show up elsewhere, but it seemed worth having the test anyway.

In slp-11b.c we previously pushed the permutation of the in[i*4]
group down from the load to just before the store.  That didn't
reduce the number or frequency of the permutations (or increase
them either).  But separating the permute from the load meant
that we could no longer use load/store lanes.

Whether load/store lanes are a good idea here is another question.
If there were two sets of loads, and if we could use a single
permutation instead of one per load, then avoiding load/store
lanes should be a good thing even under the current abstract
cost model.  But I think under the current model we should
try to avoid splitting up potential load/store lanes groups
if there is no specific benefit to the split.

Preferring load/store lanes is still a source of missed optimisations
that we should fix one day...

gcc/
	* params.opt (-param=vect-max-layout-candidates=): New parameter.
	* doc/invoke.texi (vect-max-layout-candidates): Document it.
	* tree-vectorizer.h (auto_lane_permutation_t): New typedef.
	(auto_load_permutation_t): Likewise.
	* tree-vect-slp.cc (vect_slp_node_weight): New function.
	(slpg_layout_cost): New class.
	(slpg_vertex): Replace perm_in and perm_out with partition,
	out_degree, weight and out_weight.
	(slpg_partition_info, slpg_partition_layout_costs): New classes.
	(vect_optimize_slp_pass): Likewise, cannibalizing some part of
	the previous vect_optimize_slp.
	(vect_optimize_slp): Use it.

gcc/testsuite/
	* lib/target-supports.exp (check_effective_target_vect_var_shift):
	Return true for aarch64.
	* gcc.dg/vect/bb-slp-layout-1.c: New test.
	* gcc.dg/vect/bb-slp-layout-2.c: New test.
	* gcc.dg/vect/bb-slp-layout-3.c: New test.
	* gcc.dg/vect/bb-slp-layout-4.c: New test.
	* gcc.dg/vect/bb-slp-layout-5.c: New test.
	* gcc.dg/vect/bb-slp-layout-6.c: New test.
	* gcc.dg/vect/bb-slp-layout-7.c: New test.
	* gcc.dg/vect/bb-slp-layout-8.c: New test.
	* gcc.dg/vect/bb-slp-layout-9.c: New test.
	* gcc.dg/vect/bb-slp-layout-10.c: New test.
	* gcc.dg/vect/bb-slp-layout-11.c: New test.
	* gcc.dg/vect/bb-slp-layout-13.c: New test.
	* gcc.dg/vect/bb-slp-layout-14.c: New test.
	* gcc.dg/vect/bb-slp-layout-15.c: New test.
	* gcc.dg/vect/bb-slp-layout-16.c: New test.
	* gcc.dg/vect/bb-slp-layout-17.c: New test.
	* gcc.dg/vect/slp-11b.c: XFAIL SLP test for load-lanes targets.
2022-08-30 15:43:47 +01:00
c++tools Daily bump. 2022-03-19 00:16:22 +00:00
config Daily bump. 2022-08-02 00:16:51 +00:00
contrib Change get_std_name_hint to use generated hash table 2022-08-30 16:33:51 +02:00
fixincludes Daily bump. 2022-08-26 00:16:21 +00:00
gcc Extend SLP permutation optimisations 2022-08-30 15:43:47 +01:00
gnattools Daily bump. 2021-10-23 00:16:26 +00:00
gotools automake: regenerate 2022-08-30 13:41:03 +02:00
include Daily bump. 2022-07-13 00:16:33 +00:00
INSTALL
intl Daily bump. 2021-11-30 00:16:44 +00:00
libada Daily bump. 2022-08-26 00:16:21 +00:00
libatomic automake: regenerate 2022-08-30 13:26:46 +02:00
libbacktrace Daily bump. 2022-07-09 00:16:54 +00:00
libcc1 Daily bump. 2022-06-28 00:16:58 +00:00
libcody Daily bump. 2022-06-04 00:16:27 +00:00
libcpp Daily bump. 2022-08-27 00:17:09 +00:00
libdecnumber Daily bump. 2022-05-21 00:16:32 +00:00
libffi Daily bump. 2021-11-16 00:16:31 +00:00
libgcc m32c-rtems: remove obsoleted port 2022-08-30 15:48:03 +02:00
libgfortran Daily bump. 2022-08-27 00:17:09 +00:00
libgo libgo: use SYS_timer_settime32 2022-07-30 10:35:23 -07:00
libgomp Daily bump. 2022-08-27 00:17:09 +00:00
libiberty Daily bump. 2022-08-26 00:16:21 +00:00
libitm Daily bump. 2022-06-03 00:16:40 +00:00
libobjc Daily bump. 2022-08-26 00:16:21 +00:00
liboffloadmic Daily bump. 2022-08-26 00:16:21 +00:00
libphobos Daily bump. 2022-08-28 00:16:28 +00:00
libquadmath Daily bump. 2022-08-26 00:16:21 +00:00
libsanitizer libsanitizer: update LOCAL_PATCHES 2022-08-30 12:54:18 +02:00
libssp Daily bump. 2022-08-26 00:16:21 +00:00
libstdc++-v3 Daily bump. 2022-08-28 00:16:28 +00:00
libvtv Daily bump. 2022-08-26 00:16:21 +00:00
lto-plugin Daily bump. 2022-08-02 00:16:51 +00:00
maintainer-scripts Daily bump. 2022-07-29 00:16:21 +00:00
zlib Daily bump. 2022-08-26 00:16:21 +00:00
.dir-locals.el dir-locals: Use https for bug references 2021-07-20 11:40:34 +01:00
.gitattributes
.gitignore .gitignore: do not ignore config.h 2022-07-19 17:07:04 +03:00
ABOUT-NLS
ar-lib
ChangeLog Daily bump. 2022-08-19 00:16:27 +00:00
ChangeLog.jit
ChangeLog.tree-ssa
compile
config-ml.in
config.guess config.sub, config.guess : Import upstream 2021-01-25. 2021-02-23 17:21:10 +08:00
config.rpath
config.sub config.sub: change mode to 755. 2021-12-21 09:10:57 +01:00
configure Makefile.def: drop remnants of unused libelf 2022-08-18 09:37:09 +01:00
configure.ac Makefile.def: drop remnants of unused libelf 2022-08-18 09:37:09 +01:00
COPYING
COPYING.LIB
COPYING.RUNTIME
COPYING3
COPYING3.LIB
depcomp
install-sh
libtool-ldflags
libtool.m4 Revert "Sync with binutils: GCC: Pass --plugin to AR and RANLIB" 2021-12-15 20:45:58 -08:00
ltgcc.m4
ltmain.sh
ltoptions.m4
ltsugar.m4
ltversion.m4
lt~obsolete.m4
MAINTAINERS Add myself as AutoFDO maintainer 2022-08-04 13:38:28 -07:00
Makefile.def Makefile.def: drop remnants of unused libelf 2022-08-18 09:37:09 +01:00
Makefile.in Makefile.def: drop remnants of unused libelf 2022-08-18 09:37:09 +01:00
Makefile.tpl Makefile.def: drop remnants of unused libelf 2022-08-18 09:37:09 +01:00
missing
mkdep
mkinstalldirs
move-if-change
multilib.am
README
symlink-tree
test-driver
ylwrap

This directory contains the GNU Compiler Collection (GCC).

The GNU Compiler Collection is free software.  See the files whose
names start with COPYING for copying permission.  The manuals, and
some of the runtime libraries, are under different terms; see the
individual source files for details.

The directory INSTALL contains copies of the installation information
as HTML and plain text.  The source of this information is
gcc/doc/install.texi.  The installation information includes details
of what is included in the GCC sources and what files GCC installs.

See the file gcc/doc/gcc.texi (together with other files that it
includes) for usage and porting information.  An online readable
version of the manual is in the files gcc/doc/gcc.info*.

See http://gcc.gnu.org/bugs/ for how to report bugs usefully.

Copyright years on GCC source files may be listed using range
notation, e.g., 1987-2012, indicating that every year in the range,
inclusive, is a copyrightable year that could otherwise be listed
individually.