aarch64: Avoid unnecessary use of 2-input TBLs [PR115258]

When using TBL for (say) a V4SI permutation, the aarch64 port first
asks target-independent code to lower to a V16QI permutation.
Then, during code generation, an input like:

  (reg:V4SI R)

gets converted to:

  (subreg:V16QI (reg:V4SI R) 0)

aarch64_vectorize_vec_perm_const had:

  d.op0 = op0 ? force_reg (op_mode, op0) : NULL_RTX;
  if (op0 == op1)
    d.op1 = d.op0;
  else
    d.op1 = op1 ? force_reg (op_mode, op1) : NULL_RTX;

But subregs (unlike regs) are not shared, so the op0 == op1 check
always failed for this case.  We'd then force each subreg into a
fresh register, meaning that during the later:

  aarch64_expand_vec_perm_1 (d->target, d->op0, d->op1, sel);

there is no way for aarch64_expand_vec_perm_1 to realise that
d->op0 and d->op1 are the same value.  It would therefore generate
a two-input TBL in the testcase, even though a single-input TBL
is enough.

I'm not sure forcing subregs to a fresh regiter is a good idea --
it caused problems for copysign & co. -- but that's not something
to fiddle with during stage 4.  Using op0 == op1 for rtx equality
is independently wrong, so we might as well just fix that for now.

The patch gets rid of extra MOVs that are a regression from GCC 14.

The testcase is based on one from Kugan, itself based on TSVC.

gcc/
	PR target/115258
	* config/aarch64/aarch64.cc (aarch64_vectorize_vec_perm_const): Use
	d.one_vector_p to decide whether op1 should be a copy of op0.

gcc/testsuite/
	PR target/115258
	* gcc.target/aarch64/pr115258_2.c: New test.

Co-authored-by: Kugan Vivekanandarajah <kvivekananda@nvidia.com>
This commit is contained in:
Richard Sandiford 2025-03-10 20:29:52 +00:00
parent e355fe414a
commit 31dcf941ac
2 changed files with 19 additions and 2 deletions

View file

@ -26851,8 +26851,8 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
d.op_vec_flags = aarch64_classify_vector_mode (d.op_mode);
d.target = target;
d.op0 = op0 ? force_reg (op_mode, op0) : NULL_RTX;
if (op0 == op1)
d.op1 = d.op0;
if (op0 && d.one_vector_p)
d.op1 = copy_rtx (d.op0);
else
d.op1 = op1 ? force_reg (op_mode, op1) : NULL_RTX;
d.testing_p = !target;

View file

@ -0,0 +1,17 @@
/* { dg-do compile } */
/* { dg-options "-Ofast -mcpu=neoverse-v2" } */
extern __attribute__((aligned(64))) float a[32000], b[32000];
int dummy(float[32000], float[32000], float);
void s1112() {
for (int nl = 0; nl < 100000 * 3; nl++) {
for (int i = 32000 - 1; i >= 0; i--) {
a[i] = b[i] + (float)1.;
}
dummy(a, b, 0.);
}
}
/* { dg-final { scan-assembler-not {\tmov\tv[0-9]+\.16b,} } } */