vectorizer: Avoid an OOB access from vectorization

Our checks for whether the vectorization of a given loop would make an
out of bounds access miss the case when the vector we load is so large
as to span multiple iterations worth of data (while only being there to
implement a single iteration).

This patch adds a check for such an access.

Example where this was going wrong (smaller version of testcase added):

```
  extern unsigned short multi_array[5][16][16];
  extern void initialise_s(int *);
  extern int get_sval();

  void foo() {
    int s0 = get_sval();
    int s[31];
    int i,j;
    initialise_s(&s[0]);
    s0 = get_sval();
    for (j=0; j < 16; j++)
      for (i=0; i < 16; i++)
	multi_array[1][j][i]=s[j*2];
  }
```

With the above loop we would load the `s[j*2]` integer into a 4 element
vector, which reads 3 extra elements than the scalar loop would.
`get_group_load_store_type` identifies that the loop requires a scalar
epilogue due to gaps.  However we do not identify that the above code
requires *two* scalar loops to be peeled due to the fact that each
iteration loads an amount of data from the *next* iteration (while not
using it).

Bootstrapped and regtested on aarch64-none-linux-gnu.
N.b. out of interest we came across this working with Morello.

gcc/ChangeLog:

	* tree-vect-stmts.cc (get_group_load_store_type): Account for
	`gap` when checking if need to peel twice.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/vect-multi-peel-gaps.c: New test.
This commit is contained in:
Matthew Malcomson 2023-07-20 11:43:12 +01:00
parent 23ad5ed743
commit c5bd0e5870
2 changed files with 65 additions and 2 deletions

View file

@ -0,0 +1,61 @@
/* For some targets we end up vectorizing the below loop such that the `sp`
single integer is loaded into a 4 integer vector.
While the writes are all safe, without 2 scalar loops being peeled into the
epilogue we would read past the end of the 31 integer array. This happens
because we load a 4 integer chunk to only use the first integer and
increment by 2 integers at a time, hence the last load needs s[30-33] and
the penultimate load needs s[28-31].
This testcase ensures that we do not crash due to that behaviour. */
/* { dg-require-effective-target mmap } */
#include <sys/mman.h>
#include <stdio.h>
#define MMAP_SIZE 0x20000
#define ADDRESS 0x1122000000
#define MB_BLOCK_SIZE 16
#define VERT_PRED_16 0
#define HOR_PRED_16 1
#define DC_PRED_16 2
int *sptr;
extern void intrapred_luma_16x16();
unsigned short mprr_2[5][16][16];
void initialise_s(int *s) { }
int main() {
void *s_mapping;
void *end_s;
s_mapping = mmap ((void *)ADDRESS, MMAP_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (s_mapping == MAP_FAILED)
{
perror ("mmap");
return 1;
}
end_s = (s_mapping + MMAP_SIZE);
sptr = (int*)(end_s - sizeof(int[31]));
intrapred_luma_16x16(sptr);
return 0;
}
void intrapred_luma_16x16(int * restrict sp) {
for (int j=0; j < MB_BLOCK_SIZE; j++)
{
mprr_2[VERT_PRED_16][j][0]=sp[j*2];
mprr_2[VERT_PRED_16][j][1]=sp[j*2];
mprr_2[VERT_PRED_16][j][2]=sp[j*2];
mprr_2[VERT_PRED_16][j][3]=sp[j*2];
mprr_2[VERT_PRED_16][j][4]=sp[j*2];
mprr_2[VERT_PRED_16][j][5]=sp[j*2];
mprr_2[VERT_PRED_16][j][6]=sp[j*2];
mprr_2[VERT_PRED_16][j][7]=sp[j*2];
mprr_2[VERT_PRED_16][j][8]=sp[j*2];
mprr_2[VERT_PRED_16][j][9]=sp[j*2];
mprr_2[VERT_PRED_16][j][10]=sp[j*2];
mprr_2[VERT_PRED_16][j][11]=sp[j*2];
mprr_2[VERT_PRED_16][j][12]=sp[j*2];
mprr_2[VERT_PRED_16][j][13]=sp[j*2];
mprr_2[VERT_PRED_16][j][14]=sp[j*2];
mprr_2[VERT_PRED_16][j][15]=sp[j*2];
}
}
/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" {target vect_int } } } */

View file

@ -2217,7 +2217,9 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
but the access in the loop doesn't cover the full vector
we can end up with no gap recorded but still excess
elements accessed, see PR103116. Make sure we peel for
gaps if necessary and sufficient and give up if not. */
gaps if necessary and sufficient and give up if not.
If there is a combination of the access not covering the full vector and
a gap recorded then we may need to peel twice. */
if (loop_vinfo
&& *memory_access_type == VMAT_CONTIGUOUS
&& SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
@ -2233,7 +2235,7 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
access excess elements.
??? Enhancements include peeling multiple iterations
or using masked loads with a static mask. */
|| (group_size * cvf) % cnunits + group_size < cnunits)
|| (group_size * cvf) % cnunits + group_size - gap < cnunits)
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,