For testcase
void __cond_swap(double* __x, double* __y) {
bool __r = (*__x < *__y);
auto __tmp = __r ? *__x : *__y;
*__y = __r ? *__y : *__x;
*__x = __tmp;
}
GCC-14 with -O2 and -march=x86-64 options generates the following code:
__cond_swap(double*, double*):
movsd xmm1, QWORD PTR [rdi]
movsd xmm0, QWORD PTR [rsi]
comisd xmm0, xmm1
jbe .L2
movq rax, xmm1
movapd xmm1, xmm0
movq xmm0, rax
.L2:
movsd QWORD PTR [rsi], xmm1
movsd QWORD PTR [rdi], xmm0
ret
rax is used to save and restore DFmode value. In RA both GENERAL_REGS
and SSE_REGS cost zero since we didn't disparage the
alternative in movdf_internal pattern, according to register
allocation order, GENERAL_REGS is allocated. The patch add ? for
alternative (r,v) and (v,r) just like we did for movsf/hf/bf_internal
pattern, after that we get optimal RA.
__cond_swap:
.LFB0:
.cfi_startproc
movsd (%rdi), %xmm1
movsd (%rsi), %xmm0
comisd %xmm1, %xmm0
jbe .L2
movapd %xmm1, %xmm2
movapd %xmm0, %xmm1
movapd %xmm2, %xmm0
.L2:
movsd %xmm1, (%rsi)
movsd %xmm0, (%rdi)
ret
gcc/ChangeLog:
PR target/110170
* config/i386/i386.md (movdf_internal): Disparage slightly for
2 alternatives (r,v) and (v,r) by adding constraint modifier
'?'.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr110170-3.c: New test.
PR106907 has few warnings spotted from cppcheck. In that addressing
redundant initialization issue. Here the initialized value of 'new_addr'
was overwritten before it was read. Updated the source by removing the
unnecessary initialization of 'new_addr'.
2023-07-06 Jeevitha Palanisamy <jeevitha@linux.ibm.com>
gcc/
PR target/106907
* config/rs6000/rs6000.cc (rs6000_expand_vector_extract): Remove redundant
initialization of new_addr.
If a loop is unrolled during vectorization (i.e. suggested_unroll_factor > 1),
the VFs of both main and epilog loop are enlarged. The epilog vect loop is
specific for a loop with small iteration counts, so a large VF may hurt
performance.
This patch unscales the main loop VF by suggested_unroll_factor while selecting
the epilog loop VF, so that it will be the same as vectorized loop without
unrolling (i.e. suggested_unroll_factor = 1).
gcc/ChangeLog:
PR tree-optimization/110474
* tree-vect-loop.cc (vect_analyze_loop_2): unscale the VF by suggested
unroll factor while selecting the epilog vect loop VF.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/pr110474.c: New testcase.
Rather than creating long call chains, put the onus for finishing
the evlaution on the caller.
* gimple-range-gori.cc (compute_operand_range): After calling
compute_operand2_range, recursively call self if needed.
(compute_operand2_range): Turn into a leaf function.
(gori_compute::compute_operand1_and_operand2_range): Finish
operand2 calculation.
* gimple-range-gori.h (compute_operand2_range): Remove name param.
Rather than creating long call chains, put the onus for finishing
the evlaution on the caller.
* gimple-range-gori.cc (compute_operand_range): After calling
compute_operand1_range, recursively call self if needed.
(compute_operand1_range): Turn into a leaf function.
(gori_compute::compute_operand1_and_operand2_range): Finish
operand1 calculation.
* gimple-range-gori.h (compute_operand1_range): Remove name param.
Move the check for co-dependency between 2 operands into
compute_operand_range, resulting in a much cleaner
compute_operand1_and_operand2_range routine.
* gimple-range-gori.cc (compute_operand_range): Check for
operand interdependence when both op1 and op2 are computed.
(compute_operand1_and_operand2_range): No checks required now.
compute_operand1_range and compute_operand2_range were both doing
relation discovery between the 2 operands... move it into a common area.
* gimple-range-gori.cc (compute_operand_range): Check for
a relation between op1 and op2 and use that instead.
(compute_operand1_range): Don't look for a relation override.
(compute_operand2_range): Ditto.
This testcase is causing some timeout issues. This patch splits the
testcase up by individual set algorithm.
libstdc++-v3:/ChangeLog:
* testsuite/25_algorithms/pstl/alg_sorting/set.cc: Delete
file.
* testsuite/25_algorithms/pstl/alg_sorting/set_difference.cc:
New file.
* testsuite/25_algorithms/pstl/alg_sorting/set_intersection.cc:
Likewise.
* testsuite/25_algorithms/pstl/alg_sorting/set_symmetric_difference.cc:
Likewise.
* testsuite/25_algorithms/pstl/alg_sorting/set_union.cc:
Likewise.
* testsuite/25_algorithms/pstl/alg_sorting/set_util.h:
Likewise.
The mod-subtract optimization with ncounts==1 produced incorrect edge
probabilities due to incorrect conditional probability calculation. This
patch fixes the calculation.
Signed-off-by: Filip Kastl <filip.kastl@gmail.com>
gcc/ChangeLog:
* value-prof.cc (gimple_mod_subtract_transform): Correct edge
prob calculation.
Also change some internal variables to bool.
gcc/ChangeLog:
* sched-int.h (struct haifa_sched_info): Change can_schedule_ready_p,
scehdule_more_p and contributes_to_priority indirect frunction
type from int to bool.
(no_real_insns_p): Change return type from int to bool.
(contributes_to_priority): Ditto.
* haifa-sched.cc (no_real_insns_p): Change return type from
int to bool and adjust function body accordingly.
* modulo-sched.cc (try_scheduling_node_in_cycle): Change "success"
variable type from int to bool.
(ps_insn_advance_column): Change return type from int to bool.
(ps_has_conflicts): Ditto. Change "has_conflicts"
variable type from int to bool.
* sched-deps.cc (deps_may_trap_p): Change return type from int to bool.
(conditions_mutex_p): Ditto.
* sched-ebb.cc (schedule_more_p): Ditto.
(ebb_contributes_to_priority): Change return type from
int to bool and adjust function body accordingly.
* sched-rgn.cc (is_cfg_nonregular): Ditto.
(check_live_1): Ditto.
(is_pfree): Ditto.
(find_conditional_protection): Ditto.
(is_conditionally_protected): Ditto.
(is_prisky): Ditto.
(is_exception_free): Ditto.
(haifa_find_rgns): Change "unreachable" and "too_large_failure"
variables from int to bool.
(extend_rgns): Change "rescan" variable from int to bool.
(check_live): Change return type from
int to bool and adjust function body accordingly.
(can_schedule_ready_p): Ditto.
(schedule_more_p): Ditto.
(contributes_to_priority): Ditto.
In gimple-isel we already deduce a vec_set pattern from an
ARRAY_REF(VIEW_CONVERT_EXPR). This patch does the same for a
vec_extract.
The code is largely similar to the vec_set one
including the addition of a can_vec_extract_var_idx_p function
in optabs.cc to check if the backend can handle a register
operand as index. We already have can_vec_extract in
optabs-query but that one checks whether we can extract
specific modes.
With the introduction of an internal function for vec_extract
the expander must not FAIL. For vec_set this has already been
the case so adjust the documentation accordingly.
Additionally, clarify the wording of the vector-vector case for
vec_extract.
gcc/ChangeLog:
* doc/md.texi: Document that vec_set and vec_extract must not
fail.
* gimple-isel.cc (gimple_expand_vec_set_expr): Rename this...
(gimple_expand_vec_set_extract_expr): ...to this.
(gimple_expand_vec_exprs): Call renamed function.
* internal-fn.cc (vec_extract_direct): Add.
(expand_vec_extract_optab_fn): New function to expand
vec_extract optab.
(direct_vec_extract_optab_supported_p): Add.
* internal-fn.def (VEC_EXTRACT): Add.
* optabs.cc (can_vec_extract_var_idx_p): New function.
* optabs.h (can_vec_extract_var_idx_p): Declare.
This patch adds a gen_lowpart in the vec_extract expander so it properly
works with a variable index and adds tests.
gcc/ChangeLog:
* config/riscv/autovec.md: Add gen_lowpart.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-1.c: Add
tests for variable index.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-3.c: Ditto.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-4.c: Ditto.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-run.c:
Ditto.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-zvfh-run.c:
Ditto.
This patch would like to take FRM_DYN const rtx as the rounding mode
operand according to the RVV spec, which takes the dyn as the only
rounding mode for floating-point.
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/riscv-vector-builtins.cc
(function_expander::use_exact_insn): Use FRM_DYN instead of const0.
Signed-off-by: Pan Li <pan2.li@intel.com>
This fixes a bug in the autovect FP narrowing patterns which resulted in
a combine ICE. It would try to e.g. simplify a unary operation by
simplify_const_unary_operation which obviously expects a float_truncate
and not a truncate for a floating-point mode.
gcc/ChangeLog:
* config/riscv/autovec.md: Use float_truncate.
RISC-V lowers the TYPE_PRECISION for MODE_VECTOR_BOOL vectors in order
to distinguish between VNx1BI, VNx2BI, VNx4BI and VNx8BI.
This patch adjusts uses of MODE_VECTOR_BOOL to use GET_MODE_PRECISION
instead of GET_MODE_BITSIZE.
The RISC-V tests are provided by Juzhe.
Co-Authored-By: Juzhe-Zhong <juzhe.zhong@rivai.ai>
gcc/c-family/ChangeLog:
* c-common.cc (c_common_type_for_mode): Use GET_MODE_PRECISION.
gcc/ChangeLog:
* simplify-rtx.cc (native_encode_rtx): Ditto.
(native_decode_vector_rtx): Ditto.
(simplify_const_vector_byte_offset): Ditto.
(simplify_const_vector_subreg): Ditto.
* tree.cc (build_truth_vector_type_for_mode): Ditto.
* varasm.cc (output_constant_pool_2): Ditto.
gcc/fortran/ChangeLog:
* trans-types.cc (gfc_type_for_mode): Ditto.
gcc/go/ChangeLog:
* go-lang.cc (go_langhook_type_for_mode): Ditto.
gcc/lto/ChangeLog:
* lto-lang.cc (lto_type_for_mode): Ditto.
gcc/rust/ChangeLog:
* backend/rust-tree.cc (c_common_type_for_mode): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-10.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-11.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-12.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-13.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-14.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-3.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-4.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-5.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-6.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-7.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-8.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/bitmask-9.c: New test.
MIPSr6 support unaligned memory access with normal lh/sh/lw/sw/ld/sd
instructions, and thus lwl/lwr/ldl/ldr and swl/swr/sdl/sdr is removed.
For microarchitecture, these memory access instructions issue 2
operation if the address is not aligned, which is like what lwl family
do.
For some situation (such as accessing boundary of pages) on some
microarchitectures, the unaligned access may not be good enough,
then the kernel should trap&emu it: the kernel may need
-mno-unalgined-access option.
gcc/
* config/mips/mips.cc (mips_expand_block_move): don't expand for
r6 with -mno-unaligned-access option if one or both of src and
dest are unaligned. restruct: return directly if length is not const.
(mips_block_move_straight): emit_move if ISA_HAS_UNALIGNED_ACCESS.
gcc/testsuite/
* gcc.target/mips/expand-block-move-r6-no-unaligned.c: new test.
* gcc.target/mips/expand-block-move-r6.c: new test.
gcc.dg/vect/slp-perm-9.c is reported to FAIL with -march=cascadelake
now which is because we now vectorize the epilogue with V2HImode
vectors after the recent change to not scrap too large vector
epilogues during transform but during analysis time.
The following adjusts the testcase to always use the existing alternate
N which avoids epilogue vectorization.
* gcc.dg/vect/slp-perm-9.c: Always use alternate N.
The test installed by "x86: make VPTERNLOG* usable on less than 512-bit
operands with just AVX512F" won't succeed on 32-bit, for floating point
operations being done there (by default) without using SIMD insns.
gcc/testsuite/
* gcc.target/i386/avx512f-copysign.c: Suppress for 32-bit.
Following two-operand bitwise operations, add another splitter to also
deal with not followed by broadcast all on its own, which can be
expressed as simple embedded broadcast instead once a broadcast operand
is actually permitted in the respective insn. While there also permit
a broadcast operand in the corresponding expander.
gcc/
PR target/100711
* config/i386/sse.md: New splitters to simplify
not;vec_duplicate as a singular vpternlog.
(one_cmpl<mode>2): Allow broadcast for operand 1.
(<mask_codefor>one_cmpl<mode>2<mask_name>): Likewise.
gcc/testsuite/
PR target/100711
* gcc.target/i386/pr100711-6.c: New test.
With respective two-operand bitwise operations now expressable by a
single VPTERNLOG, add splitters to also deal with ior and xor
counterparts of the original and-only case. Note that the splitters need
to be separate, as the placement of "not" differs in the final insns
(*iornot<mode>3, *xnor<mode>3) which are intended to pick up one half of
the result.
gcc/
PR target/100711
* config/i386/sse.md: New splitters to simplify
not;vec_duplicate;{ior,xor} as vec_duplicate;{iornot,xnor}.
gcc/testsuite/
PR target/100711
* gcc.target/i386/pr100711-4.c: New test.
* gcc.target/i386/pr100711-5.c: New test.
The intended broadcast (with AVX512) can very well be done right from
memory.
gcc/
PR target/100711
* config/i386/sse.md: Permit non-immediate operand 1 in AVX2
form of splitter for PR target/100711.
The following adjusts the tree.def documentation about VEC_PERM_EXPR
which wasn't adjusted when the restrictions of permutes with constant
mask were relaxed.
PR middle-end/110541
* tree.def (VEC_PERM_EXPR): Adjust documentation to reflect
reality.
When it's the memory operand which is to be inverted, using VPANDN*
requires a further load instruction. The same can be achieved by a
single VPTERNLOG*. Add two new alternatives (for plain memory and
embedded broadcast), adjusting the predicate for the first operand
accordingly.
Two pre-existing testcases actually end up being affected (improved) by
the change, which is reflected in updated expectations there.
gcc/
PR target/93768
* config/i386/sse.md (*andnot<mode>3): Add new alternatives
for memory form operand 1.
gcc/testsuite/
PR target/93768
* gcc.target/i386/avx512f-andn-di-zmm-2.c: New test.
* gcc.target/i386/avx512f-andn-si-zmm-2.c: Adjust expecations
towards generated code.
* gcc.target/i386/pr100711-3.c: Adjust expectations for 32-bit
code.
All combinations of and, ior, xor, and not involving two operands can be
expressed that way in a single insn.
gcc/
PR target/93768
* config/i386/i386.cc (ix86_rtx_costs): Further special-case
bitwise vector operations.
* config/i386/sse.md (*iornot<mode>3): New insn.
(*xnor<mode>3): Likewise.
(*<nlogic><mode>3): Likewise.
(andor): New code iterator.
(nlogic): New code attribute.
(ternlog_nlogic): Likewise.
gcc/testsuite/
PR target/93768
* gcc.target/i386/avx512-binop-not-1.h: New.
* gcc.target/i386/avx512-binop-not-2.h: New.
* gcc.target/i386/avx512f-orn-si-zmm-1.c: New test.
* gcc.target/i386/avx512f-orn-si-zmm-2.c: New test.
These tests fail with -std=gnu++98/-D_GLIBCXX_DEBUG in the runtest
flags. They should require the c++11 effective target.
libstdc++-v3/ChangeLog:
* testsuite/23_containers/forward_list/debug/iterator1_neg.cc:
Skip as UNSUPPORTED for C++98 mode.
* testsuite/23_containers/forward_list/debug/iterator3_neg.cc:
Likewise.
libstdc++-v3/ChangeLog:
PR libstdc++/110542
* include/bits/stl_uninitialized.h (__uninitialized_default_n):
Do not use std::fill_n during constant evaluation.
Similar to r14-2052-gdd2eb972a5b063, replace the try-block with RAII
types for deallocating storage and destroying elements.
libstdc++-v3/ChangeLog:
* include/bits/vector.tcc (_M_default_append): Replace try-block
with RAII types.
A mips16e2 related test fails after the ifcvt change. The mips16e2
addition also causes a test for unrelated module to fail.
This patch adjusts branch costs when running the two affected tests.
These tests should not require the -mbranch-cost option, and
this issue needs to be addressed.
gcc/testsuite/ChangeLog:
* gcc.target/mips/mips16e2-cmov.c: Adjust branch cost to
encourage if-conversion.
* gcc.target/mips/movcc-3.c: Same as above.
The problem here is we might produce some values out of the type's
min/max (and/or valid values, e.g. signed booleans). The fix is to
use an integer type which has the same precision and signedness
as the original type.
Note two_value_replacement in phiopt had the same issue in previous
versions; though I don't know if a problem will show up there.
OK? Bootstrapped and tested on x86_64-linux-gnu.
gcc/ChangeLog:
PR tree-optimization/110487
* match.pd (a !=/== CST1 ? CST2 : CST3): Always
build a nonstandard integer and use that.
This fixes the first part of this bug where `a ? -1 : 0`
would cause a value of 1 into the signed boolean value.
It fixes the problem by casting to an integer type of
the same size/signedness before doing the negative and
then casting to the type of expression.
OK? Bootstrapped and tested on x86_64.
gcc/ChangeLog:
* match.pd (a?-1:0): Cast type an integer type
rather the type before the negative.
(a?0:-1): Likewise.
gcc/ChangeLog:
* config/xtensa/xtensa.cc (machine_function, xtensa_expand_prologue):
Change to use HARD_REG_BIT and its macros.
* config/xtensa/xtensa.md
(peephole2: regmove elimination during DFmode input reload):
Likewise.
The following makes sure to not make conditional undefs in PHI arguments
unconditional by folding cond ? arg1 : arg2.
PR tree-optimization/110491
* tree-ssa-phiopt.cc (match_simplify_replacement): Check
whether the PHI args are possibly undefined before folding
the COND_EXPR.
* gcc.dg/torture/pr110491.c: New testcase.
We extend the machine mode from 8 to 16 bits already. But there still
one placing missing from the streamer. It has one hard coded array
for the machine code like size 256.
In the lto pass, we memset the array by MAX_MACHINE_MODE count but the
value of the MAX_MACHINE_MODE will grow as more and more modes are
added. While the machine mode array in tree-streamer still leave 256 as is.
Then, when the MAX_MACHINE_MODE is greater than 256, the memset of
lto_output_init_mode_table will touch the memory out of range unexpected.
This patch would like to take the MAX_MACHINE_MODE as the size of the
array in streamer, to make sure there is no potential unexpected
memory access in future. Meanwhile, this patch also adjust some place
which has MAX_MACHINE_MODE <= 256 assumption.
Care is taken that for offload compilation, we interpret the stream-in
data in terms of the host 'MAX_MACHINE_MODE' ('file_data->mode_bits'),
which very likely is different from the offload device
'MAX_MACHINE_MODE'.
gcc/
* lto-streamer-in.cc (lto_input_mode_table): Stream in the mode
bits for machine mode table.
* lto-streamer-out.cc (lto_write_mode_table): Stream out the
HOST machine mode bits.
* lto-streamer.h (struct lto_file_decl_data): New fields mode_bits.
* tree-streamer.cc (streamer_mode_table): Take MAX_MACHINE_MODE
as the table size.
* tree-streamer.h (streamer_mode_table): Ditto.
(bp_pack_machine_mode): Take 1 << ceil_log2 (MAX_MACHINE_MODE)
as the packing limit.
(bp_unpack_machine_mode): Ditto with 'file_data->mode_bits'.
gcc/lto/
* lto-common.cc (lto_file_finalize) [!ACCEL_COMPILER]: Initialize
'file_data->mode_bits'.
Signed-off-by: Pan Li <pan2.li@intel.com>
Co-authored-by: Thomas Schwinge <thomas@codesourcery.com>
The following removes gimple_uses_undefined_value_p and instead
uses the conservative mark_ssa_maybe_undefs in PHI-OPT, the last
user of the other API.
* tree-ssa-phiopt.cc (pass_phiopt::execute): Mark SSA undefs.
(empty_bb_or_one_feeding_into_p): Check for them.
* tree-ssa.h (gimple_uses_undefined_value_p): Remove.
* tree-ssa.cc (gimple_uses_undefined_value_p): Likewise.
slp_done_for_suggested_uf is used directly in vect_analyze_loop_2
without initialization, which is undefined behavior. Initialize it to false
according to the discussion.
gcc/ChangeLog:
PR tree-optimization/110531
* tree-vect-loop.cc (vect_analyze_loop_1): initialize
slp_done_for_suggested_uf to false.
The following replaces the simplistic gimple_uses_undefined_value_p
with the conservative mark_ssa_maybe_undefs approach as already
used by LIM and IVOPTs. This is to avoid exposing an unconditional
uninitialized read on a path from entry by if-combine.
PR tree-optimization/110228
* tree-ssa-ifcombine.cc (pass_tree_ifcombine::execute):
Mark SSA may-undefs.
(bb_no_side_effects_p): Check stmt uses for undefs.
* gcc.dg/torture/pr110228.c: New testcase.
* gcc.dg/uninit-pr101912.c: Un-XFAIL.
When we compute liveness and relevantness we have to make sure to
handle live but not relevant stmts in a way we can later vectorize
them. When the stmt uses only operands that do not need vectorization
we can just leave such stmts in place - but not in the case they
are recognized as patterns. Since we don't have a way to cancel
pattern recognition we have to force mark such stmts as relevant.
PR tree-optimization/110436
* tree-vect-stmts.cc (vect_mark_relevant): Expand dumping,
force live but not relevant pattern stmts relevant.
* gcc.dg/pr110436.c: New testcase.