nvptx backend prerequisites for OpenMP offloading

gcc/
	* config/nvptx/mkoffload.c (main): Check that either OpenACC or OpenMP
	is selected.  Pass -mgomp to offload compiler in OpenMP case.
	* config/nvptx/nvptx-protos.h (nvptx_shuffle_kind): Move enum
	declaration from nvptx.c.
	(nvptx_gen_shuffle): Declare.
	(nvptx_output_set_softstack): Declare.
	* config/nvptx/nvptx.c (nvptx_shuffle_kind): Move to nvptx-protos.h.
	(need_softstack_decl): New variable.
	(need_unisimt_decl): New variable.
	(diagnose_openacc_conflict): New.  Use it...
	(nvptx_option_override): ...here.  Handle TARGET_GOMP.
	(nvptx_encode_section_info): Handle "shared" attribute.
	(write_as_kernel): Restrict to OpenACC target regions.
	(init_softstack_frame): New.
	(nvptx_init_unisimt_predicate): New.
	(write_omp_entry): New.  Use it...
	(nvptx_declare_function_name): ...here to emit OpenMP target region
	entrypoints.  Handle TARGET_SOFT_STACK.  Call
	nvptx_init_unisimt_predicate.
	(nvptx_output_set_softstack): New.
	(nvptx_get_drap_rtx): Return %argp as the DRAP if needed.
	(nvptx_gen_shuffle): Export.
	(nvptx_output_call_insn): Handle COND_EXEC patterns.  Emit instruction
	predicate.
	(nvptx_print_operand): Fix handling of instruction predicates.
	(nvptx_get_unisimt_master): New helper function.
	(nvptx_get_unisimt_predicate): Ditto.
	(nvptx_call_insn_is_syscall_p): Ditto.
	(nvptx_unisimt_handle_set): Ditto.
	(nvptx_reorg_uniform_simt): New.  Transform code for -muniform-simt.
	(nvptx_reorg): Call nvptx_reorg_uniform_simt.
	(nvptx_handle_shared_attribute): New.  Use it...
	(nvptx_attribute_table): ... here (new entry).
	(nvptx_record_offload_symbol): Handle NULL attributes.
	(nvptx_file_end): Handle need_softstack_decl and need_unisimt_decl.
	(nvptx_simt_vf): New.
	(TARGET_SIMT_VF): Define.
	* config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Define
	__nvptx_softstack or __nvptx_unisimt__ when -msoft-stack, or resp.
	-muniform-simt option is active.
	(STACK_SIZE_MODE): Define.
	(FIXED_REGISTERS): Adjust.
	(SOFTSTACK_SLOT_REGNUM): New.
	(SOFTSTACK_PREV_REGNUM): New.
	(REGISTER_NAMES): Adjust.
	(struct machine_function): New fields.
	* config/nvptx/nvptx.md (UNSPEC_SET_SOFTSTACK): New.
	(UNSPEC_VOTE_BALLOT): Ditto.
	(UNSPEC_LANEID): Ditto.
	(UNSPECV_NOUNROLL): Ditto.
	(atomic): New attribute.
	(predicable): New attribute.  Generate predicated forms via
	define_cond_exec.
	(br_true): Mark as not predicable.
	(br_false): Ditto.
	(br_true_uni): Ditto.
	(br_false_uni): Ditto.
	(return): Ditto.
	(trap_if_true): Ditto.
	(trap_if_false): Ditto.
	(nvptx_fork): Ditto.
	(nvptx_forked): Ditto.
	(nvptx_joining): Ditto.
	(nvptx_join): Ditto.
	(nvptx_barsync): Ditto.
	(epilogue): Emit stack restore if TARGET_SOFT_STACK.
	(allocate_stack): Implement for TARGET_SOFT_STACK.  Remove unused code.
	(allocate_stack_<mode>): Remove unused pattern.
	(set_softstack_insn): New pattern.
	(restore_stack_block): Handle for TARGET_SOFT_STACK.
	(nvptx_vote_ballot): New pattern.
	(omp_simt_lane): Ditto.
	(omp_simt_last_lane): Ditto.
	(omp_simt_ordered): Ditto.
	(omp_simt_vote_any): Ditto.
	(omp_simt_xchg_bfly): Ditto.
	(omp_simt_xchg_idx): Ditto.
	(nvptx_nounroll): Ditto.
	(atomic_compare_and_swap<mode>_1): Mark with atomic attribute.
	(atomic_exchange<mode>): Ditto.
	(atomic_fetch_add<mode>): Ditto.
	(atomic_fetch_addsf): Ditto.
	(atomic_fetch_<logic><mode>): Ditto.
	* config/nvptx/nvptx.opt: (msoft-stack): New option.
	(muniform-simt): Ditto.
	(mgomp): Ditto.
	* config/nvptx/t-nvptx (MULTILIB_OPTIONS): New.
	* doc/extend.texi (Nvidia PTX Variable Attributes): New section.
	* doc/invoke.texi (msoft-stack): Document.
	(muniform-simt): Document
	(mgomp): Document.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in: (TARGET_SIMT_VF): New hook.
	* target.def: Define it.
	* target-insns.def (omp_simt_lane): New.
	(omp_simt_last_lane): New.
	(omp_simt_ordered): New.
	(omp_simt_vote_any): New.
	(omp_simt_xchg_bfly): New.
	(omp_simt_xchg_idx): New.

libgcc/
	* config/nvptx/crt0.c (__main): Setup __nvptx_stacks and __nvptx_uni.
	* config/nvptx/mgomp.c: New file.
	* config/nvptx/t-nvptx: Add mgomp.c

gcc/testsuite/
	* lib/target-supports.exp (check_effective_target_alloca): Use a
	compile test.
	* gcc.target/nvptx/softstack.c: New test.
	* gcc.target/nvptx/decl-shared.c: New test.
	* gcc.target/nvptx/decl-shared-init.c: New test.

From-SVN: r242503
This commit is contained in:
Alexander Monakov 2016-11-16 20:17:00 +03:00 committed by Alexander Monakov
parent 2fe2aba3cd
commit 5012919d0b
23 changed files with 870 additions and 68 deletions

View file

@ -1,3 +1,106 @@
2016-11-16 Alexander Monakov <amonakov@ispras.ru>
* config/nvptx/mkoffload.c (main): Check that either OpenACC or OpenMP
is selected. Pass -mgomp to offload compiler in OpenMP case.
* config/nvptx/nvptx-protos.h (nvptx_shuffle_kind): Move enum
declaration from nvptx.c.
(nvptx_gen_shuffle): Declare.
(nvptx_output_set_softstack): Declare.
* config/nvptx/nvptx.c (nvptx_shuffle_kind): Move to nvptx-protos.h.
(need_softstack_decl): New variable.
(need_unisimt_decl): New variable.
(diagnose_openacc_conflict): New. Use it...
(nvptx_option_override): ...here. Handle TARGET_GOMP.
(nvptx_encode_section_info): Handle "shared" attribute.
(write_as_kernel): Restrict to OpenACC target regions.
(init_softstack_frame): New.
(nvptx_init_unisimt_predicate): New.
(write_omp_entry): New. Use it...
(nvptx_declare_function_name): ...here to emit OpenMP target region
entrypoints. Handle TARGET_SOFT_STACK. Call
nvptx_init_unisimt_predicate.
(nvptx_output_set_softstack): New.
(nvptx_get_drap_rtx): Return %argp as the DRAP if needed.
(nvptx_gen_shuffle): Export.
(nvptx_output_call_insn): Handle COND_EXEC patterns. Emit instruction
predicate.
(nvptx_print_operand): Fix handling of instruction predicates.
(nvptx_get_unisimt_master): New helper function.
(nvptx_get_unisimt_predicate): Ditto.
(nvptx_call_insn_is_syscall_p): Ditto.
(nvptx_unisimt_handle_set): Ditto.
(nvptx_reorg_uniform_simt): New. Transform code for -muniform-simt.
(nvptx_reorg): Call nvptx_reorg_uniform_simt.
(nvptx_handle_shared_attribute): New. Use it...
(nvptx_attribute_table): ... here (new entry).
(nvptx_record_offload_symbol): Handle NULL attributes.
(nvptx_file_end): Handle need_softstack_decl and need_unisimt_decl.
(nvptx_simt_vf): New.
(TARGET_SIMT_VF): Define.
* config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Define
__nvptx_softstack or __nvptx_unisimt__ when -msoft-stack, or resp.
-muniform-simt option is active.
(STACK_SIZE_MODE): Define.
(FIXED_REGISTERS): Adjust.
(SOFTSTACK_SLOT_REGNUM): New.
(SOFTSTACK_PREV_REGNUM): New.
(REGISTER_NAMES): Adjust.
(struct machine_function): New fields.
* config/nvptx/nvptx.md (UNSPEC_SET_SOFTSTACK): New.
(UNSPEC_VOTE_BALLOT): Ditto.
(UNSPEC_LANEID): Ditto.
(UNSPECV_NOUNROLL): Ditto.
(atomic): New attribute.
(predicable): New attribute. Generate predicated forms via
define_cond_exec.
(br_true): Mark as not predicable.
(br_false): Ditto.
(br_true_uni): Ditto.
(br_false_uni): Ditto.
(return): Ditto.
(trap_if_true): Ditto.
(trap_if_false): Ditto.
(nvptx_fork): Ditto.
(nvptx_forked): Ditto.
(nvptx_joining): Ditto.
(nvptx_join): Ditto.
(nvptx_barsync): Ditto.
(epilogue): Emit stack restore if TARGET_SOFT_STACK.
(allocate_stack): Implement for TARGET_SOFT_STACK. Remove unused code.
(allocate_stack_<mode>): Remove unused pattern.
(set_softstack_insn): New pattern.
(restore_stack_block): Handle for TARGET_SOFT_STACK.
(nvptx_vote_ballot): New pattern.
(omp_simt_lane): Ditto.
(omp_simt_last_lane): Ditto.
(omp_simt_ordered): Ditto.
(omp_simt_vote_any): Ditto.
(omp_simt_xchg_bfly): Ditto.
(omp_simt_xchg_idx): Ditto.
(nvptx_nounroll): Ditto.
(atomic_compare_and_swap<mode>_1): Mark with atomic attribute.
(atomic_exchange<mode>): Ditto.
(atomic_fetch_add<mode>): Ditto.
(atomic_fetch_addsf): Ditto.
(atomic_fetch_<logic><mode>): Ditto.
* config/nvptx/nvptx.opt: (msoft-stack): New option.
(muniform-simt): Ditto.
(mgomp): Ditto.
* config/nvptx/t-nvptx (MULTILIB_OPTIONS): New.
* doc/extend.texi (Nvidia PTX Variable Attributes): New section.
* doc/invoke.texi (msoft-stack): Document.
(muniform-simt): Document
(mgomp): Document.
* doc/tm.texi: Regenerate.
* doc/tm.texi.in: (TARGET_SIMT_VF): New hook.
* target.def: Define it.
* target-insns.def (omp_simt_lane): New.
(omp_simt_last_lane): New.
(omp_simt_ordered): New.
(omp_simt_vote_any): New.
(omp_simt_xchg_bfly): New.
(omp_simt_xchg_idx): New.
2016-11-16 Maciej W. Rozycki <macro@imgtec.com>
* config/mips/mips-protos.h (mips_set_text_contents_type): New

View file

@ -460,6 +460,7 @@ main (int argc, char **argv)
/* Scan the argument vector. */
bool fopenmp = false;
bool fopenacc = false;
for (int i = 1; i < argc; i++)
{
#define STR "-foffload-abi="
@ -476,11 +477,15 @@ main (int argc, char **argv)
#undef STR
else if (strcmp (argv[i], "-fopenmp") == 0)
fopenmp = true;
else if (strcmp (argv[i], "-fopenacc") == 0)
fopenacc = true;
else if (strcmp (argv[i], "-save-temps") == 0)
save_temps = true;
else if (strcmp (argv[i], "-v") == 0)
verbose = true;
}
if (!(fopenacc ^ fopenmp))
fatal_error (input_location, "either -fopenacc or -fopenmp must be set");
struct obstack argv_obstack;
obstack_init (&argv_obstack);
@ -501,6 +506,8 @@ main (int argc, char **argv)
default:
gcc_unreachable ();
}
if (fopenmp)
obstack_ptr_grow (&argv_obstack, "-mgomp");
for (int ix = 1; ix != argc; ix++)
{

View file

@ -21,6 +21,16 @@
#ifndef GCC_NVPTX_PROTOS_H
#define GCC_NVPTX_PROTOS_H
/* The kind of shuffe instruction. */
enum nvptx_shuffle_kind
{
SHUFFLE_UP,
SHUFFLE_DOWN,
SHUFFLE_BFLY,
SHUFFLE_IDX,
SHUFFLE_MAX
};
extern void nvptx_declare_function_name (FILE *, const char *, const_tree decl);
extern void nvptx_declare_object_name (FILE *file, const char *name,
const_tree decl);
@ -36,10 +46,12 @@ extern void nvptx_register_pragmas (void);
extern void nvptx_expand_oacc_fork (unsigned);
extern void nvptx_expand_oacc_join (unsigned);
extern void nvptx_expand_call (rtx, rtx);
extern rtx nvptx_gen_shuffle (rtx, rtx, rtx, nvptx_shuffle_kind);
extern rtx nvptx_expand_compare (rtx);
extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
extern const char *nvptx_output_mov_insn (rtx, rtx);
extern const char *nvptx_output_call_insn (rtx_insn *, rtx, rtx);
extern const char *nvptx_output_return (void);
extern const char *nvptx_output_set_softstack (unsigned);
#endif
#endif

View file

@ -72,16 +72,6 @@
/* This file should be included last. */
#include "target-def.h"
/* The kind of shuffe instruction. */
enum nvptx_shuffle_kind
{
SHUFFLE_UP,
SHUFFLE_DOWN,
SHUFFLE_BFLY,
SHUFFLE_IDX,
SHUFFLE_MAX
};
/* The various PTX memory areas an object might reside in. */
enum nvptx_data_area
{
@ -141,6 +131,12 @@ static GTY(()) rtx worker_red_sym;
/* Global lock variable, needed for 128bit worker & gang reductions. */
static GTY(()) tree global_lock_var;
/* True if any function references __nvptx_stacks. */
static bool need_softstack_decl;
/* True if any function references __nvptx_uni. */
static bool need_unisimt_decl;
/* Allocate a new, cleared machine_function structure. */
static struct machine_function *
@ -151,6 +147,16 @@ nvptx_init_machine_status (void)
return p;
}
/* Issue a diagnostic when option OPTNAME is enabled (as indicated by OPTVAL)
and -fopenacc is also enabled. */
static void
diagnose_openacc_conflict (bool optval, const char *optname)
{
if (flag_openacc && optval)
error ("option %s is not supported together with -fopenacc", optname);
}
/* Implement TARGET_OPTION_OVERRIDE. */
static void
@ -188,6 +194,13 @@ nvptx_option_override (void)
worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_red");
SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
diagnose_openacc_conflict (TARGET_GOMP, "-mgomp");
diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack");
diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt");
if (TARGET_GOMP)
target_flags |= MASK_SOFT_STACK | MASK_UNIFORM_SIMT;
}
/* Return a ptx type for MODE. If PROMOTE, then use .u32 for QImode to
@ -238,9 +251,17 @@ nvptx_encode_section_info (tree decl, rtx rtl, int first)
if (TREE_CONSTANT (decl))
area = DATA_AREA_CONST;
else if (TREE_CODE (decl) == VAR_DECL)
/* TODO: This would be a good place to check for a .shared or
other section name. */
area = TREE_READONLY (decl) ? DATA_AREA_CONST : DATA_AREA_GLOBAL;
{
if (lookup_attribute ("shared", DECL_ATTRIBUTES (decl)))
{
area = DATA_AREA_SHARED;
if (DECL_INITIAL (decl))
error ("static initialization of variable %q+D in %<.shared%>"
" memory is not supported", decl);
}
else
area = TREE_READONLY (decl) ? DATA_AREA_CONST : DATA_AREA_GLOBAL;
}
SET_SYMBOL_DATA_AREA (XEXP (rtl, 0), area);
}
@ -718,7 +739,10 @@ static bool
write_as_kernel (tree attrs)
{
return (lookup_attribute ("kernel", attrs) != NULL_TREE
|| lookup_attribute ("omp target entrypoint", attrs) != NULL_TREE);
|| (lookup_attribute ("omp target entrypoint", attrs) != NULL_TREE
&& lookup_attribute ("oacc function", attrs) != NULL_TREE));
/* For OpenMP target regions, the corresponding kernel entry is emitted from
write_omp_entry as a separate function. */
}
/* Emit a linker marker for a function decl or defn. */
@ -973,6 +997,67 @@ init_frame (FILE *file, int regno, unsigned align, unsigned size)
POINTER_SIZE, reg_names[regno], reg_names[regno]);
}
/* Emit soft stack frame setup sequence. */
static void
init_softstack_frame (FILE *file, unsigned alignment, HOST_WIDE_INT size)
{
/* Maintain 64-bit stack alignment. */
unsigned keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT;
size = ROUND_UP (size, keep_align);
int bits = POINTER_SIZE;
const char *reg_stack = reg_names[STACK_POINTER_REGNUM];
const char *reg_frame = reg_names[FRAME_POINTER_REGNUM];
const char *reg_sspslot = reg_names[SOFTSTACK_SLOT_REGNUM];
const char *reg_sspprev = reg_names[SOFTSTACK_PREV_REGNUM];
fprintf (file, "\t.reg.u%d %s;\n", bits, reg_stack);
fprintf (file, "\t.reg.u%d %s;\n", bits, reg_frame);
fprintf (file, "\t.reg.u%d %s;\n", bits, reg_sspslot);
fprintf (file, "\t.reg.u%d %s;\n", bits, reg_sspprev);
fprintf (file, "\t{\n");
fprintf (file, "\t\t.reg.u32 %%fstmp0;\n");
fprintf (file, "\t\t.reg.u%d %%fstmp1;\n", bits);
fprintf (file, "\t\t.reg.u%d %%fstmp2;\n", bits);
fprintf (file, "\t\tmov.u32 %%fstmp0, %%tid.y;\n");
fprintf (file, "\t\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n",
bits == 64 ? ".wide" : ".lo", bits / 8);
fprintf (file, "\t\tmov.u%d %%fstmp2, __nvptx_stacks;\n", bits);
/* Initialize %sspslot = &__nvptx_stacks[tid.y]. */
fprintf (file, "\t\tadd.u%d %s, %%fstmp2, %%fstmp1;\n", bits, reg_sspslot);
/* Initialize %sspprev = __nvptx_stacks[tid.y]. */
fprintf (file, "\t\tld.shared.u%d %s, [%s];\n",
bits, reg_sspprev, reg_sspslot);
/* Initialize %frame = %sspprev - size. */
fprintf (file, "\t\tsub.u%d %s, %s, " HOST_WIDE_INT_PRINT_DEC ";\n",
bits, reg_frame, reg_sspprev, size);
/* Apply alignment, if larger than 64. */
if (alignment > keep_align)
fprintf (file, "\t\tand.b%d %s, %s, %d;\n",
bits, reg_frame, reg_frame, -alignment);
size = crtl->outgoing_args_size;
gcc_assert (size % keep_align == 0);
/* Initialize %stack. */
fprintf (file, "\t\tsub.u%d %s, %s, " HOST_WIDE_INT_PRINT_DEC ";\n",
bits, reg_stack, reg_frame, size);
/* Usually 'crtl->is_leaf' is computed during register allocator
initialization, which is not done on NVPTX. Compute it now. */
gcc_assert (!crtl->is_leaf);
crtl->is_leaf = leaf_function_p ();
if (!crtl->is_leaf)
fprintf (file, "\t\tst.shared.u%d [%s], %s;\n",
bits, reg_sspslot, reg_stack);
fprintf (file, "\t}\n");
cfun->machine->has_softstack = true;
need_softstack_decl = true;
}
/* Emit code to initialize the REGNO predicate register to indicate
whether we are not lane zero on the NAME axis. */
@ -986,6 +1071,97 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
fprintf (file, "\t}\n");
}
/* Emit code to initialize predicate and master lane index registers for
-muniform-simt code generation variant. */
static void
nvptx_init_unisimt_predicate (FILE *file)
{
int bits = POINTER_SIZE;
int master = REGNO (cfun->machine->unisimt_master);
int pred = REGNO (cfun->machine->unisimt_predicate);
fprintf (file, "\t{\n");
fprintf (file, "\t\t.reg.u32 %%ustmp0;\n");
fprintf (file, "\t\t.reg.u%d %%ustmp1;\n", bits);
fprintf (file, "\t\t.reg.u%d %%ustmp2;\n", bits);
fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.y;\n");
fprintf (file, "\t\tmul%s.u32 %%ustmp1, %%ustmp0, 4;\n",
bits == 64 ? ".wide" : ".lo");
fprintf (file, "\t\tmov.u%d %%ustmp2, __nvptx_uni;\n", bits);
fprintf (file, "\t\tadd.u%d %%ustmp2, %%ustmp2, %%ustmp1;\n", bits);
fprintf (file, "\t\tld.shared.u32 %%r%d, [%%ustmp2];\n", master);
fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.x;\n");
/* Compute 'master lane index' as 'tid.x & __nvptx_uni[tid.y]'. */
fprintf (file, "\t\tand.b32 %%r%d, %%r%d, %%ustmp0;\n", master, master);
/* Compute predicate as 'tid.x == master'. */
fprintf (file, "\t\tsetp.eq.u32 %%r%d, %%r%d, %%ustmp0;\n", pred, master);
fprintf (file, "\t}\n");
need_unisimt_decl = true;
}
/* Emit kernel NAME for function ORIG outlined for an OpenMP 'target' region:
extern void gomp_nvptx_main (void (*fn)(void*), void *fnarg);
void __attribute__((kernel)) NAME (void *arg, char *stack, size_t stacksize)
{
__nvptx_stacks[tid.y] = stack + stacksize * (ctaid.x * ntid.y + tid.y + 1);
__nvptx_uni[tid.y] = 0;
gomp_nvptx_main (ORIG, arg);
}
ORIG itself should not be emitted as a PTX .entry function. */
static void
write_omp_entry (FILE *file, const char *name, const char *orig)
{
static bool gomp_nvptx_main_declared;
if (!gomp_nvptx_main_declared)
{
gomp_nvptx_main_declared = true;
write_fn_marker (func_decls, false, true, "gomp_nvptx_main");
func_decls << ".extern .func gomp_nvptx_main (.param.u" << POINTER_SIZE
<< " %in_ar1, .param.u" << POINTER_SIZE << " %in_ar2);\n";
}
#define ENTRY_TEMPLATE(PS, PS_BYTES, MAD_PS_32) "\
(.param.u" PS " %arg, .param.u" PS " %stack, .param.u" PS " %sz)\n\
{\n\
.reg.u32 %r<3>;\n\
.reg.u" PS " %R<4>;\n\
mov.u32 %r0, %tid.y;\n\
mov.u32 %r1, %ntid.y;\n\
mov.u32 %r2, %ctaid.x;\n\
cvt.u" PS ".u32 %R1, %r0;\n\
" MAD_PS_32 " %R1, %r1, %r2, %R1;\n\
mov.u" PS " %R0, __nvptx_stacks;\n\
" MAD_PS_32 " %R0, %r0, " PS_BYTES ", %R0;\n\
ld.param.u" PS " %R2, [%stack];\n\
ld.param.u" PS " %R3, [%sz];\n\
add.u" PS " %R2, %R2, %R3;\n\
mad.lo.u" PS " %R2, %R1, %R3, %R2;\n\
st.shared.u" PS " [%R0], %R2;\n\
mov.u" PS " %R0, __nvptx_uni;\n\
" MAD_PS_32 " %R0, %r0, 4, %R0;\n\
mov.u32 %r0, 0;\n\
st.shared.u32 [%R0], %r0;\n\
mov.u" PS " %R0, \0;\n\
ld.param.u" PS " %R1, [%arg];\n\
{\n\
.param.u" PS " %P<2>;\n\
st.param.u" PS " [%P0], %R0;\n\
st.param.u" PS " [%P1], %R1;\n\
call.uni gomp_nvptx_main, (%P0, %P1);\n\
}\n\
ret.uni;\n\
}\n"
static const char entry64[] = ENTRY_TEMPLATE ("64", "8", "mad.wide.u32");
static const char entry32[] = ENTRY_TEMPLATE ("32", "4", "mad.lo.u32 ");
#undef ENTRY_TEMPLATE
const char *entry_1 = TARGET_ABI64 ? entry64 : entry32;
/* Position ENTRY_2 after the embedded nul using strlen of the prefix. */
const char *entry_2 = entry_1 + strlen (entry64) + 1;
fprintf (file, ".visible .entry %s%s%s%s", name, entry_1, orig, entry_2);
need_softstack_decl = need_unisimt_decl = true;
}
/* Implement ASM_DECLARE_FUNCTION_NAME. Writes the start of a ptx
function, including local var decls and copies from the arguments to
local regs. */
@ -997,6 +1173,14 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
tree result_type = TREE_TYPE (fntype);
int argno = 0;
if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
&& !lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
{
char *buf = (char *) alloca (strlen (name) + sizeof ("$impl"));
sprintf (buf, "%s$impl", name);
write_omp_entry (file, name, buf);
name = buf;
}
/* We construct the initial part of the function into a string
stream, in order to share the prototype writing code. */
std::stringstream s;
@ -1034,19 +1218,24 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
fprintf (file, "%s", s.str().c_str());
/* Declare a local var for outgoing varargs. */
if (cfun->machine->has_varadic)
init_frame (file, STACK_POINTER_REGNUM,
UNITS_PER_WORD, crtl->outgoing_args_size);
/* Declare a local variable for the frame. Force its size to be
DImode-compatible. */
HOST_WIDE_INT sz = get_frame_size ();
if (sz || cfun->machine->has_chain)
init_frame (file, FRAME_POINTER_REGNUM,
crtl->stack_alignment_needed / BITS_PER_UNIT,
(sz + GET_MODE_SIZE (DImode) - 1)
& ~(HOST_WIDE_INT)(GET_MODE_SIZE (DImode) - 1));
bool need_frameptr = sz || cfun->machine->has_chain;
int alignment = crtl->stack_alignment_needed / BITS_PER_UNIT;
if (!TARGET_SOFT_STACK)
{
/* Declare a local var for outgoing varargs. */
if (cfun->machine->has_varadic)
init_frame (file, STACK_POINTER_REGNUM,
UNITS_PER_WORD, crtl->outgoing_args_size);
/* Declare a local variable for the frame. Force its size to be
DImode-compatible. */
if (need_frameptr)
init_frame (file, FRAME_POINTER_REGNUM, alignment,
ROUND_UP (sz, GET_MODE_SIZE (DImode)));
}
else if (need_frameptr || cfun->machine->has_varadic || cfun->calls_alloca)
init_softstack_frame (file, alignment, sz);
/* Declare the pseudos we have as ptx registers. */
int maxregs = max_reg_num ();
@ -1072,8 +1261,25 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
if (cfun->machine->axis_predicate[1])
nvptx_init_axis_predicate (file,
REGNO (cfun->machine->axis_predicate[1]), "x");
if (cfun->machine->unisimt_predicate)
nvptx_init_unisimt_predicate (file);
}
/* Output instruction that sets soft stack pointer in shared memory to the
value in register given by SRC_REGNO. */
const char *
nvptx_output_set_softstack (unsigned src_regno)
{
if (cfun->machine->has_softstack && !crtl->is_leaf)
{
fprintf (asm_out_file, "\tst.shared.u%d\t[%s], ",
POINTER_SIZE, reg_names[SOFTSTACK_SLOT_REGNUM]);
output_reg (asm_out_file, src_regno, VOIDmode);
fprintf (asm_out_file, ";\n");
}
return "";
}
/* Output a return instruction. Also copy the return value to its outgoing
location. */
@ -1113,6 +1319,8 @@ nvptx_function_ok_for_sibcall (tree, tree)
static rtx
nvptx_get_drap_rtx (void)
{
if (TARGET_SOFT_STACK && stack_realign_drap)
return arg_pointer_rtx;
return NULL_RTX;
}
@ -1311,7 +1519,7 @@ nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
/* Generate an instruction or sequence to broadcast register REG
across the vectors of a single warp. */
static rtx
rtx
nvptx_gen_shuffle (rtx dst, rtx src, rtx idx, nvptx_shuffle_kind kind)
{
rtx res;
@ -1833,6 +2041,8 @@ nvptx_output_mov_insn (rtx dst, rtx src)
return "%.\tcvt%t0%t1\t%0, %1;";
}
static void nvptx_print_operand (FILE *, rtx, int);
/* Output INSN, which is a call to CALLEE with result RESULT. For ptx, this
involves writing .param declarations and in/out copies into them. For
indirect calls, also write the .callprototype. */
@ -1844,6 +2054,8 @@ nvptx_output_call_insn (rtx_insn *insn, rtx result, rtx callee)
static int labelno;
bool needs_tgt = register_operand (callee, Pmode);
rtx pat = PATTERN (insn);
if (GET_CODE (pat) == COND_EXEC)
pat = COND_EXEC_CODE (pat);
int arg_end = XVECLEN (pat, 0);
tree decl = NULL_TREE;
@ -1888,6 +2100,8 @@ nvptx_output_call_insn (rtx_insn *insn, rtx result, rtx callee)
fprintf (asm_out_file, ";\n");
}
/* The '.' stands for the call's predicate, if any. */
nvptx_print_operand (asm_out_file, NULL_RTX, '.');
fprintf (asm_out_file, "\t\tcall ");
if (result != NULL_RTX)
fprintf (asm_out_file, "(%s_in), ", reg_names[NVPTX_RETURN_REGNUM]);
@ -1951,8 +2165,6 @@ nvptx_print_operand_punct_valid_p (unsigned char c)
return c == '.' || c== '#';
}
static void nvptx_print_operand (FILE *, rtx, int);
/* Subroutine of nvptx_print_operand; used to print a memory reference X to FILE. */
static void
@ -2013,12 +2225,10 @@ nvptx_print_operand (FILE *file, rtx x, int code)
x = current_insn_predicate;
if (x)
{
unsigned int regno = REGNO (XEXP (x, 0));
fputs ("[", file);
fputs ("@", file);
if (GET_CODE (x) == EQ)
fputs ("!", file);
fputs (reg_names [regno], file);
fputs ("]", file);
output_reg (file, REGNO (XEXP (x, 0)), VOIDmode);
}
return;
}
@ -2313,6 +2523,89 @@ nvptx_reorg_subreg (void)
}
}
/* Return a SImode "master lane index" register for uniform-simt, allocating on
first use. */
static rtx
nvptx_get_unisimt_master ()
{
rtx &master = cfun->machine->unisimt_master;
return master ? master : master = gen_reg_rtx (SImode);
}
/* Return a BImode "predicate" register for uniform-simt, similar to above. */
static rtx
nvptx_get_unisimt_predicate ()
{
rtx &pred = cfun->machine->unisimt_predicate;
return pred ? pred : pred = gen_reg_rtx (BImode);
}
/* Return true if given call insn references one of the functions provided by
the CUDA runtime: malloc, free, vprintf. */
static bool
nvptx_call_insn_is_syscall_p (rtx_insn *insn)
{
rtx pat = PATTERN (insn);
gcc_checking_assert (GET_CODE (pat) == PARALLEL);
pat = XVECEXP (pat, 0, 0);
if (GET_CODE (pat) == SET)
pat = SET_SRC (pat);
gcc_checking_assert (GET_CODE (pat) == CALL
&& GET_CODE (XEXP (pat, 0)) == MEM);
rtx addr = XEXP (XEXP (pat, 0), 0);
if (GET_CODE (addr) != SYMBOL_REF)
return false;
const char *name = XSTR (addr, 0);
/* Ordinary malloc/free are redirected to __nvptx_{malloc,free), so only the
references with forced assembler name refer to PTX syscalls. For vprintf,
accept both normal and forced-assembler-name references. */
return (!strcmp (name, "vprintf") || !strcmp (name, "*vprintf")
|| !strcmp (name, "*malloc")
|| !strcmp (name, "*free"));
}
/* If SET subexpression of INSN sets a register, emit a shuffle instruction to
propagate its value from lane MASTER to current lane. */
static void
nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master)
{
rtx reg;
if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set)))
emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), insn);
}
/* Adjust code for uniform-simt code generation variant by making atomics and
"syscalls" conditionally executed, and inserting shuffle-based propagation
for registers being set. */
static void
nvptx_reorg_uniform_simt ()
{
rtx_insn *insn, *next;
for (insn = get_insns (); insn; insn = next)
{
next = NEXT_INSN (insn);
if (!(CALL_P (insn) && nvptx_call_insn_is_syscall_p (insn))
&& !(NONJUMP_INSN_P (insn)
&& GET_CODE (PATTERN (insn)) == PARALLEL
&& get_attr_atomic (insn)))
continue;
rtx pat = PATTERN (insn);
rtx master = nvptx_get_unisimt_master ();
for (int i = 0; i < XVECLEN (pat, 0); i++)
nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
rtx pred = nvptx_get_unisimt_predicate ();
pred = gen_rtx_NE (BImode, pred, const0_rtx);
pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat);
validate_change (insn, &PATTERN (insn), pat, false);
}
}
/* Loop structure of the function. The entire function is described as
a NULL loop. */
@ -3829,6 +4122,9 @@ nvptx_reorg (void)
/* Replace subregs. */
nvptx_reorg_subreg ();
if (TARGET_UNIFORM_SIMT)
nvptx_reorg_uniform_simt ();
regstat_free_n_sets_and_refs ();
df_finish_pass (true);
@ -3857,12 +4153,36 @@ nvptx_handle_kernel_attribute (tree *node, tree name, tree ARG_UNUSED (args),
return NULL_TREE;
}
/* Handle a "shared" attribute; arguments as in
struct attribute_spec.handler. */
static tree
nvptx_handle_shared_attribute (tree *node, tree name, tree ARG_UNUSED (args),
int ARG_UNUSED (flags), bool *no_add_attrs)
{
tree decl = *node;
if (TREE_CODE (decl) != VAR_DECL)
{
error ("%qE attribute only applies to variables", name);
*no_add_attrs = true;
}
else if (!(TREE_PUBLIC (decl) || TREE_STATIC (decl)))
{
error ("%qE attribute not allowed with auto storage class", name);
*no_add_attrs = true;
}
return NULL_TREE;
}
/* Table of valid machine attributes. */
static const struct attribute_spec nvptx_attribute_table[] =
{
/* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler,
affects_type_identity } */
{ "kernel", 0, 0, true, false, false, nvptx_handle_kernel_attribute, false },
{ "shared", 0, 0, true, false, false, nvptx_handle_shared_attribute, false },
{ NULL, 0, 0, false, false, false, NULL, false }
};
@ -3924,13 +4244,13 @@ nvptx_record_offload_symbol (tree decl)
case FUNCTION_DECL:
{
tree attr = get_oacc_fn_attrib (decl);
tree dims = TREE_VALUE (attr);
unsigned ix;
/* OpenMP offloading does not set this attribute. */
tree dims = attr ? TREE_VALUE (attr) : NULL_TREE;
fprintf (asm_out_file, "//:FUNC_MAP \"%s\"",
IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
for (; dims; dims = TREE_CHAIN (dims))
{
int size = TREE_INT_CST_LOW (TREE_VALUE (dims));
@ -3991,6 +4311,21 @@ nvptx_file_end (void)
if (worker_red_size)
write_worker_buffer (asm_out_file, worker_red_sym,
worker_red_align, worker_red_size);
if (need_softstack_decl)
{
write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
/* 32 is the maximum number of warps in a block. Even though it's an
external declaration, emit the array size explicitly; otherwise, it
may fail at PTX JIT time if the definition is later in link order. */
fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[32];\n",
POINTER_SIZE);
}
if (need_unisimt_decl)
{
write_var_marker (asm_out_file, false, true, "__nvptx_uni");
fprintf (asm_out_file, ".extern .shared .u32 __nvptx_uni[32];\n");
}
}
/* Expander for the shuffle builtins. */
@ -4176,6 +4511,14 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
#define PTX_WORKER_LENGTH 32
#define PTX_GANG_DEFAULT 0 /* Defer to runtime. */
/* Implement TARGET_SIMT_VF target hook: number of threads in a warp. */
static int
nvptx_simt_vf ()
{
return PTX_VECTOR_LENGTH;
}
/* Validate compute dimensions of an OpenACC offload or routine, fill
in non-unity defaults. FN_LEVEL indicates the level at which a
routine might spawn a loop. It is negative for non-routines. If
@ -4944,6 +5287,9 @@ nvptx_goacc_reduction (gcall *call)
#undef TARGET_BUILTIN_DECL
#define TARGET_BUILTIN_DECL nvptx_builtin_decl
#undef TARGET_SIMT_VF
#define TARGET_SIMT_VF nvptx_simt_vf
#undef TARGET_GOACC_VALIDATE_DIMS
#define TARGET_GOACC_VALIDATE_DIMS nvptx_goacc_validate_dims

View file

@ -31,6 +31,10 @@
builtin_assert ("machine=nvptx"); \
builtin_assert ("cpu=nvptx"); \
builtin_define ("__nvptx__"); \
if (TARGET_SOFT_STACK) \
builtin_define ("__nvptx_softstack__"); \
if (TARGET_UNIFORM_SIMT) \
builtin_define ("__nvptx_unisimt__"); \
} while (0)
/* Avoid the default in ../../gcc.c, which adds "-pthread", which is not
@ -79,13 +83,14 @@
#define POINTER_SIZE (TARGET_ABI64 ? 64 : 32)
#define Pmode (TARGET_ABI64 ? DImode : SImode)
#define STACK_SIZE_MODE Pmode
/* Registers. Since ptx is a virtual target, we just define a few
hard registers for special purposes and leave pseudos unallocated.
We have to have some available hard registers, to keep gcc setup
happy. */
#define FIRST_PSEUDO_REGISTER 16
#define FIXED_REGISTERS { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }
#define FIXED_REGISTERS { 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 }
#define CALL_USED_REGISTERS { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 }
#define HARD_REGNO_NREGS(REG, MODE) \
@ -133,10 +138,17 @@ enum reg_class { NO_REGS, ALL_REGS, LIM_REG_CLASSES };
#define FRAME_POINTER_REGNUM 2
#define ARG_POINTER_REGNUM 3
#define STATIC_CHAIN_REGNUM 4
/* This register points to the shared memory location with the current warp's
soft stack pointer (__nvptx_stacks[tid.y]). */
#define SOFTSTACK_SLOT_REGNUM 5
/* This register is used to save the previous value of the soft stack pointer
in the prologue and restore it when returning. */
#define SOFTSTACK_PREV_REGNUM 6
#define REGISTER_NAMES \
{ \
"%value", "%stack", "%frame", "%args", "%chain", "%hr5", "%hr6", "%hr7", \
"%value", "%stack", "%frame", "%args", \
"%chain", "%sspslot", "%sspprev", "%hr7", \
"%hr8", "%hr9", "%hr10", "%hr11", "%hr12", "%hr13", "%hr14", "%hr15" \
}
@ -200,10 +212,13 @@ struct GTY(()) machine_function
bool is_varadic; /* This call is varadic */
bool has_varadic; /* Current function has a varadic call. */
bool has_chain; /* Current function has outgoing static chain. */
bool has_softstack; /* Current function has a soft stack frame. */
int num_args; /* Number of args of current call. */
int return_mode; /* Return mode of current fn.
(machine_mode not defined yet.) */
rtx axis_predicate[2]; /* Neutering predicates. */
rtx unisimt_master; /* 'Master lane index' for -muniform-simt. */
rtx unisimt_predicate; /* Predicate for -muniform-simt. */
};
#endif

View file

@ -36,10 +36,16 @@
UNSPEC_ALLOCA
UNSPEC_SET_SOFTSTACK
UNSPEC_DIM_SIZE
UNSPEC_BIT_CONV
UNSPEC_VOTE_BALLOT
UNSPEC_LANEID
UNSPEC_SHUFFLE
UNSPEC_BR_UNIFIED
])
@ -55,11 +61,16 @@
UNSPECV_FORKED
UNSPECV_JOINING
UNSPECV_JOIN
UNSPECV_NOUNROLL
])
(define_attr "subregs_ok" "false,true"
(const_string "false"))
(define_attr "atomic" "false,true"
(const_string "false"))
;; The nvptx operand predicates, in general, don't permit subregs and
;; only literal constants, which differ from the generic ones, which
;; permit subregs and symbolc constants (as appropriate)
@ -124,6 +135,17 @@
return true;
})
(define_attr "predicable" "false,true"
(const_string "true"))
(define_cond_exec
[(match_operator 0 "predicate_operator"
[(match_operand:BI 1 "nvptx_register_operand" "")
(match_operand:BI 2 "const0_operand" "")])]
""
""
)
(define_constraint "P0"
"An integer with the value 0."
(and (match_code "const_int")
@ -509,7 +531,8 @@
(label_ref (match_operand 1 "" ""))
(pc)))]
""
"%j0\\tbra\\t%l1;")
"%j0\\tbra\\t%l1;"
[(set_attr "predicable" "false")])
(define_insn "br_false"
[(set (pc)
@ -518,7 +541,8 @@
(label_ref (match_operand 1 "" ""))
(pc)))]
""
"%J0\\tbra\\t%l1;")
"%J0\\tbra\\t%l1;"
[(set_attr "predicable" "false")])
;; unified conditional branch
(define_insn "br_true_uni"
@ -527,7 +551,8 @@
UNSPEC_BR_UNIFIED) (const_int 0))
(label_ref (match_operand 1 "" "")) (pc)))]
""
"%j0\\tbra.uni\\t%l1;")
"%j0\\tbra.uni\\t%l1;"
[(set_attr "predicable" "false")])
(define_insn "br_false_uni"
[(set (pc) (if_then_else
@ -535,7 +560,8 @@
UNSPEC_BR_UNIFIED) (const_int 0))
(label_ref (match_operand 1 "" "")) (pc)))]
""
"%J0\\tbra.uni\\t%l1;")
"%J0\\tbra.uni\\t%l1;"
[(set_attr "predicable" "false")])
(define_expand "cbranch<mode>4"
[(set (pc)
@ -938,12 +964,16 @@
""
{
return nvptx_output_return ();
})
}
[(set_attr "predicable" "false")])
(define_expand "epilogue"
[(clobber (const_int 0))]
""
{
if (TARGET_SOFT_STACK)
emit_insn (gen_set_softstack_insn (gen_rtx_REG (Pmode,
SOFTSTACK_PREV_REGNUM)));
emit_jump_insn (gen_return ());
DONE;
})
@ -972,31 +1002,40 @@
(match_operand 1 "nvptx_register_operand")]
""
{
if (TARGET_SOFT_STACK)
{
emit_move_insn (stack_pointer_rtx,
gen_rtx_MINUS (Pmode, stack_pointer_rtx, operands[1]));
emit_insn (gen_set_softstack_insn (stack_pointer_rtx));
emit_move_insn (operands[0], virtual_stack_dynamic_rtx);
DONE;
}
/* The ptx documentation specifies an alloca intrinsic (for 32 bit
only) but notes it is not implemented. The assembler emits a
confused error message. Issue a blunt one now instead. */
sorry ("target cannot support alloca.");
emit_insn (gen_nop ());
DONE;
if (TARGET_ABI64)
emit_insn (gen_allocate_stack_di (operands[0], operands[1]));
else
emit_insn (gen_allocate_stack_si (operands[0], operands[1]));
DONE;
})
(define_insn "allocate_stack_<mode>"
[(set (match_operand:P 0 "nvptx_register_operand" "=R")
(unspec:P [(match_operand:P 1 "nvptx_register_operand" "R")]
UNSPEC_ALLOCA))]
""
"%.\\tcall (%0), %%alloca, (%1);")
(define_insn "set_softstack_insn"
[(unspec [(match_operand 0 "nvptx_register_operand" "R")]
UNSPEC_SET_SOFTSTACK)]
"TARGET_SOFT_STACK"
{
return nvptx_output_set_softstack (REGNO (operands[0]));
})
(define_expand "restore_stack_block"
[(match_operand 0 "register_operand" "")
(match_operand 1 "register_operand" "")]
""
{
if (TARGET_SOFT_STACK)
{
emit_move_insn (operands[0], operands[1]);
emit_insn (gen_set_softstack_insn (operands[0]));
}
DONE;
})
@ -1018,14 +1057,16 @@
(const_int 0))
(const_int 0))]
""
"%j0 trap;")
"%j0 trap;"
[(set_attr "predicable" "false")])
(define_insn "trap_if_false"
[(trap_if (eq (match_operand:BI 0 "nvptx_register_operand" "R")
(const_int 0))
(const_int 0))]
""
"%J0 trap;")
"%J0 trap;"
[(set_attr "predicable" "false")])
(define_expand "ctrap<mode>4"
[(trap_if (match_operator 0 "nvptx_comparison_operator"
@ -1074,28 +1115,28 @@
UNSPECV_FORK)]
""
"// fork %0;"
)
[(set_attr "predicable" "false")])
(define_insn "nvptx_forked"
[(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
UNSPECV_FORKED)]
""
"// forked %0;"
)
[(set_attr "predicable" "false")])
(define_insn "nvptx_joining"
[(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
UNSPECV_JOINING)]
""
"// joining %0;"
)
[(set_attr "predicable" "false")])
(define_insn "nvptx_join"
[(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
UNSPECV_JOIN)]
""
"// join %0;"
)
[(set_attr "predicable" "false")])
(define_expand "oacc_fork"
[(set (match_operand:SI 0 "nvptx_nonmemory_operand" "")
@ -1134,6 +1175,88 @@
""
"%.\\tshfl%S3.b32\\t%0, %1, %2, 31;")
(define_insn "nvptx_vote_ballot"
[(set (match_operand:SI 0 "nvptx_register_operand" "=R")
(unspec:SI [(match_operand:BI 1 "nvptx_register_operand" "R")]
UNSPEC_VOTE_BALLOT))]
""
"%.\\tvote.ballot.b32\\t%0, %1;")
;; Patterns for OpenMP SIMD-via-SIMT lowering
;; Implement IFN_GOMP_SIMT_LANE: set operand 0 to lane index
(define_insn "omp_simt_lane"
[(set (match_operand:SI 0 "nvptx_register_operand" "")
(unspec:SI [(const_int 0)] UNSPEC_LANEID))]
""
"%.\\tmov.u32\\t%0, %%laneid;")
;; Implement IFN_GOMP_SIMT_ORDERED: copy operand 1 to operand 0 and
;; place a compiler barrier to disallow unrolling/peeling the containing loop
(define_expand "omp_simt_ordered"
[(match_operand:SI 0 "nvptx_register_operand" "=R")
(match_operand:SI 1 "nvptx_register_operand" "R")]
""
{
emit_move_insn (operands[0], operands[1]);
emit_insn (gen_nvptx_nounroll ());
DONE;
})
;; Implement IFN_GOMP_SIMT_XCHG_BFLY: perform a "butterfly" exchange
;; across lanes
(define_expand "omp_simt_xchg_bfly"
[(match_operand 0 "nvptx_register_operand" "=R")
(match_operand 1 "nvptx_register_operand" "R")
(match_operand:SI 2 "nvptx_nonmemory_operand" "Ri")]
""
{
emit_insn (nvptx_gen_shuffle (operands[0], operands[1], operands[2],
SHUFFLE_BFLY));
DONE;
})
;; Implement IFN_GOMP_SIMT_XCHG_IDX: broadcast value in operand 1
;; from lane given by index in operand 2 to operand 0 in all lanes
(define_expand "omp_simt_xchg_idx"
[(match_operand 0 "nvptx_register_operand" "=R")
(match_operand 1 "nvptx_register_operand" "R")
(match_operand:SI 2 "nvptx_nonmemory_operand" "Ri")]
""
{
emit_insn (nvptx_gen_shuffle (operands[0], operands[1], operands[2],
SHUFFLE_IDX));
DONE;
})
;; Implement IFN_GOMP_SIMT_VOTE_ANY:
;; set operand 0 to zero iff all lanes supply zero in operand 1
(define_expand "omp_simt_vote_any"
[(match_operand:SI 0 "nvptx_register_operand" "=R")
(match_operand:SI 1 "nvptx_register_operand" "R")]
""
{
rtx pred = gen_reg_rtx (BImode);
emit_move_insn (pred, gen_rtx_NE (BImode, operands[1], const0_rtx));
emit_insn (gen_nvptx_vote_ballot (operands[0], pred));
DONE;
})
;; Implement IFN_GOMP_SIMT_LAST_LANE:
;; set operand 0 to the lowest lane index that passed non-zero in operand 1
(define_expand "omp_simt_last_lane"
[(match_operand:SI 0 "nvptx_register_operand" "=R")
(match_operand:SI 1 "nvptx_register_operand" "R")]
""
{
rtx pred = gen_reg_rtx (BImode);
rtx tmp = gen_reg_rtx (SImode);
emit_move_insn (pred, gen_rtx_NE (BImode, operands[1], const0_rtx));
emit_insn (gen_nvptx_vote_ballot (tmp, pred));
emit_insn (gen_ctzsi2 (operands[0], tmp));
DONE;
})
;; extract parts of a 64 bit object into 2 32-bit ints
(define_insn "unpack<mode>si2"
[(set (match_operand:SI 0 "nvptx_register_operand" "=R")
@ -1186,7 +1309,8 @@
(set (match_dup 1)
(unspec_volatile:SDIM [(const_int 0)] UNSPECV_CAS))]
""
"%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;")
"%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;"
[(set_attr "atomic" "true")])
(define_insn "atomic_exchange<mode>"
[(set (match_operand:SDIM 0 "nvptx_register_operand" "=R") ;; output
@ -1197,7 +1321,8 @@
(set (match_dup 1)
(match_operand:SDIM 2 "nvptx_nonmemory_operand" "Ri"))] ;; input
""
"%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;")
"%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;"
[(set_attr "atomic" "true")])
(define_insn "atomic_fetch_add<mode>"
[(set (match_operand:SDIM 1 "memory_operand" "+m")
@ -1209,7 +1334,8 @@
(set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
(match_dup 1))]
""
"%.\\tatom%A1.add%t0\\t%0, %1, %2;")
"%.\\tatom%A1.add%t0\\t%0, %1, %2;"
[(set_attr "atomic" "true")])
(define_insn "atomic_fetch_addsf"
[(set (match_operand:SF 1 "memory_operand" "+m")
@ -1221,7 +1347,8 @@
(set (match_operand:SF 0 "nvptx_register_operand" "=R")
(match_dup 1))]
""
"%.\\tatom%A1.add%t0\\t%0, %1, %2;")
"%.\\tatom%A1.add%t0\\t%0, %1, %2;"
[(set_attr "atomic" "true")])
(define_code_iterator any_logic [and ior xor])
(define_code_attr logic [(and "and") (ior "or") (xor "xor")])
@ -1237,10 +1364,18 @@
(set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
(match_dup 1))]
"0"
"%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
"%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;"
[(set_attr "atomic" "true")])
(define_insn "nvptx_barsync"
[(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
UNSPECV_BARSYNC)]
""
"\\tbar.sync\\t%0;")
"\\tbar.sync\\t%0;"
[(set_attr "predicable" "false")])
(define_insn "nvptx_nounroll"
[(unspec_volatile [(const_int 0)] UNSPECV_NOUNROLL)]
""
"\\t.pragma \\\"nounroll\\\";"
[(set_attr "predicable" "false")])

View file

@ -32,3 +32,15 @@ Link in code for a __main kernel.
moptimize
Target Report Var(nvptx_optimize) Init(-1)
Optimize partition neutering.
msoft-stack
Target Report Mask(SOFT_STACK)
Use custom stacks instead of local memory for automatic storage.
muniform-simt
Target Report Mask(UNIFORM_SIMT)
Generate code that can keep local state uniform across all lanes.
mgomp
Target Report Mask(GOMP)
Generate code for OpenMP offloading: enables -msoft-stack and -muniform-simt.

View file

@ -8,3 +8,5 @@ ALL_HOST_OBJS += mkoffload.o
mkoffload$(exeext): mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBDEPS)
+$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ \
mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBS)
MULTILIB_OPTIONS = mgomp

View file

@ -5576,6 +5576,7 @@ attributes.
* MeP Variable Attributes::
* Microsoft Windows Variable Attributes::
* MSP430 Variable Attributes::
* Nvidia PTX Variable Attributes::
* PowerPC Variable Attributes::
* RL78 Variable Attributes::
* SPU Variable Attributes::
@ -6257,6 +6258,20 @@ same name (@pxref{MSP430 Function Attributes}).
These attributes can be applied to both functions and variables.
@end table
@node Nvidia PTX Variable Attributes
@subsection Nvidia PTX Variable Attributes
These variable attributes are supported by the Nvidia PTX back end:
@table @code
@item shared
@cindex @code{shared} attribute, Nvidia PTX
Use this attribute to place a variable in the @code{.shared} memory space.
This memory space is private to each cooperative thread array; only threads
within one thread block refer to the same instance of the variable.
The runtime does not initialize variables in this memory space.
@end table
@node PowerPC Variable Attributes
@subsection PowerPC Variable Attributes

View file

@ -20570,6 +20570,37 @@ offloading execution.
Apply partitioned execution optimizations. This is the default when any
level of optimization is selected.
@item -msoft-stack
@opindex msoft-stack
Generate code that does not use @code{.local} memory
directly for stack storage. Instead, a per-warp stack pointer is
maintained explicitly. This enables variable-length stack allocation (with
variable-length arrays or @code{alloca}), and when global memory is used for
underlying storage, makes it possible to access automatic variables from other
threads, or with atomic instructions. This code generation variant is used
for OpenMP offloading, but the option is exposed on its own for the purpose
of testing the compiler; to generate code suitable for linking into programs
using OpenMP offloading, use option @option{-mgomp}.
@item -muniform-simt
@opindex muniform-simt
Switch to code generation variant that allows to execute all threads in each
warp, while maintaining memory state and side effects as if only one thread
in each warp was active outside of OpenMP SIMD regions. All atomic operations
and calls to runtime (malloc, free, vprintf) are conditionally executed (iff
current lane index equals the master lane index), and the register being
assigned is copied via a shuffle instruction from the master lane. Outside of
SIMD regions lane 0 is the master; inside, each thread sees itself as the
master. Shared memory array @code{int __nvptx_uni[]} stores all-zeros or
all-ones bitmasks for each warp, indicating current mode (0 outside of SIMD
regions). Each thread can bitwise-and the bitmask at position @code{tid.y}
with current lane index to compute the master lane index.
@item -mgomp
@opindex mgomp
Generate code for use in OpenMP offloading: enables @option{-msoft-stack} and
@option{-muniform-simt} options, and selects corresponding multilib variant.
@end table
@node PDP-11 Options

View file

@ -5862,6 +5862,10 @@ usable. In that case, the smaller the number is, the more desirable it is
to use it.
@end deftypefn
@deftypefn {Target Hook} int TARGET_SIMT_VF (void)
Return number of threads in SIMT thread group on the target.
@end deftypefn
@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level})
This hook should check the launch dimensions provided for an OpenACC
compute region, or routine. Defaulted values are represented as -1

View file

@ -4295,6 +4295,8 @@ address; but often a machine-dependent strategy can generate better code.
@hook TARGET_SIMD_CLONE_USABLE
@hook TARGET_SIMT_VF
@hook TARGET_GOACC_VALIDATE_DIMS
@hook TARGET_GOACC_DIM_LIMIT

View file

@ -68,6 +68,12 @@ DEF_TARGET_INSN (oacc_dim_pos, (rtx x0, rtx x1))
DEF_TARGET_INSN (oacc_dim_size, (rtx x0, rtx x1))
DEF_TARGET_INSN (oacc_fork, (rtx x0, rtx x1, rtx x2))
DEF_TARGET_INSN (oacc_join, (rtx x0, rtx x1, rtx x2))
DEF_TARGET_INSN (omp_simt_lane, (rtx x0))
DEF_TARGET_INSN (omp_simt_last_lane, (rtx x0, rtx x1))
DEF_TARGET_INSN (omp_simt_ordered, (rtx x0, rtx x1))
DEF_TARGET_INSN (omp_simt_vote_any, (rtx x0, rtx x1))
DEF_TARGET_INSN (omp_simt_xchg_bfly, (rtx x0, rtx x1, rtx x2))
DEF_TARGET_INSN (omp_simt_xchg_idx, (rtx x0, rtx x1, rtx x2))
DEF_TARGET_INSN (prefetch, (rtx x0, rtx x1, rtx x2))
DEF_TARGET_INSN (probe_stack, (rtx x0))
DEF_TARGET_INSN (probe_stack_address, (rtx x0))

View file

@ -1648,6 +1648,18 @@ int, (struct cgraph_node *), NULL)
HOOK_VECTOR_END (simd_clone)
/* Functions relating to OpenMP SIMT vectorization transform. */
#undef HOOK_PREFIX
#define HOOK_PREFIX "TARGET_SIMT_"
HOOK_VECTOR (TARGET_SIMT, simt)
DEFHOOK
(vf,
"Return number of threads in SIMT thread group on the target.",
int, (void), NULL)
HOOK_VECTOR_END (simt)
/* Functions relating to openacc. */
#undef HOOK_PREFIX
#define HOOK_PREFIX "TARGET_GOACC_"

View file

@ -1,3 +1,11 @@
2016-11-16 Alexander Monakov <amonakov@ispras.ru>
* lib/target-supports.exp (check_effective_target_alloca): Use a
compile test.
* gcc.target/nvptx/softstack.c: New test.
* gcc.target/nvptx/decl-shared.c: New test.
* gcc.target/nvptx/decl-shared-init.c: New test.
2016-11-16 Maciej W. Rozycki <macro@imgtec.com>
* gcc.target/mips/data-sym-jump.c: New test case.

View file

@ -0,0 +1 @@
int var __attribute__((shared)) = 0; /* { dg-error "static initialization .* not supported" } */

View file

@ -0,0 +1,14 @@
static int v_internal __attribute__((shared,used));
int v_common __attribute__((shared));
int v_extdef __attribute__((shared,nocommon));
extern int v_extdecl __attribute__((shared));
int use()
{
return v_extdecl;
}
/* { dg-final { scan-assembler "\[\r\n\]\[\t \]*.shared \[^,\r\n\]*v_internal" } } */
/* { dg-final { scan-assembler "\[\r\n\]\[\t \]*.weak .shared \[^,\r\n\]*v_common" } } */
/* { dg-final { scan-assembler "\[\r\n\]\[\t \]*.visible .shared \[^,\r\n\]*v_extdef" } } */
/* { dg-final { scan-assembler "\[\r\n\]\[\t \]*.extern .shared \[^,\r\n\]*v_extdecl" } } */

View file

@ -0,0 +1,23 @@
/* { dg-options "-O2 -msoft-stack" } */
/* { dg-do run } */
static __attribute__((noinline,noclone)) int f(int *p)
{
return __sync_lock_test_and_set(p, 1);
}
static __attribute__((noinline,noclone)) int g(int n)
{
/* Check that variable-length stack allocation works. */
int v[n];
v[0] = 0;
/* Check that atomic operations can be applied to auto data. */
return f(v) == 0 && v[0] == 1;
}
int main()
{
if (!g(1))
__builtin_abort();
return 0;
}

View file

@ -763,7 +763,10 @@ proc check_effective_target_untyped_assembly {} {
proc check_effective_target_alloca {} {
if { [istarget nvptx-*-*] } {
return 0
return [check_no_compiler_messages alloca assembly {
void f (void*);
void g (int n) { f (__builtin_alloca (n)); }
}]
}
return 1
}

View file

@ -1,3 +1,9 @@
2016-11-16 Alexander Monakov <amonakov@ispras.ru>
* config/nvptx/crt0.c (__main): Setup __nvptx_stacks and __nvptx_uni.
* config/nvptx/mgomp.c: New file.
* config/nvptx/t-nvptx: Add mgomp.c
2016-11-16 Waldemar Brodkorb <wbx@openadk.org>
PR libgcc/68468

View file

@ -24,6 +24,14 @@ int *__exitval_ptr;
extern void __attribute__((noreturn)) exit (int status);
extern int main (int, void **);
/* Always setup soft stacks to allow testing with -msoft-stack but without
-mgomp. 32 is the maximum number of warps in a CTA: the definition here
must match the external declaration emitted by the compiler. */
void *__nvptx_stacks[32] __attribute__((shared,nocommon));
/* Likewise for -muniform-simt. */
unsigned __nvptx_uni[32] __attribute__((shared,nocommon));
void __attribute__((kernel))
__main (int *rval_ptr, int argc, void **argv)
{
@ -33,5 +41,9 @@ __main (int *rval_ptr, int argc, void **argv)
if (rval_ptr)
*rval_ptr = 255;
static char stack[131072] __attribute__((aligned(8)));
__nvptx_stacks[0] = stack + sizeof stack;
__nvptx_uni[0] = 0;
exit (main (argc, argv));
}

View file

@ -0,0 +1,32 @@
/* Define shared memory arrays for -msoft-stack and -muniform-simt.
Copyright (C) 2015-2016 Free Software Foundation, Inc.
This file is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 3, or (at your option) any
later version.
This file is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
Under Section 7 of GPL version 3, you are granted additional
permissions described in the GCC Runtime Library Exception, version
3.1, as published by the Free Software Foundation.
You should have received a copy of the GNU General Public License and
a copy of the GCC Runtime Library Exception along with this program;
see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
<http://www.gnu.org/licenses/>. */
/* OpenACC offloading does not use these symbols; thus, they are exposed
only for the -mgomp multilib. The same definitions are also provided
in crt0.c for the case of non-offloading compilation. 32 is the maximum
number of warps in a CTA. */
#if defined(__nvptx_softstack__) && defined(__nvptx_unisimt__)
void *__nvptx_stacks[32] __attribute__((shared,nocommon));
unsigned __nvptx_uni[32] __attribute__((shared,nocommon));
#endif

View file

@ -1,4 +1,5 @@
LIB2ADD=$(srcdir)/config/nvptx/reduction.c
LIB2ADD=$(srcdir)/config/nvptx/reduction.c \
$(srcdir)/config/nvptx/mgomp.c
LIB2ADDEH=
LIB2FUNCS_EXCLUDE=__main