Locality cloning pass: -fipa-reorder-for-locality

Implement partitioning and cloning in the callgraph to help locality.
A new -fipa-reorder-for-locality flag is used to enable this.
The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc
The optimization has two components:
* Partitioning the callgraph so as to group callers and callees that frequently
call each other in the same partition
* Cloning functions that straddle multiple callchains and allowing each clone
to be local to the partition of its callchain.

The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc.
It creates a partitioning plan and does the prerequisite cloning.
The partitioning is then implemented during the existing LTO partitioning pass.

To guide these locality heuristics we use PGO data.
In the absence of PGO data we use a static heuristic that uses the accumulated
estimated edge frequencies of the callees for each function to guide the
reordering.
We are investigating some more elaborate static heuristics, in particular using
the demangled C++ names to group template instantiatios together.
This is promising but we are working out some kinks in the implementation
currently and want to send that out as a follow-up once we're more confident
in it.

A new bootstrap-lto-locality bootstrap config is added that allows us to test
this on GCC itself with either static or PGO heuristics.
GCC bootstraps with both (normal LTO bootstrap and profiledbootstrap).

As this new pass enables a new partitioning scheme it is incompatible with
explicit -flto-partition= options so an error is introduced when the user
uses both flags explicitly.

With this optimization we are seeing good performance gains on some large
internal workloads that stress the parts of the processor that is sensitive
to code locality, but we'd appreciate wider performance evaluation.

Bootstrapped and tested on aarch64-none-linux-gnu.
Ok for mainline?
Thanks,
Kyrill

Signed-off-by: Prachi Godbole <pgodbole@nvidia.com>
Co-authored-by: Kyrylo Tkachov <ktkachov@nvidia.com>

config/ChangeLog:

	* bootstrap-lto-locality.mk: New file.

gcc/ChangeLog:

	* Makefile.in (OBJS): Add ipa-locality-cloning.o.
	* cgraph.h (set_new_clone_decl_and_node_flags): Declare prototype.
	* cgraphclones.cc (set_new_clone_decl_and_node_flags): Remove static
	qualifier.
	* common.opt (fipa-reorder-for-locality): New flag.
	(LTO_PARTITION_DEFAULT): Declare.
	(flto-partition): Change default to LTO_PARTITION_DFEAULT.
	* doc/invoke.texi: Document -fipa-reorder-for-locality.
	* flag-types.h (enum lto_locality_cloning_model): Declare.
	(lto_partitioning_model): Add LTO_PARTITION_DEFAULT.
	* lto-cgraph.cc (lto_set_symtab_encoder_in_partition): Add dumping of
	node and index.
	* opts.cc (validate_ipa_reorder_locality_lto_partition): Define.
	(finish_options): Handle LTO_PARTITION_DEFAULT.
	* params.opt (lto_locality_cloning_model): New enum.
	(lto-partition-locality-cloning): New param.
	(lto-partition-locality-frequency-cutoff): Likewise.
	(lto-partition-locality-size-cutoff): Likewise.
	(lto-max-locality-partition): Likewise.
	* passes.def: Register pass_ipa_locality_cloning.
	* timevar.def (TV_IPA_LC): New timevar.
	* tree-pass.h (make_pass_ipa_locality_cloning): Declare.
	* ipa-locality-cloning.cc: New file.
	* ipa-locality-cloning.h: New file.

gcc/lto/ChangeLog:

	* lto-partition.cc (add_node_references_to_partition): Define.
	(create_partition): Likewise.
	(lto_locality_map): Likewise.
	(lto_promote_cross_file_statics): Add extra dumping.
	* lto-partition.h (lto_locality_map): Declare prototype.
	* lto.cc (do_whole_program_analysis): Handle
	flag_ipa_reorder_for_locality.
This commit is contained in:
Kyrylo Tkachov 2025-02-27 09:24:10 -08:00
parent b4cf69503b
commit 6d9fdf4bf5
No known key found for this signature in database
18 changed files with 1423 additions and 11 deletions

View file

@ -0,0 +1,20 @@
# This option enables LTO and locality partitioning for stage2 and stage3 in slim mode
STAGE2_CFLAGS += -flto=jobserver -frandom-seed=1 -fipa-reorder-for-locality
STAGE3_CFLAGS += -flto=jobserver -frandom-seed=1 -fipa-reorder-for-locality
STAGEprofile_CFLAGS += -flto=jobserver -frandom-seed=1 -fipa-reorder-for-locality
STAGEtrain_CFLAGS += -flto=jobserver -frandom-seed=1 -fipa-reorder-for-locality
STAGEfeedback_CFLAGS += -flto=jobserver -frandom-seed=1 -fipa-reorder-for-locality
# assumes the host supports the linker plugin
LTO_AR = $$r/$(HOST_SUBDIR)/prev-gcc/gcc-ar$(exeext) -B$$r/$(HOST_SUBDIR)/prev-gcc/
LTO_RANLIB = $$r/$(HOST_SUBDIR)/prev-gcc/gcc-ranlib$(exeext) -B$$r/$(HOST_SUBDIR)/prev-gcc/
LTO_NM = $$r/$(HOST_SUBDIR)/prev-gcc/gcc-nm$(exeext) -B$$r/$(HOST_SUBDIR)/prev-gcc/
LTO_EXPORTS = AR="$(LTO_AR)"; export AR; \
RANLIB="$(LTO_RANLIB)"; export RANLIB; \
NM="$(LTO_NM)"; export NM;
LTO_FLAGS_TO_PASS = AR="$(LTO_AR)" RANLIB="$(LTO_RANLIB)" NM="$(LTO_NM)"
do-compare = $(SHELL) $(srcdir)/contrib/compare-lto $$f1 $$f2
extra-compare = gcc/lto1$(exeext)

View file

@ -1555,6 +1555,7 @@ OBJS = \
incpath.o \
init-regs.o \
internal-fn.o \
ipa-locality-cloning.o \
ipa-cp.o \
ipa-sra.o \
ipa-devirt.o \
@ -3026,6 +3027,7 @@ GTFILES = $(CPPLIB_H) $(srcdir)/input.h $(srcdir)/coretypes.h \
$(srcdir)/ipa-param-manipulation.h $(srcdir)/ipa-sra.cc \
$(srcdir)/ipa-modref.h $(srcdir)/ipa-modref.cc \
$(srcdir)/ipa-modref-tree.h \
$(srcdir)/ipa-locality-cloning.cc \
$(srcdir)/signop.h \
$(srcdir)/diagnostic-spec.h $(srcdir)/diagnostic-spec.cc \
$(srcdir)/dwarf2out.h \

View file

@ -2627,6 +2627,7 @@ void tree_function_versioning (tree, tree, vec<ipa_replace_map *, va_gc> *,
void dump_callgraph_transformation (const cgraph_node *original,
const cgraph_node *clone,
const char *suffix);
void set_new_clone_decl_and_node_flags (cgraph_node *new_node);
/* In cgraphbuild.cc */
int compute_call_stmt_bb_frequency (tree, basic_block bb);
void record_references_in_initializer (tree, bool);

View file

@ -158,7 +158,7 @@ cgraph_edge::clone (cgraph_node *n, gcall *call_stmt, unsigned stmt_uid,
/* Set flags of NEW_NODE and its decl. NEW_NODE is a newly created private
clone or its thunk. */
static void
void
set_new_clone_decl_and_node_flags (cgraph_node *new_node)
{
DECL_EXTERNAL (new_node->decl) = 0;

View file

@ -2116,6 +2116,10 @@ fipa-modref
Common Var(flag_ipa_modref) Optimization
Perform interprocedural modref analysis.
fipa-reorder-for-locality
Common Var(flag_ipa_reorder_for_locality) Init(0) Optimization
Perform reordering and cloning of functions to maximize locality.
fipa-profile
Common Var(flag_ipa_profile) Init(0) Optimization
Perform interprocedural profile propagation.
@ -2274,6 +2278,9 @@ Number of cache entries in incremental LTO after which to prune old entries.
Enum
Name(lto_partition_model) Type(enum lto_partition_model) UnknownError(unknown LTO partitioning model %qs)
EnumValue
Enum(lto_partition_model) String(default) Value(LTO_PARTITION_DEFAULT)
EnumValue
Enum(lto_partition_model) String(none) Value(LTO_PARTITION_NONE)
@ -2293,7 +2300,7 @@ EnumValue
Enum(lto_partition_model) String(cache) Value(LTO_PARTITION_CACHE)
flto-partition=
Common Joined RejectNegative Enum(lto_partition_model) Var(flag_lto_partition) Init(LTO_PARTITION_BALANCED)
Common Joined RejectNegative Enum(lto_partition_model) Var(flag_lto_partition) Init(LTO_PARTITION_DEFAULT)
Specify the algorithm to partition symbols and vars at linktime.
; The initial value of -1 comes from Z_DEFAULT_COMPRESSION in zlib.h.

View file

@ -593,7 +593,7 @@ Objective-C and Objective-C++ Dialects}.
-finline-functions -finline-functions-called-once -finline-limit=@var{n}
-finline-small-functions -fipa-modref -fipa-cp -fipa-cp-clone
-fipa-bit-cp -fipa-vrp -fipa-pta -fipa-profile -fipa-pure-const
-fipa-reference -fipa-reference-addressable
-fipa-reference -fipa-reference-addressable -fipa-reorder-for-locality
-fipa-stack-alignment -fipa-icf -fira-algorithm=@var{algorithm}
-flate-combine-instructions -flifetime-dse -flive-patching=@var{level}
-fira-region=@var{region} -fira-hoist-pressure
@ -13871,6 +13871,21 @@ Enabled by default at @option{-O1} and higher.
Discover read-only, write-only and non-addressable static variables.
Enabled by default at @option{-O1} and higher.
@opindex fipa-reorder-for-locality
@item -fipa-reorder-for-locality
Group call chains close together in the binary layout to improve code
locality and minimize jump distances between frequently called functions.
Unlike @option{-freorder-functions} this pass considers the call
chains between functions and groups them together, rather than grouping all
hot/normal/cold/never-executed functions into separate sections.
Unlike @option{-fprofile-reorder-functions} it aims to improve code locality
throughout the runtime of the program rather than focusing on program startup.
This option is incompatible with an explicit
@option{-flto-partition=} option since it enforces a custom partitioning
scheme.
If using this option it is recommended to also use profile feedback, but this
option is not enabled by default otherwise.
@opindex fipa-stack-alignment
@item -fipa-stack-alignment
Reduce stack alignment on call sites if possible.
@ -14606,11 +14621,13 @@ Enabled for x86 at levels @option{-O2}, @option{-O3}, @option{-Os}.
@opindex freorder-functions
@item -freorder-functions
Reorder functions in the object file in order to
improve code locality. This is implemented by using special
subsections @code{.text.hot} for most frequently executed functions and
@code{.text.unlikely} for unlikely executed functions. Reordering is done by
the linker so object file format must support named sections and linker must
place them in a reasonable way.
improve code locality. Unlike @option{-fipa-reorder-for-locality} this option
prioritises grouping all functions within a category
(hot/normal/cold/never-executed) together.
This is implemented by using special subsections @code{.text.hot} for most
frequently executed functions and @code{.text.unlikely} for unlikely executed
functions. Reordering is done by the linker so object file format must support
named sections and linker must place them in a reasonable way.
This option isn't effective unless you either provide profile feedback
(see @option{-fprofile-arcs} for details) or manually annotate functions with
@ -15635,7 +15652,8 @@ Enabled by @option{-fprofile-generate}, @option{-fprofile-use}, and
@item -fprofile-reorder-functions
Function reordering based on profile instrumentation collects
first time of execution of a function and orders these functions
in ascending order.
in ascending order, aiming to optimize program startup through more
efficient loading of text segments.
Enabled with @option{-fprofile-use}.

View file

@ -404,7 +404,15 @@ enum lto_partition_model {
LTO_PARTITION_BALANCED = 2,
LTO_PARTITION_1TO1 = 3,
LTO_PARTITION_MAX = 4,
LTO_PARTITION_CACHE = 5
LTO_PARTITION_CACHE = 5,
LTO_PARTITION_DEFAULT= 6
};
/* flag_lto_locality_cloning initialization values. */
enum lto_locality_cloning_model {
LTO_LOCALITY_NO_CLONING = 0,
LTO_LOCALITY_NON_INTERPOSABLE_CLONING = 1,
LTO_LOCALITY_MAXIMAL_CLONING = 2,
};
/* flag_lto_linker_output initialization values. */

1137
gcc/ipa-locality-cloning.cc Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,35 @@
/* LTO partitioning logic routines.
Copyright The GNU Toolchain Authors
This file is part of GCC.
GCC is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 3, or (at your option) any later
version.
GCC is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
You should have received a copy of the GNU General Public License
along with GCC; see the file COPYING3. If not see
<http://www.gnu.org/licenses/>. */
#ifndef IPA_LOCALITY_CLONING_H
#define IPA_LOCALITY_CLONING_H
/* Structure describing locality partitions. */
struct locality_partition_def
{
int part_id;
vec<cgraph_node *> nodes;
int insns;
};
typedef struct locality_partition_def *locality_partition;
extern vec<locality_partition> locality_partitions;
#endif /* IPA_LOCALITY_CLONING_H */

View file

@ -229,6 +229,8 @@ lto_set_symtab_encoder_in_partition (lto_symtab_encoder_t encoder,
symtab_node *node)
{
int index = lto_symtab_encoder_encode (encoder, node);
if (dump_file)
fprintf(dump_file, "Node %s, index %d\n", node->asm_name(), index);
encoder->nodes[index].in_partition = true;
}

View file

@ -37,6 +37,7 @@ along with GCC; see the file COPYING3. If not see
#include "ipa-prop.h"
#include "ipa-fnsummary.h"
#include "lto-partition.h"
#include "ipa-locality-cloning.h"
#include <limits>
@ -1418,6 +1419,126 @@ lto_balanced_map (int n_lto_partitions, int max_partition_size)
}
}
/* Add all references of NODE into PARTITION. */
static void
add_node_references_to_partition (ltrans_partition partition, symtab_node *node)
{
struct ipa_ref *ref = NULL;
varpool_node *vnode;
for (int j = 0; node->iterate_reference (j, ref); j++)
if (is_a <varpool_node *> (ref->referred))
{
vnode = dyn_cast <varpool_node *> (ref->referred);
if (!symbol_partitioned_p (vnode)
&& !vnode->no_reorder
&& vnode->get_partitioning_class () == SYMBOL_PARTITION)
{
add_symbol_to_partition (partition, vnode);
if (dump_file)
fprintf (dump_file, "Varpool Node: %s\n", vnode->dump_asm_name ());
add_node_references_to_partition (partition, vnode);
}
}
for (int j = 0; node->iterate_referring (j, ref); j++)
if (is_a <varpool_node *> (ref->referring))
{
vnode = dyn_cast <varpool_node *> (ref->referring);
gcc_assert (vnode->definition);
if (!symbol_partitioned_p (vnode)
&& !vnode->no_reorder
&& !vnode->can_remove_if_no_refs_p ()
&& vnode->get_partitioning_class () == SYMBOL_PARTITION)
{
add_symbol_to_partition (partition, vnode);
if (dump_file)
fprintf (dump_file, "Varpool Node: %s\n", vnode->dump_asm_name ());
add_node_references_to_partition (partition, vnode);
}
}
if (cgraph_node *cnode = dyn_cast <cgraph_node *> (node))
{
struct cgraph_edge *e;
/* Add all inline clones and callees that are duplicated. */
for (e = cnode->callees; e; e = e->next_callee)
if (e->callee->get_partitioning_class () == SYMBOL_DUPLICATE)
add_node_references_to_partition (partition, e->callee);
/* Add all thunks associated with the function. */
for (e = cnode->callers; e; e = e->next_caller)
if (e->caller->thunk && !e->caller->inlined_to)
add_node_references_to_partition (partition, e->caller);
}
}
/* Create and return the created partition of name NAME. */
static ltrans_partition
create_partition (int &npartitions, const char *name)
{
npartitions++;
return new_partition (name);
}
/* Partitioning for code locality.
The partitioning plan (and prerequisite cloning) will have been done by the
IPA locality cloning pass. This function just implements that plan by
assigning those partitions to ltrans_parititions. */
void
lto_locality_map (int max_partition_size)
{
symtab_node *snode;
int npartitions = 0;
auto_vec<varpool_node *> varpool_order;
struct cgraph_node *node;
if (locality_partitions.length () == 0)
{
if (dump_file)
{
fprintf (dump_file, "Locality partition: falling back to balanced "
"model\n");
}
lto_balanced_map (param_lto_partitions, param_max_partition_size);
return;
}
ltrans_partition partition = nullptr;
for (auto part : locality_partitions)
{
partition = create_partition (npartitions, "");
for (unsigned j = 0; j < part->nodes.length (); j++)
{
node = part->nodes[j];
if (symbol_partitioned_p (node))
continue;
add_symbol_to_partition (partition, node);
add_node_references_to_partition (partition, node);
}
}
int64_t partition_size = max_partition_size;
/* All other unpartitioned symbols. */
FOR_EACH_SYMBOL (snode)
{
if (snode->get_partitioning_class () == SYMBOL_PARTITION
&& !symbol_partitioned_p (snode))
{
if (partition->insns > partition_size)
partition = create_partition (npartitions, "");
add_symbol_to_partition (partition, snode);
if (dump_file)
fprintf (dump_file, "Un-ordered Node: %s\n", snode->dump_asm_name ());
}
}
}
/* Return true if we must not change the name of the NODE. The name as
extracted from the corresponding decl should be passed in NAME. */
@ -1732,7 +1853,12 @@ lto_promote_cross_file_statics (void)
{
ltrans_partition part
= ltrans_partitions[i];
if (dump_file)
fprintf (dump_file, "lto_promote_cross_file_statics for part %s %p\n",
part->name, (void *)part->encoder);
part->encoder = compute_ltrans_boundary (part->encoder);
if (dump_file)
fprintf (dump_file, "new encoder %p\n", (void *)part->encoder);
}
lto_clone_numbers = new hash_map<const char *, unsigned>;

View file

@ -37,6 +37,7 @@ void lto_1_to_1_map (void);
void lto_max_map (void);
void lto_cache_map (int, int);
void lto_balanced_map (int, int);
void lto_locality_map (int);
void lto_promote_cross_file_statics (void);
void free_ltrans_partitions (void);
void lto_promote_statics_nonwpa (void);

View file

@ -547,7 +547,9 @@ do_whole_program_analysis (void)
symtab_node::checking_verify_symtab_nodes ();
bitmap_obstack_release (NULL);
if (flag_lto_partition == LTO_PARTITION_1TO1)
if (flag_ipa_reorder_for_locality)
lto_locality_map (param_max_locality_partition_size);
else if (flag_lto_partition == LTO_PARTITION_1TO1)
lto_1_to_1_map ();
else if (flag_lto_partition == LTO_PARTITION_MAX)
lto_max_map ();

View file

@ -1037,6 +1037,25 @@ report_conflicting_sanitizer_options (struct gcc_options *opts, location_t loc,
}
}
/* Validate from OPTS and OPTS_SET that when -fipa-reorder-for-locality is
enabled no explicit -flto-partition is also passed as the locality cloning
pass uses its own partitioning scheme. */
static void
validate_ipa_reorder_locality_lto_partition (struct gcc_options *opts,
struct gcc_options *opts_set)
{
static bool validated_p = false;
if (opts->x_flag_lto_partition != LTO_PARTITION_DEFAULT)
{
if (opts_set->x_flag_ipa_reorder_for_locality && !validated_p)
error ("%<-fipa-reorder-for-locality%> is incompatible with"
" an explicit %qs option", "-flto-partition");
}
validated_p = true;
}
/* After all options at LOC have been read into OPTS and OPTS_SET,
finalize settings of those options and diagnose incompatible
combinations. */
@ -1249,6 +1268,10 @@ finish_options (struct gcc_options *opts, struct gcc_options *opts_set,
if (opts->x_flag_reorder_blocks_and_partition)
SET_OPTION_IF_UNSET (opts, opts_set, flag_reorder_functions, 1);
validate_ipa_reorder_locality_lto_partition (opts, opts_set);
if (opts_set->x_flag_lto_partition != LTO_PARTITION_DEFAULT)
opts_set->x_flag_lto_partition = opts->x_flag_lto_partition = LTO_PARTITION_BALANCED;
/* The -gsplit-dwarf option requires -ggnu-pubnames. */
if (opts->x_dwarf_split_debug_info)
opts->x_debug_generate_pub_sections = 2;

View file

@ -469,6 +469,33 @@ Minimal size of a partition for LTO (in estimated instructions).
Common Joined UInteger Var(param_lto_partitions) Init(128) IntegerRange(1, 65536) Param
Number of partitions the program should be split to.
Enum
Name(lto_locality_cloning_model) Type(enum lto_locality_cloning_model) UnknownError(unknown LTO partitioning model %qs)
EnumValue
Enum(lto_locality_cloning_model) String(no) Value(LTO_LOCALITY_NO_CLONING)
EnumValue
Enum(lto_locality_cloning_model) String(non_interposable) Value(LTO_LOCALITY_NON_INTERPOSABLE_CLONING)
EnumValue
Enum(lto_locality_cloning_model) String(maximal) Value(LTO_LOCALITY_MAXIMAL_CLONING)
-param=lto-partition-locality-cloning=
Common Joined RejectNegative Enum(lto_locality_cloning_model) Var(flag_lto_locality_cloning) Init(LTO_LOCALITY_MAXIMAL_CLONING) Optimization
-param=lto-partition-locality-frequency-cutoff=
Common Joined UInteger Var(param_lto_locality_frequency) Init(1) IntegerRange(0, 65536) Param Optimization
The denominator n of fraction 1/n of the execution frequency of callee to be cloned for a particular caller. Special value of 0 dictates to always clone without a cut-off.
-param=lto-partition-locality-size-cutoff=
Common Joined UInteger Var(param_lto_locality_size) Init(1000) IntegerRange(1, 65536) Param Optimization
Size cut-off for callee including inlined calls to be cloned for a particular caller.
-param=lto-max-locality-partition=
Common Joined UInteger Var(param_max_locality_partition_size) Init(1000000) Param
Maximal size of a locality partition for LTO (in estimated instructions). Value of 0 results in default value being used.
-param=max-average-unrolled-insns=
Common Joined UInteger Var(param_max_average_unrolled_insns) Init(80) Param Optimization
The maximum number of instructions to consider to unroll in a loop on average.

View file

@ -162,6 +162,7 @@ along with GCC; see the file COPYING3. If not see
NEXT_PASS (pass_ipa_sra);
NEXT_PASS (pass_ipa_fn_summary);
NEXT_PASS (pass_ipa_inline);
NEXT_PASS (pass_ipa_locality_cloning);
NEXT_PASS (pass_ipa_pure_const);
NEXT_PASS (pass_ipa_modref);
NEXT_PASS (pass_ipa_free_fn_summary, false /* small_p */);

View file

@ -105,6 +105,7 @@ DEFTIMEVAR (TV_IPA_PURE_CONST , "ipa pure const")
DEFTIMEVAR (TV_IPA_ICF , "ipa icf")
DEFTIMEVAR (TV_IPA_PTA , "ipa points-to")
DEFTIMEVAR (TV_IPA_SRA , "ipa SRA")
DEFTIMEVAR (TV_IPA_LC , "ipa locality clone")
DEFTIMEVAR (TV_IPA_FREE_LANG_DATA , "ipa free lang data")
DEFTIMEVAR (TV_IPA_FREE_INLINE_SUMMARY, "ipa free inline summary")
DEFTIMEVAR (TV_IPA_MODREF , "ipa modref")

View file

@ -551,6 +551,7 @@ extern ipa_opt_pass_d *make_pass_ipa_cdtor_merge (gcc::context *ctxt);
extern ipa_opt_pass_d *make_pass_ipa_single_use (gcc::context *ctxt);
extern ipa_opt_pass_d *make_pass_ipa_comdats (gcc::context *ctxt);
extern ipa_opt_pass_d *make_pass_ipa_modref (gcc::context *ctxt);
extern ipa_opt_pass_d *make_pass_ipa_locality_cloning (gcc::context *ctxt);
extern gimple_opt_pass *make_pass_cleanup_cfg_post_optimizing (gcc::context
*ctxt);