Implement partitioning and cloning in the callgraph to help locality.
A new -fipa-reorder-for-locality flag is used to enable this.
The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc
The optimization has two components:
* Partitioning the callgraph so as to group callers and callees that frequently
call each other in the same partition
* Cloning functions that straddle multiple callchains and allowing each clone
to be local to the partition of its callchain.
The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc.
It creates a partitioning plan and does the prerequisite cloning.
The partitioning is then implemented during the existing LTO partitioning pass.
To guide these locality heuristics we use PGO data.
In the absence of PGO data we use a static heuristic that uses the accumulated
estimated edge frequencies of the callees for each function to guide the
reordering.
We are investigating some more elaborate static heuristics, in particular using
the demangled C++ names to group template instantiatios together.
This is promising but we are working out some kinks in the implementation
currently and want to send that out as a follow-up once we're more confident
in it.
A new bootstrap-lto-locality bootstrap config is added that allows us to test
this on GCC itself with either static or PGO heuristics.
GCC bootstraps with both (normal LTO bootstrap and profiledbootstrap).
As this new pass enables a new partitioning scheme it is incompatible with
explicit -flto-partition= options so an error is introduced when the user
uses both flags explicitly.
With this optimization we are seeing good performance gains on some large
internal workloads that stress the parts of the processor that is sensitive
to code locality, but we'd appreciate wider performance evaluation.
Bootstrapped and tested on aarch64-none-linux-gnu.
Ok for mainline?
Thanks,
Kyrill
Signed-off-by: Prachi Godbole <pgodbole@nvidia.com>
Co-authored-by: Kyrylo Tkachov <ktkachov@nvidia.com>
config/ChangeLog:
* bootstrap-lto-locality.mk: New file.
gcc/ChangeLog:
* Makefile.in (OBJS): Add ipa-locality-cloning.o.
* cgraph.h (set_new_clone_decl_and_node_flags): Declare prototype.
* cgraphclones.cc (set_new_clone_decl_and_node_flags): Remove static
qualifier.
* common.opt (fipa-reorder-for-locality): New flag.
(LTO_PARTITION_DEFAULT): Declare.
(flto-partition): Change default to LTO_PARTITION_DFEAULT.
* doc/invoke.texi: Document -fipa-reorder-for-locality.
* flag-types.h (enum lto_locality_cloning_model): Declare.
(lto_partitioning_model): Add LTO_PARTITION_DEFAULT.
* lto-cgraph.cc (lto_set_symtab_encoder_in_partition): Add dumping of
node and index.
* opts.cc (validate_ipa_reorder_locality_lto_partition): Define.
(finish_options): Handle LTO_PARTITION_DEFAULT.
* params.opt (lto_locality_cloning_model): New enum.
(lto-partition-locality-cloning): New param.
(lto-partition-locality-frequency-cutoff): Likewise.
(lto-partition-locality-size-cutoff): Likewise.
(lto-max-locality-partition): Likewise.
* passes.def: Register pass_ipa_locality_cloning.
* timevar.def (TV_IPA_LC): New timevar.
* tree-pass.h (make_pass_ipa_locality_cloning): Declare.
* ipa-locality-cloning.cc: New file.
* ipa-locality-cloning.h: New file.
gcc/lto/ChangeLog:
* lto-partition.cc (add_node_references_to_partition): Define.
(create_partition): Likewise.
(lto_locality_map): Likewise.
(lto_promote_cross_file_statics): Add extra dumping.
* lto-partition.h (lto_locality_map): Declare prototype.
* lto.cc (do_whole_program_analysis): Handle
flag_ipa_reorder_for_locality.