gcc/libcpp
Jakub Jelinek eb4879ab90 c++: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
The following patch implements the
C++23 P2071R2 - Named universal character escapes
paper to support \N{LATIN SMALL LETTER E} etc.
I've used Unicode 14.0, there are 144803 character name properties
(including the ones generated by Unicode NR1 and NR2 rules)
and correction/control/alternate aliases, together with zero terminators
that would be 3884745 bytes, which is clearly unacceptable for libcpp.
This patch instead contains a generator which from the UnicodeData.txt
and NameAliases.txt files emits a space optimized radix tree (208765
bytes long for 14.0), a single string literal dictionary (59418 bytes),
maximum name length (currently 88 chars) and two small helper arrays
for the NR1/NR2 name generation.
The radix tree needs 2 to 9 bytes per node, the exact format is
described in the generator program.  There could be ways to shrink
the dictionary size somewhat at the expense of slightly slower lookups.

Currently the patch implements strict matching (that is what is needed
to actually implement it on valid code) and Unicode UAX44-LM2 algorithm
loose matching to provide hints (that algorithm essentially ignores
hyphens in between two alphanumeric characters, spaces and underscores
(with one exception for hyphen) and does case insensitive matching).
In the attachment is a WIP patch that shows how to implement also
spellcheck.{h,cc} style discovery of misspellings, but I'll need to talk
to David Malcolm about it, as spellcheck.{h,cc} is in gcc/ subdir
(so the WIP incremental patch instead prints all the names to stderr).

2022-08-26  Jakub Jelinek  <jakub@redhat.com>

	PR c++/106648
libcpp/
	* charset.cc: Implement C++23 P2071R2 - Named universal character
	escapes.  Include uname2c.h.
	(hangul_syllables, hangul_count): New variables.
	(struct uname2c_data): New type.
	(_cpp_uname2c, _cpp_uname2c_uax44_lm2): New functions.
	(_cpp_valid_ucn): Use them.  Handle named universal character escapes.
	(convert_ucn): Adjust comment.
	(convert_escape): Call convert_ucn even for \N.
	(_cpp_interpret_identifier): Handle named universal character escapes.
	* lex.cc (get_bidi_ucn): Fix up function comment formatting.
	(get_bidi_named): New function.
	(forms_identifier_p, lex_string): Handle named universal character
	escapes.
	* makeuname2c.cc: New file.  Small parts copied from makeucnid.cc.
	* uname2c.h: New generated file.
gcc/c-family/
	* c-cppbuiltin.cc (c_cpp_builtins): Predefine
	__cpp_named_character_escapes to 202207L.
gcc/testsuite/
	* c-c++-common/cpp/named-universal-char-escape-1.c: New test.
	* c-c++-common/cpp/named-universal-char-escape-2.c: New test.
	* c-c++-common/cpp/named-universal-char-escape-3.c: New test.
	* c-c++-common/cpp/named-universal-char-escape-4.c: New test.
	* c-c++-common/Wbidi-chars-25.c: New test.
	* gcc.dg/cpp/named-universal-char-escape-1.c: New test.
	* gcc.dg/cpp/named-universal-char-escape-2.c: New test.
	* g++.dg/cpp/named-universal-char-escape-1.C: New test.
	* g++.dg/cpp/named-universal-char-escape-2.C: New test.
	* g++.dg/cpp23/feat-cxx2b.C: Test __cpp_named_character_escapes.
2022-08-26 09:27:39 +02:00
..
include libcpp: Implement C++23 P2290R3 - Delimited escape sequences [PR106645] 2022-08-20 10:26:55 +02:00
po Daily bump. 2022-05-05 00:16:29 +00:00
aclocal.m4 libcpp: Enable Intel CET on Intel CET enabled host for jit 2020-05-12 09:17:45 -07:00
ChangeLog Daily bump. 2022-08-25 00:16:33 +00:00
ChangeLog.jit
charset.cc c++: Implement C++23 P2071R2 - Named universal character escapes [PR106648] 2022-08-26 09:27:39 +02:00
config.in aix: handle 64bit inodes for include directories 2022-01-12 16:59:47 +01:00
configure aix: handle 64bit inodes for include directories 2022-01-12 16:59:47 +01:00
configure.ac aix: handle 64bit inodes for include directories 2022-01-12 16:59:47 +01:00
directives.cc preprocessor: Implement C++23 P2437R1 - Support for #warning [PR106646] 2022-08-24 09:55:57 +02:00
errors.cc Rename .c files to .cc files. 2022-01-17 22:12:04 +01:00
expr.cc libcpp: Ignore CPP_PADDING tokens in _cpp_parse_expr [PR105732] 2022-05-29 21:57:51 +02:00
files.cc Rename .c files to .cc files. 2022-01-17 22:12:04 +01:00
generated_cpp_wcwidth.h libcpp: Update cpp_wcwidth() to Unicode 14.0.0 2022-06-26 14:13:26 -04:00
identifiers.cc Rename .c files to .cc files. 2022-01-17 22:12:04 +01:00
init.cc preprocessor: Implement C++23 P2437R1 - Support for #warning [PR106646] 2022-08-24 09:55:57 +02:00
internal.h preprocessor: -Wbidi-chars and UCNs [PR104030] 2022-01-24 17:48:23 -05:00
lex.cc c++: Implement C++23 P2071R2 - Named universal character escapes [PR106648] 2022-08-26 09:27:39 +02:00
line-map.cc pack fields in line-map data structures 2022-01-18 14:33:01 +01:00
location-example.txt PR preprocessor/83173: Enhance -fdump-internal-locations output 2018-11-27 16:04:31 +00:00
macro.cc libcpp: Fix up padding handling in funlike_invocation_p [PR104147] 2022-02-01 20:48:03 +01:00
Makefile.in preprocessor: Extract messages from cpp_*_at calls for translation 2022-02-11 23:22:07 +00:00
makeucnid.cc Rename .c files to .cc files. 2022-01-17 22:12:04 +01:00
makeuname2c.cc c++: Implement C++23 P2071R2 - Named universal character escapes [PR106648] 2022-08-26 09:27:39 +02:00
mkdeps.cc Rename .c files to .cc files. 2022-01-17 22:12:04 +01:00
pch.cc Rename .c files to .cc files. 2022-01-17 22:12:04 +01:00
symtab.cc Rename .c files to .cc files. 2022-01-17 22:12:04 +01:00
system.h Update copyright years. 2022-01-03 10:42:10 +01:00
traditional.cc Change references of .c files to .cc files 2022-01-17 22:12:07 +01:00
ucnid.h libcpp: Update ucnid.h to Unicode 14 2022-06-28 17:33:37 -04:00
ucnid.tab Update copyright years. 2022-01-03 10:42:10 +01:00
uname2c.h c++: Implement C++23 P2071R2 - Named universal character escapes [PR106648] 2022-08-26 09:27:39 +02:00