gcc/contrib/unicode
Tomasz Kamiński 3b33d792cf libstdc++: Implement debug format for strings and characters formatters [PR109162]
This patch implements part P2286R8 that specified debug (escaped)
format for the strings and characters sequences. This include both
handling of the '?' format specifier and set_debug_format member.

To indicate partial support we define __glibcxx_format_ranges macro
value 1, without defining __cpp_lib_format_ranges.

We provide two separate escaping routines depending on the literal
encoding for the corresponding character types. If the character
encoding is Unicode, we follow the specification for the standard
(__format::__write_escaped_unicode).
For other encodings, we escape only characters in range [0x00, 0x80),
interpreting them as ASCII values: [0x00, 0x20), 0x7f and  '\t', '\r',
'\n', '\\', '"', '\'' are escaped. We assume every character outside
this range is printable (__format::_write_escaped_ascii).
In particular we do not yet implement special handling of shift
sequences.

For Unicode escaping a new __unicode::__escape_edges table is introduced,
that encodes information if character belongs to General_Category that is
escaped by the standard (Control or Other). This table is generated from
DerivedGeneralCategory.txt provided by Unicode. Only boolean flag is
preserved to reduce the number of entries. The additional rules for escaping
are handled by __format::__should_escape_unicode.

When width or precision is specified, we emit escaped string to the temporary
buffer and format the resulting string according to the format spec.
For characters use a fixed size stack buffer, for which a new _Fixedbuf_sink is
introduced. For strings, we use _Str_sink and to avoid allocations,
we compute the estimated size of (possibly truncated) input, and if it is
larger than width field we print directly.

	PR libstdc++/109162

contrib/ChangeLog:

	* unicode/README: Mentioned DerivedGeneralCategory.txt.
	* unicode/gen_libstdcxx_unicode_data.py: Generation __escape_edges
	table from DerivedGeneralCategory.txt. Update file name in comments.
	* unicode/DerivedGeneralCategory.txt: Copy of file distributed by
	Unicode Consortium.

libstdc++-v3/ChangeLog:

	* include/bits/chrono_io.h (__detail::_Widen): Moved to std/format file.
	* include/bits/unicode-data.h: Regnerate.
	* include/bits/unicode.h (__unicode::_Utf_iterator::_M_units)
	(__unicode::__should_escape_category): Define.
	* include/std/format (_GLIBCXX_WIDEN_, _GLIBCXX_WIDEN):	Copied from
	include/bits/chrono_io.h.
	(__format::_Widen): Moved from include/bits/chrono_io.h.
	(__format::_Term_char, __format::_Escapes, __format::_Separators)
	(__format::__should_escape_ascii, __format::__should_escape_unicode)
	(__format::__write_escape_seq, __format::__write_escaped_char)
	(__format::__write_escaped_acii, __format::__write_escaped_unicode)
	(__format::__write_escaped): Define.
	(__formatter_str::_S_trunc): Extracted truncation of character
	sequences.
	(__formatter_str::format): Handle _Pres_esc.
	(__formatter_int::_M_do_parse) [__glibcxx_format_ranges]: Parse '?'.
	(__formatter_int::_M_format_character_escaped): Define.
	(formatter<_CharT, _CharT>::format, formatter<char, wchar_t>::format):
	Handle _Pres_esc.
	(__formatter_str::set_debug_format, formatter<...>::set_debug_format)
	Guard with __glibcxx_format_ranges.
	(__format::_Fixedbuf_sink): Define.
	* testsuite/23_containers/vector/bool/format.cc: Use __format::_Widen
	and remove unnecessary <chrono> include.
	* testsuite/std/format/debug.cc: New test.
	* testsuite/std/format/debug_nonunicode.cc: New test.
	* testsuite/std/format/parse_ctx.cc (escaped_strings_supported): Define
	to true if __glibcxx_format_ranges is defined.
	* testsuite/std/format/string.cc (escaped_strings_supported): Define to
	true if __glibcxx_format_ranges is defined.

Reviewed-by: Jonathan Wakely <jwakely@redhat.com>
Signed-off-by: Tomasz Kamiński <tkaminsk@redhat.com>
2025-04-11 08:43:50 +02:00
..
from_glibc Update copyright years. 2025-01-02 11:59:57 +01:00
DerivedCoreProperties.txt contrib, libcpp, libstdc++: Update to Unicode 16.0 2024-10-08 10:01:47 +02:00
DerivedGeneralCategory.txt libstdc++: Implement debug format for strings and characters formatters [PR109162] 2025-04-11 08:43:50 +02:00
DerivedNormalizationProps.txt contrib, libcpp, libstdc++: Update to Unicode 16.0 2024-10-08 10:01:47 +02:00
EastAsianWidth.txt contrib, libcpp, libstdc++: Update to Unicode 16.0 2024-10-08 10:01:47 +02:00
emoji-data.txt contrib, libcpp, libstdc++: Update to Unicode 16.0 2024-10-08 10:01:47 +02:00
gen-box-drawing-chars.py contrib: Remove C-style comments from Python files 2024-01-05 13:57:05 +00:00
gen-combining-chars.py contrib: Remove C-style comments from Python files 2024-01-05 13:57:05 +00:00
gen-printable-chars.py contrib: Remove C-style comments from Python files 2024-01-05 13:57:05 +00:00
gen_libstdcxx_unicode_data.py libstdc++: Implement debug format for strings and characters formatters [PR109162] 2025-04-11 08:43:50 +02:00
gen_wcwidth.py contrib: Remove C-style comments from Python files 2024-01-05 13:57:05 +00:00
GraphemeBreakProperty.txt contrib, libcpp, libstdc++: Update to Unicode 16.0 2024-10-08 10:01:47 +02:00
NameAliases.txt contrib, libcpp, libstdc++: Update to Unicode 16.0 2024-10-08 10:01:47 +02:00
PropList.txt contrib, libcpp, libstdc++: Update to Unicode 16.0 2024-10-08 10:01:47 +02:00
README libstdc++: Implement debug format for strings and characters formatters [PR109162] 2025-04-11 08:43:50 +02:00
unicode-license.txt
UnicodeData.txt contrib, libcpp, libstdc++: Update to Unicode 16.0 2024-10-08 10:01:47 +02:00
utf8-dump.py

This directory contains a mechanism for GCC to have its own internal
implementation of wcwidth functionality (cpp_wcwidth () in libcpp/charset.c),
as well as a mechanism to update the information about codepoints permitted in
identifiers, which is encoded in libcpp/ucnid.h, and mapping between Unicode
names and codepoints, which is encoded in libcpp/uname2c.h.

The idea is to produce the necessary lookup tables
(../../libcpp/{ucnid.h,uname2c.h,generated_cpp_wcwidth.h}) in a reproducible
way, starting from the following files that are distributed by the Unicode
Consortium:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt
ftp://ftp.unicode.org/Public/UNIDATA/PropList.txt
ftp://ftp.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
ftp://ftp.unicode.org/Public/UNIDATA/NameAliases.txt

Three additional files are needed for lookup tables in libstdc++:

ftp://ftp.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakProperty.txt
ftp://ftp.unicode.org/Public/UNIDATA/emoji/emoji-data.txt
ftp://ftp.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt

All these files have been added to source control in this directory;
please see unicode-license.txt for the relevant copyright information.

In order to keep in sync with glibc's wcwidth as much as possible, it is
desirable for the logic that processes the Unicode data to be the same as
glibc's.  To that end, we also put in this directory, in the from_glibc/
directory, the glibc python code that implements their logic.  This code was
copied verbatim from glibc, and it can be updated at any time from the glibc
source code repository.  The files copied from that repository are:

localedata/unicode-gen/unicode_utils.py
localedata/unicode-gen/utf8_gen.py

And the most recent versions added to GCC are from glibc git commit:
064c708c78cc2a6b5802dce73108fc0c1c6bfc80

The script gen_wcwidth.py found here contains the GCC-specific code to
map glibc's output to the lookup tables we require.  This script should not need
to change, unless there are structural changes to the Unicode data files or to
the glibc code.  Similarly, makeucnid.cc in ../../libcpp contains the logic to
produce ucnid.h.

The procedure to update GCC's Unicode support is the following:

1.  Update the six Unicode data files from the above URLs.

2.  Update the two glibc files in from_glibc/ from glibc's git.  Update
    the commit number above in this README.

3.  Run ./gen_wcwidth.py X.Y > ../../libcpp/generated_cpp_wcwidth.h
    (where X.Y is the version of the Unicode standard corresponding to the
    Unicode data files being used, most recently, 16.0.0).

4.  Update Unicode Copyright years in libcpp/makeucnid.cc and in
    libcpp/makeuname2c.cc up to the year in which the Unicode
    standard has been released.

5.  Compile makeucnid, e.g. with:
      g++ -O2 ../../libcpp/makeucnid.cc -o ../../libcpp/makeucnid

6.  Generate ucnid.h as follows:
      ../../libcpp/makeucnid ../../libcpp/ucnid.tab UnicodeData.txt \
	DerivedNormalizationProps.txt DerivedCoreProperties.txt \
	> ../../libcpp/ucnid.h

7.  Read the corresponding Unicode's standard and update correspondingly
    generated_ranges table in libcpp/makeuname2c.cc (in Unicode 16 all
    the needed information was in Table 4-8).

8.  Compile makeuname2c, e.g. with:
      g++ -O2 ../../libcpp/makeuname2c.cc -o ../../libcpp/makeuname2c

9:  Generate uname2c.h as follows:
      ../../libcpp/makeuname2c UnicodeData.txt NameAliases.txt \
	> ../../libcpp/uname2c.h

See gen_libstdcxx_unicode_data.py for instructions on updating the lookup
tables in libstdc++.