procyberian/gcc - Masscollabs Services: Beyond Sharing , Liberating The Software World

procyberian/gcc

Fork 0

Commit graph

Author	SHA1	Message	Date
Jakub Jelinek	d0e8f58b81	contrib, libcpp, libstdc++: Update to Unicode 16.0 It is autumn again and there is a new Unicode version 16.0. The following patch updates our Unicode stuff in contrib, libcpp and libstdc++ from that Unicode version. 2024-10-08 Jakub Jelinek <jakub@redhat.com> contrib/ * unicode/README: Update glibc git commit hash, replace Unicode 15 or 15.1 versions with 16. * unicode/gen_libstdcxx_unicode_data.py: Use 160000 instead of 150100 in _GLIBCXX_GET_UNICODE_DATA test. * unicode/from_glibc/utf8_gen.py: Updated from glibc 064c708c78cc2a6b5802dce73108fc0c1c6bfc80 commit. * unicode/DerivedCoreProperties.txt: Updated from Unicode 16.0. * unicode/emoji-data.txt: Likewise. * unicode/PropList.txt: Likewise. * unicode/GraphemeBreakProperty.txt: Likewise. * unicode/DerivedNormalizationProps.txt: Likewise. * unicode/NameAliases.txt: Likewise. * unicode/UnicodeData.txt: Likewise. * unicode/EastAsianWidth.txt: Likewise. gcc/testsuite/ * c-c++-common/cpp/named-universal-char-escape-1.c: Add tests for some Unicode 16.0 characters, both normal and generated. libcpp/ * makeucnid.cc (write_copyright): Update Unicode Copyright years. * makeuname2c.cc (generated_ranges): Adjust Unicode version from 15.1 to 16.0. Add EGYPTIAN HIEROGLYPH- generated range, adjust indexes in following entries. (write_copyright): Update Unicode Copyright years. * generated_cpp_wcwidth.h: Regenerated. * ucnid.h: Regenerated. * uname2c.h: Regenerated. libstdc++-v3/ * include/bits/unicode.h (std::__unicode::__v15_1_0): Rename inline namespace to ... (std::__unicode::__v16_0_0): ... this. (_GLIBCXX_GET_UNICODE_DATA): Change from 150100 to 160000. * include/bits/unicode-data.h: Regenerated. * testsuite/ext/unicode/properties.cc: Check for _Gcb_SpacingMark on U+11F03 rather than U+1D16D as the latter lost SpacingMark property in Unicode 16.0.	2024-10-08 10:01:47 +02:00
Jonathan Wakely	37a4c5c23a	libstdc++: Add Unicode-aware width estimation for std::format This implements the requirements in the following proposals, which dictate how std::format deals with non-ASCII strings: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2572r1.html https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2675r1.pdf There are two parts to this. The width estimation for strings must only count the width of the first character in an extended grapheme cluster. That requires implementing the algorithm for detecting cluster breaks, which requires a number of lookup tables of the grapheme cluster break properties (and Indic_Conjunct_Break and Extended_Pictographic properties) of every code point. Additionally, some characters have a field width of 2, which requires another lookup table of field widths for every code point. The tables added in this commit do not contain entries for every code point from 0 to 0x10FFFF as that would be very inefficient and use too much memory. Instead the tables only contain the code points that form an "edge" for a property, omitting all the code points that have the same property as the preceding one. We can use a binary search to find the closest code point in the table that is not greater than the one we're looking for. The tables are generated by a new Python script added to the contrib/unicode directory, and a new data file downloaded from the Unicode Consortium website. The rules for extended grapheme cluster breaking are implemented for the latest Unicode standard, version 15.1.0. libstdc++-v3/ChangeLog: * include/Makefile.am: Add new headers. * include/Makefile.in: Regenerate. * include/bits/unicode.h: New file. * include/bits/unicode-data.h: New file. * include/std/format: Include <bits/unicode.h>. (__literal_encoding_is_utf8): Move to <bits/unicode.h>. (_Spec::_M_fill): Change type to char32_t. (_Spec::_M_parse_fill_and_align): Read a Unicode scalar value instead of a single character. (__write_padded): Change __fill_char parameter to char32_t and encode it into the output. (__formatter_str::format): Use new __unicode::__field_width and __unicode::__truncate functions. * include/std/ostream: Adjust namespace qualification for __literal_encoding_is_utf8. * include/std/print: Likewise. * src/c++23/print.cc: Add [[unlikely]] attribute to error path. * testsuite/ext/unicode/view.cc: New test. * testsuite/std/format/functions/format.cc: Add missing examples from the standard demonstrating alignment with non-ASCII characters. Add examples checking correct handling of extended grapheme clusters. contrib/ChangeLog: * unicode/README: Add notes about generating libstdc++ tables. * unicode/GraphemeBreakProperty.txt: New file. * unicode/emoji-data.txt: New file. * unicode/gen_libstdcxx_unicode_data.py: New file.	2024-01-08 01:14:50 +00:00

Author

SHA1

Message

Date

Jakub Jelinek

d0e8f58b81

contrib, libcpp, libstdc++: Update to Unicode 16.0

It is autumn again and there is a new Unicode version 16.0.

The following patch updates our Unicode stuff in contrib, libcpp and
libstdc++ from that Unicode version.

2024-10-08  Jakub Jelinek  <jakub@redhat.com>

contrib/
	* unicode/README: Update glibc git commit hash, replace
	Unicode 15 or 15.1 versions with 16.
	* unicode/gen_libstdcxx_unicode_data.py: Use 160000 instead of
	150100 in _GLIBCXX_GET_UNICODE_DATA test.
	* unicode/from_glibc/utf8_gen.py: Updated from glibc
	064c708c78cc2a6b5802dce73108fc0c1c6bfc80 commit.
	* unicode/DerivedCoreProperties.txt: Updated from Unicode 16.0.
	* unicode/emoji-data.txt: Likewise.
	* unicode/PropList.txt: Likewise.
	* unicode/GraphemeBreakProperty.txt: Likewise.
	* unicode/DerivedNormalizationProps.txt: Likewise.
	* unicode/NameAliases.txt: Likewise.
	* unicode/UnicodeData.txt: Likewise.
	* unicode/EastAsianWidth.txt: Likewise.
gcc/testsuite/
	* c-c++-common/cpp/named-universal-char-escape-1.c: Add tests
	for some Unicode 16.0 characters, both normal and generated.
libcpp/
	* makeucnid.cc (write_copyright): Update Unicode Copyright years.
	* makeuname2c.cc (generated_ranges): Adjust Unicode version from 15.1
	to 16.0.  Add EGYPTIAN HIEROGLYPH- generated range, adjust indexes in
	following entries.
	(write_copyright): Update Unicode Copyright years.
	* generated_cpp_wcwidth.h: Regenerated.
	* ucnid.h: Regenerated.
	* uname2c.h: Regenerated.
libstdc++-v3/
	* include/bits/unicode.h (std::__unicode::__v15_1_0): Rename inline
	namespace to ...
	(std::__unicode::__v16_0_0): ... this.
	(_GLIBCXX_GET_UNICODE_DATA): Change from 150100 to 160000.
	* include/bits/unicode-data.h: Regenerated.
	* testsuite/ext/unicode/properties.cc: Check for _Gcb_SpacingMark
	on U+11F03 rather than U+1D16D as the latter lost SpacingMark property
	in Unicode 16.0.

2024-10-08 10:01:47 +02:00

Jonathan Wakely

37a4c5c23a

libstdc++: Add Unicode-aware width estimation for std::format

This implements the requirements in the following proposals, which
dictate how std::format deals with non-ASCII strings:
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2572r1.html
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2675r1.pdf

There are two parts to this. The width estimation for strings must only
count the width of the first character in an extended grapheme cluster.
That requires implementing the algorithm for detecting cluster breaks,
which requires a number of lookup tables of the grapheme cluster break
properties (and Indic_Conjunct_Break and Extended_Pictographic
properties) of every code point. Additionally, some characters have a
field width of 2, which requires another lookup table of field widths
for every code point.  The tables added in this commit do not contain
entries for every code point from 0 to 0x10FFFF as that would be very
inefficient and use too much memory. Instead the tables only contain the
code points that form an "edge" for a property, omitting all the code
points that have the same property as the preceding one. We can use a
binary search to find the closest code point in the table that is not
greater than the one we're looking for.

The tables are generated by a new Python script added to the
contrib/unicode directory, and a new data file downloaded from the
Unicode Consortium website.

The rules for extended grapheme cluster breaking are implemented for the
latest Unicode standard, version 15.1.0.

libstdc++-v3/ChangeLog:

	* include/Makefile.am: Add new headers.
	* include/Makefile.in: Regenerate.
	* include/bits/unicode.h: New file.
	* include/bits/unicode-data.h: New file.
	* include/std/format: Include <bits/unicode.h>.
	(__literal_encoding_is_utf8): Move to <bits/unicode.h>.
	(_Spec::_M_fill): Change type to char32_t.
	(_Spec::_M_parse_fill_and_align): Read a Unicode scalar value
	instead of a single character.
	(__write_padded): Change __fill_char parameter to char32_t and
	encode it into the output.
	(__formatter_str::format): Use new __unicode::__field_width and
	__unicode::__truncate functions.
	* include/std/ostream: Adjust namespace qualification for
	__literal_encoding_is_utf8.
	* include/std/print: Likewise.
	* src/c++23/print.cc: Add [[unlikely]] attribute to error path.
	* testsuite/ext/unicode/view.cc: New test.
	* testsuite/std/format/functions/format.cc: Add missing examples
	from the standard demonstrating alignment with non-ASCII
	characters. Add examples checking correct handling of extended
	grapheme clusters.

contrib/ChangeLog:

	* unicode/README: Add notes about generating libstdc++ tables.
	* unicode/GraphemeBreakProperty.txt: New file.
	* unicode/emoji-data.txt: New file.
	* unicode/gen_libstdcxx_unicode_data.py: New file.

2024-01-08 01:14:50 +00:00

2 commits