Commit graph

20 commits

Author SHA1 Message Date
Tomasz Kamiński
3b33d792cf libstdc++: Implement debug format for strings and characters formatters [PR109162]
This patch implements part P2286R8 that specified debug (escaped)
format for the strings and characters sequences. This include both
handling of the '?' format specifier and set_debug_format member.

To indicate partial support we define __glibcxx_format_ranges macro
value 1, without defining __cpp_lib_format_ranges.

We provide two separate escaping routines depending on the literal
encoding for the corresponding character types. If the character
encoding is Unicode, we follow the specification for the standard
(__format::__write_escaped_unicode).
For other encodings, we escape only characters in range [0x00, 0x80),
interpreting them as ASCII values: [0x00, 0x20), 0x7f and  '\t', '\r',
'\n', '\\', '"', '\'' are escaped. We assume every character outside
this range is printable (__format::_write_escaped_ascii).
In particular we do not yet implement special handling of shift
sequences.

For Unicode escaping a new __unicode::__escape_edges table is introduced,
that encodes information if character belongs to General_Category that is
escaped by the standard (Control or Other). This table is generated from
DerivedGeneralCategory.txt provided by Unicode. Only boolean flag is
preserved to reduce the number of entries. The additional rules for escaping
are handled by __format::__should_escape_unicode.

When width or precision is specified, we emit escaped string to the temporary
buffer and format the resulting string according to the format spec.
For characters use a fixed size stack buffer, for which a new _Fixedbuf_sink is
introduced. For strings, we use _Str_sink and to avoid allocations,
we compute the estimated size of (possibly truncated) input, and if it is
larger than width field we print directly.

	PR libstdc++/109162

contrib/ChangeLog:

	* unicode/README: Mentioned DerivedGeneralCategory.txt.
	* unicode/gen_libstdcxx_unicode_data.py: Generation __escape_edges
	table from DerivedGeneralCategory.txt. Update file name in comments.
	* unicode/DerivedGeneralCategory.txt: Copy of file distributed by
	Unicode Consortium.

libstdc++-v3/ChangeLog:

	* include/bits/chrono_io.h (__detail::_Widen): Moved to std/format file.
	* include/bits/unicode-data.h: Regnerate.
	* include/bits/unicode.h (__unicode::_Utf_iterator::_M_units)
	(__unicode::__should_escape_category): Define.
	* include/std/format (_GLIBCXX_WIDEN_, _GLIBCXX_WIDEN):	Copied from
	include/bits/chrono_io.h.
	(__format::_Widen): Moved from include/bits/chrono_io.h.
	(__format::_Term_char, __format::_Escapes, __format::_Separators)
	(__format::__should_escape_ascii, __format::__should_escape_unicode)
	(__format::__write_escape_seq, __format::__write_escaped_char)
	(__format::__write_escaped_acii, __format::__write_escaped_unicode)
	(__format::__write_escaped): Define.
	(__formatter_str::_S_trunc): Extracted truncation of character
	sequences.
	(__formatter_str::format): Handle _Pres_esc.
	(__formatter_int::_M_do_parse) [__glibcxx_format_ranges]: Parse '?'.
	(__formatter_int::_M_format_character_escaped): Define.
	(formatter<_CharT, _CharT>::format, formatter<char, wchar_t>::format):
	Handle _Pres_esc.
	(__formatter_str::set_debug_format, formatter<...>::set_debug_format)
	Guard with __glibcxx_format_ranges.
	(__format::_Fixedbuf_sink): Define.
	* testsuite/23_containers/vector/bool/format.cc: Use __format::_Widen
	and remove unnecessary <chrono> include.
	* testsuite/std/format/debug.cc: New test.
	* testsuite/std/format/debug_nonunicode.cc: New test.
	* testsuite/std/format/parse_ctx.cc (escaped_strings_supported): Define
	to true if __glibcxx_format_ranges is defined.
	* testsuite/std/format/string.cc (escaped_strings_supported): Define to
	true if __glibcxx_format_ranges is defined.

Reviewed-by: Jonathan Wakely <jwakely@redhat.com>
Signed-off-by: Tomasz Kamiński <tkaminsk@redhat.com>
2025-04-11 08:43:50 +02:00
Jakub Jelinek
6441eb6dc0 Update copyright years. 2025-01-02 11:59:57 +01:00
Jakub Jelinek
d0e8f58b81 contrib, libcpp, libstdc++: Update to Unicode 16.0
It is autumn again and there is a new Unicode version 16.0.

The following patch updates our Unicode stuff in contrib, libcpp and
libstdc++ from that Unicode version.

2024-10-08  Jakub Jelinek  <jakub@redhat.com>

contrib/
	* unicode/README: Update glibc git commit hash, replace
	Unicode 15 or 15.1 versions with 16.
	* unicode/gen_libstdcxx_unicode_data.py: Use 160000 instead of
	150100 in _GLIBCXX_GET_UNICODE_DATA test.
	* unicode/from_glibc/utf8_gen.py: Updated from glibc
	064c708c78cc2a6b5802dce73108fc0c1c6bfc80 commit.
	* unicode/DerivedCoreProperties.txt: Updated from Unicode 16.0.
	* unicode/emoji-data.txt: Likewise.
	* unicode/PropList.txt: Likewise.
	* unicode/GraphemeBreakProperty.txt: Likewise.
	* unicode/DerivedNormalizationProps.txt: Likewise.
	* unicode/NameAliases.txt: Likewise.
	* unicode/UnicodeData.txt: Likewise.
	* unicode/EastAsianWidth.txt: Likewise.
gcc/testsuite/
	* c-c++-common/cpp/named-universal-char-escape-1.c: Add tests
	for some Unicode 16.0 characters, both normal and generated.
libcpp/
	* makeucnid.cc (write_copyright): Update Unicode Copyright years.
	* makeuname2c.cc (generated_ranges): Adjust Unicode version from 15.1
	to 16.0.  Add EGYPTIAN HIEROGLYPH- generated range, adjust indexes in
	following entries.
	(write_copyright): Update Unicode Copyright years.
	* generated_cpp_wcwidth.h: Regenerated.
	* ucnid.h: Regenerated.
	* uname2c.h: Regenerated.
libstdc++-v3/
	* include/bits/unicode.h (std::__unicode::__v15_1_0): Rename inline
	namespace to ...
	(std::__unicode::__v16_0_0): ... this.
	(_GLIBCXX_GET_UNICODE_DATA): Change from 150100 to 160000.
	* include/bits/unicode-data.h: Regenerated.
	* testsuite/ext/unicode/properties.cc: Check for _Gcb_SpacingMark
	on U+11F03 rather than U+1D16D as the latter lost SpacingMark property
	in Unicode 16.0.
2024-10-08 10:01:47 +02:00
Jonathan Wakely
ef2efc53fd libstdc++: Fix Python scripts to output the correct filename
These scripts both print "generated by $file, do not edit" header but
one of them prints the wrong filename. Use the built-in __file__
attribute to ensure it's correct.

contrib/ChangeLog:

	* unicode/gen_libstdcxx_unicode_data.py: Fix header of generated
	file to name the correct script.

libstdc++-v3/ChangeLog:

	* include/bits/text_encoding-data.h: Regenerate.
	* include/bits/unicode-data.h: Regenerate.
	* scripts/gen_text_encoding_data.py: Fix header of generated
	file to name the correct script.
2024-03-19 15:20:07 +00:00
Jonathan Wakely
e99d9607f0 libstdc++: Add copyright and license text to new generated headers
contrib/ChangeLog:

	* unicode/gen_libstdcxx_unicode_data.py: Add copyright and
	license text to the output.

libstdc++-v3/ChangeLog:

	* include/bits/text_encoding-data.h: Regenerate.
	* include/bits/unicode-data.h: Regenerate.
	* scripts/gen_text_encoding_data.py: Add copyright and license
	text to the output.
2024-02-04 21:40:23 +00:00
Jonathan Wakely
ea314ccd62 libstdc++: Fix Unicode property detection functions
Fix some copy & pasted logic in __is_extended_pictographic. This
function should yield false for the values before the first edge, not
true. Also add a missing boundary condition check in __incb_property.

Also Fix an off-by-one error in _Utf_iterator::operator++() that would
make dereferencing a past-the-end iterator undefined (where the intended
design is that the iterator is always incrementable and dereferenceable,
for better memory safety).

Also simplify the grapheme view iterator, which still contained some
remnants of an earlier design I was experimenting with.

Slightly tweak the gen_libstdcxx_unicode_data.py script so that the
_Gcb_property enumerators are in the order we encounter them in the data
file, instead of sorting them alphabetically. Start with the "Other"
property at value 0, because that's the default property for anything
not in the file. This makes no practical difference, but seems cleaner.
It causes the values in the __gcb_edges table to change, so can only be
done now before anybody is using this code yet. The enumerator values
and table entries become ABI artefacts for the function using them.

contrib/ChangeLog:

	* unicode/gen_libstdcxx_unicode_data.py: Print out Gcb_property
	enumerators in the order they're seen, not alphabetical order.

libstdc++-v3/ChangeLog:

	* include/bits/unicode-data.h: Regenerate.
	* include/bits/unicode.h (_Utf_iterator::operator++()): Fix off
	by one error.
	(__incb_property): Add missing check for values before the
	first edge.
	(__is_extended_pictographic): Invert return values to fix
	copy&pasted logic.
	(_Grapheme_cluster_view::_Iterator): Remove second iterator
	member and find end of cluster lazily.
	* testsuite/ext/unicode/grapheme_view.cc: New test.
	* testsuite/ext/unicode/properties.cc: New test.
	* testsuite/ext/unicode/view.cc: New test.
2024-01-09 23:42:59 +00:00
Jonathan Wakely
37a4c5c23a libstdc++: Add Unicode-aware width estimation for std::format
This implements the requirements in the following proposals, which
dictate how std::format deals with non-ASCII strings:
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2572r1.html
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2675r1.pdf

There are two parts to this. The width estimation for strings must only
count the width of the first character in an extended grapheme cluster.
That requires implementing the algorithm for detecting cluster breaks,
which requires a number of lookup tables of the grapheme cluster break
properties (and Indic_Conjunct_Break and Extended_Pictographic
properties) of every code point. Additionally, some characters have a
field width of 2, which requires another lookup table of field widths
for every code point.  The tables added in this commit do not contain
entries for every code point from 0 to 0x10FFFF as that would be very
inefficient and use too much memory. Instead the tables only contain the
code points that form an "edge" for a property, omitting all the code
points that have the same property as the preceding one. We can use a
binary search to find the closest code point in the table that is not
greater than the one we're looking for.

The tables are generated by a new Python script added to the
contrib/unicode directory, and a new data file downloaded from the
Unicode Consortium website.

The rules for extended grapheme cluster breaking are implemented for the
latest Unicode standard, version 15.1.0.

libstdc++-v3/ChangeLog:

	* include/Makefile.am: Add new headers.
	* include/Makefile.in: Regenerate.
	* include/bits/unicode.h: New file.
	* include/bits/unicode-data.h: New file.
	* include/std/format: Include <bits/unicode.h>.
	(__literal_encoding_is_utf8): Move to <bits/unicode.h>.
	(_Spec::_M_fill): Change type to char32_t.
	(_Spec::_M_parse_fill_and_align): Read a Unicode scalar value
	instead of a single character.
	(__write_padded): Change __fill_char parameter to char32_t and
	encode it into the output.
	(__formatter_str::format): Use new __unicode::__field_width and
	__unicode::__truncate functions.
	* include/std/ostream: Adjust namespace qualification for
	__literal_encoding_is_utf8.
	* include/std/print: Likewise.
	* src/c++23/print.cc: Add [[unlikely]] attribute to error path.
	* testsuite/ext/unicode/view.cc: New test.
	* testsuite/std/format/functions/format.cc: Add missing examples
	from the standard demonstrating alignment with non-ASCII
	characters. Add examples checking correct handling of extended
	grapheme clusters.

contrib/ChangeLog:

	* unicode/README: Add notes about generating libstdc++ tables.
	* unicode/GraphemeBreakProperty.txt: New file.
	* unicode/emoji-data.txt: New file.
	* unicode/gen_libstdcxx_unicode_data.py: New file.
2024-01-08 01:14:50 +00:00
Jonathan Wakely
29abd09a74 contrib: Remove C-style comments from Python files
These Python scripts have "*/" at the end of the license header comment
blocks, presumably copy&pasted from C files.

contrib/ChangeLog:

	* analyze_brprob.py: Remove stray text at end of comment.
	* analyze_brprob_spec.py: Likewise.
	* check-params-in-docs.py: Likewise.
	* check_GNU_style.py: Likewise.
	* check_GNU_style_lib.py: Likewise.
	* filter-clang-warnings.py: Likewise.
	* gcc-changelog/git_check_commit.py: Likewise.
	* gcc-changelog/git_commit.py: Likewise.
	* gcc-changelog/git_email.py: Likewise.
	* gcc-changelog/git_repository.py: Likewise.
	* gcc-changelog/git_update_version.py: Likewise.
	* gcc-changelog/test_email.py: Likewise.
	* gen_autofdo_event.py: Likewise.
	* mark_spam.py: Likewise.
	* unicode/gen-box-drawing-chars.py: Likewise.
	* unicode/gen-combining-chars.py: Likewise.
	* unicode/gen-printable-chars.py: Likewise.
	* unicode/gen_wcwidth.py: Likewise.
2024-01-05 13:57:05 +00:00
Jonathan Wakely
1bc9eddb80 contrib: Add script name to usage error in gen_wcwidth.py
contrib/ChangeLog:

	* unicode/gen_wcwidth.py: Add sys.argv[0] to usage error.
2024-01-05 13:57:04 +00:00
Jakub Jelinek
a945c346f5 Update copyright years. 2024-01-03 12:19:35 +01:00
Jakub Jelinek
d64b7c82da libcpp, contrib: Update to Unicode 15.1
The following patch (in plaintext just a pseudo-patch where I've left out
the too big parts of either wget downloaded or regenerated files out with
..., full patch attached compressed) updates to Unicode 15.1 from 15.0
we had last year.  Apparently Unicode forgot to add a new range to 4-8 Table
we are using, but from the other files it is clear what should have been
added; I've filed a bugreport against Unicode.

2023-11-14  Jakub Jelinek  <jakub@redhat.com>

contrib/
	* unicode/README: Adjust glibc git commit hash, number of Unicode
	data files to be updated and latest Unicode version.
	* unicode/from_glibc/utf8_gen.py: Update from glibc.
	* unicode/UnicodeData.txt: Update from Unicode 15.1.
	* unicode/EastAsianWidth.txt: Likewise.
	* unicode/DerivedNormalizationProps.txt: Likewise.
	* unicode/NameAliases.txt: Likewise.
	* unicode/DerivedCoreProperties.txt: Likewise.
	* unicode/PropList.txt: Likewise.
libcpp/
	* makeucnid.cc (write_copyright): Update copyright year.
	* makeuname2c.cc (write_copyright): Likewise.
	(struct generated): Update latest Unicode version.
	(generated_ranges): Add 2ebf0-2ee5d CJK UNIFIED IDEOGRAPH
	range which was forgotten to be added to 4-8 table, but
	clearly is expected to be there from the 15.1 additions.
	* ucnid.h: Regenerated.
	* uname2c.h: Regenerated.
	* generated_cpp_wcwidth.h: Regenerated.
2023-11-14 18:32:37 +01:00
David Malcolm
4f01ae3761 diagnostics: add support for "text art" diagrams
Existing text output in GCC has to be implemented by writing
sequentially to a pretty_printer instance.  This makes it
hard to implement some kinds of diagnostic output (see e.g.
diagnostic-show-locus.cc).

This patch adds more flexible ways of creating text output:
- a canvas class, which can be "painted" to via random-access (rather
that sequentially)
- a table class for 2D grid layout, supporting items that span
multiple rows/columns
- a widget class for organizing diagrams hierarchically.

The patch also expands GCC's diagnostics subsystem so that diagnostics
can have "text art" diagrams - think ASCII art, but potentially
including some Unicode characters, such as box-drawing chars.

The new code is in a new "gcc/text-art" subdirectory and "text_art"
namespace.

The patch adds a new "-fdiagnostics-text-art-charset=VAL" option, with
values:
- "none": don't emit diagrams (added to -fdiagnostics-plain-output)
- "ascii": use pure ASCII in diagrams
- "unicode": allow for conservative use of unicode drawing characters
(such as box-drawing characters).
- "emoji" (the default): as "unicode", but potentially allow for
conservative use of emoji in the output (such as U+26A0 WARNING SIGN).
I made it possible to disable emoji separately from unicode as I believe
there's a generation gap in acceptance of these characters (some older
programmers have a visceral reaction against them, whereas younger
programmers may have no problem with them).

Diagrams are emitted to stderr by default.  With SARIF output they are
captured as a location in "relatedLocations", with the diagram as a
code block in Markdown within a "markdown" property of a message.

This patch doesn't add any such diagram usage to GCC, saving that for
followups, apart from adding a plugin to the test suite to exercise the
functionality.

contrib/ChangeLog:
	* unicode/gen-box-drawing-chars.py: New file.
	* unicode/gen-combining-chars.py: New file.
	* unicode/gen-printable-chars.py: New file.

gcc/ChangeLog:
	* Makefile.in (OBJS-libcommon): Add text-art/box-drawing.o,
	text-art/canvas.o, text-art/ruler.o, text-art/selftests.o,
	text-art/style.o, text-art/styled-string.o, text-art/table.o,
	text-art/theme.o, and text-art/widget.o.
	* color-macros.h (COLOR_FG_BRIGHT_BLACK): New.
	(COLOR_FG_BRIGHT_RED): New.
	(COLOR_FG_BRIGHT_GREEN): New.
	(COLOR_FG_BRIGHT_YELLOW): New.
	(COLOR_FG_BRIGHT_BLUE): New.
	(COLOR_FG_BRIGHT_MAGENTA): New.
	(COLOR_FG_BRIGHT_CYAN): New.
	(COLOR_FG_BRIGHT_WHITE): New.
	(COLOR_BG_BRIGHT_BLACK): New.
	(COLOR_BG_BRIGHT_RED): New.
	(COLOR_BG_BRIGHT_GREEN): New.
	(COLOR_BG_BRIGHT_YELLOW): New.
	(COLOR_BG_BRIGHT_BLUE): New.
	(COLOR_BG_BRIGHT_MAGENTA): New.
	(COLOR_BG_BRIGHT_CYAN): New.
	(COLOR_BG_BRIGHT_WHITE): New.
	* common.opt (fdiagnostics-text-art-charset=): New option.
	(diagnostic-text-art.h): New SourceInclude.
	(diagnostic_text_art_charset) New Enum and EnumValues.
	* configure: Regenerate.
	* configure.ac (gccdepdir): Add text-art to loop.
	* diagnostic-diagram.h: New file.
	* diagnostic-format-json.cc (json_emit_diagram): New.
	(diagnostic_output_format_init_json): Wire it up to
	context->m_diagrams.m_emission_cb.
	* diagnostic-format-sarif.cc: Include "diagnostic-diagram.h" and
	"text-art/canvas.h".
	(sarif_result::on_nested_diagnostic): Move code to...
	(sarif_result::add_related_location): ...this new function.
	(sarif_result::on_diagram): New.
	(sarif_builder::emit_diagram): New.
	(sarif_builder::make_message_object_for_diagram): New.
	(sarif_emit_diagram): New.
	(diagnostic_output_format_init_sarif): Set
	context->m_diagrams.m_emission_cb to sarif_emit_diagram.
	* diagnostic-text-art.h: New file.
	* diagnostic.cc: Include "diagnostic-text-art.h",
	"diagnostic-diagram.h", and "text-art/theme.h".
	(diagnostic_initialize): Initialize context->m_diagrams and
	call diagnostics_text_art_charset_init.
	(diagnostic_finish): Clean up context->m_diagrams.m_theme.
	(diagnostic_emit_diagram): New.
	(diagnostics_text_art_charset_init): New.
	* diagnostic.h (text_art::theme): New forward decl.
	(class diagnostic_diagram): Likewise.
	(diagnostic_context::m_diagrams): New field.
	(diagnostic_emit_diagram): New decl.
	* doc/invoke.texi (Diagnostic Message Formatting Options): Add
	-fdiagnostics-text-art-charset=.
	(-fdiagnostics-plain-output): Add
	-fdiagnostics-text-art-charset=none.
	* gcc.cc: Include "diagnostic-text-art.h".
	(driver_handle_option): Handle OPT_fdiagnostics_text_art_charset_.
	* opts-common.cc (decode_cmdline_options_to_array): Add
	"-fdiagnostics-text-art-charset=none" to expanded_args for
	-fdiagnostics-plain-output.
	* opts.cc: Include "diagnostic-text-art.h".
	(common_handle_option): Handle OPT_fdiagnostics_text_art_charset_.
	* pretty-print.cc (pp_unicode_character): New.
	* pretty-print.h (pp_unicode_character): New decl.
	* selftest-run-tests.cc: Include "text-art/selftests.h".
	(selftest::run_tests): Call text_art_tests.
	* text-art/box-drawing-chars.inc: New file, generated by
	contrib/unicode/gen-box-drawing-chars.py.
	* text-art/box-drawing.cc: New file.
	* text-art/box-drawing.h: New file.
	* text-art/canvas.cc: New file.
	* text-art/canvas.h: New file.
	* text-art/ruler.cc: New file.
	* text-art/ruler.h: New file.
	* text-art/selftests.cc: New file.
	* text-art/selftests.h: New file.
	* text-art/style.cc: New file.
	* text-art/styled-string.cc: New file.
	* text-art/table.cc: New file.
	* text-art/table.h: New file.
	* text-art/theme.cc: New file.
	* text-art/theme.h: New file.
	* text-art/types.h: New file.
	* text-art/widget.cc: New file.
	* text-art/widget.h: New file.

gcc/testsuite/ChangeLog:
	* gcc.dg/plugin/diagnostic-test-text-art-ascii-bw.c: New test.
	* gcc.dg/plugin/diagnostic-test-text-art-ascii-color.c: New test.
	* gcc.dg/plugin/diagnostic-test-text-art-none.c: New test.
	* gcc.dg/plugin/diagnostic-test-text-art-unicode-bw.c: New test.
	* gcc.dg/plugin/diagnostic-test-text-art-unicode-color.c: New test.
	* gcc.dg/plugin/diagnostic_plugin_test_text_art.c: New test plugin.
	* gcc.dg/plugin/plugin.exp (plugin_test_list): Add them.

libcpp/ChangeLog:
	* charset.cc (get_cppchar_property): New function template, based
	on...
	(cpp_wcwidth): ...this function.  Rework to use the above.
	Include "combining-chars.inc".
	(cpp_is_combining_char): New function
	Include "printable-chars.inc".
	(cpp_is_printable_char): New function
	* combining-chars.inc: New file, generated by
	contrib/unicode/gen-combining-chars.py.
	* include/cpplib.h (cpp_is_combining_char): New function decl.
	(cpp_is_printable_char): New function decl.
	* printable-chars.inc: New file, generated by
	contrib/unicode/gen-printable-chars.py.

Signed-off-by: David Malcolm <dmalcolm@redhat.com>
2023-06-21 21:49:00 -04:00
Jakub Jelinek
63b25b8012 contrib: Update instructions regarding Unicode updates
I've noticed we have instructions on how to update from newer Unicode
standard, but it didn't mention uname2c.h regeneration.

The following patch mentions that, also mentions that the Copyright years
of Unicode should be updated and adds a copy of NameAliases.txt which
is used for uname2c.h generation.

2023-03-16  Jakub Jelinek  <jakub@redhat.com>

	* unicode/README: Update to mention also makeuname2c.
	* unicode/NameAliases.txt: New file.
2023-03-16 10:28:25 +01:00
Lewis Hyatt
73dd5c6c88 libcpp: Update cpp_wcwidth() to Unicode 15
Updates cpp_wcwidth() to Unicode 15, following the procedure in
contrib/unicode/README mechanically without incident.

contrib/ChangeLog:

	* unicode/DerivedCoreProperties.txt: Update to Unicode 15.
	* unicode/DerivedNormalizationProps.txt: Likewise.
	* unicode/EastAsianWidth.txt: Likwise.
	* unicode/PropList.txt: Likewise.
	* unicode/README: Likewise.
	* unicode/UnicodeData.txt: Likewise.

libcpp/ChangeLog:

	* generated_cpp_wcwidth.h: Regenerated for Unicode 15.
2023-03-13 07:40:50 -04:00
Jakub Jelinek
83ffe9cde7 Update copyright years. 2023-01-16 11:52:17 +01:00
Lewis Hyatt
4fda776a2f libcpp: Update ucnid.h to Unicode 14
This patch updates ucnid.h from Unicode 13 to Unicode 14.  Additionally, the
procedure detailed in contrib/unicode/README, which updates
generated_wcwidth.h, has been expanded with instructions for updating this
file as well, so that both may be done at the same time conveniently.  Two
additional Unicode data files which are needed to create ucnid.h are also
added to source control in contrib/unicode.

contrib/ChangeLog:

	* unicode/README: Added instructions for updating ucnid.h.
	* unicode/DerivedCoreProperties.txt: New file added to source
	control from Unicode 14.0 release.
	* unicode/DerivedNormalizationProps.txt: Likewise.

libcpp/ChangeLog:

	* ucnid.h: Regenerated for Unicode 14.0.
2022-06-28 17:33:37 -04:00
Lewis Hyatt
57988cbe73 libcpp: Update cpp_wcwidth() to Unicode 14.0.0
The procedure detailed in contrib/unicode/README was followed with nothing
notable coming up. The glibc scripts did not require any update, so the
only change was retrieving new versions of the Unicode data files and
rerunning gen_wcwidth.py.

contrib/ChangeLog:

	* unicode/EastAsianWidth.txt: Update to Unicode 14.0.0.
	* unicode/PropList.txt: Likewise.
	* unicode/README: Likewise.
	* unicode/UnicodeData.txt: Likewise.

libcpp/ChangeLog:

	* generated_cpp_wcwidth.h: Generated from updated Unicode data files.
2022-06-26 14:13:26 -04:00
David Malcolm
b050653c4c contrib: add unicode/utf8-dump.py
This script may be useful when debugging issues relating to Unicode
encoding (e.g. when investigating source files with bidirectional control
characters).

It dumps a UTF-8 file as a list of numbered lines (mimicking GCC's
diagnostic output format), interleaved with lines per character showing
the Unicode codepoints, the UTF-8 encoding bytes, the name of the
character, and, where printable, the characters themselves.
The lines are printed in logical order, which may help the reader to grok
the relationship between visual and logical ordering in bi-di files.

For example:

$ cat test.c
int གྷ;
const char *אבג = "ALEF-BET-GIMEL";

$ ./contrib/unicode/utf8-dump.py test.c
   1 | int གྷ;
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0074            0x74                     LATIN SMALL LETTER T t
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0F43  0xe0 0xbd 0x83                       TIBETAN LETTER GHA གྷ
     |   U+003B            0x3b                                SEMICOLON ;
     |   U+000A            0x0a                           LINE FEED (LF) (control character)
   2 | const char *אבג = "ALEF-BET-GIMEL";
     |   U+0063            0x63                     LATIN SMALL LETTER C c
     |   U+006F            0x6f                     LATIN SMALL LETTER O o
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0073            0x73                     LATIN SMALL LETTER S s
     |   U+0074            0x74                     LATIN SMALL LETTER T t
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0063            0x63                     LATIN SMALL LETTER C c
     |   U+0068            0x68                     LATIN SMALL LETTER H h
     |   U+0061            0x61                     LATIN SMALL LETTER A a
     |   U+0072            0x72                     LATIN SMALL LETTER R r
     |   U+0020            0x20                                    SPACE (separator)
     |   U+002A            0x2a                                 ASTERISK *
     |   U+05D0       0xd7 0x90                       HEBREW LETTER ALEF א
     |   U+05D1       0xd7 0x91                        HEBREW LETTER BET ב
     |   U+05D2       0xd7 0x92                      HEBREW LETTER GIMEL ג
     |   U+0020            0x20                                    SPACE (separator)
     |   U+003D            0x3d                              EQUALS SIGN =
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0022            0x22                           QUOTATION MARK "
     |   U+0041            0x41                   LATIN CAPITAL LETTER A A
     |   U+004C            0x4c                   LATIN CAPITAL LETTER L L
     |   U+0045            0x45                   LATIN CAPITAL LETTER E E
     |   U+0046            0x46                   LATIN CAPITAL LETTER F F
     |   U+002D            0x2d                             HYPHEN-MINUS -
     |   U+0042            0x42                   LATIN CAPITAL LETTER B B
     |   U+0045            0x45                   LATIN CAPITAL LETTER E E
     |   U+0054            0x54                   LATIN CAPITAL LETTER T T
     |   U+002D            0x2d                             HYPHEN-MINUS -
     |   U+0047            0x47                   LATIN CAPITAL LETTER G G
     |   U+0049            0x49                   LATIN CAPITAL LETTER I I
     |   U+004D            0x4d                   LATIN CAPITAL LETTER M M
     |   U+0045            0x45                   LATIN CAPITAL LETTER E E
     |   U+004C            0x4c                   LATIN CAPITAL LETTER L L
     |   U+0022            0x22                           QUOTATION MARK "
     |   U+003B            0x3b                                SEMICOLON ;
     |   U+000A            0x0a                           LINE FEED (LF) (control character)

Tested with Python 3.8

contrib/ChangeLog:
	* unicode/utf8-dump.py: New file.

Signed-off-by: David Malcolm <dmalcolm@redhat.com>
2021-11-01 11:52:28 -04:00
Lewis Hyatt
497c9f8d4d libcpp: Update cpp_wcwidth() to Unicode 13.0.0
generated_cpp_wcwidth.h was regenerated using Unicode 13.0.0 data files. No
material changes to the parsing scripts (either GCC- or glibc-sourced) were
necessary; glibc's utf8_gen.py was tweaked slightly by glibc and matched here.

contrib/ChangeLog:

	* unicode/EastAsianWidth.txt: Update to Unicode 13.0.0.
	* unicode/PropList.txt: Likewise.
	* unicode/README: Likewise.
	* unicode/UnicodeData.txt: Likewise.
	* unicode/from_glibc/unicode_utils.py: Update to latest glibc version.
	* unicode/from_glibc/utf8_gen.py: Likewise.

libcpp/ChangeLog:

	* generated_cpp_wcwidth.h: Regenerated from Unicode 13.0.0 data.
2020-11-07 09:36:43 -05:00
Lewis Hyatt
ee9256409f Byte vs column awareness for diagnostic-show-locus.c (PR 49973)
contrib/ChangeLog

2019-12-09  Lewis Hyatt  <lhyatt@gmail.com>

	PR preprocessor/49973
	* unicode/from_glibc/unicode_utils.py: Support script from
	glibc (commit 464cd3) to extract character widths from Unicode data
	files.
	* unicode/from_glibc/utf8_gen.py: Likewise.
	* unicode/UnicodeData.txt: Unicode v. 12.1.0 data file.
	* unicode/EastAsianWidth.txt: Likewise.
	* unicode/PropList.txt: Likewise.
	* unicode/gen_wcwidth.py: New utility to generate
	libcpp/generated_cpp_wcwidth.h with help from the glibc support
	scripts and the Unicode data files.
	* unicode/unicode-license.txt: Added.
	* unicode/README: New explanatory file.

libcpp/ChangeLog

2019-12-09  Lewis Hyatt  <lhyatt@gmail.com>

	PR preprocessor/49973
	* generated_cpp_wcwidth.h: New file generated by
	../contrib/unicode/gen_wcwidth.py, supports new cpp_wcwidth function.
	* charset.c (compute_next_display_width): New function to help
	implement display columns.
	(cpp_byte_column_to_display_column): Likewise.
	(cpp_display_column_to_byte_column): Likewise.
	(cpp_wcwidth): Likewise.
	* include/cpplib.h (cpp_byte_column_to_display_column): Declare.
	(cpp_display_column_to_byte_column): Declare.
	(cpp_wcwidth): Declare.
	(cpp_display_width): New function.

gcc/ChangeLog

2019-12-09  Lewis Hyatt  <lhyatt@gmail.com>

	PR preprocessor/49973
	* input.c (location_compute_display_column): New function to help with
	multibyte awareness in diagnostics.
	(test_cpp_utf8): New self-test.
	(input_c_tests): Call the new test.
	* input.h (location_compute_display_column): Declare.
	* diagnostic-show-locus.c: Pervasive changes to add multibyte awareness
	to all classes and functions.
	(enum column_unit): New enum.
	(class exploc_with_display_col): New class.
	(class layout_point): Convert m_column member to array m_columns[2].
	(layout_range::contains_point): Add col_unit argument.
	(test_layout_range_for_single_point): Pass new argument.
	(test_layout_range_for_single_line): Likewise.
	(test_layout_range_for_multiple_lines): Likewise.
	(line_bounds::convert_to_display_cols): New function.
	(layout::get_state_at_point): Add col_unit argument.
	(make_range): Use empty filename rather than dummy filename.
	(get_line_width_without_trailing_whitespace): Rename to...
	(get_line_bytes_without_trailing_whitespace): ...this.
	(test_get_line_width_without_trailing_whitespace): Rename to...
	(test_get_line_bytes_without_trailing_whitespace): ...this.
	(class layout): m_exploc changed to exploc_with_display_col from
	plain expanded_location.
	(layout::get_linenum_width): New accessor member function.
	(layout::get_x_offset_display): Likewise.
	(layout::calculate_linenum_width): New subroutine for the constuctor.
	(layout::calculate_x_offset_display): Likewise.
	(layout::layout): Use the new subroutines. Add multibyte awareness.
	(layout::print_source_line): Add multibyte awareness.
	(layout::print_line): Likewise.
	(layout::print_annotation_line): Likewise.
	(line_label::line_label): Likewise.
	(layout::print_any_labels): Likewise.
	(layout::annotation_line_showed_range_p): Likewise.
	(get_printed_columns): Likewise.
	(class line_label): Rename m_length to m_display_width.
	(get_affected_columns): Rename to...
	(get_affected_range): ...this; add col_unit argument and multibyte
	awareness.
	(class correction): Add m_affected_bytes and m_display_cols
	members.  Rename m_len to m_byte_length for clarity.  Add multibyte
	awareness throughout.
	(correction::insertion_p): Add multibyte awareness.
	(correction::compute_display_cols): New function.
	(correction::ensure_terminated): Use new member name m_byte_length.
	(line_corrections::add_hint): Add multibyte awareness.
	(layout::print_trailing_fixits): Likewise.
	(layout::get_x_bound_for_row): Likewise.
	(test_one_liner_simple_caret_utf8): New self-test analogous to the one
	with _utf8 suffix removed, testing multibyte awareness.
	(test_one_liner_caret_and_range_utf8): Likewise.
	(test_one_liner_multiple_carets_and_ranges_utf8): Likewise.
	(test_one_liner_fixit_insert_before_utf8): Likewise.
	(test_one_liner_fixit_insert_after_utf8): Likewise.
	(test_one_liner_fixit_remove_utf8): Likewise.
	(test_one_liner_fixit_replace_utf8): Likewise.
	(test_one_liner_fixit_replace_non_equal_range_utf8): Likewise.
	(test_one_liner_fixit_replace_equal_secondary_range_utf8): Likewise.
	(test_one_liner_fixit_validation_adhoc_locations_utf8): Likewise.
	(test_one_liner_many_fixits_1_utf8): Likewise.
	(test_one_liner_many_fixits_2_utf8): Likewise.
	(test_one_liner_labels_utf8): Likewise.
	(test_diagnostic_show_locus_one_liner_utf8): Likewise.
	(test_overlapped_fixit_printing_utf8): Likewise.
	(test_overlapped_fixit_printing): Adapt for changes to
	get_affected_columns, get_printed_columns and class corrections.
	(test_overlapped_fixit_printing_2): Likewise.
	(test_linenum_sep): New constant.
	(test_left_margin): Likewise.
	(test_offset_impl): Helper function for new test.
	(test_layout_x_offset_display_utf8): New test.
	(diagnostic_show_locus_c_tests): Call new tests.

gcc/testsuite/ChangeLog:

2019-12-09  Lewis Hyatt  <lhyatt@gmail.com>

	PR preprocessor/49973
	* gcc.dg/plugin/diagnostic_plugin_test_show_locus.c
	(test_show_locus): Tweak so that expected output is the same as
	before the diagnostic-show-locus.c changes.
	* gcc.dg/cpp/pr66415-1.c: Likewise.

From-SVN: r279137
2019-12-09 20:03:47 +00:00