procyberian/gcc - Masscollabs Services: Beyond Sharing , Liberating The Software World

Author	SHA1	Message	Date
Tomasz Kamiński	3b33d792cf	libstdc++: Implement debug format for strings and characters formatters [PR109162] This patch implements part P2286R8 that specified debug (escaped) format for the strings and characters sequences. This include both handling of the '?' format specifier and set_debug_format member. To indicate partial support we define __glibcxx_format_ranges macro value 1, without defining __cpp_lib_format_ranges. We provide two separate escaping routines depending on the literal encoding for the corresponding character types. If the character encoding is Unicode, we follow the specification for the standard (__format::__write_escaped_unicode). For other encodings, we escape only characters in range [0x00, 0x80), interpreting them as ASCII values: [0x00, 0x20), 0x7f and '\t', '\r', '\n', '\\', '"', '\'' are escaped. We assume every character outside this range is printable (__format::_write_escaped_ascii). In particular we do not yet implement special handling of shift sequences. For Unicode escaping a new __unicode::__escape_edges table is introduced, that encodes information if character belongs to General_Category that is escaped by the standard (Control or Other). This table is generated from DerivedGeneralCategory.txt provided by Unicode. Only boolean flag is preserved to reduce the number of entries. The additional rules for escaping are handled by __format::__should_escape_unicode. When width or precision is specified, we emit escaped string to the temporary buffer and format the resulting string according to the format spec. For characters use a fixed size stack buffer, for which a new _Fixedbuf_sink is introduced. For strings, we use _Str_sink and to avoid allocations, we compute the estimated size of (possibly truncated) input, and if it is larger than width field we print directly. PR libstdc++/109162 contrib/ChangeLog: * unicode/README: Mentioned DerivedGeneralCategory.txt. * unicode/gen_libstdcxx_unicode_data.py: Generation __escape_edges table from DerivedGeneralCategory.txt. Update file name in comments. * unicode/DerivedGeneralCategory.txt: Copy of file distributed by Unicode Consortium. libstdc++-v3/ChangeLog: * include/bits/chrono_io.h (__detail::_Widen): Moved to std/format file. * include/bits/unicode-data.h: Regnerate. * include/bits/unicode.h (__unicode::_Utf_iterator::_M_units) (__unicode::__should_escape_category): Define. * include/std/format (_GLIBCXX_WIDEN_, _GLIBCXX_WIDEN): Copied from include/bits/chrono_io.h. (__format::_Widen): Moved from include/bits/chrono_io.h. (__format::_Term_char, __format::_Escapes, __format::_Separators) (__format::__should_escape_ascii, __format::__should_escape_unicode) (__format::__write_escape_seq, __format::__write_escaped_char) (__format::__write_escaped_acii, __format::__write_escaped_unicode) (__format::__write_escaped): Define. (__formatter_str::_S_trunc): Extracted truncation of character sequences. (__formatter_str::format): Handle _Pres_esc. (__formatter_int::_M_do_parse) [__glibcxx_format_ranges]: Parse '?'. (__formatter_int::_M_format_character_escaped): Define. (formatter<_CharT, _CharT>::format, formatter<char, wchar_t>::format): Handle _Pres_esc. (__formatter_str::set_debug_format, formatter<...>::set_debug_format) Guard with __glibcxx_format_ranges. (__format::_Fixedbuf_sink): Define. * testsuite/23_containers/vector/bool/format.cc: Use __format::_Widen and remove unnecessary <chrono> include. * testsuite/std/format/debug.cc: New test. * testsuite/std/format/debug_nonunicode.cc: New test. * testsuite/std/format/parse_ctx.cc (escaped_strings_supported): Define to true if __glibcxx_format_ranges is defined. * testsuite/std/format/string.cc (escaped_strings_supported): Define to true if __glibcxx_format_ranges is defined. Reviewed-by: Jonathan Wakely <jwakely@redhat.com> Signed-off-by: Tomasz Kamiński <tkaminsk@redhat.com>	2025-04-11 08:43:50 +02:00
Jakub Jelinek	6441eb6dc0	Update copyright years.	2025-01-02 11:59:57 +01:00
Jakub Jelinek	d0e8f58b81	contrib, libcpp, libstdc++: Update to Unicode 16.0 It is autumn again and there is a new Unicode version 16.0. The following patch updates our Unicode stuff in contrib, libcpp and libstdc++ from that Unicode version. 2024-10-08 Jakub Jelinek <jakub@redhat.com> contrib/ * unicode/README: Update glibc git commit hash, replace Unicode 15 or 15.1 versions with 16. * unicode/gen_libstdcxx_unicode_data.py: Use 160000 instead of 150100 in _GLIBCXX_GET_UNICODE_DATA test. * unicode/from_glibc/utf8_gen.py: Updated from glibc 064c708c78cc2a6b5802dce73108fc0c1c6bfc80 commit. * unicode/DerivedCoreProperties.txt: Updated from Unicode 16.0. * unicode/emoji-data.txt: Likewise. * unicode/PropList.txt: Likewise. * unicode/GraphemeBreakProperty.txt: Likewise. * unicode/DerivedNormalizationProps.txt: Likewise. * unicode/NameAliases.txt: Likewise. * unicode/UnicodeData.txt: Likewise. * unicode/EastAsianWidth.txt: Likewise. gcc/testsuite/ * c-c++-common/cpp/named-universal-char-escape-1.c: Add tests for some Unicode 16.0 characters, both normal and generated. libcpp/ * makeucnid.cc (write_copyright): Update Unicode Copyright years. * makeuname2c.cc (generated_ranges): Adjust Unicode version from 15.1 to 16.0. Add EGYPTIAN HIEROGLYPH- generated range, adjust indexes in following entries. (write_copyright): Update Unicode Copyright years. * generated_cpp_wcwidth.h: Regenerated. * ucnid.h: Regenerated. * uname2c.h: Regenerated. libstdc++-v3/ * include/bits/unicode.h (std::__unicode::__v15_1_0): Rename inline namespace to ... (std::__unicode::__v16_0_0): ... this. (_GLIBCXX_GET_UNICODE_DATA): Change from 150100 to 160000. * include/bits/unicode-data.h: Regenerated. * testsuite/ext/unicode/properties.cc: Check for _Gcb_SpacingMark on U+11F03 rather than U+1D16D as the latter lost SpacingMark property in Unicode 16.0.	2024-10-08 10:01:47 +02:00
Jonathan Wakely	ef2efc53fd	libstdc++: Fix Python scripts to output the correct filename These scripts both print "generated by $file, do not edit" header but one of them prints the wrong filename. Use the built-in __file__ attribute to ensure it's correct. contrib/ChangeLog: * unicode/gen_libstdcxx_unicode_data.py: Fix header of generated file to name the correct script. libstdc++-v3/ChangeLog: * include/bits/text_encoding-data.h: Regenerate. * include/bits/unicode-data.h: Regenerate. * scripts/gen_text_encoding_data.py: Fix header of generated file to name the correct script.	2024-03-19 15:20:07 +00:00
Jonathan Wakely	e99d9607f0	libstdc++: Add copyright and license text to new generated headers contrib/ChangeLog: * unicode/gen_libstdcxx_unicode_data.py: Add copyright and license text to the output. libstdc++-v3/ChangeLog: * include/bits/text_encoding-data.h: Regenerate. * include/bits/unicode-data.h: Regenerate. * scripts/gen_text_encoding_data.py: Add copyright and license text to the output.	2024-02-04 21:40:23 +00:00
Jonathan Wakely	ea314ccd62	libstdc++: Fix Unicode property detection functions Fix some copy & pasted logic in __is_extended_pictographic. This function should yield false for the values before the first edge, not true. Also add a missing boundary condition check in __incb_property. Also Fix an off-by-one error in _Utf_iterator::operator++() that would make dereferencing a past-the-end iterator undefined (where the intended design is that the iterator is always incrementable and dereferenceable, for better memory safety). Also simplify the grapheme view iterator, which still contained some remnants of an earlier design I was experimenting with. Slightly tweak the gen_libstdcxx_unicode_data.py script so that the _Gcb_property enumerators are in the order we encounter them in the data file, instead of sorting them alphabetically. Start with the "Other" property at value 0, because that's the default property for anything not in the file. This makes no practical difference, but seems cleaner. It causes the values in the __gcb_edges table to change, so can only be done now before anybody is using this code yet. The enumerator values and table entries become ABI artefacts for the function using them. contrib/ChangeLog: * unicode/gen_libstdcxx_unicode_data.py: Print out Gcb_property enumerators in the order they're seen, not alphabetical order. libstdc++-v3/ChangeLog: * include/bits/unicode-data.h: Regenerate. * include/bits/unicode.h (_Utf_iterator::operator++()): Fix off by one error. (__incb_property): Add missing check for values before the first edge. (__is_extended_pictographic): Invert return values to fix copy&pasted logic. (_Grapheme_cluster_view::_Iterator): Remove second iterator member and find end of cluster lazily. * testsuite/ext/unicode/grapheme_view.cc: New test. * testsuite/ext/unicode/properties.cc: New test. * testsuite/ext/unicode/view.cc: New test.	2024-01-09 23:42:59 +00:00
Jonathan Wakely	37a4c5c23a	libstdc++: Add Unicode-aware width estimation for std::format This implements the requirements in the following proposals, which dictate how std::format deals with non-ASCII strings: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2572r1.html https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2675r1.pdf There are two parts to this. The width estimation for strings must only count the width of the first character in an extended grapheme cluster. That requires implementing the algorithm for detecting cluster breaks, which requires a number of lookup tables of the grapheme cluster break properties (and Indic_Conjunct_Break and Extended_Pictographic properties) of every code point. Additionally, some characters have a field width of 2, which requires another lookup table of field widths for every code point. The tables added in this commit do not contain entries for every code point from 0 to 0x10FFFF as that would be very inefficient and use too much memory. Instead the tables only contain the code points that form an "edge" for a property, omitting all the code points that have the same property as the preceding one. We can use a binary search to find the closest code point in the table that is not greater than the one we're looking for. The tables are generated by a new Python script added to the contrib/unicode directory, and a new data file downloaded from the Unicode Consortium website. The rules for extended grapheme cluster breaking are implemented for the latest Unicode standard, version 15.1.0. libstdc++-v3/ChangeLog: * include/Makefile.am: Add new headers. * include/Makefile.in: Regenerate. * include/bits/unicode.h: New file. * include/bits/unicode-data.h: New file. * include/std/format: Include <bits/unicode.h>. (__literal_encoding_is_utf8): Move to <bits/unicode.h>. (_Spec::_M_fill): Change type to char32_t. (_Spec::_M_parse_fill_and_align): Read a Unicode scalar value instead of a single character. (__write_padded): Change __fill_char parameter to char32_t and encode it into the output. (__formatter_str::format): Use new __unicode::__field_width and __unicode::__truncate functions. * include/std/ostream: Adjust namespace qualification for __literal_encoding_is_utf8. * include/std/print: Likewise. * src/c++23/print.cc: Add [[unlikely]] attribute to error path. * testsuite/ext/unicode/view.cc: New test. * testsuite/std/format/functions/format.cc: Add missing examples from the standard demonstrating alignment with non-ASCII characters. Add examples checking correct handling of extended grapheme clusters. contrib/ChangeLog: * unicode/README: Add notes about generating libstdc++ tables. * unicode/GraphemeBreakProperty.txt: New file. * unicode/emoji-data.txt: New file. * unicode/gen_libstdcxx_unicode_data.py: New file.	2024-01-08 01:14:50 +00:00
Jonathan Wakely	29abd09a74	contrib: Remove C-style comments from Python files These Python scripts have "/" at the end of the license header comment blocks, presumably copy&pasted from C files. contrib/ChangeLog: analyze_brprob.py: Remove stray text at end of comment. * analyze_brprob_spec.py: Likewise. * check-params-in-docs.py: Likewise. * check_GNU_style.py: Likewise. * check_GNU_style_lib.py: Likewise. * filter-clang-warnings.py: Likewise. * gcc-changelog/git_check_commit.py: Likewise. * gcc-changelog/git_commit.py: Likewise. * gcc-changelog/git_email.py: Likewise. * gcc-changelog/git_repository.py: Likewise. * gcc-changelog/git_update_version.py: Likewise. * gcc-changelog/test_email.py: Likewise. * gen_autofdo_event.py: Likewise. * mark_spam.py: Likewise. * unicode/gen-box-drawing-chars.py: Likewise. * unicode/gen-combining-chars.py: Likewise. * unicode/gen-printable-chars.py: Likewise. * unicode/gen_wcwidth.py: Likewise.	2024-01-05 13:57:05 +00:00
Jonathan Wakely	1bc9eddb80	contrib: Add script name to usage error in gen_wcwidth.py contrib/ChangeLog: * unicode/gen_wcwidth.py: Add sys.argv[0] to usage error.	2024-01-05 13:57:04 +00:00
Jakub Jelinek	a945c346f5	Update copyright years.	2024-01-03 12:19:35 +01:00
Jakub Jelinek	d64b7c82da	libcpp, contrib: Update to Unicode 15.1 The following patch (in plaintext just a pseudo-patch where I've left out the too big parts of either wget downloaded or regenerated files out with ..., full patch attached compressed) updates to Unicode 15.1 from 15.0 we had last year. Apparently Unicode forgot to add a new range to 4-8 Table we are using, but from the other files it is clear what should have been added; I've filed a bugreport against Unicode. 2023-11-14 Jakub Jelinek <jakub@redhat.com> contrib/ * unicode/README: Adjust glibc git commit hash, number of Unicode data files to be updated and latest Unicode version. * unicode/from_glibc/utf8_gen.py: Update from glibc. * unicode/UnicodeData.txt: Update from Unicode 15.1. * unicode/EastAsianWidth.txt: Likewise. * unicode/DerivedNormalizationProps.txt: Likewise. * unicode/NameAliases.txt: Likewise. * unicode/DerivedCoreProperties.txt: Likewise. * unicode/PropList.txt: Likewise. libcpp/ * makeucnid.cc (write_copyright): Update copyright year. * makeuname2c.cc (write_copyright): Likewise. (struct generated): Update latest Unicode version. (generated_ranges): Add 2ebf0-2ee5d CJK UNIFIED IDEOGRAPH range which was forgotten to be added to 4-8 table, but clearly is expected to be there from the 15.1 additions. * ucnid.h: Regenerated. * uname2c.h: Regenerated. * generated_cpp_wcwidth.h: Regenerated.	2023-11-14 18:32:37 +01:00
David Malcolm	4f01ae3761	diagnostics: add support for "text art" diagrams Existing text output in GCC has to be implemented by writing sequentially to a pretty_printer instance. This makes it hard to implement some kinds of diagnostic output (see e.g. diagnostic-show-locus.cc). This patch adds more flexible ways of creating text output: - a canvas class, which can be "painted" to via random-access (rather that sequentially) - a table class for 2D grid layout, supporting items that span multiple rows/columns - a widget class for organizing diagrams hierarchically. The patch also expands GCC's diagnostics subsystem so that diagnostics can have "text art" diagrams - think ASCII art, but potentially including some Unicode characters, such as box-drawing chars. The new code is in a new "gcc/text-art" subdirectory and "text_art" namespace. The patch adds a new "-fdiagnostics-text-art-charset=VAL" option, with values: - "none": don't emit diagrams (added to -fdiagnostics-plain-output) - "ascii": use pure ASCII in diagrams - "unicode": allow for conservative use of unicode drawing characters (such as box-drawing characters). - "emoji" (the default): as "unicode", but potentially allow for conservative use of emoji in the output (such as U+26A0 WARNING SIGN). I made it possible to disable emoji separately from unicode as I believe there's a generation gap in acceptance of these characters (some older programmers have a visceral reaction against them, whereas younger programmers may have no problem with them). Diagrams are emitted to stderr by default. With SARIF output they are captured as a location in "relatedLocations", with the diagram as a code block in Markdown within a "markdown" property of a message. This patch doesn't add any such diagram usage to GCC, saving that for followups, apart from adding a plugin to the test suite to exercise the functionality. contrib/ChangeLog: * unicode/gen-box-drawing-chars.py: New file. * unicode/gen-combining-chars.py: New file. * unicode/gen-printable-chars.py: New file. gcc/ChangeLog: * Makefile.in (OBJS-libcommon): Add text-art/box-drawing.o, text-art/canvas.o, text-art/ruler.o, text-art/selftests.o, text-art/style.o, text-art/styled-string.o, text-art/table.o, text-art/theme.o, and text-art/widget.o. * color-macros.h (COLOR_FG_BRIGHT_BLACK): New. (COLOR_FG_BRIGHT_RED): New. (COLOR_FG_BRIGHT_GREEN): New. (COLOR_FG_BRIGHT_YELLOW): New. (COLOR_FG_BRIGHT_BLUE): New. (COLOR_FG_BRIGHT_MAGENTA): New. (COLOR_FG_BRIGHT_CYAN): New. (COLOR_FG_BRIGHT_WHITE): New. (COLOR_BG_BRIGHT_BLACK): New. (COLOR_BG_BRIGHT_RED): New. (COLOR_BG_BRIGHT_GREEN): New. (COLOR_BG_BRIGHT_YELLOW): New. (COLOR_BG_BRIGHT_BLUE): New. (COLOR_BG_BRIGHT_MAGENTA): New. (COLOR_BG_BRIGHT_CYAN): New. (COLOR_BG_BRIGHT_WHITE): New. * common.opt (fdiagnostics-text-art-charset=): New option. (diagnostic-text-art.h): New SourceInclude. (diagnostic_text_art_charset) New Enum and EnumValues. * configure: Regenerate. * configure.ac (gccdepdir): Add text-art to loop. * diagnostic-diagram.h: New file. * diagnostic-format-json.cc (json_emit_diagram): New. (diagnostic_output_format_init_json): Wire it up to context->m_diagrams.m_emission_cb. * diagnostic-format-sarif.cc: Include "diagnostic-diagram.h" and "text-art/canvas.h". (sarif_result::on_nested_diagnostic): Move code to... (sarif_result::add_related_location): ...this new function. (sarif_result::on_diagram): New. (sarif_builder::emit_diagram): New. (sarif_builder::make_message_object_for_diagram): New. (sarif_emit_diagram): New. (diagnostic_output_format_init_sarif): Set context->m_diagrams.m_emission_cb to sarif_emit_diagram. * diagnostic-text-art.h: New file. * diagnostic.cc: Include "diagnostic-text-art.h", "diagnostic-diagram.h", and "text-art/theme.h". (diagnostic_initialize): Initialize context->m_diagrams and call diagnostics_text_art_charset_init. (diagnostic_finish): Clean up context->m_diagrams.m_theme. (diagnostic_emit_diagram): New. (diagnostics_text_art_charset_init): New. * diagnostic.h (text_art::theme): New forward decl. (class diagnostic_diagram): Likewise. (diagnostic_context::m_diagrams): New field. (diagnostic_emit_diagram): New decl. * doc/invoke.texi (Diagnostic Message Formatting Options): Add -fdiagnostics-text-art-charset=. (-fdiagnostics-plain-output): Add -fdiagnostics-text-art-charset=none. * gcc.cc: Include "diagnostic-text-art.h". (driver_handle_option): Handle OPT_fdiagnostics_text_art_charset_. * opts-common.cc (decode_cmdline_options_to_array): Add "-fdiagnostics-text-art-charset=none" to expanded_args for -fdiagnostics-plain-output. * opts.cc: Include "diagnostic-text-art.h". (common_handle_option): Handle OPT_fdiagnostics_text_art_charset_. * pretty-print.cc (pp_unicode_character): New. * pretty-print.h (pp_unicode_character): New decl. * selftest-run-tests.cc: Include "text-art/selftests.h". (selftest::run_tests): Call text_art_tests. * text-art/box-drawing-chars.inc: New file, generated by contrib/unicode/gen-box-drawing-chars.py. * text-art/box-drawing.cc: New file. * text-art/box-drawing.h: New file. * text-art/canvas.cc: New file. * text-art/canvas.h: New file. * text-art/ruler.cc: New file. * text-art/ruler.h: New file. * text-art/selftests.cc: New file. * text-art/selftests.h: New file. * text-art/style.cc: New file. * text-art/styled-string.cc: New file. * text-art/table.cc: New file. * text-art/table.h: New file. * text-art/theme.cc: New file. * text-art/theme.h: New file. * text-art/types.h: New file. * text-art/widget.cc: New file. * text-art/widget.h: New file. gcc/testsuite/ChangeLog: * gcc.dg/plugin/diagnostic-test-text-art-ascii-bw.c: New test. * gcc.dg/plugin/diagnostic-test-text-art-ascii-color.c: New test. * gcc.dg/plugin/diagnostic-test-text-art-none.c: New test. * gcc.dg/plugin/diagnostic-test-text-art-unicode-bw.c: New test. * gcc.dg/plugin/diagnostic-test-text-art-unicode-color.c: New test. * gcc.dg/plugin/diagnostic_plugin_test_text_art.c: New test plugin. * gcc.dg/plugin/plugin.exp (plugin_test_list): Add them. libcpp/ChangeLog: * charset.cc (get_cppchar_property): New function template, based on... (cpp_wcwidth): ...this function. Rework to use the above. Include "combining-chars.inc". (cpp_is_combining_char): New function Include "printable-chars.inc". (cpp_is_printable_char): New function * combining-chars.inc: New file, generated by contrib/unicode/gen-combining-chars.py. * include/cpplib.h (cpp_is_combining_char): New function decl. (cpp_is_printable_char): New function decl. * printable-chars.inc: New file, generated by contrib/unicode/gen-printable-chars.py. Signed-off-by: David Malcolm <dmalcolm@redhat.com>	2023-06-21 21:49:00 -04:00
Jakub Jelinek	63b25b8012	contrib: Update instructions regarding Unicode updates I've noticed we have instructions on how to update from newer Unicode standard, but it didn't mention uname2c.h regeneration. The following patch mentions that, also mentions that the Copyright years of Unicode should be updated and adds a copy of NameAliases.txt which is used for uname2c.h generation. 2023-03-16 Jakub Jelinek <jakub@redhat.com> * unicode/README: Update to mention also makeuname2c. * unicode/NameAliases.txt: New file.	2023-03-16 10:28:25 +01:00
Lewis Hyatt	73dd5c6c88	libcpp: Update cpp_wcwidth() to Unicode 15 Updates cpp_wcwidth() to Unicode 15, following the procedure in contrib/unicode/README mechanically without incident. contrib/ChangeLog: * unicode/DerivedCoreProperties.txt: Update to Unicode 15. * unicode/DerivedNormalizationProps.txt: Likewise. * unicode/EastAsianWidth.txt: Likwise. * unicode/PropList.txt: Likewise. * unicode/README: Likewise. * unicode/UnicodeData.txt: Likewise. libcpp/ChangeLog: * generated_cpp_wcwidth.h: Regenerated for Unicode 15.	2023-03-13 07:40:50 -04:00
Jakub Jelinek	83ffe9cde7	Update copyright years.	2023-01-16 11:52:17 +01:00
Lewis Hyatt	4fda776a2f	libcpp: Update ucnid.h to Unicode 14 This patch updates ucnid.h from Unicode 13 to Unicode 14. Additionally, the procedure detailed in contrib/unicode/README, which updates generated_wcwidth.h, has been expanded with instructions for updating this file as well, so that both may be done at the same time conveniently. Two additional Unicode data files which are needed to create ucnid.h are also added to source control in contrib/unicode. contrib/ChangeLog: * unicode/README: Added instructions for updating ucnid.h. * unicode/DerivedCoreProperties.txt: New file added to source control from Unicode 14.0 release. * unicode/DerivedNormalizationProps.txt: Likewise. libcpp/ChangeLog: * ucnid.h: Regenerated for Unicode 14.0.	2022-06-28 17:33:37 -04:00
Lewis Hyatt	57988cbe73	libcpp: Update cpp_wcwidth() to Unicode 14.0.0 The procedure detailed in contrib/unicode/README was followed with nothing notable coming up. The glibc scripts did not require any update, so the only change was retrieving new versions of the Unicode data files and rerunning gen_wcwidth.py. contrib/ChangeLog: * unicode/EastAsianWidth.txt: Update to Unicode 14.0.0. * unicode/PropList.txt: Likewise. * unicode/README: Likewise. * unicode/UnicodeData.txt: Likewise. libcpp/ChangeLog: * generated_cpp_wcwidth.h: Generated from updated Unicode data files.	2022-06-26 14:13:26 -04:00
David Malcolm	b050653c4c	contrib: add unicode/utf8-dump.py This script may be useful when debugging issues relating to Unicode encoding (e.g. when investigating source files with bidirectional control characters). It dumps a UTF-8 file as a list of numbered lines (mimicking GCC's diagnostic output format), interleaved with lines per character showing the Unicode codepoints, the UTF-8 encoding bytes, the name of the character, and, where printable, the characters themselves. The lines are printed in logical order, which may help the reader to grok the relationship between visual and logical ordering in bi-di files. For example: $ cat test.c int གྷ; const char אבג = "ALEF-BET-GIMEL"; $ ./contrib/unicode/utf8-dump.py test.c 1 \| int གྷ; \| U+0069 0x69 LATIN SMALL LETTER I i \| U+006E 0x6e LATIN SMALL LETTER N n \| U+0074 0x74 LATIN SMALL LETTER T t \| U+0020 0x20 SPACE (separator) \| U+0F43 0xe0 0xbd 0x83 TIBETAN LETTER GHA གྷ \| U+003B 0x3b SEMICOLON ; \| U+000A 0x0a LINE FEED (LF) (control character) 2 \| const char אבג = "ALEF-BET-GIMEL"; \| U+0063 0x63 LATIN SMALL LETTER C c \| U+006F 0x6f LATIN SMALL LETTER O o \| U+006E 0x6e LATIN SMALL LETTER N n \| U+0073 0x73 LATIN SMALL LETTER S s \| U+0074 0x74 LATIN SMALL LETTER T t \| U+0020 0x20 SPACE (separator) \| U+0063 0x63 LATIN SMALL LETTER C c \| U+0068 0x68 LATIN SMALL LETTER H h \| U+0061 0x61 LATIN SMALL LETTER A a \| U+0072 0x72 LATIN SMALL LETTER R r \| U+0020 0x20 SPACE (separator) \| U+002A 0x2a ASTERISK * \| U+05D0 0xd7 0x90 HEBREW LETTER ALEF א \| U+05D1 0xd7 0x91 HEBREW LETTER BET ב \| U+05D2 0xd7 0x92 HEBREW LETTER GIMEL ג \| U+0020 0x20 SPACE (separator) \| U+003D 0x3d EQUALS SIGN = \| U+0020 0x20 SPACE (separator) \| U+0022 0x22 QUOTATION MARK " \| U+0041 0x41 LATIN CAPITAL LETTER A A \| U+004C 0x4c LATIN CAPITAL LETTER L L \| U+0045 0x45 LATIN CAPITAL LETTER E E \| U+0046 0x46 LATIN CAPITAL LETTER F F \| U+002D 0x2d HYPHEN-MINUS - \| U+0042 0x42 LATIN CAPITAL LETTER B B \| U+0045 0x45 LATIN CAPITAL LETTER E E \| U+0054 0x54 LATIN CAPITAL LETTER T T \| U+002D 0x2d HYPHEN-MINUS - \| U+0047 0x47 LATIN CAPITAL LETTER G G \| U+0049 0x49 LATIN CAPITAL LETTER I I \| U+004D 0x4d LATIN CAPITAL LETTER M M \| U+0045 0x45 LATIN CAPITAL LETTER E E \| U+004C 0x4c LATIN CAPITAL LETTER L L \| U+0022 0x22 QUOTATION MARK " \| U+003B 0x3b SEMICOLON ; \| U+000A 0x0a LINE FEED (LF) (control character) Tested with Python 3.8 contrib/ChangeLog: * unicode/utf8-dump.py: New file. Signed-off-by: David Malcolm <dmalcolm@redhat.com>	2021-11-01 11:52:28 -04:00
Lewis Hyatt	497c9f8d4d	libcpp: Update cpp_wcwidth() to Unicode 13.0.0 generated_cpp_wcwidth.h was regenerated using Unicode 13.0.0 data files. No material changes to the parsing scripts (either GCC- or glibc-sourced) were necessary; glibc's utf8_gen.py was tweaked slightly by glibc and matched here. contrib/ChangeLog: * unicode/EastAsianWidth.txt: Update to Unicode 13.0.0. * unicode/PropList.txt: Likewise. * unicode/README: Likewise. * unicode/UnicodeData.txt: Likewise. * unicode/from_glibc/unicode_utils.py: Update to latest glibc version. * unicode/from_glibc/utf8_gen.py: Likewise. libcpp/ChangeLog: * generated_cpp_wcwidth.h: Regenerated from Unicode 13.0.0 data.	2020-11-07 09:36:43 -05:00
Lewis Hyatt	ee9256409f	Byte vs column awareness for diagnostic-show-locus.c (PR 49973) contrib/ChangeLog 2019-12-09 Lewis Hyatt <lhyatt@gmail.com> PR preprocessor/49973 * unicode/from_glibc/unicode_utils.py: Support script from glibc (commit 464cd3) to extract character widths from Unicode data files. * unicode/from_glibc/utf8_gen.py: Likewise. * unicode/UnicodeData.txt: Unicode v. 12.1.0 data file. * unicode/EastAsianWidth.txt: Likewise. * unicode/PropList.txt: Likewise. * unicode/gen_wcwidth.py: New utility to generate libcpp/generated_cpp_wcwidth.h with help from the glibc support scripts and the Unicode data files. * unicode/unicode-license.txt: Added. * unicode/README: New explanatory file. libcpp/ChangeLog 2019-12-09 Lewis Hyatt <lhyatt@gmail.com> PR preprocessor/49973 * generated_cpp_wcwidth.h: New file generated by ../contrib/unicode/gen_wcwidth.py, supports new cpp_wcwidth function. * charset.c (compute_next_display_width): New function to help implement display columns. (cpp_byte_column_to_display_column): Likewise. (cpp_display_column_to_byte_column): Likewise. (cpp_wcwidth): Likewise. * include/cpplib.h (cpp_byte_column_to_display_column): Declare. (cpp_display_column_to_byte_column): Declare. (cpp_wcwidth): Declare. (cpp_display_width): New function. gcc/ChangeLog 2019-12-09 Lewis Hyatt <lhyatt@gmail.com> PR preprocessor/49973 * input.c (location_compute_display_column): New function to help with multibyte awareness in diagnostics. (test_cpp_utf8): New self-test. (input_c_tests): Call the new test. * input.h (location_compute_display_column): Declare. * diagnostic-show-locus.c: Pervasive changes to add multibyte awareness to all classes and functions. (enum column_unit): New enum. (class exploc_with_display_col): New class. (class layout_point): Convert m_column member to array m_columns[2]. (layout_range::contains_point): Add col_unit argument. (test_layout_range_for_single_point): Pass new argument. (test_layout_range_for_single_line): Likewise. (test_layout_range_for_multiple_lines): Likewise. (line_bounds::convert_to_display_cols): New function. (layout::get_state_at_point): Add col_unit argument. (make_range): Use empty filename rather than dummy filename. (get_line_width_without_trailing_whitespace): Rename to... (get_line_bytes_without_trailing_whitespace): ...this. (test_get_line_width_without_trailing_whitespace): Rename to... (test_get_line_bytes_without_trailing_whitespace): ...this. (class layout): m_exploc changed to exploc_with_display_col from plain expanded_location. (layout::get_linenum_width): New accessor member function. (layout::get_x_offset_display): Likewise. (layout::calculate_linenum_width): New subroutine for the constuctor. (layout::calculate_x_offset_display): Likewise. (layout::layout): Use the new subroutines. Add multibyte awareness. (layout::print_source_line): Add multibyte awareness. (layout::print_line): Likewise. (layout::print_annotation_line): Likewise. (line_label::line_label): Likewise. (layout::print_any_labels): Likewise. (layout::annotation_line_showed_range_p): Likewise. (get_printed_columns): Likewise. (class line_label): Rename m_length to m_display_width. (get_affected_columns): Rename to... (get_affected_range): ...this; add col_unit argument and multibyte awareness. (class correction): Add m_affected_bytes and m_display_cols members. Rename m_len to m_byte_length for clarity. Add multibyte awareness throughout. (correction::insertion_p): Add multibyte awareness. (correction::compute_display_cols): New function. (correction::ensure_terminated): Use new member name m_byte_length. (line_corrections::add_hint): Add multibyte awareness. (layout::print_trailing_fixits): Likewise. (layout::get_x_bound_for_row): Likewise. (test_one_liner_simple_caret_utf8): New self-test analogous to the one with _utf8 suffix removed, testing multibyte awareness. (test_one_liner_caret_and_range_utf8): Likewise. (test_one_liner_multiple_carets_and_ranges_utf8): Likewise. (test_one_liner_fixit_insert_before_utf8): Likewise. (test_one_liner_fixit_insert_after_utf8): Likewise. (test_one_liner_fixit_remove_utf8): Likewise. (test_one_liner_fixit_replace_utf8): Likewise. (test_one_liner_fixit_replace_non_equal_range_utf8): Likewise. (test_one_liner_fixit_replace_equal_secondary_range_utf8): Likewise. (test_one_liner_fixit_validation_adhoc_locations_utf8): Likewise. (test_one_liner_many_fixits_1_utf8): Likewise. (test_one_liner_many_fixits_2_utf8): Likewise. (test_one_liner_labels_utf8): Likewise. (test_diagnostic_show_locus_one_liner_utf8): Likewise. (test_overlapped_fixit_printing_utf8): Likewise. (test_overlapped_fixit_printing): Adapt for changes to get_affected_columns, get_printed_columns and class corrections. (test_overlapped_fixit_printing_2): Likewise. (test_linenum_sep): New constant. (test_left_margin): Likewise. (test_offset_impl): Helper function for new test. (test_layout_x_offset_display_utf8): New test. (diagnostic_show_locus_c_tests): Call new tests. gcc/testsuite/ChangeLog: 2019-12-09 Lewis Hyatt <lhyatt@gmail.com> PR preprocessor/49973 * gcc.dg/plugin/diagnostic_plugin_test_show_locus.c (test_show_locus): Tweak so that expected output is the same as before the diagnostic-show-locus.c changes. * gcc.dg/cpp/pr66415-1.c: Likewise. From-SVN: r279137	2019-12-09 20:03:47 +00:00

20 commits