diagnostics: escape non-ASCII source bytes for certain diagnostics

This patch adds support to GCC's diagnostic subsystem for escaping certain bytes and Unicode characters when quoting source code. Specifically, this patch adds a new flag rich_location::m_escape_on_output which is a hint from a diagnostic that non-ASCII bytes in the pertinent lines of the user's source code should be escaped when printed. The patch sets this for the following diagnostics: - when complaining about stray bytes in the program (when these are non-printable) - when complaining about "null character(s) ignored"); - for -Wnormalized= (and generate source ranges for such warnings) The escaping is controlled by a new option: -fdiagnostics-escape-format=[unicode|bytes] For example, consider a diagnostic involing a source line containing the string "before" followed by the Unicode character U+03C0 ("GREEK SMALL LETTER PI", with UTF-8 encoding 0xCF 0x80) followed by the byte 0xBF (a stray UTF-8 trailing byte), followed by the string "after", where the diagnostic highlights the U+03C0 character. By default, this line will be printed verbatim to the user when reporting a diagnostic at it, as: beforeπXafter ^ (using X for the stray byte to avoid putting invalid UTF-8 in this commit message) If the diagnostic sets the "escape" flag, it will be printed as: before<U+03C0><BF>after ^~~~~~~~ with -fdiagnostics-escape-format=unicode (the default), or as: before<CF><80><BF>after ^~~~~~~~ if the user supplies -fdiagnostics-escape-format=bytes. This only affects how the source is printed; it does not affect how column numbers that are printed (as per -fdiagnostics-column-unit= and -fdiagnostics-column-origin=). gcc/c-family/ChangeLog: * c-lex.c (c_lex_with_flags): When complaining about non-printable CPP_OTHER tokens, set the "escape on output" flag. gcc/ChangeLog: * common.opt (fdiagnostics-escape-format=): New. (diagnostics_escape_format): New enum. (DIAGNOSTICS_ESCAPE_FORMAT_UNICODE): New enum value. (DIAGNOSTICS_ESCAPE_FORMAT_BYTES): Likewise. * diagnostic-format-json.cc (json_end_diagnostic): Add "escape-source" attribute. * diagnostic-show-locus.c (exploc_with_display_col::exploc_with_display_col): Replace "tabstop" param with a cpp_char_column_policy and add an "aspect" param. Use these to compute m_display_col accordingly. (struct char_display_policy): New struct. (layout::m_policy): New field. (layout::m_escape_on_output): New field. (def_policy): New function. (make_range): Update for changes to exploc_with_display_col ctor. (default_print_decoded_ch): New. (width_per_escaped_byte): New. (escape_as_bytes_width): New. (escape_as_bytes_print): New. (escape_as_unicode_width): New. (escape_as_unicode_print): New. (make_policy): New. (layout::layout): Initialize new fields. Update m_exploc ctor call for above change to ctor. (layout::maybe_add_location_range): Update for changes to exploc_with_display_col ctor. (layout::calculate_x_offset_display): Update for change to cpp_display_width. (layout::print_source_line): Pass policy to cpp_display_width_computation. Capture cpp_decoded_char when calling process_next_codepoint. Move printing of source code to m_policy.m_print_cb. (line_label::line_label): Pass in policy rather than context. (layout::print_any_labels): Update for change to line_label ctor. (get_affected_range): Pass in policy rather than context, updating calls to location_compute_display_column accordingly. (get_printed_columns): Likewise, also for cpp_display_width. (correction::correction): Pass in policy rather than tabstop. (correction::compute_display_cols): Pass m_policy rather than m_tabstop to cpp_display_width. (correction::m_tabstop): Replace with... (correction::m_policy): ...this. (line_corrections::line_corrections): Pass in policy rather than context. (line_corrections::m_context): Replace with... (line_corrections::m_policy): ...this. (line_corrections::add_hint): Update to use m_policy rather than m_context. (line_corrections::add_hint): Likewise. (layout::print_trailing_fixits): Likewise. (selftest::test_display_widths): New. (selftest::test_layout_x_offset_display_utf8): Update to use policy rather than tabstop. (selftest::test_one_liner_labels_utf8): Add test of escaping source lines. (selftest::test_diagnostic_show_locus_one_liner_utf8): Update to use policy rather than tabstop. (selftest::test_overlapped_fixit_printing): Likewise. (selftest::test_overlapped_fixit_printing_utf8): Likewise. (selftest::test_overlapped_fixit_printing_2): Likewise. (selftest::test_tab_expansion): Likewise. (selftest::test_escaping_bytes_1): New. (selftest::test_escaping_bytes_2): New. (selftest::diagnostic_show_locus_c_tests): Call the new tests. * diagnostic.c (diagnostic_initialize): Initialize context->escape_format. (convert_column_unit): Update to use default character width policy. (selftest::test_diagnostic_get_location_text): Likewise. * diagnostic.h (enum diagnostics_escape_format): New enum. (diagnostic_context::escape_format): New field. * doc/invoke.texi (-fdiagnostics-escape-format=): New option. (-fdiagnostics-format=): Add "escape-source" attribute to examples of JSON output, and document it. * input.c (location_compute_display_column): Pass in "policy" rather than "tabstop", passing to cpp_byte_column_to_display_column. (selftest::test_cpp_utf8): Update to use cpp_char_column_policy. * input.h (class cpp_char_column_policy): New forward decl. (location_compute_display_column): Pass in "policy" rather than "tabstop". * opts.c (common_handle_option): Handle OPT_fdiagnostics_escape_format_. * selftest.c (temp_source_file::temp_source_file): New ctor overload taking a size_t. * selftest.h (temp_source_file::temp_source_file): Likewise. gcc/testsuite/ChangeLog: * c-c++-common/diagnostic-format-json-1.c: Add regexp to consume "escape-source" attribute. * c-c++-common/diagnostic-format-json-2.c: Likewise. * c-c++-common/diagnostic-format-json-3.c: Likewise. * c-c++-common/diagnostic-format-json-4.c: Likewise, twice. * c-c++-common/diagnostic-format-json-5.c: Likewise. * gcc.dg/cpp/warn-normalized-4-bytes.c: New test. * gcc.dg/cpp/warn-normalized-4-unicode.c: New test. * gcc.dg/encoding-issues-bytes.c: New test. * gcc.dg/encoding-issues-unicode.c: New test. * gfortran.dg/diagnostic-format-json-1.F90: Add regexp to consume "escape-source" attribute. * gfortran.dg/diagnostic-format-json-2.F90: Likewise. * gfortran.dg/diagnostic-format-json-3.F90: Likewise. libcpp/ChangeLog: * charset.c (convert_escape): Use encoding_rich_location when complaining about nonprintable unknown escape sequences. (cpp_display_width_computation::::cpp_display_width_computation): Pass in policy rather than tabstop. (cpp_display_width_computation::process_next_codepoint): Add "out" param and populate *out if non-NULL. (cpp_display_width_computation::advance_display_cols): Pass NULL to process_next_codepoint. (cpp_byte_column_to_display_column): Pass in policy rather than tabstop. Pass NULL to process_next_codepoint. (cpp_display_column_to_byte_column): Pass in policy rather than tabstop. * errors.c (cpp_diagnostic_get_current_location): New function, splitting out the logic from... (cpp_diagnostic): ...here. (cpp_warning_at): New function. (cpp_pedwarning_at): New function. * include/cpplib.h (cpp_warning_at): New decl for rich_location. (cpp_pedwarning_at): Likewise. (struct cpp_decoded_char): New. (struct cpp_char_column_policy): New. (cpp_display_width_computation::cpp_display_width_computation): Replace "tabstop" param with "policy". (cpp_display_width_computation::process_next_codepoint): Add "out" param. (cpp_display_width_computation::m_tabstop): Replace with... (cpp_display_width_computation::m_policy): ...this. (cpp_byte_column_to_display_column): Replace "tabstop" param with "policy". (cpp_display_width): Likewise. (cpp_display_column_to_byte_column): Likewise. * include/line-map.h (rich_location::escape_on_output_p): New. (rich_location::set_escape_on_output): New. (rich_location::m_escape_on_output): New. * internal.h (cpp_diagnostic_get_current_location): New decl. (class encoding_rich_location): New. * lex.c (skip_whitespace): Use encoding_rich_location when complaining about null characters. (warn_about_normalization): Generate a source range when complaining about improperly normalized tokens, rather than just a point, and use encoding_rich_location so that the source code is escaped on printing. * line-map.c (rich_location::rich_location): Initialize m_escape_on_output. Signed-off-by: David Malcolm <dmalcolm@redhat.com>
2021-10-18 18:55:31 -04:00 · 2021-10-18 18:55:31 -04:00 · bd5e882cf6
commit bd5e882cf6
parent 91bac9fed5
31 changed files with 942 additions and 168 deletions
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@ -1268,6 +1268,14 @@ extern bool cpp_warning_syshdr (cpp_reader *, enum cpp_warning_reason reason,
 				const char *msgid, ...)
  ATTRIBUTE_PRINTF_3;

+/* As their counterparts above, but use RICHLOC.  */
+extern bool cpp_warning_at (cpp_reader *, enum cpp_warning_reason,
+			    rich_location *richloc, const char *msgid, ...)
+  ATTRIBUTE_PRINTF_4;
+extern bool cpp_pedwarning_at (cpp_reader *, enum cpp_warning_reason,
+			       rich_location *richloc, const char *msgid, ...)
+  ATTRIBUTE_PRINTF_4;
+
 /* Output a diagnostic with "MSGID: " preceding the
   error string of errno.  No location is printed.  */
 extern bool cpp_errno (cpp_reader *, enum cpp_diagnostic_level,
@ -1442,43 +1450,95 @@ extern const char * cpp_get_userdef_suffix

 /* In charset.c */

+/* The result of attempting to decode a run of UTF-8 bytes.  */
+
+struct cpp_decoded_char
+{
+  const char *m_start_byte;
+  const char *m_next_byte;
+
+  bool m_valid_ch;
+  cppchar_t m_ch;
+};
+
+/* Information for mapping between code points and display columns.
+
+   This is a tabstop value, along with a callback for getting the
+   widths of characters.  Normally this callback is cpp_wcwidth, but we
+   support other schemes for escaping non-ASCII unicode as a series of
+   ASCII chars when printing the user's source code in diagnostic-show-locus.c
+
+   For example, consider:
+   - the Unicode character U+03C0 "GREEK SMALL LETTER PI" (UTF-8: 0xCF 0x80)
+   - the Unicode character U+1F642 "SLIGHTLY SMILING FACE"
+     (UTF-8: 0xF0 0x9F 0x99 0x82)
+   - the byte 0xBF (a stray trailing byte of a UTF-8 character)
+   Normally U+03C0 would occupy one display column, U+1F642
+   would occupy two display columns, and the stray byte would be
+   printed verbatim as one display column.
+
+   However when escaping them as unicode code points as "<U+03C0>"
+   and "<U+1F642>" they occupy 8 and 9 display columns respectively,
+   and when escaping them as bytes as "<CF><80>" and "<F0><9F><99><82>"
+   they occupy 8 and 16 display columns respectively.  In both cases
+   the stray byte is escaped to <BF> as 4 display columns.  */
+
+struct cpp_char_column_policy
+{
+  cpp_char_column_policy (int tabstop,
+			  int (*width_cb) (cppchar_t c))
+  : m_tabstop (tabstop),
+    m_undecoded_byte_width (1),
+    m_width_cb (width_cb)
+  {}
+
+  int m_tabstop;
+  /* Width in display columns of a stray byte that isn't decodable
+     as UTF-8.  */
+  int m_undecoded_byte_width;
+  int (*m_width_cb) (cppchar_t c);
+};
+
 /* A class to manage the state while converting a UTF-8 sequence to cppchar_t
   and computing the display width one character at a time.  */
 class cpp_display_width_computation {
 public:
  cpp_display_width_computation (const char *data, int data_length,
-				 int tabstop);
+				 const cpp_char_column_policy &policy);
  const char *next_byte () const { return m_next; }
  int bytes_processed () const { return m_next - m_begin; }
  int bytes_left () const { return m_bytes_left; }
  bool done () const { return !bytes_left (); }
  int display_cols_processed () const { return m_display_cols; }

-  int process_next_codepoint ();
+  int process_next_codepoint (cpp_decoded_char *out);
  int advance_display_cols (int n);

 private:
  const char *const m_begin;
  const char *m_next;
  size_t m_bytes_left;
-  const int m_tabstop;
+  const cpp_char_column_policy &m_policy;
  int m_display_cols;
 };

 /* Convenience functions that are simple use cases for class
   cpp_display_width_computation.  Tab characters will be expanded to spaces
-   as determined by TABSTOP.  */
+   as determined by POLICY.m_tabstop, and non-printable-ASCII characters
+   will be escaped as per POLICY.  */

 int cpp_byte_column_to_display_column (const char *data, int data_length,
-				       int column, int tabstop);
+				       int column,
+				       const cpp_char_column_policy &policy);
 inline int cpp_display_width (const char *data, int data_length,
-			      int tabstop)
+			      const cpp_char_column_policy &policy)
 {
  return cpp_byte_column_to_display_column (data, data_length, data_length,
-					    tabstop);
+					    policy);
 }
 int cpp_display_column_to_byte_column (const char *data, int data_length,
-				       int display_col, int tabstop);
+				       int display_col,
+				       const cpp_char_column_policy &policy);
 int cpp_wcwidth (cppchar_t c);

 bool cpp_input_conversion_is_trivial (const char *input_charset);