c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]
The following patch implements the easy parts of the paper. When @$` are added to the basic character set, it means that R"@$`()@$`" should now be valid (here I've noticed most of the raw string tests were tested solely with -std=c++11 or -std=gnu++11 and I've tried to change that), and on the other side even if by extension $ is allowed in identifiers, \u0024 or \U00000024 or \u{24} should not be, similarly how \u0041 is not allowed. The paper in 3.1 claims though that #include <stdio.h> #define STR(x) #x int main() { printf("%s", STR(\u0060)); // U+0060 is ` GRAVE ACCENT } should have been accepted before this paper (and rejected after it), but g++ rejects it. I've tried to understand it, but am confused on what is the right behavior and why. Consider #define STR(x) #x const char *a = "\u00b7"; const char *b = STR(\u00b7); const char *c = "\u0041"; const char *d = STR(\u0041); const char *e = STR(a\u00b7); const char *f = STR(a\u0041); const char *g = STR(a \u00b7); const char *h = STR(a \u0041); const char *i = "\u066d"; const char *j = STR(\u066d); const char *k = "\u0040"; const char *l = STR(\u0040); const char *m = STR(a\u066d); const char *n = STR(a\u0040); const char *o = STR(a \u066d); const char *p = STR(a \u0040); Neither clang nor gcc emit any diagnostics on the a, c, i and k initializers, those are certainly valid (c is invalid in C23 though). g++ emits with -pedantic-errors errors on all the others, while clang++ on the ones with STR involving \u0041, \u0040 and a\u0066d. The chosen values are \u0040 '@' as something being changed by this paper, \u0041 'A' as basic character set char valid in identifiers before/after, \u00b7 as an example of character which is pedantically valid in identifiers if not at the start and \u066d s something pedantically not valid in identifiers. Now, https://eel.is/c++draft/lex.charset#6 says that UCN used outside of a string/character literal which corresponds to basic character set character (or control character) is ill-formed, that would make d, f, h cases invalid for C++ and l, n, p cases invalid for C++26. https://eel.is/c++draft/lex.name states which characters can appear at the start of the identifier and which can appear after the start. And https://eel.is/c++draft/lex.pptoken states that preprocessing-token is either identifier, or tons of other things, or "each non-whitespace character that cannot be one of the above" Then https://eel.is/c++draft/lex.pptoken#1 says that this last category is invalid if the preprocessing token is being converted into token. And https://eel.is/c++draft/lex.pptoken#2 includes "If any character not in the basic character set matches the last category, the program is ill-formed." Now, e.g. for the C++23 STR(\u0040) case, \u0040 is there not in the basic character set, so valid outside of the literals (not the case anymore in C++26), but it isn't nondigit and doesn't have XID_Start property, so it isn't IMHO an identifier and so must be the "each non-whitespace character that cannot be one of the above" case. Why doesn't the above mentioned https://eel.is/c++draft/lex.pptoken#2 sentence make that invalid? Ignoring that, I'd say it would be then stringized and that feels like it is what clang++ is doing. Now, e.g. for the STR(a\u066d) case, I wonder why that isn't lexed as a identifier followed by \u066d "each non-whitespace character that cannot be one of the above" token and stringified similarly, clang++ rejects that. What GCC libcpp seems to be doing is that if that forms_identifier_p calls _cpp_valid_utf8 or _cpp_valid_ucn with an argument which tells it is first or second+ in identifier, and e.g. _cpp_valid_ucn then for UCNs valid in string literals calls else if (identifier_pos) { int validity = ucn_valid_in_identifier (pfile, result, nst); if (validity == 0) cpp_error (pfile, CPP_DL_ERROR, "universal character %.*s is not valid in an identifier", (int) (str - base), base); else if (validity == 2 && identifier_pos == 1) cpp_error (pfile, CPP_DL_ERROR, "universal character %.*s is not valid at the start of an identifier", (int) (str - base), base); } so basically all those invalid in identifiers cases emit an error and pretend to be valid in identifiers, rather than what e.g. _cpp_valid_utf8 does for C but not for C++ and only for the chars completely invalid in identifiers rather than just valid in identifiers but not at the start: /* In C++, this is an error for invalid character in an identifier because logically, the UTF-8 was converted to a UCN during translation phase 1 (even though we don't physically do it that way). In C, this byte rather becomes grammatically a separate token. */ if (CPP_OPTION (pfile, cplusplus)) cpp_error (pfile, CPP_DL_ERROR, "extended character %.*s is not valid in an identifier", (int) (*pstr - base), base); else { *pstr = base; return false; } The comment doesn't really match what is done in recent C++ versions because there UCNs are translated to characters and not the other way around. 2024-07-25 Jakub Jelinek <jakub@redhat.com> PR c++/110343 libcpp/ * lex.cc: C++26 P2558R2 - Add @, $, and ` to the basic character set. (lex_raw_string): For C++26 allow $@` characters in prefix. * charset.cc (_cpp_valid_ucn): For C++26 reject \u0024 in identifiers. gcc/testsuite/ * c-c++-common/raw-string-1.c: Use { c || c++11 } effective target, remove c++ specific dg-options. * c-c++-common/raw-string-2.c: Likewise. * c-c++-common/raw-string-4.c: Likewise. * c-c++-common/raw-string-5.c: Likewise. Expect some diagnostics only for non-c++26, for c++26 expect different. * c-c++-common/raw-string-6.c: Use { c || c++11 } effective target, remove c++ specific dg-options. * c-c++-common/raw-string-11.c: Likewise. * c-c++-common/raw-string-13.c: Likewise. * c-c++-common/raw-string-14.c: Likewise. * c-c++-common/raw-string-15.c: Use { c || c++11 } effective target, change c++ specific dg-options to just -Wtrigraphs. * c-c++-common/raw-string-16.c: Likewise. * c-c++-common/raw-string-17.c: Use { c || c++11 } effective target, remove c++ specific dg-options. * c-c++-common/raw-string-18.c: Use { c || c++11 } effective target, remove -std=c++11 from c++ specific dg-options. * c-c++-common/raw-string-19.c: Likewise. * g++.dg/cpp26/raw-string1.C: New test. * g++.dg/cpp26/raw-string2.C: New test.
This commit is contained in:
parent
34fb0feca7
commit
29341f21ce
17 changed files with 50 additions and 34 deletions
|
@ -1,7 +1,6 @@
|
|||
// { dg-do run }
|
||||
// { dg-do run { target { c || c++11 } } }
|
||||
// { dg-require-effective-target wchar }
|
||||
// { dg-options "-std=gnu99 -Wno-c++-compat -trigraphs" { target c } }
|
||||
// { dg-options "-std=c++0x" { target c++ } }
|
||||
|
||||
#ifndef __cplusplus
|
||||
#include <wchar.h>
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
// PR preprocessor/48740
|
||||
// { dg-do run { target { c || c++11 } } }
|
||||
// { dg-options "-std=gnu99 -trigraphs -save-temps" { target c } }
|
||||
// { dg-options "-std=c++0x -save-temps" { target c++ } }
|
||||
// { dg-do run }
|
||||
// { dg-options "-save-temps" { target c++ } }
|
||||
|
||||
int main ()
|
||||
{
|
||||
|
@ -9,4 +9,3 @@ int main ()
|
|||
"foo%sbar%sfred%sbob?""?""?""?""?",
|
||||
sizeof ("foo%sbar%sfred%sbob?""?""?""?""?"));
|
||||
}
|
||||
|
||||
|
|
|
@ -1,8 +1,7 @@
|
|||
// PR preprocessor/57620
|
||||
// { dg-do run }
|
||||
// { dg-do run { target { c || c++11 } } }
|
||||
// { dg-require-effective-target wchar }
|
||||
// { dg-options "-std=gnu99 -Wno-c++-compat -trigraphs" { target c } }
|
||||
// { dg-options "-std=c++11" { target c++ } }
|
||||
|
||||
#ifndef __cplusplus
|
||||
#include <wchar.h>
|
||||
|
|
|
@ -1,7 +1,6 @@
|
|||
// PR preprocessor/57620
|
||||
// { dg-do compile }
|
||||
// { dg-do compile { target { c || c++11 } } }
|
||||
// { dg-options "-std=gnu99 -trigraphs" { target c } }
|
||||
// { dg-options "-std=c++11" { target c++ } }
|
||||
|
||||
const void *s0 = R"abc\
|
||||
def()abcdef" 0;
|
||||
|
|
|
@ -1,8 +1,8 @@
|
|||
// PR preprocessor/57620
|
||||
// { dg-do run }
|
||||
// { dg-do run { target { c || c++11 } } }
|
||||
// { dg-require-effective-target wchar }
|
||||
// { dg-options "-std=gnu99 -Wno-c++-compat -Wtrigraphs" { target c } }
|
||||
// { dg-options "-std=gnu++11 -Wtrigraphs" { target c++ } }
|
||||
// { dg-options "-Wtrigraphs" { target c++ } }
|
||||
|
||||
#ifndef __cplusplus
|
||||
#include <wchar.h>
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
// PR preprocessor/57620
|
||||
// { dg-do compile }
|
||||
// { dg-do compile { target { c || c++11 } } }
|
||||
// { dg-options "-std=gnu99 -Wtrigraphs" { target c } }
|
||||
// { dg-options "-std=gnu++11 -Wtrigraphs" { target c++ } }
|
||||
// { dg-options "-Wtrigraphs" { target c++ } }
|
||||
|
||||
const void *s0 = R"abc\
|
||||
def()abcdef" 0;
|
||||
|
|
|
@ -1,7 +1,6 @@
|
|||
/* PR preprocessor/57824 */
|
||||
/* { dg-do run } */
|
||||
/* { dg-do run { target { c || c++11 } } } */
|
||||
/* { dg-options "-std=gnu99" { target c } } */
|
||||
/* { dg-options "-std=c++11" { target c++ } } */
|
||||
|
||||
#define S(s) s
|
||||
#define T(s) s "\n"
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
/* PR preprocessor/57824 */
|
||||
/* { dg-do compile } */
|
||||
/* { dg-do compile { target { c || c++11 } } } */
|
||||
/* { dg-options "-std=gnu99 -fdump-tree-optimized-lineno" { target c } } */
|
||||
/* { dg-options "-std=c++11 -fdump-tree-optimized-lineno" { target c++ } } */
|
||||
/* { dg-options "-fdump-tree-optimized-lineno" { target c++ } } */
|
||||
|
||||
const char x[] = R"(
|
||||
abc
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
/* PR preprocessor/57824 */
|
||||
/* { dg-do compile } */
|
||||
// { dg-do compile { target { c || c++11 } } }
|
||||
/* { dg-options "-std=gnu99 -fdump-tree-optimized-lineno -save-temps" { target c } } */
|
||||
/* { dg-options "-std=c++11 -fdump-tree-optimized-lineno -save-temps" { target c++ } } */
|
||||
/* { dg-options "-fdump-tree-optimized-lineno -save-temps" { target c++ } } */
|
||||
|
||||
const char x[] = R"(
|
||||
abc
|
||||
|
|
|
@ -1,7 +1,6 @@
|
|||
// { dg-do run }
|
||||
// { dg-do run { target { c || c++11 } } }
|
||||
// { dg-require-effective-target wchar }
|
||||
// { dg-options "-std=gnu99 -Wno-c++-compat -trigraphs" { target c } }
|
||||
// { dg-options "-std=c++0x" { target c++ } }
|
||||
|
||||
#ifndef __cplusplus
|
||||
#include <wchar.h>
|
||||
|
|
|
@ -1,7 +1,6 @@
|
|||
// R is not applicable for character literals.
|
||||
// { dg-do compile }
|
||||
// { dg-do compile { target { c || c++11 } } }
|
||||
// { dg-options "-std=gnu99" { target c } }
|
||||
// { dg-options "-std=c++0x" { target c++ } }
|
||||
|
||||
const int i0 = R'a'; // { dg-error "was not declared|undeclared" "undeclared" }
|
||||
// { dg-error "expected ',' or ';'" "expected" { target c } .-1 }
|
||||
|
|
|
@ -1,6 +1,5 @@
|
|||
// { dg-do compile }
|
||||
// { dg-do compile { target { c || c++11 } } }
|
||||
// { dg-options "-std=gnu99" { target c } }
|
||||
// { dg-options "-std=c++0x" { target c++ } }
|
||||
|
||||
const void *s0 = R"0123456789abcdefg()0123456789abcdefg" 0;
|
||||
// { dg-error "raw string delimiter longer" "longer" { target *-*-* } .-1 }
|
||||
|
@ -15,12 +14,18 @@ const void *s3 = R")())" 0;
|
|||
// { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
|
||||
// { dg-error "stray" "stray" { target *-*-* } .-2 }
|
||||
const void *s4 = R"@()@" 0;
|
||||
// { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
|
||||
// { dg-error "stray" "stray" { target *-*-* } .-2 }
|
||||
// { dg-error "invalid character" "invalid" { target { c || c++23_down } } .-1 }
|
||||
// { dg-error "stray" "stray" { target { c || c++23_down } } .-2 }
|
||||
// { dg-error "before numeric constant" "numeric" { target c++26 } .-3 }
|
||||
const void *s5 = R"$()$" 0;
|
||||
// { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
|
||||
// { dg-error "stray" "stray" { target *-*-* } .-2 }
|
||||
const void *s6 = R"\u0040()\u0040" 0;
|
||||
// { dg-error "invalid character" "invalid" { target { c || c++23_down } } .-1 }
|
||||
// { dg-error "stray" "stray" { target { c || c++23_down } } .-2 }
|
||||
// { dg-error "before numeric constant" "numeric" { target c++26 } .-3 }
|
||||
const void *s6 = R"`()`" 0;
|
||||
// { dg-error "invalid character" "invalid" { target { c || c++23_down } } .-1 }
|
||||
// { dg-error "stray" "stray" { target { c || c++23_down } } .-2 }
|
||||
// { dg-error "before numeric constant" "numeric" { target c++26 } .-3 }
|
||||
const void *s7 = R"\u0040()\u0040" 0;
|
||||
// { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
|
||||
// { dg-error "stray" "stray" { target *-*-* } .-2 }
|
||||
|
||||
|
|
|
@ -1,6 +1,5 @@
|
|||
// { dg-do compile }
|
||||
// { dg-do compile { target { c || c++11 } } }
|
||||
// { dg-options "-std=gnu99" { target c } }
|
||||
// { dg-options "-std=c++0x" { target c++ } }
|
||||
|
||||
const void *s0 = R"ouch()ouCh"; // { dg-error "unterminated raw string" "unterminated" }
|
||||
// { dg-error "at end of input" "end" { target *-*-* } .-1 }
|
||||
|
|
4
gcc/testsuite/g++.dg/cpp26/raw-string1.C
Normal file
4
gcc/testsuite/g++.dg/cpp26/raw-string1.C
Normal file
|
@ -0,0 +1,4 @@
|
|||
// C++26 P2558R2 - Add @, $, and ` to the basic character set
|
||||
// { dg-do compile { target c++26 } }
|
||||
|
||||
const char *s0 = R"`@$$@`@`$()`@$$@`@`$";
|
7
gcc/testsuite/g++.dg/cpp26/raw-string2.C
Normal file
7
gcc/testsuite/g++.dg/cpp26/raw-string2.C
Normal file
|
@ -0,0 +1,7 @@
|
|||
// C++26 P2558R2 - Add @, $, and ` to the basic character set
|
||||
// { dg-do compile { target { ! { avr*-*-* mmix*-*-* *-*-aix* } } } }
|
||||
// { dg-options "" }
|
||||
|
||||
int a$b;
|
||||
int a\u0024c; // { dg-error "universal character \\\\u0024 is not valid in an identifier" "" { target c++26 } }
|
||||
int a\U00000024d; // { dg-error "universal character \\\\U00000024 is not valid in an identifier" "" { target c++26 } }
|
|
@ -1808,7 +1808,12 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
|
|||
result = 1;
|
||||
}
|
||||
else if (identifier_pos && result == 0x24
|
||||
&& CPP_OPTION (pfile, dollars_in_ident))
|
||||
&& CPP_OPTION (pfile, dollars_in_ident)
|
||||
/* In C++26 when dollars are allowed in identifiers,
|
||||
we should still reject \u0024 as $ is part of the basic
|
||||
character set. */
|
||||
&& !(CPP_OPTION (pfile, cplusplus)
|
||||
&& CPP_OPTION (pfile, lang) > CLK_CXX23))
|
||||
{
|
||||
if (CPP_OPTION (pfile, warn_dollars) && !pfile->state.skipping)
|
||||
{
|
||||
|
|
|
@ -2718,7 +2718,10 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
|
|||
|| c == '*' || c == '+' || c == '-' || c == '/'
|
||||
|| c == '^' || c == '&' || c == '|' || c == '~'
|
||||
|| c == '!' || c == '=' || c == ','
|
||||
|| c == '"' || c == '\''))
|
||||
|| c == '"' || c == '\''
|
||||
|| ((c == '$' || c == '@' || c == '`')
|
||||
&& CPP_OPTION (pfile, cplusplus)
|
||||
&& CPP_OPTION (pfile, lang) > CLK_CXX23)))
|
||||
prefix[prefix_len++] = c;
|
||||
else
|
||||
{
|
||||
|
|
Loading…
Add table
Reference in a new issue