* cpplex.c: add comment describing lexer algorithm.

From-SVN: r33443
2000-04-26 10:17:32 +00:00 · 2000-04-26 10:17:32 +00:00 · d6d5f7955b
commit d6d5f7955b
parent 6f0ae5b4f2
1 changed files with 90 additions and 0 deletions
--- a/gcc/cpplex.c
+++ b/gcc/cpplex.c
@ -2050,6 +2050,96 @@ _cpp_init_input_buffer (pfile)

 #if 0

+/* Lexing algorithm.
+
+ The original lexer in cpplib was made up of two passes: a first pass
+ that replaced trigraphs and deleted esacped newlines, and a second
+ pass that tokenized the result of the first pass.  Tokenisation was
+ performed by peeking at the next character in the input stream.  For
+ example, if the input stream contained "~=", the handler for the ~
+ character would peek at the next character, and if it were a '='
+ would skip over it, and return a "~=" token, otherwise it would
+ return just the "~" token.
+
+ To implement a single-pass lexer, this peeking ahead is unworkable.
+ An arbitrary number of escaped newlines, and trigraphs (in particular
+ ??/ which translates to the escape \), could separate the '~' and '='
+ in the input stream, yet the next token is still a "~=".
+
+ Suppose instead that we lex by one logical line at a time, producing
+ a token list or stack for each logical line, and when seeing the '~'
+ push a CPP_COMPLEMENT token on the list.  Then if the '~' is part of
+ a longer token ("~=") we know we must see the remainder of the token
+ by the time we reach the end of the logical line.  Thus we can have
+ the '=' handler look at the previous token (at the end of the list /
+ top of the stack) and see if it is a "~" token, and if so, instead of
+ pushing a "=" token revise the existing token to be a "~=" token.
+
+ This works in the presence of escaped newlines, because the '\' would
+ have been pushed on the top of the stack as a CPP_BACKSLASH.  The
+ newline ('\n' or '\r') handler looks at the token at the top of the
+ stack to see if it is a CPP_BACKSLASH, and if so discards both.
+ Otherwise it pushes the newline (CPP_VSPACE) token as normal.  Hence
+ the '=' handler would never see any intervening escaped newlines.
+
+ To make trigraphs work in this context, as in precedence trigraphs
+ are highest and converted before anything else, the '?' handler does
+ lookahead to see if it is a trigraph, and if so skips the trigraph
+ and pushes the token it represents onto the top of the stack.  This
+ also works in the particular case of a CPP_BACKSLASH trigraph.
+
+ To the preprocessor, whitespace is only significant to the point of
+ knowing whether whitespace precedes a particular token.  For example,
+ the '=' handler needs to know whether there was whitespace between it
+ and a "~" token on the top of the stack, to make the token conversion
+ decision correctly.  So each token has a PREV_WHITESPACE flag to
+ indicate this - the standard permits consecutive whitespace to be
+ regarded as a single space.  The compiler front ends are not
+ interested in whitespace at all; they just require a token stream.
+ Another place where whitespace is significant to the preprocessor is
+ a #define statment - if there is whitespace between the macro name
+ and an initial "(" token the macro is "object-like", otherwise it is
+ a function-like macro that takes arguments.
+
+ However, all is not rosy.  Parsing of identifiers, numbers, comments
+ and strings becomes trickier because of the possibility of raw
+ trigraphs and escaped newlines in the input stream.
+
+ The trigraphs are three consecutive characters beginning with two
+ question marks.  A question mark is not a valid as part of a number
+ or identifier, so parsing of a number or identifier terminates
+ normally upon reaching it, returning to the mainloop which handles
+ the trigraph just like it would in any other position.  Similarly for
+ the backslash of a backslash-newline combination.  So we just need
+ the escaped-newline dropper in the mainloop to check if the token on
+ the top of the stack is a number or identifier, and to continue the
+ processing of the token as if nothing had happened.
+
+ For strings, we replace trigraphs whenever we reach a quote or
+ newline, because there might be a backslash trigraph escaping them.
+ We need to be careful that we start trigraph replacing from where we
+ left off previously, because it is possible for a first scan to leave
+ "fake" trigraphs that a second scan would pick up as real (e.g. the
+ sequence "????\\n=" would find a fake ??= trigraph after removing the
+ escaped newline.)
+
+ For line comments, on reaching a newline we scan the previous
+ character(s) to see if it escaped, and continue if it is.  Block
+ comments ignore everything and just focus on finding the comment
+ termination mark.  The only difficult thing, and it is surprisingly
+ tricky, is checking if an asterisk precedes the final slash since
+ they could be separated by escaped newlines.  If the preprocessor is
+ invoked with the output comments option, we don't bother removing
+ escaped newlines and replacing trigraphs for output.
+
+ Finally, numbers can begin with a period, which is pushed initially
+ as a CPP_DOT token in its own right.  The digit handler checks if the
+ previous token was a CPP_DOT not separated by whitespace, and if so
+ pops it off the stack and pushes a period into the number's buffer
+ before calling the number parser.
+
+*/
+
 static void expand_comment_space PARAMS ((cpp_toklist *));
 void init_trigraph_map PARAMS ((void));
 static unsigned char* trigraph_replace PARAMS ((cpp_reader *, unsigned char *,