* cppinternals.texi: New file.

From-SVN: r37990
2000-12-04 07:34:21 +00:00 · 2000-12-04 07:34:21 +00:00 · 6951bc4a54
commit 6951bc4a54
parent 614c7d3716
2 changed files with 229 additions and 0 deletions
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@ -1,3 +1,7 @@
+2000-12-04  Neil Booth  <neilb@earthling.net>
+
+        * cppinternals.texi: New file.
+
 2000-12-04  Neil Booth  <neilb@earthling.net>

        * cppfiles.c (cpp_make_system_header): Take 2 booleans,
--- a/gcc/cppinternals.texi
+++ b/gcc/cppinternals.texi
@ -0,0 +1,225 @@
+\input texinfo
+@setfilename cppinternals.info
+@settitle The GNU C Preprocessor Internals
+
+@ifinfo
+@dircategory Programming
+@direntry
+* Cpplib:		       Cpplib internals.
+@end direntry
+@end ifinfo
+
+@c @smallbook
+@c @cropmarks
+@c @finalout
+@setchapternewpage odd
+@ifinfo
+This file documents the internals of the GNU C Preprocessor.
+
+Copyright 2000 Free Software Foundation, Inc.
+
+Permission is granted to make and distribute verbatim copies of
+this manual provided the copyright notice and this permission notice
+are preserved on all copies.
+
+@ignore
+Permission is granted to process this file through Tex and print the
+results, provided the printed document carries copying permission
+notice identical to this one except for the removal of this paragraph
+(this paragraph not being relevant to the printed manual).
+
+@end ignore
+Permission is granted to copy and distribute modified versions of this
+manual under the conditions for verbatim copying, provided also that
+the entire resulting derived work is distributed under the terms of a
+permission notice identical to this one.
+
+Permission is granted to copy and distribute translations of this manual
+into another language, under the above conditions for modified versions.
+@end ifinfo
+
+@titlepage
+@c @finalout
+@title Cpplib Internals
+@subtitle Last revised Dec 2000
+@subtitle for GCC version 3.0
+@author Neil Booth
+@page
+@vskip 0pt plus 1filll
+@c man begin COPYRIGHT
+Copyright @copyright{} 2000
+Free Software Foundation, Inc.
+
+Permission is granted to make and distribute verbatim copies of
+this manual provided the copyright notice and this permission notice
+are preserved on all copies.
+
+Permission is granted to copy and distribute modified versions of this
+manual under the conditions for verbatim copying, provided also that
+the entire resulting derived work is distributed under the terms of a
+permission notice identical to this one.
+
+Permission is granted to copy and distribute translations of this manual
+into another language, under the above conditions for modified versions.
+@c man end
+@end titlepage
+@page
+
+@node Top, Conventions,, (DIR)
+@chapter Cpplib - the core of the GNU C Preprocessor
+
+The GNU C preprocessor in GCC 3.0 has been completely rewritten.  It is
+now implemented as a library, cpplib, so it can be easily shared between
+a stand-alone preprocessor, and a preprocessor integrated with the C,
+C++ and Objective C front ends.  It is also available for use by other
+programs, though this is not recommended as its exposed interface has
+not yet reached a point of reasonable stability.
+
+This library has been written to be re-entrant, so that it can be used
+to preprocess many files simultaneously if necessary.  It has also been
+written with the preprocessing token as the fundamental unit; the
+preprocessor in previous versions of GCC would operate on text strings
+as the fundamental unit.
+
+This brief manual documents some of the internals of cpplib, and a few
+tricky issues encountered.  It also describes certain behaviour we would
+like to preserve, such as the format and spacing of its output.
+
+Identifiers, macro expansion, hash nodes, lexing.
+
+@menu
+* Conventions::	    Conventions used in the code.
+* Lexer::	    The combined C, C++ and Objective C Lexer.
+* Whitespace::      Input and output newlines and whitespace.
+* Concept Index::   Index of concepts and terms.
+* Index::           Index.
+@end menu
+
+@node Conventions, Lexer, Top, Top
+
+cpplib has two interfaces - one is exposed internally only, and the
+other is for both internal and external use.
+
+The convention is that functions and types that are exposed to multiple
+files internally are prefixed with @samp{_cpp_}, and are to be found in
+the file @samp{cpphash.h}.  Functions and types exposed to external
+clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}.
+
+We are striving to reduce the information exposed in cpplib.h to the
+bare minimum necessary, and then to keep it there.  This makes clear
+exactly what external clients are entitled to assume, and allows us to
+change internals in the future without worrying whether library clients
+are perhaps relying on some kind of undocumented implementation-specific
+behaviour.
+
+@node Lexer, Whitespace, Conventions, Top
+
+The lexer is contained in the file @samp{cpplex.c}.  We want to have a
+lexer that is single-pass, for efficiency reasons.  We would also like
+the lexer to only step forwards through the input files, and not step
+back.  This will make future changes to support different character
+sets, in particular state or shift-dependent ones, much easier.
+
+This file also contains all information needed to spell a token, i.e. to
+output it either in a diagnostic or to a preprocessed output file.  This
+information is not exported, but made available to clients through such
+functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.
+
+The most painful aspect of lexing ISO-standard C and C++ is handling
+trigraphs and backlash-escaped newlines.  Trigraphs are processed before
+any interpretation of the meaning of a character is made, and unfortunately
+there is a trigraph representation for a backslash, so it is possible for
+the trigraph @samp{??/} to introduce an escaped newline.
+
+Escaped newlines are tedious because theoretically they can occur
+anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token,
+within the characters of an identifier, and even between the @samp{*}
+and @samp{/} that terminates a comment.  Moreover, you cannot be sure
+there is just one - there might be an arbitrarily long sequence of them.
+
+So the routine @samp{parse_identifier}, that lexes an identifier, cannot
+assume that it can scan forwards until the first non-identifier
+character and be done with it, because this could be the @samp{\}
+introducing an escaped newline, or the @samp{?} introducing the trigraph
+sequence that represents the @samp{\} of an escaped newline.  Similarly
+for the routine that handles numbers, @samp{parse_number}.  If these
+routines stumble upon a @samp{?} or @samp{\}, they call
+@samp{skip_escaped_newlines} to skip over any potential escaped newlines
+before checking whether they can finish.
+
+Similarly code in the main body of @samp{_cpp_lex_token} cannot simply
+check for a @samp{=} after a @samp{+} character to determine whether it
+has a @samp{+=} token; it needs to be prepared for an escaped newline of
+some sort.  These cases use the function @samp{get_effective_char},
+which returns the first character after any intervening newlines.
+
+The lexer needs to keep track of the correct column position,
+including counting tabs as specified by the @samp{-ftabstop=} option.
+This should be done even within comments; C-style comments can appear in
+the middle of a line, and we want to report diagnostics in the correct
+position for text appearing after the end of the comment.
+
+Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,
+may be invalid and require a diagnostic.  However, if they appear in a
+macro expansion we don't want to complain with each use of the macro.
+It is therefore best to catch them during the lexing stage, in
+@samp{parse_identifier}.  In both cases, whether a diagnostic is needed
+or not is dependent upon lexer state.  For example, we don't want to
+issue a diagnostic for re-poisoning a poisoned identifier, or for using
+@samp{__VA_ARGS__} in the expansion of a variable-argument macro.
+Therefore @samp{parse_identifier} makes use of flags to determine
+whether a diagnostic is appropriate.  Since we change state on a
+per-token basis, and don't lex whole lines at a time, this is not a
+problem.
+
+Another place where state flags are used to change behaviour is whilst
+parsing header names.  Normally, a @samp{<} would be lexed as a single
+token.  After a @samp{#include} directive, though, it should be lexed
+as a single token as far as the nearest @samp{>} character.  Note that
+we don't allow the terminators of header names to be escaped; the first
+@samp{"} or @samp{>} terminates the header name.
+
+Interpretation of some character sequences depends upon whether we are
+lexing C, C++ or Objective C, and on the revision of the standard in
+force.  For example, @samp{@@foo} is a single identifier token in
+objective C, but two separate tokens @samp{@@} and @samp{foo} in C or
+C++.  Such cases are handled in the main function @samp{_cpp_lex_token},
+based upon the flags set in the @samp{cpp_options} structure.
+
+Note we have almost, but not quite, achieved the goal of not stepping
+backwards in the input stream.  Currently @samp{skip_escaped_newlines}
+does step back, though with care it should be possible to adjust it so
+that this does not happen.  For example, one tricky issue is if we meet
+a trigraph, but the command line option @samp{-trigraphs} is not in
+force but @samp{-Wtrigraphs} is, we need to warn about it but then
+buffer it and continue to treat it as 3 separate characters.
+
+@node Whitespace, Concept Index, Lexer, Top
+
+The lexer has been written to treat each of @samp{\r}, @samp{\n},
+@samp{\r\n} and @samp{\n\r} as a single new line indicator.  This allows
+it to transparently preprocess MS-DOS, Macintosh and Unix files without
+their needing to pass through a special filter beforehand.
+
+We also decided to treat a backslash, either @samp{\} or the trigraph
+@samp{??/}, separated from one of the above newline forms by whitespace
+only (one or more space, tab, form-feed, vertical tab or NUL characters),
+as an intended escaped newline.  The library issues a diagnostic in this
+case.
+
+Handling newlines in this way is made simpler by doing it in one place
+only.  The function @samp{handle_newline} takes care of all newline
+characters, and @samp{skip_escaped_newlines} takes care of all escaping
+of newlines, deferring to @samp{handle_newline} to handle the newlines
+themselves.
+
+@node Concept Index, Index, Whitespace, Top
+@unnumbered Concept Index
+@printindex cp
+
+@node Index,, Concept Index, Top
+@unnumbered Index of Directives, Macros and Options
+@printindex fn
+
+@contents
+@bye