* cppinternals.texi: New file.
From-SVN: r37990
This commit is contained in:
parent
614c7d3716
commit
6951bc4a54
2 changed files with 229 additions and 0 deletions
|
@ -1,3 +1,7 @@
|
|||
2000-12-04 Neil Booth <neilb@earthling.net>
|
||||
|
||||
* cppinternals.texi: New file.
|
||||
|
||||
2000-12-04 Neil Booth <neilb@earthling.net>
|
||||
|
||||
* cppfiles.c (cpp_make_system_header): Take 2 booleans,
|
||||
|
|
225
gcc/cppinternals.texi
Normal file
225
gcc/cppinternals.texi
Normal file
|
@ -0,0 +1,225 @@
|
|||
\input texinfo
|
||||
@setfilename cppinternals.info
|
||||
@settitle The GNU C Preprocessor Internals
|
||||
|
||||
@ifinfo
|
||||
@dircategory Programming
|
||||
@direntry
|
||||
* Cpplib: Cpplib internals.
|
||||
@end direntry
|
||||
@end ifinfo
|
||||
|
||||
@c @smallbook
|
||||
@c @cropmarks
|
||||
@c @finalout
|
||||
@setchapternewpage odd
|
||||
@ifinfo
|
||||
This file documents the internals of the GNU C Preprocessor.
|
||||
|
||||
Copyright 2000 Free Software Foundation, Inc.
|
||||
|
||||
Permission is granted to make and distribute verbatim copies of
|
||||
this manual provided the copyright notice and this permission notice
|
||||
are preserved on all copies.
|
||||
|
||||
@ignore
|
||||
Permission is granted to process this file through Tex and print the
|
||||
results, provided the printed document carries copying permission
|
||||
notice identical to this one except for the removal of this paragraph
|
||||
(this paragraph not being relevant to the printed manual).
|
||||
|
||||
@end ignore
|
||||
Permission is granted to copy and distribute modified versions of this
|
||||
manual under the conditions for verbatim copying, provided also that
|
||||
the entire resulting derived work is distributed under the terms of a
|
||||
permission notice identical to this one.
|
||||
|
||||
Permission is granted to copy and distribute translations of this manual
|
||||
into another language, under the above conditions for modified versions.
|
||||
@end ifinfo
|
||||
|
||||
@titlepage
|
||||
@c @finalout
|
||||
@title Cpplib Internals
|
||||
@subtitle Last revised Dec 2000
|
||||
@subtitle for GCC version 3.0
|
||||
@author Neil Booth
|
||||
@page
|
||||
@vskip 0pt plus 1filll
|
||||
@c man begin COPYRIGHT
|
||||
Copyright @copyright{} 2000
|
||||
Free Software Foundation, Inc.
|
||||
|
||||
Permission is granted to make and distribute verbatim copies of
|
||||
this manual provided the copyright notice and this permission notice
|
||||
are preserved on all copies.
|
||||
|
||||
Permission is granted to copy and distribute modified versions of this
|
||||
manual under the conditions for verbatim copying, provided also that
|
||||
the entire resulting derived work is distributed under the terms of a
|
||||
permission notice identical to this one.
|
||||
|
||||
Permission is granted to copy and distribute translations of this manual
|
||||
into another language, under the above conditions for modified versions.
|
||||
@c man end
|
||||
@end titlepage
|
||||
@page
|
||||
|
||||
@node Top, Conventions,, (DIR)
|
||||
@chapter Cpplib - the core of the GNU C Preprocessor
|
||||
|
||||
The GNU C preprocessor in GCC 3.0 has been completely rewritten. It is
|
||||
now implemented as a library, cpplib, so it can be easily shared between
|
||||
a stand-alone preprocessor, and a preprocessor integrated with the C,
|
||||
C++ and Objective C front ends. It is also available for use by other
|
||||
programs, though this is not recommended as its exposed interface has
|
||||
not yet reached a point of reasonable stability.
|
||||
|
||||
This library has been written to be re-entrant, so that it can be used
|
||||
to preprocess many files simultaneously if necessary. It has also been
|
||||
written with the preprocessing token as the fundamental unit; the
|
||||
preprocessor in previous versions of GCC would operate on text strings
|
||||
as the fundamental unit.
|
||||
|
||||
This brief manual documents some of the internals of cpplib, and a few
|
||||
tricky issues encountered. It also describes certain behaviour we would
|
||||
like to preserve, such as the format and spacing of its output.
|
||||
|
||||
Identifiers, macro expansion, hash nodes, lexing.
|
||||
|
||||
@menu
|
||||
* Conventions:: Conventions used in the code.
|
||||
* Lexer:: The combined C, C++ and Objective C Lexer.
|
||||
* Whitespace:: Input and output newlines and whitespace.
|
||||
* Concept Index:: Index of concepts and terms.
|
||||
* Index:: Index.
|
||||
@end menu
|
||||
|
||||
@node Conventions, Lexer, Top, Top
|
||||
|
||||
cpplib has two interfaces - one is exposed internally only, and the
|
||||
other is for both internal and external use.
|
||||
|
||||
The convention is that functions and types that are exposed to multiple
|
||||
files internally are prefixed with @samp{_cpp_}, and are to be found in
|
||||
the file @samp{cpphash.h}. Functions and types exposed to external
|
||||
clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}.
|
||||
|
||||
We are striving to reduce the information exposed in cpplib.h to the
|
||||
bare minimum necessary, and then to keep it there. This makes clear
|
||||
exactly what external clients are entitled to assume, and allows us to
|
||||
change internals in the future without worrying whether library clients
|
||||
are perhaps relying on some kind of undocumented implementation-specific
|
||||
behaviour.
|
||||
|
||||
@node Lexer, Whitespace, Conventions, Top
|
||||
|
||||
The lexer is contained in the file @samp{cpplex.c}. We want to have a
|
||||
lexer that is single-pass, for efficiency reasons. We would also like
|
||||
the lexer to only step forwards through the input files, and not step
|
||||
back. This will make future changes to support different character
|
||||
sets, in particular state or shift-dependent ones, much easier.
|
||||
|
||||
This file also contains all information needed to spell a token, i.e. to
|
||||
output it either in a diagnostic or to a preprocessed output file. This
|
||||
information is not exported, but made available to clients through such
|
||||
functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.
|
||||
|
||||
The most painful aspect of lexing ISO-standard C and C++ is handling
|
||||
trigraphs and backlash-escaped newlines. Trigraphs are processed before
|
||||
any interpretation of the meaning of a character is made, and unfortunately
|
||||
there is a trigraph representation for a backslash, so it is possible for
|
||||
the trigraph @samp{??/} to introduce an escaped newline.
|
||||
|
||||
Escaped newlines are tedious because theoretically they can occur
|
||||
anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token,
|
||||
within the characters of an identifier, and even between the @samp{*}
|
||||
and @samp{/} that terminates a comment. Moreover, you cannot be sure
|
||||
there is just one - there might be an arbitrarily long sequence of them.
|
||||
|
||||
So the routine @samp{parse_identifier}, that lexes an identifier, cannot
|
||||
assume that it can scan forwards until the first non-identifier
|
||||
character and be done with it, because this could be the @samp{\}
|
||||
introducing an escaped newline, or the @samp{?} introducing the trigraph
|
||||
sequence that represents the @samp{\} of an escaped newline. Similarly
|
||||
for the routine that handles numbers, @samp{parse_number}. If these
|
||||
routines stumble upon a @samp{?} or @samp{\}, they call
|
||||
@samp{skip_escaped_newlines} to skip over any potential escaped newlines
|
||||
before checking whether they can finish.
|
||||
|
||||
Similarly code in the main body of @samp{_cpp_lex_token} cannot simply
|
||||
check for a @samp{=} after a @samp{+} character to determine whether it
|
||||
has a @samp{+=} token; it needs to be prepared for an escaped newline of
|
||||
some sort. These cases use the function @samp{get_effective_char},
|
||||
which returns the first character after any intervening newlines.
|
||||
|
||||
The lexer needs to keep track of the correct column position,
|
||||
including counting tabs as specified by the @samp{-ftabstop=} option.
|
||||
This should be done even within comments; C-style comments can appear in
|
||||
the middle of a line, and we want to report diagnostics in the correct
|
||||
position for text appearing after the end of the comment.
|
||||
|
||||
Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,
|
||||
may be invalid and require a diagnostic. However, if they appear in a
|
||||
macro expansion we don't want to complain with each use of the macro.
|
||||
It is therefore best to catch them during the lexing stage, in
|
||||
@samp{parse_identifier}. In both cases, whether a diagnostic is needed
|
||||
or not is dependent upon lexer state. For example, we don't want to
|
||||
issue a diagnostic for re-poisoning a poisoned identifier, or for using
|
||||
@samp{__VA_ARGS__} in the expansion of a variable-argument macro.
|
||||
Therefore @samp{parse_identifier} makes use of flags to determine
|
||||
whether a diagnostic is appropriate. Since we change state on a
|
||||
per-token basis, and don't lex whole lines at a time, this is not a
|
||||
problem.
|
||||
|
||||
Another place where state flags are used to change behaviour is whilst
|
||||
parsing header names. Normally, a @samp{<} would be lexed as a single
|
||||
token. After a @samp{#include} directive, though, it should be lexed
|
||||
as a single token as far as the nearest @samp{>} character. Note that
|
||||
we don't allow the terminators of header names to be escaped; the first
|
||||
@samp{"} or @samp{>} terminates the header name.
|
||||
|
||||
Interpretation of some character sequences depends upon whether we are
|
||||
lexing C, C++ or Objective C, and on the revision of the standard in
|
||||
force. For example, @samp{@@foo} is a single identifier token in
|
||||
objective C, but two separate tokens @samp{@@} and @samp{foo} in C or
|
||||
C++. Such cases are handled in the main function @samp{_cpp_lex_token},
|
||||
based upon the flags set in the @samp{cpp_options} structure.
|
||||
|
||||
Note we have almost, but not quite, achieved the goal of not stepping
|
||||
backwards in the input stream. Currently @samp{skip_escaped_newlines}
|
||||
does step back, though with care it should be possible to adjust it so
|
||||
that this does not happen. For example, one tricky issue is if we meet
|
||||
a trigraph, but the command line option @samp{-trigraphs} is not in
|
||||
force but @samp{-Wtrigraphs} is, we need to warn about it but then
|
||||
buffer it and continue to treat it as 3 separate characters.
|
||||
|
||||
@node Whitespace, Concept Index, Lexer, Top
|
||||
|
||||
The lexer has been written to treat each of @samp{\r}, @samp{\n},
|
||||
@samp{\r\n} and @samp{\n\r} as a single new line indicator. This allows
|
||||
it to transparently preprocess MS-DOS, Macintosh and Unix files without
|
||||
their needing to pass through a special filter beforehand.
|
||||
|
||||
We also decided to treat a backslash, either @samp{\} or the trigraph
|
||||
@samp{??/}, separated from one of the above newline forms by whitespace
|
||||
only (one or more space, tab, form-feed, vertical tab or NUL characters),
|
||||
as an intended escaped newline. The library issues a diagnostic in this
|
||||
case.
|
||||
|
||||
Handling newlines in this way is made simpler by doing it in one place
|
||||
only. The function @samp{handle_newline} takes care of all newline
|
||||
characters, and @samp{skip_escaped_newlines} takes care of all escaping
|
||||
of newlines, deferring to @samp{handle_newline} to handle the newlines
|
||||
themselves.
|
||||
|
||||
@node Concept Index, Index, Whitespace, Top
|
||||
@unnumbered Concept Index
|
||||
@printindex cp
|
||||
|
||||
@node Index,, Concept Index, Top
|
||||
@unnumbered Index of Directives, Macros and Options
|
||||
@printindex fn
|
||||
|
||||
@contents
|
||||
@bye
|
Loading…
Add table
Reference in a new issue