Supporting Programming Languages.

How to work with a programming language not yet supported by Inweb.

@h Introduction.
To a very large extent, Inweb works the same way regardless of what language
its webs are using, and that is deliberate. On the other hand, when a web
is woven, it will look much nicer with syntax-colouring, and that clearly
can't be done without at least a surface understanding of what programs in
the language mean.

As we've seen, the Contents section of a web has to specify its language.
For example,

	|Language: Perl|

declares that the program expressed by the web is a Perl script. The language
name must be one which Inweb knows, or, more exactly, one for which it can
find a "language definition file". These are stored in the |Languages|
subdirectory of the |inweb| distribution, and if a language is called |L|
then its file is |L.ildf|. You can see the languages currently available
to Inweb by using |-show-languages|. At present, a newly installed Inweb
replies like so:

[[languages.txt]]

@ So what if you want to write a literate program in a language not on that
list? One option is to give the language as |None|. (Note that this is
different from simply not declaring a language -- if your web doesn't say
what language it is, Inweb assumes C.) |None| is fine for tangling, though
it has the minor annoyance that it tangles to a file with the filename
extension |.txt|, not knowing any better. But you can cure that with
|-tangle-to F| for any filename |F| of your choice. With weaving, though,
|None| makes for drab-looking weaves, because there's very little syntax
colouring.

An even more extreme option is |Plain Text|, which has no syntax colouring
at all. (But this could still be useful if what you want is to produce an
annotated explanation of some complicated configuration file in plain text.)

@ In fact, though, it's easy to make new language definitions, and if you're
spending any serious effort on a web of a program in an unsupported language
then it's probably worth making one. Contributions of these to the Inweb
open source project are welcome, and then this effort might also benefit others.
This section of the manual is about how to do it.

Once you have written a definition, use |-read-language L| at the command
line, where |L| is the file defining it. If you have many custom languages,
|-read-languages D| reads all of the definitions in a directory |D|. Or, if
the language in question is really quite specific to a single web, you can
make a |Private Languages| subdirectory of the web and put it in there.

@h Structure of language definitions.
Each language is defined by a single ILDF file. ("Inweb Language Definition
Format".) In this section, we'll call it the ILD.

The ILD is a plain text file, which is read in line by line. Leading and
trailing whitespace on each line is ignored; blank lines are ignored; and
so are comments, which are lines beginning with a |#| character.

The ILD contains three sorts of thing:
(a) Properties, set by lines in the form |Name: "C++"|.
(b) Keywords, set by lines in the form |keyword int|.
(c) A colouring program, introduced by |colouring {| and continuing until the
last block of it is closed with a |}|.

Everything in an ILD is optional, so a minimal ILD is in principle empty. In
practice, though, every ILD should open like so:

= (sample ILDF code)
Name: "C"
Details: "The C programming language"
Extension: ".c"

@h Properties.
Inevitably, there's a miscellaneous shopping list of these, but let's start
with the semi-compulsory ones.

|Name|. This is the one used by webs in their |Language: "X"| lines, and should
match the ILD's own filename: wherever it is stored, the ILD for langauge |X|
should be filenamed |X.ildf|.

|Details| These are used only by |-show-languages|.

|Extension|. The default file extension used when a web in this format is
tangled. Thus, a web for a C program called |something| will normally tangle
to a file called |something.c|.

@ Most programming languages contain comments. In some, like Perl, a comment
begins with a triggering notation (in Perl's case, |#|) occurring outside of
quoted material; and it continues to the end of its line. We'll call that a
"line comment". There are also languages where comments must be the only
non-whitespace items on their lines: in that case, we'll call them "whole
line comments". In others, like Inform 7, a comment begins with one notation
|[| and ends with another |]|, not necessarily on the same line. We'll call
those "multiline comments".

|Line Comment| is the notation for line comments, and |Whole Line Comment| is
the notation for whole line comments.

|Multiline Comment Open| and |Multiline Comment Close|, which should exist
as a pair or not at all, is the notation for multiline comments.

For example, C defines:

= (sample ILDF code)
    Multiline Comment Open: "/*"
    Multiline Comment Close: "*/"
    Line Comment: "//"

@ As noted, comments occur only outside of string or character literals. We
can give notations for these as follows:

|String Literal| must be a single character, and marks both start and end.

|String Literal Escape| is the escape character within a string literal to
express an instance of the |String Literal| character without ending it.

|Character Literal| and |Character Literal Escape| are the same thing for
character literals.

Here, C defines:

= (sample ILDF code)
    String Literal: "\""
    String Literal Escape: "\\"
    Character Literal: "'"
    Character Literal Escape: "\\"

@ Next, numeric literals, like |0xFE45| in C, or |$$10011110| in Inform 6.
It's assumed that every language allows non-negative decimal numbers.

|Binary Literal Prefix|, |Octal Literal Prefix|, and |Hexadecimal Literal Prefix|
are notations for non-decimal numbers, if they exist.

|Negative Literal Prefix| allows negative decimals: this is usually |-| if set.

Here, C has:

= (sample ILDF code)
    Hexadecimal Literal Prefix: "0x"
    Binary Literal Prefix: "0b"
    Negative Literal Prefix: "-"

@ |Shebang| is used only in tangling, and is a probably short text added at
the very beginning of a tangled program. This is useful for scripting languages
in Unix, where the opening line must be a "shebang" indicating their language.
For example, Perl defines:
= (sample ILDF code)
    Shebang: "#!/usr/bin/perl\n\n"
=
Most languages do not have a shebang.

@ In order for C compilers to report C syntax errors on the correct line,
despite rearranging by automatic tools, C conventionally recognises the
preprocessor directive |#line| to tell it that a contiguous extract follows
from the given file. Quite a few languages support notations like this,
which most users never use.

When tangling, Inweb is just such a rearranging tool, and it inserts line
markers automatically for languages which support them: |Line Marker| specifies
that this language does, and gives the notation. For example, C provides:
= (sample ILDF code)
    Line Marker: "#line %d \"%f\"\n"
=
Here |%d| expands to the line number, and |%f| the filename, of origin.

@ When a named paragraph is used in code, and the tangler is "expanding" it
to its contents, it can optionally place some material before and after the
matter added. This material is in |Before Named Paragraph Expansion| and
|After Named Paragraph Expansion|, which are by default empty.

For C and all similar languages, we recommend this:
= (sample ILDF code)
    Before Named Paragraph Expansion: "\n{\n"
    After Named Paragraph Expansion: "}\n"
=
The effect of this is to ensure that code such as:
= (not code)
    if (x == y) @<Do something dramatic@>;
=
tangles to something like this:
= (not code)
    if (x == y)
    {
    ...
    }
=
so that any variables defined inside "Do something dramatic" have limited
scope, and so that multi-line macros are treated as a single statement by |if|,
|while| and so on.

(The new-line before the opening brace is not for aesthetic purposes; we never
care much about the aesthetics of tangled C code, which is not for human eyes.
It's in case of any problems arising with line comments.)

@ When the author of a web makes definitions with |@d| or |@e|, Inweb will
need to tangle those into valid constant definitions in the language concerned.
It can only do so if the language provides a notation for that.

|Start Definition| begins; |Prolong Definition|, if given, shows how to
continue a multiline definition (if they are allowed); and |End Definition|,
if given, places any ending notation. For example, Inform 6 defines:
= (sample ILDF code)
    Start Definition: "Constant %S =\s"
    End Definition: ";\n"
=
where |%S| expands to the name of the term to be defined. Thus, we might tangle
out to:
= (not code)
    Constant TAXICAB = 1729;\n
=
Inweb ignores all definitions unless one of these three properties is given.

@ Inweb needs a notation for conditional compilation in order to handle some
of its advanced features for tangling tagged material: the Inform project
makes use of this to handle code dependent on the operating system in use.
If the language supports it, the notation is in |Start Ifdef| and |End Ifdef|,
and in |Start Ifndef| and |End Ifndef|. For example, Inform 6 has:
= (sample ILDF code)
    Start Ifdef: "#ifdef %S;\n"
    End Ifdef: "#endif; ! %S\n"
    Start Ifndef: "#ifndef %S;\n"
    End Ifndef: "#endif; ! %S\n"
=
which is a subtly different notation from the C one. Again, |%S| expands to
the name of the term we are conditionally compiling on.

@ |Supports Namespaces| must be either |true| or |false|, and is by default
|false|. If set, then the language allows identifier names to include
dividers with the notation |::|; section headings can declare that all of
their code belongs to a single namespace; and any functions detected in that
code must have a name using that namespace.

This is a rudimentary way to provide namespaces to languages not otherwise
having them: InC uses it to extend C.

@ |Suppress Disclaimer| is again |true| or |false|, and by default |false|.
The disclaimer is a comment placed into a tangle declaring that the file
has been auto-generated by Inweb and shouldn't be edited. (The comment
only appears in comment notation has been declared for the language: so
e.g., the Plain Text ILD doesn't need to be told to |Suppress Disclaimer|
since it cannot tangle comments anyway.)

@h Secret Features.
It is not quite true that everything a language can do is defined by the ILD.
Additional features are provided to C-like languages to detect functions
and |typedef|s. At present, these are hard-wired into Inweb, and it will take
further thought to work out how to express them in LDFs.

The property |C-Like|, by default |false|, enables these features.

(In addition, a language whose name is |InC| gets still more features, but
those are not so much a failing of ILDF as because Inweb is itself a sort of
compiler for |InC| -- see elsewhere in this manual.)

@h Keywords.
Syntax colouring is greatly helped by knowing that certain identifier names
are special: for example, |void| is special in C. These are often called
"reserved words", in that they can't be used as variable or function names
in the language in question. For C, then, we include the line:
= (sample ILDF code)
    keyword void
=
Keywords can be declared in a number of categories, which are identified by
colour name: the default is |!reserved|, the colour for reserved words. But
for example:
= (sample ILDF code)
    keyword isdigit of !function
=
makes a keyword of colour |!function|.

@h Syntax colouring program.
That leads nicely into how syntax colouring is done.

ILDs have no control over what colours or typefaces are used: that's all
controllable, but is done by changing the weave pattern. So we can't colour
a word "green": instead we colour it semantically, from the following
palette of possibilities:
= (sample ILDF code)
!character  !comment     !constant  !definition  !element  !extract
!function   !identifier  !plain     !reserved    !string
=
Each character has its own colour. At the start of the process, every
character is |!plain|.

@ At the first stage, Inweb uses the language's comment syntax to work out
what part of the code is commentary, and what part is "live". Only the live
part goes forward into stage two. All comment material is coloured |!comment|.

At the second stage, Inweb uses the syntax for literals. Character literals
are painted in |!character|, string literals in |!string|, identifiers in
|!identifier|, and numeric literals as |!constant|.

At the third stage, Inweb runs the colouring program for the language (if
one is provided): it has the opportunity to apply some polish. Note that this
runs only on the live material; it cannot affect the commented-out matter.

When a colouring program begins running, then, everything is coloured in
one of the following: |!character|, |!string|, |!identifier|, |!constant|,
and |!plain|.

@ A colouring program begins with |colouring {| and ends with |}|. The
empty program is legal but does nothing:
= (sample ILDF code)
    colouring {
    }
=
The material between the braces is called a "block". Each block runs on a
given stretch of contiguous text, called the "snippet". For the outermost
block, that's a line of source code. Blocks normally contain one or more
"rules":
= (sample ILDF code)
    colouring {
        marble => !function
    }
=
Rules take the form of "if X, then Y", and the |=>| divides the X from the Y.
This one says that if the snippet consists of the word "marble", then colour
it |!function|. Of course this is not very useful, since it would only catch
lines containing only that one word. So we really want to narrow in on smaller
snippets. This, for example, applies its rule to each individual character
in turn:
= (sample ILDF code)
    colouring {
        characters {
            K => !identifier
        }
    }
=

@ In the above examples, |K| and |marble| appeared without quotation marks,
but they were only allowed to do that because (a) they were single words,
(b) those words had no other meaning, and (c) they didn't contain any
awkward characters. For any more complicated texts, always use quotation
marks. For example, in
= (sample ILDF code)
	"=>" => !reserved
=
the |=>| in quotes is just text, whereas the one outside quotes is being
used to divide a rule.

If you need a literal double quote inside the double-quotes, use |\"|; and
use |\\| for a literal backslash. For example:
= (sample ILDF code)
    "\\\"" => !reserved
=
actually matches the text |\"|.

@h The six splits.
|characters| is an example of a "split", which splits up the original snippet
of text -- say, the line |let K = 2| -- into smaller, non-overlapping snippets
-- in this case, nine of them: |l|, |e|, |t|, | |, |K|, | |, |=|, | |, and |2|.
Every split is followed by a block of rules, which is applied to each of the
pieces in turn. Inweb works sideways-first: thus, if the block contains rules
R1, R2, ..., then R1 is applied to each piece first, then R2 to each piece,
and so on.

There are several different ways to split, all of them written in the
plural, to emphasize that they work on what are usually multiple things.
Rules, on the other hand, are written in the singular. Splits are not allowed
to be followed by |=>|: they always begin a block.

1. |characters| splits the snippet into each of its characters.

2. |characters in T| splits the snippet into each of its characters which
lie inside the text |T|. For example, here is a not very useful ILD for
plain text in which all vowels are in red:

[[../Private Languages/VowelsExample.ildf as ILDF]]

Given the text:
= (not code)
A noir, E blanc, I rouge, U vert, O bleu : voyelles,
Je dirai quelque jour vos naissances latentes :
A, noir corset velu des mouches éclatantes
Qui bombinent autour des puanteurs cruelles,
=
this produces:
= (sample VowelsExample code)
A noir, E blanc, I rouge, U vert, O bleu : voyelles,
Je dirai quelque jour vos naissances latentes :
A, noir corset velu des mouches éclatantes
Qui bombinent autour des puanteurs cruelles,
=

3. The split |instances of X| narrows in on each usage of the text |X| inside
the snippet. For example,
[[../Private Languages/LineageExample.ildf as ILDF]]
acts on the text:
= (not code)
Jacob first appears in the Book of Genesis, the son of Isaac and Rebecca, the
grandson of Abraham, Sarah and Bethuel, the nephew of Ishmael.
=
to produce:
= (sample LineageExample code)
Jacob first appears in the Book of Genesis, the son of Isaac and Rebecca, the
grandson of Abraham, Sarah and Bethuel, the nephew of Ishmael.
=
Note that it never runs in an overlapping way: the snippet |===| would be
considered as having only one instance of |==| (the first two characters),
while |====| would have two.

4. The split |runs of C|, where |C| describes a colour, splits the snippet
into non-overlapping contiguous pieces which have that colour. For example:
[[../Private Languages/RunningExample.ildf as ILDF]]
acts on:
= (not code)
Napoleon Bonaparte (1769-1821) took 167 scientists to Egypt in 1798,
who published their so-called Memoirs over the period 1798-1801.
=
to produce:
= (sample RunningExample code)
Napoleon Bonaparte (1769-1821) took 167 scientists to Egypt in 1798,
who published their so-called Memoirs over the period 1798-1801.
=
Here the hyphens in number ranges have been coloured, but not the hyphen
in "so-called".

A more computer-science sort of example would be:
[[../Private Languages/StdioExample.ildf as ILDF]]
which acts on:
= (not code)
if (x == 1) printf("Hello!");
=
to produce:
= (sample StdioExample code)
if (x == 1) printf("Hello!");
=
The split divides the line up into three runs, and the inner block runs three
times: on |if|, then |x|, then |printf|. Only the third time has any effect.

As a special form, |runs of unquoted| means "runs of characters not painted
either with |!string| or |!character|". This is special because |unquoted| is
not a colour.

5. The split |matches of /E/|, where |/E/| is a regular expression (see below),
splits the snippet up into non-overlapping pieces which match it: possibly
none at all, of course, in which case the block of rules is never used.
This is easier to demonstrate than explain:
[[../Private Languages/AssemblageExample.ildf as ILDF]]
which acts on:
= (not code)
		JSR .initialise
		LDR A, #.data
		RTS
	.initialise
		TAX
=
to produce:
= (sample AssemblageExample code)
		JSR .initialise
		LDR A, #.data
		RTS
	.initialise
		TAX
=

6. Lastly, the split |brackets in /E/| matches the snippet against the
regular expression |E|, and then runs the rules on each bracketed
subexpression in turn. (If there is no match, or there are no bracketed
terms in |E|, nothing happens.)
[[../Private Languages/EquationsExample.ildf as ILDF]]
acts on:
= (not code)
	A = 2716
	B=3
	C =715 + B
	D < 14
=
to produce:
= (sample EquationsExample code)
	A = 2716
	B=3
	C =715 + B
	D < 14
=
What happens here is that the expression has two bracketed terms, one for
the letter, one for the number; the rule is run first on the letter, then
on the number, and both are turned to |!function|.

@h The seven ways rules can apply.
Rules are the lines with a |=>| in. As noted, they take the form "if X, then
Y". The following are the possibilities for X, the condition.

1. The easiest thing is to give nothing at all, and then the rule always
applies. For example, this somewhat nihilistic program gets rid of colouring
entirely:
= (sample ILDF code)
    colouring {
        => !plain
    }
=

2. If X is a piece of literal text, the rule applies when the snippet is
exactly that text. For example,
= (sample ILDF code)
    printf => !function
=

3. X can require the whole snippet to be of a particular colour, by writing
|coloured C|. For example:
= (sample ILDF code)
    colouring {
        characters {
            coloured !character => !plain
        }
    }
=
removes the syntax colouring on character literals.

4. X can require the snippet to be one of the language's known keywords, as
declared earlier in the ILD by a |keyword| command. The syntax here is
|keyword of C|, where |C| is a colour. For example:
= (sample ILDF code)
    keyword of !element => !element
=
says: if the snippet is a keyword declared as being of colour |!element|,
then actually colour it that way. (This is much faster than making many
comparison rules in a row, one for each keyword in the language; Inweb has
put all of the registered keywords into a hash table for rapid lookup.)

5. X can look at a little context before or after the snippet, testing it
with one of the following: |prefix P|, |spaced prefix P|,
|optionally spaced prefix P|. These qualifiers have to do with whether white
space must appear after |P| and before the snippet. For example,
= (sample ILDF code)
    runs of !identifier {
        prefix optionally spaced -> => !element
    }
=
means that any identifier occurring after a |->| token will be coloured
as |!element|. Similarly for |suffix|.

6. X can test the snippet against a regular expression, with |matching /E/|.
For example:
= (sample ILDF code)
    runs of !identifier {
        matching /.*x.*/ => !element
    }
=
...turns any identifier containing a lower-case |x| into |!element| colour.
Note that |matching /x/| would not have worked, because our regular expression
is required to match the entire snippet, not just somewhere inside.
= (sample ILDF code)
    characters in "0123456789" {
        matching /\d\d\d\d/ => !element
    }
=
...colours all four-digit numbers, but no others.

7. Whenever a split takes place, Inweb keeps count of how many pieces there are,
and different rules can apply to differently numbered pieces. The notation
is |number N|, where |N| is the number, counting from 1. For example,
[[../Private Languages/ThirdExample.ildf as ILDF]]
acts on:
= (not code)
With how sad steps, O Moon, thou climb'st the skies! 
How silently, and with how wan a face! 
What, may it be that even in heav'nly place 
That busy archer his sharp arrows tries! 
Sure, if that long-with love-acquainted eyes 
Can judge of love, thou feel'st a lover's case, 
I read it in thy looks; thy languish'd grace 
To me, that feel the like, thy state descries. 
Then, ev'n of fellowship, O Moon, tell me, 
Is constant love deem'd there but want of wit? 
Are beauties there as proud as here they be? 
Do they above love to be lov'd, and yet 
Those lovers scorn whom that love doth possess? 
Do they call virtue there ungratefulness?
=
to produce:
= (sample ThirdExample code)
With how sad steps, O Moon, thou climb'st the skies! 
How silently, and with how wan a face! 
What, may it be that even in heav'nly place 
That busy archer his sharp arrows tries! 
Sure, if that long-with love-acquainted eyes 
Can judge of love, thou feel'st a lover's case, 
I read it in thy looks; thy languish'd grace 
To me, that feel the like, thy state descries. 
Then, ev'n of fellowship, O Moon, tell me, 
Is constant love deem'd there but want of wit? 
Are beauties there as proud as here they be? 
Do they above love to be lov'd, and yet 
Those lovers scorn whom that love doth possess? 
Do they call virtue there ungratefulness?
=

@ Any condition can be reversed by preceding it with |not|. For example,
= (sample ILDF code)
    not coloured !string => !plain
=

@h The three ways rules can take effect.
Now let's look at the conclusion Y of a rule. Here the possibilities are
simpler:

1. If Y is the name of a colour, the snippet is painted in that colour.

2. If Y is an open brace |{|, then it introduces a block of rules which are
applied to the snippet only if this rule has matched. For example,
= (sample ILDF code)
    keyword !element => {
        optionally spaced prefix . => !element
        optionally spaced prefix -> => !element
    }
=
means that if the original condition |keyword !element| applies, then two
further rules are applied.

By default, the colour is applied to the snippet. For prefix or suffix
rules (see above), it can also be applied to the prefix or suffix: use
the notation |=> C on both| or |=> C on suffix| or |=> C on prefix|.

3. If Y is the word |debug|, then the current snippet and its colouring
are printed out on the command line. Thus:
= (sample ILDF code)
    colouring {
        matches of /\d\S+/ {
            => debug
        }
    }
=
The rule |=> debug| is unconditional, and will print whenever it's reached.

@h The worm, Ouroboros.
Inweb Language Definition Format is a kind of language in itself, and in
fact Inweb is supplied with an ILD for ILDF itself, which Inweb used to
syntax-colour the examples above. Here it is, as syntax-coloured by itself:

[[../Languages/ILDF.ildf as ILDF]]