Pattern Matching

To provide a limited regular-expression parser.

§1. Character types
§3. Simple parsing
§6. A Worse PCRE
§14. Replacement

§1. Character types. We will define white space as spaces and tabs only, since the various kinds of line terminator will always be stripped out before this is applied.

    int Regexp::white_space(int c) {
        if ((c == ' ') || (c == '\t')) return TRUE;
        return FALSE;
    }

The function Regexp::white_space is used in §5.

§2. The presence of : here is perhaps a bit surprising, since it's illegal in C and has other meanings in other languages, but it's legal in C-for-Inform identifiers.

    int Regexp::identifier_char(int c) {
        if ((c == '_') || (c == ':') ||
            ((c >= 'A') && (c <= 'Z')) ||
            ((c >= 'a') && (c <= 'z')) ||
            ((c >= '0') && (c <= '9'))) return TRUE;
        return FALSE;
    }

The function Regexp::identifier_char is used in §13.

§3. Simple parsing. The following finds the earliest minimal-length substring of a string, delimited by two pairs of characters: for example, << and >>. This could easily be done as a regular expression using Regexp::match, but the routine here is much quicker.

    int Regexp::find_expansion(text_stream *text, wchar_t on1, wchar_t on2,
        wchar_t off1, wchar_t off2, int *len) {
        for (int i = 0; i < Str::len(text); i++)
            if ((Str::get_at(text, i) == on1) && (Str::get_at(text, i+1) == on2)) {
                for (int j=i+2; j < Str::len(text); j++)
                    if ((Str::get_at(text, j) == off1) && (Str::get_at(text, j+1) == off2)) {
                        *len = j+2-i;
                        return i;
                    }
            }
        return -1;
    }

The function Regexp::find_expansion appears nowhere else.

§4. Still more simply:

    int Regexp::find_open_brace(text_stream *text) {
        for (int i=0; i < Str::len(text); i++)
            if (Str::get_at(text, i) == '{')
                return i;
        return -1;
    }

The function Regexp::find_open_brace appears nowhere else.

§5. Note that we count the empty string as being white space. Again, this is equivalent to Regexp::match(p, " *"), but much faster.

    int Regexp::string_is_white_space(text_stream *text) {
        LOOP_THROUGH_TEXT(P, text)
            if (Regexp::white_space(Str::get(P)) == FALSE)
                return FALSE;
        return TRUE;
    }

The function Regexp::string_is_white_space appears nowhere else.

§6. A Worse PCRE. I originally wanted to call the function in this section a_better_sscanf, then thought perhaps a_worse_PCRE would be more true. (PCRE is Philip Hazel's superb C implementation of regular-expression parsing, but I didn't need its full strength, and I didn't want to complicate the build process by linking to it.)

This is a very minimal regular expression parser, simply for convenience of parsing short texts against particularly simple patterns. Here is an example of use:

        match_results mr = Regexp::create_mr();
        if (Regexp::match(&mr, text, L"fish (%d+) ([a-zA-Z_][a-zA-Z0-9_]*) *") {
            PRINT("Fish number: %S\n", mr.exp[0]);
            PRINT("Fish name: %S\n", mr.exp[1]);
        }
        Regexp::dispose_of(&mr);

Note the L at the front of the regex itself: this is a wide string.

This tries to match the given text to see if it consists of the word fish, then any amount of whitespace, then a string of digits which are copied into mr->exp[0], then whitespace again, and then an alphanumeric identifier to be copied into mr->exp[1], and finally optional whitespace. (If no match is made, the contents of the found strings are undefined.)

Note that this differs from, for example, Perl's regular expression matcher in several ways. The regular expression syntax is slightly different and in general simpler. A match has to be made from start to end, so it's as if there were an implicit ^ at the front and $ at the back (in Perl terms). The full match text is therefore always the entire text put in, so there's no need to record this. In Perl, matching against m/(.*) plus (.*)/ would set three subexpressions: number 0 would be the whole text matched, number 1 would be the first bracketed part, number 2 the second. Here, though, the corresponding regex would be written L"(%c*) plus (%c*)", and the bracketed terms would be subexpressions 0 and 1.

    define MAX_BRACKETED_SUBEXPRESSIONS 5  this many bracketed subexpressions can be extracted

§7. The internal state of the matcher is stored as follows:

    typedef struct match_position {
        int tpos;  position within text being matched
        int ppos;  position within pattern
        int bc;  count of bracketed subexpressions so far begun
        int bl;  bracket indentation level
        int bracket_nesting[MAX_BRACKETED_SUBEXPRESSIONS];
         which subexpression numbers (0, 1, 2, 3) correspond to which nesting
        int brackets_start[MAX_BRACKETED_SUBEXPRESSIONS], brackets_end[MAX_BRACKETED_SUBEXPRESSIONS];
         positions in text being matched, inclusive
    } match_position;

The structure match_position is private to this section.

§8. It may appear that match texts are limited to 64 characters here, but they are not. They are simply a little faster to access if short.

    define MATCH_TEXT_INITIAL_ALLOCATION 64

    typedef struct match_result {
        wchar_t match_text_storage[MATCH_TEXT_INITIAL_ALLOCATION];
        struct text_stream match_text_struct;
    } match_result;
    typedef struct match_results {
        int no_matched_texts;
        struct match_result exp_storage[MAX_BRACKETED_SUBEXPRESSIONS];
        struct text_stream *exp[MAX_BRACKETED_SUBEXPRESSIONS];
        int exp_at[MAX_BRACKETED_SUBEXPRESSIONS];
    } match_results;

The structure match_result is private to this section.

The structure match_results is accessed in 3/cla, 8/ws, 8/wm, 8/bf and here.

§9. Match result objects are inherently ephemeral, and we can expect to be creating them and throwing them away frequently. This must be done explicitly. Note that the storage required is on the C stack (unless some result strings grow very large), so that it's very quick to allocate and deallocate.

    match_results Regexp::create_mr(void) {
        match_results mr;
        mr.no_matched_texts = 0;
        for (int i=0; i<MAX_BRACKETED_SUBEXPRESSIONS; i++) {
            mr.exp[i] = NULL;
            mr.exp_at[i] = -1;
        }
        return mr;
    }

    void Regexp::dispose_of(match_results *mr) {
        if (mr) {
            for (int i=0; i<MAX_BRACKETED_SUBEXPRESSIONS; i++)
                if (mr->exp[i]) {
                    STREAM_CLOSE(mr->exp[i]);
                    mr->exp[i] = NULL;
                }
            mr->no_matched_texts = 0;
        }
    }

The function Regexp::create_mr is used in §14, 3/cla (§11, §12), 8/ws (§7.3.2, §7.3.3.2, §7.3.3.2.1, §7.2.1, §7.2.2.1, §7.2.2.3), 8/wm (§9), 8/bf (§3).

The function Regexp::dispose_of is used in §10, §14, 3/cla (§11), 8/ws (§7.3.2, §7.3.3.2, §7.3.3.2.1, §7.2.1, §7.2.2.1, §7.2.2.3), 8/wm (§9), 8/bf (§3).

§10. So, then: the matcher itself.

    int Regexp::match(match_results *mr, text_stream *text, wchar_t *pattern) {
        if (mr) Regexp::prepare(mr);
        int rv = (Regexp::match_r(mr, text, pattern, NULL, FALSE) >= 0)?TRUE:FALSE;
        if ((mr) && (rv == FALSE)) Regexp::dispose_of(mr);
        return rv;
    }

    int Regexp::match_from(match_results *mr, text_stream *text, wchar_t *pattern,
        int x, int allow_partial) {
        int match_to = x;
        if (x < Str::len(text)) {
            if (mr) Regexp::prepare(mr);
            match_position at;
            at.tpos = x; at.ppos = 0; at.bc = 0; at.bl = 0;
            match_to = Regexp::match_r(mr, text, pattern, &at, allow_partial);
            if (match_to == -1) {
                match_to = x;
                if (mr) Regexp::dispose_of(mr);
            }
        }
        return match_to - x;
    }

    void Regexp::prepare(match_results *mr) {
        if (mr) {
            mr->no_matched_texts = 0;
            for (int i=0; i<MAX_BRACKETED_SUBEXPRESSIONS; i++) {
                mr->exp_at[i] = -1;
                if (mr->exp[i]) STREAM_CLOSE(mr->exp[i]);
                mr->exp_storage[i].match_text_struct =
                    Streams::new_buffer(
                        MATCH_TEXT_INITIAL_ALLOCATION, mr->exp_storage[i].match_text_storage);
                mr->exp_storage[i].match_text_struct.stream_flags |= FOR_RE_STRF;
                mr->exp[i] = &(mr->exp_storage[i].match_text_struct);
            }
        }
    }

The function Regexp::match is used in 3/cla (§11, §12), 8/ws (§7.3.2, §7.3.3.2, §7.3.3.2.1, §7.2.1, §7.2.2.1, §7.2.2.3), 8/wm (§9), 8/bf (§3).

The function Regexp::match_from appears nowhere else.

The function Regexp::prepare is used in §14.

§11.

    int Regexp::match_r(match_results *mr, text_stream *text, wchar_t *pattern,
        match_position *scan_from, int allow_partial) {
        match_position at;
        if (scan_from) at = *scan_from;
        else { at.tpos = 0; at.ppos = 0; at.bc = 0; at.bl = 0; }

        while ((Str::get_at(text, at.tpos)) || (pattern[at.ppos])) {
            if ((allow_partial) && (pattern[at.ppos] == 0)) break;
            <Parentheses in the match pattern set up substrings to extract 11.1>;

            int chcl,  what class of characters to match: a *_CLASS value
                range_from, range_to,  for LITERAL_CLASS only
                reverse = FALSE;  require a non-match rather than a match
            <Extract the character class to match from the pattern 11.2>;

            int rep_from = 1, rep_to = 1;  minimum and maximum number of repetitions
            int greedy = TRUE;  go for a maximal-length match if possible
            <Extract repetition markers from the pattern 11.3>;

            int reps = 0;
            <Count how many repetitions can be made here 11.4>;
            if (reps < rep_from) return -1;

             we can now accept anything from rep_from to reps repetitions
            if (rep_from == reps) { at.tpos += reps; continue; }
            <Try all possible match lengths until we find a match 11.5>;

             no match length worked, so no match
            return -1;
        }
        <Copy the bracketed texts found into the global strings 11.6>;
        return at.tpos;
    }

The function Regexp::match_r is used in §10, §11.5, §14.

§11.1. <Parentheses in the match pattern set up substrings to extract 11.1> =

        if (pattern[at.ppos] == '(') {
            if (at.bl < MAX_BRACKETED_SUBEXPRESSIONS) at.bracket_nesting[at.bl] = -1;
            if (at.bc < MAX_BRACKETED_SUBEXPRESSIONS) {
                at.bracket_nesting[at.bl] = at.bc;
                at.brackets_start[at.bc] = at.tpos; at.brackets_end[at.bc] = -1;
            }
            at.bl++; at.bc++; at.ppos++;
            continue;
        }
        if (pattern[at.ppos] == ')') {
            at.bl--;
            if ((at.bl >= 0) && (at.bl < MAX_BRACKETED_SUBEXPRESSIONS) && (at.bracket_nesting[at.bl] >= 0))
                at.brackets_end[at.bracket_nesting[at.bl]] = at.tpos-1;
            at.ppos++;
            continue;
        }

This code is used in §11.

§11.2. <Extract the character class to match from the pattern 11.2> =

        int len;
        chcl = Regexp::get_cclass(pattern, at.ppos, &len, &range_from, &range_to, &reverse);
        at.ppos += len;

This code is used in §11.

§11.3. This is standard regular-expression notation, except that I haven't bothered to implement numeric repetition counts, which we won't need:

<Extract repetition markers from the pattern 11.3> =

        if (chcl == WHITESPACE_CLASS) {
            rep_from = 1; rep_to = Str::len(text)-at.tpos;
        }
        if (pattern[at.ppos] == '+') {
            rep_from = 1; rep_to = Str::len(text)-at.tpos; at.ppos++;
        } else if (pattern[at.ppos] == '*') {
            rep_from = 0; rep_to = Str::len(text)-at.tpos; at.ppos++;
        }
        if (pattern[at.ppos] == '?') { greedy = FALSE; at.ppos++; }

This code is used in §11.

§11.4. <Count how many repetitions can be made here 11.4> =

        for (reps = 0; ((Str::get_at(text, at.tpos+reps)) && (reps < rep_to)); reps++)
            if (Regexp::test_cclass(Str::get_at(text, at.tpos+reps), chcl,
                range_from, range_to, pattern, reverse) == FALSE)
                break;

This code is used in §11.

§11.5. <Try all possible match lengths until we find a match 11.5> =

        int from = rep_from, to = reps, dj = 1, from_tpos = at.tpos;
        if (greedy) { from = reps; to = rep_from; dj = -1; }
        for (int j = from; j != to+dj; j += dj) {
            at.tpos = from_tpos + j;
            int try = Regexp::match_r(mr, text, pattern, &at, allow_partial);
            if (try >= 0) return try;
        }

This code is used in §11.

§11.6. <Copy the bracketed texts found into the global strings 11.6> =

        if (mr) {
            for (int i=0; i<at.bc; i++) {
                Str::clear(mr->exp[i]);
                for (int j = at.brackets_start[i]; j <= at.brackets_end[i]; j++)
                    PUT_TO(mr->exp[i], Str::get_at(text, j));
                mr->exp_at[i] = at.brackets_start[i];
            }
            mr->no_matched_texts = at.bc;
        }

This code is used in §11.

§12. So then: most characters in the pattern are taken literally (if the pattern says q, the only match is with a lower-case letter "q"), except that:

(a) a space means "one or more characters of white space";
(b) %d means any decimal digit;
(c) %c means any character at all;
(d) %C means any character which isn't white space;
(e) %i means any character from the identifier class (see above);
(f) %p means any character which can be used in the name of a Preform nonterminal, which is to say, an identifier character or a hyphen;
(g) %P means the same or else a colon;
(h) %t means a tab;
(i) %q means a double-quote.

% otherwise makes a literal escape; a space means any whitespace character; square brackets enclose literal alternatives, and note as usual with grep engines that []xyz] is legal and makes a set of four possibilities, the first of which is a literal close square; within a set, a hyphen makes a character range; an initial ^ negates the result; and otherwise everything is literal.

    define ANY_CLASS 1
    define DIGIT_CLASS 2
    define WHITESPACE_CLASS 3
    define NONWHITESPACE_CLASS 4
    define IDENTIFIER_CLASS 5
    define PREFORM_CLASS 6
    define PREFORMC_CLASS 7
    define LITERAL_CLASS 8
    define TAB_CLASS 9
    define QUOTE_CLASS 10

    int Regexp::get_cclass(wchar_t *pattern, int ppos, int *len, int *from, int *to, int *reverse) {
        if (pattern[ppos] == '^') { ppos++; *reverse = TRUE; } else { *reverse = FALSE; }
        switch (pattern[ppos]) {
            case '%':
                ppos++;
                *len = 2;
                switch (pattern[ppos]) {
                    case 'd': return DIGIT_CLASS;
                    case 'c': return ANY_CLASS;
                    case 'C': return NONWHITESPACE_CLASS;
                    case 'i': return IDENTIFIER_CLASS;
                    case 'p': return PREFORM_CLASS;
                    case 'P': return PREFORMC_CLASS;
                    case 'q': return QUOTE_CLASS;
                    case 't': return TAB_CLASS;
                }
                *from = ppos; *to = ppos; return LITERAL_CLASS;
            case '[':
                *from = ppos+1;
                ppos += 2;
                while ((pattern[ppos]) && (pattern[ppos] != ']')) ppos++;
                *to = ppos - 1; *len = ppos - *from + 2;
                return LITERAL_CLASS;
            case ' ':
                *len = 1; return WHITESPACE_CLASS;
        }
        *len = 1; *from = ppos; *to = ppos; return LITERAL_CLASS;
    }

The function Regexp::get_cclass is used in §11.2.

§13.

    int Regexp::test_cclass(int c, int chcl, int range_from, int range_to, wchar_t *drawn_from, int reverse) {
        int match = FALSE;
        switch (chcl) {
            case ANY_CLASS: if (c) match = TRUE; break;
            case DIGIT_CLASS: if (isdigit(c)) match = TRUE; break;
            case WHITESPACE_CLASS: if (Characters::is_space_or_tab(c)) match = TRUE; break;
            case TAB_CLASS: if (c == '\t') match = TRUE; break;
            case NONWHITESPACE_CLASS: if (!(Characters::is_space_or_tab(c))) match = TRUE; break;
            case QUOTE_CLASS: if (c != '\"') match = TRUE; break;
            case IDENTIFIER_CLASS: if (Regexp::identifier_char(c)) match = TRUE; break;
            case PREFORM_CLASS: if ((c == '-') || (c == '_') ||
                ((c >= 'a') && (c <= 'z')) ||
                ((c >= '0') && (c <= '9'))) match = TRUE; break;
            case PREFORMC_CLASS: if ((c == '-') || (c == '_') || (c == ':') ||
                ((c >= 'a') && (c <= 'z')) ||
                ((c >= '0') && (c <= '9'))) match = TRUE; break;
            case LITERAL_CLASS:
                if ((range_to > range_from) && (drawn_from[range_from] == '^')) {
                    range_from++; reverse = reverse?FALSE:TRUE;
                }
                for (int j = range_from; j <= range_to; j++) {
                    int c1 = drawn_from[j], c2 = c1;
                    if ((j+1 < range_to) && (drawn_from[j+1] == '-')) { c2 = drawn_from[j+2]; j += 2; }
                    if ((c >= c1) && (c <= c2)) {
                        match = TRUE; break;
                    }
                }
                break;
        }
        if (reverse) match = (match)?FALSE:TRUE;
        return match;
    }

The function Regexp::test_cclass is used in §11.4.

§14. Replacement. And this routine conveniently handles searching and replacing. This time we can match at substrings of the text (i.e., we are not forced to match from the start right to the end), and multiple replacements can be made. For example,

        Regexp::replace(text, L"[aeiou]", L"!", REP_REPEATING);

will turn the text "goose eggs" into "g!!s! !ggs".

    define REP_REPEATING 1
    define REP_ATSTART 2

    int Regexp::replace(text_stream *text, wchar_t *pattern, wchar_t *replacement, int options) {
        TEMPORARY_TEXT(altered);
        match_results mr = Regexp::create_mr();
        int changes = 0;
        for (int i=0, L=Str::len(text); i<L; i++) {
            match_position mp; mp.tpos = i; mp.ppos = 0; mp.bc = 0; mp.bl = 0;
            Regexp::prepare(&mr);
            int try = Regexp::match_r(&mr, text, pattern, &mp, TRUE);
            if (try >= 0) {
                if (replacement)
                    for (int j=0; replacement[j]; j++) {
                        int c = replacement[j];
                        if (c == '%') {
                            j++;
                            int ind = replacement[j] - '0';
                            if ((ind >= 0) && (ind < MAX_BRACKETED_SUBEXPRESSIONS))
                                WRITE_TO(altered, "%S", mr.exp[ind]);
                            else
                                PUT_TO(altered, replacement[j]);
                        } else {
                            PUT_TO(altered, replacement[j]);
                        }
                    }
                int left = L - try;
                changes++;
                Regexp::dispose_of(&mr);
                L = Str::len(text); i = L-left-1;
                if ((options & REP_REPEATING) == 0) { <Add the rest 14.1>; break; }
                continue;
            } else PUT_TO(altered, Str::get_at(text, i));
            if (options & REP_ATSTART) { <Add the rest 14.1>; break; }
        }
        Regexp::dispose_of(&mr);
        if (changes > 0) Str::copy(text, altered);
        DISCARD_TEXT(altered);
        return changes;
    }

The function Regexp::replace appears nowhere else.

§14.1. <Add the rest 14.1> =

        for (i++; i<L; i++)
            PUT_TO(altered, Str::get_at(text, i));

This code is used in §14 (twice).

Back to 'Tries and Avinues'
(This section ends Chapter 4: Text Handling.)