Convenient routines for manipulating strings of text.


§1. Strings are streams. Although Foundation provides limited facilities for handling standard or wide C-style strings — that is, null-terminated arrays of char or wchar_t — these are not encouraged.

Instead, a standard string for a program using Foundation is nothing more than a text stream (see Chapter 2). These are unbounded in size, with memory allocation being automatic; they are encoded as an array of Unicode code points (not as UTF-8, -16 or -32); and they do not use a null or indeed any terminator. This has the advantage that finding the length of a string, and appending characters to it, run in constant time regardless of the string's length. It is is entirely feasible to write hundreds of megabytes of output into a string, if that's useful, and no substantial slowing down will occur in handling the result (except, of course, that printing it out on screen would take a while). Strings are also very well protected against buffer overruns.

The present section of code provides convenient routines for creating, duplicating, modifying and examining such strings.

§2. New strings. Sometimes we want to make a new string in the sense of allocating more memory to hold it. These objects won't automatically be destroyed, so we shouldn't call these routines too casually. If we need a string just for some space to play with for a short while, it's better to create one with TEMPORARY_TEXT and then get rid of it with DISCARD_TEXT, macros defined in Chapter 2.

The capacity of these strings is unlimited in principle, and the number here is just the size of the initial memory block, which is fastest to access.

    text_stream *Str::new(void) {
        return Str::new_with_capacity(32);
    }

    text_stream *Str::new_with_capacity(int c) {
        text_stream *S = CREATE(text_stream);
        if (Streams::open_to_memory(S, c)) return S;
        return NULL;
    }

    void Str::dispose_of(text_stream *text) {
        if (text) STREAM_CLOSE(text);
    }

The function Str::new is used in §3, 2/dl (§4.1), 2/dct (§7.3.1), 3/cla (§6, §11), 5/ee (§5), 7/vn (§7).

The function Str::new_with_capacity is used in 3/pth (§4), 3/fln (§2).

The function Str::dispose_of is used in 2/dct (§7.3.2, §11).

§3. Duplication of an existing string is complicated only by the issue that we want the duplicate always to be writeable, so that NULL can't be duplicated as NULL.

    text_stream *Str::duplicate(text_stream *E) {
        if (E == NULL) return Str::new();
        text_stream *S = CREATE(text_stream);
        if (Streams::open_to_memory(S, Str::len(E)+4)) {
            Streams::copy(S, E);
            return S;
        }
        return NULL;
    }

The function Str::duplicate is used in 2/dct (§7.3.1), 3/cla (§5.1), 5/ee (§5), 7/vn (§7.1).

§4. Converting from C strings. Here we open text streams initially equal to the given C strings, and with the capacity of the initial block large enough to hold the whole thing plus a little extra, for efficiency's sake.

    text_stream *Str::new_from_wide_string(wchar_t *C_string) {
        text_stream *S = CREATE(text_stream);
        if (Streams::open_from_wide_string(S, C_string)) return S;
        return NULL;
    }

    text_stream *Str::new_from_ISO_string(char *C_string) {
        text_stream *S = CREATE(text_stream);
        if (Streams::open_from_ISO_string(S, C_string)) return S;
        return NULL;
    }

    text_stream *Str::new_from_UTF8_string(char *C_string) {
        text_stream *S = CREATE(text_stream);
        if (Streams::open_from_UTF8_string(S, C_string)) return S;
        return NULL;
    }

    text_stream *Str::new_from_locale_string(char *C_string) {
        text_stream *S = CREATE(text_stream);
        if (Streams::open_from_locale_string(S, C_string)) return S;
        return NULL;
    }

The function Str::new_from_wide_string is used in 3/cla (§5, §14), 3/pth (§3), 5/ee (§5).

The function Str::new_from_ISO_string appears nowhere else.

The function Str::new_from_UTF8_string appears nowhere else.

The function Str::new_from_locale_string is used in 3/pth (§2, §3).

§5. And sometimes we want to use an existing stream object:

    text_stream *Str::from_wide_string(text_stream *S, wchar_t *c_string) {
        if (Streams::open_from_wide_string(S, c_string) == FALSE) return NULL;
        return S;
    }

    text_stream *Str::from_locale_string(text_stream *S, char *c_string) {
        if (Streams::open_from_locale_string(S, c_string) == FALSE) return NULL;
        return S;
    }

The function Str::from_wide_string appears nowhere else.

The function Str::from_locale_string appears nowhere else.

§6. Converting to C strings.

    void Str::copy_to_ISO_string(char *C_string, text_stream *S, int buffer_size) {
        Streams::write_as_ISO_string(C_string, S, buffer_size);
    }

    void Str::copy_to_UTF8_string(char *C_string, text_stream *S, int buffer_size) {
        Streams::write_as_UTF8_string(C_string, S, buffer_size);
    }

    void Str::copy_to_wide_string(wchar_t *C_string, text_stream *S, int buffer_size) {
        Streams::write_as_wide_string(C_string, S, buffer_size);
    }

    void Str::copy_to_locale_string(char *C_string, text_stream *S, int buffer_size) {
        Streams::write_as_locale_string(C_string, S, buffer_size);
    }

The function Str::copy_to_ISO_string appears nowhere else.

The function Str::copy_to_UTF8_string appears nowhere else.

The function Str::copy_to_wide_string appears nowhere else.

The function Str::copy_to_locale_string is used in 3/pth (§8, §9), 3/fln (§10), 3/drc (§2).

§7. Converting to integers.

    int Str::atoi(text_stream *S, int index) {
        char buffer[32];
        int i = 0;
        for (string_position P = Str::at(S, index);
            ((i < 31) && (P.index < Str::len(S))); P = Str::forward(P))
            buffer[i++] = (char) Str::get(P);
        buffer[i] = 0;
        return atoi(buffer);
    }

The function Str::atoi is used in 3/cla (§12), 7/vn (§10).

§8. Length. A puritan would return a size_t here, but I am not a puritan.

    int Str::len(text_stream *S) {
        return Streams::get_position(S);
    }

The function Str::len is used in §3, §7, §10, §11, §12, §13, §14, §15, §16, §18, §19, §20, §21, §23, §24, §25, 2/dl (§9), 3/cla (§13, §14, §14.1), 3/pth (§4, §5, §7), 3/fln (§2, §3, §5, §9), 4/taa (§2.1), 4/pm (§3, §4, §11.3, §14), 5/htm (§7, §15), 5/ee (§7.1, §7.2.3, §7.2.4), 7/vn (§7, §7.1, §10).

§9. Position markers. A position marker is a lightweight way to refer to a particular position in a given string. Position 0 is before the first character; if, for example, the string contains the word "gazpacho", then position 8 represents the end of the string, after the "o". Negative positions are not allowed, but positive ones well past the end of the string are legal. (Doing things at those positions may well not be, of course.)

    typedef struct string_position {
        struct text_stream *S;
        int index;
    } string_position;

The structure string_position is private to this section.

§10. You can then find a position in a given string thus:

    string_position Str::start(text_stream *S) {
        string_position P; P.S = S; P.index = 0; return P;
    }

    string_position Str::at(text_stream *S, int i) {
        if (i < 0) i = 0;
        if (i > Str::len(S)) i = Str::len(S);
        string_position P; P.S = S; P.index = i; return P;
    }

    string_position Str::end(text_stream *S) {
        string_position P; P.S = S; P.index = Str::len(S); return P;
    }

The function Str::start is used in §12, §19, §23, 3/pth (§5).

The function Str::at is used in §7, §13, §14, §15, §16, §24, 3/pth (§4, §5), 3/fln (§2, §3, §9).

The function Str::end is used in §12, 3/fln (§9).

§11. And you can step forwards or backwards:

    string_position Str::back(string_position P) {
        if (P.index > 0) P.index--; return P;
    }

    string_position Str::forward(string_position P) {
        P.index++; return P;
    }

    string_position Str::plus(string_position P, int increment) {
        P.index += increment; return P;
    }

    int Str::width_between(string_position P1, string_position P2) {
        if (P1.S != P2.S) internal_error("positions are in different strings");
        return P2.index - P1.index;
    }

    int Str::in_range(string_position P) {
        if (P.index < Str::len(P.S)) return TRUE;
        return FALSE;
    }

    int Str::index(string_position P) {
        return P.index;
    }

The function Str::back is used in §12, 3/fln (§9).

The function Str::forward is used in §7, §16, §19, §23, §24, 3/fln (§2).

The function Str::plus appears nowhere else.

The function Str::width_between appears nowhere else.

The function Str::in_range appears nowhere else.

The function Str::index appears nowhere else.

§12. This leads to the following convenient loop macros:

    define LOOP_THROUGH_TEXT(P, ST)
        for (string_position P = Str::start(ST); P.index < Str::len(P.S); P.index++)
    define LOOP_BACKWARDS_THROUGH_TEXT(P, ST)
        for (string_position P = Str::back(Str::end(ST)); P.index >= 0; P.index--)

§13. Character operations. How to get at individual characters, then, now that we can refer to positions:

    wchar_t Str::get(string_position P) {
        if ((P.S == NULL) || (P.index < 0)) return 0;
        return Streams::get_char_at_index(P.S, P.index);
    }

    wchar_t Str::get_at(text_stream *S, int index) {
        if ((S == NULL) || (index < 0)) return 0;
        return Streams::get_char_at_index(S, index);
    }

    wchar_t Str::get_first_char(text_stream *S) {
        return Str::get(Str::at(S, 0));
    }

    wchar_t Str::get_last_char(text_stream *S) {
        int L = Str::len(S);
        if (L == 0) return 0;
        return Str::get(Str::at(S, L-1));
    }

The function Str::get is used in §7, §16, §19, §21, §22, §23, §24, §25, 2/dct (§4), 3/pth (§4, §5), 3/fln (§2, §3, §7, §8, §9), 3/shl (§1), 4/pm (§5), 5/ee (§5), 7/vn (§7, §10).

The function Str::get_at is used in §20, §23, §25, 3/pth (§7), 3/fln (§5), 4/taa (§2), 4/pm (§3, §4, §11, §11.4, §11.6, §14, §14.1).

The function Str::get_first_char is used in 7/vn (§10).

The function Str::get_last_char appears nowhere else.

§14.

    void Str::put(string_position P, wchar_t C) {
        if (P.index < 0) internal_error("wrote before start of string");
        if (P.S == NULL) internal_error("wrote to null stream");
        int ext = Str::len(P.S);
        if (P.index > ext) internal_error("wrote beyond end of string");
        if (P.index == ext) {
            if (C) PUT_TO(P.S, (int) C);
            return;
        }
        Streams::put_char_at_index(P.S, P.index, C);
    }

    void Str::put_at(text_stream *S, int index, wchar_t C) {
        Str::put(Str::at(S, index), C);
    }

The function Str::put is used in §15, §23, §24, 5/ee (§5).

The function Str::put_at is used in 4/tf (§5.3, §6).

§15. Truncation.

    void Str::clear(text_stream *S) {
        Str::truncate(S, 0);
    }

    void Str::truncate(text_stream *S, int len) {
        if (len < 0) len = 0;
        if (len < Str::len(S)) Str::put(Str::at(S, len), 0);
    }

The function Str::clear is used in §16, §17, §24, 3/drc (§2), 4/tf (§6), 4/pm (§11.6), 7/vn (§7.1).

The function Str::truncate is used in §23, §24.

§16. Copying.

    void Str::concatenate(text_stream *S1, text_stream *S2) {
        Streams::copy(S1, S2);
    }

    void Str::copy(text_stream *S1, text_stream *S2) {
        if (S1 == S2) return;
        Str::clear(S1);
        Streams::copy(S1, S2);
    }

    void Str::copy_tail(text_stream *S1, text_stream *S2, int from) {
        Str::clear(S1);
        int L = Str::len(S2);
        if (from < L)
            for (string_position P = Str::at(S2, from); P.index < L; P = Str::forward(P))
                PUT_TO(S1, Str::get(P));
    }

The function Str::concatenate appears nowhere else.

The function Str::copy is used in 3/cla (§12), 4/pm (§14), 5/ee (§5, §7.1).

The function Str::copy_tail appears nowhere else.

§17. A subtly different operation is to set a string equal to a given C string:

    void Str::copy_ISO_string(text_stream *S, char *C_string) {
        Str::clear(S);
        Streams::write_ISO_string(S, C_string);
    }

    void Str::copy_UTF8_string(text_stream *S, char *C_string) {
        Str::clear(S);
        Streams::write_UTF8_string(S, C_string);
    }

    void Str::copy_wide_string(text_stream *S, wchar_t *C_string) {
        Str::clear(S);
        Streams::write_wide_string(S, C_string);
    }

The function Str::copy_ISO_string appears nowhere else.

The function Str::copy_UTF8_string appears nowhere else.

The function Str::copy_wide_string appears nowhere else.

§18. Comparisons. We provide both case sensitive and insensitive versions.

    int Str::eq(text_stream *S1, text_stream *S2) {
        if ((Str::len(S1) == Str::len(S2)) && (Str::cmp(S1, S2) == 0)) return TRUE;
        return FALSE;
    }

    int Str::eq_insensitive(text_stream *S1, text_stream *S2) {
        if ((Str::len(S1) == Str::len(S2)) && (Str::cmp_insensitive(S1, S2) == 0)) return TRUE;
        return FALSE;
    }

    int Str::ne(text_stream *S1, text_stream *S2) {
        if ((Str::len(S1) != Str::len(S2)) || (Str::cmp(S1, S2) != 0)) return TRUE;
        return FALSE;
    }

    int Str::ne_insensitive(text_stream *S1, text_stream *S2) {
        if ((Str::len(S1) != Str::len(S2)) || (Str::cmp_insensitive(S1, S2) != 0)) return TRUE;
        return FALSE;
    }

The function Str::eq is used in 2/dl (§9), 2/dct (§7.3), 3/pth (§3), 3/fln (§11).

The function Str::eq_insensitive appears nowhere else.

The function Str::ne is used in 7/vn (§8).

The function Str::ne_insensitive appears nowhere else.

§19. These two routines produce a numerical string difference suitable for alphabetic sorting, like strlen in the C standard library.

    int Str::cmp(text_stream *S1, text_stream *S2) {
        for (string_position P = Str::start(S1), Q = Str::start(S2);
            (P.index < Str::len(S1)) && (Q.index < Str::len(S2));
            P = Str::forward(P), Q = Str::forward(Q)) {
            int d = (int) Str::get(P) - (int) Str::get(Q);
            if (d != 0) return d;
        }
        return Str::len(S1) - Str::len(S2);
    }

    int Str::cmp_insensitive(text_stream *S1, text_stream *S2) {
        for (string_position P = Str::start(S1), Q = Str::start(S2);
            (P.index < Str::len(S1)) && (Q.index < Str::len(S2));
            P = Str::forward(P), Q = Str::forward(Q)) {
            int d = tolower((int) Str::get(P)) - tolower((int) Str::get(Q));
            if (d != 0) return d;
        }
        return Str::len(S1) - Str::len(S2);
    }

The function Str::cmp is used in §18, 7/vn (§8).

The function Str::cmp_insensitive is used in §18, 3/cla (§15).

§20. It's sometimes useful to see whether two strings agree on their last N characters, or their first N. For example,

        Str::suffix_eq(I"wayzgoose", I"snow goose", N)

will return TRUE for N equal to 0 to 5, and FALSE thereafter.

(The Oxford English Dictionary defines a "wayzgoose" as a holiday outing for the staff of a publishing house.)

    int Str::prefix_eq(text_stream *S1, text_stream *S2, int N) {
        int L1 = Str::len(S1), L2 = Str::len(S2);
        if ((N > L1) || (N > L2)) return FALSE;
        for (int i=0; i<N; i++)
            if (Str::get_at(S1, i) != Str::get_at(S2, i))
                return FALSE;
        return TRUE;
    }

    int Str::suffix_eq(text_stream *S1, text_stream *S2, int N) {
        int L1 = Str::len(S1), L2 = Str::len(S2);
        if ((N > L1) || (N > L2)) return FALSE;
        for (int i=1; i<=N; i++)
            if (Str::get_at(S1, L1-i) != Str::get_at(S2, L2-i))
                return FALSE;
        return TRUE;
    }

    int Str::begins_with_wide_string(text_stream *S, wchar_t *prefix) {
        if ((prefix == NULL) || (*prefix == 0)) return TRUE;
        if (S == NULL) return FALSE;
        for (int i = 0; prefix[i]; i++)
            if (Str::get_at(S, i) != prefix[i])
                return FALSE;
        return TRUE;
    }

    int Str::ends_with_wide_string(text_stream *S, wchar_t *suffix) {
        if ((suffix == NULL) || (*suffix == 0)) return TRUE;
        if (S == NULL) return FALSE;
        for (int i = 0, at = Str::len(S) - (int) wcslen(suffix); suffix[i]; i++)
            if (Str::get_at(S, at+i) != suffix[i])
                return FALSE;
        return TRUE;
    }

The function Str::prefix_eq is used in 3/pth (§7), 3/fln (§5).

The function Str::suffix_eq appears nowhere else.

The function Str::begins_with_wide_string is used in 3/cla (§5.1).

The function Str::ends_with_wide_string appears nowhere else.

§21.

    int Str::eq_wide_string(text_stream *S1, wchar_t *S2) {
        if (Str::len(S1) == (int) wcslen(S2)) {
            int i=0;
            LOOP_THROUGH_TEXT(P, S1)
                if (Str::get(P) != S2[i++])
                    return FALSE;
            return TRUE;
        }
        return FALSE;
    }
    int Str::eq_narrow_string(text_stream *S1, char *S2) {
        if (Str::len(S1) == (int) strlen(S2)) {
            int i=0;
            LOOP_THROUGH_TEXT(P, S1)
                if (Str::get(P) != (wchar_t) S2[i++])
                    return FALSE;
            return TRUE;
        }
        return FALSE;
    }
    int Str::ne_wide_string(text_stream *S1, wchar_t *S2) {
        return (Str::eq_wide_string(S1, S2)?FALSE:TRUE);
    }

The function Str::eq_wide_string is used in 2/dl (§9), 3/fln (§9), 5/ee (§5, §7.2.1, §7.3.2.1).

The function Str::eq_narrow_string appears nowhere else.

The function Str::ne_wide_string appears nowhere else.

§22. White space.

    int Str::is_whitespace(text_stream *S) {
        LOOP_THROUGH_TEXT(pos, S)
            if (Characters::is_space_or_tab(Str::get(pos)) == FALSE)
                return FALSE;
        return TRUE;
    }

The function Str::is_whitespace is used in 3/cla (§11).

§23. This removes spaces and tabs from both ends:

    void Str::trim_white_space(text_stream *S) {
        int len = Str::len(S), i = 0, j = 0;
        string_position F = Str::start(S);
        LOOP_THROUGH_TEXT(P, S) {
            if (!(Characters::is_space_or_tab(Str::get(P)))) { F = P; break; }
            i++;
        }
        LOOP_BACKWARDS_THROUGH_TEXT(Q, S) {
            if (!(Characters::is_space_or_tab(Str::get(Q)))) break;
            j++;
        }
        if (i+j > Str::len(S)) Str::truncate(S, 0);
        else {
            len = len - j;
            Str::truncate(S, len);
            if (i > 0) {
                string_position P = Str::start(S);
                wchar_t c = 0;
                do {
                    c = Str::get(F);
                    Str::put(P, c);
                    P = Str::forward(P); F = Str::forward(F);
                } while (c != 0);
                len = len - i;
                Str::truncate(S, len);
            }
        }
    }

    int Str::trim_white_space_at_end(text_stream *S) {
        int shortened = FALSE;
        for (int j = Str::len(S)-1; j >= 0; j--) {
            if (Characters::is_space_or_tab(Str::get_at(S, j))) {
                Str::truncate(S, j);
                shortened = TRUE;
            } else break;
        }
        return shortened;
    }

    int Str::trim_all_white_space_at_end(text_stream *S) {
        int shortened = FALSE;
        for (int j = Str::len(S)-1; j >= 0; j--) {
            if (Characters::is_babel_whitespace(Str::get_at(S, j))) {
                Str::truncate(S, j);
                shortened = TRUE;
            } else break;
        }
        return shortened;
    }

The function Str::trim_white_space appears nowhere else.

The function Str::trim_white_space_at_end appears nowhere else.

The function Str::trim_all_white_space_at_end appears nowhere else.

§24. Deleting characters.

    void Str::delete_first_character(text_stream *S) {
        Str::delete_nth_character(S, 0);
    }

    void Str::delete_last_character(text_stream *S) {
        if (Str::len(S) > 0)
            Str::truncate(S, Str::len(S) - 1);
    }

    void Str::delete_nth_character(text_stream *S, int n) {
        for (string_position P = Str::at(S, n); P.index < Str::len(P.S); P = Str::forward(P))
            Str::put(P, Str::get(Str::forward(P)));
    }

    void Str::delete_n_characters(text_stream *S, int n) {
        int L = Str::len(S) - n;
        if (L <= 0) Str::clear(S);
        else {
            for (int i=0; i<L; i++)
                Str::put(Str::at(S, i), Str::get(Str::at(S, i+n)));
            Str::truncate(S, L);
        }
    }

The function Str::delete_first_character appears nowhere else.

The function Str::delete_last_character appears nowhere else.

The function Str::delete_nth_character appears nowhere else.

The function Str::delete_n_characters is used in 3/cla (§5.1), 3/pth (§7), 3/fln (§5).

§25. Substrings.

    void Str::substr(OUTPUT_STREAM, string_position from, string_position to) {
        if (from.S != to.S) internal_error("substr on two different strings");
        for (int i = from.index; i < to.index; i++)
            PUT(Str::get_at(from.S, i));
    }

    int Str::includes_character(text_stream *S, wchar_t c) {
        if (S)
            LOOP_THROUGH_TEXT(pos, S)
                if (Str::get(pos) == c)
                    return TRUE;
        return FALSE;
    }

    int Str::includes_wide_string_at(text_stream *S, wchar_t *prefix, int j) {
        if ((prefix == NULL) || (*prefix == 0)) return TRUE;
        if (S == NULL) return FALSE;
        for (int i = 0; prefix[i]; i++)
            if (Str::get_at(S, i+j) != prefix[i])
                return FALSE;
        return TRUE;
    }

    int Str::includes_wide_string_at_insensitive(text_stream *S, wchar_t *prefix, int j) {
        if ((prefix == NULL) || (*prefix == 0)) return TRUE;
        if (S == NULL) return FALSE;
        for (int i = 0; prefix[i]; i++)
            if (Characters::tolower(Str::get_at(S, i+j)) != Characters::tolower(prefix[i]))
                return FALSE;
        return TRUE;
    }

    int Str::includes(text_stream *S, text_stream *T) {
        int LS = Str::len(S);
        int LT = Str::len(T);
        for (int i=0; i<LS-LT; i++) {
            int failed = FALSE;
            for (int j=0; j<LT; j++)
                if (Str::get_at(S, i+j) != Str::get_at(T, j)) {
                    failed = TRUE;
                    break;
                }
            if (failed == FALSE) return TRUE;
        }
        return FALSE;
    }

The function Str::substr is used in 3/fln (§3).

The function Str::includes_character appears nowhere else.

The function Str::includes_wide_string_at appears nowhere else.

The function Str::includes_wide_string_at_insensitive appears nowhere else.

The function Str::includes appears nowhere else.

§26. Shim for literal storage. This is where all of those I-literals created by Inweb are stored at run-time. Note that every instance of, say, I"fish" would return the same string, that is, the same text_stream * value. To prevent nasty accidents, this is marked so that the stream value, "fish", cannot be modified at run-time.

The dictionary look-up here would not be thread-safe, so it's protected by a mutex. There's no real performance concern because the following routine is run just once per I-literal in the source code, when the program starts up.

    dictionary *string_literals_dictionary = NULL;

    text_stream *Str::literal(wchar_t *wide_C_string) {
        text_stream *answer = NULL;
        CREATE_MUTEX(mutex);
        LOCK_MUTEX(mutex);
        <Look in dictionary of string literals 26.1>;
        UNLOCK_MUTEX(mutex);
        return answer;
    }

The function Str::literal appears nowhere else.

§26.1. <Look in dictionary of string literals 26.1> =

        if (string_literals_dictionary == NULL)
            string_literals_dictionary = Dictionaries::new(100, TRUE);
        answer = Dictionaries::get_text_literal(string_literals_dictionary, wide_C_string);
        if (answer == NULL) {
            Dictionaries::create_literal(string_literals_dictionary, wide_C_string);
            answer = Dictionaries::get_text_literal(string_literals_dictionary, wide_C_string);
            WRITE_TO(answer, "%w", wide_C_string);
            Streams::mark_as_read_only(answer);
        }

This code is used in §26.