diff --git a/doc/lispref/elisp.texi b/doc/lispref/elisp.texi index c4bd97bf815..6057691239f 100644 --- a/doc/lispref/elisp.texi +++ b/doc/lispref/elisp.texi @@ -1316,6 +1316,7 @@ Regular Expressions * Rx Notation:: An alternative, structured regexp notation. @end ifnottex * Regexp Functions:: Functions for operating on regular expressions. +* Regexp Problems:: Some problems and how they may be avoided. Syntax of Regular Expressions diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index d27cfb8c0c7..ce79765b733 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi @@ -263,6 +263,7 @@ case-sensitive. * Rx Notation:: An alternative, structured regexp notation. @end ifnottex * Regexp Functions:: Functions for operating on regular expressions. +* Regexp Problems:: Some problems and how they may be avoided. @end menu @node Syntax of Regexps @@ -343,15 +344,6 @@ first tries to match all three @samp{a}s; but the rest of the pattern is The next alternative is for @samp{a*} to match only two @samp{a}s. With this choice, the rest of the regexp matches successfully. -@strong{Warning:} Nested repetition operators can run for a very -long time, if they lead to ambiguous matching. For -example, trying to match the regular expression @samp{\(x+y*\)*a} -against the string @samp{xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz} could -take hours before it ultimately fails. Emacs may try each way of -grouping the @samp{x}s before concluding that none of them can work. -In general, avoid expressions that can match the same string in -multiple ways. - @item @samp{+} @cindex @samp{+} in regexp is a postfix operator, similar to @samp{*} except that it must match @@ -1884,6 +1876,73 @@ variables that may be set to a pattern that actually matches something. @end defvar +@node Regexp Problems +@subsection Problems with Regular Expressions +@cindex regular expression problems +@cindex regexp stack overflow +@cindex stack overflow in regexp + +The Emacs regexp implementation, like many of its kind, is generally +robust but occasionally causes trouble in either of two ways: matching +may run out of internal stack space and signal an error, and it can +take a long time to complete. The advice below will make these +symptoms less likely and help alleviate problems that do arise. + +@itemize +@item +Anchor regexps at the beginning of a line, string or buffer using +zero-width assertions (@samp{^} and @code{\`}). This takes advantage +of fast paths in the implementation and can avoid futile matching +attempts. Other zero-width assertions may also bring benefits by +causing a match to fail early. + +@item +Avoid or-patterns in favour of character alternatives: write +@samp{[ab]} instead of @samp{a\|b}. Recall that @samp{\s-} and @samp{\sw} +are equivalent to @samp{[[:space:]]} and @samp{[[:word:]]}, respectively. + +@item +Since the last branch of an or-pattern does not add a backtrack point +on the stack, consider putting the most likely matched pattern last. +For example, @samp{^\(?:a\|.b\)*c} will run out of stack if trying to +match a very long string of @samp{a}s, but the equivalent +@samp{^\(?:.b\|a\)*c} will not. + +(It is a trade-off: successfully matched or-patterns run faster with +the most frequently matched pattern first.) + +@item +Try to ensure that any part of the text can only match in a single +way. For example, @samp{a*a*} will match the same set of strings as +@samp{a*}, but the former can do so in many ways and will therefore +cause slow backtracking if the match fails later on. Make or-pattern +branches mutually exclusive if possible, so that matching will not go +far into more than one branch before failing. + +Be especially careful with nested repetitions: they can easily result +in very slow matching in the presence of ambiguities. For example, +@samp{\(?:a*b*\)+c} will take a long time attempting to match even a +moderately long string of @samp{a}s before failing. The equivalent +@samp{\(?:a\|b\)*c} is much faster, and @samp{[ab]*c} better still. + +@item +Don't use capturing groups unless they are really needed; that is, use +@samp{\(?:@dots{}\)} instead of @samp{\(@dots{}\)} for bracketing +purposes. + +@ifnottex +@item +Consider using @code{rx} (@pxref{Rx Notation}); it can optimise some +or-patterns automatically and will never introduce capturing groups +unless explicitly requested. +@end ifnottex +@end itemize + +If you run into regexp stack overflow despite following the above +advice, don't be afraid of performing the matching in multiple +function calls, each using a simpler regexp where backtracking can +more easily be contained. + @node Regexp Search @section Regular Expression Searching @cindex regular expression searching diff --git a/etc/PROBLEMS b/etc/PROBLEMS index daff102a0d3..ede83a6e7c0 100644 --- a/etc/PROBLEMS +++ b/etc/PROBLEMS @@ -742,18 +742,6 @@ completed" message that tls.el relies upon, causing affected Emacs functions to hang. To work around the problem, use older or newer versions of gnutls-cli, or use Emacs's built-in gnutls support. -*** Stack overflow in regexp matcher. -Due to fundamental limitations in the way Emacs' regular expression -engine is designed, you might run into combinatorial explosions in -backtracking with certain regexps. - -Avoid "\(...\(...\)*...\)*" and "\(...\)*\(...\)*". Look for a way to -anchor your regular expression, to avoid matching the null string in -infinite ways. The latter is what creates backtrack points, and -eventual overflow in practice. - -(Also prefer "\(?:...\)" to "\(...\)" unless you need the latter.) - * Runtime problems related to font handling ** Characters are displayed as empty boxes or with wrong font under X.