Don't distort character ranges in rx translation

The Emacs regexp engine interprets character ranges from ASCII to raw bytes, such as [a-\xfe], as not including non-ASCII Unicode at all; ranges from non-ACII Unicode to raw bytes, such as [ü-\x91], are ignored entirely. To make rx produce a translation that works as intended, split ranges that that go from ordinary characters to raw bytes. Such ranges may appear from set manipulation and regexp optimisation. * lisp/emacs-lisp/rx.el (rx--generate-alt): Split intervals that straddle the char-raw boundary when rendering a string regexp from an interval set. * test/lisp/emacs-lisp/rx-tests.el (rx-char-any-raw-byte): Add test cases.
2023-07-17 13:05:21 +02:00 · 2023-07-17 13:05:21 +02:00 · 157e735ce8
commit 157e735ce8
parent 7446a8c34e
2 changed files with 17 additions and 1 deletions
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@ -484,6 +484,12 @@ classes."
                             (char-to-string (car item)))
                            ((eq (1+ (car item)) (cdr item))
                             (string (car item) (cdr item)))
+                            ;; Ranges that go between normal chars and raw bytes
+                            ;; must be split to avoid being mutilated
+                            ;; by Emacs's regexp parser.
+                            ((<= (car item) #x3fff7f (cdr item))
+                             (string (car item) ?- #x3fff7f
+                                     #x3fff80 ?- (cdr item)))
                            (t
                             (string (car item) ?- (cdr item)))))
                    items nil)