ACL2 Centaur Regular Expressions
ACRE is a regular expression package with features somewhat similar to Perl's regular expressions.
Important note for writing regular expressions: the Common Lisp string reader treats backslashes as escape characters, so all backslashes in the parse tree below need to be escaped. A good way to work around this is to use the ACL2::fancy-string-reader instead of regular double-quoted strings. For example, to match a whitespace character (\s in regex syntax) followed by a backslash (\\ in regex syntax) followed by a double quote, you'd need to enter the following double-quoted string:
"\\s\\\\\""
or the following \"fancy string\":
#{"""\s\\""""}
If you print either of these out as follows, you'll see what the string's actual contents are:
(cw "'~s0'~%" "\\s\\\\\"") (prints '\s\\"') (cw "'~s0'~%" #{"""\s\\""""}) (prints '\s\\"')
So please either use fancy-strings or else keep in mind that, in regular double-quoted strings, any backslash in the following grammar must really be written as two backslashes. It is best to debug your regular expressions by printing them out as above to avoid this sort of confusion.
Here is the syntax for parsing regular expressions, and following it are descriptions of their semantics.
regex = concat concat "|" regex (Disjunction -- match either one) concat = "" (Empty string -- always matches) repeat concat repeat = atom atom repeatop atom = group primitive group = "(" regex ")" (positional capture) "(?:" regex ")" (noncapturing) "(?i:" regex ")" (noncapturing, case insensitive) "(?^i:" regex ")" (noncapturing, case sensitive) "(?<" name ">" regex ")" (named capture) "(?=" regex ")" (zero-length lookahead) "(?!" regex ")" (zero-length lookahead, negated) "(?<=" regex ")" (zero-length lookbehind, constant width) "(?<!" regex ")" (zero-length lookbehind, constant width, negated) primitive = non_meta_character (matches the given character) "." (matches any character) "[" characterset "]" (matches any character in the class) "[^" characterset "]" (matches any character not in the class) "^" (matches beginning of string) "$" (matches end of string) backref "\" metacharacter (escape) "\" charsetchar (character set abbreviations) "\n" (newline) "\t" (tab) "\r" (carriage return) "\f" (form feed) "\o" NNN (octal character code) "\x" NN (hexadecimal character code) backref = "\" digit (matches nth capture group) "\g" digit "\g{" number "}" (matches nth capture group -- may be greater than 9) "\g{" name "}" (matches named capture group) "\k{" name "}" "\k<" name ">" "\k'" name "'" repeatop = repeatbase repeatbase repeatmod repeatbase = "*" (0 or more times) "+" (1 or more times) "?" (0 or 1 times) "{" n "}" (exactly n times) "{" n ",}" (n or more times) "{" n "," m "}" (minimum n and maximum m times) repeatmod = "?" (nongreedy) "+" (nonbacktracking) characterset = "" cset_elem characterset cset_elem = cset_set cset_atom "-" cset_atom (range -- second must have higher char code) cset_atom cset_set = "\w" (word character -- alphanumeric or _) "\d" (decimal digit) "\s" (whitespace character) "\h" (horizontal whitespace character) "\v" (vertical whitespace character) cset_atom = non_cset_metacharacter "\\" (backslash) "\]" (close bracket) "\-" (dash) "\n" (newline) "\t" (tab) "\r" (carriage return) "\f" (form feed) "\o" NNN (octal character code) "\x" NN (hexadecimal character code)
The following types and functions make up the high-level regular expression API.