Search-engine friendly clone of the ACL2 documentation.

Projects

Acre

ACL2 Centaur Regular Expressions

ACRE is a regular expression package with features somewhat similar to Perl's regular expressions.

Important note for writing regular expressions: the Common Lisp string reader treats backslashes as escape characters, so all backslashes in the parse tree below need to be escaped. A good way to work around this is to use the ACL2::fancy-string-reader instead of regular double-quoted strings. For example, to match a whitespace character (\s in regex syntax) followed by a backslash (\\ in regex syntax) followed by a double quote, you'd need to enter the following double-quoted string:

"\\s\\\\\""

or the following \"fancy string\":

#{"""\s\\""""}

If you print either of these out as follows, you'll see what the string's actual contents are:

(cw "'~s0'~%" "\\s\\\\\"")       (prints '\s\\"') 
(cw "'~s0'~%" #{"""\s\\""""})    (prints '\s\\"')

So please either use fancy-strings or else keep in mind that, in regular double-quoted strings, any backslash in the following grammar must really be written as two backslashes. It is best to debug your regular expressions by printing them out as above to avoid this sort of confusion.

Here is the syntax for parsing regular expressions, and following it are descriptions of their semantics.

regex = concat
        concat "|" regex              (Disjunction -- match either one)

concat = ""                           (Empty string -- always matches)
         repeat concat

repeat = atom
         atom repeatop

atom = group
       primitive

group = "(" regex ")"                 (positional capture)
        "(?:" regex ")"               (noncapturing)
        "(?i:" regex ")"              (noncapturing, case insensitive)
        "(?^i:" regex ")"             (noncapturing, case sensitive)
        "(?<" name ">" regex ")"      (named capture)
        "(?=" regex ")"               (zero-length lookahead)
        "(?!" regex ")"               (zero-length lookahead, negated)
        "(?<=" regex ")"              (zero-length lookbehind, constant width)
        "(?<!" regex ")"              (zero-length lookbehind, constant width, negated)

primitive = non_meta_character        (matches the given character)
            "."                       (matches any character)
            "[" characterset "]"      (matches any character in the class)
            "[^" characterset "]"     (matches any character not in the class)
            "^"                       (matches beginning of string)
            "$"                       (matches end of string)
            backref
            "\" metacharacter         (escape)
            "\" charsetchar           (character set abbreviations)
            "\n"                      (newline)
            "\t"                      (tab)
            "\r"                      (carriage return)
            "\f"                      (form feed)
            "\o" NNN                  (octal character code)
            "\x" NN                   (hexadecimal character code)

backref = "\" digit                   (matches nth capture group)
          "\g" digit
          "\g{" number "}"            (matches nth capture group -- may be greater than 9)
          "\g{" name "}"              (matches named capture group)
          "\k{" name "}"
          "\k<" name ">"
          "\k'" name "'"

repeatop = repeatbase 
           repeatbase repeatmod

repeatbase = "*"                      (0 or more times)
             "+"                      (1 or more times)
             "?"                      (0 or 1 times)
             "{" n "}"                (exactly n times)
             "{" n ",}"               (n or more times)
             "{" n "," m "}"          (minimum n and maximum m times)

repeatmod = "?"                       (nongreedy)
            "+"                       (nonbacktracking)

characterset = ""
               cset_elem characterset

cset_elem = cset_set
            cset_atom "-" cset_atom   (range -- second must have higher char code)
            cset_atom

cset_set = "\w"                       (word character -- alphanumeric or _)
           "\d"                       (decimal digit)
           "\s"                       (whitespace character)
           "\h"                       (horizontal whitespace character)
           "\v"                       (vertical whitespace character)

cset_atom =  non_cset_metacharacter
             "\\"                     (backslash)
             "\]"                     (close bracket)
             "\-"                     (dash)
             "\n"                     (newline)
             "\t"                     (tab)
             "\r"                     (carriage return)
             "\f"                     (form feed)
             "\o" NNN                 (octal character code)
             "\x" NN                  (hexadecimal character code)

API

The following types and functions make up the high-level regular expression API.

Types

regex-p -- internal representation of regular expression patterns
matchresult-p -- result of matching against a string

Functions

(parse pattern :legible t) --> (mv err regex) -- Pattern is a string; returns either an error or a regex. If legible is nonnil (the default), then the pattern string is preprocessed before parsing in a way that allows patterns to be written more legibly: the preprocessing step removes non-escaped whitespace and removes comments consisting of the rest of the line after (and including) any non-escaped # character.
(match-regex regex str :case-insens nil) --> matchresult -- Matches the given regex against str, using case insensitive matching if case-insens is set to nonnil.
(parse-and-match-regex pattern str :case-insens nil :legible t) --> (mv err matchresult) -- Combines parse and match-regex.
(match pattern str :case-insens nil :legible t) --> matchresult -- Macro which parses the given pattern into a regex at macroexpansion time, expanding to a call of match-regex on the resulting regular expression. Any parse error becomes an error at macroexpansion time.
(matchresult->matchedp matchresult) --> boolean -- returns T if the regular expression matched the string and NIL otherwise.
(matchresult->matched-substr matchresult) --> substr-if-matched -- returns the matched substring of the input string if there was a match, else NIL.
(matchresult->captured-substr name matchresult) --> substr-if-matched -- returns the substring that was captured by the given capture group, or NIL if the regex didn't match or the capture group was in part of the regex that didn't match. (As an example of the latter, consider matching the pattern "(aa)|bb" against the string "bb".) Name may be a positive number, which accesses capture groups by position, or a string, which accesses named capture groups (names are case sensitive).
(matchresult->captured-substr! name matchresult) --> substr -- Same as the above, but simply returns the empty string if the above would return NIL.

B* binders for capture groups

((captures binding1 binding2 ...) matchresult) binds variables to captured substrings from the given matchresult. A binding may be simply a variable, in which case a positional capture group is accessed, or (var name), in which case var will be bound to the result of looking up the named capture group name. If a capture group doesn't exist, the variable is bound to NIL. The captures! binding does the same thing but instead binds the variable to the empty string if the capture group doesn't exist.
((named-captures binding1 binding2 ...) matchresult) is similar to captures, differing only when the binding is simply a variable. Instead of getting a capture group by position, this looks up the downcase of the name of the variable. Named-captures! behaves analogously to captures!.

Subtopics

Acre-internals: Umbrella topic for implementation-level documentation of acre.
Parse-and-match-regex: Parse a regular expression from a pattern string and match it against a string x.
Match-regex: Perform regular expression matching on a string.
Parse: Parse a pattern string into a regular expression object.
Matchresult->matchedp: Boolean flag indicating whether the regular expression matched the string
Match: Match a string against a regular expression, which is parsed at macroexpansion time.