Search-engine friendly clone of the ACL2 documentation.

Clex

Example-lexer

A very basic lexer written using some CLEX utilities.

We implement a simple lexer for the following, contrived language:

// supporting definitions:

   Letter ::= 'A' | ... | 'Z'
            | 'a' | ... | 'z'

   Number ::= '0' | ... | '9'

// top-level tokens:

  Whitespace ::= (Space | Newline | Tab)*

  Punctuation ::= ';'
                | '+'  | '-'  | '*' | '/'
                | '++' | '--'

  Keyword ::= 'void' | 'int' | 'float'

  Identifier ::= Letter ( Letter | Number | '_' )*    // except keywords

  Comment ::= '//' [anything but newline]* (Newline | EOF)

Our lexer produces simple token-p structures that have a type and some text.

The main lexer loop is lex*. As basic correctness properties, we prove that it returns a valid tokenlist-p and that, on success, we can flatten out the tokens it produces to recreate the original input stream:

Theorem: tokenlist-all-text-of-lex*

(defthm tokenlist-all-text-of-lex*
  (b* (((mv okp tokens ?new-sin) (lex* sin)))
    (implies okp
             (equal (tokenlist-all-text tokens)
                    (implode (strin-left sin))))))

This seems pretty good. It isn't a total correctness statement—for that, we might also want to know something like: if there exists any valid way to tokenize the input, then we will find it. Or we might want to know: there is always exactly one unique way to validly tokenize a list. These seem harder to state and prove.

Subtopics

Token-p: Representation of a single token.
Lex-punctuation: Match punctuation characters to create a :punctuation token.
Lex-id/keyword: Match identifier characters to create an :id or :keyword token, as appropriate.
Lex-string: Complete wrapper: lex an ACL2 string.
Lex-whitespace: Match whitespace to create a :whitespace token.
Lex-comment: Match comment characters to create an :comment token.
Lex1: Lex a single token from the input stream.
Lex-main: Wrapper: lex the entire input stream or fail with a good error message.
Lex*: Main lexer loop: completely tokenize the entire input stream.
Tokenlist-p: (tokenlist-p x) recognizes lists where every element satisfies token-p.
Letter-char-p: Recognize upper- and lower-case letters.
Idtail-char-p: Recogize characters that are okay in the "tails" of identifiers: letters, numbers, and underscores.
Whitespace-char-p: Recognize basic whitespace.
Number-char-p: Recognize digits 0-9.
Tokentype-p: Valid types for tokens.
Lex*-exec: Tail-recursive version of lex* (to prevent stack overflows).
Newline-string: A string with just the newline character.