A very basic lexer written using some CLEX utilities.
We implement a simple lexer for the following, contrived language:
// supporting definitions: Letter ::= 'A' | ... | 'Z' | 'a' | ... | 'z' Number ::= '0' | ... | '9' // top-level tokens: Whitespace ::= (Space | Newline | Tab)* Punctuation ::= ';' | '+' | '-' | '*' | '/' | '++' | '--' Keyword ::= 'void' | 'int' | 'float' Identifier ::= Letter ( Letter | Number | '_' )* // except keywords Comment ::= '//' [anything but newline]* (Newline | EOF)
Our lexer produces simple token-p structures that have a
The main lexer loop is lex*. As basic correctness properties, we prove that it returns a valid tokenlist-p and that, on success, we can flatten out the tokens it produces to recreate the original input stream:
Theorem:
(defthm tokenlist-all-text-of-lex* (b* (((mv okp tokens ?new-sin) (lex* sin))) (implies okp (equal (tokenlist-all-text tokens) (implode (strin-left sin))))))
This seems pretty good. It isn't a total correctness statement—for that, we might also want to know something like: if there exists any valid way to tokenize the input, then we will find it. Or we might want to know: there is always exactly one unique way to validly tokenize a list. These seem harder to state and prove.