Search-engine friendly clone of the ACL2 documentation.

Grammar-parser

Grammar-parser-implementation

Implementation of the parser of ABNF grammars.

This is a recursive-descent, backtracking parser. There is a parsing function for every rule, and parsing functions for certain groups, options, and repetitions. There are also parameterized parsing functions for terminals (natural numbers) matching exact values, ranges, and (case-insensitively) characters.

Parsing Primitive

Some of the parsing functions that make up the parser are Seq parsing primitives. The remarks below apply to those functions as well.

Inputs and Outputs

Each of these parsing functions takes as input a list of natural numbers (i.e. terminals), and returns as outputs (i) an indication of success or failure, (ii) the tree or list of trees obtained from parsing a prefix of the input (or nil if parsing fails), and (iii) the remaining suffix of the input that must still be parsed. The indication of success or failure is either nil to indicate success, or a message to describe the failure. This is consistent with the Seq macros, with which these parsing functions are implemented.

The parse-grammar top-level function takes as input a list of natural numbers and returns as output just a tree, or nil to indicate failure; this function consumes all the input, failing if there is unparsed input. The parse-grammar-from-file function takes as input a file name and calls parse-grammar on its content, returning a tree or nil.

Each parsing function definition is accompanied by a theorem stating that the function fixes its input list of natural numbers. The proof of each such theorem uses, as rewrite rules, the theorems for the parsing functions called by the parsing function whose theorem is being proved.

Disambiguation and Look-Ahead

As explained in the documentation of parse-grammar*, the grammar of the ABNF concrete syntax [RFC:4] is ambiguous. The rule rulelist allows strings `c-nl WSP' either to be split into an ending c-nl under a rule and a starting WSP under an immediately following (*c-wsp c-nl), or to be made into a c-wsp in the ending part of elements under the rule. The same kind of choice applies when, instead of a rule immediately followed by a (*c-wsp c-nl), there are two subsequent (*c-wsp c-nl)s: the string `c-nl WSP' can be either split between the two (*c-wsp c-nl)s or put all under the first one. Indeed, expanding elements in the definiens of rule gives ... *c-wsp c-nl, which ends in the same way as the group (*c-wsp c-nl): this is why the ambiguity applies equally to a rule immediately followed by a (*c-wsp c-nl) and to a (*c-wsp c-nl) immediately followed by a (*c-wsp c-nl).

Aside from the rulelist rule, the rest of the grammar is LL(*):

In the repeat rule, a look-ahead of an unbounded number of DIGITs is needed to determine the alternative (the second alternative if "*" is found after the DIGITs, otherwise the first alternative).
In the concatenation rule, a look-ahead of an unbounded number of c-wsps is needed to determine where a concatenation ends (it does if no repetition is found after the c-wsps, otherwise the concatenation continues with the found repetition).

Aside from the rulelist, repeat, and concatenation rules, the rest of the grammar is LL(2):

In the defined-as rule, a look-ahead of two symbols is needed to distinguish "=" and "=/".
In the element rule, a look-ahead of two symbols is needed to distinguish num-val and char-val (the two symbols are "%" and the one after).
In the char-val rule, a look-ahead of two symbols is needed to distinguish case-insensitive-string and case-sensitive-string (the two symbols are "%" and the one after).

In each of the three rules listed above, the two choices have the first character in common. Thus, it may seem that these rules are actually LL(1), by first parsing the first character in common and then deciding how to proceed based on the next character. However, each character pair like "=/" and "%s" is parsed in one shot via one call to parse-ichar2 which produces a single leaf tree with the list of those two character codes, not via two calls to parse-ichar which would produce two leaf trees each with a singleton list of one character code. If the rules were formulated as concatenations of single-character strings (e.g. "=" "/" and "%" "s") instead, these rules would be LL(1).

Aside from the rulelist, repeat, concatenation, defined-as, element, and char-val rules, the rest of the grammar is LL(1).

The parser resolves the rulelist ambiguity by keeping strings `c-nl WSP' as c-wsps under rule or under the first (*c-wsp c-nl) of two subsequent (*c-wsp c-nl)s, instead of splitting them into a c-nl to end the rule or to end the first (*c-wsp c-nl) of two subsequent (*c-wsp c-nl)s, and a WSP to start the subsequent (*c-wsp c-nl). The decision point is when a c-nl is encountered while parsing the ending *c-wsp of elements or while parsing the *c-wsp of a (*c-wsp c-nl): should the *c-wsp be considered finished and the c-nl used to end the rule or (*c-wsp c-nl), or should the parser attempt to extend the *c-wsp with an extra c-wsp, if the c-nl is followed by a WSP? By having parse-*cwsp always try the extra c-wsp, we never split strings `c-nl WSP'. Thus, parse-*cwsp tries to parse as many c-wsps as possible, like all the other parse-*... parsing functions. If the c-nl is not followed by a WSP, the parsing of the extra c-wsp fails and the only possibility left is to finish the *c-wsp and use the c-nl to end the rule or the (*c-wsp c-nl); there is no ambiguity in this case.

The look-ahead for the LL(*), LL(2), and LL(1) rules is handled via backtracking. The amount of backtracking is expected to be small in reasonable grammars.

Termination

The termination of the singly recursive parsing functions (e.g. parse-*bit) is proved by showing that the size of the input decreases.

The termination of the mutually recursive parsing functions (i.e. parse-alternation, parse-concatenation, etc.) is proved via a lexicographic measure consisting of the size of the input and an ordering of the parsing functions. This is explained in the following paragraphs.

Since parse-alternation calls parse-concatenation on the same input, the size of the input alone is not sufficient to show that the mutually recursive parsing functions terminate. But parse-concatenation never (indirectly) calls parse-alternation on the same input: it has to go through parse-group or parse-option, which consume a "(" or a "[" before calling parse-alternation on, therefore, a smaller input. So if we order the parsing functions, by assigning numbers to them, so that parse-alternation has a larger order number than parse-concatenation, either the size of the input goes down, or it stays the same but the parsing function order number goes down. In other words, the lexicographic measure goes down.

To establish the relative ordering of the parsing functions, we look at which ones (may) call which other ones on the same input: the former must be (assigned) larger (order numbers) than the latter. Thus, we have the following ordering constraints:

parse-alternation must be larger than parse-concatenation.
parse-concatenation must be larger than parse-repetition.
parse-repetition must be larger than parse-element. (The former calls the latter on the same input if parse-?repeat does not consume any input.)
parse-element must be larger than parse-group.
parse-element must be larger than parse-option.
parse-alt-rest must be larger than parse-alt-rest-comp.
parse-conc-rest must be larger than parse-conc-rest-comp.

These constraints provide a partial order on the parsing function, which we can totalize as follows (from smallest to largest):

Note that when a smaller function calls a larger or equal function, it does so on a smaller input. In particular: parse-group and parse-option call parse-alternation only after consuming a "(" or a "["; parse-alt-rest-comp calls parse-concatenation only after consuming at least a "/"; and parse-conc-rest-comp calls parse-repetition only after consuming at least one c-wsp, which is never empty.

The theorems about input lengths that accompany the parsing function definitions are used in the termination proofs, both of the singly recursive functions and of the mutually recursive functions.

The termination of the mutually recursive parsing functions involves an additional complication. After parse-alternation calls parse-concatenation, it may call parse-alt-rest, which involves a proof obligation that the input to parse-alt-rest, which is the remaining input after parse-concatenation, has length less than or equal to the input to parse-alternation. This is the case, because parse-concatenation always returns a remaining input of length less than or equal to its input. But this property can be only proved after parse-concatenation is admitted, that is after its (mutual) termination has been proved: for the termination proof, it is like an uninterpreted function. The same holds for the interaction between other mutually recursive parsing functions.

The known solution to the issue just discussed is to add tests saying that the remaining input of parse-concatenation is smaller than the initial input to parse-alternation (which is also the input to parse-concatenation). This way the termination proof can proceeed. However, since those tests are always true, they can be wrapped in mbt, which engenders the obligation to prove that are always true, as part of guard verification (and additionally, their execution is avoided). The Seq macros provide operators :s= and :w= for this purpose (see their documentation): they generate mbt tests of inequalities on the lengths of inputs and remaining inputs. So we use those in our mutually recursive parsing functions.

Subtopics

Parse-grammar-from-file: Parse a file into an ABNF grammar.
Parse-ichar2: Parse two natural numbers that encode, case-insensitively, two given characters, into a tree that matches a case-insensitive character value notation that consists of those two characters.
Parse-ichar: Parse a natural number that encodes, case-insensitively, a given character, into a tree that matches a case-insensitive character value notation that consists of that character.
Parse-in-either-range: Parse a natural number in one of two given ranges into a tree that matches a group consisting of an alternation of the corresponding range numeric value notations.
Parse-*-in-either-range: Parse zero or more natural numbers, each of which in one of two given ranges, into a list of trees that matches the repetition of zero or more alternations of the corresponding range numeric value notations.
Parse-bit: Parse a bit.
Parse-equal-/-equal-slash: Parse the group ("=" / "=/").
Parse-wsp: Parse a space or horizontal tab.
Parse-*digit-star-*digit: Parse a group (*DIGIT "*" *DIGIT).
Parse-hexdig: Parse a hexadecimal digit.
Parse-cwsp: Parse either white space, or a comment and new line followed by white space.
Parse-concatenation: Parse a concatenation.
Parse-case-insensitive-string: Parse a case-insensitive character value notation.
Parse-alpha: Parse a letter.
Parse-dot-1*bit: Parse a group ("." 1*BIT).
Parse-case-sensitive-string: Parse a case-sensitive character value notation.
Parse-bin/dec/hex-val: Parse a group (bin-val / dec-val / hex-val).
Parse-alpha/digit/dash: Parse a group (ALPHA / DIGIT / "-").
Parse-rule: Parse a rule.
Parse-repeat: Parse a repetition range.
Parse-cnl: Parse a comment or new line.
Parse-sp: Parse a space.
Parse-rule-/-*cwsp-cnl: Parse a group ( rule / (*c-wsp c-nl) ).
Parse-element: Parse an element.
Parse-dec-val: Parse a decimal numeric value notation.
Parse-bin-val: Parse a binary numeric value notation.
Parse-alternation: Parse an alternation.
Parse-*cwsp-cnl: Parse a group (*c-wsp c-nl).
Parse-*bit: Parse a repetition of zero or more bits.
Parse-wsp/vchar: Parse a group (WSP / VCHAR).
Parse-quoted-string: Parse a quoted string.
Parse-prose-val: Parse a prose value notation.
Parse-hex-val: Parse a hexadecimal numeric value notation.
Parse-grammar: Parse a sequence of natural numbers into an ABNF grammar.
Parse-elements: Parse the definiens of a rule.
Parse-defined-as: Parse the text between a rule name and its definiens.
Parse-dash-1*hexdig: Parse a group ("-" 1*HEXDIG).
Parse-comment: Parse a comment.
Parse-char-val: Parse a character value notation.
Parse-1*bit: Parse a repetition of one or more bits.
Parse-rulename: Parse a rule name.
Parse-num-val: Parse a numeric value notation.
Parse-htab: Parse a horizontal tab.
Parse-hex-val-rest: Parse an option [ 1*("." 1*HEXDIG) / ("-" 1*HEXDIG) ], which is the rest of the definiens of hex-val.
Parse-dot-1*hexdig: Parse a group ("." 1*HEXDIG).
Parse-dot-1*digit: Parse a group ("." 1*DIGIT).
Parse-digit: Parse a decimal digit.
Parse-dec-val-rest: Parse an option [ 1*("." 1*DIGIT) / ("-" 1*DIGIT) ], which is the rest of the definiens of dec-val.
Parse-dash-1*digit: Parse a group ("-" 1*DIGIT).
Parse-dash-1*bit: Parse a group ("-" 1*BIT).
Parse-crlf: Parse a carriage return followed by a line feed.
Parse-cr: Parse a carriage return.
Parse-conc-rest-comp: Parse a group (1*c-wsp repetition), which is a component of the rest of the definiens of concatenation.
Parse-cnl-wsp: Parse a group (c-nl WSP).
Parse-bin-val-rest: Parse an option [ 1*("." 1*BIT) / ("-" 1*BIT) ], which is the rest of the definiens of bin-val.
Parse-alt-rest-comp: Parse a group (*c-wsp "/" *c-wsp concatenation), which is a component of the rest of the definiens of alternation.
Parse-*cwsp: Parse a repetition of zero or more c-wsps.
Parse-vchar: Parse a visible character.
Parse-rulelist: Parse a rule list.
Parse-option: Parse an option.
Parse-group: Parse a group.
Parse-1*-dot-1*hexdig: Parse a repetition 1*("." 1*HEXDIG).
Parse-1*-dot-1*digit: Parse a repetition 1*("." 1*DIGIT).
Parse-1*-dot-1*bit: Parse a repetition 1*("." 1*BIT).
Parse-*-rule-/-*cwsp-cnl: Parse a repetition *( rule / (*c-wsp c-nl) ).
Parse-*-alpha/digit/dash: Parse a repetition *(ALPHA / DIGIT / "-").
Parse-lf: Parse a line feed.
Parse-dquote: Parse a double quote.
Parse-1*hexdig: Parse a repetition of one or more hexadecimal digits.
Parse-1*cwsp: Parse a repetition of one or more c-wsps.
Parse-?repeat: Parse an optional repetition range.
Parse-*-dot-1*hexdig: Parse a repetition *("." 1*HEXDIG).
Parse-*-dot-1*digit: Parse a repetition *("." 1*DIGIT).
Parse-*-dot-1*bit: Parse a repetition *("." 1*BIT).
Parse-repetition: Parse a repetition.
Parse-1*digit: Parse a repetition of one or more decimal digits.
Parse-?%i: Parse an option [ "%i" ].
Parse-*wsp/vchar: Parse a repetition *(WSP / VCHAR).
Parse-*hexdig: Parse a repetition of zero or more hexadecimal digits.
Parse-*digit: Parse a repetition of zero or more decimal digits.
Parse-alt-rest: Parse a repetition *(*c-wsp "/" *c-wsp concatenation), which is the rest of the definiens of alternation.
Parse-conc-rest: Parse a repetition *(1*c-wsp repetition), which is the rest of the definiens of concatenation.
*grammar-parser-error-msg*: Message for grammar parsing errors.