Grammar-parser-implementation
Implementation of the parser of ABNF grammars.
This is a recursive-descent, backtracking parser.
There is a parsing function for every rule,
and parsing functions for certain groups, options, and repetitions.
There are also parameterized parsing functions for
terminals (natural numbers) matching
exact values, ranges, and (case-insensitively) characters.
Parsing Primitive
Some of the parsing functions that make up the parser are Seq parsing primitives. The remarks below apply to those functions as well.
Inputs and Outputs
Each of these parsing functions
takes as input a list of natural numbers (i.e. terminals),
and returns as outputs
(i) an indication of success or failure,
(ii) the tree or list of trees
obtained from parsing a prefix of the input
(or nil if parsing fails),
and (iii) the remaining suffix of the input that must still be parsed.
The indication of success or failure is
either nil to indicate success,
or a message to describe the failure.
This is consistent with the Seq macros,
with which these parsing functions are implemented.
The parse-grammar top-level function
takes as input a list of natural numbers
and returns as output just a tree, or nil to indicate failure;
this function consumes all the input, failing if there is unparsed input.
The parse-grammar-from-file function
takes as input a file name and calls parse-grammar on its content,
returning a tree or nil.
Each parsing function definition is accompanied by a theorem stating that
the function fixes its input list of natural numbers.
The proof of each such theorem uses, as rewrite rules,
the theorems for the parsing functions called by
the parsing function whose theorem is being proved.
Disambiguation and Look-Ahead
As explained in the documentation of parse-grammar*,
the grammar of the ABNF concrete syntax [RFC:4] is ambiguous.
The rule rulelist allows strings `c-nl WSP'
either to be split into an ending c-nl under a rule
and a starting WSP under an immediately following (*c-wsp c-nl),
or to be made into a c-wsp
in the ending part of elements under the rule.
The same kind of choice applies when,
instead of a rule immediately followed by a (*c-wsp c-nl),
there are two subsequent (*c-wsp c-nl)s:
the string `c-nl WSP' can be
either split between the two (*c-wsp c-nl)s
or put all under the first one.
Indeed, expanding elements in the definiens of rule
gives ... *c-wsp c-nl,
which ends in the same way as the group (*c-wsp c-nl):
this is why the ambiguity applies equally to
a rule immediately followed by a (*c-wsp c-nl)
and to a (*c-wsp c-nl)
immediately followed by a (*c-wsp c-nl).
Aside from the rulelist rule,
the rest of the grammar is LL(*):
- In the repeat rule,
a look-ahead of an unbounded number of DIGITs
is needed to determine the alternative
(the second alternative if "*" is found after the DIGITs,
otherwise the first alternative).
- In the concatenation rule,
a look-ahead of an unbounded number of c-wsps
is needed to determine where a concatenation ends
(it does if no repetition is found after the c-wsps,
otherwise the concatenation
continues with the found repetition).
Aside from the rulelist, repeat, and concatenation rules,
the rest of the grammar is LL(2):
- In the defined-as rule,
a look-ahead of two symbols
is needed to distinguish "=" and "=/".
- In the element rule,
a look-ahead of two symbols
is needed to distinguish num-val and char-val
(the two symbols are "%" and the one after).
- In the char-val rule,
a look-ahead of two symbols is needed
to distinguish case-insensitive-string and case-sensitive-string
(the two symbols are "%" and the one after).
In each of the three rules listed above,
the two choices have the first character in common.
Thus, it may seem that these rules are actually LL(1),
by first parsing the first character in common
and then deciding how to proceed based on the next character.
However, each character pair like "=/" and "%s"
is parsed in one shot via one call to parse-ichar2
which produces a single leaf tree
with the list of those two character codes,
not via two calls to parse-ichar
which would produce two leaf trees
each with a singleton list of one character code.
If the rules were formulated as concatenations of single-character strings
(e.g. "=" "/" and "%" "s") instead,
these rules would be LL(1).
Aside from the
rulelist,
repeat,
concatenation,
defined-as,
element, and
char-val rules,
the rest of the grammar is LL(1).
The parser resolves the rulelist ambiguity
by keeping strings `c-nl WSP' as c-wsps
under rule or
under the first (*c-wsp c-nl) of two subsequent (*c-wsp c-nl)s,
instead of splitting them into a c-nl
to end the rule or
to end the first (*c-wsp c-nl) of two subsequent (*c-wsp c-nl)s,
and a WSP to start the subsequent (*c-wsp c-nl).
The decision point is when a c-nl is encountered
while parsing the ending *c-wsp of elements
or while parsing the *c-wsp of a (*c-wsp c-nl):
should the *c-wsp be considered finished
and the c-nl used to end the rule or (*c-wsp c-nl),
or should the parser attempt to extend the *c-wsp
with an extra c-wsp, if the c-nl is followed by a WSP?
By having parse-*cwsp always try the extra c-wsp,
we never split strings `c-nl WSP'.
Thus, parse-*cwsp tries to parse as many c-wsps as possible,
like all the other parse-*... parsing functions.
If the c-nl is not followed by a WSP,
the parsing of the extra c-wsp fails
and the only possibility left is to finish the *c-wsp
and use the c-nl to end the rule or the (*c-wsp c-nl);
there is no ambiguity in this case.
The look-ahead for the LL(*), LL(2), and LL(1) rules
is handled via backtracking.
The amount of backtracking
is expected to be small in reasonable grammars.
Termination
The termination of the singly recursive parsing functions
(e.g. parse-*bit)
is proved by showing that the size of the input decreases.
The termination of the mutually recursive parsing functions
(i.e. parse-alternation, parse-concatenation, etc.)
is proved via a lexicographic measure consisting of
the size of the input and an ordering of the parsing functions.
This is explained in the following paragraphs.
Since parse-alternation calls parse-concatenation
on the same input,
the size of the input alone is not sufficient
to show that the mutually recursive parsing functions terminate.
But parse-concatenation never
(indirectly) calls parse-alternation on the same input:
it has to go through parse-group or parse-option,
which consume a "(" or a "["
before calling parse-alternation on, therefore, a smaller input.
So if we order the parsing functions, by assigning numbers to them,
so that parse-alternation has
a larger order number than parse-concatenation,
either the size of the input goes down,
or it stays the same but the parsing function order number goes down.
In other words, the lexicographic measure goes down.
To establish the relative ordering of the parsing functions,
we look at which ones (may) call which other ones on the same input:
the former must be (assigned) larger (order numbers) than the latter.
Thus, we have the following ordering constraints:
These constraints provide a partial order on the parsing function,
which we can totalize as follows (from smallest to largest):
- parse-conc-rest-comp
- parse-conc-rest
- parse-alt-rest-comp
- parse-alt-rest
- parse-option
- parse-group
- parse-element
- parse-repetition
- parse-concatenation
- parse-alternation
Note that when a smaller function calls a larger or equal function,
it does so on a smaller input.
In particular:
parse-group and parse-option call parse-alternation
only after consuming a "(" or a "[";
parse-alt-rest-comp calls parse-concatenation
only after consuming at least a "/"; and
parse-conc-rest-comp calls parse-repetition
only after consuming at least one c-wsp, which is never empty.
The theorems about input lengths
that accompany the parsing function definitions
are used in the termination proofs,
both of the singly recursive functions
and of the mutually recursive functions.
The termination of the mutually recursive parsing functions
involves an additional complication.
After parse-alternation calls parse-concatenation,
it may call parse-alt-rest,
which involves a proof obligation that
the input to parse-alt-rest,
which is the remaining input after parse-concatenation,
has length less than or equal to the input to parse-alternation.
This is the case, because parse-concatenation always returns
a remaining input of length less than or equal to its input.
But this property can be only proved
after parse-concatenation is admitted,
that is after its (mutual) termination has been proved:
for the termination proof, it is like an uninterpreted function.
The same holds for the interaction between
other mutually recursive parsing functions.
The known solution to the issue just discussed
is to add tests saying that
the remaining input of parse-concatenation
is smaller than the initial input to parse-alternation
(which is also the input to parse-concatenation).
This way the termination proof can proceeed.
However, since those tests are always true,
they can be wrapped in mbt,
which engenders the obligation to prove that are always true,
as part of guard verification
(and additionally, their execution is avoided).
The Seq macros provide operators :s= and :w=
for this purpose (see their documentation):
they generate mbt tests of inequalities
on the lengths of inputs and remaining inputs.
So we use those in our mutually recursive parsing functions.
Subtopics
- Parse-grammar-from-file
- Parse a file into an ABNF grammar.
- Parse-ichar2
- Parse two natural numbers
that encode, case-insensitively, two given characters,
into a tree that matches
a case-insensitive character value notation
that consists of those two characters.
- Parse-ichar
- Parse a natural number
that encodes, case-insensitively, a given character,
into a tree that matches
a case-insensitive character value notation
that consists of that character.
- Parse-in-either-range
- Parse a natural number in one of two given ranges
into a tree that matches
a group consisting of
an alternation of
the corresponding range numeric value notations.
- Parse-*-in-either-range
- Parse zero or more natural numbers,
each of which in one of two given ranges,
into a list of trees that matches
the repetition of zero or more alternations
of the corresponding range numeric value notations.
- Parse-bit
- Parse a bit.
- Parse-equal-/-equal-slash
- Parse the group ("=" / "=/").
- Parse-wsp
- Parse a space or horizontal tab.
- Parse-*digit-star-*digit
- Parse a group (*DIGIT "*" *DIGIT).
- Parse-hexdig
- Parse a hexadecimal digit.
- Parse-cwsp
- Parse either white space,
or a comment and new line followed by white space.
- Parse-concatenation
- Parse a concatenation.
- Parse-case-insensitive-string
- Parse a case-insensitive character value notation.
- Parse-alpha
- Parse a letter.
- Parse-dot-1*bit
- Parse a group ("." 1*BIT).
- Parse-case-sensitive-string
- Parse a case-sensitive character value notation.
- Parse-bin/dec/hex-val
- Parse a group (bin-val / dec-val / hex-val).
- Parse-alpha/digit/dash
- Parse a group (ALPHA / DIGIT / "-").
- Parse-rule
- Parse a rule.
- Parse-repeat
- Parse a repetition range.
- Parse-cnl
- Parse a comment or new line.
- Parse-sp
- Parse a space.
- Parse-rule-/-*cwsp-cnl
- Parse a group ( rule / (*c-wsp c-nl) ).
- Parse-element
- Parse an element.
- Parse-dec-val
- Parse a decimal numeric value notation.
- Parse-bin-val
- Parse a binary numeric value notation.
- Parse-alternation
- Parse an alternation.
- Parse-*cwsp-cnl
- Parse a group (*c-wsp c-nl).
- Parse-*bit
- Parse a repetition of zero or more bits.
- Parse-wsp/vchar
- Parse a group (WSP / VCHAR).
- Parse-quoted-string
- Parse a quoted string.
- Parse-prose-val
- Parse a prose value notation.
- Parse-hex-val
- Parse a hexadecimal numeric value notation.
- Parse-grammar
- Parse a sequence of natural numbers into an ABNF grammar.
- Parse-elements
- Parse the definiens of a rule.
- Parse-defined-as
- Parse the text between a rule name and its definiens.
- Parse-dash-1*hexdig
- Parse a group ("-" 1*HEXDIG).
- Parse-comment
- Parse a comment.
- Parse-char-val
- Parse a character value notation.
- Parse-1*bit
- Parse a repetition of one or more bits.
- Parse-rulename
- Parse a rule name.
- Parse-num-val
- Parse a numeric value notation.
- Parse-htab
- Parse a horizontal tab.
- Parse-hex-val-rest
- Parse an option [ 1*("." 1*HEXDIG) / ("-" 1*HEXDIG) ],
which is the rest of the definiens of hex-val.
- Parse-dot-1*hexdig
- Parse a group ("." 1*HEXDIG).
- Parse-dot-1*digit
- Parse a group ("." 1*DIGIT).
- Parse-digit
- Parse a decimal digit.
- Parse-dec-val-rest
- Parse an option [ 1*("." 1*DIGIT) / ("-" 1*DIGIT) ],
which is the rest of the definiens of dec-val.
- Parse-dash-1*digit
- Parse a group ("-" 1*DIGIT).
- Parse-dash-1*bit
- Parse a group ("-" 1*BIT).
- Parse-crlf
- Parse a carriage return followed by a line feed.
- Parse-cr
- Parse a carriage return.
- Parse-conc-rest-comp
- Parse a group (1*c-wsp repetition),
which is a component of
the rest of the definiens of concatenation.
- Parse-cnl-wsp
- Parse a group (c-nl WSP).
- Parse-bin-val-rest
- Parse an option [ 1*("." 1*BIT) / ("-" 1*BIT) ],
which is the rest of the definiens of bin-val.
- Parse-alt-rest-comp
- Parse a group (*c-wsp "/" *c-wsp concatenation),
which is a component of
the rest of the definiens of alternation.
- Parse-*cwsp
- Parse a repetition of zero or more c-wsps.
- Parse-vchar
- Parse a visible character.
- Parse-rulelist
- Parse a rule list.
- Parse-option
- Parse an option.
- Parse-group
- Parse a group.
- Parse-1*-dot-1*hexdig
- Parse a repetition 1*("." 1*HEXDIG).
- Parse-1*-dot-1*digit
- Parse a repetition 1*("." 1*DIGIT).
- Parse-1*-dot-1*bit
- Parse a repetition 1*("." 1*BIT).
- Parse-*-rule-/-*cwsp-cnl
- Parse a repetition *( rule / (*c-wsp c-nl) ).
- Parse-*-alpha/digit/dash
- Parse a repetition *(ALPHA / DIGIT / "-").
- Parse-lf
- Parse a line feed.
- Parse-dquote
- Parse a double quote.
- Parse-1*hexdig
- Parse a repetition of one or more hexadecimal digits.
- Parse-1*cwsp
- Parse a repetition of one or more c-wsps.
- Parse-?repeat
- Parse an optional repetition range.
- Parse-*-dot-1*hexdig
- Parse a repetition *("." 1*HEXDIG).
- Parse-*-dot-1*digit
- Parse a repetition *("." 1*DIGIT).
- Parse-*-dot-1*bit
- Parse a repetition *("." 1*BIT).
- Parse-repetition
- Parse a repetition.
- Parse-1*digit
- Parse a repetition of one or more decimal digits.
- Parse-?%i
- Parse an option [ "%i" ].
- Parse-*wsp/vchar
- Parse a repetition *(WSP / VCHAR).
- Parse-*hexdig
- Parse a repetition of zero or more hexadecimal digits.
- Parse-*digit
- Parse a repetition of zero or more decimal digits.
- Parse-alt-rest
- Parse a repetition *(*c-wsp "/" *c-wsp concatenation),
which is the rest of the definiens of alternation.
- Parse-conc-rest
- Parse a repetition *(1*c-wsp repetition),
which is the rest of the definiens of concatenation.
- *grammar-parser-error-msg*
- Message for grammar parsing errors.