Parstate
Fixtype of parser states.
This is a product type introduced by fty::defprod.
Fields
- bytes — byte-list
- position — position
- chars-read — char+position-list
- chars-unread — char+position-list
- tokens-read — token+span-list
- tokens-read-len — natp
- tokens-unread — token+span-list
- checkpoints — nat-list
- gcc — booleanp
- size — natp
Additional Requirements
The following invariant is enforced on the fields:
(and (equal size
(+ (len bytes)
(len chars-unread)
(len tokens-unread)))
(equal tokens-read-len (len tokens-read)))
Our parsing functions take and return parser states.
The primary component of a parser state
is the input sequence of bytes remaining,
which initially comes from a file (see files).
Bytes are read and removed from this component,
and turned into characters.
Given the need for look-ahead during lexing,
we use the common technique of ``unreading'' characters sometimes,
i.e. putting them back into the parser state.
But since we have already turned bytes into characters,
we do not want to put back the bytes:
thus, the chars-unread component of the parser state
contains a list of characters that have been put back, in character form;
the form is natural numbers, i.e. Unicode code points.
The list is initially empty.
When non-empty, it must be thought of
as preceding the bytes byte list.
If the list is not empty,
the next character will be read directly from that list,
not from the byte list.
To avoid putting back the wrong character by mistake,
i.e. a character that was not actually read,
we use the chars-read component of the parser state
to keep track of which characters have been read and could be unread.
Thus, every time we read a character,
we add it to the chars-read list,
which can be visualized, reversed, to the left of
the chars-unread and bytes lists:
+----------+ read +--------+-------+
| chars | <----- | chars | bytes |
| read | | unread | |
| reversed | -----> | | |
+----------+ unread +--------+-------+
The reading of a character moves the character from right to left,
from the chars-unread list to the reversed chars-read list
if chars-unread is not empty,
or from the bytes list to the reversed chars-read list
where one or more bytes are UTF-8-decoded into the character.
When a character is unread, it is moved from left to right,
i.e. from the reversed chars-read list to the chars-unread list.
If chars-read is empty, it is an internal error:
if the parser code is correct, this must never happen.
The reading and unreading of characters happens at the lexing level.
A similar look-ahead happens at the proper parsing level,
where the elements of the read and unread lists
are not characters but tokens.
The parser state has lists tokens-read and tokens-unread
whose handling is similar to the ones for characters.
The lists can be visualized as follows
(there are two different situations, explained after the diagram):
+----------+----------+ read +--------+-------+
| tokens | chars | <----- | chars | bytes |
| read | read | | unread | |
| reversed | reversed | -----> | | |
+----------+----------+ unread +--------+-------+
+----------+ read +--------+--------+-------+
| tokens | <----- | tokens | chars | bytes |
| read | | unread | unread | |
| reversed | -----> | | | |
+----------+ unread +--------+--------+-------+
When a token is successfully read,
it is moved left to the reversed tokens-read list,
and chars-read is cleared.
The reason is that, once a token is read,
all the characters read form the token,
and there is no need to ever unread them:
they have already formed the token.
But after reading that token, in order to read the next token,
the chars-read list is populated with
the characters that are read in order to read the next token.
So in general the reversed tokens-read list
is at the left of the reversed chars-read list,
and the latter is cleared again when another token
is added to tokens-read.
This is the first situation depicted in the diagram above.
Characters are read only as a subroutine of reading tokens.
If a token is unread, it is unread just after having been read,
with the chars-read list still cleared.
The unread token is moved right, added to the tokens-unread list.
As more tokens are unread, they are also moved right.
As they are read again, they are moved left.
This is the second situation depicted in the diagram above.
Note that we may have some unread characters,
which were read and then unread in the course of reading tokens,
so the chars-unread list is
between the token-unread and bytes lists.
While chars-read is cleared every time a token is read,
tokens-read is never cleared.
This lets us do some backtracking when needed,
but without saving and copying the whole parser state:
we just need to keep track of how many tokens must be put back
for backtracking.
For reporting errors, it is useful to keep track of
where in a file the constructs being parsed occur.
To this end, the position component of the parser state
is the position, in the file, of the next character to be read
from the bytes component;
if this component is empty,
the position is just one past the last character in the file.
For read and unread characters,
since we have already found their position and have moved past it,
we store their positions in the chars-read and chars-unread.
This is why they contain not just characters,
but char+position pairs.
Similarly, for tokens, we also store their spans.
To support backtracking,
we also keep track of zero or more checkpoints,
which indicate positions in the tokens-read list.
When a checkpoint is recorded,
the current length of tokens-read is stored as a checkpoint,
by consing it to the checkpoints list.
Later, the checkpoint can be simply cleared,
in which case it is simply removed from the check,
by replacing checkpoints with its cdr.
Alternatively, we can backtrack to the checkpoint,
which involves moving tokens from tokens-read to tokens-unread
until tokens-read has the length of the checkpoint in question;
then the checkpoint is removed from checkpoints as well.
That is, we have the ability to backtrack to earlier tokens,
without having to keep track of how many tokens we have read
since the potential point of backtrack.
The reason why checkpoints is a list of natural numbers
and not just an optional natural number
is that we may need to support ``nested'' backtracking
while parsing something that may also backtrack.
We include a boolean flag saying whether
certain GCC extensions should be accepted or not.
These GCC extensions are limited to the ones
currently captured in our abstract syntax.
This parser state component is set at the beginning and never changes,
but it is useful to have it as part of the parser state
to avoid passing an additional parameter.
This parser state component could potentially evolve into
a richer set of options for different versions and dialects of C.
We could look into turning the parser state into a stobj in the future,
if efficiency is an issue.
The code of the parser already treats the parser state
in a single-threaded way.
For speed, we cache the value returned by
the function parsize defined later.
Some profiling revealed that significant time was spent there,
due to the need to save the parser state size
before certain mbt tests.
Ideally we should obtain this optimization using apt::isodata,
but that transformation currently does not quite handle
all of the parser's functions.
Also for speed, we cache the number of the tokens read so far.
The checkpointing and backtracking mechanism described above
calculates that length in order to record it as a checkpoint.
When there is a significant number of read token, that can take time,
as revealed by some profiling.
Subtopics
- Parstatep
- Recognizer for parstate structures.
- Parstate-fix
- Fixing function for parstate structures.
- Make-parstate
- Basic constructor macro for parstate structures.
- Parstate-equiv
- Basic equivalence relation for parstate structures.
- Parstate->tokens-read-len
- Get the tokens-read-len field from a parstate.
- Parstate->tokens-unread
- Get the tokens-unread field from a parstate.
- Parstate->size
- Get the size field from a parstate.
- Parstate->chars-unread
- Get the chars-unread field from a parstate.
- Change-parstate
- Modifying constructor for parstate structures.
- Parstate->tokens-read
- Get the tokens-read field from a parstate.
- Parstate->checkpoints
- Get the checkpoints field from a parstate.
- Parstate->chars-read
- Get the chars-read field from a parstate.
- Parstate->position
- Get the position field from a parstate.
- Parstate->gcc
- Get the gcc field from a parstate.
- Parstate->bytes
- Get the bytes field from a parstate.