Atc-abstract-syntax
Abstract syntax of C for ATC.
ATC generates C code by
generating abstract syntax trees
and pretty-printing them.
For now we use the abstract syntax from the C language formalization.
In fact, that abstract syntax has been initially motivated by ATC.
However, in the future we may have a separate syntax for ATC,
for the following reasons:
- The abstract syntax from the C language formalization
captures syntax after preprocessing,
but at some point we may want ATC to generate code
with at least some of the preprocessing constructs,
such as #includes, and possibly also some (simple) macros.
This means that the ATC abstract syntax will have to mix
preprocessing constructs with the preprocessed constructs:
this is something that may not be part, in this mixed form,
of the language formalization,
which should differentiate between
preprocessing translation units and
(preprocessed) translation units.
- We might want ATC to generate certain comments in the code,
which would require the ATC abstract syntax to incorporate
some information about comments.
However, in the language formalization,
comments, and their removal during one of C's translation phases,
would be captured differently,
not as part of the abstract syntax
over which the language semantics is defined.
- While the abstract syntax from the language formalization
may be generalized to cover much more of the C language,
the abstract syntax for ATC can incorporate restrictions
that make it simpler and that fit the C code generated by ATC.
In particular,
the C syntax for declarations and related entities is quite complex,
with significant mutual recursion,
and with many constraints not directly captured by the C grammar.
For instance, typedef is classified as a storage class specifier,
for syntactic convenience apparently [C:6.7.1/5],
even though its role is very different from the others.
Thus, by differentiating more explicitly, in our ATC abstract syntax,
among different kinds of declarations and related entities,
we make things more clear overall, as far as ATC is concerned.
Notwithstanding the above three reasons,
in the short term, for expediency, we might actually
incorporate preprocessing directives and comments,
and impose restrictions,
on the abstract syntax from the language formalization,
rather than creating a separate abstract syntax for ATC.
So long as the language formalization appropriately covers a subset of C,
there is nothing inherently wrong with that approach.
However, longer-term, as the language formalization is made more general,
in particular covering the translation phases [C:5.1.12] explicitly,
we will likely need to differentiate the abstract syntax for ATC
from the one(s) from the language formalization.
For proof generation,
we would provide a mapping from the former to the latter,
or in fact we may have the proof apply to the actual concrete syntax,
if the language formalization includes parsing.
Some observations about some parts of the abstract syntax
in relation to ATC:
- The fixtype ident allows any ACL2 string,
which may thus not be valid C identifiers.
However, ATC always generates valid C identifiers.
- The fixtype ident models only ASCII identifiers.
For ATC, ASCII identifiers may be all that we need
in the foreseeable future:
we may not need to generate C programs with
identifiers that include non-ASCII characters,
some aspects of which may be
implementation-dependent or locale-dependent.
Since ASCII identifiers are portable,
we plan for ATC to generate only ASCII identifiers,
unless there will be reasons to support a broader range.
- The fixtype ident allows identifiers of any length.
In the interest of portability, it is our intention to have ATC
generate identifiers of 31 characters or less
(which is the minimum of significant characters in (external) identifiers
[C:6.4.2.1/5] [C:6.4.2.1/6] [C:5.2.4.1]
which may not be a significant limitation.
- The fixtype iconst does not capture leading 0s
in octal and hexadecimal constants.
This is not a severe limitation,
but at some point we may want ATC to generate
something like 0x0000ffff.
- The fixtype const includes enuemration constants,
which, as discussed there,
cannot be differentiated from identifier expressions
during parsing, but only after static semantic checking.
This is not an issue for ATC, which generates, does not parse, C code:
ATC can generate one or the other case in the code.
- The fixtypes for declarations do not support function pointers.
Most likely, this support will not be neded for ATC,
given that ACL2 is first-order,
and thus cannot readily represent C function pointers.
(However, perhaps there is a way
to represent function pointers with apply$.)
- The fixtype block-item only supports object declarations.
This suffices for ATC currently.
- The fixtype fundef could be alteratively defined as
consisting of a function declaration and a body.
However, even though this would work in abstract syntax,
in concrete syntax a function declaration has to end with a semicolon
(and that is why the grammar rule in [C:6.9.1/1]
does not use a declaration, but rather its components):
thus, for the ATC pretty-printer,
we want to differentiate between
the type specifier sequences and declarators
that form a full function declaration,
and the type specifier sequences and declarators
that are part of a function definition.
Subtopics
- Irr-tyspecseq
- An irrelevant type specifier sequence.
- Irr-tag-declon
- An irrelevant structure/union/enumeration declaration.
- Irr-param-declon
- An irrelevant parameter declaration.
- Irr-iconst-length
- An irrelevant length suffix.
- Irr-iconst
- An irrelevant integer constant.
- Irr-fundef
- An irrelevant function definition.
- Irr-ext-declon
- An irrelevant external declaration.
- Irr-unop
- An irrelevant unary operator.
- Irr-tyname
- An irrelevant type name.
- Irr-transunit
- An irrelevant translation unit.
- Irr-stmt
- An irrelevant statement.
- Irr-initer
- An irrelevant initializer.
- Irr-ident
- An irrelevant identifier.
- Irr-fileset
- An irrelevant file set.
- Irr-expr
- An irrelevant expression.
- Irr-block-item
- An irrelevant block-item.
- Irr-binop
- An irrelevant binary operator.
- Irr-file
- An irrelevant file.