Source character set.
[C:5.2.1] prescribes the presence of certain members in the source character set. These prescribed members form the basic source character set, and correspond naturally to certain (ACL2) ASCII characters. However, [C] does not prescribe ASCII, even though many C implementations may use (some extension of) ASCII.
We formalize the source character set as a constrained ACL2 predicate, along with an injection into them from the ASCII characters mentioned above (i.e. the ones that correspond naturally to the ones prescribed by [C]). The image of the injection is the basic source character set, whose members are the basic source characters. The whole constrained predicate is the extended source character set, whose members that are not basic source characters are extended source characters. This is according to the nomenclature in [C:5.2.1].
Because the extended source character set is formalized here as a constrained predicate, there is no immediate way to refer to the extended characters (while the basic ones may be referenced via the aforementioned injection). Thus, we introduce a second injection, from the constrained predicate into the natural numbers: this means that each character has a unique associated number, which may be used to refer to the character. This second injection and numbering has no counterpart in [C]: its significance is limited to our ACL2 formalization.
[C:5.2.1.2] discusses encodings of source characters, including a notion of shift states that affect the encodings. This encoding provides a way to refer to the members of the source characters set; this is how source files are read. However, because of the constraints on single bytes for basic characters, and because of the possibility of shift states, this encoding does not seem the most straightforward way to refer to the source character set's members in our formalization. This is why we introduce the second injection described above. We plan to formalize the encoding discussed in [C:5.2.1.2] elsewhere.