Unicode-escapes
First Java lexical translation step: Unicode escapes [JLS14:3.3].
A Java program is a finite sequence of Unicode characters
satisfying a number of constraints.
The first set of constraints is described in [JLS14:3.3],
which also describes how
a sequence of Unicode characters satisfying these constraints
is turned into another sequence of Unicode characters,
which is then subjected to
further constraint checks and associated transformations.
The grammar rules in [JLS14:3.3] express
part of this first set of constraints,
but they are ambiguous if taken alone:
one could always choose the raw-input-character alternative
for unicode-input-character,
without recognizing any unicode-escape,
and therefore leave the original Unicode sequence unchanged.
The English text in [JLS14:3.3] provides additional constraints.
One additional constraint in this first set is that,
in order for a Unicode escape to be recognized
(i.e. in order for unicode-escape to be chosen),
the backslash that starts the escape must be preceded
by an even number of backslashes.
Without this constraint, it would be impossible, for example,
to have the Java literal string
"The Unicode escape \\u0020 is the ASCII space character.",
where the double backslash is
a (non-Unicode) escape sequence for a single backslash [JLS14:3.10.6]:
without the constraint, this Java literal string would be equivalent to
"The Unicode escape \ is the ASCII space character.".
[JLS14:3.3] introduces the notion of `eligible' backslash
as one preceded by an even number of backslashes (possibly none).
In the example string above,
the backslash just before the u is not eligible,
because it is preceded by an odd number of backslashes.
For each eligible backslash, there are two cases:
either the eligible backslash is followed by one or more u letters,
or it is not.
In the second case, there is no Unicode escape,
because the grammar requires the presence of one or more u letters;
the backslash must be presumably
the start of a (non-Unicode) escape sequence [JLS14:3.10.6],
e.g. the backslash in the Java string literal
"A line.\nAnother line."
is eligible,
but there is no Unicode escape.
If instead an eligible backslash is followed by one or more u letters,
there are two sub-cases:
either the last u is followed by four hexadecimal digits,
or it is not.
In the second sub-case,
the original sequence of Unicode characters is not a valid Java program;
none of the (non-Unicode) escape sequences [JSL:3.10.6]
has u following the backslash.
In the first sub-case,
we have a possible Unicode escape according to the grammar,
and it must be recognized as such,
thus removing the inherent grammatical ambiguity.
We formalize the processing of Unicode escapes via a function
from lists of Unicode characters to lists of Unicode characters
that recognizes the Unicode escapes according to the above rules
and turns them into the corresponding single Unicode characters.
This function is constructed as the composition of
(i) a parser from lists of Unicode characters
to lists of unicode-input-character trees
and (ii) an abstractor from lists of unicode-input-character trees
to lists of Unicode characters.
The parser always succeeds,
even if there is an eligible backslash
followed by one of more u letters
but with the last u not followed by four hexadecimal digits
(in which case, as noted above, the processing of Unicode escapes fails).
In this case, the parser just leaves the characters as they are,
without recognizing any Unicode escape (since there is not one).
This makes the parser slightly more general,
and perhaps useful in other circumstances.
Presumably [JLS14] prescribes an error in this situation
(as opposed to simply leaving the characters unchanged)
because the resulting Unicode character sequence
could never be a valid Java program anyhow,
and so parsing can stop immediately instead of stopping later anyhow.
Nonetheless, when we compose the parser with the abstractor (see above),
we perform the check for that situation,
and return an error if the situation occurs,
as prescribed by [JLS14].
In other words, our formalization of Unicode escape processing
is faithful to [JLS14],
but its parser component is a bit more general.
Subtopics
- Uniescape-parse-constraints-p
- All the input/output constraints for the Unicode escape parser.
- Even-backslashes-tree-constraints-p
- Necessary condition for parsed trees to be Unicode escapes.
- Len-of-string-of-prefix-of-unicode-input-character-trees
- A theorem about the length of the string of
a (proper) prefix of a list of unicode-input-character trees.
- Uniescape-parse
- Parse the Unicode escapes in a list of Unicode characters.
- Nonzero-uletters-after-p
- Check if, in a list of Unicode characters,
the character at position pos is followed by
a non-zero number of `u' letters.
- Abs-unicode-escape
- Abstract a unicode-escape tree to a Unicode character.
- Even-backslashes-before-p
- Check if, in a list of Unicode characters,
the character at position pos is preceded by
an even number of backslashes.
- Uniescape-tree-constraints-p
- Sufficient condition for parsed trees to be Unicode escapes.
- Uniescape-process
- Perform Java's first lexical translation step.
- Uniescape-candidate-valid-p
- Check if a candidate Unicode escape is valid.
- Uniescape-candidate-p
- Check if, in a list of Unicode characters,
a Unicode escape may start at position pos.
- Abs-raw-input-character
- Abstract a raw-input-character tree to a Unicode character.
- Abs-hex-digit
- Abstract a hex-digit tree to a natural number below 16.
- Abs-unicode-input-character
- Abstract a unicode-input-character tree to a Unicode character.
- Some-uniescape-candidate-invalid-p
- Check if a list of Unicode characters includes
an invalid Unicode escape candidate.
- Uniescapep
- Check if, in a list of Unicode characters,
there is a Unicode escape at position pos.
- Abs-unicode-input-character-list
- Abstract a list of unicode-input-character trees
to a list of Unicode characters.
- Uniescape-parse-p
- Check if a list of Unicode characters
can be parsed into a list of trees.
- Eligible-backslash-p
- Check if, in a list of Unicode characters,
the character at position pos is an eligible backslash.
- Unicode-input-character-tree-is-escape-p
- Check if a unicode-input-character tree
is a Unicode escape.