Read a character.
(read-char parstate) → (mv erp char? pos new-parstate)
Besides the character, we also return its position. No character is returned if we are at the end of the file; in this case, the returned file position is the one of the non-existent character just past the end of the file (i.e. the position of a character if we added it to the end of the file).
First we check whether the sequence of unread characters is not empty.
If it is not empty, we move a character from that sequence
to the sequence of read characters,
which amounts to incrementing
If the sequence of unread characters is empty,
we need to read the next character from the bytes.
If there are no more bytes, we have reached the end of file.
We return
Otherwise, we read the first byte, which is removed from the parser state. Since we support Unicode, we perform UTF-8 decoding, which may involve reading additional bytes.
Looking at the rules in the ABNF grammar for basic and extended characters, we see that the codes of the three ASCII non-new-line extended characters (namely dollar, at sign, and backquote) fill gaps in the ASCII codes of the basic characters, so that the codes 9, 11, 12, and 32-126 are all valid ASCII characters, which are thus returned, incrementing the position by one column. If instead the byte is the ASCII code 10 for line feed, we increment the position by one line. If instead the byte is the ASCII code 13 for carriage return, there are two cases based on whether the immediately following byte exists and is the ASCII code 10 for line feed: if it does, we consume two bytes instead of one, but we return just a line feed, since we only really need one new-line character (also in line with [C:5.2.1/3]); if it does not, we just consume the carriage return, but return a line feed, again for normalizing the new-line character. In both cases, we increment the position by one line, which also resets the column to 0 (see position-inc-line).
Note that horizontal tab, vertical tab, and form feed just increment the column number by 1 and leave the line number unchanged, like most other characters. This may not match the visual appearance, but the parser has no way to know how many columns a horizontal tab takes, or how many lines a vertical tab or form feed takes. So, at least for now, we just treat these as most other characters.
We exclude most ASCII control characters, except for the basic ones and for the new-line ones, since there should be little need to use those in C code. Furthermore, some are dangerous, particularly backspace, since it may make the code look different from what it is, similarly to Trojan Source in Unicode.
If the first byte is 128 or more, it must start a non-ASCII Unicode character. There are three cases to consider. It is easy to find references to UTF-8 encoding/decoding, for instance this Wikipedia page.
First case:
If the first byte has the form
Second case:
If the first byte has the form
Third case:
If the first byte has the form
If the first byte read has any other value,
either it is an invalid UTF-8 encoding (e.g.
Note that all the non-ASCII characters that we accept just increment the column number by 1 and leave the line number unchanged. This may not be appropriate for certain Unicode characters, but for now we treat them in this simplified way.
Function:
(defun read-char (parstate) (declare (xargs :stobjs (parstate))) (declare (xargs :guard (parstatep parstate))) (let ((__function__ 'read-char)) (declare (ignorable __function__)) (b* (((reterr) nil (irr-position) parstate) (parstate.bytes (parstate->bytes parstate)) (parstate.position (parstate->position parstate)) (parstate.chars-read (parstate->chars-read parstate)) (parstate.chars-unread (parstate->chars-unread parstate)) (parstate.size (parstate->size parstate)) ((when (and (> parstate.chars-unread 0) (> parstate.size 0))) (b* (((unless (< parstate.chars-read (parstate->chars-length parstate))) (raise "Internal error: index ~x0 out of bound ~x1." parstate.chars-read (parstate->chars-length parstate)) (reterr t)) (char+pos (parstate->char parstate.chars-read parstate)) (parstate (update-parstate->chars-unread (1- parstate.chars-unread) parstate)) (parstate (update-parstate->chars-read (1+ parstate.chars-read) parstate)) (parstate (update-parstate->size (1- parstate.size) parstate))) (retok (char+position->char char+pos) (char+position->position char+pos) parstate))) ((unless (and (consp parstate.bytes) (> parstate.size 0))) (retok nil parstate.position parstate)) (byte (car parstate.bytes)) (bytes (cdr parstate.bytes)) ((when (or (= byte 9) (= byte 11) (= byte 12) (and (<= 32 byte) (<= byte 126)))) (b* ((parstate (update-parstate->bytes bytes parstate)) (parstate (update-parstate->position (position-inc-column 1 parstate.position) parstate)) ((unless (< parstate.chars-read (parstate->chars-length parstate))) (raise "Internal error: index ~x0 out of bound ~x1." parstate.chars-read (parstate->chars-length parstate)) (reterr t)) (parstate (update-parstate->char parstate.chars-read (make-char+position :char byte :position parstate.position) parstate)) (parstate (update-parstate->chars-read (1+ parstate.chars-read) parstate)) (parstate (update-parstate->size (1- parstate.size) parstate))) (retok byte parstate.position parstate))) ((when (= byte 10)) (b* ((parstate (update-parstate->bytes bytes parstate)) (parstate (update-parstate->position (position-inc-line 1 parstate.position) parstate)) ((unless (< parstate.chars-read (parstate->chars-length parstate))) (raise "Internal error: index ~x0 out of bound ~x1." parstate.chars-read (parstate->chars-length parstate)) (reterr t)) (parstate (update-parstate->char parstate.chars-read (make-char+position :char 10 :position parstate.position) parstate)) (parstate (update-parstate->chars-read (1+ parstate.chars-read) parstate)) (parstate (update-parstate->size (1- parstate.size) parstate))) (retok 10 parstate.position parstate))) ((when (= byte 13)) (b* (((mv bytes count) (if (and (consp bytes) (> parstate.size 1) (= (car bytes) 10)) (mv (cdr bytes) 2) (mv bytes 1))) (parstate (update-parstate->bytes bytes parstate)) (parstate (update-parstate->position (position-inc-line 1 parstate.position) parstate)) ((unless (< parstate.chars-read (parstate->chars-length parstate))) (raise "Internal error: index ~x0 out of bound ~x1." parstate.chars-read (parstate->chars-length parstate)) (reterr t)) (parstate (update-parstate->char parstate.chars-read (make-char+position :char 10 :position parstate.position) parstate)) (parstate (update-parstate->chars-read (1+ parstate.chars-read) parstate)) (parstate (update-parstate->size (- parstate.size count) parstate))) (retok 10 parstate.position parstate))) ((when (= (logand byte 224) 192)) (b* (((unless (and (consp bytes) (> parstate.size 1))) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "another byte after ~ the first byte ~x0 ~ of the form 110... ~ (i.e. between 192 and 223) ~ of a two-byte UTF-8 encoding" byte) :found "end of file")) (byte2 (car bytes)) (bytes (cdr bytes)) ((unless (= (logand byte2 192) 128)) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "a byte of the form 10... ~ (i.e. between 128 and 191) ~ after the first byte ~x0 ~ of the form 110... ~ (i.e. between 192 and 223) ~ of a two-byte UTF-8 encoding" byte) :found (msg "the byte ~x0" byte2))) (code (+ (ash (logand byte 31) 6) (logand byte2 63))) ((when (< code 128)) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "a value between 80h and 7FFh ~ UTF-8-encoded in the two bytes ~ (~x0 ~x1)" byte byte2) :found (msg "the value ~x0" code))) (parstate (update-parstate->bytes bytes parstate)) (parstate (update-parstate->position (position-inc-column 1 parstate.position) parstate)) ((unless (< parstate.chars-read (parstate->chars-length parstate))) (raise "Internal error: index ~x0 out of bound ~x1." parstate.chars-read (parstate->chars-length parstate)) (reterr t)) (parstate (update-parstate->char parstate.chars-read (make-char+position :char code :position parstate.position) parstate)) (parstate (update-parstate->chars-read (1+ parstate.chars-read) parstate)) (parstate (update-parstate->size (- parstate.size 2) parstate))) (retok code parstate.position parstate))) ((when (= (logand byte 240) 224)) (b* (((unless (and (consp bytes) (> parstate.size 1))) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "another byte after ~ the first byte ~x0 ~ of the form 1110... ~ (i.e. between 224 to 239) ~ of a three-byte UTF-8 encoding" byte) :found "end of file")) (byte2 (car bytes)) (bytes (cdr bytes)) ((unless (= (logand byte2 192) 128)) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "a byte of the form 10... ~ (i.e. between 128 and 191) ~ after the first byte ~x0 ~ of the form 1110... ~ (i.e. between 224 and 239) ~ of a three-byte UTF-8 encoding" byte) :found (msg "the byte ~x0" byte2))) ((unless (and (consp bytes) (> parstate.size 2))) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "another byte after ~ the first byte ~x0 ~ of the form 1110... ~ (i.e. between 224 to 239) ~ and the second byte ~x1 ~ of the form 10... ~ (i.e. between 128 and 191) ~ of a three-byte UTF-8 encoding" byte byte2) :found "end of file")) (byte3 (car bytes)) (bytes (cdr bytes)) ((unless (= (logand byte3 192) 128)) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "a byte of the form 10... ~ (i.e. between 128 and 191) ~ after the first byte ~x0 ~ of the form 1110... ~ (i.e. between 224 and 239) ~ and the second byte ~x1 ~ of the form 10... ~ (i.e. between 128 and 191) ~ of a three-byte UTF-8 encoding" byte byte2) :found (msg "the byte ~x0" byte3))) (code (+ (ash (logand byte 15) 12) (ash (logand byte2 63) 6) (logand byte3 63))) ((when (< code 2048)) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "a value between 800h and FFFFh ~ UTF-8-encoded in the three bytes ~ (~x0 ~x1 ~x2)" byte byte2 byte3) :found (msg "the value ~x0" code))) ((when (or (and (<= 8234 code) (<= code 8238)) (and (<= 8294 code) (<= code 8297)))) (reterr-msg :where (position-to-msg parstate.position) :expected "a Unicode character with code ~ in the range 9-13 or 32-126 ~ or 128-8233 or 8239-8293 or ~ or 8298-55295 or 57344-1114111" :found (char-to-msg code))) (parstate (update-parstate->bytes bytes parstate)) (parstate (update-parstate->position (position-inc-column 1 parstate.position) parstate)) ((unless (< parstate.chars-read (parstate->chars-length parstate))) (raise "Internal error: index ~x0 out of bound ~x1." parstate.chars-read (parstate->chars-length parstate)) (reterr t)) (parstate (update-parstate->char parstate.chars-read (make-char+position :char code :position parstate.position) parstate)) (parstate (update-parstate->chars-read (1+ parstate.chars-read) parstate)) (parstate (update-parstate->size (- parstate.size 3) parstate))) (retok code parstate.position parstate))) ((when (= (logand 248 byte) 240)) (b* (((unless (and (consp bytes) (> parstate.size 1))) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "another byte after ~ the first byte ~x0 ~ of the form 11110... ~ (i.e. between 240 to 247) ~ of a four-byte UTF-8 encoding" byte) :found "end of file")) (byte2 (car bytes)) (bytes (cdr bytes)) ((unless (= (logand byte2 192) 128)) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "a byte of the form 10... ~ (i.e. between 128 and 191) ~ after the first byte ~x0 ~ of the form 11110... ~ (i.e. between 240 and 247) ~ of a four-byte UTF-8 encoding" byte) :found (msg "the byte ~x0" byte2))) ((unless (and (consp bytes) (> parstate.size 2))) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "another byte after ~ the first byte ~x0 ~ of the form 11110... ~ (i.e. between 240 to 247) ~ and the second byte ~x1 ~ of the form 10... ~ (i.e. between 128 and 191) ~ of a four-byte UTF-8 encoding" byte byte2) :found "end of file")) (byte3 (car bytes)) (bytes (cdr bytes)) ((unless (= (logand byte3 192) 128)) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "a byte of the form 10... ~ (i.e. between 128 and 191) ~ after the first byte ~x0 ~ of the form 11110... ~ (i.e. between 240 and 247) ~ and the second byte ~x1 ~ of the form 10... ~ (i.e. between 128 and 191) ~ of a four-byte UTF-8 encoding" byte byte2) :found (msg "the byte ~x0" byte3))) ((unless (and (consp bytes) (> parstate.size 3))) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "another byte after ~ the first byte ~x0 ~ of the form 11110... ~ (i.e. between 240 to 247) ~ and the second byte ~x1 ~ of the form 10... ~ (i.e. between 128 and 191) ~ and the third byte ~x2 ~ of the form 10... ~ (i.e. between 128 and 191) ~ of a four-byte UTF-8 encoding" byte byte2 byte3) :found "end of file")) (byte4 (car bytes)) (bytes (cdr bytes)) ((unless (= (logand byte4 192) 128)) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "a byte of the form 10... ~ (i.e. between 128 and 191) ~ after the first byte ~x0 ~ of the form 11110... ~ (i.e. between 240 and 247) ~ and the second byte ~x1 ~ of the form 10... ~ (i.e. between 128 and 191) ~ and the third byte ~x2 ~ of the form 10... ~ (i.e. between 128 and 191) ~ of a four-byte UTF-8 encoding" byte byte2 byte3) :found (msg "the byte ~x0" byte4))) (code (+ (ash (logand byte 7) 18) (ash (logand byte2 63) 12) (ash (logand byte3 63) 6) (logand byte4 63))) ((when (or (< code 65536) (> code 1114111))) (reterr-msg :where (position-to-msg parstate.position) :expected (msg "a value between 10000h and 10FFFFh ~ UTF-8-encoded in the four bytes ~ (~x0 ~x1 ~x2 ~x3)" byte byte2 byte3 byte4) :found (msg "the value ~x0" code))) (parstate (update-parstate->bytes bytes parstate)) (parstate (update-parstate->position (position-inc-column 1 parstate.position) parstate)) ((unless (< parstate.chars-read (parstate->chars-length parstate))) (raise "Internal error: index ~x0 out of bound ~x1." parstate.chars-read (parstate->chars-length parstate)) (reterr t)) (parstate (update-parstate->char parstate.chars-read (make-char+position :char code :position parstate.position) parstate)) (parstate (update-parstate->chars-read (1+ parstate.chars-read) parstate)) (parstate (update-parstate->size (- parstate.size 4) parstate))) (retok code parstate.position parstate)))) (reterr-msg :where (position-to-msg parstate.position) :expected "a byte in the range 9-13 or 32-126 or 192-223" :found (msg "the byte ~x0" byte)))))
Theorem:
(defthm nat-optionp-of-read-char.char? (b* (((mv acl2::?erp ?char? acl2::?pos ?new-parstate) (read-char parstate))) (nat-optionp char?)) :rule-classes :rewrite)
Theorem:
(defthm positionp-of-read-char.pos (b* (((mv acl2::?erp ?char? acl2::?pos ?new-parstate) (read-char parstate))) (positionp pos)) :rule-classes :rewrite)
Theorem:
(defthm parstatep-of-read-char.new-parstate (implies (parstatep parstate) (b* (((mv acl2::?erp ?char? acl2::?pos ?new-parstate) (read-char parstate))) (parstatep new-parstate))) :rule-classes :rewrite)
Theorem:
(defthm parsize-of-read-char-uncond (b* (((mv acl2::?erp ?char? acl2::?pos ?new-parstate) (read-char parstate))) (<= (parsize new-parstate) (parsize parstate))) :rule-classes :linear)
Theorem:
(defthm parsize-of-read-char-cond (b* (((mv acl2::?erp ?char? acl2::?pos ?new-parstate) (read-char parstate))) (implies (and (not erp) char?) (<= (parsize new-parstate) (1- (parsize parstate))))) :rule-classes :linear)