Search-engine friendly clone of the ACL2 documentation.

Parser

Read-char

Read a character.

Signature
(read-char pstate) → (mv erp char? pos new-pstate)
Arguments: pstate — Guard (parstatep pstate).
Returns: char? — Type (nat-optionp char?).; pos — Type (positionp pos).; new-pstate — Type (parstatep new-pstate).

Besides the character, we also return its position. No character is returned if we are at the end of the file; in this case, the returned file position is the one of the non-existent character just past the end of the file (i.e. the position of a character if we added it to the end of the file).

If a character is read, it is added to the list of read characters. See parstate.

If some character was put back earlier, we get the character directly from there, removing it from the list; see parstate. There is no change in position in the parser state in this case, because the position is the one at the start of the byte list. Otherwise, we look at the bytes.

If there are no more bytes, we have reached the end of file. We return nil, for no character, and we leave the parser state unchanged.

Otherwise, we read the first byte, which is removed from the parser state. Since we support Unicode, we perform UTF-8 decoding, which may involve reading additional bytes.

Looking at the rules in the ABNF grammar for basic and extended characters, we see that the codes of the three ASCII non-new-line extended characters (namely dollar, at sign, and backquote) fill gaps in the ASCII codes of the basic characters, so that the codes 9, 11, 12, and 32-126 are all valid ASCII characters, which are thus returned, incrementing the position by one column. If instead the byte is the ASCII code 10 for line feed, we increment the position by one line. If instead the byte is the ASCII code 13 for carriage return, there are two cases based on whether the immediately following byte exists and is the ASCII code 10 for line feed: if it does, we consume two bytes instead of one, but we return just a line feed, since we only really need one new-line character (also in line with [C:5.2.1/3]); if it does not, we just consume the carriage return, but return a line feed, again for normalizing the new-line character. In both cases, we increment the position by one line, which also resets the column to 0 (see position-inc-line).

Note that horizontal tab, vertical tab, and form feed just increment the column number by 1 and leave the line number unchanged, like most other characters. This may not match the visual appearance, but the parser has no way to know how many columns a horizontal tab takes, or how many lines a vertical tab or form feed takes. So, at least for now, we just treat these as most other characters.

We exclude most ASCII control characters, except for the basic ones and for the new-line ones, since there should be little need to use those in C code. Furthermore, some are dangerous, particularly backspace, since it may make the code look different from what it is, similarly to Trojan Source in Unicode.

If the first byte is 128 or more, it must start a non-ASCII Unicode character. There are three cases to consider. It is easy to find references to UTF-8 encoding/decoding, for instance this Wikipedia page.

First case: If the first byte has the form 110xxxyy, it must be followed by a second byte of the form 10yyzzzz, which together encode xxxyyyyzzzz, which covers the range from 0 to 7FFh. We return an error if there is no second byte. We return an error if the encoded value is below 80h. If all these checks pass, the code covers the character range from U+80 to U+7FF.

Second case: If the first byte has the form 1110xxxx, it must be followed by a second byte of the form 10yyyyzz and by a third byte of the form 10zzwwww, which together encode xxxxyyyyzzzzwwww, which covers the range from 0 to FFFFh. We return an error if there is no second or third byte. We return an error if the encoded value is below 800h. If all these checks pass, the code covers the character range from U+800 to U+FFFF, but we return an error if the character is between U+202A and U+202E or between U+2066 and U+2069; see the safe-nonascii rule in our ABNF grammar.

Third case: If the first byte has the form 11110xyy, it must be followed by a second byte of the form 10yyzzzz and by a third byte of the form 10wwwwuu and by a fourth byte of the form 10uuvvvv, which together encode xyyyyzzzzwwwwuuuuvvvv, which covers the range from 0 to 1FFFFFh. We return an error if there is no second or third or fourth byte. We return an error if the encoded value is below 10000h or above 10FFFFh. If all these checks pass, the code covers the character range from U+10000 to U+1FFFFF.

If the first byte read has any other value, either it is an invalid UTF-8 encoding (e.g. 111...) or it is an ASCII character that we do not accept (e.g. 00000000). We return an error in this case.

Note that all the non-ASCII characters that we accept just increment the column number by 1 and leave the line number unchanged. This may not be appropriate for certain Unicode characters, but for now we treat them in this simplified way.

Definitions and Theorems

Function: read-char

(defun read-char (pstate)
 (declare (xargs :guard (parstatep pstate)))
 (let ((__function__ 'read-char))
  (declare (ignorable __function__))
  (b*
   (((reterr)
     nil (irr-position)
     (irr-parstate))
    ((parstate pstate) pstate)
    ((when (consp pstate.chars-unread))
     (b* ((char+pos (car pstate.chars-unread)))
       (retok (char+position->char char+pos)
              (char+position->position char+pos)
              (change-parstate
                   pstate
                   :chars-unread (cdr pstate.chars-unread)
                   :chars-read (cons char+pos pstate.chars-read)
                   :size (1- pstate.size)))))
    ((unless (consp pstate.bytes))
     (retok nil
            pstate.position (parstate-fix pstate)))
    (byte (car pstate.bytes))
    (bytes (cdr pstate.bytes))
    ((when (or (= byte 9)
               (= byte 11)
               (= byte 12)
               (and (<= 32 byte) (<= byte 126))))
     (retok
          byte pstate.position
          (change-parstate
               pstate
               :bytes bytes
               :position (position-inc-column 1 pstate.position)
               :chars-read
               (cons (make-char+position :char byte
                                         :position pstate.position)
                     pstate.chars-read)
               :size (1- pstate.size))))
    ((when (= byte 10))
     (retok
          10 pstate.position
          (change-parstate
               pstate
               :bytes bytes
               :position (position-inc-line 1 pstate.position)
               :chars-read
               (cons (make-char+position :char 10
                                         :position pstate.position)
                     pstate.chars-read)
               :size (1- pstate.size))))
    ((when (= byte 13))
     (b* (((mv bytes count)
           (if (and (consp bytes) (= (car bytes) 10))
               (mv (cdr bytes) 2)
             (mv bytes 1))))
      (retok
           10 pstate.position
           (change-parstate
                pstate
                :bytes bytes
                :position (position-inc-line 1 pstate.position)
                :chars-read
                (cons (make-char+position :char 10
                                          :position pstate.position)
                      pstate.chars-read)
                :size (- pstate.size count)))))
    ((when (= (logand byte 224) 192))
     (b*
      (((unless (consp bytes))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "another byte after ~
                                          the first byte ~x0 ~
                                          of the form 110... ~
                                          (i.e. between 192 and 223) ~
                                          of a two-byte UTF-8 encoding"
          byte)
         :found "end of file"))
       (byte2 (car bytes))
       (bytes (cdr bytes))
       ((unless (= (logand byte2 192) 128))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "a byte of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          after the first byte ~x0 ~
                                          of the form 110... ~
                                          (i.e. between 192 and 223) ~
                                          of a two-byte UTF-8 encoding"
          byte)
         :found (msg "the byte ~x0" byte2)))
       (code (+ (ash (logand byte 31) 6)
                (logand byte2 63)))
       ((when (< code 128))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "a value between 80h and 7FFh ~
                                          UTF-8-encoded in the two bytes ~
                                          (~x0 ~x1)"
          byte byte2)
         :found (msg "the value ~x0" code))))
      (retok
           code pstate.position
           (change-parstate
                pstate
                :bytes bytes
                :position (position-inc-column 1 pstate.position)
                :chars-read
                (cons (make-char+position :char code
                                          :position pstate.position)
                      pstate.chars-read)
                :size (- pstate.size 2)))))
    ((when (= (logand byte 240) 224))
     (b*
      (((unless (consp bytes))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "another byte after ~
                                          the first byte ~x0 ~
                                          of the form 1110... ~
                                          (i.e. between 224 to 239) ~
                                          of a three-byte UTF-8 encoding"
          byte)
         :found "end of file"))
       (byte2 (car bytes))
       (bytes (cdr bytes))
       ((unless (= (logand byte2 192) 128))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "a byte of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          after the first byte ~x0 ~
                                          of the form 1110... ~
                                          (i.e. between 224 and 239) ~
                                          of a three-byte UTF-8 encoding"
          byte)
         :found (msg "the byte ~x0" byte2)))
       ((unless (consp bytes))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "another byte after ~
                                          the first byte ~x0 ~
                                          of the form 1110... ~
                                          (i.e. between 224 to 239) ~
                                          and the second byte ~x1 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          of a three-byte UTF-8 encoding"
          byte byte2)
         :found "end of file"))
       (byte3 (car bytes))
       (bytes (cdr bytes))
       ((unless (= (logand byte3 192) 128))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "a byte of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          after the first byte ~x0 ~
                                          of the form 1110... ~
                                          (i.e. between 224 and 239) ~
                                          and the second byte ~x1 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          of a three-byte UTF-8 encoding"
          byte byte2)
         :found (msg "the byte ~x0" byte3)))
       (code (+ (ash (logand byte 15) 12)
                (ash (logand byte2 63) 6)
                (logand byte3 63)))
       ((when (< code 2048))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "a value between 800h and FFFFh ~
                                          UTF-8-encoded in the three bytes ~
                                          (~x0 ~x1 ~x2)"
          byte byte2 byte3)
         :found (msg "the value ~x0" code)))
       ((when (or (and (<= 8234 code) (<= code 8238))
                  (and (<= 8294 code) (<= code 8297))))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         "a Unicode character with code ~
                                     in the range 9-13 or 32-126 ~
                                     or 128-8233 or 8239-8293 or ~
                                     or 8298-55295 or 57344-1114111"
         :found (char-to-msg code))))
      (retok
           code pstate.position
           (change-parstate
                pstate
                :bytes bytes
                :position (position-inc-column 1 pstate.position)
                :chars-read
                (cons (make-char+position :char code
                                          :position pstate.position)
                      pstate.chars-read)
                :size (- pstate.size 3)))))
    ((when (= (logand 248 byte) 240))
     (b*
      (((unless (consp bytes))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "another byte after ~
                                          the first byte ~x0 ~
                                          of the form 11110... ~
                                          (i.e. between 240 to 247) ~
                                          of a four-byte UTF-8 encoding"
          byte)
         :found "end of file"))
       (byte2 (car bytes))
       (bytes (cdr bytes))
       ((unless (= (logand byte2 192) 128))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "a byte of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          after the first byte ~x0 ~
                                          of the form 11110... ~
                                          (i.e. between 240 and 247) ~
                                          of a four-byte UTF-8 encoding"
          byte)
         :found (msg "the byte ~x0" byte2)))
       ((unless (consp bytes))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "another byte after ~
                                          the first byte ~x0 ~
                                          of the form 11110... ~
                                          (i.e. between 240 to 247) ~
                                          and the second byte ~x1 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          of a four-byte UTF-8 encoding"
          byte byte2)
         :found "end of file"))
       (byte3 (car bytes))
       (bytes (cdr bytes))
       ((unless (= (logand byte3 192) 128))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "a byte of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          after the first byte ~x0 ~
                                          of the form 11110... ~
                                          (i.e. between 240 and 247) ~
                                          and the second byte ~x1 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          of a four-byte UTF-8 encoding"
          byte byte2)
         :found (msg "the byte ~x0" byte3)))
       ((unless (consp bytes))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "another byte after ~
                                          the first byte ~x0 ~
                                          of the form 11110... ~
                                          (i.e. between 240 to 247) ~
                                          and the second byte ~x1 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          and the third byte ~x2 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          of a four-byte UTF-8 encoding"
          byte byte2 byte3)
         :found "end of file"))
       (byte4 (car bytes))
       (bytes (cdr bytes))
       ((unless (= (logand byte4 192) 128))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "a byte of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          after the first byte ~x0 ~
                                          of the form 11110... ~
                                          (i.e. between 240 and 247) ~
                                          and the second byte ~x1 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          and the third byte ~x2 ~
                                          of the form 10... ~
                                          (i.e. between 128 and 191) ~
                                          of a four-byte UTF-8 encoding"
          byte byte2 byte3)
         :found (msg "the byte ~x0" byte4)))
       (code (+ (ash (logand byte 7) 18)
                (ash (logand byte2 63) 12)
                (ash (logand byte3 63) 6)
                (logand byte4 63)))
       ((when (or (< code 65536) (> code 1114111)))
        (reterr-msg
         :where (position-to-msg pstate.position)
         :expected
         (msg
          "a value between 10000h and 10FFFFh ~
                                          UTF-8-encoded in the four bytes ~
                                          (~x0 ~x1 ~x2 ~x3)"
          byte byte2 byte3 byte4)
         :found (msg "the value ~x0" code))))
      (retok
           code pstate.position
           (change-parstate
                pstate
                :bytes bytes
                :position (position-inc-column 1 pstate.position)
                :chars-read
                (cons (make-char+position :char code
                                          :position pstate.position)
                      pstate.chars-read)
                :size (- pstate.size 4))))))
   (reterr-msg
        :where (position-to-msg pstate.position)
        :expected "a byte in the range 9-13 or 32-126 or 192-223"
        :found (msg "the byte ~x0" byte)))))

Theorem: nat-optionp-of-read-char.char?

(defthm nat-optionp-of-read-char.char?
  (b* (((mv acl2::?erp
            ?char? acl2::?pos ?new-pstate)
        (read-char pstate)))
    (nat-optionp char?))
  :rule-classes :rewrite)

Theorem: positionp-of-read-char.pos

(defthm positionp-of-read-char.pos
  (b* (((mv acl2::?erp
            ?char? acl2::?pos ?new-pstate)
        (read-char pstate)))
    (positionp pos))
  :rule-classes :rewrite)

Theorem: parstatep-of-read-char.new-pstate

(defthm parstatep-of-read-char.new-pstate
  (b* (((mv acl2::?erp
            ?char? acl2::?pos ?new-pstate)
        (read-char pstate)))
    (parstatep new-pstate))
  :rule-classes :rewrite)

Theorem: parsize-of-read-char-uncond

(defthm parsize-of-read-char-uncond
  (b* (((mv acl2::?erp
            ?char? acl2::?pos ?new-pstate)
        (read-char pstate)))
    (<= (parsize new-pstate)
        (parsize pstate)))
  :rule-classes :linear)

Theorem: parsize-of-read-char-cond

(defthm parsize-of-read-char-cond
  (b* (((mv acl2::?erp
            ?char? acl2::?pos ?new-pstate)
        (read-char pstate)))
    (implies (and (not erp) char?)
             (<= (parsize new-pstate)
                 (1- (parsize pstate)))))
  :rule-classes :linear)