unicode by default

John Machin sjmachin at lexicon.net
Thu May 12 03:58:24 EDT 2011


On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:

>
> So, the UTF-16 UTF-32 is INTERNAL only, for Python

NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
encodings for the EXTERNAL representation of Unicode characters in byte
streams.

> I also was not aware that UTF-8 chars could be up to six(6) byes long
> from left to right.

It could be, once upon a time in ISO faerieland, when it was thought that
Unicode could grow to 2**32 codepoints. However ISO and the Unicode
consortium have agreed that 17 planes is the utter max, and accordingly a
valid UTF-8 byte sequence can be no longer than 4 bytes ... see below

    >>> chr(17 * 65536)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: chr() arg not in range(0x110000)
    >>> chr(17 * 65536 - 1)
    '\U0010ffff'
    >>> _.encode('utf8')
    b'\xf4\x8f\xbf\xbf'
    >>> b'\xf5\x8f\xbf\xbf'.decode('utf8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\python32\lib\encodings\utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xf5 in position 0:
invalid start byte





More information about the Python-list mailing list