A few questiosn about encoding

Steven D'Aprano steve+comp.lang.python at pearwood.info
Wed Jun 12 21:40:44 EDT 2013


On Wed, 12 Jun 2013 21:30:23 +0100, Nobody wrote:

> The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a
> total of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than
> 20 bits).

Same with UTF-8 and UTF-32, both of which are limited to U+10FFFF because 
that is what Unicode is limited to.

The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but 
that's not UTF-8, that's UTF-8-plus-extra-codepoints. Likewise the 
mechanism of UTF-32 could go up to 0xFFFFFFFF, but doing so means you 
don't have Unicode chars any more, and hence your byte-string is not 
valid UTF-32:

py> b = b'\xFF'*8
py> b.decode('UTF-32')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3: 
codepoint not in range(0x110000)


-- 
Steven



More information about the Python-list mailing list