UTF-8 question from Dive into Python 3

Antoine Pitrou solipsis at pitrou.net
Wed Jan 19 09:41:00 EST 2011


On Wed, 19 Jan 2011 14:00:13 +0000 (UTC)
Tim Harig <usernet at ilthio.net> wrote:
> 
> - Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
> - yes, then can I still assume the remaining UTF-8 bytes are in big-endian
>             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> - order?
>   ^^^^^^
> - 
> - A: Yes, UTF-8 can contain a BOM. However, it makes no difference as
>      ^^^ 
> - to the endianness of the byte stream. UTF-8 always has the same byte
>                            ^^^^         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> - order.
>   ^^^^^^

Which certainly doesn't mean that byte order can be called "big
endian" for any recognized definition of the latter. Similarly, ASCII
test has its own order which certainly can't be characterized as either
"little endian" or "big endian".

> UTF-8 has no apparent endianess if you only store it as a byte stream.
> It does however have a byte order.  If you store it using multibytes
> (six bytes for all UTF-8 possibilites) , which is useful if you want
> to have one storage container for each letter as opposed to one for
> each byte(1)

That's a ridiculous proposition. Why would you waste so much space?
UTF-8 exists *precisely* so that you can save space with most scripts.
If you are ready to use 4+ bytes per character, just use UTF-32 which
has much nicer properties.

Bottom line: you are not describing UTF-8, only your own foolish
interpretation of it. UTF-8 does not have any endianness since it is a
byte stream and does not care about "machine words".

Antoine.





More information about the Python-list mailing list