UTF-8 question from Dive into Python 3

Wed Jan 19 09:00:13 EST 2011

Considering you post contained no information or evidence for your
negations, I shouldn't even bother responding.  I will bite once.
Hopefully next time your arguments will contain some pith.

On 2011-01-19, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Wed, 19 Jan 2011 11:34:53 +0000 (UTC)
> Tim Harig <usernet at ilthio.net> wrote:
>> That is why the FAQ I linked to
>> says yes to the fact that you can consider UTF-8 to always be in big-endian
>> order.
>
> It certainly doesn't. Read better.

- Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
- yes, then can I still assume the remaining UTF-8 bytes are in big-endian
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- order?
  ^^^^^^
- 
- A: Yes, UTF-8 can contain a BOM. However, it makes no difference as
     ^^^ 
- to the endianness of the byte stream. UTF-8 always has the same byte
                           ^^^^         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- order. An initial BOM is only used as a signature -- an indication that
  ^^^^^^
- an otherwise unmarked text file is in UTF-8. Note that some recipients of
- UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently
- in 8-bit environments, the use of a BOM will interfere with any protocol
- or file format that expects specific ASCII characters at the beginning,
- such as the use of "#!" of at the beginning of Unix shell scripts.

The question that was not addressed was whether you can consider UTF-8
to be little endian.  I pointed out why you cannot always make that
assumption in my previous post.

UTF-8 has no apparent endianess if you only store it as a byte stream.
It does however have a byte order.  If you store it using multibytes
(six bytes for all UTF-8 possibilites) , which is useful if you want
to have one storage container for each letter as opposed to one for
each byte(1), the bytes will still have the same order but you have
interrupted its sole existance as a byte stream and have returned it
to the underlying multibyte oriented representation.  If you attempt
any numeric or binary operations on what is now a multibyte sequence,
the processor will interpret the data using its own endian rules.

If your processor is big-endian, then you don't have any problems.
The processor will interpret the data in the order that it is stored.
If your processor is little endian, then it will effectively change the
order of the bytes for its own evaluation.

So, you can always assume a big-endian and things will work out correctly
while you cannot always make the same assumption as little endian
without potential issues.  The same holds true for any byte stream data.
That is why I say that byte streams are essentially big endian.  It is
all a matter of how you look at it.

I prefer to look at all data as endian even if it doesn't create
endian issues because it forces me to consider any endian issues that
might arise.  If none do, I haven't really lost anything.  If you simply
assume that any byte sequence cannot have endian issues you ignore the
possibility that such issues might not arise.  When an issue like the
above does, you end up with a potential bug.

(1) For unicode it is probably better to convert to characters to
UTF-32/UCS-4 for internal processing; but, creating a container large
enough to hold any length of UTF-8 character will work.