Is this a bug? BOM decoded with UTF8

Diez B. Roggisch deetsNOSPAM at web.de
Fri Feb 11 09:08:51 EST 2005


> What are you talking about? The BOM and UTF-16 go hand-and-hand.
> Without a Byte Order Mark, you can't unambiguosly determine whether big
> or little endian UTF-16 was used. If, for example, you came across a
> UTF-16 text file containing this hexidecimal data: 2200> 
> what would you  assume? That is is quote character in little-endian
> format or that it is a for-all symbol in big-endian format?

I'm well aware of the need of a bom for fixed-size multibyte-characters like
utf16.

But I don't see the need for that on an utf-8 byte sequence, and I first
encountered that in MS tool output - can't remember when and what exactly
that was. And I have to confess that I attributed that as a stupidity from
MS. But according to the FAQ you mentioned, it is apparently legal in utf-8
too. Neverless the FAQ states:

"""
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
yes, then can I still assume the remaining UTF-8 bytes are in big-endian
order?


A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the
endianness of the byte stream. UTF-8 always has the same byte order. An
initial BOM is only used as a signature ? an indication that an otherwise
unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded
data do not expect a BOM. Where UTF-8 is used transparently in 8-bit
environments, the use of a BOM will interfere with any protocol or file
format that expects specific ASCII characters at the beginning, such as the
use of "#!" of at the beginning of Unix shell scripts. [AF] & [MD]
"""

So they admit that it makes no sense - especially as decoding a utf-8 string
given any 8-bit encoding like latin1 will succeed.

So in the end, I stand corrected. But I still think its crap - But not MS
crap. :)

-- 
Regards,

Diez B. Roggisch



More information about the Python-list mailing list