[I18n-sig] UTF-8 and BOM

M.-A. Lemburg mal@lemburg.com
Wed, 16 May 2001 20:48:51 +0200


Paul Prescod wrote:
> 
> Notepad always saves UTF-8 documents with a BOM. Visual Studio 7 gives
> users an option.
> 
> Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading
> character. The UTF-16 decoder removes it. I recognize that the BOM is
> not useful as a "byte order mark" for UTF-8 data but I would still
> suggest that the UTF-8 decoder should remove it for these reasons:
 
>  1) Microsoft has taken the stance that a BOM is legal on UTF-8 data

BOMs are standard Unicode char points, so they are legal in all
Unicode encodings.
 
>  2) Doing so is legal:
> 
> "Q: Is the UTF-8 encoding scheme the same irrespective of whether the
> underlying processor is little endian or big endian?
> 
> A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no
> endian problem as there is for encoding forms that use 16-bit or 32-bit
> code units. Where a BOM is used with UTF-8, it is only to distinguish
> UTF-8 from other UTF encodings - it has nothing to do with byte order.
> [KW]"
> 
> http://www.unicode.org/unicode/faq/utf_bom.html

... as I said :-)
 
>  3) I think that distinguising UTF-8 from other encodings through the
> BOM is actually a great idea and I wish that every UTF-8 creator would
> do it!

Uhm, I can't follow you here... BOMs in UTF-8 look like this:

>>> u'\ufeff'.encode('utf-8')
'\xef\xbb\xbf'

which is somewhat different from '\xff\xfe' or '\xfe\xff'.
 
>  4) The behavior would be consistent with the UTF-16 behavior.

>>> u'\ufeff'.encode('utf-16')
'\xff\xfe\xff\xfe'

>>> u'\ufeff'.encode('utf-16-le')
'\xff\xfe'

>>> u'\ufeff'.encode('utf-16-be')
'\xfe\xff'

>>> u'\ufeff'.encode('utf-8')
'\xef\xbb\xbf'

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/