[I18n-sig] UTF-8 and BOM
M.-A. Lemburg
mal@lemburg.com
Wed, 16 May 2001 20:48:51 +0200
Paul Prescod wrote:
>
> Notepad always saves UTF-8 documents with a BOM. Visual Studio 7 gives
> users an option.
>
> Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading
> character. The UTF-16 decoder removes it. I recognize that the BOM is
> not useful as a "byte order mark" for UTF-8 data but I would still
> suggest that the UTF-8 decoder should remove it for these reasons:
> 1) Microsoft has taken the stance that a BOM is legal on UTF-8 data
BOMs are standard Unicode char points, so they are legal in all
Unicode encodings.
> 2) Doing so is legal:
>
> "Q: Is the UTF-8 encoding scheme the same irrespective of whether the
> underlying processor is little endian or big endian?
>
> A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no
> endian problem as there is for encoding forms that use 16-bit or 32-bit
> code units. Where a BOM is used with UTF-8, it is only to distinguish
> UTF-8 from other UTF encodings - it has nothing to do with byte order.
> [KW]"
>
> http://www.unicode.org/unicode/faq/utf_bom.html
... as I said :-)
> 3) I think that distinguising UTF-8 from other encodings through the
> BOM is actually a great idea and I wish that every UTF-8 creator would
> do it!
Uhm, I can't follow you here... BOMs in UTF-8 look like this:
>>> u'\ufeff'.encode('utf-8')
'\xef\xbb\xbf'
which is somewhat different from '\xff\xfe' or '\xfe\xff'.
> 4) The behavior would be consistent with the UTF-16 behavior.
>>> u'\ufeff'.encode('utf-16')
'\xff\xfe\xff\xfe'
>>> u'\ufeff'.encode('utf-16-le')
'\xff\xfe'
>>> u'\ufeff'.encode('utf-16-be')
'\xfe\xff'
>>> u'\ufeff'.encode('utf-8')
'\xef\xbb\xbf'
--
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/