Unicode BOM marks

Mon Mar 7 14:24:42 EST 2005

Hi,

For the first time in my programmer life, I have to take care of character 
encoding. I have a question about the BOM marks. 

If I understand well, into the UTF-8 unicode binary representation, some 
systems add at the beginning of the file a BOM mark (Windows?), some don't.
(Linux?). Therefore, the exact same text encoded in the same UTF-8 will 
result in two different binary files, and of a slightly different length. 
Right ?

I guess that this leading BOM mark are special marking bytes that can't be, in 
no way, decoded as valid text.
Right ?
(I really really hope the answer is yes otherwise we're in hell when moving 
file from one platform to another, even with the same Unicode encoding).

I also guess that this leading BOM mark is silently ignored by any unicode 
aware file stream reader to which we already indicated that the file follows 
the UTF-8 encoding standard.
Right ?

If so, is it the case with the python codecs decoder ?

In python documentation, I see theseconstants. The documentation is not clear 
to which encoding these constants apply. Here's my understanding :

BOM : UTF-8 only or UTF-8 and UTF-32 ?
BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_UTF8 : UTF-8 only
BOM_UTF16 : UTF-16 only
BOM_UTF16_BE : UTF-16 only
BOM_UTF16_LE : UTF-16 only
BOM_UTF32 : UTF-32 only
BOM_UTF32_BE : UTF-32 only
BOM_UTF32_LE : UTF-32 only

Why should I need these constants if codecs decoder can handle them without my 
help, only specifying the encoding ?

Thank you

Francis Girard

Python tells me to use an encoding declaration at the top of my files (the 
message is referring to http://www.python.org/peps/pep-0263.html).

I expected to see there a list of acceptable