Is this a bug? BOM decoded with UTF8

pekka niiranen pekka.niiranen at wlanmail.com
Fri Feb 11 07:51:54 EST 2005


> pekka niiranen wrote:
> 
>> I have two files "my.utf8" and "my.utf16" which
>> both contain BOM and two "a" characters.
>>
>> Contents of "my.utf8" in HEX:
>>     EFBBBF6161
>>
>> Contents of "my.utf16" in HEX:
>>     FEFF6161
> 
> 
> This is not true: this byte string does not denote
> two "a" characters. Instead, it is a single character
> U+6161.
> 
Correct, I used hexeditor to create those files.

>> Is there a trick to read UTF8 encoded file with BOM not decoded?
> 
> 
> It's very easy: just drop the first character if it is the BOM.

I know its easy (string.replace()) but why does UTF-16 do
it on its own then? Is that according to Unicode standard or just
Python convention?

> 
> The UTF-8 codec will never do this on its own.


Never? Hmm, so that is not going to change in future versions?

> Regards,
> Martin



More information about the Python-list mailing list