Is this a bug? BOM decoded with UTF8

Brian Quinlan brian at sweetapp.com
Fri Feb 11 10:09:36 EST 2005


Diez B. Roggisch wrote:
> I'm well aware of the need of a bom for fixed-size multibyte-characters like
> utf16.
> 
> But I don't see the need for that on an utf-8 byte sequence, and I first
> encountered that in MS tool output - can't remember when and what exactly
> that was. And I have to confess that I attributed that as a stupidity from
> MS. But according to the FAQ you mentioned, it is apparently legal in utf-8
> too. Neverless the FAQ states:
> 
[snipped]
> So they admit that it makes no sense - especially as decoding a utf-8 string
> given any 8-bit encoding like latin1 will succeed.

They say that it makes no sense as an byte-order indicator but they 
indicate that it can be used as a file signature.

And I'm not sure what you mean about decoding a UTF-8 string given any 
8-bit encoding. Of course the encoder must be know:

 >>> u'T\N{LATIN SMALL LETTER U WITH DIAERESIS}r'
...   .encode('utf-8').decode('latin1').encode('latin1')
'T\xc3\xbcr'

I can assume you that most Germans can differentiate between "Tür" and 
"Tã¼r".

Using a BOM with UTF-8 makes it easy to indentify it as such AND it 
shouldn't break any probably written Unicode-aware tools.

Cheers,
Brian



More information about the Python-list mailing list