Is this a bug? BOM decoded with UTF8

Diez B. Roggisch deetsNOSPAM at web.de
Fri Feb 11 12:10:34 EST 2005


> They say that it makes no sense as an byte-order indicator but they
> indicate that it can be used as a file signature.
> 
> And I'm not sure what you mean about decoding a UTF-8 string given any
> 8-bit encoding. Of course the encoder must be know:

That every utf-8 string can be decoded in any byte-sized encoding. Does it
make sense? No. But does it fail (as decoding utf-8 frequently does)? No. 

So if you are in a situation where you _don't_ know the encoding, a decoding
can only be based on a heuristic. And a utf-8 BOM can be part of that
heuristic - but it still is only a hint. Besides that, lots of tools don't
produce it. E.g. everything that produces/consumes xml doesn't need it.

>  >>> u'T\N{LATIN SMALL LETTER U WITH DIAERESIS}r'
> ...   .encode('utf-8').decode('latin1').encode('latin1')
> 'T\xc3\xbcr'

If the encoder is to be known, using the BOM becomes obsolete.

> I can assume you that most Germans can differentiate between "Tür" and
> "Tã¼r".

Oh, germans can. Computers oth can't. You could try and use common words
like "für" and so on for a heuristic. But that is no guarantee.

> Using a BOM with UTF-8 makes it easy to indentify it as such AND it
> shouldn't break any probably written Unicode-aware tools.

As the faq states, that can very well happen.

-- 
Regards,

Diez B. Roggisch



More information about the Python-list mailing list