Is this a bug? BOM decoded with UTF8
Diez B. Roggisch
deetsNOSPAM at web.de
Fri Feb 11 12:10:34 EST 2005
> They say that it makes no sense as an byte-order indicator but they
> indicate that it can be used as a file signature.
>
> And I'm not sure what you mean about decoding a UTF-8 string given any
> 8-bit encoding. Of course the encoder must be know:
That every utf-8 string can be decoded in any byte-sized encoding. Does it
make sense? No. But does it fail (as decoding utf-8 frequently does)? No.
So if you are in a situation where you _don't_ know the encoding, a decoding
can only be based on a heuristic. And a utf-8 BOM can be part of that
heuristic - but it still is only a hint. Besides that, lots of tools don't
produce it. E.g. everything that produces/consumes xml doesn't need it.
> >>> u'T\N{LATIN SMALL LETTER U WITH DIAERESIS}r'
> ... .encode('utf-8').decode('latin1').encode('latin1')
> 'T\xc3\xbcr'
If the encoder is to be known, using the BOM becomes obsolete.
> I can assume you that most Germans can differentiate between "Tür" and
> "Tã¼r".
Oh, germans can. Computers oth can't. You could try and use common words
like "für" and so on for a heuristic. But that is no guarantee.
> Using a BOM with UTF-8 makes it easy to indentify it as such AND it
> shouldn't break any probably written Unicode-aware tools.
As the faq states, that can very well happen.
--
Regards,
Diez B. Roggisch
More information about the Python-list
mailing list