[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Fri Jan 8 16:56:46 CET 2010

On Fri, Jan 8, 2010 at 1:05 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
>>> It *is* crazy, but unfortunately rather common.  Wikipedia has a good
>>> description of the issues:
>>> <http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark>.  Basically, some
>>> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as
>>> being UTF-8, so it's become a convention to do that.  That's not good
>>> enough, so you need to guess the encoding as well to make sure, but if there
>>> is a BOM and you can otherwise verify that the file is probably UTF-8
>>> encoded, you should discard it.
>>
>> That doesn't make sense. If the file isn't UTF-8 you can't see the
>> BOM, because the BOM itself is UTF-8-encoded.
>
> I think what Glyph meant is this: if a file starts with the UTF-8
> signature, assume it's UTF-8. Then validate the assumption against the
> rest of the file also, and then process it as UTF-8. If the rest clearly
> is not UTF-8, assume that the UTF-8 signature is bogus.
>
> I understood this proposal as a general processing guideline, not
> something the io library should do (but, say, a text editor).
>
> FWIW, I'm personally in favor of using the UTF-8 signature. If people
> consider them crazy talk, that may be because UTF-8 can't possibly have
> a byte order - hence I call it a signature, not the BOM. As a signature,
> I don't consider it crazy at all. There is a long tradition of having
> magic bytes in files (executable files, Postscript, PDF, ... - see
> /etc/magic). Having a magic byte sequence for plain text to denote the
> encoding is useful and helps reducing moji-bake. This is the reason it's
> used on Windows: notepad would normally assume that text is in the ANSI
> code page, and for compatibility, it can't stop doing that. So the UTF-8
> signature gives them an exit strategy.

Sure. I said "crazy talk" only to stir up discussion. Which worked. :-)

Also, I don't want Python's default behavior to change -- sniffing the
encoding should be a separate option.

-- 
--Guido van Rossum (python.org/~guido)