[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Fri Jan 8 22:49:23 CET 2010

On Jan 8, 2010, at 4:14 PM, Tres Seaver wrote:
>> I understood this proposal as a general processing guideline, not
>> something the io library should do (but, say, a text editor).
>>
>> FWIW, I'm personally in favor of using the UTF-8 signature. If people
>> consider them crazy talk, that may be because UTF-8 can't possibly  
>> have
>> a byte order - hence I call it a signature, not the BOM. As a  
>> signature,
>> I don't consider it crazy at all. There is a long tradition of having
>> magic bytes in files (executable files, Postscript, PDF, ... - see
>> /etc/magic). Having a magic byte sequence for plain text to denote  
>> the
>> encoding is useful and helps reducing moji-bake. This is the reason  
>> it's
>> used on Windows: notepad would normally assume that text is in the  
>> ANSI
>> code page, and for compatibility, it can't stop doing that. So the  
>> UTF-8
>> signature gives them an exit strategy.
>
> Agreed.  Having that marker at the start of the file makes interop  
> with
> other tools *much* easier.

Putting the BOM at the beginning of UTF-8 text files is not a good  
idea, it makes interop much *worse* on a unix system, not better.  
Without the BOM, most commands do the right thing with UTF-8 text.  
E.g. to concatenate two files:

$ cat file-1 file-2 > file-3

With a BOM at the beginning of the file, it won't work right. Of  
course, you could modify "cat" (and every other stream processing  
command) to know how to consume and emit BOMs, and omit the extra one  
that would show up in the middle of the stream...but even that can't  
work; what about:
$ (cat file-1; cat file-2) > file-3.

Should the shell now know that when you run multiple commands, it  
should eat the BOM emitted from the second command?

Basically, using a BOM in a utf-8 file is just not a good idea: it  
completely ruins interop with every standard unix tool.

This is not to say that Python shouldn't have a way to read a file  
with a UTF-8 BOM: it just shouldn't encourage you to *write* such files.

James