[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Fri Jan 8 03:23:08 CET 2010

Guido van Rossum wrote:
> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
> talk. And for the other two, perhaps it would make more sense to have
> a separate encoding-guessing function that takes a binary stream and
> returns a text stream wrapping it with the proper encoding?
> 
Alternatively, have a universal UTF-8/16/32 encoding, ie one that 
expects UTF-8,
with or without BOM, or UTF-16/32 with BOM.
> 
> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
> <victor.stinner at haypocalc.com> wrote:
>> Hi,
>>
>> Builtin open() function is unable to open an UTF-16/32 file starting with a
>> BOM if the encoding is not specified (raise an unicode error). For an UTF-8
>> file starting with a BOM, read()/readline() returns also the BOM whereas the
>> BOM should be "ignored".
>>
>> See recent issues related to reading an UTF-8 text file including a BOM: #7185
>> (csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with
>> the UTF-8-SIG encoding, but it's possible to do better.
>>
>> I propose to improve open() (TextIOWrapper) by using the BOM to choose the
>> right encoding. I think that only files opened in read only mode should
>> support this new feature. *Read* the BOM in a *write* only file would cause
>> unexpected behaviours.
>>
>> Since my proposition changes the result TextIOWrapper.read()/readline() for
>> files starting with a BOM, we might introduce an option to open() to enable
>> the new behaviour. But is it really needed to keep the backward compatibility?
>>
>> I wrote a proof of concept attached to the issue #7651. My patch only changes
>> the behaviour of TextIOWrapper for reading files starting with a BOM. It
>> doesn't work yet if a seek() is used before the first read.
>>