[Python-Dev] Improve open() to support reading file starting with an unicode BOM

M.-A. Lemburg mal at egenix.com
Mon Jan 11 21:44:34 CET 2010


Olemis Lang wrote:
>> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
>> <victor.stinner at haypocalc.com> wrote:
>>> Hi,
>>>
>>> Builtin open() function is unable to open an UTF-16/32 file starting with a
>>> BOM if the encoding is not specified (raise an unicode error). For an UTF-8
>>> file starting with a BOM, read()/readline() returns also the BOM whereas the
>>> BOM should be "ignored".
>>>
> [...]
>>
> 
> I had similar issues too (please read below ;o) ...
> 
> On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum <guido at python.org> wrote:
>> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
>> talk. And for the other two, perhaps it would make more sense to have
>> a separate encoding-guessing function that takes a binary stream and
>> returns a text stream wrapping it with the proper encoding?
>>
> 
> About guessing the encoding, I experienced this issue while I was
> developing a Trac plugin. What I was doing is as follows :
> 
> - I guessed the MIME type + charset encoding using Trac MIME API (it
> was a CSV file encoded using UTF-16)
> - I read the file using `open`
> - Then wrapped the file using `codecs.EncodedFile`
> - Then used `csv.reader`
> 
> ... and still get the BOM in the first value of the first row in the CSV file.

You didn't say, but I presume that the charset guessing logic
returned either 'utf-16-le' or 'utf-16-be' - those encodings don't
remove the leading BOM. The 'utf-16' codec will remove the BOM.

> {{{
> #!python
> 
>>>> mimetype
> 'utf-16-le'
>>>> ef = EncodedFile(f, 'utf-8', mimetype)
> }}}

Same here: the UTF-8 codec will not remove the BOM, you have
to use the 'utf-8-sig' codec for that.

> IMO I think I am +1 for leaving `open` just like it is, and use module
> `codecs` to deal with encodings, but I am strongly -1 for returning
> the BOM while using `EncodedFile` (mainly because encoding is
> explicitly supplied in ;o)

Note that EncodedFile() doesn't do any fancy BOM detection or
filtering. This is the job of the codecs.

Also note that BOM removal is only valid at the beginning of
a file. All subsequent BOM-bytes have to be read as-is (they
map to a zero-width non-breaking space) - without removing them.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 11 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/



More information about the Python-Dev mailing list