[Python-Dev] Improve open() to support reading file starting with an unicode BOM

MRAB python at mrabarnett.plus.com
Mon Jan 11 18:35:35 CET 2010


Lennart Regebro wrote:
> On Mon, Jan 11, 2010 at 11:37, Walter Dörwald <walter at livinglogic.de> wrote:
>> UTF-8 might be a good choice
> 
> No, fallback if there is no BOM should be the local settings, just as
> fallback is today if you don't specify a codec.
> I mean, what if you want to look for a BOM but fall back to something
> else? How far will we go with encoding special information in the
> codecs names? codec='BOM else UTF-16 else locale'? :-)
> 
> BOM is not a locale, and should not be a locale. Having a locale
> called BOM is wrong per se. It should either be default to look for a
> BOM when codec=None, or a special parameter. If none of these are
> desired, then we need a special function that takes a filename or file
> handle, and looks for a BOM and returns the codec found or None. But
> I find that much less natural and obvious than checking the BOM when codec=None.
> 
Or pass a function that accepts a byte stream or the first few bytes and
returns the encoding and any unused bytes (because the byte stream might
not be seekable)?

def guess_encoding(byte_stream):
     data = byte_stream.read(2)
     if data == b"\xFE\xFF":
         return "UTF-16BE", b""
     return "UTF-8", data

text_file = open(filename, encoding=guess_encoding)
...



More information about the Python-Dev mailing list