Guessing the encoding from a BOM

Chris Angelico rosuav at gmail.com
Thu Jan 16 13:06:16 EST 2014


On Fri, Jan 17, 2014 at 5:01 AM, Björn Lindqvist <bjourne at gmail.com> wrote:
> 2014/1/16 Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
>> def guess_encoding_from_bom(filename, default):
>>     with open(filename, 'rb') as f:
>>         sig = f.read(4)
>>     if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
>>         return 'utf_16'
>>     elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
>>         return 'utf_32'
>>     else:
>>         return default
>
> You might want to add the utf8 bom too: '\xEF\xBB\xBF'.

I'd actually rather not. It would tempt people to pollute UTF-8 files
with a BOM, which is not necessary unless you are MS Notepad.

ChrisA



More information about the Python-list mailing list