Guessing the encoding from a BOM
Albert-Jan Roskam
fomcl at yahoo.com
Thu Jan 16 14:37:29 EST 2014
--------------------------------------------
On Thu, 1/16/14, Chris Angelico <rosuav at gmail.com> wrote:
Subject: Re: Guessing the encoding from a BOM
To:
Cc: "python-list at python.org" <python-list at python.org>
Date: Thursday, January 16, 2014, 7:06 PM
On Fri, Jan 17, 2014 at 5:01 AM,
Björn Lindqvist <bjourne at gmail.com>
wrote:
> 2014/1/16 Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
>> def guess_encoding_from_bom(filename, default):
>> with open(filename, 'rb')
as f:
>> sig =
f.read(4)
>> if
sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
>> return
'utf_16'
>> elif
sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
>> return
'utf_32'
>> else:
>> return
default
>
> You might want to add the utf8 bom too:
'\xEF\xBB\xBF'.
I'd actually rather not. It would tempt people to pollute
UTF-8 files
with a BOM, which is not necessary unless you are MS
Notepad.
===> Can you elaborate on that? Unless your utf-8 files will only contain ascii characters I do not understand why you would not want a bom utf-8.
Btw, isn't "read_encoding_from_bom" a better function name than "guess_encoding_from_bom"? I thought the point of BOMs was that there would be no more need to guess?
Thanks!
Albert-Jan
More information about the Python-list
mailing list