Guessing the encoding from a BOM
Björn Lindqvist
bjourne at gmail.com
Thu Jan 16 13:01:51 EST 2014
2014/1/16 Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
> def guess_encoding_from_bom(filename, default):
> with open(filename, 'rb') as f:
> sig = f.read(4)
> if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
> return 'utf_16'
> elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
> return 'utf_32'
> else:
> return default
You might want to add the utf8 bom too: '\xEF\xBB\xBF'.
> (4) Don't return anything, but raise an exception. (But
> which exception?)
I like this option the most because it is the most "fail fast". If you
return 'undefined' the error might happen hours later or not at all in
some cases.
--
mvh/best regards Björn Lindqvist
More information about the Python-list
mailing list