Guessing the encoding from a BOM

Björn Lindqvist bjourne at gmail.com
Thu Jan 16 13:01:51 EST 2014


2014/1/16 Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
> def guess_encoding_from_bom(filename, default):
>     with open(filename, 'rb') as f:
>         sig = f.read(4)
>     if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
>         return 'utf_16'
>     elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
>         return 'utf_32'
>     else:
>         return default

You might want to add the utf8 bom too: '\xEF\xBB\xBF'.

>     (4) Don't return anything, but raise an exception. (But
>         which exception?)

I like this option the most because it is the most "fail fast". If you
return 'undefined' the error might happen hours later or not at all in
some cases.


-- 
mvh/best regards Björn Lindqvist



More information about the Python-list mailing list