Guessing the encoding from a BOM

Steven D'Aprano steve+comp.lang.python at pearwood.info
Wed Jan 15 21:13:55 EST 2014


I have a function which guesses the likely encoding used by text files by 
reading the BOM (byte order mark) at the beginning of the file. A 
simplified version:


def guess_encoding_from_bom(filename, default):
    with open(filename, 'rb') as f:
        sig = f.read(4)
    if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
        return 'utf_16'
    elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
        return 'utf_32'
    else:
        return default


The idea is that you can call the function with a file name and a default 
encoding to return if one can't be guessed. I want to provide a default 
value for the default argument (a default default), but one which will 
unconditionally fail if you blindly go ahead and use it.

E.g. I want to either provide a default:

enc = guess_encoding_from_bom("filename", 'latin1')
f = open("filename", encoding=enc)


or I want to write:

enc = guess_encoding_from_bom("filename")
if enc == something:
     # Can't guess, fall back on an alternative strategy
     ...
else:
     f = open("filename", encoding=enc)


If I forget to check the returned result, I should get an explicit 
failure as soon as I try to use it, rather than silently returning the 
wrong results.

What should I return as the default default? I have four possibilities:

    (1) 'undefined', which is an standard encoding guaranteed to 
        raise an exception when used;

    (2) 'unknown', which best describes the result, and currently 
        there is no encoding with that name;

    (3) None, which is not the name of an encoding; or

    (4) Don't return anything, but raise an exception. (But 
        which exception?)


Apart from option (4), here are the exceptions you get from blindly using 
options (1) through (3):

py> 'abc'.encode('undefined')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.3/encodings/undefined.py", line 19, in 
encode
    raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding

py> 'abc'.encode('unknown')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: unknown

py> 'abc'.encode(None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: encode() argument 1 must be str, not None


At the moment, I'm leaning towards option (1). Thoughts?



-- 
Steven



More information about the Python-list mailing list