Guessing the encoding from a BOM

Steven D'Aprano steve at pearwood.info
Thu Jan 16 01:55:16 EST 2014


On Thu, 16 Jan 2014 14:47:00 +1100, Ben Finney wrote:

> Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:
> 
>> enc = guess_encoding_from_bom("filename") if enc == something:
>>      # Can't guess, fall back on an alternative strategy ...
>> else:
>>      f = open("filename", encoding=enc)
>>
>>
>> If I forget to check the returned result, I should get an explicit
>> failure as soon as I try to use it, rather than silently returning the
>> wrong results.
> 
> Yes, agreed.
> 
>> What should I return as the default default? I have four possibilities:
>>
>>     (1) 'undefined', which is an standard encoding guaranteed to
>>         raise an exception when used;
> 
> +0.5. This describes the outcome of the guess.
> 
>>     (2) 'unknown', which best describes the result, and currently
>>         there is no encoding with that name;
> 
> +0. This *better* describes the outcome, but I don't think adding a new
> name is needed nor very helpful.

And there is a chance -- albeit a small chance -- that someday the std 
lib will gain an encoding called "unknown".


>>     (4) Don't return anything, but raise an exception. (But
>>         which exception?)
> 
> +1. I'd like a custom exception class, sub-classed from ValueError.

Why ValueError? It's not really a "invalid value" error, it's more "my 
heuristic isn't good enough" failure. (Maybe the file starts with another 
sort of BOM which I don't know about.)

If I go with an exception, I'd choose RuntimeError, or a custom error 
that inherits directly from Exception.



Thanks to everyone for the feedback.



-- 
Steven



More information about the Python-list mailing list