Guessing the encoding from a BOM
Steven D'Aprano
steve at pearwood.info
Thu Jan 16 01:55:16 EST 2014
On Thu, 16 Jan 2014 14:47:00 +1100, Ben Finney wrote:
> Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:
>
>> enc = guess_encoding_from_bom("filename") if enc == something:
>> # Can't guess, fall back on an alternative strategy ...
>> else:
>> f = open("filename", encoding=enc)
>>
>>
>> If I forget to check the returned result, I should get an explicit
>> failure as soon as I try to use it, rather than silently returning the
>> wrong results.
>
> Yes, agreed.
>
>> What should I return as the default default? I have four possibilities:
>>
>> (1) 'undefined', which is an standard encoding guaranteed to
>> raise an exception when used;
>
> +0.5. This describes the outcome of the guess.
>
>> (2) 'unknown', which best describes the result, and currently
>> there is no encoding with that name;
>
> +0. This *better* describes the outcome, but I don't think adding a new
> name is needed nor very helpful.
And there is a chance -- albeit a small chance -- that someday the std
lib will gain an encoding called "unknown".
>> (4) Don't return anything, but raise an exception. (But
>> which exception?)
>
> +1. I'd like a custom exception class, sub-classed from ValueError.
Why ValueError? It's not really a "invalid value" error, it's more "my
heuristic isn't good enough" failure. (Maybe the file starts with another
sort of BOM which I don't know about.)
If I go with an exception, I'd choose RuntimeError, or a custom error
that inherits directly from Exception.
Thanks to everyone for the feedback.
--
Steven
More information about the Python-list
mailing list