Possible bug with stability of mimetypes.guess_* function output
Peter Otten
__peter__ at web.de
Fri Feb 7 14:40:06 EST 2014
Asaf Las wrote:
> On Friday, February 7, 2014 8:06:36 PM UTC+2, Johannes Bauer wrote:
>> Hi group,
>>
>> I'm using Python 3.3.2+ (default, Oct 9 2013, 14:50:09) [GCC 4.8.1] on
>> linux and have found what is very peculiar behavior at best and a bug at
>> worst. It regards the mimetypes module and in particular the
>> guess_all_extensions and guess_extension functions.
>>
>> I've found that these do not return stable output. When running the
>> following commands, it returns one of:
>>
>> $ python3 -c 'import mimetypes;
>> print(mimetypes.guess_all_extensions("text/html"),
>> mimetypes.guess_extension("text/html"))'
>> ['.htm', '.html', '.shtml'] .htm
>>
>> $ python3 -c 'import mimetypes;
>> print(mimetypes.guess_all_extensions("text/html"),
>> mimetypes.guess_extension("text/html"))'
>> ['.html', '.htm', '.shtml'] .html
>>
>> So guess_extension(x) seems to always return guess_all_extensions(x)[0].
>>
>> Curiously, "shtml" is never the first element. The other two are mixed
>> with a probability of around 50% which leads me to believe they're
>> internally managed as a set and are therefore affected by the
>> (relatively new) nondeterministic hashing function initialization.
>>
>>
>> I don't know if stable output is guaranteed for these functions, but it
>> sure would be nice. Messes up a whole bunch of things otherwise :-/
>>
>> Please let me know if this is a bug or expected behavior.
>>
>> Best regards,
>>
>> Johannes
>
> dictionary. same for v3.3.3 as well.
>
> it might be you could try to query using sequence below :
>
> import mimetypes
> mimetypes.init()
> mimetypes.guess_extension("text/html")
>
> i got only 'htm' for 5 consequitive attempts
As Johannes mentioned, this depends on the hash seed:
$ PYTHONHASHSEED=0 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
.html
$ PYTHONHASHSEED=1 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
.htm
$ PYTHONHASHSEED=2 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
.shtml
You never see ".shtml" as the guessed extension because it is not in the
original mimetypes.types_map dict, but instead programmaticaly read from a
file like /etc/mime.types and then added to a list of extensions.
Johanes,
I'd like the guessed extension to be consistent, too, but even if that is
rejected the current behaviour should be documented.
Please file a bug report.
More information about the Python-list
mailing list