Possible bug with stability of mimetypes.guess_* function output

Peter Otten __peter__ at web.de
Fri Feb 7 14:40:06 EST 2014


Asaf Las wrote:

> On Friday, February 7, 2014 8:06:36 PM UTC+2, Johannes Bauer wrote:
>> Hi group,
>> 
>> I'm using Python 3.3.2+ (default, Oct  9 2013, 14:50:09) [GCC 4.8.1] on
>> linux and have found what is very peculiar behavior at best and a bug at
>> worst. It regards the mimetypes module and in particular the
>> guess_all_extensions and guess_extension functions.
>> 
>> I've found that these do not return stable output. When running the
>> following commands, it returns one of:
>> 
>> $ python3 -c 'import mimetypes;
>> print(mimetypes.guess_all_extensions("text/html"),
>> mimetypes.guess_extension("text/html"))'
>> ['.htm', '.html', '.shtml'] .htm
>> 
>> $ python3 -c 'import mimetypes;
>> print(mimetypes.guess_all_extensions("text/html"),
>> mimetypes.guess_extension("text/html"))'
>> ['.html', '.htm', '.shtml'] .html
>> 
>> So guess_extension(x) seems to always return guess_all_extensions(x)[0].
>> 
>> Curiously, "shtml" is never the first element. The other two are mixed
>> with a probability of around 50% which leads me to believe they're
>> internally managed as a set and are therefore affected by the
>> (relatively new) nondeterministic hashing function initialization.
>> 
>> 
>> I don't know if stable output is guaranteed for these functions, but it
>> sure would be nice. Messes up a whole bunch of things otherwise :-/
>> 
>> Please let me know if this is a bug or expected behavior.
>> 
>> Best regards,
>> 
>> Johannes
> 
> dictionary. same for v3.3.3 as well.
> 
> it might be you could try to query using sequence below :
> 
> import mimetypes
> mimetypes.init()
> mimetypes.guess_extension("text/html")
> 
> i got only 'htm' for 5 consequitive attempts

As Johannes mentioned, this depends on the hash seed:

$ PYTHONHASHSEED=0 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
.html
$ PYTHONHASHSEED=1 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
.htm
$ PYTHONHASHSEED=2 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
.shtml

You never see ".shtml" as the guessed extension because it is not in the 
original mimetypes.types_map dict, but instead programmaticaly read from a 
file like /etc/mime.types and then added to a list of extensions.

Johanes, 
I'd like the guessed extension to be consistent, too, but even if that is 
rejected the current behaviour should be documented. 

Please file a bug report.




More information about the Python-list mailing list