Best way to hash a dictionary

Alex Martelli aleax at aleax.it
Mon Mar 3 07:31:34 EST 2003


Miki Tebeka wrote:
   ...
> Yap. It design decision. I'm doing a NLP M.Sc. thesis where I tag
> words with part of speech. I Hebrew words have complex part of speech
> with many attributes.
> An example might be: {'POS':'Verb', 'Gender':'Male', 'Tense':'Past'}
> My algorithm is a learning one and I need to store some statistics per
> tag, the most efficient way is hash tables.

Can you please clarify for us why your desired mapping -- a separate
hash table where, e.g.,

{'POS':'Verb', 'Gender':'Male', 'Tense':'Past'}  -->  extrainfo

is necessarily more efficient than adding a non-otherwise-occurring
key to the dict object and having, e.g.:

{'POS':'Verb', 'Gender':'Male', 'Tense':'Past', 'Xinfo':extrainfo}

???  This is not entirely clear to me from your posts so far.  Similar
doubts apply to many other possible structures -- e.g. why would it
be unsuitable to have the tag be (('POS', 'Verb'), ('Gender', 'Male'), 
('Tense', 'Past')), a tuple you'd have no problem using as a key in 
any other dict.

Basically, the fact that you tag words with sets of (tagname,tagvalue)
pairs does not necessarily explain why the sets need to be dicts -- that
depends on how you build/find out the tagnames and values for a word
and how you then proceed to use what you have built or found out.


But, this is by the by -- just trying to get a better perspective
on your issues.  Assuming that, indeed, dictionaries are exactly
what you need in both cases, we move on to...:

>> Please note, I'm not saying you definitely don't need to use dictionaries
>> as dictionary keys. Simply that it sounds like there may be a better way.
> If you find one, please tell me. Basically I need immutable hash.

Sets (in the sets module in the 2.3 standard Python library) have the
ability to present themselves as temporarily immutable, and that of
giving out an immutable copy of themselves, but such abilities
are unfortunately not (yet?) present for standard dictionaries, and
I fear it may be too late to introduce it in Python 2.3.   However...:


>> Finally: the reason you can't just use the dictionaries as keys, of
>> course, is that dictionaries are mutable. Does you design preclude any
>> change to the dicionaries after they become keys in the other dictionary?
>> If so then you may just be able to subclass the standard dict type.
> This has performance penalties, and since an average run of my thesis

I'm confused -- how do you *know* you'd have performance issues here?
How have you measured the option of subclassing dict?

> is 5 days I can't do that. Plus if you try:
>>>> class D(dict):
> pass
> 
>>>> d = D()
>>>> h = {}
>>>> h[d] = 1
> Traceback (most recent call last):
>   File "<pyshell#37>", line 1, in ?
>     h[d] = 1
> TypeError: dict objects are unhashable

Well of course, you haven't added any __hash__ method -- what did you
expect?!  Surely you wouldn't expect Python to add a __hash__ method
automatically for you...?  From such an observation as this I suspect
we may be talking entirely at cross-purposes.

The obvious idea would be something like:

class HashableDict(dict):
    def __init__(self, *args, **kwds):
        dict.__init__(self, *args, **kwds)
        my_hash = 0
        for key in self: my_hash ^= hash(key)
        self.my_hash = my_hash
    def __hash__(self):
        return self.my_hash

plus perhaps some precaution to ensure instances of HashableDict are
never modified after their __init__, but that's basically just to
ward against possible bugs, so let's skip it for now.  So where's
the performance penalty here?  When you have finished preparing (and
thus finished modifying) a dict, you build the equivalent hashable
dict ONCE -- paying a one-off price to compute its hash, of course,
but how could you ever possibly avoid THAT? -- and that's it.

What does a profile of your 5-days run tell you about where your
code is spending its time currently?


Alex





More information about the Python-list mailing list